In this post, you will learn on how to improve and reduce the bandwidth cost for both the user and the server owner. But first we need to understand a bit the issue (but in case you know all about it, you can jump to the tips at the end of the blog post).
Well-Known Location Doom Empire
Starting a long time ago, in 1994, because of a spidering program behaving badly,
robots.txt was introduced and quickly adopted by WebCrawler, Lycos and other search engines at the time. Now, Web clients had the possibility to first inspect the
robots.txt at the root of the Web site and to not index the section of the Web sites which declared "not welcome" to Web spiders. This file was put at the root of the Web site,
http://example.org/robots.txt. It is called a "Well known location" resource. It means that the HTTP client is expecting to find something at that address when doing a
Since then many of these resources have been created unfortunately. The issues is that it imposes on server owners certain names they might have want to use for something else. Let's say, as a Web site owner, I decide to create a Web page
/contact at the root of my Web site. One day, a powerful company decides that it would be cool if everyone had a
/contact with a dedicated format. I then become forced to adjust my own URI space to not create conflict with this new de facto popular practice. We usually say that it is cluttering the Web site namespace.
What are the other common resources which have been created since
Note that in the future if you would like to create a knew well-known resource, RFC 5785 (Defining Well-Known Uniform Resource Identifiers (URIs)) has been proposed specifically for addressing this issue.
In terms of bandwidth, why could it be an issue? These are files which are most of the time requested by autonomous Web clients. When an HTTP client requests a resource which is not available on the HTTP server, it will send back a 404 response. These response can be very simple light text or a full HTML page with a lot of code.
Google evaluated that the waste of bandwidth generated by missing
apple-touch-icon on mobile was 3% to 4%. This means that the server is sending bits on the wire which are useless (cost for the site owner) and the same for the client receiving them (cost for the mobile owner).
It's there a way to fix that? Maybe.
Let's Hack Around It
So what about instead of having the burden to specify every resources in place for each clients, we could send a very light 404 answer targeted to the Web clients that are requesting the resources we do not have on our own server.
Let's say for the purpose of the demo, that only favicon and robots are available on your Web site. We need then to send a specialized light 404 for the rest of the possible resources.
With Apache, we can use the
Location directive. This must be defined in the server configuration file
httpd.conf or the virtual host configuration file. It can not be defined in
<VirtualHost *:80> DocumentRoot "/somewhere/over/the/rainbow" ServerName example.org <Directory "/somewhere/over/the/rainbow"> # Here some options # And your common 404 file ErrorDocument 404 /fancy-404.html </Directory> # your customized errors #<Location /robots.txt> # ErrorDocument 404 /plain-404.txt #</Location> #<Location /favicon.ico> # ErrorDocument 404 /plain-404.txt #</Location> <Location /humans.txt> ErrorDocument 404 /plain-404.txt </Location> <Location /crossdomain.xml> ErrorDocument 404 /plain-404.txt </Location> <Location /w3c/p3p.xml> ErrorDocument 404 /plain-404.txt </Location> <Location /apple-touch-icon.png> ErrorDocument 404 /plain-404.txt </Location> <Location /apple-touch-icon-precomposed.png> ErrorDocument 404 /plain-404.txt </Location> </VirtualHost>
Here I put in comments the
robots.txt and the
favicon.ico but you can adjust to your own needs and send errors or not to specific requests.
plain-404.txt is a very simple text file with just NOT FOUND inside and the
fancy-404.html is an html file helping humans to understand what is happening and invite them to find their way on the site. The result is quite cool.
For a classical mistake, let say requesting
http://example.org/foba6365djh, we receive the html error.
GET /foba6365djh HTTP/1.1 Host: example.org HTTP/1.1 404 Not Found Content-Length: 1926 Content-Type: text/html; charset=utf-8 Date: Wed, 30 Jul 2014 05:30:33 GMT ETag: "f7660-786-4e55273ef8a80;4ff4eb6306700" Last-Modified: Sun, 01 Sep 2013 13:30:02 GMT <!DOCTYPE html> …
And then for a request to let say
http://crossdomain.xml/foba6365djh, we get the plain light error message.
GET /crossdomain.xml HTTP/1.1 Host: example.org HTTP/1.1 404 Not Found Content-Length: 9 Content-Type: text/plain Date: Wed, 30 Jul 2014 05:29:11 GMT NOT FOUND
It is probably possible to do it for nginx too. Be my guest, I'll link your post from here.