otsukare Thoughts after a day of work

Reducing your bandwidth bill by customizing 404 responses.

In this post, you will learn on how to improve and reduce the bandwidth cost for both the user and the server owner. But first we need to understand a bit the issue (but in case you know all about it, you can jump to the tips at the end of the blog post).

Well-Known Location Doom Empire

Starting a long time ago, in 1994, because of a spidering program behaving badly, robots.txt was introduced and quickly adopted by WebCrawler, Lycos and other search engines at the time. Now, Web clients had the possibility to first inspect the robots.txt at the root of the Web site and to not index the section of the Web sites which declared "not welcome" to Web spiders. This file was put at the root of the Web site, http://example.org/robots.txt. It is called a "Well known location" resource. It means that the HTTP client is expecting to find something at that address when doing a HTTP GET.

Since then many of these resources have been created unfortunately. The issues is that it imposes on server owners certain names they might have want to use for something else. Let's say, as a Web site owner, I decide to create a Web page /contact at the root of my Web site. One day, a powerful company decides that it would be cool if everyone had a /contact with a dedicated format. I then become forced to adjust my own URI space to not create conflict with this new de facto popular practice. We usually say that it is cluttering the Web site namespace.

What are the other common resources which have been created since robots.txt?

Note that in the future if you would like to create a knew well-known resource, RFC 5785 (Defining Well-Known Uniform Resource Identifiers (URIs)) has been proposed specifically for addressing this issue.

Bandwidth Waste

In terms of bandwidth, why could it be an issue? These are files which are most of the time requested by autonomous Web clients. When an HTTP client requests a resource which is not available on the HTTP server, it will send back a 404 response. These response can be very simple light text or a full HTML page with a lot of code.

Google evaluated that the waste of bandwidth generated by missing apple-touch-icon on mobile was 3% to 4%. This means that the server is sending bits on the wire which are useless (cost for the site owner) and the same for the client receiving them (cost for the mobile owner).

It's there a way to fix that? Maybe.

Let's Hack Around It

So what about instead of having the burden to specify every resources in place for each clients, we could send a very light 404 answer targeted to the Web clients that are requesting the resources we do not have on our own server.

Let's say for the purpose of the demo, that only favicon and robots are available on your Web site. We need then to send a specialized light 404 for the rest of the possible resources.

Apache

With Apache, we can use the Location directive. This must be defined in the server configuration file httpd.conf or the virtual host configuration file. It can not be defined in .htaccess.

<VirtualHost *:80>
    DocumentRoot "/somewhere/over/the/rainbow"
    ServerName example.org
    <Directory "/somewhere/over/the/rainbow">
        # Here some options
        # And your common 404 file
        ErrorDocument 404 /fancy-404.html
    </Directory>
    # your customized errors
    #<Location /robots.txt>
    #    ErrorDocument 404 /plain-404.txt
    #</Location>
    #<Location /favicon.ico>
    #    ErrorDocument 404 /plain-404.txt
    #</Location>
    <Location /humans.txt>
        ErrorDocument 404 /plain-404.txt
    </Location>
    <Location /crossdomain.xml>
        ErrorDocument 404 /plain-404.txt
    </Location>
    <Location /w3c/p3p.xml>
        ErrorDocument 404 /plain-404.txt
    </Location>
    <Location /apple-touch-icon.png>
        ErrorDocument 404 /plain-404.txt
    </Location>
    <Location /apple-touch-icon-precomposed.png>
        ErrorDocument 404 /plain-404.txt
    </Location>
</VirtualHost>

Here I put in comments the robots.txt and the favicon.ico but you can adjust to your own needs and send errors or not to specific requests.

The plain-404.txt is a very simple text file with just NOT FOUND inside and the fancy-404.html is an html file helping humans to understand what is happening and invite them to find their way on the site. The result is quite cool.

For a classical mistake, let say requesting http://example.org/foba6365djh, we receive the html error.

GET /foba6365djh HTTP/1.1
Host: example.org

HTTP/1.1 404 Not Found
Content-Length: 1926
Content-Type: text/html; charset=utf-8
Date: Wed, 30 Jul 2014 05:30:33 GMT
ETag: "f7660-786-4e55273ef8a80;4ff4eb6306700"
Last-Modified: Sun, 01 Sep 2013 13:30:02 GMT

<!DOCTYPE html>
…

And then for a request to let say http://crossdomain.xml/foba6365djh, we get the plain light error message.

GET /crossdomain.xml HTTP/1.1
Host: example.org

HTTP/1.1 404 Not Found
Content-Length: 9
Content-Type: text/plain
Date: Wed, 30 Jul 2014 05:29:11 GMT

NOT FOUND

nginx

It is probably possible to do it for nginx too. Be my guest, I'll link your post from here.

Otsukare.