Most recent update: 26th March 2022 - 06:53:08 - 5135 characters

Texting Robots: The Taming of robots.txt

If you are a webmaster or have ever written software to crawl the web you've hopefully discovered robots.txt. Simply put, robots.txt is intended to tell automated crawlers the list of resources to include or exclude in their crawling process.

The "Robot Exclusion Protocol" was first proposed over a quarter century ago and decades later is used by most websites on the internet.

Once might imagine that a quarter century old specification in broad use would be well defined. One might also be tremendously wrong.

The specification was never officially formalized. There's movement on that front now in the form of a draft IETF proposal but the draft proposal doesn't cover Crawl-Delay, an essential addition if you're anyone but Google.

More complicated than a loose specification however is the number of edge cases that web scale reveals.

Why is robots.txt important?

Web crawling is wonderous. Web crawling enables us to take this wild and strangely interlinked behemoth network of compute and knowledge which we call the internet, fueled by billions of independent actors, and pull it together for use as a corpus of humanity.

That this was ever allowed or became standard practice is a marvel in and of itself. Hence the danger. Nefarious or ill behaving web crawlers can push us towards the tragedy of the commons.

The more web crawling is abused, the more likely it is for websites to prevent web crawling. When a website decides to prevent web crawling it's a coin flip as to whether they prevent the specific abusive bot or whether they restrict crawling more generally.

If we collectively aren't careful we'll slowly watch the societal benefit given to us by web crawling dissolve or be centralized to only a few massive companies.

The robots.txt specification

The format is incredibly simple. A plantext file located at /robots.txt on the domain specifies a set of instructions like:

User-Agent: Smerity
Allow: /
Disallow: /favorites/movies/*/horror/

User-Agent: *
Disallow: /favorites/
Crawl-Delay: 10
Sitemap: https://example.com/sitemap.xml

For a given user agent ("bot", "robot", "crawler", ...) a set of patterns indicate which URLs as allowed or disallowed. A simplified form of regular expressions are allowed - specifically * for matching anything and $ for matching the end of the input. If a URL doesn't match any pattern then we assume it's allowed.

These patterns may apply to a specific user agent or to anyone not explicitly listed (i.e. the wildcard *).

Sitemap provides an XML file listing as much or as little of the web domain as the website might like.

The crawl delay is an informal proposal with widespread adoption that specifies how long a crawler must wait between successive requests. Google ignores this as they get to define the game being played but most other crawlers respect it lest they be blocked.

So what's the problem with robots.txt?

While the format above sounds simple and has been in existence for decades the real world usage of robots.txt is chaotic and haphazard at best.

When you crawl the web you're going to inevitably hit land mines of misunderstanding.

  • Adversarial examples: a pattern consisting of a megabyte of repeated "A" characters
  • Automatically generated patterns
  • Patterns which result in exploding regular expressions (e.g. "/a/*/b/*/c/*/..." or "a/**************/b")
    • The example of "a/**************/b" is likely an error in most cases but one webmaster noted that the longest matching pattern takes precedence, meaning that even if the mentioned pattern is equivalent to "a/*/b" it will win out over over all shorter patterns
  • Null bytes in the middle of the robots.txt file
  • Byte order mark (BOM) at the beginning of the robots.txt file
  • Images encoded as Base64 (?)
  • Entirely different programming languages / file formats (such as entries for .htaccess)
  • Misspellings of basic words ("Disallow") and minor variations of instructions ("User-agent" becoming "User agent")

While the purist may wish to simply say "Your robots.txt file is broken so I'm allowed to ignore it" that's not how you can treat it in practice. In practice it's your fault that you're crawling their website when they asked you not to.

That's not necessarily reasonable or fair but it's the game we have to play.

My solution: Texting Robots

Manually combining the unit tests of Google's robotstxt (C++) and Moz's Reppy (Python bindings to their C++ library rep-cpp) to ensure sane coverage of edge cases and then testing Texting Robots against 54.9 million robots.txt files.