Most recent update: 27th March 2022 - 02:19:28 - 12324 characters

Texting Robots: Taming robots.txt with 34 million tests

Handling robots.txt is a vital part to web crawling. This is about the creation of the Texting Robots library for handling

If you are a webmaster or have ever written software to crawl the web you've hopefully discovered robots.txt. Simply put, robots.txt is intended to tell automated crawlers the list of resources to include or exclude in their crawling process.

The "Robot Exclusion Protocol" was first proposed over a quarter century ago and decades later is used by most websites on the internet.

One might imagine that a quarter century old specification in broad use would be well defined. One might also be tremendously wrong.

The specification was never officially formalized. There's movement on that front now in the form of a draft IETF proposal but the draft proposal doesn't even cover the widespread Crawl-Delay, an essential addition if you're anyone but Google.

More complicated than a loose specification however is the number of edge cases that web scale reveals.

Why is robots.txt important?

Web crawling is wonderous. Web crawling enables us to take this wild and strangely interlinked behemoth network of compute and knowledge which we call the internet, fueled by billions of independent actors, and pull it together for use as a corpus of humanity.

That this was ever allowed or became standard practice is a marvel in and of itself. Hence the danger. Nefarious or ill behaving web crawlers can push us towards the tragedy of the commons.

Whenever I see someone writing a crawler that doesn't have rate limiting and robots.txt compliance as first class citizens I feel a shiver down my spine.

The more web crawling is abused, the more likely it is for websites to prevent web crawling. When a website decides to prevent web crawling it's a coin flip as to whether they prevent the specific abusive bot or whether they restrict crawling more generally.

If we collectively aren't careful we'll slowly watch the societal benefit given to us by web crawling dissolve and be anointed to only a few massive companies.

The robots.txt specification

Skip this section if you're already familiar with robots.txt

The format is incredibly simple. A plantext file located at /robots.txt on the domain which specifies a set of instructions:

User-Agent: Smerity
Allow: /
Disallow: /favorites/movies/*/horror/

User-Agent: *
Disallow: /favorites/
Crawl-Delay: 10
Sitemap: https://example.com/sitemap.xml

For a given user agent ("bot", "robot", "crawler", ...) a set of patterns indicate which URLs as allowed or disallowed. A simplified form of regular expressions are allowed - specifically * for matching anything and $ for matching the end of the input. If a URL doesn't match any pattern then we assume it's allowed.

These patterns may apply to a specific user agent or to anyone not explicitly listed (i.e. the wildcard *).

Sitemap provides an XML file listing as much or as little of the web domain as the website might like.

The crawl delay is an informal proposal with widespread adoption that specifies how long a crawler must wait between successive requests. Google ignores this as they get to define the game being played but most other crawlers respect it lest they be blocked. I did say most others right?

So what's the problem with robots.txt?

While robots.txt appears a simple specification at first glance the scale and complexity of the web teases out every possible edge case. The real world usage of robots.txt, now decades old, remains chaotic and haphazard at best.

When you crawl the web you're going to inevitably hit land mines of misunderstanding and aggression.

  • Adversarial examples: a pattern consisting of a megabyte of repeated "A" characters
  • Automatically generated patterns that query a stupendous amount of text from a database and/or have odd redundant loops
  • Imaginary regular expression syntax (when robots.txt specifically only handles a self described set of * and $)
  • Patterns which result in exploding regular expressions (e.g. "/a/*/b/*/c/*/..." or "a/**************/b")
    • The example of "a/**************/b" is likely an error in most cases but one webmaster ingeniously noted that the loose robots.txt spec states longest matching patterns take precedence and hence even if the mentioned pattern is equivalent to "a/*/b" it should win out over any shorter patterns
  • Null bytes in the middle of the robots.txt file
  • Byte order mark (BOM) at the beginning of the robots.txt file
  • Images encoded as Base64 added as rules (...but why?)
  • Entirely different programming languages / file formats (such as entries for .htaccess weirdly embedded)
  • Misspellings of basic words ("Disallow") and minor variations ("User-agent" becoming "User agent")

If you're deeply curious you can check out a few examples of rules that were particularly odd. Warning: I didn't directly link to the text as it's 4.5 megabytes of madness.

While the purist may wish to simply say "Your robots.txt file is broken so I'm allowed to ignore it" that's not how you can treat it in practice. In practice it's your fault that you're crawling their website when they asked you not to. If they don't like you we return to the tragedy of the commons where the end result is Disallow: /.

That's not necessarily reasonable or fair but it is the game we have to play.

The Texting Robots library

I have extensive past experience crawling. While other libraries exist for handling robots.txt they either didn't cater to my use cases or were no longer actively developed.

The most important to me was to feel certainty that standard cases and edge cases were well handled. Being unable to understand why an issue is occurring and hence be able to fix the inevitable edge cases the web will attack you with is a non starter.

Rust was chosen as I've recently become quite fond of it as a language and the core principles align well with my aims: safe, correct, and fast. Rust also features first class support for WebAssembly (WASM) and trivial integration with Python. As an example, a WebAssembly (WASM) proof of concept already exists for Texting Robots that achieves only a 50-100% slow down compared to native code.

To ensure safety, I've put the library through fuzz testing and a stress test of 38.4 million robots.txt URLs. Thanks to Common Crawl for supplying the trove of robots.txt files.

For correctness I've manually combined the unit tests of Google's robotstxt (C++) and Moz's Reppy (Python bindings to their C++ library rep-cpp) to ensure sane coverage of robots.txt and potential edge cases.

The unit tests from these two libraries serve as a strong source of knowledge as many of the tests are populated from incidents found in the real world. Hilariously certain aspects of the Google and Moz interpretation (as seen by unit tests) disagree with each other. When this occurred the author deferred to as much common sense as they were able to muster.

Speed, while important, remains a secondary concern to the former two points. When a speed-up may result in incorrect coverage or substantial complexity it was not considered. Speed will continued to be worked on as is needed and appropriate built on the stable foundation of the stress tests and unit tests above.

Speed and efficiency are always important but are not core to robots.txt parsing. Even when crawling hundreds of pages per second it's unlikely to be a bottleneck. Still:

  • Reppy (with numbers from their repository):
    • 100k parses per second

    • 1M URL checks per second

  • Texting Robots:
    • 16.46µs per parsed robots.txt = 61k per second
    • 980.00ns per allow check = 1.02 million per second

These results are for a single core parsing Twitter's robots.txt and can be replicated by running cargo run --release in the repository.

Reppy's numbers are impressive and come thanks to a custom regular expression engine. Given robots.txt only allows * and $ there are many possibilities for speed-ups. Unfortunately their implementation does increase complexity substantially and edge cases can sneak through.

Texting Robots backs off to full regular expressions using regex but features two shortcut modes enabled by default.

If no regular expression operator is included (i.e. $ and * are not used) then the pattern is equivalent to a starts_with check.

If the regular expression only involves * then a simple matcher can be used. The pattern is broken into spans separated by * and we ensure that each of the spans occurs sequentially in the target. The resulting implementation is short and readable.

The experience of using Rust

First majority project in Rust.

  • The nom parsing library, while likely overkill here, was particularly helpful in handling the annoyingly flexible input requirements (i.e. infinite spaces here, dash or space there, colon or ...)
  • Integrated documentation and testing

So robots.txt is solved?

Far from it.

The missing parts of Texting Robots

Texting Robots will continue to be refined as new issues are discovered and dealt with.

Texting Robots doesn't yet handle the entire robots.txt process end to end. What is in scope and out of scope will continue to be considered but I hope it can serve as the basis of other works.

  • WASM implementation for use in other languages
  • What HTTP error codes when fetching robots.txt mean in terms of permission to crawl
  • How to handle the 429 ("Too Many Requests") response from a server during crawling (and especially during robots.txt fetching)
  • Whether robots.txt input should be truncated at the library level to 500 kibibytes as recommended by Google or whether this should be done by the user calling the library

Texting Robots cannot guard you against all possible edge cases but should give you a strong starting point from which to ensure you and your code constitute a positive addition to the internet at large.

The missing parts of robots.txt

  • Thousands or millions of domains backing on to a single backend
  • An adaptive scheme that reflects real world traffic and availability
  • Moving to a more standard format rather than the loose robots.txt spec

My dream proposal would be robots.json, generated statically or dynamically.