Smerity.com: Texting Robots: Taming robots.txt with Rust and 34 million tests

Texting Robots: Taming robots.txt with Rust and 34 million tests

Most recent update: 28th March 2022 - 15:10:11 - 18356 characters

Image above generated using Midjourney - thanks David Holz

In the distant past I was part of Common Crawl and contributed over 2.5 petabytes of data and 35 billion webpages to that massive archive. From that experience and many others like it I've seen the complexity the web can throw at you. A well working robots.txt parser is the first step in any web crawling enterprise, small or large, and yet it’s surprisingly difficult to get a battle tested implementation.

To remedy this for myself I set about creating Texting Robots, a robots.txt parsing library written in Rust aiming for reliability and correctness at scale. In the future I hope through FFI and WASM bindings to offer this as a library for many other languages.

Surely a robots.txt parser isn't that complicated though... right? The rest of this article documents the true terror behind this deceitfully simple looking foe.

The "Robot Exclusion Protocol" was first proposed over a quarter century ago and decades later is used by most websites on the internet. The intent of robots.txt is to tell automated crawlers the list of resources to include or exclude in their crawling process.

One might imagine that a quarter century old specification in broad use would be well defined. One might also be tremendously wrong.

The specification was never officially formalized. There's movement on that front now in the form of a draft IETF proposal but the draft proposal doesn't even cover the widespread Crawl-Delay, an essential addition if you're anyone but Google. The formalization of such a specification will also be unlikely to update the existing robots.txt files scattered across the web.

More complicated than a loose specification however is the number of edge cases that web scale reveals.

Why is `robots.txt` important?

Web crawling is wonderous. Web crawling enables us to take this wild and strangely interlinked behemoth network of compute and knowledge which we call the internet, fueled by billions of independent actors, and pull it together for use as a corpus of humanity.

That this was ever allowed or became standard practice is a marvel in and of itself. Hence the danger. Nefarious or ill behaving web crawlers can push us towards the tragedy of the commons.

Whenever I see someone writing a crawler that doesn't have rate limiting and robots.txt compliance as first class citizens I feel a shiver down my spine.

The more web crawling is abused, the more likely it is for websites to prevent web crawling. When a website decides to prevent web crawling it's a coin flip as to whether they prevent the specific abusive bot or whether they restrict crawling more generally.

If we collectively aren't careful we'll slowly watch the societal benefit given to us by web crawling dissolve and be anointed to only a few massive companies.

The `robots.txt` specification

Skip this section if you're already familiar with robots.txt

The format is incredibly simple at a glance. A plaintext file located at /robots.txt on the domain which specifies a set of instructions:

User-Agent: Smerity
Allow: /
Disallow: /favorites/movies/*/horror/

User-Agent: *
Disallow: /favorites/
Crawl-Delay: 1.5
Sitemap: https://example.com/sitemap.xml

For a given user agent ("bot", "robot", "crawler", ...) a set of patterns indicate which URLs as allowed or disallowed. A simplified form of regular expressions are allowed - specifically * for matching anything and $ for matching the end of the input. If a URL doesn't match any pattern then we assume it's allowed.

These patterns may apply to a specific user agent or to anyone not explicitly listed (i.e. the wildcard *).

Sitemap provides an XML file listing as much or as little of the web domain as the website might like.

The crawl delay is an informal proposal with widespread adoption that specifies how long a crawler must wait between successive requests. Google ignores this as they get to define the game being played but most other crawlers respect it lest they be blocked. I did say most others right?

So what's the problem with `robots.txt`?

While robots.txt appears a simple specification at first glance the scale and complexity of the web teases out every possible edge case. The real world usage of robots.txt remains chaotic and haphazard at best.

To test my robots.txt parsing implementation I constructed a test harness for Texting Robots against the 34 million robots.txt files in Common Crawl's January 2022 archive. The robots.txt crawl archive is 140 gigabytes in size compressed, containing both the requests and responses encoded in Web Archive (WARC) format.

After analyzing the results of this stress test I was able to find many examples of the land mines of misunderstanding and aggression that are inevitable upon crawling the web at scale. As a brief summary of what you'll see when you crawl enough robots.txt files:

Adversarial examples: a pattern consisting of a megabyte of repeated "A" characters
- The creator is a security pen tester so at least they're on brand?
Automatically generated patterns that query a stupendous amount of text from a database and/or have odd redundant loops
- It's fun to see /shop/Filing/Document Storage Wallets and Files and Files and Files and... repeated for 2KB!
Imaginary regular expression syntax (when robots.txt specifically only handles a self described set of * and $)
Patterns which result in exploding regular expressions (e.g. "/a/*/b/*/c/*/d/*/..." or "a/**************/b")
- The example of "a/**************/b" is likely an error in most cases but one webmaster ingeniously noted that the loose robots.txt spec states longest matching patterns take precedence and hence even if the mentioned pattern is equivalent to "a/*/b" it should win out over any shorter patterns
Null bytes in the middle of the robots.txt file
Byte order mark (BOM) at the beginning of the robots.txt file
Images encoded as Base64 added as rules (...but why?)
Entirely different programming languages / file formats (such as entries for .htaccess weirdly embedded)
Misspellings of basic words ("Disallow") and minor variations ("User-agent" becoming "User agent")
- Google's robotstxt library really puts work in to enumerating many "Disallow" variants and according to their responses to issues these misspellings are decided based upon datasets they collected!
...and predictably, many files that were never even intended to be robots.txt

The adversarial robots.txt file from a security pen tester

The adversarial robots.txt file from a security pen tester

If you're deeply curious you can check out a few examples of rules that were particularly odd. Warning: I didn't directly link to the text as it's 4.5 megabytes of madness.

While the purist may wish to simply say "Your robots.txt file is broken so I'm allowed to ignore it" that's not how you can treat it in practice. In practice it's your fault that you're crawling their website when they asked you not to. If they don't like you we return to the tragedy of the commons where the end result is Disallow: /.

That's not necessarily reasonable or fair but it is the game we have to play.

The Texting Robots library

I have extensive past experience crawling and handling the complexities of robots.txt has always been an issue. While other robots.txt libraries exist they either didn't cater to my use cases, weren't in my set of targeted languages, or were no longer actively developed.

Developing a robots.txt parser from scratch was also a good experience for me. Being unable to understand why an issue is occurring and hence how to fix the inevitable edge cases the web will attack you with is a necessity. This library was architected for that exact purpose. Being able to understand the intricacies of each of these issues is even more valuable as a learning experience.

Rust was chosen as I've recently become quite fond of it as a language and the core principles align well with my aims: safety, correctness, and speed. Rust also features first class support for WebAssembly (WASM) and trivial integration with Python. As an example, a WebAssembly (WASM) proof of concept already exists for Texting Robots that achieves only a 100-200% slow down compared to link time optimized native code. No optimization has yet been done on the WASM / WASI side.

To ensure safety I've put the library through fuzz testing and a stress test of 38.4 million robots.txt files to ensure no unexpected panics. Thanks to Common Crawl for supplying the underlying robots.txt dataset as part of their standard crawl archives.

The view from htop when running 80 cores fuzzing Testing Robots

The view from the htop when running 80 cores each fuzzing Testing Robots

For correctness I've manually combined the unit tests of Google's robotstxt (C++) and Moz's Reppy (Python bindings to their C++ library rep-cpp) to ensure sane coverage of robots.txt and potential edge cases.

The unit tests from these two libraries serve as a strong source of knowledge as many of the tests are populated from incidents found in the real world. Hilariously certain aspects of the Google and Moz interpretation (as seen by unit tests) disagree with each other. When this occurred the author deferred to as much common sense as they were able to muster.

Speed, while important, remains a secondary concern to the former two points. When a speed-up may result in incorrect coverage or substantial complexity it was not considered. Speed will continued to be worked on as is needed and appropriate, building upon the stable foundation of the stress tests and unit tests above, but was not my initial aim.

Speed and efficiency are in general not core to robots.txt parsing. Even when crawling hundreds of pages per second it's unlikely to be a bottleneck given the number of domains are dwarfed by the number of pages from each domain.

For benchmarking however:

Reppy (using numbers from their repository):
- 100k parses per second
- 1M URL checks per second
Texting Robots:
- 10.86µs per parsed robots.txt = 61k per second
- 980ns per allow check = 1.02 million per second
Texting Robots with link time optimization (LTO):
- 10.64µs per parsed robots.txt = 92k per second
- 896ns per allow check = 1.12 million per second

These results are for a single thread parsing Twitter's robots.txt and can be replicated by running cargo run --release in the repository. The parsing speed will be dependent on each robots.txt file but the existing numbers are well past what's needed for the vast majority of applications.

How is Reppy so fast?

Reppy's numbers are impressive and come thanks to a custom regular expression engine. Given robots.txt only allows * and $ there are many possibilities for speed-ups. Unfortunately their implementation does increase complexity substantially and edge cases can sneak through.

Texting Robots backs off to full regular expressions using the regex crate but features two shortcut modes enabled by default that provide substantial improvements.

If no regular expression operator is included (i.e. $ and * are not used) then the pattern is equivalent to a starts_with check.

If the regular expression only involves * then a simple matcher can be used. The pattern is broken into spans separated by * and we ensure that each of the spans occur sequentially within the target. The resulting implementation is short and readable.

The experience of using Rust

This is my first major project in Rust and first released crate. I have been using it for some time and the promises have held true so far.

The most wondrous experience was in learning a complex part of Rust (specifically how Atomic Reference Counting (Arc) interacts with Self on structs) entirely through compiler errors. Whilst that experience wasn't from this project the default mode with Rust has been "friendly compiler error driven development".

While the ecosystem can still be rough in places the majority of core crates (read: libraries) are frequently of high quality. Documentation is common and automatically generated as part of the package manager process.

Of particular note for this library was the nom parsing library which was likely overkill here but provided particularly helpful handling for robots.txt's annoyingly flexible input requirements (i.e. infinite spaces here, dash or space there, colon or ...). Props to my friend Adam Chalmers on introducing Nom through parsing text, parsing bitstreams, and parsing DNS headers.

The overall Rust community, being both technically engaging as well as kind and welcoming, are worthy of a shout out.

So `robots.txt` is solved?

Far from it!

For as long as the internet has strange and wonderous websites and flaky / old hardware you can expect robots.txt to remain a bounty of madness.

The missing parts of Texting Robots

Texting Robots will continue to be refined as new issues are discovered and dealt with.

Texting Robots purposely doesn't fetch the robots.txt file itself to limit the scope of the library. With this smaller scope Texting Robots should be able to support simple integrations through FFI or WASM with languages other than Rust.

Battle testing this intepretation of the robots.txt specification against the web is easier done when testing with friends. Or to misquote Linus Torvalds:

Given enough eyeballs, all robots.txt parsing bugs are shallow.

If you have experience with WASM you could get rolling right away based on the WASI proof of concept.

Beyond that the library user must handle a set of further questions including:

How HTTP error codes when fetching robots.txt impact your permission to crawl?
How do you rate limit yourself in the event of 429 ("Too Many Requests") responses from the server?
- Was a retry after header included that suggests a different delay compared to the one stated in robots.txt?
Should you truncate robots.txt to 500 kibibytes as recommended by Google?

Texting Robots cannot guard you against all possible edge cases but should give you a strong starting point from which to ensure you and your code constitute a positive addition to the internet at large.

The missing parts of `robots.txt`

Thousands or millions of domains backing on to a single backend
- As an approximation you can bucket by IP address and be aware of when one of the domains in the bucket request a slower crawl rate
- 429 "Too Many Requests" responses are the only way to correctly speed limit as you might not know all these domains are hitting a single set of servers
An adaptive scheme that reflects real world traffic and availability
Moving to a more standardized format rather than the loose robots.txt spec
What instances should we morally and righteously ignore robots.txt?
- As an example: robots.txt can retroactively remove content from the Internet Archive meaning new website owners can disappear older content

My dream proposal would be a robots.json endpoint which could simultaneously remove the parsing complexity of the robots.txt spec, allow for backward compatibility (i.e. compile down to robots.txt), and potentially react dynamically to different automated crawlers / different situations.

Who knows when that dream world might appear however. For now I'll settle for the chaos that is robots.txt.

Smerity (n):

Texting Robots: Taming robots.txt with Rust and 34 million tests

Why is robots.txt important?

The robots.txt specification

So what's the problem with robots.txt?