Smerity.com: Crawling in three stories

Crawling in three stories

Most recent update: 8th February 2021 - 02:56:03 - 6586 characters

Inspired by A Facebook crawler was making 7M requests per day to my stupid website (Hacker News), I thought it worth revisiting my history with web crawlers. This will take place from a few specific stories. For multiple stories the names of the companies have been removed.

My background in crawling

A web crawler was one of the first substantial projects I ever wrote in Python thanks to a summer camp called NCSS which teaches high school students programming. Since then I've returned to be a tutor almost every year for over a decade - though web crawlers are sadly no longer the project.

Far later than that I became the only engineer at Common Crawl, a non-profit dedicated to providing an open repository of web crawl data that can be accessed and analyzed by anyone. The dataset has been used in a million ways but my favourites include the largest open web graph at the time, composed of 3.5 billion pages and 128 billion hyperlinks, and the use of the dataset as a source of text within my field of machine learning.

According to a quick summation of the statistics, I crawled over 2.5 petabytes and 35 billion webpages mostly by myself. Most months the AWS bill was only $3-4k thanks to utilizing AWS Spot Instances and AWS gifting storage for the dataset.

Luckily there are far smarter people and tools working on the Common Crawl initiative now!

Note: being the sole engineer and crawling billions of web pages is ... not fun. You're on call 24/7 handling a cluster of hundreds of nodes receiving bleary eyed emails at 4am that can send a spike of adrenaline and fear through you akin to a sudden drop when trapped in an elevator that you yourself designed with sticky tape and glue.

Ancient history: Slashdot

Reasoning behind my excessive tax on the servers? I'm a young student, just left high school, about to go into university, and I'm looking into doing some NLP (natural language processing)

I was hoping on creating a tool that will automatically extract some of the most common memes ("But does it run Linux?" and "In Soviet Russia..." style jokes etc) and I needed corpus - I wrote a primitive (threaded :S) web crawler and started it before I considered robots.txt. I haven't ever really written a web crawler before, and when I suddenly remembered and checked robots.txt it had already been running for some time.

I do intensely apologise.

So, if at all possible, can I get unblocked? And on top of that, I know it's a hard ask, but if I obey robots.txt (religiously lol) can I still crawl Slashdot? Even just from my preliminary research I'm turning up some interesting things. Extracting important collocations from the corpus I've attained so far has resulted in some interesting collocations -

The spread of Common Crawl responses

Given the number of pages crawled at Common Crawl, it's stunning the relatively low number of issues that occurred. As noted, the project has far better talent and tooling now, but at the time I would do my best to handle all these cases with the help of the only other member of Common Crawl, the org's director.

The small website

If the internet were a hundred billion little blinking lights, how frequently do the lights blink out? How likely is it for you to have crawled their page recently?

Web pages fail frequently. Non-technical and somewhat technical teams can decide you were at fault.

"You crawled my website 8 times in the last day, right before it broke!"

Whilst I wanted to reply with "... Perhaps your website not handling a request per three hours is your bigger problem ..." - but instead we'd work with the person as best we could, usually pointing them toward more standard technical issues like a low quality hosting provider or a very odd PHP/Perl/Python/... script on their system.

What frustrated me most was that these emails were frequently the most aggressive. They read like a ransom demand from space pirates who were down to their last twelve dubloons.

The BIG website

To be polite to websites you want to segment a crawl according to domain name and/or IP address. This allows you to rate limit to a sane number of requests per second, set either by a reasonable guess or the request found in robots.txt.

Certain big websites however run across huge and non-transparent IP ranges. The worst case may be a web platform that fronts tens or hundreds of thousands of websites.

We're fans of your project, and think it's a really cool dataset.
We're not very big fans of your crawler though.
We saw bursts of crawl activity of around 3000 requests per second a couple weeks back. The crawl pattern and rate caused cache thrash and degraded the user experience.

The team behind this were incredibly nice. My only regret was not solving the problem sooner for them as it represented quite a complicated situation.

For some time I was pretty hard on myself but it turns out even Google have similar issues:

I used to work for a top five web site and even we couldn’t get ahold of anyone - one day Google decided to start crawling us at a rate of 120k rps and it was killing the site by pulling ancient content that was 100% cache miss. No way for us to get in touch with Google officially, our billionaire CEO hadn’t traded numbers with their billionaire CEO so no help there, one of the developers had a college buddy that landed at Google and that guy was able to use some sort of internal mailing list to get them to drop the crawl rate down to 20k rps.

The litigious website

Certain websites decide that not only do they not want to set robots.txt but they want to treat each and every instance of crawling their site as if it were an armed robbery.

tl;dr

Crawling should be an act of love

You appreciate a resource so much that you want to back it up, analyze it, understand it, ...

Given that:

Crawling is a tragedy of the commons so don't contribute to the tragedy
Pay attention to robots.txt for each and every website - and if it's missing proceed with caution
"Don't be evil"
Provide a detailed way to contact you

Present and past versions: 8th February 2021 - 02:56:03, 8th February 2021 - 02:55:11, 17th June 2020 - 12:47:34, 12th June 2020 - 13:36:11

Smerity (n):