Open Source Web Crawl Copy of the Internet

A nonprofit called Common Crawl is now using its own Web crawler and making a giant copy of the Web that it makes accessible to anyone. The organization offers up over five billion Web pages, available for free so that researchers and entrepreneurs can try things otherwise possible only for those with access to resources on the scale of Google’s.

The ccBot crawler is a distributed crawling infrastructure that makes use of the Apache Hadoop project. We use Map-Reduce to process and extract crawl candidates from our crawl database. This candidate list is sorted by host (domain name) and then distributed to a set of spider (bot) servers. We do not use Nutch for the purposes of crawling, but instead utilize a custom crawl infrastructure to strictly limit the rate at which we crawl individual web hosts. The resulting crawl data is then post processed (for the purposes of link extraction and deduplication) and then reintegrated into the crawl database.

They store the crawl data on Amazon’s S3 service, allowing it to be bulk downloaded as well as directly accessed for map-reduce processing in EC2.

Technology Review discusses the open web crawl.

Common Crawl has so far indexed more than five billion pages, adding up to 81 terabytes of data, made available through Amazon’s cloud computing service. For about $25 a programmer could set up an account with Amazon and get to work crunching Common Crawl data, says Lisa Green, Common Crawl’s director. The Internet Archive, another nonprofit, also compiles a copy of the Web and offers a service called the “Wayback Machine” that can show old versions of a particular page. However, it doesn’t allow anyone to analyze all its data at once in that way.

If you liked this article, please give it a quick review on ycombinator or StumbleUpon. Thanks