A free database of the entire Web may spawn the next Google

January 24, 2013

A nonprofit called Common Crawl is now using its own Web crawler and making a giant copy of the Web that it makes accessible to anyone.

The organization offers up over five billion Web pages, available for free so that researchers and entrepreneurs can try things otherwise possible only for those with access to resources on the scale of Google’s, MIT Technology Review reports.

Common Crawl has so far indexed more than five billion pages, adding up to 81 terabytes of data, made available through Amazon’s cloud computing service. For about $25 a programmer could set up an account with Amazon and get to work crunching Common Crawl data, says Lisa Green, Common Crawl’s director.

Common Crawl also has Google’s director of research, Peter Norvig, and MIT Media Lab director Joi Ito on its advisory board.


Hmm, could the entire Web be stored in DNA or graphene molecules? — Editor