Google crawls the web, caches and indexes the pages it finds into a database, and provides a consumer user interface to search that database. What if we could build other applications on top of that same database?
Not having access to Google’s copy of the web, other companies do the same thing as Google themselves in order to provide their services. Some examples:
- Attributor lets large publishers find video, image, and text copyright infringements.
- TinEye lets users upload an image and see where it is used online.
- MajesticSEO lets website owners track backlinks to their pages.
Amazon has a growing list of Public Data Sets. What if they could provide cached “views” of the web that could be processed using EC2 or Elastic MapReduce? That would allow more entrepreneurs to think big about using the whole of the web as a data set.
Amazon’s Public Data Sets already has some cached versions of Wikipedia. Companies (like Freebase) and researchers (such as Jun Liu & Sudha Ram) seem to use them. My wish is we had such a query-able data store for all websites.
Until we have such a data source and platform, here’s Ilya Grigorik’s excellent presentation on Building a Mini-Google in Ruby to do it ourselves.
An aside: For simple site search problems, I prefer the design of Bing’s API over Google’s Site Search. We use Bing for LearnHub Search.

“For every friend who joins Dropbox, we’ll give you both 250 MB of bonus space.”
“For every company that your company invites that joins Dropbox, we’ll give both companies 50 GB of bonus space.”

