The idea below is based on certain assumptions made by me. It seems that this idea is workable, unless this "misuse" was foreseen and features/measures were put in place to thwart such a use. I first wrote this idea in Jun'09, but never posted it on the Web.
Online news media is abuzz about the release of an updated version of Google Search Appliance (GSA) - version 6.0. Among the many updates to this version, one feature has intrigued me the most - the ability to index billions of documents (using clustering). (1)
To understand a possible implication of this, it's important realize that Google's search algorithms are one of its most important pieces of IP. What makes Google Google is its secret sauce - the algorithms it uses to rank Web content. Everyone can crawl the Web, but it's the relevance-determining algorithms that give Google much of its competitive edge over rival search engines such as Yahoo Search, Ask.com, and Bing. (2)
It's also known that GSA uses Google's ranking algorithms to rank indexed content. (3)
We also know that the ability to crawl and ingest the Web is not a major source of competitive advantage for search engines. Even a simple program such as HTTrack can do a relatively decent job of downloading a website by jumping from URL to URL. The process that HTTrack uses to crawl a website is similar to how contemporary search engines crawl the Web. It should be easily possible to configure (or customize) HTTrack to crawl the Web, rather than just a website. (4)
Click image to enlarge
What all of this leads me to believe is that it should be possible to cluster multiple GSAs to create a pseudo-Google - a search engine that uses Google's secret algorithms to rank the Web, but is powered by a cluster of GSAs. If this is indeed possible, it'll make it super-easy for clever entrepreneurs to launch new search engines that provide high-quality results.