Behind Google
Here is a good paper on Google's architecture.
I was wondering why Google needed so many servers (apparently more than 10000) to handle "only" 1000 requests/sec.
It turns out that a single request involves many servers, beside the front-end web server: index servers and document servers, as well as ad servers and spell checking services. In fact, many (apparently around a dozen) index and document servers are involved in handling a query, because of the shear size of the data and the need for it to be replicated (for scalability and reliability): the indexes and web content are split in redundant partitions (that they call shards), and a query is performed in parallel on these shards. Also, boxes are needed for the spidering and indexing itself, but I couldn't find a mention of the number of these.
This TechNetCast mp3 stream with Jim Reese (Chief Operations Engineer) and the following Q&A goes, with humour, over some of the problems that appear when you manage that many servers (networking issues, heat/power/hosting issues, software consistency on all the boxes) and the strategies they employ to mitigate these.
One of the early papers on Google's anatomy by Sergey Brin and Lawrence Page. It gives a lot of interesting details on the design of the search engine.
Also good reads, an analysis of some of Google's success factors and Wired's summary of Google's attitude: "Don't be evil".
A Wired article on GRUB, a search engine that attempts to offload the indexing by distributing it to volunteers (ala Seti@Home and Folding@Home). This could allow a faster indexing of the whole web, but certainly creates issues around the reliability of the built index, as well as other technical problems.
Via Nauman Leghari's blog ( http://weblogs.asp.net/nleghari/posts/9569.aspx ), many papers from Google people : http://labs.google.com/papers.html
Posted by: Dumky (July 1, 2003 04:05 PM) ______________________________________More info on PageRank at http://pagerank.stanford.edu/
Sepandar Kamvar ( http://www.stanford.edu/~sdkamvar/research.html ) is doing some excellent research on speeding up the PageRank computation and personalizing the PageRank results.
I haven't heard more from GRUB ( http://www.grub.org/ ) since last time, but I wonder were it fits in Josh Gray's analysis of Distributed Computing ( http://www.clustercomputing.org/content/tfcc-5-1-gray.html ).
Posted by: Dumky (July 24, 2003 11:19 AM) ______________________________________A video of a recorded lecture (by Urs Hoelzle) about Google is available: http://norfolk.cs.washington.edu/htbin-post/unrestricted/mmedia/ondemand_colloq.cgi (search for Google on that page).
It goes through both a software and a hardware overview.
The video is mirrored at http://www.uwtv.org/programs/displayevent.asp?rid=1680 and http://www.researchchannel.org/program/displayevent.asp?rid=1680
Posted by: Dumky (August 28, 2003 06:46 PM)