Scaling API-first – The story of a global engineering organization
Review of "The anatomy of a large scale hyper textual web search engine"
1. KOSURU SAI MALLESWAR; SC09B093; SEM-6.
REVIEW OF “The Anatomy of a Large-Scale Hyper textual Web Search Engine”
Sergey Brin and Lawrence Page started the design of ‘Google’ to make a search engine that can
crawl and index the web quickly and efficiently and to effectively deal with huge uncontrolled
hypertext collections. One of the main goals was to improve the quality and scalability of search.
Another goal was to setup a system that can support novel research activities on large-scale web
data and a reasonable number of people can actually use it for their academic research.
Google makes efficient use of storage space to store the index. This allows the quality of the
search to scale effectively to the size of the web as it grows. Its data structures are optimized for
fast and efficient access. To get high precision, Google uses the link structure of the Web to
calculate a quality ranking for each web page. This ranking is called PageRank. The probability
that the ‘random surfer’ visits a page is its PageRank. The ranking also involves damping factor,
which is the probability at each page the ‘random surfer’ will get bored and request another
random page. It allows for personalization and can make it nearly impossible to deliberately
mislead the system in order to get a higher ranking. The text of a link is associated with the page
that the link is on and also with the page the link points to. This idea of anchor text propagation
provides better quality search but the challenge was the efficient usage of it because of the heavy
data processing task. Along with page rank Google keeps a track of location information of all
hits, some visual presentation details and stores full raw HTML of pages in the repository.
Most of the Google’s architecture is implemented in C or C++ for efficiency and can run in
either Solaris or Linux. The data structures of Google include big files, document indexes,
lexicon, forward and reverse indexes and a huge repository. Google’s data structures are
optimized in terms of cost by the feature of avoiding disk seeks whenever possible. Google has a
fast distributed crawling system, where URL server and the crawlers are implemented in Python.
Each crawler maintains a DNS cache to reduce the no. of DNS lookups, uses asynchronous IO
and a no. of queues. The steps involved in indexing are parsing, indexing documents into barrels
using multiple indexers running in parallel and sorting. The Google’s ranking system is designed
so that no particular factor can have too much influence. The dot product of the vector of count-
weights with the vector of type-weights is used to compute an IR score for the document.
Finally, the IR score is combined with PageRank to give a final rank to the document. For multi
word search, Google has a complex algorithm. Google also considers feedback by trusted users
while updating the ranks of webpages.
Google can produce better results than the major commercial search engines for most searches.
Google has evolved to overcome a number of bottlenecks in CPU, memory access, memory
capacity, disk seeks, disk throughput, disk capacity, and network IO during various operations.
By the efficient crawling and indexing performed by Google, information can be kept up to date
and major changes can be tested relatively quickly. Google does not have optimizations such as
query caching, sub-indices on common terms. The inventors intended to speed up Google
considerably through distribution and hardware, software, and algorithmic improvements. They
wished to make Google as a high quality search tool for searchers and researchers all around the
world, sparking the next generation of search engine technology.