TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
Google Paper
1. The Anatomy of a Large-Scale
Hypertextual Web Search Engine
Lawrence Page & Sergey Brin
Presented By : Girish Malkarnenkar
Email: girish@cs.utexas.edu
INF384H / CS395T Concepts of Information Retrieval and
Web Search (Fall 2011) - (12th September 2011)
2. Motivation behind Google
• Rapid growth in Amount of
information
on the web
Number of new
inexperienced web users
3. Motivation behind Google
• Usage of human maintained indices like
Yahoo! which were subjective, expensive to
build & maintain, slow to improve and did not
cover all topics.
• Automated search engines relying on simple
keyword matching returned low quality
results.
• Attempts by advertisers to mislead automated
search engines
4. How bad were things in 1997?
• “Junk results” washed out any relevant search
results.
• Only one of the top 4 commercial search
engines at the time could find itself (in the
top 10 results)!
• There was a desperate need for a search
engine that could cope up with the ever-
increasing information flow and still return
relevant information.
5. Challenges in scaling with the web!
• In 1994, the 1st web search engine, the
WWWW indexed around 105 pages.
• By November 1997, the top engines
indexed 108 web documents!
• In 1994, the WWWW handled 1500
queries per day.
• By November 1997, Altavista handled
around 20 million queries per day!
6. Challenges in scalability
• Fast crawling technology
• Storage Space
• Efficient indexing system
• Fast handling of queries
7. Google’s design goals
• Aiming for very high precision in results since
most users look only at the first few 10s of
results.
• Precision is important even at the expense of
recall (i.e. the total number of relevant
documents returned)
8. The irony of it all…
• In this paper, the authors had criticized the
commercialization of academic search engine
as it caused search engine technology to
remain a black art.
• They had also stated their aims of making
Google an open academic environment for
researchers working on large scale web data.
• In the appendix, they had also blasted
advertising funded search engines for being
“inherently biased”
9. System features of Google
• PageRank
• A Top 10 IEEE ICDM data mining algorithm
• Tries to incorporate ideas from
academic community (publishing and citations)
• Anchor Text
• <a href=http://www.com> ANCHOR TEXT </a>
10. PageRank!
It isn't the only factor that Google uses to rank pages, but it is an
important one.
11. Why does PageRank use links?
• Links represent citations
• Quantity of links to a website makes the
website more popular
• Quality of links to a website also helps in
computing rank
• Link structure largely unused before Larry
Page proposed it to thesis advisor
• Idea based on academic citation literature
which counted citations or backlinks to a given
page.
12. How does PageRank work?
Counts links from all pages but not
equally
Normalizes by the number of links on a
page.
13. Simplified PageRank algorithm
• Assume four web pages: A, B,C and D. Let each page
would begin with an estimated PageRank of 0.25.
A C
D
B
C
A
D
B
• L(A) is defined as the number of links going out of page
A. The PageRank of a page A is given as follows:
14. PageRank algorithm including damping factor
Assume page A has pages B, C, D ..., which point
to it. The parameter d is a damping factor which
can be set between 0 and 1. Usually set d to
0.85. The PageRank of a page A is given as
follows:
15. Intuitive Justification
• A "random surfer" who is given a web page at random and keeps
clicking on links, never hitting "back“, but eventually gets bored
and starts on another random page.
– The probability that the random surfer visits a page is its
PageRank.
– The d damping factor is the probability at each page the
"random surfer" will get bored and request another random
page.
• A page can have a high PageRank
– If there are many pages that point to it
– Or if there are some pages that point to it, and have a high
PageRank.
16. Anchor Text
• <A href="http://www.yahoo.com/">Yahoo!</A>
The text of a hyperlink (anchor text) is
associated with the page that the link is on,
and it is also associated with the page the link
points to.
Why?
anchors often provide more accurate descriptions of
web pages than the pages themselves.
anchors may exist for documents which cannot be
indexed by a text-based search engine, such as images,
programs, and databases.
17. Other Features
• It has location information for all hits (uses
proximity in search)
• Google keeps track of some visual
presentation details such as font size of words.
• Words in a larger or bolder font are weighted
higher than other words.
• Full raw HTML of pages is available in a
repository
19. Google Architecture
Multiple crawlers run in parallel.
Keeps track of URLs Each crawler keeps its own DNS Compresses and
that have and need lookup cache and ~300 open stores web pages
to be crawled connections open at once.
Stores each link and
text surrounding link.
Converts relative URLs
into absolute URLs.
Uncompresses and parses Contains full html of every web
documents. Stores link page. Each document is prefixed
information in anchors file. by docID, length, and URL.
20. Google Architecture
Maps absolute URLs into docIDs stored in Doc Parses & distributes hit lists into
Index. Stores anchor text in “barrels”. “barrels.”
Generates database of links (pairs of docIds).
Partially sorted forward
indexes sorted by docID. Each
barrel stores hitlists for a given
range of wordIDs.
In-memory hash table that
maps words to wordIds.
Contains pointer to doclist in
barrel which wordId falls into.
Creates inverted index
whereby document list
containing docID and hitlists
can be retrieved given wordID.
DocID keyed index where each entry includes info such as pointer to doc in
repository, checksum, statistics, status, etc. Also contains URL info if doc
has been crawled. If not just contains URL.
21. Single Word Query Ranking
• Hitlist is retrieved for single word
• Each hit can be one of several types: title, anchor,
URL, large font, small font, etc.
• Each hit type is assigned its own weight
• Type-weights make up vector of weights
• Number of hits of each type is counted to form
count-weight vector
• Dot product of type-weight and count-weight vectors
is used to compute IR score
• IR score is combined with PageRank to compute final
rank
22. Multi-word Query Ranking
• Similar to single-word ranking except now must
analyze proximity of words in a document
• Hits occurring closer together are weighted higher
than those farther apart
• Each proximity relation is classified into 1 of 10 bins
ranging from a “phrase match” to “not even close”
• Each type and proximity pair has a type-prox weight
• Counts converted into count-weights
• Take dot product of count-weights and type-prox
weights to computer for IR score
23. The Past: Original Page # 1
When Larry Page and Sergey Brin begun work on their search engine, it
wasn’t originally called Google. They called it Backrub (as a reference to the
algorithm which used backlinks to rank pages), only changing it a year into
development and yes, the hand in the logo was Larry Page’s, scanned.
26. The Future?
“The ultimate search engine would
understand exactly what you mean and give
back exactly what you want.”
- Larry Page
27. References…
• Brin, Page. The Anatomy of a Large-Scale
Hypertextual Web Search Engine.
• www.cs.uvm.edu/~xwu/kdd
• http://www.ics.uci.edu/~scott/google.htm