SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
The Anatomy of a Large-Scale
Hypertextual Web Search Engine
           Lawrence Page & Sergey Brin




          Presented By : Girish Malkarnenkar
             Email: girish@cs.utexas.edu
INF384H / CS395T Concepts of Information Retrieval and
    Web Search (Fall 2011) - (12th September 2011)
Motivation behind Google
• Rapid growth in                Amount of
                                 information
                                 on the web




       Number of new
       inexperienced web users
Motivation behind Google
• Usage of human maintained indices like
  Yahoo! which were subjective, expensive to
  build & maintain, slow to improve and did not
  cover all topics.
• Automated search engines relying on simple
  keyword matching returned low quality
  results.
• Attempts by advertisers to mislead automated
  search engines
How bad were things in 1997?
• “Junk results” washed out any relevant search
  results.
• Only one of the top 4 commercial search
  engines at the time could find itself (in the
  top 10 results)!
• There was a desperate need for a search
  engine that could cope up with the ever-
  increasing information flow and still return
  relevant information.
Challenges in scaling with the web!
• In 1994, the 1st web search engine, the
  WWWW indexed around 105 pages.
• By November 1997, the top engines
  indexed 108 web documents!
• In 1994, the WWWW handled 1500
  queries per day.
• By November 1997, Altavista handled
  around 20 million queries per day!
Challenges in scalability


•   Fast crawling technology
•   Storage Space
•   Efficient indexing system
•   Fast handling of queries
Google’s design goals
• Aiming for very high precision in results since
  most users look only at the first few 10s of
  results.

• Precision is important even at the expense of
  recall (i.e. the total number of relevant
  documents returned)
The irony of it all…
• In this paper, the authors had criticized the
  commercialization of academic search engine
  as it caused search engine technology to
  remain a black art.
• They had also stated their aims of making
  Google an open academic environment for
  researchers working on large scale web data.
• In the appendix, they had also blasted
  advertising funded search engines for being
  “inherently biased”
System features of Google
• PageRank
•   A Top 10 IEEE ICDM data mining algorithm
•   Tries to incorporate ideas from
    academic community (publishing and citations)

• Anchor Text
•   <a href=http://www.com> ANCHOR TEXT </a>
PageRank!




It isn't the only factor that Google uses to rank pages, but it is an
                           important one.
Why does PageRank use links?
• Links represent citations
• Quantity of links to a website makes the
  website more popular
• Quality of links to a website also helps in
  computing rank
• Link structure largely unused before Larry
  Page proposed it to thesis advisor
• Idea based on academic citation literature
  which counted citations or backlinks to a given
  page.
How does PageRank work?


Counts links from all pages but not
 equally
Normalizes by the number of links on a
 page.
Simplified PageRank algorithm
• Assume four web pages: A, B,C and D. Let each page
  would begin with an estimated PageRank of 0.25.


      A       C
                  D
          B

              C
      A
                      D
          B


• L(A) is defined as the number of links going out of page
  A. The PageRank of a page A is given as follows:
PageRank algorithm including damping factor
 Assume page A has pages B, C, D ..., which point
 to it. The parameter d is a damping factor which
 can be set between 0 and 1. Usually set d to
 0.85. The PageRank of a page A is given as
 follows:
Intuitive Justification

• A "random surfer" who is given a web page at random and keeps
  clicking on links, never hitting "back“, but eventually gets bored
  and starts on another random page.
   – The probability that the random surfer visits a page is its
      PageRank.
   – The d damping factor is the probability at each page the
      "random surfer" will get bored and request another random
      page.

• A page can have a high PageRank
   – If there are many pages that point to it
   – Or if there are some pages that point to it, and have a high
     PageRank.
Anchor Text
•   <A href="http://www.yahoo.com/">Yahoo!</A>
The text of a hyperlink (anchor text) is
associated with the page that the link is on,
and it is also associated with the page the link
points to.

Why?
   anchors often provide more accurate descriptions of
     web pages than the pages themselves.

      anchors may exist for documents which cannot be
       indexed by a text-based search engine, such as images,
       programs, and databases.
Other Features

• It has location information for all hits (uses
  proximity in search)
• Google keeps track of some visual
  presentation details such as font size of words.
• Words in a larger or bolder font are weighted
  higher than other words.
• Full raw HTML of pages is available in a
  repository
Google Architecture
Implemented in C and C++ on Solaris and Linux
Google Architecture
                          Multiple crawlers run in parallel.
Keeps track of URLs       Each crawler keeps its own DNS          Compresses and
that have and need         lookup cache and ~300 open            stores web pages
   to be crawled             connections open at once.




 Stores each link and
text surrounding link.




Converts relative URLs
 into absolute URLs.


              Uncompresses and parses               Contains full html of every web
               documents. Stores link              page. Each document is prefixed
             information in anchors file.            by docID, length, and URL.
Google Architecture
Maps absolute URLs into docIDs stored in Doc          Parses & distributes hit lists into
   Index. Stores anchor text in “barrels”.                       “barrels.”
Generates database of links (pairs of docIds).
                                                            Partially sorted forward
                                                        indexes sorted by docID. Each
                                                        barrel stores hitlists for a given
                                                               range of wordIDs.

                                                          In-memory hash table that
                                                           maps words to wordIds.
                                                         Contains pointer to doclist in
                                                        barrel which wordId falls into.

                                                           Creates inverted index
                                                           whereby document list
                                                        containing docID and hitlists
                                                       can be retrieved given wordID.
      DocID keyed index where each entry includes info such as pointer to doc in
       repository, checksum, statistics, status, etc. Also contains URL info if doc
                      has been crawled. If not just contains URL.
Single Word Query Ranking
• Hitlist is retrieved for single word
• Each hit can be one of several types: title, anchor,
  URL, large font, small font, etc.
• Each hit type is assigned its own weight
• Type-weights make up vector of weights
• Number of hits of each type is counted to form
  count-weight vector
• Dot product of type-weight and count-weight vectors
  is used to compute IR score
• IR score is combined with PageRank to compute final
  rank
Multi-word Query Ranking
• Similar to single-word ranking except now must
  analyze proximity of words in a document
• Hits occurring closer together are weighted higher
  than those farther apart
• Each proximity relation is classified into 1 of 10 bins
  ranging from a “phrase match” to “not even close”
• Each type and proximity pair has a type-prox weight
• Counts converted into count-weights
• Take dot product of count-weights and type-prox
  weights to computer for IR score
The Past: Original Page # 1




When Larry Page and Sergey Brin begun work on their search engine, it
wasn’t originally called Google. They called it Backrub (as a reference to the
algorithm which used backlinks to rank pages), only changing it a year into
development and yes, the hand in the logo was Larry Page’s, scanned.
The Past: Original Page # 2




The original Google webpage (in 1997)
The Present
The Future?


“The ultimate search engine would
understand exactly what you mean and give
back exactly what you want.”

- Larry Page
References…
• Brin, Page. The Anatomy of a Large-Scale
  Hypertextual Web Search Engine.
• www.cs.uvm.edu/~xwu/kdd
• http://www.ics.uci.edu/~scott/google.htm
Thank you!

Más contenido relacionado

La actualidad más candente

Understanding Seo At A Glance
Understanding Seo At A GlanceUnderstanding Seo At A Glance
Understanding Seo At A Glancepoojagupta267
 
Search engines powerpoint
Search engines powerpointSearch engines powerpoint
Search engines powerpointvbaker2210
 
google search engine
google search enginegoogle search engine
google search engineway2go
 
Searching the Web
Searching the WebSearching the Web
Searching the Webcshieh
 
How a search engine works slide
How a search engine works slideHow a search engine works slide
How a search engine works slideSovan Misra
 
Architecture of a search engine
Architecture of a search engineArchitecture of a search engine
Architecture of a search engineSylvain Utard
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawlervinay arora
 
Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search EnginesNitin Pande
 
Search Engines and its working
Search Engines and its workingSearch Engines and its working
Search Engines and its workingMukesh Kumar
 
Working of search engine
Working of search engineWorking of search engine
Working of search engineNikhil Deswal
 
Internet Tutorial 03
Internet  Tutorial 03Internet  Tutorial 03
Internet Tutorial 03dpd
 
Googling of GooGle
Googling of GooGleGoogling of GooGle
Googling of GooGlebinit singh
 
RDFa: an introduction
RDFa: an introductionRDFa: an introduction
RDFa: an introductionKai Li
 

La actualidad más candente (20)

Understanding Seo At A Glance
Understanding Seo At A GlanceUnderstanding Seo At A Glance
Understanding Seo At A Glance
 
Search engines powerpoint
Search engines powerpointSearch engines powerpoint
Search engines powerpoint
 
Search engine
Search engineSearch engine
Search engine
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
google search engine
google search enginegoogle search engine
google search engine
 
Searching the Web
Searching the WebSearching the Web
Searching the Web
 
Search Engine ppt
Search Engine pptSearch Engine ppt
Search Engine ppt
 
How a search engine works slide
How a search engine works slideHow a search engine works slide
How a search engine works slide
 
Architecture of a search engine
Architecture of a search engineArchitecture of a search engine
Architecture of a search engine
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Search engine ppt
Search engine pptSearch engine ppt
Search engine ppt
 
Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search Engines
 
Search Engines and its working
Search Engines and its workingSearch Engines and its working
Search Engines and its working
 
search engines
search enginessearch engines
search engines
 
Working of search engine
Working of search engineWorking of search engine
Working of search engine
 
Google Search Presentation
Google Search PresentationGoogle Search Presentation
Google Search Presentation
 
Internet Tutorial 03
Internet  Tutorial 03Internet  Tutorial 03
Internet Tutorial 03
 
Googling of GooGle
Googling of GooGleGoogling of GooGle
Googling of GooGle
 
RDFa: an introduction
RDFa: an introductionRDFa: an introduction
RDFa: an introduction
 

Destacado

Query optimization
Query optimizationQuery optimization
Query optimizationdixitdavey
 
14. Query Optimization in DBMS
14. Query Optimization in DBMS14. Query Optimization in DBMS
14. Query Optimization in DBMSkoolkampus
 
Query Optimisation
Query OptimisationQuery Optimisation
Query Optimisationdchq
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query OptimizationJ Singh
 
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...Beat Signer
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEZalpa Rathod
 

Destacado (7)

Query optimization
Query optimizationQuery optimization
Query optimization
 
Query optimisation
Query optimisationQuery optimisation
Query optimisation
 
14. Query Optimization in DBMS
14. Query Optimization in DBMS14. Query Optimization in DBMS
14. Query Optimization in DBMS
 
Query Optimisation
Query OptimisationQuery Optimisation
Query Optimisation
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
 
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSE
 

Similar a Google Paper

The Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search EngineThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search EngineMehul Boricha
 
Google history nd architecture
Google history nd architectureGoogle history nd architecture
Google history nd architectureDivyangee Jain
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...Stefan Adam
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engineankur881120
 
Organising and Managing Research
Organising and Managing ResearchOrganising and Managing Research
Organising and Managing ResearchDr. Vinayak Bharadi
 
Internet Research Presentation
Internet Research PresentationInternet Research Presentation
Internet Research Presentationadeason
 
Content Registration at Crossref - LIVE Bangkok
Content Registration at Crossref - LIVE BangkokContent Registration at Crossref - LIVE Bangkok
Content Registration at Crossref - LIVE BangkokCrossref
 
Optimizing DITA Content for Search Engine Optimization tekom tcworld 2016
Optimizing DITA Content for Search Engine Optimization tekom tcworld 2016Optimizing DITA Content for Search Engine Optimization tekom tcworld 2016
Optimizing DITA Content for Search Engine Optimization tekom tcworld 2016IXIASOFT
 
DITA and SEO
DITA and SEODITA and SEO
DITA and SEOIXIASOFT
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptxScrbifPt
 
How to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User ExperienceHow to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User ExperienceBrightEdge
 
SEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in IndiaSEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in Indiaannakoch32
 
Resource Oriented Architectures: The Future of Data API?
Resource Oriented Architectures: The Future of Data API?Resource Oriented Architectures: The Future of Data API?
Resource Oriented Architectures: The Future of Data API?Victor Olex
 

Similar a Google Paper (20)

Anatomy of google
Anatomy of googleAnatomy of google
Anatomy of google
 
The Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search EngineThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine
 
Search engines
Search enginesSearch engines
Search engines
 
Google history nd architecture
Google history nd architectureGoogle history nd architecture
Google history nd architecture
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
 
How Google Works
How Google WorksHow Google Works
How Google Works
 
Brief
BriefBrief
Brief
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engine
 
Organising and Managing Research
Organising and Managing ResearchOrganising and Managing Research
Organising and Managing Research
 
Internet Research Presentation
Internet Research PresentationInternet Research Presentation
Internet Research Presentation
 
Content Registration at Crossref - LIVE Bangkok
Content Registration at Crossref - LIVE BangkokContent Registration at Crossref - LIVE Bangkok
Content Registration at Crossref - LIVE Bangkok
 
Optimizing DITA Content for Search Engine Optimization tekom tcworld 2016
Optimizing DITA Content for Search Engine Optimization tekom tcworld 2016Optimizing DITA Content for Search Engine Optimization tekom tcworld 2016
Optimizing DITA Content for Search Engine Optimization tekom tcworld 2016
 
DITA and SEO
DITA and SEODITA and SEO
DITA and SEO
 
CAB 2.pptx
CAB 2.pptxCAB 2.pptx
CAB 2.pptx
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 
How to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User ExperienceHow to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User Experience
 
SEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in IndiaSEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in India
 
Resource Oriented Architectures: The Future of Data API?
Resource Oriented Architectures: The Future of Data API?Resource Oriented Architectures: The Future of Data API?
Resource Oriented Architectures: The Future of Data API?
 

Último

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Último (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Google Paper

  • 1. The Anatomy of a Large-Scale Hypertextual Web Search Engine Lawrence Page & Sergey Brin Presented By : Girish Malkarnenkar Email: girish@cs.utexas.edu INF384H / CS395T Concepts of Information Retrieval and Web Search (Fall 2011) - (12th September 2011)
  • 2. Motivation behind Google • Rapid growth in Amount of information on the web Number of new inexperienced web users
  • 3. Motivation behind Google • Usage of human maintained indices like Yahoo! which were subjective, expensive to build & maintain, slow to improve and did not cover all topics. • Automated search engines relying on simple keyword matching returned low quality results. • Attempts by advertisers to mislead automated search engines
  • 4. How bad were things in 1997? • “Junk results” washed out any relevant search results. • Only one of the top 4 commercial search engines at the time could find itself (in the top 10 results)! • There was a desperate need for a search engine that could cope up with the ever- increasing information flow and still return relevant information.
  • 5. Challenges in scaling with the web! • In 1994, the 1st web search engine, the WWWW indexed around 105 pages. • By November 1997, the top engines indexed 108 web documents! • In 1994, the WWWW handled 1500 queries per day. • By November 1997, Altavista handled around 20 million queries per day!
  • 6. Challenges in scalability • Fast crawling technology • Storage Space • Efficient indexing system • Fast handling of queries
  • 7. Google’s design goals • Aiming for very high precision in results since most users look only at the first few 10s of results. • Precision is important even at the expense of recall (i.e. the total number of relevant documents returned)
  • 8. The irony of it all… • In this paper, the authors had criticized the commercialization of academic search engine as it caused search engine technology to remain a black art. • They had also stated their aims of making Google an open academic environment for researchers working on large scale web data. • In the appendix, they had also blasted advertising funded search engines for being “inherently biased”
  • 9. System features of Google • PageRank • A Top 10 IEEE ICDM data mining algorithm • Tries to incorporate ideas from academic community (publishing and citations) • Anchor Text • <a href=http://www.com> ANCHOR TEXT </a>
  • 10. PageRank! It isn't the only factor that Google uses to rank pages, but it is an important one.
  • 11. Why does PageRank use links? • Links represent citations • Quantity of links to a website makes the website more popular • Quality of links to a website also helps in computing rank • Link structure largely unused before Larry Page proposed it to thesis advisor • Idea based on academic citation literature which counted citations or backlinks to a given page.
  • 12. How does PageRank work? Counts links from all pages but not equally Normalizes by the number of links on a page.
  • 13. Simplified PageRank algorithm • Assume four web pages: A, B,C and D. Let each page would begin with an estimated PageRank of 0.25. A C D B C A D B • L(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:
  • 14. PageRank algorithm including damping factor Assume page A has pages B, C, D ..., which point to it. The parameter d is a damping factor which can be set between 0 and 1. Usually set d to 0.85. The PageRank of a page A is given as follows:
  • 15. Intuitive Justification • A "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back“, but eventually gets bored and starts on another random page. – The probability that the random surfer visits a page is its PageRank. – The d damping factor is the probability at each page the "random surfer" will get bored and request another random page. • A page can have a high PageRank – If there are many pages that point to it – Or if there are some pages that point to it, and have a high PageRank.
  • 16. Anchor Text • <A href="http://www.yahoo.com/">Yahoo!</A> The text of a hyperlink (anchor text) is associated with the page that the link is on, and it is also associated with the page the link points to. Why?  anchors often provide more accurate descriptions of web pages than the pages themselves.  anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases.
  • 17. Other Features • It has location information for all hits (uses proximity in search) • Google keeps track of some visual presentation details such as font size of words. • Words in a larger or bolder font are weighted higher than other words. • Full raw HTML of pages is available in a repository
  • 18. Google Architecture Implemented in C and C++ on Solaris and Linux
  • 19. Google Architecture Multiple crawlers run in parallel. Keeps track of URLs Each crawler keeps its own DNS Compresses and that have and need lookup cache and ~300 open stores web pages to be crawled connections open at once. Stores each link and text surrounding link. Converts relative URLs into absolute URLs. Uncompresses and parses Contains full html of every web documents. Stores link page. Each document is prefixed information in anchors file. by docID, length, and URL.
  • 20. Google Architecture Maps absolute URLs into docIDs stored in Doc Parses & distributes hit lists into Index. Stores anchor text in “barrels”. “barrels.” Generates database of links (pairs of docIds). Partially sorted forward indexes sorted by docID. Each barrel stores hitlists for a given range of wordIDs. In-memory hash table that maps words to wordIds. Contains pointer to doclist in barrel which wordId falls into. Creates inverted index whereby document list containing docID and hitlists can be retrieved given wordID. DocID keyed index where each entry includes info such as pointer to doc in repository, checksum, statistics, status, etc. Also contains URL info if doc has been crawled. If not just contains URL.
  • 21. Single Word Query Ranking • Hitlist is retrieved for single word • Each hit can be one of several types: title, anchor, URL, large font, small font, etc. • Each hit type is assigned its own weight • Type-weights make up vector of weights • Number of hits of each type is counted to form count-weight vector • Dot product of type-weight and count-weight vectors is used to compute IR score • IR score is combined with PageRank to compute final rank
  • 22. Multi-word Query Ranking • Similar to single-word ranking except now must analyze proximity of words in a document • Hits occurring closer together are weighted higher than those farther apart • Each proximity relation is classified into 1 of 10 bins ranging from a “phrase match” to “not even close” • Each type and proximity pair has a type-prox weight • Counts converted into count-weights • Take dot product of count-weights and type-prox weights to computer for IR score
  • 23. The Past: Original Page # 1 When Larry Page and Sergey Brin begun work on their search engine, it wasn’t originally called Google. They called it Backrub (as a reference to the algorithm which used backlinks to rank pages), only changing it a year into development and yes, the hand in the logo was Larry Page’s, scanned.
  • 24. The Past: Original Page # 2 The original Google webpage (in 1997)
  • 26. The Future? “The ultimate search engine would understand exactly what you mean and give back exactly what you want.” - Larry Page
  • 27. References… • Brin, Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. • www.cs.uvm.edu/~xwu/kdd • http://www.ics.uci.edu/~scott/google.htm