Websrc~1

•Descargar como PPT, PDF•

0 recomendaciones•290 vistas

This document discusses the architecture and design of search engines like Google. It covers their basic requirements around recall, precision and handling query volume. It also describes PageRank and how it prioritizes pages that are frequently referenced by other important pages. The document outlines Google's architecture including components like crawlers, data storage, indexing and ranking. It provides technical details on how Google processes queries and ranks results.

Tecnología Noticias y política

Anatomy of a search engine
• Not much known about AV, Lycos, Yahoo,
etc.
• But Google and Clever (to some extent) are
published
• Design criteria
• Differences
• Architecture
• Data structures

Requirements
• Basic IR concepts:
– Recall: what % of relevant docs are retrieved
– Precision: what % of docs retrieved are relevant
• Quantity:
– handle hundreds of thousands of queries/sec
• Quality
– High precision (not with pres. engines)

Page rank
• Idea: a page is important when it is referred
to a lot, or referred to from an important
page
• PR is used to prioritize; works well even
with search is just on page titles

PR details
• Pages T1,…,Tn point to page A, C(A) is a link
fan-out of A
PR(A)=(1-d) + d(PR(T1)/C(T1)+…+PR(Tn)/C(Tn))
d=dumping factor=.85
Model of random walk on the Web
PR(p) = prob. That a “random” user will visit p

Other features and terms
• Anchor text is associated with the page it
links to
• Some markup aspects are used

Google architecture
• URL server sends list
of URLs to be fetched
to crawlers
• StoreServer
compresses and stores
pages
• Indexer extracts
words, their pos., size,
capital.
• Anchors cont.links and
their text
• Sorter generates
inverted index
• Searcher uses Lexicon,
II, and PR

Some details
• Barrels store words (wordIDs); if a doc
contains a word, doc`s ID and its wordID
are stored with hitlist of this word in the doc
• Lexicon points to Inverted Barrels; ea word
points to docid and hits

Operation
• Crawling
• Searching
• Ranking

Crawling and indexing
• Parsing into anchors and words – error
robustness (flex+stack)
• Indexing in parallel – hashing into barrels
using the lexicon – the problem of new
words shared

Searching
1 parse query
2 convert words into wordIDs
3 Identif. A barrel for ea. Word
4 scan doclists until a doc that matches all the
search words is found

Ranking
• For a single word, identify the hit list and its
type, count the # of hits of ea type, vector-
multiply
• Combine with PR
• For multiple words, take proximity into
account

Going further
• Google will not return any IBM pages for
the query `mainframes`
• Many pages that point to IBM page use the
term ‘mainframe’, so this page should be
returned

• Clever ranks authoritities pages and hub pages.
Authorities are pages with high PR. Hubs are
pages that point to authorities. E.g. my friend’s
page with a list of links to on-line CD stores. Hubs
may not be chosen by PR alone
• Clever/HITS (Hyperlink Induced Topic Search)
starts with an initial set of pages and hubs

Mathematically speaking…
• Let xp be authority weight, yq be hub weight,
q->p denotes q links to p
x p = ∑ yq y p = ∑ xq
q→ p p →q

• Let A be adjacency matrix: Ai,j =1 if there is a
link between i and j, 0 otherwise

x ←ATy and y ← Ax
x ←ATAx, and we can iterate that further,
working with powers of ATA
This sequence of powers converges to the
eigenvector of ATA
This means that the result does not depend on
the initial weights

• Remove ‘local’ links (“back to the main
page”)
• Drift: transfer of main authority to, e.g.,
topics of hobbies
• Highjacking: if several pages from the same
site occur in the base set, they may take
over a topic

• Remedied by partial content indexing –
anchors, and by
• dividing a page into pagelets – contiguous
sequences of links
• Hubs are good when learning about a topic,
less so when seekeing specific info.

Autres engins
• Altavista et Lycos ont probablement des
méthodes simples de sélection
• Excite semble utiliser beaucoup de
propriétés des pages
• Voir « What is a tall poppy among Web pages? »7th Int’l
WWW Conf.

Más contenido relacionado

La actualidad más candente

An Intro to NoSQL DatabasesRajith Pemabandu

Apache HiveAmit Khandelwal

20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...David Horvath

HiveBala Krishna

Searching techniquesJayatunga Amaraweera

Resource Oriented Architectures: The Future of Data API?Victor Olex

Big Data and Hadoop ComponentsDezyreAcademy

11 wordprocessing ml subject - glossary documentShawn Villaron

Shooting rabbits with slingTomasz Rękawek

How to use a databaseAmyshipp

Apache HBase™Prashant Gupta

Houston tech fest dev intro to sharepoint searchMichael Oryszak

Introduction to ArangoDB (nosql matters Barcelona 2012)ArangoDB Database

La actualidad más candente (13)

An Intro to NoSQL Databases

Apache Hive

20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...

Hive

Searching techniques

Resource Oriented Architectures: The Future of Data API?

Big Data and Hadoop Components

11 wordprocessing ml subject - glossary document

Shooting rabbits with sling

How to use a database

Apache HBase™

Houston tech fest dev intro to sharepoint search

Introduction to ArangoDB (nosql matters Barcelona 2012)

Destacado

Beginning In J2EESuthat Rongraung

How Internet Serch Engins Workmanami motegi

AndroidwearRaja Kiran

Problem-based Learning at 2014 CSE IPSG sharingLester Lim

Android SeminarGanesh Waghmare

CND magnétoscopieLayla Zerhoun

Java Server Faces (JSF) - BasicsBG Java EE Course

Java Persistence API (JPA) Step By StepGuo Albert

P073 osmMilo van der Linden

Skf half year-2010_svSKF

Plan Estratégico Comité TecnologíaAmchamEC

Datta Nadkarni portfolio 2014- Marketing Strategist for- Farmers, LensCrafter...www.DATTANADKARNI.COM

Noun;there is...there are -at airport and on airplaneAldyansyah -

The role of research libraries in a European e-science environmentWouter Schallier

Brochure Graphic Production Pedro Figueras, Design Specialised in IC

Jeff Savitz - A System for Testing Advertising Derick Schaefer

Value of DoIT GISksendhil

nancyguest6c220b

Lua 30+ Programming Skills and 20+ Optimization TipsHo Kim

Teacher TrainingFandcorp English Solutions

Destacado (20)

Beginning In J2EE

How Internet Serch Engins Work

Androidwear

Problem-based Learning at 2014 CSE IPSG sharing

Android Seminar

CND magnétoscopie

Java Server Faces (JSF) - Basics

Java Persistence API (JPA) Step By Step

P073 osm

Skf half year-2010_sv

Plan Estratégico Comité Tecnología

Datta Nadkarni portfolio 2014- Marketing Strategist for- Farmers, LensCrafter...

Noun;there is...there are -at airport and on airplane

The role of research libraries in a European e-science environment

Brochure Graphic Production

Jeff Savitz - A System for Testing Advertising

Value of DoIT GIS

nancy

Lua 30+ Programming Skills and 20+ Optimization Tips

Teacher Training

Similar a Websrc~1

Google Paper girish1m

The Anatomy of a Large-Scale Hypertextual Web Search EngineMehul Boricha

DC presentation 1Harini Sirisena

"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...Stefan Adam

Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf

Search enginesAnshuman Tyagi

OSCON 2012 MongoDB TutorialSteven Francia

Practical Machine Learning for Smarter Search with Spark+SolrJake Mannix

Practical Machine Learning for Smarter Search with Solr and SparkJake Mannix

MongoDB for GenealogySteven Francia

Evolution of the Graph SchemaJoshua Shinavier

Modeling Data in MongoDBlehresman

Haystack 2019 - Search with Vectors - Simon HughesOpenSource Connections

Searching with vectorsSimon Hughes

NoSql - mayank singhMayank Singh

Analyzing Web Archivesvinaygo

Why do they call it Linked Data when they want to say...?Oscar Corcho

Vectors in Search - Towards More Semantic MatchingSimon Hughes

Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Lucidworks

Smx Ad Tech Seo Tacticsjeetututeja

Similar a Websrc~1 (20)

Google Paper

The Anatomy of a Large-Scale Hypertextual Web Search Engine

DC presentation 1

"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...

Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16

Search engines

OSCON 2012 MongoDB Tutorial

Practical Machine Learning for Smarter Search with Spark+Solr

Practical Machine Learning for Smarter Search with Solr and Spark

MongoDB for Genealogy

Evolution of the Graph Schema

Modeling Data in MongoDB

Haystack 2019 - Search with Vectors - Simon Hughes

Searching with vectors

NoSql - mayank singh

Analyzing Web Archives

Why do they call it Linked Data when they want to say...?

Vectors in Search - Towards More Semantic Matching

Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com

Smx Ad Tech Seo Tactics

Más de Ram Dutt Shukla

Ip Sec Rev1Ram Dutt Shukla

Message AuthenticationRam Dutt Shukla

ShttpRam Dutt Shukla

Web SecurityRam Dutt Shukla

I Pv6 AddressingRam Dutt Shukla

Anycast & MulticastRam Dutt Shukla

Congestion ControlRam Dutt Shukla

Retransmission TcpRam Dutt Shukla

Tcp Congestion AvoidanceRam Dutt Shukla

Tcp Immediate Data TransferRam Dutt Shukla

Tcp Reliability Flow ControlRam Dutt Shukla

Tcp Udp NotesRam Dutt Shukla

Transport Layer [Autosaved]Ram Dutt Shukla

Transport LayerRam Dutt Shukla

T TcpRam Dutt Shukla

Anycast & MulticastRam Dutt Shukla

IgmpRam Dutt Shukla

Mobile I Pv6Ram Dutt Shukla

MldRam Dutt Shukla

Más de Ram Dutt Shukla (20)

Ip Sec Rev1

Message Authentication

Shttp

Web Security

I Pv6 Addressing

Anycast & Multicast

Congestion Control

Retransmission Tcp

Tcp Congestion Avoidance

Tcp Immediate Data Transfer

Tcp Reliability Flow Control

Tcp Udp Notes

Transport Layer [Autosaved]

Transport Layer

T Tcp

Anycast & Multicast

Igmp

Mobile I Pv6

Mld

Último

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

unit 4 immunoblotting technique complete.pptxBkGupta21

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Take control of your SAP testing with UiPath Test SuiteDianaGray10

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

Websrc~1

1. Anatomy of a search engine • Not much known about AV, Lycos, Yahoo, etc. • But Google and Clever (to some extent) are published • Design criteria • Differences • Architecture • Data structures

2. Requirements • Basic IR concepts: – Recall: what % of relevant docs are retrieved – Precision: what % of docs retrieved are relevant • Quantity: – handle hundreds of thousands of queries/sec • Quality – High precision (not with pres. engines)

3. Page rank • Idea: a page is important when it is referred to a lot, or referred to from an important page • PR is used to prioritize; works well even with search is just on page titles

4. PR details • Pages T1,…,Tn point to page A, C(A) is a link fan-out of A PR(A)=(1-d) + d(PR(T1)/C(T1)+…+PR(Tn)/C(Tn)) d=dumping factor=.85 Model of random walk on the Web PR(p) = prob. That a “random” user will visit p

5. Other features and terms • Anchor text is associated with the page it links to • Some markup aspects are used

6. Google architecture • URL server sends list of URLs to be fetched to crawlers • StoreServer compresses and stores pages • Indexer extracts words, their pos., size, capital. • Anchors cont.links and their text • Sorter generates inverted index • Searcher uses Lexicon, II, and PR

7. Some details • Barrels store words (wordIDs); if a doc contains a word, doc`s ID and its wordID are stored with hitlist of this word in the doc • Lexicon points to Inverted Barrels; ea word points to docid and hits

8. Operation • Crawling • Searching • Ranking

9. Crawling and indexing • Parsing into anchors and words – error robustness (flex+stack) • Indexing in parallel – hashing into barrels using the lexicon – the problem of new words shared

10. Searching 1 parse query 2 convert words into wordIDs 3 Identif. A barrel for ea. Word 4 scan doclists until a doc that matches all the search words is found

11. Ranking • For a single word, identify the hit list and its type, count the # of hits of ea type, vector- multiply • Combine with PR • For multiple words, take proximity into account

12. Going further • Google will not return any IBM pages for the query `mainframes` • Many pages that point to IBM page use the term ‘mainframe’, so this page should be returned

13. • Clever ranks authoritities pages and hub pages. Authorities are pages with high PR. Hubs are pages that point to authorities. E.g. my friend’s page with a list of links to on-line CD stores. Hubs may not be chosen by PR alone • Clever/HITS (Hyperlink Induced Topic Search) starts with an initial set of pages and hubs

14. Mathematically speaking… • Let xp be authority weight, yq be hub weight, q->p denotes q links to p x p = ∑ yq y p = ∑ xq q→ p p →q • Let A be adjacency matrix: Ai,j =1 if there is a link between i and j, 0 otherwise

15. x ←ATy and y ← Ax x ←ATAx, and we can iterate that further, working with powers of ATA This sequence of powers converges to the eigenvector of ATA This means that the result does not depend on the initial weights

16. • Remove ‘local’ links (“back to the main page”) • Drift: transfer of main authority to, e.g., topics of hobbies • Highjacking: if several pages from the same site occur in the base set, they may take over a topic

17. • Remedied by partial content indexing – anchors, and by • dividing a page into pagelets – contiguous sequences of links • Hubs are good when learning about a topic, less so when seekeing specific info.

18. Autres engins • Altavista et Lycos ont probablement des méthodes simples de sélection • Excite semble utiliser beaucoup de propriétés des pages • Voir « What is a tall poppy among Web pages? »7th Int’l WWW Conf.

Websrc~1

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (13)

Destacado

Destacado (20)

Similar a Websrc~1

Similar a Websrc~1 (20)

Más de Ram Dutt Shukla

Más de Ram Dutt Shukla (20)

Último

Último (20)

Websrc~1