SlideShare una empresa de Scribd logo
1 de 18
Anatomy of a search engine
• Not much known about AV, Lycos, Yahoo,
  etc.
• But Google and Clever (to some extent) are
  published
• Design criteria
• Differences
• Architecture
• Data structures
Requirements
• Basic IR concepts:
  – Recall: what % of relevant docs are retrieved
  – Precision: what % of docs retrieved are relevant
• Quantity:
  – handle hundreds of thousands of queries/sec
• Quality
  – High precision (not with pres. engines)
Page rank
• Idea: a page is important when it is referred
  to a lot, or referred to from an important
  page
• PR is used to prioritize; works well even
  with search is just on page titles
PR details
• Pages T1,…,Tn point to page A, C(A) is a link
  fan-out of A
PR(A)=(1-d) + d(PR(T1)/C(T1)+…+PR(Tn)/C(Tn))
d=dumping factor=.85
Model of random walk on the Web
PR(p) = prob. That a “random” user will visit p
Other features and terms
• Anchor text is associated with the page it
  links to
• Some markup aspects are used
Google architecture
                •   URL server sends list
                    of URLs to be fetched
                    to crawlers
                •   StoreServer
                    compresses and stores
                    pages
                •   Indexer extracts
                    words, their pos., size,
                    capital.
                •   Anchors cont.links and
                    their text
                •   Sorter generates
                    inverted index
                •   Searcher uses Lexicon,
                    II, and PR
Some details
• Barrels store words (wordIDs); if a doc
  contains a word, doc`s ID and its wordID
  are stored with hitlist of this word in the doc
• Lexicon points to Inverted Barrels; ea word
  points to docid and hits
Operation
• Crawling
• Searching
• Ranking
Crawling and indexing
• Parsing into anchors and words – error
  robustness (flex+stack)
• Indexing in parallel – hashing into barrels
  using the lexicon – the problem of new
  words shared
Searching
1 parse query
2 convert words into wordIDs
3 Identif. A barrel for ea. Word
4 scan doclists until a doc that matches all the
  search words is found
Ranking
• For a single word, identify the hit list and its
  type, count the # of hits of ea type, vector-
  multiply
• Combine with PR
• For multiple words, take proximity into
  account
Going further
• Google will not return any IBM pages for
  the query `mainframes`
• Many pages that point to IBM page use the
  term ‘mainframe’, so this page should be
  returned
• Clever ranks authoritities pages and hub pages.
  Authorities are pages with high PR. Hubs are
  pages that point to authorities. E.g. my friend’s
  page with a list of links to on-line CD stores. Hubs
  may not be chosen by PR alone
• Clever/HITS (Hyperlink Induced Topic Search)
  starts with an initial set of pages and hubs
Mathematically speaking…
• Let xp be authority weight, yq be hub weight,
  q->p denotes q links to p
     x p = ∑ yq         y p = ∑ xq
           q→ p               p →q


• Let A be adjacency matrix: Ai,j =1 if there is a
  link between i and j, 0 otherwise
x ←ATy and y ← Ax
x ←ATAx, and we can iterate that further,
  working with powers of ATA
This sequence of powers converges to the
  eigenvector of ATA
This means that the result does not depend on
  the initial weights
• Remove ‘local’ links (“back to the main
  page”)
• Drift: transfer of main authority to, e.g.,
  topics of hobbies
• Highjacking: if several pages from the same
  site occur in the base set, they may take
  over a topic
• Remedied by partial content indexing –
  anchors, and by
• dividing a page into pagelets – contiguous
  sequences of links
• Hubs are good when learning about a topic,
  less so when seekeing specific info.
Autres engins
• Altavista et Lycos ont probablement des
  méthodes simples de sélection
• Excite semble utiliser beaucoup de
  propriétés des pages
• Voir « What is a tall poppy among Web pages? »7th Int’l
  WWW Conf.

Más contenido relacionado

La actualidad más candente

An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL DatabasesRajith Pemabandu
 
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...David Horvath
 
Resource Oriented Architectures: The Future of Data API?
Resource Oriented Architectures: The Future of Data API?Resource Oriented Architectures: The Future of Data API?
Resource Oriented Architectures: The Future of Data API?Victor Olex
 
Big Data and Hadoop Components
Big Data and Hadoop ComponentsBig Data and Hadoop Components
Big Data and Hadoop ComponentsDezyreAcademy
 
11 wordprocessing ml subject - glossary document
11   wordprocessing ml subject - glossary document11   wordprocessing ml subject - glossary document
11 wordprocessing ml subject - glossary documentShawn Villaron
 
Shooting rabbits with sling
Shooting rabbits with slingShooting rabbits with sling
Shooting rabbits with slingTomasz Rękawek
 
How to use a database
How to use a databaseHow to use a database
How to use a databaseAmyshipp
 
Houston tech fest dev intro to sharepoint search
Houston tech fest   dev intro to sharepoint searchHouston tech fest   dev intro to sharepoint search
Houston tech fest dev intro to sharepoint searchMichael Oryszak
 
Introduction to ArangoDB (nosql matters Barcelona 2012)
Introduction to ArangoDB (nosql matters Barcelona 2012)Introduction to ArangoDB (nosql matters Barcelona 2012)
Introduction to ArangoDB (nosql matters Barcelona 2012)ArangoDB Database
 

La actualidad más candente (13)

An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL Databases
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
 
Hive
HiveHive
Hive
 
Searching techniques
Searching techniquesSearching techniques
Searching techniques
 
Resource Oriented Architectures: The Future of Data API?
Resource Oriented Architectures: The Future of Data API?Resource Oriented Architectures: The Future of Data API?
Resource Oriented Architectures: The Future of Data API?
 
Big Data and Hadoop Components
Big Data and Hadoop ComponentsBig Data and Hadoop Components
Big Data and Hadoop Components
 
11 wordprocessing ml subject - glossary document
11   wordprocessing ml subject - glossary document11   wordprocessing ml subject - glossary document
11 wordprocessing ml subject - glossary document
 
Shooting rabbits with sling
Shooting rabbits with slingShooting rabbits with sling
Shooting rabbits with sling
 
How to use a database
How to use a databaseHow to use a database
How to use a database
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Houston tech fest dev intro to sharepoint search
Houston tech fest   dev intro to sharepoint searchHouston tech fest   dev intro to sharepoint search
Houston tech fest dev intro to sharepoint search
 
Introduction to ArangoDB (nosql matters Barcelona 2012)
Introduction to ArangoDB (nosql matters Barcelona 2012)Introduction to ArangoDB (nosql matters Barcelona 2012)
Introduction to ArangoDB (nosql matters Barcelona 2012)
 

Destacado

How Internet Serch Engins Work
How Internet Serch Engins WorkHow Internet Serch Engins Work
How Internet Serch Engins Workmanami motegi
 
Problem-based Learning at 2014 CSE IPSG sharing
Problem-based Learning at 2014 CSE IPSG sharingProblem-based Learning at 2014 CSE IPSG sharing
Problem-based Learning at 2014 CSE IPSG sharingLester Lim
 
Java Server Faces (JSF) - Basics
Java Server Faces (JSF) - BasicsJava Server Faces (JSF) - Basics
Java Server Faces (JSF) - BasicsBG Java EE Course
 
Java Persistence API (JPA) Step By Step
Java Persistence API (JPA) Step By StepJava Persistence API (JPA) Step By Step
Java Persistence API (JPA) Step By StepGuo Albert
 
Skf half year-2010_sv
Skf half year-2010_svSkf half year-2010_sv
Skf half year-2010_svSKF
 
Plan Estratégico Comité Tecnología
Plan Estratégico Comité TecnologíaPlan Estratégico Comité Tecnología
Plan Estratégico Comité TecnologíaAmchamEC
 
Datta Nadkarni portfolio 2014- Marketing Strategist for- Farmers, LensCrafter...
Datta Nadkarni portfolio 2014- Marketing Strategist for- Farmers, LensCrafter...Datta Nadkarni portfolio 2014- Marketing Strategist for- Farmers, LensCrafter...
Datta Nadkarni portfolio 2014- Marketing Strategist for- Farmers, LensCrafter...www.DATTANADKARNI.COM
 
Noun;there is...there are -at airport and on airplane
Noun;there is...there are -at airport and on airplaneNoun;there is...there are -at airport and on airplane
Noun;there is...there are -at airport and on airplaneAldyansyah -
 
The role of research libraries in a European e-science environment
The role of research libraries in a European e-science environmentThe role of research libraries in a European e-science environment
The role of research libraries in a European e-science environmentWouter Schallier
 
Jeff Savitz - A System for Testing Advertising
Jeff Savitz - A System for Testing Advertising Jeff Savitz - A System for Testing Advertising
Jeff Savitz - A System for Testing Advertising Derick Schaefer
 
Value of DoIT GIS
Value of DoIT GISValue of DoIT GIS
Value of DoIT GISksendhil
 
Lua 30+ Programming Skills and 20+ Optimization Tips
Lua 30+ Programming Skills and 20+ Optimization TipsLua 30+ Programming Skills and 20+ Optimization Tips
Lua 30+ Programming Skills and 20+ Optimization TipsHo Kim
 

Destacado (20)

Beginning In J2EE
Beginning In J2EEBeginning In J2EE
Beginning In J2EE
 
How Internet Serch Engins Work
How Internet Serch Engins WorkHow Internet Serch Engins Work
How Internet Serch Engins Work
 
Androidwear
AndroidwearAndroidwear
Androidwear
 
Problem-based Learning at 2014 CSE IPSG sharing
Problem-based Learning at 2014 CSE IPSG sharingProblem-based Learning at 2014 CSE IPSG sharing
Problem-based Learning at 2014 CSE IPSG sharing
 
Android Seminar
Android SeminarAndroid Seminar
Android Seminar
 
CND magnétoscopie
CND magnétoscopieCND magnétoscopie
CND magnétoscopie
 
Java Server Faces (JSF) - Basics
Java Server Faces (JSF) - BasicsJava Server Faces (JSF) - Basics
Java Server Faces (JSF) - Basics
 
Java Persistence API (JPA) Step By Step
Java Persistence API (JPA) Step By StepJava Persistence API (JPA) Step By Step
Java Persistence API (JPA) Step By Step
 
P073 osm
P073 osmP073 osm
P073 osm
 
Skf half year-2010_sv
Skf half year-2010_svSkf half year-2010_sv
Skf half year-2010_sv
 
Plan Estratégico Comité Tecnología
Plan Estratégico Comité TecnologíaPlan Estratégico Comité Tecnología
Plan Estratégico Comité Tecnología
 
Datta Nadkarni portfolio 2014- Marketing Strategist for- Farmers, LensCrafter...
Datta Nadkarni portfolio 2014- Marketing Strategist for- Farmers, LensCrafter...Datta Nadkarni portfolio 2014- Marketing Strategist for- Farmers, LensCrafter...
Datta Nadkarni portfolio 2014- Marketing Strategist for- Farmers, LensCrafter...
 
Noun;there is...there are -at airport and on airplane
Noun;there is...there are -at airport and on airplaneNoun;there is...there are -at airport and on airplane
Noun;there is...there are -at airport and on airplane
 
The role of research libraries in a European e-science environment
The role of research libraries in a European e-science environmentThe role of research libraries in a European e-science environment
The role of research libraries in a European e-science environment
 
Brochure Graphic Production
Brochure Graphic Production Brochure Graphic Production
Brochure Graphic Production
 
Jeff Savitz - A System for Testing Advertising
Jeff Savitz - A System for Testing Advertising Jeff Savitz - A System for Testing Advertising
Jeff Savitz - A System for Testing Advertising
 
Value of DoIT GIS
Value of DoIT GISValue of DoIT GIS
Value of DoIT GIS
 
nancy
nancynancy
nancy
 
Lua 30+ Programming Skills and 20+ Optimization Tips
Lua 30+ Programming Skills and 20+ Optimization TipsLua 30+ Programming Skills and 20+ Optimization Tips
Lua 30+ Programming Skills and 20+ Optimization Tips
 
Teacher Training
Teacher TrainingTeacher Training
Teacher Training
 

Similar a Websrc~1

Google Paper
Google Paper Google Paper
Google Paper girish1m
 
The Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search EngineThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search EngineMehul Boricha
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...Stefan Adam
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
 
OSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialOSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialSteven Francia
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrJake Mannix
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkJake Mannix
 
Evolution of the Graph Schema
Evolution of the Graph SchemaEvolution of the Graph Schema
Evolution of the Graph SchemaJoshua Shinavier
 
Modeling Data in MongoDB
Modeling Data in MongoDBModeling Data in MongoDB
Modeling Data in MongoDBlehresman
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesOpenSource Connections
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectorsSimon Hughes
 
NoSql - mayank singh
NoSql - mayank singhNoSql - mayank singh
NoSql - mayank singhMayank Singh
 
Analyzing Web Archives
Analyzing Web ArchivesAnalyzing Web Archives
Analyzing Web Archivesvinaygo
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Oscar Corcho
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingSimon Hughes
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Lucidworks
 
Smx Ad Tech Seo Tactics
Smx Ad Tech Seo TacticsSmx Ad Tech Seo Tactics
Smx Ad Tech Seo Tacticsjeetututeja
 

Similar a Websrc~1 (20)

Google Paper
Google Paper Google Paper
Google Paper
 
The Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search EngineThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
Search engines
Search enginesSearch engines
Search engines
 
OSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialOSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB Tutorial
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
MongoDB for Genealogy
MongoDB for GenealogyMongoDB for Genealogy
MongoDB for Genealogy
 
Evolution of the Graph Schema
Evolution of the Graph SchemaEvolution of the Graph Schema
Evolution of the Graph Schema
 
Modeling Data in MongoDB
Modeling Data in MongoDBModeling Data in MongoDB
Modeling Data in MongoDB
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
 
NoSql - mayank singh
NoSql - mayank singhNoSql - mayank singh
NoSql - mayank singh
 
Analyzing Web Archives
Analyzing Web ArchivesAnalyzing Web Archives
Analyzing Web Archives
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
 
Smx Ad Tech Seo Tactics
Smx Ad Tech Seo TacticsSmx Ad Tech Seo Tactics
Smx Ad Tech Seo Tactics
 

Más de Ram Dutt Shukla (20)

Ip Sec Rev1
Ip Sec Rev1Ip Sec Rev1
Ip Sec Rev1
 
Message Authentication
Message AuthenticationMessage Authentication
Message Authentication
 
Shttp
ShttpShttp
Shttp
 
Web Security
Web SecurityWeb Security
Web Security
 
I Pv6 Addressing
I Pv6 AddressingI Pv6 Addressing
I Pv6 Addressing
 
Anycast & Multicast
Anycast & MulticastAnycast & Multicast
Anycast & Multicast
 
Congestion Control
Congestion ControlCongestion Control
Congestion Control
 
Congestion Control
Congestion ControlCongestion Control
Congestion Control
 
Retransmission Tcp
Retransmission TcpRetransmission Tcp
Retransmission Tcp
 
Tcp Congestion Avoidance
Tcp Congestion AvoidanceTcp Congestion Avoidance
Tcp Congestion Avoidance
 
Tcp Immediate Data Transfer
Tcp Immediate Data TransferTcp Immediate Data Transfer
Tcp Immediate Data Transfer
 
Tcp Reliability Flow Control
Tcp Reliability Flow ControlTcp Reliability Flow Control
Tcp Reliability Flow Control
 
Tcp Udp Notes
Tcp Udp NotesTcp Udp Notes
Tcp Udp Notes
 
Transport Layer [Autosaved]
Transport Layer [Autosaved]Transport Layer [Autosaved]
Transport Layer [Autosaved]
 
Transport Layer
Transport LayerTransport Layer
Transport Layer
 
T Tcp
T TcpT Tcp
T Tcp
 
Anycast & Multicast
Anycast & MulticastAnycast & Multicast
Anycast & Multicast
 
Igmp
IgmpIgmp
Igmp
 
Mobile I Pv6
Mobile I Pv6Mobile I Pv6
Mobile I Pv6
 
Mld
MldMld
Mld
 

Último

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

Websrc~1

  • 1. Anatomy of a search engine • Not much known about AV, Lycos, Yahoo, etc. • But Google and Clever (to some extent) are published • Design criteria • Differences • Architecture • Data structures
  • 2. Requirements • Basic IR concepts: – Recall: what % of relevant docs are retrieved – Precision: what % of docs retrieved are relevant • Quantity: – handle hundreds of thousands of queries/sec • Quality – High precision (not with pres. engines)
  • 3. Page rank • Idea: a page is important when it is referred to a lot, or referred to from an important page • PR is used to prioritize; works well even with search is just on page titles
  • 4. PR details • Pages T1,…,Tn point to page A, C(A) is a link fan-out of A PR(A)=(1-d) + d(PR(T1)/C(T1)+…+PR(Tn)/C(Tn)) d=dumping factor=.85 Model of random walk on the Web PR(p) = prob. That a “random” user will visit p
  • 5. Other features and terms • Anchor text is associated with the page it links to • Some markup aspects are used
  • 6. Google architecture • URL server sends list of URLs to be fetched to crawlers • StoreServer compresses and stores pages • Indexer extracts words, their pos., size, capital. • Anchors cont.links and their text • Sorter generates inverted index • Searcher uses Lexicon, II, and PR
  • 7. Some details • Barrels store words (wordIDs); if a doc contains a word, doc`s ID and its wordID are stored with hitlist of this word in the doc • Lexicon points to Inverted Barrels; ea word points to docid and hits
  • 9. Crawling and indexing • Parsing into anchors and words – error robustness (flex+stack) • Indexing in parallel – hashing into barrels using the lexicon – the problem of new words shared
  • 10. Searching 1 parse query 2 convert words into wordIDs 3 Identif. A barrel for ea. Word 4 scan doclists until a doc that matches all the search words is found
  • 11. Ranking • For a single word, identify the hit list and its type, count the # of hits of ea type, vector- multiply • Combine with PR • For multiple words, take proximity into account
  • 12. Going further • Google will not return any IBM pages for the query `mainframes` • Many pages that point to IBM page use the term ‘mainframe’, so this page should be returned
  • 13. • Clever ranks authoritities pages and hub pages. Authorities are pages with high PR. Hubs are pages that point to authorities. E.g. my friend’s page with a list of links to on-line CD stores. Hubs may not be chosen by PR alone • Clever/HITS (Hyperlink Induced Topic Search) starts with an initial set of pages and hubs
  • 14. Mathematically speaking… • Let xp be authority weight, yq be hub weight, q->p denotes q links to p x p = ∑ yq y p = ∑ xq q→ p p →q • Let A be adjacency matrix: Ai,j =1 if there is a link between i and j, 0 otherwise
  • 15. x ←ATy and y ← Ax x ←ATAx, and we can iterate that further, working with powers of ATA This sequence of powers converges to the eigenvector of ATA This means that the result does not depend on the initial weights
  • 16. • Remove ‘local’ links (“back to the main page”) • Drift: transfer of main authority to, e.g., topics of hobbies • Highjacking: if several pages from the same site occur in the base set, they may take over a topic
  • 17. • Remedied by partial content indexing – anchors, and by • dividing a page into pagelets – contiguous sequences of links • Hubs are good when learning about a topic, less so when seekeing specific info.
  • 18. Autres engins • Altavista et Lycos ont probablement des méthodes simples de sélection • Excite semble utiliser beaucoup de propriétés des pages • Voir « What is a tall poppy among Web pages? »7th Int’l WWW Conf.