SlideShare una empresa de Scribd logo
1 de 34
Descargar para leer sin conexión
Build Your Own
Search Engine
Siddhartha Reddy
@the5el

1
SOME Ideas Underlying
a Search Engine
Siddhartha Reddy
@the5el

2
Siddhartha Reddy
o

Loves Lives Search	


o

Leads Product Discovery (search) team at
Flipkart	


o

@sids

3
“Should I build my own search engine?”	

No.	


4
“Should I build my own search engine?”	

No. Use Lucene/Solr/ElasticSearch/Sphinx/etc.	

“What are we doing here then?”	


5
“User”

Text	

Analysis
Query/Response

Searcher

Index

Text	

Analysis
Documents

Indexer
“User”

Text	

Analysis
Query/Response

Searcher

Index

Text	

Analysis
Documents

Indexer
Text Analysis

8
Text Analysis
“I was killed i` the Capitol; Brutus killed me.”
Tokenization

I, was, killed, i`, the, Capitol, Brutus, killed, me

Stop-word
Removal

I, was, killed, i`, the, Capitol, Brutus, killed, me

Normalization
Case-folding

i, killed, i`, capitol, brutus, killed, me

Synonyms

(colour = color; pavement = footpath)

Others

(Accents and diacritics, Abbreviations etc.)

Stemming

Lemmatization

i, killed, i`, capitol, brutus, killed, me

(saw = see; saw = hacksaw)

9

- English Porter (Rule-based)	

- KStemmer (Dictionary-based)
“User”

Text	

Analysis
Query/Response

Searcher

Index

Text	

Analysis
Documents

Indexer
The Index

11
Term-Document Matrix

Brutus AND Caesar AND NOT Calpurnia

Brutus

110100

Ceasar

110111

Calpurnia

10000

NOT Calpurnia

101111

110100 AND 110111 AND 101111 = 100100

12
Inverted Index

Brutus AND Caesar AND NOT Calpurnia

13
“User”

Text	

Analysis
Query/Response

Searcher	

(Ranking/Scoring)

Index

Text	

Analysis
Documents

Indexer
Relevance
Ranking/Scoring

15
Ranking/Scoring
“mysql performance”
Top 25 Best Linux Performance Monitoring and Debugging Tools
8 great MySQL Performance Tips
Linux performance: is Linux becoming just too slow and bloated?
MySQL Performance Blog

16
Ranking/Scoring
“mysql performance”
Term Frequency (Tf)
mysql
Top 25 Best Linux
Performance Monitoring
and Debugging Tools
8 great MySQL
Performance Tips
Linux performance: is
Linux becoming just too
slow and bloated?
MySQL Performance Blog

performance

Total

1

23

24

5

7

12

3

12

15

6

8

14

* random numbers

17
Ranking/Scoring
“mysql performance”
Term Frequency (Tf)
mysql

performance

Total

Top 25 Best Linux
Performance Monitoring
and Debugging Tools
Linux performance: is
Linux becoming just too
slow and bloated?
MySQL Performance Blog

1

23

24

3

12

15

6

8

14

8 great MySQL
Performance Tips

5

7

12

18
Ranking/Scoring
“mysql performance”

• Inverse Document Frequency (Idf)	

• How common (or rare) is a term?	

• 1 / (no. of documents the term occurs in)

19
Ranking/Scoring
“mysql performance”

score = Tf * Idf

•

Tf normalized by document length	


•

Idf dampened by applying a function (log)

20
Ranking/Scoring
“mysql performance”
Term

Idf

mysql

10

performance

2

Tf * Idf
mysql

performance

Total

1 * 10

23 * 2

56

Linux performance: is Linux
3 * 10
becoming just too slow and bloated?

12 * 2

54

MySQL Performance Blog

6 * 10

8*2

76

8 great MySQL Performance Tips

5 * 10

7*2

64

Top 25 Best Linux Performance
Monitoring and Debugging Tools

21
Ranking/Scoring
“mysql performance”
Tf * Idf
mysql

performance

Total

MySQL Performance Blog

6 * 10

8*2

76

8 great MySQL Performance Tips

5 * 10

7*2

64

Top 25 Best Linux Performance
Monitoring and Debugging Tools

1 * 10

23 * 2

56

Linux performance: is Linux
becoming just too slow and
bloated?

3 * 10

12 * 2

54

22
Boolean Search vs.
Ranked Search
•

• Ranked Search	


Boolean Search	


	


o

Rich query syntax

o

No relevance scoring

o
o

	


o

Simple query syntax

o

Relevance ranking/scoring is key

Ex: Patent search, Enterprise search

o

Ex: Web Search, Flipkart Search

Precision  Recall controlled by user

o

Search Engine needs to balance Precision 
Recall

	

	


23
“User”

Text	

Analysis
Query/Response

Searcher

Index

Text	

Analysis
Documents

Indexer
Indexing

25
Building an Inverted
Documents

Text Analysis

term,documentId
pairs

S
o
r
t

(Disk)
termId = term

termId = postingId	

(dictionary)

Persist

postingId = postingsList	

(postings file)

26

term,documentId
pairs, sorted
“User”

Text	

Analysis
Query/Response

Searcher

Index

Text	

Analysis
Documents

Indexer
Scaling
Indexing
and
Searching

28
`large` Inverted Index

29
Distributed Indexing

30
Distributed Search
brutus

d1, d3, d6, d7

ceasar

d1, d2, d4, d8, d9

noble

d5, d10

with

d1, d2, d3, d5

killed

d8

brutus
ceasar

noble
with
killed

d1, d3, d6, d7
d1, d2, d4, d8,
d9

31

d5, d10
d1, d2, d3, d5
d8
Distributed Search
brutus
ceasar

noble
with
killed

d1, d3, d6, d7
d1, d2, d4, d8,
d9

d5, d10
d1, d2, d3, d5
d8

Partitioning by
terms

brutu
s
ceasa

d1, d3

d6, d7

d1, d2, d4

brutu
s
ceasa

r
noble

d5

r
noble

d10

with

d1, d2, d3,
d5

killed

d8

Partitioning by
documents

32

d8, d9
Images Attribution
• Introduction to Information Retrieval	

o

By Christopher D. Manning, Prabhakar
Raghavan  Hinrich Schütze

33
Thank You
siddhartha@flipkart.com	

@sids

34

Más contenido relacionado

Último

COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 

Último (20)

COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 

Build Your Own Search Engine