Slides for a talk I gave at The Fifth Elephant conference in 2012.
Description copied from the conference's website:
No, this is not another tutorial on using Solr/ElasticSearch/Sphinx/Lucene. Imagine that none of these existed and you need a search engine for your shiny new eCommerce startup. What would you do? Build your own search engine, of course. In this session at The Fifth Elephant 2012, Siddhartha Reddy describes what it takes to do that.
Here is a video of the talk: https://hasgeek.tv/fifthelephant/2012-1/49-siddhartha-reddy-build-your-own-search-engine
12. Term-Document Matrix
Brutus AND Caesar AND NOT Calpurnia
Brutus
110100
Ceasar
110111
Calpurnia
10000
NOT Calpurnia
101111
110100 AND 110111 AND 101111 = 100100
12
16. Ranking/Scoring
“mysql performance”
Top 25 Best Linux Performance Monitoring and Debugging Tools
8 great MySQL Performance Tips
Linux performance: is Linux becoming just too slow and bloated?
MySQL Performance Blog
16
17. Ranking/Scoring
“mysql performance”
Term Frequency (Tf)
mysql
Top 25 Best Linux
Performance Monitoring
and Debugging Tools
8 great MySQL
Performance Tips
Linux performance: is
Linux becoming just too
slow and bloated?
MySQL Performance Blog
performance
Total
1
23
24
5
7
12
3
12
15
6
8
14
* random numbers
17
18. Ranking/Scoring
“mysql performance”
Term Frequency (Tf)
mysql
performance
Total
Top 25 Best Linux
Performance Monitoring
and Debugging Tools
Linux performance: is
Linux becoming just too
slow and bloated?
MySQL Performance Blog
1
23
24
3
12
15
6
8
14
8 great MySQL
Performance Tips
5
7
12
18
21. Ranking/Scoring
“mysql performance”
Term
Idf
mysql
10
performance
2
Tf * Idf
mysql
performance
Total
1 * 10
23 * 2
56
Linux performance: is Linux
3 * 10
becoming just too slow and bloated?
12 * 2
54
MySQL Performance Blog
6 * 10
8*2
76
8 great MySQL Performance Tips
5 * 10
7*2
64
Top 25 Best Linux Performance
Monitoring and Debugging Tools
21
22. Ranking/Scoring
“mysql performance”
Tf * Idf
mysql
performance
Total
MySQL Performance Blog
6 * 10
8*2
76
8 great MySQL Performance Tips
5 * 10
7*2
64
Top 25 Best Linux Performance
Monitoring and Debugging Tools
1 * 10
23 * 2
56
Linux performance: is Linux
becoming just too slow and
bloated?
3 * 10
12 * 2
54
22
23. Boolean Search vs.
Ranked Search
•
• Ranked Search
Boolean Search
o
Rich query syntax
o
No relevance scoring
o
o
o
Simple query syntax
o
Relevance ranking/scoring is key
Ex: Patent search, Enterprise search
o
Ex: Web Search, Flipkart Search
Precision Recall controlled by user
o
Search Engine needs to balance Precision
Recall
23
26. Building an Inverted
Documents
Text Analysis
term,documentId
pairs
S
o
r
t
(Disk)
termId = term
termId = postingId
(dictionary)
Persist
postingId = postingsList
(postings file)
26
term,documentId
pairs, sorted