Presented by Stefan Pohl, Senior Research Engineer, HERE, a Nokia Business
Besides the quality of results, the time that it takes from the submission of a query to the display of results is of utmost importance to user satisfaction. Within search engine implementations such as Apache Lucene, significant development efforts are hence directed towards reducing query latency. In this session, I will explain reasons for high query latencies and describe general approaches and recent developments within Lucene to counter them.To make the presented material relevant to a wider audience, I will focus on the actual query processing, as this is at the core of every query and search use-case.
2. Who Am I
●
Search user, developer, researcher
●
Many years in industry & academia
●
Ph.D. in Information Retrieval
●
Interests: Search, Big Data, Machine Learning
●
Currently working on the Geocoding offer of HERE,
Nokia's Location Platform
●
Spare time: Lucene contributor
7 Nov 2013
Query Latency Optimization with Lucene
2
4. Motivation: Query Latency
● Human Reaction Time: 200 ms *
→ Backend latency: << 200 ms
●
Faster queries means higher manageable load
●
Costs
* Steven C. Seow, Designing and Engineering Time: The Psychology of Time Perception in
Software, Addison-Wesley Professional, 2008.
7 Nov 2013
Query Latency Optimization with Lucene
4
7. First: Do Your Homework
● Keep enough RAM for OS (disk buffer cache)
● Reduce HDD “pressure” (e.g. throttle indexing)
● SSDs
● Warming
● Ideally: your index fits in memory
See http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
7 Nov 2013
Query Latency Optimization with Lucene
7
8. Mining Hypothesis
●
Check if query latencies are reproducible
●
If not, try to find correlations with system events:
–
–
–
–
●
Many new incoming docs to index?
Other daemons spike in disk or CPU activity?
Garbage Collections?
Other sar statistics (e.g. paging)
If yes, profile
–
–
First, your code
Don't instrument Lucene internal low-level classes
7 Nov 2013
Query Latency Optimization with Lucene
8
9. Hypothesis Testing
●
You really think you understand the problem
and have a potential solution?
●
Try it out (if it's cheap)!
●
Otherwise, think of (cheap) experiments that
–
–
7 Nov 2013
Give confidence
Tell you (and others) what the gains are (ROI)
Query Latency Optimization with Lucene
9
10. Example: In-memory
●
Buy more memory / bigger machine !?
●
Simulate1
–
–
–
●
1
Consecutively execute the same query multiple times
Much lower memory requirement (i.e. the size of the involved postings)
Repeat for sample of queries of interest
Gives lower bound on query latency
S. Pohl, A. Moffat. Measurement Techniques and Caching Effects. In Proceedings of the 31st European
Conference on Information Retrieval, Toulouse, France, April 2009. Springer.
7 Nov 2013
Query Latency Optimization with Lucene
10
12. Conjunctions (i.e. AND / Occur.MUST)
●
Sort Boolean clauses by increasing DocFreq ft
7 Nov 2013
Query Latency Optimization with Lucene
12
13. Conjunctions (i.e. AND / Occur.MUST)
●
Next() on sparsest posting list (“lead”)
7 Nov 2013
Query Latency Optimization with Lucene
13
14. Conjunctions (i.e. AND / Occur.MUST)
●
Advance(18) on next sparsest posting list → fail
7 Nov 2013
Query Latency Optimization with Lucene
14
15. Conjunctions (i.e. AND / Occur.MUST)
●
Start all over again with “lead”, but advance(22)
7 Nov 2013
Query Latency Optimization with Lucene
15
16. Conjunctions (i.e. AND / Occur.MUST)
●
Try to advance(31) on all other posting lists
7 Nov 2013
Query Latency Optimization with Lucene
16
17. Conjunctions (i.e. AND / Occur.MUST)
●
Try to advance(31) on all other posting lists
7 Nov 2013
Query Latency Optimization with Lucene
17
18. Conjunctions (i.e. AND / Occur.MUST)
●
Try to advance(31) on all other posting lists
7 Nov 2013
Query Latency Optimization with Lucene
18
19. Conjunctions (i.e. AND / Occur.MUST)
●
Match found → R = {31
7 Nov 2013
Query Latency Optimization with Lucene
19
20. Conjunctions (i.e. AND / Occur.MUST)
●
Next() on “lead” → R = {31}
7 Nov 2013
Query Latency Optimization with Lucene
20
21. Disjunctions (i.e. OR / Occur.SHOULD)
7 Nov 2013
Query Latency Optimization with Lucene
21
22. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() on all clauses
7 Nov 2013
Query Latency Optimization with Lucene
22
23. Disjunctions (i.e. OR / Occur.SHOULD)
●
Track clauses in min-heap → R = {2
7 Nov 2013
Query Latency Optimization with Lucene
23
24. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() on all previously matched clauses → R = {2,4
7 Nov 2013
Query Latency Optimization with Lucene
24
25. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() on all previously matched clauses → R = {2,4,5
7 Nov 2013
Query Latency Optimization with Lucene
25
26. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7
7 Nov 2013
Query Latency Optimization with Lucene
26
27. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9
7 Nov 2013
Query Latency Optimization with Lucene
27
28. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11
7 Nov 2013
Query Latency Optimization with Lucene
28
29. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12
7 Nov 2013
Query Latency Optimization with Lucene
29
30. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16
7 Nov 2013
Query Latency Optimization with Lucene
30
31. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16,18
7 Nov 2013
Query Latency Optimization with Lucene
31
32. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16,18,20
7 Nov 2013
Query Latency Optimization with Lucene
32
33. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16,18,20,22
7 Nov 2013
Query Latency Optimization with Lucene
33
34. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26
7 Nov 2013
Query Latency Optimization with Lucene
34
35. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27
7 Nov 2013
Query Latency Optimization with Lucene
35
36. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29
7 Nov 2013
Query Latency Optimization with Lucene
36
37. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31
7 Nov 2013
Query Latency Optimization with Lucene
37
38. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32
7 Nov 2013
Query Latency Optimization with Lucene
38
39. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37
7 Nov 2013
Query Latency Optimization with Lucene
39
40. Disjunctions (i.e. OR / Occur.SHOULD)
●
Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37}
7 Nov 2013
Query Latency Optimization with Lucene
40
41. Why Query Processing Can Be Slow?
●
Disjunctive Processing: O(n log |C|)
–
–
–
●
High DF terms (large n)
Many terms (large |C|), e.g. query expansion
No / too little use of advance()
Filter (over-use)
7 Nov 2013
Query Latency Optimization with Lucene
41
42. Filter
●
Aims:
–
–
–
●
(Pre-)computation of common sub-queries
Cache result
Don't influence scoring
Limitation
–
–
Additional cost for 1st query
Currently, no skip information generated
→ Adding filter as a conjunct to queries can sometimes be faster
e.g. http://java.dzone.com/news/fast-lucene-search-filters
7 Nov 2013
Query Latency Optimization with Lucene
42
43. Stopword Removal
●
Removal of High-DocFreq terms from
–
–
●
Limitation:
–
●
Index : 10-30% space saving
Query: no very expensive terms
“To be or not to be”
In general, don't do it
7 Nov 2013
Query Latency Optimization with Lucene
43
44. Minor, But Easy Improvements
●
Reduce information, increase locality:
–
Don't store TF, if it's almost always 1 (and you don't
need positions),
fieldType.setIndexOptions(IndexOptions.DOCS_ONLY);
–
●
Use BlockPostingsFormat (default in Lucene ≥ 4.1)
Tune Space/Time/Quality tradeoffs:
–
–
7 Nov 2013
DirectDocValues
Less complex scoring function
Query Latency Optimization with Lucene
44
46. MinShouldMatch
●
●
●
(Lucene-4571)
Don't want matches on only one (stop-)word?
Enforce at least mm>1 terms to be present !
Synthetic example query used during dev:
Terms:
ref
restored
struck
wings
dublin
DocFreq:
3.8M
32k
32k
32k
32k
E.g. mm=2:
Conjunctive Processing:
advance()
Disjunctive Processing:
next()
7 Nov 2013
Query Latency Optimization with Lucene
46
51. MinShouldMatch – Open Questions
●
●
●
(Lucene-4571)
How bad is it to exclude docs that only match one,
but an important term?
Why is it enough to match any mm terms?
Why not providing a list of stop-words to a
'StopwordExcludingScorer'?
(But be careful: “To Be Or Not To Be”)
7 Nov 2013
Query Latency Optimization with Lucene
51
52. ReqOptSumScorer
●
Benefit:
–
–
●
Conjunctive processing on required clauses
Calls advance() on optional clauses
How do you determine which clauses are required?
– Lookup term statistics (i.e. DocFreq)
– 2nd lookup unnecessary, if you hand over stats to query
7 Nov 2013
Query Latency Optimization with Lucene
52
53. CommonTermsQuery (≥ 4.1)
●
Looks up term infos (docfreq, posting list offset)
●
(Lucene-4628)
Categorizes query terms as
–
–
●
Low-freq: At least one low-freq term MUST occur in result doc
High-freq: SHOULD occur in doc → their presence add to score
Executes query, but hands over term statistics
→ no 2nd round of term lookups necessary !
●
Also supports MinShouldMatch
7 Nov 2013
Query Latency Optimization with Lucene
53
54. Cost-Model (≥ 4.3)
●
What about structured queries? E.g. +(a b) +c
●
(Lucene-4607)
Currently: worst-case estimate of returned #docs (docfreq)
–
–
●
Disjunctions: sumcC(dfc)
Conjunctions: mincC(dfc)
Limitations:
–
–
●
Effort to generate returned docs?
Only one cost (next() vs. advance())
Open Question:
–
Can we do better with more detailed cost models?
7 Nov 2013
Query Latency Optimization with Lucene
54
55. Maxscore Top-k Scoring Algorithm1
●
●
Experimental prototype code attached to Lucene-4100
Limitation:
–
1
(Lucene-4100)
Requires final run over whole index (i.e. only for static indexes)
H. Turtle, J. Flood. Query Evaluation: Strategies and Optimizations, IPM, 31(6), 1995.
7 Nov 2013
Query Latency Optimization with Lucene
55
56. Index Sorting (≥ 4.3)
●
Advantages (if appropriate sort order chosen)
–
–
●
(Lucene-4752)
Better compression → more locality → faster processing
Early termination
Use together with EarlyTerminatingSortingCollector
–
–
Can terminate scoring within sorted segments
Fully scores as-yet unsorted segments
→ see 2nd half of Shai & Adrian's talk yesterday for details
7 Nov 2013
Query Latency Optimization with Lucene
56
57. Parallelization
●
In general, sharding is better:
–
–
●
Shared-nothing
Better use cores for handling load
Multi-threaded query execution:
–
Static indexes:
For slow queries, almost perfect speedups
(if docs are uniformly distributed over shards)
–
Dynamic indexes:
●
Lucene-2840, Lucene-5299
7 Nov 2013
Query Latency Optimization with Lucene
57
58. Summary
●
Understand your problem
●
Scoring can become an issue with many million docs
●
Many recent efficiency improvements
●
More to come... patches welcome
7 Nov 2013
Query Latency Optimization with Lucene
58
59. We're Hiring @HERE
Frankfurt, Berlin, Boston, Chicago.
Come work with us.
Get in touch!
7 Nov 2013
developer.here.com/geocoder
Query Latency Optimization with Lucene
59
60. Thank You!
Contact
Email : stefan.pohl@here.com
Web : http://linkedin.com/in/stefanpohl
Twitter : @pohlstefan
7 Nov 2013
developer.here.com/geocoder
Query Latency Optimization with Lucene
60