Publicidad

Query Latency Optimization with Lucene

18 de Nov de 2013
Publicidad

Más contenido relacionado

Publicidad

Más de lucenerevolution(20)

Publicidad

Query Latency Optimization with Lucene

  1. Query Latency Optimization Stefan Pohl stefan.pohl@here.com Sr. Research Engineer, Ph.D.
  2. Who Am I ● Search user, developer, researcher ● Many years in industry & academia ● Ph.D. in Information Retrieval ● Interests: Search, Big Data, Machine Learning ● Currently working on the Geocoding offer of HERE, Nokia's Location Platform ● Spare time: Lucene contributor 7 Nov 2013 Query Latency Optimization with Lucene 2
  3. Agenda ● Motivation ● Latency Optimization ● Query Processing / Scoring ● Recent Developments in Lucene 7 Nov 2013 Query Latency Optimization with Lucene 3
  4. Motivation: Query Latency ● Human Reaction Time: 200 ms * → Backend latency: << 200 ms ● Faster queries means higher manageable load ● Costs * Steven C. Seow, Designing and Engineering Time: The Psychology of Time Perception in Software, Addison-Wesley Professional, 2008. 7 Nov 2013 Query Latency Optimization with Lucene 4
  5. Motivation: Query Latency Distribution 7 Nov 2013 Query Latency Optimization with Lucene 5
  6. Latency Optimization 7 Nov 2013 Query Latency Optimization with Lucene 6
  7. First: Do Your Homework ● Keep enough RAM for OS (disk buffer cache) ● Reduce HDD “pressure” (e.g. throttle indexing) ● SSDs ● Warming ● Ideally: your index fits in memory See http://wiki.apache.org/lucene-java/ImproveSearchingSpeed 7 Nov 2013 Query Latency Optimization with Lucene 7
  8. Mining Hypothesis ● Check if query latencies are reproducible ● If not, try to find correlations with system events: – – – – ● Many new incoming docs to index? Other daemons spike in disk or CPU activity? Garbage Collections? Other sar statistics (e.g. paging) If yes, profile – – First, your code Don't instrument Lucene internal low-level classes 7 Nov 2013 Query Latency Optimization with Lucene 8
  9. Hypothesis Testing ● You really think you understand the problem and have a potential solution? ● Try it out (if it's cheap)! ● Otherwise, think of (cheap) experiments that – – 7 Nov 2013 Give confidence Tell you (and others) what the gains are (ROI) Query Latency Optimization with Lucene 9
  10. Example: In-memory ● Buy more memory / bigger machine !? ● Simulate1 – – – ● 1 Consecutively execute the same query multiple times Much lower memory requirement (i.e. the size of the involved postings) Repeat for sample of queries of interest Gives lower bound on query latency S. Pohl, A. Moffat. Measurement Techniques and Caching Effects. In Proceedings of the 31st European Conference on Information Retrieval, Toulouse, France, April 2009. Springer. 7 Nov 2013 Query Latency Optimization with Lucene 10
  11. Query Processing 7 Nov 2013 Query Latency Optimization with Lucene 11
  12. Conjunctions (i.e. AND / Occur.MUST) ● Sort Boolean clauses by increasing DocFreq ft 7 Nov 2013 Query Latency Optimization with Lucene 12
  13. Conjunctions (i.e. AND / Occur.MUST) ● Next() on sparsest posting list (“lead”) 7 Nov 2013 Query Latency Optimization with Lucene 13
  14. Conjunctions (i.e. AND / Occur.MUST) ● Advance(18) on next sparsest posting list → fail 7 Nov 2013 Query Latency Optimization with Lucene 14
  15. Conjunctions (i.e. AND / Occur.MUST) ● Start all over again with “lead”, but advance(22) 7 Nov 2013 Query Latency Optimization with Lucene 15
  16. Conjunctions (i.e. AND / Occur.MUST) ● Try to advance(31) on all other posting lists 7 Nov 2013 Query Latency Optimization with Lucene 16
  17. Conjunctions (i.e. AND / Occur.MUST) ● Try to advance(31) on all other posting lists 7 Nov 2013 Query Latency Optimization with Lucene 17
  18. Conjunctions (i.e. AND / Occur.MUST) ● Try to advance(31) on all other posting lists 7 Nov 2013 Query Latency Optimization with Lucene 18
  19. Conjunctions (i.e. AND / Occur.MUST) ● Match found → R = {31 7 Nov 2013 Query Latency Optimization with Lucene 19
  20. Conjunctions (i.e. AND / Occur.MUST) ● Next() on “lead” → R = {31} 7 Nov 2013 Query Latency Optimization with Lucene 20
  21. Disjunctions (i.e. OR / Occur.SHOULD) 7 Nov 2013 Query Latency Optimization with Lucene 21
  22. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() on all clauses 7 Nov 2013 Query Latency Optimization with Lucene 22
  23. Disjunctions (i.e. OR / Occur.SHOULD) ● Track clauses in min-heap → R = {2 7 Nov 2013 Query Latency Optimization with Lucene 23
  24. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() on all previously matched clauses → R = {2,4 7 Nov 2013 Query Latency Optimization with Lucene 24
  25. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() on all previously matched clauses → R = {2,4,5 7 Nov 2013 Query Latency Optimization with Lucene 25
  26. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7 7 Nov 2013 Query Latency Optimization with Lucene 26
  27. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9 7 Nov 2013 Query Latency Optimization with Lucene 27
  28. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11 7 Nov 2013 Query Latency Optimization with Lucene 28
  29. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12 7 Nov 2013 Query Latency Optimization with Lucene 29
  30. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16 7 Nov 2013 Query Latency Optimization with Lucene 30
  31. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18 7 Nov 2013 Query Latency Optimization with Lucene 31
  32. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20 7 Nov 2013 Query Latency Optimization with Lucene 32
  33. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22 7 Nov 2013 Query Latency Optimization with Lucene 33
  34. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26 7 Nov 2013 Query Latency Optimization with Lucene 34
  35. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27 7 Nov 2013 Query Latency Optimization with Lucene 35
  36. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29 7 Nov 2013 Query Latency Optimization with Lucene 36
  37. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31 7 Nov 2013 Query Latency Optimization with Lucene 37
  38. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32 7 Nov 2013 Query Latency Optimization with Lucene 38
  39. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37 7 Nov 2013 Query Latency Optimization with Lucene 39
  40. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37} 7 Nov 2013 Query Latency Optimization with Lucene 40
  41. Why Query Processing Can Be Slow? ● Disjunctive Processing: O(n log |C|) – – – ● High DF terms (large n) Many terms (large |C|), e.g. query expansion No / too little use of advance() Filter (over-use) 7 Nov 2013 Query Latency Optimization with Lucene 41
  42. Filter ● Aims: – – – ● (Pre-)computation of common sub-queries Cache result Don't influence scoring Limitation – – Additional cost for 1st query Currently, no skip information generated → Adding filter as a conjunct to queries can sometimes be faster e.g. http://java.dzone.com/news/fast-lucene-search-filters 7 Nov 2013 Query Latency Optimization with Lucene 42
  43. Stopword Removal ● Removal of High-DocFreq terms from – – ● Limitation: – ● Index : 10-30% space saving Query: no very expensive terms “To be or not to be” In general, don't do it 7 Nov 2013 Query Latency Optimization with Lucene 43
  44. Minor, But Easy Improvements ● Reduce information, increase locality: – Don't store TF, if it's almost always 1 (and you don't need positions), fieldType.setIndexOptions(IndexOptions.DOCS_ONLY); – ● Use BlockPostingsFormat (default in Lucene ≥ 4.1) Tune Space/Time/Quality tradeoffs: – – 7 Nov 2013 DirectDocValues Less complex scoring function Query Latency Optimization with Lucene 44
  45. Recent Developments within Lucene 7 Nov 2013 Query Latency Optimization with Lucene 45
  46. MinShouldMatch ● ● ● (Lucene-4571) Don't want matches on only one (stop-)word? Enforce at least mm>1 terms to be present ! Synthetic example query used during dev: Terms: ref restored struck wings dublin DocFreq: 3.8M 32k 32k 32k 32k E.g. mm=2: Conjunctive Processing: advance() Disjunctive Processing: next() 7 Nov 2013 Query Latency Optimization with Lucene 46
  47. MinShouldMatch 7 Nov 2013 Query Latency Optimization with Lucene (Lucene-4571) 47
  48. MinShouldMatch 7 Nov 2013 Query Latency Optimization with Lucene (Lucene-4571) 48
  49. MinShouldMatch (Lucene-4571) DocFreq: 3.8M 32k 32k 32k 32k HighDF 1/5: ref restored struck wings dublin HighDF 2/5: ref http struck wings dublin HighDF 3/5: ref http from wings dublin HighDF 4/5: ref http from name dublin HighDF 5/5: ref http from name title DocFreq: 3.8M 3.5M 3.2M 2.8M 2.4M 7 Nov 2013 Query Latency Optimization with Lucene 49
  50. MinShouldMatch – Results 7 Nov 2013 Query Latency Optimization with Lucene (Lucene-4571) 50
  51. MinShouldMatch – Open Questions ● ● ● (Lucene-4571) How bad is it to exclude docs that only match one, but an important term? Why is it enough to match any mm terms? Why not providing a list of stop-words to a 'StopwordExcludingScorer'? (But be careful: “To Be Or Not To Be”) 7 Nov 2013 Query Latency Optimization with Lucene 51
  52. ReqOptSumScorer ● Benefit: – – ● Conjunctive processing on required clauses Calls advance() on optional clauses How do you determine which clauses are required? – Lookup term statistics (i.e. DocFreq) – 2nd lookup unnecessary, if you hand over stats to query 7 Nov 2013 Query Latency Optimization with Lucene 52
  53. CommonTermsQuery (≥ 4.1) ● Looks up term infos (docfreq, posting list offset) ● (Lucene-4628) Categorizes query terms as – – ● Low-freq: At least one low-freq term MUST occur in result doc High-freq: SHOULD occur in doc → their presence add to score Executes query, but hands over term statistics → no 2nd round of term lookups necessary ! ● Also supports MinShouldMatch 7 Nov 2013 Query Latency Optimization with Lucene 53
  54. Cost-Model (≥ 4.3) ● What about structured queries? E.g. +(a b) +c ● (Lucene-4607) Currently: worst-case estimate of returned #docs (docfreq) – – ● Disjunctions: sumcC(dfc) Conjunctions: mincC(dfc) Limitations: – – ● Effort to generate returned docs? Only one cost (next() vs. advance()) Open Question: – Can we do better with more detailed cost models? 7 Nov 2013 Query Latency Optimization with Lucene 54
  55. Maxscore Top-k Scoring Algorithm1 ● ● Experimental prototype code attached to Lucene-4100 Limitation: – 1 (Lucene-4100) Requires final run over whole index (i.e. only for static indexes) H. Turtle, J. Flood. Query Evaluation: Strategies and Optimizations, IPM, 31(6), 1995. 7 Nov 2013 Query Latency Optimization with Lucene 55
  56. Index Sorting (≥ 4.3) ● Advantages (if appropriate sort order chosen) – – ● (Lucene-4752) Better compression → more locality → faster processing Early termination Use together with EarlyTerminatingSortingCollector – – Can terminate scoring within sorted segments Fully scores as-yet unsorted segments → see 2nd half of Shai & Adrian's talk yesterday for details 7 Nov 2013 Query Latency Optimization with Lucene 56
  57. Parallelization ● In general, sharding is better: – – ● Shared-nothing Better use cores for handling load Multi-threaded query execution: – Static indexes: For slow queries, almost perfect speedups (if docs are uniformly distributed over shards) – Dynamic indexes: ● Lucene-2840, Lucene-5299 7 Nov 2013 Query Latency Optimization with Lucene 57
  58. Summary ● Understand your problem ● Scoring can become an issue with many million docs ● Many recent efficiency improvements ● More to come... patches welcome 7 Nov 2013 Query Latency Optimization with Lucene 58
  59. We're Hiring @HERE Frankfurt, Berlin, Boston, Chicago. Come work with us. Get in touch! 7 Nov 2013 developer.here.com/geocoder Query Latency Optimization with Lucene 59
  60. Thank You! Contact Email : stefan.pohl@here.com Web : http://linkedin.com/in/stefanpohl Twitter : @pohlstefan 7 Nov 2013 developer.here.com/geocoder Query Latency Optimization with Lucene 60
Publicidad