SlideShare una empresa de Scribd logo
1 de 60
Descargar para leer sin conexión
Query Latency Optimization
Stefan Pohl
stefan.pohl@here.com

Sr. Research Engineer, Ph.D.
Who Am I
●

Search user, developer, researcher

●

Many years in industry & academia

●

Ph.D. in Information Retrieval

●

Interests: Search, Big Data, Machine Learning

●

Currently working on the Geocoding offer of HERE,
Nokia's Location Platform

●

Spare time: Lucene contributor

7 Nov 2013

Query Latency Optimization with Lucene

2
Agenda
● Motivation
●

Latency Optimization

●

Query Processing / Scoring

●

Recent Developments in Lucene

7 Nov 2013

Query Latency Optimization with Lucene

3
Motivation: Query Latency
● Human Reaction Time: 200 ms *
→ Backend latency: << 200 ms
●

Faster queries means higher manageable load

●

Costs

* Steven C. Seow, Designing and Engineering Time: The Psychology of Time Perception in
Software, Addison-Wesley Professional, 2008.
7 Nov 2013

Query Latency Optimization with Lucene

4
Motivation: Query Latency Distribution

7 Nov 2013

Query Latency Optimization with Lucene

5
Latency Optimization

7 Nov 2013

Query Latency Optimization with Lucene

6
First: Do Your Homework
● Keep enough RAM for OS (disk buffer cache)
● Reduce HDD “pressure” (e.g. throttle indexing)
● SSDs
● Warming
● Ideally: your index fits in memory
See http://wiki.apache.org/lucene-java/ImproveSearchingSpeed

7 Nov 2013

Query Latency Optimization with Lucene

7
Mining Hypothesis
●

Check if query latencies are reproducible

●

If not, try to find correlations with system events:
–
–
–
–

●

Many new incoming docs to index?
Other daemons spike in disk or CPU activity?
Garbage Collections?
Other sar statistics (e.g. paging)

If yes, profile
–
–

First, your code
Don't instrument Lucene internal low-level classes

7 Nov 2013

Query Latency Optimization with Lucene

8
Hypothesis Testing
●

You really think you understand the problem
and have a potential solution?

●

Try it out (if it's cheap)!

●

Otherwise, think of (cheap) experiments that
–
–

7 Nov 2013

Give confidence
Tell you (and others) what the gains are (ROI)
Query Latency Optimization with Lucene

9
Example: In-memory
●

Buy more memory / bigger machine !?

●

Simulate1
–
–
–

●

1

Consecutively execute the same query multiple times
Much lower memory requirement (i.e. the size of the involved postings)
Repeat for sample of queries of interest

Gives lower bound on query latency

S. Pohl, A. Moffat. Measurement Techniques and Caching Effects. In Proceedings of the 31st European
Conference on Information Retrieval, Toulouse, France, April 2009. Springer.

7 Nov 2013

Query Latency Optimization with Lucene

10
Query Processing

7 Nov 2013

Query Latency Optimization with Lucene

11
Conjunctions (i.e. AND / Occur.MUST)

●

Sort Boolean clauses by increasing DocFreq ft

7 Nov 2013

Query Latency Optimization with Lucene

12
Conjunctions (i.e. AND / Occur.MUST)

●

Next() on sparsest posting list (“lead”)

7 Nov 2013

Query Latency Optimization with Lucene

13
Conjunctions (i.e. AND / Occur.MUST)

●

Advance(18) on next sparsest posting list → fail

7 Nov 2013

Query Latency Optimization with Lucene

14
Conjunctions (i.e. AND / Occur.MUST)

●

Start all over again with “lead”, but advance(22)

7 Nov 2013

Query Latency Optimization with Lucene

15
Conjunctions (i.e. AND / Occur.MUST)

●

Try to advance(31) on all other posting lists

7 Nov 2013

Query Latency Optimization with Lucene

16
Conjunctions (i.e. AND / Occur.MUST)

●

Try to advance(31) on all other posting lists

7 Nov 2013

Query Latency Optimization with Lucene

17
Conjunctions (i.e. AND / Occur.MUST)

●

Try to advance(31) on all other posting lists

7 Nov 2013

Query Latency Optimization with Lucene

18
Conjunctions (i.e. AND / Occur.MUST)

●

Match found → R = {31

7 Nov 2013

Query Latency Optimization with Lucene

19
Conjunctions (i.e. AND / Occur.MUST)

●

Next() on “lead” → R = {31}

7 Nov 2013

Query Latency Optimization with Lucene

20
Disjunctions (i.e. OR / Occur.SHOULD)

7 Nov 2013

Query Latency Optimization with Lucene

21
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() on all clauses

7 Nov 2013

Query Latency Optimization with Lucene

22
Disjunctions (i.e. OR / Occur.SHOULD)

●

Track clauses in min-heap → R = {2

7 Nov 2013

Query Latency Optimization with Lucene

23
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() on all previously matched clauses → R = {2,4

7 Nov 2013

Query Latency Optimization with Lucene

24
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() on all previously matched clauses → R = {2,4,5

7 Nov 2013

Query Latency Optimization with Lucene

25
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7

7 Nov 2013

Query Latency Optimization with Lucene

26
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9

7 Nov 2013

Query Latency Optimization with Lucene

27
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11

7 Nov 2013

Query Latency Optimization with Lucene

28
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12

7 Nov 2013

Query Latency Optimization with Lucene

29
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16

7 Nov 2013

Query Latency Optimization with Lucene

30
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16,18

7 Nov 2013

Query Latency Optimization with Lucene

31
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16,18,20

7 Nov 2013

Query Latency Optimization with Lucene

32
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22

7 Nov 2013

Query Latency Optimization with Lucene

33
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26

7 Nov 2013

Query Latency Optimization with Lucene

34
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27

7 Nov 2013

Query Latency Optimization with Lucene

35
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29

7 Nov 2013

Query Latency Optimization with Lucene

36
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31

7 Nov 2013

Query Latency Optimization with Lucene

37
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32

7 Nov 2013

Query Latency Optimization with Lucene

38
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37

7 Nov 2013

Query Latency Optimization with Lucene

39
Disjunctions (i.e. OR / Occur.SHOULD)

●

Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37}

7 Nov 2013

Query Latency Optimization with Lucene

40
Why Query Processing Can Be Slow?
●

Disjunctive Processing: O(n log |C|)
–
–
–

●

High DF terms (large n)
Many terms (large |C|), e.g. query expansion
No / too little use of advance()

Filter (over-use)

7 Nov 2013

Query Latency Optimization with Lucene

41
Filter
●

Aims:
–
–
–

●

(Pre-)computation of common sub-queries
Cache result
Don't influence scoring

Limitation
–
–

Additional cost for 1st query
Currently, no skip information generated

→ Adding filter as a conjunct to queries can sometimes be faster
e.g. http://java.dzone.com/news/fast-lucene-search-filters
7 Nov 2013

Query Latency Optimization with Lucene

42
Stopword Removal
●

Removal of High-DocFreq terms from
–
–

●

Limitation:
–

●

Index : 10-30% space saving
Query: no very expensive terms

“To be or not to be”

In general, don't do it

7 Nov 2013

Query Latency Optimization with Lucene

43
Minor, But Easy Improvements
●

Reduce information, increase locality:
–

Don't store TF, if it's almost always 1 (and you don't
need positions),
fieldType.setIndexOptions(IndexOptions.DOCS_ONLY);

–

●

Use BlockPostingsFormat (default in Lucene ≥ 4.1)

Tune Space/Time/Quality tradeoffs:
–
–

7 Nov 2013

DirectDocValues
Less complex scoring function
Query Latency Optimization with Lucene

44
Recent Developments
within Lucene
7 Nov 2013

Query Latency Optimization with Lucene

45
MinShouldMatch
●
●

●

(Lucene-4571)

Don't want matches on only one (stop-)word?
Enforce at least mm>1 terms to be present !
Synthetic example query used during dev:
Terms:

ref

restored

struck

wings

dublin

DocFreq:

3.8M

32k

32k

32k

32k

E.g. mm=2:
Conjunctive Processing:
advance()

Disjunctive Processing:
next()

7 Nov 2013

Query Latency Optimization with Lucene

46
MinShouldMatch

7 Nov 2013

Query Latency Optimization with Lucene

(Lucene-4571)

47
MinShouldMatch

7 Nov 2013

Query Latency Optimization with Lucene

(Lucene-4571)

48
MinShouldMatch

(Lucene-4571)

DocFreq:

3.8M

32k

32k

32k

32k

HighDF 1/5:

ref

restored

struck

wings

dublin

HighDF 2/5:

ref

http

struck

wings

dublin

HighDF 3/5:

ref

http

from

wings

dublin

HighDF 4/5:

ref

http

from

name

dublin

HighDF 5/5:

ref

http

from

name

title

DocFreq:

3.8M

3.5M

3.2M

2.8M

2.4M

7 Nov 2013

Query Latency Optimization with Lucene

49
MinShouldMatch – Results

7 Nov 2013

Query Latency Optimization with Lucene

(Lucene-4571)

50
MinShouldMatch – Open Questions
●

●

●

(Lucene-4571)

How bad is it to exclude docs that only match one,
but an important term?
Why is it enough to match any mm terms?
Why not providing a list of stop-words to a
'StopwordExcludingScorer'?
(But be careful: “To Be Or Not To Be”)

7 Nov 2013

Query Latency Optimization with Lucene

51
ReqOptSumScorer
●

Benefit:
–
–

●

Conjunctive processing on required clauses
Calls advance() on optional clauses

How do you determine which clauses are required?
– Lookup term statistics (i.e. DocFreq)
– 2nd lookup unnecessary, if you hand over stats to query

7 Nov 2013

Query Latency Optimization with Lucene

52
CommonTermsQuery (≥ 4.1)
●

Looks up term infos (docfreq, posting list offset)

●

(Lucene-4628)

Categorizes query terms as
–
–

●

Low-freq: At least one low-freq term MUST occur in result doc
High-freq: SHOULD occur in doc → their presence add to score

Executes query, but hands over term statistics
→ no 2nd round of term lookups necessary !

●

Also supports MinShouldMatch

7 Nov 2013

Query Latency Optimization with Lucene

53
Cost-Model (≥ 4.3)
●

What about structured queries? E.g. +(a b) +c

●

(Lucene-4607)

Currently: worst-case estimate of returned #docs (docfreq)
–
–

●

Disjunctions: sumcC(dfc)
Conjunctions: mincC(dfc)

Limitations:
–
–

●

Effort to generate returned docs?
Only one cost (next() vs. advance())

Open Question:
–

Can we do better with more detailed cost models?

7 Nov 2013

Query Latency Optimization with Lucene

54
Maxscore Top-k Scoring Algorithm1

●
●

Experimental prototype code attached to Lucene-4100
Limitation:
–

1

(Lucene-4100)

Requires final run over whole index (i.e. only for static indexes)

H. Turtle, J. Flood. Query Evaluation: Strategies and Optimizations, IPM, 31(6), 1995.

7 Nov 2013

Query Latency Optimization with Lucene

55
Index Sorting (≥ 4.3)
●

Advantages (if appropriate sort order chosen)
–
–

●

(Lucene-4752)

Better compression → more locality → faster processing
Early termination

Use together with EarlyTerminatingSortingCollector
–
–

Can terminate scoring within sorted segments
Fully scores as-yet unsorted segments

→ see 2nd half of Shai & Adrian's talk yesterday for details
7 Nov 2013

Query Latency Optimization with Lucene

56
Parallelization
●

In general, sharding is better:
–
–

●

Shared-nothing
Better use cores for handling load

Multi-threaded query execution:
–

Static indexes:
For slow queries, almost perfect speedups
(if docs are uniformly distributed over shards)

–

Dynamic indexes:
●
Lucene-2840, Lucene-5299

7 Nov 2013

Query Latency Optimization with Lucene

57
Summary
●

Understand your problem

●

Scoring can become an issue with many million docs

●

Many recent efficiency improvements

●

More to come... patches welcome

7 Nov 2013

Query Latency Optimization with Lucene

58
We're Hiring @HERE
Frankfurt, Berlin, Boston, Chicago.

Come work with us.
Get in touch!

7 Nov 2013

developer.here.com/geocoder
Query Latency Optimization with Lucene

59
Thank You!
Contact
Email : stefan.pohl@here.com
Web : http://linkedin.com/in/stefanpohl
Twitter : @pohlstefan

7 Nov 2013

developer.here.com/geocoder
Query Latency Optimization with Lucene

60

Más contenido relacionado

La actualidad más candente

Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsOpenSource Connections
 
プログラミングコンテストでの動的計画法
プログラミングコンテストでの動的計画法プログラミングコンテストでの動的計画法
プログラミングコンテストでの動的計画法Takuya Akiba
 
Independent component analysis
Independent component analysisIndependent component analysis
Independent component analysisVanessa S
 
【DL輪読会】SUMO: Unbiased Estimation of Log Marginal Probability for Latent Varia...
【DL輪読会】SUMO: Unbiased Estimation of Log Marginal Probability for Latent Varia...【DL輪読会】SUMO: Unbiased Estimation of Log Marginal Probability for Latent Varia...
【DL輪読会】SUMO: Unbiased Estimation of Log Marginal Probability for Latent Varia...Deep Learning JP
 
データ解析14 ナイーブベイズ
データ解析14 ナイーブベイズデータ解析14 ナイーブベイズ
データ解析14 ナイーブベイズHirotaka Hachiya
 
딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드
딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드
딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드taeseon ryu
 
First Order Logic resolution
First Order Logic resolutionFirst Order Logic resolution
First Order Logic resolutionAmar Jukuntla
 
High performance python computing for data science
High performance python computing for data scienceHigh performance python computing for data science
High performance python computing for data scienceTakami Sato
 
Unit4: Knowledge Representation
Unit4: Knowledge RepresentationUnit4: Knowledge Representation
Unit4: Knowledge RepresentationTekendra Nath Yogi
 
ガウス過程入門
ガウス過程入門ガウス過程入門
ガウス過程入門ShoShimoyama
 
検索可能暗号の概観と今後の展望(第2回次世代セキュア情報基盤ワークショップ)
検索可能暗号の概観と今後の展望(第2回次世代セキュア情報基盤ワークショップ)検索可能暗号の概観と今後の展望(第2回次世代セキュア情報基盤ワークショップ)
検索可能暗号の概観と今後の展望(第2回次世代セキュア情報基盤ワークショップ)Akira Kanaoka
 
Using AI to Build a Self-Driving Query Optimizer with Shivnath Babu and Adria...
Using AI to Build a Self-Driving Query Optimizer with Shivnath Babu and Adria...Using AI to Build a Self-Driving Query Optimizer with Shivnath Babu and Adria...
Using AI to Build a Self-Driving Query Optimizer with Shivnath Babu and Adria...Databricks
 
第21回アルゴリズム勉強会
第21回アルゴリズム勉強会第21回アルゴリズム勉強会
第21回アルゴリズム勉強会Yuuki Ono
 
Introducción al Stack Elastic y Machine Learning con Elasticsearch
Introducción al Stack Elastic y Machine Learning con ElasticsearchIntroducción al Stack Elastic y Machine Learning con Elasticsearch
Introducción al Stack Elastic y Machine Learning con ElasticsearchImma Valls Bernaus
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
第一原理計算と密度汎関数理論
第一原理計算と密度汎関数理論第一原理計算と密度汎関数理論
第一原理計算と密度汎関数理論dc1394
 
Variational autoencoder
Variational autoencoderVariational autoencoder
Variational autoencoderMikio Shiga
 

La actualidad más candente (20)

Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
プログラミングコンテストでの動的計画法
プログラミングコンテストでの動的計画法プログラミングコンテストでの動的計画法
プログラミングコンテストでの動的計画法
 
Independent component analysis
Independent component analysisIndependent component analysis
Independent component analysis
 
【DL輪読会】SUMO: Unbiased Estimation of Log Marginal Probability for Latent Varia...
【DL輪読会】SUMO: Unbiased Estimation of Log Marginal Probability for Latent Varia...【DL輪読会】SUMO: Unbiased Estimation of Log Marginal Probability for Latent Varia...
【DL輪読会】SUMO: Unbiased Estimation of Log Marginal Probability for Latent Varia...
 
データ解析14 ナイーブベイズ
データ解析14 ナイーブベイズデータ解析14 ナイーブベイズ
データ解析14 ナイーブベイズ
 
Managing Postgres with Ansible
Managing Postgres with AnsibleManaging Postgres with Ansible
Managing Postgres with Ansible
 
Hoare論理
Hoare論理Hoare論理
Hoare論理
 
딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드
딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드
딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드
 
First Order Logic resolution
First Order Logic resolutionFirst Order Logic resolution
First Order Logic resolution
 
High performance python computing for data science
High performance python computing for data scienceHigh performance python computing for data science
High performance python computing for data science
 
Unit4: Knowledge Representation
Unit4: Knowledge RepresentationUnit4: Knowledge Representation
Unit4: Knowledge Representation
 
ガウス過程入門
ガウス過程入門ガウス過程入門
ガウス過程入門
 
検索可能暗号の概観と今後の展望(第2回次世代セキュア情報基盤ワークショップ)
検索可能暗号の概観と今後の展望(第2回次世代セキュア情報基盤ワークショップ)検索可能暗号の概観と今後の展望(第2回次世代セキュア情報基盤ワークショップ)
検索可能暗号の概観と今後の展望(第2回次世代セキュア情報基盤ワークショップ)
 
Using AI to Build a Self-Driving Query Optimizer with Shivnath Babu and Adria...
Using AI to Build a Self-Driving Query Optimizer with Shivnath Babu and Adria...Using AI to Build a Self-Driving Query Optimizer with Shivnath Babu and Adria...
Using AI to Build a Self-Driving Query Optimizer with Shivnath Babu and Adria...
 
第21回アルゴリズム勉強会
第21回アルゴリズム勉強会第21回アルゴリズム勉強会
第21回アルゴリズム勉強会
 
双対性
双対性双対性
双対性
 
Introducción al Stack Elastic y Machine Learning con Elasticsearch
Introducción al Stack Elastic y Machine Learning con ElasticsearchIntroducción al Stack Elastic y Machine Learning con Elasticsearch
Introducción al Stack Elastic y Machine Learning con Elasticsearch
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
第一原理計算と密度汎関数理論
第一原理計算と密度汎関数理論第一原理計算と密度汎関数理論
第一原理計算と密度汎関数理論
 
Variational autoencoder
Variational autoencoderVariational autoencoder
Variational autoencoder
 

Similar a Query Latency Optimization with Lucene

tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-datatranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-dataDavid Peyruc
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
 
PAGOdA Presentation
PAGOdA PresentationPAGOdA Presentation
PAGOdA PresentationDBOnto
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010ivan provalov
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowPyData
 
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana CloudUsing SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana CloudSigOpt
 
Schedulers
SchedulersSchedulers
SchedulersKai Liu
 
Multi-Tenant Data Cloud with YARN & Helix
Multi-Tenant Data Cloud with YARN & HelixMulti-Tenant Data Cloud with YARN & Helix
Multi-Tenant Data Cloud with YARN & HelixKishore Gopalakrishna
 
Camp 4-data workshop presentation
Camp 4-data workshop presentationCamp 4-data workshop presentation
Camp 4-data workshop presentationPaolo Missier
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...Lucidworks
 
Evolve your toolchains dev/ops with OpenStack
Evolve your toolchains dev/ops with OpenStackEvolve your toolchains dev/ops with OpenStack
Evolve your toolchains dev/ops with OpenStackRyan Richard
 
Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Tugdual Grall
 
Scale Splunk
Scale SplunkScale Splunk
Scale SplunkSplunk
 
Unit7 & 8 Performance and optimization
Unit7 & 8 Performance and optimization Unit7 & 8 Performance and optimization
Unit7 & 8 Performance and optimization leenachandra
 
Unit7 & 8 performance analysis and optimization
Unit7 & 8 performance analysis and optimizationUnit7 & 8 performance analysis and optimization
Unit7 & 8 performance analysis and optimizationleenachandra
 
Apache Solr - An Experience Report
Apache Solr - An Experience ReportApache Solr - An Experience Report
Apache Solr - An Experience ReportNetcetera
 
What’s New In PostgreSQL 9.3
What’s New In PostgreSQL 9.3What’s New In PostgreSQL 9.3
What’s New In PostgreSQL 9.3Pavan Deolasee
 
Splunk and map_reduce
Splunk and map_reduceSplunk and map_reduce
Splunk and map_reduceGreg Hanchin
 

Similar a Query Latency Optimization with Lucene (20)

tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-datatranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
PAGOdA Presentation
PAGOdA PresentationPAGOdA Presentation
PAGOdA Presentation
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Postgres
PostgresPostgres
Postgres
 
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana CloudUsing SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
 
Schedulers
SchedulersSchedulers
Schedulers
 
Multi-Tenant Data Cloud with YARN & Helix
Multi-Tenant Data Cloud with YARN & HelixMulti-Tenant Data Cloud with YARN & Helix
Multi-Tenant Data Cloud with YARN & Helix
 
Camp 4-data workshop presentation
Camp 4-data workshop presentationCamp 4-data workshop presentation
Camp 4-data workshop presentation
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
 
Evolve your toolchains dev/ops with OpenStack
Evolve your toolchains dev/ops with OpenStackEvolve your toolchains dev/ops with OpenStack
Evolve your toolchains dev/ops with OpenStack
 
Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?
 
Scale Splunk
Scale SplunkScale Splunk
Scale Splunk
 
Unit7 & 8 Performance and optimization
Unit7 & 8 Performance and optimization Unit7 & 8 Performance and optimization
Unit7 & 8 Performance and optimization
 
Unit7 & 8 performance analysis and optimization
Unit7 & 8 performance analysis and optimizationUnit7 & 8 performance analysis and optimization
Unit7 & 8 performance analysis and optimization
 
Apache Solr - An Experience Report
Apache Solr - An Experience ReportApache Solr - An Experience Report
Apache Solr - An Experience Report
 
What’s New In PostgreSQL 9.3
What’s New In PostgreSQL 9.3What’s New In PostgreSQL 9.3
What’s New In PostgreSQL 9.3
 
isd312-09-summarization
isd312-09-summarizationisd312-09-summarization
isd312-09-summarization
 
Splunk and map_reduce
Splunk and map_reduceSplunk and map_reduce
Splunk and map_reduce
 

Más de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Más de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Último

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Último (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Query Latency Optimization with Lucene

  • 1. Query Latency Optimization Stefan Pohl stefan.pohl@here.com Sr. Research Engineer, Ph.D.
  • 2. Who Am I ● Search user, developer, researcher ● Many years in industry & academia ● Ph.D. in Information Retrieval ● Interests: Search, Big Data, Machine Learning ● Currently working on the Geocoding offer of HERE, Nokia's Location Platform ● Spare time: Lucene contributor 7 Nov 2013 Query Latency Optimization with Lucene 2
  • 3. Agenda ● Motivation ● Latency Optimization ● Query Processing / Scoring ● Recent Developments in Lucene 7 Nov 2013 Query Latency Optimization with Lucene 3
  • 4. Motivation: Query Latency ● Human Reaction Time: 200 ms * → Backend latency: << 200 ms ● Faster queries means higher manageable load ● Costs * Steven C. Seow, Designing and Engineering Time: The Psychology of Time Perception in Software, Addison-Wesley Professional, 2008. 7 Nov 2013 Query Latency Optimization with Lucene 4
  • 5. Motivation: Query Latency Distribution 7 Nov 2013 Query Latency Optimization with Lucene 5
  • 6. Latency Optimization 7 Nov 2013 Query Latency Optimization with Lucene 6
  • 7. First: Do Your Homework ● Keep enough RAM for OS (disk buffer cache) ● Reduce HDD “pressure” (e.g. throttle indexing) ● SSDs ● Warming ● Ideally: your index fits in memory See http://wiki.apache.org/lucene-java/ImproveSearchingSpeed 7 Nov 2013 Query Latency Optimization with Lucene 7
  • 8. Mining Hypothesis ● Check if query latencies are reproducible ● If not, try to find correlations with system events: – – – – ● Many new incoming docs to index? Other daemons spike in disk or CPU activity? Garbage Collections? Other sar statistics (e.g. paging) If yes, profile – – First, your code Don't instrument Lucene internal low-level classes 7 Nov 2013 Query Latency Optimization with Lucene 8
  • 9. Hypothesis Testing ● You really think you understand the problem and have a potential solution? ● Try it out (if it's cheap)! ● Otherwise, think of (cheap) experiments that – – 7 Nov 2013 Give confidence Tell you (and others) what the gains are (ROI) Query Latency Optimization with Lucene 9
  • 10. Example: In-memory ● Buy more memory / bigger machine !? ● Simulate1 – – – ● 1 Consecutively execute the same query multiple times Much lower memory requirement (i.e. the size of the involved postings) Repeat for sample of queries of interest Gives lower bound on query latency S. Pohl, A. Moffat. Measurement Techniques and Caching Effects. In Proceedings of the 31st European Conference on Information Retrieval, Toulouse, France, April 2009. Springer. 7 Nov 2013 Query Latency Optimization with Lucene 10
  • 11. Query Processing 7 Nov 2013 Query Latency Optimization with Lucene 11
  • 12. Conjunctions (i.e. AND / Occur.MUST) ● Sort Boolean clauses by increasing DocFreq ft 7 Nov 2013 Query Latency Optimization with Lucene 12
  • 13. Conjunctions (i.e. AND / Occur.MUST) ● Next() on sparsest posting list (“lead”) 7 Nov 2013 Query Latency Optimization with Lucene 13
  • 14. Conjunctions (i.e. AND / Occur.MUST) ● Advance(18) on next sparsest posting list → fail 7 Nov 2013 Query Latency Optimization with Lucene 14
  • 15. Conjunctions (i.e. AND / Occur.MUST) ● Start all over again with “lead”, but advance(22) 7 Nov 2013 Query Latency Optimization with Lucene 15
  • 16. Conjunctions (i.e. AND / Occur.MUST) ● Try to advance(31) on all other posting lists 7 Nov 2013 Query Latency Optimization with Lucene 16
  • 17. Conjunctions (i.e. AND / Occur.MUST) ● Try to advance(31) on all other posting lists 7 Nov 2013 Query Latency Optimization with Lucene 17
  • 18. Conjunctions (i.e. AND / Occur.MUST) ● Try to advance(31) on all other posting lists 7 Nov 2013 Query Latency Optimization with Lucene 18
  • 19. Conjunctions (i.e. AND / Occur.MUST) ● Match found → R = {31 7 Nov 2013 Query Latency Optimization with Lucene 19
  • 20. Conjunctions (i.e. AND / Occur.MUST) ● Next() on “lead” → R = {31} 7 Nov 2013 Query Latency Optimization with Lucene 20
  • 21. Disjunctions (i.e. OR / Occur.SHOULD) 7 Nov 2013 Query Latency Optimization with Lucene 21
  • 22. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() on all clauses 7 Nov 2013 Query Latency Optimization with Lucene 22
  • 23. Disjunctions (i.e. OR / Occur.SHOULD) ● Track clauses in min-heap → R = {2 7 Nov 2013 Query Latency Optimization with Lucene 23
  • 24. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() on all previously matched clauses → R = {2,4 7 Nov 2013 Query Latency Optimization with Lucene 24
  • 25. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() on all previously matched clauses → R = {2,4,5 7 Nov 2013 Query Latency Optimization with Lucene 25
  • 26. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7 7 Nov 2013 Query Latency Optimization with Lucene 26
  • 27. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9 7 Nov 2013 Query Latency Optimization with Lucene 27
  • 28. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11 7 Nov 2013 Query Latency Optimization with Lucene 28
  • 29. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12 7 Nov 2013 Query Latency Optimization with Lucene 29
  • 30. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16 7 Nov 2013 Query Latency Optimization with Lucene 30
  • 31. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18 7 Nov 2013 Query Latency Optimization with Lucene 31
  • 32. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20 7 Nov 2013 Query Latency Optimization with Lucene 32
  • 33. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22 7 Nov 2013 Query Latency Optimization with Lucene 33
  • 34. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26 7 Nov 2013 Query Latency Optimization with Lucene 34
  • 35. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27 7 Nov 2013 Query Latency Optimization with Lucene 35
  • 36. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29 7 Nov 2013 Query Latency Optimization with Lucene 36
  • 37. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31 7 Nov 2013 Query Latency Optimization with Lucene 37
  • 38. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32 7 Nov 2013 Query Latency Optimization with Lucene 38
  • 39. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37 7 Nov 2013 Query Latency Optimization with Lucene 39
  • 40. Disjunctions (i.e. OR / Occur.SHOULD) ● Next() → R = {2,4,5,7,9,11,12,16,18,20,22,26,27,29,31,32,37} 7 Nov 2013 Query Latency Optimization with Lucene 40
  • 41. Why Query Processing Can Be Slow? ● Disjunctive Processing: O(n log |C|) – – – ● High DF terms (large n) Many terms (large |C|), e.g. query expansion No / too little use of advance() Filter (over-use) 7 Nov 2013 Query Latency Optimization with Lucene 41
  • 42. Filter ● Aims: – – – ● (Pre-)computation of common sub-queries Cache result Don't influence scoring Limitation – – Additional cost for 1st query Currently, no skip information generated → Adding filter as a conjunct to queries can sometimes be faster e.g. http://java.dzone.com/news/fast-lucene-search-filters 7 Nov 2013 Query Latency Optimization with Lucene 42
  • 43. Stopword Removal ● Removal of High-DocFreq terms from – – ● Limitation: – ● Index : 10-30% space saving Query: no very expensive terms “To be or not to be” In general, don't do it 7 Nov 2013 Query Latency Optimization with Lucene 43
  • 44. Minor, But Easy Improvements ● Reduce information, increase locality: – Don't store TF, if it's almost always 1 (and you don't need positions), fieldType.setIndexOptions(IndexOptions.DOCS_ONLY); – ● Use BlockPostingsFormat (default in Lucene ≥ 4.1) Tune Space/Time/Quality tradeoffs: – – 7 Nov 2013 DirectDocValues Less complex scoring function Query Latency Optimization with Lucene 44
  • 45. Recent Developments within Lucene 7 Nov 2013 Query Latency Optimization with Lucene 45
  • 46. MinShouldMatch ● ● ● (Lucene-4571) Don't want matches on only one (stop-)word? Enforce at least mm>1 terms to be present ! Synthetic example query used during dev: Terms: ref restored struck wings dublin DocFreq: 3.8M 32k 32k 32k 32k E.g. mm=2: Conjunctive Processing: advance() Disjunctive Processing: next() 7 Nov 2013 Query Latency Optimization with Lucene 46
  • 47. MinShouldMatch 7 Nov 2013 Query Latency Optimization with Lucene (Lucene-4571) 47
  • 48. MinShouldMatch 7 Nov 2013 Query Latency Optimization with Lucene (Lucene-4571) 48
  • 49. MinShouldMatch (Lucene-4571) DocFreq: 3.8M 32k 32k 32k 32k HighDF 1/5: ref restored struck wings dublin HighDF 2/5: ref http struck wings dublin HighDF 3/5: ref http from wings dublin HighDF 4/5: ref http from name dublin HighDF 5/5: ref http from name title DocFreq: 3.8M 3.5M 3.2M 2.8M 2.4M 7 Nov 2013 Query Latency Optimization with Lucene 49
  • 50. MinShouldMatch – Results 7 Nov 2013 Query Latency Optimization with Lucene (Lucene-4571) 50
  • 51. MinShouldMatch – Open Questions ● ● ● (Lucene-4571) How bad is it to exclude docs that only match one, but an important term? Why is it enough to match any mm terms? Why not providing a list of stop-words to a 'StopwordExcludingScorer'? (But be careful: “To Be Or Not To Be”) 7 Nov 2013 Query Latency Optimization with Lucene 51
  • 52. ReqOptSumScorer ● Benefit: – – ● Conjunctive processing on required clauses Calls advance() on optional clauses How do you determine which clauses are required? – Lookup term statistics (i.e. DocFreq) – 2nd lookup unnecessary, if you hand over stats to query 7 Nov 2013 Query Latency Optimization with Lucene 52
  • 53. CommonTermsQuery (≥ 4.1) ● Looks up term infos (docfreq, posting list offset) ● (Lucene-4628) Categorizes query terms as – – ● Low-freq: At least one low-freq term MUST occur in result doc High-freq: SHOULD occur in doc → their presence add to score Executes query, but hands over term statistics → no 2nd round of term lookups necessary ! ● Also supports MinShouldMatch 7 Nov 2013 Query Latency Optimization with Lucene 53
  • 54. Cost-Model (≥ 4.3) ● What about structured queries? E.g. +(a b) +c ● (Lucene-4607) Currently: worst-case estimate of returned #docs (docfreq) – – ● Disjunctions: sumcC(dfc) Conjunctions: mincC(dfc) Limitations: – – ● Effort to generate returned docs? Only one cost (next() vs. advance()) Open Question: – Can we do better with more detailed cost models? 7 Nov 2013 Query Latency Optimization with Lucene 54
  • 55. Maxscore Top-k Scoring Algorithm1 ● ● Experimental prototype code attached to Lucene-4100 Limitation: – 1 (Lucene-4100) Requires final run over whole index (i.e. only for static indexes) H. Turtle, J. Flood. Query Evaluation: Strategies and Optimizations, IPM, 31(6), 1995. 7 Nov 2013 Query Latency Optimization with Lucene 55
  • 56. Index Sorting (≥ 4.3) ● Advantages (if appropriate sort order chosen) – – ● (Lucene-4752) Better compression → more locality → faster processing Early termination Use together with EarlyTerminatingSortingCollector – – Can terminate scoring within sorted segments Fully scores as-yet unsorted segments → see 2nd half of Shai & Adrian's talk yesterday for details 7 Nov 2013 Query Latency Optimization with Lucene 56
  • 57. Parallelization ● In general, sharding is better: – – ● Shared-nothing Better use cores for handling load Multi-threaded query execution: – Static indexes: For slow queries, almost perfect speedups (if docs are uniformly distributed over shards) – Dynamic indexes: ● Lucene-2840, Lucene-5299 7 Nov 2013 Query Latency Optimization with Lucene 57
  • 58. Summary ● Understand your problem ● Scoring can become an issue with many million docs ● Many recent efficiency improvements ● More to come... patches welcome 7 Nov 2013 Query Latency Optimization with Lucene 58
  • 59. We're Hiring @HERE Frankfurt, Berlin, Boston, Chicago. Come work with us. Get in touch! 7 Nov 2013 developer.here.com/geocoder Query Latency Optimization with Lucene 59
  • 60. Thank You! Contact Email : stefan.pohl@here.com Web : http://linkedin.com/in/stefanpohl Twitter : @pohlstefan 7 Nov 2013 developer.here.com/geocoder Query Latency Optimization with Lucene 60