Point field types in Solr. Evolution of the Range Queries.

Privileged and Confidential
Point Field Types in Solr
Evolution of Range Filters
amikryukov@griddynamics.com

Agenda
1. Recap: From query parser to TopDocsCollector.
2. TermQuery search flow.
3. How Range Filters are implemented?
4. Optimizations for Range Filters.
5. Point Fields.
2

Recap: From query parser to TopDocsCollector
? What is query parser?
3

? What is the diﬀerence between Query and Scorer? And why you need a Collector?
4

Recap: Query execution flow
LeafReader
5

? What is the diﬀerence between Query and Scorer?
? What is the diﬀerence between TermsEnum and PosingsEnum?
6

Recap: inverted index, terms, posting list
7

TermQuery search flow
q=Price:10
8

TermQuery search flow
q=Price:10
TermQuery(
field=’Price’,
val=’10’)
TermQuery
TermWeight
TermScorer
9

TermQuery
Idea:
Iterate over posting list of the term `10`
10
q=Price:10

TermQuery source code
11

TermScorer source code
12

How Range filters are implemented?
term -> document ids
421 -> [1]
423 -> [2]
445 -> [3]
446 -> [3]
448 -> [4]
521 -> [5]
522 -> [7]
632 -> [5]
633 -> [6]
634 -> [7]
641 -> [5]
642 -> [6]
644 -> [7]
q=PRICE:[423 TO 642]
13

How Range filters are implemented?
421 -> [1]
423 -> [2]
445 -> [3]
446 -> [3]
448 -> [4]
521 -> [5]
522 -> [7]
632 -> [5]
633 -> [6]
634 -> [7]
641 -> [5]
642 -> [6]
644 -> [7]
q=PRICE:[423 TO 642]q=PRICE:423 PRICE:445 PRICE:446 … PRICE:642
14

MultiTermQuery
421 -> [1]
423 -> [2]
445 -> [3]
446 -> [3]
448 -> [4]
521 -> [5]
522 -> [7]
632 -> [5]
633 -> [6]
634 -> [7]
641 -> [5]
642 -> [6]
644 -> [7]
15

Naive implementation
421 -> [1]
423 -> [2]
445 -> [3]
446 -> [3]
448 -> [4]
521 -> [5]
522 -> [7]
632 -> [5]
633 -> [6]
634 -> [7]
641 -> [5]
642 -> [6]
644 -> [7]
In total = 11 should clauses.
16

Optimizations for Range Filters
? How can we improve the naive implementation of RangeFilterQuery?Original values
421 -> [1]
423 -> [2]
445 -> [3]
446 -> [3]
448 -> [4]
521 -> [5]
522 -> [7]
632 -> [5]
633 -> [6]
634 -> [7]
641 -> [5]
642 -> [6]
644 -> [7]
17

Trie
18
Original values
421 -> [1]
423 -> [2]
445 -> [3]
446 -> [3]
448 -> [4]
521 -> [5]
522 -> [7]
632 -> [5]
633 -> [6]
634 -> [7]
641 -> [5]
642 -> [6]
644 -> [7]

Trie*Field index time
Original values
421 -> [1]
423 -> [2]
445 -> [3]
446 -> [3]
448 -> [4]
521 -> [5]
522 -> [7]
632 -> [5]
633 -> [6]
634 -> [7]
641 -> [5]
642 -> [6]
644 -> [7]
Additional values
42* -> [1, 2]
44* -> [3, 4]
52* -> [5, 7]
63* -> [5, 6]
64* -> [5, 6 , 7]
4** -> [1, 2, 3, 4]
5** -> [5, 7]
6** -> [5, 6, 7]
Exploit the Trie*Field
Shift 2
Shift 1
Shift 0
19
(since Lucene 2.9)

Trie*Field query time
Original values
421 -> [1]
423 -> [2]
445 -> [3]
446 -> [3]
448 -> [4]
521 -> [5]
522 -> [7]
632 -> [5]
633 -> [6]
634 -> [7]
641 -> [5]
642 -> [6]
644 -> [7]
Additional values
42* -> [1, 2]
44* -> [3, 4]
52* -> [5, 7]
63* -> [5, 6]
64* -> [5, 6 , 7]
4** -> [1, 2, 3, 4]
5** -> [5, 7]
6** -> [5, 6, 7]
Exploit the Trie*Field
In total = 6 should clauses in the end
20

Is not it enough? Distribution of terms?
Trie-based approach does not involve distribution of the terms analysis.
q=PRICE:[100 TO 2002222]Original values
1 -> [1]
100 -> [2]
2000001 -> [3]
2000022 -> [3]
2000222 -> [4]
2002222 -> [5]
50000005 -> [7]
21

Is not it enough?
IO eﬀiciency.
We need to store all original and additional values.
We need to read all Terms of the field at search time.
Original values
1 -> [1]
100 -> [2]
2000001 -> [3]
2000022 -> [3]
2000222 -> [4]
2002222 -> [5]
50000005 -> [7]
Additional values
10* -> [2]
1** -> [1, 2]
200002* -> [3]
200022* -> [4]
20002** -> [4]
200**** -> [3, 4, 5]
200222* -> [5]
20022** -> [5]
2002*** -> [5]
22

Point Fields
This feature replaces the now deprecated numeric fields (Trie*Field) and numeric range query since it
has better overall performance and is more general - allowing multidimensions. (since Lucene 6.0)
● Based on Bkd-Tree: A Dynamic Scalable kd-Tree
Naturally adapt to each data set's particular distribution. In contrast to legacy numeric fields
which always index the same precision levels for every value regardless of how the points are
distributed.
● Most of the data structure resides in on-disk blocks, with a small in-heap binary tree index
structure to locate the blocks at search time.
● Allows to operate with multi-dimensional points. (Maps, 3D-models).
23

Bkd-Tree
Binary Space Partitioning tree
B - Blocked
Number of points in the cell = 2
24

Bkd-Tree adapts to particular distribution
Example from
https://www.elastic.co/blog/lucene-points-6.0
25

Point Fields: index time
Disk
Heap
Lucene - number of points in cell is 1024.
26

Point Fields: search time
Disk
Heap
q=PRICE:[100, 2002222]
If block overlaps with the query - we
have to check every term value inside
If block is fully contained within the query -
the documents with values in that cell are
eﬀiciently collected without having to test
each point
27

Performance testing (Lucene 6.0)
28

Point Fields
29

Links
Numeric Range Queries in Lucene/Solr
http://blog-archive.griddynamics.com/2014/10/numeric-range-queries-in-lucenesolr.html
Lucene Search Essentials: Scorers, Collectors and Custom Queries
https://www.slideshare.net/lucenerevolution/lucene-search-essentials-scorers-collectors-and-custom-queries-dublin13
Multi-dimensional points, coming in Apache Lucene 6.0
https://www.elastic.co/blog/lucene-points-6.0
Bkd-Tree: A Dynamic Scalable kd-Tree
https://users.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf
The Evolution of Lucene & Solr Numerics from Strings to Points
https://www.slideshare.net/lucidworks/the-evolution-of-lucene-solr-numerics-from-strings-to-points-
presented-by-steve-rowe-lucidworks?from_action=save
30

Privileged and Confidential 31

Point field types in Solr. Evolution of the Range Queries.

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Point field types in Solr. Evolution of the Range Queries.

Similar a Point field types in Solr. Evolution of the Range Queries. (20)

Último

Último (20)

Point field types in Solr. Evolution of the Range Queries.