SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
Privileged and Confidential
Point Field Types in Solr
Evolution of Range Filters
amikryukov@griddynamics.com
Privileged and Confidential
Agenda
1. Recap: From query parser to TopDocsCollector.
2. TermQuery search flow.
3. How Range Filters are implemented?
4. Optimizations for Range Filters.
5. Point Fields.
2
Privileged and Confidential
Recap: From query parser to TopDocsCollector
? What is query parser?
3
Privileged and Confidential
Recap: From query parser to TopDocsCollector
? What is query parser?
? What is the difference between Query and Scorer? And why you need a Collector?
4
Privileged and Confidential
Recap: Query execution flow
LeafReader
5
Privileged and Confidential
Recap: From query parser to TopDocsCollector
? What is query parser?
? What is the difference between Query and Scorer?
? What is the difference between TermsEnum and PosingsEnum?
6
Privileged and Confidential
Recap: inverted index, terms, posting list
7
Privileged and Confidential
TermQuery search flow
q=Price:10
8
Privileged and Confidential
TermQuery search flow
q=Price:10
TermQuery(
field=’Price’,
val=’10’)
TermQuery
TermWeight
TermScorer
9
Privileged and Confidential
TermQuery
Idea:
Iterate over posting list of the term `10`
10
q=Price:10
Privileged and Confidential
TermQuery source code
11
Privileged and Confidential
TermScorer source code
12
Privileged and Confidential
How Range filters are implemented?
term -> document ids
421 -> [1]
423 -> [2]
445 -> [3]
446 -> [3]
448 -> [4]
521 -> [5]
522 -> [7]
632 -> [5]
633 -> [6]
634 -> [7]
641 -> [5]
642 -> [6]
644 -> [7]
q=PRICE:[423 TO 642]
13
Privileged and Confidential
How Range filters are implemented?
term -> document ids
421 -> [1]
423 -> [2]
445 -> [3]
446 -> [3]
448 -> [4]
521 -> [5]
522 -> [7]
632 -> [5]
633 -> [6]
634 -> [7]
641 -> [5]
642 -> [6]
644 -> [7]
q=PRICE:[423 TO 642]q=PRICE:423 PRICE:445 PRICE:446 … PRICE:642
14
Privileged and Confidential
MultiTermQuery
term -> document ids
421 -> [1]
423 -> [2]
445 -> [3]
446 -> [3]
448 -> [4]
521 -> [5]
522 -> [7]
632 -> [5]
633 -> [6]
634 -> [7]
641 -> [5]
642 -> [6]
644 -> [7]
q=PRICE:[423 TO 642]q=PRICE:423 PRICE:445 PRICE:446 … PRICE:642
15
Privileged and Confidential
Naive implementation
term -> document ids
421 -> [1]
423 -> [2]
445 -> [3]
446 -> [3]
448 -> [4]
521 -> [5]
522 -> [7]
632 -> [5]
633 -> [6]
634 -> [7]
641 -> [5]
642 -> [6]
644 -> [7]
q=PRICE:[423 TO 642]q=PRICE:423 PRICE:445 PRICE:446 … PRICE:642
In total = 11 should clauses.
16
Privileged and Confidential
Optimizations for Range Filters
? How can we improve the naive implementation of RangeFilterQuery?Original values
421 -> [1]
423 -> [2]
445 -> [3]
446 -> [3]
448 -> [4]
521 -> [5]
522 -> [7]
632 -> [5]
633 -> [6]
634 -> [7]
641 -> [5]
642 -> [6]
644 -> [7]
17
Privileged and Confidential
Trie
18
Original values
421 -> [1]
423 -> [2]
445 -> [3]
446 -> [3]
448 -> [4]
521 -> [5]
522 -> [7]
632 -> [5]
633 -> [6]
634 -> [7]
641 -> [5]
642 -> [6]
644 -> [7]
Privileged and Confidential
Trie*Field index time
Original values
421 -> [1]
423 -> [2]
445 -> [3]
446 -> [3]
448 -> [4]
521 -> [5]
522 -> [7]
632 -> [5]
633 -> [6]
634 -> [7]
641 -> [5]
642 -> [6]
644 -> [7]
Additional values
42* -> [1, 2]
44* -> [3, 4]
52* -> [5, 7]
63* -> [5, 6]
64* -> [5, 6 , 7]
4** -> [1, 2, 3, 4]
5** -> [5, 7]
6** -> [5, 6, 7]
Exploit the Trie*Field
Shift 2
Shift 1
Shift 0
19
(since Lucene 2.9)
Privileged and Confidential
Trie*Field query time
Original values
421 -> [1]
423 -> [2]
445 -> [3]
446 -> [3]
448 -> [4]
521 -> [5]
522 -> [7]
632 -> [5]
633 -> [6]
634 -> [7]
641 -> [5]
642 -> [6]
644 -> [7]
Additional values
42* -> [1, 2]
44* -> [3, 4]
52* -> [5, 7]
63* -> [5, 6]
64* -> [5, 6 , 7]
4** -> [1, 2, 3, 4]
5** -> [5, 7]
6** -> [5, 6, 7]
Exploit the Trie*Field
In total = 6 should clauses in the end
20
Privileged and Confidential
Is not it enough? Distribution of terms?
Trie-based approach does not involve distribution of the terms analysis.
q=PRICE:[100 TO 2002222]Original values
1 -> [1]
100 -> [2]
2000001 -> [3]
2000022 -> [3]
2000222 -> [4]
2002222 -> [5]
50000005 -> [7]
21
Privileged and Confidential
Is not it enough?
IO efficiency.
We need to store all original and additional values.
We need to read all Terms of the field at search time.
Original values
1 -> [1]
100 -> [2]
2000001 -> [3]
2000022 -> [3]
2000222 -> [4]
2002222 -> [5]
50000005 -> [7]
Additional values
10* -> [2]
1** -> [1, 2]
200002* -> [3]
200022* -> [4]
20002** -> [4]
200**** -> [3, 4, 5]
200222* -> [5]
20022** -> [5]
2002*** -> [5]
22
Privileged and Confidential
Point Fields
This feature replaces the now deprecated numeric fields (Trie*Field) and numeric range query since it
has better overall performance and is more general - allowing multidimensions. (since Lucene 6.0)
● Based on Bkd-Tree: A Dynamic Scalable kd-Tree
Naturally adapt to each data set's particular distribution. In contrast to legacy numeric fields
which always index the same precision levels for every value regardless of how the points are
distributed.
● Most of the data structure resides in on-disk blocks, with a small in-heap binary tree index
structure to locate the blocks at search time.
● Allows to operate with multi-dimensional points. (Maps, 3D-models).
23
Privileged and Confidential
Bkd-Tree
Binary Space Partitioning tree
B - Blocked
Number of points in the cell = 2
24
Privileged and Confidential
Bkd-Tree adapts to particular distribution
Example from
https://www.elastic.co/blog/lucene-points-6.0
25
Privileged and Confidential
Point Fields: index time
Disk
Heap
Lucene - number of points in cell is 1024.
26
Privileged and Confidential
Point Fields: search time
Disk
Heap
q=PRICE:[100, 2002222]
If block overlaps with the query - we
have to check every term value inside
If block is fully contained within the query -
the documents with values in that cell are
efficiently collected without having to test
each point
27
Privileged and Confidential
Performance testing (Lucene 6.0)
28
Privileged and Confidential
Point Fields
29
Privileged and Confidential
Links
Numeric Range Queries in Lucene/Solr
http://blog-archive.griddynamics.com/2014/10/numeric-range-queries-in-lucenesolr.html
Lucene Search Essentials: Scorers, Collectors and Custom Queries
https://www.slideshare.net/lucenerevolution/lucene-search-essentials-scorers-collectors-and-custom-queries-dublin13
Multi-dimensional points, coming in Apache Lucene 6.0
https://www.elastic.co/blog/lucene-points-6.0
Bkd-Tree: A Dynamic Scalable kd-Tree
https://users.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf
The Evolution of Lucene & Solr Numerics from Strings to Points
https://www.slideshare.net/lucidworks/the-evolution-of-lucene-solr-numerics-from-strings-to-points-
presented-by-steve-rowe-lucidworks?from_action=save
30
Privileged and Confidential 31

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
BigData_TP2: Design Patterns dans Hadoop
BigData_TP2: Design Patterns dans HadoopBigData_TP2: Design Patterns dans Hadoop
BigData_TP2: Design Patterns dans Hadoop
 
ProxySQL High Availability (Clustering)
ProxySQL High Availability (Clustering)ProxySQL High Availability (Clustering)
ProxySQL High Availability (Clustering)
 
SGBDR vs NoSQL, Différences et Uses Cases. Focus sur ArangoDB
SGBDR vs NoSQL, Différences et Uses Cases. Focus sur ArangoDBSGBDR vs NoSQL, Différences et Uses Cases. Focus sur ArangoDB
SGBDR vs NoSQL, Différences et Uses Cases. Focus sur ArangoDB
 
Securefile LOBs
Securefile LOBsSecurefile LOBs
Securefile LOBs
 
Five_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptx
Five_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptxFive_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptx
Five_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptx
 
Elasticsearch Monitoring in Openshift
Elasticsearch Monitoring in OpenshiftElasticsearch Monitoring in Openshift
Elasticsearch Monitoring in Openshift
 
Tuning Autovacuum in Postgresql
Tuning Autovacuum in PostgresqlTuning Autovacuum in Postgresql
Tuning Autovacuum in Postgresql
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
MySQL Cluster Basics
MySQL Cluster BasicsMySQL Cluster Basics
MySQL Cluster Basics
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 
Faster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDBFaster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDB
 
PGConf APAC 2018 - PostgreSQL HA with Pgpool-II and whats been happening in P...
PGConf APAC 2018 - PostgreSQL HA with Pgpool-II and whats been happening in P...PGConf APAC 2018 - PostgreSQL HA with Pgpool-II and whats been happening in P...
PGConf APAC 2018 - PostgreSQL HA with Pgpool-II and whats been happening in P...
 
Base des données réparties
Base des données répartiesBase des données réparties
Base des données réparties
 
Bases de données réparties
Bases de données répartiesBases de données réparties
Bases de données réparties
 
Modélisation de données pour MongoDB
Modélisation de données pour MongoDBModélisation de données pour MongoDB
Modélisation de données pour MongoDB
 
Oracle LOB Internals and Performance Tuning
Oracle LOB Internals and Performance TuningOracle LOB Internals and Performance Tuning
Oracle LOB Internals and Performance Tuning
 
Эффективная отладка репликации MySQL
Эффективная отладка репликации MySQLЭффективная отладка репликации MySQL
Эффективная отладка репликации MySQL
 
Les BD NoSQL
Les BD NoSQLLes BD NoSQL
Les BD NoSQL
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 

Similar a Point field types in Solr. Evolution of the Range Queries.

Writing efficient sql
Writing efficient sqlWriting efficient sql
Writing efficient sql
j9soto
 
Scaling MySQL Strategies for Developers
Scaling MySQL Strategies for DevelopersScaling MySQL Strategies for Developers
Scaling MySQL Strategies for Developers
Jonathan Levin
 
MongoDB Roadmap
MongoDB RoadmapMongoDB Roadmap
MongoDB Roadmap
MongoDB
 
MongoDB Roadmap
MongoDB RoadmapMongoDB Roadmap
MongoDB Roadmap
MongoDB
 

Similar a Point field types in Solr. Evolution of the Range Queries. (20)

Web analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.com
 
Writing efficient sql
Writing efficient sqlWriting efficient sql
Writing efficient sql
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys"
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval Meetup
 
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
 
Apache Solr for begginers
Apache Solr for begginersApache Solr for begginers
Apache Solr for begginers
 
Druid at naver.com - part 1
Druid at naver.com - part 1Druid at naver.com - part 1
Druid at naver.com - part 1
 
OQL querying and indexes with Apache Geode (incubating)
OQL querying and indexes with Apache Geode (incubating)OQL querying and indexes with Apache Geode (incubating)
OQL querying and indexes with Apache Geode (incubating)
 
Scaling MySQL Strategies for Developers
Scaling MySQL Strategies for DevelopersScaling MySQL Strategies for Developers
Scaling MySQL Strategies for Developers
 
What's new in Redis v3.2
What's new in Redis v3.2What's new in Redis v3.2
What's new in Redis v3.2
 
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Time series denver an introduction to prometheus
Time series denver   an introduction to prometheusTime series denver   an introduction to prometheus
Time series denver an introduction to prometheus
 
MongoDB Roadmap
MongoDB RoadmapMongoDB Roadmap
MongoDB Roadmap
 
SQL Server Deep Drive
SQL Server Deep Drive SQL Server Deep Drive
SQL Server Deep Drive
 
Top 10 Cypher Tuning Tips & Tricks
Top 10 Cypher Tuning Tips & TricksTop 10 Cypher Tuning Tips & Tricks
Top 10 Cypher Tuning Tips & Tricks
 
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptx
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptxGraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptx
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptx
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
 
Introducing Apache Carbon Data - Hadoop Native Columnar Data Format
Introducing Apache Carbon Data - Hadoop Native Columnar Data FormatIntroducing Apache Carbon Data - Hadoop Native Columnar Data Format
Introducing Apache Carbon Data - Hadoop Native Columnar Data Format
 
MongoDB Roadmap
MongoDB RoadmapMongoDB Roadmap
MongoDB Roadmap
 

Último

Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
mbmh111980
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 

Último (20)

Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
 
How to pick right visual testing tool.pdf
How to pick right visual testing tool.pdfHow to pick right visual testing tool.pdf
How to pick right visual testing tool.pdf
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabber
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdf
 
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityAPVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
What need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java DevelopersWhat need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java Developers
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdf
 
SQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionSQL Injection Introduction and Prevention
SQL Injection Introduction and Prevention
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
AI Hackathon.pptx
AI                        Hackathon.pptxAI                        Hackathon.pptx
AI Hackathon.pptx
 
CompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfCompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdf
 

Point field types in Solr. Evolution of the Range Queries.

  • 1. Privileged and Confidential Point Field Types in Solr Evolution of Range Filters amikryukov@griddynamics.com
  • 2. Privileged and Confidential Agenda 1. Recap: From query parser to TopDocsCollector. 2. TermQuery search flow. 3. How Range Filters are implemented? 4. Optimizations for Range Filters. 5. Point Fields. 2
  • 3. Privileged and Confidential Recap: From query parser to TopDocsCollector ? What is query parser? 3
  • 4. Privileged and Confidential Recap: From query parser to TopDocsCollector ? What is query parser? ? What is the difference between Query and Scorer? And why you need a Collector? 4
  • 5. Privileged and Confidential Recap: Query execution flow LeafReader 5
  • 6. Privileged and Confidential Recap: From query parser to TopDocsCollector ? What is query parser? ? What is the difference between Query and Scorer? ? What is the difference between TermsEnum and PosingsEnum? 6
  • 7. Privileged and Confidential Recap: inverted index, terms, posting list 7
  • 8. Privileged and Confidential TermQuery search flow q=Price:10 8
  • 9. Privileged and Confidential TermQuery search flow q=Price:10 TermQuery( field=’Price’, val=’10’) TermQuery TermWeight TermScorer 9
  • 10. Privileged and Confidential TermQuery Idea: Iterate over posting list of the term `10` 10 q=Price:10
  • 13. Privileged and Confidential How Range filters are implemented? term -> document ids 421 -> [1] 423 -> [2] 445 -> [3] 446 -> [3] 448 -> [4] 521 -> [5] 522 -> [7] 632 -> [5] 633 -> [6] 634 -> [7] 641 -> [5] 642 -> [6] 644 -> [7] q=PRICE:[423 TO 642] 13
  • 14. Privileged and Confidential How Range filters are implemented? term -> document ids 421 -> [1] 423 -> [2] 445 -> [3] 446 -> [3] 448 -> [4] 521 -> [5] 522 -> [7] 632 -> [5] 633 -> [6] 634 -> [7] 641 -> [5] 642 -> [6] 644 -> [7] q=PRICE:[423 TO 642]q=PRICE:423 PRICE:445 PRICE:446 … PRICE:642 14
  • 15. Privileged and Confidential MultiTermQuery term -> document ids 421 -> [1] 423 -> [2] 445 -> [3] 446 -> [3] 448 -> [4] 521 -> [5] 522 -> [7] 632 -> [5] 633 -> [6] 634 -> [7] 641 -> [5] 642 -> [6] 644 -> [7] q=PRICE:[423 TO 642]q=PRICE:423 PRICE:445 PRICE:446 … PRICE:642 15
  • 16. Privileged and Confidential Naive implementation term -> document ids 421 -> [1] 423 -> [2] 445 -> [3] 446 -> [3] 448 -> [4] 521 -> [5] 522 -> [7] 632 -> [5] 633 -> [6] 634 -> [7] 641 -> [5] 642 -> [6] 644 -> [7] q=PRICE:[423 TO 642]q=PRICE:423 PRICE:445 PRICE:446 … PRICE:642 In total = 11 should clauses. 16
  • 17. Privileged and Confidential Optimizations for Range Filters ? How can we improve the naive implementation of RangeFilterQuery?Original values 421 -> [1] 423 -> [2] 445 -> [3] 446 -> [3] 448 -> [4] 521 -> [5] 522 -> [7] 632 -> [5] 633 -> [6] 634 -> [7] 641 -> [5] 642 -> [6] 644 -> [7] 17
  • 18. Privileged and Confidential Trie 18 Original values 421 -> [1] 423 -> [2] 445 -> [3] 446 -> [3] 448 -> [4] 521 -> [5] 522 -> [7] 632 -> [5] 633 -> [6] 634 -> [7] 641 -> [5] 642 -> [6] 644 -> [7]
  • 19. Privileged and Confidential Trie*Field index time Original values 421 -> [1] 423 -> [2] 445 -> [3] 446 -> [3] 448 -> [4] 521 -> [5] 522 -> [7] 632 -> [5] 633 -> [6] 634 -> [7] 641 -> [5] 642 -> [6] 644 -> [7] Additional values 42* -> [1, 2] 44* -> [3, 4] 52* -> [5, 7] 63* -> [5, 6] 64* -> [5, 6 , 7] 4** -> [1, 2, 3, 4] 5** -> [5, 7] 6** -> [5, 6, 7] Exploit the Trie*Field Shift 2 Shift 1 Shift 0 19 (since Lucene 2.9)
  • 20. Privileged and Confidential Trie*Field query time Original values 421 -> [1] 423 -> [2] 445 -> [3] 446 -> [3] 448 -> [4] 521 -> [5] 522 -> [7] 632 -> [5] 633 -> [6] 634 -> [7] 641 -> [5] 642 -> [6] 644 -> [7] Additional values 42* -> [1, 2] 44* -> [3, 4] 52* -> [5, 7] 63* -> [5, 6] 64* -> [5, 6 , 7] 4** -> [1, 2, 3, 4] 5** -> [5, 7] 6** -> [5, 6, 7] Exploit the Trie*Field In total = 6 should clauses in the end 20
  • 21. Privileged and Confidential Is not it enough? Distribution of terms? Trie-based approach does not involve distribution of the terms analysis. q=PRICE:[100 TO 2002222]Original values 1 -> [1] 100 -> [2] 2000001 -> [3] 2000022 -> [3] 2000222 -> [4] 2002222 -> [5] 50000005 -> [7] 21
  • 22. Privileged and Confidential Is not it enough? IO efficiency. We need to store all original and additional values. We need to read all Terms of the field at search time. Original values 1 -> [1] 100 -> [2] 2000001 -> [3] 2000022 -> [3] 2000222 -> [4] 2002222 -> [5] 50000005 -> [7] Additional values 10* -> [2] 1** -> [1, 2] 200002* -> [3] 200022* -> [4] 20002** -> [4] 200**** -> [3, 4, 5] 200222* -> [5] 20022** -> [5] 2002*** -> [5] 22
  • 23. Privileged and Confidential Point Fields This feature replaces the now deprecated numeric fields (Trie*Field) and numeric range query since it has better overall performance and is more general - allowing multidimensions. (since Lucene 6.0) ● Based on Bkd-Tree: A Dynamic Scalable kd-Tree Naturally adapt to each data set's particular distribution. In contrast to legacy numeric fields which always index the same precision levels for every value regardless of how the points are distributed. ● Most of the data structure resides in on-disk blocks, with a small in-heap binary tree index structure to locate the blocks at search time. ● Allows to operate with multi-dimensional points. (Maps, 3D-models). 23
  • 24. Privileged and Confidential Bkd-Tree Binary Space Partitioning tree B - Blocked Number of points in the cell = 2 24
  • 25. Privileged and Confidential Bkd-Tree adapts to particular distribution Example from https://www.elastic.co/blog/lucene-points-6.0 25
  • 26. Privileged and Confidential Point Fields: index time Disk Heap Lucene - number of points in cell is 1024. 26
  • 27. Privileged and Confidential Point Fields: search time Disk Heap q=PRICE:[100, 2002222] If block overlaps with the query - we have to check every term value inside If block is fully contained within the query - the documents with values in that cell are efficiently collected without having to test each point 27
  • 28. Privileged and Confidential Performance testing (Lucene 6.0) 28
  • 30. Privileged and Confidential Links Numeric Range Queries in Lucene/Solr http://blog-archive.griddynamics.com/2014/10/numeric-range-queries-in-lucenesolr.html Lucene Search Essentials: Scorers, Collectors and Custom Queries https://www.slideshare.net/lucenerevolution/lucene-search-essentials-scorers-collectors-and-custom-queries-dublin13 Multi-dimensional points, coming in Apache Lucene 6.0 https://www.elastic.co/blog/lucene-points-6.0 Bkd-Tree: A Dynamic Scalable kd-Tree https://users.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf The Evolution of Lucene & Solr Numerics from Strings to Points https://www.slideshare.net/lucidworks/the-evolution-of-lucene-solr-numerics-from-strings-to-points- presented-by-steve-rowe-lucidworks?from_action=save 30