SlideShare una empresa de Scribd logo
1 de 66
Descargar para leer sin conexión
Is Your IndexReader Really
  Atomic or Maybe Slow?
         Uwe Schindler
    SD DataSolutions GmbH,
 uschindler@sd-datasolutions.de

                                  1
My Background
• I am committer and PMC member of Apache Lucene and Solr. My
  main focus is on development of Lucene Java.
• Implemented fast numerical search and maintaining the new
  attribute-based text analysis API. Well known as Generics and
  Sophisticated Backwards Compatibility Policeman.
• Working as consultant and software architect for SD DataSolutions
  GmbH in Bremen, Germany. The main task is maintaining PANGAEA
  (Publishing Network for Geoscientific & Environmental Data) where
  I implemented the portal's geo-spatial retrieval functions with
  Apache Lucene Core.
• Talks about Lucene at various international conferences like the
  previous Lucene Revolution, Lucene Eurocon, ApacheCon EU/US,
  Berlin Buzzwords, and various local meetups.

                                                                      2
Agenda

• Motivation / History of Lucene

• AtomicReader & CompositeReader

• Reader contexts

• Wrap up

                                   3
Lucene Index Structure
• Lucene was the first full text search engine
  that supported document additions and
  updates
• Snapshot isolation ensures consistency


          Segmented index structure
  Committing changes creates new segments
                                                 4
Segments in Lucene




                      • Each index consists of various segments
                        placed in the index directory. All documents
                        are added to new new segment files, merged with other
                        on-disk files after flushing.
Image: Doug Cutting




                                   *) The term “optimal” does not mean all indexes must be optimized!
                                                                                                  5
Segments in Lucene




                      • Each index consists of various segments
                        placed in the index directory. All documents
                        are added to new new segment files, merged with other
                        on-disk files after flushing.
                      • Lucene writes segments incrementally and can merge
                        them.
Image: Doug Cutting




                                   *) The term “optimal” does not mean all indexes must be optimized!
                                                                                                  6
Segments in Lucene




                      • Each index consists of various segments
                        placed in the index directory. All documents
                        are added to new new segment files, merged with other
                        on-disk files after flushing.
                      • Lucene writes segments incrementally and can merge
                        them.
Image: Doug Cutting




                      • Optimized* index consists of one segment.
                                   *) The term “optimal” does not mean all indexes must be optimized!
                                                                                                  7
Lucene merges while indexing
   all of English Wikipedia




                                    Video by courtesy of:
                                                    8
                Mike McCandless’ blog, http://goo.gl/kI53f
Indexes in Lucene (up to version 3.6)
• Each segment (“atomic index”) is a completely
  functional index:
  – SegmentReader implements the IndexReader
    interface for single segments
• Composite indexes
  – DirectoryReader implements the IndexReader
    interface on top of a set of SegmentReaders
  – MultiReader is an abstraction of multiple
    IndexReaders combined to one virtual index

                                                  9
Composite Indexes (up to version 3.6)
Atomic “views” on multi-segment index:
  – Term Dictionary: on-the-fly merged & sorted term
    index (priority queue for TermEnum,…)
  – Postings: posting lists for each term appended,
    convert document IDs to be global
  – Metadata: doc frequency, term frequency,…
  – Stored fields, term vectors, deletions: delegate global
    document IDs -> segment document IDs (binary
    search)
  – FieldCache: duplicate instances for single segments
    and composite view (memory!!!)
                                                          10
Merging Term Index and Postings
                         Term 1

Term 1   Term 1          Term 2

Term 2   Term 3          Term 3

Term 4   Term 5          Term 4

Term 7   Term 6          Term 5

Term 8   Term 8          Term 6

Term 9   Term 9          Term 7
                         Term 8
                         Term 9


                                   11
Merging Term Index and Postings
                         Term 1

Term 1   Term 1          Term 2

Term 2   Term 3          Term 3

Term 4   Term 5          Term 4

Term 7   Term 6          Term 5

Term 8   Term 8          Term 6

Term 9   Term 9          Term 7
                         Term 8
                         Term 9


                                   12
Composite Indexes (up to version 3.6)
Atomic “views” on multi-segment index:
  – Term Dictionary: on-the-fly merged & sorted term
    index (priority queue for TermEnum,…)
  – Postings: posting lists for each term appended,
    convert document IDs to be global
  – Metadata: doc frequency, term frequency,…
  – Stored fields, term vectors, deletions: delegate
    global document IDs -> segment document IDs
    (binary search)
  – FieldCache: duplicate instances for single
    segments and composite view (memory!!!)
                                                   13
Searching before Version 2.9
• IndexSearcher used the underlying index
  always as a “single” (atomic) index:
  – Queries are executed on the atomic view of a
    composite index
  – Slowdown for queries that scan term dictionary
    (MultiTermQuery) or hit lots of documents,
    facetting




                                                          Image: © The Walt Disney Company
    => recommendation to “optimize” index
  – On every index change, FieldCache used for
    sorting had to be reloaded completely

                                                     14
Lucene 2.9 and later:
             Per segment search

Search is executed separately on each index segment:




             Atomic view no longer used!
                                                       15
Per Segment Search: Pros
• No live term dictionary merging
• Possibility to parallelize
   – ExecutorService in IndexSearcher since Lucene 3.1+
   – Do not optimize to make this work!

• Sorting only needs per-segment FieldCache
   – Cheap reopen after index changes!

• Filter cache per segment
   – Cheap reopen after index changes!
                                                          16
Per Segment Search: Cons
• Query/Filter API changes:
  – Scorer / Filter’s DocIdSet no longer use global
    document IDs
• Slower sorting by string terms
  – Term ords are only comparable inside each
    segment
  – String comparisons needed after segment
    traversal
  – Use numeric sorting if possible!!!
    (Lucene supports missing values since version 3.4 [buggy], corrected 3.5+)

                                                                                 17
Agenda

• Motivation / History of Lucene

• AtomicReader & CompositeReader

• Reader contexts

• Wrap up

                                   18
Composite Indexes (up to version 3.6)
Atomic “views” on multi-segment index:
  – Term Dictionary: on-the-fly merged & sorted term
    index (priority queue for TermEnum,…)
  – Postings: posting lists for each term appended,
    convert document IDs to be global
  – Metadata: doc frequency, term frequency,…
  – Stored fields, term vectors, deletions: delegate global
    document IDs -> segment document IDs (binary
    search)
  – FieldCache: duplicate instances for single segments
    and composite view (memory!!!)
                                                          19
Composite Indexes (version 4.0)
Atomic “views” on multi-segment index:
  – Term Dictionary: on-the-fly merged & sorted term
    index (priority queue for TermEnum,…)
  – Postings: posting lists for each term appended,
    convert document IDs to be global
  – Metadata: doc frequency, term frequency,…
  – Stored fields, term vectors, deletions: delegate global
    document IDs -> segment document IDs (binary
    search)
  – FieldCache: duplicate instances for single segments
    and composite view (memory!!!)
                                                          20
Early Lucene 4.0
                                   • Only “historic” IndexReader interface
                                     available since Lucene 1.0
                                   • 80% of all methods of composite IndexReaders
                                     throwed UnsupportedOperationException
                                     – This affected all user-facing APIs
                                       (SegmentReader was hidden / marked experimental)
                                   • No compile time safety!
Image: © The Walt Disney Company




                                     – Query Scorers and Filters need term dictionary and
                                       postings, throwing UOE when executed on composite
                                       reader


                                                                                          21
Image: © The Walt Disney Company




                                   Heavy Committing™ !!!




 22
IndexReader
•   Most generic (abstract) API to an index
     – superclass of more specific types
     – cannot be subclassed directly (no public constructor)




                                                               23
IndexReader
•   Most generic (abstract) API to an index
     – superclass of more specific types
     – cannot be subclassed directly (no public constructor)

•   Does not know anything about “inverted index” concept
     – no terms, no postings!




                                                               24
IndexReader
•   Most generic (abstract) API to an index
     – superclass of more specific types
     – cannot be subclassed directly (no public constructor)

•   Does not know anything about “inverted index” concept
     – no terms, no postings!

•   Access to stored fields (and term vectors) by document ID
     – to display / highlight search results only!




                                                                25
IndexReader
•   Most generic (abstract) API to an index
     – superclass of more specific types
     – cannot be subclassed directly (no public constructor)

•   Does not know anything about “inverted index” concept
     – no terms, no postings!

•   Access to stored fields (and term vectors) by document ID
     – to display / highlight search results only!

•   Some very limited metadata
     – number of documents,…




                                                                26
IndexReader
•   Most generic (abstract) API to an index
     – superclass of more specific types
     – cannot be subclassed directly (no public constructor)

•   Does not know anything about “inverted index” concept
     – no terms, no postings!

•   Access to stored fields (and term vectors) by document ID
     – to display / highlight search results only!

•   Some very limited metadata
     – number of documents,…

•   Unmodifiable: No more deletions / changing of norms possible!
     – Applies to all readers in Lucene 4.0 !!!




                                                                    27
IndexReader
•   Most generic (abstract) API to an index
     – superclass of more specific types
     – cannot be subclassed directly (no public constructor)

•   Does not know anything about “inverted index” concept
     – no terms, no postings!

•   Access to stored fields (and term vectors) by document ID
     – to display / highlight search results only!

•   Some very limited metadata
     – number of documents,…

•   Unmodifiable: No more deletions / changing of norms possible!
     – Applies to all readers in Lucene 4.0 !!!

•   Passed to IndexSearcher
     – just as before!

                                                                    28
IndexReader
•   Most generic (abstract) API to an index
     – superclass of more specific types
     – cannot be subclassed directly (no public constructor)

•   Does not know anything about “inverted index” concept
     – no terms, no postings!

•   Access to stored fields (and term vectors) by document ID
     – to display / highlight search results only!

•   Some very limited metadata
     – number of documents,…

•   Unmodifiable: No more deletions / changing of norms possible!
     – Applies to all readers in Lucene 4.0 !!!

•   Passed to IndexSearcher
     – just as before!

•   Allows IndexReader.open() for backwards compatibility (deprecated)   29
AtomicReader
• Inherits from IndexReader




                              30
AtomicReader
• Inherits from IndexReader
• Access to “atomic” indexes (single segments)




                                                 31
AtomicReader
• Inherits from IndexReader
• Access to “atomic” indexes (single segments)
• Full term dictionary and postings API




                                                 32
AtomicReader
• Inherits from IndexReader
• Access to “atomic” indexes (single segments)
• Full term dictionary and postings API




                                                 33
AtomicReader
• Inherits from IndexReader
• Access to “atomic” indexes (single segments)
• Full term dictionary and postings API
• Access to DocValues (new in Lucene 4.0) and
  norms



                                                 34
CompositeReader
• No additional functionality on top of
  IndexReader




                                          35
CompositeReader
• No additional functionality on top of
  IndexReader
• Provides getSequentialSubReaders() to
  retrieve all child readers




                                          36
CompositeReader
• No additional functionality on top of
  IndexReader
• Provides getSequentialSubReaders() to
  retrieve all child readers
• DirectoryReader and MultiReader implement
  this class



                                          37
DirectoryReader
• Abstract class, defines interface for:




                                           38
DirectoryReader
• Abstract class, defines interface for:
   – access to on-disk indexes (on top of Directory class)
   – access to commit points, index metadata, index
     version, isCurrent() for reopen support
   – defines abstract openIfChanged (for cheap reopening
     of indexes)
   – child readers are always AtomicReader instances




                                                         39
DirectoryReader
• Abstract class, defines interface for:
   – access to on-disk indexes (on top of Directory class)
   – access to commit points, index metadata, index version,
     isCurrent() for reopen support
   – defines abstract openIfChanged (for cheap reopening of
     indexes)
   – child readers are always AtomicReader instances
• Provides static factory methods for opening indexes
   – well-known from IndexReader in Lucene 1 to 3
   – factories return internal DirectoryReader implementation
     (StandardDirectoryReader with SegmentReaders as childs)

                                                               40
Basic Search Example
Image: © The Walt Disney Company




                                       Looks familiar, doesn’t it?

                                                                     41
“Arrgh! I still need terms and postings on my
DirectoryReader!!! What should I do? Optimize
to only have one segment? Help me please!!!”
          (example question on java-user@lucene.apache.org)




                                                              42
What to do?

• Calm down
• Take a break and drink a beer*
• Don’t optimize / force merge your index!!!




     *) Robert’s solution                      43
Solutions
• Most efficient way:
  – Retrieve atomic leaves from your composite:
    reader.getTopReaderContext().leaves()
  – Iterate over sub-readers, do the work
    (possibly parallelized)
  – Merge results




                                                  44
Solutions
• Most efficient way:
  – Retrieve atomic leaves from your composite:
    reader.getTopReaderContext().leaves()
  – Iterate over sub-readers, do the work
    (possibly parallelized)
  – Merge results


• Otherwise: wrap your CompositeReader:

                                                  45
(Slow) Solution

AtomicReader
  SlowCompositeReaderWrapper.wrap(IndexReader r)


• Wraps IndexReaders of any kind as atomic reader,
  providing terms, postings, deletions, doc values
   – Internally uses same algorithms like previous Lucene readers
   – Segment-merging uses this to merge segments, too




                                                                    46
(Slow) Solution

AtomicReader
  SlowCompositeReaderWrapper.wrap(IndexReader r)


• Wraps IndexReaders of any kind as atomic reader,
  providing terms, postings, deletions, doc values
   – Internally uses same algorithms like previous Lucene readers
   – Segment-merging uses this to merge segments, too


• Solr always provides an AtomicReader for convenience
  through SolrIndexSearcher. Plugin writers should use:
AtomicReader
  rd = mySolrIndexSearcher.getAtomicReader()
                                                                    47
Other readers
• FilterAtomicReader
  – Was FilterIndexReader in 3.x
    (but now solely works on atomic readers)
  – Allows to filter terms, postings, deletions
  – Useful for index splitters (e.g., PKIndexSplitter,
    MultiPassIndexSplitter)
    (provide own getLiveDocs() method, merge to IndexWriter)

• ParallelAtomicReader, -CompositeReader


                                                               48
IndexReader Reopening
• Reopening solely provided by directory-based
  DirectoryReader instances




                                             49
IndexReader Reopening
• Reopening solely provided by directory-based
  DirectoryReader instances
• No more reopen for:
  – AtomicReader: they are atomic, no refresh possible
  – MultiReader: reopen child readers separately, create
    new MultiReader on top of reopened readers
  – Parallel*Reader, FilterAtomicReader: reopen
    wrapped readers, create new wrapper afterwards


                                                           50
Agenda

• Motivation / History of Lucene

• AtomicReader & CompositeReader

• Reader contexts

• Wrap up

                                   51
IndexReaderContext
AtomicReaderContext
CompositeReaderContext


WTF ?!?
                         52
The Problem
              SuperComposite: MultiReader


    CompositeA:                     CompositeB:
   DirectoryReader                DirectoryReader
Atomic0   Atomic1   Atomic2    Atomic0   Atomic1   Atomic2

Doc0      Doc0      Doc0        Doc0     Doc0      Doc0
Doc1      Doc1      Doc1        Doc1     Doc1      Doc1
Doc2      Doc2      Doc2        Doc2     Doc2      Doc2
Doc3      Doc3      Doc3        Doc3     Doc3      Doc3




                                                             53
The Problem
              SuperComposite: MultiReader


    CompositeA:                     CompositeB:
   DirectoryReader                DirectoryReader
Atomic0   Atomic1   Atomic2    Atomic0   Atomic1   Atomic2

Doc0      Doc4      Doc8        Doc0     Doc0      Doc0
Doc1      Doc5      Doc9        Doc1     Doc1      Doc1
Doc2      Doc6      Doc10       Doc2     Doc2      Doc2
Doc3      Doc7      Doc11       Doc3     Doc3      Doc3




                                                             54
The Problem
              SuperComposite: MultiReader


    CompositeA:                     CompositeB:
   DirectoryReader                DirectoryReader
Atomic0   Atomic1   Atomic2    Atomic0   Atomic1   Atomic2

Doc0      Doc4      Doc8        Doc12    Doc16     Doc20

Doc1      Doc5      Doc9        Doc13    Doc17     Doc21

Doc2      Doc6      Doc10       Doc14    Doc18     Doc22

Doc3      Doc7      Doc11       Doc15    Doc19     Doc23




                                                             55
Solution
• Each IndexReader instance provides:
  IndexReaderContext getTopReaderContext()




                                             56
Solution
• Each IndexReader instance provides:
  IndexReaderContext getTopReaderContext()

• The context provides a “view” on all childs (direct
  descendants) and leaves (down to lowest level
  AtomicReaders)




                                                        57
Solution
• Each IndexReader instance provides:
  IndexReaderContext getTopReaderContext()

• The context provides a “view” on all childs (direct
  descendants) and leaves (down to lowest level
  AtomicReaders)
• Each atomic leave has a docBase (document ID
  offset)




                                                    58
Solution
• Each IndexReader instance provides:
  IndexReaderContext getTopReaderContext()

• The context provides a “view” on all childs (direct
  descendants) and leaves (down to lowest level
  AtomicReaders)
• Each atomic leave has a docBase (document ID offset)
• IndexSearcher passes a context instance relative to its
  own top-level reader to each Query-Scorer / Filter
   – allows to access complete reader tree up to the current top-
     level reader
   – Allows to get the “global” document ID
                                                              59
Agenda

• Motivation / History of Lucene

• AtomicReader & CompositeReader

• Reader contexts

• Wrap up

                                   60
Summary


• Up to Lucene 3.6 indexes are still accessible on
  any hierarchy level with same interface
• Lucene 4.0 will split the IndexReader class into
  several abstract interfaces
• IndexReaderContexts will support per-segment
  search preserving top-level document IDs
                                                     61
Summary
• Lucene moved from global searches to per-
  segment search in Lucene 2.9
• Up to Lucene 3.6 indexes are still accessible on
  any hierarchy level with same interface
• Lucene 4.0 will split the IndexReader class into
  several abstract interfaces
• IndexReaderContexts will support per-segment
  search preserving top-level document IDs
                                                     62
Summary


• Up to Lucene 3.6 indexes are still accessible on
  any hierarchy level with same interface
• Lucene 4.0 will split the IndexReader class into
  several abstract interfaces
• IndexReaderContexts will support per-segment
  search preserving top-level document IDs
                                                     63
Summary




• Lucene 4.0 will split the IndexReader class into
  several abstract interfaces
• IndexReaderContexts will support per-segment
  search preserving top-level document IDs
                                                     64
Summary




• Lucene 4.0 will split the IndexReader class into
  several abstract interfaces
• IndexReaderContexts will support per-segment
  search preserving top-level document IDs
                                                     65
Contact
                  Uwe Schindler
                uschindler@apache.org
                http://www.thetaphi.de
                        @thetaph1




    SD DataSolutions GmbH
         Wätjenstr. 49
    28213 Bremen, Germany
      +49 421 40889785-0
http://www.sd-datasolutions.de

                                         66

Más contenido relacionado

La actualidad más candente

Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizationsSzehon Ho
 
PostgreSQL Performance Tuning
PostgreSQL Performance TuningPostgreSQL Performance Tuning
PostgreSQL Performance Tuningelliando dias
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
mongodb와 mysql의 CRUD 연산의 성능 비교
mongodb와 mysql의 CRUD 연산의 성능 비교mongodb와 mysql의 CRUD 연산의 성능 비교
mongodb와 mysql의 CRUD 연산의 성능 비교Woo Yeong Choi
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기NAVER D2
 
Neo4j Training Modeling
Neo4j Training ModelingNeo4j Training Modeling
Neo4j Training ModelingMax De Marzi
 
Introduction to Data Vault Modeling
Introduction to Data Vault ModelingIntroduction to Data Vault Modeling
Introduction to Data Vault ModelingKent Graziano
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
 
Design Patterns using Amazon DynamoDB
 Design Patterns using Amazon DynamoDB Design Patterns using Amazon DynamoDB
Design Patterns using Amazon DynamoDBAmazon Web Services
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDatabricks
 
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...DataStax
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn
 
Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Willy Lulciuc
 
Scylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScyllaDB
 

La actualidad más candente (20)

Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizations
 
PostgreSQL Performance Tuning
PostgreSQL Performance TuningPostgreSQL Performance Tuning
PostgreSQL Performance Tuning
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
mongodb와 mysql의 CRUD 연산의 성능 비교
mongodb와 mysql의 CRUD 연산의 성능 비교mongodb와 mysql의 CRUD 연산의 성능 비교
mongodb와 mysql의 CRUD 연산의 성능 비교
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
 
Neo4j Training Modeling
Neo4j Training ModelingNeo4j Training Modeling
Neo4j Training Modeling
 
Introduction to Data Vault Modeling
Introduction to Data Vault ModelingIntroduction to Data Vault Modeling
Introduction to Data Vault Modeling
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
MongoDB
MongoDBMongoDB
MongoDB
 
Design Patterns using Amazon DynamoDB
 Design Patterns using Amazon DynamoDB Design Patterns using Amazon DynamoDB
Design Patterns using Amazon DynamoDB
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 
Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez
 
Scylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with Raft
 

Similar a Is Your Index Reader Really Atomic or Maybe Slow?

Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialeckilucenerevolution
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingabial
 
Elasticsearch Architechture
Elasticsearch ArchitechtureElasticsearch Architechture
Elasticsearch ArchitechtureAnurag Sharma
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteSearch Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteLucidworks
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.NetDean Thrasher
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 
Glusterfs session #9 index xlator
Glusterfs session #9   index xlatorGlusterfs session #9   index xlator
Glusterfs session #9 index xlatorPranith Karampuri
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetuprcmuir
 
Introduction to couchbase
Introduction to couchbaseIntroduction to couchbase
Introduction to couchbaseDipti Borkar
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch BasicsShifa Khan
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1YI-CHING WU
 
Musings on Mesos: Docker, Kubernetes, and Beyond.
Musings on Mesos: Docker, Kubernetes, and Beyond.Musings on Mesos: Docker, Kubernetes, and Beyond.
Musings on Mesos: Docker, Kubernetes, and Beyond.Timothy St. Clair
 
IntelliJ IDEA Architecture and Performance
IntelliJ IDEA Architecture and PerformanceIntelliJ IDEA Architecture and Performance
IntelliJ IDEA Architecture and Performanceintelliyole
 

Similar a Is Your Index Reader Really Atomic or Maybe Slow? (20)

Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Flexible Indexing in Lucene 4.0
Flexible Indexing in Lucene 4.0Flexible Indexing in Lucene 4.0
Flexible Indexing in Lucene 4.0
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
Elasticsearch Architechture
Elasticsearch ArchitechtureElasticsearch Architechture
Elasticsearch Architechture
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteSearch Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.Net
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Elastic search
Elastic searchElastic search
Elastic search
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Glusterfs session #9 index xlator
Glusterfs session #9   index xlatorGlusterfs session #9   index xlator
Glusterfs session #9 index xlator
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 
Introduction to couchbase
Introduction to couchbaseIntroduction to couchbase
Introduction to couchbase
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Lucene 101
Lucene 101Lucene 101
Lucene 101
 
Musings on Mesos: Docker, Kubernetes, and Beyond.
Musings on Mesos: Docker, Kubernetes, and Beyond.Musings on Mesos: Docker, Kubernetes, and Beyond.
Musings on Mesos: Docker, Kubernetes, and Beyond.
 
IntelliJ IDEA Architecture and Performance
IntelliJ IDEA Architecture and PerformanceIntelliJ IDEA Architecture and Performance
IntelliJ IDEA Architecture and Performance
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 

Más de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Más de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Último

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

Is Your Index Reader Really Atomic or Maybe Slow?

  • 1. Is Your IndexReader Really Atomic or Maybe Slow? Uwe Schindler SD DataSolutions GmbH, uschindler@sd-datasolutions.de 1
  • 2. My Background • I am committer and PMC member of Apache Lucene and Solr. My main focus is on development of Lucene Java. • Implemented fast numerical search and maintaining the new attribute-based text analysis API. Well known as Generics and Sophisticated Backwards Compatibility Policeman. • Working as consultant and software architect for SD DataSolutions GmbH in Bremen, Germany. The main task is maintaining PANGAEA (Publishing Network for Geoscientific & Environmental Data) where I implemented the portal's geo-spatial retrieval functions with Apache Lucene Core. • Talks about Lucene at various international conferences like the previous Lucene Revolution, Lucene Eurocon, ApacheCon EU/US, Berlin Buzzwords, and various local meetups. 2
  • 3. Agenda • Motivation / History of Lucene • AtomicReader & CompositeReader • Reader contexts • Wrap up 3
  • 4. Lucene Index Structure • Lucene was the first full text search engine that supported document additions and updates • Snapshot isolation ensures consistency  Segmented index structure  Committing changes creates new segments 4
  • 5. Segments in Lucene • Each index consists of various segments placed in the index directory. All documents are added to new new segment files, merged with other on-disk files after flushing. Image: Doug Cutting *) The term “optimal” does not mean all indexes must be optimized! 5
  • 6. Segments in Lucene • Each index consists of various segments placed in the index directory. All documents are added to new new segment files, merged with other on-disk files after flushing. • Lucene writes segments incrementally and can merge them. Image: Doug Cutting *) The term “optimal” does not mean all indexes must be optimized! 6
  • 7. Segments in Lucene • Each index consists of various segments placed in the index directory. All documents are added to new new segment files, merged with other on-disk files after flushing. • Lucene writes segments incrementally and can merge them. Image: Doug Cutting • Optimized* index consists of one segment. *) The term “optimal” does not mean all indexes must be optimized! 7
  • 8. Lucene merges while indexing all of English Wikipedia Video by courtesy of: 8 Mike McCandless’ blog, http://goo.gl/kI53f
  • 9. Indexes in Lucene (up to version 3.6) • Each segment (“atomic index”) is a completely functional index: – SegmentReader implements the IndexReader interface for single segments • Composite indexes – DirectoryReader implements the IndexReader interface on top of a set of SegmentReaders – MultiReader is an abstraction of multiple IndexReaders combined to one virtual index 9
  • 10. Composite Indexes (up to version 3.6) Atomic “views” on multi-segment index: – Term Dictionary: on-the-fly merged & sorted term index (priority queue for TermEnum,…) – Postings: posting lists for each term appended, convert document IDs to be global – Metadata: doc frequency, term frequency,… – Stored fields, term vectors, deletions: delegate global document IDs -> segment document IDs (binary search) – FieldCache: duplicate instances for single segments and composite view (memory!!!) 10
  • 11. Merging Term Index and Postings Term 1 Term 1 Term 1 Term 2 Term 2 Term 3 Term 3 Term 4 Term 5 Term 4 Term 7 Term 6 Term 5 Term 8 Term 8 Term 6 Term 9 Term 9 Term 7 Term 8 Term 9 11
  • 12. Merging Term Index and Postings Term 1 Term 1 Term 1 Term 2 Term 2 Term 3 Term 3 Term 4 Term 5 Term 4 Term 7 Term 6 Term 5 Term 8 Term 8 Term 6 Term 9 Term 9 Term 7 Term 8 Term 9 12
  • 13. Composite Indexes (up to version 3.6) Atomic “views” on multi-segment index: – Term Dictionary: on-the-fly merged & sorted term index (priority queue for TermEnum,…) – Postings: posting lists for each term appended, convert document IDs to be global – Metadata: doc frequency, term frequency,… – Stored fields, term vectors, deletions: delegate global document IDs -> segment document IDs (binary search) – FieldCache: duplicate instances for single segments and composite view (memory!!!) 13
  • 14. Searching before Version 2.9 • IndexSearcher used the underlying index always as a “single” (atomic) index: – Queries are executed on the atomic view of a composite index – Slowdown for queries that scan term dictionary (MultiTermQuery) or hit lots of documents, facetting Image: © The Walt Disney Company => recommendation to “optimize” index – On every index change, FieldCache used for sorting had to be reloaded completely 14
  • 15. Lucene 2.9 and later: Per segment search Search is executed separately on each index segment: Atomic view no longer used! 15
  • 16. Per Segment Search: Pros • No live term dictionary merging • Possibility to parallelize – ExecutorService in IndexSearcher since Lucene 3.1+ – Do not optimize to make this work! • Sorting only needs per-segment FieldCache – Cheap reopen after index changes! • Filter cache per segment – Cheap reopen after index changes! 16
  • 17. Per Segment Search: Cons • Query/Filter API changes: – Scorer / Filter’s DocIdSet no longer use global document IDs • Slower sorting by string terms – Term ords are only comparable inside each segment – String comparisons needed after segment traversal – Use numeric sorting if possible!!! (Lucene supports missing values since version 3.4 [buggy], corrected 3.5+) 17
  • 18. Agenda • Motivation / History of Lucene • AtomicReader & CompositeReader • Reader contexts • Wrap up 18
  • 19. Composite Indexes (up to version 3.6) Atomic “views” on multi-segment index: – Term Dictionary: on-the-fly merged & sorted term index (priority queue for TermEnum,…) – Postings: posting lists for each term appended, convert document IDs to be global – Metadata: doc frequency, term frequency,… – Stored fields, term vectors, deletions: delegate global document IDs -> segment document IDs (binary search) – FieldCache: duplicate instances for single segments and composite view (memory!!!) 19
  • 20. Composite Indexes (version 4.0) Atomic “views” on multi-segment index: – Term Dictionary: on-the-fly merged & sorted term index (priority queue for TermEnum,…) – Postings: posting lists for each term appended, convert document IDs to be global – Metadata: doc frequency, term frequency,… – Stored fields, term vectors, deletions: delegate global document IDs -> segment document IDs (binary search) – FieldCache: duplicate instances for single segments and composite view (memory!!!) 20
  • 21. Early Lucene 4.0 • Only “historic” IndexReader interface available since Lucene 1.0 • 80% of all methods of composite IndexReaders throwed UnsupportedOperationException – This affected all user-facing APIs (SegmentReader was hidden / marked experimental) • No compile time safety! Image: © The Walt Disney Company – Query Scorers and Filters need term dictionary and postings, throwing UOE when executed on composite reader 21
  • 22. Image: © The Walt Disney Company Heavy Committing™ !!! 22
  • 23. IndexReader • Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor) 23
  • 24. IndexReader • Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor) • Does not know anything about “inverted index” concept – no terms, no postings! 24
  • 25. IndexReader • Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor) • Does not know anything about “inverted index” concept – no terms, no postings! • Access to stored fields (and term vectors) by document ID – to display / highlight search results only! 25
  • 26. IndexReader • Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor) • Does not know anything about “inverted index” concept – no terms, no postings! • Access to stored fields (and term vectors) by document ID – to display / highlight search results only! • Some very limited metadata – number of documents,… 26
  • 27. IndexReader • Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor) • Does not know anything about “inverted index” concept – no terms, no postings! • Access to stored fields (and term vectors) by document ID – to display / highlight search results only! • Some very limited metadata – number of documents,… • Unmodifiable: No more deletions / changing of norms possible! – Applies to all readers in Lucene 4.0 !!! 27
  • 28. IndexReader • Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor) • Does not know anything about “inverted index” concept – no terms, no postings! • Access to stored fields (and term vectors) by document ID – to display / highlight search results only! • Some very limited metadata – number of documents,… • Unmodifiable: No more deletions / changing of norms possible! – Applies to all readers in Lucene 4.0 !!! • Passed to IndexSearcher – just as before! 28
  • 29. IndexReader • Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor) • Does not know anything about “inverted index” concept – no terms, no postings! • Access to stored fields (and term vectors) by document ID – to display / highlight search results only! • Some very limited metadata – number of documents,… • Unmodifiable: No more deletions / changing of norms possible! – Applies to all readers in Lucene 4.0 !!! • Passed to IndexSearcher – just as before! • Allows IndexReader.open() for backwards compatibility (deprecated) 29
  • 31. AtomicReader • Inherits from IndexReader • Access to “atomic” indexes (single segments) 31
  • 32. AtomicReader • Inherits from IndexReader • Access to “atomic” indexes (single segments) • Full term dictionary and postings API 32
  • 33. AtomicReader • Inherits from IndexReader • Access to “atomic” indexes (single segments) • Full term dictionary and postings API 33
  • 34. AtomicReader • Inherits from IndexReader • Access to “atomic” indexes (single segments) • Full term dictionary and postings API • Access to DocValues (new in Lucene 4.0) and norms 34
  • 35. CompositeReader • No additional functionality on top of IndexReader 35
  • 36. CompositeReader • No additional functionality on top of IndexReader • Provides getSequentialSubReaders() to retrieve all child readers 36
  • 37. CompositeReader • No additional functionality on top of IndexReader • Provides getSequentialSubReaders() to retrieve all child readers • DirectoryReader and MultiReader implement this class 37
  • 38. DirectoryReader • Abstract class, defines interface for: 38
  • 39. DirectoryReader • Abstract class, defines interface for: – access to on-disk indexes (on top of Directory class) – access to commit points, index metadata, index version, isCurrent() for reopen support – defines abstract openIfChanged (for cheap reopening of indexes) – child readers are always AtomicReader instances 39
  • 40. DirectoryReader • Abstract class, defines interface for: – access to on-disk indexes (on top of Directory class) – access to commit points, index metadata, index version, isCurrent() for reopen support – defines abstract openIfChanged (for cheap reopening of indexes) – child readers are always AtomicReader instances • Provides static factory methods for opening indexes – well-known from IndexReader in Lucene 1 to 3 – factories return internal DirectoryReader implementation (StandardDirectoryReader with SegmentReaders as childs) 40
  • 41. Basic Search Example Image: © The Walt Disney Company Looks familiar, doesn’t it? 41
  • 42. “Arrgh! I still need terms and postings on my DirectoryReader!!! What should I do? Optimize to only have one segment? Help me please!!!” (example question on java-user@lucene.apache.org) 42
  • 43. What to do? • Calm down • Take a break and drink a beer* • Don’t optimize / force merge your index!!! *) Robert’s solution 43
  • 44. Solutions • Most efficient way: – Retrieve atomic leaves from your composite: reader.getTopReaderContext().leaves() – Iterate over sub-readers, do the work (possibly parallelized) – Merge results 44
  • 45. Solutions • Most efficient way: – Retrieve atomic leaves from your composite: reader.getTopReaderContext().leaves() – Iterate over sub-readers, do the work (possibly parallelized) – Merge results • Otherwise: wrap your CompositeReader: 45
  • 46. (Slow) Solution AtomicReader SlowCompositeReaderWrapper.wrap(IndexReader r) • Wraps IndexReaders of any kind as atomic reader, providing terms, postings, deletions, doc values – Internally uses same algorithms like previous Lucene readers – Segment-merging uses this to merge segments, too 46
  • 47. (Slow) Solution AtomicReader SlowCompositeReaderWrapper.wrap(IndexReader r) • Wraps IndexReaders of any kind as atomic reader, providing terms, postings, deletions, doc values – Internally uses same algorithms like previous Lucene readers – Segment-merging uses this to merge segments, too • Solr always provides an AtomicReader for convenience through SolrIndexSearcher. Plugin writers should use: AtomicReader rd = mySolrIndexSearcher.getAtomicReader() 47
  • 48. Other readers • FilterAtomicReader – Was FilterIndexReader in 3.x (but now solely works on atomic readers) – Allows to filter terms, postings, deletions – Useful for index splitters (e.g., PKIndexSplitter, MultiPassIndexSplitter) (provide own getLiveDocs() method, merge to IndexWriter) • ParallelAtomicReader, -CompositeReader 48
  • 49. IndexReader Reopening • Reopening solely provided by directory-based DirectoryReader instances 49
  • 50. IndexReader Reopening • Reopening solely provided by directory-based DirectoryReader instances • No more reopen for: – AtomicReader: they are atomic, no refresh possible – MultiReader: reopen child readers separately, create new MultiReader on top of reopened readers – Parallel*Reader, FilterAtomicReader: reopen wrapped readers, create new wrapper afterwards 50
  • 51. Agenda • Motivation / History of Lucene • AtomicReader & CompositeReader • Reader contexts • Wrap up 51
  • 53. The Problem SuperComposite: MultiReader CompositeA: CompositeB: DirectoryReader DirectoryReader Atomic0 Atomic1 Atomic2 Atomic0 Atomic1 Atomic2 Doc0 Doc0 Doc0 Doc0 Doc0 Doc0 Doc1 Doc1 Doc1 Doc1 Doc1 Doc1 Doc2 Doc2 Doc2 Doc2 Doc2 Doc2 Doc3 Doc3 Doc3 Doc3 Doc3 Doc3 53
  • 54. The Problem SuperComposite: MultiReader CompositeA: CompositeB: DirectoryReader DirectoryReader Atomic0 Atomic1 Atomic2 Atomic0 Atomic1 Atomic2 Doc0 Doc4 Doc8 Doc0 Doc0 Doc0 Doc1 Doc5 Doc9 Doc1 Doc1 Doc1 Doc2 Doc6 Doc10 Doc2 Doc2 Doc2 Doc3 Doc7 Doc11 Doc3 Doc3 Doc3 54
  • 55. The Problem SuperComposite: MultiReader CompositeA: CompositeB: DirectoryReader DirectoryReader Atomic0 Atomic1 Atomic2 Atomic0 Atomic1 Atomic2 Doc0 Doc4 Doc8 Doc12 Doc16 Doc20 Doc1 Doc5 Doc9 Doc13 Doc17 Doc21 Doc2 Doc6 Doc10 Doc14 Doc18 Doc22 Doc3 Doc7 Doc11 Doc15 Doc19 Doc23 55
  • 56. Solution • Each IndexReader instance provides: IndexReaderContext getTopReaderContext() 56
  • 57. Solution • Each IndexReader instance provides: IndexReaderContext getTopReaderContext() • The context provides a “view” on all childs (direct descendants) and leaves (down to lowest level AtomicReaders) 57
  • 58. Solution • Each IndexReader instance provides: IndexReaderContext getTopReaderContext() • The context provides a “view” on all childs (direct descendants) and leaves (down to lowest level AtomicReaders) • Each atomic leave has a docBase (document ID offset) 58
  • 59. Solution • Each IndexReader instance provides: IndexReaderContext getTopReaderContext() • The context provides a “view” on all childs (direct descendants) and leaves (down to lowest level AtomicReaders) • Each atomic leave has a docBase (document ID offset) • IndexSearcher passes a context instance relative to its own top-level reader to each Query-Scorer / Filter – allows to access complete reader tree up to the current top- level reader – Allows to get the “global” document ID 59
  • 60. Agenda • Motivation / History of Lucene • AtomicReader & CompositeReader • Reader contexts • Wrap up 60
  • 61. Summary • Up to Lucene 3.6 indexes are still accessible on any hierarchy level with same interface • Lucene 4.0 will split the IndexReader class into several abstract interfaces • IndexReaderContexts will support per-segment search preserving top-level document IDs 61
  • 62. Summary • Lucene moved from global searches to per- segment search in Lucene 2.9 • Up to Lucene 3.6 indexes are still accessible on any hierarchy level with same interface • Lucene 4.0 will split the IndexReader class into several abstract interfaces • IndexReaderContexts will support per-segment search preserving top-level document IDs 62
  • 63. Summary • Up to Lucene 3.6 indexes are still accessible on any hierarchy level with same interface • Lucene 4.0 will split the IndexReader class into several abstract interfaces • IndexReaderContexts will support per-segment search preserving top-level document IDs 63
  • 64. Summary • Lucene 4.0 will split the IndexReader class into several abstract interfaces • IndexReaderContexts will support per-segment search preserving top-level document IDs 64
  • 65. Summary • Lucene 4.0 will split the IndexReader class into several abstract interfaces • IndexReaderContexts will support per-segment search preserving top-level document IDs 65
  • 66. Contact Uwe Schindler uschindler@apache.org http://www.thetaphi.de @thetaph1 SD DataSolutions GmbH Wätjenstr. 49 28213 Bremen, Germany +49 421 40889785-0 http://www.sd-datasolutions.de 66