Lucene

3. Basics Search Engine Open Source Supports Full Text Search, Sorting, Filtering and many other search functionalities The core to Lucene is- Inverted Index Relevance Score Search Algorithms Tokenization

4. Index An index is collection of document. These document may or may not have any schema. Fields: Document consists of one or more fields. Each field can be of different data type. Each Field is represented as key value pair. Terms: When a field is processed through analyzer, it produces Terms. A term is “the unit of search” in search engines.

5. Segment Index is split into many smaller sections, called Segments. Each segment has its own index. Lucene searches all the segments in sequence. Data (document) once written to segment can never be modiﬁed. However Lucene can merge multiple segments to optimize the performance.

6. Inverted Index Inverted index is an index data structure. In simple words it inverts the “document-centric” data structure (document -> terms) to “term-centric” data structure (term -> document).

7. Lucene: Insert (Indexing) “Indexing” is process of Document insertion to Lucene. Lucene writes data to “in-memory buﬀer”. When the buffer size reaches certain size, it gets ﬂushed to a “segment”.

8. Lucene: Delete Document is never deleted from segment but only marked deleted in a ﬁle. So that it can not be accessed during the search. It can be considered as soft delete.

9. Lucene: Update A document never really gets updated. But the update is actually a two-step process: “older version” is marked “deleted” in the “original segment”. “new version” is “added” to the “current segment”.

10. Lucene: Get or Search Searching or retrieving results from Lucene is a multi step process: Query Parser : Creates a query. Index Searcher : Searches the query

11. Near Real Time Search Lucene provides “near real time search” but not the real time search. NRT search is due to the way documents get inserted. Since any new document ﬁrst gets added to in-memory buffer. Then buffer is ﬂushed to become a segment. Till the document reaches the segment it is “unsearchable”.

12. Document Scoring The ofﬁcial doc says- “Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User's query.” In simpler term it is called “Tf-Idf” (Term Frequency- Inverse Document Frequency) i.e. more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. Note: Scoring is a detailed topic, I would publish a detailed study of it. For reference Similarity formula is described here.

13. Boosting Score Lucene let’s apply boost at various level. These are namely: Document Level Boost (while Indexing) Field Level Boost (while Indexing) Query Level Boost (while Searching)

14. Query Boost Query-time boosts allow one to specify which terms/clauses are "more important”. Query boost plays role during searching. The higher the boost factor, the more relevant the term will be, and therefore the higher the corresponding document scores. Eg: Boosting ﬁrst name over last name to factor of 2: (ﬁrst_name : “Jack”)^ 2 (last_name : “Jack”)

15. References Lucene Documentation Segment Inverted index Lucene tutorial Lucene Query Syntax Lucene Similarity

Lucene

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lucene

Similar to Lucene (20)

More from Surinder Kaur

More from Surinder Kaur (12)

Recently uploaded

Recently uploaded (20)

Lucene