SlideShare a Scribd company logo
1 of 15
Download to read offline
Lucene
The Search Engine
By Surinder Kaur
Basics
Index
Segment
Inverted Index
Indexing
Lucene Delete
Lucene Update
Searching
Near Real Time Search
Query Boost
Scoring
References
Table of Content
Basics
Search Engine
Open Source
Supports Full Text Search, Sorting, Filtering and many other search functionalities
The core to Lucene is-
Inverted Index
Relevance Score
Search Algorithms
Tokenization
Index
An index is collection of document.
These document may or may not have any schema.
Fields: Document consists of one or more fields. Each field can
be of different data type.
Each Field is represented as key value pair.
Terms: When a field is processed through analyzer, it produces
Terms.
A term is “the unit of search” in search engines.
Segment
Index is split into many smaller
sections, called Segments. Each
segment has its own index.
Lucene searches all the segments in
sequence.
Data (document) once written to
segment can never be modified.
However Lucene can merge multiple
segments to optimize the
performance.
Inverted Index
Inverted index is an index data structure.
In simple words it inverts the “document-centric” data
structure (document -> terms) to “term-centric” data
structure (term -> document).
Lucene: Insert (Indexing)
“Indexing” is process of Document insertion to Lucene.
Lucene writes data to “in-memory buffer”.
When the buffer size reaches certain size, it gets
flushed to a “segment”.
Lucene: Delete
Document is never deleted from segment but only
marked deleted in a file. So that it can not be
accessed during the search.
It can be considered as soft delete.
Lucene: Update
A document never really gets updated.
But the update is actually a two-step process:
“older version” is marked “deleted” in the “original
segment”.
“new version” is “added” to the “current segment”.
Lucene: Get or Search
Searching or retrieving results from Lucene is a multi
step process:
Query Parser : Creates a query.
Index Searcher : Searches the query
Near Real Time Search
Lucene provides “near real time search” but not the
real time search.
NRT search is due to the way documents get inserted.
Since any new document first gets added to in-memory
buffer. Then buffer is flushed to become a segment.
Till the document reaches the segment it is
“unsearchable”.
Document Scoring
The official doc says- “Lucene scoring uses a combination of
the Vector Space Model (VSM) of Information Retrieval and
the Boolean model to determine how relevant a given Document is to
a User's query.”
In simpler term it is called “Tf-Idf” (Term Frequency- Inverse Document
Frequency) i.e. more times a query term appears in a document
relative to the number of times the term appears in all the documents
in the collection, the more relevant that document is to the query.
Note: Scoring is a detailed topic, I would publish a detailed study of
it. For reference Similarity formula is described here.
Boosting Score
Lucene let’s apply boost at various level. These are
namely:
Document Level Boost (while Indexing)
Field Level Boost (while Indexing)
Query Level Boost (while Searching)
Query Boost
Query-time boosts allow one to specify which terms/clauses
are "more important”.
Query boost plays role during searching.
The higher the boost factor, the more relevant the term will
be, and therefore the higher the corresponding document
scores.
Eg: Boosting first name over last name to factor of 2:
(first_name : “Jack”)^ 2 (last_name : “Jack”)
References
Lucene Documentation
Segment
Inverted index
Lucene tutorial
Lucene Query Syntax
Lucene Similarity

More Related Content

What's hot

Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
Swapnil & Patil
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
GokulD
 

What's hot (20)

Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
Introduction to apache lucene
Introduction to apache luceneIntroduction to apache lucene
Introduction to apache lucene
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Elasticsearch speed is key
Elasticsearch speed is keyElasticsearch speed is key
Elasticsearch speed is key
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Query DSL In Elasticsearch
Query DSL In ElasticsearchQuery DSL In Elasticsearch
Query DSL In Elasticsearch
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 
Apache Lucene Basics
Apache Lucene BasicsApache Lucene Basics
Apache Lucene Basics
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache Solr
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 

Similar to Lucene

Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
maxfalc
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...
ijsrd.com
 

Similar to Lucene (20)

Lucene
LuceneLucene
Lucene
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
The search engine index
The search engine indexThe search engine index
The search engine index
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
 
A Review of Elastic Search: Performance Metrics and challenges
A Review of Elastic Search: Performance Metrics and challengesA Review of Elastic Search: Performance Metrics and challenges
A Review of Elastic Search: Performance Metrics and challenges
 
Elastic search
Elastic searchElastic search
Elastic search
 
Ibm haifa.mq.final
Ibm haifa.mq.finalIbm haifa.mq.final
Ibm haifa.mq.final
 
Database and Research Matrix.pptx
Database and Research Matrix.pptxDatabase and Research Matrix.pptx
Database and Research Matrix.pptx
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrieval
 
N017249497
N017249497N017249497
N017249497
 
Context Based Indexing in Search Engines Using Ontology: Review
Context Based Indexing in Search Engines Using Ontology: ReviewContext Based Indexing in Search Engines Using Ontology: Review
Context Based Indexing in Search Engines Using Ontology: Review
 
Index Structures.pptx
Index Structures.pptxIndex Structures.pptx
Index Structures.pptx
 
Sub1522
Sub1522Sub1522
Sub1522
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATIONUSING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 

More from Surinder Kaur (12)

Agile
AgileAgile
Agile
 
MapReduce
MapReduceMapReduce
MapReduce
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
JSON Parsing
JSON ParsingJSON Parsing
JSON Parsing
 
Analysis of Emergency Evacuation of Building using PEPA
Analysis of Emergency Evacuation of Building using PEPAAnalysis of Emergency Evacuation of Building using PEPA
Analysis of Emergency Evacuation of Building using PEPA
 
Skype
SkypeSkype
Skype
 
NAT
NATNAT
NAT
 
XSLT
XSLTXSLT
XSLT
 
Dom
Dom Dom
Dom
 
java API for XML DOM
java API for XML DOMjava API for XML DOM
java API for XML DOM
 
intelligent sensors and sensor networks
intelligent sensors and sensor networksintelligent sensors and sensor networks
intelligent sensors and sensor networks
 
MPI n OpenMP
MPI n OpenMPMPI n OpenMP
MPI n OpenMP
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Lucene

  • 2. Basics Index Segment Inverted Index Indexing Lucene Delete Lucene Update Searching Near Real Time Search Query Boost Scoring References Table of Content
  • 3. Basics Search Engine Open Source Supports Full Text Search, Sorting, Filtering and many other search functionalities The core to Lucene is- Inverted Index Relevance Score Search Algorithms Tokenization
  • 4. Index An index is collection of document. These document may or may not have any schema. Fields: Document consists of one or more fields. Each field can be of different data type. Each Field is represented as key value pair. Terms: When a field is processed through analyzer, it produces Terms. A term is “the unit of search” in search engines.
  • 5. Segment Index is split into many smaller sections, called Segments. Each segment has its own index. Lucene searches all the segments in sequence. Data (document) once written to segment can never be modified. However Lucene can merge multiple segments to optimize the performance.
  • 6. Inverted Index Inverted index is an index data structure. In simple words it inverts the “document-centric” data structure (document -> terms) to “term-centric” data structure (term -> document).
  • 7. Lucene: Insert (Indexing) “Indexing” is process of Document insertion to Lucene. Lucene writes data to “in-memory buffer”. When the buffer size reaches certain size, it gets flushed to a “segment”.
  • 8. Lucene: Delete Document is never deleted from segment but only marked deleted in a file. So that it can not be accessed during the search. It can be considered as soft delete.
  • 9. Lucene: Update A document never really gets updated. But the update is actually a two-step process: “older version” is marked “deleted” in the “original segment”. “new version” is “added” to the “current segment”.
  • 10. Lucene: Get or Search Searching or retrieving results from Lucene is a multi step process: Query Parser : Creates a query. Index Searcher : Searches the query
  • 11. Near Real Time Search Lucene provides “near real time search” but not the real time search. NRT search is due to the way documents get inserted. Since any new document first gets added to in-memory buffer. Then buffer is flushed to become a segment. Till the document reaches the segment it is “unsearchable”.
  • 12. Document Scoring The official doc says- “Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User's query.” In simpler term it is called “Tf-Idf” (Term Frequency- Inverse Document Frequency) i.e. more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. Note: Scoring is a detailed topic, I would publish a detailed study of it. For reference Similarity formula is described here.
  • 13. Boosting Score Lucene let’s apply boost at various level. These are namely: Document Level Boost (while Indexing) Field Level Boost (while Indexing) Query Level Boost (while Searching)
  • 14. Query Boost Query-time boosts allow one to specify which terms/clauses are "more important”. Query boost plays role during searching. The higher the boost factor, the more relevant the term will be, and therefore the higher the corresponding document scores. Eg: Boosting first name over last name to factor of 2: (first_name : “Jack”)^ 2 (last_name : “Jack”)
  • 15. References Lucene Documentation Segment Inverted index Lucene tutorial Lucene Query Syntax Lucene Similarity