SlideShare una empresa de Scribd logo
Lucene
The Search Engine
By Surinder Kaur
Basics
Index
Segment
Inverted Index
Indexing
Lucene Delete
Lucene Update
Searching
Near Real Time Search
Query Boost
Scoring
References
Table of Content
Basics
Search Engine
Open Source
Supports Full Text Search, Sorting, Filtering and many other search functionalities
The core to Lucene is-
Inverted Index
Relevance Score
Search Algorithms
Tokenization
Index
An index is collection of document.
These document may or may not have any schema.
Fields: Document consists of one or more fields. Each field can
be of different data type.
Each Field is represented as key value pair.
Terms: When a field is processed through analyzer, it produces
Terms.
A term is “the unit of search” in search engines.
Segment
Index is split into many smaller
sections, called Segments. Each
segment has its own index.
Lucene searches all the segments in
sequence.
Data (document) once written to
segment can never be modified.
However Lucene can merge multiple
segments to optimize the
performance.
Inverted Index
Inverted index is an index data structure.
In simple words it inverts the “document-centric” data
structure (document -> terms) to “term-centric” data
structure (term -> document).
Lucene: Insert (Indexing)
“Indexing” is process of Document insertion to Lucene.
Lucene writes data to “in-memory buffer”.
When the buffer size reaches certain size, it gets
flushed to a “segment”.
Lucene: Delete
Document is never deleted from segment but only
marked deleted in a file. So that it can not be
accessed during the search.
It can be considered as soft delete.
Lucene: Update
A document never really gets updated.
But the update is actually a two-step process:
“older version” is marked “deleted” in the “original
segment”.
“new version” is “added” to the “current segment”.
Lucene: Get or Search
Searching or retrieving results from Lucene is a multi
step process:
Query Parser : Creates a query.
Index Searcher : Searches the query
Near Real Time Search
Lucene provides “near real time search” but not the
real time search.
NRT search is due to the way documents get inserted.
Since any new document first gets added to in-memory
buffer. Then buffer is flushed to become a segment.
Till the document reaches the segment it is
“unsearchable”.
Document Scoring
The official doc says- “Lucene scoring uses a combination of
the Vector Space Model (VSM) of Information Retrieval and
the Boolean model to determine how relevant a given Document is to
a User's query.”
In simpler term it is called “Tf-Idf” (Term Frequency- Inverse Document
Frequency) i.e. more times a query term appears in a document
relative to the number of times the term appears in all the documents
in the collection, the more relevant that document is to the query.
Note: Scoring is a detailed topic, I would publish a detailed study of
it. For reference Similarity formula is described here.
Boosting Score
Lucene let’s apply boost at various level. These are
namely:
Document Level Boost (while Indexing)
Field Level Boost (while Indexing)
Query Level Boost (while Searching)
Query Boost
Query-time boosts allow one to specify which terms/clauses
are "more important”.
Query boost plays role during searching.
The higher the boost factor, the more relevant the term will
be, and therefore the higher the corresponding document
scores.
Eg: Boosting first name over last name to factor of 2:
(first_name : “Jack”)^ 2 (last_name : “Jack”)
References
Lucene Documentation
Segment
Inverted index
Lucene tutorial
Lucene Query Syntax
Lucene Similarity

Más contenido relacionado

La actualidad más candente

Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
Lucky Sharma
 
Introduction to apache lucene
Introduction to apache luceneIntroduction to apache lucene
Introduction to apache lucene
Shrikrishna Parab
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
OpenSource Connections
 
Apache lucene
Apache luceneApache lucene
Apache lucene
Dr. Abhiram Gandhe
 
Elasticsearch speed is key
Elasticsearch speed is keyElasticsearch speed is key
Elasticsearch speed is key
Enterprise Search Warsaw Meetup
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
Swapnil & Patil
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
lucenerevolution
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
Grant Ingersoll
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
Bertrand Delacretaz
 
Query DSL In Elasticsearch
Query DSL In ElasticsearchQuery DSL In Elasticsearch
Query DSL In Elasticsearch
Knoldus Inc.
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
Adrien Grand
 
Apache Lucene Basics
Apache Lucene BasicsApache Lucene Basics
Apache Lucene Basics
Anirudh Sharma
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
GokulD
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
searchbox-com
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache Solr
Biogeeks
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
lucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
lucenerevolution
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
lucenerevolution
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
lucenerevolution
 

La actualidad más candente (20)

Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
Introduction to apache lucene
Introduction to apache luceneIntroduction to apache lucene
Introduction to apache lucene
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Elasticsearch speed is key
Elasticsearch speed is keyElasticsearch speed is key
Elasticsearch speed is key
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Query DSL In Elasticsearch
Query DSL In ElasticsearchQuery DSL In Elasticsearch
Query DSL In Elasticsearch
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 
Apache Lucene Basics
Apache Lucene BasicsApache Lucene Basics
Apache Lucene Basics
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache Solr
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 

Similar a Lucene

Lucene
LuceneLucene
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
IOSR Journals
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Robert Calcavecchia
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
Joey Wen
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
El Habib NFAOUI
 
The search engine index
The search engine indexThe search engine index
The search engine index
CJ Jenkins
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
maxfalc
 
A Review of Elastic Search: Performance Metrics and challenges
A Review of Elastic Search: Performance Metrics and challengesA Review of Elastic Search: Performance Metrics and challenges
A Review of Elastic Search: Performance Metrics and challenges
rahulmonikasharma
 
Elastic search
Elastic searchElastic search
Elastic search
Binit Pathak
 
Ibm haifa.mq.final
Ibm haifa.mq.finalIbm haifa.mq.final
Ibm haifa.mq.final
Pranav Prakash
 
Database and Research Matrix.pptx
Database and Research Matrix.pptxDatabase and Research Matrix.pptx
Database and Research Matrix.pptx
RahulRoshan37
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...
ijsrd.com
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrieval
captainmactavish1996
 
N017249497
N017249497N017249497
N017249497
IOSR Journals
 
Context Based Indexing in Search Engines Using Ontology: Review
Context Based Indexing in Search Engines Using Ontology: ReviewContext Based Indexing in Search Engines Using Ontology: Review
Context Based Indexing in Search Engines Using Ontology: Review
iosrjce
 
Index Structures.pptx
Index Structures.pptxIndex Structures.pptx
Index Structures.pptx
MBablu1
 
Sub1522
Sub1522Sub1522
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
Mayur Rathod
 
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATIONUSING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
IJDKP
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
Asad Abbas
 

Similar a Lucene (20)

Lucene
LuceneLucene
Lucene
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
The search engine index
The search engine indexThe search engine index
The search engine index
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
 
A Review of Elastic Search: Performance Metrics and challenges
A Review of Elastic Search: Performance Metrics and challengesA Review of Elastic Search: Performance Metrics and challenges
A Review of Elastic Search: Performance Metrics and challenges
 
Elastic search
Elastic searchElastic search
Elastic search
 
Ibm haifa.mq.final
Ibm haifa.mq.finalIbm haifa.mq.final
Ibm haifa.mq.final
 
Database and Research Matrix.pptx
Database and Research Matrix.pptxDatabase and Research Matrix.pptx
Database and Research Matrix.pptx
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrieval
 
N017249497
N017249497N017249497
N017249497
 
Context Based Indexing in Search Engines Using Ontology: Review
Context Based Indexing in Search Engines Using Ontology: ReviewContext Based Indexing in Search Engines Using Ontology: Review
Context Based Indexing in Search Engines Using Ontology: Review
 
Index Structures.pptx
Index Structures.pptxIndex Structures.pptx
Index Structures.pptx
 
Sub1522
Sub1522Sub1522
Sub1522
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATIONUSING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 

Más de Surinder Kaur

Agile
AgileAgile
MapReduce
MapReduceMapReduce
MapReduce
Surinder Kaur
 
Apache Hive
Apache HiveApache Hive
Apache Hive
Surinder Kaur
 
JSON Parsing
JSON ParsingJSON Parsing
JSON Parsing
Surinder Kaur
 
Analysis of Emergency Evacuation of Building using PEPA
Analysis of Emergency Evacuation of Building using PEPAAnalysis of Emergency Evacuation of Building using PEPA
Analysis of Emergency Evacuation of Building using PEPA
Surinder Kaur
 
Skype
SkypeSkype
NAT
NATNAT
XSLT
XSLTXSLT
Dom
Dom Dom
java API for XML DOM
java API for XML DOMjava API for XML DOM
java API for XML DOM
Surinder Kaur
 
intelligent sensors and sensor networks
intelligent sensors and sensor networksintelligent sensors and sensor networks
intelligent sensors and sensor networks
Surinder Kaur
 
MPI n OpenMP
MPI n OpenMPMPI n OpenMP
MPI n OpenMP
Surinder Kaur
 

Más de Surinder Kaur (12)

Agile
AgileAgile
Agile
 
MapReduce
MapReduceMapReduce
MapReduce
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
JSON Parsing
JSON ParsingJSON Parsing
JSON Parsing
 
Analysis of Emergency Evacuation of Building using PEPA
Analysis of Emergency Evacuation of Building using PEPAAnalysis of Emergency Evacuation of Building using PEPA
Analysis of Emergency Evacuation of Building using PEPA
 
Skype
SkypeSkype
Skype
 
NAT
NATNAT
NAT
 
XSLT
XSLTXSLT
XSLT
 
Dom
Dom Dom
Dom
 
java API for XML DOM
java API for XML DOMjava API for XML DOM
java API for XML DOM
 
intelligent sensors and sensor networks
intelligent sensors and sensor networksintelligent sensors and sensor networks
intelligent sensors and sensor networks
 
MPI n OpenMP
MPI n OpenMPMPI n OpenMP
MPI n OpenMP
 

Último

Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 

Último (20)

Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 

Lucene

  • 2. Basics Index Segment Inverted Index Indexing Lucene Delete Lucene Update Searching Near Real Time Search Query Boost Scoring References Table of Content
  • 3. Basics Search Engine Open Source Supports Full Text Search, Sorting, Filtering and many other search functionalities The core to Lucene is- Inverted Index Relevance Score Search Algorithms Tokenization
  • 4. Index An index is collection of document. These document may or may not have any schema. Fields: Document consists of one or more fields. Each field can be of different data type. Each Field is represented as key value pair. Terms: When a field is processed through analyzer, it produces Terms. A term is “the unit of search” in search engines.
  • 5. Segment Index is split into many smaller sections, called Segments. Each segment has its own index. Lucene searches all the segments in sequence. Data (document) once written to segment can never be modified. However Lucene can merge multiple segments to optimize the performance.
  • 6. Inverted Index Inverted index is an index data structure. In simple words it inverts the “document-centric” data structure (document -> terms) to “term-centric” data structure (term -> document).
  • 7. Lucene: Insert (Indexing) “Indexing” is process of Document insertion to Lucene. Lucene writes data to “in-memory buffer”. When the buffer size reaches certain size, it gets flushed to a “segment”.
  • 8. Lucene: Delete Document is never deleted from segment but only marked deleted in a file. So that it can not be accessed during the search. It can be considered as soft delete.
  • 9. Lucene: Update A document never really gets updated. But the update is actually a two-step process: “older version” is marked “deleted” in the “original segment”. “new version” is “added” to the “current segment”.
  • 10. Lucene: Get or Search Searching or retrieving results from Lucene is a multi step process: Query Parser : Creates a query. Index Searcher : Searches the query
  • 11. Near Real Time Search Lucene provides “near real time search” but not the real time search. NRT search is due to the way documents get inserted. Since any new document first gets added to in-memory buffer. Then buffer is flushed to become a segment. Till the document reaches the segment it is “unsearchable”.
  • 12. Document Scoring The official doc says- “Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User's query.” In simpler term it is called “Tf-Idf” (Term Frequency- Inverse Document Frequency) i.e. more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. Note: Scoring is a detailed topic, I would publish a detailed study of it. For reference Similarity formula is described here.
  • 13. Boosting Score Lucene let’s apply boost at various level. These are namely: Document Level Boost (while Indexing) Field Level Boost (while Indexing) Query Level Boost (while Searching)
  • 14. Query Boost Query-time boosts allow one to specify which terms/clauses are "more important”. Query boost plays role during searching. The higher the boost factor, the more relevant the term will be, and therefore the higher the corresponding document scores. Eg: Boosting first name over last name to factor of 2: (first_name : “Jack”)^ 2 (last_name : “Jack”)
  • 15. References Lucene Documentation Segment Inverted index Lucene tutorial Lucene Query Syntax Lucene Similarity