SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
Apache Lucene/Solr London User Group
How the Lucene More Like
This Works
Alessandro Benedetti, Software Engineer
16th May 2019
Apache Lucene/Solr London User GroupWho I am
▪ Search Consultant
▪ R&D Software Engineer
▪ Master in Computer Science
▪ Apache Lucene/Solr Enthusiast
▪ Semantic, NLP, Machine Learning
Technologies passionate
▪ Beach Volleyball Player & Snowboarder
Alessandro Benedetti
Apache Lucene/Solr London User GroupSease
Search Services
● Open Source Enthusiasts
● Apache Lucene/Solr experts
! Community Contributors
● Active Researchers
● Hot Trends : Learning To Rank, Document Similarity,
Search Quality Evaluation, Relevancy Tuning
Apache Lucene/Solr London User Group
● Document Similarity
● Apache Lucene More Like This
! Term Scorer
! BM25
● Interesting Terms Retrieval
● Query Building
! DEMO
! Future Work
! JIRA References
Agenda
Apache Lucene/Solr London User Group
Document Similarity
Problem : find similar documents to a seed one
Solution(s) :

● Collaborative approach 

(users interactions)
● Content Based
● Hybrid
Similar ? 

● Documents accessed in
similar manners by similar
people
● Terms distributions
● All of above
Apache Lucene/Solr London User Group
Real World Use Cases - Streaming Services
Apache Lucene/Solr London User Group
Real World Use Cases - Hotels
Apache Lucene/Solr London User Group
Apache Lucene
Apache LuceneTM is a high-performance, full-featured text search engine library
written entirely in Java.
It is a technology suitable for nearly any application that requires full-text
search, especially cross-platform.
Apache Lucene is an open source project available for free download.
Apache Lucene/Solr London User Group
● Search Library (java)
● Structured Documents
! Inverted Index
! Similarity Metrics ( TF-IDF, BM25)
! Fast Search
! Support for advanced queries
! Relevancy tuning
Apache Lucene
Apache Lucene/Solr London User Group
Inverted Index
Indexing
Apache Lucene/Solr London User Group
Input
Document More Like This
Params
Interesting
Terms

Retriever
Term Scorer
Query Builder QUERY
More Like This - Break Up
Apache Lucene/Solr London User Group
Responsibility : define a set of parameters (and defaults) that affect the
various components of the More Like This module
● Regulate MLT behavior
● Groups parameters specific to each component
● Javadoc documentation
● Default values
! Useful container for various parameters to be passed
More Like This Params
Apache Lucene/Solr London User Group
● Field Name
● Field Stats ( Document Count)
● Term Stats ( Document Frequency)
! Term Frequency
! TF-IDF -> tf * (log ( numDocs / docFreq + 1) + 1)
! BM25
Term Scorer
Responsibility : assign a score to a term that measure how distinctive is the term
for the document in input
Apache Lucene/Solr London User Group
! Origin from Probabilistic Information Retrieval
! Default Similarity from Lucene 6.0 [1]
! 25th iteration in improving TF-IDF
! TF
! IDF
! Document Length
[1] LUCENE-6789
BM25 Term Scorer
Apache Lucene/Solr London User Group
BM25 Term Scorer - Inverse Document Frequency
IDF Score

has very similar
behavior
Apache Lucene/Solr London User Group
BM25 Term Scorer - Term Frequency
TF Score

approaches

asymptotically (k+1)



k=1.2 in this
example
Apache Lucene/Solr London User Group
BM25 Term Scorer - Document Length
Document Length /

Avg Document
Length



affects how fast we
saturate TF score
Apache Lucene/Solr London User Group
Responsibility : retrieve from the document a queue of weighted interesting
terms Params Used
! Analyzer
! Max Num Token Parsed
! Min Term Frequency
! Min/Max Document Frequency
! Max Query Terms
! Query Time Field Boost
Interesting Term Retriever
! Analyze content / Term Vector
! Skip Tokens
! Score Tokens
! Build Queue of Top Scored terms
Apache Lucene/Solr London User Group
Params Used
! Term Boost Enabled
More Like This Query Builder
Field1 :

Term1
Field2 :

Term2
Field1 :

Term3
Field1 :

Term4
Field3 :

Term5
3.0 4.0 4.5 4.8 7.5
Q = Field1:Term1^3.0 Field2:Term2^4.0
Field1:Term3^4.5 Field1:Term4^4.8
Field3:Term5^7.5
Apache Lucene/Solr London User Group
Term Boost
! on/off
! Affect each term weight in the
MLT query
! It is the term score 

( it depends of the Term Scorer
implementation chosen)
More Like This Boost
Field Boost
! field1^5.0 field2^2.0 field3^1.5
! Affect Term Scorer
! Affect the interesting terms
retrieved
N.B. a highly boosted field can
dominate the interesting terms
retrieval
Apache Lucene/Solr London User Group
More Like This Usage - Lucene Classification
! Given a document D to classify
! K Nearest Neighbours Classifier
! Find Top K similar documents to D ( MLT)
! Classes are extracted
! Class Frequency + Class ranking -> Class probability
Apache Lucene/Solr London User Group
More Like This Usage - Apache Solr
! More Like This query parser

( can be concatenated with other queries)
! More Like This search component

( can be assigned to a Request Handler)
! More Like This handler

( handler with specific request parameters)
Apache Lucene/Solr London User Group
More Like This Demo - Movie Data Set
This data consists of the following fields:
● id - unique identifier for the movie
● Title - Name of the movie
● Directors - The person(s) who directed the making of the film
● Genres - The genre(s) that the movie belongs to
Apache Lucene/Solr London User Group
More Like This Demo - Tuned
! Enable/Disable Term Boost
! Min Term Frequency
! Min Document Frequency
! Field Boost
Apache Lucene/Solr London User Group
Future Work
! Query Builder just use Terms and Term Score
! Term Positions ?
! Phrase Queries Boost

(for terms close in position)
! Sentence boundaries
! Field centric vs Document centric

( should high boosted fields kick out

relevant terms from low boosted fields)
Apache Lucene/Solr London User Group
Future Work - More Like These
! Multiple documents in input
! Interesting terms across
documents
● Useful for Content Based
recommender engines
Apache Lucene/Solr London User Group
Pros
● Apache Lucene Module
! Advanced Params
! Input : 

- structured document

- just text
! Build an advanced query
! Leverage the Inverted Index

( and additional data structures)
More Like This
Cons
● Massive single class
! Low cohesion
! Low readability
! Minimum test coverage
! Difficult to extend

( and improve)
Apache Lucene/Solr London User Group
● LUCENE-7498 - Introducing BM25 Term Scorer
● LUCENE-7802 - Architectural Refactor
● LUCENE-8326 - MLT Params Refactor
JIRA References
Apache Lucene/Solr London User Group
Questions ?
Apache Lucene/Solr London User GroupThanks!

Más contenido relacionado

La actualidad más candente

Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Tweaking the Base Score: Lucene/Solr Similarities Explained
Tweaking the Base Score: Lucene/Solr Similarities ExplainedTweaking the Base Score: Lucene/Solr Similarities Explained
Tweaking the Base Score: Lucene/Solr Similarities ExplainedSematext Group, Inc.
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic IntroductionMayur Rathod
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginnersNeil Baker
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceSijie Guo
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
How to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams SafeHow to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams Safeconfluent
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
 
Understanding of Apache kafka metrics for monitoring
Understanding of Apache kafka metrics for monitoring Understanding of Apache kafka metrics for monitoring
Understanding of Apache kafka metrics for monitoring SANG WON PARK
 
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드confluent
 

La actualidad más candente (20)

Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Tweaking the Base Score: Lucene/Solr Similarities Explained
Tweaking the Base Score: Lucene/Solr Similarities ExplainedTweaking the Base Score: Lucene/Solr Similarities Explained
Tweaking the Base Score: Lucene/Solr Similarities Explained
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage Service
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
How to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams SafeHow to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams Safe
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Understanding of Apache kafka metrics for monitoring
Understanding of Apache kafka metrics for monitoring Understanding of Apache kafka metrics for monitoring
Understanding of Apache kafka metrics for monitoring
 
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 

Similar a How the Lucene More Like This Works

Faceted search using Solr and Ontopia
Faceted search using Solr and OntopiaFaceted search using Solr and Ontopia
Faceted search using Solr and OntopiaGeir Ove Grønmo
 
Advanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAdvanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAlessandro Benedetti
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneSease
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1 GokulD
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5israelekpo
 
Building multi billion ( dollars, users, documents ) search engines on open ...
Building multi billion ( dollars, users, documents ) search engines  on open ...Building multi billion ( dollars, users, documents ) search engines  on open ...
Building multi billion ( dollars, users, documents ) search engines on open ...Andrei Lopatenko
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6DEEPAK KHETAWAT
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSease
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveAndrea Gazzarini
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdfAbanti Aazmin
 
Search explained T3DD15
Search explained T3DD15Search explained T3DD15
Search explained T3DD15Hans Höchtl
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemTrey Grainger
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solrsagar chaturvedi
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Search Intelligence @elo7.com
Search Intelligence @elo7.comSearch Intelligence @elo7.com
Search Intelligence @elo7.comFernando Meyer
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studyCharlie Hull
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4LKoji Sekiguchi
 

Similar a How the Lucene More Like This Works (20)

Faceted search using Solr and Ontopia
Faceted search using Solr and OntopiaFaceted search using Solr and Ontopia
Faceted search using Solr and Ontopia
 
Advanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAdvanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache Lucene
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
 
Building multi billion ( dollars, users, documents ) search engines on open ...
Building multi billion ( dollars, users, documents ) search engines  on open ...Building multi billion ( dollars, users, documents ) search engines  on open ...
Building multi billion ( dollars, users, documents ) search engines on open ...
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdf
 
Search explained T3DD15
Search explained T3DD15Search explained T3DD15
Search explained T3DD15
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Search Intelligence @elo7.com
Search Intelligence @elo7.comSearch Intelligence @elo7.com
Search Intelligence @elo7.com
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4L
 

Más de Sease

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors LuceneSease
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneSease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Sease
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveSease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache SolrSease
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale IndexingSease
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfSease
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Sease
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfSease
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxSease
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingSease
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Sease
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information RetrievalSease
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationSease
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to RankSease
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusSease
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupSease
 

Más de Sease (20)

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information Retrieval
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to Rank
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval Meetup
 

Último

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Último (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

How the Lucene More Like This Works

  • 1. Apache Lucene/Solr London User Group How the Lucene More Like This Works Alessandro Benedetti, Software Engineer 16th May 2019
  • 2. Apache Lucene/Solr London User GroupWho I am ▪ Search Consultant ▪ R&D Software Engineer ▪ Master in Computer Science ▪ Apache Lucene/Solr Enthusiast ▪ Semantic, NLP, Machine Learning Technologies passionate ▪ Beach Volleyball Player & Snowboarder Alessandro Benedetti
  • 3. Apache Lucene/Solr London User GroupSease Search Services ● Open Source Enthusiasts ● Apache Lucene/Solr experts ! Community Contributors ● Active Researchers ● Hot Trends : Learning To Rank, Document Similarity, Search Quality Evaluation, Relevancy Tuning
  • 4. Apache Lucene/Solr London User Group ● Document Similarity ● Apache Lucene More Like This ! Term Scorer ! BM25 ● Interesting Terms Retrieval ● Query Building ! DEMO ! Future Work ! JIRA References Agenda
  • 5. Apache Lucene/Solr London User Group Document Similarity Problem : find similar documents to a seed one Solution(s) :
 ● Collaborative approach 
 (users interactions) ● Content Based ● Hybrid Similar ? 
 ● Documents accessed in similar manners by similar people ● Terms distributions ● All of above
  • 6. Apache Lucene/Solr London User Group Real World Use Cases - Streaming Services
  • 7. Apache Lucene/Solr London User Group Real World Use Cases - Hotels
  • 8. Apache Lucene/Solr London User Group Apache Lucene Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download.
  • 9. Apache Lucene/Solr London User Group ● Search Library (java) ● Structured Documents ! Inverted Index ! Similarity Metrics ( TF-IDF, BM25) ! Fast Search ! Support for advanced queries ! Relevancy tuning Apache Lucene
  • 10. Apache Lucene/Solr London User Group Inverted Index Indexing
  • 11. Apache Lucene/Solr London User Group Input Document More Like This Params Interesting Terms
 Retriever Term Scorer Query Builder QUERY More Like This - Break Up
  • 12. Apache Lucene/Solr London User Group Responsibility : define a set of parameters (and defaults) that affect the various components of the More Like This module ● Regulate MLT behavior ● Groups parameters specific to each component ● Javadoc documentation ● Default values ! Useful container for various parameters to be passed More Like This Params
  • 13. Apache Lucene/Solr London User Group ● Field Name ● Field Stats ( Document Count) ● Term Stats ( Document Frequency) ! Term Frequency ! TF-IDF -> tf * (log ( numDocs / docFreq + 1) + 1) ! BM25 Term Scorer Responsibility : assign a score to a term that measure how distinctive is the term for the document in input
  • 14. Apache Lucene/Solr London User Group ! Origin from Probabilistic Information Retrieval ! Default Similarity from Lucene 6.0 [1] ! 25th iteration in improving TF-IDF ! TF ! IDF ! Document Length [1] LUCENE-6789 BM25 Term Scorer
  • 15. Apache Lucene/Solr London User Group BM25 Term Scorer - Inverse Document Frequency IDF Score
 has very similar behavior
  • 16. Apache Lucene/Solr London User Group BM25 Term Scorer - Term Frequency TF Score
 approaches
 asymptotically (k+1)
 
 k=1.2 in this example
  • 17. Apache Lucene/Solr London User Group BM25 Term Scorer - Document Length Document Length /
 Avg Document Length
 
 affects how fast we saturate TF score
  • 18. Apache Lucene/Solr London User Group Responsibility : retrieve from the document a queue of weighted interesting terms Params Used ! Analyzer ! Max Num Token Parsed ! Min Term Frequency ! Min/Max Document Frequency ! Max Query Terms ! Query Time Field Boost Interesting Term Retriever ! Analyze content / Term Vector ! Skip Tokens ! Score Tokens ! Build Queue of Top Scored terms
  • 19. Apache Lucene/Solr London User Group Params Used ! Term Boost Enabled More Like This Query Builder Field1 :
 Term1 Field2 :
 Term2 Field1 :
 Term3 Field1 :
 Term4 Field3 :
 Term5 3.0 4.0 4.5 4.8 7.5 Q = Field1:Term1^3.0 Field2:Term2^4.0 Field1:Term3^4.5 Field1:Term4^4.8 Field3:Term5^7.5
  • 20. Apache Lucene/Solr London User Group Term Boost ! on/off ! Affect each term weight in the MLT query ! It is the term score 
 ( it depends of the Term Scorer implementation chosen) More Like This Boost Field Boost ! field1^5.0 field2^2.0 field3^1.5 ! Affect Term Scorer ! Affect the interesting terms retrieved N.B. a highly boosted field can dominate the interesting terms retrieval
  • 21. Apache Lucene/Solr London User Group More Like This Usage - Lucene Classification ! Given a document D to classify ! K Nearest Neighbours Classifier ! Find Top K similar documents to D ( MLT) ! Classes are extracted ! Class Frequency + Class ranking -> Class probability
  • 22. Apache Lucene/Solr London User Group More Like This Usage - Apache Solr ! More Like This query parser
 ( can be concatenated with other queries) ! More Like This search component
 ( can be assigned to a Request Handler) ! More Like This handler
 ( handler with specific request parameters)
  • 23. Apache Lucene/Solr London User Group More Like This Demo - Movie Data Set This data consists of the following fields: ● id - unique identifier for the movie ● Title - Name of the movie ● Directors - The person(s) who directed the making of the film ● Genres - The genre(s) that the movie belongs to
  • 24. Apache Lucene/Solr London User Group More Like This Demo - Tuned ! Enable/Disable Term Boost ! Min Term Frequency ! Min Document Frequency ! Field Boost
  • 25. Apache Lucene/Solr London User Group Future Work ! Query Builder just use Terms and Term Score ! Term Positions ? ! Phrase Queries Boost
 (for terms close in position) ! Sentence boundaries ! Field centric vs Document centric
 ( should high boosted fields kick out
 relevant terms from low boosted fields)
  • 26. Apache Lucene/Solr London User Group Future Work - More Like These ! Multiple documents in input ! Interesting terms across documents ● Useful for Content Based recommender engines
  • 27. Apache Lucene/Solr London User Group Pros ● Apache Lucene Module ! Advanced Params ! Input : 
 - structured document
 - just text ! Build an advanced query ! Leverage the Inverted Index
 ( and additional data structures) More Like This Cons ● Massive single class ! Low cohesion ! Low readability ! Minimum test coverage ! Difficult to extend
 ( and improve)
  • 28. Apache Lucene/Solr London User Group ● LUCENE-7498 - Introducing BM25 Term Scorer ● LUCENE-7802 - Architectural Refactor ● LUCENE-8326 - MLT Params Refactor JIRA References
  • 29. Apache Lucene/Solr London User Group Questions ?
  • 30. Apache Lucene/Solr London User GroupThanks!