SlideShare una empresa de Scribd logo
1 de 23
LexisNexisLegalSearchGuildLexisNexisLegalSearchGuild
Learning to Rank v2.0
Sujit Pal, Elsevier Labs
October 11, 2018
What it is, how it’s done, and what it can do for you
LexisNexisLegalSearchGuild
Outline
• History
• Overview
• Practical Considerations
• Case Studies
• Wrap Up
Learning to Rank - what it is, how it's done, and what it can do for you 2
LexisNexisLegalSearchGuild
History
• 1992: Idea of LTR (or Machine Learned Ranking) first proposed
• 2003: Altavista (later acquired by Yahoo!) using LTR in its engine
• 2005: Microsoft invents RankNet, deploys in Bing
• 2008: In contrast, Google’s engine hand tuned, relies on up ~200 signals
• 2009: Yandex invents and deploys MatrixNet in its engine
• 2016: Google says RankBrain is #3 signal to its search engine
• 2016: Bloomberg contributes LTR plugin to Solr
• 2017: Open Source Connections contributes LTR plugin in Elasticsearch
Learning to Rank - what it is, how it's done, and what it can do for you 3
LexisNexisLegalSearchGuild
Overview
Learning to Rank - what it is, how it's done, and what it can do for you 4
LexisNexisLegalSearchGuild
LTR Pipeline
Learning to Rank - what it is, how it's done, and what it can do for you 5
Image Credit: https://towardsdatascience.com/when-to-use-a-machine-learned-vs-score-based-search-ranker-aa8762cd9aa9
• Training: Build LTR
model using training
data (query, document,
label) triples
• Label is rank
• Inference: Use model to
predict label ŷ = h(x)
from unseen (query,
document) pairs
LexisNexisLegalSearchGuild
LTR used as re-ranker
• LTR models are usually more
computationally expensive than
search engines.
• Search engine used to pull out
matched documents
• Top-N of these documents are fed into
the LTR model and top-n of those are
replaced with the output of the
model, for N >> n (typically 50-100x).
Learning to Rank - what it is, how it's done, and what it can do for you 6
Index
Query
Matched
(10k)
Scored
(10k)
Top 1000
retrieved
Re-ranked
Top 10
Ranking
Model
Image Credit: https://lucidworks.com/2016/08/17/learning-to-rank-solr/
LexisNexisLegalSearchGuild
Types of LTR Algorithms
• Pointwise – documents ranked
by relevance of each (query,
document) pair
• Pairwise – documents ranked by
considering priority between
pairs of (query, document) pairs
• Listwise – documents ranked by
considering the entire relevance
ordering of all (query,
Documents) tuples per query
Learning to Rank - what it is, how it's done, and what it can do for you 7
Image Credit: https://www.hiit.fi/wp-content/uploads/2018/04/Learning-to-Rank-S11.pdf
LexisNexisLegalSearchGuild
Popular LTR Libraries
• RankLib (Java) – from Lemur Project (UMass, CMU), provides
Coordinate Ascent, Random Forest (pointwise), MART, RankNet,
RankBoost (pairwise), LambdaMART (pair/listwise), AdaRank and
ListNet (listwise)
• SVMRank (C++) – from Cornell, provides SVMRank (pairwise)
• XGBoost (Python/C++) – LambdaRank (pairwise)
• PyLTR (Python) – LambdaMART (pairwise)
• Michael Alcorn (Python) – RankNet and LambdaMART (pairwise)
Learning to Rank - what it is, how it's done, and what it can do for you 8
LexisNexisLegalSearchGuild
Common Algorithms
• Linear Model
• Predicted rank is linear combination of input
features
• RankNet
• Neural network based
• Good for binary (relevant/irrelevant) labels
• Weight matrix transforms input features into
rank probabilities
• LambdaMART
• Tree (forest) based
• Good for multi-class labels
• Feature splits with thresholds
Learning to Rank - what it is, how it's done, and what it can do for you 9
Image Credits: https://slideplayer.com/slide/9347431/, https://dimensionless.in/introduction-to-random-forest/
LexisNexisLegalSearchGuild
Practical Considerations
Learning to Rank - what it is, how it's done, and what it can do for you 10
LexisNexisLegalSearchGuild
Reasons to consider LTR
• Too many parameters to tune manually without overfitting to
particular query set.
• Availability of enough (implicit or explicit) good training data to train
LTR model.
Learning to Rank - what it is, how it's done, and what it can do for you 11
LexisNexisLegalSearchGuild
Acquiring labels
• Implicit
• Intrinsic features (words, phrases)
• Document metadata
• User Clicks
• Time spent on document
• Purchases (if applicable)
• Cheap to build but noisy
• Explicit
• Human expert rates relevancy of each
document against query
• Cleaner but expensive to build
Learning to Rank - what it is, how it's done, and what it can do for you 12
Image Credit: https://www.datasciencecentral.com/profiles/blogs/handling-imbalanced-data-sets-in-supervised-learning-using-family
LexisNexisLegalSearchGuild
Traditional Ranking Models
• Vector Space Models
• Boolean – predicts if document relevant to query or not
• TF-IDF – rank documents by cosine similarity between document and query
• Probabilistic Models
• BM25 – rank documents by log odds of relevance to query
• LMIR – probability of document’s LM generating terms in query
• Importance based Models
• HITS – rank documents by hubness/authority (inlinks/outlinks).
• PageRank – rank document by probability of random surfer arriving on page
• Impact Factor – rank documents by number of citations
Learning to Rank - what it is, how it's done, and what it can do for you 13
LexisNexisLegalSearchGuild
Feature Selection
• Document Features
• Document Length
• URL Length
• Publication Date
• Number of outlinks
• PageRank
• Query Features
• Number of words
• PER or ORG in query
Learning to Rank - what it is, how it's done, and what it can do for you 14
• Query-Document Features
• TF-IDF, BM25 similarity
• Frequency of query in anchor
text
• Document contains query words
in title
• User Dependent Features
• Star ratings
• Age, gender
• Device
LexisNexisLegalSearchGuild
Evaluation Metrics
• Mean Average Precision (MAP
@k)
• Mean Reciprocal Rank (MRR)
Learning to Rank - what it is, how it's done, and what it can do for you 15
• Normalized Discounted
Cumulative Gain (NDCG @k)
• Rank Correlation
LexisNexisLegalSearchGuild
LETOR Data Format
2 qid:1 1:3 2:3 3:0 4:0 5:3 6:1 7:1 8:0 9:0 10:1 11:156... # 11
2 qid:1 1:3 2:0 3:3 4:0 5:3 6:1 7:0 8:1 9:0 10:1 11:406... # 23
0 qid:1 1:3 2:0 3:2 4:0 5:3 6:1 7:0 8:0.666667 9:0 10:1 ... # 44
2 qid:1 1:3 2:0 3:3 4:0 5:3 6:1 7:0 8:1 9:0 10:1 11:287 ... # 57
1 qid:1 1:3 2:0 3:3 4:0 5:3 6:1 7:0 8:1 9:0 10:1 11:2009 ... # 89
Learning to Rank - what it is, how it's done, and what it can do for you 16
label
Query ID
Features: query, document, query/document, other
(sparse or dense format)
Comments (ex: docID)
LexisNexisLegalSearchGuild
Case Studies
Learning to Rank - what it is, how it's done, and what it can do for you 17
LexisNexisLegalSearchGuild
Dataset
• We use The Movie Database (TMDB) from Kaggle.
• 45k movies, 20 genres, 31k unique keywords
• We extract following fields: (docID, title, description, popularity,
release date, running time, rating (0-10), keywords, genres)
• Categorical labels 1-5 created from rating
• Objective is to build LTR model that learns the ordering implied by
rating and re-rank top 10 results using this model
• Features chosen: (query-title and query-description similarity using
TF-IDF and BM25, document recency, original score, and boolean 0/1
for each genre)
Learning to Rank - what it is, how it's done, and what it can do for you 18
LexisNexisLegalSearchGuild
General Flow
• Set up plugin if applicable
• Load data
• Define LTR features to index (if applicable)
• Generate LTR features for query set in LETOR format
• Train LTR model using 3rd party library (RankLib, etc).
• Upload trained model definition to index (if applicable)
• Run re-rank query using trained LTR model
Learning to Rank - what it is, how it's done, and what it can do for you 19
LexisNexisLegalSearchGuild
Examples
• Solr Case Study
• Elasticsearch Case Study
• DIY LTR Case Study
• Project: https://github.com/sujitpal/ltr-examples
Learning to Rank - what it is, how it's done, and what it can do for you 20
LexisNexisLegalSearchGuild
Wrap Up
Learning to Rank - what it is, how it's done, and what it can do for you 21
LexisNexisLegalSearchGuild
Resources
• Book – Learning to Rank for Information Retrieval, by Tie-Yan Liu.
• Paper – From RankNet to LambdaRank to LambdaMART: An
Overview, by Christopher J. C. Burges
• Tutorials
• Solr - https://github.com/airalcorn2/Solr-LTR
• Elasticsearch – Learning to Rank 101 by Pere Urbon-Bayes, ES-LTR Demo by
Doug Turnbull.
• Product Centric LTR Documentation
• Solr Learning To Rank Docs
• Elasticsearch Learning to Rank Docs
Learning to Rank - what it is, how it's done, and what it can do for you 22
LexisNexisLegalSearchGuild
Thank you!
• Contact: sujit.pal@elsevier.com
• Twitter: @palsujit
Learning to Rank - what it is, how it's done, and what it can do for you 23

Más contenido relacionado

La actualidad más candente

Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Sease
 
[226]대용량 텍스트마이닝 기술 하정우
[226]대용량 텍스트마이닝 기술 하정우[226]대용량 텍스트마이닝 기술 하정우
[226]대용량 텍스트마이닝 기술 하정우
NAVER D2
 

La actualidad más candente (20)

A Learning to Rank Project on a Daily Song Ranking Problem
A Learning to Rank Project on a Daily Song Ranking ProblemA Learning to Rank Project on a Daily Song Ranking Problem
A Learning to Rank Project on a Daily Song Ranking Problem
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolbox
 
하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018
하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018
하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
 
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...
 
Starting data science with kaggle.com
Starting data science with kaggle.comStarting data science with kaggle.com
Starting data science with kaggle.com
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
[226]대용량 텍스트마이닝 기술 하정우
[226]대용량 텍스트마이닝 기술 하정우[226]대용량 텍스트마이닝 기술 하정우
[226]대용량 텍스트마이닝 기술 하정우
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolbox
 
Grouping and Joining in Lucene/Solr
Grouping and Joining in Lucene/SolrGrouping and Joining in Lucene/Solr
Grouping and Joining in Lucene/Solr
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
 
How Solr Search Works
How Solr Search WorksHow Solr Search Works
How Solr Search Works
 
Information Retrieval Models for Recommender Systems - PhD slides
Information Retrieval Models for Recommender Systems - PhD slidesInformation Retrieval Models for Recommender Systems - PhD slides
Information Retrieval Models for Recommender Systems - PhD slides
 
System design for recommendations and search
System design for recommendations and searchSystem design for recommendations and search
System design for recommendations and search
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
 
Anatomy of an eCommerce Search Engine by Mayur Datar
Anatomy of an eCommerce Search Engine by Mayur DatarAnatomy of an eCommerce Search Engine by Mayur Datar
Anatomy of an eCommerce Search Engine by Mayur Datar
 

Similar a Learning to Rank Presentation (v2) at LexisNexis Search Guild

RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
S. Diana Hu
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Lucidworks
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
eswcsummerschool
 
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Lucidworks
 

Similar a Learning to Rank Presentation (v2) at LexisNexis Search Guild (20)

RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engine
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
 
From keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic searchFrom keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic search
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
Semantic framework for web scraping.
Semantic framework for web scraping.Semantic framework for web scraping.
Semantic framework for web scraping.
 
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
 

Más de Sujit Pal

Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
Sujit Pal
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
Sujit Pal
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
Sujit Pal
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
Sujit Pal
 

Más de Sujit Pal (20)

Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSupporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question Answering
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and Test
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...
 
The power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringThe power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestring
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop Visualization
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
 
Leslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal ClubLeslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal Club
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slides
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity Search
 
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Learning to Rank Presentation (v2) at LexisNexis Search Guild

  • 1. LexisNexisLegalSearchGuildLexisNexisLegalSearchGuild Learning to Rank v2.0 Sujit Pal, Elsevier Labs October 11, 2018 What it is, how it’s done, and what it can do for you
  • 2. LexisNexisLegalSearchGuild Outline • History • Overview • Practical Considerations • Case Studies • Wrap Up Learning to Rank - what it is, how it's done, and what it can do for you 2
  • 3. LexisNexisLegalSearchGuild History • 1992: Idea of LTR (or Machine Learned Ranking) first proposed • 2003: Altavista (later acquired by Yahoo!) using LTR in its engine • 2005: Microsoft invents RankNet, deploys in Bing • 2008: In contrast, Google’s engine hand tuned, relies on up ~200 signals • 2009: Yandex invents and deploys MatrixNet in its engine • 2016: Google says RankBrain is #3 signal to its search engine • 2016: Bloomberg contributes LTR plugin to Solr • 2017: Open Source Connections contributes LTR plugin in Elasticsearch Learning to Rank - what it is, how it's done, and what it can do for you 3
  • 4. LexisNexisLegalSearchGuild Overview Learning to Rank - what it is, how it's done, and what it can do for you 4
  • 5. LexisNexisLegalSearchGuild LTR Pipeline Learning to Rank - what it is, how it's done, and what it can do for you 5 Image Credit: https://towardsdatascience.com/when-to-use-a-machine-learned-vs-score-based-search-ranker-aa8762cd9aa9 • Training: Build LTR model using training data (query, document, label) triples • Label is rank • Inference: Use model to predict label ŷ = h(x) from unseen (query, document) pairs
  • 6. LexisNexisLegalSearchGuild LTR used as re-ranker • LTR models are usually more computationally expensive than search engines. • Search engine used to pull out matched documents • Top-N of these documents are fed into the LTR model and top-n of those are replaced with the output of the model, for N >> n (typically 50-100x). Learning to Rank - what it is, how it's done, and what it can do for you 6 Index Query Matched (10k) Scored (10k) Top 1000 retrieved Re-ranked Top 10 Ranking Model Image Credit: https://lucidworks.com/2016/08/17/learning-to-rank-solr/
  • 7. LexisNexisLegalSearchGuild Types of LTR Algorithms • Pointwise – documents ranked by relevance of each (query, document) pair • Pairwise – documents ranked by considering priority between pairs of (query, document) pairs • Listwise – documents ranked by considering the entire relevance ordering of all (query, Documents) tuples per query Learning to Rank - what it is, how it's done, and what it can do for you 7 Image Credit: https://www.hiit.fi/wp-content/uploads/2018/04/Learning-to-Rank-S11.pdf
  • 8. LexisNexisLegalSearchGuild Popular LTR Libraries • RankLib (Java) – from Lemur Project (UMass, CMU), provides Coordinate Ascent, Random Forest (pointwise), MART, RankNet, RankBoost (pairwise), LambdaMART (pair/listwise), AdaRank and ListNet (listwise) • SVMRank (C++) – from Cornell, provides SVMRank (pairwise) • XGBoost (Python/C++) – LambdaRank (pairwise) • PyLTR (Python) – LambdaMART (pairwise) • Michael Alcorn (Python) – RankNet and LambdaMART (pairwise) Learning to Rank - what it is, how it's done, and what it can do for you 8
  • 9. LexisNexisLegalSearchGuild Common Algorithms • Linear Model • Predicted rank is linear combination of input features • RankNet • Neural network based • Good for binary (relevant/irrelevant) labels • Weight matrix transforms input features into rank probabilities • LambdaMART • Tree (forest) based • Good for multi-class labels • Feature splits with thresholds Learning to Rank - what it is, how it's done, and what it can do for you 9 Image Credits: https://slideplayer.com/slide/9347431/, https://dimensionless.in/introduction-to-random-forest/
  • 10. LexisNexisLegalSearchGuild Practical Considerations Learning to Rank - what it is, how it's done, and what it can do for you 10
  • 11. LexisNexisLegalSearchGuild Reasons to consider LTR • Too many parameters to tune manually without overfitting to particular query set. • Availability of enough (implicit or explicit) good training data to train LTR model. Learning to Rank - what it is, how it's done, and what it can do for you 11
  • 12. LexisNexisLegalSearchGuild Acquiring labels • Implicit • Intrinsic features (words, phrases) • Document metadata • User Clicks • Time spent on document • Purchases (if applicable) • Cheap to build but noisy • Explicit • Human expert rates relevancy of each document against query • Cleaner but expensive to build Learning to Rank - what it is, how it's done, and what it can do for you 12 Image Credit: https://www.datasciencecentral.com/profiles/blogs/handling-imbalanced-data-sets-in-supervised-learning-using-family
  • 13. LexisNexisLegalSearchGuild Traditional Ranking Models • Vector Space Models • Boolean – predicts if document relevant to query or not • TF-IDF – rank documents by cosine similarity between document and query • Probabilistic Models • BM25 – rank documents by log odds of relevance to query • LMIR – probability of document’s LM generating terms in query • Importance based Models • HITS – rank documents by hubness/authority (inlinks/outlinks). • PageRank – rank document by probability of random surfer arriving on page • Impact Factor – rank documents by number of citations Learning to Rank - what it is, how it's done, and what it can do for you 13
  • 14. LexisNexisLegalSearchGuild Feature Selection • Document Features • Document Length • URL Length • Publication Date • Number of outlinks • PageRank • Query Features • Number of words • PER or ORG in query Learning to Rank - what it is, how it's done, and what it can do for you 14 • Query-Document Features • TF-IDF, BM25 similarity • Frequency of query in anchor text • Document contains query words in title • User Dependent Features • Star ratings • Age, gender • Device
  • 15. LexisNexisLegalSearchGuild Evaluation Metrics • Mean Average Precision (MAP @k) • Mean Reciprocal Rank (MRR) Learning to Rank - what it is, how it's done, and what it can do for you 15 • Normalized Discounted Cumulative Gain (NDCG @k) • Rank Correlation
  • 16. LexisNexisLegalSearchGuild LETOR Data Format 2 qid:1 1:3 2:3 3:0 4:0 5:3 6:1 7:1 8:0 9:0 10:1 11:156... # 11 2 qid:1 1:3 2:0 3:3 4:0 5:3 6:1 7:0 8:1 9:0 10:1 11:406... # 23 0 qid:1 1:3 2:0 3:2 4:0 5:3 6:1 7:0 8:0.666667 9:0 10:1 ... # 44 2 qid:1 1:3 2:0 3:3 4:0 5:3 6:1 7:0 8:1 9:0 10:1 11:287 ... # 57 1 qid:1 1:3 2:0 3:3 4:0 5:3 6:1 7:0 8:1 9:0 10:1 11:2009 ... # 89 Learning to Rank - what it is, how it's done, and what it can do for you 16 label Query ID Features: query, document, query/document, other (sparse or dense format) Comments (ex: docID)
  • 17. LexisNexisLegalSearchGuild Case Studies Learning to Rank - what it is, how it's done, and what it can do for you 17
  • 18. LexisNexisLegalSearchGuild Dataset • We use The Movie Database (TMDB) from Kaggle. • 45k movies, 20 genres, 31k unique keywords • We extract following fields: (docID, title, description, popularity, release date, running time, rating (0-10), keywords, genres) • Categorical labels 1-5 created from rating • Objective is to build LTR model that learns the ordering implied by rating and re-rank top 10 results using this model • Features chosen: (query-title and query-description similarity using TF-IDF and BM25, document recency, original score, and boolean 0/1 for each genre) Learning to Rank - what it is, how it's done, and what it can do for you 18
  • 19. LexisNexisLegalSearchGuild General Flow • Set up plugin if applicable • Load data • Define LTR features to index (if applicable) • Generate LTR features for query set in LETOR format • Train LTR model using 3rd party library (RankLib, etc). • Upload trained model definition to index (if applicable) • Run re-rank query using trained LTR model Learning to Rank - what it is, how it's done, and what it can do for you 19
  • 20. LexisNexisLegalSearchGuild Examples • Solr Case Study • Elasticsearch Case Study • DIY LTR Case Study • Project: https://github.com/sujitpal/ltr-examples Learning to Rank - what it is, how it's done, and what it can do for you 20
  • 21. LexisNexisLegalSearchGuild Wrap Up Learning to Rank - what it is, how it's done, and what it can do for you 21
  • 22. LexisNexisLegalSearchGuild Resources • Book – Learning to Rank for Information Retrieval, by Tie-Yan Liu. • Paper – From RankNet to LambdaRank to LambdaMART: An Overview, by Christopher J. C. Burges • Tutorials • Solr - https://github.com/airalcorn2/Solr-LTR • Elasticsearch – Learning to Rank 101 by Pere Urbon-Bayes, ES-LTR Demo by Doug Turnbull. • Product Centric LTR Documentation • Solr Learning To Rank Docs • Elasticsearch Learning to Rank Docs Learning to Rank - what it is, how it's done, and what it can do for you 22
  • 23. LexisNexisLegalSearchGuild Thank you! • Contact: sujit.pal@elsevier.com • Twitter: @palsujit Learning to Rank - what it is, how it's done, and what it can do for you 23

Notas del editor

  1. Most of the key work done between 2008 – 2011, competitions sponsored by MS, Yahoo and Yandex. Bloomberg LTR meetup – Michael Nillson, Erick Erickson. OSC LTR – at Haystack earlier this year.
  2. All cases you need judgement list (ie relevant vs irrelevant). For MRR you need first good result so notion of position; for DCG you need graded results and for NDCG and Rank Correlation we also need ideal ordering.