SlideShare a Scribd company logo
1 of 23
Searching for The Matrix in haystack
        (with Elasticsearch)
         Synopsi.TV case study



           Tomáš Sirný
           @junckritter

 Pyvo/Rubyslava November 2012
The Environment
●   Recommendation service for movies, TV shows
●   People mark titles they watched(check-in), rate
    them
●   Get recommendations
●   Make „Watch Later“ or other-purpose lists
●   …
●   Search (to check-in, add to list, share, etc.)
The Problem
●   Input box for search on top of web page
●   Many movies, TV shows in database
●   Lot of them have similar titles, use similar
    words
●   Some are more probable to be searched for
●   Few input information – 3, 4 letters
●   Autocomplete, not only exact match
The Red Pill
The Blue Pill
The Tool
●   Elasticsearch – designed for searching in
    documents
●   Based on Lucene – de facto standard
●   Young yet feature-rich
●   Quick development (despite 1 core developer)
●   Business company recently founded
●   10M funding in A-round
The (Wannabe) Solution
●   Differentiate titles
●   Have cover, plot, cast, directors
●   Year
●   Popularity (whatever it means)
●   Prefer ones with more data, more popular
The Text – First Attempt

●   Text Query (now Match Query)
●   phrase_prefix type – all words in input with
    matching of prefixes („m“, „ma“, „mat“, …), same
    order of words
●   operator and
●   not_analyzed „name“ field (not broke down to
    words)
The Text – First Attempt

●   slop parameter - allows change of order, skip
    words
                 „matrix revolutions“

                 „revolutions matrix“

              „matrix first revolutions“
The Sorting – First Attempt
●   Default scoring considers only occurence text in
    documents
●   We also want other properties of document to
    count
●   Custom Score Query
●   Define script for scoring

        „script“: „_score * doc[„rating“].value“
The Rating
●   Allows to prefer more „popular“ titles
●   External – top lists, links, etc.
●   Internal – usage data from system
●   Problem for newly added titles – lack of data of
    both types
The Tuning of Rating
●   Get rid off external data
●   Only score „completeness“ of each document
●   Release year


               „script“: „3 * log(_score) +
       1 * log(doc["year"].date.year – 1880) +
    0.75 * log(doc["watched_count"].value +1)“
The Tuning of Query
●    Name field analyzed, edgeNGram filter

index:
    analysis:
     filter:
      my_ngram:
        type: edgeNGram
        min_gram : 1
        max_gram : 11
        side : front
     analyzer:
      my_analyzer:
        type: custom
        tokenizer: standard
        filter: [lowercase, asciifolding, my_ngram]
The AKA's

●   Also know as – names of title in different
    countries
●   Lot of additional data, sometimes only „noise“
●   „original“ is still most important
The AKA's
●   Array of AKAs – problems with scoring of short
    names
●   Nested AKA documents - query does not return
    nested document which matched

●   AKA document is child of title – have own
    information (original, country, slug)
●   Top Children Query – which AKA matched
●   Another query with Ids Filter – get titles
The Sorting – Second Attempt
●   Custom Filter Score Query – apply set of filters,
    each filter boosts documents which pass its
    condition
●   boost parameter of filter – differentiate
    importance of that filter
●   score_mode – sum, product of boost values
The Sorting – Used Score Filters
●   Release date (in case of TV show last episode)
    in last 6 months
●   Release date in next 3 months
●   „original“ AKA
●   Have all important categories filled
●   Not Short genre
●   Not TV movie
The Sorting – Short Input
●   Special case 1 – 3 letters
●   Very rare to exact match
●   Should work after typing of first letter
●   Only titles from this year
●   3 letters – also titles in near future and previous
    year
The Year in Input
●   Matrix 1999
●   Matrix Reloaded (2003)
●   Matrix 2000- released to 2000
●   Matrix 2000+ released since 2000
One More Thing – Advanced Search
●   Titles have also data about their usage
●   „Watched by Friends“ Filter
    Shows titles with IDs of your „friends“ in proper
    field (TermsFilter([IDS]))
●   „Not Watched“ filter
    Show titles in which is your ID absent
    (NotFilter(TermFilter(ID))
●   combination – titles to watch to catch up with
    friends
The End




  Thanks


Tomáš Sirný
@junckritter

More Related Content

What's hot

Semantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrSemantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrTrey Grainger
 
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Lucidworks
 
An hour with Database and SQL
An hour with Database and SQLAn hour with Database and SQL
An hour with Database and SQLIraj Hedayati
 
Introduction to DB design
Introduction to DB designIntroduction to DB design
Introduction to DB designVijay Kalangi
 
2015-04-11-PseudoConstants
2015-04-11-PseudoConstants2015-04-11-PseudoConstants
2015-04-11-PseudoConstantsRiley Major
 
Webinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior RelevanceWebinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior RelevanceLucidworks
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingOntotext
 
Python training in hyderabad
Python training in hyderabadPython training in hyderabad
Python training in hyderabadRajitha D
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayMichael Yarichuk
 
02 well formed and valid documents
02 well formed and valid documents02 well formed and valid documents
02 well formed and valid documentsBaskarkncet
 
XML's validation - XML Schema
XML's validation - XML SchemaXML's validation - XML Schema
XML's validation - XML Schemavidede_group
 
Using OWL for the RESO Data Dictionary
Using OWL for the RESO Data DictionaryUsing OWL for the RESO Data Dictionary
Using OWL for the RESO Data DictionaryChimezie Ogbuji
 
Order #188231367 (status writer assigned) role model albert e
Order #188231367 (status writer assigned) role model   albert eOrder #188231367 (status writer assigned) role model   albert e
Order #188231367 (status writer assigned) role model albert eJUST36
 

What's hot (16)

Semantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrSemantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/Solr
 
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
 
Xml presentation
Xml presentationXml presentation
Xml presentation
 
An hour with Database and SQL
An hour with Database and SQLAn hour with Database and SQL
An hour with Database and SQL
 
Introduction to DB design
Introduction to DB designIntroduction to DB design
Introduction to DB design
 
Final presentation
Final presentationFinal presentation
Final presentation
 
2015-04-11-PseudoConstants
2015-04-11-PseudoConstants2015-04-11-PseudoConstants
2015-04-11-PseudoConstants
 
Webinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior RelevanceWebinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior Relevance
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
 
Python training in hyderabad
Python training in hyderabadPython training in hyderabad
Python training in hyderabad
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy Way
 
02 well formed and valid documents
02 well formed and valid documents02 well formed and valid documents
02 well formed and valid documents
 
XML's validation - XML Schema
XML's validation - XML SchemaXML's validation - XML Schema
XML's validation - XML Schema
 
Using OWL for the RESO Data Dictionary
Using OWL for the RESO Data DictionaryUsing OWL for the RESO Data Dictionary
Using OWL for the RESO Data Dictionary
 
Order #188231367 (status writer assigned) role model albert e
Order #188231367 (status writer assigned) role model   albert eOrder #188231367 (status writer assigned) role model   albert e
Order #188231367 (status writer assigned) role model albert e
 

Similar to Searching for The Matrix in haystack (with Elasticsearch)

Find it, possibly also near you!
Find it, possibly also near you!Find it, possibly also near you!
Find it, possibly also near you!Paul Borgermans
 
Advanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAdvanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAlessandro Benedetti
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneSease
 
MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at NetflixMLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at NetflixXavier Amatriain
 
Xavier amatriain, dir algorithms netflix m lconf 2013
Xavier amatriain, dir algorithms netflix m lconf 2013Xavier amatriain, dir algorithms netflix m lconf 2013
Xavier amatriain, dir algorithms netflix m lconf 2013MLconf
 
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...Sylvain Utard
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSujit Pal
 
ClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureMLClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureMLGeorge Simov
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHPPaul Borgermans
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologiesenterprisesearchmeetup
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data scienceTuri, Inc.
 
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...Lucidworks
 
Type theory in practice
Type theory in practiceType theory in practice
Type theory in practiceGabriel Habryn
 
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...Citus Data
 
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRoelof Pieters
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Kai Chan
 

Similar to Searching for The Matrix in haystack (with Elasticsearch) (20)

Find it, possibly also near you!
Find it, possibly also near you!Find it, possibly also near you!
Find it, possibly also near you!
 
Advanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAdvanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache Lucene
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
 
MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at NetflixMLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
 
Xavier amatriain, dir algorithms netflix m lconf 2013
Xavier amatriain, dir algorithms netflix m lconf 2013Xavier amatriain, dir algorithms netflix m lconf 2013
Xavier amatriain, dir algorithms netflix m lconf 2013
 
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
 
ClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureMLClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureML
 
Oracle by Muhammad Iqbal
Oracle by Muhammad IqbalOracle by Muhammad Iqbal
Oracle by Muhammad Iqbal
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHP
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data science
 
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
 
Solr ce si cum
Solr ce si cumSolr ce si cum
Solr ce si cum
 
Type theory in practice
Type theory in practiceType theory in practice
Type theory in practice
 
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
 
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and Graphs
 
XML
XMLXML
XML
 
DynamodbDB Deep Dive
DynamodbDB Deep DiveDynamodbDB Deep Dive
DynamodbDB Deep Dive
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Searching for The Matrix in haystack (with Elasticsearch)

  • 1. Searching for The Matrix in haystack (with Elasticsearch) Synopsi.TV case study Tomáš Sirný @junckritter Pyvo/Rubyslava November 2012
  • 2. The Environment ● Recommendation service for movies, TV shows ● People mark titles they watched(check-in), rate them ● Get recommendations ● Make „Watch Later“ or other-purpose lists ● … ● Search (to check-in, add to list, share, etc.)
  • 3. The Problem ● Input box for search on top of web page ● Many movies, TV shows in database ● Lot of them have similar titles, use similar words ● Some are more probable to be searched for ● Few input information – 3, 4 letters ● Autocomplete, not only exact match
  • 6. The Tool ● Elasticsearch – designed for searching in documents ● Based on Lucene – de facto standard ● Young yet feature-rich ● Quick development (despite 1 core developer) ● Business company recently founded ● 10M funding in A-round
  • 7. The (Wannabe) Solution ● Differentiate titles ● Have cover, plot, cast, directors ● Year ● Popularity (whatever it means) ● Prefer ones with more data, more popular
  • 8. The Text – First Attempt ● Text Query (now Match Query) ● phrase_prefix type – all words in input with matching of prefixes („m“, „ma“, „mat“, …), same order of words ● operator and ● not_analyzed „name“ field (not broke down to words)
  • 9. The Text – First Attempt ● slop parameter - allows change of order, skip words „matrix revolutions“ „revolutions matrix“ „matrix first revolutions“
  • 10. The Sorting – First Attempt ● Default scoring considers only occurence text in documents ● We also want other properties of document to count ● Custom Score Query ● Define script for scoring „script“: „_score * doc[„rating“].value“
  • 11. The Rating ● Allows to prefer more „popular“ titles ● External – top lists, links, etc. ● Internal – usage data from system ● Problem for newly added titles – lack of data of both types
  • 12. The Tuning of Rating ● Get rid off external data ● Only score „completeness“ of each document ● Release year „script“: „3 * log(_score) + 1 * log(doc["year"].date.year – 1880) + 0.75 * log(doc["watched_count"].value +1)“
  • 13. The Tuning of Query ● Name field analyzed, edgeNGram filter index: analysis: filter: my_ngram: type: edgeNGram min_gram : 1 max_gram : 11 side : front analyzer: my_analyzer: type: custom tokenizer: standard filter: [lowercase, asciifolding, my_ngram]
  • 14. The AKA's ● Also know as – names of title in different countries ● Lot of additional data, sometimes only „noise“ ● „original“ is still most important
  • 15.
  • 16. The AKA's ● Array of AKAs – problems with scoring of short names ● Nested AKA documents - query does not return nested document which matched ● AKA document is child of title – have own information (original, country, slug) ● Top Children Query – which AKA matched ● Another query with Ids Filter – get titles
  • 17. The Sorting – Second Attempt ● Custom Filter Score Query – apply set of filters, each filter boosts documents which pass its condition ● boost parameter of filter – differentiate importance of that filter ● score_mode – sum, product of boost values
  • 18. The Sorting – Used Score Filters ● Release date (in case of TV show last episode) in last 6 months ● Release date in next 3 months ● „original“ AKA ● Have all important categories filled ● Not Short genre ● Not TV movie
  • 19. The Sorting – Short Input ● Special case 1 – 3 letters ● Very rare to exact match ● Should work after typing of first letter ● Only titles from this year ● 3 letters – also titles in near future and previous year
  • 20. The Year in Input ● Matrix 1999 ● Matrix Reloaded (2003) ● Matrix 2000- released to 2000 ● Matrix 2000+ released since 2000
  • 21. One More Thing – Advanced Search ● Titles have also data about their usage ● „Watched by Friends“ Filter Shows titles with IDs of your „friends“ in proper field (TermsFilter([IDS])) ● „Not Watched“ filter Show titles in which is your ID absent (NotFilter(TermFilter(ID)) ● combination – titles to watch to catch up with friends
  • 22.
  • 23. The End Thanks Tomáš Sirný @junckritter