Searching for The Matrix in haystack (with Elasticsearch)
1. Searching for The Matrix in haystack
(with Elasticsearch)
Synopsi.TV case study
Tomáš Sirný
@junckritter
Pyvo/Rubyslava November 2012
2. The Environment
● Recommendation service for movies, TV shows
● People mark titles they watched(check-in), rate
them
● Get recommendations
● Make „Watch Later“ or other-purpose lists
● …
● Search (to check-in, add to list, share, etc.)
3. The Problem
● Input box for search on top of web page
● Many movies, TV shows in database
● Lot of them have similar titles, use similar
words
● Some are more probable to be searched for
● Few input information – 3, 4 letters
● Autocomplete, not only exact match
6. The Tool
● Elasticsearch – designed for searching in
documents
● Based on Lucene – de facto standard
● Young yet feature-rich
● Quick development (despite 1 core developer)
● Business company recently founded
● 10M funding in A-round
7. The (Wannabe) Solution
● Differentiate titles
● Have cover, plot, cast, directors
● Year
● Popularity (whatever it means)
● Prefer ones with more data, more popular
8. The Text – First Attempt
● Text Query (now Match Query)
● phrase_prefix type – all words in input with
matching of prefixes („m“, „ma“, „mat“, …), same
order of words
● operator and
● not_analyzed „name“ field (not broke down to
words)
9. The Text – First Attempt
● slop parameter - allows change of order, skip
words
„matrix revolutions“
„revolutions matrix“
„matrix first revolutions“
10. The Sorting – First Attempt
● Default scoring considers only occurence text in
documents
● We also want other properties of document to
count
● Custom Score Query
● Define script for scoring
„script“: „_score * doc[„rating“].value“
11. The Rating
● Allows to prefer more „popular“ titles
● External – top lists, links, etc.
● Internal – usage data from system
● Problem for newly added titles – lack of data of
both types
12. The Tuning of Rating
● Get rid off external data
● Only score „completeness“ of each document
● Release year
„script“: „3 * log(_score) +
1 * log(doc["year"].date.year – 1880) +
0.75 * log(doc["watched_count"].value +1)“
13. The Tuning of Query
● Name field analyzed, edgeNGram filter
index:
analysis:
filter:
my_ngram:
type: edgeNGram
min_gram : 1
max_gram : 11
side : front
analyzer:
my_analyzer:
type: custom
tokenizer: standard
filter: [lowercase, asciifolding, my_ngram]
14. The AKA's
● Also know as – names of title in different
countries
● Lot of additional data, sometimes only „noise“
● „original“ is still most important
15.
16. The AKA's
● Array of AKAs – problems with scoring of short
names
● Nested AKA documents - query does not return
nested document which matched
● AKA document is child of title – have own
information (original, country, slug)
● Top Children Query – which AKA matched
● Another query with Ids Filter – get titles
17. The Sorting – Second Attempt
● Custom Filter Score Query – apply set of filters,
each filter boosts documents which pass its
condition
● boost parameter of filter – differentiate
importance of that filter
● score_mode – sum, product of boost values
18. The Sorting – Used Score Filters
● Release date (in case of TV show last episode)
in last 6 months
● Release date in next 3 months
● „original“ AKA
● Have all important categories filled
● Not Short genre
● Not TV movie
19. The Sorting – Short Input
● Special case 1 – 3 letters
● Very rare to exact match
● Should work after typing of first letter
● Only titles from this year
● 3 letters – also titles in near future and previous
year
20. The Year in Input
● Matrix 1999
● Matrix Reloaded (2003)
● Matrix 2000- released to 2000
● Matrix 2000+ released since 2000
21. One More Thing – Advanced Search
● Titles have also data about their usage
● „Watched by Friends“ Filter
Shows titles with IDs of your „friends“ in proper
field (TermsFilter([IDS]))
● „Not Watched“ filter
Show titles in which is your ID absent
(NotFilter(TermFilter(ID))
● combination – titles to watch to catch up with
friends