SlideShare a Scribd company logo
1 of 29
Recommendations from the search
engine
Sesam Hackathon, Warsaw, 2014-03-23
Lars Marius Garshol, larsga@bouvet.no, http://twitter.com/larsga
1
This whole presentation is about Ted Dunning’s
proposed approach to recommendations
Based on his 1993 paper (below)
– references at the end
Very simple method, dead easy to implement
– seems to work pretty well
2
Inspiration
Usually designed as prediction of ratings
– Dunning believes this is the wrong approach
– people’s ratings don’t necessarily reflect what they’ll
buy
– go by what people do rather than what they say
You don’t want to recommend Bob Dylan
– everyone’s already heard about him, and know what
they think
– you want to recommend things that are new to the user
You don’t want to recommend things everyone
likes
3
Thoughts on recommendations
Step 1
– work out which things tend to occur together
– that is, if you buy this, you’re likely to also buy this
– however, we only want pairs which are statistically
significant
Step 2
– index up the significant pairs in a search engine
– use search to produce the actual results
4
The actual approach
Statistically significant co-
occurrence
Part the first
User Item
u1 i1
u1 i2
u2 i1
u3 i2
u3 i3
u3 i4
... ...
The starting point
Some kind of log of user actions
User has
– bought a movie | album | book | ...
– opened a document
– ...
From this raw material, we can work
out what things tend to go together
– and whether this is significant
7
i1 i2 i3 i4 i5 i6 i7
i1 23 42 0 0 5 7
i2 23 6 1 129 2 10
i3 42 6 3 0 492 1
i4 0 1 3 2 3 1
i5 0 129 0 2 94 2
i6 5 2 492 3 94 1
i7 7 10 1 1 2 1
8
Item-to-item matrix
k[0][0] = the number in the matrix on
previous slide
k[0][1] = the sum of that whole column
minus k[0][0]
k[1][0] = the sum of that whole row
minus k[0][0]
k[1][1] = the sum of the entire matrix
minus k[0][0] minus k[1][0] minus
k[0][1]
9
Producing the k 2x2 matrix
How to compute the k matrix for a given cell in the matrix
on the previous slide
If the output of LLR(k) is above some threshold, the pair is considered significant.
Check the Python code on
– https://github.com/larsga/py-
snippets/tree/master/machine-learning/llr
– this requires a lot of memory and CPU
Or just use Mahout
– RowSimilarityJob does exactly this
10
Doing it for real
Search engine as recommender
Part the second
Take all the items and index them up with the
search engine in the usual way
– that is, each title has an id, a title, a description, etc
Then, add a “magic” field
– put into it the IDs of all the items that appear in a
significant pair with this item
– let’s call this field “indicators”
Now we’re ready to do recommendations
12
Indexing with the search engine
Collect some set of items for which the user has
expressed a preference
– by buying them, looking at them, rating them, whatever
The IDs of these items are your query
– search the “indicators” field
– the search results are your recommendations
That’s it!
– pack up, go home
13
Doing recommendations
Imagine that you’re searching for movies, and you
type “the godfather”
– “the” appears in all documents, so documents matching that
get a low relevance score
– “godfather” appears in very few documents, so matches on
that get a high score
– this is basically TF/IDF in a nutshell
Now, imagine you liked two movies: “The Godfather”
and “The Daytrippers”
– nearly all movies have “The Godfather” as an indicator
– very few have “The Daytrippers”
– the second will therefore influence recommendations much
more
14
Why does it work?
Trying it out for real
Part the third
Again, the code is on Github
– very simple webapp based on web.py and Lucene
– https://github.com/larsga/py-
snippets/tree/master/machine-learning/llr
The underlying data is the MovieLens dataset
– 10 million ratings of 10,000 movies by 72,000 users
– http://grouplens.org/datasets/movielens/
16
Real demo with real data
llr.py
– this chews the data, producing the significant pairs
– takes huge amount of memory and about 30 minutes
– have made absolutely no attempts to optimize it
llr_index.py
– reads output of previous script, makes Lucene index
recom-ui.py
– the actual web application
17
Three scripts
18
19
20
Liked one movie
21
Liked two movies
Movies with highest llr scor
together with this movie
22
Liked three movies
Recommendations are actually now spot-on. At least for me.
class Movie:
def GET(self, movieid):
nocache()
doc = search.do_query('id', movieid)[0]
#recoms = search.do_query('indicators', movieid)
recoms = [search.do_query('id', movieid)[0] for movieid in doc.bets]
if hasattr(session, 'liked'):
youlike = search.do_query('indicators', session.liked)
else:
youlike = []
return render.movie(doc, recoms, youlike)
23
Complete code for movie page
Further work
Winding up
Tweak the parameters a bit to see what happens
Can we support a “Dislike” button?
Test it with more kinds of data
Learn how to do this with Mahout
25
Things left to do
26
What is this?
From Ted Dunning’s slides
27
And this?
From Ted Dunning’s slides
28
And this?
From Ted Dunning’s slides
The original 1993 paper
– http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14
.5962
Ebook with lots of background but little detail
– http://www.mapr.com/practical-machine-learning
Slides covering the same material
– www.slideshare.net/tdunning/building-multimodal-
recommendation-engines-using-search-engines
Blog post with actual equations
– http://tdunning.blogspot.com/2008/03/surprise-and-
coincidence.html
29
References

More Related Content

What's hot

Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
Trey Grainger
 

What's hot (20)

Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonThe Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
 
AlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonAlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in Python
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Mauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopMauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshop
 
Enhance discovery Solr and Mahout
Enhance discovery Solr and MahoutEnhance discovery Solr and Mahout
Enhance discovery Solr and Mahout
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Machine Learning using Big data
Machine Learning using Big data Machine Learning using Big data
Machine Learning using Big data
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
Machine Learning in the age of Big Data
Machine Learning in the age of Big DataMachine Learning in the age of Big Data
Machine Learning in the age of Big Data
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data Science
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
 
Solr 6.0 Graph Query Overview
Solr 6.0 Graph Query OverviewSolr 6.0 Graph Query Overview
Solr 6.0 Graph Query Overview
 
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered Search
 

Viewers also liked

Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
Tobias Kuhn
 
Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
Tobias Kuhn
 
Recommendation techniques
Recommendation techniques Recommendation techniques
Recommendation techniques
sun9413
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
Liang Xiang
 
Amazon Item-to-Item Recommendations
Amazon Item-to-Item RecommendationsAmazon Item-to-Item Recommendations
Amazon Item-to-Item Recommendations
Roger Chen
 

Viewers also liked (11)

Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
 
Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
 
Recommendation techniques
Recommendation techniques Recommendation techniques
Recommendation techniques
 
Recomendation system: Community Detection Based Recomendation System using Hy...
Recomendation system: Community Detection Based Recomendation System using Hy...Recomendation system: Community Detection Based Recomendation System using Hy...
Recomendation system: Community Detection Based Recomendation System using Hy...
 
The Universal Recommender
The Universal RecommenderThe Universal Recommender
The Universal Recommender
 
Publishing Production: From the Desktop to the Cloud
Publishing Production: From the Desktop to the CloudPublishing Production: From the Desktop to the Cloud
Publishing Production: From the Desktop to the Cloud
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 
Amazon Item-to-Item Recommendations
Amazon Item-to-Item RecommendationsAmazon Item-to-Item Recommendations
Amazon Item-to-Item Recommendations
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engine
 
Recommendation system
Recommendation system Recommendation system
Recommendation system
 
Report Writing - Conclusions & Recommendations sections
Report Writing - Conclusions & Recommendations sectionsReport Writing - Conclusions & Recommendations sections
Report Writing - Conclusions & Recommendations sections
 

Similar to Using the search engine as recommendation engine

Data Science Accelerator Program
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator Program
GoDataDriven
 
Machine learning
Machine learningMachine learning
Machine learning
Ashok Masti
 

Similar to Using the search engine as recommendation engine (20)

Software Development is Upside Down
Software Development is Upside DownSoftware Development is Upside Down
Software Development is Upside Down
 
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
 
The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18
 
Defcon 21-caceres-massive-attacks-with-distributed-computing by pseudor00t
Defcon 21-caceres-massive-attacks-with-distributed-computing by pseudor00tDefcon 21-caceres-massive-attacks-with-distributed-computing by pseudor00t
Defcon 21-caceres-massive-attacks-with-distributed-computing by pseudor00t
 
Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
SRECon Coherent Performance
SRECon Coherent PerformanceSRECon Coherent Performance
SRECon Coherent Performance
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Data Science Accelerator Program
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator Program
 
Coaching teams in creative problem solving
Coaching teams in creative problem solvingCoaching teams in creative problem solving
Coaching teams in creative problem solving
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is Distributed
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Machine learning
Machine learningMachine learning
Machine learning
 
Chapter 02 collaborative recommendation
Chapter 02   collaborative recommendationChapter 02   collaborative recommendation
Chapter 02 collaborative recommendation
 
Chapter 02 collaborative recommendation
Chapter 02   collaborative recommendationChapter 02   collaborative recommendation
Chapter 02 collaborative recommendation
 
Recommendation Systems Roadtrip
Recommendation Systems RoadtripRecommendation Systems Roadtrip
Recommendation Systems Roadtrip
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDB
 

More from Lars Marius Garshol

Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Lars Marius Garshol
 

More from Lars Marius Garshol (20)

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
 
Kveik - what is it?
Kveik - what is it?Kveik - what is it?
Kveik - what is it?
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithms
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
History of writing
History of writingHistory of writing
History of writing
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativity
 
Norwegian farmhouse ale
Norwegian farmhouse aleNorwegian farmhouse ale
Norwegian farmhouse ale
 
Archive integration with RDF
Archive integration with RDFArchive integration with RDF
Archive integration with RDF
 
The Euro crisis in 10 minutes
The Euro crisis in 10 minutesThe Euro crisis in 10 minutes
The Euro crisis in 10 minutes
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
 
Bitcoin - digital gold
Bitcoin - digital goldBitcoin - digital gold
Bitcoin - digital gold
 
Hops - the green gold
Hops - the green goldHops - the green gold
Hops - the green gold
 
Big data 101
Big data 101Big data 101
Big data 101
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
 
Approximate string comparators
Approximate string comparatorsApproximate string comparators
Approximate string comparators
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programming
 
Semantisk integrasjon
Semantisk integrasjonSemantisk integrasjon
Semantisk integrasjon
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Using the search engine as recommendation engine

  • 1. Recommendations from the search engine Sesam Hackathon, Warsaw, 2014-03-23 Lars Marius Garshol, larsga@bouvet.no, http://twitter.com/larsga 1
  • 2. This whole presentation is about Ted Dunning’s proposed approach to recommendations Based on his 1993 paper (below) – references at the end Very simple method, dead easy to implement – seems to work pretty well 2 Inspiration
  • 3. Usually designed as prediction of ratings – Dunning believes this is the wrong approach – people’s ratings don’t necessarily reflect what they’ll buy – go by what people do rather than what they say You don’t want to recommend Bob Dylan – everyone’s already heard about him, and know what they think – you want to recommend things that are new to the user You don’t want to recommend things everyone likes 3 Thoughts on recommendations
  • 4. Step 1 – work out which things tend to occur together – that is, if you buy this, you’re likely to also buy this – however, we only want pairs which are statistically significant Step 2 – index up the significant pairs in a search engine – use search to produce the actual results 4 The actual approach
  • 6. User Item u1 i1 u1 i2 u2 i1 u3 i2 u3 i3 u3 i4 ... ... The starting point Some kind of log of user actions User has – bought a movie | album | book | ... – opened a document – ... From this raw material, we can work out what things tend to go together – and whether this is significant
  • 7. 7
  • 8. i1 i2 i3 i4 i5 i6 i7 i1 23 42 0 0 5 7 i2 23 6 1 129 2 10 i3 42 6 3 0 492 1 i4 0 1 3 2 3 1 i5 0 129 0 2 94 2 i6 5 2 492 3 94 1 i7 7 10 1 1 2 1 8 Item-to-item matrix
  • 9. k[0][0] = the number in the matrix on previous slide k[0][1] = the sum of that whole column minus k[0][0] k[1][0] = the sum of that whole row minus k[0][0] k[1][1] = the sum of the entire matrix minus k[0][0] minus k[1][0] minus k[0][1] 9 Producing the k 2x2 matrix How to compute the k matrix for a given cell in the matrix on the previous slide If the output of LLR(k) is above some threshold, the pair is considered significant.
  • 10. Check the Python code on – https://github.com/larsga/py- snippets/tree/master/machine-learning/llr – this requires a lot of memory and CPU Or just use Mahout – RowSimilarityJob does exactly this 10 Doing it for real
  • 11. Search engine as recommender Part the second
  • 12. Take all the items and index them up with the search engine in the usual way – that is, each title has an id, a title, a description, etc Then, add a “magic” field – put into it the IDs of all the items that appear in a significant pair with this item – let’s call this field “indicators” Now we’re ready to do recommendations 12 Indexing with the search engine
  • 13. Collect some set of items for which the user has expressed a preference – by buying them, looking at them, rating them, whatever The IDs of these items are your query – search the “indicators” field – the search results are your recommendations That’s it! – pack up, go home 13 Doing recommendations
  • 14. Imagine that you’re searching for movies, and you type “the godfather” – “the” appears in all documents, so documents matching that get a low relevance score – “godfather” appears in very few documents, so matches on that get a high score – this is basically TF/IDF in a nutshell Now, imagine you liked two movies: “The Godfather” and “The Daytrippers” – nearly all movies have “The Godfather” as an indicator – very few have “The Daytrippers” – the second will therefore influence recommendations much more 14 Why does it work?
  • 15. Trying it out for real Part the third
  • 16. Again, the code is on Github – very simple webapp based on web.py and Lucene – https://github.com/larsga/py- snippets/tree/master/machine-learning/llr The underlying data is the MovieLens dataset – 10 million ratings of 10,000 movies by 72,000 users – http://grouplens.org/datasets/movielens/ 16 Real demo with real data
  • 17. llr.py – this chews the data, producing the significant pairs – takes huge amount of memory and about 30 minutes – have made absolutely no attempts to optimize it llr_index.py – reads output of previous script, makes Lucene index recom-ui.py – the actual web application 17 Three scripts
  • 18. 18
  • 19. 19
  • 21. 21 Liked two movies Movies with highest llr scor together with this movie
  • 22. 22 Liked three movies Recommendations are actually now spot-on. At least for me.
  • 23. class Movie: def GET(self, movieid): nocache() doc = search.do_query('id', movieid)[0] #recoms = search.do_query('indicators', movieid) recoms = [search.do_query('id', movieid)[0] for movieid in doc.bets] if hasattr(session, 'liked'): youlike = search.do_query('indicators', session.liked) else: youlike = [] return render.movie(doc, recoms, youlike) 23 Complete code for movie page
  • 25. Tweak the parameters a bit to see what happens Can we support a “Dislike” button? Test it with more kinds of data Learn how to do this with Mahout 25 Things left to do
  • 26. 26 What is this? From Ted Dunning’s slides
  • 27. 27 And this? From Ted Dunning’s slides
  • 28. 28 And this? From Ted Dunning’s slides
  • 29. The original 1993 paper – http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14 .5962 Ebook with lots of background but little detail – http://www.mapr.com/practical-machine-learning Slides covering the same material – www.slideshare.net/tdunning/building-multimodal- recommendation-engines-using-search-engines Blog post with actual equations – http://tdunning.blogspot.com/2008/03/surprise-and- coincidence.html 29 References