SlideShare una empresa de Scribd logo
1 de 12
Differences in Distributions and
Their Effect on Recommendation
System Performance
Why Collaborative Filtering Doesn’t Scale
(portions reference Prismatic’s Silicon Valley talk)
History of Recommendation
Overfitting
Distribution
of All Items
Across Users
Distribution of
All Items Across
All Users in the
Future
Concrete Set of
Past Items
Across Users
Concrete Set of
Future Items
Across Users
Recommender Systems Dilemma
Set of All Items Possible
Set of Items Known to Users in the Future
Set of Items Known to Users in the
Past
Set of Items
Recommended By
Recommenders
Items Viewed
Or Liked in
the Future
Items Users
Viewed Or Rated
in the Past
Items Seen in Ground
Truth Without
Changes in Item
Access
??????
Collaborative Filtering in Music
• Construct correlations between items from set of past known items
• Generate estimated distribution for past users across all items
• Hope ‘errors’ relate to future user liked items
• Gap between distributions escalates with the scale of data
Resulting Biases
Huge number of items where 50%+ of users only ever saw 20 songs a
month out of 3 million
Massive gap between all items and known items distribution
Cross Validation ground truth assumes the 50%+ users only ever saw
that new top 20 songs for the new set
Results are supposed to be based on if users knew all sets
Continuous user testing assumes ‘all items seen’ distributions, but
only the set of recommended items are new items seen
User data itself is a biased subset of the whole
First Generation Problems
• Everyone likes The Beatles or Norah Jones
• Extremely frequent in biased data sets
• Since everyone listened to before, everyone gets recommended them
• Recommendations usually repeat the top 40 of the data collection
• Users might like novel recommendations, but that won’t ever be in
the evaluation set in cross validation – users never saw them
Problems Over Time
• The ground truth is heavily biased by recommendations controlling
the set of known items
• Machine learning – including collaborative filtering – learns the algorithm
distribution more than users preferences
• Performance Bias
• Future ground truth comes from those that stayed in the system
• They liked the system
• It doesn’t represent those that were unhappy and left
• Biases data to keep existing users happy without regard to ex-users
• In extreme cases, even new users are discarded
Best Solution So Far
Past Data Idealized Future Distribution
Idealized Function Feature Value => Rating
Best Solution So Far
• Requires all Items be categorized and quantized
• Requires accuracy and general agreement on these values
• (Socially Defined versus Absolute)
• At least all features are present in all sets
• Transforms recommendation into optimization and personalization
• Set of items with highest score for a user
• Ability to predict poor performing product or agent solutions
• Better able to incorporate additional data
• Prediction is usually linear time over the number of items
Evaluation Adjustments
• No Replacement for Real World A/B testing
• Machine Learning for evaluation, not just the question
• Hidden dependencies and ‘cheating’
Learned Algorithm Model Training
Evaluation
Model
Model
Training
Business
Objective
Ground Truth
Distribution Problems in Recommender Systems

Más contenido relacionado

Similar a Distribution Problems in Recommender Systems

Product Recommendations Enhanced with Reviews
Product Recommendations Enhanced with ReviewsProduct Recommendations Enhanced with Reviews
Product Recommendations Enhanced with Reviews
maranlar
 

Similar a Distribution Problems in Recommender Systems (20)

Demystifying Recommendation Systems
Demystifying Recommendation SystemsDemystifying Recommendation Systems
Demystifying Recommendation Systems
 
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in Mendeley
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Overview of recommender system
Overview of recommender systemOverview of recommender system
Overview of recommender system
 
IFIP Summer School 2015 - Using Authorization Logic to Capture User Policies ...
IFIP Summer School 2015 - Using Authorization Logic to Capture User Policies ...IFIP Summer School 2015 - Using Authorization Logic to Capture User Policies ...
IFIP Summer School 2015 - Using Authorization Logic to Capture User Policies ...
 
Product Recommendations Enhanced with Reviews
Product Recommendations Enhanced with ReviewsProduct Recommendations Enhanced with Reviews
Product Recommendations Enhanced with Reviews
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic Algorithm
 
Culbert.ppt
Culbert.pptCulbert.ppt
Culbert.ppt
 
Culbert.ppt
Culbert.pptCulbert.ppt
Culbert.ppt
 
Culbert.ppt
Culbert.pptCulbert.ppt
Culbert.ppt
 
Culbert.ppt
Culbert.pptCulbert.ppt
Culbert.ppt
 
case based recommendation approach for market basket data
case based recommendation approach for market basket datacase based recommendation approach for market basket data
case based recommendation approach for market basket data
 
Олександр Обєдніков “Рекомендательные системы”
Олександр Обєдніков “Рекомендательные системы”Олександр Обєдніков “Рекомендательные системы”
Олександр Обєдніков “Рекомендательные системы”
 
Use of data science in recommendation system
Use of data science in  recommendation systemUse of data science in  recommendation system
Use of data science in recommendation system
 
Measuring Impact: Towards a data citation metric
Measuring Impact: Towards a data citation metricMeasuring Impact: Towards a data citation metric
Measuring Impact: Towards a data citation metric
 
Recommended System.pptx
 Recommended System.pptx Recommended System.pptx
Recommended System.pptx
 
Josh Aberant - Data-Driven Digital Growth
Josh Aberant - Data-Driven Digital GrowthJosh Aberant - Data-Driven Digital Growth
Josh Aberant - Data-Driven Digital Growth
 
Recommender systems
Recommender systemsRecommender systems
Recommender systems
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentation
 
Fashiondatasc
FashiondatascFashiondatasc
Fashiondatasc
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Distribution Problems in Recommender Systems

  • 1. Differences in Distributions and Their Effect on Recommendation System Performance Why Collaborative Filtering Doesn’t Scale (portions reference Prismatic’s Silicon Valley talk)
  • 3. Overfitting Distribution of All Items Across Users Distribution of All Items Across All Users in the Future Concrete Set of Past Items Across Users Concrete Set of Future Items Across Users
  • 4. Recommender Systems Dilemma Set of All Items Possible Set of Items Known to Users in the Future Set of Items Known to Users in the Past Set of Items Recommended By Recommenders Items Viewed Or Liked in the Future Items Users Viewed Or Rated in the Past Items Seen in Ground Truth Without Changes in Item Access ??????
  • 5. Collaborative Filtering in Music • Construct correlations between items from set of past known items • Generate estimated distribution for past users across all items • Hope ‘errors’ relate to future user liked items • Gap between distributions escalates with the scale of data
  • 6. Resulting Biases Huge number of items where 50%+ of users only ever saw 20 songs a month out of 3 million Massive gap between all items and known items distribution Cross Validation ground truth assumes the 50%+ users only ever saw that new top 20 songs for the new set Results are supposed to be based on if users knew all sets Continuous user testing assumes ‘all items seen’ distributions, but only the set of recommended items are new items seen User data itself is a biased subset of the whole
  • 7. First Generation Problems • Everyone likes The Beatles or Norah Jones • Extremely frequent in biased data sets • Since everyone listened to before, everyone gets recommended them • Recommendations usually repeat the top 40 of the data collection • Users might like novel recommendations, but that won’t ever be in the evaluation set in cross validation – users never saw them
  • 8. Problems Over Time • The ground truth is heavily biased by recommendations controlling the set of known items • Machine learning – including collaborative filtering – learns the algorithm distribution more than users preferences • Performance Bias • Future ground truth comes from those that stayed in the system • They liked the system • It doesn’t represent those that were unhappy and left • Biases data to keep existing users happy without regard to ex-users • In extreme cases, even new users are discarded
  • 9. Best Solution So Far Past Data Idealized Future Distribution Idealized Function Feature Value => Rating
  • 10. Best Solution So Far • Requires all Items be categorized and quantized • Requires accuracy and general agreement on these values • (Socially Defined versus Absolute) • At least all features are present in all sets • Transforms recommendation into optimization and personalization • Set of items with highest score for a user • Ability to predict poor performing product or agent solutions • Better able to incorporate additional data • Prediction is usually linear time over the number of items
  • 11. Evaluation Adjustments • No Replacement for Real World A/B testing • Machine Learning for evaluation, not just the question • Hidden dependencies and ‘cheating’ Learned Algorithm Model Training Evaluation Model Model Training Business Objective Ground Truth