Traditional machine learning and collaborative filtering pay little attention to the sources of the data they use. The differences between the distribution backing the learning data, the distribution backing the algorithm output, and the distribution backing the ground truth are often completely different and almost unrelated to the target distribution: true ratings across all items for every user.
1. Differences in Distributions and
Their Effect on Recommendation
System Performance
Why Collaborative Filtering Doesn’t Scale
(portions reference Prismatic’s Silicon Valley talk)
3. Overfitting
Distribution
of All Items
Across Users
Distribution of
All Items Across
All Users in the
Future
Concrete Set of
Past Items
Across Users
Concrete Set of
Future Items
Across Users
4. Recommender Systems Dilemma
Set of All Items Possible
Set of Items Known to Users in the Future
Set of Items Known to Users in the
Past
Set of Items
Recommended By
Recommenders
Items Viewed
Or Liked in
the Future
Items Users
Viewed Or Rated
in the Past
Items Seen in Ground
Truth Without
Changes in Item
Access
??????
5. Collaborative Filtering in Music
• Construct correlations between items from set of past known items
• Generate estimated distribution for past users across all items
• Hope ‘errors’ relate to future user liked items
• Gap between distributions escalates with the scale of data
6. Resulting Biases
Huge number of items where 50%+ of users only ever saw 20 songs a
month out of 3 million
Massive gap between all items and known items distribution
Cross Validation ground truth assumes the 50%+ users only ever saw
that new top 20 songs for the new set
Results are supposed to be based on if users knew all sets
Continuous user testing assumes ‘all items seen’ distributions, but
only the set of recommended items are new items seen
User data itself is a biased subset of the whole
7. First Generation Problems
• Everyone likes The Beatles or Norah Jones
• Extremely frequent in biased data sets
• Since everyone listened to before, everyone gets recommended them
• Recommendations usually repeat the top 40 of the data collection
• Users might like novel recommendations, but that won’t ever be in
the evaluation set in cross validation – users never saw them
8. Problems Over Time
• The ground truth is heavily biased by recommendations controlling
the set of known items
• Machine learning – including collaborative filtering – learns the algorithm
distribution more than users preferences
• Performance Bias
• Future ground truth comes from those that stayed in the system
• They liked the system
• It doesn’t represent those that were unhappy and left
• Biases data to keep existing users happy without regard to ex-users
• In extreme cases, even new users are discarded
9. Best Solution So Far
Past Data Idealized Future Distribution
Idealized Function Feature Value => Rating
10. Best Solution So Far
• Requires all Items be categorized and quantized
• Requires accuracy and general agreement on these values
• (Socially Defined versus Absolute)
• At least all features are present in all sets
• Transforms recommendation into optimization and personalization
• Set of items with highest score for a user
• Ability to predict poor performing product or agent solutions
• Better able to incorporate additional data
• Prediction is usually linear time over the number of items
11. Evaluation Adjustments
• No Replacement for Real World A/B testing
• Machine Learning for evaluation, not just the question
• Hidden dependencies and ‘cheating’
Learned Algorithm Model Training
Evaluation
Model
Model
Training
Business
Objective
Ground Truth