Building a Recommendation Engine - An example of a product recommendation engine
R Recommendation System Contest Analysis
1. R Recommendation System Contest
John Myles White
March 10, 2011
John Myles White R Recommendation System Contest
2. Kaggle
Kaggle is a platform for data prediction competitions
that allows organizations to post their data and have it
scrutinized by the world’s best data scientists.
John Myles White R Recommendation System Contest
3. Kaggle Features
Kaggle provides every contest with:
Centralized data downloads
Public and private leaderboards using RMSE, AUC and other
metrics
Public discussion forums for participants to use
John Myles White R Recommendation System Contest
4. Kaggle Features
John Myles White R Recommendation System Contest
5. Recent Kaggle Contests
Tourism Forecasting
Chess Ratings: Elo versus the Rest of the World
INFORMS 2010: Short Term Stock Price Movements
John Myles White R Recommendation System Contest
6. Current and Upcoming Kaggle Contests
Arabic Writer Identification
Don’t Overfit: Dealing with Many Variables and Few
Observations
Heritage Health Prize
John Myles White R Recommendation System Contest
7. Advice on Running Kaggle Contests
Stay involved: respond to forum posts quickly and make the
contest seem alive
Don’t use a prediction task where near perfect accuracy can
be achieved
John Myles White R Recommendation System Contest
8. Mistakes We Made
Netflix Prize: 0.8616 RMSE
R Recommendation Contest: 0.9882 AUC
John Myles White R Recommendation System Contest
9. The R Recommendation System Contest
Contestants must be able to predict whether a user U will
have a package P installed on their system
John Myles White R Recommendation System Contest
10. Full Data Set
Outcomes: List of all packages installed on 52 R users’
systems
Predictors: Metadata about 2485 CRAN packages
John Myles White R Recommendation System Contest
11. Metadata
Dependencies
Suggests
Imports
Views
Core
Recommended
Maintainer
Maintainer’s Package Count
John Myles White R Recommendation System Contest
12. Training Data / Test Data Split
Uniform random split over rows in full data set
Training Set: 99373 rows
Test Set: 33125 rows
John Myles White R Recommendation System Contest
13. Additional Metadata
LDA topic assignments for CRAN packages
Used 25 topics
Used all documentation: manuals, vignettes, etc.
John Myles White R Recommendation System Contest
14. Example Models
1. Package Metadata
2. Package Metadata + Per User Intercepts
3. Package Metadata + Per User Intercepts + Package Topic
Assignments
John Myles White R Recommendation System Contest
15. Example Model 1
library(‘ProjectTemplate’)
try(load.project())
logit.fit <- glm(Installed ~ LogDependencyCount +
LogSuggestionCount +
LogImportCount +
LogViewsIncluding +
LogPackagesMaintaining +
CorePackage +
RecommendedPackage,
data = training.data,
family = binomial(link = ‘logit’))
John Myles White R Recommendation System Contest
16. Example Model 2
logit.fit <- glm(Installed ~ LogDependencyCount +
LogSuggestionCount +
LogImportCount +
LogViewsIncluding +
LogPackagesMaintaining +
CorePackage +
RecommendedPackage +
factor(User),
data = training.data,
family = binomial(link = ‘logit’))
John Myles White R Recommendation System Contest
17. Example Model 3
logit.fit <- glm(Installed ~ LogDependencyCount +
LogSuggestionCount +
LogImportCount +
LogViewsIncluding +
LogPackagesMaintaining +
CorePackage +
RecommendedPackage +
factor(User) +
Topic,
data = training.data,
family = binomial(link = ‘logit’))
John Myles White R Recommendation System Contest
18. Model Performance
Model 1: ∼ 0.80 AUC
Model 2: ∼ 0.95 AUC
Model 3: > 0.95 AUC
John Myles White R Recommendation System Contest
20. Future Work
What makes a package useful?
Need subjective ratings
Some packages are only installed because they’re
dependencies for other popular packages
John Myles White R Recommendation System Contest
21. Future Work
Get a better data sample:
Contest only used data from 52 users
But we do have complete data for those users
But data was not a random sample of R users
John Myles White R Recommendation System Contest
22. Future Work
Do more with LDA to categorize R packages
Prediction task allows us to evaluate “quality” of topics count
and topic assignments
John Myles White R Recommendation System Contest
23. Future Work
Build up various package-package similarity matrices for
conditional recommendations
John Myles White R Recommendation System Contest
24. Future Work
Can we understand the clustering in the network structure
graph?
John Myles White R Recommendation System Contest
25. Resources
For more information, see
The original Dataists’ contest announcement
GitHub project page
John Myles White R Recommendation System Contest