R Recommendation System Contest Analysis

R Recommendation System Contest

John Myles White

March 10, 2011

John Myles White R Recommendation System Contest

Kaggle

Kaggle is a platform for data prediction competitions
that allows organizations to post their data and have it
scrutinized by the world’s best data scientists.


Kaggle Features

Kaggle provides every contest with:
Centralized data downloads
Public and private leaderboards using RMSE, AUC and other
metrics
Public discussion forums for participants to use


Kaggle Features


Recent Kaggle Contests

Tourism Forecasting
Chess Ratings: Elo versus the Rest of the World
INFORMS 2010: Short Term Stock Price Movements


Current and Upcoming Kaggle Contests

Arabic Writer Identiﬁcation
Don’t Overﬁt: Dealing with Many Variables and Few
Observations
Heritage Health Prize


Advice on Running Kaggle Contests

Stay involved: respond to forum posts quickly and make the
contest seem alive
Don’t use a prediction task where near perfect accuracy can
be achieved


Mistakes We Made

Netﬂix Prize: 0.8616 RMSE
R Recommendation Contest: 0.9882 AUC


The R Recommendation System Contest

Contestants must be able to predict whether a user U will
have a package P installed on their system


Full Data Set

Outcomes: List of all packages installed on 52 R users’
systems
Predictors: Metadata about 2485 CRAN packages


Metadata

Dependencies
Suggests
Imports
Views
Core
Recommended
Maintainer
Maintainer’s Package Count


Training Data / Test Data Split

Uniform random split over rows in full data set
Training Set: 99373 rows
Test Set: 33125 rows


Additional Metadata

LDA topic assignments for CRAN packages
Used 25 topics
Used all documentation: manuals, vignettes, etc.


Example Models

1. Package Metadata
2. Package Metadata + Per User Intercepts
3. Package Metadata + Per User Intercepts + Package Topic
Assignments


Example Model 1

library(‘ProjectTemplate’)
try(load.project())

logit.fit <- glm(Installed ~ LogDependencyCount +
LogSuggestionCount +
LogImportCount +
LogViewsIncluding +
LogPackagesMaintaining +
CorePackage +
RecommendedPackage,
data = training.data,
family = binomial(link = ‘logit’))


Example Model 2

LogImportCount +
LogViewsIncluding +
CorePackage +
RecommendedPackage +
factor(User),


Example Model 3

LogImportCount +
LogViewsIncluding +
CorePackage +
RecommendedPackage +
factor(User) +
Topic,


Model Performance

Model 1: ∼ 0.80 AUC
Model 2: ∼ 0.95 AUC
Model 3: > 0.95 AUC


Unexploited Structure in Data


Future Work

What makes a package useful?
Need subjective ratings
Some packages are only installed because they’re
dependencies for other popular packages


Future Work

Get a better data sample:
Contest only used data from 52 users
But we do have complete data for those users
But data was not a random sample of R users


Future Work

Do more with LDA to categorize R packages
Prediction task allows us to evaluate “quality” of topics count
and topic assignments


Future Work

Build up various package-package similarity matrices for
conditional recommendations


Future Work

Can we understand the clustering in the network structure
graph?


Resources

For more information, see
The original Dataists’ contest announcement
GitHub project page


R Recommendation System Contest Analysis

Recomendados

Recomendados

Más contenido relacionado

Similar a R Recommendation System Contest Analysis

Similar a R Recommendation System Contest Analysis (20)

Más de NYC Predictive Analytics

Más de NYC Predictive Analytics (10)

R Recommendation System Contest Analysis