SlideShare una empresa de Scribd logo
1 de 25
Descargar para leer sin conexión
R Recommendation System Contest

           John Myles White


             March 10, 2011




      John Myles White   R Recommendation System Contest
Kaggle




         Kaggle is a platform for data prediction competitions
         that allows organizations to post their data and have it
         scrutinized by the world’s best data scientists.




                         John Myles White   R Recommendation System Contest
Kaggle Features




   Kaggle provides every contest with:
       Centralized data downloads
       Public and private leaderboards using RMSE, AUC and other
       metrics
       Public discussion forums for participants to use




                       John Myles White   R Recommendation System Contest
Kaggle Features




                  John Myles White   R Recommendation System Contest
Recent Kaggle Contests




      Tourism Forecasting
      Chess Ratings: Elo versus the Rest of the World
      INFORMS 2010: Short Term Stock Price Movements




                     John Myles White   R Recommendation System Contest
Current and Upcoming Kaggle Contests




      Arabic Writer Identification
      Don’t Overfit: Dealing with Many Variables and Few
      Observations
      Heritage Health Prize




                     John Myles White   R Recommendation System Contest
Advice on Running Kaggle Contests




      Stay involved: respond to forum posts quickly and make the
      contest seem alive
      Don’t use a prediction task where near perfect accuracy can
      be achieved




                     John Myles White   R Recommendation System Contest
Mistakes We Made




     Netflix Prize: 0.8616 RMSE
     R Recommendation Contest: 0.9882 AUC




                   John Myles White   R Recommendation System Contest
The R Recommendation System Contest




      Contestants must be able to predict whether a user U will
      have a package P installed on their system




                     John Myles White   R Recommendation System Contest
Full Data Set




      Outcomes: List of all packages installed on 52 R users’
      systems
      Predictors: Metadata about 2485 CRAN packages




                     John Myles White   R Recommendation System Contest
Metadata



     Dependencies
     Suggests
     Imports
     Views
     Core
     Recommended
     Maintainer
     Maintainer’s Package Count




                    John Myles White   R Recommendation System Contest
Training Data / Test Data Split




      Uniform random split over rows in full data set
      Training Set: 99373 rows
      Test Set: 33125 rows




                     John Myles White   R Recommendation System Contest
Additional Metadata




      LDA topic assignments for CRAN packages
      Used 25 topics
      Used all documentation: manuals, vignettes, etc.




                       John Myles White   R Recommendation System Contest
Example Models




    1. Package Metadata
    2. Package Metadata + Per User Intercepts
    3. Package Metadata + Per User Intercepts + Package Topic
       Assignments




                     John Myles White   R Recommendation System Contest
Example Model 1


  library(‘ProjectTemplate’)
  try(load.project())

  logit.fit <- glm(Installed ~ LogDependencyCount +
                               LogSuggestionCount +
                               LogImportCount +
                               LogViewsIncluding +
                               LogPackagesMaintaining +
                               CorePackage +
                               RecommendedPackage,
                   data = training.data,
                   family = binomial(link = ‘logit’))



                  John Myles White   R Recommendation System Contest
Example Model 2



  logit.fit <- glm(Installed ~ LogDependencyCount +
                               LogSuggestionCount +
                               LogImportCount +
                               LogViewsIncluding +
                               LogPackagesMaintaining +
                               CorePackage +
                               RecommendedPackage +
                               factor(User),
                   data = training.data,
                   family = binomial(link = ‘logit’))




                  John Myles White   R Recommendation System Contest
Example Model 3


  logit.fit <- glm(Installed ~ LogDependencyCount +
                               LogSuggestionCount +
                               LogImportCount +
                               LogViewsIncluding +
                               LogPackagesMaintaining +
                               CorePackage +
                               RecommendedPackage +
                               factor(User) +
                               Topic,
                   data = training.data,
                   family = binomial(link = ‘logit’))




                  John Myles White   R Recommendation System Contest
Model Performance




      Model 1: ∼ 0.80 AUC
      Model 2: ∼ 0.95 AUC
      Model 3: > 0.95 AUC




                    John Myles White   R Recommendation System Contest
Unexploited Structure in Data




                  John Myles White   R Recommendation System Contest
Future Work




  What makes a package useful?
      Need subjective ratings
      Some packages are only installed because they’re
      dependencies for other popular packages




                     John Myles White   R Recommendation System Contest
Future Work




  Get a better data sample:
      Contest only used data from 52 users
      But we do have complete data for those users
      But data was not a random sample of R users




                      John Myles White   R Recommendation System Contest
Future Work




      Do more with LDA to categorize R packages
      Prediction task allows us to evaluate “quality” of topics count
      and topic assignments




                      John Myles White   R Recommendation System Contest
Future Work




      Build up various package-package similarity matrices for
      conditional recommendations




                     John Myles White   R Recommendation System Contest
Future Work




      Can we understand the clustering in the network structure
      graph?




                     John Myles White   R Recommendation System Contest
Resources




   For more information, see
       The original Dataists’ contest announcement
       GitHub project page




                       John Myles White   R Recommendation System Contest

Más contenido relacionado

Similar a R Recommendation System Contest Analysis

Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Sease
 
Introduction to the Compliance Driven Development (CDD) and Security Centric ...
Introduction to the Compliance Driven Development (CDD) and Security Centric ...Introduction to the Compliance Driven Development (CDD) and Security Centric ...
Introduction to the Compliance Driven Development (CDD) and Security Centric ...VMware Tanzu
 
Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...John Allspaw
 
Q1 Southern California Session Slides
Q1 Southern California Session SlidesQ1 Southern California Session Slides
Q1 Southern California Session SlidesHarold Wong
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_dataNitin Kumar
 
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart NeedhamThe post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart NeedhamStewart Needham
 
Data Observability.pptx
Data Observability.pptxData Observability.pptx
Data Observability.pptxSonaSamad1
 
Windows 2008 R2 &amp; Windows7
Windows 2008 R2 &amp; Windows7Windows 2008 R2 &amp; Windows7
Windows 2008 R2 &amp; Windows7Gabe Akisanmi
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarinn5712036
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
Automation frameworks
Automation frameworksAutomation frameworks
Automation frameworksVishwanath KC
 
IT talk: Как я перестал бояться и полюбил TestNG
IT talk: Как я перестал бояться и полюбил TestNGIT talk: Как я перестал бояться и полюбил TestNG
IT talk: Как я перестал бояться и полюбил TestNGDataArt
 
Data Mining Concepts 15061
Data Mining Concepts 15061Data Mining Concepts 15061
Data Mining Concepts 15061badirh
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Conceptsdataminers.ir
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining ConceptsDung Nguyen
 
Windows Logging Cheat Sheet ver Jan 2016 - MalwareArchaeology
Windows Logging Cheat Sheet ver Jan 2016 - MalwareArchaeologyWindows Logging Cheat Sheet ver Jan 2016 - MalwareArchaeology
Windows Logging Cheat Sheet ver Jan 2016 - MalwareArchaeologyMichael Gough
 
Implementing a data science project (R Version) Part1
Implementing a data science project (R Version) Part1Implementing a data science project (R Version) Part1
Implementing a data science project (R Version) Part1Dr Sulaimon Afolabi
 
Redis: Swiss Army Knife @HackerRank: Kamal Joshi
Redis: Swiss Army Knife @HackerRank: Kamal JoshiRedis: Swiss Army Knife @HackerRank: Kamal Joshi
Redis: Swiss Army Knife @HackerRank: Kamal JoshiRedis Labs
 

Similar a R Recommendation System Contest Analysis (20)

Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
 
BDACA - Lecture3
BDACA - Lecture3BDACA - Lecture3
BDACA - Lecture3
 
Introduction to the Compliance Driven Development (CDD) and Security Centric ...
Introduction to the Compliance Driven Development (CDD) and Security Centric ...Introduction to the Compliance Driven Development (CDD) and Security Centric ...
Introduction to the Compliance Driven Development (CDD) and Security Centric ...
 
Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...
 
Q1 Southern California Session Slides
Q1 Southern California Session SlidesQ1 Southern California Session Slides
Q1 Southern California Session Slides
 
groovy & grails - lecture 13
groovy & grails - lecture 13groovy & grails - lecture 13
groovy & grails - lecture 13
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_data
 
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart NeedhamThe post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
 
Data Observability.pptx
Data Observability.pptxData Observability.pptx
Data Observability.pptx
 
Windows 2008 R2 &amp; Windows7
Windows 2008 R2 &amp; Windows7Windows 2008 R2 &amp; Windows7
Windows 2008 R2 &amp; Windows7
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Automation frameworks
Automation frameworksAutomation frameworks
Automation frameworks
 
IT talk: Как я перестал бояться и полюбил TestNG
IT talk: Как я перестал бояться и полюбил TestNGIT talk: Как я перестал бояться и полюбил TestNG
IT talk: Как я перестал бояться и полюбил TestNG
 
Data Mining Concepts 15061
Data Mining Concepts 15061Data Mining Concepts 15061
Data Mining Concepts 15061
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Windows Logging Cheat Sheet ver Jan 2016 - MalwareArchaeology
Windows Logging Cheat Sheet ver Jan 2016 - MalwareArchaeologyWindows Logging Cheat Sheet ver Jan 2016 - MalwareArchaeology
Windows Logging Cheat Sheet ver Jan 2016 - MalwareArchaeology
 
Implementing a data science project (R Version) Part1
Implementing a data science project (R Version) Part1Implementing a data science project (R Version) Part1
Implementing a data science project (R Version) Part1
 
Redis: Swiss Army Knife @HackerRank: Kamal Joshi
Redis: Swiss Army Knife @HackerRank: Kamal JoshiRedis: Swiss Army Knife @HackerRank: Kamal Joshi
Redis: Swiss Army Knife @HackerRank: Kamal Joshi
 

Más de NYC Predictive Analytics

Graph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsGraph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsNYC Predictive Analytics
 
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsThe caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsNYC Predictive Analytics
 
Intro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMNYC Predictive Analytics
 
Optimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive AnalyticsOptimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive AnalyticsNYC Predictive Analytics
 
An Introduction to Multilevel Regression Modeling for Prediction
An Introduction to Multilevel Regression Modeling for PredictionAn Introduction to Multilevel Regression Modeling for Prediction
An Introduction to Multilevel Regression Modeling for PredictionNYC Predictive Analytics
 
How OMGPOP Uses Predictive Analytics to Drive Change
How OMGPOP Uses Predictive Analytics to Drive ChangeHow OMGPOP Uses Predictive Analytics to Drive Change
How OMGPOP Uses Predictive Analytics to Drive ChangeNYC Predictive Analytics
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisNYC Predictive Analytics
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineNYC Predictive Analytics
 

Más de NYC Predictive Analytics (10)

Graph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsGraph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media Analytics
 
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsThe caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
 
Intro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVM
 
R package Recommendation Engine
R package Recommendation EngineR package Recommendation Engine
R package Recommendation Engine
 
Optimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive AnalyticsOptimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive Analytics
 
An Introduction to Multilevel Regression Modeling for Prediction
An Introduction to Multilevel Regression Modeling for PredictionAn Introduction to Multilevel Regression Modeling for Prediction
An Introduction to Multilevel Regression Modeling for Prediction
 
How OMGPOP Uses Predictive Analytics to Drive Change
How OMGPOP Uses Predictive Analytics to Drive ChangeHow OMGPOP Uses Predictive Analytics to Drive Change
How OMGPOP Uses Predictive Analytics to Drive Change
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
Recommendation Engine Demystified
Recommendation Engine DemystifiedRecommendation Engine Demystified
Recommendation Engine Demystified
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
 

R Recommendation System Contest Analysis

  • 1. R Recommendation System Contest John Myles White March 10, 2011 John Myles White R Recommendation System Contest
  • 2. Kaggle Kaggle is a platform for data prediction competitions that allows organizations to post their data and have it scrutinized by the world’s best data scientists. John Myles White R Recommendation System Contest
  • 3. Kaggle Features Kaggle provides every contest with: Centralized data downloads Public and private leaderboards using RMSE, AUC and other metrics Public discussion forums for participants to use John Myles White R Recommendation System Contest
  • 4. Kaggle Features John Myles White R Recommendation System Contest
  • 5. Recent Kaggle Contests Tourism Forecasting Chess Ratings: Elo versus the Rest of the World INFORMS 2010: Short Term Stock Price Movements John Myles White R Recommendation System Contest
  • 6. Current and Upcoming Kaggle Contests Arabic Writer Identification Don’t Overfit: Dealing with Many Variables and Few Observations Heritage Health Prize John Myles White R Recommendation System Contest
  • 7. Advice on Running Kaggle Contests Stay involved: respond to forum posts quickly and make the contest seem alive Don’t use a prediction task where near perfect accuracy can be achieved John Myles White R Recommendation System Contest
  • 8. Mistakes We Made Netflix Prize: 0.8616 RMSE R Recommendation Contest: 0.9882 AUC John Myles White R Recommendation System Contest
  • 9. The R Recommendation System Contest Contestants must be able to predict whether a user U will have a package P installed on their system John Myles White R Recommendation System Contest
  • 10. Full Data Set Outcomes: List of all packages installed on 52 R users’ systems Predictors: Metadata about 2485 CRAN packages John Myles White R Recommendation System Contest
  • 11. Metadata Dependencies Suggests Imports Views Core Recommended Maintainer Maintainer’s Package Count John Myles White R Recommendation System Contest
  • 12. Training Data / Test Data Split Uniform random split over rows in full data set Training Set: 99373 rows Test Set: 33125 rows John Myles White R Recommendation System Contest
  • 13. Additional Metadata LDA topic assignments for CRAN packages Used 25 topics Used all documentation: manuals, vignettes, etc. John Myles White R Recommendation System Contest
  • 14. Example Models 1. Package Metadata 2. Package Metadata + Per User Intercepts 3. Package Metadata + Per User Intercepts + Package Topic Assignments John Myles White R Recommendation System Contest
  • 15. Example Model 1 library(‘ProjectTemplate’) try(load.project()) logit.fit <- glm(Installed ~ LogDependencyCount + LogSuggestionCount + LogImportCount + LogViewsIncluding + LogPackagesMaintaining + CorePackage + RecommendedPackage, data = training.data, family = binomial(link = ‘logit’)) John Myles White R Recommendation System Contest
  • 16. Example Model 2 logit.fit <- glm(Installed ~ LogDependencyCount + LogSuggestionCount + LogImportCount + LogViewsIncluding + LogPackagesMaintaining + CorePackage + RecommendedPackage + factor(User), data = training.data, family = binomial(link = ‘logit’)) John Myles White R Recommendation System Contest
  • 17. Example Model 3 logit.fit <- glm(Installed ~ LogDependencyCount + LogSuggestionCount + LogImportCount + LogViewsIncluding + LogPackagesMaintaining + CorePackage + RecommendedPackage + factor(User) + Topic, data = training.data, family = binomial(link = ‘logit’)) John Myles White R Recommendation System Contest
  • 18. Model Performance Model 1: ∼ 0.80 AUC Model 2: ∼ 0.95 AUC Model 3: > 0.95 AUC John Myles White R Recommendation System Contest
  • 19. Unexploited Structure in Data John Myles White R Recommendation System Contest
  • 20. Future Work What makes a package useful? Need subjective ratings Some packages are only installed because they’re dependencies for other popular packages John Myles White R Recommendation System Contest
  • 21. Future Work Get a better data sample: Contest only used data from 52 users But we do have complete data for those users But data was not a random sample of R users John Myles White R Recommendation System Contest
  • 22. Future Work Do more with LDA to categorize R packages Prediction task allows us to evaluate “quality” of topics count and topic assignments John Myles White R Recommendation System Contest
  • 23. Future Work Build up various package-package similarity matrices for conditional recommendations John Myles White R Recommendation System Contest
  • 24. Future Work Can we understand the clustering in the network structure graph? John Myles White R Recommendation System Contest
  • 25. Resources For more information, see The original Dataists’ contest announcement GitHub project page John Myles White R Recommendation System Contest