Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial

There is more to recommendation algorithms than rating prediction. And, there is more to recommender systems than algorithms. In this tutorial, given at the 2012 ACM Recommender Systems Conference in Dublin, I review things such as different interaction and user feedback mechanisms, offline experimentation and AB testing, or software architectures for Recommender Systems.

  • Inicia sesión para ver los comentarios

Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial

  1. 1. Building  Industrial-­‐scale  Real-­‐world  Recommender  Systems                                                                                                                            September  11,  2012    Xavier  Amatriain  Personaliza8on  Science  and  Engineering  -­‐  Ne?lix   @xamat  
  2. 2. Outline1.  Anatomy of Netflix Personalization2.  Data & Models3.  Consumer (Data) Science4.  Architectures
  3. 3. Anatomy ofNetflixPersonalization Everything is a Recommendation
  4. 4. Everything is personalized Ranking Note: Recommendations Rows are per household, not individual user 4
  5. 5. Top 10 Personalization awarenessAll Dad Dad&Mom Daughter All All? Daughter Son Mom Mom Diversity 5
  6. 6. Support for Recommendations Social Support 6
  7. 7. Watch again & Continue Watching 7
  8. 8. Genres8
  9. 9. Genre rows§  Personalized genre rows focus on user interest §  Also provide context and “evidence” §  Important for member satisfaction – moving personalized rows to top on devices increased retention§  How are they generated? §  Implicit: based on user’s recent plays, ratings, & other interactions §  Explicit taste preferences §  Hybrid:combine the above §  Also take into account: §  Freshness - has this been shown before? §  Diversity– avoid repeating tags and genres, limit number of TV genres, etc.
  10. 10. Genres - personalization 10
  11. 11. Genres - personalization 11
  12. 12. Genres- explanations 12
  13. 13. Genres- explanations 13
  14. 14. Genres – user involvement 14
  15. 15. Genres – user involvement 15
  16. 16. Similars §  Displayed in many different contexts §  In response to user actions/ context (search, queue add…) §  More like… rows
  17. 17. Anatomy of a Personalization - Recap§  Everything is a recommendation: not only rating prediction, but also ranking, row selection, similarity…§  We strive to make it easy for the user, but…§  We want the user to be aware and be involved in the recommendation process§  Deal with implicit/explicit and hybrid feedback§  Add support/explanations for recommendations§  Consider issues such as diversity or freshness 17
  18. 18. Data &Models
  19. 19. §  PlaysBig Data §  Behavior §  Geo-Information §  Time §  Ratings §  Searches §  Impressions §  Device info §  Metadata §  Social §  … 19
  20. 20. Big Data §  25M+ subscribers@Netflix §  Ratings: 4M/day §  Searches: 3M/day §  Plays: 30M/day §  2B hours streamed in Q4 2011 §  1B hours in June 2012 20
  21. 21. Models§  Logistic/linear regression§  Elastic nets§  Matrix Factorization§  Markov Chains§  Clustering§  LDA§  Association Rules§  Gradient Boosted Decision Trees§  … 21
  22. 22. Rating Prediction 22
  23. 23. 2007 Progress Prize§  KorBell team (AT&T) improved by 8.43%§  Spent ~2,000 hours§  Combined 107 prediction algorithms with linear equation§  Gave us the source code
  24. 24. 2007 Progress Prize§  Top 2 algorithms §  SVD - Prize RMSE: 0.8914 §  RBM - Prize RMSE: 0.8990§  Linear blend Prize RMSE: 0.88§  Limitations §  Designed for 100M ratings, we have 5B ratings §  Not adaptable as users add ratings §  Performance issues§  Currently in use as part of Netflix’ rating prediction component
  25. 25. SVDX[n x m] = U[n x r] S [ r x r] (V[m x r])T§  X: m x n matrix (e.g., m users, n videos)§  U: m x r matrix (m users, r concepts)§  S: r x r diagonal matrix (strength of each ‘concept’) (r: rank of the matrix)§  V: r x n matrix (n videos, r concepts)
  26. 26. Simon Funk’s SVD§  One of the most interesting findings during the Netflix Prize came out of a blog post§  Incremental, iterative, and approximate way to compute the SVD using gradient descent 26
  27. 27. SVD for Rating Prediction§  Associate each user with a user-factors vector pu ∈ ℜ f§  Associate each item with an item-factors vector qv ∈ ℜ f§  Define a baseline estimate buv = µ + bu + bv to account for user and item deviation from the average§  Predict rating using the rule T r = buv + p qv uv u 27
  28. 28. SVD++§  Koren et. al proposed an asymmetric variation that includes implicit feedback: $ − 1 − 1 T & R(u) 2 r = buv + q & uv v ∑ (ruj − buj )x j + N(u) 2 ∑ yj ) ) % j∈R(u) j∈N (u) (§  Where §  qv , xv , yv ∈ ℜ f are three item factor vectors §  Users are not parametrized, but rather represented by: §  R(u): items rated by user u §  N(u): items for which the user has given an implicit preference (e.g. rated vs. not rated) 28
  29. 29. RBM
  30. 30. First generation neural networks (~60s) Like Hate§  Perceptrons (~1960) output units - §  Single layer of hand-coded class labels features §  Linear activation function §  Fundamentally limited in what non-adaptive they can learn to do. hand-coded features input units - features
  31. 31. Second generation neural networks (~80s) Compare output toBack-propagate correct answer to compute error signalerror signal toget derivatives outputsfor learning Non-linear activation function hidden layers input features
  32. 32. Belief Networks (~90s)§  Directed acyclic graph stochas8c   composed of stochastic hidden                 variables with weighted cause   connections.§  Can observe some of the variables§  Solve two problems: §  Inference: Infer the states of the unobserved variables. visible     §  Learning: Adjust the effect   interactions between variables to make the network more likely to generate the observed data.
  33. 33. Restricted Boltzmann Machine§  Restrict the connectivity to make learning easier. §  Only one layer of hidden units. §  Although multiple layers are possible hidden §  No connections between hidden units. j§  Hidden units are independent given the visible states.. §  So we can quickly get an unbiased sample from the posterior distribution over hidden “causes” i when given a data-vector visible§  RBMs can be stacked to form Deep Belief Nets (DBN)
  34. 34. RBM for the Netflix Prize 34
  35. 35. What about the final prize ensembles?§  Our offline studies showed they were too computationally intensive to scale§  Expected improvement not worth the engineering effort§  Plus, focus had already shifted to other issues that had more impact than rating prediction... 35
  36. 36. Ranking Key algorithm, sorts titles in most contexts
  37. 37. Ranking§  Ranking = Scoring + Sorting + Filtering §  Factors bags of movies for presentation to a user §  Accuracy§  Goal: Find the best possible ordering of a §  Novelty set of videos for a user within a specific §  Diversity context in real-time §  Freshness§  Objective: maximize consumption §  Scalability§  Aspirations: Played & “enjoyed” titles have §  … best score§  Akin to CTR forecast for ads/search results
  38. 38. Ranking§  Popularity is the obvious baseline§  Ratings prediction is a clear secondary data input that allows for personalization§  We have added many other features (and tried many more that have not proved useful)§  What about the weights? §  Based on A/B testing §  Machine-learned
  39. 39. Example: Two features, linear model 1  Predicted Rating 2   Final  Ranking   3   4   Linear  Model:   frank(u,v)  =  w1  p(v)  +  w2  r(u,v)  +  b   5   Popularity 39
  40. 40. Results 40
  41. 41. Learning to rank§  Machine learning problem: goal is to construct ranking model from training data§  Training data can have partial order or binary judgments (relevant/not relevant).§  Resulting order of the items typically induced from a numerical score§  Learning to rank is a key element for personalization§  You can treat the problem as a standard supervised classification problem 41
  42. 42. Learning to Rank Approaches1.  Pointwise §  Ranking function minimizes loss function defined on individual relevance judgment §  Ranking score based on regression or classification §  Ordinal regression, Logistic regression, SVM, GBDT, …2.  Pairwise §  Loss function is defined on pair-wise preferences §  Goal: minimize number of inversions in ranking §  Ranking problem is then transformed into the binary classification problem §  RankSVM, RankBoost, RankNet, FRank…
  43. 43. Learning to rank - metrics§  Quality of ranking measured using metrics as §  Normalized Discounted Cumulative Gain n DCG relevancei NDCG = where DCG = relevance1 + ∑ and IDCG = ideal ranking IDCG 2 log 2 i §  Mean Reciprocal Rank (MRR) 1 1 MRR = H ∑ rank(h ) where hi are the positive “hits” from the user h∈H i §  Mean average Precision (MAP) N ∑ AveP(n) tp MAP = n=1 where N can be number of users, items… and P = N tp + fp 43
  44. 44. Learning to rank - metrics§  Quality of ranking measured using metrics as §  Fraction of Concordant Pairs (FCP) §  Given items xi and xj, user preference P and a ranking method R, a concordant pair (CP) is { xi , x j } s.t.P(xi ) > P(x j ) ⇔ R(xi ) < R(x j ) ∑CP(x , x ) i j §  Then FCP = i≠ j n(n −1) §  Others… 2§  But, it is hard to optimize machine-learned models directly on these measures §  They are not differentiable§  Recent research on models that directly optimize ranking measures 44
  45. 45. Learning to Rank Approaches3.  Listwise a.  Directly optimizing IR measures (difficult since they are not differentiable) §  Directly optimize IR measures through Genetic Programming §  Directly optimize measures with Simulated Annealing §  Gradient descent on smoothed version of objective function §  SVM-MAP relaxes the MAP metric by adding it to the SVM constraints §  AdaRank uses boosting to optimize NDCG b.  Indirect Loss Function §  RankCosine uses similarity between the ranking list and the ground truth as loss function §  ListNet uses KL-divergence as loss function by defining a probability distribution §  Problem: optimization in the listwise loss function does not necessarily optimize IR metrics
  46. 46. Similars §  Different similarities computed from different sources: metadata, ratings, viewing data… §  Similarities can be treated as data/features §  Machine Learned models improve our concept of “similarity” 46
  47. 47. Data & Models - Recap§  All sorts of feedback from the user can help generate better recommendations§  Need to design systems that capture and take advantage of all this data§  The right model is as important as the right data§  It is important to come up with new theoretical models, but also need to think about application to a domain, and practical issues§  Rating prediction models are only part of the solution to recommendation (think about ranking, similarity…) 47
  48. 48. Consumer(Data) Science
  49. 49. Consumer Science§  Main goal is to effectively innovate for customers§  Innovation goals §  “If you want to increase your success rate, double your failure rate.” – Thomas Watson, Sr., founder of IBM §  The only real failure is the failure to innovate §  Fail cheaply §  Know why you failed/succeeded 49
  50. 50. Consumer (Data) Science1.  Start with a hypothesis: §  Algorithm/feature/design X will increase member engagement with our service, and ultimately member retention2.  Design a test §  Develop a solution or prototype §  Think about dependent & independent variables, control, significance…3.  Execute the test4.  Let data speak for itself 50
  51. 51. Offline/Online testing process days Weeks to months Offline Online A/B Rollout Feature to testing [success] testing [success] all users [fail] 51
  52. 52. Offline testing process Initial Hypothesis DecideReformulate Model RolloutHypothesis Prototype Rollout Train Model Feature to [no] Wait for all users Try offline Online A/B Results Analyze different model? [yes] Test testing Results [success] [no] Hypothesis Significant validated improvement offline? on users? [yes] [fail] 52 [no]
  53. 53. Offline testing§  Optimize algorithms offline§  Measure model performance, using metrics such as: §  Mean Reciprocal Rank, Normalized Discounted Cumulative Gain, Fraction of Concordant Pairs, Precision/Recall & F-measures, AUC, RMSE, Diversity…§  Offline performance used as an indication to make informed decisions on follow-up A/B tests§  A critical (and unsolved) issue is how offline metrics can correlate with A/B test results.§  Extremely important to define a coherent offline evaluation framework (e.g. How to create training/testing datasets is not trivial) 53
  54. 54. Online A/B testing process Choose Design A/ Control B Test Group DecideReformulate Model RolloutHypothesis Prototype Rollout Train Model Feature to [no] Wait for offline all users Try Offline Results Analyze different model? testing [yes] Test Results Significant Hypothesis [success] improvement validated on users? [no] offline? [yes] [no] 54
  55. 55. Executing A/B tests§  Many different metrics, but ultimately trust user engagement (e.g. hours of play and customer retention)§  Think about significance and hypothesis testing §  Our tests usually have thousands of members and 2-20 cells§  A/B Tests allow you to try radical ideas or test many approaches at the same time. §  We typically have hundreds of customer A/B tests running§  Decisions on the product always data-driven 55
  56. 56. What to measure§  OEC: Overall Evaluation Criteria§  In an AB test framework, the measure of success is key§  Short-term metrics do not always align with long term goals §  E.g. CTR: generating more clicks might mean that our recommendations are actually worse§  Use long term metrics such as LTV (Life time value) whenever possible §  In Netflix, we use member retention 56
  57. 57. What to measure§  Short-term metrics can sometimes be informative, and may allow for faster decision-taking §  At Netflix we use many such as hours streamed by users or %hours from a given algorithm§  But, be aware of several caveats of using early decision mechanisms Initial effects appear to trend. See “Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained” [Kohavi et. Al. KDD 12] 57
  58. 58. Consumer Data Science - Recap§  Consumer Data Science aims to innovate for the customer by running experiments and letting data speak§  This is mainly done through online AB Testing§  However, we can speed up innovation by experimenting offline§  But, both for online and offline experimentation, it is important to choose the right metric and experimental framework 58
  59. 59. Architectures 59
  60. 60. Technology hQp://   60
  61. 61. 61
  62. 62. Event & DataDistribution 62
  63. 63. Event & Data Distribution•  UI devices should broadcast many different kinds of user events •  Clicks •  Presentations •  Browsing events •  …•  Events vs. data •  Some events only need to be propagated and trigger an action (low latency, low information per event) •  Others need to be processed and “turned into” data (higher latency, higher information quality). •  And… there are many in between•  Real-time event flow managed through internal tool (Manhattan)•  Data flow mostly managed through Hadoop. 63
  64. 64. Offline Jobs 64
  65. 65. Offline Jobs•  Two kinds of offline jobs •  Model training •  Batch offline computation of recommendations/ intermediate results•  Offline queries either in Hive or PIG•  Need a publishing mechanism that solves several issues •  Notify readers when result of query is ready •  Support different repositories (s3, cassandra…) •  Handle errors, monitoring… •  We do this through Hermes 65
  66. 66. Computation 66
  67. 67. Computation•  Two ways of computing personalized results •  Batch/offline •  Online•  Each approach has pros/cons •  Offline +  Allows more complex computations +  Can use more data -  Cannot react to quick changes -  May result in staleness •  Online +  Can respond quickly to events +  Can use most recent data -  May fail because of SLA -  Cannot deal with “complex” computations•  It’s not an either/or decision •  Both approaches can be combined 67
  68. 68. Signals & Models 68
  69. 69. Signals & Models•  Both offline and online algorithms are based on three different inputs: •  Models: previously trained from existing data •  (Offline) Data: previously processed and stored information •  Signals: fresh data obtained from live services •  User-related data •  Context data (session, date, time…) 69
  70. 70. Results 70
  71. 71. Results•  Recommendations can be serviced from: •  Previously computed lists •  Online algorithms •  A combination of both•  The decision on where to service the recommendation from can respond to many factors including context.•  Also, important to think about the fallbacks (what if plan A fails)•  Previously computed lists/intermediate results can be stored in a variety of ways •  Cache •  Cassandra •  Relational DB 71
  72. 72. Alerts and Monitoring§  A non-trivial concern in large-scale recommender systems§  Monitoring: continuously observe quality of system§  Alert: fast notification if quality of system goes below a certain pre-defined threshold§  Questions: §  What do we need to monitor? §  How do we know something is “bad enough” to alert 72
  73. 73. What to monitor Did something go§  Staleness wrong here? §  Monitor time since last data update 73
  74. 74. What to monitor§  Algorithmic quality §  Monitor different metrics by comparing what users do and what your algorithm predicted they would do 74
  75. 75. What to monitor§  Algorithmic quality §  Monitor different metrics by comparing what users do and what your algorithm predicted they would do Did something go wrong here? 75
  76. 76. What to monitor§  Algorithmic source for users §  Monitor how users interact with different algorithms Algorithm X Did something go wrong here? New version 76
  77. 77. When to alert§  Alerting thresholds are hard to tune §  Avoid unnecessary alerts (the “learn-to-ignore problem”) §  Avoid important issues being noticed before the alert happens§  Rules of thumb §  Alert on anything that will impact user experience significantly §  Alert on issues that are actionable §  If a noticeable event happens without an alert… add a new alert for next time 77
  78. 78. Conclusions 78
  79. 79. The Personalization Problem§  The Netflix Prize simplified the recommendation problem to predicting ratings§  But… §  User ratings are only one of the many data inputs we have §  Rating predictions are only part of our solution §  Other algorithms such as ranking or similarity are very important§  We can reformulate the recommendation problem §  Function to optimize: probability a user chooses something and enjoys it enough to come back to the service 79
  80. 80. More to Recsys than Algorithms§  Not only is there more to algorithms than rating prediction§  There is more to Recsys than algorithms §  User Interface & Feedback §  Data §  AB Testing §  Systems & Architectures 80
  81. 81. More data + Better models + More accurate metrics +Better approaches & architectures Lots of room for improvement! 81
  82. 82. We’re hiring!Xavier Amatriain (@xamat)