SlideShare una empresa de Scribd logo
1 de 17
Big, Practical Recommendations
with Alternating Least Squares

Sean Owen • Apache Mahout / Myrrix.com
WHERE’S BIG LEARNING?
 Next: Application Layer
    Analytics
    Machine Learning
                                     Applications
 Like Apache Mahout
    Common Big Data app today       Processing
    Clustering, recommenders,
     classifiers on Hadoop            Database
    Free, open source; not mature

 Where’s commercialized               Storage
  Big Learning?
A RECOMMENDER SHOULD …
 Answer in Real-time                Accept Diverse Input
    Ingest new data, now               Not just people and products
    Modify recommendations based       Not just explicit ratings
      on newest data                    Clicks, views, buys
    No “cold start” for new data
                                        Side information
 Scale Horizontally                 Be “Pretty Accurate”
    For queries per second
    For size of data set
NEED: 2-TIER ARCHITECTURE
 Real-time Serving Layer
    Quick results based on
      precomputed model
    Incremental update
    Partitionable for scale

 Batch Computation Layer
    Builds model
    Scales out (on Hadoop?)
    Asynchronous, occasional,
      long-lived runs
A PRACTICAL ALGORITHM

MATRIX FACTORIZATION                BENEFITS
 Factor user-item matrix to         Models intuition
  user-feature + feature-item        Factorization is batch
  matrix                              parallelizable
 Well understood in ML, as:         Reconstruction (recs) in
    Principal Component Analysis     low-dimension is fast
    Latent Semantic Indexing
                                     Allows projection of new data
 Several algorithms, like:             Cold start solution
    Singular Value Decomposition       Approximate update solution
    Alternating Least Squares
A PRACTICAL IMPLEMENTATION
ALTERNATING LEAST
SQUARES                              BENEFITS
 Simple factorization P ≈ X YT       Parallelizable by row --
 Approximate: X, Y are                very Hadoop-friendly
  “skinny” (low-rank)                 Iterative: OK answer fast,
 Faster than the SVD                  refine as long as desired
    Trivially parallel, iterative    Yields to “binary” input model
 Dumber than the SVD                    Ratings as regularization
                                           instead
    No singular values,
                                         Sparseness / 0s no longer a
      orthonormal basis
                                           problem
ALS ALGORITHM 1
 Input: (user, item, strength)      1   4   3
  tuples
                                             3
     Anything you can quantify is
       input                             4       3   2
     Strength is positive           5       2       3
 Many tuples per user-item                      5
 R is sparse user-item              2   4               R
  interaction matrix
 rij = total strength of
  interaction between user i
  and item j
ALS ALGORITHM 2
 Follow “Collaborative                    1   1   1   0   0
  Filtering for Implicit
                                           0   0   1   0   0
  Feedback Datasets”
  www2.research.att.com/~yifanhu/PUB/cf.   0   1   0   1   1
  pdf
                                           1   0   1   0   1
 Construct “binary” matrix P
                                           0   0   0   1   0
    1 where R > 0
                                           1   1   0   0   0   P
    0 where R = 0

 Factor P, not R
    R returns in regularization

 Still sparse; implicit 0s fine
ALS ALGORITHM 3
 P is m x n
 Choose k << m, n
 Factor P as Q = X YT, Q ≈ P
    X is m x k ; YT is k x n               YT
 Find best approximation Q
    Minimize L2 norm of diff: || P-Q   X
      ||2
    Minimal squared error:
      “Least Squares”
 Recommendations are
  largest values in Q
ALS ALGORITHM 4
 Optimizing X, Y
  simultaneously is non-
  convex, hard
 If X or Y are fixed, system of
                                       YT
  linear equations:
  convex, easy
 Initialize Y with random         X
  values
 Solve for X
 Fix X, solve for Y
 Repeat (“Alternating”)
ALS ALGORITHM 5
 Define regularization weights cui = 1 + α rui
 Minimize:

  Σ cui(pui – xuTyi)2 + λ(Σ||xu||2 + Σ||yi||2)

 Simple least-squares regression objective, plus
    Weighted least-squared error terms by strength,
      a penalty for not reconstructing 1 at “strong” association is higher
    Standard L2 regularization term
ALS ALGORITHM 6
 With fixed Y, compute optimal X
 Each row xu is independent
 Define Cu as diagonal matrix of cu (user strength weights)
 xu = (YTCuY + λI)-1 YTCupu
 Compare to simple least-squares regression solution (YTY)-1 YTpu
    Adds Tikhonov / ridge regression regularization term λI
    Attaches cu weights to YT

 See paper for how YTCuY is computed efficiently;
  skipping the engineering!
EXAMPLE FACTORIZATION
 k = 3, λ = 2, α = 40, 10 iterations

                                        0.96   0.99   0.99    0.38    0.93
         1    1   1   0   0
                                        0.44   0.39   0.98    -0.11   0.39
         0    0   1   0   0

                               ≈
                                        0.70   0.99   0.42    0.98    0.98
         0    1   0   1   1
         1    0   1   0   1             1.00   1.04   0.99    0.44    0.98   Q = X•YT
                                        0.11   0.51   -0.13   1.00    0.57
         0    0   0   1   0
                                        0.97   1.00   0.68    0.47    0.91
         1    1   0   0   0
FOLD-IN
 Need immediate, if               Note (YTY)(YTY)-1 = I
  approximate, updates for         Gives YT’s right inverse:
  new data                          YT (Y(YTY)-1) = I
 New user u needs new row         Xu = Qu Y(YTY)-1
  Qu = Xu YT
                                   Xu ≈ Pu Y(YTY)-1
 We have Pu ≈ Qu
                                   Recommend as usual:
 Compute Xu via right inverse:     Qu = XuYT
  X YT(YT)-1 = Q(YT)-1 so:
                                   For existing user, instead
  X = Q(YT)-1
                                    add to existing row Xu
 What is   (YT)-1?
THIS IS MYRRIX
 Soft-launched
 Serving Layer available
  as open source download
 Computation Layer available
  as beta
 Ready on Amazon EC2 / EMR
                                srowen@myrrix.com
 Full launch Q4 2012
 myrrix.com
APPENDIX
EXAMPLES

STACKOVERFLOW TAGS               WIKIPEDIA LINKS
 Recommend tags to               Recommend new linked
  questions                        articles from existing links
 Tag questions automatically,    Propose missing, related
  improve tag coverage             links
 3.5M questions x 30K tags       2.5M articles x 1.8M articles
 4.3 hours x 5 machines on       28 hours x 2 PCs on
  Amazon EMR                       Apache Hadoop 1.0.3
 $3.03 ≈ $0.08 per 100,000
  recs

Más contenido relacionado

La actualidad más candente

Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyChris Johnson
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...Hakka Labs
 
Spotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendationsSpotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendationsSophia Ciocca
 
Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...Databricks
 
Recent advances in deep recommender systems
Recent advances in deep recommender systemsRecent advances in deep recommender systems
Recent advances in deep recommender systemsNAVER Engineering
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringViet-Trung TRAN
 
Fast ALS-Based Matrix Factorization for Recommender Systems
Fast ALS-Based Matrix Factorization for Recommender SystemsFast ALS-Based Matrix Factorization for Recommender Systems
Fast ALS-Based Matrix Factorization for Recommender SystemsDavid Zibriczky
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsErik Bernhardsson
 
Use of data science in recommendation system
Use of data science in  recommendation systemUse of data science in  recommendation system
Use of data science in recommendation systemAkashPatil334
 
Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Mounia Lalmas-Roelleke
 
Building Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyBuilding Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyVidhya Murali
 
Tutorial on sequence aware recommender systems - UMAP 2018
Tutorial on sequence aware recommender systems - UMAP 2018Tutorial on sequence aware recommender systems - UMAP 2018
Tutorial on sequence aware recommender systems - UMAP 2018Paolo Cremonesi
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyChing-Wei Chen
 
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Massimo Quadrana
 
Introduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative FilteringIntroduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative FilteringDKALab
 

La actualidad más candente (20)

Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at Spotify
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
 
Recommending and searching @ Spotify
Recommending and searching @ SpotifyRecommending and searching @ Spotify
Recommending and searching @ Spotify
 
Spotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendationsSpotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendations
 
Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...
 
Recent advances in deep recommender systems
Recent advances in deep recommender systemsRecent advances in deep recommender systems
Recent advances in deep recommender systems
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filtering
 
Fast ALS-Based Matrix Factorization for Recommender Systems
Fast ALS-Based Matrix Factorization for Recommender SystemsFast ALS-Based Matrix Factorization for Recommender Systems
Fast ALS-Based Matrix Factorization for Recommender Systems
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive Analytics
 
Use of data science in recommendation system
Use of data science in  recommendation systemUse of data science in  recommendation system
Use of data science in recommendation system
 
Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)
 
Building Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyBuilding Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at Spotify
 
Recent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at Netflix
 
Locality sensitive hashing
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashing
 
Tutorial on sequence aware recommender systems - UMAP 2018
Tutorial on sequence aware recommender systems - UMAP 2018Tutorial on sequence aware recommender systems - UMAP 2018
Tutorial on sequence aware recommender systems - UMAP 2018
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce Setting
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at Spotify
 
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
 
Introduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative FilteringIntroduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative Filtering
 

Similar a Big Practical Recommendations with Alternating Least Squares

Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep LearningSourya Dey
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackarogozhnikov
 
Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsDmitriy Selivanov
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2arogozhnikov
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1arogozhnikov
 
Machine Learning - Regression model
Machine Learning - Regression modelMachine Learning - Regression model
Machine Learning - Regression modelRADO7900
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习AdaboostShocky1
 
Relaxed Utility Maximization in Complete Markets
Relaxed Utility Maximization in Complete MarketsRelaxed Utility Maximization in Complete Markets
Relaxed Utility Maximization in Complete Marketsguasoni
 
H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14Sri Ambati
 
2014.10.dartmouth
2014.10.dartmouth2014.10.dartmouth
2014.10.dartmouthQiqi Wang
 
Optimization Techniques.pdf
Optimization Techniques.pdfOptimization Techniques.pdf
Optimization Techniques.pdfanandsimple
 
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective OptimizationDominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective OptimizationIlya Loshchilov
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutData Science London
 
Introduction of Quantum Annealing and D-Wave Machines
Introduction of Quantum Annealing and D-Wave MachinesIntroduction of Quantum Annealing and D-Wave Machines
Introduction of Quantum Annealing and D-Wave MachinesArithmer Inc.
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learningYogendra Singh
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4arogozhnikov
 

Similar a Big Practical Recommendations with Alternating Least Squares (20)

Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
 
Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender Systems
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1
 
Machine Learning - Regression model
Machine Learning - Regression modelMachine Learning - Regression model
Machine Learning - Regression model
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
Relaxed Utility Maximization in Complete Markets
Relaxed Utility Maximization in Complete MarketsRelaxed Utility Maximization in Complete Markets
Relaxed Utility Maximization in Complete Markets
 
H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14
 
2014.10.dartmouth
2014.10.dartmouth2014.10.dartmouth
2014.10.dartmouth
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
 
Optimization Techniques.pdf
Optimization Techniques.pdfOptimization Techniques.pdf
Optimization Techniques.pdf
 
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective OptimizationDominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
opt_slides_ump.pdf
opt_slides_ump.pdfopt_slides_ump.pdf
opt_slides_ump.pdf
 
Introduction of Quantum Annealing and D-Wave Machines
Introduction of Quantum Annealing and D-Wave MachinesIntroduction of Quantum Annealing and D-Wave Machines
Introduction of Quantum Annealing and D-Wave Machines
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
Partial Derivatives.pdf
Partial Derivatives.pdfPartial Derivatives.pdf
Partial Derivatives.pdf
 
Lec1 01
Lec1 01Lec1 01
Lec1 01
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 

Más de Data Science London

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Data Science London
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaData Science London
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingData Science London
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Data Science London
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysisData Science London
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayData Science London
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignData Science London
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Data Science London
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureData Science London
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryData Science London
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutData Science London
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRData Science London
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersData Science London
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxData Science London
 

Más de Data Science London (20)

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
Nowcasting Business Performance
Nowcasting Business PerformanceNowcasting Business Performance
Nowcasting Business Performance
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunching
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysis
 
Survival Analysis of Web Users
Survival Analysis of Web UsersSurvival Analysis of Web Users
Survival Analysis of Web Users
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, Today
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems Design
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Data Science for Live Music
Data Science for Live MusicData Science for Live Music
Data Science for Live Music
 
Research at last.fm
Research at last.fmResearch at last.fm
Research at last.fm
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music Industry
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with Mahout
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook Users
 
Practical Magic with Incanter
Practical Magic with IncanterPractical Magic with Incanter
Practical Magic with Incanter
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists Toolbox
 

Último

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Último (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

Big Practical Recommendations with Alternating Least Squares

  • 1. Big, Practical Recommendations with Alternating Least Squares Sean Owen • Apache Mahout / Myrrix.com
  • 2. WHERE’S BIG LEARNING?  Next: Application Layer  Analytics  Machine Learning Applications  Like Apache Mahout  Common Big Data app today Processing  Clustering, recommenders, classifiers on Hadoop Database  Free, open source; not mature  Where’s commercialized Storage Big Learning?
  • 3. A RECOMMENDER SHOULD …  Answer in Real-time  Accept Diverse Input  Ingest new data, now  Not just people and products  Modify recommendations based  Not just explicit ratings on newest data  Clicks, views, buys  No “cold start” for new data  Side information  Scale Horizontally  Be “Pretty Accurate”  For queries per second  For size of data set
  • 4. NEED: 2-TIER ARCHITECTURE  Real-time Serving Layer  Quick results based on precomputed model  Incremental update  Partitionable for scale  Batch Computation Layer  Builds model  Scales out (on Hadoop?)  Asynchronous, occasional, long-lived runs
  • 5. A PRACTICAL ALGORITHM MATRIX FACTORIZATION BENEFITS  Factor user-item matrix to  Models intuition user-feature + feature-item  Factorization is batch matrix parallelizable  Well understood in ML, as:  Reconstruction (recs) in  Principal Component Analysis low-dimension is fast  Latent Semantic Indexing  Allows projection of new data  Several algorithms, like:  Cold start solution  Singular Value Decomposition  Approximate update solution  Alternating Least Squares
  • 6. A PRACTICAL IMPLEMENTATION ALTERNATING LEAST SQUARES BENEFITS  Simple factorization P ≈ X YT  Parallelizable by row --  Approximate: X, Y are very Hadoop-friendly “skinny” (low-rank)  Iterative: OK answer fast,  Faster than the SVD refine as long as desired  Trivially parallel, iterative  Yields to “binary” input model  Dumber than the SVD  Ratings as regularization instead  No singular values,  Sparseness / 0s no longer a orthonormal basis problem
  • 7. ALS ALGORITHM 1  Input: (user, item, strength) 1 4 3 tuples 3  Anything you can quantify is input 4 3 2  Strength is positive 5 2 3  Many tuples per user-item 5  R is sparse user-item 2 4 R interaction matrix  rij = total strength of interaction between user i and item j
  • 8. ALS ALGORITHM 2  Follow “Collaborative 1 1 1 0 0 Filtering for Implicit 0 0 1 0 0 Feedback Datasets” www2.research.att.com/~yifanhu/PUB/cf. 0 1 0 1 1 pdf 1 0 1 0 1  Construct “binary” matrix P 0 0 0 1 0  1 where R > 0 1 1 0 0 0 P  0 where R = 0  Factor P, not R  R returns in regularization  Still sparse; implicit 0s fine
  • 9. ALS ALGORITHM 3  P is m x n  Choose k << m, n  Factor P as Q = X YT, Q ≈ P  X is m x k ; YT is k x n YT  Find best approximation Q  Minimize L2 norm of diff: || P-Q X ||2  Minimal squared error: “Least Squares”  Recommendations are largest values in Q
  • 10. ALS ALGORITHM 4  Optimizing X, Y simultaneously is non- convex, hard  If X or Y are fixed, system of YT linear equations: convex, easy  Initialize Y with random X values  Solve for X  Fix X, solve for Y  Repeat (“Alternating”)
  • 11. ALS ALGORITHM 5  Define regularization weights cui = 1 + α rui  Minimize: Σ cui(pui – xuTyi)2 + λ(Σ||xu||2 + Σ||yi||2)  Simple least-squares regression objective, plus  Weighted least-squared error terms by strength, a penalty for not reconstructing 1 at “strong” association is higher  Standard L2 regularization term
  • 12. ALS ALGORITHM 6  With fixed Y, compute optimal X  Each row xu is independent  Define Cu as diagonal matrix of cu (user strength weights)  xu = (YTCuY + λI)-1 YTCupu  Compare to simple least-squares regression solution (YTY)-1 YTpu  Adds Tikhonov / ridge regression regularization term λI  Attaches cu weights to YT  See paper for how YTCuY is computed efficiently; skipping the engineering!
  • 13. EXAMPLE FACTORIZATION  k = 3, λ = 2, α = 40, 10 iterations 0.96 0.99 0.99 0.38 0.93 1 1 1 0 0 0.44 0.39 0.98 -0.11 0.39 0 0 1 0 0 ≈ 0.70 0.99 0.42 0.98 0.98 0 1 0 1 1 1 0 1 0 1 1.00 1.04 0.99 0.44 0.98 Q = X•YT 0.11 0.51 -0.13 1.00 0.57 0 0 0 1 0 0.97 1.00 0.68 0.47 0.91 1 1 0 0 0
  • 14. FOLD-IN  Need immediate, if  Note (YTY)(YTY)-1 = I approximate, updates for  Gives YT’s right inverse: new data YT (Y(YTY)-1) = I  New user u needs new row  Xu = Qu Y(YTY)-1 Qu = Xu YT  Xu ≈ Pu Y(YTY)-1  We have Pu ≈ Qu  Recommend as usual:  Compute Xu via right inverse: Qu = XuYT X YT(YT)-1 = Q(YT)-1 so:  For existing user, instead X = Q(YT)-1 add to existing row Xu  What is (YT)-1?
  • 15. THIS IS MYRRIX  Soft-launched  Serving Layer available as open source download  Computation Layer available as beta  Ready on Amazon EC2 / EMR srowen@myrrix.com  Full launch Q4 2012  myrrix.com
  • 17. EXAMPLES STACKOVERFLOW TAGS WIKIPEDIA LINKS  Recommend tags to  Recommend new linked questions articles from existing links  Tag questions automatically,  Propose missing, related improve tag coverage links  3.5M questions x 30K tags  2.5M articles x 1.8M articles  4.3 hours x 5 machines on  28 hours x 2 PCs on Amazon EMR Apache Hadoop 1.0.3  $3.03 ≈ $0.08 per 100,000 recs