SlideShare a Scribd company logo
1 of 56
Download to read offline
May 9, 2014
Collaborative
Filtering with Spark
Chris Johnson
@MrChrisJohnson
Friday, May 9, 14
Who am I??
•Chris Johnson
– Machine Learning guy from NYC
– Focused on music recommendations
– Formerly a graduate student at UT Austin
Friday, May 9, 14
3
What is MLlib?
Algorithms:
• classification: logistic regression, linear support vector machine
(SVM), naive bayes
• regression: generalized linear regression
• clustering: k-means
• decomposition: singular value decomposition (SVD), principle
component analysis (PCA
• collaborative filtering: alternating least squares (ALS)
http://spark.apache.org/docs/0.9.0/mllib-guide.html
Friday, May 9, 14
4
What is MLlib?
Algorithms:
• classification: logistic regression, linear support vector machine
(SVM), naive bayes
• regression: generalized linear regression
• clustering: k-means
• decomposition: singular value decomposition (SVD), principle
component analysis (PCA
• collaborative filtering: alternating least squares (ALS)
http://spark.apache.org/docs/0.9.0/mllib-guide.html
Friday, May 9, 14
Collaborative Filtering - “The Netflix Prize”
5
Friday, May 9, 14
Collaborative Filtering
6
Hey,
I like tracks P, Q, R, S!
Well,
I like tracks Q, R, S, T!
Then you should check out
track P!
Nice! Btw try track T!
Image via Erik Bernhardsson
Friday, May 9, 14
7
Collaborative Filtering at Spotify
• Discover (personalized recommendations)
• Radio
• Related Artists
• Now Playing
Friday, May 9, 14
Section name 8
Friday, May 9, 14
Explicit Matrix Factorization 9
Movies
Users
Chris
Inception
•Users explicitly rate a subset of the movie catalog
•Goal: predict how users will rate new movies
Friday, May 9, 14
• = bias for user
• = bias for item
• = regularization parameter
Explicit Matrix Factorization 10
Chris
Inception
? 3 5 ?
1 ? ? 1
2 ? 3 2
? ? ? 5
5 2 ? 4
•Approximate ratings matrix by the product of low-
dimensional user and movie matrices
•Minimize RMSE (root mean squared error)
• = user rating for movie
• = user latent factor vector
• = item latent factor vector
X Y
Friday, May 9, 14
Implicit Matrix Factorization 11
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
0 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1
•Replace Stream counts with binary labels
– 1 = streamed, 0 = never streamed
•Minimize weighted RMSE (root mean squared error) using a
function of stream counts as weights
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
•
• = user latent factor vector
• =i tem latent factor vector
X Y
Friday, May 9, 14
Alternating Least Squares 12
• Initialize user and item vectors to random noise
• Fix item vectors and solve for optimal user vectors
– Take the derivative of loss function with respect to user’s vector, set
equal to 0, and solve
– Results in a system of linear equations with closed form solution!
• Fix user vectors and solve for optimal item vectors
• Repeat until convergence
code: https://github.com/MrChrisJohnson/implicitMF
Friday, May 9, 14
Alternating Least Squares 13
• Note that:
• Then, we can pre-compute once per iteration
– and only contain non-zero elements for tracks that
the user streamed
– Using sparse matrix operations we can then compute each user’s
vector efficiently in time where is the number of
tracks the user streamed
code: https://github.com/MrChrisJohnson/implicitMF
Friday, May 9, 14
14
Alternating Least Squares
code: https://github.com/MrChrisJohnson/implicitMF
Friday, May 9, 14
Section name 15
Friday, May 9, 14
Scaling up Implicit Matrix Factorization
with Hadoop
16
Friday, May 9, 14
Hadoop at Spotify 2009
17
Friday, May 9, 14
Hadoop at Spotify 2014
18
700 Nodes in our London data center
Friday, May 9, 14
Implicit Matrix Factorization with Hadoop
19
Reduce stepMap step
u % K = 0
i % L = 0
u % K = 0
i % L = 1
...
u % K = 0
i % L = L-1
u % K = 1
i % L = 0
u % K = 1
i % L = 1
... ...
... ... ... ...
u % K = K-1
i % L = 0
... ...
u % K = K-1
i % L = L-1
item vectors
item%L=0
item vectors
item%L=1
item vectors
i % L = L-1
user vectors
u % K = 0
user vectors
u % K = 1
user vectors
u % K = K-1
all log entries
u % K = 1
i % L = 1
u % K = 0
u % K = 1
u % K = K-1
Figure via Erik Bernhardsson
Friday, May 9, 14
Implicit Matrix Factorization with Hadoop
20
One map task
Distributed
cache:
All user vectors
where u % K = x
Distributed
cache:
All item vectors
where i % L = y
Mapper Emit contributions
Map input:
tuples (u, i, count)
where
u % K = x
and
i % L = y
Reducer New vector!
Figure via Erik Bernhardsson
Friday, May 9, 14
Hadoop suffers from I/O overhead
21
IO Bottleneck
Friday, May 9, 14
Spark to the rescue!!
22
Vs
http://www.slideshare.net/Hadoop_Summit/spark-and-shark
Spark
Hadoop
Friday, May 9, 14
Section name 23
Friday, May 9, 14
First Attempt
24
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Join user vectors along with all ratings for that user and all item vectors for
which the user rated the item
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
Friday, May 9, 14
First Attempt
25
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Join user vectors along with all ratings for that user and all item vectors for
which the user rated the item
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
node 2 node 3 node 4 node 5
Friday, May 9, 14
First Attempt
26
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Join user vectors along with all ratings for that user and all item vectors for
which the user rated the item
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
node 2 node 3 node 4 node 5
Friday, May 9, 14
First Attempt
27
Friday, May 9, 14
First Attempt
28
•Issues:
–Unnecessarily sending multiple copies of item vector to each node
–Unnecessarily shuffling data across cluster at each iteration
–Not taking advantage of Spark’s in memory capabilities!
Friday, May 9, 14
Second Attempt
29
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Group ratings matrix into blocks, and join blocks with necessary user and
item vectors (to avoid multiple item vector copies at each node)
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
Friday, May 9, 14
Second Attempt
30
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Group ratings matrix into blocks, and join blocks with necessary user and
item vectors (to avoid multiple item vector copies at each node)
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
Friday, May 9, 14
Second Attempt
31
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Group ratings matrix into blocks, and join blocks with necessary user and
item vectors (to avoid multiple item vector copies at each node)
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
Friday, May 9, 14
Second Attempt
32
Friday, May 9, 14
Second Attempt
33
Friday, May 9, 14
Second Attempt
34
•Issues:
–Still Unnecessarily shuffling data across cluster at each iteration
–Still not taking advantage of Spark’s in memory capabilities!
Friday, May 9, 14
So, what are we missing?...
35
•Partitioner: Defines how the elements in a key-value pair RDD are
partitioned across the cluster.
node 1 node 2 node 3 node 4 node 5 node 6
user vectors
partition 1
partition 2
partition 3
Friday, May 9, 14
So, what are we missing?...
36
•partitionBy(partitioner): Partitions all elements of the same key to the
same node in the cluster, as defined by the partitioner.
node 1 node 2 node 3 node 4 node 5 node 6
user vectors
partition 1
partition 2
partition 3
Friday, May 9, 14
So, what are we missing?...
37
•mapPartitions(func): Similar to map, but runs separately on each
partition (block) of the RDD, so func must be of type Iterator[T] =>
Iterator[U] when running on an RDD of type T.
node 1 node 2 node 3 node 4 node 5 node 6
user vectors
partition 1
partition 2
partition 3
function()
function()
function()
Friday, May 9, 14
So, what are we missing?...
38
• persist(storageLevel): Set this RDD's storage level to persist (cache)
its values across operations after the first time it is computed.
node 1 node 2 node 3 node 4 node 5 node 6
user vectors
partition 1
partition 2
partition 3
Friday, May 9, 14
Third Attempt
39
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third Attempt
40
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third Attempt
41
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third Attempt
42
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third Attempt
43
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third Attempt
44
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third attempt
45
Friday, May 9, 14
Third attempt
46
Friday, May 9, 14
Third attempt
47
Friday, May 9, 14
ALS Running Times
48
Via Xiangrui Meng (Databricks) http://stanford.edu/~rezab/sparkworkshop/slides/xiangrui.pdf
Friday, May 9, 14
Section name 49
Fin
Friday, May 9, 14
Section name 50
Friday, May 9, 14
Section name 51
Friday, May 9, 14
Section name 52
Friday, May 9, 14
Section name 53
Friday, May 9, 14
Section name 54
Friday, May 9, 14
Section name 55
Friday, May 9, 14
Section name 56
Friday, May 9, 14

More Related Content

What's hot

How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyJosh Baer
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
 
Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Esh Vckay
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyChris Johnson
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsErik Bernhardsson
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...DataWorks Summit/Hadoop Summit
 
Personalizing the listening experience
Personalizing the listening experiencePersonalizing the listening experience
Personalizing the listening experienceMounia Lalmas-Roelleke
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com confluent
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyChris Johnson
 
Building Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyBuilding Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyVidhya Murali
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...Hakka Labs
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explainedconfluent
 
Big data and machine learning @ Spotify
Big data and machine learning @ SpotifyBig data and machine learning @ Spotify
Big data and machine learning @ SpotifyOscar Carlsson
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Xavier Amatriain
 
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-InformationMeta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-Informationrecsysfr
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry confluent
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupAndy Sloane
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 

What's hot (20)

How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At Spotify
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
 
Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover Weekly
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive Analytics
 
Apache KAfka
Apache KAfkaApache KAfka
Apache KAfka
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
 
Personalizing the listening experience
Personalizing the listening experiencePersonalizing the listening experience
Personalizing the listening experience
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and Spotify
 
Building Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyBuilding Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at Spotify
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Big data and machine learning @ Spotify
Big data and machine learning @ SpotifyBig data and machine learning @ Spotify
Big data and machine learning @ Spotify
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-InformationMeta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data Meetup
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 

Viewers also liked

Lecture 6 lu factorization & determinants - section 2-5 2-7 3-1 and 3-2
Lecture 6   lu factorization & determinants - section 2-5 2-7 3-1 and 3-2Lecture 6   lu factorization & determinants - section 2-5 2-7 3-1 and 3-2
Lecture 6 lu factorization & determinants - section 2-5 2-7 3-1 and 3-2njit-ronbrown
 
Intro to Factorization Machines
Intro to Factorization MachinesIntro to Factorization Machines
Intro to Factorization MachinesPavel Kalaidin
 
Neighbor methods vs matrix factorization - case studies of real-life recommen...
Neighbor methods vs matrix factorization - case studies of real-life recommen...Neighbor methods vs matrix factorization - case studies of real-life recommen...
Neighbor methods vs matrix factorization - case studies of real-life recommen...Domonkos Tikk
 
Matrix factorization
Matrix factorizationMatrix factorization
Matrix factorizationrubyyc
 
Factorization Machines with libFM
Factorization Machines with libFMFactorization Machines with libFM
Factorization Machines with libFMLiangjie Hong
 
Nonnegative Matrix Factorization
Nonnegative Matrix FactorizationNonnegative Matrix Factorization
Nonnegative Matrix FactorizationTatsuya Yokota
 
Matrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender SystemsMatrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender SystemsAladejubelo Oluwashina
 
آموزش محاسبات عددی - بخش دوم
آموزش محاسبات عددی - بخش دومآموزش محاسبات عددی - بخش دوم
آموزش محاسبات عددی - بخش دومfaradars
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introductionLiang Xiang
 
Introduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative FilteringIntroduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative FilteringDKALab
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBenjamin Bengfort
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender SystemsT212
 
Recommendation system
Recommendation system Recommendation system
Recommendation system Vikrant Arya
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemMilind Gokhale
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineNYC Predictive Analytics
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architectureLiang Xiang
 

Viewers also liked (17)

Lecture 6 lu factorization & determinants - section 2-5 2-7 3-1 and 3-2
Lecture 6   lu factorization & determinants - section 2-5 2-7 3-1 and 3-2Lecture 6   lu factorization & determinants - section 2-5 2-7 3-1 and 3-2
Lecture 6 lu factorization & determinants - section 2-5 2-7 3-1 and 3-2
 
Intro to Factorization Machines
Intro to Factorization MachinesIntro to Factorization Machines
Intro to Factorization Machines
 
Neighbor methods vs matrix factorization - case studies of real-life recommen...
Neighbor methods vs matrix factorization - case studies of real-life recommen...Neighbor methods vs matrix factorization - case studies of real-life recommen...
Neighbor methods vs matrix factorization - case studies of real-life recommen...
 
Matrix factorization
Matrix factorizationMatrix factorization
Matrix factorization
 
Factorization Machines with libFM
Factorization Machines with libFMFactorization Machines with libFM
Factorization Machines with libFM
 
Nonnegative Matrix Factorization
Nonnegative Matrix FactorizationNonnegative Matrix Factorization
Nonnegative Matrix Factorization
 
Matrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender SystemsMatrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender Systems
 
آموزش محاسبات عددی - بخش دوم
آموزش محاسبات عددی - بخش دومآموزش محاسبات عددی - بخش دوم
آموزش محاسبات عددی - بخش دوم
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
 
Introduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative FilteringIntroduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative Filtering
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix Factorization
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Recommendation system
Recommendation system Recommendation system
Recommendation system
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation System
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 

Similar to Collaborative Filtering with Spark

Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...
Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...
Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...Sc Huang
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReducesscdotopen
 
IntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdfIntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdfAlphaIssaghaDiallo
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part IIDeep learning Tutorial - Part II
Deep learning Tutorial - Part IIQuantUniversity
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analyticsCollin Bennett
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
 
Final Presentation - Edan&Itzik
Final Presentation - Edan&ItzikFinal Presentation - Edan&Itzik
Final Presentation - Edan&Itzikitzik cohen
 
Maximizing the Diversity of Exposure in a Social Network
Maximizing the Diversity of Exposure in a Social Network	Maximizing the Diversity of Exposure in a Social Network
Maximizing the Diversity of Exposure in a Social Network Cigdem Aslay
 
Music Recommender Systems
Music Recommender SystemsMusic Recommender Systems
Music Recommender Systemsfuchaoqun
 
Recent advances in deep recommender systems
Recent advances in deep recommender systemsRecent advances in deep recommender systems
Recent advances in deep recommender systemsNAVER Engineering
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptxJasonTuran2
 
Download
DownloadDownload
Downloadbutest
 
Download
DownloadDownload
Downloadbutest
 
Ire presentation
Ire presentationIre presentation
Ire presentationRaj Patel
 
Understanding Large Social Networks | IRE Major Project | Team 57 | LINE
Understanding Large Social Networks | IRE Major Project | Team 57 | LINEUnderstanding Large Social Networks | IRE Major Project | Team 57 | LINE
Understanding Large Social Networks | IRE Major Project | Team 57 | LINERaj Patel
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 

Similar to Collaborative Filtering with Spark (20)

Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...
Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...
Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
 
IntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdfIntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdf
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part IIDeep learning Tutorial - Part II
Deep learning Tutorial - Part II
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Matrix Factorization
Matrix FactorizationMatrix Factorization
Matrix Factorization
 
Final Presentation - Edan&Itzik
Final Presentation - Edan&ItzikFinal Presentation - Edan&Itzik
Final Presentation - Edan&Itzik
 
Project Matsu
Project MatsuProject Matsu
Project Matsu
 
Maximizing the Diversity of Exposure in a Social Network
Maximizing the Diversity of Exposure in a Social Network	Maximizing the Diversity of Exposure in a Social Network
Maximizing the Diversity of Exposure in a Social Network
 
Music Recommender Systems
Music Recommender SystemsMusic Recommender Systems
Music Recommender Systems
 
Recent advances in deep recommender systems
Recent advances in deep recommender systemsRecent advances in deep recommender systems
Recent advances in deep recommender systems
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
 
Download
DownloadDownload
Download
 
Download
DownloadDownload
Download
 
Ire presentation
Ire presentationIre presentation
Ire presentation
 
Understanding Large Social Networks | IRE Major Project | Team 57 | LINE
Understanding Large Social Networks | IRE Major Project | Team 57 | LINEUnderstanding Large Social Networks | IRE Major Project | Team 57 | LINE
Understanding Large Social Networks | IRE Major Project | Team 57 | LINE
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and Hadoop
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
L04.pdf
L04.pdfL04.pdf
L04.pdf
 

Recently uploaded

Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 

Recently uploaded (20)

Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 

Collaborative Filtering with Spark

  • 1. May 9, 2014 Collaborative Filtering with Spark Chris Johnson @MrChrisJohnson Friday, May 9, 14
  • 2. Who am I?? •Chris Johnson – Machine Learning guy from NYC – Focused on music recommendations – Formerly a graduate student at UT Austin Friday, May 9, 14
  • 3. 3 What is MLlib? Algorithms: • classification: logistic regression, linear support vector machine (SVM), naive bayes • regression: generalized linear regression • clustering: k-means • decomposition: singular value decomposition (SVD), principle component analysis (PCA • collaborative filtering: alternating least squares (ALS) http://spark.apache.org/docs/0.9.0/mllib-guide.html Friday, May 9, 14
  • 4. 4 What is MLlib? Algorithms: • classification: logistic regression, linear support vector machine (SVM), naive bayes • regression: generalized linear regression • clustering: k-means • decomposition: singular value decomposition (SVD), principle component analysis (PCA • collaborative filtering: alternating least squares (ALS) http://spark.apache.org/docs/0.9.0/mllib-guide.html Friday, May 9, 14
  • 5. Collaborative Filtering - “The Netflix Prize” 5 Friday, May 9, 14
  • 6. Collaborative Filtering 6 Hey, I like tracks P, Q, R, S! Well, I like tracks Q, R, S, T! Then you should check out track P! Nice! Btw try track T! Image via Erik Bernhardsson Friday, May 9, 14
  • 7. 7 Collaborative Filtering at Spotify • Discover (personalized recommendations) • Radio • Related Artists • Now Playing Friday, May 9, 14
  • 9. Explicit Matrix Factorization 9 Movies Users Chris Inception •Users explicitly rate a subset of the movie catalog •Goal: predict how users will rate new movies Friday, May 9, 14
  • 10. • = bias for user • = bias for item • = regularization parameter Explicit Matrix Factorization 10 Chris Inception ? 3 5 ? 1 ? ? 1 2 ? 3 2 ? ? ? 5 5 2 ? 4 •Approximate ratings matrix by the product of low- dimensional user and movie matrices •Minimize RMSE (root mean squared error) • = user rating for movie • = user latent factor vector • = item latent factor vector X Y Friday, May 9, 14
  • 11. Implicit Matrix Factorization 11 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 •Replace Stream counts with binary labels – 1 = streamed, 0 = never streamed •Minimize weighted RMSE (root mean squared error) using a function of stream counts as weights • = bias for user • = bias for item • = regularization parameter • = 1 if user streamed track else 0 • • = user latent factor vector • =i tem latent factor vector X Y Friday, May 9, 14
  • 12. Alternating Least Squares 12 • Initialize user and item vectors to random noise • Fix item vectors and solve for optimal user vectors – Take the derivative of loss function with respect to user’s vector, set equal to 0, and solve – Results in a system of linear equations with closed form solution! • Fix user vectors and solve for optimal item vectors • Repeat until convergence code: https://github.com/MrChrisJohnson/implicitMF Friday, May 9, 14
  • 13. Alternating Least Squares 13 • Note that: • Then, we can pre-compute once per iteration – and only contain non-zero elements for tracks that the user streamed – Using sparse matrix operations we can then compute each user’s vector efficiently in time where is the number of tracks the user streamed code: https://github.com/MrChrisJohnson/implicitMF Friday, May 9, 14
  • 14. 14 Alternating Least Squares code: https://github.com/MrChrisJohnson/implicitMF Friday, May 9, 14
  • 16. Scaling up Implicit Matrix Factorization with Hadoop 16 Friday, May 9, 14
  • 17. Hadoop at Spotify 2009 17 Friday, May 9, 14
  • 18. Hadoop at Spotify 2014 18 700 Nodes in our London data center Friday, May 9, 14
  • 19. Implicit Matrix Factorization with Hadoop 19 Reduce stepMap step u % K = 0 i % L = 0 u % K = 0 i % L = 1 ... u % K = 0 i % L = L-1 u % K = 1 i % L = 0 u % K = 1 i % L = 1 ... ... ... ... ... ... u % K = K-1 i % L = 0 ... ... u % K = K-1 i % L = L-1 item vectors item%L=0 item vectors item%L=1 item vectors i % L = L-1 user vectors u % K = 0 user vectors u % K = 1 user vectors u % K = K-1 all log entries u % K = 1 i % L = 1 u % K = 0 u % K = 1 u % K = K-1 Figure via Erik Bernhardsson Friday, May 9, 14
  • 20. Implicit Matrix Factorization with Hadoop 20 One map task Distributed cache: All user vectors where u % K = x Distributed cache: All item vectors where i % L = y Mapper Emit contributions Map input: tuples (u, i, count) where u % K = x and i % L = y Reducer New vector! Figure via Erik Bernhardsson Friday, May 9, 14
  • 21. Hadoop suffers from I/O overhead 21 IO Bottleneck Friday, May 9, 14
  • 22. Spark to the rescue!! 22 Vs http://www.slideshare.net/Hadoop_Summit/spark-and-shark Spark Hadoop Friday, May 9, 14
  • 24. First Attempt 24 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Join user vectors along with all ratings for that user and all item vectors for which the user rated the item – Sum up YtCuIY and YtCuPu and solve for optimal user vectors Friday, May 9, 14
  • 25. First Attempt 25 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Join user vectors along with all ratings for that user and all item vectors for which the user rated the item – Sum up YtCuIY and YtCuPu and solve for optimal user vectors node 2 node 3 node 4 node 5 Friday, May 9, 14
  • 26. First Attempt 26 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Join user vectors along with all ratings for that user and all item vectors for which the user rated the item – Sum up YtCuIY and YtCuPu and solve for optimal user vectors node 2 node 3 node 4 node 5 Friday, May 9, 14
  • 28. First Attempt 28 •Issues: –Unnecessarily sending multiple copies of item vector to each node –Unnecessarily shuffling data across cluster at each iteration –Not taking advantage of Spark’s in memory capabilities! Friday, May 9, 14
  • 29. Second Attempt 29 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Group ratings matrix into blocks, and join blocks with necessary user and item vectors (to avoid multiple item vector copies at each node) – Sum up YtCuIY and YtCuPu and solve for optimal user vectors Friday, May 9, 14
  • 30. Second Attempt 30 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Group ratings matrix into blocks, and join blocks with necessary user and item vectors (to avoid multiple item vector copies at each node) – Sum up YtCuIY and YtCuPu and solve for optimal user vectors Friday, May 9, 14
  • 31. Second Attempt 31 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Group ratings matrix into blocks, and join blocks with necessary user and item vectors (to avoid multiple item vector copies at each node) – Sum up YtCuIY and YtCuPu and solve for optimal user vectors Friday, May 9, 14
  • 34. Second Attempt 34 •Issues: –Still Unnecessarily shuffling data across cluster at each iteration –Still not taking advantage of Spark’s in memory capabilities! Friday, May 9, 14
  • 35. So, what are we missing?... 35 •Partitioner: Defines how the elements in a key-value pair RDD are partitioned across the cluster. node 1 node 2 node 3 node 4 node 5 node 6 user vectors partition 1 partition 2 partition 3 Friday, May 9, 14
  • 36. So, what are we missing?... 36 •partitionBy(partitioner): Partitions all elements of the same key to the same node in the cluster, as defined by the partitioner. node 1 node 2 node 3 node 4 node 5 node 6 user vectors partition 1 partition 2 partition 3 Friday, May 9, 14
  • 37. So, what are we missing?... 37 •mapPartitions(func): Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator[T] => Iterator[U] when running on an RDD of type T. node 1 node 2 node 3 node 4 node 5 node 6 user vectors partition 1 partition 2 partition 3 function() function() function() Friday, May 9, 14
  • 38. So, what are we missing?... 38 • persist(storageLevel): Set this RDD's storage level to persist (cache) its values across operations after the first time it is computed. node 1 node 2 node 3 node 4 node 5 node 6 user vectors partition 1 partition 2 partition 3 Friday, May 9, 14
  • 39. Third Attempt 39 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 40. Third Attempt 40 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 41. Third Attempt 41 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 42. Third Attempt 42 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 43. Third Attempt 43 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 44. Third Attempt 44 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 48. ALS Running Times 48 Via Xiangrui Meng (Databricks) http://stanford.edu/~rezab/sparkworkshop/slides/xiangrui.pdf Friday, May 9, 14