SlideShare una empresa de Scribd logo
1 de 65
Descargar para leer sin conexión
June 27, 2014
Music
Recommendations at
Scale with Spark
Chris Johnson
@MrChrisJohnson
Who am I??
•Chris Johnson
– Machine Learning guy from NYC
– Focused on music recommendations
– Formerly a PhD student at UT Austin
3
Recommendations at Spotify
!
• Discover (personalized recommendations)
• Radio
• Related Artists
• Now Playing
How can we find good
recommendations?
!
• Manual Curation
!
!
!
• Manually Tag Attributes
!
!
• Audio Content,
Metadata, Text Analysis
!
!
• Collaborative Filtering
4
How can we find good
recommendations?
!
• Manual Curation
!
!
!
• Manually Tag Attributes
!
!
• Audio Content,
Metadata, Text Analysis
!
!
• Collaborative Filtering
5
Collaborative Filtering - “The Netflix Prize” 6
Collaborative Filtering
7
Hey,
I like tracks P, Q, R, S!
Well,
I like tracks Q, R, S, T!
Then you should check out
track P!
Nice! Btw try track T!
Image via Erik Bernhardsson
Section name 8
Explicit Matrix Factorization 9
Movies
Users
Chris
Inception
•Users explicitly rate a subset of the movie catalog
•Goal: predict how users will rate new movies
• = bias for user
• = bias for item
• = regularization parameter
Explicit Matrix Factorization 10
Chris
Inception
? 3 5 ?
1 ? ? 1
2 ? 3 2
? ? ? 5
5 2 ? 4
•Approximate ratings matrix by the product of low-
dimensional user and movie matrices
•Minimize RMSE (root mean squared error)
• = user rating for movie 
• = user latent factor vector
• = item latent factor vector
X YUsers
Movies
Implicit Matrix Factorization 11
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0 
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0 
0 0 1 0 0 1 0 0 
1 0 0 0 1 0 0 1
•Instead of explicit ratings use binary labels
– 1 = streamed, 0 = never streamed
•Minimize weighted RMSE (root mean squared error) using a
function of total streams as weights
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
• 
• = user latent factor vector
• =i tem latent factor vector
X YUsers
Songs
Alternating Least Squares (ALS) 12
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0 
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0 
0 0 1 0 0 1 0 0 
1 0 0 0 1 0 0 1
•Instead of explicit ratings use binary labels
– 1 = streamed, 0 = never streamed
•Minimize weighted RMSE (root mean squared error) using a
function of total streams as weights
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
• 
• = user latent factor vector
• =i tem latent factor vector
X YUsers
Songs
Fix songs
Alternating Least Squares (ALS) 13
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0 
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0 
0 0 1 0 0 1 0 0 
1 0 0 0 1 0 0 1
•Instead of explicit ratings use binary labels
– 1 = streamed, 0 = never streamed
•Minimize weighted RMSE (root mean squared error) using a
function of total streams as weights
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
• 
• = user latent factor vector
• =i tem latent factor vector
X YUsers
Songs
Fix songs
Solve for users
Alternating Least Squares (ALS) 14
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0 
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0 
0 0 1 0 0 1 0 0 
1 0 0 0 1 0 0 1
•Instead of explicit ratings use binary labels
– 1 = streamed, 0 = never streamed
•Minimize weighted RMSE (root mean squared error) using a
function of total streams as weights
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
• 
• = user latent factor vector
• =i tem latent factor vector
X YUsers
Songs Fix users
Alternating Least Squares (ALS) 15
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0 
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0 
0 0 1 0 0 1 0 0 
1 0 0 0 1 0 0 1
•Instead of explicit ratings use binary labels
– 1 = streamed, 0 = never streamed
•Minimize weighted RMSE (root mean squared error) using a
function of total streams as weights
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
• 
• = user latent factor vector
• =i tem latent factor vector
X YUsers
Songs
Solve for songs
Fix users
Alternating Least Squares (ALS) 16
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0 
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0 
0 0 1 0 0 1 0 0 
1 0 0 0 1 0 0 1
•Instead of explicit ratings use binary labels
– 1 = streamed, 0 = never streamed
•Minimize weighted RMSE (root mean squared error) using a
function of total streams as weights
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
• 
• = user latent factor vector
• =i tem latent factor vector
X YUsers
Songs
Solve for songs
Fix users
Repeat until convergence…
Alternating Least Squares (ALS) 17
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0 
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0 
0 0 1 0 0 1 0 0 
1 0 0 0 1 0 0 1
•Instead of explicit ratings use binary labels
– 1 = streamed, 0 = never streamed
•Minimize weighted RMSE (root mean squared error) using a
function of total streams as weights
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
• 
• = user latent factor vector
• =i tem latent factor vector
X YUsers
Songs
Solve for songs
Fix users
Repeat until convergence…
18
Alternating Least Squares
code: https://github.com/MrChrisJohnson/implicitMF
Section name 19
Scaling up Implicit Matrix Factorization
with Hadoop
20
Hadoop at Spotify 2009
21
Hadoop at Spotify 2014
22
700 Nodes in our London data center
Implicit Matrix Factorization with Hadoop
23
Reduce stepMap step
u % K = 0
i % L = 0
u % K = 0
i % L = 1
...
u % K = 0
i % L = L-1
u % K = 1
i % L = 0
u % K = 1
i % L = 1
... ...
... ... ... ...
u % K = K-1
i % L = 0
... ...
u % K = K-1
i % L = L-1
item vectors
item%L=0
item vectors
item%L=1
item vectors
i % L = L-1
user vectors
u % K = 0
user vectors
u % K = 1
user vectors
u % K = K-1
all log entries
u % K = 1
i % L = 1
u % K = 0
u % K = 1
u % K = K-1
Figure via Erik Bernhardsson
Implicit Matrix Factorization with Hadoop
24
One map task
Distributed
cache:
All user vectors
where u % K = x
Distributed
cache:
All item vectors
where i % L = y
Mapper Emit contributions
Map input:
tuples (u, i, count)
where
u % K = x
and
i % L = y
Reducer New vector!
Figure via Erik Bernhardsson
Hadoop suffers from I/O overhead
25
IO Bottleneck
Spark to the rescue!!
26
Vs
http://www.slideshare.net/Hadoop_Summit/spark-and-shark
Spark
Hadoop
Section name 27
28
ratings user vectors item vectors
First Attempt (broadcast everything)
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
• For each iteration:
1. Compute YtY over item vectors and broadcast 
2. Broadcast item vectors
3. Group ratings by user
4. Solve for optimal user vector
29
ratings user vectors item vectors
First Attempt (broadcast everything)
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
• For each iteration:
1. Compute YtY over item vectors and broadcast 
2. Broadcast item vectors
3. Group ratings by user
4. Solve for optimal user vector
First Attempt (broadcast everything)
30
ratings user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
• For each iteration:
1. Compute YtY over item vectors and broadcast 
2. Broadcast item vectors
3. Group ratings by user
4. Solve for optimal user vector
31
ratings user vectors item vectors
First Attempt (broadcast everything)
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
• For each iteration:
1. Compute YtY over item vectors and broadcast 
2. Broadcast item vectors
3. Group ratings by user
4. Solve for optimal user vector
First Attempt (broadcast everything)
32
ratings user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
• For each iteration:
1. Compute YtY over item vectors and broadcast 
2. Broadcast item vectors
3. Group ratings by user
4. Solve for optimal user vector
First Attempt (broadcast everything)
33
First Attempt (broadcast everything)
34
•Cons: 
– Unnecessarily shuffling all data across wire each iteration.
– Not caching ratings data
– Unnecessarily sending a full copy of user/item vectors to all workers.
Second Attempt (full gridify)
35
ratings user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
•Group ratings matrix into K x L, partition, and cache 
•For each iteration:
1. Compute YtY over item vectors and broadcast 
2. For each item vector send a copy to each rating block in the item % L column
3. Compute intermediate terms for each block (partition)
4. Group by user, aggregate intermediate terms, and solve for optimal user vector
Second Attempt (full gridify)
36
ratings user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
•Group ratings matrix into K x L, partition, and cache 
•For each iteration:
1. Compute YtY over item vectors and broadcast 
2. For each item vector send a copy to each rating block in the item % L column
3. Compute intermediate terms for each block (partition)
4. Group by user, aggregate intermediate terms, and solve for optimal user vector
Second Attempt (full gridify)
37
ratings user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
•Group ratings matrix into K x L, partition, and cache 
•For each iteration:
1. Compute YtY over item vectors and broadcast 
2. For each item vector send a copy to each rating block in the item % L column
3. Compute intermediate terms for each block (partition)
4. Group by user, aggregate intermediate terms, and solve for optimal user vector
Second Attempt (full gridify)
38
ratings user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
•Group ratings matrix into K x L, partition, and cache 
•For each iteration:
1. Compute YtY over item vectors and broadcast 
2. For each item vector send a copy to each rating block in the item % L column
3. Compute intermediate terms for each block (partition)
4. Group by user, aggregate intermediate terms, and solve for optimal user vector
Second Attempt (full gridify)
39
ratings user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
•Group ratings matrix into K x L, partition, and cache 
•For each iteration:
1. Compute YtY over item vectors and broadcast 
2. For each item vector send a copy to each rating block in the item % L column
3. Compute intermediate terms for each block (partition)
4. Group by user, aggregate intermediate terms, and solve for optimal user vector
Second Attempt (full gridify)
40
ratings user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
•Group ratings matrix into K x L, partition, and cache 
•For each iteration:
1. Compute YtY over item vectors and broadcast 
2. For each item vector send a copy to each rating block in the item % L column
3. Compute intermediate terms for each block (partition)
4. Group by user, aggregate intermediate terms, and solve for optimal user vector
Second Attempt (full gridify)
41
ratings user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
•Group ratings matrix into K x L, partition, and cache 
•For each iteration:
1. Compute YtY over item vectors and broadcast 
2. For each item vector send a copy to each rating block in the item % L column
3. Compute intermediate terms for each block (partition)
4. Group by user, aggregate intermediate terms, and solve for optimal user vector
Second Attempt
42
Second Attempt
43
•Pros
– Ratings get cached and never shuffled
– Each partition only requires a subset of item (or user) vectors in memory each iteration
– Potentially requires less local memory than a “half gridify” scheme
•Cons
- Sending lots of intermediate data over wire each iteration in order to aggregate and solve for optimal vectors
- More IO overhead than a “half gridify” scheme
Third Attempt (half gridify)
44
ratings user vectors item vectors
•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache
•For each iteration:
1. Compute YtY over item vectors and broadcast
2. For each item vector, send a copy to each user rating partition that requires it (potentially
all partitions) 
3. Each partition aggregates intermediate terms and solves for optimal user vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
Third Attempt (half gridify)
45
ratings user vectors item vectors
•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache
•For each iteration:
1. Compute YtY over item vectors and broadcast
2. For each item vector, send a copy to each user rating partition that requires it (potentially
all partitions) 
3. Each partition aggregates intermediate terms and solves for optimal user vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
Third Attempt (half gridify)
46
ratings user vectors item vectors
•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache
•For each iteration:
1. Compute YtY over item vectors and broadcast
2. For each item vector, send a copy to each user rating partition that requires it (potentially
all partitions) 
3. Each partition aggregates intermediate terms and solves for optimal user vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
Third Attempt (half gridify)
47
ratings user vectors item vectors
•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache
•For each iteration:
1. Compute YtY over item vectors and broadcast
2. For each item vector, send a copy to each user rating partition that requires it (potentially
all partitions) 
3. Each partition aggregates intermediate terms and solves for optimal user vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
Third Attempt (half gridify)
48
ratings user vectors item vectors
•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache
•For each iteration:
1. Compute YtY over item vectors and broadcast
2. For each item vector, send a copy to each user rating partition that requires it (potentially
all partitions) 
3. Each partition aggregates intermediate terms and solves for optimal user vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
Third Attempt (half gridify)
49
ratings user vectors item vectors
•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache
•For each iteration:
1. Compute YtY over item vectors and broadcast
2. For each item vector, send a copy to each user rating partition that requires it (potentially
all partitions) 
3. Each partition aggregates intermediate terms and solves for optimal user vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
Third Attempt (half gridify)
50
ratings user vectors item vectors
•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache
•For each iteration:
1. Compute YtY over item vectors and broadcast
2. For each item vector, send a copy to each user rating partition that requires it (potentially
all partitions) 
3. Each partition aggregates intermediate terms and solves for optimal user vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
Note that we removed the extra
shuffle from the full gridify
approach.
51
Third Attempt (half gridify)
•Pros
– Ratings get cached and never shuffled
– Once item vectors are joined with ratings partitions each partition has enough information to solve optimal user
vectors without any additional shuffling/aggregation (which occurs with the “full gridify” scheme)
•Cons
- Each partition could potentially require a copy of each item vector (which may not all fit in memory)
- Potentially requires more local memory than “full gridify” scheme
Actual MLlib code!
ALS Running Times
52
Hadoop
Spark (full
gridify)
Spark (half
gridify)
10 hours 3.5 hours 1.5 hours
•Dataset consisting of Spotify streaming data for 4 Million users and 500k artists
-Note: full dataset consists of 40M users and 20M songs but we haven’t yet successfully run with Spark
•All jobs run using 40 latent factors
•Spark jobs used 200 executors with 20G containers
•Hadoop job used 1k mappers, 300 reducers
ALS Running Times
53
ALS runtime numbers via @evansparks using Spark version 0.8.0
Section name 54
Random Learnings
55
•PairRDDFunctions are your friend!
Random Learnings
56
•Kryo serialization faster than java serialization but may require you to
write and/or register your own serializers
Random Learnings
57
•Kryo serialization faster than java serialization but may require you to
write and/or register your own serializers
Random Learnings
58
•Running with larger datasets often results in failed executors and job
never fully recovers
Section name 59
Fin
Section name 60
Section name 61
Section name 62
Section name 63
Section name 64
Section name 65

Más contenido relacionado

La actualidad más candente

Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Erik Bernhardsson
 
Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsChris Johnson
 
Big data and machine learning @ Spotify
Big data and machine learning @ SpotifyBig data and machine learning @ Spotify
Big data and machine learning @ SpotifyOscar Carlsson
 
Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Esh Vckay
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupAndy Sloane
 
Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyNeville Li
 
Interactive Recommender Systems
Interactive Recommender SystemsInteractive Recommender Systems
Interactive Recommender SystemsRoelof van Zwol
 
Personalizing the listening experience
Personalizing the listening experiencePersonalizing the listening experience
Personalizing the listening experienceMounia Lalmas-Roelleke
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...Hakka Labs
 
Homepage Personalization at Spotify
Homepage Personalization at SpotifyHomepage Personalization at Spotify
Homepage Personalization at SpotifyOguz Semerci
 
Déjà Vu: The Importance of Time and Causality in Recommender Systems
Déjà Vu: The Importance of Time and Causality in Recommender SystemsDéjà Vu: The Importance of Time and Causality in Recommender Systems
Déjà Vu: The Importance of Time and Causality in Recommender SystemsJustin Basilico
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyChris Johnson
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixJustin Basilico
 
Shallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender SystemShallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender SystemAnoop Deoras
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix DatasetBen Mabey
 
Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Mounia Lalmas-Roelleke
 
Recommendation system
Recommendation systemRecommendation system
Recommendation systemAkshat Thakar
 
The Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainThe Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Xavier Amatriain
 

La actualidad más candente (20)

Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014
 
Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music Recommendations
 
Big data and machine learning @ Spotify
Big data and machine learning @ SpotifyBig data and machine learning @ Spotify
Big data and machine learning @ Spotify
 
Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data Meetup
 
Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ Spotify
 
Interactive Recommender Systems
Interactive Recommender SystemsInteractive Recommender Systems
Interactive Recommender Systems
 
Personalizing the listening experience
Personalizing the listening experiencePersonalizing the listening experience
Personalizing the listening experience
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
 
Homepage Personalization at Spotify
Homepage Personalization at SpotifyHomepage Personalization at Spotify
Homepage Personalization at Spotify
 
Déjà Vu: The Importance of Time and Causality in Recommender Systems
Déjà Vu: The Importance of Time and Causality in Recommender SystemsDéjà Vu: The Importance of Time and Causality in Recommender Systems
Déjà Vu: The Importance of Time and Causality in Recommender Systems
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and Spotify
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
 
Shallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender SystemShallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender System
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix Dataset
 
Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)
 
Recommendation system
Recommendation systemRecommendation system
Recommendation system
 
The Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainThe Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and Pain
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
 

Similar a Music Recommendations at Scale with Spark

Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with SparkChris Johnson
 
Scalable Recommendation Algorithms with LSH
Scalable Recommendation Algorithms with LSHScalable Recommendation Algorithms with LSH
Scalable Recommendation Algorithms with LSHMaruf Aytekin
 
A new similarity measurement based on hellinger distance for collaborating fi...
A new similarity measurement based on hellinger distance for collaborating fi...A new similarity measurement based on hellinger distance for collaborating fi...
A new similarity measurement based on hellinger distance for collaborating fi...Prabhu Kumar
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsErik Bernhardsson
 
Practical Deep Learning Using Tensor Flow - Sandeep Kath
Practical Deep Learning Using Tensor Flow - Sandeep KathPractical Deep Learning Using Tensor Flow - Sandeep Kath
Practical Deep Learning Using Tensor Flow - Sandeep KathSandeep Kath
 
Recommender systems
Recommender systemsRecommender systems
Recommender systemsTamer Rezk
 
Random Walk with Restart for Automatic Playlist Continuation and Query-specif...
Random Walk with Restart for Automatic Playlist Continuation and Query-specif...Random Walk with Restart for Automatic Playlist Continuation and Query-specif...
Random Walk with Restart for Automatic Playlist Continuation and Query-specif...Timo van Niedek
 
Recsys 2018 overview and highlights
Recsys 2018 overview and highlightsRecsys 2018 overview and highlights
Recsys 2018 overview and highlightsSandra Garcia
 
User Based Recommendation Systems (1).pdf
User Based Recommendation Systems (1).pdfUser Based Recommendation Systems (1).pdf
User Based Recommendation Systems (1).pdfMridulGupta588131
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear ComplexitySangwoo Mo
 
Recommendations with hadoop streaming and python
Recommendations with hadoop streaming and pythonRecommendations with hadoop streaming and python
Recommendations with hadoop streaming and pythonAndrew Look
 
Fast ALS-based matrix factorization for explicit and implicit feedback datasets
Fast ALS-based matrix factorization for explicit and implicit feedback datasetsFast ALS-based matrix factorization for explicit and implicit feedback datasets
Fast ALS-based matrix factorization for explicit and implicit feedback datasetsGravity - Rock Solid Recommendations
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation SystemsRobin Reni
 
Approximate Nearest Neighbors and Vector Models by Erik Bernhardsson
Approximate Nearest Neighbors and Vector Models by Erik BernhardssonApproximate Nearest Neighbors and Vector Models by Erik Bernhardsson
Approximate Nearest Neighbors and Vector Models by Erik BernhardssonHakka Labs
 

Similar a Music Recommendations at Scale with Spark (20)

Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with Spark
 
Data Mining Lecture_9.pptx
Data Mining Lecture_9.pptxData Mining Lecture_9.pptx
Data Mining Lecture_9.pptx
 
Scalable Recommendation Algorithms with LSH
Scalable Recommendation Algorithms with LSHScalable Recommendation Algorithms with LSH
Scalable Recommendation Algorithms with LSH
 
SVD.ppt
SVD.pptSVD.ppt
SVD.ppt
 
A new similarity measurement based on hellinger distance for collaborating fi...
A new similarity measurement based on hellinger distance for collaborating fi...A new similarity measurement based on hellinger distance for collaborating fi...
A new similarity measurement based on hellinger distance for collaborating fi...
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive Analytics
 
R meetup lm
R meetup lmR meetup lm
R meetup lm
 
Practical Deep Learning Using Tensor Flow - Sandeep Kath
Practical Deep Learning Using Tensor Flow - Sandeep KathPractical Deep Learning Using Tensor Flow - Sandeep Kath
Practical Deep Learning Using Tensor Flow - Sandeep Kath
 
Recommender systems
Recommender systemsRecommender systems
Recommender systems
 
Random Walk with Restart for Automatic Playlist Continuation and Query-specif...
Random Walk with Restart for Automatic Playlist Continuation and Query-specif...Random Walk with Restart for Automatic Playlist Continuation and Query-specif...
Random Walk with Restart for Automatic Playlist Continuation and Query-specif...
 
Recsys 2018 overview and highlights
Recsys 2018 overview and highlightsRecsys 2018 overview and highlights
Recsys 2018 overview and highlights
 
Present eval
Present evalPresent eval
Present eval
 
User Based Recommendation Systems (1).pdf
User Based Recommendation Systems (1).pdfUser Based Recommendation Systems (1).pdf
User Based Recommendation Systems (1).pdf
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear Complexity
 
Recommendations with hadoop streaming and python
Recommendations with hadoop streaming and pythonRecommendations with hadoop streaming and python
Recommendations with hadoop streaming and python
 
Fast ALS-based matrix factorization for explicit and implicit feedback datasets
Fast ALS-based matrix factorization for explicit and implicit feedback datasetsFast ALS-based matrix factorization for explicit and implicit feedback datasets
Fast ALS-based matrix factorization for explicit and implicit feedback datasets
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systems
 
Approximate Nearest Neighbors and Vector Models by Erik Bernhardsson
Approximate Nearest Neighbors and Vector Models by Erik BernhardssonApproximate Nearest Neighbors and Vector Models by Erik Bernhardsson
Approximate Nearest Neighbors and Vector Models by Erik Bernhardsson
 
Unit 4
Unit 4Unit 4
Unit 4
 
Unit 4
Unit 4Unit 4
Unit 4
 

Último

Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 

Último (20)

Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 

Music Recommendations at Scale with Spark

  • 1. June 27, 2014 Music Recommendations at Scale with Spark Chris Johnson @MrChrisJohnson
  • 2. Who am I?? •Chris Johnson – Machine Learning guy from NYC – Focused on music recommendations – Formerly a PhD student at UT Austin
  • 3. 3 Recommendations at Spotify ! • Discover (personalized recommendations) • Radio • Related Artists • Now Playing
  • 4. How can we find good recommendations? ! • Manual Curation ! ! ! • Manually Tag Attributes ! ! • Audio Content, Metadata, Text Analysis ! ! • Collaborative Filtering 4
  • 5. How can we find good recommendations? ! • Manual Curation ! ! ! • Manually Tag Attributes ! ! • Audio Content, Metadata, Text Analysis ! ! • Collaborative Filtering 5
  • 6. Collaborative Filtering - “The Netflix Prize” 6
  • 7. Collaborative Filtering 7 Hey, I like tracks P, Q, R, S! Well, I like tracks Q, R, S, T! Then you should check out track P! Nice! Btw try track T! Image via Erik Bernhardsson
  • 9. Explicit Matrix Factorization 9 Movies Users Chris Inception •Users explicitly rate a subset of the movie catalog •Goal: predict how users will rate new movies
  • 10. • = bias for user • = bias for item • = regularization parameter Explicit Matrix Factorization 10 Chris Inception ? 3 5 ? 1 ? ? 1 2 ? 3 2 ? ? ? 5 5 2 ? 4 •Approximate ratings matrix by the product of low- dimensional user and movie matrices •Minimize RMSE (root mean squared error) • = user rating for movie • = user latent factor vector • = item latent factor vector X YUsers Movies
  • 11. Implicit Matrix Factorization 11 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 •Instead of explicit ratings use binary labels – 1 = streamed, 0 = never streamed •Minimize weighted RMSE (root mean squared error) using a function of total streams as weights • = bias for user • = bias for item • = regularization parameter • = 1 if user streamed track else 0 • • = user latent factor vector • =i tem latent factor vector X YUsers Songs
  • 12. Alternating Least Squares (ALS) 12 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 •Instead of explicit ratings use binary labels – 1 = streamed, 0 = never streamed •Minimize weighted RMSE (root mean squared error) using a function of total streams as weights • = bias for user • = bias for item • = regularization parameter • = 1 if user streamed track else 0 • • = user latent factor vector • =i tem latent factor vector X YUsers Songs Fix songs
  • 13. Alternating Least Squares (ALS) 13 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 •Instead of explicit ratings use binary labels – 1 = streamed, 0 = never streamed •Minimize weighted RMSE (root mean squared error) using a function of total streams as weights • = bias for user • = bias for item • = regularization parameter • = 1 if user streamed track else 0 • • = user latent factor vector • =i tem latent factor vector X YUsers Songs Fix songs Solve for users
  • 14. Alternating Least Squares (ALS) 14 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 •Instead of explicit ratings use binary labels – 1 = streamed, 0 = never streamed •Minimize weighted RMSE (root mean squared error) using a function of total streams as weights • = bias for user • = bias for item • = regularization parameter • = 1 if user streamed track else 0 • • = user latent factor vector • =i tem latent factor vector X YUsers Songs Fix users
  • 15. Alternating Least Squares (ALS) 15 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 •Instead of explicit ratings use binary labels – 1 = streamed, 0 = never streamed •Minimize weighted RMSE (root mean squared error) using a function of total streams as weights • = bias for user • = bias for item • = regularization parameter • = 1 if user streamed track else 0 • • = user latent factor vector • =i tem latent factor vector X YUsers Songs Solve for songs Fix users
  • 16. Alternating Least Squares (ALS) 16 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 •Instead of explicit ratings use binary labels – 1 = streamed, 0 = never streamed •Minimize weighted RMSE (root mean squared error) using a function of total streams as weights • = bias for user • = bias for item • = regularization parameter • = 1 if user streamed track else 0 • • = user latent factor vector • =i tem latent factor vector X YUsers Songs Solve for songs Fix users Repeat until convergence…
  • 17. Alternating Least Squares (ALS) 17 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 •Instead of explicit ratings use binary labels – 1 = streamed, 0 = never streamed •Minimize weighted RMSE (root mean squared error) using a function of total streams as weights • = bias for user • = bias for item • = regularization parameter • = 1 if user streamed track else 0 • • = user latent factor vector • =i tem latent factor vector X YUsers Songs Solve for songs Fix users Repeat until convergence…
  • 18. 18 Alternating Least Squares code: https://github.com/MrChrisJohnson/implicitMF
  • 20. Scaling up Implicit Matrix Factorization with Hadoop 20
  • 21. Hadoop at Spotify 2009 21
  • 22. Hadoop at Spotify 2014 22 700 Nodes in our London data center
  • 23. Implicit Matrix Factorization with Hadoop 23 Reduce stepMap step u % K = 0 i % L = 0 u % K = 0 i % L = 1 ... u % K = 0 i % L = L-1 u % K = 1 i % L = 0 u % K = 1 i % L = 1 ... ... ... ... ... ... u % K = K-1 i % L = 0 ... ... u % K = K-1 i % L = L-1 item vectors item%L=0 item vectors item%L=1 item vectors i % L = L-1 user vectors u % K = 0 user vectors u % K = 1 user vectors u % K = K-1 all log entries u % K = 1 i % L = 1 u % K = 0 u % K = 1 u % K = K-1 Figure via Erik Bernhardsson
  • 24. Implicit Matrix Factorization with Hadoop 24 One map task Distributed cache: All user vectors where u % K = x Distributed cache: All item vectors where i % L = y Mapper Emit contributions Map input: tuples (u, i, count) where u % K = x and i % L = y Reducer New vector! Figure via Erik Bernhardsson
  • 25. Hadoop suffers from I/O overhead 25 IO Bottleneck
  • 26. Spark to the rescue!! 26 Vs http://www.slideshare.net/Hadoop_Summit/spark-and-shark Spark Hadoop
  • 28. 28 ratings user vectors item vectors First Attempt (broadcast everything) worker 1 worker 2 worker 3 worker 4 worker 5 worker 6 • For each iteration: 1. Compute YtY over item vectors and broadcast 2. Broadcast item vectors 3. Group ratings by user 4. Solve for optimal user vector
  • 29. 29 ratings user vectors item vectors First Attempt (broadcast everything) worker 1 worker 2 worker 3 worker 4 worker 5 worker 6 YtY YtY YtY YtY YtY YtY • For each iteration: 1. Compute YtY over item vectors and broadcast 2. Broadcast item vectors 3. Group ratings by user 4. Solve for optimal user vector
  • 30. First Attempt (broadcast everything) 30 ratings user vectors item vectors worker 1 worker 2 worker 3 worker 4 worker 5 worker 6 YtY YtY YtY YtY YtY YtY • For each iteration: 1. Compute YtY over item vectors and broadcast 2. Broadcast item vectors 3. Group ratings by user 4. Solve for optimal user vector
  • 31. 31 ratings user vectors item vectors First Attempt (broadcast everything) worker 1 worker 2 worker 3 worker 4 worker 5 worker 6 YtY YtY YtY YtY YtY YtY • For each iteration: 1. Compute YtY over item vectors and broadcast 2. Broadcast item vectors 3. Group ratings by user 4. Solve for optimal user vector
  • 32. First Attempt (broadcast everything) 32 ratings user vectors item vectors worker 1 worker 2 worker 3 worker 4 worker 5 worker 6 • For each iteration: 1. Compute YtY over item vectors and broadcast 2. Broadcast item vectors 3. Group ratings by user 4. Solve for optimal user vector
  • 33. First Attempt (broadcast everything) 33
  • 34. First Attempt (broadcast everything) 34 •Cons: – Unnecessarily shuffling all data across wire each iteration. – Not caching ratings data – Unnecessarily sending a full copy of user/item vectors to all workers.
  • 35. Second Attempt (full gridify) 35 ratings user vectors item vectors worker 1 worker 2 worker 3 worker 4 worker 5 worker 6 •Group ratings matrix into K x L, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector send a copy to each rating block in the item % L column 3. Compute intermediate terms for each block (partition) 4. Group by user, aggregate intermediate terms, and solve for optimal user vector
  • 36. Second Attempt (full gridify) 36 ratings user vectors item vectors worker 1 worker 2 worker 3 worker 4 worker 5 worker 6 •Group ratings matrix into K x L, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector send a copy to each rating block in the item % L column 3. Compute intermediate terms for each block (partition) 4. Group by user, aggregate intermediate terms, and solve for optimal user vector
  • 37. Second Attempt (full gridify) 37 ratings user vectors item vectors worker 1 worker 2 worker 3 worker 4 worker 5 worker 6 •Group ratings matrix into K x L, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector send a copy to each rating block in the item % L column 3. Compute intermediate terms for each block (partition) 4. Group by user, aggregate intermediate terms, and solve for optimal user vector
  • 38. Second Attempt (full gridify) 38 ratings user vectors item vectors worker 1 worker 2 worker 3 worker 4 worker 5 worker 6 YtY YtY YtY YtY YtY YtY •Group ratings matrix into K x L, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector send a copy to each rating block in the item % L column 3. Compute intermediate terms for each block (partition) 4. Group by user, aggregate intermediate terms, and solve for optimal user vector
  • 39. Second Attempt (full gridify) 39 ratings user vectors item vectors worker 1 worker 2 worker 3 worker 4 worker 5 worker 6 YtY YtY YtY YtY YtY YtY •Group ratings matrix into K x L, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector send a copy to each rating block in the item % L column 3. Compute intermediate terms for each block (partition) 4. Group by user, aggregate intermediate terms, and solve for optimal user vector
  • 40. Second Attempt (full gridify) 40 ratings user vectors item vectors worker 1 worker 2 worker 3 worker 4 worker 5 worker 6 YtY YtY YtY YtY YtY YtY •Group ratings matrix into K x L, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector send a copy to each rating block in the item % L column 3. Compute intermediate terms for each block (partition) 4. Group by user, aggregate intermediate terms, and solve for optimal user vector
  • 41. Second Attempt (full gridify) 41 ratings user vectors item vectors worker 1 worker 2 worker 3 worker 4 worker 5 worker 6 •Group ratings matrix into K x L, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector send a copy to each rating block in the item % L column 3. Compute intermediate terms for each block (partition) 4. Group by user, aggregate intermediate terms, and solve for optimal user vector
  • 43. Second Attempt 43 •Pros – Ratings get cached and never shuffled – Each partition only requires a subset of item (or user) vectors in memory each iteration – Potentially requires less local memory than a “half gridify” scheme •Cons - Sending lots of intermediate data over wire each iteration in order to aggregate and solve for optimal vectors - More IO overhead than a “half gridify” scheme
  • 44. Third Attempt (half gridify) 44 ratings user vectors item vectors •Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector, send a copy to each user rating partition that requires it (potentially all partitions) 3. Each partition aggregates intermediate terms and solves for optimal user vectors worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
  • 45. Third Attempt (half gridify) 45 ratings user vectors item vectors •Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector, send a copy to each user rating partition that requires it (potentially all partitions) 3. Each partition aggregates intermediate terms and solves for optimal user vectors worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
  • 46. Third Attempt (half gridify) 46 ratings user vectors item vectors •Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector, send a copy to each user rating partition that requires it (potentially all partitions) 3. Each partition aggregates intermediate terms and solves for optimal user vectors worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
  • 47. Third Attempt (half gridify) 47 ratings user vectors item vectors •Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector, send a copy to each user rating partition that requires it (potentially all partitions) 3. Each partition aggregates intermediate terms and solves for optimal user vectors worker 1 worker 2 worker 3 worker 4 worker 5 worker 6 YtY YtY YtY YtY YtY YtY
  • 48. Third Attempt (half gridify) 48 ratings user vectors item vectors •Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector, send a copy to each user rating partition that requires it (potentially all partitions) 3. Each partition aggregates intermediate terms and solves for optimal user vectors worker 1 worker 2 worker 3 worker 4 worker 5 worker 6 YtY YtY YtY YtY YtY YtY
  • 49. Third Attempt (half gridify) 49 ratings user vectors item vectors •Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector, send a copy to each user rating partition that requires it (potentially all partitions) 3. Each partition aggregates intermediate terms and solves for optimal user vectors worker 1 worker 2 worker 3 worker 4 worker 5 worker 6 YtY YtY YtY YtY YtY YtY
  • 50. Third Attempt (half gridify) 50 ratings user vectors item vectors •Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector, send a copy to each user rating partition that requires it (potentially all partitions) 3. Each partition aggregates intermediate terms and solves for optimal user vectors worker 1 worker 2 worker 3 worker 4 worker 5 worker 6 YtY YtY YtY YtY YtY YtY Note that we removed the extra shuffle from the full gridify approach.
  • 51. 51 Third Attempt (half gridify) •Pros – Ratings get cached and never shuffled – Once item vectors are joined with ratings partitions each partition has enough information to solve optimal user vectors without any additional shuffling/aggregation (which occurs with the “full gridify” scheme) •Cons - Each partition could potentially require a copy of each item vector (which may not all fit in memory) - Potentially requires more local memory than “full gridify” scheme Actual MLlib code!
  • 52. ALS Running Times 52 Hadoop Spark (full gridify) Spark (half gridify) 10 hours 3.5 hours 1.5 hours •Dataset consisting of Spotify streaming data for 4 Million users and 500k artists -Note: full dataset consists of 40M users and 20M songs but we haven’t yet successfully run with Spark •All jobs run using 40 latent factors •Spark jobs used 200 executors with 20G containers •Hadoop job used 1k mappers, 300 reducers
  • 53. ALS Running Times 53 ALS runtime numbers via @evansparks using Spark version 0.8.0
  • 56. Random Learnings 56 •Kryo serialization faster than java serialization but may require you to write and/or register your own serializers
  • 57. Random Learnings 57 •Kryo serialization faster than java serialization but may require you to write and/or register your own serializers
  • 58. Random Learnings 58 •Running with larger datasets often results in failed executors and job never fully recovers