Big data and machine learning @ Spotify

Oscar Carlsson
Data Engineer
lad@spotify.com
Big Data
and
Machine Learning
@ Spotify
Friday 6/3 2015

● D-student starting 2009
● Graduated last year from CSALL
(Student in this class 2013)
● Master thesis at Spotify
● Data Engineer at Spotify in Gothenburg
Me

● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline

Supervised learning:
data (X), labels (Y)
Unsupervised learning:
data (X)
In the Machine Learning class:

What is data at Spotify?
Songs Track
Metadata
User generated Users Playlists
Cover arts Listens Country, email etc Tracks of
playlist
Album Clicks Add/Removes
Genres, Mood
etc
Page views
30 Million songs
60 Million Monthly Active Users
58 Markets
15 Million subscribers
1.5 Billion Playlists

Big Data and processing it
● 20 TB compressed data / DAY
○ 200 TB generated and stored / day (replication)
● Our business is highly dependent on these logs
○ We pay artist depending on plays, plays = logs
Too much to store on a single computer. We need a
cluster to process it!
.. this is typically what is called “Big Data”

Big Data and processing it
● Distributed computing and storage
○ Hadoop
■ MapReduce
○ Cassandra
● Hadoop cluster
○ 1100 nodes
○ ~8000 jobs/day

Using data at Spotify
Everyone part of the company is interested in our data
● Product
○ Are people using X? Should we focus on features such as Y?
● Insights
○ What music is trending? What artists is popular where?
● Performance
○ How is latency in country Y? Did this reduce stutter in country X?

Using data at Spotify
● Data-driven decision making
○ Like.. every decision.
○ Analysts / Data scientists
● A/B test everything!
● A/B testing:
○ Statistical hypothesis testing
○ Simple randomized experiment with >= 2
variants (A, B)

Using data at Spotify: A/B testing
Objective: Decrease time from loading playlist to first play
Hypothesis: The bigger button the faster users finds it
Test set up:
● A - variant 1
○ 2% US and SE MAU users
● B - variant 2
○ 2% US and SE MAU users
● Control - normal
○ Rest of users in US SE
“The shuffle button”

Using data at Spotify: A/B testing
CONTROL A B

Analytics: A/B testing
Metric:
Share of users playing first play > 500ms
(500ms is made up)
Lets roll out A to all users and throw away B!

● Machine Learning
○ User analysis
○ Artist disambiguation
○ Recommender systems
Outline

“ A music session
somehow represents
a moment for the
user. Can we find
these moments and
describe them? ”

● Take a subset of user listening data with new genre
data
○ Combine listens in sessions
■ Consequent plays, no 15 min pause
○ Session = [genres]
● Clustering algorithms to find similar sessions
○ K-means / Hierarchical clustering
● Describe the clusters using logistic regression
Machine Learning: Cluster user music sessions

K-Means Per cluster classification

Per cluster logistic regression
w: weight vector
Each w_i can be interpreted as the effect in the x_i variable
x_i = genres

Clusters described by logistic regression
name of x_i
at largest
w_i

Machine Learning
Artist disambiguation
Cleaning up the artists pages

Machine Learning: Artist disambiguation

Lets listen to those tracks!
Is it really the same Fredrik?

● Rank artists with probability of being ambiguous
● Apply clustering on each “ambiguous” artists
albums/tracks
○ Using features such as country, release year,
label/licensor etc.
○ Distinct cluster could be different artists
● Nicely present this for manual curation

Machine Learning: Recommender system
The discover page

Collaborative filtering

Collaborative filtering
● Build a matrix of user plays
● Compute similarity between items

4 Million tracks x 60 Million users
→ Pairwise similarity infeasible
Approximate the matrix with NMF

Matrix factorization (latent factor models)

Small vectors
Cosine similarity and dot product efficient

Finding recommendations:
Approximate nearest neighbour (ANN)
code: https://github.com/spotify/annoy
Related artists & Radio:
Similar to user recommendations, more models and not
all CF-based
Multiple models:
Score candidates from all models, combine and rank!

I just went through this quickly, read more details of
Spotify Rec sys here:
Doing this on MapReduce
Comparing with Netflix
Music Rec @ MLConf 2014

● More content-based ML
○ Fingerprinting: Echo nest
○ Content-based music recommendation using
convolutional neural networks
● Personalize everything
○ Emails
○ Ads
○ User profiling
● ML on other parts of product than Rec Sys
.. final last words on the Future of ML at Spotify

Summary
● Multiple data sources -> multiple angles
● Data drives decision with A/B testing
● User analysis
○ Cluster and describe with classifier
● Artist disambiguation
○ Cluster and give to manual curators
● Recommender systems
○ Collaborative filtering

● We supervise thesis workers
○ Artist disambiguation/deduplication
○ Cluster user music sessions
○ Context-based recommender systems
○ Personalized ads / Personalized emails
● We have internships!
www.spotify.com/jobs
.. and potentially you could help us?

Oscar Carlsson
lad@spotify.com
Linkedin
Thank you for
listening!

Big data and machine learning @ Spotify

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Big data and machine learning @ Spotify

Similar a Big data and machine learning @ Spotify (20)

Último

Último (20)

Big data and machine learning @ Spotify