Presented at the Machine Learning class at Chalmers, Gothenburg.
http://www.cse.chalmers.se/research/lab/courses.php?coid=9
Trying to connect their theoretical machine learning class with industry examples.
2. ● D-student starting 2009
● Graduated last year from CSALL
(Student in this class 2013)
● Master thesis at Spotify
● Data Engineer at Spotify in Gothenburg
Me
3. ● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline
5. What is data at Spotify?
Songs Track
Metadata
User generated Users Playlists
Cover arts Listens Country, email etc Tracks of
playlist
Album Clicks Add/Removes
Genres, Mood
etc
Page views
30 Million songs
60 Million Monthly Active Users
58 Markets
15 Million subscribers
1.5 Billion Playlists
6. ● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline
7. Big Data and processing it
● 20 TB compressed data / DAY
○ 200 TB generated and stored / day (replication)
● Our business is highly dependent on these logs
○ We pay artist depending on plays, plays = logs
Too much to store on a single computer. We need a
cluster to process it!
.. this is typically what is called “Big Data”
8. Big Data and processing it
● Distributed computing and storage
○ Hadoop
■ MapReduce
○ Cassandra
● Hadoop cluster
○ 1100 nodes
○ ~8000 jobs/day
9. ● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline
10. Using data at Spotify
Everyone part of the company is interested in our data
● Product
○ Are people using X? Should we focus on features such as Y?
● Insights
○ What music is trending? What artists is popular where?
● Performance
○ How is latency in country Y? Did this reduce stutter in country X?
11. Using data at Spotify
● Data-driven decision making
○ Like.. every decision.
○ Analysts / Data scientists
● A/B test everything!
● A/B testing:
○ Statistical hypothesis testing
○ Simple randomized experiment with >= 2
variants (A, B)
12. Using data at Spotify: A/B testing
Objective: Decrease time from loading playlist to first play
Hypothesis: The bigger button the faster users finds it
Test set up:
● A - variant 1
○ 2% US and SE MAU users
● B - variant 2
○ 2% US and SE MAU users
● Control - normal
○ Rest of users in US SE
“The shuffle button”
15. ● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline
16. ● Machine Learning
○ User analysis
○ Artist disambiguation
○ Recommender systems
Outline
17. “ A music session
somehow represents
a moment for the
user. Can we find
these moments and
describe them? ”
18. ● Take a subset of user listening data with new genre
data
○ Combine listens in sessions
■ Consequent plays, no 15 min pause
○ Session = [genres]
● Clustering algorithms to find similar sessions
○ K-means / Hierarchical clustering
● Describe the clusters using logistic regression
Machine Learning: Cluster user music sessions
20. Machine Learning: Cluster user music sessions
Per cluster logistic regression
w: weight vector
Each w_i can be interpreted as the effect in the x_i variable
x_i = genres
21. Machine Learning: Cluster user music sessions
Clusters described by logistic regression
name of x_i
at largest
w_i
28. Machine Learning: Artist disambiguation
● Rank artists with probability of being ambiguous
● Apply clustering on each “ambiguous” artists
albums/tracks
○ Using features such as country, release year,
label/licensor etc.
○ Distinct cluster could be different artists
● Nicely present this for manual curation
35. Machine Learning: Recommender system
Finding recommendations:
Approximate nearest neighbour (ANN)
code: https://github.com/spotify/annoy
Related artists & Radio:
Similar to user recommendations, more models and not
all CF-based
Multiple models:
Score candidates from all models, combine and rank!
36. Machine Learning: Recommender system
I just went through this quickly, read more details of
Spotify Rec sys here:
Doing this on MapReduce
Comparing with Netflix
Music Rec @ MLConf 2014
37. ● More content-based ML
○ Fingerprinting: Echo nest
○ Content-based music recommendation using
convolutional neural networks
● Personalize everything
○ Emails
○ Ads
○ User profiling
● ML on other parts of product than Rec Sys
.. final last words on the Future of ML at Spotify
38. Summary
● Multiple data sources -> multiple angles
● Data drives decision with A/B testing
● User analysis
○ Cluster and describe with classifier
● Artist disambiguation
○ Cluster and give to manual curators
● Recommender systems
○ Collaborative filtering
39. ● We supervise thesis workers
○ Artist disambiguation/deduplication
○ Cluster user music sessions
○ Context-based recommender systems
○ Personalized ads / Personalized emails
● We have internships!
www.spotify.com/jobs
.. and potentially you could help us?