How Spotify uses large scale Machine Learning running on top of Hadoop to power music discovery. From the NYC Predictive Analytics meetup: http://www.meetup.com/NYC-Predictive-Analytics/events/129778152/
2. I’m Erik Bernhardsson
Master’s in Physics from KTH in Stockholm
Started at Spotify in 2008, managed the Analytics team for two years
Moved to NYC in 2011, now the Engineering Manager of the Discovery team at Spotify in NYC
2
3. August 5, 2013
What’s Spotify? What are the
challenges?
Started in 2006
Currently has 24 million users
6 million paying users
Available in 20 countries
About 300 engineers, of which 70 in NYC
4. And adding 20K every day...
Big challenge: Spotify has over 20 million tracks
4
5. Good and bad news: we also have 100B streams
Let’s use collaborative
filtering!
5
Hey,
I like tracks P, Q, R, S!
Well,
I like tracks Q, R, S, T!
Then you should check out
track P!
Nice! Btw try track T!
10. Here’s a secret behind the Discover page
It’s precomputed every night
10
HADOOP
Cassandra
Bartender
Log streams
Music recs
hdfs2cass
11. Here’s a secret behind the Discover page
It’s precomputed every night
10
HADOOP
Cassandra
Bartender
Log streams
Music recs
hdfs2cass
12. Here’s a secret behind the Discover page
It’s precomputed every night
10
HADOOP
Cassandra
Bartender
Log streams
Music recs
hdfs2cass
https://github.com/spotify/luigi
13. Here’s a secret behind the Discover page
It’s precomputed every night
10
HADOOP
Cassandra
Bartender
Log streams
Music recs
hdfs2cass
https://github.com/spotify/luigi
https://github.com/spotify/hdfs2cass
14. OK so how do we come up with recommendations?
Let’s do collaborative filtering!
In particular, implicit collaborative filtering
In particular, matrix factorization (aka latent factor methods)
11
16. AP AP AP AP AP AP
Hadoop
(>100B streams)
Play track z
play track y
play track x
5k tracks/s
Step 1: Collect data
13
17. Step 2: Put everything into a big sparse matrix
14
@ . . . 7 . . . . . . . . .
...
...
...
A
very big matrix too:
M =
0
B
B
B
@
c11 c12 . . . c1n
c21 c22 . . . c2n
...
...
cm1 cm2 . . . cmn
1
C
C
C
A
| {z }
107
items
9
>>>>>>>>>=
>>>>>>>>>;
107
users
18. Matrix example
Roughly 25 billion nonzero entries
Total size is roughly 25 billion * 12 bytes = 300 GB (“medium data”)
15
19. Matrix example
Roughly 25 billion nonzero entries
Total size is roughly 25 billion * 12 bytes = 300 GB (“medium data”)
15
Erik
Never gonna give
you up
Erik listened to Never
gonna give you up 1
times
20. Idea is to find vectors for each user and item
Here’s how it looks like algebraically:
Step 3: Matrix factorization
16P =
B
B
B
@
p21 p22 . . . p2n
...
...
pm1 pm2 . . . pmn
C
C
C
A
The idea with matrix factorization is to represent this probability distribu-
tion like this:
pui = aT
u bi
M0
= AT
B
0
B
B
B
B
B
B
@
1
C
C
C
C
C
C
A
⇡
0
B
B
B
B
B
B
@
1
C
C
C
C
C
C
A
| {z }
f
f
0
. . . . . . .
1 0
. .
1
21. For instance, for PLSA
Probabilistic Latent Semantic Indexing (Hoffman, 1999)
Invented as a method intended for text classification
17
P =
0
B
B
B
B
B
B
@
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
1
C
C
C
C
C
C
A
⇡
0
B
B
B
B
B
B
@
. .
. .
. .
. .
. .
. .
1
C
C
C
C
C
C
A
| {z }
user vectors
✓
. . . . . . .
. . . . . . .
◆
| {z }
item vectors
PLSA
0
B
B
B
B
B
B
@
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
1
C
C
C
C
C
C
A
| {z }
P (u,i)=
P
z
P (u|z)P (i,z)
⇡
0
B
B
B
B
B
B
@
. .
. .
. .
. .
. .
. .
1
C
C
C
C
C
C
A
| {z }
P (u|z)
✓
. . . . . . .
. . . . . . .
◆
| {z }
P (i,z)
X
22. Why are vectors nice?
Super small fingerprints of the musical style or the user’s taste
Usually something like 40-200 elements
Hard to illustrate 40 dimensions in a 2 dimensional slide, but here’s an attempt:
18
0.87 1.17 -0.26 0.56 2.21 0.77 -0.03
Latent factor 1
Latent factor 2
track x's vector
Track X:
24. Implementing matrix factorization is a little tricky
Iterative algorithms that stake many steps to converge
40 parameters for each item and user
So something like 1.2 billion parameters
“Google News Personalization: Scalable Online Collaborative Filtering”
20
25. One iteration, one map/reduce job
21
Reduce stepMap step
u % K = 0
i % L = 0
u % K = 0
i % L = 1
...
u % K = 0
i % L = L-1
u % K = 1
i % L = 0
u % K = 1
i % L = 1
... ...
... ... ... ...
u % K = K-1
i % L = 0
... ...
u % K = K-1
i % L = L-1
item vectors
item%L=0
item vectors
item%L=1
item vectors
i % L = L-1
user vectors
u % K = 0
user vectors
u % K = 1
user vectors
u % K = K-1
all log entries
u % K = 1
i % L = 1
u % K = 0
u % K = 1
u % K = K-1
26. One iteration, one map/reduce job
21
Reduce stepMap step
u % K = 0
i % L = 0
u % K = 0
i % L = 1
...
u % K = 0
i % L = L-1
u % K = 1
i % L = 0
u % K = 1
i % L = 1
... ...
... ... ... ...
u % K = K-1
i % L = 0
... ...
u % K = K-1
i % L = L-1
item vectors
item%L=0
item vectors
item%L=1
item vectors
i % L = L-1
user vectors
u % K = 0
user vectors
u % K = 1
user vectors
u % K = K-1
all log entries
u % K = 1
i % L = 1
u % K = 0
u % K = 1
u % K = K-1
27. Here’s what happens in one map shard
Input is a bunch of (user, item, count) tuples
user is the same modulo K for all users
item is the same modulo L for all items
22
One map task
Distributed
cache:
All user vectors
where u % K = x
Distributed
cache:
All item vectors
where i % L = y
Mapper Emit contributions
Map input:
tuples (u, i, count)
where
u % K = x
and
i % L = y
Reducer New vector!
28. Might take a while to converge
Start with random vectors
around the origin
23
29. Hadoop?
Yeah we could probably do it in Spark 10x or 100x faster.
Still, Hadoop is a great way to scale things horizontally.
????
24
30. Nice compact vectors and it’s super fast to compute
similarity
25
Latent factor 1
Latent factor 2
track x
track y
cos(x, y) = HIGH
IPMF item item:
P(i ! j) = exp(bT
j bi)/Zi =
exp(bT
j bi)
P
k exp(bT
k bi)
VECTORS:
pui = aT
u bi
simij = cos(bi, bj) =
bT
i bj
|bi||bj|
O(f)
i j simi,j
2pac 2pac 1.0
2pac Notorious B.I.G. 0.91
2pac Dr. Dre 0.87
2pac Florence + the Machine 0.26
IPMF item item:
P(i ! j) = exp(bT
j bi)/Zi =
exp(bT
j bi)
P
k exp(bT
k bi)
VECTORS:
pui = aT
u bi
simij = cos(bi, bj) =
bT
i bj
|bi||bj|
O(f)
i j simi,j
2pac 2pac 1.0
2pac Notorious B.I.G. 0.91
2pac Dr. Dre 0.87
2pac Florence + the Machine 0.26
Florence + the Machine Lana Del Rey 0.81
IPMF item item MDS:
P(i ! j) = exp(bT
j bi)/Zi =
exp( |bj bi|
2
)
P
k exp( |bk bi|
2
)
31. Music recommendations are now just dot products
26
Latent factor 1
Latent factor 2
track x
User u's vector
track y
32. It’s still tricky to search for similar tracks though
We have many million tracks and you don’t want to compute cosine for all pairs
27
33. Approximate nearest neighbors to the rescue!
Cut the space recursively by random
plane.
If two points are close, they are more
likely to end up on the same side of
each plane.
https://github.com/spotify/annoy
28
34. How do you retrain the model?
It takes a long time to train a full factorization model.
We want to update user vectors much more frequently (at least daily!)
However, item vectors are fairly stable.
Throw away user vectors and recreate them from scratch!
29
35. The pipeline
“Hack” to recalculate user vectors
more frequently.
Is this a little complicated? Yeah
probably.
30
May 2013 logs
Matrix
factorization
Item
vectors
User
vectors
June 2013 logs
Matrix
factorization
Item
vectors
User
vectors
+ more logs
Seeding
User vectors
(1)
Logs
User vectors
(2)
More logs
User vectors
(3)
More logs
User vectors
(4)
More logs
User vectors
(5)
More logs
Time
36. Ideal case
Put all vectors in Cassandra/Memcached, use Storm to update in real time
31
37. But Hadoop is pretty nice at parallelizing recommendations
24 core but not a lot of
RAM? mmap is your
friend
32
One map reduce job
Recs!
ANN index
of all vectors
Distributed cache:
User vectors
M M
M M
DC
M M
M M
DC
M M
M M
DC
38. Music recommendations!
Our latest baby, the
Discover page. Featuring
lots of different types of
recommendations.
Expect this to change
quite a lot in the next few
months!
33