SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
August 5, 2013
ML ♡ Hadoop @ Spotify
If it’s slow, buy more racks
I’m Erik Bernhardsson
Master’s in Physics from KTH in Stockholm
Started at Spotify in 2008, managed the Analytics team for two years
Moved to NYC in 2011, now the Engineering Manager of the Discovery team at Spotify in NYC
2
August 5, 2013
What’s Spotify? What are the
challenges?
Started in 2006
Currently has 24 million users
6 million paying users
Available in 20 countries
About 300 engineers, of which 70 in NYC
And adding 20K every day...
Big challenge: Spotify has over 20 million tracks
4
Good and bad news: we also have 100B streams
Let’s use collaborative
filtering!
5
Hey,
I like tracks P, Q, R, S!
Well,
I like tracks Q, R, S, T!
Then you should check out
track P!
Nice! Btw try track T!
Hadoop at Spotify
6
Back in 2009
Matrix factorization
causing cluster to
overheat? Don’t worry,
put up curtain
7
Source:
Hadoop today
700 nodes at our data center in London
8
The Discover page
9
Here’s a secret behind the Discover page
It’s precomputed every night
10
HADOOP
Cassandra
Bartender
Log streams
Music recs
hdfs2cass
Here’s a secret behind the Discover page
It’s precomputed every night
10
HADOOP
Cassandra
Bartender
Log streams
Music recs
hdfs2cass
Here’s a secret behind the Discover page
It’s precomputed every night
10
HADOOP
Cassandra
Bartender
Log streams
Music recs
hdfs2cass
https://github.com/spotify/luigi
Here’s a secret behind the Discover page
It’s precomputed every night
10
HADOOP
Cassandra
Bartender
Log streams
Music recs
hdfs2cass
https://github.com/spotify/luigi
https://github.com/spotify/hdfs2cass
OK so how do we come up with recommendations?
Let’s do collaborative filtering!
In particular, implicit collaborative filtering
In particular, matrix factorization (aka latent factor methods)
11
Stop!!!
Break it down!!
12
AP AP AP AP AP AP
Hadoop
(>100B streams)
Play track z
play track y
play track x
5k tracks/s
Step 1: Collect data
13
Step 2: Put everything into a big sparse matrix
14
@ . . . 7 . . . . . . . . .
...
...
...
A
very big matrix too:
M =
0
B
B
B
@
c11 c12 . . . c1n
c21 c22 . . . c2n
...
...
cm1 cm2 . . . cmn
1
C
C
C
A
| {z }
107
items
9
>>>>>>>>>=
>>>>>>>>>;
107
users
Matrix example
Roughly 25 billion nonzero entries
Total size is roughly 25 billion * 12 bytes = 300 GB (“medium data”)
15
Matrix example
Roughly 25 billion nonzero entries
Total size is roughly 25 billion * 12 bytes = 300 GB (“medium data”)
15
Erik
Never gonna give
you up
Erik listened to Never
gonna give you up 1
times
Idea is to find vectors for each user and item
Here’s how it looks like algebraically:
Step 3: Matrix factorization
16P =
B
B
B
@
p21 p22 . . . p2n
...
...
pm1 pm2 . . . pmn
C
C
C
A
The idea with matrix factorization is to represent this probability distribu-
tion like this:
pui = aT
u bi
M0
= AT
B
0
B
B
B
B
B
B
@
1
C
C
C
C
C
C
A
⇡
0
B
B
B
B
B
B
@
1
C
C
C
C
C
C
A
| {z }
f
f
0
. . . . . . .
1 0
. .
1
For instance, for PLSA
Probabilistic Latent Semantic Indexing (Hoffman, 1999)
Invented as a method intended for text classification
17
P =
0
B
B
B
B
B
B
@
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
1
C
C
C
C
C
C
A
⇡
0
B
B
B
B
B
B
@
. .
. .
. .
. .
. .
. .
1
C
C
C
C
C
C
A
| {z }
user vectors
✓
. . . . . . .
. . . . . . .
◆
| {z }
item vectors
PLSA
0
B
B
B
B
B
B
@
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
1
C
C
C
C
C
C
A
| {z }
P (u,i)=
P
z
P (u|z)P (i,z)
⇡
0
B
B
B
B
B
B
@
. .
. .
. .
. .
. .
. .
1
C
C
C
C
C
C
A
| {z }
P (u|z)
✓
. . . . . . .
. . . . . . .
◆
| {z }
P (i,z)
X
Why are vectors nice?
Super small fingerprints of the musical style or the user’s taste
Usually something like 40-200 elements
Hard to illustrate 40 dimensions in a 2 dimensional slide, but here’s an attempt:
18
0.87 1.17 -0.26 0.56 2.21 0.77 -0.03
Latent factor 1
Latent factor 2
track x's vector
Track X:
Another example of tracks in two dimensions
19
Implementing matrix factorization is a little tricky
Iterative algorithms that stake many steps to converge
40 parameters for each item and user
So something like 1.2 billion parameters
“Google News Personalization: Scalable Online Collaborative Filtering”
20
One iteration, one map/reduce job
21
Reduce stepMap step
u % K = 0
i % L = 0
u % K = 0
i % L = 1
...
u % K = 0
i % L = L-1
u % K = 1
i % L = 0
u % K = 1
i % L = 1
... ...
... ... ... ...
u % K = K-1
i % L = 0
... ...
u % K = K-1
i % L = L-1
item vectors
item%L=0
item vectors
item%L=1
item vectors
i % L = L-1
user vectors
u % K = 0
user vectors
u % K = 1
user vectors
u % K = K-1
all log entries
u % K = 1
i % L = 1
u % K = 0
u % K = 1
u % K = K-1
One iteration, one map/reduce job
21
Reduce stepMap step
u % K = 0
i % L = 0
u % K = 0
i % L = 1
...
u % K = 0
i % L = L-1
u % K = 1
i % L = 0
u % K = 1
i % L = 1
... ...
... ... ... ...
u % K = K-1
i % L = 0
... ...
u % K = K-1
i % L = L-1
item vectors
item%L=0
item vectors
item%L=1
item vectors
i % L = L-1
user vectors
u % K = 0
user vectors
u % K = 1
user vectors
u % K = K-1
all log entries
u % K = 1
i % L = 1
u % K = 0
u % K = 1
u % K = K-1
Here’s what happens in one map shard
Input is a bunch of (user, item, count) tuples
user is the same modulo K for all users
item is the same modulo L for all items
22
One map task
Distributed
cache:
All user vectors
where u % K = x
Distributed
cache:
All item vectors
where i % L = y
Mapper Emit contributions
Map input:
tuples (u, i, count)
where
u % K = x
and
i % L = y
Reducer New vector!
Might take a while to converge
Start with random vectors
around the origin
23
Hadoop?
Yeah we could probably do it in Spark 10x or 100x faster.
Still, Hadoop is a great way to scale things horizontally.
????
24
Nice compact vectors and it’s super fast to compute
similarity
25
Latent factor 1
Latent factor 2
track x
track y
cos(x, y) = HIGH
IPMF item item:
P(i ! j) = exp(bT
j bi)/Zi =
exp(bT
j bi)
P
k exp(bT
k bi)
VECTORS:
pui = aT
u bi
simij = cos(bi, bj) =
bT
i bj
|bi||bj|
O(f)
i j simi,j
2pac 2pac 1.0
2pac Notorious B.I.G. 0.91
2pac Dr. Dre 0.87
2pac Florence + the Machine 0.26
IPMF item item:
P(i ! j) = exp(bT
j bi)/Zi =
exp(bT
j bi)
P
k exp(bT
k bi)
VECTORS:
pui = aT
u bi
simij = cos(bi, bj) =
bT
i bj
|bi||bj|
O(f)
i j simi,j
2pac 2pac 1.0
2pac Notorious B.I.G. 0.91
2pac Dr. Dre 0.87
2pac Florence + the Machine 0.26
Florence + the Machine Lana Del Rey 0.81
IPMF item item MDS:
P(i ! j) = exp(bT
j bi)/Zi =
exp( |bj bi|
2
)
P
k exp( |bk bi|
2
)
Music recommendations are now just dot products
26
Latent factor 1
Latent factor 2
track x
User u's vector
track y
It’s still tricky to search for similar tracks though
We have many million tracks and you don’t want to compute cosine for all pairs
27
Approximate nearest neighbors to the rescue!
Cut the space recursively by random
plane.
If two points are close, they are more
likely to end up on the same side of
each plane.
https://github.com/spotify/annoy
28
How do you retrain the model?
It takes a long time to train a full factorization model.
We want to update user vectors much more frequently (at least daily!)
However, item vectors are fairly stable.
Throw away user vectors and recreate them from scratch!
29
The pipeline
“Hack” to recalculate user vectors
more frequently.
Is this a little complicated? Yeah
probably.
30
May 2013 logs
Matrix
factorization
Item
vectors
User
vectors
June 2013 logs
Matrix
factorization
Item
vectors
User
vectors
+ more logs
Seeding
User vectors
(1)
Logs
User vectors
(2)
More logs
User vectors
(3)
More logs
User vectors
(4)
More logs
User vectors
(5)
More logs
Time
Ideal case
Put all vectors in Cassandra/Memcached, use Storm to update in real time
31
But Hadoop is pretty nice at parallelizing recommendations
24 core but not a lot of
RAM? mmap is your
friend
32
One map reduce job
Recs!
ANN index
of all vectors
Distributed cache:
User vectors
M M
M M
DC
M M
M M
DC
M M
M M
DC
Music recommendations!
Our latest baby, the
Discover page. Featuring
lots of different types of
recommendations.
Expect this to change
quite a lot in the next few
months!
33
More music recommendations!
Radio!
34
More music recommendations!
Related artists
35
Thanks!
Btw, we’re hiring Machine Learning Engineers
and Data Engineers!
Email me at erikbern@spotify.com!

Más contenido relacionado

La actualidad más candente

Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyChris Johnson
 
Building Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyBuilding Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyVidhya Murali
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At SpotifyVidhya Murali
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyChris Johnson
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyChing-Wei Chen
 
Interactive Recommender Systems
Interactive Recommender SystemsInteractive Recommender Systems
Interactive Recommender SystemsRoelof van Zwol
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyChris Johnson
 
Big data and machine learning @ Spotify
Big data and machine learning @ SpotifyBig data and machine learning @ Spotify
Big data and machine learning @ SpotifyOscar Carlsson
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...Hakka Labs
 
Playlist Recommendations @ Spotify
Playlist Recommendations @ SpotifyPlaylist Recommendations @ Spotify
Playlist Recommendations @ SpotifyNikhil Tibrewal
 
The Universal Recommender
The Universal RecommenderThe Universal Recommender
The Universal RecommenderPat Ferrel
 
Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyNeville Li
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At SpotifyAdam Kawa
 
Storm at Spotify
Storm at SpotifyStorm at Spotify
Storm at SpotifyNeville Li
 
Personalizing the listening experience
Personalizing the listening experiencePersonalizing the listening experience
Personalizing the listening experienceMounia Lalmas-Roelleke
 
Engagement, Metrics & Personalisation at Scale
Engagement, Metrics &  Personalisation at ScaleEngagement, Metrics &  Personalisation at Scale
Engagement, Metrics & Personalisation at ScaleMounia Lalmas-Roelleke
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Spark Summit
 
How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyJosh Baer
 

La actualidad más candente (20)

Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at Spotify
 
Building Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyBuilding Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at Spotify
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At Spotify
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover Weekly
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at Spotify
 
Interactive Recommender Systems
Interactive Recommender SystemsInteractive Recommender Systems
Interactive Recommender Systems
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and Spotify
 
Big data and machine learning @ Spotify
Big data and machine learning @ SpotifyBig data and machine learning @ Spotify
Big data and machine learning @ Spotify
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
 
Playlist Recommendations @ Spotify
Playlist Recommendations @ SpotifyPlaylist Recommendations @ Spotify
Playlist Recommendations @ Spotify
 
The Universal Recommender
The Universal RecommenderThe Universal Recommender
The Universal Recommender
 
Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ Spotify
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At Spotify
 
Storm at Spotify
Storm at SpotifyStorm at Spotify
Storm at Spotify
 
Personalizing the listening experience
Personalizing the listening experiencePersonalizing the listening experience
Personalizing the listening experience
 
Engagement, Metrics & Personalisation at Scale
Engagement, Metrics &  Personalisation at ScaleEngagement, Metrics &  Personalisation at Scale
Engagement, Metrics & Personalisation at Scale
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At Spotify
 
Search @ Spotify
Search @ Spotify Search @ Spotify
Search @ Spotify
 

Destacado

RapidMiner: Word Vector Tool And Rapid Miner
RapidMiner:   Word Vector Tool And Rapid MinerRapidMiner:   Word Vector Tool And Rapid Miner
RapidMiner: Word Vector Tool And Rapid MinerRapidmining Content
 
Phoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBasePhoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBaseSalesforce Developers
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with SparkChris Johnson
 
Music data is scary, beautiful and exciting
Music data is scary, beautiful and excitingMusic data is scary, beautiful and exciting
Music data is scary, beautiful and excitingBrian Whitman
 
The Echo Nest at Music and Bits, October 21 2009
The Echo Nest at Music and Bits, October 21 2009The Echo Nest at Music and Bits, October 21 2009
The Echo Nest at Music and Bits, October 21 2009Brian Whitman
 
Cut Bait - 10 Years of Dorkbot
Cut Bait - 10 Years of DorkbotCut Bait - 10 Years of Dorkbot
Cut Bait - 10 Years of DorkbotBrian Whitman
 
The echo nest-music_discovery(1)
The echo nest-music_discovery(1)The echo nest-music_discovery(1)
The echo nest-music_discovery(1)Sophia Yeiji Shin
 
The Echo Nest Remix at Dorkbot NYC, March 4 2009
The Echo Nest Remix at Dorkbot NYC, March 4 2009The Echo Nest Remix at Dorkbot NYC, March 4 2009
The Echo Nest Remix at Dorkbot NYC, March 4 2009Brian Whitman
 
The future music platform
The future music platformThe future music platform
The future music platformBrian Whitman
 
Echo nest-api-boston-2012
Echo nest-api-boston-2012Echo nest-api-boston-2012
Echo nest-api-boston-2012Paul Lamere
 
Luigi Presentation at OSCON 2013
Luigi Presentation at OSCON 2013Luigi Presentation at OSCON 2013
Luigi Presentation at OSCON 2013Erik Bernhardsson
 
Analysis of Spotify & New Feature Ideas
Analysis of Spotify & New Feature IdeasAnalysis of Spotify & New Feature Ideas
Analysis of Spotify & New Feature IdeasSarah L. Miller
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQueryDharmesh Vaya
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix ScaleJustin Basilico
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark Summit
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Xavier Amatriain
 

Destacado (19)

RapidMiner: Word Vector Tool And Rapid Miner
RapidMiner:   Word Vector Tool And Rapid MinerRapidMiner:   Word Vector Tool And Rapid Miner
RapidMiner: Word Vector Tool And Rapid Miner
 
Phoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBasePhoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBase
 
storm at twitter
storm at twitterstorm at twitter
storm at twitter
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with Spark
 
Music data is scary, beautiful and exciting
Music data is scary, beautiful and excitingMusic data is scary, beautiful and exciting
Music data is scary, beautiful and exciting
 
The Echo Nest at Music and Bits, October 21 2009
The Echo Nest at Music and Bits, October 21 2009The Echo Nest at Music and Bits, October 21 2009
The Echo Nest at Music and Bits, October 21 2009
 
Cut Bait - 10 Years of Dorkbot
Cut Bait - 10 Years of DorkbotCut Bait - 10 Years of Dorkbot
Cut Bait - 10 Years of Dorkbot
 
The echo nest-music_discovery(1)
The echo nest-music_discovery(1)The echo nest-music_discovery(1)
The echo nest-music_discovery(1)
 
The Echo Nest Remix at Dorkbot NYC, March 4 2009
The Echo Nest Remix at Dorkbot NYC, March 4 2009The Echo Nest Remix at Dorkbot NYC, March 4 2009
The Echo Nest Remix at Dorkbot NYC, March 4 2009
 
The future music platform
The future music platformThe future music platform
The future music platform
 
Echo nest-api-boston-2012
Echo nest-api-boston-2012Echo nest-api-boston-2012
Echo nest-api-boston-2012
 
Luigi future
Luigi futureLuigi future
Luigi future
 
Luigi Presentation at OSCON 2013
Luigi Presentation at OSCON 2013Luigi Presentation at OSCON 2013
Luigi Presentation at OSCON 2013
 
Analysis of Spotify & New Feature Ideas
Analysis of Spotify & New Feature IdeasAnalysis of Spotify & New Feature Ideas
Analysis of Spotify & New Feature Ideas
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQuery
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix Scale
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
 

Similar a ML+Hadoop at NYC Predictive Analytics

streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxGopiNathVelivela
 
Introduction to recommender systems
Introduction to recommender systemsIntroduction to recommender systems
Introduction to recommender systemsArnaud de Myttenaere
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filterxlight
 
Intro matlab and convolution islam
Intro matlab and convolution islamIntro matlab and convolution islam
Intro matlab and convolution islamIslam Alabbasy
 
High-Performance Haskell
High-Performance HaskellHigh-Performance Haskell
High-Performance HaskellJohan Tibell
 
Large-scale computation without sacrificing expressiveness
Large-scale computation without sacrificing expressivenessLarge-scale computation without sacrificing expressiveness
Large-scale computation without sacrificing expressivenessSangjin Han
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAlbert Bifet
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013MLconf
 
SAMPLE QUESTIONExercise 1 Consider the functionf (x,C).docx
SAMPLE QUESTIONExercise 1 Consider the functionf (x,C).docxSAMPLE QUESTIONExercise 1 Consider the functionf (x,C).docx
SAMPLE QUESTIONExercise 1 Consider the functionf (x,C).docxagnesdcarey33086
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...Florent Renucci
 
3.2 Graphs of Functions
3.2 Graphs of Functions3.2 Graphs of Functions
3.2 Graphs of Functionssmiller5
 
3.2 Graphs of Functions
3.2 Graphs of Functions3.2 Graphs of Functions
3.2 Graphs of Functionssmiller5
 
lecture 23
lecture 23lecture 23
lecture 23sajinsc
 

Similar a ML+Hadoop at NYC Predictive Analytics (20)

streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
 
Introduction to recommender systems
Introduction to recommender systemsIntroduction to recommender systems
Introduction to recommender systems
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
Programming Hp33s talk v3
Programming Hp33s talk v3Programming Hp33s talk v3
Programming Hp33s talk v3
 
Intro matlab and convolution islam
Intro matlab and convolution islamIntro matlab and convolution islam
Intro matlab and convolution islam
 
MUMS: Agent-based Modeling Workshop - Practical Bayesian Optimization for Age...
MUMS: Agent-based Modeling Workshop - Practical Bayesian Optimization for Age...MUMS: Agent-based Modeling Workshop - Practical Bayesian Optimization for Age...
MUMS: Agent-based Modeling Workshop - Practical Bayesian Optimization for Age...
 
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)
 
High-Performance Haskell
High-Performance HaskellHigh-Performance Haskell
High-Performance Haskell
 
Large-scale computation without sacrificing expressiveness
Large-scale computation without sacrificing expressivenessLarge-scale computation without sacrificing expressiveness
Large-scale computation without sacrificing expressiveness
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
An Efficient and Parallel Abstract Interpreter in Scala — First Algorithm
An Efficient and Parallel Abstract Interpreter in Scala — First AlgorithmAn Efficient and Parallel Abstract Interpreter in Scala — First Algorithm
An Efficient and Parallel Abstract Interpreter in Scala — First Algorithm
 
Matlab1
Matlab1Matlab1
Matlab1
 
SAMPLE QUESTIONExercise 1 Consider the functionf (x,C).docx
SAMPLE QUESTIONExercise 1 Consider the functionf (x,C).docxSAMPLE QUESTIONExercise 1 Consider the functionf (x,C).docx
SAMPLE QUESTIONExercise 1 Consider the functionf (x,C).docx
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
 
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
 
3.2 Graphs of Functions
3.2 Graphs of Functions3.2 Graphs of Functions
3.2 Graphs of Functions
 
3.2 Graphs of Functions
3.2 Graphs of Functions3.2 Graphs of Functions
3.2 Graphs of Functions
 
lecture 23
lecture 23lecture 23
lecture 23
 

Último

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Último (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

ML+Hadoop at NYC Predictive Analytics

  • 1. August 5, 2013 ML ♡ Hadoop @ Spotify If it’s slow, buy more racks
  • 2. I’m Erik Bernhardsson Master’s in Physics from KTH in Stockholm Started at Spotify in 2008, managed the Analytics team for two years Moved to NYC in 2011, now the Engineering Manager of the Discovery team at Spotify in NYC 2
  • 3. August 5, 2013 What’s Spotify? What are the challenges? Started in 2006 Currently has 24 million users 6 million paying users Available in 20 countries About 300 engineers, of which 70 in NYC
  • 4. And adding 20K every day... Big challenge: Spotify has over 20 million tracks 4
  • 5. Good and bad news: we also have 100B streams Let’s use collaborative filtering! 5 Hey, I like tracks P, Q, R, S! Well, I like tracks Q, R, S, T! Then you should check out track P! Nice! Btw try track T!
  • 7. Back in 2009 Matrix factorization causing cluster to overheat? Don’t worry, put up curtain 7
  • 8. Source: Hadoop today 700 nodes at our data center in London 8
  • 10. Here’s a secret behind the Discover page It’s precomputed every night 10 HADOOP Cassandra Bartender Log streams Music recs hdfs2cass
  • 11. Here’s a secret behind the Discover page It’s precomputed every night 10 HADOOP Cassandra Bartender Log streams Music recs hdfs2cass
  • 12. Here’s a secret behind the Discover page It’s precomputed every night 10 HADOOP Cassandra Bartender Log streams Music recs hdfs2cass https://github.com/spotify/luigi
  • 13. Here’s a secret behind the Discover page It’s precomputed every night 10 HADOOP Cassandra Bartender Log streams Music recs hdfs2cass https://github.com/spotify/luigi https://github.com/spotify/hdfs2cass
  • 14. OK so how do we come up with recommendations? Let’s do collaborative filtering! In particular, implicit collaborative filtering In particular, matrix factorization (aka latent factor methods) 11
  • 16. AP AP AP AP AP AP Hadoop (>100B streams) Play track z play track y play track x 5k tracks/s Step 1: Collect data 13
  • 17. Step 2: Put everything into a big sparse matrix 14 @ . . . 7 . . . . . . . . . ... ... ... A very big matrix too: M = 0 B B B @ c11 c12 . . . c1n c21 c22 . . . c2n ... ... cm1 cm2 . . . cmn 1 C C C A | {z } 107 items 9 >>>>>>>>>= >>>>>>>>>; 107 users
  • 18. Matrix example Roughly 25 billion nonzero entries Total size is roughly 25 billion * 12 bytes = 300 GB (“medium data”) 15
  • 19. Matrix example Roughly 25 billion nonzero entries Total size is roughly 25 billion * 12 bytes = 300 GB (“medium data”) 15 Erik Never gonna give you up Erik listened to Never gonna give you up 1 times
  • 20. Idea is to find vectors for each user and item Here’s how it looks like algebraically: Step 3: Matrix factorization 16P = B B B @ p21 p22 . . . p2n ... ... pm1 pm2 . . . pmn C C C A The idea with matrix factorization is to represent this probability distribu- tion like this: pui = aT u bi M0 = AT B 0 B B B B B B @ 1 C C C C C C A ⇡ 0 B B B B B B @ 1 C C C C C C A | {z } f f 0 . . . . . . . 1 0 . . 1
  • 21. For instance, for PLSA Probabilistic Latent Semantic Indexing (Hoffman, 1999) Invented as a method intended for text classification 17 P = 0 B B B B B B @ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 C C C C C C A ⇡ 0 B B B B B B @ . . . . . . . . . . . . 1 C C C C C C A | {z } user vectors ✓ . . . . . . . . . . . . . . ◆ | {z } item vectors PLSA 0 B B B B B B @ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 C C C C C C A | {z } P (u,i)= P z P (u|z)P (i,z) ⇡ 0 B B B B B B @ . . . . . . . . . . . . 1 C C C C C C A | {z } P (u|z) ✓ . . . . . . . . . . . . . . ◆ | {z } P (i,z) X
  • 22. Why are vectors nice? Super small fingerprints of the musical style or the user’s taste Usually something like 40-200 elements Hard to illustrate 40 dimensions in a 2 dimensional slide, but here’s an attempt: 18 0.87 1.17 -0.26 0.56 2.21 0.77 -0.03 Latent factor 1 Latent factor 2 track x's vector Track X:
  • 23. Another example of tracks in two dimensions 19
  • 24. Implementing matrix factorization is a little tricky Iterative algorithms that stake many steps to converge 40 parameters for each item and user So something like 1.2 billion parameters “Google News Personalization: Scalable Online Collaborative Filtering” 20
  • 25. One iteration, one map/reduce job 21 Reduce stepMap step u % K = 0 i % L = 0 u % K = 0 i % L = 1 ... u % K = 0 i % L = L-1 u % K = 1 i % L = 0 u % K = 1 i % L = 1 ... ... ... ... ... ... u % K = K-1 i % L = 0 ... ... u % K = K-1 i % L = L-1 item vectors item%L=0 item vectors item%L=1 item vectors i % L = L-1 user vectors u % K = 0 user vectors u % K = 1 user vectors u % K = K-1 all log entries u % K = 1 i % L = 1 u % K = 0 u % K = 1 u % K = K-1
  • 26. One iteration, one map/reduce job 21 Reduce stepMap step u % K = 0 i % L = 0 u % K = 0 i % L = 1 ... u % K = 0 i % L = L-1 u % K = 1 i % L = 0 u % K = 1 i % L = 1 ... ... ... ... ... ... u % K = K-1 i % L = 0 ... ... u % K = K-1 i % L = L-1 item vectors item%L=0 item vectors item%L=1 item vectors i % L = L-1 user vectors u % K = 0 user vectors u % K = 1 user vectors u % K = K-1 all log entries u % K = 1 i % L = 1 u % K = 0 u % K = 1 u % K = K-1
  • 27. Here’s what happens in one map shard Input is a bunch of (user, item, count) tuples user is the same modulo K for all users item is the same modulo L for all items 22 One map task Distributed cache: All user vectors where u % K = x Distributed cache: All item vectors where i % L = y Mapper Emit contributions Map input: tuples (u, i, count) where u % K = x and i % L = y Reducer New vector!
  • 28. Might take a while to converge Start with random vectors around the origin 23
  • 29. Hadoop? Yeah we could probably do it in Spark 10x or 100x faster. Still, Hadoop is a great way to scale things horizontally. ???? 24
  • 30. Nice compact vectors and it’s super fast to compute similarity 25 Latent factor 1 Latent factor 2 track x track y cos(x, y) = HIGH IPMF item item: P(i ! j) = exp(bT j bi)/Zi = exp(bT j bi) P k exp(bT k bi) VECTORS: pui = aT u bi simij = cos(bi, bj) = bT i bj |bi||bj| O(f) i j simi,j 2pac 2pac 1.0 2pac Notorious B.I.G. 0.91 2pac Dr. Dre 0.87 2pac Florence + the Machine 0.26 IPMF item item: P(i ! j) = exp(bT j bi)/Zi = exp(bT j bi) P k exp(bT k bi) VECTORS: pui = aT u bi simij = cos(bi, bj) = bT i bj |bi||bj| O(f) i j simi,j 2pac 2pac 1.0 2pac Notorious B.I.G. 0.91 2pac Dr. Dre 0.87 2pac Florence + the Machine 0.26 Florence + the Machine Lana Del Rey 0.81 IPMF item item MDS: P(i ! j) = exp(bT j bi)/Zi = exp( |bj bi| 2 ) P k exp( |bk bi| 2 )
  • 31. Music recommendations are now just dot products 26 Latent factor 1 Latent factor 2 track x User u's vector track y
  • 32. It’s still tricky to search for similar tracks though We have many million tracks and you don’t want to compute cosine for all pairs 27
  • 33. Approximate nearest neighbors to the rescue! Cut the space recursively by random plane. If two points are close, they are more likely to end up on the same side of each plane. https://github.com/spotify/annoy 28
  • 34. How do you retrain the model? It takes a long time to train a full factorization model. We want to update user vectors much more frequently (at least daily!) However, item vectors are fairly stable. Throw away user vectors and recreate them from scratch! 29
  • 35. The pipeline “Hack” to recalculate user vectors more frequently. Is this a little complicated? Yeah probably. 30 May 2013 logs Matrix factorization Item vectors User vectors June 2013 logs Matrix factorization Item vectors User vectors + more logs Seeding User vectors (1) Logs User vectors (2) More logs User vectors (3) More logs User vectors (4) More logs User vectors (5) More logs Time
  • 36. Ideal case Put all vectors in Cassandra/Memcached, use Storm to update in real time 31
  • 37. But Hadoop is pretty nice at parallelizing recommendations 24 core but not a lot of RAM? mmap is your friend 32 One map reduce job Recs! ANN index of all vectors Distributed cache: User vectors M M M M DC M M M M DC M M M M DC
  • 38. Music recommendations! Our latest baby, the Discover page. Featuring lots of different types of recommendations. Expect this to change quite a lot in the next few months! 33
  • 41. Thanks! Btw, we’re hiring Machine Learning Engineers and Data Engineers! Email me at erikbern@spotify.com!