Spark and GraphX in the Netflix Recommender System: We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 50 million members and 5 billion hours streamed over three months). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.
4. Introduction
● Goal: Help members find
content that they’ll enjoy
to maximize satisfaction
and retention
● Core part of product
○ Every impression is a
recommendation
5. Main Challenge - Scale
● Algorithms @ Netflix Scale
○ > 62 M Members
○ > 50 Countries
○ > 1000 device types
○ > 100M Hours / day
● Distributed Machine Learning
algorithms help with Scale
6. Main Challenge - Scale
● Algorithms @ Netflix Scale
○ > 62 M Members
○ > 50 Countries
○ > 1000 device types
○ > 100M Hours / day
● Distributed Machine Learning
algorithms help with Scale
○ Spark And GraphX
8. Spark And GraphX
● Spark- Distributed in-memory computational engine
using Resilient Distributed Datasets (RDDs)
● GraphX - extends RDDs to Multigraphs and provides
graph analytics
● Convenient and fast, all the way from prototyping
(iSpark, Zeppelin) to Production
9. Two Machine Learning Problems
● Generate ranking of items with respect to a given item
from an interaction graph
○ Graph Diffusion algorithms (e.g. Topic Sensitive Pagerank)
● Find Clusters of related items using co-occurrence data
○ Probabilistic Graphical Models (Latent Dirichlet Allocation)
15. ● Popular graph diffusion algorithm
● Capturing vertex importance with regards to a particular
vertex
● e.g. for the topic “Seattle”
Topic Sensitive Pagerank @ Netflix
16. Iteration 0
We start by
activating a single
node
“Seattle”
related to
shot in
featured in
related to
cast
cast
cast
related to
17. Iteration 1
With some probability,
we follow outbound
edges, otherwise we
go back to the origin.
20. GraphX implementation
● Running one propagation for each possible starting
node would be slow
● Keep a vector of activation probabilities at each vertex
● Use GraphX to run all propagations in parallel
21. Topic Sensitive Pagerank in GraphX
activation probability,
starting from vertex 1
activation probability,
starting from vertex 2
activation probability,
starting from vertex 3
...
Activation probabilities
as vertex attributes
...
...
... ...
...
...
24. LDA @ Netflix
● A popular clustering/latent factors model
● Discovers clusters/topics of related videos from Netflix
data
● e.g, a topic of Animal Documentaries
25. LDA - Graphical Model
Question: How to parallelize inference?
26. LDA - Graphical Model
Question: How to parallelize inference?
Answer: Read conditional independencies
in the model
30. Gibbs Sampler 2 (UnCollapsed)
Sample Topic Labels in a given document In parallel
Sample Topic Labels in different documents In parallel
31. Gibbs Sampler 2 (UnCollapsed)
Suitable For GraphX
Sample Topic Labels in a given document In parallel
Sample Topic Labels in different documents In parallel
57. What we learned so far ...
● Where is the cross-over point for your iterative ML
algorithm?
○ GraphX brings performance benefits if you’re on the right side of that
point
○ GraphX lets you easily throw more hardware at a problem
● GraphX very useful (and fast) for other graph
processing tasks
○ Data pre-processing
○ Efficient joins
58. What we learned so far ...
● Regularly save the state
○ With a 99.9% success rate, what’s the probability of successfully
running 1,000 iterations?
● Multi-Core Machine learning (r3.8xl, 32 threads, 220
GB) is very efficient
○ if your data fits in memory of single machine !
59. What we learned so far ...
● Regularly save the state
○ With a 99.9% success rate, what’s the probability of successfully
running 1,000 iterations?
○ ~36%
● Multi-Core Machine learning (r3.8xl, 32 threads, 220
GB) is very efficient
○ if your data fits in memory of single machine !
63. Creating a GraphX graph
scala> val graph = GraphLoader.edgeListFile(sc, edgesFile, false, 100)
graph: org.apache.spark.graphx.Graph[Int,Int] = org.apache.spark.graphx.
impl.GraphImpl@547a8dc1
scala> graph.edges.count
res3: Long = 16090021
scala> graph.vertices.count
res4: Long = 4548083
64. Pagerank in GraphX
scala> val ranks = graph.staticPageRank(10, 0.15).vertices
scala> val resources = mapping.map { row =>
val fields = row.split(",")
(fields.last.toLong, fields.first)
}
scala> val ranksByResource = resources.join(ranks).map {
case (id, (resource, rank)) => (resource, rank)
}
scala> ranksByResource.top(3)(Ordering.by(_._2)).foreach(println)
(http://dbpedia.org/resource/United_States,15686.671749384182)
(http://dbpedia.org/resource/Animal,6530.621240073025)
(http://dbpedia.org/resource/United_Kingdom,5780.806077968981)
65. Topic-sensitive pagerank in GraphX
● Initialization:
○ Construct a message VertexRDD holding initial activation probabilities
at each vertex (sparse vector with one non-zero)
● Propagate message along outbound edges using flatMap
○ (Involves shuffling)
● Sum incoming messages at each vertex
○ aggregateUsingIndex, summing up sparse vectors
○ join the message to the old graph to create a new one
● count to materialize the new graph
● unpersist to clean up old graph and message
● Repeat for fixed number of iterations or until convergence
● Zeppelin notebook, using DBpedia data
66. Distributed Gibbs Sampler in GraphX
1) Initialize Document-Word graph, G
2) For each triplet in G,
a) Construct a categorical using vertex attributes (P(topic | document), P(word | topic))
b) Sample a topic label from the categorical distribution
3) Aggregate topic labels on the Vertex id
4) Sample vertex attributes from dirichlet distribution
a) This involves computing and distributing a marginal over the Topic matrix, this materializes
the graph in every iteration
5) Join vertices with updated attributes with the graph and repeat from
step 2
Note: Step 2 and 3 can be accomplished jointly using aggregateMessages method on the
Graph