Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Spark Meetup @ Netflix, 05/19/2015

5.475 visualizaciones

Publicado el

Spark & GraphX for recommendation algorithms - presented at a Netflix-hosted Spark Meetup, on 05/19/2015

Publicado en: Ingeniería
  • I recovered from bulimia. You can too! learn more... ▲▲▲ http://ishbv.com/bulimiarec/pdf
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí
  • The Bulimia Recovery Program, We Recovered, You CAN TOO! ■■■ http://scamcb.com/bulimiarec/pdf
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí
  • The Bulimia Recovery Program, We Recovered, You CAN TOO! ★★★ http://ishbv.com/bulimiarec/pdf
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí
  • The 3 Secrets To Your Bulimia Recovery ◆◆◆ http://ishbv.com/bulimiarec/pdf
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí
  • DOWNLOAD FULL BOOKS INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí

Spark Meetup @ Netflix, 05/19/2015

  1. 1. Spark and GraphX in the Netflix Recommender System Ehtsham Elahi and Yves Raimond (@EhtshamElahi) (@moustaki) Algorithms Engineering Netflix
  2. 2. Machine Learning @ Netflix
  3. 3. Recommendations @ Netflix ● Goal: Help members find content that they’ll enjoy to maximize satisfaction and retention ● Core part of product ○ Every impression is a recommendation
  4. 4. 5 ▪ Regression (Linear, logistic, elastic net) ▪ SVD and other Matrix Factorizations ▪ Factorization Machines ▪ Restricted Boltzmann Machines ▪ Deep Neural Networks ▪ Markov Models and Graph Algorithms ▪ Clustering ▪ Latent Dirichlet Allocation ▪ Gradient Boosted Decision Trees/Random Forests ▪ Gaussian Processes ▪ … Models & Algorithms
  5. 5. Main Challenge - Scale ● Algorithms @ Netflix Scale ○ > 62 M Members ○ > 50 Countries ○ > 1000 device types ○ > 100M Hours / day ● Can distributed Machine Learning algorithms help with Scale?
  6. 6. Spark and GraphX
  7. 7. Spark and GraphX ● Spark - Distributed in-memory computational engine using Resilient Distributed Datasets (RDDs) ● GraphX - extends RDDs to Multigraphs and provides graph analytics ● Convenient and fast, all the way from prototyping (spark-notebook, iSpark, Zeppelin) to production
  8. 8. Two Machine Learning Problems ● Generate ranking of items with respect to a given item from an interaction graph ○ Graph Diffusion algorithms (e.g. Topic Sensitive Pagerank) ● Find Clusters of related items using co-occurrence data ○ Probabilistic Graphical Models (Latent Dirichlet Allocation)
  9. 9. Iterative Algorithms in GraphX v1 v2v3 v4 v6 v7Vertex Attribute Edge Attribute
  10. 10. Iterative Algorithms in GraphX v1 v2v3 v4 v6 v7Vertex Attribute Edge Attribute GraphX represents the graph as RDDs. e.g. VertexRDD, EdgeRDD
  11. 11. Iterative Algorithms in GraphX v1 v2v3 v4 v6 v7Vertex Attribute Edge Attribute GraphX provides APIs to propagate and update attributes
  12. 12. Iterative Algorithms in GraphX v1 v2v3 v4 v6 v7Vertex Attribute Edge Attribute Iterative Algorithm proceeds by creating updated graphs
  13. 13. Graph Diffusion algorithms
  14. 14. ● Popular graph diffusion algorithm ● Capturing vertex importance with regards to a particular vertex ● e.g. for the topic “Seattle” Topic Sensitive Pagerank @ Netflix
  15. 15. Iteration 0 We start by activating a single node “Seattle” related to shot in featured in related to cast cast cast related to
  16. 16. Iteration 1 With some probability, we follow outbound edges, otherwise we go back to the origin.
  17. 17. Iteration 2 Vertex accumulates higher mass
  18. 18. Iteration 2 And again, until convergence
  19. 19. GraphX implementation ● Running one propagation for each possible starting node would be slow ● Keep a vector of activation probabilities at each vertex ● Use GraphX to run all propagations in parallel
  20. 20. Topic Sensitive Pagerank in GraphX activation probability, starting from vertex 1 activation probability, starting from vertex 2 activation probability, starting from vertex 3 ... Activation probabilities as vertex attributes ... ... ... ... ... ...
  21. 21. Example graph diffusion results “Matrix” “Zombies” “Seattle”
  22. 22. Distributed Clustering algorithms
  23. 23. LDA @ Netflix ● A popular clustering/latent factors model ● Discovers clusters/topics of related videos from Netflix data ● e.g, a topic of Animal Documentaries
  24. 24. LDA - Graphical Model Per-topic word distributions Per-document topic distributions Topic label for document d and word w
  25. 25. LDA - Graphical Model Question: How to parallelize inference?
  26. 26. LDA - Graphical Model Question: How to parallelize inference? Answer: Read conditional independencies in the model
  27. 27. Gibbs Sampler 1 (Semi Collapsed)
  28. 28. Gibbs Sampler 1 (Semi Collapsed) Sample Topic Labels in a given document Sequentially Sample Topic Labels in different documents In parallel
  29. 29. Gibbs Sampler 2 (UnCollapsed)
  30. 30. Gibbs Sampler 2 (UnCollapsed) Sample Topic Labels in a given document In parallel Sample Topic Labels in different documents In parallel
  31. 31. Gibbs Sampler 2 (UnCollapsed) Suitable For GraphX Sample Topic Labels in a given document In parallel Sample Topic Labels in different documents In parallel
  32. 32. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 A distributed parameterized graph for LDA with 3 Topics
  33. 33. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 A distributed parameterized graph for LDA with 3 Topics document
  34. 34. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 A distributed parameterized graph for LDA with 3 Topics word
  35. 35. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 A distributed parameterized graph for LDA with 3 Topics Edge: if word appeared in the document
  36. 36. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 A distributed parameterized graph for LDA with 3 Topics Per-document topic distribution
  37. 37. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 A distributed parameterized graph for LDA with 3 Topics Per-topic word distributions
  38. 38. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 (vertex, edge, vertex) = triplet
  39. 39. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 Categorical distribution for the triplet using vertex attributes
  40. 40. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 Categorical distributions for all triplets
  41. 41. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 1 1 2 0 Sample Topics for all edges
  42. 42. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0 1 0 0 1 1 1 0 0 0 2 0 1 0 1 1 1 2 0 Neighborhood aggregation for topic histograms
  43. 43. Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.1 0.4 0.3 0.1 0.4 0.4 0.8 0.2 0.3 0.1 0.8 0.1 0.45 0.1 0.45 Realize samples from Dirichlet to update the graph
  44. 44. Example LDA Results Cluster of Bollywood Movies Cluster of Kids shows Cluster of Western movies
  45. 45. GraphX performance comparison
  46. 46. Algorithm Implementations ● Topic Sensitive Pagerank ○ Distributed GraphX implementation ○ Alternative Implementation: Broadcast graph adjacency matrix, Scala/Breeze code, triggered by Spark ● LDA ○ Distributed GraphX implementation ○ Alternative Implementation: Single machine, Multi-threaded Java code ● All implementations are Netflix internal code
  47. 47. Performance Comparison
  48. 48. Performance Comparison Open Source DBPedia dataset
  49. 49. Performance Comparison Sublinear rise in time with GraphX Vs Linear rise in the Alternative
  50. 50. Performance Comparison Doubling the size of cluster: 2.0 speedup in the Alternative Impl Vs 1.2 in GraphX
  51. 51. Performance Comparison Large number of vertices propagated in parallel lead to large shuffle data, causing failures in GraphX for small clusters
  52. 52. Performance Comparison Netflix dataset Number of Topics = 100
  53. 53. Performance Comparison GraphX setup: 8 x Resources than the Multi-Core setup
  54. 54. Performance Comparison Wikipedia dataset, 100 Topic LDA Cluster: (16 x r3.2xl) (source: Databricks)
  55. 55. Performance Comparison GraphX for very large datasets outperforms the multi-core unCollapsed Impl
  56. 56. Lessons Learned
  57. 57. What we learned so far... ● Where is the cross-over point for your iterative ML algorithm? ○ GraphX brings performance benefits if you’re on the right side of that point ○ GraphX lets you easily throw more hardware at a problem ● GraphX very useful (and fast) for other graph processing tasks ○ Data pre-processing ○ Efficient joins
  58. 58. What we learned so far ... ● Regularly save the state ○ With a 99.9% success rate, what’s the probability of successfully running 1,000 iterations? ● Multi-Core Machine learning (r3.8xl, 32 threads, 220 GB) is very efficient ○ if your data fits in memory of single machine !
  59. 59. What we learned so far ... ● Regularly save the state ○ With a 99.9% success rate, what’s the probability of successfully running 1,000 iterations? ○ ~36% ● Multi-Core Machine learning (r3.8xl, 32 threads, 220 GB) is very efficient ○ if your data fits in memory of single machine !
  60. 60. We’re hiring! (come talk to us) https://jobs.netflix.com/

×