Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Scala Data Pipelines @ Spotify

39.055 visualizaciones

Publicado el

Talk at Big Data Scala, Aug 18, 2015

Publicado en: Software
  • Secrets to making $$$ with paid surveys... ♥♥♥ http://ishbv.com/surveys6/pdf
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí
  • Increase Penis Size - Secrets to Increase Penis Size Revealed. ♥♥♥ https://tinyurl.com/yaygh4xh
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí
  • Sex in your area for one night is there tinyurl.com/hotsexinarea Copy and paste link in your browser to visit a site)
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí
  • Increase Penis Size - Secrets to Increase Penis Size Revealed. ●●● https://tinyurl.com/ydaetwbk
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí
  • Get BIGGER and LONGER Without Surgery, Forget Surgery, Pills And Extenders, DON'T Do This To Make It Bigger. ◆◆◆ https://tinyurl.com/getpebible2019
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí

Scala Data Pipelines @ Spotify

  1. Scala Data Pipelines @ Spotify Neville Li @sinisa_lyh
  2. Who am I? ‣ SpotifyNYCsince2011 ‣ FormerlyYahoo!Search ‣ Musicrecommendations ‣ Datainfrastructure ‣ Scalasince2013
  3. Spotify in numbers • Started in 2006, 58 markets • 75M+ active users, 20M+ paying • 30M+ songs, 20K new per day • 1.5 billion playlists • 1 TB logs per day • 1200+ node Hadoop cluster • 10K+ Hadoop jobs per day
  4. Music recommendation @ Spotify • Discover Weekly • Radio • RelatedArtists • Discover Page
  5. Recommendation systems
  6. A little teaser PGroupedTable<K,V>::combineValues(CombineFn<K,V> combineFn, CombineFn<K,V> reduceFn) Crunch: CombineFns are used to represent the associative operations… Grouped[K, +V]::reduce[U >: V](fn: (U, U) U) Scalding: reduce with fn which must be associative and commutative… PairRDDFunctions[K, V]::reduceByKey(fn: (V, V) => V) Spark: Merge the values for each key using an associative reduce function…
  7. Monoid! enables map side reduce Actually it’s a semigroup
  8. One more teaser Linear equation inAlternate Least Square (ALS) Matrix factorization xu = (YTY + YT(Cu − I)Y)−1YTCup(u) vectors.map { case (id, v) => (id, v * v) }.map(_._2).reduce(_ + _) // YtY ratings.keyBy(fixedKey).join(outerProducts) // YtCuIY .map { case (_, (r, op)) => (solveKey(r), op * (r.rating * alpha)) }.reduceByKey(_ + _) ratings.keyBy(fixedKey).join(vectors) // YtCupu .map { case (_, (r, v)) => val (Cui, pui) = (r.rating * alpha + 1, if (Cui > 0.0) 1.0 else 0.0) (solveKey(r), v * (Cui * pui)) }.reduceByKey(_ + _) http://www.slideshare.net/MrChrisJohnson/scala-data-pipelines-for-music-recommendations
  9. Success story • Mid 2013: 100+ Python Luigi M/R jobs, few tests • 10+ new hires since, most fresh grads • Few with Java experience, none with Scala • Now: 300+ Scalding jobs, 400+ tests • More ad-hoc jobs untracked • Spark also taking off
  10. First 10 months ……
  11. Activity over time
  12. Guess how many jobs written by yours truly?
  13. Performance vs. Agility https://nicholassterling.wordpress.com/2012/11/16/scala-performance/
  14. Let’sdiveinto something technical
  15. To join or not to join? val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .join(tgp) .values // (user, genre) .group .mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
  16. Hash join val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .hashJoin(tgp.forceToDisk) // tgp replicated to all mappers .values // (user, genre) .group .mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
  17. CoGroup val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .cogroup(tgp) { case (_, users, genres) => users.map((_, genres.toSet)) } // (track, (user, genres)) .values // (user, genres)
 .group .reduce(_ ++ _) // map-side reduce!
  18. CoGroup val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .cogroup(tgp) { case (_, users, genres) => users.map((_, genres.toSet)) } // (track, (user, genres)) .values // (user, genres)
 .group .sum // SetMonoid[Set[T]] from Algebird * sum[U >:V](implicit sg: Semigroup[U])
  19. Key-value file as distributed cache val streams: TypedPipe[(String, String)] = _ // (gid, user) val tgp: SparkeyManager = _ // tgp replicated to all mappers streams .map { case (track, user) => (user, tgp.get(track).split(",").toSet) } .group .sum https://github.com/spotify/sparkey SparkeyManagerwraps DistributedCacheFile
  20. Joins and CoGroups • Require shuffle and reduce step • Some ops force everything to reducers
 e.g. mapGroup, mapValueStream • CoGroup more flexible for complex logic • Scalding flattens a.join(b).join(c)…
 into MultiJoin(a, b, c, …)
  21. Distributed cache • Fasterwith off-heap binary files • Building cache = more wiring • Memory mapping may interfere withYARN • E.g. 64GB nodes with 48GB for containers (no cgroup) • 12 × 2GB containers each with 2GB JVM heap + mmap cache • OOM and swap! • Keep files small (< 1GB) or fallback to joins…
  22. Analyze your jobs • Concurrent Driven • Visualize job execution • Workflow optimization • Bottlenecks • Data skew
  23. Notenough math?
  24. Recommending tracks • User listened to Rammstein - Du Hast • Recommend 10 similartracks • 40 dimension feature vectors fortracks • Compute cosine similarity between all pairs • O(n) lookup per userwhere n ≈ 30m • Trythat with 50m users * 10 seed tracks each
  25. ANNOY - cheat by approximation • Approximate Nearest Neighbor OhYeah • Random projections and binarytree search • Build index on single machine • Load in mappers via distribute cache • O(log n) lookup https://github.com/spotify/annoy https://github.com/spotify/annoy-java
  26. ANN Benchmark https://github.com/erikbern/ann-benchmarks
  27. Filtering candidates • Users don’t like seeing artist/album/tracks they already know • But may forget what they listened long ago • 50m * thousands of items each • Over 5 years of streaming logs • Need to update daily • Need to purge old items per user
  28. Options • Aggregate all logs daily • Aggregate last x days daily • CSVof artist/album/track ids • Bloom filters
  29. Decayed value with cutoff • Compute new user-item score daily • Weighted on context, e.g. radio, search, playlist • score’ = score + previous * 0.99 • half life = log0.99 0.5 = 69 days • Cut off at top 2000 • Items that users might remember seeing recently
  30. Bloom filters • Probabilistic data structure • Encoding set of items with m bits and k hash functions • No false negative • Tunable false positive probability • Size proportional to capacity & FP probability • Let’s build one per user-{artists,albums,tracks} • Algebird BloomFilterMonoid: z = all zero bits, + = bitwise OR
  31. Size versus max items & FP prob • User-item distribution is uneven • Assuming same setting for all users • # items << capacity → wasting space • # items > capacity → high FP rate
  32. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead
  33. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead n=1k item
  34. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead n=1k n=10k item full
  35. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead item n=1k n=10k n=100k fullfull
  36. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead n=1k n=10k n=100k n=1m item fullfullfull
  37. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to << N max possible items • Keep smallest one with capacity > items inserted • Expensive to build • Cheap to store and lookup
  38. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to << N max possible items • Keep smallest one with capacity > items inserted • Expensive to build • Cheap to store and lookup n=1k

×