Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Scio

624 visualizaciones

Publicado el

Scio - A Scala API for Google Cloud Dataflow
https://github.com/spotify/scio

Publicado en: Software
  • Sé el primero en comentar

Scio

  1. 1. Scio A Scala API for Google Cloud Dataflow Neville Li @sinisa_lyh
  2. 2. Who am I?
  3. 3. Origin Story Scalding and Spark ML, recommendations, analytics 50+ users, 400+ unique jobs
  4. 4. Moving to Google Cloud Early 2015 - Dataflow Scala hack project
  5. 5. What is Dataflow?
  6. 6. Data model Spark • RDD for batch, DStream for streaming • Explicit caching semantics • Two sets ofAPIs Dataflow • PCollection for both batch and streaming • Windowed and timestamped values • One unifiedAPI
  7. 7. Execution Spark • Driver and executors • Dynamic execution from driver • Transforms and actions Dataflow • No master • Static execution planning • Transforms only, no actions
  8. 8. Why Dataflow?
  9. 9. Why not Scalding on GCE Pros • Community
 Twitter, eBay, Etsy, Stripe, LinkedIn, … • Stable and proven
  10. 10. Why not Scalding on GCE Cons • Hadoop cluster operations • Multi-tenancy
 resource contention and utilization • No streaming mode (Summingbird?)
  11. 11. Why not Spark on GCE Pros • Batch, streaming, interactive and SQL • MLlib, GraphX • Scala, Python, and R support • Zeppelin, spark-notebook, Hue
  12. 12. Why not Spark on GCE Cons • Hard to tune and scale • Cluster lifecycle management
  13. 13. Why Dataflow with Scala Dataflow • Hosted solution, no operations • Ecosystem
 GCS, BigQuery, PubSub, Bigtable, … • Unified batch and streaming model
  14. 14. Why Dataflow with Scala Scala • High level DSL
 easytransition for developers • Reusable and composable code via FP • Numerical libraries: Breeze,Algebird
  15. 15. Scio Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o] Verb: I can, know, understand, have knowledge.
  16. 16. github.com/spotify/scio
  17. 17. WordCount Almost identical to Spark version val sc = ScioContext() sc.textFile("shakespeare.txt") .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty)) .countByValue() .saveAsTextFile("wordcount.txt")
  18. 18. PageRank def pageRank(in: SCollection[(String, String)]) = { val links = in.groupByKey() var ranks = links.mapValues(_ => 1.0) for (i <- 1 to 10) { val contribs = links.join(ranks).values .flatMap { case (urls, rank) => val size = urls.size urls.map((_, rank / size)) } ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _) } ranks }
  19. 19. Spotify Running 60 million tracks 30m users * 10 tempo buckets * 25 tracks Audio: tempo, energy, time signature ... Metadata: genres, categories, … Latent vectors from collaborative filtering
  20. 20. Personalized new releases • Pre-computed weekly on Hadoop
 (on-premise cluster) • 100GB recommendations
 from HDFS to Bigtable in US+EU • 250GB Bloom filters from Bigtable to HDFS • 200 LOC
  21. 21. User conversion analysis • For marketing and campaigning strategies • Track usertransitions through products • Aggregated for simulation and projection • 150GB BigQuery in and out
  22. 22. Demo Time!
  23. 23. Design and Implementation • Simplicity over premature optimization • Usability over Python/Java inter-op • Ser/de: ☑kryo/chill ☒Coder[T] • Closure cleaner
  24. 24. What’s next? • Apache Beam donation • Migrating internal teams • BigQuery SQL-2011 dialect • Better streaming support • PRs and issues welcome!
  25. 25. Neville Li @sinisa_lyh Thank you!

×