Scio

•

1 recomendación•894 vistas

Neville Li

Scio - A Scala API for Google Cloud Dataflow https://github.com/spotify/scio

Software

Scio
A Scala API for Google Cloud Dataflow
Neville Li @sinisa_lyh

Origin Story
Scalding and Spark
ML, recommendations, analytics
50+ users, 400+ unique jobs

Moving to
Google Cloud
Early 2015 - Dataflow Scala hack project

Data model
Spark
• RDD for batch, DStream for streaming
• Explicit caching semantics
• Two sets ofAPIs
Dataflow
• PCollection for both batch and streaming
• Windowed and timestamped values
• One unifiedAPI

Execution
Spark
• Driver and executors
• Dynamic execution from driver
• Transforms and actions
Dataflow
• No master
• Static execution planning
• Transforms only, no actions

Why not Scalding on GCE
Pros
• Community 
Twitter, eBay, Etsy, Stripe, LinkedIn, …
• Stable and proven

Why not Scalding on GCE
Cons
• Hadoop cluster operations
• Multi-tenancy 
resource contention and utilization
• No streaming mode (Summingbird?)

Why not Spark on GCE
Pros
• Batch, streaming, interactive and SQL
• MLlib, GraphX
• Scala, Python, and R support
• Zeppelin, spark-notebook, Hue

Why not Spark on GCE
Cons
• Hard to tune and scale
• Cluster lifecycle management

Why Dataflow with Scala
Dataflow
• Hosted solution, no operations
• Ecosystem 
GCS, BigQuery, PubSub, Bigtable, …
• Unified batch and streaming model

Why Dataflow with Scala
Scala
• High level DSL 
easytransition for developers
• Reusable and composable code via FP
• Numerical libraries: Breeze,Algebird

Scio
Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o]
Verb: I can, know, understand, have knowledge.

WordCount
Almost identical to Spark version
val sc = ScioContext()
sc.textFile("shakespeare.txt")
.flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty))
.countByValue()
.saveAsTextFile("wordcount.txt")

$PageRank def pageRank(in: SCollection[(String, String)]) = { val links = in.groupByKey() var ranks = links.mapValues(_ => 1.0) for (i <- 1 to 10) { val contribs = links.join(ranks).values .flatMap { case (urls, rank) => val size = urls.size urls.map((_, rank / size)) } ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _) } ranks }$

Spotify Running
60 million tracks
30m users * 10 tempo buckets * 25 tracks
Audio: tempo, energy, time signature ...
Metadata: genres, categories, …
Latent vectors from collaborative filtering

Personalized new releases
• Pre-computed weekly on Hadoop 
(on-premise cluster)
• 100GB recommendations 
from HDFS to Bigtable in US+EU
• 250GB Bloom filters from Bigtable to HDFS
• 200 LOC

User conversion analysis
• For marketing and campaigning strategies
• Track usertransitions through products
• Aggregated for simulation and projection
• 150GB BigQuery in and out

Design and Implementation
• Simplicity over premature optimization
• Usability over Python/Java inter-op
• Ser/de: ☑kryo/chill ☒Coder[T]
• Closure cleaner

What’s next?
• Apache Beam donation
• Migrating internal teams
• BigQuery SQL-2011 dialect
• Better streaming support
• PRs and issues welcome!

Más contenido relacionado

La actualidad más candente

Cassandra spark connectorDuyhai Doan

Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake

Using PostgreSQL with Bibliographic DataJimmy Angelakos

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thAlton Alexander

Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...randyguck

Pivoting Data with SparkSQL by Andrew RaySpark Summit

Spark meetup v2.0.5Yan Zhou

Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit

DataEngConf SF16 - Spark SQL WorkshopHakka Labs

Cost-based query optimization in Apache Hive 0.14Julian Hyde

Building data pipelinesJonathan Holloway

Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly

Assessing Graph Solutions for Apache SparkDatabricks

DataSource V2 and Cassandra – A Whole New WorldDatabricks

Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Holden Karau

Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference

Apache spark IntroTudor Lapusan

PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph

SparkR - Play Spark Using R (20160909 HadoopCon)wqchen

Graph databases: Tinkerpop and Titan DBMohamed Taher Alrefaie

La actualidad más candente (20)

Cassandra spark connector

Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster

Using PostgreSQL with Bibliographic Data

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...

Pivoting Data with SparkSQL by Andrew Ray

Spark meetup v2.0.5

Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar

DataEngConf SF16 - Spark SQL Workshop

Cost-based query optimization in Apache Hive 0.14

Building data pipelines

Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...

Assessing Graph Solutions for Apache Spark

DataSource V2 and Cassandra – A Whole New World

Streaming ML on Spark: Deprecated, experimental and internal ap is galore!

Large scale, interactive ad-hoc queries over different datastores with Apache...

Apache spark Intro

PySpark Cassandra - Amsterdam Spark Meetup

SparkR - Play Spark Using R (20160909 HadoopCon)

Graph databases: Tinkerpop and Titan DB

Destacado

Only the First Drop: Changing the Way Startups are Funded by Denes Ban (OurCr...Leadel

Smarter campus workshop Part I - Amit Sinha and Heidi Riley - Smarter planet ...Smarter Planet Students for a

SCIO – Explore Me, IoT Israel 2014iotisrael

Open Spectrum - Physics, Engineering, Commerce and PoliticsBrough Turner

Refactoring workshop (Campus Party Quito 2014)Maria Gomez

Nutrition and It's ImportanceBP KOIRALA INSTITUTE OF HELATH SCIENCS,, NEPAL

Bringing iot data to life, IoT Israel 2014iotisrael

Dr. Jimmy Schwarzkopf main tent trends 2016Dr. Jimmy Schwarzkopf

Linux Kernel ExploitationScio Security

Sensors candidated dkim_v2David Yushin KIM

STKI Israeli IT market study 2016 V2Dr. Jimmy Schwarzkopf

ScioJonah Sherman-Waterman

Molecular Sensor from SCIOJeffrey Funk Business Models

Ansible + HadoopMichael Young

The Future of Digital HealthMonty C. M. Metzger

The Digital Health Tech Vision 2016accenture

Video is Changing the Worldaccenture

Chemicals: Smarter Investments, Outstanding Resultsaccenture

Unlocking the Power of RegTechaccenture

Mastering The Fourth Industrial Revolution Monty C. M. Metzger

Destacado (20)

Only the First Drop: Changing the Way Startups are Funded by Denes Ban (OurCr...

Smarter campus workshop Part I - Amit Sinha and Heidi Riley - Smarter planet ...

SCIO – Explore Me, IoT Israel 2014

Open Spectrum - Physics, Engineering, Commerce and Politics

Refactoring workshop (Campus Party Quito 2014)

Nutrition and It's Importance

Bringing iot data to life, IoT Israel 2014

Dr. Jimmy Schwarzkopf main tent trends 2016

Linux Kernel Exploitation

Sensors candidated dkim_v2

STKI Israeli IT market study 2016 V2

Scio

Molecular Sensor from SCIO

Ansible + Hadoop

The Future of Digital Health

The Digital Health Tech Vision 2016

Video is Changing the World

Chemicals: Smarter Investments, Outstanding Results

Unlocking the Power of RegTech

Mastering The Fourth Industrial Revolution

Similar a Scio

Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys

From stream to recommendation using apache beam with cloud pubsub and cloud d...Neville Li

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy

20160512 apache-spark-for-everyoneAmanda Casari

How Concur uses Big Data to get you to Tableau Conference On TimeDenny Lee

Artigo 81 - spark_tutorial.pdfWalmirCouto3

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys

20170126 big data processingVienna Data Science Group

NoSQL: Why, When, and HowBigBlueHat

Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Roger Huang

Scala 20140715Roger Huang

OCF.tw's talk about "Introduction to spark"Giivee The

2015 Data Science Summit @ dato ReviewHang Li

Apache Spark RDDsDean Chen

Big data workloads using Apache Sparkon HDInsightNilesh Gule

Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys

Sandish3CertsSandish Kumar H N

Why hadoop map reduce needs scala, an introduction to scoobi and scaldingXebia Nederland BV

Similar a Scio (20)

Alpine academy apache spark series #1 introduction to cluster computing wit...

Apache Spark for Everyone - Women Who Code Workshop

Big Data Processing with .NET and Spark (SQLBits 2020)

From stream to recommendation using apache beam with cloud pubsub and cloud d...

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

20160512 apache-spark-for-everyone

How Concur uses Big Data to get you to Tableau Conference On Time

Artigo 81 - spark_tutorial.pdf

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...

20170126 big data processing

NoSQL: Why, When, and How

Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

Scala 20140715

OCF.tw's talk about "Introduction to spark"

2015 Data Science Summit @ dato Review

Apache Spark RDDs

Big data workloads using Apache Sparkon HDInsight

Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...

Sandish3Certs

Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Último

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

TECUNIQUE: Success Stories: IT Service providermohitmore19

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

5 Signs You Need a Fashion PLM Software.pdfWave PLM

Scio

1. Scio A Scala API for Google Cloud Dataflow Neville Li @sinisa_lyh

2. Who am I?

3. Origin Story Scalding and Spark ML, recommendations, analytics 50+ users, 400+ unique jobs

4. Moving to Google Cloud Early 2015 - Dataflow Scala hack project

5. What is Dataflow?

6. Data model Spark • RDD for batch, DStream for streaming • Explicit caching semantics • Two sets ofAPIs Dataflow • PCollection for both batch and streaming • Windowed and timestamped values • One unifiedAPI

7. Execution Spark • Driver and executors • Dynamic execution from driver • Transforms and actions Dataflow • No master • Static execution planning • Transforms only, no actions

8. Why Dataflow?

9. Why not Scalding on GCE Pros • Community  Twitter, eBay, Etsy, Stripe, LinkedIn, … • Stable and proven

10. Why not Scalding on GCE Cons • Hadoop cluster operations • Multi-tenancy  resource contention and utilization • No streaming mode (Summingbird?)

11. Why not Spark on GCE Pros • Batch, streaming, interactive and SQL • MLlib, GraphX • Scala, Python, and R support • Zeppelin, spark-notebook, Hue

12. Why not Spark on GCE Cons • Hard to tune and scale • Cluster lifecycle management

13. Why Dataflow with Scala Dataflow • Hosted solution, no operations • Ecosystem  GCS, BigQuery, PubSub, Bigtable, … • Unified batch and streaming model

14. Why Dataflow with Scala Scala • High level DSL  easytransition for developers • Reusable and composable code via FP • Numerical libraries: Breeze,Algebird

15.

16. Scio Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o] Verb: I can, know, understand, have knowledge.

17. github.com/spotify/scio

18. WordCount Almost identical to Spark version val sc = ScioContext() sc.textFile("shakespeare.txt") .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty)) .countByValue() .saveAsTextFile("wordcount.txt")

19. PageRank def pageRank(in: SCollection[(String, String)]) = { val links = in.groupByKey() var ranks = links.mapValues(_ => 1.0) for (i <- 1 to 10) { val contribs = links.join(ranks).values .flatMap { case (urls, rank) => val size = urls.size urls.map((_, rank / size)) } ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _) } ranks }

20. Spotify Running 60 million tracks 30m users * 10 tempo buckets * 25 tracks Audio: tempo, energy, time signature ... Metadata: genres, categories, … Latent vectors from collaborative filtering

21.

22.

23.

24.

25. Personalized new releases • Pre-computed weekly on Hadoop  (on-premise cluster) • 100GB recommendations  from HDFS to Bigtable in US+EU • 250GB Bloom filters from Bigtable to HDFS • 200 LOC

26. User conversion analysis • For marketing and campaigning strategies • Track usertransitions through products • Aggregated for simulation and projection • 150GB BigQuery in and out

27. Demo Time!

28. Design and Implementation • Simplicity over premature optimization • Usability over Python/Java inter-op • Ser/de: ☑kryo/chill ☒Coder[T] • Closure cleaner

29. What’s next? • Apache Beam donation • Migrating internal teams • BigQuery SQL-2011 dialect • Better streaming support • PRs and issues welcome!

30. Neville Li @sinisa_lyh Thank you!

Scio

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Scio

Similar a Scio (20)

Último

Último (20)

Scio