SlideShare una empresa de Scribd logo
1 de 52
Descargar para leer sin conexión
Cloud Dataflow
A Google Service and SDK
The Presenter
Alex Van Boxel, Software Architect, software engineer, devops, tester,...
at Vente-Exclusive.com
Introduction
Twitter @alexvb
Plus +AlexVanBoxel
E-mail alex@vanboxel.be
Web http://alex.vanboxel.be
Dataflow
what, how, where?
Dataflow is...
A set of SDK that define the
programming model that
you use to build your
streaming and batch
processing pipeline (*)
Google Cloud Dataflow is a
fully managed service that
will run and optimize your
pipeline
Dataflow where...
ETL
● Move
● Filter
● Enrich
Analytics
● Streaming Compu
● Batch Compu
● Machine Learning
Composition
● Connecting in Pub/Sub
○ Could be other dataflows
○ Could be something else
NoOps Model
● Resources allocated on demand
● Lifecycle managed by the service
● Optimization
● Rebalancing
Cloud Dataflow Benefits
Unified Programming Model
Unified: Streaming and Batch
Open Sourced
● Java 8 implementation
● Python in the works
Cloud Dataflow Benefits
Unified Programming Model (streaming)
Cloud Dataflow Benefits
DataFlow
(streaming)
DataFlow (batch)
BigQuery
Unified Programming Model (stream and backup)
Cloud Dataflow Benefits
DataFlow
(streaming)
DataFlow (batch)
BigQuery
Unified Programming Model (batch)
Cloud Dataflow Benefits
DataFlow
(streaming)
DataFlow (batch)
BigQuery
Beam
Apache Incubation
DataFlow SDK becomes Apache Beam
Cloud Dataflow Benefits
DataFlow SDK becomes Apache Beam
● Pipeline first, runtime second
○ focus your data pipelines, not the characteristics runner
● Portability
○ portable across a number of runtime engines
○ runtime based on any number of considerations
■ performance
■ cost
■ scalability
Cloud Dataflow Benefits
DataFlow SDK becomes Apache Beam
● Unified model
○ Batch and streaming are integrated into a unified model
○ Powerful semantics, such as windowing, ordering and triggering
● Development tooling
○ tools you need to create portable data pipelines quickly and easily
using open-source languages, libraries and tools.
Cloud Dataflow Benefits
DataFlow SDK becomes Apache Beam
● Scala Implementation
● Alternative Runners
○ Spark Runner from Cloudera Labs
○ Flink Runner from data Artisans
Cloud Dataflow Benefits
DataFlow SDK becomes Apache Beam
● Cloudera
● data Artisans
● Talend
● Cask
● PayPal
Cloud Dataflow Benefits
SDK Primitives
Autopsy of a dataflow pipeline
Pipeline
● Inputs: Pub/Sub, BigQuery, GCS, XML, JSON, …
● Transforms: Filter, Join, Aggregate, …
● Windows: Sliding, Sessions, Fixed, …
● Triggers: Correctness….
● Outputs: Pub/Sub, BigQuery, GCS, XML, JSON, ...
Logical Model
PCollection<T>
● Immutable collection
● Could be bound or unbound
● Created by
○ a backing data store
○ a transformation from other PCollection
○ generated
● PCollection<KV<K,V>>
Dataflow SDK Primitives
ParDo<I,O>
● Parallel Processing of each element in the PCollection
● Developer provide DoFn<I,O>
● Shards processed independent
of each other
Dataflow SDK Primitives
ParDo<I,O>
Interesting concept of side outputs: allows for data sharing
Dataflow SDK Primitives
PTransform
● Combine small operations into bigger transforms
● Standard transform (COUNT, SUM,...)
● Reuse and Monitoring
Dataflow SDK Primitives
public class BlockTransform extends PTransform<PCollection<Legacy>, PCollection<Event>> {
@Override
public PCollection<Event> apply(PCollection<Legacy> input) {
return input.apply(ParDo.of(new TimestampEvent1Fn()))
.apply(ParDo.of(new FromLegacy2EventFn())).setCoder(AvroCoder.of(Event.class))
.apply(Window.<Event>into(Sessions.withGapDuration(Duration.standardMinutes(20))))
.apply(ParDo.of(new ExtactChannelUserFn()))
}
}
Merging
● GroupByKey
○ Takes a PCollection of KV<K,V>
and groups them
○ Must be keyed
● Flatten
○ Join with same type
Dataflow SDK Primitives
Input/Output
Dataflow SDK Primitives
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
// streamData is Unbounded; apply windowing afterward.
PCollection<String> streamData =
p.apply(PubsubIO.Read.named("ReadFromPubsub")
.topic("/topics/my-topic"));
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
PCollection<TableRow> weatherData = p
.apply(BigQueryIO.Read.named("ReadWeatherStations")
.from("clouddataflow-readonly:samples.weather_stations"));
● Multiple Inputs + Outputs
Input/Output
Dataflow SDK Primitives
Primitives on steroids
Autopsy of a dataflow pipeline
Windows
● Split unbounded collections (ex. Pub/Sub)
● Works on bounded collection as well
Dataflow SDK Primitives
.apply(Window.<Event>into(Sessions.withGapDuration(Duration.standardMinutes(20))))
Window.<CAL>into(SlidingWindows.of(Duration.standardMinutes(5)));
Triggers
● Time-based
● Data-based
● Composite
Dataflow SDK Primitives
AfterEach.inOrder(
AfterWatermark.pastEndOfWindow(),
Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(10)))
.orFinally(AfterWatermark
.pastEndOfWindow()
.plusDelayOf(Duration.standardDays(2))));
14:22Event time 14:21
Input/Output
Dataflow SDK Primitives
14:21 14:22
14:21:25Processing time
Event time
14:2314:22
Watermark
14:21:20 14:21:40
Early trigger Late trigger Late trigger
14:25
14:22Event time 14:21
Input/Output
Dataflow SDK Primitives
14:21 14:22
14:21:25Processing time
Event time
14:2314:22
Watermark
14:21:20 14:21:40
Early trigger Late trigger Late trigger
14:25
Google Cloud Dataflow
The service
What you need
● Google Cloud SDK
○ utilities to authenticate and manage project
● Java IDE
○ Eclipse
○ IntelliJ
● Standard Build System
○ Maven
○ Gradle
Dataflow SDK Primitives
What you need
Dataflow SDK Primitives
subprojects {
apply plugin: 'java'
repositories {
maven {
url "http://jcenter.bintray.com"
}
mavenLocal()
}
dependencies {
compile 'com.google.cloud.dataflow:google-cloud-dataflow-java-sdk-all:1.3.0'
testCompile 'org.hamcrest:hamcrest-all:1.3'
testCompile 'org.assertj:assertj-core:3.0.0'
testCompile 'junit:junit:4.12'
}
}
How it work
Google Cloud Dataflow Service
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))
.apply(FlatMapElements.via((String word) -> Arrays.asList(word.split("[^a-zA-Z']+")))
.withOutputType(new TypeDescriptor<String>() {}))
.apply(Filter.byPredicate((String word) -> !word.isEmpty()))
.apply(Count.<String>perElement())
.apply(MapElements
.via((KV<String, Long> wordCount) -> wordCount.getKey() + ": " + wordCount.getValue())
.withOutputType(new TypeDescriptor<String>() {}))
.apply(TextIO.Write.to("gs://YOUR_OUTPUT_BUCKET/AND_OUTPUT_PREFIX"));
Service
● Stage
● Optimize (Flume)
● Execute
Google Cloud Dataflow Service
Optimization
Google Cloud Dataflow Service
Optimization
Google Cloud Dataflow Service
Console
Google Cloud Dataflow Service
Lifecycle
Google Cloud Dataflow Service
Compute Worker Scaling
Google Cloud Dataflow Service
Compute Worker Rescaling
Google Cloud Dataflow Service
Compute Worker Rebalancing
Google Cloud Dataflow Service
Compute Worker Rescaling
Google Cloud Dataflow Service
Lifecycle
Google Cloud Dataflow Service
BigQuery
Google Cloud Dataflow Service
Testability
Test Driven Development Build-In
Testability - Built in
Dataflow Testability
KV<String, List<CalEvent>> input = KV.of("", new ArrayList<>());
input.getValue().add(CalEvent.event("SESSION", "REFERRAL")
.build());
input.getValue().add(CalEvent.event("SESSION", "START")
.data("{"referral":{"Sourc...}").build());
DoFnTester<KV<?, List<CalEvent>>, CalEvent> fnTester = DoFnTester.of(new
SessionRepairFn());
fnTester.processElement(input);
List<CalEvent> calEvents = fnTester.peekOutputElements();
assertThat(calEvents).hasSize(2);
CalEvent referral = CalUtil.findEvent(calEvents, "REFERRAL");
assertNotNull("REFERRAL should still exist", referral);
CalEvent start = CalUtil.findEvent(calEvents, "START");
assertNotNull("START should still exist", start);
assertNull("START should have referral removed.",
Appendix
Follow the yellow brick road
Paper - Flume Java
FlumeJava: Easy, Efficient Data-Parallel Pipelines
http://pages.cs.wisc.edu/~akella/CS838/F12/838-
CloudPapers/FlumeJava.pdf
More Dataflow Resources
DataFlow/Bean vs Spark
Dataflow/Beam & Spark: A Programming Model
Comparison
https://cloud.google.com/dataflow/blog/dataflow-beam-
and-spark-comparison
More Dataflow Resources
Github - Integrate Dataflow with Luigi
Google Cloud integration for Luigi
https://github.com/alexvanboxel/luigiext-gcloud
More Dataflow Resources
Questions and Answers
The last slide
Twitter @alexvb
Plus +AlexVanBoxel
E-mail alex@vanboxel.be
Web http://alex.vanboxel.be

Más contenido relacionado

La actualidad más candente

Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
 
From airflow to google cloud composer
From airflow to google cloud composerFrom airflow to google cloud composer
From airflow to google cloud composerBruce Kuo
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSDerrick Qin
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...DataWorks Summit
 
Tom Grey - Google Cloud Platform
Tom Grey - Google Cloud PlatformTom Grey - Google Cloud Platform
Tom Grey - Google Cloud PlatformFondazione CUOA
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflowmutt_data
 
Performance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and UnderscoresPerformance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and UnderscoresJitendra Singh
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentationIlias Okacha
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsGuozhang Wang
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation Brett VanderPlaats
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Riccardo Zamana
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowYohei Onishi
 
Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...Simplilearn
 

La actualidad más candente (20)

Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
From airflow to google cloud composer
From airflow to google cloud composerFrom airflow to google cloud composer
From airflow to google cloud composer
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
 
Tom Grey - Google Cloud Platform
Tom Grey - Google Cloud PlatformTom Grey - Google Cloud Platform
Tom Grey - Google Cloud Platform
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
 
Performance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and UnderscoresPerformance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and Underscores
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Apache Airflow overview
Apache Airflow overviewApache Airflow overview
Apache Airflow overview
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...
 

Destacado

Big Query Basics
Big Query BasicsBig Query Basics
Big Query BasicsIdo Green
 
Google Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.comGoogle Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.comAlex Van Boxel
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQueryDharmesh Vaya
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperMárton Kodok
 
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture  for Big Data AppGoogle Cloud Dataflow and lightweight Lambda Architecture  for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data AppTrieu Nguyen
 
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
An indepth look at Google BigQuery Architecture by Felipe Hoffa of GoogleAn indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
An indepth look at Google BigQuery Architecture by Felipe Hoffa of GoogleData Con LA
 
You might be paying too much for BigQuery
You might be paying too much for BigQueryYou might be paying too much for BigQuery
You might be paying too much for BigQueryRyuji Tamagawa
 
From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...Neville Li
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...
Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...
Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...Cloudera, Inc.
 
Streaming Auto-scaling in Google Cloud Dataflow
Streaming Auto-scaling in Google Cloud DataflowStreaming Auto-scaling in Google Cloud Dataflow
Streaming Auto-scaling in Google Cloud DataflowC4Media
 
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryIntro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryChris Schalk
 
How BigQuery broke my heart
How BigQuery broke my heartHow BigQuery broke my heart
How BigQuery broke my heartGabriel Hamilton
 
BigQuery for the Big Data win
BigQuery for the Big Data winBigQuery for the Big Data win
BigQuery for the Big Data winKen Taylor
 
Google Analytics and BigQuery, by Javier Ramirez, from datawaki
Google Analytics and BigQuery, by Javier Ramirez, from datawakiGoogle Analytics and BigQuery, by Javier Ramirez, from datawaki
Google Analytics and BigQuery, by Javier Ramirez, from datawakijavier ramirez
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
 

Destacado (20)

Google BigQuery
Google BigQueryGoogle BigQuery
Google BigQuery
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
 
Google Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.comGoogle Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.com
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQuery
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
 
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture  for Big Data AppGoogle Cloud Dataflow and lightweight Lambda Architecture  for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
 
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
An indepth look at Google BigQuery Architecture by Felipe Hoffa of GoogleAn indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
 
You might be paying too much for BigQuery
You might be paying too much for BigQueryYou might be paying too much for BigQuery
You might be paying too much for BigQuery
 
From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...
 
Google Cloud Dataflow を理解する - #bq_sushi
Google Cloud Dataflow を理解する - #bq_sushiGoogle Cloud Dataflow を理解する - #bq_sushi
Google Cloud Dataflow を理解する - #bq_sushi
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Big query
Big queryBig query
Big query
 
Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...
Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...
Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...
 
Streaming Auto-scaling in Google Cloud Dataflow
Streaming Auto-scaling in Google Cloud DataflowStreaming Auto-scaling in Google Cloud Dataflow
Streaming Auto-scaling in Google Cloud Dataflow
 
Google and big query
Google and big queryGoogle and big query
Google and big query
 
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryIntro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
 
How BigQuery broke my heart
How BigQuery broke my heartHow BigQuery broke my heart
How BigQuery broke my heart
 
BigQuery for the Big Data win
BigQuery for the Big Data winBigQuery for the Big Data win
BigQuery for the Big Data win
 
Google Analytics and BigQuery, by Javier Ramirez, from datawaki
Google Analytics and BigQuery, by Javier Ramirez, from datawakiGoogle Analytics and BigQuery, by Javier Ramirez, from datawaki
Google Analytics and BigQuery, by Javier Ramirez, from datawaki
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing
 

Similar a Google Cloud Dataflow

Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21JDA Labs MTL
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT_MTL
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Kaxil Naik
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
 
JCConf 2016 - Google Dataflow 小試
JCConf 2016 - Google Dataflow 小試JCConf 2016 - Google Dataflow 小試
JCConf 2016 - Google Dataflow 小試Simon Su
 
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...Lucas Jellema
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in SparkDigital Vidya
 
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...tdc-globalcode
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On DemandBogdan Kyryliuk
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformGoDataDriven
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
 
[20200720]cloud native develoment - Nelson Lin
[20200720]cloud native develoment - Nelson Lin[20200720]cloud native develoment - Nelson Lin
[20200720]cloud native develoment - Nelson LinHanLing Shen
 
"Offline mode for a mobile application, redux on server and a little bit abou...
"Offline mode for a mobile application, redux on server and a little bit abou..."Offline mode for a mobile application, redux on server and a little bit abou...
"Offline mode for a mobile application, redux on server and a little bit abou...Viktor Turskyi
 
Quick prototyping with VulcanJS
Quick prototyping with VulcanJSQuick prototyping with VulcanJS
Quick prototyping with VulcanJSMelek Hakim
 
Data Engineer's Lunch #50: Airbyte for Data Engineering
Data Engineer's Lunch #50: Airbyte for Data EngineeringData Engineer's Lunch #50: Airbyte for Data Engineering
Data Engineer's Lunch #50: Airbyte for Data EngineeringAnant Corporation
 
Encode Club workshop slides
Encode Club workshop slidesEncode Club workshop slides
Encode Club workshop slidesVanessa Lošić
 
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...Lightbend
 

Similar a Google Cloud Dataflow (20)

Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers Program
 
JCConf 2016 - Google Dataflow 小試
JCConf 2016 - Google Dataflow 小試JCConf 2016 - Google Dataflow 小試
JCConf 2016 - Google Dataflow 小試
 
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
 
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
 
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On Demand
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
 
[20200720]cloud native develoment - Nelson Lin
[20200720]cloud native develoment - Nelson Lin[20200720]cloud native develoment - Nelson Lin
[20200720]cloud native develoment - Nelson Lin
 
"Offline mode for a mobile application, redux on server and a little bit abou...
"Offline mode for a mobile application, redux on server and a little bit abou..."Offline mode for a mobile application, redux on server and a little bit abou...
"Offline mode for a mobile application, redux on server and a little bit abou...
 
Quick prototyping with VulcanJS
Quick prototyping with VulcanJSQuick prototyping with VulcanJS
Quick prototyping with VulcanJS
 
Data Engineer's Lunch #50: Airbyte for Data Engineering
Data Engineer's Lunch #50: Airbyte for Data EngineeringData Engineer's Lunch #50: Airbyte for Data Engineering
Data Engineer's Lunch #50: Airbyte for Data Engineering
 
Encode Club workshop slides
Encode Club workshop slidesEncode Club workshop slides
Encode Club workshop slides
 
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
 

Último

Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 

Último (20)

Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Google Cloud Dataflow

  • 1. Cloud Dataflow A Google Service and SDK
  • 2. The Presenter Alex Van Boxel, Software Architect, software engineer, devops, tester,... at Vente-Exclusive.com Introduction Twitter @alexvb Plus +AlexVanBoxel E-mail alex@vanboxel.be Web http://alex.vanboxel.be
  • 4. Dataflow is... A set of SDK that define the programming model that you use to build your streaming and batch processing pipeline (*) Google Cloud Dataflow is a fully managed service that will run and optimize your pipeline
  • 5. Dataflow where... ETL ● Move ● Filter ● Enrich Analytics ● Streaming Compu ● Batch Compu ● Machine Learning Composition ● Connecting in Pub/Sub ○ Could be other dataflows ○ Could be something else
  • 6. NoOps Model ● Resources allocated on demand ● Lifecycle managed by the service ● Optimization ● Rebalancing Cloud Dataflow Benefits
  • 7. Unified Programming Model Unified: Streaming and Batch Open Sourced ● Java 8 implementation ● Python in the works Cloud Dataflow Benefits
  • 8. Unified Programming Model (streaming) Cloud Dataflow Benefits DataFlow (streaming) DataFlow (batch) BigQuery
  • 9. Unified Programming Model (stream and backup) Cloud Dataflow Benefits DataFlow (streaming) DataFlow (batch) BigQuery
  • 10. Unified Programming Model (batch) Cloud Dataflow Benefits DataFlow (streaming) DataFlow (batch) BigQuery
  • 12. DataFlow SDK becomes Apache Beam Cloud Dataflow Benefits
  • 13. DataFlow SDK becomes Apache Beam ● Pipeline first, runtime second ○ focus your data pipelines, not the characteristics runner ● Portability ○ portable across a number of runtime engines ○ runtime based on any number of considerations ■ performance ■ cost ■ scalability Cloud Dataflow Benefits
  • 14. DataFlow SDK becomes Apache Beam ● Unified model ○ Batch and streaming are integrated into a unified model ○ Powerful semantics, such as windowing, ordering and triggering ● Development tooling ○ tools you need to create portable data pipelines quickly and easily using open-source languages, libraries and tools. Cloud Dataflow Benefits
  • 15. DataFlow SDK becomes Apache Beam ● Scala Implementation ● Alternative Runners ○ Spark Runner from Cloudera Labs ○ Flink Runner from data Artisans Cloud Dataflow Benefits
  • 16. DataFlow SDK becomes Apache Beam ● Cloudera ● data Artisans ● Talend ● Cask ● PayPal Cloud Dataflow Benefits
  • 17. SDK Primitives Autopsy of a dataflow pipeline
  • 18. Pipeline ● Inputs: Pub/Sub, BigQuery, GCS, XML, JSON, … ● Transforms: Filter, Join, Aggregate, … ● Windows: Sliding, Sessions, Fixed, … ● Triggers: Correctness…. ● Outputs: Pub/Sub, BigQuery, GCS, XML, JSON, ... Logical Model
  • 19. PCollection<T> ● Immutable collection ● Could be bound or unbound ● Created by ○ a backing data store ○ a transformation from other PCollection ○ generated ● PCollection<KV<K,V>> Dataflow SDK Primitives
  • 20. ParDo<I,O> ● Parallel Processing of each element in the PCollection ● Developer provide DoFn<I,O> ● Shards processed independent of each other Dataflow SDK Primitives
  • 21. ParDo<I,O> Interesting concept of side outputs: allows for data sharing Dataflow SDK Primitives
  • 22. PTransform ● Combine small operations into bigger transforms ● Standard transform (COUNT, SUM,...) ● Reuse and Monitoring Dataflow SDK Primitives public class BlockTransform extends PTransform<PCollection<Legacy>, PCollection<Event>> { @Override public PCollection<Event> apply(PCollection<Legacy> input) { return input.apply(ParDo.of(new TimestampEvent1Fn())) .apply(ParDo.of(new FromLegacy2EventFn())).setCoder(AvroCoder.of(Event.class)) .apply(Window.<Event>into(Sessions.withGapDuration(Duration.standardMinutes(20)))) .apply(ParDo.of(new ExtactChannelUserFn())) } }
  • 23. Merging ● GroupByKey ○ Takes a PCollection of KV<K,V> and groups them ○ Must be keyed ● Flatten ○ Join with same type Dataflow SDK Primitives
  • 24. Input/Output Dataflow SDK Primitives PipelineOptions options = PipelineOptionsFactory.create(); Pipeline p = Pipeline.create(options); // streamData is Unbounded; apply windowing afterward. PCollection<String> streamData = p.apply(PubsubIO.Read.named("ReadFromPubsub") .topic("/topics/my-topic")); PipelineOptions options = PipelineOptionsFactory.create(); Pipeline p = Pipeline.create(options); PCollection<TableRow> weatherData = p .apply(BigQueryIO.Read.named("ReadWeatherStations") .from("clouddataflow-readonly:samples.weather_stations"));
  • 25. ● Multiple Inputs + Outputs Input/Output Dataflow SDK Primitives
  • 26. Primitives on steroids Autopsy of a dataflow pipeline
  • 27. Windows ● Split unbounded collections (ex. Pub/Sub) ● Works on bounded collection as well Dataflow SDK Primitives .apply(Window.<Event>into(Sessions.withGapDuration(Duration.standardMinutes(20)))) Window.<CAL>into(SlidingWindows.of(Duration.standardMinutes(5)));
  • 28. Triggers ● Time-based ● Data-based ● Composite Dataflow SDK Primitives AfterEach.inOrder( AfterWatermark.pastEndOfWindow(), Repeatedly.forever(AfterProcessingTime .pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(10))) .orFinally(AfterWatermark .pastEndOfWindow() .plusDelayOf(Duration.standardDays(2))));
  • 29. 14:22Event time 14:21 Input/Output Dataflow SDK Primitives 14:21 14:22 14:21:25Processing time Event time 14:2314:22 Watermark 14:21:20 14:21:40 Early trigger Late trigger Late trigger 14:25
  • 30. 14:22Event time 14:21 Input/Output Dataflow SDK Primitives 14:21 14:22 14:21:25Processing time Event time 14:2314:22 Watermark 14:21:20 14:21:40 Early trigger Late trigger Late trigger 14:25
  • 32. What you need ● Google Cloud SDK ○ utilities to authenticate and manage project ● Java IDE ○ Eclipse ○ IntelliJ ● Standard Build System ○ Maven ○ Gradle Dataflow SDK Primitives
  • 33. What you need Dataflow SDK Primitives subprojects { apply plugin: 'java' repositories { maven { url "http://jcenter.bintray.com" } mavenLocal() } dependencies { compile 'com.google.cloud.dataflow:google-cloud-dataflow-java-sdk-all:1.3.0' testCompile 'org.hamcrest:hamcrest-all:1.3' testCompile 'org.assertj:assertj-core:3.0.0' testCompile 'junit:junit:4.12' } }
  • 34. How it work Google Cloud Dataflow Service Pipeline p = Pipeline.create(options); p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*")) .apply(FlatMapElements.via((String word) -> Arrays.asList(word.split("[^a-zA-Z']+"))) .withOutputType(new TypeDescriptor<String>() {})) .apply(Filter.byPredicate((String word) -> !word.isEmpty())) .apply(Count.<String>perElement()) .apply(MapElements .via((KV<String, Long> wordCount) -> wordCount.getKey() + ": " + wordCount.getValue()) .withOutputType(new TypeDescriptor<String>() {})) .apply(TextIO.Write.to("gs://YOUR_OUTPUT_BUCKET/AND_OUTPUT_PREFIX"));
  • 35. Service ● Stage ● Optimize (Flume) ● Execute Google Cloud Dataflow Service
  • 40. Compute Worker Scaling Google Cloud Dataflow Service
  • 41. Compute Worker Rescaling Google Cloud Dataflow Service
  • 42. Compute Worker Rebalancing Google Cloud Dataflow Service
  • 43. Compute Worker Rescaling Google Cloud Dataflow Service
  • 47. Testability - Built in Dataflow Testability KV<String, List<CalEvent>> input = KV.of("", new ArrayList<>()); input.getValue().add(CalEvent.event("SESSION", "REFERRAL") .build()); input.getValue().add(CalEvent.event("SESSION", "START") .data("{"referral":{"Sourc...}").build()); DoFnTester<KV<?, List<CalEvent>>, CalEvent> fnTester = DoFnTester.of(new SessionRepairFn()); fnTester.processElement(input); List<CalEvent> calEvents = fnTester.peekOutputElements(); assertThat(calEvents).hasSize(2); CalEvent referral = CalUtil.findEvent(calEvents, "REFERRAL"); assertNotNull("REFERRAL should still exist", referral); CalEvent start = CalUtil.findEvent(calEvents, "START"); assertNotNull("START should still exist", start); assertNull("START should have referral removed.",
  • 49. Paper - Flume Java FlumeJava: Easy, Efficient Data-Parallel Pipelines http://pages.cs.wisc.edu/~akella/CS838/F12/838- CloudPapers/FlumeJava.pdf More Dataflow Resources
  • 50. DataFlow/Bean vs Spark Dataflow/Beam & Spark: A Programming Model Comparison https://cloud.google.com/dataflow/blog/dataflow-beam- and-spark-comparison More Dataflow Resources
  • 51. Github - Integrate Dataflow with Luigi Google Cloud integration for Luigi https://github.com/alexvanboxel/luigiext-gcloud More Dataflow Resources
  • 52. Questions and Answers The last slide Twitter @alexvb Plus +AlexVanBoxel E-mail alex@vanboxel.be Web http://alex.vanboxel.be