SlideShare una empresa de Scribd logo
1 de 68
Descargar para leer sin conexión
Apache Beam (incubating) PPMC
Senior Software Engineer, Google
Dan Halperin
Introduction to Apache Beam
&
No shard left behind:
APIs for massive parallel efficiency
Apache Beam is a unified
programming model designed to
provide efficient and portable
data processing pipelines
What is Apache Beam?
The Beam Programming Model
•What / Where / When / How

SDKs for writing Beam pipelines
•Java, Python
Beam Runners for existing distributed
processing backends
•Apache Apex
•Apache Flink
•Apache Spark
•Google Cloud Dataflow
The Beam Programming Model
SDKs for writing Beam pipelines
•Java, Python
Beam Runners for existing distributed
processing backends
What is Apache Beam?
Google Cloud
Dataflow
Apache
Apex
Apache
Apache
Gearpump
Apache
Cloud
Dataflow
Apache
Spark
Beam Model: Fn Runners
Apache
Flink
Beam Model: Pipeline Construction
Other
Languages
Beam Java
Beam
Python
Execution Execution
Apache
Gearpump
Execution
Apache
Apex
Apache Beam is a unified
programming model designed to
provide efficient and portable
data processing pipelines
Clean abstractions hide details
PCollection – a parallel collection of timestamped elements that are in windows.

Quick overview of the Beam model
Clean abstractions hide details
PCollection – a parallel collection of timestamped elements that are in windows.

Sources & Readers – produce PCollections of timestamped elements and a watermark.
Quick overview of the Beam model
Clean abstractions hide details
PCollection – a parallel collection of timestamped elements that are in windows.

Sources & Readers – produce PCollections of timestamped elements and a watermark.
ParDo – flatmap over elements of a PCollection.
Quick overview of the Beam model
Clean abstractions hide details
PCollection – a parallel collection of timestamped elements that are in windows.

Sources & Readers – produce PCollections of timestamped elements and a watermark.
ParDo – flatmap over elements of a PCollection.
(Co)GroupByKey – shuffle & group {{K: V}} → {K: [V]}.
Quick overview of the Beam model
Clean abstractions hide details
PCollection – a parallel collection of timestamped elements that are in windows.

Sources & Readers – produce PCollections of timestamped elements and a watermark.
ParDo – flatmap over elements of a PCollection.
(Co)GroupByKey – shuffle & group {{K: V}} → {K: [V]}.
Side inputs – global view of a PCollection used for broadcast / joins.

Quick overview of the Beam model
Clean abstractions hide details
PCollection – a parallel collection of timestamped elements that are in windows.

Sources & Readers – produce PCollections of timestamped elements and a watermark.
ParDo – flatmap over elements of a PCollection.
(Co)GroupByKey – shuffle & group {{K: V}} → {K: [V]}.
Side inputs – global view of a PCollection used for broadcast / joins.

Window – reassign elements to zero or more windows; may be data-dependent.
Quick overview of the Beam model
Clean abstractions hide details
PCollection – a parallel collection of timestamped elements that are in windows.

Sources & Readers – produce PCollections of timestamped elements and a watermark.
ParDo – flatmap over elements of a PCollection.
(Co)GroupByKey – shuffle & group {{K: V}} → {K: [V]}.
Side inputs – global view of a PCollection used for broadcast / joins.

Window – reassign elements to zero or more windows; may be data-dependent.
Triggers – user flow control based on window, watermark, element count, lateness.
Quick overview of the Beam model
1.Classic Batch 2. Batch with Fixed
Windows
3. Streaming
4. Streaming with
Speculative + Late Data
5. Sessions
Data: JSON-encoded analytics stream from site
•{“user”:“dhalperi”, “page”:“oreilly.com/feed/7”, “tstamp”:”2016-08-31T15:07Z”, …}
Desired output: Per-user session length and activity level
•dhalperi, 17 pageviews, 2016-08-31 15:00-15:35
Other pipeline goals:
•Live data – can track ongoing sessions with speculative output

dhalperi, 10 pageviews, 2016-08-31 15:00-15:15 (EARLY)
•Archival data – much faster, still correct output respecting event time
Simple clickstream analysis pipeline
Simple clickstream analysis pipeline
PCollection<KV<User, Click>> clickstream =
pipeline.apply(IO.Read(…))
.apply(MapElements.of(new ParseClicksAndAssignUser()));
PCollection<KV<User, Long>> userSessions =
clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3)))
.triggering(
AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))))
.apply(Count.perKey());
userSessions.apply(MapElements.of(new FormatSessionsForOutput()))
.apply(IO.Write(…));
pipeline.run();
Unified unbounded & bounded PCollections
pipeline.apply(IO.Read(…)).apply(MapElements.of(new ParseClicksAndAssignUser()));
Apache Kafka, ActiveMQ, tailing filesystem, …
•A live, roughly in-order stream of messages, unbounded PCollections.
•KafkaIO.read().fromTopic(“pageviews”)

HDFS, Google Cloud Storage, yesterday’s Kafka log, …
• Archival data, often readable in any order, bounded PCollections.
• TextIO.read().from(“gs://oreilly/pageviews/*”)
Windowing and triggers
PCollection<KV<User, Long>> userSessions =
clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3)))
.triggering(
AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))))
Event time
3:00 3:05 3:10 3:15 3:20 3:25
Windowing and triggers
PCollection<KV<User, Long>> userSessions =
clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3)))
.triggering(
AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))))
Event time
3:00 3:05 3:10 3:15 3:20 3:25
One session, 3:04-3:25
Windowing and triggers
PCollection<KV<User, Long>> userSessions =
clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3)))
.triggering(
AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))))
Processingtime
Event time
3:00 3:05 3:10 3:15 3:20 3:25
Windowing and triggers
PCollection<KV<User, Long>> userSessions =
clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3)))
.triggering(
AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))))
Processingtime
Event time
3:00 3:05 3:10 3:15 3:20 3:25
1 session, 3:04–3:10(EARLY)
Windowing and triggers
PCollection<KV<User, Long>> userSessions =
clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3)))
.triggering(
AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))))
Processingtime
Event time
3:00 3:05 3:10 3:15 3:20 3:25
1 session, 3:04–3:10(EARLY)
2 sessions, 3:04–3:10 & 3:15–3:20(EARLY)
Windowing and triggers
PCollection<KV<User, Long>> userSessions =
clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3)))
.triggering(
AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))))
Processingtime
Event time
3:00 3:05 3:10 3:15 3:20 3:25
1 session, 3:04–3:25
1 session, 3:04–3:10(EARLY)
2 sessions, 3:04–3:10 & 3:15–3:20(EARLY)
Windowing and triggers
PCollection<KV<User, Long>> userSessions =
clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3)))
.triggering(
AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))))
Processingtime
Event time
3:00 3:05 3:10 3:15 3:20 3:25
1 session, 3:04–3:25
1 session, 3:04–3:10(EARLY)
2 sessions, 3:04–3:10 & 3:15–3:20(EARLY)
Watermark
Writing output, and bundling
userSessions.apply(MapElements.of(new FormatSessionsForOutput()))
.apply(IO.Write(…));
Writing is the dual of reading — format and then output.
Not its own primitive, but implemented based on sink semantics using ParDo + graph
structure.
Streaming job consuming Kafka stream
•Uses 10 workers.
•Pipeline lag of a few seconds.
•With a 2 million users over 1 day.
•A total of ~4.7M early + final sessions.
•240 worker-hours
Two example runs of this pipeline
Streaming job consuming Kafka stream
•Uses 10 workers.
•Pipeline lag of a few seconds.
•With a 2 million users over 1 day.
•A total of ~4.7M early + final sessions.
•240 worker-hours
Two example runs of this pipeline
Batch job consuming HDFS archive
•Uses 200 workers.
•Runs for 30 minutes.
•Same input.
•A total of ~2.1M final sessions.
•100 worker-hours
Streaming job consuming Kafka stream
•Uses 10 workers.
•Pipeline lag of a few seconds.
•With a 2 million users over 1 day.
•A total of ~4.7M early + final sessions.
•240 worker-hours
Two example runs of this pipeline
Batch job consuming HDFS archive
•Uses 200 workers.
•Runs for 30 minutes.
•Same input.
•A total of ~2.1M final sessions.
•100 worker-hours
What does the user have to change to get these results?
Streaming job consuming Kafka stream
•Uses 10 workers.
•Pipeline lag of a few seconds.
•With a 2 million users over 1 day.
•A total of ~4.7M early + final sessions.
•240 worker-hours
Two example runs of this pipeline
Batch job consuming HDFS archive
•Uses 200 workers.
•Runs for 30 minutes.
•Same input.
•A total of ~2.1M final sessions.
•100 worker-hours
What does the user have to change to get these results?
A: O(10 lines of code) + Command-line Arguments
Apache Beam is a unified
programming model designed to
provide efficient and portable
data processing pipelines
Efficiency at runner’s discretion
“Read from this source, splitting it 1000 ways”
•user decides
“Read from this source”
Beam abstractions empower runners
Efficiency at runner’s discretion
“Read from this source, splitting it 1000 ways”
•user decides
“Read from this source”
APIs:
• long getEstimatedSize()
• List<Source> splitIntoBundles(size)
Beam abstractions empower runners
Efficiency at runner’s discretion
“Read from this source, splitting it 1000 ways”
•user decides
“Read from this source”
APIs:
• long getEstimatedSize()
• List<Source> splitIntoBundles(size)
Beam abstractions empower runners
gs://logs/*
Efficiency at runner’s discretion
“Read from this source, splitting it 1000 ways”
•user decides
“Read from this source”
APIs:
• long getEstimatedSize()
• List<Source> splitIntoBundles(size)
Beam abstractions empower runners
gs://logs/*
Size?
Efficiency at runner’s discretion
“Read from this source, splitting it 1000 ways”
•user decides
“Read from this source”
APIs:
• long getEstimatedSize()
• List<Source> splitIntoBundles(size)
Beam abstractions empower runners
gs://logs/*
50 TiB
Efficiency at runner’s discretion
“Read from this source, splitting it 1000 ways”
•user decides
“Read from this source”
APIs:
• long getEstimatedSize()
• List<Source> splitIntoBundles(size)
Beam abstractions empower runners
gs://logs/*
Cluster utilization?
Quota?
Reservations?
Bandwidth?
Throughput?Bottleneck?
50 TiB
Efficiency at runner’s discretion
“Read from this source, splitting it 1000 ways”
•user decides
“Read from this source”
APIs:
• long getEstimatedSize()
• List<Source> splitIntoBundles(size)
Beam abstractions empower runners
gs://logs/*
Cluster utilization?
Quota?
Reservations?
Bandwidth?
Throughput?Bottleneck?
Split

1TiB
Efficiency at runner’s discretion
“Read from this source, splitting it 1000 ways”
•user decides
“Read from this source”
APIs:
• long getEstimatedSize()
• List<Source> splitIntoBundles(size)
Beam abstractions empower runners
gs://logs/*
Cluster utilization?
Quota?
Reservations?
Bandwidth?
Throughput?Bottleneck?
filenames
A bundle is a group of elements processed and committed together.
Bundling and runner efficiency
A bundle is a group of elements processed and committed together.
Bundling and runner efficiency
APIs (ParDo/DoFn):
•startBundle()
•processElement() n times
•finishBundle()
A bundle is a group of elements processed and committed together.
Bundling and runner efficiency
APIs (ParDo/DoFn):
•startBundle()
•processElement() n times
•finishBundle()
Transaction
A bundle is a group of elements processed and committed together.
Streaming runner: small bundles, low-latency pipelining across stages,

overhead of frequent commits.
Bundling and runner efficiency
APIs (ParDo/DoFn):
•startBundle()
•processElement() n times
•finishBundle()
Transaction
A bundle is a group of elements processed and committed together.
Streaming runner: small bundles, low-latency pipelining across stages,

overhead of frequent commits.
Classic batch runner: large bundles, fewer large commits, more efficient,

long synchronous stages.
Bundling and runner efficiency
APIs (ParDo/DoFn):
•startBundle()
•processElement() n times
•finishBundle()
Transaction
A bundle is a group of elements processed and committed together.
Streaming runner: small bundles, low-latency pipelining across stages,

overhead of frequent commits.
Classic batch runner: large bundles, fewer large commits, more efficient,

long synchronous stages.
Other runner strategies strike a different balance.
Bundling and runner efficiency
APIs (ParDo/DoFn):
•startBundle()
•processElement() n times
•finishBundle()
Transaction
Beam triggers are flow control, not instructions.
•“it is okay to produce data” not “produce data now”.
•Runners decide when to produce data, and can make local choices for efficiency.
Streaming clickstream analysis: runner may optimize for latency and freshness.
• Small bundles and frequent triggering → more files and more (speculative) records.
Batch clickstream analysis: runner may optimize for throughput and efficiency.
• Large bundles and no early triggering → fewer large files and no early records.
Runner-controlled triggering
Pipeline workload varies
Workload
Time
Streaming pipeline’s input varies Batch pipelines go through stages
Perils of fixed decisions
Workload
Time Time
Over-provisioned / worst case Under-provisioned / average case
Ideal case
Workload
Time
The Straggler Problem
Worker
Time
Work is unevenly distributed across tasks.
Reasons:
•Underlying data.
•Processing.
•Runtime effects.
Effects are cumulative per stage.
Standard workarounds for stragglers
Split files into equal sizes?
Preemptively over-split?
Detect slow workers and re-execute?
Sample extensively and then split?
Worker
Time
Standard workarounds for stragglers
Split files into equal sizes?
Preemptively over-split?
Detect slow workers and re-execute?
Sample extensively and then split?
All of these have major costs, none is a complete solution.
Worker
Time
No amount of upfront heuristic tuning (be it manual or
automatic) is enough to guarantee good performance: the
system will always hit unpredictable situations at run-time.

A system that's able to dynamically adapt and get out of a
bad situation is much more powerful than one that
heuristically hopes to avoid getting into it.
Readers provide simple progress signals, enable runners to take action based on
execution-time characteristics.
APIs for how much work is pending:
•Bounded: double getFractionConsumed()
•Unbounded: long getBacklogBytes()
Work-stealing:
•Bounded: Source splitAtFraction(double)

int getParallelismRemaining()
Beam readers enable dynamic adaptation
Dynamic work rebalancing
Now
Done work Active work Predicted completion
Tasks
Time
Average
Dynamic work rebalancing
Now
Done work Active work Predicted completion
Tasks
Time
Dynamic work rebalancing
Now
Done work Active work Predicted completion
Tasks
Now
Dynamic work rebalancing
Now
Done work Active work Predicted completion
Tasks
Now
Beam pipeline on the Google Cloud Dataflow runner
Dynamic Work Rebalancing: a real example
2-stage pipeline,

split “evenly” but uneven in practice
Beam pipeline on the Google Cloud Dataflow runner
Dynamic Work Rebalancing: a real example
2-stage pipeline,

split “evenly” but uneven in practice
Same pipeline

dynamic work rebalancing enabled
Beam pipeline on the Google Cloud Dataflow runner
Dynamic Work Rebalancing: a real example
2-stage pipeline,

split “evenly” but uneven in practice
Same pipeline

dynamic work rebalancing enabled
Savings
Beam pipeline on the Google Cloud Dataflow runner
Dynamic Work Rebalancing + Autoscaling
Beam pipeline on the Google Cloud Dataflow runner
Dynamic Work Rebalancing + Autoscaling
Initially allocate ~80 workers

based on size
Beam pipeline on the Google Cloud Dataflow runner
Dynamic Work Rebalancing + Autoscaling
Initially allocate ~80 workers

based on size
Multiple rounds of
upsizing enabled
by dynamic splitting
Beam pipeline on the Google Cloud Dataflow runner
Dynamic Work Rebalancing + Autoscaling
Initially allocate ~80 workers

based on size
Multiple rounds of
upsizing enabled
by dynamic splitting
Upscale to 1000 workers

* tasks stay well-balanced
* without oversplitting initially
Beam pipeline on the Google Cloud Dataflow runner
Dynamic Work Rebalancing + Autoscaling
Initially allocate ~80 workers

based on size
Multiple rounds of
upsizing enabled
by dynamic splitting
Long-running tasks
aborted without
causing stragglers
Upscale to 1000 workers

* tasks stay well-balanced
* without oversplitting initially
Apache Beam is a unified
programming model designed to
provide efficient and portable
data processing pipelines
Write a pipeline once, run it anywhere
PCollection<KV<User, Click>> clickstream =
pipeline.apply(IO.Read(…))
.apply(MapElements.of(new ParseClicksAndAssignUser()));
PCollection<KV<User, Long>> userSessions =
clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3)))
.triggering(
AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))))
.apply(Count.perKey());
userSessions.apply(MapElements.of(new FormatSessionsForOutput()))
.apply(IO.Write(…));
pipeline.run();
Unified model for portable execution
Demo time
Building the community
If you analyze data: Try it out!
If you have a data storage or messaging system: Write a connector!

If you have Big Data APIs: Write a Beam SDK, DSL, or library!
If you have a distributed processing backend:
•Write a Beam Runner!

(& join Apache Apex, Flink, Spark, Gearpump, and Google Cloud Dataflow)
Get involved with Beam
Apache Beam (incubating):
•https://beam.incubator.apache.org/
•documentation, mailing lists, downloads, etc.
Read our blog posts!
•Streaming 101 & 102: oreilly.com/ideas/the-world-beyond-batch-streaming-101
•No shard left behind: Dynamic work rebalancing in Google Cloud Dataflow
•Apache Beam blog: http://beam.incubator.apache.org/blog/
Follow @ApacheBeam on Twitter
Learn more

Más contenido relacionado

La actualidad más candente

Sourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache BeamSourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache BeamPyData
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamFlink Forward
 
From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...Neville Li
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewenconfluent
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingDatabricks
 
Functional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksFunctional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksHuafeng Wang
 
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...Flink Forward
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Spark Summit
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka Dori Waldman
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Flink Forward
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Databricks
 
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture  for Big Data AppGoogle Cloud Dataflow and lightweight Lambda Architecture  for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data AppTrieu Nguyen
 
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...Flink Forward
 
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch -  Dynami...Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch -  Dynami...
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...Flink Forward
 

La actualidad más candente (20)

Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
Sourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache BeamSourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache Beam
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
 
From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
 
Functional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksFunctional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming Frameworks
 
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Stream Processing made simple with Kafka
Stream Processing made simple with KafkaStream Processing made simple with Kafka
Stream Processing made simple with Kafka
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture  for Big Data AppGoogle Cloud Dataflow and lightweight Lambda Architecture  for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
 
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
 
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch -  Dynami...Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch -  Dynami...
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
 

Destacado

Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamNeville Li
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry confluent
 
윤석준 2015 cuda_contest
윤석준 2015 cuda_contest윤석준 2015 cuda_contest
윤석준 2015 cuda_contestSeok-joon Yun
 
Robert Metzger - Apache Flink Community Updates November 2016 @ Berlin Meetup
Robert Metzger - Apache Flink Community Updates November 2016 @ Berlin Meetup Robert Metzger - Apache Flink Community Updates November 2016 @ Berlin Meetup
Robert Metzger - Apache Flink Community Updates November 2016 @ Berlin Meetup Ververica
 
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup Ververica
 
Open source and business rules
Open source and business rulesOpen source and business rules
Open source and business rulesGeoffrey De Smet
 
Introduction to Drools
Introduction to DroolsIntroduction to Drools
Introduction to Droolsgiurca
 
central processing unit and pipeline
central processing unit and pipelinecentral processing unit and pipeline
central processing unit and pipelineRai University
 
Jboss drools 4 scope - benefits, shortfalls
Jboss drools   4 scope - benefits, shortfalls Jboss drools   4 scope - benefits, shortfalls
Jboss drools 4 scope - benefits, shortfalls Zoran Hristov
 
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + DemosDrools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + DemosMauricio (Salaboy) Salatino
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceDataWorks Summit/Hadoop Summit
 
Drools & jBPM Info Sheet
Drools & jBPM Info SheetDrools & jBPM Info Sheet
Drools & jBPM Info SheetMark Proctor
 
Intro to Drools - St Louis Gateway JUG
Intro to Drools - St Louis Gateway JUGIntro to Drools - St Louis Gateway JUG
Intro to Drools - St Louis Gateway JUGRay Ploski
 
Rules Programming tutorial
Rules Programming tutorialRules Programming tutorial
Rules Programming tutorialSrinath Perera
 
Computer Organozation
Computer OrganozationComputer Organozation
Computer OrganozationAabha Tiwari
 

Destacado (20)

Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
Computer Architecture
Computer ArchitectureComputer Architecture
Computer Architecture
 
윤석준 2015 cuda_contest
윤석준 2015 cuda_contest윤석준 2015 cuda_contest
윤석준 2015 cuda_contest
 
Robert Metzger - Apache Flink Community Updates November 2016 @ Berlin Meetup
Robert Metzger - Apache Flink Community Updates November 2016 @ Berlin Meetup Robert Metzger - Apache Flink Community Updates November 2016 @ Berlin Meetup
Robert Metzger - Apache Flink Community Updates November 2016 @ Berlin Meetup
 
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
 
Open source and business rules
Open source and business rulesOpen source and business rules
Open source and business rules
 
Introduction to Drools
Introduction to DroolsIntroduction to Drools
Introduction to Drools
 
central processing unit and pipeline
central processing unit and pipelinecentral processing unit and pipeline
central processing unit and pipeline
 
FOSS in the Enterprise
FOSS in the EnterpriseFOSS in the Enterprise
FOSS in the Enterprise
 
Jboss drools 4 scope - benefits, shortfalls
Jboss drools   4 scope - benefits, shortfalls Jboss drools   4 scope - benefits, shortfalls
Jboss drools 4 scope - benefits, shortfalls
 
Drools & jBPM Workshop London 2013
Drools & jBPM Workshop London 2013Drools & jBPM Workshop London 2013
Drools & jBPM Workshop London 2013
 
Drools BeJUG 2010
Drools BeJUG 2010Drools BeJUG 2010
Drools BeJUG 2010
 
ieeecloud2016
ieeecloud2016ieeecloud2016
ieeecloud2016
 
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + DemosDrools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
 
Drools & jBPM Info Sheet
Drools & jBPM Info SheetDrools & jBPM Info Sheet
Drools & jBPM Info Sheet
 
Intro to Drools - St Louis Gateway JUG
Intro to Drools - St Louis Gateway JUGIntro to Drools - St Louis Gateway JUG
Intro to Drools - St Louis Gateway JUG
 
Rules Programming tutorial
Rules Programming tutorialRules Programming tutorial
Rules Programming tutorial
 
Computer Organozation
Computer OrganozationComputer Organozation
Computer Organozation
 

Similar a Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lightbend
 
Reactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka StreamsReactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka StreamsKonrad Malawski
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
 
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...Flink Forward
 
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...Aljoscha Krettek
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...Flink Forward
 
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Reactivesummit
 
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & KafkaBack-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & KafkaAkara Sucharitakul
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
End to-end async and await
End to-end async and awaitEnd to-end async and await
End to-end async and awaitvfabro
 
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Lightbend
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a ServiceSteven Wu
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!confluent
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterPaolo Castagna
 
Streaming all the things with akka streams
Streaming all the things with akka streams   Streaming all the things with akka streams
Streaming all the things with akka streams Johan Andrén
 
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...Flink Forward
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon
 

Similar a Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency (20)

Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
 
Reactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka StreamsReactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka Streams
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
 
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
 
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & KafkaBack-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
End to-end async and await
End to-end async and awaitEnd to-end async and await
End to-end async and await
 
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a Service
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
 
SDAccel Design Contest: Vivado HLS
SDAccel Design Contest: Vivado HLSSDAccel Design Contest: Vivado HLS
SDAccel Design Contest: Vivado HLS
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Streaming all the things with akka streams
Streaming all the things with akka streams   Streaming all the things with akka streams
Streaming all the things with akka streams
 
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
 

Último

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 

Último (20)

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 

Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

  • 1. Apache Beam (incubating) PPMC Senior Software Engineer, Google Dan Halperin Introduction to Apache Beam & No shard left behind: APIs for massive parallel efficiency
  • 2. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  • 3. What is Apache Beam? The Beam Programming Model •What / Where / When / How
 SDKs for writing Beam pipelines •Java, Python Beam Runners for existing distributed processing backends •Apache Apex •Apache Flink •Apache Spark •Google Cloud Dataflow The Beam Programming Model SDKs for writing Beam pipelines •Java, Python Beam Runners for existing distributed processing backends What is Apache Beam? Google Cloud Dataflow Apache Apex Apache Apache Gearpump Apache Cloud Dataflow Apache Spark Beam Model: Fn Runners Apache Flink Beam Model: Pipeline Construction Other Languages Beam Java Beam Python Execution Execution Apache Gearpump Execution Apache Apex
  • 4. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  • 5. Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.
 Quick overview of the Beam model
  • 6. Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.
 Sources & Readers – produce PCollections of timestamped elements and a watermark. Quick overview of the Beam model
  • 7. Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.
 Sources & Readers – produce PCollections of timestamped elements and a watermark. ParDo – flatmap over elements of a PCollection. Quick overview of the Beam model
  • 8. Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.
 Sources & Readers – produce PCollections of timestamped elements and a watermark. ParDo – flatmap over elements of a PCollection. (Co)GroupByKey – shuffle & group {{K: V}} → {K: [V]}. Quick overview of the Beam model
  • 9. Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.
 Sources & Readers – produce PCollections of timestamped elements and a watermark. ParDo – flatmap over elements of a PCollection. (Co)GroupByKey – shuffle & group {{K: V}} → {K: [V]}. Side inputs – global view of a PCollection used for broadcast / joins.
 Quick overview of the Beam model
  • 10. Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.
 Sources & Readers – produce PCollections of timestamped elements and a watermark. ParDo – flatmap over elements of a PCollection. (Co)GroupByKey – shuffle & group {{K: V}} → {K: [V]}. Side inputs – global view of a PCollection used for broadcast / joins.
 Window – reassign elements to zero or more windows; may be data-dependent. Quick overview of the Beam model
  • 11. Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.
 Sources & Readers – produce PCollections of timestamped elements and a watermark. ParDo – flatmap over elements of a PCollection. (Co)GroupByKey – shuffle & group {{K: V}} → {K: [V]}. Side inputs – global view of a PCollection used for broadcast / joins.
 Window – reassign elements to zero or more windows; may be data-dependent. Triggers – user flow control based on window, watermark, element count, lateness. Quick overview of the Beam model
  • 12. 1.Classic Batch 2. Batch with Fixed Windows 3. Streaming 4. Streaming with Speculative + Late Data 5. Sessions
  • 13. Data: JSON-encoded analytics stream from site •{“user”:“dhalperi”, “page”:“oreilly.com/feed/7”, “tstamp”:”2016-08-31T15:07Z”, …} Desired output: Per-user session length and activity level •dhalperi, 17 pageviews, 2016-08-31 15:00-15:35 Other pipeline goals: •Live data – can track ongoing sessions with speculative output
 dhalperi, 10 pageviews, 2016-08-31 15:00-15:15 (EARLY) •Archival data – much faster, still correct output respecting event time Simple clickstream analysis pipeline
  • 14. Simple clickstream analysis pipeline PCollection<KV<User, Click>> clickstream = pipeline.apply(IO.Read(…)) .apply(MapElements.of(new ParseClicksAndAssignUser())); PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark()
 .withEarlyFirings(AtPeriod(Minutes(1))))) .apply(Count.perKey()); userSessions.apply(MapElements.of(new FormatSessionsForOutput())) .apply(IO.Write(…)); pipeline.run();
  • 15. Unified unbounded & bounded PCollections pipeline.apply(IO.Read(…)).apply(MapElements.of(new ParseClicksAndAssignUser())); Apache Kafka, ActiveMQ, tailing filesystem, … •A live, roughly in-order stream of messages, unbounded PCollections. •KafkaIO.read().fromTopic(“pageviews”)
 HDFS, Google Cloud Storage, yesterday’s Kafka log, … • Archival data, often readable in any order, bounded PCollections. • TextIO.read().from(“gs://oreilly/pageviews/*”)
  • 16. Windowing and triggers PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark()
 .withEarlyFirings(AtPeriod(Minutes(1))))) Event time 3:00 3:05 3:10 3:15 3:20 3:25
  • 17. Windowing and triggers PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark()
 .withEarlyFirings(AtPeriod(Minutes(1))))) Event time 3:00 3:05 3:10 3:15 3:20 3:25 One session, 3:04-3:25
  • 18. Windowing and triggers PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark()
 .withEarlyFirings(AtPeriod(Minutes(1))))) Processingtime Event time 3:00 3:05 3:10 3:15 3:20 3:25
  • 19. Windowing and triggers PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark()
 .withEarlyFirings(AtPeriod(Minutes(1))))) Processingtime Event time 3:00 3:05 3:10 3:15 3:20 3:25 1 session, 3:04–3:10(EARLY)
  • 20. Windowing and triggers PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark()
 .withEarlyFirings(AtPeriod(Minutes(1))))) Processingtime Event time 3:00 3:05 3:10 3:15 3:20 3:25 1 session, 3:04–3:10(EARLY) 2 sessions, 3:04–3:10 & 3:15–3:20(EARLY)
  • 21. Windowing and triggers PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark()
 .withEarlyFirings(AtPeriod(Minutes(1))))) Processingtime Event time 3:00 3:05 3:10 3:15 3:20 3:25 1 session, 3:04–3:25 1 session, 3:04–3:10(EARLY) 2 sessions, 3:04–3:10 & 3:15–3:20(EARLY)
  • 22. Windowing and triggers PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark()
 .withEarlyFirings(AtPeriod(Minutes(1))))) Processingtime Event time 3:00 3:05 3:10 3:15 3:20 3:25 1 session, 3:04–3:25 1 session, 3:04–3:10(EARLY) 2 sessions, 3:04–3:10 & 3:15–3:20(EARLY) Watermark
  • 23. Writing output, and bundling userSessions.apply(MapElements.of(new FormatSessionsForOutput())) .apply(IO.Write(…)); Writing is the dual of reading — format and then output. Not its own primitive, but implemented based on sink semantics using ParDo + graph structure.
  • 24. Streaming job consuming Kafka stream •Uses 10 workers. •Pipeline lag of a few seconds. •With a 2 million users over 1 day. •A total of ~4.7M early + final sessions. •240 worker-hours Two example runs of this pipeline
  • 25. Streaming job consuming Kafka stream •Uses 10 workers. •Pipeline lag of a few seconds. •With a 2 million users over 1 day. •A total of ~4.7M early + final sessions. •240 worker-hours Two example runs of this pipeline Batch job consuming HDFS archive •Uses 200 workers. •Runs for 30 minutes. •Same input. •A total of ~2.1M final sessions. •100 worker-hours
  • 26. Streaming job consuming Kafka stream •Uses 10 workers. •Pipeline lag of a few seconds. •With a 2 million users over 1 day. •A total of ~4.7M early + final sessions. •240 worker-hours Two example runs of this pipeline Batch job consuming HDFS archive •Uses 200 workers. •Runs for 30 minutes. •Same input. •A total of ~2.1M final sessions. •100 worker-hours What does the user have to change to get these results?
  • 27. Streaming job consuming Kafka stream •Uses 10 workers. •Pipeline lag of a few seconds. •With a 2 million users over 1 day. •A total of ~4.7M early + final sessions. •240 worker-hours Two example runs of this pipeline Batch job consuming HDFS archive •Uses 200 workers. •Runs for 30 minutes. •Same input. •A total of ~2.1M final sessions. •100 worker-hours What does the user have to change to get these results? A: O(10 lines of code) + Command-line Arguments
  • 28. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  • 29. Efficiency at runner’s discretion “Read from this source, splitting it 1000 ways” •user decides “Read from this source” Beam abstractions empower runners
  • 30. Efficiency at runner’s discretion “Read from this source, splitting it 1000 ways” •user decides “Read from this source” APIs: • long getEstimatedSize() • List<Source> splitIntoBundles(size) Beam abstractions empower runners
  • 31. Efficiency at runner’s discretion “Read from this source, splitting it 1000 ways” •user decides “Read from this source” APIs: • long getEstimatedSize() • List<Source> splitIntoBundles(size) Beam abstractions empower runners gs://logs/*
  • 32. Efficiency at runner’s discretion “Read from this source, splitting it 1000 ways” •user decides “Read from this source” APIs: • long getEstimatedSize() • List<Source> splitIntoBundles(size) Beam abstractions empower runners gs://logs/* Size?
  • 33. Efficiency at runner’s discretion “Read from this source, splitting it 1000 ways” •user decides “Read from this source” APIs: • long getEstimatedSize() • List<Source> splitIntoBundles(size) Beam abstractions empower runners gs://logs/* 50 TiB
  • 34. Efficiency at runner’s discretion “Read from this source, splitting it 1000 ways” •user decides “Read from this source” APIs: • long getEstimatedSize() • List<Source> splitIntoBundles(size) Beam abstractions empower runners gs://logs/* Cluster utilization? Quota? Reservations? Bandwidth? Throughput?Bottleneck? 50 TiB
  • 35. Efficiency at runner’s discretion “Read from this source, splitting it 1000 ways” •user decides “Read from this source” APIs: • long getEstimatedSize() • List<Source> splitIntoBundles(size) Beam abstractions empower runners gs://logs/* Cluster utilization? Quota? Reservations? Bandwidth? Throughput?Bottleneck? Split
 1TiB
  • 36. Efficiency at runner’s discretion “Read from this source, splitting it 1000 ways” •user decides “Read from this source” APIs: • long getEstimatedSize() • List<Source> splitIntoBundles(size) Beam abstractions empower runners gs://logs/* Cluster utilization? Quota? Reservations? Bandwidth? Throughput?Bottleneck? filenames
  • 37. A bundle is a group of elements processed and committed together. Bundling and runner efficiency
  • 38. A bundle is a group of elements processed and committed together. Bundling and runner efficiency APIs (ParDo/DoFn): •startBundle() •processElement() n times •finishBundle()
  • 39. A bundle is a group of elements processed and committed together. Bundling and runner efficiency APIs (ParDo/DoFn): •startBundle() •processElement() n times •finishBundle() Transaction
  • 40. A bundle is a group of elements processed and committed together. Streaming runner: small bundles, low-latency pipelining across stages,
 overhead of frequent commits. Bundling and runner efficiency APIs (ParDo/DoFn): •startBundle() •processElement() n times •finishBundle() Transaction
  • 41. A bundle is a group of elements processed and committed together. Streaming runner: small bundles, low-latency pipelining across stages,
 overhead of frequent commits. Classic batch runner: large bundles, fewer large commits, more efficient,
 long synchronous stages. Bundling and runner efficiency APIs (ParDo/DoFn): •startBundle() •processElement() n times •finishBundle() Transaction
  • 42. A bundle is a group of elements processed and committed together. Streaming runner: small bundles, low-latency pipelining across stages,
 overhead of frequent commits. Classic batch runner: large bundles, fewer large commits, more efficient,
 long synchronous stages. Other runner strategies strike a different balance. Bundling and runner efficiency APIs (ParDo/DoFn): •startBundle() •processElement() n times •finishBundle() Transaction
  • 43. Beam triggers are flow control, not instructions. •“it is okay to produce data” not “produce data now”. •Runners decide when to produce data, and can make local choices for efficiency. Streaming clickstream analysis: runner may optimize for latency and freshness. • Small bundles and frequent triggering → more files and more (speculative) records. Batch clickstream analysis: runner may optimize for throughput and efficiency. • Large bundles and no early triggering → fewer large files and no early records. Runner-controlled triggering
  • 44. Pipeline workload varies Workload Time Streaming pipeline’s input varies Batch pipelines go through stages
  • 45. Perils of fixed decisions Workload Time Time Over-provisioned / worst case Under-provisioned / average case
  • 47. The Straggler Problem Worker Time Work is unevenly distributed across tasks. Reasons: •Underlying data. •Processing. •Runtime effects. Effects are cumulative per stage.
  • 48. Standard workarounds for stragglers Split files into equal sizes? Preemptively over-split? Detect slow workers and re-execute? Sample extensively and then split? Worker Time
  • 49. Standard workarounds for stragglers Split files into equal sizes? Preemptively over-split? Detect slow workers and re-execute? Sample extensively and then split? All of these have major costs, none is a complete solution. Worker Time
  • 50. No amount of upfront heuristic tuning (be it manual or automatic) is enough to guarantee good performance: the system will always hit unpredictable situations at run-time.
 A system that's able to dynamically adapt and get out of a bad situation is much more powerful than one that heuristically hopes to avoid getting into it.
  • 51. Readers provide simple progress signals, enable runners to take action based on execution-time characteristics. APIs for how much work is pending: •Bounded: double getFractionConsumed() •Unbounded: long getBacklogBytes() Work-stealing: •Bounded: Source splitAtFraction(double)
 int getParallelismRemaining() Beam readers enable dynamic adaptation
  • 52. Dynamic work rebalancing Now Done work Active work Predicted completion Tasks Time Average
  • 53. Dynamic work rebalancing Now Done work Active work Predicted completion Tasks Time
  • 54. Dynamic work rebalancing Now Done work Active work Predicted completion Tasks Now
  • 55. Dynamic work rebalancing Now Done work Active work Predicted completion Tasks Now
  • 56. Beam pipeline on the Google Cloud Dataflow runner Dynamic Work Rebalancing: a real example 2-stage pipeline,
 split “evenly” but uneven in practice
  • 57. Beam pipeline on the Google Cloud Dataflow runner Dynamic Work Rebalancing: a real example 2-stage pipeline,
 split “evenly” but uneven in practice Same pipeline
 dynamic work rebalancing enabled
  • 58. Beam pipeline on the Google Cloud Dataflow runner Dynamic Work Rebalancing: a real example 2-stage pipeline,
 split “evenly” but uneven in practice Same pipeline
 dynamic work rebalancing enabled Savings
  • 59. Beam pipeline on the Google Cloud Dataflow runner Dynamic Work Rebalancing + Autoscaling
  • 60. Beam pipeline on the Google Cloud Dataflow runner Dynamic Work Rebalancing + Autoscaling Initially allocate ~80 workers
 based on size
  • 61. Beam pipeline on the Google Cloud Dataflow runner Dynamic Work Rebalancing + Autoscaling Initially allocate ~80 workers
 based on size Multiple rounds of upsizing enabled by dynamic splitting
  • 62. Beam pipeline on the Google Cloud Dataflow runner Dynamic Work Rebalancing + Autoscaling Initially allocate ~80 workers
 based on size Multiple rounds of upsizing enabled by dynamic splitting Upscale to 1000 workers
 * tasks stay well-balanced * without oversplitting initially
  • 63. Beam pipeline on the Google Cloud Dataflow runner Dynamic Work Rebalancing + Autoscaling Initially allocate ~80 workers
 based on size Multiple rounds of upsizing enabled by dynamic splitting Long-running tasks aborted without causing stragglers Upscale to 1000 workers
 * tasks stay well-balanced * without oversplitting initially
  • 64. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  • 65. Write a pipeline once, run it anywhere PCollection<KV<User, Click>> clickstream = pipeline.apply(IO.Read(…)) .apply(MapElements.of(new ParseClicksAndAssignUser())); PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark()
 .withEarlyFirings(AtPeriod(Minutes(1))))) .apply(Count.perKey()); userSessions.apply(MapElements.of(new FormatSessionsForOutput())) .apply(IO.Write(…)); pipeline.run(); Unified model for portable execution
  • 67. Building the community If you analyze data: Try it out! If you have a data storage or messaging system: Write a connector!
 If you have Big Data APIs: Write a Beam SDK, DSL, or library! If you have a distributed processing backend: •Write a Beam Runner!
 (& join Apache Apex, Flink, Spark, Gearpump, and Google Cloud Dataflow) Get involved with Beam
  • 68. Apache Beam (incubating): •https://beam.incubator.apache.org/ •documentation, mailing lists, downloads, etc. Read our blog posts! •Streaming 101 & 102: oreilly.com/ideas/the-world-beyond-batch-streaming-101 •No shard left behind: Dynamic work rebalancing in Google Cloud Dataflow •Apache Beam blog: http://beam.incubator.apache.org/blog/ Follow @ApacheBeam on Twitter Learn more