SlideShare una empresa de Scribd logo
1 de 61
Stream & Batch Processing
with Apache Flink
Robert Metzger
@rmetzger_
rmetzger@apache.org
Munich Apache Flink Meetup,
November 11, 2015
A bit of history
From incubation until now
2
3
Apr 2014 Jun 2015Dec 2014
0.70.60.5 0.90.9-m1 0.10
Nov 2015
Top level
0.8
Community growth
Flink is one of the largest and most active Apache
big data projects with well over 120 contributors
4
Flink meetups around the globe
5
Organizations at Flink Forward
6
Featured in
7
The streaming era
Flink is all about
8
9
batch
event
based
need new systems
well served
10
Streaming is the biggest change in
data infrastructure since Hadoop
What is stream processing
 Real-world data is unbounded and is
pushed to systems
 Right now: people are using the batch
paradigm for stream analysis (there was
no good stream processor available)
 New systems (Flink, Kafka) embrace
streaming nature of data
11
Web server Kafka topic
Stream processing
12
Flink is a stream processor with many faces
Streaming dataflow runtime
Flink's streaming runtime
13
Requirements for a stream processor
 Low latency
• Fast results (milliseconds)
 High throughput
• handle large data amounts (millions of events
per second)
 Exactly-once guarantees
• Correct results, also in failure cases
 Programmability
• Intuitive APIs
14
Pipelining
15
Basic building block to “keep the data moving”
• Low latency
• Operators push
data forward
• Data shipping as
buffers, not tuple-
wise
• Natural handling
of back-pressure
Fault Tolerance in streaming
 at least once: ensure all operators see all
events
• Storm: Replay stream in failure case
 Exactly once: Ensure that operators do
not perform duplicate updates to their
state
• Flink: Distributed Snapshots
• Spark: Micro-batches on batch runtime
16
Flink’s Distributed Snapshots
 Lightweight approach of storing the state
of all operators without pausing the
execution
 high throughput, low latency
 Implemented using barriers flowing
through the topology
17
Kafka
Consumer
offset = 162
Element
Counter
value = 152
Operator
stateData Stream
barrier
Before barrier =
part of the snapshot
After barrier =
Not in snapshot
(backup till next snapshot)
DataStream API
18
case class WordCount(word: String, count: Int)
val text: DataStream[String] = …;
text
.flatMap { line => line.split(" ") }
.map { word => new WordCount(word, 1) }
.keyBy("word")
.window(GlobalWindows.create())
.trigger(new EOFTrigger())
.sum("count")
Batch Word Count in the DataStream API
Batch Word Count
in the DataSet API
19
case class WordCount(word: String, count: Int)
val text: DataStream[String] = …;
text
.flatMap { line => line.split(" ") }
.map { word => new WordCount(word, 1) }
.keyBy("word")
.window(GlobalWindows.create())
.trigger(new EOFTrigger())
.sum("count")
val text: DataSet[String] = …;
text
.flatMap { line => line.split(" ") }
.map { word => new WordCount(word, 1) }
.groupBy("word")
.sum("count")
Best of all worlds for streaming
 Low latency
• Thanks to pipelined engine
 Exactly-once guarantees
• Distributed Snapshots
 High throughput
• Controllable checkpointing overhead
 Programmability
• APIs similar to those known from the batch world
20
Throughput of distributed grep
21
Data
Generator
“grep”
operator
30 machines, 120 cores
0
20,000,000
40,000,000
60,000,000
80,000,000
100,000,000
120,000,000
140,000,000
160,000,000
180,000,000
200,000,000
Flink, no fault
tolerance
Flink, exactly
once (5s)
Storm, no
fault tolerance
Storm, micro-
batches
aggregate throughput
of 175 million
elements per second
aggregate throughput
of 9 million elements
per second
• Flink achieves 20x
higher throughput
• Flink throughput
almost the same
with and without
exactly-once
Aggregate throughput for stream record
grouping
22
0
10,000,000
20,000,000
30,000,000
40,000,000
50,000,000
60,000,000
70,000,000
80,000,000
90,000,000
100,000,000
Flink, no
fault
tolerance
Flink,
exactly
once
Storm, no
fault
tolerance
Storm, at
least once
aggregate throughput
of 83 million elements
per second
8,6 million elements/s
309k elements/s  Flink achieves 260x
higher throughput with
fault tolerance
30 machines,
120 cores Network
transfer
Latency in stream record grouping
23
Data
Generator
Receiver:
Throughput /
Latency measure
• Measure time for a record to
travel from source to sink
0.00
5.00
10.00
15.00
20.00
25.00
30.00
Flink, no
fault
tolerance
Flink, exactly
once
Storm, at
least once
Median latency
25 ms
1 ms
0.00
10.00
20.00
30.00
40.00
50.00
60.00
Flink, no
fault
tolerance
Flink,
exactly
once
Storm, at
least
once
99th percentile
latency
50 ms
24
Exactly-Once with YARN Chaos Monkey
 Validate exactly-once guarantees with
state-machine
25
Performance: Summary
26
Continuous
streaming
Latency-bound
buffering
Distributed
Snapshots
High Throughput &
Low Latency
With configurable throughput/latency tradeoff
“Faces” of Flink
27
Faces of a stream processor
28
Stream
processing
Batch
processing
Machine Learning at scale
Graph Analysis
Streaming dataflow runtime
The Flink Stack
29
Streaming dataflow runtime
Specialized
Abstractions
/ APIs
Core APIs
Flink Core
Runtime
Deployment
The Flink Stack
30
Streaming dataflow runtime
DataSet (Java/Scala) DataStream (Java/Scala)
Experimental
Python API also
available
Data Source
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
API independent Dataflow
Graph representation
Batch Optimizer Graph Builder
Batch is a special case of streaming
 Batch: run a bounded stream (data set) on
a stream processor
 Form a global window over the entire data
set for join or grouping operations
31
Batch-specific optimizations
 Managed memory on- and off-heap
• Operators (join, sort, …) with out-of-core
support
• Optimized serialization stack for user-types
 Cost-based Optimizer
• Job execution depends on data size
32
The Flink Stack
33
Streaming dataflow runtime
Specialized
Abstractions
/ APIs
Core APIs
Flink Core
Runtime
Deployment
DataSet (Java/Scala) DataStream
FlinkML: Machine Learning
 API for ML pipelines inspired by scikit-learn
 Collection of packaged algorithms
• SVM, Multiple Linear Regression, Optimization, ALS, ...
34
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures().setDegree(3)
val mlr = MultipleLinearRegression()
val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
pipeline.fit(trainingData)
val predictions: DataSet[LabeledVector] = pipeline.predict(testingData)
Gelly: Graph Processing
 Graph API and library
 Packaged algorithms
• PageRank, SSSP, Label Propagation, Community
Detection, Connected Components
35
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
Graph<Long, Long, NullValue> graph = ...
DataSet<Vertex<Long, Long>> verticesWithCommunity = graph.run(
new LabelPropagation<Long>(30)).getVertices();
verticesWithCommunity.print();
env.execute();
Flink Stack += Gelly, ML
36
Gelly
ML
DataSet (Java/Scala) DataStream
Streaming dataflow runtime
Integration with other systems
37
SAMOA
DataSet DataStream
HadoopM/R
GoogleDataflow
Cascading
Storm
Zeppelin
• Use Hadoop Input/Output Formats
• Mapper / Reducer implementations
• Hadoop’s FileSystem implementations
• Run applications implemented against Google’s Data Flow API
on premise with Flink
• Run Cascading jobs on Flink, with almost no code change
• Benefit from Flink’s vastly better performance than
MapReduce
• Interactive, web-based data exploration
• Machine learning on data streams
• Compatibility layer for running Storm code
• FlinkTopologyBuilder: one line replacement for
existing jobs
• Wrappers for Storm Spouts and Bolts
• Coming soon: Exactly-once with Storm
• Build-in serializer support for
Hadoop’s Writable type
Deployment options
Gelly
Table
ML
SAMOA
DataSet(Java/Scala)DataStream
Hadoop
LocalClusterYARNTezEmbedded
Dataflow
Dataflow
MRQL
Table
Cascading
Streamingdataflowruntime
Storm
Zeppelin
• Start Flink in your IDE / on your machine
• Local debugging / development using the
same code as on the cluster
• “bare metal” standalone installation of Flink
on a cluster
• Flink on Hadoop YARN (Hadoop 2.2.0+)
• Restarts failed containers
• Support for Kerberos-secured YARN/HDFS
setups
The full stack
39
Gelly
Table
ML
SAMOA
DataSet (Java/Scala) DataStream
HadoopM/R
Local Cluster Yarn Tez Embedded
Dataflow
Dataflow(WiP)
MRQL
Table
Cascading
Streaming dataflow runtime
Storm(WiP)
Zeppelin
Flink Use-cases and
production users
40
Bouygues Telecom
41
Bouygues Telecom
42
Bouygues Telecom
43
Capital One
44
45
Closing
46
What is currently happening?
 Features for upcoming 0.10 release:
• Master High Availability
• Vastly improved monitoring GUI
• Watermarks / Event time processing /
Windowing rework
 The next talk by Matthias
• Stable streaming API (no more “beta” flag)
 Next release: 1.0
• API stability
47
How do I get started?
48
Mailing Lists: (news | user | dev)@flink.apache.org
Twitter: @ApacheFlink
Blogs: flink.apache.org/blog, data-artisans.com/blog/
IRC channel: irc.freenode.net#flink
Start Flink on YARN in 4 commands:
# get the hadoop2 package from the Flink download page at
# http://flink.apache.org/downloads.html
wget <download url>
tar xvzf flink-0.9.1-bin-hadoop2.tgz
cd flink-0.9.1/
./bin/flink run -m yarn-cluster -yn 4 ./examples/flink-java-
examples-0.9.1-WordCount.jar
flink.apache.org 49
• Check out the slides: http://flink-
forward.org/?post_type=session
• Video recordings on YouTube, “Flink Forward”
channel
Appendix
50
51
52
53
54
Managed (off-heap) memory and out-of-
core support
55Memory runs out
Cost-based Optimizer
56
DataSource
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
broadcast forward
Combine
GroupRed
sort
DataSource
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
hash-part [0,1]
GroupRed
sort
forward
Best plan
depends on
relative sizes
of input files
57
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Optimizer
Type extraction
stack
Task
scheduling
Dataflow
metadata
Pre-flight (Client)
JobManager
TaskManagers
Data
Source
orders.tbl
Filter
Map
DataSourc
e
lineitem.tbl
Join
Hybrid Hash
build
HT
probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow
Graph
deploy
operators
track
intermediate
results
LocalCluster:YARN,Standalone
Iterative processing in Flink
Flink offers built-in iterations and delta
iterations to execute ML and graph
algorithms efficiently
58
map
join sum
ID1
ID2
ID3
Example: Matrix Factorization
59
Factorizing a matrix with
28 billion ratings for
recommendations
More at: http://data-artisans.com/computing-recommendations-with-flink.html
60
Batch
aggregation
ExecutionGraph
JobManager
TaskManager 1
TaskManager 2
M1
M2
RP1RP2
R1
R2
1
2 3a
3b
4a
4b
5a
5b
"Blocked" result partition
61
Streaming
window
aggregation
ExecutionGraph
JobManager
TaskManager 1
TaskManager 2
M1
M2
RP1RP2
R1
R2
1
2 3a
3b
4a
4b
5a
5b
"Pipelined" result partition

Más contenido relacionado

La actualidad más candente

Data Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkData Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkFabian Hueske
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Stephan Ewen
 
From Apache Flink® 1.3 to 1.4
From Apache Flink® 1.3 to 1.4From Apache Flink® 1.3 to 1.4
From Apache Flink® 1.3 to 1.4Till Rohrmann
 
Baymeetup-FlinkResearch
Baymeetup-FlinkResearchBaymeetup-FlinkResearch
Baymeetup-FlinkResearchFoo Sounds
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream ProcessingGyula Fóra
 
Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Stephan Ewen
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon PresentationGyula Fóra
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)Robert Metzger
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkRobert Metzger
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Robert Metzger
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestDataGyula Fóra
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and visionStephan Ewen
 
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink MeetupCommunity Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink MeetupRobert Metzger
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache FlinkAKASH SIHAG
 
Data Analysis With Apache Flink
Data Analysis With Apache FlinkData Analysis With Apache Flink
Data Analysis With Apache FlinkDataWorks Summit
 
Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)Sangjin Lee
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemGyula Fóra
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupRobert Metzger
 
Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016Till Rohrmann
 

La actualidad más candente (20)

Data Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkData Stream Processing with Apache Flink
Data Stream Processing with Apache Flink
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
 
From Apache Flink® 1.3 to 1.4
From Apache Flink® 1.3 to 1.4From Apache Flink® 1.3 to 1.4
From Apache Flink® 1.3 to 1.4
 
Baymeetup-FlinkResearch
Baymeetup-FlinkResearchBaymeetup-FlinkResearch
Baymeetup-FlinkResearch
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream Processing
 
Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestData
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink MeetupCommunity Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
Data Analysis With Apache Flink
Data Analysis With Apache FlinkData Analysis With Apache Flink
Data Analysis With Apache Flink
 
Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya Meetup
 
Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016
 

Similar a Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Integrations and Use Cases

Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureGyula Fóra
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System OverviewFlink Forward
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteFlink Forward
 
Flink Streaming Berlin Meetup
Flink Streaming Berlin MeetupFlink Streaming Berlin Meetup
Flink Streaming Berlin MeetupMárton Balassi
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureRobert Metzger
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonStephan Ewen
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015Robert Metzger
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitGyula Fóra
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterPaolo Castagna
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseKostas Tzoumas
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CARobert Metzger
 

Similar a Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Integrations and Use Cases (20)

Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and Future
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Flink Streaming Berlin Meetup
Flink Streaming Berlin MeetupFlink Streaming Berlin Meetup
Flink Streaming Berlin Meetup
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop Summit
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
 

Más de Robert Metzger

How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)Robert Metzger
 
Apache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin MeetupApache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin MeetupRobert Metzger
 
Flink September 2015 Community Update
Flink September 2015 Community UpdateFlink September 2015 Community Update
Flink September 2015 Community UpdateRobert Metzger
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingRobert Metzger
 
August Flink Community Update
August Flink Community UpdateAugust Flink Community Update
August Flink Community UpdateRobert Metzger
 
Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Robert Metzger
 
Apache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community UpdateApache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community UpdateRobert Metzger
 
Berlin Apache Flink Meetup May 2015, Community Update
Berlin Apache Flink Meetup May 2015, Community UpdateBerlin Apache Flink Meetup May 2015, Community Update
Berlin Apache Flink Meetup May 2015, Community UpdateRobert Metzger
 
Flink Community Update April 2015
Flink Community Update April 2015Flink Community Update April 2015
Flink Community Update April 2015Robert Metzger
 
Apache Flink Community Update March 2015
Apache Flink Community Update March 2015Apache Flink Community Update March 2015
Apache Flink Community Update March 2015Robert Metzger
 
Flink Community Update February 2015
Flink Community Update February 2015Flink Community Update February 2015
Flink Community Update February 2015Robert Metzger
 
Compute "Closeness" in Graphs using Apache Giraph.
Compute "Closeness" in Graphs using Apache Giraph.Compute "Closeness" in Graphs using Apache Giraph.
Compute "Closeness" in Graphs using Apache Giraph.Robert Metzger
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Robert Metzger
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Robert Metzger
 

Más de Robert Metzger (16)

How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
 
dA Platform Overview
dA Platform OverviewdA Platform Overview
dA Platform Overview
 
Apache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin MeetupApache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin Meetup
 
Flink September 2015 Community Update
Flink September 2015 Community UpdateFlink September 2015 Community Update
Flink September 2015 Community Update
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer Checkpointing
 
August Flink Community Update
August Flink Community UpdateAugust Flink Community Update
August Flink Community Update
 
Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)
 
Apache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community UpdateApache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community Update
 
Apache Flink Hands On
Apache Flink Hands OnApache Flink Hands On
Apache Flink Hands On
 
Berlin Apache Flink Meetup May 2015, Community Update
Berlin Apache Flink Meetup May 2015, Community UpdateBerlin Apache Flink Meetup May 2015, Community Update
Berlin Apache Flink Meetup May 2015, Community Update
 
Flink Community Update April 2015
Flink Community Update April 2015Flink Community Update April 2015
Flink Community Update April 2015
 
Apache Flink Community Update March 2015
Apache Flink Community Update March 2015Apache Flink Community Update March 2015
Apache Flink Community Update March 2015
 
Flink Community Update February 2015
Flink Community Update February 2015Flink Community Update February 2015
Flink Community Update February 2015
 
Compute "Closeness" in Graphs using Apache Giraph.
Compute "Closeness" in Graphs using Apache Giraph.Compute "Closeness" in Graphs using Apache Giraph.
Compute "Closeness" in Graphs using Apache Giraph.
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
 

Último

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Último (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Integrations and Use Cases

  • 1. Stream & Batch Processing with Apache Flink Robert Metzger @rmetzger_ rmetzger@apache.org Munich Apache Flink Meetup, November 11, 2015
  • 2. A bit of history From incubation until now 2
  • 3. 3 Apr 2014 Jun 2015Dec 2014 0.70.60.5 0.90.9-m1 0.10 Nov 2015 Top level 0.8
  • 4. Community growth Flink is one of the largest and most active Apache big data projects with well over 120 contributors 4
  • 5. Flink meetups around the globe 5
  • 8. The streaming era Flink is all about 8
  • 10. 10 Streaming is the biggest change in data infrastructure since Hadoop
  • 11. What is stream processing  Real-world data is unbounded and is pushed to systems  Right now: people are using the batch paradigm for stream analysis (there was no good stream processor available)  New systems (Flink, Kafka) embrace streaming nature of data 11 Web server Kafka topic Stream processing
  • 12. 12 Flink is a stream processor with many faces Streaming dataflow runtime
  • 14. Requirements for a stream processor  Low latency • Fast results (milliseconds)  High throughput • handle large data amounts (millions of events per second)  Exactly-once guarantees • Correct results, also in failure cases  Programmability • Intuitive APIs 14
  • 15. Pipelining 15 Basic building block to “keep the data moving” • Low latency • Operators push data forward • Data shipping as buffers, not tuple- wise • Natural handling of back-pressure
  • 16. Fault Tolerance in streaming  at least once: ensure all operators see all events • Storm: Replay stream in failure case  Exactly once: Ensure that operators do not perform duplicate updates to their state • Flink: Distributed Snapshots • Spark: Micro-batches on batch runtime 16
  • 17. Flink’s Distributed Snapshots  Lightweight approach of storing the state of all operators without pausing the execution  high throughput, low latency  Implemented using barriers flowing through the topology 17 Kafka Consumer offset = 162 Element Counter value = 152 Operator stateData Stream barrier Before barrier = part of the snapshot After barrier = Not in snapshot (backup till next snapshot)
  • 18. DataStream API 18 case class WordCount(word: String, count: Int) val text: DataStream[String] = …; text .flatMap { line => line.split(" ") } .map { word => new WordCount(word, 1) } .keyBy("word") .window(GlobalWindows.create()) .trigger(new EOFTrigger()) .sum("count") Batch Word Count in the DataStream API
  • 19. Batch Word Count in the DataSet API 19 case class WordCount(word: String, count: Int) val text: DataStream[String] = …; text .flatMap { line => line.split(" ") } .map { word => new WordCount(word, 1) } .keyBy("word") .window(GlobalWindows.create()) .trigger(new EOFTrigger()) .sum("count") val text: DataSet[String] = …; text .flatMap { line => line.split(" ") } .map { word => new WordCount(word, 1) } .groupBy("word") .sum("count")
  • 20. Best of all worlds for streaming  Low latency • Thanks to pipelined engine  Exactly-once guarantees • Distributed Snapshots  High throughput • Controllable checkpointing overhead  Programmability • APIs similar to those known from the batch world 20
  • 21. Throughput of distributed grep 21 Data Generator “grep” operator 30 machines, 120 cores 0 20,000,000 40,000,000 60,000,000 80,000,000 100,000,000 120,000,000 140,000,000 160,000,000 180,000,000 200,000,000 Flink, no fault tolerance Flink, exactly once (5s) Storm, no fault tolerance Storm, micro- batches aggregate throughput of 175 million elements per second aggregate throughput of 9 million elements per second • Flink achieves 20x higher throughput • Flink throughput almost the same with and without exactly-once
  • 22. Aggregate throughput for stream record grouping 22 0 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 70,000,000 80,000,000 90,000,000 100,000,000 Flink, no fault tolerance Flink, exactly once Storm, no fault tolerance Storm, at least once aggregate throughput of 83 million elements per second 8,6 million elements/s 309k elements/s  Flink achieves 260x higher throughput with fault tolerance 30 machines, 120 cores Network transfer
  • 23. Latency in stream record grouping 23 Data Generator Receiver: Throughput / Latency measure • Measure time for a record to travel from source to sink 0.00 5.00 10.00 15.00 20.00 25.00 30.00 Flink, no fault tolerance Flink, exactly once Storm, at least once Median latency 25 ms 1 ms 0.00 10.00 20.00 30.00 40.00 50.00 60.00 Flink, no fault tolerance Flink, exactly once Storm, at least once 99th percentile latency 50 ms
  • 24. 24
  • 25. Exactly-Once with YARN Chaos Monkey  Validate exactly-once guarantees with state-machine 25
  • 28. Faces of a stream processor 28 Stream processing Batch processing Machine Learning at scale Graph Analysis Streaming dataflow runtime
  • 29. The Flink Stack 29 Streaming dataflow runtime Specialized Abstractions / APIs Core APIs Flink Core Runtime Deployment
  • 30. The Flink Stack 30 Streaming dataflow runtime DataSet (Java/Scala) DataStream (Java/Scala) Experimental Python API also available Data Source orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] hash-part [0] GroupRed sort forward API independent Dataflow Graph representation Batch Optimizer Graph Builder
  • 31. Batch is a special case of streaming  Batch: run a bounded stream (data set) on a stream processor  Form a global window over the entire data set for join or grouping operations 31
  • 32. Batch-specific optimizations  Managed memory on- and off-heap • Operators (join, sort, …) with out-of-core support • Optimized serialization stack for user-types  Cost-based Optimizer • Job execution depends on data size 32
  • 33. The Flink Stack 33 Streaming dataflow runtime Specialized Abstractions / APIs Core APIs Flink Core Runtime Deployment DataSet (Java/Scala) DataStream
  • 34. FlinkML: Machine Learning  API for ML pipelines inspired by scikit-learn  Collection of packaged algorithms • SVM, Multiple Linear Regression, Optimization, ALS, ... 34 val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ... val scaler = StandardScaler() val polyFeatures = PolynomialFeatures().setDegree(3) val mlr = MultipleLinearRegression() val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr) pipeline.fit(trainingData) val predictions: DataSet[LabeledVector] = pipeline.predict(testingData)
  • 35. Gelly: Graph Processing  Graph API and library  Packaged algorithms • PageRank, SSSP, Label Propagation, Community Detection, Connected Components 35 ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); Graph<Long, Long, NullValue> graph = ... DataSet<Vertex<Long, Long>> verticesWithCommunity = graph.run( new LabelPropagation<Long>(30)).getVertices(); verticesWithCommunity.print(); env.execute();
  • 36. Flink Stack += Gelly, ML 36 Gelly ML DataSet (Java/Scala) DataStream Streaming dataflow runtime
  • 37. Integration with other systems 37 SAMOA DataSet DataStream HadoopM/R GoogleDataflow Cascading Storm Zeppelin • Use Hadoop Input/Output Formats • Mapper / Reducer implementations • Hadoop’s FileSystem implementations • Run applications implemented against Google’s Data Flow API on premise with Flink • Run Cascading jobs on Flink, with almost no code change • Benefit from Flink’s vastly better performance than MapReduce • Interactive, web-based data exploration • Machine learning on data streams • Compatibility layer for running Storm code • FlinkTopologyBuilder: one line replacement for existing jobs • Wrappers for Storm Spouts and Bolts • Coming soon: Exactly-once with Storm • Build-in serializer support for Hadoop’s Writable type
  • 38. Deployment options Gelly Table ML SAMOA DataSet(Java/Scala)DataStream Hadoop LocalClusterYARNTezEmbedded Dataflow Dataflow MRQL Table Cascading Streamingdataflowruntime Storm Zeppelin • Start Flink in your IDE / on your machine • Local debugging / development using the same code as on the cluster • “bare metal” standalone installation of Flink on a cluster • Flink on Hadoop YARN (Hadoop 2.2.0+) • Restarts failed containers • Support for Kerberos-secured YARN/HDFS setups
  • 39. The full stack 39 Gelly Table ML SAMOA DataSet (Java/Scala) DataStream HadoopM/R Local Cluster Yarn Tez Embedded Dataflow Dataflow(WiP) MRQL Table Cascading Streaming dataflow runtime Storm(WiP) Zeppelin
  • 45. 45
  • 47. What is currently happening?  Features for upcoming 0.10 release: • Master High Availability • Vastly improved monitoring GUI • Watermarks / Event time processing / Windowing rework  The next talk by Matthias • Stable streaming API (no more “beta” flag)  Next release: 1.0 • API stability 47
  • 48. How do I get started? 48 Mailing Lists: (news | user | dev)@flink.apache.org Twitter: @ApacheFlink Blogs: flink.apache.org/blog, data-artisans.com/blog/ IRC channel: irc.freenode.net#flink Start Flink on YARN in 4 commands: # get the hadoop2 package from the Flink download page at # http://flink.apache.org/downloads.html wget <download url> tar xvzf flink-0.9.1-bin-hadoop2.tgz cd flink-0.9.1/ ./bin/flink run -m yarn-cluster -yn 4 ./examples/flink-java- examples-0.9.1-WordCount.jar
  • 49. flink.apache.org 49 • Check out the slides: http://flink- forward.org/?post_type=session • Video recordings on YouTube, “Flink Forward” channel
  • 51. 51
  • 52. 52
  • 53. 53
  • 54. 54
  • 55. Managed (off-heap) memory and out-of- core support 55Memory runs out
  • 56. Cost-based Optimizer 56 DataSource orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash buildHT probe broadcast forward Combine GroupRed sort DataSource orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] hash-part [0] hash-part [0,1] GroupRed sort forward Best plan depends on relative sizes of input files
  • 57. 57 case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Optimizer Type extraction stack Task scheduling Dataflow metadata Pre-flight (Client) JobManager TaskManagers Data Source orders.tbl Filter Map DataSourc e lineitem.tbl Join Hybrid Hash build HT probe hash-part [0] hash-part [0] GroupRed sort forward Program Dataflow Graph deploy operators track intermediate results LocalCluster:YARN,Standalone
  • 58. Iterative processing in Flink Flink offers built-in iterations and delta iterations to execute ML and graph algorithms efficiently 58 map join sum ID1 ID2 ID3
  • 59. Example: Matrix Factorization 59 Factorizing a matrix with 28 billion ratings for recommendations More at: http://data-artisans.com/computing-recommendations-with-flink.html

Notas del editor

  1. There are two different data processing problems. The first is when you have collected some data and want to make some sense from it, for example build a report. The second is when you want to react to events that are happening in the world, for example a credit card transaction. And these two are fundamentally different: the first one is akin to polling, while the second one is push-based. There are many tools for the first kind of problem, including Apache Flink. However, while the second is as important and common as the first, for a long time we have not had good tools to deal with it. This is now changing.
  2. GCE 30 instances with 4 cores and 15 GB of memory each. Flink master from July, 24th, Storm 0.9.3. All the code used for the evaluation can be found here. Flink 1.5 million elements per second per core Aggregate Throughput in cluster 182 million elements per second. Storm 82,000 elements per second per core Aggregate 0.57 million elements per second Storm with Acknowledge 4,700 elements per second per core, Latency 30-120 milliseconds Trident: 75,000 elements per second per core
  3. Flink 720,000 events per second per core 690,000 with checkpointing activated Storm With at-least-once: 2,600 events per second per core
  4. Flink 720,000 events per second per core 690,000 with checkpointing activated Storm With at-least-once: 2,600 events per second per core
  5. Flink 0 Buffer timeout: latency median 0 msec, 99 %tile 20 msec 24,500 events per second per core
  6. People previously made the case that high throughput and low latency are mutually exclusive