Apache Flink is an open source platform for distributed stream and batch data processing. At its core, Flink is a streaming dataflow engine which provides data distribution, communication, and fault tolerance for distributed computations over data streams. On top of this core, APIs make it easy to develop distributed data analysis programs. Libraries for graph processing or machine learning provide convenient abstractions for solving large-scale problems. Apache Flink integrates with a multitude of other open source systems like Hadoop, databases, or message queues. Its streaming capabilities make it a perfect fit for traditional batch processing as well as state of the art stream processing.
6. Flink Community
Top 5 Apache Big Data project in the Apache
Software Foundation
500+ messages/month on the mailing list
8400+ commits
1500+ pull requests merged
950+ stars
510+ forks
8. Use Case: Log File Analysis
▪ Load log files from a distributed file system
▪ Process them, sessionize according to the user id
▪ Write a view to the database or dump more data
for further processing
8
• Process
• Analyze
• Aggregate
9. Use Case: Tweet Impressions
9
Continuous Stream of Tweets
(each with a timestamp)
▪ How do we measure the importance of Tweets?
• Total number of views
• Views within a time period
▪ We need to process and aggregate Tweets!
Max Marie Jonas Tim are tweeting.
10. Use Case: Tweet Impressions
10
Max Marie Jonas Tim are tweeting.
Last minute
Last hour
Last day
Impressions
Impression Events Aggregation of Impressions Output
More at: http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
12. Why Stream Processing?
▪ Most problems have streaming nature
▪ Stream processing gives lower latency
▪ Data volumes more easily tamed
▪ More predictable resource consumption
12
Event stream
batch
(solved)
event
based
13. Challenges in Streaming
▪ Latency
▪ Throughput
▪ Fault-Tolerance
▪ Correctness
▪ Elements may be out-of-order
▪ Elements may be processed more than
once
13
14. Windows
▪ A grouping of records according to time,
count, or session, e.g.
• Count: The last 100 records
• Session: All records for user X
• Time: All records of the last 2 minutes
14
15. Event Time
▪ Processing time: when data is processed
▪ Ingestion time: when data is loaded
▪ Event time: when data is generated
▪ Almost always, the three are different
▪ Event time helps to process out-of-order or
to replay elements as they occurred
15
16. Event Time & Watermarks
▪ Elements arrives: How do we know what time it
is?
▪ Processing time: take the hardware clock
▪ Event time: Watermarks
▪ Watermarks are timestamps
▪ No elements later than the timestamp are
expected to arrive
16
77. Pipelining
25
Basic building block to “keep data moving”
• Low latency
• Operators push data
forward
• Data shipping as
buffers, not tuple-
wise
• Natural handling of
80. Flink Engine
1. Execute everything as streams
2. Iterative (cyclic) dataflows
3. Mutable state in operators State +
Computation
81. Flink Engine
1. Execute everything as streams
2. Iterative (cyclic) dataflows
3. Mutable state in operators
4. Operate on managed memory
State +
Computation
82. Flink Engine
1. Execute everything as streams
2. Iterative (cyclic) dataflows
3. Mutable state in operators
4. Operate on managed memory
5. Special code paths for batch
State +
Computation
83. Flink Engine
1. Execute everything as streams
2. Iterative (cyclic) dataflows
3. Mutable state in operators
4. Operate on managed memory
5. Special code paths for batch
6. HA mode – no single point of failure
State +
Computation
84. Flink Engine
1. Execute everything as streams
2. Iterative (cyclic) dataflows
3. Mutable state in operators
4. Operate on managed memory
5. Special code paths for batch
6. HA mode – no single point of failure
7. Checkpointing of operator state
State +
Computation
85. Flink Eco System
Gelly
Table
ML
SAMOA
DataSet (Java/Scala/Python) DataStream
HadoopM/R
Local Cluster Yarn
Dataflow
Dataflow
MRQL
Table
Cascading
Streaming dataflow runtime
Storm
Zeppelin
93. Apache Flink
▪ A powerful framework with stream
processor at its core
▪ Features
• True Streaming with great Batch support
• Easy to use APIs, library ecosystem
• Fault-tolerant and Consistent
• Low latency - High throughput
• Growing community
94. I ♥ , do you?
35
▪ More information on flink.apache.org
▪ Flink Training at data-artisans.com
▪ Subscribe to the mailing lists
▪ Follow @ApacheFlink
▪ Next: 1.0.0 release
▪ Soon: Stream SQL, Mesos, Dynamic scaling