Streaming Data Flow with Apache Flink: Key Concepts and API

1. Streaming Data Flow with Apache Flink Till Rohrmann trohrmann@apache.org @stsffap

2. Recent History April ‘14 December ‘14 v0.5 v0.6 v0.7 April ‘15 Project Incubation Top Level Project v0.8 v0.9 Currently moving towards 0.10 and 1.0 release.

3. What is Flink? Deployment  Local (Single JVM) · Cluster (Standalone, YARN) DataStream API Unbounded Data DataSet API Bounded Data Runtime Distributed Streaming Data Flow Libraries Machine Learning · Graph Processing · SQL-like API

4. What is Flink? Streaming Topologies Stream Time Window Count Low Latency Long Batch Pipelines Resource Utilization 1.2 1.4 1.5 1.2 0.8 0.9 1.0 0.8 Rating Matrix User Matrix Item Matrix 1.5 1.7 1.2 0.6 1.0 1.1 0.8 0.4 W X Y ZW X Y Z A B C D 4.0 4.5 5.0 3.5 2.0 3.5 4.0 2.0 1.0 = X User Machine Learning Iterative Algorithms Graph Analysis 53 1 2 4 0.5 0.2 0.9 0.3 0.1 0.4 0.7 Mutable State

5. Stream Processing Real world data is unbounded and is pushed to systems. BatchStreaming

6. Stream Platform Architecture Server Logs Trxn Logs Sensor Logs Downstream Systems Flink – Analyze and correlate streams – Create derived streams Kafka – Gather and backup streams – Offer streams

7. Cornerstones of Flink Low Latency for fast results. High Throughput to handle many events per second. Exactly-once guarantees for correct results. Expressive APIs for productivity.

8. sum DataStream API keyBy sumTime Window Time Window

13. DataStream API StreamExecutionEnvironment env = StreamExecutionEnvironment  .getExecutionEnvironment() DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window counts.print(); env.execute();

21. DataStream API public static class SplitByWhitespace  implements FlatMapFunction<String, Tuple2<String, Integer>> {    @Override  public void flatMap ( String value, Collector<Tuple2<String, Integer>> out) {   String[] tokens = value.toLowerCase().split("W+");    for (String token : tokens) {  if (token.length() > 0) {  out.collect(new Tuple2<>(token, 1));  }  }  }  }

27. Pipelining DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, …); // DataStream WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // split stream by word .sum(1); // sum per word as they arrive Source Map Reduce

28. Pipelining S1 M1 R1 S2 M2 R2 Source Map Reduce Complete pipeline online concurrently.

29. Pipelining S1 M1 R1 S2 M2 R2 Chained tasks Complete pipeline online concurrently. Source Map Reduce

30. Pipelining S1 M1 R1 S2 M2 R2 Chained tasks Complete pipeline online concurrently. Source Map Reduce S1 · M1

31. Pipelining S1 S2 M2 M1 R1 Complete pipeline online concurrently. Chained tasks Pipelined Shufﬂe Source Map Reduce S1 · M1 R2

32. Pipelining Complete pipeline online concurrently. Worker Worker

37. Streaming Fault Tolerance At Most Once • No guarantees at all At Least Once • Ensure that all operators see all events. Exactly Once • Ensure that all operators see all events. • Do not perform duplicates updates to operator state. Flink gives you all guarantees.

38. Distributed Snapshots Barriers ﬂow through the topology in line with data. Flink guarantees exactly once processing. Part of snapshot

39. Distributed Snapshots Flink guarantees exactly once processing.   JobManager Master State Backend Checkpoint Data Source 1: State 1: Source 2: State 2: Source 3: Sink 1: Source 4: Sink 2: Offset: 6791 Offset: 7252 Offset: 5589 Offset: 6843

40. Distributed Snapshots Flink guarantees exactly once processing.   JobManager Master State Backend Checkpoint Data Source 1: State 1: Source 2: State 2: Source 3: Sink 1: Source 4: Sink 2: Offset: 6791 Offset: 7252 Offset: 5589 Offset: 6843 Start Checkpoint Message

41. Distributed Snapshots Flink guarantees exactly once processing.   JobManager Master State Backend Checkpoint Data Source 1: 6791 State 1: Source 2: 7252 State 2: Source 3: 5589 Sink 1: Source 4: 6843 Sink 2: Emit Barriers Acknowledge with Position

42. Distributed Snapshots Flink guarantees exactly once processing.   JobManager Master State Backend Checkpoint Data Source 1: 6791 State 1: Source 2: 7252 State 2: Source 3: 5589 Sink 1: Source 4: 6843 Sink 2: Received barrier at each input

43. Distributed Snapshots Flink guarantees exactly once processing.   JobManager Master State Backend Checkpoint Data Source 1: 6791 State 1: Source 2: 7252 State 2: Source 3: 5589 Sink 1: Source 4: 6843 Sink 2: s1 Write snapshot of its state Received barrier at each input

44. Distributed Snapshots Flink guarantees exactly once processing.   JobManager Master State Backend Checkpoint Data Source 1: 6791 State 1: PTR1 Source 2: 7252 State 2: PTR2 Source 3: 5589 Sink 1: Source 4: 6843 Sink 2: s1 Acknowledge with pointer to state s2

45. Distributed Snapshots Flink guarantees exactly once processing.   JobManager Master State Backend Checkpoint Data Source 1: 6791 State 1: PTR1 Source 2: 7252 State 2: PTR2 Source 3: 5589 Sink 1: ACK Source 4: 6843 Sink 2: ACK s1 s2 Acknowledge Checkpoint Received barrier at each input

46. Distributed Snapshots Flink guarantees exactly once processing.   JobManager Master State Backend Checkpoint Data Source 1: 6791 State 1: PTR1 Source 2: 7252 State 2: PTR2 Source 3: 5589 Sink 1: ACK Source 4: 6843 Sink 2: ACK s1 s2

47. Operator State Stateless Operators ds.filter(_ != 0) System state ds.keyBy(0).window(TumblingTimeWindows.of(5, TimeUnit.SECONDS)) User deﬁned state public class CounterSum implements RichReduceFunction<Long> { private OperatorState<Long> counter; @Override public Long reduce(Long v1, Long v2) throws Exception { counter.update(counter.value() + 1); return v1 + v2; } @Override public void open(Configuration config) { counter = getRuntimeContext().getOperatorState(“counter”, 0L, false); } }

48. Batch on Streaming DataStream API Unbounded Data DataSet API Bounded Data Runtime Distributed Streaming Data Flow Libraries Machine Learning · Graph Processing · SQL-like API

49. Batch on Streaming Run a bounded stream (data set) on  a stream processor. Bounded data set Unbounded data stream

50. Batch on Streaming Stream Windows Pipelined Data Exchange Global View Pipelined or Blocking Data Exchange Inﬁnite Streams Finite Streams Run a bounded stream (data set) on  a stream processor.

51. Batch Pipelines Data exchange  is mostly streamed Some operators block (e.g. sort, hash table)

52. DataSet API ExecutionEnvironment env = ExecutionEnvironment  .getExecutionEnvironment() DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences counts.print();

59. Batch-speciﬁc optimizations Cost-based optimizer • Program adapts to changing data size Managed memory • On- and off-heap memory • Internal operators (e.g. join or sort) with out-of-core support • Serialization stack for user-types

60. Demo Time

61. Getting Started Project Page: http://ﬂink.apache.org

62. Getting Started Project Page: http://ﬂink.apache.org Quickstarts: Java & Scala API

63. Getting Started Project Page: http://ﬂink.apache.org Docs: Programming Guides

64. Getting Started Project Page: http://ﬂink.apache.org Get Involved: Mailing Lists, Stack Overﬂow, IRC, …

65. Blogs http://ﬂink.apache.org/blog http://data-artisans.com/blog Twitter @ApacheFlink Mailing lists (news|user|dev)@ﬂink.apache.org Apache Flink

66. Thank You!

Streaming Data Flow with Apache Flink: Key Concepts and API

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Streaming Data Flow with Apache Flink: Key Concepts and API

Similar a Streaming Data Flow with Apache Flink: Key Concepts and API (20)

Más de Till Rohrmann

Más de Till Rohrmann (17)

Último

Último (20)

Streaming Data Flow with Apache Flink: Key Concepts and API