SlideShare una empresa de Scribd logo
1 de 50
Descargar para leer sin conexión
Apache flink
By pranay
Contents
2
❏ What is Flink?
❏ Why Flink
❏ Flink Ecosystem
❏ Sources and sinks
❏ Flink Program structure
❏ Dataset API and DataStream API
❏ Windowing in flink
❏ Fault tolerance & Availability
❏ Use cases
❏ Limitations
3
Big Data
● We define big data as huge amount of data.
● Structured, semi-structured and unstructured format
● This led to emergence of certain tools & frameworks that help in
storage and processing of data
● Processing part - 1. Batch Processing
2. Stream Processing
Batch processing
Vs
Stream processing
4
STREAM PROCESSING
5
Stream processing challenges
➢ Data is unbounded, means no start and end
➢ Unpredictable and inconsistent intervals of new data
➢ Data can be Out of order with different timestamps
➢ Latency factor impacts accuracy of results
6
What is APache FLink?
7
distributed-Stream processing
framework
WHY FLink?
8
➢ Scalable
➢ High throughput
➢ Low latency
➢ Flexible windowing
➢ Exactly once
semantics
➢ High availability
➢ Fault tolerant
9
MORE ABOUT FLINK
● Flink works on kappa architecture
● The key idea of this architecture is to handle both batch and
real-time data through single stream processing engine.
● Hence, flink treats all data as stream and processes the data in real
time.
● Batch data is special case of streaming here.
10
REmember last slide?
HISTORY IN SHORT
SparkHadoop Flink
11
< <
12
Flink job execution architecture
Flink Ecosystem
13
Storage/streaming
14
● Flink is just computation engine,
doesn’t come with storage system
● Can read, write data from different
storage systems as well as streaming
systems
➢ Apache kafka(source/sink)
➢ Apache Cassandra(source/sink)
➢ Elastic search(sink)
➢ Hadoop filesystem(sink)
➢ RabbitMQ(source/sink)
➢ Amazon Kinesis
streams(source/sink)
➢ Twitter streaming API(source)
➢ Google PubSub(Source/sink)
PreDefined sources/sinks
➢ File based
○ readTextFile(path)
○ readTextFile(fileformat,path)
➢ Socket-based
○ SocketTextStream
➢ Collection-based(java.util.Collection)
○ fromCollection(Collection)
○ Elements must be of same type
➢ Custom(Previous slide)
○ addSource(. . .)
➢ writeAsText() - TExt output format
➢ writeAsCsv() - CSV output format
➢ print() - standard output
➢ writeUsingOutputFormat() - custom
file output
➢ WriteToSocket
➢ Custom( Previous slide)
○ addSink(. . .)
15
Some more information
✘ As you see flink is a layered system. Different layers are built on top
of other raising abstraction level of the program
✘ Runtime layer receives program in the form of job graph. This is the
layer where job manager sits.
✘ Both Datastream API and DataSet API generate jobgraphs through
separate compilation processes.
16
Flink Program sKeleton
Source Transformations Sink
17
Dataset API
It is used to perform batch operations on
the data over a period. Datasets are
created from sources like local files and
transformations like filtering, mapping,
aggregating, joining are applied and the
result is written to sinks.
Flink api’s
DataStream API
It is used for handling data in continuous
stream. Operations like filtering, mapping,
windowing and aggregating can be
applied on the datastream and can be
written to sinks.
18
Datastream TRansformations
19
● Map
● FlatMap
● Filter
● KeyBy
● Reduce
● Fold
● Window
○ windowAll
○ Window apply
○ Window fold
○ Window reduce
● Aggregations
○ Sum
○ Min
○ Max
● Aggregate
● Select
● Split
● Iterate
● Extract Timestamps
● Shuffle
HOw it works?
EXAMPLE PLEASE . . .
20
21
DATASET API EXAMPLE
DATASTREAM API EXAMPLE
22
Windows In Flink
23
● Windows are the heart of processing infinite streams.
● Windows split the stream into fixed sized buckets over which we can
apply computations.
● It is a mechanism to take snapshot of the stream based on time or
other variables
● Need of windowing - in case we need some kind of aggregation on
the incoming data
24
MORE INFO . . .
● Most of window operations are encouraged on keyed datastream.
● The basic structure of windowed transformation is as follows.
25
Window Assigners . . .
❏ Tumbling Windows
❏ Sliding Windows
A window assigner defines how elements are assigned to
windows
❏ Sessions Windows
❏ Global Windows
TIME in FLINK
➢ Processing Time - Assigns elements based on the current clock of
the machines
➢ Event Time - Assigns windows based on timestamp of the
elements.
➢ Ingestion time - Assigns wall clock timestamps as they arrive in the
system and processes as event time semantics based on the
attached timestamps.
26
Tumbling windows
✘ Here window size is specified and elements are assigned to non
overlapping windows.
✘ Eg. If window size is specified as 2 minutes, all elements in 2 minutes
would come in one window for processing.
✘ It is useful in dividing the stream into multiple discrete batches.
27
28
29
Sliding windows
➢ Here windows can overlap unlike tumbling windows
➢ Overlap is user specified. Here one element can be assigned to
multiple windows
➢ Eg. 10 minutes window with a slide of 5 minutes.
30
31
Session Windows
➢ This is applicable cases where window boundaries needs to be
adjusted as per incoming data.
➢ Here we specify session gap or inactivity period before marking a
session closed.
➢ Refer next slide for example.
32
33
Global Windows
➢ Sub-division of elements into windows is not done. Instead, we
assign each element to 1 single key per global window.
➢ There is no end to this window, so triggers should be used if any
computation needs to be done.
➢ Flink also has count windows. For eg. a tumbling count window of
100 will add 100 elements to the window and will evaluate upon
receiving 100th element.
34
35
Evictors & Triggers
➢ A trigger determines when window is ready for processing. Except
Global windows, all windows come with a default trigger.
➢ We can define a custom trigger which tells when to fire elements for
processing .
➢ Evictors are used to evict some elements from the window. After the
window is completed, it is purged
Fault Tolerance
➢ Before diving in, let’s define stateful computing. Most of the
computational tasks require stateful data.
➢ We already came across this in previous slides.
➢ In word count example, word count is an output that accumulates
new objects into existing word count. It is a stateful variable.
➢ Here fault tolerance is achieved with checkpointing. Flink maintains
state locally per task.
➢ State is periodically checkpointed to durable storage. A checkpoint
is consistent snapshot of the state of all tasks.
36
More on Checkpointing . . .
➢ Flink backs up state at a certain interval
➢ In case of a failure, flink recovers all tasks to the state of last
checkpoint and starts running the tasks from that checkpoint.
➢ The application continues as the failure never happened.
➢ Flink guarantees exactly once semantics which means all data is
processed exactly once regardless of any failures.
37
Stopping and resuming jobs . . .
➢ The running jobs must be stopped before upgrading a component
and resumed after the upgrade.
➢ There are 2 modes to resume the jobs.
➢ Savepoint - unlike checkpoint which is periodically triggered by the
system, savepoint is triggered by running commands.
➢ Even the storage format is different from checkpoints.
➢ External checkpoint - Extension of existing internal checkpoint.
After storing internal checkpoint, it is stored specific directory.
38
Savepoints vs checkpoints . . .
➢ The primary purpose of checkpoints is to provide a recovery
mechanism in case of unexpected job failures. Checkpoint’s lifecycle
is managed by flink. Checkpoints are dropped after job was
terminated by user. They are lightweight.
➢ Savepoints are created, owned and deleted by user. Their use-case
is for planned manual backup and resume. They are bit of
expensive. Mostly used for flink version upgrade, changing job graph
and for changing parallelism.
39
More on checkpoints. . .
➢ We already know state is persisted upon checkpoints to avoid data
loss and recover consistently. How the state is stored internally and
how it is persisted upon checkpoints depends on state backend.
➢ 3 state backends available
○ Memory state backend
○ Fs state backend
○ Rocksdb state backend
40
High Availability for standalone cluster
➢ Job manager coordinates flink deployment and it is responsible for
scheduling the jobs and resource management.
➢ However, there is only one job manager per flink cluster creating a
single point of failure.
➢ There will be a single leading job manager at any time and multiple
standby job managers to take leadership incase leader fails.
➢ Flink uses zookeeper for distributed coordination between all
running instances of job manager and leader election.
41
42
43
Limitations
➢ Maturity and community support is less in the industry than its
predecessors but it is growing fastly.
➢ No known adaptation of Flink batch, only popular for streaming
Use Cases
❏ Event driven Applications
❏ Data Analytics Applications
❏ Data Pipeline Applications
44
45
46
47
48
EXAMPLES
Event driven applications
➢ Fraud detection
➢ Anomaly detection
➢ Rule Based Alerting
➢ Business Process
Monitoring
Data Analytics Applications
➢ Quality monitoring of
Telecom networks
➢ Analysis of live data
➢ Analysis of product
updates
➢ Large scale graph
analysis
Data Pipeline Applications
➢ Continuous ETL in
E-commerce
➢ Real time search index
building in E-commerce
Use cases cont’d. . .
Bouygues Telecom
For real time event processing and
analytics for billions of messages
per day and real time customer
insights.
King - candy crush creator
Real time analytics on 30 billion
events generated everyday across
200 countries.
Zalando E-commerce
Uses flink for real time process
monitoring, stream processing,
business process monitoring.
Otto - largest online retailer
For user agent identification and
identifying a search session via
stream processing,
Alibaba Group
For online recommendations of the
products based on the user activity.
They developed Blink on top of the
flink for this.
Capital One - commercial
banking
Monitor customer data in real time,
to detect customer issues, event
correlation, anomaly identification
49
Thank You!
Keep Learning . . .
50

Más contenido relacionado

La actualidad más candente

Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System OverviewFlink Forward
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveYingjun Wu
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistentconfluent
 
Apache Flink Worst Practices
Apache Flink Worst PracticesApache Flink Worst Practices
Apache Flink Worst PracticesKonstantin Knauf
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Storesconfluent
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)KafkaZone
 
Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Mayank Shrivastava
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
Introduction To Flink
Introduction To FlinkIntroduction To Flink
Introduction To FlinkKnoldus Inc.
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Cruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clustersCruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clustersPrateek Maheshwari
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewenconfluent
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasFlink Forward
 
Extending Druid Index File
Extending Druid Index FileExtending Druid Index File
Extending Druid Index FileNavis Ryu
 

La actualidad más candente (20)

Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWave
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
 
Apache Flink Worst Practices
Apache Flink Worst PracticesApache Flink Worst Practices
Apache Flink Worst Practices
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Stores
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
 
Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Introduction To Flink
Introduction To FlinkIntroduction To Flink
Introduction To Flink
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Cruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clustersCruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clusters
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
Extending Druid Index File
Extending Druid Index FileExtending Druid Index File
Extending Druid Index File
 

Similar a Apache flink

Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Timo Walther
 
Real-time Stream Processing using Apache Apex
Real-time Stream Processing using Apache ApexReal-time Stream Processing using Apache Apex
Real-time Stream Processing using Apache ApexApache Apex
 
Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016Bhupesh Chawda
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkDataWorks Summit
 
Stateful streaming data pipelines
Stateful streaming data pipelinesStateful streaming data pipelines
Stateful streaming data pipelinesTimothy Farkas
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Scalability truths and serverless architectures
Scalability truths and serverless architecturesScalability truths and serverless architectures
Scalability truths and serverless architecturesRegunath B
 
Flink-window-function-basic
Flink-window-function-basicFlink-window-function-basic
Flink-window-function-basicPreetdeep Kumar
 
Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)Aljoscha Krettek
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamVerverica
 
Workflows via Event driven architecture
Workflows via Event driven architectureWorkflows via Event driven architecture
Workflows via Event driven architectureMilan Patel
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward
 
Rise of the machines: Continuous Delivery at SEEK - YOW! Night Summary Slides
Rise of the machines: Continuous Delivery at SEEK - YOW! Night Summary SlidesRise of the machines: Continuous Delivery at SEEK - YOW! Night Summary Slides
Rise of the machines: Continuous Delivery at SEEK - YOW! Night Summary SlidesDiUS
 
Automating Banner Financial Aid Dataload - San Jacinto College
Automating Banner Financial Aid Dataload - San Jacinto CollegeAutomating Banner Financial Aid Dataload - San Jacinto College
Automating Banner Financial Aid Dataload - San Jacinto CollegeCA | Automic Software
 
Let's get to know the Data Streaming
Let's get to know the Data StreamingLet's get to know the Data Streaming
Let's get to know the Data StreamingKnoldus Inc.
 
QueueMetrics - Tips and Tricks
QueueMetrics - Tips and TricksQueueMetrics - Tips and Tricks
QueueMetrics - Tips and TricksClarotech_Events
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
 
Flink Forward Berlin 2017: Hao Wu - Large Scale User Behavior Analytics by Flink
Flink Forward Berlin 2017: Hao Wu - Large Scale User Behavior Analytics by FlinkFlink Forward Berlin 2017: Hao Wu - Large Scale User Behavior Analytics by Flink
Flink Forward Berlin 2017: Hao Wu - Large Scale User Behavior Analytics by FlinkFlink Forward
 

Similar a Apache flink (20)

Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
 
Real-time Stream Processing using Apache Apex
Real-time Stream Processing using Apache ApexReal-time Stream Processing using Apache Apex
Real-time Stream Processing using Apache Apex
 
Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Stateful streaming data pipelines
Stateful streaming data pipelinesStateful streaming data pipelines
Stateful streaming data pipelines
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Scalability truths and serverless architectures
Scalability truths and serverless architecturesScalability truths and serverless architectures
Scalability truths and serverless architectures
 
Flink-window-function-basic
Flink-window-function-basicFlink-window-function-basic
Flink-window-function-basic
 
Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
 
Workflows via Event driven architecture
Workflows via Event driven architectureWorkflows via Event driven architecture
Workflows via Event driven architecture
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
 
Rise of the machines: Continuous Delivery at SEEK - YOW! Night Summary Slides
Rise of the machines: Continuous Delivery at SEEK - YOW! Night Summary SlidesRise of the machines: Continuous Delivery at SEEK - YOW! Night Summary Slides
Rise of the machines: Continuous Delivery at SEEK - YOW! Night Summary Slides
 
Automating Banner Financial Aid Dataload - San Jacinto College
Automating Banner Financial Aid Dataload - San Jacinto CollegeAutomating Banner Financial Aid Dataload - San Jacinto College
Automating Banner Financial Aid Dataload - San Jacinto College
 
Let's get to know the Data Streaming
Let's get to know the Data StreamingLet's get to know the Data Streaming
Let's get to know the Data Streaming
 
QueueMetrics - Tips and Tricks
QueueMetrics - Tips and TricksQueueMetrics - Tips and Tricks
QueueMetrics - Tips and Tricks
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Flink Forward Berlin 2017: Hao Wu - Large Scale User Behavior Analytics by Flink
Flink Forward Berlin 2017: Hao Wu - Large Scale User Behavior Analytics by FlinkFlink Forward Berlin 2017: Hao Wu - Large Scale User Behavior Analytics by Flink
Flink Forward Berlin 2017: Hao Wu - Large Scale User Behavior Analytics by Flink
 

Último

Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilVinayVitekari
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesRAJNEESHKUMAR341697
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxNadaHaitham1
 

Último (20)

Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 

Apache flink

  • 2. Contents 2 ❏ What is Flink? ❏ Why Flink ❏ Flink Ecosystem ❏ Sources and sinks ❏ Flink Program structure ❏ Dataset API and DataStream API ❏ Windowing in flink ❏ Fault tolerance & Availability ❏ Use cases ❏ Limitations
  • 3. 3 Big Data ● We define big data as huge amount of data. ● Structured, semi-structured and unstructured format ● This led to emergence of certain tools & frameworks that help in storage and processing of data ● Processing part - 1. Batch Processing 2. Stream Processing
  • 6. Stream processing challenges ➢ Data is unbounded, means no start and end ➢ Unpredictable and inconsistent intervals of new data ➢ Data can be Out of order with different timestamps ➢ Latency factor impacts accuracy of results 6
  • 7. What is APache FLink? 7 distributed-Stream processing framework
  • 8. WHY FLink? 8 ➢ Scalable ➢ High throughput ➢ Low latency ➢ Flexible windowing ➢ Exactly once semantics ➢ High availability ➢ Fault tolerant
  • 9. 9 MORE ABOUT FLINK ● Flink works on kappa architecture ● The key idea of this architecture is to handle both batch and real-time data through single stream processing engine. ● Hence, flink treats all data as stream and processes the data in real time. ● Batch data is special case of streaming here.
  • 12. 12 Flink job execution architecture
  • 14. Storage/streaming 14 ● Flink is just computation engine, doesn’t come with storage system ● Can read, write data from different storage systems as well as streaming systems ➢ Apache kafka(source/sink) ➢ Apache Cassandra(source/sink) ➢ Elastic search(sink) ➢ Hadoop filesystem(sink) ➢ RabbitMQ(source/sink) ➢ Amazon Kinesis streams(source/sink) ➢ Twitter streaming API(source) ➢ Google PubSub(Source/sink)
  • 15. PreDefined sources/sinks ➢ File based ○ readTextFile(path) ○ readTextFile(fileformat,path) ➢ Socket-based ○ SocketTextStream ➢ Collection-based(java.util.Collection) ○ fromCollection(Collection) ○ Elements must be of same type ➢ Custom(Previous slide) ○ addSource(. . .) ➢ writeAsText() - TExt output format ➢ writeAsCsv() - CSV output format ➢ print() - standard output ➢ writeUsingOutputFormat() - custom file output ➢ WriteToSocket ➢ Custom( Previous slide) ○ addSink(. . .) 15
  • 16. Some more information ✘ As you see flink is a layered system. Different layers are built on top of other raising abstraction level of the program ✘ Runtime layer receives program in the form of job graph. This is the layer where job manager sits. ✘ Both Datastream API and DataSet API generate jobgraphs through separate compilation processes. 16
  • 17. Flink Program sKeleton Source Transformations Sink 17
  • 18. Dataset API It is used to perform batch operations on the data over a period. Datasets are created from sources like local files and transformations like filtering, mapping, aggregating, joining are applied and the result is written to sinks. Flink api’s DataStream API It is used for handling data in continuous stream. Operations like filtering, mapping, windowing and aggregating can be applied on the datastream and can be written to sinks. 18
  • 19. Datastream TRansformations 19 ● Map ● FlatMap ● Filter ● KeyBy ● Reduce ● Fold ● Window ○ windowAll ○ Window apply ○ Window fold ○ Window reduce ● Aggregations ○ Sum ○ Min ○ Max ● Aggregate ● Select ● Split ● Iterate ● Extract Timestamps ● Shuffle
  • 20. HOw it works? EXAMPLE PLEASE . . . 20
  • 23. Windows In Flink 23 ● Windows are the heart of processing infinite streams. ● Windows split the stream into fixed sized buckets over which we can apply computations. ● It is a mechanism to take snapshot of the stream based on time or other variables ● Need of windowing - in case we need some kind of aggregation on the incoming data
  • 24. 24 MORE INFO . . . ● Most of window operations are encouraged on keyed datastream. ● The basic structure of windowed transformation is as follows.
  • 25. 25 Window Assigners . . . ❏ Tumbling Windows ❏ Sliding Windows A window assigner defines how elements are assigned to windows ❏ Sessions Windows ❏ Global Windows
  • 26. TIME in FLINK ➢ Processing Time - Assigns elements based on the current clock of the machines ➢ Event Time - Assigns windows based on timestamp of the elements. ➢ Ingestion time - Assigns wall clock timestamps as they arrive in the system and processes as event time semantics based on the attached timestamps. 26
  • 27. Tumbling windows ✘ Here window size is specified and elements are assigned to non overlapping windows. ✘ Eg. If window size is specified as 2 minutes, all elements in 2 minutes would come in one window for processing. ✘ It is useful in dividing the stream into multiple discrete batches. 27
  • 28. 28
  • 29. 29 Sliding windows ➢ Here windows can overlap unlike tumbling windows ➢ Overlap is user specified. Here one element can be assigned to multiple windows ➢ Eg. 10 minutes window with a slide of 5 minutes.
  • 30. 30
  • 31. 31 Session Windows ➢ This is applicable cases where window boundaries needs to be adjusted as per incoming data. ➢ Here we specify session gap or inactivity period before marking a session closed. ➢ Refer next slide for example.
  • 32. 32
  • 33. 33 Global Windows ➢ Sub-division of elements into windows is not done. Instead, we assign each element to 1 single key per global window. ➢ There is no end to this window, so triggers should be used if any computation needs to be done. ➢ Flink also has count windows. For eg. a tumbling count window of 100 will add 100 elements to the window and will evaluate upon receiving 100th element.
  • 34. 34
  • 35. 35 Evictors & Triggers ➢ A trigger determines when window is ready for processing. Except Global windows, all windows come with a default trigger. ➢ We can define a custom trigger which tells when to fire elements for processing . ➢ Evictors are used to evict some elements from the window. After the window is completed, it is purged
  • 36. Fault Tolerance ➢ Before diving in, let’s define stateful computing. Most of the computational tasks require stateful data. ➢ We already came across this in previous slides. ➢ In word count example, word count is an output that accumulates new objects into existing word count. It is a stateful variable. ➢ Here fault tolerance is achieved with checkpointing. Flink maintains state locally per task. ➢ State is periodically checkpointed to durable storage. A checkpoint is consistent snapshot of the state of all tasks. 36
  • 37. More on Checkpointing . . . ➢ Flink backs up state at a certain interval ➢ In case of a failure, flink recovers all tasks to the state of last checkpoint and starts running the tasks from that checkpoint. ➢ The application continues as the failure never happened. ➢ Flink guarantees exactly once semantics which means all data is processed exactly once regardless of any failures. 37
  • 38. Stopping and resuming jobs . . . ➢ The running jobs must be stopped before upgrading a component and resumed after the upgrade. ➢ There are 2 modes to resume the jobs. ➢ Savepoint - unlike checkpoint which is periodically triggered by the system, savepoint is triggered by running commands. ➢ Even the storage format is different from checkpoints. ➢ External checkpoint - Extension of existing internal checkpoint. After storing internal checkpoint, it is stored specific directory. 38
  • 39. Savepoints vs checkpoints . . . ➢ The primary purpose of checkpoints is to provide a recovery mechanism in case of unexpected job failures. Checkpoint’s lifecycle is managed by flink. Checkpoints are dropped after job was terminated by user. They are lightweight. ➢ Savepoints are created, owned and deleted by user. Their use-case is for planned manual backup and resume. They are bit of expensive. Mostly used for flink version upgrade, changing job graph and for changing parallelism. 39
  • 40. More on checkpoints. . . ➢ We already know state is persisted upon checkpoints to avoid data loss and recover consistently. How the state is stored internally and how it is persisted upon checkpoints depends on state backend. ➢ 3 state backends available ○ Memory state backend ○ Fs state backend ○ Rocksdb state backend 40
  • 41. High Availability for standalone cluster ➢ Job manager coordinates flink deployment and it is responsible for scheduling the jobs and resource management. ➢ However, there is only one job manager per flink cluster creating a single point of failure. ➢ There will be a single leading job manager at any time and multiple standby job managers to take leadership incase leader fails. ➢ Flink uses zookeeper for distributed coordination between all running instances of job manager and leader election. 41
  • 42. 42
  • 43. 43 Limitations ➢ Maturity and community support is less in the industry than its predecessors but it is growing fastly. ➢ No known adaptation of Flink batch, only popular for streaming
  • 44. Use Cases ❏ Event driven Applications ❏ Data Analytics Applications ❏ Data Pipeline Applications 44
  • 45. 45
  • 46. 46
  • 47. 47
  • 48. 48 EXAMPLES Event driven applications ➢ Fraud detection ➢ Anomaly detection ➢ Rule Based Alerting ➢ Business Process Monitoring Data Analytics Applications ➢ Quality monitoring of Telecom networks ➢ Analysis of live data ➢ Analysis of product updates ➢ Large scale graph analysis Data Pipeline Applications ➢ Continuous ETL in E-commerce ➢ Real time search index building in E-commerce
  • 49. Use cases cont’d. . . Bouygues Telecom For real time event processing and analytics for billions of messages per day and real time customer insights. King - candy crush creator Real time analytics on 30 billion events generated everyday across 200 countries. Zalando E-commerce Uses flink for real time process monitoring, stream processing, business process monitoring. Otto - largest online retailer For user agent identification and identifying a search session via stream processing, Alibaba Group For online recommendations of the products based on the user activity. They developed Blink on top of the flink for this. Capital One - commercial banking Monitor customer data in real time, to detect customer issues, event correlation, anomaly identification 49