Introduction to Stream Processing

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Introduction to
Stream Processing
Guido Schmutz
Liverpool - 5.12.2018
@gschmutz guidoschmutz.wordpress.com

Agenda
Introduction to Stream Processing
1. Motivation for Stream Processing?
2. Capabilities for Stream Processing
3. Implementing Stream Processing Solutions
4. Summary

Guido Schmutz
Working at Trivadis for more than 22 years
Oracle Groundbreaker Ambassador & Oracle ACE Director
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
135th edition

Motivation for Stream Processing?

Why not using Traditional BI Infrastructures?
Enterprise Data
Warehouse
ETL / Stored
Procedures
Bulk Source
DB
Extract
File
DB
BI Tools
Search / Explore
Enterprise Apps
Logic
{ }
API
high latency

Bulk Source
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
Search
SQL
Export
Service
Parallel
Processing
Storage
Storage
RawRefined
Results
high latency
Enterprise Apps
Logic
{ }
API
File Import / SQL Import
DB
Extract
File
DB
Big Data solves Volume and Variety – not Velocity

Bulk Source
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
Search
SQL
Export
Service
Parallel
Processing
Storage
Storage
RawRefined
Results
high latency
Enterprise Apps
Logic
{ }
API
DB
Extract
File
DB
Event Source
Location
Telemetry
IoT
Data
Mobile
Apps
Social
Event Stream

Bulk Source
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
Search
SQL
Export
Service
• Machine Learning
• Graph Algorithms
• Natural Language Processing
Parallel
Processing
Storage
Storage
RawRefined
Results
high latency
Enterprise Apps
Logic
{ }
API
DB
Extract
File
DB
Event Stream
Event Source
Location
IoT
Data
Mobile
Apps
Social
Event
Hub
Event
Hub
Event
Hub
Telemetry

"Data at Rest" vs. "Data in Motion"
Data at Rest Data in Motion
Store
Act
Analyze
StoreAct
Analyze
11101
01010
10110
11101
01010
10110

When to Stream / When not?
Constant low
Milliseconds & under
Low milliseconds to seconds,
delay in case of failures
10s of seconds of more,
Re-run in case of failures
Real-Time Near-Real-Time Batch
Source: adapted from Cloudera

"No free lunch"
Constant low
Milliseconds & under
Low milliseconds to seconds,
delay in case of failures
10s of seconds of more,
Re-run in case of failures
Real-Time Near-Real-Time Batch
"Difficult" architectures, lower latency "Easier architectures", higher latency

Event
Hub
Event
Hub
Hadoop Clusterd
Hadoop Cluster
Stream Analytics
Platform
Stream Processing Architecture solves Velocity
BI Tools
Enterprise Data
Warehouse
Event
Hub
SQ
L
Search / Explore
Enterprise Apps
Search
ServiceResults
Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Event
Stream
Event
Stream
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Low(est) latency, no history
Telemetry

Hadoop Clusterd
Hadoop Cluster
Stream Analytics
Platform
Big Data for all historical data analysis
BI Tools
Enterprise Data
Warehouse
SQ
L
Search / Explore
Enterprise Apps
Search
ServiceResults
Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Event
Stream
Event
Stream
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Data FlowEvent
Hub
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Telemetry

Hadoop Clusterd
Hadoop Cluster
Stream Analytics
Platform
Integrate existing systems with lower latency through CDC
BI Tools
Enterprise Data
Warehouse
SQ
L
Search / Explore
Enterprise Apps
Search
ServiceResults
Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Event
Stream
Event
Stream
Data FlowEvent
Hub
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Change Data
Capture
Telemetry

Data Store
Integrate existing systems through CDC
Data
Event Hub
Integration
Consuming Systems
StateLogic
CDC
CDC Connector
Traditional Silo-based
System
LogicUser Interface
Capture changes directly on database
Change Data Capture (CDC) => think like
a global database trigger
Transform existing systems to event
producer
Event
Stream
Event
Stream

New systems participate in event-oriented fashion
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice Platform
Microservice State
{ }
API
Stream Analytics Platform
Stream
Processor
State
{ }
API
Event
Stream
SQL
Search
Service
BI Tools
Enterprise Data
Warehouse
Search / Explore
SQL
Export
Search
Service
Enterprise Apps
Logic
{ }
API
Event
Stream
Data FlowEvent
Hub
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Change Data
Capture
Event
Stream
Event
Stream
Telemetry

Edge computing allows processing close to data sources
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice Platform
Microservice State
{ }
API
Stream Analytics Platform
Stream
Processor
State
{ }
API
SQL
Search
Service
BI Tools
Enterprise Data
Warehouse
Search / Explore
SQL
Export
Search
Service
Enterprise Apps
Logic
{ }
API
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Edge Node
Change DataCapture
D
ata
Flow
Event
Hub
Data Flow
Event
Stream
Event
Stream
Event Stream
Telemetry
Rules
Event Hub
Storage

Hadoop Clusterd
Hadoop Cluster
Big Data
Unified Architecture for Modern Data Analytics Solutions
SQL
Search
Service
BI Tools
Enterprise Data
Warehouse
Search / Explore
Event
Hub
D
ata
Flow
D
ata
Flow
Change DataCapture Parallel
Processing
Storage
Storage
RawRefined
Results
SQL
Export
Microservice State
{ }
API
Stream
Processor
State
{ }
API
Event
Stream
Event
Stream
Search
Service
Stream Analytics
Microservices
Enterprise Apps
Logic
{ }
API
Edge Node
Rules
Event Hub
Storage
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Event Stream
Telemetry

Two Types of Stream Processing
(from Gartner)
Stream Data Integration
• primarily focuses on the ingestion and
processing of data sources targeting real-
time extract-transform-load (ETL) and data
integration use cases
• filter and enrich the data
• optionally calculate time-windowed
aggregations before storing the results in a
database or file system
Stream Analytics
• targets analytics use cases
• calculating aggregates and detecting
patterns to generate higher-level, more
relevant summary information (complex
events)
• Complex events may signify threats or
opportunities that require a response from
the business through real-time dashboards,
alerts or decision automation

Sample Use Case
Truck-2
Truck-1
Truck-3
truck_
position
detect_danger
ous_driving
Truck
Driver
Change Data
Capture
join_dangerous_driv
ing_driver
dangerous_dri
ving_driver
Count By Event Type
Window (1m, 30s)
count_by_event
_type

Important Capabilities for Stream
Processing

Stream vs. Table / Static
Stream
• “History”
• an unbounded sequence of structured
data ("facts")
• Facts in a stream are immutable
Table / Static
• “State”
• a view of a stream, or another table,
and represents a collection of evolving
facts
• Latest value for each key in a stream
• Facts in a table are mutable

Streaming Model: Event-at-a-time vs. Micro Batch
Event-at-a-time Processing
• Events processed as they
arrive
• low-latency
• fault tolerance expensive
Micro-Batch Processing
• Splits incoming stream in
small batches
• Fault tolerance easier
• Better throughput

Fault Tolerance: Delivery Guarantees
• At most once (fire-and-forget) - means the message is sent, but the sender
doesn’t care if it’s received or lost. Imposes no additional overhead to ensure
message delivery, such as requiring acknowledgments from consumers. Hence, it
is the easiest and most performant behavior to support.
• At least once - means that retransmission of a message will occur until an
acknowledgment is received. Since a delayed acknowledgment from the receiver
could be in flight when the sender retransmits the message, the message may be
received one or more times.
• Exactly once - ensures that a message is received once and only once, and is
never lost and never repeated. The system must implement whatever
mechanisms are required to ensure that a message is received and processed
just once
[ 0 | 1 ]
[ 1+ ]
[ 1 ]

API
Compositional / Programmatic
• Manual topology definition and
optimization
Declarative
• High-Level, fluent API
• Higher order function as operators
(filter, mapWithState …)
SQL
• Query language allowing to use
stream in FROM clause
• Extensions supporting windowing,
pattern matching, spatial, ….
Operators
val filteredDf =
truckPosDf.where(
"eventType !='Normal'")
SELECT * FROM truck_position_s
WHERE eventType != ’Normal’
public class MyProcessor
implements Processor<String, String>
{ ... }

Windowing
Computations over events done using
windows of data
Due to size and never-ending nature of it,
it’s not feasible to keep entire stream of
data in memory
A window of data represents a certain
amount of data where we can perform
computations on
Windows give the power to keep a
working memory and look back at recent
data efficiently
Time
Stream of Data Window of Data

Sliding Window (aka Hopping
Window) - uses eviction and
trigger policies that are based on
time: window length and sliding
interval length
Fixed Window (aka Tumbling
Window) - eviction policy always
based on the window being full and
trigger policy based on either the
count of items in the window or
time
Session Window – composed of
sequences of temporarily related
events terminated by a gap of
inactivity greater than some
timeout
Windowing
Time TimeTime

Joining
Challenges of joining streams
1. Data streams need to be aligned as they
come because they have different timestamps
2. since streams are never-ending, the joins
must be limited; otherwise join will never end
3. join needs to produce results continuously as
there is no end to the data
Stream to Static (Table) Join
Stream to Stream Join (one window join)
Stream to Stream Join (two window join)
Stream-to-
Static Join
Stream-to-
Stream
Join
Stream-to-
Stream
Join
Time
Time
Time

Pattern Detection
• Streaming Data often contains interesting patterns that only emerge as new
streaming data arrives, e.g.
• Absence Pattern: event A not followed by event B within time window
• Sequence Pattern: event A followed by event B followed by event C
• Increasing Pattern: up trend of a value of a certain attribute
• Decreasing Pattern: down trend of a value of a certain attribute
• …
• Pattern operators allow developers to define complex relationships between
streaming events

State Management
Necessary if stream processing use case
is dependent on previously seen data or
external data
Windowing, Joining and Pattern
Detection use State Management behind
the scenes
State Management services can be made
available for custom state handling logic
State needs to be managed as close to
the stream processor as possible
Options for State Management
How does it handle failures? If a machine
crashes and the/some state is lost?
In-Memory
Replicated,
Distributed
Store
Local,
Embedded
Store
Operational Complexity and Features
Low high

Queryable State (aka. Interactive Queries)
Exposes the state managed by the
Stream Analytics solution to the
outside world
Allows an application to query the
managed state, i.e. to visualize it
For some scenarios, Queryable State
can eliminate the need for an external
database to keep results
Stream Processing Cluster
Reference
Data
Stream Analytics
{ }
Query API
State
Stream
Processor
Search /
Explore
Online &
Mobile Apps
Model
Dashboard

Event Time vs. Ingestion Time vs. Processing Time
Event time
• the time at which events actually occurred
Ingestion time / Processing Time
• the time at which events are ingested /
observed in the system
Not all use cases care about event times but many do
Examples
• characterizing user behavior over time
• most billing applications
• anomaly detection

Capabilities: Stream Data Integration vs. Stream Analytics
Stream Data Integration Stream Analytics
Support for Various Data Sources Yes (yes)
Streaming ETL (Transformation, Routing, Validation) yes (yes)
Format Translation yes (yes)
Micro-Batching yes (yes)
Auto Scaling yes
Backpressure Support yes -
Processing Time vs. Event Time - yes
Stream-to-Static Joins (Lookup/Enrichment) yes yes
Stream-to-Stream Joins - yes
Windowing - yes
State Management - yes
Queryable State (aka Interactive Queries) - yes
Streaming SQL (yes) yes
Event Pattern Matching / Detection - Yes

Implementing Stream Processing
Solutions

Stream Processing & Analytics Ecosystem
Stream Analytics
Event Hub
Open Source Closed Source
Stream Data Integration
Source: adapted from Tibco
Edge

Summary

Summary
Stream Processing is the solution for low-latency
Event Hub, Stream Data Integration and Stream Analytics are the main building
blocks in your architecture
Kafka is currently the de-facto standard for Event Hub
Various options exists for Stream Data Integration and Stream Analytics
SQL becomes a valid option for implementing Stream Analytics
Still room for improvements (SQL, Event Pattern Detection, Streaming Machine
Learning)

Technology on its own won't help you.
You need to know how to use it properly.

Introduction to Stream Processing

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Introduction to Stream Processing

Similar a Introduction to Stream Processing (20)

Más de Guido Schmutz

Más de Guido Schmutz (9)

Último

Último (20)

Introduction to Stream Processing