Make streaming processing towards ANSI SQL

Make Stream Processing
Towards ANSI SQL
Shaoxuan Wang
Alibaba Group
2018.6.20

Broadcom
High-Perf Platform
Facebook
Social Graph Storage
Alibaba Group
Real-Time Data Infra
Peking University
EECS
University of California
at San Diego
Computer Engineer
Flink Committer
Since 2017
Shaoxuan Wang
Alibaba Group
wshaoxuan@gmail.com
shaoxuan@apache.org

01 ANSI SQL for Stream Processing
02 Blink SQL Engine
03 Blink SQL Optimization

01 ANSI SQL for Stream
Processing

OptimizedDeclarativeUnderstandable Stable
One Query, Same Result
Unify
Why SQL？

real-time
return one final result
correctness
emit results as early as possible
Batch versus Stream Processing
Batch Processing Stream Processing
VS
in stream processing, it emits intermediate results, and
keeps refining the results to ensure correctness
VS

WHAT & HOW: results are calculated
WHEN: to emit a (intermedia) result
HOW: to refine the results
ANSI SQL can Describe Stream Processing
Can be fully described by SQL
Does not affect business logic
Can be solved by SQL engine
Describe a Stream Processing

Stream
Dynamic TableApply
Changelog
user clicks
user clicks
Mary 1
Bob 1
Mary 2
Liz 1
Bob 2
Mary 3
Mary 1
Bob 1
Mary 2
LIz 1
Bob 2
Mary 3
Stream
Dynamic Table
Apply
Changelog
Introducing Dynamic Table
Stream-Table Duality

user url
clicks
Mary ./home
user cnt
Mary 1
result
Bob 1
Liz 1
Mary 2
Liz 2
Mary, 1
Bob, 1
Mary, 2
Liz, 1
Mary, 3
Liz, 2
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
Mary ./prod?id=7
Bob ./cart
Liz ./prod?id=3
Liz ./home
Mary ./prod?id=1
Dynamic Table Dynamic Table Output Stream
Mary, ./home
Bob, ./cart
Mary, ./prod?id=1
Liz, ./home
Mary, ./prod?id=7
Liz, ./prod?id=3
Input Stream
Mary 3
Continuous SQL Query on Dynamic Table
StreamStream
Continuous
SQL Query

Incorrect! This value
should be 2
Retraction for RefinementResult Refinement can be very Complex

RetractionIntroducing Retraction
Retraction is not cost-free:
1. Events are doubled
2. Operators can be complex when
consider handling retraction (e.g.
max/min aggregate)
You should not reason about retraction. Just write simple
queries, SQL engine will ensure the correctness.

Introducing Alibaba Blink
Blink1.0:
enterprise edition of Flink
with lots of improvements
designed by AlibabaApache Flink
Alibaba’s Improvements
Blink2.0:
a new unified high performance compute engine for
complete data applications
Introducing Blink

Runtime
DAG API & Operators
Query Processor
Query Optimizer & Query Executor
SQL & Table API
Relational
Local
Single JVM
Cloud
GCE, EC2
Cluster
Standalone, YARN
SQL
& TableAPI
Logical
Plan
Physical
Plan
Execution
DAG
completely same between batch & stream processing
Optimizer
stream processing has some unique design
Same Results
Batch mode
Same SQL Query
Stream Mode
Architecture of Blink SQL Engine

• ANSI SQL
• Major data types (numeric, varchar, binary, decimal, array, map)
• UDF/UDTF/UDAF
• Support all types of join (inner/left/right/full/semi/anti)
• Support over window, grouping window (tumbling, sliding, session)
• Various subquery supported (correlated/uncorrelated)
• Advanced analysis (grouping set, cube, rollup…)
stream processing with Blink SQL can fully pass TPCH, and results are
same as batch processing
Blink SQL Functionalities

Predicate, Projection push-down
Sort related rules
State (MapState/ValueState)
Retraction
EMIT SLA -> MicroBatch
Joining Reorder
VS
Same as batch
Collect stats in different ways
Stream has unique designNot useful for stream
Challenges & Opportunities for Stream Processing

25xJoin on
custID
Customer
150million
Order
1.5billion
HashJoin
Join on
custID
Customer
150million
Order
1.5billion
ValueState MapState CountAgg
Join on
custID
Customer
150million
Order
1.5billion
ValueState ValueState
100million
CountAgg CountAgg
PK：custID PK：orderID PK：custID PK：orderID
PK：custID
PK：orderID
Stream Processing TPCH13:
StateIO-Cost Plays a Big Role on Plan Choosing

Agg
(MaxWithRetract)
Calc
Agg
(Sum)
lineitem
Agg
(Max)
Calc
Agg
(Sum)
lineitem
Result of sum
is ascending
15x
Input value is
unsigned type
Stream Processing TPCH15:
Removing Retraction Operation can Significantly Improve Performance

Simple
Aggregation
(forwarding) Local-
Global Aggregation
1 3 2 7 5
1 4 3 8 6
1 3 2 9 5
7 6 5
5
9 8
1 3 2 7 5
1 4 3 8 6
1 3 2 9 5
1 17
4 18
5 15
25 18 17
5 4 1
SUM
Local-Global Agg to Improve Data Skew

(forwarding) Local-Global
Aggregation
B,2
B,2
A,4
A,2
A,3
A,1
A,2
B,2
A,2
A,4
A,2
A,4
A,2
A,4
A,1
B,1
A,1
A,3
A,2
A,4
B,2 A,4
A,2
B,1 A,2
A,4
A,1
Map
Map
Map
Local
Agg
Local
Agg
Local
Agg
Global
Agg
Global
Agg
Local
Agg
B,2
A,4
A,2
A,3
A,1
A,2
B,2
A,2
A,4
A,2
A,4
A,2
A,4
A,1
B,1
B,1A,1
B,2A,2
A,3
A,4
(keyed-shuffle) Local-Global
Aggregation
Count
Distinct
Local-Global Agg to Improve Data Skew

SQL Query Optimizer
SQL Query Executor
State Storage Engine
Runtime – OS
Resource Conf
10x, 100x, …...
10x
10x
<10x
<10x
Performance Tuning for Stream Processing

Structured Streaming @
Processing 100s billion records/hour
1000s of customer streaming apps
in production on Alicloud
Largest app has 1000s of subtasks
and 10s of TB state
Blink Platform (e.x. Alicloud StreamCompute)

• Stream processing can be described by ANSI SQL
• Alibaba Blink SQL follows ANSI SQL
• SQL Optimization of stream processing faces new challenges
and opportunities
• Alibaba Blink Platform (e.x. Alicloud StreamCompute)
operators world largest stream processing businesses
Take Away

Thanks
Shaoxuan wang
wshaoxuan@gmail.com
shaoxuan@apache.org
2018.6.20
We are Hiring!
Hangzhou / Beijing, China
Seattle / Bay Area, US
blink-jobs@list.alibaba-inc.com

Make streaming processing towards ANSI SQL

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Make streaming processing towards ANSI SQL

Similar a Make streaming processing towards ANSI SQL (20)

Más de DataWorks Summit

Más de DataWorks Summit (20)

Último

Último (20)

Make streaming processing towards ANSI SQL