SQL is the most widely used language for data processing. It allows users to concisely and easily declare their business logic. Data analysts usually do not have complex software programing backgrounds, but they can program SQL and use it on a regular basis to analyze data and power the business decisions. Apache Flink is one of streaming engines that supports SQL. Besides Flink, some other stream processing frameworks, like Kafka and Spark structured streaming, have SQL-like DSL, but they do not have the same semantics as Flink. Flink’s SQL implementation follows ANSI SQL standard while others do not.
In this talk, we will present why following ANSI SQL standard is essential characteristic of Flink SQL and how we achieved this. The core business of Alibaba is now fully driven by the data processing engine: Blink, a project based on Flink with Alibaba’s improvements. About 90% of blink jobs are written by Flink SQL. We will show the use cases and the experience of running large scale Flink SQL jobs at Alibaba in the talk.
Speakers
Shaoxuan Wang, Senior Engineering Manager, Alibaba
Xiaowei Jiang, Senior Director, Alibaba
2. Broadcom
High-Perf Platform
Facebook
Social Graph Storage
Alibaba Group
Real-Time Data Infra
Peking University
EECS
University of California
at San Diego
Computer Engineer
Flink Committer
Since 2017
Shaoxuan Wang
Alibaba Group
wshaoxuan@gmail.com
shaoxuan@apache.org
6. real-time
return one final result
correctness
emit results as early as possible
Batch versus Stream Processing
Batch Processing Stream Processing
VS
in stream processing, it emits intermediate results, and
keeps refining the results to ensure correctness
VS
7. WHAT & HOW: results are calculated
WHEN: to emit a (intermedia) result
HOW: to refine the results
ANSI SQL can Describe Stream Processing
Can be fully described by SQL
Does not affect business logic
Can be solved by SQL engine
Describe a Stream Processing
8. Stream
Dynamic TableApply
Changelog
user clicks
user clicks
Mary 1
Bob 1
Mary 2
Liz 1
Bob 2
Mary 3
Mary 1
Bob 1
Mary 2
LIz 1
Bob 2
Mary 3
Stream
Dynamic Table
Apply
Changelog
Introducing Dynamic Table
Stream-Table Duality
9. user url
clicks
Mary ./home
user cnt
Mary 1
result
Bob 1
Liz 1
Mary 2
Liz 2
Mary, 1
Bob, 1
Mary, 2
Liz, 1
Mary, 3
Liz, 2
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
Mary ./prod?id=7
Bob ./cart
Liz ./prod?id=3
Liz ./home
Mary ./prod?id=1
Dynamic Table Dynamic Table Output Stream
Mary, ./home
Bob, ./cart
Mary, ./prod?id=1
Liz, ./home
Mary, ./prod?id=7
Liz, ./prod?id=3
Input Stream
Mary 3
Continuous SQL Query on Dynamic Table
StreamStream
Continuous
SQL Query
11. RetractionIntroducing Retraction
Retraction is not cost-free:
1. Events are doubled
2. Operators can be complex when
consider handling retraction (e.g.
max/min aggregate)
You should not reason about retraction. Just write simple
queries, SQL engine will ensure the correctness.
13. Introducing Alibaba Blink
Blink1.0:
enterprise edition of Flink
with lots of improvements
designed by AlibabaApache Flink
Alibaba’s Improvements
Blink2.0:
a new unified high performance compute engine for
complete data applications
Introducing Blink
14. Runtime
DAG API & Operators
Query Processor
Query Optimizer & Query Executor
SQL & Table API
Relational
Local
Single JVM
Cloud
GCE, EC2
Cluster
Standalone, YARN
SQL
& TableAPI
Logical
Plan
Physical
Plan
Execution
DAG
completely same between batch & stream processing
Optimizer
stream processing has some unique design
Same Results
Batch mode
Same SQL Query
Stream Mode
Architecture of Blink SQL Engine
15. • ANSI SQL
• Major data types (numeric, varchar, binary, decimal, array, map)
• UDF/UDTF/UDAF
• Support all types of join (inner/left/right/full/semi/anti)
• Support over window, grouping window (tumbling, sliding, session)
• Various subquery supported (correlated/uncorrelated)
• Advanced analysis (grouping set, cube, rollup…)
stream processing with Blink SQL can fully pass TPCH, and results are
same as batch processing
Blink SQL Functionalities
17. Predicate, Projection push-down
Sort related rules
State (MapState/ValueState)
Retraction
EMIT SLA -> MicroBatch
Joining Reorder
Batch Processing Stream Processing
VS
Same as batch
Collect stats in different ways
Stream has unique designNot useful for stream
Challenges & Opportunities for Stream Processing
18. 25xJoin on
custID
Customer
150million
Order
1.5billion
HashJoin
Batch Processing Stream Processing
Join on
custID
Customer
150million
Order
1.5billion
ValueState MapState CountAgg
Join on
custID
Customer
150million
Order
1.5billion
ValueState ValueState
100million
CountAgg CountAgg
PK:custID PK:orderID PK:custID PK:orderID
PK:custID
PK:orderID
Stream Processing TPCH13:
StateIO-Cost Plays a Big Role on Plan Choosing
22. SQL Query Optimizer
SQL Query Executor
State Storage Engine
Runtime – OS
Resource Conf
10x, 100x, …...
10x
10x
<10x
<10x
Performance Tuning for Stream Processing
23. Structured Streaming @
Processing 100s billion records/hour
1000s of customer streaming apps
in production on Alicloud
Largest app has 1000s of subtasks
and 10s of TB state
Blink Platform (e.x. Alicloud StreamCompute)
24. • Stream processing can be described by ANSI SQL
• Alibaba Blink SQL follows ANSI SQL
• SQL Optimization of stream processing faces new challenges
and opportunities
• Alibaba Blink Platform (e.x. Alicloud StreamCompute)
operators world largest stream processing businesses
Take Away