SlideShare una empresa de Scribd logo
1 de 57
Descargar para leer sin conexión
Streaming SQL for
Data Engineers: The
Next Big Thing?
Streaming SQL Products
● Apache Flink
● Apache Spark
● Apache Beam
● AWS Kinesis
● Google Cloud Dataflow
● Databricks
● ksqlDB
● …
● Meta
● LinkedIn
● Pinterest
● DoorDash
● Alibaba
● …
Companies building
internal platforms
Open source and
vendor solutions
👋 Hi, I’m Yaroslav
👋 Hi, I’m Yaroslav
● Principal Software Engineer @ Goldsky
● Staff Data Engineer @ Shopify
● Software Architect @ Activision
● …
👋 Hi, I’m Yaroslav
● Principal Software Engineer @ Goldsky
● Staff Data Engineer @ Shopify
● Software Architect @ Activision
● …
❤ Apache Flink
🤔
TableEnvironment tableEnv = TableEnvironment.create(/*…*/);
Table revenue = tableEnv.sqlQuery(
"SELECT cID, cName, SUM(revenue) AS revSum " +
"FROM Orders " +
"WHERE cCountry = 'FRANCE' " +
"GROUP BY cID, cName"
);
… but why SQL?
Why SQL?
● Wide adoption
● Declarative transformation model
● Planner!
● Common type system
What instead of How
User
Intention Execution
Runtime
←
Imperative Style
→
User
Intention Execution
Runtime
→
Planning
Planner
→
Declarative SQL Style
SELECT * FROM Orders
INNER JOIN Product
ON Orders.productId = Product.id
● LOTS of code!
● Create an operator to connect
two streams
● Define and accumulate state
● Implement a mechanism for
emitting the latest value per
key
SQL API DataStream API
Declarative Transformation Model
SELECT * FROM Orders
INNER JOIN Product
ON Orders.productId = Product.id
SQL API Why not Table API?
val orders = tEnv.from("Orders")
.select($"productId", $"a", $"b")
val products = tEnv.from("Products")
.select($"id", $"c", $"d")
val result = orders
.join(products)
.where($"productId" === $"id")
.select($"a", $"b", $"c")
Declarative Transformation Model
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ticker
ORDER BY price DESC) AS row_num
FROM stock_table)
WHERE row_num <= 10;
Top-N Query
Declarative Transformation Model
Row Pattern Recognition in SQL
(ISO/IEC TR 19075-5:2016)
SELECT *
FROM stock_table
MATCH_RECOGNIZE(
PARTITION BY ticker
ORDER BY event_time
MEASURES
A.event_time AS initialPriceTime,
C.event_time AS dropTime,
A.price - C.price AS dropDiff,
A.price AS initialPrice,
C.price AS lastPrice
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A B* C) WITHIN INTERVAL '10' MINUTES
DEFINE
B AS B.price > A.price - 500
)
Flink Planner Migration
From https://www.ververica.com/blog/a-journey-to-beating-flinks-sql-performance
Planner Decoupling
Planner Optimizations & Query Rewrite
● Predicate push down
● Projection push down
● Join rewrite
● Join elimination
● Constant inlining
● …
SQL API DataStream API
val postgresSink: SinkFunction[Envelope] = JdbcSink.sink(
"INSERT INTO table " +
"(id, number, timestamp, author, difficulty, size, vid, block_range) " +
"VALUES (?, ?, ?, ?, ?, ?, ?, ?) " +
"ON CONFLICT (id) DO UPDATE SET " +
"number = excluded.number, " +
"timestamp = excluded.timestamp, " +
"author = excluded.author, " +
"difficulty = excluded.difficulty, " +
"size = excluded.size, " +
"vid = excluded.vid, " +
"block_range = excluded.block_range " +
"WHERE excluded.vid > table.vid",
new JdbcStatementBuilder[Envelope] {
override def accept(statement: PreparedStatement, record: Envelope): Unit = {
val payload = record.payload
payload.id.foreach { id => statement.setString(1, id) }
payload.number.foreach { number => statement.setBigDecimal(2, new java.math.BigDecimal(number)) }
payload.timestamp.foreach { timestamp => statement.setBigDecimal(3, new java.math.BigDecimal(timestamp)) }
payload.author.foreach { author => statement.setString(4, author) }
payload.difficulty.foreach { difficulty => statement.setBigDecimal(5, new java.math.BigDecimal(difficulty)) }
payload.size.foreach { size => statement.setBigDecimal(6, new java.math.BigDecimal(size)) }
payload.vid.foreach { vid => statement.setLong(7, vid.toLong) }
payload.block_range.foreach { block_range => statement.setObject(8, new PostgresIntRange(block_range), Types.O
}
},
CREATE TABLE TABLE (
id BIGINT,
number INTEGER,
timestamp TIMESTAMP,
author STRING,
difficulty STRING,
size INTEGER,
vid BIGINT,
block_range STRING
PRIMARY KEY (vid) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
'table-name' = 'table'
);
😱
Common Type System
When you start using SQL
you get access to the
decades of advancements
in database design
When NOT to use
● Complex serialization / deserialization logic
● Low-level optimizations, especially with state and timers
● Not always debugging-friendly
Dealing with Complexity
UDFs for heavy lifting
● Calling 3rd-party
libraries
● External calls
● Enrichments
Templating
● Control structures
● dbt-style macros
and references
Convinced? Let’s use it!
Ways to use
Structured Statements
dbt-style Project Notebooks
Managed Runtime
Requirements
● Version control
● Code organization
● Testability
● CI/CD
● Observability
Structured Statements
def revenueByCountry(country: String): Table = {
tEnv.sqlQuery(
s"""
|SELECT name, SUM(revenue) AS totalRevenue
|FROM Orders
|WHERE country = '${country}'
|GROUP BY name""".stripMargin
)
}
Structured Statements
def revenueByCountry(country: String): Table = {
tEnv.sqlQuery(
s"""
|SELECT name, SUM(revenue) AS totalRevenue
|FROM Orders
|WHERE country = '${country}'
|GROUP BY name""".stripMargin
)
}
✅ structure
✅ mock/stub
for testing
Structured Statements
● Treat them like code
● Only make sense when Table API is not available
● Mix with other API flavours
● SQL also has style guides
● Otherwise it’s a typical streaming application!
Structured Statements
● Version control: 🟢
● Code organization: 🟢
● Testability: 🟡
● CI/CD: 🟡
● Observability: 🟢
dbt-style Project
➔ models
◆ common
● users.sql
● users.yml
◆ sales.sql
◆ sales.yml
◆ …
➔ tests
◆ …
dbt-style Project
➔ models
◆ common
● users.sql
● users.yml
◆ sales.sql
◆ sales.yml
◆ …
➔ tests
◆ …
✅ structured
✅ schematized
✅ testable
dbt-style Project
SELECT
((text::jsonb)->>'bid_price')::FLOAT AS bid_price,
(text::jsonb)->>'order_quantity' AS order_quantity,
(text::jsonb)->>'symbol' AS symbol,
(text::jsonb)->>'trade_type' AS trade_type,
to_timestamp(((text::jsonb)->'timestamp')::BIGINT) AS ts
FROM {{ REF('market_orders_raw') }}
{{ config(materialized='materializedview') }}
SELECT symbol,
AVG(bid_price) AS avg
FROM {{ REF('market_orders') }}
GROUP BY symbol
dbt-style Project
● Works well for heavy analytical use-cases
● Could write tests in Python/Scala/etc.
● Probably needs more tooling than you think (state
management, observability, etc.)
● Check dbt adapter from Materialize!
dbt-style Project
● Version control: 🟢
● Code organization: 🟢
● Testability: 🟡
● CI/CD: 🟡
● Observability: 🟡
Notebooks
Apache Zeppelin
Notebooks
Apache Zeppelin
Notebooks
● Great UX
● Ideal for exploratory analysis and BI
● Complements all other patterns really well
● Way more important for realtime workloads
Notebooks
We don't recommend productionizing notebooks and
instead encourage empowering data scientists to build
production-ready code with the right programming
frameworks
https://www.thoughtworks.com/en-ca/radar/technique
s/productionizing-notebooks
Notebooks
● Version control: 🟡
● Code organization: 🔴
● Testability: 🔴
● CI/CD: 🔴
● Observability: 🔴
Managed Runtime
decodable
Managed Runtime
● Managed ≈ “Serverless”
● Auto-scaling
● Automated deployments, rollbacks, etc.
● Testing for different layers is decoupled
(runtime vs jobs)
Managed Runtime
Reference Architecture
Control Plane Data Plane
API Reconciler
Streaming Job
UI CLI
Any managed runtime
requires excellent
developer experience
to succeed
Managed Runtime: Ideal Developer Experience
Notebooks UX
SELECT * …
SELECT * …
Managed Runtime: Ideal Developer Experience
Version Control Integration
SELECT * …
SELECT * …
Managed Runtime: Ideal Developer Experience
dbt-style Project Structure
SELECT * …
SELECT * …
➔ models
◆ common
◆ sales
◆ shipping
◆ marketing
◆ …
Managed Runtime: Ideal Developer Experience
Versioning
SELECT * …
SELECT * …
● Version 1
● Version 2
● Version 3
● …
Managed Runtime: Ideal Developer Experience
Previews
SELECT * …
SELECT * …
User Count
Irene 100
Alex 53
Josh 12
Jane 1
Managed Runtime
● Version control: 🟢
● Code organization: 🟢
● Testability: 🟡
● CI/CD: 🟢
● Observability: 🟢
Summary
Structured
Statements
dbt-style Project Notebooks Managed
Runtime
Version Control 🟢 🟢 🟡 🟢
Code
Organization
🟢 🟢 🔴 🟢
Testability 🟡 🟡 🔴 🟡
CI/CD 🟡 🟡 🔴 🟢
Observability 🟢 🟡 🔴 🟢
Complexity 🟢 🟡 🟡 🔴
General Guidelines
● Long-running streaming apps require special attention
to state management
● Try to avoid mutability: every change is a new version
● Integration testing > unit testing
● Embrace the SRE mentality
Really dislike SQL?
Malloy PRQL
Questions?
@sap1ens

Más contenido relacionado

Similar a Streaming SQL for Data Engineers: The Next Big Thing? With Yaroslav Tkachenko | Current 2022

Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 
Advanced Code Flow, Notes From the Field
Advanced Code Flow, Notes From the FieldAdvanced Code Flow, Notes From the Field
Advanced Code Flow, Notes From the FieldAriel Moskovich
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Databricks
 
Modular Web Applications With Netzke
Modular Web Applications With NetzkeModular Web Applications With Netzke
Modular Web Applications With Netzkenetzke
 
How Level Infinite Implemented CQRS and Event Sourcing on Top of Apache Pulsa...
How Level Infinite Implemented CQRS and Event Sourcing on Top of Apache Pulsa...How Level Infinite Implemented CQRS and Event Sourcing on Top of Apache Pulsa...
How Level Infinite Implemented CQRS and Event Sourcing on Top of Apache Pulsa...ScyllaDB
 
GraphQL the holy contract between client and server
GraphQL the holy contract between client and serverGraphQL the holy contract between client and server
GraphQL the holy contract between client and serverPavel Chertorogov
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationYi Pan
 
stackconf 2020 | The path to a Serverless-native era with Kubernetes by Paolo...
stackconf 2020 | The path to a Serverless-native era with Kubernetes by Paolo...stackconf 2020 | The path to a Serverless-native era with Kubernetes by Paolo...
stackconf 2020 | The path to a Serverless-native era with Kubernetes by Paolo...NETWAYS
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebookAniket Mokashi
 
Integration-Monday-Stateful-Programming-Models-Serverless-Functions
Integration-Monday-Stateful-Programming-Models-Serverless-FunctionsIntegration-Monday-Stateful-Programming-Models-Serverless-Functions
Integration-Monday-Stateful-Programming-Models-Serverless-FunctionsBizTalk360
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in SparkDigital Vidya
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Waysmalltown
 
Delivery Pipelines as a First Class Citizen @deliverAgile2019
Delivery Pipelines as a First Class Citizen @deliverAgile2019Delivery Pipelines as a First Class Citizen @deliverAgile2019
Delivery Pipelines as a First Class Citizen @deliverAgile2019ciberkleid
 
Benchx: An XQuery benchmarking web application
Benchx: An XQuery benchmarking web application Benchx: An XQuery benchmarking web application
Benchx: An XQuery benchmarking web application Andy Bunce
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Kaxil Naik
 

Similar a Streaming SQL for Data Engineers: The Next Big Thing? With Yaroslav Tkachenko | Current 2022 (20)

Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Sprint 55
Sprint 55Sprint 55
Sprint 55
 
Advanced Code Flow, Notes From the Field
Advanced Code Flow, Notes From the FieldAdvanced Code Flow, Notes From the Field
Advanced Code Flow, Notes From the Field
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
 
Modular Web Applications With Netzke
Modular Web Applications With NetzkeModular Web Applications With Netzke
Modular Web Applications With Netzke
 
How Level Infinite Implemented CQRS and Event Sourcing on Top of Apache Pulsa...
How Level Infinite Implemented CQRS and Event Sourcing on Top of Apache Pulsa...How Level Infinite Implemented CQRS and Event Sourcing on Top of Apache Pulsa...
How Level Infinite Implemented CQRS and Event Sourcing on Top of Apache Pulsa...
 
GraphQL the holy contract between client and server
GraphQL the holy contract between client and serverGraphQL the holy contract between client and server
GraphQL the holy contract between client and server
 
Sprint 59
Sprint 59Sprint 59
Sprint 59
 
Revealing ALLSTOCKER
Revealing ALLSTOCKERRevealing ALLSTOCKER
Revealing ALLSTOCKER
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
 
stackconf 2020 | The path to a Serverless-native era with Kubernetes by Paolo...
stackconf 2020 | The path to a Serverless-native era with Kubernetes by Paolo...stackconf 2020 | The path to a Serverless-native era with Kubernetes by Paolo...
stackconf 2020 | The path to a Serverless-native era with Kubernetes by Paolo...
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
Grails 101
Grails 101Grails 101
Grails 101
 
Integration-Monday-Stateful-Programming-Models-Serverless-Functions
Integration-Monday-Stateful-Programming-Models-Serverless-FunctionsIntegration-Monday-Stateful-Programming-Models-Serverless-Functions
Integration-Monday-Stateful-Programming-Models-Serverless-Functions
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
 
Delivery Pipelines as a First Class Citizen @deliverAgile2019
Delivery Pipelines as a First Class Citizen @deliverAgile2019Delivery Pipelines as a First Class Citizen @deliverAgile2019
Delivery Pipelines as a First Class Citizen @deliverAgile2019
 
Benchx: An XQuery benchmarking web application
Benchx: An XQuery benchmarking web application Benchx: An XQuery benchmarking web application
Benchx: An XQuery benchmarking web application
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
 

Más de HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 

Más de HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Último

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Último (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Streaming SQL for Data Engineers: The Next Big Thing? With Yaroslav Tkachenko | Current 2022

  • 1. Streaming SQL for Data Engineers: The Next Big Thing?
  • 2.
  • 4. ● Apache Flink ● Apache Spark ● Apache Beam ● AWS Kinesis ● Google Cloud Dataflow ● Databricks ● ksqlDB ● … ● Meta ● LinkedIn ● Pinterest ● DoorDash ● Alibaba ● … Companies building internal platforms Open source and vendor solutions
  • 5.
  • 6. 👋 Hi, I’m Yaroslav
  • 7. 👋 Hi, I’m Yaroslav ● Principal Software Engineer @ Goldsky ● Staff Data Engineer @ Shopify ● Software Architect @ Activision ● …
  • 8. 👋 Hi, I’m Yaroslav ● Principal Software Engineer @ Goldsky ● Staff Data Engineer @ Shopify ● Software Architect @ Activision ● … ❤ Apache Flink
  • 9. 🤔 TableEnvironment tableEnv = TableEnvironment.create(/*…*/); Table revenue = tableEnv.sqlQuery( "SELECT cID, cName, SUM(revenue) AS revSum " + "FROM Orders " + "WHERE cCountry = 'FRANCE' " + "GROUP BY cID, cName" );
  • 10. … but why SQL?
  • 11. Why SQL? ● Wide adoption ● Declarative transformation model ● Planner! ● Common type system
  • 15. SELECT * FROM Orders INNER JOIN Product ON Orders.productId = Product.id ● LOTS of code! ● Create an operator to connect two streams ● Define and accumulate state ● Implement a mechanism for emitting the latest value per key SQL API DataStream API Declarative Transformation Model
  • 16. SELECT * FROM Orders INNER JOIN Product ON Orders.productId = Product.id SQL API Why not Table API? val orders = tEnv.from("Orders") .select($"productId", $"a", $"b") val products = tEnv.from("Products") .select($"id", $"c", $"d") val result = orders .join(products) .where($"productId" === $"id") .select($"a", $"b", $"c") Declarative Transformation Model
  • 17. SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY ticker ORDER BY price DESC) AS row_num FROM stock_table) WHERE row_num <= 10; Top-N Query Declarative Transformation Model
  • 18. Row Pattern Recognition in SQL (ISO/IEC TR 19075-5:2016) SELECT * FROM stock_table MATCH_RECOGNIZE( PARTITION BY ticker ORDER BY event_time MEASURES A.event_time AS initialPriceTime, C.event_time AS dropTime, A.price - C.price AS dropDiff, A.price AS initialPrice, C.price AS lastPrice ONE ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW PATTERN (A B* C) WITHIN INTERVAL '10' MINUTES DEFINE B AS B.price > A.price - 500 )
  • 19. Flink Planner Migration From https://www.ververica.com/blog/a-journey-to-beating-flinks-sql-performance Planner Decoupling
  • 20. Planner Optimizations & Query Rewrite ● Predicate push down ● Projection push down ● Join rewrite ● Join elimination ● Constant inlining ● …
  • 21. SQL API DataStream API val postgresSink: SinkFunction[Envelope] = JdbcSink.sink( "INSERT INTO table " + "(id, number, timestamp, author, difficulty, size, vid, block_range) " + "VALUES (?, ?, ?, ?, ?, ?, ?, ?) " + "ON CONFLICT (id) DO UPDATE SET " + "number = excluded.number, " + "timestamp = excluded.timestamp, " + "author = excluded.author, " + "difficulty = excluded.difficulty, " + "size = excluded.size, " + "vid = excluded.vid, " + "block_range = excluded.block_range " + "WHERE excluded.vid > table.vid", new JdbcStatementBuilder[Envelope] { override def accept(statement: PreparedStatement, record: Envelope): Unit = { val payload = record.payload payload.id.foreach { id => statement.setString(1, id) } payload.number.foreach { number => statement.setBigDecimal(2, new java.math.BigDecimal(number)) } payload.timestamp.foreach { timestamp => statement.setBigDecimal(3, new java.math.BigDecimal(timestamp)) } payload.author.foreach { author => statement.setString(4, author) } payload.difficulty.foreach { difficulty => statement.setBigDecimal(5, new java.math.BigDecimal(difficulty)) } payload.size.foreach { size => statement.setBigDecimal(6, new java.math.BigDecimal(size)) } payload.vid.foreach { vid => statement.setLong(7, vid.toLong) } payload.block_range.foreach { block_range => statement.setObject(8, new PostgresIntRange(block_range), Types.O } }, CREATE TABLE TABLE ( id BIGINT, number INTEGER, timestamp TIMESTAMP, author STRING, difficulty STRING, size INTEGER, vid BIGINT, block_range STRING PRIMARY KEY (vid) NOT ENFORCED ) WITH ( 'connector' = 'jdbc', 'table-name' = 'table' ); 😱 Common Type System
  • 22. When you start using SQL you get access to the decades of advancements in database design
  • 23. When NOT to use ● Complex serialization / deserialization logic ● Low-level optimizations, especially with state and timers ● Not always debugging-friendly
  • 24. Dealing with Complexity UDFs for heavy lifting ● Calling 3rd-party libraries ● External calls ● Enrichments Templating ● Control structures ● dbt-style macros and references
  • 26. Ways to use Structured Statements dbt-style Project Notebooks Managed Runtime
  • 27. Requirements ● Version control ● Code organization ● Testability ● CI/CD ● Observability
  • 28. Structured Statements def revenueByCountry(country: String): Table = { tEnv.sqlQuery( s""" |SELECT name, SUM(revenue) AS totalRevenue |FROM Orders |WHERE country = '${country}' |GROUP BY name""".stripMargin ) }
  • 29. Structured Statements def revenueByCountry(country: String): Table = { tEnv.sqlQuery( s""" |SELECT name, SUM(revenue) AS totalRevenue |FROM Orders |WHERE country = '${country}' |GROUP BY name""".stripMargin ) } ✅ structure ✅ mock/stub for testing
  • 30. Structured Statements ● Treat them like code ● Only make sense when Table API is not available ● Mix with other API flavours ● SQL also has style guides ● Otherwise it’s a typical streaming application!
  • 31. Structured Statements ● Version control: 🟢 ● Code organization: 🟢 ● Testability: 🟡 ● CI/CD: 🟡 ● Observability: 🟢
  • 32. dbt-style Project ➔ models ◆ common ● users.sql ● users.yml ◆ sales.sql ◆ sales.yml ◆ … ➔ tests ◆ …
  • 33. dbt-style Project ➔ models ◆ common ● users.sql ● users.yml ◆ sales.sql ◆ sales.yml ◆ … ➔ tests ◆ … ✅ structured ✅ schematized ✅ testable
  • 34. dbt-style Project SELECT ((text::jsonb)->>'bid_price')::FLOAT AS bid_price, (text::jsonb)->>'order_quantity' AS order_quantity, (text::jsonb)->>'symbol' AS symbol, (text::jsonb)->>'trade_type' AS trade_type, to_timestamp(((text::jsonb)->'timestamp')::BIGINT) AS ts FROM {{ REF('market_orders_raw') }} {{ config(materialized='materializedview') }} SELECT symbol, AVG(bid_price) AS avg FROM {{ REF('market_orders') }} GROUP BY symbol
  • 35. dbt-style Project ● Works well for heavy analytical use-cases ● Could write tests in Python/Scala/etc. ● Probably needs more tooling than you think (state management, observability, etc.) ● Check dbt adapter from Materialize!
  • 36. dbt-style Project ● Version control: 🟢 ● Code organization: 🟢 ● Testability: 🟡 ● CI/CD: 🟡 ● Observability: 🟡
  • 39. Notebooks ● Great UX ● Ideal for exploratory analysis and BI ● Complements all other patterns really well ● Way more important for realtime workloads
  • 40. Notebooks We don't recommend productionizing notebooks and instead encourage empowering data scientists to build production-ready code with the right programming frameworks https://www.thoughtworks.com/en-ca/radar/technique s/productionizing-notebooks
  • 41. Notebooks ● Version control: 🟡 ● Code organization: 🔴 ● Testability: 🔴 ● CI/CD: 🔴 ● Observability: 🔴
  • 43. Managed Runtime ● Managed ≈ “Serverless” ● Auto-scaling ● Automated deployments, rollbacks, etc. ● Testing for different layers is decoupled (runtime vs jobs)
  • 44. Managed Runtime Reference Architecture Control Plane Data Plane API Reconciler Streaming Job UI CLI
  • 45. Any managed runtime requires excellent developer experience to succeed
  • 46. Managed Runtime: Ideal Developer Experience Notebooks UX SELECT * … SELECT * …
  • 47. Managed Runtime: Ideal Developer Experience Version Control Integration SELECT * … SELECT * …
  • 48. Managed Runtime: Ideal Developer Experience dbt-style Project Structure SELECT * … SELECT * … ➔ models ◆ common ◆ sales ◆ shipping ◆ marketing ◆ …
  • 49. Managed Runtime: Ideal Developer Experience Versioning SELECT * … SELECT * … ● Version 1 ● Version 2 ● Version 3 ● …
  • 50. Managed Runtime: Ideal Developer Experience Previews SELECT * … SELECT * … User Count Irene 100 Alex 53 Josh 12 Jane 1
  • 51. Managed Runtime ● Version control: 🟢 ● Code organization: 🟢 ● Testability: 🟡 ● CI/CD: 🟢 ● Observability: 🟢
  • 52. Summary Structured Statements dbt-style Project Notebooks Managed Runtime Version Control 🟢 🟢 🟡 🟢 Code Organization 🟢 🟢 🔴 🟢 Testability 🟡 🟡 🔴 🟡 CI/CD 🟡 🟡 🔴 🟢 Observability 🟢 🟡 🔴 🟢 Complexity 🟢 🟡 🟡 🔴
  • 53. General Guidelines ● Long-running streaming apps require special attention to state management ● Try to avoid mutability: every change is a new version ● Integration testing > unit testing ● Embrace the SRE mentality
  • 56.