SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
1
Streams and Tables: Two Sides of the Same Coin
Matthias J. Sax12, Guozhang Wang1, Matthias Weidlich2, Johann-Christoph Freytag2
1Confluent Inc., Palo Alto (CA)
matthias@confluent.io
guozhang@confluent.io
2Humboldt-Universität zu Berlin
mjsax@informatik.hu-berlin.de
matthias.weidlich@hu-berlin.de
freytag@informatik.hu-berlin.de
@MatthiasJSax
Twelfth International Workshop on Real-Time Business Intelligence and Analytics
27 August, 2018, Rio de Janeiro
2
Count Clicks per Page
BIRTE
VLDB
VLDB
Distributed Data Source
BIRTE(3)
VLDB(4)
VLDB(7)
BIRTE(3)
VLDB(4)
VLDB(7)
BIRTE(3)VLDB(4)VLDB(7)
Input Data Stream
3
Count Clicks per Page
BIRTE
VLDB
VLDB
Distributed Data Source
BIRTE(3)
VLDB(4)
VLDB(7)
BIRTE(3)
VLDB(4)
VLDB(7)
BIRTE(3) VLDB(4)VLDB(7)
Input Data Stream
Arrival order non-deterministic
Even-time semantics implies out-of-order data
4
Ordering: Common Approaches
Input Data Stream
BIRTE(3) VLDB(4)VLDB(7)
BIRTE(3)VLDB(4)VLDB(7)
buffer and re-order
SPS
SPS
punctuations/watermarks
BIRTE(3) VLDB(4)VLDB(7)
time=3
5
Cost
Correctness/
Completeness
Latency
Buffering and Reordering
- Ref: CQL1, Trill2
Punctuations/Watermarks
- Ref: Li et al.3, Krishnamurthy et al.4
Design Space
6
Problem Statement
How to design a model
• for the evaluation of expressive operators
• with low latency over potentially unordered data streams
• that can be implemented by mean of distributed online algorithms?
7
High-Level Proposal
• To reduce latency, we need to avoid any processing delays
• Process data in arrival order
• Emit current result immediately
• Law et al.5: cannot handle out-of-order data
• To handle out-of-order data, we need to be able to update/refine previous results
• Data streams must allow for update records
• Update/delete records by Babu and Widom6: no operator semantics defined
• Borealis:7 replays data stream after “updating/reordering”; very high cost
8
Data Model
• Offset: physical order (arrival/processing order)
• Timestamp: logical order (event-time)
• Key: optional for grouping
• Value: payload
9
Stream Processing Operators
• Stateless, order agnostic
• filter, projection, flatMap
• No special handling necessary
• Stateful, order sensitive
• aggregation, joins, windowing
• Need to handle out-of-order data
10
Data Stream Aggregation
• Model output of (windowed) aggregations as table
• State is not internal but first-class citizen
• Update stateful operator continuously
• Emit changelog stream to downstream operators
• Streams, Table, and Changelogs
• Define operator semantics over changelogs and updating tables
• Temporal operator semantics
11
Example: Count Clicks per Page
url count
record stream changelog stream
BIRTE
1
<BIRTE,1>
BIRTE 1
countTable = stream.groupBy(r->url).count()
12
Example: Count Clicks per Page
url count
record stream changelog stream
1
<BIRTE,1>
BIRTE 1
VLDB
1VLDB
<VLDB,1>
countTable = stream.groupBy(r->url).count()
VLDB
13
Example: Count Clicks per Page
url count
record stream changelog stream
1
<BIRTE,1>
BIRTE 1
VLDB
2VLDB
<VLDB,2>
countTable = stream.groupBy(r->url).count()
VLDB
<VLDB,1>
14
countTable2 = countTable.filter(url=‘VLDB’).toTable()
Example: Processing a Changelog Stream
url count
changelog stream
1
<BIRTE,1>
BIRTE
VLDB
<VLDB,1>
2
<VLDB,2>
url count
2VLDB
15
countTable = stream.groupBy(r->url).windowedBy(5sec).count()
windowID = <groupingKey,windowStartTimestamp>
windowStartTimestamp = recordTimestamp / windowSize
Example: Windowed Count
window ID count
record stream changelog stream
BIRTE(3)
1
<<BIRTE,0>,1>
<BIRTE,0>
VLDB(7)
1<VLDB,0>
<<VLDB,0>,1>VLDB(4) <VLDB,5> <<VLDB,5>,1>1
16
countTable = stream.groupBy(r->url).windowedBy(5sec).count()
windowID = <groupingKey,windowStartTimestamp>
windowStartTimestamp = recordTimestamp / windowSize
Example: Out-of-Order Data
window ID count
record stream changelog stream
BIRTE(1)
2<BIRTE,0>
1<VLDB,0>
<VLDB,5> 1 <<BIRTE,0>,2>
17
Duality of Streams and Tables
18
Cost
Correctness/
Completeness
Latency
Buffering and Reordering
- Ref: CQL1, Trill2
Punctuations/Watermarks
- Ref: Li et al.3, Krishnamurthy et al.4
Design Space
Dual Streaming Model
- continuous updates / changelogs
- decouple latency from correctness
- trade-off latency and cost
- trade-off cost and completeness
(retention time)
19
Stream-Table Transformations
See the paper for details…
20
Implementation
• Implemented in Apache Kafka (v0.10)
• Kafka Streams / Streams API
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> textLines = builder.stream("TextLinesTopic");
KTable<String, Long> wordCounts = textLines
.flatMapValues(textLine -> Arrays.asList(textLine.toLowerCase().split("W+")))
.groupBy((key, word) -> word)
.windowedBy(TimeWindows.of(5_000L))
.count();
wordCounts.toStream().to("WordsWithCountsTopic");
KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();
21
Implementation
• Implemented in Apache Kafka (v0.10)
• Kafka Streams / Streams API
• Leveraged in Confluent’s KSQL
CREATE TABLE click_count_per_url AS
SELECT url, count(*)
FROM click_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE url LIKE = '%confluent%' OR url LIKE ‘%hu-berlin%’
GROUP BY url;
22
Implementation
• Implemented in Apache Kafka (v0.10)
• Kafka Streams / Streams API
• Leveraged in Confluent’s KSQL
• Widely adopted in industry
23
Summary
• Suggest the Dual-Streaming-Model
• Handles out-of-order data within the processing model
• Optimized for low latency
• Streams and Tables are Dual
• Allows to trade-off processing cost, latency, completeness
• Adopted in industry via Kafka Streams and KSQL
24
Thank You
We are hiring!
25
References
[1] Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2003. CQL: A Language for Continuous Queries
over Streams and Relations. Database Programming Languages, 9th Int. WS. 1–19.
[2] Badrish Chandramouli et al. 2014. Trill: A High-performance Incremental Query Processor for
Diverse Analytics. Proc. VLDB Endow. 8, 4 (2014), 401–412.
[3] Jin Li et al. 2005. Semantics and Evaluation Techniques for Window Aggregates in Data Streams.
Proc. of the ACM SIGMOD Int. Conf. on Management of Data. 311–322.
[4] Sailesh Krishnamurthy et al. 2010. Continuous Analytics over Discontinuous Streams. Proc. of the
2010 ACM SIGMOD Int. Conf. on Management of Data. 1081–1092.
[5] Yan-Nei Law, HaixunWang, and Carlo Zaniolo. 2004. Query Languages and Data Models for
Database Sequences and Data Streams. Proc. of the 13th Int. Conf. on Very Large Data Bases. 492-503.
[6] Shivnath Babu and Jennifer Widom. 2001. Continuous Queries over Data Streams. SIGMOD Records
30, 3 (2001), 109–120.
[7] Daniel Abadi et al. 2005. The Design of the Borealis Stream Processing Engine. CIDR, 2nd Biennial
Conf. on Innovative Data Systems Research. 277–289.
26
Data Streams Types
27
Evolving Table
window ID count
1<BIRTE,0>
1<VLDB,0>
window ID count
1<BIRTE,0>
window ID count
1<BIRTE,0>
1<VLDB,0>
<VLDB,5> 1
record stream
BIRTE(3)VLDB(7) VLDB(4)BIRTE(1)
table v3 table v4 table v7
28
Evolving Table
window ID count
<BIRTE,0>
1<VLDB,0>
window ID count
<BIRTE,0>
window ID count
<BIRTE,0>
1<VLDB,0>
<VLDB,5> 1
record stream
BIRTE(3)VLDB(7) VLDB(4)BIRTE(1)
table v3 table v4 table v7
window ID count
1<BIRTE,0>
table v1
2 2 2
29
Table-Table Join
30
Stream-Stream Join
• Sliding Window Join, i.e., band join
• Window size specifies additional timestamp based join predicate
SELECT * FROM stream1, stream2
WHERE
stream1.key = stream2.key
AND
stream1.ts – windowSize <= stream2.ts
AND stream2.ts <= stream1.ts + windowSize
windowSize
stream1
stream2
31
Stream-Table Join
• Temporal table “lookup” join
• For each stream record, lookup for a matching table record
• Join condition: streamRecord.key == tableRecord.key
• The join is temporal is the sense, that the “correct” table version must be use
• i.e., youngest table version that is before the stream records timestamp
32
Stream-Table Join

Más contenido relacionado

La actualidad más candente

Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Shipping Data from Postgres to Clickhouse, by Murat Kabilov, Adjust
Shipping Data from Postgres to Clickhouse, by Murat Kabilov, AdjustShipping Data from Postgres to Clickhouse, by Murat Kabilov, Adjust
Shipping Data from Postgres to Clickhouse, by Murat Kabilov, AdjustAltinity Ltd
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Databricks
 

La actualidad más candente (20)

Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Shipping Data from Postgres to Clickhouse, by Murat Kabilov, Adjust
Shipping Data from Postgres to Clickhouse, by Murat Kabilov, AdjustShipping Data from Postgres to Clickhouse, by Murat Kabilov, Adjust
Shipping Data from Postgres to Clickhouse, by Murat Kabilov, Adjust
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache flink
Apache flinkApache flink
Apache flink
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
 

Similar a Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)

Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Big Data Spain
 
Internet of Things in Tbilisi
Internet of Things in TbilisiInternet of Things in Tbilisi
Internet of Things in TbilisiAlexey Bokov
 
Hpdw 2015-v10-paper
Hpdw 2015-v10-paperHpdw 2015-v10-paper
Hpdw 2015-v10-paperrestassure
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics EnvironmentIan Foster
 
SQL Server 2008 R2 StreamInsight
SQL Server 2008 R2 StreamInsightSQL Server 2008 R2 StreamInsight
SQL Server 2008 R2 StreamInsightEduardo Castro
 
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Mia Yuan Cao
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Neo4j
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use CasesMax De Marzi
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015StampedeCon
 
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...Adel Sabour
 
eBay EDW元数据管理及应用
eBay EDW元数据管理及应用eBay EDW元数据管理及应用
eBay EDW元数据管理及应用mysqlops
 
BioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogueBioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogueBioCatalogue
 
Web Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseGiorgio Orsi
 
DisrupTech - Robin Bloor (2)
DisrupTech - Robin Bloor (2)DisrupTech - Robin Bloor (2)
DisrupTech - Robin Bloor (2)Inside Analysis
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 
Building IoT and Big Data Solutions on Azure
Building IoT and Big Data Solutions on AzureBuilding IoT and Big Data Solutions on Azure
Building IoT and Big Data Solutions on AzureIdo Flatow
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016Mark Smith
 

Similar a Streams and Tables: Two Sides of the Same Coin (BIRTE 2018) (20)

Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
 
Internet of Things in Tbilisi
Internet of Things in TbilisiInternet of Things in Tbilisi
Internet of Things in Tbilisi
 
Hpdw 2015-v10-paper
Hpdw 2015-v10-paperHpdw 2015-v10-paper
Hpdw 2015-v10-paper
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics Environment
 
SQL Server 2008 R2 StreamInsight
SQL Server 2008 R2 StreamInsightSQL Server 2008 R2 StreamInsight
SQL Server 2008 R2 StreamInsight
 
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015
 
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
 
eBay EDW元数据管理及应用
eBay EDW元数据管理及应用eBay EDW元数据管理及应用
eBay EDW元数据管理及应用
 
BioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogueBioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogue
 
Web Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash Course
 
DisrupTech - Robin Bloor (2)
DisrupTech - Robin Bloor (2)DisrupTech - Robin Bloor (2)
DisrupTech - Robin Bloor (2)
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Z04506138145
Z04506138145Z04506138145
Z04506138145
 
Building IoT and Big Data Solutions on Azure
Building IoT and Big Data Solutions on AzureBuilding IoT and Big Data Solutions on Azure
Building IoT and Big Data Solutions on Azure
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
 

Más de confluent

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flinkconfluent
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsconfluent
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flinkconfluent
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...confluent
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluentconfluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkconfluent
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloudconfluent
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Diveconfluent
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluentconfluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Meshconfluent
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservicesconfluent
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3confluent
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernizationconfluent
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataconfluent
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2confluent
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023confluent
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesisconfluent
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023confluent
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streamsconfluent
 

Más de confluent (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
 

Último

A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 

Último (20)

A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 

Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)

  • 1. 1 Streams and Tables: Two Sides of the Same Coin Matthias J. Sax12, Guozhang Wang1, Matthias Weidlich2, Johann-Christoph Freytag2 1Confluent Inc., Palo Alto (CA) matthias@confluent.io guozhang@confluent.io 2Humboldt-Universität zu Berlin mjsax@informatik.hu-berlin.de matthias.weidlich@hu-berlin.de freytag@informatik.hu-berlin.de @MatthiasJSax Twelfth International Workshop on Real-Time Business Intelligence and Analytics 27 August, 2018, Rio de Janeiro
  • 2. 2 Count Clicks per Page BIRTE VLDB VLDB Distributed Data Source BIRTE(3) VLDB(4) VLDB(7) BIRTE(3) VLDB(4) VLDB(7) BIRTE(3)VLDB(4)VLDB(7) Input Data Stream
  • 3. 3 Count Clicks per Page BIRTE VLDB VLDB Distributed Data Source BIRTE(3) VLDB(4) VLDB(7) BIRTE(3) VLDB(4) VLDB(7) BIRTE(3) VLDB(4)VLDB(7) Input Data Stream Arrival order non-deterministic Even-time semantics implies out-of-order data
  • 4. 4 Ordering: Common Approaches Input Data Stream BIRTE(3) VLDB(4)VLDB(7) BIRTE(3)VLDB(4)VLDB(7) buffer and re-order SPS SPS punctuations/watermarks BIRTE(3) VLDB(4)VLDB(7) time=3
  • 5. 5 Cost Correctness/ Completeness Latency Buffering and Reordering - Ref: CQL1, Trill2 Punctuations/Watermarks - Ref: Li et al.3, Krishnamurthy et al.4 Design Space
  • 6. 6 Problem Statement How to design a model • for the evaluation of expressive operators • with low latency over potentially unordered data streams • that can be implemented by mean of distributed online algorithms?
  • 7. 7 High-Level Proposal • To reduce latency, we need to avoid any processing delays • Process data in arrival order • Emit current result immediately • Law et al.5: cannot handle out-of-order data • To handle out-of-order data, we need to be able to update/refine previous results • Data streams must allow for update records • Update/delete records by Babu and Widom6: no operator semantics defined • Borealis:7 replays data stream after “updating/reordering”; very high cost
  • 8. 8 Data Model • Offset: physical order (arrival/processing order) • Timestamp: logical order (event-time) • Key: optional for grouping • Value: payload
  • 9. 9 Stream Processing Operators • Stateless, order agnostic • filter, projection, flatMap • No special handling necessary • Stateful, order sensitive • aggregation, joins, windowing • Need to handle out-of-order data
  • 10. 10 Data Stream Aggregation • Model output of (windowed) aggregations as table • State is not internal but first-class citizen • Update stateful operator continuously • Emit changelog stream to downstream operators • Streams, Table, and Changelogs • Define operator semantics over changelogs and updating tables • Temporal operator semantics
  • 11. 11 Example: Count Clicks per Page url count record stream changelog stream BIRTE 1 <BIRTE,1> BIRTE 1 countTable = stream.groupBy(r->url).count()
  • 12. 12 Example: Count Clicks per Page url count record stream changelog stream 1 <BIRTE,1> BIRTE 1 VLDB 1VLDB <VLDB,1> countTable = stream.groupBy(r->url).count() VLDB
  • 13. 13 Example: Count Clicks per Page url count record stream changelog stream 1 <BIRTE,1> BIRTE 1 VLDB 2VLDB <VLDB,2> countTable = stream.groupBy(r->url).count() VLDB <VLDB,1>
  • 14. 14 countTable2 = countTable.filter(url=‘VLDB’).toTable() Example: Processing a Changelog Stream url count changelog stream 1 <BIRTE,1> BIRTE VLDB <VLDB,1> 2 <VLDB,2> url count 2VLDB
  • 15. 15 countTable = stream.groupBy(r->url).windowedBy(5sec).count() windowID = <groupingKey,windowStartTimestamp> windowStartTimestamp = recordTimestamp / windowSize Example: Windowed Count window ID count record stream changelog stream BIRTE(3) 1 <<BIRTE,0>,1> <BIRTE,0> VLDB(7) 1<VLDB,0> <<VLDB,0>,1>VLDB(4) <VLDB,5> <<VLDB,5>,1>1
  • 16. 16 countTable = stream.groupBy(r->url).windowedBy(5sec).count() windowID = <groupingKey,windowStartTimestamp> windowStartTimestamp = recordTimestamp / windowSize Example: Out-of-Order Data window ID count record stream changelog stream BIRTE(1) 2<BIRTE,0> 1<VLDB,0> <VLDB,5> 1 <<BIRTE,0>,2>
  • 17. 17 Duality of Streams and Tables
  • 18. 18 Cost Correctness/ Completeness Latency Buffering and Reordering - Ref: CQL1, Trill2 Punctuations/Watermarks - Ref: Li et al.3, Krishnamurthy et al.4 Design Space Dual Streaming Model - continuous updates / changelogs - decouple latency from correctness - trade-off latency and cost - trade-off cost and completeness (retention time)
  • 20. 20 Implementation • Implemented in Apache Kafka (v0.10) • Kafka Streams / Streams API StreamsBuilder builder = new StreamsBuilder(); KStream<String, String> textLines = builder.stream("TextLinesTopic"); KTable<String, Long> wordCounts = textLines .flatMapValues(textLine -> Arrays.asList(textLine.toLowerCase().split("W+"))) .groupBy((key, word) -> word) .windowedBy(TimeWindows.of(5_000L)) .count(); wordCounts.toStream().to("WordsWithCountsTopic"); KafkaStreams streams = new KafkaStreams(builder.build(), props); streams.start();
  • 21. 21 Implementation • Implemented in Apache Kafka (v0.10) • Kafka Streams / Streams API • Leveraged in Confluent’s KSQL CREATE TABLE click_count_per_url AS SELECT url, count(*) FROM click_stream WINDOW TUMBLING (SIZE 1 MINUTE) WHERE url LIKE = '%confluent%' OR url LIKE ‘%hu-berlin%’ GROUP BY url;
  • 22. 22 Implementation • Implemented in Apache Kafka (v0.10) • Kafka Streams / Streams API • Leveraged in Confluent’s KSQL • Widely adopted in industry
  • 23. 23 Summary • Suggest the Dual-Streaming-Model • Handles out-of-order data within the processing model • Optimized for low latency • Streams and Tables are Dual • Allows to trade-off processing cost, latency, completeness • Adopted in industry via Kafka Streams and KSQL
  • 25. 25 References [1] Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2003. CQL: A Language for Continuous Queries over Streams and Relations. Database Programming Languages, 9th Int. WS. 1–19. [2] Badrish Chandramouli et al. 2014. Trill: A High-performance Incremental Query Processor for Diverse Analytics. Proc. VLDB Endow. 8, 4 (2014), 401–412. [3] Jin Li et al. 2005. Semantics and Evaluation Techniques for Window Aggregates in Data Streams. Proc. of the ACM SIGMOD Int. Conf. on Management of Data. 311–322. [4] Sailesh Krishnamurthy et al. 2010. Continuous Analytics over Discontinuous Streams. Proc. of the 2010 ACM SIGMOD Int. Conf. on Management of Data. 1081–1092. [5] Yan-Nei Law, HaixunWang, and Carlo Zaniolo. 2004. Query Languages and Data Models for Database Sequences and Data Streams. Proc. of the 13th Int. Conf. on Very Large Data Bases. 492-503. [6] Shivnath Babu and Jennifer Widom. 2001. Continuous Queries over Data Streams. SIGMOD Records 30, 3 (2001), 109–120. [7] Daniel Abadi et al. 2005. The Design of the Borealis Stream Processing Engine. CIDR, 2nd Biennial Conf. on Innovative Data Systems Research. 277–289.
  • 27. 27 Evolving Table window ID count 1<BIRTE,0> 1<VLDB,0> window ID count 1<BIRTE,0> window ID count 1<BIRTE,0> 1<VLDB,0> <VLDB,5> 1 record stream BIRTE(3)VLDB(7) VLDB(4)BIRTE(1) table v3 table v4 table v7
  • 28. 28 Evolving Table window ID count <BIRTE,0> 1<VLDB,0> window ID count <BIRTE,0> window ID count <BIRTE,0> 1<VLDB,0> <VLDB,5> 1 record stream BIRTE(3)VLDB(7) VLDB(4)BIRTE(1) table v3 table v4 table v7 window ID count 1<BIRTE,0> table v1 2 2 2
  • 30. 30 Stream-Stream Join • Sliding Window Join, i.e., band join • Window size specifies additional timestamp based join predicate SELECT * FROM stream1, stream2 WHERE stream1.key = stream2.key AND stream1.ts – windowSize <= stream2.ts AND stream2.ts <= stream1.ts + windowSize windowSize stream1 stream2
  • 31. 31 Stream-Table Join • Temporal table “lookup” join • For each stream record, lookup for a matching table record • Join condition: streamRecord.key == tableRecord.key • The join is temporal is the sense, that the “correct” table version must be use • i.e., youngest table version that is before the stream records timestamp