SlideShare una empresa de Scribd logo
1 de 44
Easy, Scalable, Fault-tolerant
stream processing with
Structured Streaming
Tathagata “TD” Das
@tathadas
Spark Meetup @ Intel
Santa Clara, 23rd March 2017
About Me
Spark PMC Member
Built Spark Streaming in UC Berkeley
Currently focused on building Structured Streaming
Software engineer at Databricks
2
building robust
stream processing
apps is hard
3
Complexities in stream processing
4
Complex Data
Diverse data formats
(json, avro, binary, …)
Data can be dirty,
late, out-of-order
Complex
Systems
Diverse storage
systems and formats
(SQL, NoSQL, parquet, ...
)
System failures
Complex
Workloads
Event time processing
Combining streaming
with interactive
queries, machine
learning
Structured Streaming
stream processing on Spark SQL engine
fast, scalable, fault-tolerant
rich, unified, high level APIs
deal with complex data and complex workloads
rich ecosystem of data sources
integrate with many storage systems
5
you
should not have to
reason about streaming
6
you
should only write
simple batch-like queries
Spark
should automatically streamify
them 7
Treat Streams as Unbounded Tables
8
data stream unbounded input table
new data in the
data stream
=
new rows appended
to a unbounded table
New Model Trigger: every 1 sec
Time
Input data up
to t = 3
Quer
y
Input: data from source as an
append-only table
Trigger: how frequently to check
input for new data
Query: operations on input
usual map/filter/reduce
new window, session ops
t=1 t=2 t=3
data up
to t = 1
data up
to t = 2
New Model
result
up to t
= 1
Result
Quer
y
Time
data up
to t = 1
Input data up
to t = 2
result
up to t
= 2
data up
to t = 3
result
up to t
= 3
Result: final operated table
updated after every trigger
Output: what part of result to write
to storage after every
trigger
Complete output: write full result table every
time Output
[complete mode]
write all rows in result table to storage
t=1 t=2 t=3
New Model
t=1 t=2 t=3
Result
Quer
y
Time
Input data up
to t = 3
result
up to t
= 3
Output
[append mode] write new rows since last trigger to
storage
Result: final operated table
updated after every trigger
Output: what part of result to write
to storage after every
trigger
Complete output: write full result table every
time
Append output: write only new rows that got
added to result table since previous batch
data up
to t = 1
data up
to t = 2
result
up to t
= 1
result
up to t
= 2
New Model
t=1 t=2 t=3
Result
Quer
y
Time
Input data up
to t = 3
result
up to t
= 3
Output
[append mode] write new rows since last trigger to
storage
Conceptual model that
guides how to think of a
streaming query as a
simple table query
Engine does not need to
keep the full input table in
memory once it has
streamified it
data up
to t = 1
data up
to t = 2
result
up to t
= 1
result
up to t
= 2
DataFrames,
Datasets, SQL
input = spark.readStream
.format("json")
.load("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.writeStream
.format("parquet")
.start("dest-path")
Logical Plan
Streaming
Source
Project
device, signal
Filter
signal > 15
Streaming
Sink
Spark automatically streamifies!
Spark SQL converts batch-like query to a series of
incremental execution plans operating on new batches
of data
Series of Incremental
Execution Plans
process
newfiles
t =
1
t =
2
t =
3
process
newfiles
process
newfiles
Fault-tolerance with Checkpointing
Checkpointing - metadata
(e.g. offsets) of current batch
stored in a write ahead log in
HDFS/S3
Query can be restarted from the
log
Streaming sources can replay the
exact data range in case of failure
Streaming sinks can dedup
end-to-end
exactly-once
guarantees
process
newfiles
t =
1
t =
2
t =
3
process
newfiles
process
newfiles
write
ahead
log
static data =
bounded table
streaming data =
unbounded table
Unified API - Dataset/DataFrame
Single API !
16
Dataset/DataFrame  Tables
Unified, structured APIs in Spark to transform data
in Scala , Java , Python , R
SQL
spark.sql("
SELECT type, sum(signal)
FROM devices
GROUP BY type
")
val df: DataFrame =
spark
.table("device-data")
.groupBy("type")
.sum("signal"))
DataFrames Dataset
val ds: Dataset[(String, Double)] =
spark
.table("device-data")
.as[DeviceData]
.groupByKey(_.type)
.mapValues(_.signal)
.reduceGroups(_ + _)
17
Dataset/DataFrame  Tables
Unified, structured APIs in Spark to transform data
in Scala , Java , Python , R
SQL DataFrames Dataset
Compile time type
safety
Batch Queries with DataFrames
input = spark.read
.format("json")
.load("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.write
.format("parquet")
.save("dest-path")
Read from Json file
Select some devices
Write to parquet file
Streaming Queries with DataFrames
input = spark.readStream
.format("kafka")
.option("subscribe", "topic")
.load()
result = input
.select("device", "signal")
.where("signal > 15")
result.writeStream
.format("parquet")
.start("dest-path")
Read from Kafka
Replace read with readStream
Change format to kafka
Select some devices
Code does not change
Write to Parquet file stream
Replace save() with start()
Complex Streaming ETL
Structured Streaming enables raw data to be
available as structured data in seconds, for more
interactive and complex analytics
20
table
seconds
1010101
0
Complex Streaming ETL
21
Example
- Json data being received in
Kafka
- Parse nested json and flatten
it
- Store in structured Parquet
table
- Get end-to-end failure
guarantees
val rawData = spark.readStream
.format("kafka")
.option("subscribe", "topic")
.option("kafka.boostrap.servers",...)
.load()
val parsedData = rawData
.selectExpr("cast (value as string) as json"))
.select(from_json("json", schema).as("data"))
.select("data.*")
val query = parsedData.writeStream
.option("checkpointLocation", "/checkpoint")
.partitionBy("date")
.format("parquet")
.start("/parquetTable/")
Reading from Kafka [Spark 2.1]
22
Support Kafka 0.10.0.1
Specify options to configure
How?
kafka.boostrap.servers => broker1
What?
subscribe => topic1,topic2,topic3 // fixed list of topics
subscribePattern => topic* // dynamic list of topics
assign => {"topicA":[0,1] } // specific partitions
Where?
startingOffsets => latest(default) / earliest / {"topicA":{"0":23,"1":345} }
val rawData = spark.readStream
.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
Reading from Kafka
23
val rawData = spark.readStream
.format("kafka")
.option("subscribe", "topic")
.option("kafka.boostrap.servers",...)
.load()
rawData dataframe
has the following
columns
key value topic partitio
n
offset timestamp
[binary] [binary] "topicA" 0 345 1486087873
[binary] [binary] "topicB" 3 2890 1486086721
Transforming Data
Cast binary value to string
Name it column json
24
val parsedData = rawData
.selectExpr("cast (value as string) as json")
.select(from_json("json").as("data"))
.select("data.*")
Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand
into nested columns, name it
data
25
val parsedData = rawData
.selectExpr("cast (value as string) as json")
.select(from_json("json").as("data"))
.select("data.*")
json
{ "timestamp": 1486087873, "device":
"devA", …}
{ "timestamp": 1486082418, "device":
"devX", …}
data (nested)
timestamp device …
1486087873 devA …
1486086721 devX …
from_json("json")
as "data"
Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand
into nested columns, name it
data
Flatten the nested columns
26
val parsedData = rawData
.selectExpr("cast (value as string) as json")
.select(from_json("json").as("data"))
.select("data.*")
data (nested)
timestamp device …
14860878
73
devA …
14860867
21
devX …
timestamp device …
14860878
73
devA …
14860867
21
devX …
select("data.*")
(not nested)
Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand
into nested columns, name it
data
Flatten the nested columns
27
val parsedData = rawData
.selectExpr("cast (value as string) as json")
.select(from_json("json").as("data"))
.select("data.*")
powerful built-in APIs to
perform complex data
transformations
from_json, to_json, explode, ...
100s of functions
(see our blog post)
Writing to Parquet table
Save parsed data as
Parquet table in the given
path
Partition files by date so
that future queries on time
slices of data is fast
e.g. query on last 48 hours of
data
28
val query = parsedData.writeStream
.option("checkpointLocation", ...)
.partitionBy("date")
.format("parquet")
.start("/parquetTable")
Checkpointing
Enable checkpointing by
setting the checkpoint
location to save offset
logs
start actually starts a
continuous running
StreamingQuery in the
Spark cluster
29
val query = parsedData.writeStream
.option("checkpointLocation", ...)
.format("parquet")
.partitionBy("date")
.start("/parquetTable/")
Streaming Query
query is a handle to the continuously
running StreamingQuery
Used to monitor and manage the
execution
30
val query = parsedData.writeStream
.option("checkpointLocation", ...)
.format("parquet")
.partitionBy("date")
.start("/parquetTable/")
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
StreamingQuery
Data Consistency on Ad-hoc Queries
Data available for complex, ad-hoc analytics within
seconds
Parquet table is updated atomically, ensures prefix
integrity
Even if distributed, ad-hoc queries will see either all updates
from streaming query or none, read more in our blog
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
31
seconds!
complex, ad-hoc
queries on
latest
data
Event-time Aggregations
Many use cases require aggregate statistics by event
time
E.g. what's the #errors in each system in the 1 hour windows?
Many challenges
Extracting event time from data, handling late, out-of-order data
DStream APIs were insufficient for event-time stuff
32
Event time Aggregations
Windowing is just another type of grouping in Struct.
Streaming
number of records every hour
33
parsedData
.groupBy(window("timestamp","1 hour"))
.count()
parsedData
.groupBy(
"device",
window("timestamp","10 mins"))
.avg("signal")
avg signal strength of each
device every 10 mins
Event-time Aggregations
Any number of
simultaneous aggs
Custom aggs using
reduceGroups, UDAFs
34
Stateful Processing for Aggregations
Aggregates has to be saved as
distributed state between
triggers
Each trigger reads previous state and
writes updated state
State stored in memory, backed by
write ahead log in HDFS/S3
Fault-tolerant, exactly-once
guarantee!
35
process
newdata
t = 1
sink
src
t = 2
process
newdata
sink
src
t = 3
process
newdata
sink
src
state state
write
ahea
d log
state updates
are written to
log for checkpointing
state
Watermarking and Late Data
Watermark [Spark 2.1] -
threshold on how late an event
is expected to be in event time
Trails behind max seen event
time
Trailing gap is configurable
36
event time
max event
time
watermark data older
than
watermark
not expected
12:30 PM
12:20 PM
trailing
gap
of 10 mins
Watermarking and Late Data
Data newer than watermark may
be late, but allowed to aggregate
Data older than watermark is
"too late" and dropped
Windows older than watermark
automatically deleted to limit the
amount of intermediate state
37
max event
time
event time
watermark
late data
allowed to
aggregate
data too
late,
dropped
Watermarking and Late Data
38
max event
time
event time
watermark
allowed
lateness
of 10
mins
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
late data
allowed to
aggregate
data too
late,
dropped
Watermarking to Limit State [Spark 2.1]
39
data too late,
ignored in counts,
state dropped
Processing Time12:00
12:05
12:10
12:15
12:10 12:15 12:20
12:07
12:13
12:08
EventTime
12:15
12:18
12:04
watermark updated to
12:14 - 10m = 12:04
for next trigger,
state < 12:04 deleted
data is late, but
considered in
counts
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
system tracks max
observed event
time
12:08
wm =
12:04
10min
12:14
more details in online
programming guide
Arbitrary Stateful Operations [Spark 2.2]
mapGroupsWithState
allows any user-defined
stateful ops to a
user-defined state
supports timeouts
fault-tolerant, exactly-
once
supports Scala and Java 40
dataset
.groupByKey(groupingFunc)
.mapGroupsWithState(mappingFunc)
def mappingFunc(
key: K,
values: Iterator[V],
state: KeyedState[S]): U = {
// update or remove state
// set timeouts
// return mapped value
}
Many more updates!
StreamingQueryListener [Spark 2.1]
Receive of regular progress heartbeats for health and perf monitoring
Automatic in Databricks!!
Streaming Deduplication [Spark 2.2]
Automatically eliminate duplicate data from Kafka/Kinesis/etc.
More Kafka Integration [Spark 2.2]
Run batch queries on Kafka, and write to Kafka from batch/streaming
queries
Kinesis Source
Read from Amazon Kinesis 41
Future Directions
Stability, stability, stability
Needed to remove the Experimental tag
More supported operations
Stream-stream joins, …
Stable source and sink APIs
Connect to your streams and stores
More sources and sinks
JDBC, …
42
More Info
Structured Streaming Programming Guide
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Databricks blog posts for more focused discussions
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html
https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html
https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-
2.html
and more to come, stay tuned!!
43
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with Kafka and Spark's Structured Streaming

Más contenido relacionado

La actualidad más candente

Robust Operations of Kafka Streams
Robust Operations of Kafka StreamsRobust Operations of Kafka Streams
Robust Operations of Kafka Streams
confluent
 
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
confluent
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 

La actualidad más candente (20)

Stream Application Development with Apache Kafka
Stream Application Development with Apache KafkaStream Application Development with Apache Kafka
Stream Application Development with Apache Kafka
 
Exactly-once Data Processing with Kafka Streams - July 27, 2017
Exactly-once Data Processing with Kafka Streams - July 27, 2017Exactly-once Data Processing with Kafka Streams - July 27, 2017
Exactly-once Data Processing with Kafka Streams - July 27, 2017
 
Robust Operations of Kafka Streams
Robust Operations of Kafka StreamsRobust Operations of Kafka Streams
Robust Operations of Kafka Streams
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
 
So You Want to Write a Connector?
So You Want to Write a Connector? So You Want to Write a Connector?
So You Want to Write a Connector?
 
KSQL Intro
KSQL IntroKSQL Intro
KSQL Intro
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream Processing
 
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
 
Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread
Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread
Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread
 
Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...
Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...
Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Stream Processing made simple with Kafka
Stream Processing made simple with KafkaStream Processing made simple with Kafka
Stream Processing made simple with Kafka
 
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
 
Designing Scalable and Extendable Data Pipeline for Call Of Duty Games
Designing Scalable and Extendable Data Pipeline for Call Of Duty GamesDesigning Scalable and Extendable Data Pipeline for Call Of Duty Games
Designing Scalable and Extendable Data Pipeline for Call Of Duty Games
 
Event sourcing - what could possibly go wrong ? Devoxx PL 2021
Event sourcing  - what could possibly go wrong ? Devoxx PL 2021Event sourcing  - what could possibly go wrong ? Devoxx PL 2021
Event sourcing - what could possibly go wrong ? Devoxx PL 2021
 
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...
 
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
 
Building Stateful Microservices With Akka
Building Stateful Microservices With AkkaBuilding Stateful Microservices With Akka
Building Stateful Microservices With Akka
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 

Similar a Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with Kafka and Spark's Structured Streaming

Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
DataWorks Summit
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 

Similar a Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with Kafka and Spark's Structured Streaming (20)

Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
 

Más de confluent

Más de confluent (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
 

Último

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Último (20)

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 

Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with Kafka and Spark's Structured Streaming

  • 1. Easy, Scalable, Fault-tolerant stream processing with Structured Streaming Tathagata “TD” Das @tathadas Spark Meetup @ Intel Santa Clara, 23rd March 2017
  • 2. About Me Spark PMC Member Built Spark Streaming in UC Berkeley Currently focused on building Structured Streaming Software engineer at Databricks 2
  • 4. Complexities in stream processing 4 Complex Data Diverse data formats (json, avro, binary, …) Data can be dirty, late, out-of-order Complex Systems Diverse storage systems and formats (SQL, NoSQL, parquet, ... ) System failures Complex Workloads Event time processing Combining streaming with interactive queries, machine learning
  • 5. Structured Streaming stream processing on Spark SQL engine fast, scalable, fault-tolerant rich, unified, high level APIs deal with complex data and complex workloads rich ecosystem of data sources integrate with many storage systems 5
  • 6. you should not have to reason about streaming 6
  • 7. you should only write simple batch-like queries Spark should automatically streamify them 7
  • 8. Treat Streams as Unbounded Tables 8 data stream unbounded input table new data in the data stream = new rows appended to a unbounded table
  • 9. New Model Trigger: every 1 sec Time Input data up to t = 3 Quer y Input: data from source as an append-only table Trigger: how frequently to check input for new data Query: operations on input usual map/filter/reduce new window, session ops t=1 t=2 t=3 data up to t = 1 data up to t = 2
  • 10. New Model result up to t = 1 Result Quer y Time data up to t = 1 Input data up to t = 2 result up to t = 2 data up to t = 3 result up to t = 3 Result: final operated table updated after every trigger Output: what part of result to write to storage after every trigger Complete output: write full result table every time Output [complete mode] write all rows in result table to storage t=1 t=2 t=3
  • 11. New Model t=1 t=2 t=3 Result Quer y Time Input data up to t = 3 result up to t = 3 Output [append mode] write new rows since last trigger to storage Result: final operated table updated after every trigger Output: what part of result to write to storage after every trigger Complete output: write full result table every time Append output: write only new rows that got added to result table since previous batch data up to t = 1 data up to t = 2 result up to t = 1 result up to t = 2
  • 12. New Model t=1 t=2 t=3 Result Quer y Time Input data up to t = 3 result up to t = 3 Output [append mode] write new rows since last trigger to storage Conceptual model that guides how to think of a streaming query as a simple table query Engine does not need to keep the full input table in memory once it has streamified it data up to t = 1 data up to t = 2 result up to t = 1 result up to t = 2
  • 13. DataFrames, Datasets, SQL input = spark.readStream .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path") Logical Plan Streaming Source Project device, signal Filter signal > 15 Streaming Sink Spark automatically streamifies! Spark SQL converts batch-like query to a series of incremental execution plans operating on new batches of data Series of Incremental Execution Plans process newfiles t = 1 t = 2 t = 3 process newfiles process newfiles
  • 14. Fault-tolerance with Checkpointing Checkpointing - metadata (e.g. offsets) of current batch stored in a write ahead log in HDFS/S3 Query can be restarted from the log Streaming sources can replay the exact data range in case of failure Streaming sinks can dedup end-to-end exactly-once guarantees process newfiles t = 1 t = 2 t = 3 process newfiles process newfiles write ahead log
  • 15. static data = bounded table streaming data = unbounded table Unified API - Dataset/DataFrame Single API !
  • 16. 16 Dataset/DataFrame  Tables Unified, structured APIs in Spark to transform data in Scala , Java , Python , R SQL spark.sql(" SELECT type, sum(signal) FROM devices GROUP BY type ") val df: DataFrame = spark .table("device-data") .groupBy("type") .sum("signal")) DataFrames Dataset val ds: Dataset[(String, Double)] = spark .table("device-data") .as[DeviceData] .groupByKey(_.type) .mapValues(_.signal) .reduceGroups(_ + _)
  • 17. 17 Dataset/DataFrame  Tables Unified, structured APIs in Spark to transform data in Scala , Java , Python , R SQL DataFrames Dataset Compile time type safety
  • 18. Batch Queries with DataFrames input = spark.read .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .save("dest-path") Read from Json file Select some devices Write to parquet file
  • 19. Streaming Queries with DataFrames input = spark.readStream .format("kafka") .option("subscribe", "topic") .load() result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path") Read from Kafka Replace read with readStream Change format to kafka Select some devices Code does not change Write to Parquet file stream Replace save() with start()
  • 20. Complex Streaming ETL Structured Streaming enables raw data to be available as structured data in seconds, for more interactive and complex analytics 20 table seconds 1010101 0
  • 21. Complex Streaming ETL 21 Example - Json data being received in Kafka - Parse nested json and flatten it - Store in structured Parquet table - Get end-to-end failure guarantees val rawData = spark.readStream .format("kafka") .option("subscribe", "topic") .option("kafka.boostrap.servers",...) .load() val parsedData = rawData .selectExpr("cast (value as string) as json")) .select(from_json("json", schema).as("data")) .select("data.*") val query = parsedData.writeStream .option("checkpointLocation", "/checkpoint") .partitionBy("date") .format("parquet") .start("/parquetTable/")
  • 22. Reading from Kafka [Spark 2.1] 22 Support Kafka 0.10.0.1 Specify options to configure How? kafka.boostrap.servers => broker1 What? subscribe => topic1,topic2,topic3 // fixed list of topics subscribePattern => topic* // dynamic list of topics assign => {"topicA":[0,1] } // specific partitions Where? startingOffsets => latest(default) / earliest / {"topicA":{"0":23,"1":345} } val rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load()
  • 23. Reading from Kafka 23 val rawData = spark.readStream .format("kafka") .option("subscribe", "topic") .option("kafka.boostrap.servers",...) .load() rawData dataframe has the following columns key value topic partitio n offset timestamp [binary] [binary] "topicA" 0 345 1486087873 [binary] [binary] "topicB" 3 2890 1486086721
  • 24. Transforming Data Cast binary value to string Name it column json 24 val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json").as("data")) .select("data.*")
  • 25. Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data 25 val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json").as("data")) .select("data.*") json { "timestamp": 1486087873, "device": "devA", …} { "timestamp": 1486082418, "device": "devX", …} data (nested) timestamp device … 1486087873 devA … 1486086721 devX … from_json("json") as "data"
  • 26. Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data Flatten the nested columns 26 val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json").as("data")) .select("data.*") data (nested) timestamp device … 14860878 73 devA … 14860867 21 devX … timestamp device … 14860878 73 devA … 14860867 21 devX … select("data.*") (not nested)
  • 27. Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data Flatten the nested columns 27 val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json").as("data")) .select("data.*") powerful built-in APIs to perform complex data transformations from_json, to_json, explode, ... 100s of functions (see our blog post)
  • 28. Writing to Parquet table Save parsed data as Parquet table in the given path Partition files by date so that future queries on time slices of data is fast e.g. query on last 48 hours of data 28 val query = parsedData.writeStream .option("checkpointLocation", ...) .partitionBy("date") .format("parquet") .start("/parquetTable")
  • 29. Checkpointing Enable checkpointing by setting the checkpoint location to save offset logs start actually starts a continuous running StreamingQuery in the Spark cluster 29 val query = parsedData.writeStream .option("checkpointLocation", ...) .format("parquet") .partitionBy("date") .start("/parquetTable/")
  • 30. Streaming Query query is a handle to the continuously running StreamingQuery Used to monitor and manage the execution 30 val query = parsedData.writeStream .option("checkpointLocation", ...) .format("parquet") .partitionBy("date") .start("/parquetTable/") process newdata t = 1 t = 2 t = 3 process newdata process newdata StreamingQuery
  • 31. Data Consistency on Ad-hoc Queries Data available for complex, ad-hoc analytics within seconds Parquet table is updated atomically, ensures prefix integrity Even if distributed, ad-hoc queries will see either all updates from streaming query or none, read more in our blog https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html 31 seconds! complex, ad-hoc queries on latest data
  • 32. Event-time Aggregations Many use cases require aggregate statistics by event time E.g. what's the #errors in each system in the 1 hour windows? Many challenges Extracting event time from data, handling late, out-of-order data DStream APIs were insufficient for event-time stuff 32
  • 33. Event time Aggregations Windowing is just another type of grouping in Struct. Streaming number of records every hour 33 parsedData .groupBy(window("timestamp","1 hour")) .count() parsedData .groupBy( "device", window("timestamp","10 mins")) .avg("signal") avg signal strength of each device every 10 mins
  • 34. Event-time Aggregations Any number of simultaneous aggs Custom aggs using reduceGroups, UDAFs 34
  • 35. Stateful Processing for Aggregations Aggregates has to be saved as distributed state between triggers Each trigger reads previous state and writes updated state State stored in memory, backed by write ahead log in HDFS/S3 Fault-tolerant, exactly-once guarantee! 35 process newdata t = 1 sink src t = 2 process newdata sink src t = 3 process newdata sink src state state write ahea d log state updates are written to log for checkpointing state
  • 36. Watermarking and Late Data Watermark [Spark 2.1] - threshold on how late an event is expected to be in event time Trails behind max seen event time Trailing gap is configurable 36 event time max event time watermark data older than watermark not expected 12:30 PM 12:20 PM trailing gap of 10 mins
  • 37. Watermarking and Late Data Data newer than watermark may be late, but allowed to aggregate Data older than watermark is "too late" and dropped Windows older than watermark automatically deleted to limit the amount of intermediate state 37 max event time event time watermark late data allowed to aggregate data too late, dropped
  • 38. Watermarking and Late Data 38 max event time event time watermark allowed lateness of 10 mins parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() late data allowed to aggregate data too late, dropped
  • 39. Watermarking to Limit State [Spark 2.1] 39 data too late, ignored in counts, state dropped Processing Time12:00 12:05 12:10 12:15 12:10 12:15 12:20 12:07 12:13 12:08 EventTime 12:15 12:18 12:04 watermark updated to 12:14 - 10m = 12:04 for next trigger, state < 12:04 deleted data is late, but considered in counts parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() system tracks max observed event time 12:08 wm = 12:04 10min 12:14 more details in online programming guide
  • 40. Arbitrary Stateful Operations [Spark 2.2] mapGroupsWithState allows any user-defined stateful ops to a user-defined state supports timeouts fault-tolerant, exactly- once supports Scala and Java 40 dataset .groupByKey(groupingFunc) .mapGroupsWithState(mappingFunc) def mappingFunc( key: K, values: Iterator[V], state: KeyedState[S]): U = { // update or remove state // set timeouts // return mapped value }
  • 41. Many more updates! StreamingQueryListener [Spark 2.1] Receive of regular progress heartbeats for health and perf monitoring Automatic in Databricks!! Streaming Deduplication [Spark 2.2] Automatically eliminate duplicate data from Kafka/Kinesis/etc. More Kafka Integration [Spark 2.2] Run batch queries on Kafka, and write to Kafka from batch/streaming queries Kinesis Source Read from Amazon Kinesis 41
  • 42. Future Directions Stability, stability, stability Needed to remove the Experimental tag More supported operations Stream-stream joins, … Stable source and sink APIs Connect to your streams and stores More sources and sinks JDBC, … 42
  • 43. More Info Structured Streaming Programming Guide http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Databricks blog posts for more focused discussions https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2- 2.html and more to come, stay tuned!! 43