SlideShare a Scribd company logo
1 of 35
Download to read offline
Introduction to Structured
Streaming
Next Generation Streaming API for Spark
https://github.com/phatak-dev/spark2.0-examples/tree/master/src/main/scala/co
m/madhukaraphatak/examples/sparktwo/streaming
● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● Evolution in Stream Processing
● Drawbacks of DStream API
● Introduction to Structured Streaming
● Understanding Source and Sinks
● Stateful stream applications
● Handling State recovery
● Joins
● Window API
Evolution of Stream Processing
Stream as Fast Batch Processing
● Stream processing viewed as low latency batch
processing
● Storm took stateless per message and spark took
minibatch approach
● Focused on mostly stateless / limited state workloads
● Reconciled using Lamda architecture
● Less features and less powerful API compared to the
batch system
● Ex : Storm, Spark DStream API
Drawbacks of Stream as Fast Batch
● Handling state for long time and efficiently is a
challenge in these systems
● Lambda architecture forces the duplication of efforts in
stream and batch
● As the API is limited, doing any kind of complex
operation takes lot of effort
● No clear abstractions for handling stream specific
interactions like late events, event time, state recovery
etc
Stream as the default abstraction
● Stream becomes the default abstraction on which both
stream processing and batch processing is built
● Batch processing is looked at as bounded stream
processing
● Supports most of the advanced stream processing
constructs out of the box
● Strong state API’s
● In par with functionalities of Batch API
● Ex : Flink, Beam
Challenges with Stream as default
● Stream as abstraction makes it hard to combine stream
with batch data
● Stream abstraction works well for piping based API’s
like map, flatMap but challenging for SQL
● Stream abstraction also sometimes make it difficult to
map it to structured world as in the platform level it’s
viewed as byte stream
● There are efforts like flink SQL but we have to wait how
it turns out
Drawbacks of DStream API
Tied to Minibatch execution
● DStream API looks stream as fast batch processing in
both API and runtime level
● Batch time integral part of the API which makes it
minibatch only API
● Batch time dicates how different abstractions of API like
window and state will behave
RDDs based API
● DStream API is based on RDD API which is deprecated
for user API’s in Spark 2.0
● As DStream API uses RDD, it doesn’t get benefit of the
all runtime improvements happened in spark sql
● Difficult to combine in batch API’s as they use Dataset
abstraction
● Running SQL queries over stream are awkward and not
straight forward
Limited support for Time abstraction
● Only supports the concept of Processing time
● No support for ingestion time and event time
● As batch time is defined at application level, there is no
framework level construct to handle late events
● Windowing other than time, is not possible
Introduction to Structured Streaming
Stream as the infinite table
● In structured streaming, a stream is modeled as an
infinite table aka infinite Dataset
● As we are using structured abstraction, it’s called
structured streaming API
● All input sources, stream transformations and output
sinks modeled as Dataset
● As Dataset is underlying abstraction, stream
transformations are represented using SQL and Dataset
DSL
Advantage of Stream as infinite table
● Structured data analysis is first class not layered over
the unstructured runtime
● Easy to combine with batch data as both use same
Dataset abstraction
● Can use full power of SQL language to express stateful
stream operations
● Benefits from SQL optimisations learnt over decades
● Easy to learn and maintain
Source and Sinks API
Reading from Socket
● Socket is built in source for structured streaming
● As with DStream API, we can read socket by specifying
hostname and port
● Returns a DataFrame with single column called value
● Using console as the sink to write the output
● Once we have setup source and sink, we use query
interface to start the execution
● Ex : SocketReadExample
Questions from DStream users
● Where is batch time? Or how frequently this is going to
run?
● awaitTermination is on query not on session? Does
that mean we can have multiple queries running
parallely?
● We didn't specify local[2], how does that work?
● As this program using Dataframe, how does the schema
inference works?
Flink vs Spark stream processing
● Spark run as soon as possible may sound like per event
processing but it’s not
● In flink, all the operations like map / flatMap will be
running as processes and data will be streamed through
it
● But in spark asap, tasks are launched for given batch
and destroyed once it’s completed
● So spark still does minibatch but with much lower
latency.
Flink Operator Graph
Spark Execution Graph
a b
1 2
3 4
Batch 1
Map
Stage
Aggregat
e Stage
Spawn tasks
Batch 2
Map
Stage
Aggregat
e Stage
Sink
Stage
Sink
Stage
Spawn tasks
Socket Stream
Independence from Execution Model
● Even though current structured streaming runtime is
minibatch, API doesn’t dictate the nature of runtime
● Structured Streaming API is built in such a way that
query execution model can be change in future
● Already plan for continuous processing mode to bring
structured streaming in par with flink per message
semantics
● https://issues.apache.org/jira/browse/SPARK-20928
Socket Minibatch
● In last example, we used asap trigger.
● We can mimic the DStream mini batch behaviour by
changing the trigger API
● Trigger is specified for the query, as it determines the
frequency for query execution
● In this example, we create a 5 second trigger, which will
create a batch for every 5 seconds
● Ex : SocketMiniBatchExample
Word count on Socket Stream
● Once we know how to read from a source, we can do
operations on the same
● In this example, we will do word count using Dataframe
and Dataset API’s
● We will be using Dataset API’s for data
cleanup/preparation and Dataframe API to define the
aggregations
● Ex : SocketWordCount
Understanding State
Stateful operations
● In last example, we observed that spark remembers the
state across batches
● In structured streaming, all aggregations are stateful
● Developer needs to choose output mode complete so
that aggregations are always up to date
● Spark internally uses the both disk and memory state
store for remembering state
● No more complicated state management in application
code
Understanding output mode
● Output mode defines what’s the dataframe seen by the
sink after each batch
● APPEND signifies sink only sees the records from last
batch
● UPDATE signifies sink sees all the changed records
across the batches
● COMPLETE signifies sink sess complete output for
every batch
● Depending on operations, we need to choose output
mode
Stateless aggregations
● Most of the stream applications benefit from default
statefulness
● But sometime we need aggregations done on batch
data rather on complete data
● Helpful for porting existing DStream code to structured
streaming code
● Spark exposes flatMapGroups API to define the
stateless aggregations
Stateless wordcount
● In this example, we will define word count on a batch
data
● Batch is defined for 5 seconds.
● Rather than using groupBy and count API’s we will use
groupByKey and flatMapGroups API
● flatMapGroups defines operations to be done on each
group
● We will be using output mode APPEND
● Ex : StatelessWordCount
Limitations of flatMapGroups
● flatMapGroups will be slower than groupBy and count
as it doesn’t support partial aggregations
● flatMapGroups can be used only with output mode
APPEND as output size of the function is unbounded
● flatMapGroups needs grouping done using Dataset API
not using Dataframe API
Checkpoint and state recovery
● Building stateful applications comes with additional
responsibility of checkpointing the state for safe
recovery
● Checkpointing is achieved by writing state of the
application to a HDFS compatible storage
● Checkpointing is specific for queries. So you can mix
and match stateless and stateful queries in same
application
● Ex : RecoverableAggregation
Working with Files
File streams
● Structured Streaming has excellent support for the file
based streams
● Supports file types like csv, json,parquet out of the box
● Schema inference is not supported
● Picking up new files on arrival is same as DStream file
stream API
● Ex : FileStreamExample
Joins with static data
● As Dataset is common abstraction across batch and
stream API’s , we can easily enrich structured stream
with static data
● As both have schema built in, spark can use the catalyst
optimiser to optimise joins between files and streams
● In our example,we will be enriching sales stream with
customer data
● Ex : StreamJoin
References
● http://blog.madhukaraphatak.com/categories/introductio
n-structured-streaming/
● https://databricks.com/blog/2017/01/19/real-time-stream
ing-etl-structured-streaming-apache-spark-2-1.html
● https://flink.apache.org/news/2016/05/24/stream-sql.htm
l

More Related Content

What's hot

What's hot (20)

Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 

Viewers also liked

Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 

Viewers also liked (10)

Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 

Similar to Introduction to Structured streaming

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
Demi Ben-Ari
 

Similar to Introduction to Structured streaming (20)

Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Cassandra Lunch #88: Cadence
Cassandra Lunch #88: CadenceCassandra Lunch #88: Cadence
Cassandra Lunch #88: Cadence
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
 
Spark Structured Streaming
Spark Structured StreamingSpark Structured Streaming
Spark Structured Streaming
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
 
Simplified Troubleshooting through API Scripting
Simplified Troubleshooting through API Scripting Simplified Troubleshooting through API Scripting
Simplified Troubleshooting through API Scripting
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 

More from datamantra

More from datamantra (12)

Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 

Recently uploaded

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 

Recently uploaded (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 

Introduction to Structured streaming

  • 1. Introduction to Structured Streaming Next Generation Streaming API for Spark https://github.com/phatak-dev/spark2.0-examples/tree/master/src/main/scala/co m/madhukaraphatak/examples/sparktwo/streaming
  • 2. ● Madhukara Phatak ● Big data consultant and trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Evolution in Stream Processing ● Drawbacks of DStream API ● Introduction to Structured Streaming ● Understanding Source and Sinks ● Stateful stream applications ● Handling State recovery ● Joins ● Window API
  • 4. Evolution of Stream Processing
  • 5. Stream as Fast Batch Processing ● Stream processing viewed as low latency batch processing ● Storm took stateless per message and spark took minibatch approach ● Focused on mostly stateless / limited state workloads ● Reconciled using Lamda architecture ● Less features and less powerful API compared to the batch system ● Ex : Storm, Spark DStream API
  • 6. Drawbacks of Stream as Fast Batch ● Handling state for long time and efficiently is a challenge in these systems ● Lambda architecture forces the duplication of efforts in stream and batch ● As the API is limited, doing any kind of complex operation takes lot of effort ● No clear abstractions for handling stream specific interactions like late events, event time, state recovery etc
  • 7. Stream as the default abstraction ● Stream becomes the default abstraction on which both stream processing and batch processing is built ● Batch processing is looked at as bounded stream processing ● Supports most of the advanced stream processing constructs out of the box ● Strong state API’s ● In par with functionalities of Batch API ● Ex : Flink, Beam
  • 8. Challenges with Stream as default ● Stream as abstraction makes it hard to combine stream with batch data ● Stream abstraction works well for piping based API’s like map, flatMap but challenging for SQL ● Stream abstraction also sometimes make it difficult to map it to structured world as in the platform level it’s viewed as byte stream ● There are efforts like flink SQL but we have to wait how it turns out
  • 10. Tied to Minibatch execution ● DStream API looks stream as fast batch processing in both API and runtime level ● Batch time integral part of the API which makes it minibatch only API ● Batch time dicates how different abstractions of API like window and state will behave
  • 11. RDDs based API ● DStream API is based on RDD API which is deprecated for user API’s in Spark 2.0 ● As DStream API uses RDD, it doesn’t get benefit of the all runtime improvements happened in spark sql ● Difficult to combine in batch API’s as they use Dataset abstraction ● Running SQL queries over stream are awkward and not straight forward
  • 12. Limited support for Time abstraction ● Only supports the concept of Processing time ● No support for ingestion time and event time ● As batch time is defined at application level, there is no framework level construct to handle late events ● Windowing other than time, is not possible
  • 14. Stream as the infinite table ● In structured streaming, a stream is modeled as an infinite table aka infinite Dataset ● As we are using structured abstraction, it’s called structured streaming API ● All input sources, stream transformations and output sinks modeled as Dataset ● As Dataset is underlying abstraction, stream transformations are represented using SQL and Dataset DSL
  • 15. Advantage of Stream as infinite table ● Structured data analysis is first class not layered over the unstructured runtime ● Easy to combine with batch data as both use same Dataset abstraction ● Can use full power of SQL language to express stateful stream operations ● Benefits from SQL optimisations learnt over decades ● Easy to learn and maintain
  • 17. Reading from Socket ● Socket is built in source for structured streaming ● As with DStream API, we can read socket by specifying hostname and port ● Returns a DataFrame with single column called value ● Using console as the sink to write the output ● Once we have setup source and sink, we use query interface to start the execution ● Ex : SocketReadExample
  • 18. Questions from DStream users ● Where is batch time? Or how frequently this is going to run? ● awaitTermination is on query not on session? Does that mean we can have multiple queries running parallely? ● We didn't specify local[2], how does that work? ● As this program using Dataframe, how does the schema inference works?
  • 19. Flink vs Spark stream processing ● Spark run as soon as possible may sound like per event processing but it’s not ● In flink, all the operations like map / flatMap will be running as processes and data will be streamed through it ● But in spark asap, tasks are launched for given batch and destroyed once it’s completed ● So spark still does minibatch but with much lower latency.
  • 21. Spark Execution Graph a b 1 2 3 4 Batch 1 Map Stage Aggregat e Stage Spawn tasks Batch 2 Map Stage Aggregat e Stage Sink Stage Sink Stage Spawn tasks Socket Stream
  • 22. Independence from Execution Model ● Even though current structured streaming runtime is minibatch, API doesn’t dictate the nature of runtime ● Structured Streaming API is built in such a way that query execution model can be change in future ● Already plan for continuous processing mode to bring structured streaming in par with flink per message semantics ● https://issues.apache.org/jira/browse/SPARK-20928
  • 23. Socket Minibatch ● In last example, we used asap trigger. ● We can mimic the DStream mini batch behaviour by changing the trigger API ● Trigger is specified for the query, as it determines the frequency for query execution ● In this example, we create a 5 second trigger, which will create a batch for every 5 seconds ● Ex : SocketMiniBatchExample
  • 24. Word count on Socket Stream ● Once we know how to read from a source, we can do operations on the same ● In this example, we will do word count using Dataframe and Dataset API’s ● We will be using Dataset API’s for data cleanup/preparation and Dataframe API to define the aggregations ● Ex : SocketWordCount
  • 26. Stateful operations ● In last example, we observed that spark remembers the state across batches ● In structured streaming, all aggregations are stateful ● Developer needs to choose output mode complete so that aggregations are always up to date ● Spark internally uses the both disk and memory state store for remembering state ● No more complicated state management in application code
  • 27. Understanding output mode ● Output mode defines what’s the dataframe seen by the sink after each batch ● APPEND signifies sink only sees the records from last batch ● UPDATE signifies sink sees all the changed records across the batches ● COMPLETE signifies sink sess complete output for every batch ● Depending on operations, we need to choose output mode
  • 28. Stateless aggregations ● Most of the stream applications benefit from default statefulness ● But sometime we need aggregations done on batch data rather on complete data ● Helpful for porting existing DStream code to structured streaming code ● Spark exposes flatMapGroups API to define the stateless aggregations
  • 29. Stateless wordcount ● In this example, we will define word count on a batch data ● Batch is defined for 5 seconds. ● Rather than using groupBy and count API’s we will use groupByKey and flatMapGroups API ● flatMapGroups defines operations to be done on each group ● We will be using output mode APPEND ● Ex : StatelessWordCount
  • 30. Limitations of flatMapGroups ● flatMapGroups will be slower than groupBy and count as it doesn’t support partial aggregations ● flatMapGroups can be used only with output mode APPEND as output size of the function is unbounded ● flatMapGroups needs grouping done using Dataset API not using Dataframe API
  • 31. Checkpoint and state recovery ● Building stateful applications comes with additional responsibility of checkpointing the state for safe recovery ● Checkpointing is achieved by writing state of the application to a HDFS compatible storage ● Checkpointing is specific for queries. So you can mix and match stateless and stateful queries in same application ● Ex : RecoverableAggregation
  • 33. File streams ● Structured Streaming has excellent support for the file based streams ● Supports file types like csv, json,parquet out of the box ● Schema inference is not supported ● Picking up new files on arrival is same as DStream file stream API ● Ex : FileStreamExample
  • 34. Joins with static data ● As Dataset is common abstraction across batch and stream API’s , we can easily enrich structured stream with static data ● As both have schema built in, spark can use the catalyst optimiser to optimise joins between files and streams ● In our example,we will be enriching sales stream with customer data ● Ex : StreamJoin