Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data.
In this presentation, we are going to walk you through the programming model and the APIs. We are going to explain the concepts mostly using the default micro-batch processing model, and then later discuss different streaming models.
1. Introduction to Joins in
Structured Streaming
Himanshu Gupta
Lead Consultant
Knoldus Software LLP
https://softwareengineeringdaily.com/2016/03/09/apache-spark-usage-python-or-scala/
2. Agenda
● Quick Recap
● Unsupported Operations
● Join Operations
● Stream-Static Joins
● Stream-Stream Joins
● Support Matrix for Joins in Streaming Queries
● Demo
3. Quick Recap
● A scalable and fault-tolerant stream processing engine built on
the Spark SQL engine.
● Allows us to express our streaming computation the same
way we would express a batch computation on static data.
● Uses the Dataset/DataFrame API in Scala, Java, Python or R
to express streaming aggregations, event-time windows, etc
● Leverages Spark SQL engine to optimize computation.
● Ensures end-to-end exactly-once fault-tolerance guarantees
through checkpointing & WALs.
5. Unsupported Operations
Unsupported operations in Structured Streaming are:
● Multiple streaming aggregations (i.e. a chain of aggregations
on a streaming DF/DS) are not yet supported on streaming
Datasets.
● Limit and take first N rows are not supported on streaming
Datasets.
● Distinct operations on streaming Datasets are not supported.
● Sorting operations are supported on streaming Datasets only
after an aggregation and in Complete Output Mode.
● Any kind of joins between two streaming Datasets is not yet
supported.
6. Join Operations
● Structured Streaming supports Stream-Static and Stream-
Stream joins.
● The result of the streaming join is generated incrementally.
● The result of the join with a streaming Dataset/DataFrame is
exactly the same as if it was with a static Dataset/DataFrame
containing the same data in the stream.
7. Stream-Static Joins
● Supported since Apache Spark 2.0.
● They are not stateful, so no state management is required.
val companiesDF =
spark.read.option("header", "true").csv("src/main/resources/companies.csv")
val stockStreamDF =
spark.readStream.format("kafka").option("kafka.bootstrap.servers", bootstrapServer)
.option("subscribe", topic).load()
.select(from_json(col("value").cast("string"), schema).as("value")).select("value.*")
val filteredStockStreamDF = stockStreamDF.join(companiesDF, "companyName")
8. Stream-Stream Joins
● Supported since Apache Spark 2.3.
● Challenge:
– At any point of time, the view of the dataset is incomplete for both sides of the
join making it much harder to find matches between inputs.
– Any row received from one input stream can match with any future, yet-to-be-
received row from the other input stream.
● Solution:
– For both the input streams, we have to buffer past input as streaming state, so
that we can match every future input with past input and accordingly generate
joined results.
– It also automatically handle late, out-of-order data and can limit the state using
watermarks.
9. Inner Join
● Any kind of columns along with any kind of join conditions are
supported.
● As the stream runs, the size of streaming state will keep
growing indefinitely as all past input must be saved as any new
input can match with any input from the past.
● To avoid an unbounded state, we have to define additional join
conditions such that indefinitely old inputs cannot match with
future inputs and therefore can be cleared from the state.
10. Example
Let’s say we want to join a stream of trading company names
with another stream of stocks to filter out the stocks that a stock
broker is interested in.
val companies = spark.readStream. ...
val stocks = spark.readStream. ...
// Join with event-time constraints
stocks.join(
companies,
expr("""
companyName = stockName AND stockInputTime >= companyTradingTime AND
stockInputTime <= companyTradingTime + interval 20 seconds
""")
)
11. Outer Join
● Similar to Inner Join, except that for Left & Right Outer Joins
watermarking + event time constraints should be specified.
● Because for generating the NULL results in outer join, the
engine must know when an input row is not going to match
with anything in future.
12. Example
Let’s say we want to keep the information of the stocks which
were not traded for future prospects.
// Apply watermarks on event-time columns
val companiesWithWatermark = companies.withWatermark("companiesTradingTime",
"10 seconds")
val stocksWithWatermark = stocks.withWatermark(”stockInputTime”, "20 seconds")
// Join with event-time constraints
stocksWithWatermark.join(
companiesWithWatermark,
expr("""
companyName = stockName AND stockInputTime >= companyTradingTime AND
stockInputTime <= companyTradingTime + interval 20 seconds
"""), joinType = "leftOuter"
)
13. Support Matrix for Joins in Streaming
Queries
Left Input Right Input Join Type Supported
Static Static All Types Yes
Stream Static Inner Yes
Left Outer Yes
Right Outer No
Full Outer No
Static Stream Inner Yes
Left Outer No
Right Outer Yes
Full Outer No
Stream Stream Inner Yes
Left Outer Yes (Conditionally)
Right Outer Yes (Conditionally)
Full Outer No