4. Why Spark?
Batch and Streaming
framework that has
large community
and wide adoption.
Spark is a leader in
data processing that
scales across many
machines. It is
relatively easy to get
started.
5. Yeah, but what is Spark?
Spark consists of a programming API and execution engine
Worker Worker Worker Worker
Master
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
song_df = spark.read
.option('sep','t')
.option("inferSchema","true")
.csv("/databricks-datasets/songs/data-001/part-0000*")
tempo_df = song_df.select(
col('_c4').alias('artist_name'),
col('_c14').alias('tempo'),
)
avg_tempo_df = tempo_df
.groupBy('artist_name')
.avg('tempo')
.orderBy('avg(tempo)',ascending=False)
avg_tempo_df.show(truncate=False)
7. What is Spark?
● Fast, general purpose engine for large-scale data processing
● Replaces MapReduce as Hadoop parallel programming API
● Many options:
○ Yarn / Spark Cluster / Local
○ Scala / Python / Java / R
○ Spark Core / SQL / Streaming / MLLib / Graph
8. What is Spark Structured Streaming?
● Alternative to traditional Spark Streaming which used
DStreams
● If you are familiar with Spark, it is best to think of
Structured Streaming as Spark SQL API but for streaming
● Use import spark.sql.streaming
9. What is Spark Structured Streaming?
Tathagata Das “TD” - Lead Developer on Spark Streaming
● “Fast, fault-tolerant, exactly-once stateful stream
processing without having to reason about streaming"
● "The simplest way to perform streaming analytics is not
having to reason about streaming at all"
● A table that is constantly appended with each micro-batch
Reference: https://youtu.be/rl8dIzTpxrI
10. Lazy Evaluation with Query Optimization
● Transformations not run immediately
● Builds up a DAG
● Optimizes steps
● Runs when an Action is called
11. Why use Spark Structured Streaming?
● Run Spark as batch or streaming
● Open-source with many connectors
● Stateful streaming and joins
● Kafka is not required for every stream
26. Structured Streaming – Output Mode
How to output data when triggered
● Append - Just keep adding newest records
● Update – Only those updated since last trigger
● Complete mode - Output latest state of table
(useful for aggregation results)
29. Structured Streaming - Triggers
Specify how often to check for a new data
● Default: micro-batch mode, as soon as possible
● Fixed: micro-batch mode, wait specified time
● Only once: just one micro-batch then job stops
● Continuous: continuously read from source,
experimental and only works for certain types of queries
31. Structured Streaming - Windows
Calculate over time using windows -> Avg trips every 5 minutes
● Tumbling window: specific period of time, such as 1:00 to 1:30
● Sliding windows: fixed window, but can track multiple groups...
1:00 – 1:30, 1:05 – 1:35, 1:10 – 1:40, etc
● Session windows: similar to an inactive session timeout…
group together until a long enough gap in time between events
34. Streaming Joins
Stream-static
Simple join
Does not tracking state
Options: Inner, Left Outer,
Left Semi
Stream-stream
Complex join
Data may arrive early or
late
Options: Inner, Left/Right
Outer, Full Outer, Left Semi
37. Stateful Streaming
Aggregations, dropDuplicates, and Joins will require storing state
● Custom stateful operations possible
● HDFS or RocksDB State Store (Spark 3.2)
● If retaining many keys, HDFSBackedStateStore is likely to have JVM
memory issues