Stream, stream, stream: Different streaming methods with Spark and Kafka

Stream, Stream, Stream:
Different Streaming methods with Spark and Kafka
Itai Yaffe + Ron Tevel
Nielsen

Introduction
Ron Tevel Itai Yaffe
● Big Data developer
● Developing Big
Data infrastructure
solutions
● Big Data Tech Lead
● Dealing with Big
Data challenges
since 2012

Introduction - part 2 (or: “your turn…”)
● Data engineers? Data architects? Something else?
● Working with Spark? Planning to?
● Working with Kafka? Planning to?
● Cloud deployments? On-prem?

Agenda
● Nielsen Marketing Cloud (NMC)
○ About
○ High-level architecture
● Data flow - past and present
● Spark Streaming
○ “Stateless” and “stateful” use-cases
● Spark Structured Streaming
● “Streaming” over our Data Lake

Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen on March 2015
● A Data company
● Machine learning models for insights
● Targeting
● Business decisions

Nielsen Marketing Cloud - questions we try to answer
● How many users of a certain profile can we reach
Campaign for fancy women sneakers -
● How many hits for a specific web page in a date range

Data flow in the old days...
Events are flowing from our Serving system, need to ETL the data
into our data stores (DB, DWH, etc.)
● In the past, events were written to CSV files
○ Some fields had double quotes, e.g :
2014-07-17,12:55:38,204,400,US|FL|daytona
beach|32114,cdde7b60a3117cc4c539b10faad665a9,"https%3A%2F%2Floadm.exelator.com%2Fload%
2F%3Fp%3D204%26g%3D400%26buid%3D6989098507373987292%26j%3D0","http%3A%2F%2Fwww
.vacationrentals.com%2Fvacation-
rentals%2Fflorida%2Forlando.html",2,2,0,"1619691,9995","","","1",,"Windows 7","Chrome"
● Processing with standalone Java process
● Had many problems with this architecture
○ Truncated lines in input files
○ Can’t enforce schema
○ Had to “manually” scale the processes

That's one small step for [a] man...
Moved to Spark (and Scala) in 2014
● Spark
○ An engine for large-scale data processing
○ Distributed, scalable
○ Unified framework for batch, streaming, machine learning, etc
○ Was gaining a lot of popularity in the Big Data community
○ Built on RDDs (Resilient distributed dataset)
■ A fault-tolerant collection of elements that can be operated on in parallel
● Scala
○ Combines object-oriented and functional programming
○ First-class citizen is Spark
● Converted the standalone Java processes to Spark batch jobs
○ Solved the scaling issues
○ Still faced the CSV-related issues

Data flow - the modern way
Introducing Kafka
● Open-source stream-processing platform
○ Highly scalable
○ Publish/Subscribe (A.K.A pub/sub)
○ Schema enforcement
○ Much more
● Originally developed by LinkedIn
● Graduated from Apache Incubator on late 2012
● Quickly became the de facto standard in the industry
● Today commercial development is led by Confluent

Data flow - the modern way (cont.)
… Along with Spark Streaming
● A natural evolvement of our Spark batch job (unified framework - remember?)
● Introduced the DStream concept
○ Continuous stream of data
○ Represented by a continuous series of RDDs
● Works in micro-batches
○ Each RDD in a DStream contains data from a certain interval (e.g 5 minutes)

Spark Streaming - “stateless” app use-case
We started with Spark Streaming over Kafka (in 2015)
● Our Streaming apps were “stateless”, i.e :
○ Reading a batch of messages from Kafka
○ Performing simple transformations on each message (no aggregations)
○ Writing each batch to a persistent storage (S3)
● Stateful operations (aggregations) were performed in batch on files, either by
○ Spark jobs
○ ETLs in our DB/DWH

Spark Streaming - “stateless” app use-case

The need for stateful streaming
Fast forward a few months...
● New requirements were being raised
● Specific use-case :
○ To take the load off of the operational DB (used both as OLTP and OLAP), we wanted to move most of
the aggregative operations to our Spark streaming app

Stateful streaming via “local” aggregations
● The way to achieve it was :
○ Read messages from Kafka
○ Aggregate the messages of the current micro-batch
○ Combine the results of the results of the previous micro-batches (stored on the cluster’s HDFS)
○ Write the results back to HDFS
○ Every X batches :
■ Update the DB with the aggregated data (some sort of UPSERT)
■ Delete the aggregated files from HDFS
● UPSERT = INSERT ... ON DUPLICATE KEY UPDATE … (in MySQL)
○ For example, given t1 with columns a (the key) and b (starting from an empty table)
■ INSERT INTO t1 (a,b) VALUES (1,2) ON DUPLICATE KEY UPDATE b=b+VALUES(b); -> a=1, b=2
■ INSERT INTO t1 (a,b) VALUES (1,5) ON DUPLICATE KEY UPDATE b=b+VALUES(b); -> a=1, b=7

Stateful streaming via “local” aggregations

Stateful streaming via “local”aggregations - cons
● Required us to manage the state on our own
● Error-prone
○ E.g what if my cluster is terminated and data on HDFS is lost?
● Complicates the code
○ Mixed input sources for the same app (Kafka + files)
● Possible performance impact
○ Might cause the Kafka consumer to lag
● Obviously not the perfect way (but that’s what we had…)

Structured Streaming - to the rescue?
Spark 2.0 introduced Structured Streaming
● Enables running continuous, incremental processes
○ Basically manages the state for you
● Built on Spark SQL
○ DataFrame/Dataset API
○ Catalyst Optimizer
● Allows handling event-time and late data
● Ensures end-to-end exactly-once fault-tolerance
● Was in ALPHA mode in 2.0 and 2.1

Structured Streaming - basic concepts

Structured Streaming - WordCount example

Structured Streaming - stateful app use-case

Structured Streaming in production
So we started moving to Structured Streaming
Use case Previous
architecture
Old flow New
architecture
New flow
Existing Spark
app
Periodic Spark
batch job
Read Parquet from S3 ->
Transform ->
Write Parquet to S3
Stateless
Structured
Streaming
Read from Kafka -> Transform ->
Write Parquet to S3
Existing Java
app
Periodic
standalone Java
process (“manual”
scaling)
Read CSV ->
Transform and aggregate ->
Write to RDBMS
Stateful
Structured
Streaming
Read from Kafka -> Transform
and aggregate ->
Write to RDBMS
New app N/A N/A Stateful
Structured
Streaming
Read from Kafka -> Transform
and aggregate ->
Write to RDBMS

Structured Streaming - known issues & tips
● 3 major issues we had in 2.1.0 (solved in 2.1.1) :
○ https://issues.apache.org/jira/browse/SPARK-19517
● Using EMRFS consistent view when checkpointing to S3
○ Recommended for stateless apps
○ For stateful apps, we encountered sporadic issues possibly related to the metadata store (i.e
DynamoDB)

Structured Streaming - strengths and weaknesses (IMO)
● Strengths :
○ Running incremental, continuous processing
○ End-to-end exactly-once fault-tolerance (if you implement it correctly)
○ Increased performance (uses the Catalyst SQL optimizer and other DataFrame optimizations like code
generation)
○ Massive efforts are invested in it
● Weaknesses :
○ Maturity
○ Inability to perform multiple actions on the exact same Dataset (by-design?)
■ Seems to be resolved by https://issues.apache.org/jira/browse/SPARK-24565 (in the upcoming
Spark 2.4, but then you get at-least once)

Back to the future - DStream revived for “stateful” app use-case

So what’s wrong with our DStream to Druid application?

So what’s wrong with our DStream to Druid application?
● Kafka needs to read 300M messages from disk.
● ConcurrentModificationException when working with Spark Streaming on Kafka 0.10
○ Forced us to use 1 core per executor to avoid it
○ https://issues.apache.org/jira/browse/SPARK-19185 supposedly solved in 2.4.0 (possibly solving
https://issues.apache.org/jira/browse/SPARK-22562 as well)
● We wish we could run it less frequently.

Enter “streaming” over RDR

What is RDR?
RDR is Raw Data Repository and it’s our Data Lake
● Kafka topic messages saved to S3 in Parquet.
● RDR Loaders - Spark streaming applications.
● Applications can read from RDR and do analytics on the data.
Can we leverage our Data Lake as the data source instead of Kafka?

The Idea of How to Stream RDR files

How do we “stream” RDR Files

How do we use the new RDR “streaming” infrastructure?

“A platform to programmatically author, schedule and monitor workflows”
Developed by Airbnb and is a part of the Apache Incubator
The Scheduler

Did we solve the problems?
No longer streaming application - no longer idle cluster.
Name Day 1 Day 2 Day 3
Old App to Druid 1007.68$ 1007.68$ 1007.68$
New App to Druid 150.08$ 198.73$ 174.68$

Did we solve the problems?
● Still reads old messages from Kafka disk but unlike reading 300M messages we just
read 1K messages per hour.
● Doesn’t depend on the integration of spark streaming with Kafka - no more weird
Kafka exceptions.
● We can run the Spark batch application as (in)frequent as we’d like.

Summary
● Started with Spark Streaming for “stateless” use-cases
○ Replaced CSV files with Kafka (de facto standard in the industry)
○ Already had Spark batch in production (Spark as a unified framework)
● Tried Spark Streaming for stateleful use-cases (via “local” aggregations)
○ Not the optimal solution
● Moved to Structured Streaming (for all use-cases)
○ Pros include :
■ Enables running continuous, incremental processes
■ Built on Spark SQL
○ Cons include :
■ Maturity
■ Inability to perform multiple actions on the exact same Dataset (by-design?)

Summary - cont.
● Moved (back) to Spark Streaming
○ Aggregations are done per micro-batch (in Spark) and daily (in Druid)
○ Still not perfect
■ Performance penalty in Kafka for long micro-batches
■ Concurrency issue with Kafka 0.10 consumer in Spark
■ Under-utilized Spark clusters
● Introduced “streaming” over our Data Lake
○ Spark Streaming apps (A.K.A “RDR loaders”) write files to S3 and paths to Kafka
○ Spark batch apps read S3 paths from Kafka (and the actual files from S3)
■ Airflow for scheduling and monitoring
■ Meant for apps that don’t require real time
○ Pros :
■ Eliminated the performance penalty we had in Kafka
■ Spark clusters are much better utilized

QUESTIONS?
Join us - https://www.comeet.co/jobs/nielsen/33.000
Big Data Group Leader
Big Data Team Leader
And more...

THANK YOU!
https://www.linkedin.com/in/rontevel/
https://www.linkedin.com/in/itaiy/

Structured Streaming -
additional slides

Structured Streaming - basic terms
● Input sources :
○ File
○ Kafka
○ Socket, Rate (for testing)
● Output modes :
○ Append (default)
○ Complete
○ Update (added in Spark 2.1.1)
○ Different types of queries support different output modes
■ E.g for non-aggregation queries, Complete mode not supported as it is infeasible to keep all
unaggregated data in the Result Table
● Output sinks :
○ File
○ Kafka (added in Spark 2.2.0)
○ Foreach
○ Console, Memory (for debugging)
○ Different types of sinks support different output modes

Handling event-time and late data

Handling event-time and late data
// Group the data by window and word and compute the count of each group
val windowedCounts = words.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word"
).count()
// Group the data by window and word and compute the count of each group
val windowedCounts = words
.withWatermark("timestamp", "10 minutes")
.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word")
.count()

Fault tolerance
● The goal - end-to-end exactly-once semantics
● The means :
○ Trackable sources (i.e offsets)
○ Checkpointing
○ Idempotent sinks
aggDF
.writeStream
.outputMode("complete")
.option("checkpointLocation", "path/to/HDFS/dir")
.format("memory")
.start()

Monitoring
● Interactive APIs :
○ streamingQuery.lastProgress()/status()
○ Output example
● Asynchronous API :
○ val spark: SparkSession = ...
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {
println("Query started: " + queryStarted.id)
}
override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = {
println("Query terminated: " + queryTerminated.id)
}
override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
println("Query made progress: " + queryProgress.progress)
}
})

Stream, stream, stream: Different streaming methods with Spark and Kafka

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Stream, stream, stream: Different streaming methods with Spark and Kafka

Similar a Stream, stream, stream: Different streaming methods with Spark and Kafka (20)

Más de Itai Yaffe

Más de Itai Yaffe (20)

Último

Último (20)

Stream, stream, stream: Different streaming methods with Spark and Kafka

Notas del editor