At NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences.
To achieve that, we need to ingest billions of events per day into our big data stores, and we need to do it in a scalable yet cost-efficient manner.
In this session, we will discuss how we continuously transform our data infrastructure to support these goals.
Specifically, we will review how we went from CSV files and standalone Java applications all the way to multiple Kafka and Spark clusters, performing a mixture of Streaming and Batch ETLs, and supporting 10x data growth.
We will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty...).
We will present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services' costs.
Topics include :
* Kafka and Spark Streaming for stateless and stateful use-cases
* Spark Structured Streaming as a possible alternative
* Combining Spark Streaming with batch ETLs
* "Streaming" over Data Lake using Kafka
3. Introduction - part 2 (or: “your turn…”)
● Data engineers? Data architects? Something else?
● Attended our session yesterday about counting
unique users with Druid?
● Working with Spark/Kafka? Planning to?
4. Agenda
● Nielsen Marketing Cloud (NMC)
○ About
○ High-level architecture
● Data flow - past and present
● Spark Streaming
○ “Stateless” and “stateful” use-cases
● Spark Structured Streaming
● “Streaming” over our Data Lake
5. Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen on March 2015
● A Data company
● Machine learning models for insights
● Targeting
● Business decisions
6. Nielsen Marketing Cloud - questions we try to answer
1. How many unique users of a certain profile can we reach?
E.g campaign for young women who love tech
2. How many impressions a campaign received?
8. Data flow in the old days...
In-DB aggregation
OLAP
9. Data flow in the old days… What’s wrong with that?
● CSV-related issues, e.g:
○ Truncated lines in input files
○ Can’t enforce schema
● Scale-related issues, e.g:
○ Had to “manually” scale the processes
10. That's one small step for [a] man… (2014)
“Apache Spark is the Taylor Swift of big data software" (Derrick Harris, Fortune.com, 2015)
In-DB aggregation
OLAP
11. Why just a small step?
● Solved the scaling issues
● Still faced the CSV-related issues
12. Data flow - the modern way
+
Photography Copyright: NBC
14. The need for stateful streaming
Fast forward a few months...
●New requirements were being raised
●Specific use-case :
○ To take the load off of the operational DB (used both as OLTP and OLAP), we wanted to move most of
the aggregative operations to our Spark Streaming app
15. Stateful streaming via “local” aggregations
1.
Read Messages
5.
Upsert aggregated data
(every X micro-batches)
2.
Aggregate current micro-batch
3.
Write combined aggregated data 4.
Read aggregated data
From HDFS every X micro-batches
OLAP
16. Stateful streaming via “local” aggregations
● Required us to manage the state on our own
● Error-prone
○ E.g what if my cluster is terminated and data on HDFS is lost?
● Complicates the code
○ Mixed input sources for the same app (Kafka + files)
● Possible performance impact
○ Might cause the Kafka consumer to lag
17. Structured Streaming - to the rescue?
Spark 2.0 introduced Structured Streaming
●Enables running continuous, incremental processes
○ Basically manages the state for you
●Built on Spark SQL
○ DataFrame/Dataset API
○ Catalyst Optimizer
●Many other features
●Was in ALPHA mode in 2.0 and 2.1
Structured Streaming
18. Structured Streaming - stateful app use-case
2.
Aggregate current window
3.
Checkpoint (offsets and state) handled internally by Spark
1.
Read Messages
4.
Upsert aggregated data
(on window end)
Structured
streaming
OLAP
19. Structured Streaming - known issues & tips
● 3 major issues we had in 2.1.0 (solved in 2.1.1) :
○ https://issues.apache.org/jira/browse/SPARK-19517
○ https://issues.apache.org/jira/browse/SPARK-19677
○ https://issues.apache.org/jira/browse/SPARK-19407
● Checkpointing to S3 wasn’t straight-forward
○ Tried using EMRFS consistent view
■ Worked for stateless apps
■ Encountered sporadic issues for stateful apps
20. Structured Streaming - strengths and weaknesses (IMO)
● Strengths include :
○ Running incremental, continuous processing
○ Increased performance (e.g via Catalyst SQL optimizer)
○ Massive efforts are invested in it
● Weaknesses were mostly related to maturity
21. Back to the future - Spark Streaming revived for “stateful” app use-case
1.
Read Messages
3.
WriteFiles
2.
Aggregate Current micro-batch
4.
Load Data
OLAP
23. Cool, so… Why can’t we stop here? (cont.)
● Extreme load of Kafka brokers’ disks
○ Each micro-batch needs to read ~300M messages, Kafka can’t store it all in memory
● ConcurrentModificationException when using Spark Streaming + Kafka 0.10 integration
○ Forced us to use 1 core per executor to avoid it
○ https://issues.apache.org/jira/browse/SPARK-19185 supposedly solved in 2.4.0 (possibly solving
https://issues.apache.org/jira/browse/SPARK-22562 as well)
● We wish we could run it even less frequently
○ Remember - longer micro-batches result in a better aggregation ratio
24. Enter “streaming” over RDR
RDR (or Raw Data Repository) is our Data Lake
●Kafka topic messages are stored on S3 in Parquet format
●RDR Loaders - stateless Spark Streaming applications
●Applications can read data from RDR for various use-cases
○ E.g analyzing data of the last 30 days
Can we leverage our Data Lake and use it as the data source (instead of Kafka)?
25. How do we “stream” RDR files - producer side
S3 RDRRDR Loaders
2.
Write files
1.
Read Messages
3.
Write files’ paths
Topic with the files’ paths as messages
26. How do we “stream” RDR files - consumer side
S3 RDR
3.
Process files
1.
Read files’ paths
2.
Read RDR files
27. How do we use the new RDR “streaming” infrastructure?
1.
Read files’ paths
3.
Write files
2.
Read RDR files
OLAP
4.
Load Data
28. Did we solve the aforementioned problems?
● EMR clusters are now transient - no more idle clusters
Application type Day 1 Day 2 Day 3
Old Spark Streaming app 1007.68$ 1007.68$ 1007.68$
“Streaming” over RDR app 150.08$ 198.73$ 174.68$
29. Did we solve the aforementioned problems? (cont.)
● No more extreme load of Kafka brokers’ disks
○ We still read old messages from Kafka, but now we only read
about 1K messages per hour (rather than ~300M)
● The new infra doesn’t depend on the integration of Spark Streaming with Kafka
○ No more weird exceptions...
● We can run the Spark batch applications as (in)frequent as we’d like
30. Summary
● Initially replaced standalone Java with Spark & Scala
○ Still faced CSV-related issues
● Introduced Spark Streaming & Kafka for “stateless” use-cases
○ Quickly needed to handle stateful use-cases as well
● Tried Spark Streaming for stateleful use-cases (via “local” aggregations)
○ Required us to manage the state on our own
● Moved to Structured Streaming (for all use-cases)
○ Cons were mostly related to maturity
31. Summary (cont.)
● Went back to Spark Streaming (with Druid as OLAP)
○ Performance penalty in Kafka for long micro-batches
○ Under-utilized Spark clusters
○ Etc.
● Introduced “streaming” over our Data Lake
○ Eliminated Kafka performance penalty
○ Spark clusters are much better utilized = $$$ saved
○ And more...
32. DRUID ES
Want to know more?
● Women in Big Data
○ A world-wide program that aims :
■ To inspire, connect, grow, and champion success of women in Big Data.
■ To grow women representation in Big Data field > 25% by 2020
○ Visit the website (https://www.womeninbigdata.org/)
● Counting Unique Users in Real-Time: Here’s a Challenge for You!
○ Presented yesterday, http://tinyurl.com/yxjc72af
● NMC Tech Blog - https://medium.com/nmc-techblog
36. Structured Streaming - basic concepts
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
Data stream
Unbounded Table
New data in the data streamer
=
New rows appended to a unbounded table
Data stream as an unbonded table
38. Structured Streaming - WordCount example
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
39. Structured Streaming - basic terms
● Input sources :
○ File
○ Kafka
○ Socket, Rate (for testing)
● Output modes :
○ Append (default)
○ Complete
○ Update (added in Spark 2.1.1)
○ Different types of queries support different output modes
■ E.g for non-aggregation queries, Complete mode not supported as it is infeasible to keep all
unaggregated data in the Result Table
● Output sinks :
○ File
○ Kafka (added in Spark 2.2.0)
○ Foreach
○ Console, Memory (for debugging)
○ Different types of sinks support different output modes
40. Fault tolerance
● The goal - end-to-end exactly-once semantics
● The means :
○ Trackable sources (i.e offsets)
○ Checkpointing
○ Idempotent sinks
42. Structured Streaming in production
So we started moving to Structured Streaming
Use case Previous architecture Old flow New architecture New flow
Existing
Spark app
Periodic Spark batch job Read Parquet from S3
-> Transform ->
Write Parquet to S3
Stateless Structured
Streaming
Read from Kafka ->
Transform ->
Write Parquet to S3
Existing Java
app
Periodic standalone Java
process (“manual”
scaling)
Read CSV ->
Transform and
aggregate -> Write to
RDBMS
Stateful Structured
Streaming
Read from Kafka ->
Transform and aggregate ->
Write to RDBMS
New app N/A N/A Stateful Structured
Streaming
Read from Kafka ->
Transform and aggregate ->
Write to RDBMS
Notas del editor
Thank you for coming to hear about our different use-cases of Streaming with Spark and Kafka
I will try to make it interesting and valuable for you
Questions - at the end of the session
Nielsen marketing cloud or NMC in short
A group inside Nielsen,
Born from exelate company that was acquired by Nielsen on March 2015
Nielsen is a data company and so are we and we had strong business relationship until at some point they decided to go for it and acquired exelate
Data company meaning
Buying and onboarding data into NMC from data providers, customers and Nielsen data
We have huge high quality dataset
enrich the data using machine learning models in order to create more relevant quality insights
categorize and sell according to a need
Helping brands to take intelligence business decisions
E.g. Targeting in the digital marketing world
Meaning help fit ads to viewers
For example street sign can fit to a very small % of people who see it vs
Online ads that can fit the profile of the individual that sees it
More interesting to the user
More chances he will click the ad
Better ROI for the marketer
What are the questions we try to answer in NMC that help our customers to take business decisions ?
A lot of questions but to lead to what druid is coming to solve
Translating from human problem to technical problem:
UU (distinct) count
Simple count
Few words on NMC data pipeline architecture:
Frontend layer:
Receives all the online and offline data traffic
Bare metal on different data centers (3 in US, 2 in EU ,3 in APAC)
near real time - high throughput/low latency challenges
Backend layer
Aws Cloud based
process all the frontend layer outputs
ETL’s - load data to data sources aggregated and raw
Applications layer
Also in the cloud
Variety of apps above all our data sources
Web - NMC
data configurations (segments, audiences etc)
campaign analysis , campaign management tools etc.
visualized profile graphs
reports
We’ve used Clustrix (our operation DB) as both OLTP and OLAP
Events are flowing from our Serving system, need to ETL the data into our data stores (DB, DWH, etc.)
Events were written to CSV files
Some fields had double quotes, e.g: 2014-07-17,12:55:38,2,2,0,"1619691,9995",1
Processing was done via standalone Java process
Had many problems with this architecture
Truncated lines in input files
Can’t enforce schema
Had to “manually” scale the processes
Around 2014 the standalone Java processes were transformed into Spark batch jobs written in Scala (but in this presentation we’re going to focus on streaming).
This is a simplified version of what we built (simplified it to make it clearer across the presentation)
Spark
A distributed, scalable engine for large-scale data processing
Unified framework for batch, streaming, machine learning, etc
Was gaining a lot of popularity in the Big Data community
Built on RDDs (Resilient distributed dataset)
A fault-tolerant collection of elements that can be operated on in parallel
Scala
Combines object-oriented and functional programming
First-class citizen is Spark
Kafka
Open-source stream-processing platform
Highly scalable
Publish/Subscribe (A.K.A pub/sub)
Schema enforcement - using Schema Registry and relying on Avro format
Much more
Originally developed by LinkedIn
Graduated from Apache Incubator on late 2012
Quickly became the de facto standard in the industry
Today commercial development is led by Confluent
Spark Streaming
A natural evolvement of our Spark batch job (unified framework – remember?)
Introduced the DStream concept
Continuous stream of data
Represented by a continuous series of RDDs
Works in micro-batches
Each RDD in a DStream contains data from a certain interval (e.g 5 minutes)
We started with Spark Streaming over Kafka (in 2015)
Our Streaming apps were “stateless” (see below) and running 24/7 :
Reading a batch of messages from Kafka
Performing simple transformations on each message (no aggregations)
Writing the output of each batch to a persistent storage (DB, S3, etc.)
Stateful operations (aggregations) were performed periodically in batch either by
Spark jobs
ETLs in our DB/DWH
Looking back, Spark Streaming might have been able to perform stateful operations for us, but (as far as I recall) mapWithState wasn’t available yet, and updateStateByKey had some pending issues.
The way to achieve it was :
Read messages from Kafka
Aggregate the messages of the current micro-batch
Increased micro-batch length to achieve a better aggregation ratio
Combine the results of the results of the previous micro-batches (stored on the cluster’s HDFS)
Write the results back to HDFS
Every X batches :
Update the DB with the aggregated data (some sort of UPSERT)
Delete the aggregated files from HDFS
UPSERT = INSERT ... ON DUPLICATE KEY UPDATE … (in MySQL)
For example, given t1 with columns a (the key) and b (starting from an empty table)
INSERT INTO t1 (a,b) VALUES (1,2) ON DUPLICATE KEY UPDATE b=b+VALUES(b); -> a=1, b=2
INSERT INTO t1 (a,b) VALUES (1,5) ON DUPLICATE KEY UPDATE b=b+VALUES(b); -> a=1, b=7
In this specific use-case, the app was reading from a topic which had only small amounts of data
Required us to manage the state on our own
Error-prone
E.g what if my cluster is terminated and data on HDFS is lost?
Complicates the code
Mixed input sources for the same app (Kafka + files)
Possible performance impact
Might cause the Kafka consumer to lag
Obviously not the perfect way (but that’s what we had…)
DataFrame/Dataset - rather than DStream’s RDD
Catalyst Optimizer - extensible query optimizer which is “at the core of Spark SQL… designed with these key two purposes:
Easily add new optimization techniques and features to Spark SQL
Enable external developers to extend the optimizer (e.g. adding data source specific rules, support for new data types, etc.)” (see https://databricks.com/glossary/catalyst-optimizer)
Other features included :
Handling event-time and late data
End-to-end exactly-once fault-tolerance
Checkpoint folder is the location Spark stores :
The offsets we already read from Kafka
The state of the stateful operations (e.g aggregations)
We’ve used S3 (via EMRFS) for checkpointing.
We’ve deployed to production various use-cases using Structured Streaming :
Periodic Spark batch job was converted to a stateless Structured Streaming app
Periodic standalone Java app was converted to a stateful Structured Streaming app
A brand new app was written as a stateful Structured Streaming app
EMRFS consistent view - an optional feature on AWS EMR, allows clusters to check for list and read-after-write consistency for S3 objects written by or synced with EMRFS
Checkpointing to S3 wasn’t straight-forward
Try using EMRFS consistent view
Recommended for stateless apps
For stateful apps, we encountered sporadic issues possibly related to the metadata store (i.e DynamoDB)
Strengths :
Running incremental, continuous processing
End-to-end exactly-once fault-tolerance (if you implement it correctly)
Increased performance (uses the Catalyst SQL optimizer and other DataFrame optimizations like code generation)
Massive efforts are invested in it
Weaknesses :
Maturity
Inability to perform multiple actions on the exact same Dataset
E.g http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-Avoiding-multiple-streaming-queries-tt30944.html
Seems to be resolved by https://issues.apache.org/jira/browse/SPARK-24565 (in Spark 2.4, but then you get at-least once)
Moved many apps (mostly the ones performing UU counts and “hit” counts) to rely on Druid, which is meant for OLAP, so now :
Spark Streaming app
Runs on a long-lived EMR cluster (cluster is on 24/7)
Performs the “in-batch” aggregation per micro-batch (before writing to S3)
Writes relevant metadata to RDS (e.g S3 path)
This kind of “split” (i.e persisting the Dataset/DataFrame and iterating it a few times) is impossible with Structured Streaming (where every “branch” of processing is a separate query, at least until https://issues.apache.org/jira/browse/SPARK-24565)
M/R ingestion job (loads data into Druid) :
Reads relevant metadata from RDS
performs the final aggregation (before data is loaded into Druid)
Update state in RDS (e.g which files were handled)
Screenshot from Ganglia installed on our AWS EMR cluster running the Spark Streaming app
Remember - longer micro-batches result in a better aggregation ratio
Each such app runs on its own long-lived EMR cluster
Extreme load of Kafka brokers’ disks
Each micro-batch needs to read ~300M messages , Kafka can’t store it all in memory
ConcurrentModificationException when using Spark Streaming + Kafka 0.10 integration
Forced us to use 1 core per executor to avoid it
https://issues.apache.org/jira/browse/SPARK-19185 supposedly solved in 2.4.0 (possibly solving https://issues.apache.org/jira/browse/SPARK-22562 as well)
We wish we could run it even less frequently
Remember - longer micro-batches result in a better aggregation ratio
Each Kafka topic has its own RDR loader, which stores the data in a separate bucket on S3 (partitioned by date)
This means each topic has only 1 consumer (the appropriate RDR loader)
RDR loaders use micro-batches of 4-6 minutes, writing about 100 files per micro-batch, each file is ~0.5GB (to allow efficient read)
Only simple transformations on each message (no aggregations)
Hence no need for long micro-batche
Once the RDR loader writes the files to S3, it also writes the files’ paths to a designated topic in Kafka
How does that work?
Spark batch applications are executed every X hours
On each execution :
A transient EMR cluster is launched
An app consumes the next Y messages from the designated Kafka topic (containing Y paths).Since each such app is consuming from Kafka is a standard way, the offsets (i.e which messages the consumer already read) are committed (and maintained) the same way as we do for any other Kafka consumer
Then the app reads those Y paths from RDR and processes them
Once done, the EMR cluster is terminated
Applications are now batch rather than streaming
We no longer use Spark Streaming-Kafka integration, but rather Kafka API (from the driver) to read the files’ paths from the designated Kafka topic
Once we got the paths from Kafka, we use the “regular” batch method of reading files, i.e spark.read.parquet
After processing has ended, offsets of the messages we read are committed (as we’d do for any Kafka consumer)
We now use Airflow (the de facto standard in the industry) to schedule and monitor our batch jobs
All this obviously is not meant to be used by apps that require actual real time (say milliseconds)
EMR clusters are now transient, so the cluster is terminated as soon as the batch job has finished - no more idle clusters
Cost:
Spark Streaming cluster is on 24/7, so the cost is fixed
With the new infra, the daily cost varies based on the amount of data we processed that day
Initially replaced standalone Java with Spark & Scala
Solved the scale-related issues but not the CSV-related issues
Introduced Spark Streaming & Kafka for “stateless” use-cases
Replaced CSV files with Kafka (de facto standard in the industry)
Already had Spark batch in production (Spark as a unified framework)
Tried Spark Streaming for stateleful use-cases (via “local” aggregations)
Not the optimal solution
Moved to Structured Streaming (for all use-cases)
Pros include :
Enables running continuous, incremental processes
Built on Spark SQL
Cons include :
Maturity
Inability to perform multiple actions on the exact same Dataset
Went back to Spark Streaming
Aggregations are done per micro-batch (in Spark) and daily (in Druid)
Still not perfect
Performance penalty in Kafka for long micro-batches
Concurrency issue with Kafka 0.10 consumer in Spark
Under-utilized Spark clusters
Introduced “streaming” over our Data Lake
Spark Streaming apps (A.K.A “RDR loaders”) write files to S3 and paths to Kafka
Spark batch apps read S3 paths from Kafka (and the actual files from S3)
Transient EMR clusters
Airflow for scheduling and monitoring
Pros :
Eliminated the performance penalty we had in Kafka
Spark clusters are much better utilized = $$$ saved
“The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended… You will express your streaming computation as standard batch-like query as on a static table, and Spark runs it as an incremental query on the unbounded input table”
“A query on the input will generate the “Result Table”. Every trigger interval (say, every 1 second), new rows get appended to the Input Table, which eventually updates the Result Table. Whenever the result table gets updated, we would want to write the changed result rows to an external sink.”