Our new product (Clicktale Experience cloud) requires processing up to half a million messages per second, sessionizing each "users" journey throughout a web page. In this talk we'll discuss how we have achieved that using Spark's stateful streaming capabilities with only few servers in production, the challenges we've faced and how we've solved them. We'll also take a look at Spark 2.2 (the brand new version) and its new stateful aggregation and talk about how we've used it in order to improve performance significantly.
4. Yes
No
Agenda
4Confidential
• Introduction to Spark
• Spark In Depth
• What Is Sessionization?
• Spark Brief Overview
• Sessionization With Spark Streaming
• Scale Challenges
• Structured Streaming with Stateful Aggregations
8. 8Confidential
What is Sessionization?
Session:
“A sequence of requests made by a single end-user during a visit to a particular site”
(Wikipedia)
• To be able to aggregate user actions over time
• All data doesn’t arrive at once, but piece by piece
9. 9Confidential
Pipeline CEC – Data Types
PageView
End
Chunk
Init
PageView
End
Chunk
Init
PageView
Chunk
Chunk
Init
Visit
PageView
PageView
PageView
PageView – User’s Journey on a single web page
Visit – User’s journey on site
10. 10Confidential
Requirements overview
• Data size ranging between 200B – 1K (may grow over time)
• Process incoming user messages up to 100,000 messages per second
• Handle traffic peaks up to 1,000,000 messages/second (common with Fortune 500
companies)
• Scale out as needed without user intervention (hopefully linearly)
• Save user state until a session is complete, and only then send it down the pipeline
• Latency - up to 10 seconds from ingestion to processing (make data available as
soon as it’s ready)
18. 18Confidential
Spark Streaming Challenges
1. Stability & Resiliency -> Checkpointing
• S3
• Task failure - Eventual consistency on read
• AWS EFS (?)
• Not suited for small file systems (limited IOPS)
• HDFS
• Best overall write performance out of the three
• Can be installed on the same node as Spark Workers
• Relatively low maintenance (if used only for checkpoint)
19. 19Confidential
Spark Streaming Challenges
1. Stability & Resiliency -> Checkpointing (cont.)
• Problem:
• State not always recoverable
• No matter the DFS, limits your throughput:
• 1KB message size
• 100,000 messages/sec
• 1 minute checkpoint time
(occurs every 40 seconds)
• Workaround:
• None (in Spark Streaming )
Checkpointing –
This is the cost???
20. 20Confidential
Spark Streaming Challenges
2. Resiliency -> Managing user state between application upgrades
• Problems:
• Can’t change the graph
• Can’t change your data structures
• Workaround:
• Roll your own using `stateSnapshot()`
• Provide on start up using `StateSpec.initialState()`
* Can potentially double overhead of the job time (critical with high throughput).
21. 21Confidential
Spark Streaming Challenges
• Problem:
• Spark Streaming defaults to one job (batch) at a
time
• If a particular job is stuck, all others wait
indefinitely
• Workaround:
• Monitor job status using Sparks driver REST API
(http://<driver ip>:4040/api/v1/applications)
• Consider using Speculation (should be done
carefully)
• Enable Blacklisting if a particular node is faulty.
• If you like to live dangerously, consider modifying
“spark.streaming.concurrentJobs”
3. Stability -> Frozen Jobs
25. 26Confidential
Structured Streaming
“The key idea in Structured Streaming is to treat a live data stream as a table
that is being continuously appended” (Structured Streaming Documentation)
Source: https://spark.apache.org/docs/latest/img/structured-streaming-stream-as-a-table.png
27. 28Confidential
mapGroupsWithState
A second iteration at stateful aggregations in Spark
Resiliency & Stability -> Checkpointing
• Checkpoints are incremental, only deltas!
• Allows state recovery between upgrades *
*According to a set of tests made by us, may not apply to all cases and isn’t documented
behavior
28. 29Confidential
Spark Structured Streaming
• More new features and cool stuff
• Event based timeouts (previously only processing based)
• Watermarking (New)
• Deduplication (New)
• Timeout per state item (Enhancement)
29. 30Confidential
Our experience so far
Running ~ 1 month in production with Spark 2.2 and mapGroupsWithState:
Pros:
• Queries seem to take less time on average than Spark Streaming *
• No need to save state manually
• Deduplication out of the box is awesome
• Event based timeouts + Watermarking for late data is also awesome
* In peak hours, from ~ 3 seconds per batch to 0.6 seconds per query (x5)
30. 31Confidential
Our experience so far (Cont.)
Neutral:
• Kafka users: Spark now maps a TopicPartition to a particular Executor, improving data
locality (less shuffling).
• This means that in order to scale up, you must have at least a 1:1 mapping
between number of Kafka partitions and Spark Executors.
Cons:
• Creates a significantly larger memory overhead (due to internal state implementation)
• Makes heavier use of HDFS (many small file writes)
• Doesn’t support multiple states (yet)
• UI not as good as Streaming
31. 32Confidential
Wrapping up
• Overall, Spark Streaming is a great candidate for small-medium loads or none
Stateful aggregations streams.
• If you’re considering Spark as an option for your business, start with
Structured Streaming from the get go.
• Do consider Apache Flink and it’s similar state management module which
allows pluggable state stores as an alternative.
32. 33Confidential
• Real-time Streaming ETL with Structured Streaming:
https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-
streaming-apache-spark-2-1.html
• Making Structured Streaming Ready for Production:
https://www.youtube.com/watch?v=UQiuyov4J-4&feature=youtu.be
• Arbitrary Stateful Aggregations in Structured Streaming in Apache Spark:
https://www.youtube.com/watch?v=JAb4FIheP28
• Exploring Spark Stateful Streaming: http://asyncified.io/2016/07/31/exploring-stateful-
streaming-with-apache-spark
• Exploring Stateful Streaming with Spark Structured Streaming:
http://asyncified.io/2017/07/30/exploring-stateful-streaming-with-spark-structured-streaming
Resources
How do we aggregate user messages over time in a Streaming application??
Do a brief overview of all points, 15-20 seconds per point.
At the end of the slide do an intro to Spark and talk a little about why we chose it over alternatives
Ask a question:
How many people use Spark in production?
How many people use Spark Streaming in production?
How many do Sessionization?
Spark is not real time streaming, but micro batching
Where is the state held?