SlideShare una empresa de Scribd logo
1 de 33
Sessionization At Scale
Using Spark Streaming in production and staying sane
Marina Grechuhin & Yuval
Itzchakov
12/09/2017
YuvalItzchakov
2Confidential
• Developer @ Clicktale for the past 3 years
• Previously developer @ IDF (8200)
• @yuvalitzchakov
• https://stackoverflow.com/users/1870803/yuval-itzchakov
• http://asyncified.io
MarinaGrechuhin
3Confidential
• Team Leader @ Clicktale
• Previously co-founder and VP R&D @ SureVisit
• Previously – many more
Yes
No
Agenda
4Confidential
• Introduction to Spark
• Spark In Depth
• What Is Sessionization?
• Spark Brief Overview
• Sessionization With Spark Streaming
• Scale Challenges
• Structured Streaming with Stateful Aggregations
5Confidential
Architecture – Pipeline CEC
Elastic Load
Balancing
Auto Scaling group
Ingest
Servers
{
"version": 1,
"location":"http://adobe.com/shoe.html",
"projectId": 10,
"documentReferrer": "",
"visitId": 6403608503386111,
"domContentLoaded": 324,
"visitorId": 3246944914767871,
"pageviewId": 1199465738272767,
"engagementTime": 2336,
"messageId": 0
}
6Confidential
Pipeline CEC – Data Types
• Init Message
• Chunk Messages 0-N
• End Message
14
7Confidential
Sizing Pipeline CEC
Elastic Load
Balancing
10
500G/Day
100G/Day
Ingest
1415
Elastic Load
Balancing
8
Ingest
10
8Confidential
What is Sessionization?
Session:
“A sequence of requests made by a single end-user during a visit to a particular site”
(Wikipedia)
• To be able to aggregate user actions over time
• All data doesn’t arrive at once, but piece by piece
9Confidential
Pipeline CEC – Data Types
PageView
End
Chunk
Init
PageView
End
Chunk
Init
PageView
Chunk
Chunk
Init
Visit
PageView
PageView
PageView
PageView – User’s Journey on a single web page
Visit – User’s journey on site
10Confidential
Requirements overview
• Data size ranging between 200B – 1K (may grow over time)
• Process incoming user messages up to 100,000 messages per second
• Handle traffic peaks up to 1,000,000 messages/second (common with Fortune 500
companies)
• Scale out as needed without user intervention (hopefully linearly)
• Save user state until a session is complete, and only then send it down the pipeline
• Latency - up to 10 seconds from ingestion to processing (make data available as
soon as it’s ready)
11Confidential
12Confidential
Spark Ecosystem
Source: https://www.linkedin.com/pulse/apache-spark-big-data-dataframe-things-know-abhishek-choudhary
13Confidential
Spark Streaming
• Discretized Stream (DStream)
• Micro batching
• One RDD every batch
Where is the state
kept between
batches?
14Confidential
mapWithState
Source: https://databricks.com/wp-content/uploads/2016/01/blog-faster-stateful-streaming-figure-1-1024x562.png
Init Chunk End
Page
View
Visit
PageView
• Partial Updates
• Timeout
• Initial State
15Confidential
(“dardasaba”, “hello”),
(“dardasaba”,
“goodbye”),
(“hathatul”, “w00t”),
(“hathatul”, “nope”),
(“gargamel”, “muhaha”)
Executor 1
Executor 2
Executor 3
Key Value
“dardasaba” [“hello”,
“goodbye”]
Key Value
“hathatul” [“w00t”,
“nope”]
Key Value
“gargamel” [“muhaha”]
OpenHashMap[String, List[String]]
DStream[(String, String)]
Key Value
16Confidential
What Could Possibly Go Wrong?
17Confidential
Scale Challenges
• Stability
• Resiliency
• Scalability (scale up / down)
• Monitoring
18Confidential
Spark Streaming Challenges
1. Stability & Resiliency -> Checkpointing
• S3 
• Task failure - Eventual consistency on read
• AWS EFS (?)
• Not suited for small file systems (limited IOPS)
• HDFS 
• Best overall write performance out of the three
• Can be installed on the same node as Spark Workers
• Relatively low maintenance (if used only for checkpoint)
19Confidential
Spark Streaming Challenges
1. Stability & Resiliency -> Checkpointing (cont.)
• Problem:
• State not always recoverable
• No matter the DFS, limits your throughput:
• 1KB message size
• 100,000 messages/sec
• 1 minute checkpoint time
(occurs every 40 seconds)
• Workaround:
• None (in Spark Streaming )
Checkpointing –
This is the cost???
20Confidential
Spark Streaming Challenges
2. Resiliency -> Managing user state between application upgrades
• Problems:
• Can’t change the graph
• Can’t change your data structures
• Workaround:
• Roll your own using `stateSnapshot()`
• Provide on start up using `StateSpec.initialState()`
* Can potentially double overhead of the job time (critical with high throughput).
21Confidential
Spark Streaming Challenges
• Problem:
• Spark Streaming defaults to one job (batch) at a
time
• If a particular job is stuck, all others wait
indefinitely
• Workaround:
• Monitor job status using Sparks driver REST API
(http://<driver ip>:4040/api/v1/applications)
• Consider using Speculation (should be done
carefully)
• Enable Blacklisting if a particular node is faulty.
• If you like to live dangerously, consider modifying
“spark.streaming.concurrentJobs”
3. Stability -> Frozen Jobs
22Confidential
Spark Streaming Challenges
• Scale Up – Just works*
• Scale Down – Who takes over the worker’s state?
4. Scalability
No One!
23Confidential
Spark Streaming Challenges
• Logging mechanism – Log only small and random percent of
traffic
4. Monitoring
25Confidential
Is there a better alternative?
26Confidential
Structured Streaming
“The key idea in Structured Streaming is to treat a live data stream as a table
that is being continuously appended” (Structured Streaming Documentation)
Source: https://spark.apache.org/docs/latest/img/structured-streaming-stream-as-a-table.png
27Confidential
Structured Streaming (Cont.)
Source: https://spark.apache.org/docs/latest/img/structured-streaming-model.png
28Confidential
mapGroupsWithState
A second iteration at stateful aggregations in Spark
Resiliency & Stability -> Checkpointing
• Checkpoints are incremental, only deltas!
• Allows state recovery between upgrades *
*According to a set of tests made by us, may not apply to all cases and isn’t documented
behavior
29Confidential
Spark Structured Streaming
• More new features and cool stuff
• Event based timeouts (previously only processing based)​
• Watermarking (New)​
• Deduplication (New)​
• Timeout per state item (Enhancement)​
30Confidential
Our experience so far
Running ~ 1 month in production with Spark 2.2 and mapGroupsWithState:
Pros:
• Queries seem to take less time on average than Spark Streaming *
• No need to save state manually
• Deduplication out of the box is awesome
• Event based timeouts + Watermarking for late data is also awesome
* In peak hours, from ~ 3 seconds per batch to 0.6 seconds per query (x5)
31Confidential
Our experience so far (Cont.)
Neutral:
• Kafka users: Spark now maps a TopicPartition to a particular Executor, improving data
locality (less shuffling).
• This means that in order to scale up, you must have at least a 1:1 mapping
between number of Kafka partitions and Spark Executors.
Cons:
• Creates a significantly larger memory overhead (due to internal state implementation)
• Makes heavier use of HDFS (many small file writes)
• Doesn’t support multiple states (yet)
• UI not as good as Streaming
32Confidential
Wrapping up
• Overall, Spark Streaming is a great candidate for small-medium loads or none
Stateful aggregations streams.
• If you’re considering Spark as an option for your business, start with
Structured Streaming from the get go.
• Do consider Apache Flink and it’s similar state management module which
allows pluggable state stores as an alternative.
33Confidential
• Real-time Streaming ETL with Structured Streaming:
https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-
streaming-apache-spark-2-1.html
• Making Structured Streaming Ready for Production:
https://www.youtube.com/watch?v=UQiuyov4J-4&feature=youtu.be
• Arbitrary Stateful Aggregations in Structured Streaming in Apache Spark:
https://www.youtube.com/watch?v=JAb4FIheP28
• Exploring Spark Stateful Streaming: http://asyncified.io/2016/07/31/exploring-stateful-
streaming-with-apache-spark
• Exploring Stateful Streaming with Spark Structured Streaming:
http://asyncified.io/2017/07/30/exploring-stateful-streaming-with-spark-structured-streaming
Resources
Thank you for listening!
Questions?

Más contenido relacionado

La actualidad más candente

Using Redis at Facebook
Using Redis at FacebookUsing Redis at Facebook
Using Redis at FacebookRedis Labs
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
 
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)Matt Fuller
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesPhil Peace
 
Building Scalable, Real Time Applications for Financial Services with DataStax
Building Scalable, Real Time Applications for Financial Services with DataStaxBuilding Scalable, Real Time Applications for Financial Services with DataStax
Building Scalable, Real Time Applications for Financial Services with DataStaxDataStax
 
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...Michael Stack
 
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and SpeedmentSpeed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and SpeedmentHazelcast
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructuremattlieber
 
Security Best Practices for your Postgres Deployment
Security Best Practices for your Postgres DeploymentSecurity Best Practices for your Postgres Deployment
Security Best Practices for your Postgres DeploymentPGConf APAC
 
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...StreamNative
 
NewSQL overview, Feb 2015
NewSQL overview, Feb 2015NewSQL overview, Feb 2015
NewSQL overview, Feb 2015Ivan Glushkov
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...ScyllaDB
 
Devoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
Devoxx 2016 talk: Going Global with Nomad and Google Cloud PlatformDevoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
Devoxx 2016 talk: Going Global with Nomad and Google Cloud PlatformBastiaan Bakker
 
Redis Labs and SQL Server
Redis Labs and SQL ServerRedis Labs and SQL Server
Redis Labs and SQL ServerLynn Langit
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogVadim Semenov
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talkDataStax Academy
 

La actualidad más candente (20)

Using Redis at Facebook
Using Redis at FacebookUsing Redis at Facebook
Using Redis at Facebook
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Building Scalable, Real Time Applications for Financial Services with DataStax
Building Scalable, Real Time Applications for Financial Services with DataStaxBuilding Scalable, Real Time Applications for Financial Services with DataStax
Building Scalable, Real Time Applications for Financial Services with DataStax
 
What's new in MongoDB 2.6
What's new in MongoDB 2.6What's new in MongoDB 2.6
What's new in MongoDB 2.6
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
 
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
 
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and SpeedmentSpeed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
 
Security Best Practices for your Postgres Deployment
Security Best Practices for your Postgres DeploymentSecurity Best Practices for your Postgres Deployment
Security Best Practices for your Postgres Deployment
 
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
 
TPC-H in MongoDB
TPC-H in MongoDBTPC-H in MongoDB
TPC-H in MongoDB
 
NewSQL overview, Feb 2015
NewSQL overview, Feb 2015NewSQL overview, Feb 2015
NewSQL overview, Feb 2015
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
Devoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
Devoxx 2016 talk: Going Global with Nomad and Google Cloud PlatformDevoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
Devoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
 
Redis Labs and SQL Server
Redis Labs and SQL ServerRedis Labs and SQL Server
Redis Labs and SQL Server
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talk
 

Similar a Spark Streaming @ Scale (Clicktale)

Whirlpools in the Stream with Jayesh Lalwani
 Whirlpools in the Stream with Jayesh Lalwani Whirlpools in the Stream with Jayesh Lalwani
Whirlpools in the Stream with Jayesh LalwaniDatabricks
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014Claudiu Barbura
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lightbend
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017Monal Daxini
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit
 

Similar a Spark Streaming @ Scale (Clicktale) (20)

Whirlpools in the Stream with Jayesh Lalwani
 Whirlpools in the Stream with Jayesh Lalwani Whirlpools in the Stream with Jayesh Lalwani
Whirlpools in the Stream with Jayesh Lalwani
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
 
Spark cep
Spark cepSpark cep
Spark cep
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 

Último

Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionSneha Padhiar
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewsandhya757531
 
70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical trainingGladiatorsKasper
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Romil Mishra
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsResearcher Researcher
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.elesangwon
 
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdfAkritiPradhan2
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHSneha Padhiar
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales
 
Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labsamber724300
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfBalamuruganV28
 

Último (20)

Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based question
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overview
 
70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending Actuators
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
 
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
 
Designing pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxDesigning pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptx
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitos
 
Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labs
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdf
 

Spark Streaming @ Scale (Clicktale)

  • 1. Sessionization At Scale Using Spark Streaming in production and staying sane Marina Grechuhin & Yuval Itzchakov 12/09/2017
  • 2. YuvalItzchakov 2Confidential • Developer @ Clicktale for the past 3 years • Previously developer @ IDF (8200) • @yuvalitzchakov • https://stackoverflow.com/users/1870803/yuval-itzchakov • http://asyncified.io
  • 3. MarinaGrechuhin 3Confidential • Team Leader @ Clicktale • Previously co-founder and VP R&D @ SureVisit • Previously – many more
  • 4. Yes No Agenda 4Confidential • Introduction to Spark • Spark In Depth • What Is Sessionization? • Spark Brief Overview • Sessionization With Spark Streaming • Scale Challenges • Structured Streaming with Stateful Aggregations
  • 5. 5Confidential Architecture – Pipeline CEC Elastic Load Balancing Auto Scaling group Ingest Servers
  • 6. { "version": 1, "location":"http://adobe.com/shoe.html", "projectId": 10, "documentReferrer": "", "visitId": 6403608503386111, "domContentLoaded": 324, "visitorId": 3246944914767871, "pageviewId": 1199465738272767, "engagementTime": 2336, "messageId": 0 } 6Confidential Pipeline CEC – Data Types • Init Message • Chunk Messages 0-N • End Message
  • 7. 14 7Confidential Sizing Pipeline CEC Elastic Load Balancing 10 500G/Day 100G/Day Ingest 1415 Elastic Load Balancing 8 Ingest 10
  • 8. 8Confidential What is Sessionization? Session: “A sequence of requests made by a single end-user during a visit to a particular site” (Wikipedia) • To be able to aggregate user actions over time • All data doesn’t arrive at once, but piece by piece
  • 9. 9Confidential Pipeline CEC – Data Types PageView End Chunk Init PageView End Chunk Init PageView Chunk Chunk Init Visit PageView PageView PageView PageView – User’s Journey on a single web page Visit – User’s journey on site
  • 10. 10Confidential Requirements overview • Data size ranging between 200B – 1K (may grow over time) • Process incoming user messages up to 100,000 messages per second • Handle traffic peaks up to 1,000,000 messages/second (common with Fortune 500 companies) • Scale out as needed without user intervention (hopefully linearly) • Save user state until a session is complete, and only then send it down the pipeline • Latency - up to 10 seconds from ingestion to processing (make data available as soon as it’s ready)
  • 13. 13Confidential Spark Streaming • Discretized Stream (DStream) • Micro batching • One RDD every batch Where is the state kept between batches?
  • 15. 15Confidential (“dardasaba”, “hello”), (“dardasaba”, “goodbye”), (“hathatul”, “w00t”), (“hathatul”, “nope”), (“gargamel”, “muhaha”) Executor 1 Executor 2 Executor 3 Key Value “dardasaba” [“hello”, “goodbye”] Key Value “hathatul” [“w00t”, “nope”] Key Value “gargamel” [“muhaha”] OpenHashMap[String, List[String]] DStream[(String, String)] Key Value
  • 17. 17Confidential Scale Challenges • Stability • Resiliency • Scalability (scale up / down) • Monitoring
  • 18. 18Confidential Spark Streaming Challenges 1. Stability & Resiliency -> Checkpointing • S3  • Task failure - Eventual consistency on read • AWS EFS (?) • Not suited for small file systems (limited IOPS) • HDFS  • Best overall write performance out of the three • Can be installed on the same node as Spark Workers • Relatively low maintenance (if used only for checkpoint)
  • 19. 19Confidential Spark Streaming Challenges 1. Stability & Resiliency -> Checkpointing (cont.) • Problem: • State not always recoverable • No matter the DFS, limits your throughput: • 1KB message size • 100,000 messages/sec • 1 minute checkpoint time (occurs every 40 seconds) • Workaround: • None (in Spark Streaming ) Checkpointing – This is the cost???
  • 20. 20Confidential Spark Streaming Challenges 2. Resiliency -> Managing user state between application upgrades • Problems: • Can’t change the graph • Can’t change your data structures • Workaround: • Roll your own using `stateSnapshot()` • Provide on start up using `StateSpec.initialState()` * Can potentially double overhead of the job time (critical with high throughput).
  • 21. 21Confidential Spark Streaming Challenges • Problem: • Spark Streaming defaults to one job (batch) at a time • If a particular job is stuck, all others wait indefinitely • Workaround: • Monitor job status using Sparks driver REST API (http://<driver ip>:4040/api/v1/applications) • Consider using Speculation (should be done carefully) • Enable Blacklisting if a particular node is faulty. • If you like to live dangerously, consider modifying “spark.streaming.concurrentJobs” 3. Stability -> Frozen Jobs
  • 22. 22Confidential Spark Streaming Challenges • Scale Up – Just works* • Scale Down – Who takes over the worker’s state? 4. Scalability No One!
  • 23. 23Confidential Spark Streaming Challenges • Logging mechanism – Log only small and random percent of traffic 4. Monitoring
  • 24. 25Confidential Is there a better alternative?
  • 25. 26Confidential Structured Streaming “The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended” (Structured Streaming Documentation) Source: https://spark.apache.org/docs/latest/img/structured-streaming-stream-as-a-table.png
  • 26. 27Confidential Structured Streaming (Cont.) Source: https://spark.apache.org/docs/latest/img/structured-streaming-model.png
  • 27. 28Confidential mapGroupsWithState A second iteration at stateful aggregations in Spark Resiliency & Stability -> Checkpointing • Checkpoints are incremental, only deltas! • Allows state recovery between upgrades * *According to a set of tests made by us, may not apply to all cases and isn’t documented behavior
  • 28. 29Confidential Spark Structured Streaming • More new features and cool stuff • Event based timeouts (previously only processing based)​ • Watermarking (New)​ • Deduplication (New)​ • Timeout per state item (Enhancement)​
  • 29. 30Confidential Our experience so far Running ~ 1 month in production with Spark 2.2 and mapGroupsWithState: Pros: • Queries seem to take less time on average than Spark Streaming * • No need to save state manually • Deduplication out of the box is awesome • Event based timeouts + Watermarking for late data is also awesome * In peak hours, from ~ 3 seconds per batch to 0.6 seconds per query (x5)
  • 30. 31Confidential Our experience so far (Cont.) Neutral: • Kafka users: Spark now maps a TopicPartition to a particular Executor, improving data locality (less shuffling). • This means that in order to scale up, you must have at least a 1:1 mapping between number of Kafka partitions and Spark Executors. Cons: • Creates a significantly larger memory overhead (due to internal state implementation) • Makes heavier use of HDFS (many small file writes) • Doesn’t support multiple states (yet) • UI not as good as Streaming
  • 31. 32Confidential Wrapping up • Overall, Spark Streaming is a great candidate for small-medium loads or none Stateful aggregations streams. • If you’re considering Spark as an option for your business, start with Structured Streaming from the get go. • Do consider Apache Flink and it’s similar state management module which allows pluggable state stores as an alternative.
  • 32. 33Confidential • Real-time Streaming ETL with Structured Streaming: https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured- streaming-apache-spark-2-1.html • Making Structured Streaming Ready for Production: https://www.youtube.com/watch?v=UQiuyov4J-4&feature=youtu.be • Arbitrary Stateful Aggregations in Structured Streaming in Apache Spark: https://www.youtube.com/watch?v=JAb4FIheP28 • Exploring Spark Stateful Streaming: http://asyncified.io/2016/07/31/exploring-stateful- streaming-with-apache-spark • Exploring Stateful Streaming with Spark Structured Streaming: http://asyncified.io/2017/07/30/exploring-stateful-streaming-with-spark-structured-streaming Resources
  • 33. Thank you for listening! Questions?

Notas del editor

  1. Monitor message sending mechanism 1 kafka topic
  2. Monitor message sending mechanism 1 kafka topic
  3. How do we aggregate user messages over time in a Streaming application??
  4. Do a brief overview of all points, 15-20 seconds per point. At the end of the slide do an intro to Spark and talk a little about why we chose it over alternatives
  5. Ask a question: How many people use Spark in production? How many people use Spark Streaming in production? How many do Sessionization?
  6. Spark is not real time streaming, but micro batching Where is the state held?
  7. Talk about each file system briefly