Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
ย
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture
1. Spark Streaming
and Friends
Chris Fregly
Global Big Data Conference
Sept 2014
Kinesis
Streaming
2. Who am I?
Former Netflixโer:
netflix.github.io
Spark Contributor:
github.com/apache/spark
Founder:
fluxcapacitor.com
Author:
effectivespark.com
sparkinaction.com
6. Agenda
โข Spark, Spark Streaming Overview
โข Use Cases
โข API and Libraries
โข Execution Model
โข Fault Tolerance
โข Cluster Deployment
โข Monitoring
โข Scaling and Tuning
โข Lambda Architecture
โข Approximations
7. Spark Overview (1/2)
โข Berkeley AMPLab ~2009
โข Part of Berkeley Data Analytics Stack (BDAS,
aka โbadassโ)
8. Spark Overview (2/2)
โข Based on Microsoft Dryad paper ~2007
โข Written in Scala
โข Supports Java, Python, SQL, and R
โข In-memory when possible, not required
โข Improved efficiency over MapReduce
โ 100x in-memory, 2-10x on-disk
โข Compatible with Hadoop
โ File formats, SerDes, and UDFs
12. Unified Benefits
โข Advancements in higher-level libraries
pushed down into core and vice-versa
โข Examples
โ Spark Streaming: GC and memory
management improvements
โ Spark GraphX: IndexedRDD for random,
hashed access within a partition versus
scanning entire partition
14. Resilient Distributed Dataset (RDD)
โข Core Spark abstraction
โข Represents partitions
across the cluster nodes
โข Enables parallel processing
on data sets
โข Partitions can be in-memory or
on-disk
โข Immutable, recomputable,
fault tolerant
โข Contains transformation lineage on data set
16. Spark API Overview
โข Richer, more expressive than MapReduce
โข Native support for Java, Scala, Python,
SQL, and R (mostly)
โข Unified API across all libraries
โข Operations = Transformations + Actions
24. Master High Availability
โข Multiple Master Nodes
โข ZooKeeper maintains current Master
โข Existing applications and workers will be
notified of new Master election
โข New applications and workers need to
explicitly specify current Master
โข Alternatives (Not recommended)
โ Local filesystem
โ NFS Mount
26. Spark Streaming Overview
โข Low latency, high throughput, fault-tolerance
(mostly)
โข Long-running Spark application
โข Supports Flume, Kafka, Twitter, Kinesis,
Socket, File, etc.
โข Graceful shutdown, in-flight message
draining
โข Uses Spark Core, DAG Execution Model,
and Fault Tolerance
27. Spark Streaming Use Cases
โข ETL on streaming data during ingestion
โข Anomaly, malware, and fraud detection
โข Operational dashboards
โข Lambda architecture
โ Unified batch and streaming
โ ie. Different machine learning models for different
time frames
โข Predictive maintenance
โ Sensors
โข NLP analysis
โ Twitter firehose
28. Discretized Stream (DStream)
โข Core Spark Streaming abstraction
โข Micro-batches of RDDs
โข Operations similar to RDD
โข Fault tolerance using DStream/RDD lineage
30. Spark Streaming API Overview
โข Rich, expressive API similar to core
โข Operations
โ Transformations
โ Actions
โข Window and State Operations
โข Requires checkpointing to snip long-running
DStream lineage
โข Register DStream as a Spark SQL table
for querying!
40. Spark Streaming + Kinesis Architecture
Kinesis
Producer
Kinesis
Producer
Spark St reaming Kinesis Archit ec t ure
Kinesis Spark St reaming,
Kinesis Cl ient Library
Appl icat ion
Kinesis St ream
Shard 1
Shard 2
Shard 3
Kinesis
Producer
Kinesis Receiver DSt ream 1
Kinesis Cl ient Library
Kinesis Record
Processor Thread 1
Kinesis Record
Processor Thread 2
Kinesis Receiver DSt ream 2
Kinesis Cl ient Library
Kinesis Record
Processor Thread 1
41. Throughput and Pricing
Spark
Kinesis
Producer
Spark St reaming Kinesis Throughput and Pr ic ing
< 10 second delay
Kinesis
Spark St reaming
Appl icat ion
Kinesis St ream
Shard 1
Spark
Kinesis
Receiver
Shard 1
Shard 1
1 MB/ sec
per shard
1000 PUTs/ sec
50K/ PUT
2 MB/ sec
per shard
Shard Cost : $ 0.36 per day per shard
PUT Cost : $ 2.50 per day per shard
Net work Transf er Cost : Free wit hin Region! !
45. Streaming Receiver Failure
โข Use a backup receiver
โข Use multiple receivers pulling from multiple
shards
โ Use checkpoint-enabled, sharded streaming
source (ie. Kafka and Kinesis)
โข Data is replicated to 2 nodes immediately
upon ingestion
โข Possible data loss
โข Possible at-least once
โข Use buffered sources (ie. Kafka and Kinesis)
46. Streaming Driver Failure
โข Use a backup Driver
โ Use DStream metadata checkpoint info to
recover
โข Single point of failure โ interrupts stream
processing
โข Streaming Driver is a long-running Spark
application
โ Schedules long-running stream receivers
โข State and Window RDD checkpoints help
avoid data loss (mostly)
48. Types of Checkpoints
Spark
1. Spark checkpointing of StreamingContext
DStreams and metadata
2. Lineage of state and window DStream
operations
Kinesis
3. Kinesis Client Library (KCL) checkpoints
current position within shard
โ Checkpoint info is stored in DynamoDB per
Kinesis application keyed by shard
52. Tuning
โข Batch interval
โ High: reduce overhead of submitting new tasks for each batch
โ Low: keeps latencies low
โ Sweet spot: DStream job time (scheduling + processing) is
steady and less than batch interval
โข Checkpoint interval
โ High: reduce load on checkpoint overhead
โ Low: reduce amount of data loss on failure
โ Recommendation: 5-10x sliding window interval
โข Use DStream.repartition() to increase parallelism of processing
DStream jobs across cluster
โข Use spark.streaming.unpersist=true to let the Streaming Framework
figure out when to unpersist
โข Use CMS GC for consistent processing times
58. Approximation Overview
โข Required for scaling
โข Speed up analysis of large datasets
โข Reduce size of working dataset
โข Data is messy
โข Collection of data is messy
โข Exact isnโt always necessary
โข โApproximate is the new Exactโ
60. Approximations In Action
Figure: Memory Savings with Approximation Techniques
(http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)
61. Spark Statistics Library
โข Correlations
โ Dependence between 2 random variables
โ Pearson, Spearman
โข Hypothesis Testing
โ Measure of statistical significance
โ Chi-squared test
โข Stratified Sampling
โ Sample separately from different sub-populations
โ Bernoulli and Poisson sampling
โ With and without replacement
โข Random data generator
โ Uniform, standard normal, and Poisson distribution
62. Summary
โข Spark, Spark Streaming Overview
โข Use Cases
โข API and Libraries
โข Execution Model
โข Fault Tolerance
โข Cluster Deployment
โข Monitoring
โข Scaling and Tuning
โข Lambda Architecture
โข Approximations
Oct 2014 MEAP Early Access
http://sparkinaction.com