SlideShare a Scribd company logo
1 of 57
Recipes for running Spark
Streaming in Production
Tathagata “TD” Das
Hadoop Summit 2015
@tathadas
Who am I?
Project Management Committee (PMC) member of
Spark
Lead developer of Spark Streaming
Formerly in AMPLab, UC Berkeley
Software developer at Databricks
Founded by creators of Spark and remains largest
contributor
Offers a hosted service
- Spark on EC2
- Notebooks
- Plot visualizations
- Cluster management
- Scheduled jobs
What is Databricks?
What is Spark
Streaming?
Spark Streaming
Scalable, fault-tolerant stream processing
system
File
systems
Databases
Dashboar
ds
Flume
Kinesis
HDFS/S3
Kafka
Twitter
High-level
API
joins, windows, …
often 5x less code
Fault-
tolerant
Exactly-once
semantics, even for
stateful ops
Integration
Integrate with MLlib,
SQL, DataFrames,
GraphX
What can you use it for?
6
Real-time fraud detection in transactions
React to anomalies in sensors in real-time
Cat videos in tweets as soon as they go viral
How does it work?
Receivers receive data streams and chop them up into
batches
Spark processes the batches and pushes out the results
7
data streams
receivers
batches results
Word Count with Kafka
val context = new StreamingContext(conf, Seconds(1))
val lines = KafkaUtils.createStream(context, ...)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1))
.reduceByKey(_ + _)
wordCounts.print()
context.start()
8
entry point of
streaming
functionality
create DStream
from Kafka data
DStream: represents a data
stream
Word Count with Kafka
val context = new StreamingContext(conf, Seconds(1))
val lines = KafkaUtils.createStream(context, ...)
val words = lines.flatMap(_.split(" "))
9
split lines into words
Transformation: transform data to create new
DStreams
Word Count with Kafka
val context = new StreamingContext(conf, Seconds(1))
val lines = KafkaUtils.createStream(context, ...)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1))
.reduceByKey(_ + _)
wordCounts.print()
context.start()
10
print some counts on
screen
count the words
start receiving and
transforming the data
Languages
Can natively use
Can use any other language by using pipe()
11
12
Integrates with Spark Ecosystem
13
Spark Core
Spark
Streaming
Spark
SQL
MLlib GraphX
Combine batch and streaming processing
Join data streams with static data
sets
// Create data set from Hadoop file
val dataset = sparkContext.hadoopFile(“file”)
// Join each batch in stream with the dataset
kafkaStream.transform { batchRDD =>
batchRDD.join(dataset)
.filter( ... )
}
14
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Combine machine learning with streaming
Learn models offline, apply them
online
// Learn model offline
val model = KMeans.train(dataset, ...)
// Apply model online on stream
kafkaStream.map { event =>
model.predict(event.feature)
}
15
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Combine SQL with streaming
Interactively query streaming data with
SQL
// Register each batch in stream as table
kafkaStream.foreachRDD { batchRDD =>
batchRDD.toDF.registerTempTable("events")
}
// Interactively query table
sqlContext.sql("select * from events")
16
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
17
How do I write production quality
Spark Streaming applications?
WordCount with Kafka
object WordCount {
def main(args: Array[String]) {
val context = new StreamingContext(new SparkConf(), Seconds(1))
val lines = KafkaUtils.createStream(context, ...)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x,1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
18
Got it working on a small
Spark cluster on little data
What’s next??
What’s needed to get it production-ready?
Fault-tolerance and semantics
Performance and stability
Monitoring and upgrading
19
20
Deeper View of Spark Streaming
Any Spark Application
21
Driver
User code runs
in the driver
process
YARN / Mesos /
Spark Standalone
cluster
Tasks sent to
executors for
processing data
Executor
Executor
Executor
Driver launches
executors in
cluster
Spark Streaming Application: Receive data
22
Executo
r
Executo
r
Driver runs
receivers as
long running
tasks
Receiver Data stream
Driver
object WordCount {
def main(args: Array[String]) {
val context = new StreamingContext(…)
val lines = KafkaUtils.createStream(...)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x,1))
.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
Receiver divides
stream into blocks
and keeps in
memory
Data
Blocks
Blocks also
replicated to
another
executor
Data
Blocks
Spark Streaming Application: Process data
23
Executo
r
Executo
r
Receiver
Data
Blocks
Data
Blocks
Data
stor
e
Every batch
interval, driver
launches tasks to
process the
blocks
Driver
object WordCount {
def main(args: Array[String]) {
val context = new StreamingContext(…)
val lines = KafkaUtils.createStream(...)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x,1))
.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
24
Fault-tolerance and semantics
Failures? Why care?
Many streaming applications need zero data loss
guarantees despite any kind of failures in the system
At least once guarantee – every record processed at least once
Exactly once guarantee – every record processed exactly once
Different kinds of failures – executor and driver failures
Some failures and guarantee requirements need
additional configurations and setups
25
Executo
r
Receiver
Data
Blocks
What if an executor fails?
Task and receiver restarted by
Spark automatically, no config
needed
26
Executo
r
Failed
Ex.
Receiver
Blocks
Blocks
Driver
If executor fails,
receiver is lost
and all blocks are
lost
Receiver
Receiver
restarted
Tasks restarted
on block
replicas
What if the driver fails?
27
Executo
r
BlocksHow do we
recover?
When the
driver fails, all
the executors
fail
All
computation, all
received blocks
are lost
Executo
r
Receiver
Blocks
Failed
Ex.
Receiver
Blocks
Failed
Executo
r
Blocks
Driver
Failed
Driver
Recovering Driver with Checkpointing
Checkpointing: Periodically save the
state of computation to fault-tolerant
storage
28
Executo
r
Blocks
Executo
r
Receiver
Blocks
Active
Driver
Checkpoint
information
Recovering Driver with Checkpointing
Checkpointing: Periodically save the
state of computation to fault-tolerant
storage
29
Failed driver can be
restarted from checkpoint
information
Failed
Driver
Restarte
d Driver
New
Executo
r
New
Executo
r
Receiver
New
executors
launched
and
receivers
restarted
Recovering Driver with Checkpointing
1. Configure automatic driver restart
All cluster managers support this
2. Set a checkpoint directory in a HDFS-compatible file
system
streamingContext.checkpoint(hdfsDirectory)
3. Slightly restructure of the code to use checkpoints for
recovery
30
Configurating Automatic Driver Restart
Spark Standalone – Submit job in “cluster” mode with “--supervise”
See http://spark.apache.org/docs/latest/spark-standalone.html
YARN – Submit job in “cluster” mode
See YARN config “yarn.resourcemanager.am.max-attempts”
Mesos – Marathon can restart Mesos applications
31
Restructuring code for Checkpointing
32
val context = new StreamingContext(...)
val lines = KafkaUtils.createStream(...)
val words = lines.flatMap(...)
...
context.start()
Create
+
Setup
Start
def creatingFunc(): StreamingContext = {
val context = new StreamingContext(...)
val lines = KafkaUtils.createStream(...)
val words = lines.flatMap(...)
...
context.checkpoint(hdfsDir)
}
Put all setup code into a function that returns a new StreamingContext
Get context setup from HDFS dir OR create a new one with the function
val context = StreamingContext.getOrCreate(
hdfsDir, creatingFunc)
context.start()
Restructuring code for Checkpointing
StreamingContext.getOrCreate():
If HDFS directory has checkpoint info
recover context from info
else
call creatingFunc() to create
and setup a new context
Restarted process can figure out
whether to recover using checkpoint
info or not
33
def creatingFunc(): StreamingContext = {
val context = new StreamingContext(...)
val lines = KafkaUtils.createStream(...)
val words = lines.flatMap(...)
...
context.checkpoint(hdfsDir)
}
val context = StreamingContext.getOrCreate(
hdfsDir, creatingFunc)
context.start()
Received blocks lost on Restart!
34
Failed
Driver
Restarte
d Driver
New
Executo
r
New Ex.
Receiver
No Blocks
In-memory blocks
of buffered data
are lost on driver
restart
Recovering data with Write Ahead Logs
Write Ahead Log: Synchronously
save received data to fault-tolerant
storage
35
Executo
r
Blocks
saved to
HDFS
Executo
r
Receiver
Blocks
Active
Driver
Data stream
Received blocks lost on Restart!
Write Ahead Log: Synchronously
save received data to fault-tolerant
storage
36
Failed
Driver
Restarte
d Driver
New
Executo
r
New Ex.
Receiver
Blocks
Blocks recovered
from Write Ahead
Log
Recovering data with Write Ahead Logs
1. Enable checkpointing, logs written in checkpoint
directory
2. Enabled WAL in SparkConf configuration
sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
3. Receiver should also be reliable
Acknowledge source only after data saved to WAL
Unacked data will be replayed from source by restarted receiver
37
Fault-tolerance Semantics
38
Zero data loss = every stage
processes each event at least once
despite any failure
Sources
Transformin
g
Sinks
Outputting
Receiving
Fault-tolerance Semantics
39
Sources
Transformin
g
Sinks
Outputting
Receiving
Exactly once, as long as received data is not lost
At least once, with Write Ahead Log and Reliable
receivers
Receiving
Outputting
Exactly once, if outputs are idempotent or
transactional
End-to-end semantics:
At-least once
Fault-tolerance Semantics
40
Exactly once receiving with new Kafka Direct
approach
Treats Kafka like a replicated log, reads it like a file
Does not use receivers
No need to create multiple DStreams and union them
No need to enable Write Ahead Logs
val directKafkaStream = KafkaUtils.createDirectStream(...)
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.htm
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
Sources
Transformin
g
Sinks
Outputting
Receiving
Fault-tolerance Semantics
41
Exactly once receiving with new Kafka Direct
approach
Sources
Transformin
g
Sinks
Outputting
Receiving
Exactly once, as long as received data is not lost
Exactly once, if outputs are idempotent or
transactional
End-to-end semantics:
Exactly once!
42
http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-
monkey.html
43
Performance and stability
Achieving High Throughput
44
High throughput achieved by sufficient
parallelism at all stages of the pipeline
Sources
Transformin
g
Sinks
Outputting
Receiving
Scaling the Receivers
45
Sources
Transformin
g
Sinks
Outputting
Receiving
Sources must be configured with parallel data
streams
#partitions in Kafka topics, #shards in Kinesis streams,
…
Streaming app should have multiple receivers that
receive the data streams in parallel
Multiple input DStreams, each running a receiver
Can be unioned together to create one DStream
val kafkaStream1 = KafkaUtils.createStream(...)
val kafkaStream2 = KafkaUtils.createStream(...)
Scaling the Receivers
46
Sources
Transformin
g
Sinks
Outputting
Receiving
Sufficient number of executors to run all the
receivers
#executors > #receivers , so that no more than 1
receiver per executor, and network is not shared
between receivers
Kafka Direct approach does not use receivers
Automatically parallelizes data reading across executors
Parallelism = # Kafka partitions
Stability in Processing
47
Sources
Transformin
g
Sinks
Outputting
Receiving For stability, must process data as fast as it is
received
Must ensure avg batch processing times < batch
interval
Previous batch is done by the time next batch is received
Otherwise, new batches keeps queueing up waiting
for previous batches to finish, scheduling delay goes
up
Reducing Batch Processing Times
48
Sources
Transformin
g
Sinks
Outputting
Receiving
More receivers!
Executor running receivers do lot of the processing
Repartition the received data to explicitly distribute
load
unionedStream.repartition(40)
Increase #partitions in shuffles
transformedStream.reduceByKey(reduceFunc, 40)
Get more executors and cores!
Reducing Batch Processing Times
49
Sources
Transformin
g
Sinks
Outputting
Receiving
Use Kryo serialization to serialization costs
Register classes for best performance
See configurations spark.kryo.*
http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization
Larger batch durations improve stability
More data aggregated together, amortized cost of shuffle
Limit ingestion rate to handle data surges
See configurations spark.streaming.*maxRate*
http://spark.apache.org/docs/latest/configuration.html#spark-streaming
Speeding up Output Operations
50
Sources
Transformin
g
Sinks
Outputting
Receiving
Write to data stores efficiently
dataRDD.foreach { event =>
// open connection
// insert single event
// close connection
}
foreach: inefficient
dataRDD.foreachPartition { partition =>
// open connection
// insert all events in partition
// close connection
}
foreachPartition: efficient
dataRDD.foreachPartition { partition =>
// get open connection from pool
// insert all events in partition
// return connection to pool
}
foreachPartition + connection pool: more efficient
51
Monitoring and Upgrading
Streaming in Spark Web UI
Stats over last 1000
batches
Timelines
Histograms
52
Scheduling delay should be approx 0
Processing Time approx < batch
interval
Streaming in Spark Web UI
Details of individual batches
53
Details of Spark jobs run in a
batch
[New in Spark
1.4]
Operational Monitoring
Streaming statistics published through Codahale metrics
Ganglia sink, Graphite sink, custom Codahale metrics sinks
Can see long term trends, across hours and days
Configure the metrics using
$SPARK_HOME/conf/metrics.properties
Also need to compile Spark with Ganglia LGPL profile
(see http://spark.apache.org/docs/latest/monitoring.html#metrics)
54
Upgrading Apps
1. Shutdown your current streaming app gracefully
Will process all data before shutting down cleanly
streamingContext.stop(stopGracefully = true)
2. Update app code and start it again
56
Things not covered
Memory tuning
GC tuning
Using SQLContext
Using Broadcasts
Using Accumulators
…
Refer to online guide
http://spark.apache.org/docs/latest/streaming-programming-
guide.html
57
58
More Spark Streaming Talks

More Related Content

What's hot

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with CassandraJacek Lewandowski
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & SparkMatthias Niehoff
 
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Structured Streaming for Columnar Data Warehouses with Jack GudenkaufStructured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Structured Streaming for Columnar Data Warehouses with Jack GudenkaufDatabricks
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.Sergey Zelvenskiy
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsStephane Manciot
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internalsSigmoid
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...Spark Summit
 

What's hot (20)

Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Structured Streaming for Columnar Data Warehouses with Jack GudenkaufStructured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Scala+data
Scala+dataScala+data
Scala+data
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internals
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 

Viewers also liked

Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopDataWorks Summit
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...DataWorks Summit
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNDataWorks Summit
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...DataWorks Summit
 
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeDataWorks Summit
 
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenCarpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenDataWorks Summit
 
Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets DataWorks Summit
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)DataWorks Summit
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentDataWorks Summit
 
Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]DataWorks Summit
 
Running Spark and MapReduce together in Production
Running Spark and MapReduce together in ProductionRunning Spark and MapReduce together in Production
Running Spark and MapReduce together in ProductionDataWorks Summit
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLDataWorks Summit
 
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...DataWorks Summit
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made EasyDataWorks Summit
 
Open Source SQL for Hadoop: Where are we and Where are we Going?
Open Source SQL for Hadoop: Where are we and Where are we Going?Open Source SQL for Hadoop: Where are we and Where are we Going?
Open Source SQL for Hadoop: Where are we and Where are we Going?DataWorks Summit
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDataWorks Summit
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersDataWorks Summit
 

Viewers also liked (20)

Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
 
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance Initiative
 
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenCarpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP Haven
 
Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
 
Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]
 
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
 
Running Spark and MapReduce together in Production
Running Spark and MapReduce together in ProductionRunning Spark and MapReduce together in Production
Running Spark and MapReduce together in Production
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
 
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made Easy
 
Open Source SQL for Hadoop: Where are we and Where are we Going?
Open Source SQL for Hadoop: Where are we and Where are we Going?Open Source SQL for Hadoop: Where are we and Where are we Going?
Open Source SQL for Hadoop: Where are we and Where are we Going?
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
 

Similar to Modus operandi of Spark Streaming - Recipes for Running your Streaming Application in Production

Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafkaDori Waldman
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka Dori Waldman
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Productionalizing spark streaming applications
Productionalizing spark streaming applicationsProductionalizing spark streaming applications
Productionalizing spark streaming applicationsRobert Sanders
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一scalaconfjp
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streamingphanleson
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Codemotion
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Codemotion
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseKostas Tzoumas
 
From Zero to Stream Processing
From Zero to Stream ProcessingFrom Zero to Stream Processing
From Zero to Stream ProcessingEventador
 

Similar to Modus operandi of Spark Streaming - Recipes for Running your Streaming Application in Production (20)

Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Meetup spark structured streaming
Meetup spark structured streamingMeetup spark structured streaming
Meetup spark structured streaming
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Productionalizing spark streaming applications
Productionalizing spark streaming applicationsProductionalizing spark streaming applications
Productionalizing spark streaming applications
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
From Zero to Stream Processing
From Zero to Stream ProcessingFrom Zero to Stream Processing
From Zero to Stream Processing
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Recently uploaded (20)

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Modus operandi of Spark Streaming - Recipes for Running your Streaming Application in Production

  • 1. Recipes for running Spark Streaming in Production Tathagata “TD” Das Hadoop Summit 2015 @tathadas
  • 2. Who am I? Project Management Committee (PMC) member of Spark Lead developer of Spark Streaming Formerly in AMPLab, UC Berkeley Software developer at Databricks
  • 3. Founded by creators of Spark and remains largest contributor Offers a hosted service - Spark on EC2 - Notebooks - Plot visualizations - Cluster management - Scheduled jobs What is Databricks?
  • 5. Spark Streaming Scalable, fault-tolerant stream processing system File systems Databases Dashboar ds Flume Kinesis HDFS/S3 Kafka Twitter High-level API joins, windows, … often 5x less code Fault- tolerant Exactly-once semantics, even for stateful ops Integration Integrate with MLlib, SQL, DataFrames, GraphX
  • 6. What can you use it for? 6 Real-time fraud detection in transactions React to anomalies in sensors in real-time Cat videos in tweets as soon as they go viral
  • 7. How does it work? Receivers receive data streams and chop them up into batches Spark processes the batches and pushes out the results 7 data streams receivers batches results
  • 8. Word Count with Kafka val context = new StreamingContext(conf, Seconds(1)) val lines = KafkaUtils.createStream(context, ...) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)) .reduceByKey(_ + _) wordCounts.print() context.start() 8 entry point of streaming functionality create DStream from Kafka data DStream: represents a data stream
  • 9. Word Count with Kafka val context = new StreamingContext(conf, Seconds(1)) val lines = KafkaUtils.createStream(context, ...) val words = lines.flatMap(_.split(" ")) 9 split lines into words Transformation: transform data to create new DStreams
  • 10. Word Count with Kafka val context = new StreamingContext(conf, Seconds(1)) val lines = KafkaUtils.createStream(context, ...) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)) .reduceByKey(_ + _) wordCounts.print() context.start() 10 print some counts on screen count the words start receiving and transforming the data
  • 11. Languages Can natively use Can use any other language by using pipe() 11
  • 12. 12
  • 13. Integrates with Spark Ecosystem 13 Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 14. Combine batch and streaming processing Join data streams with static data sets // Create data set from Hadoop file val dataset = sparkContext.hadoopFile(“file”) // Join each batch in stream with the dataset kafkaStream.transform { batchRDD => batchRDD.join(dataset) .filter( ... ) } 14 Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 15. Combine machine learning with streaming Learn models offline, apply them online // Learn model offline val model = KMeans.train(dataset, ...) // Apply model online on stream kafkaStream.map { event => model.predict(event.feature) } 15 Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 16. Combine SQL with streaming Interactively query streaming data with SQL // Register each batch in stream as table kafkaStream.foreachRDD { batchRDD => batchRDD.toDF.registerTempTable("events") } // Interactively query table sqlContext.sql("select * from events") 16 Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 17. 17 How do I write production quality Spark Streaming applications?
  • 18. WordCount with Kafka object WordCount { def main(args: Array[String]) { val context = new StreamingContext(new SparkConf(), Seconds(1)) val lines = KafkaUtils.createStream(context, ...) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x,1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } } 18 Got it working on a small Spark cluster on little data What’s next??
  • 19. What’s needed to get it production-ready? Fault-tolerance and semantics Performance and stability Monitoring and upgrading 19
  • 20. 20 Deeper View of Spark Streaming
  • 21. Any Spark Application 21 Driver User code runs in the driver process YARN / Mesos / Spark Standalone cluster Tasks sent to executors for processing data Executor Executor Executor Driver launches executors in cluster
  • 22. Spark Streaming Application: Receive data 22 Executo r Executo r Driver runs receivers as long running tasks Receiver Data stream Driver object WordCount { def main(args: Array[String]) { val context = new StreamingContext(…) val lines = KafkaUtils.createStream(...) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x,1)) .reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } } Receiver divides stream into blocks and keeps in memory Data Blocks Blocks also replicated to another executor Data Blocks
  • 23. Spark Streaming Application: Process data 23 Executo r Executo r Receiver Data Blocks Data Blocks Data stor e Every batch interval, driver launches tasks to process the blocks Driver object WordCount { def main(args: Array[String]) { val context = new StreamingContext(…) val lines = KafkaUtils.createStream(...) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x,1)) .reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }
  • 25. Failures? Why care? Many streaming applications need zero data loss guarantees despite any kind of failures in the system At least once guarantee – every record processed at least once Exactly once guarantee – every record processed exactly once Different kinds of failures – executor and driver failures Some failures and guarantee requirements need additional configurations and setups 25
  • 26. Executo r Receiver Data Blocks What if an executor fails? Task and receiver restarted by Spark automatically, no config needed 26 Executo r Failed Ex. Receiver Blocks Blocks Driver If executor fails, receiver is lost and all blocks are lost Receiver Receiver restarted Tasks restarted on block replicas
  • 27. What if the driver fails? 27 Executo r BlocksHow do we recover? When the driver fails, all the executors fail All computation, all received blocks are lost Executo r Receiver Blocks Failed Ex. Receiver Blocks Failed Executo r Blocks Driver Failed Driver
  • 28. Recovering Driver with Checkpointing Checkpointing: Periodically save the state of computation to fault-tolerant storage 28 Executo r Blocks Executo r Receiver Blocks Active Driver Checkpoint information
  • 29. Recovering Driver with Checkpointing Checkpointing: Periodically save the state of computation to fault-tolerant storage 29 Failed driver can be restarted from checkpoint information Failed Driver Restarte d Driver New Executo r New Executo r Receiver New executors launched and receivers restarted
  • 30. Recovering Driver with Checkpointing 1. Configure automatic driver restart All cluster managers support this 2. Set a checkpoint directory in a HDFS-compatible file system streamingContext.checkpoint(hdfsDirectory) 3. Slightly restructure of the code to use checkpoints for recovery 30
  • 31. Configurating Automatic Driver Restart Spark Standalone – Submit job in “cluster” mode with “--supervise” See http://spark.apache.org/docs/latest/spark-standalone.html YARN – Submit job in “cluster” mode See YARN config “yarn.resourcemanager.am.max-attempts” Mesos – Marathon can restart Mesos applications 31
  • 32. Restructuring code for Checkpointing 32 val context = new StreamingContext(...) val lines = KafkaUtils.createStream(...) val words = lines.flatMap(...) ... context.start() Create + Setup Start def creatingFunc(): StreamingContext = { val context = new StreamingContext(...) val lines = KafkaUtils.createStream(...) val words = lines.flatMap(...) ... context.checkpoint(hdfsDir) } Put all setup code into a function that returns a new StreamingContext Get context setup from HDFS dir OR create a new one with the function val context = StreamingContext.getOrCreate( hdfsDir, creatingFunc) context.start()
  • 33. Restructuring code for Checkpointing StreamingContext.getOrCreate(): If HDFS directory has checkpoint info recover context from info else call creatingFunc() to create and setup a new context Restarted process can figure out whether to recover using checkpoint info or not 33 def creatingFunc(): StreamingContext = { val context = new StreamingContext(...) val lines = KafkaUtils.createStream(...) val words = lines.flatMap(...) ... context.checkpoint(hdfsDir) } val context = StreamingContext.getOrCreate( hdfsDir, creatingFunc) context.start()
  • 34. Received blocks lost on Restart! 34 Failed Driver Restarte d Driver New Executo r New Ex. Receiver No Blocks In-memory blocks of buffered data are lost on driver restart
  • 35. Recovering data with Write Ahead Logs Write Ahead Log: Synchronously save received data to fault-tolerant storage 35 Executo r Blocks saved to HDFS Executo r Receiver Blocks Active Driver Data stream
  • 36. Received blocks lost on Restart! Write Ahead Log: Synchronously save received data to fault-tolerant storage 36 Failed Driver Restarte d Driver New Executo r New Ex. Receiver Blocks Blocks recovered from Write Ahead Log
  • 37. Recovering data with Write Ahead Logs 1. Enable checkpointing, logs written in checkpoint directory 2. Enabled WAL in SparkConf configuration sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true") 3. Receiver should also be reliable Acknowledge source only after data saved to WAL Unacked data will be replayed from source by restarted receiver 37
  • 38. Fault-tolerance Semantics 38 Zero data loss = every stage processes each event at least once despite any failure Sources Transformin g Sinks Outputting Receiving
  • 39. Fault-tolerance Semantics 39 Sources Transformin g Sinks Outputting Receiving Exactly once, as long as received data is not lost At least once, with Write Ahead Log and Reliable receivers Receiving Outputting Exactly once, if outputs are idempotent or transactional End-to-end semantics: At-least once
  • 40. Fault-tolerance Semantics 40 Exactly once receiving with new Kafka Direct approach Treats Kafka like a replicated log, reads it like a file Does not use receivers No need to create multiple DStreams and union them No need to enable Write Ahead Logs val directKafkaStream = KafkaUtils.createDirectStream(...) https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.htm http://spark.apache.org/docs/latest/streaming-kafka-integration.html Sources Transformin g Sinks Outputting Receiving
  • 41. Fault-tolerance Semantics 41 Exactly once receiving with new Kafka Direct approach Sources Transformin g Sinks Outputting Receiving Exactly once, as long as received data is not lost Exactly once, if outputs are idempotent or transactional End-to-end semantics: Exactly once!
  • 44. Achieving High Throughput 44 High throughput achieved by sufficient parallelism at all stages of the pipeline Sources Transformin g Sinks Outputting Receiving
  • 45. Scaling the Receivers 45 Sources Transformin g Sinks Outputting Receiving Sources must be configured with parallel data streams #partitions in Kafka topics, #shards in Kinesis streams, … Streaming app should have multiple receivers that receive the data streams in parallel Multiple input DStreams, each running a receiver Can be unioned together to create one DStream val kafkaStream1 = KafkaUtils.createStream(...) val kafkaStream2 = KafkaUtils.createStream(...)
  • 46. Scaling the Receivers 46 Sources Transformin g Sinks Outputting Receiving Sufficient number of executors to run all the receivers #executors > #receivers , so that no more than 1 receiver per executor, and network is not shared between receivers Kafka Direct approach does not use receivers Automatically parallelizes data reading across executors Parallelism = # Kafka partitions
  • 47. Stability in Processing 47 Sources Transformin g Sinks Outputting Receiving For stability, must process data as fast as it is received Must ensure avg batch processing times < batch interval Previous batch is done by the time next batch is received Otherwise, new batches keeps queueing up waiting for previous batches to finish, scheduling delay goes up
  • 48. Reducing Batch Processing Times 48 Sources Transformin g Sinks Outputting Receiving More receivers! Executor running receivers do lot of the processing Repartition the received data to explicitly distribute load unionedStream.repartition(40) Increase #partitions in shuffles transformedStream.reduceByKey(reduceFunc, 40) Get more executors and cores!
  • 49. Reducing Batch Processing Times 49 Sources Transformin g Sinks Outputting Receiving Use Kryo serialization to serialization costs Register classes for best performance See configurations spark.kryo.* http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization Larger batch durations improve stability More data aggregated together, amortized cost of shuffle Limit ingestion rate to handle data surges See configurations spark.streaming.*maxRate* http://spark.apache.org/docs/latest/configuration.html#spark-streaming
  • 50. Speeding up Output Operations 50 Sources Transformin g Sinks Outputting Receiving Write to data stores efficiently dataRDD.foreach { event => // open connection // insert single event // close connection } foreach: inefficient dataRDD.foreachPartition { partition => // open connection // insert all events in partition // close connection } foreachPartition: efficient dataRDD.foreachPartition { partition => // get open connection from pool // insert all events in partition // return connection to pool } foreachPartition + connection pool: more efficient
  • 52. Streaming in Spark Web UI Stats over last 1000 batches Timelines Histograms 52 Scheduling delay should be approx 0 Processing Time approx < batch interval
  • 53. Streaming in Spark Web UI Details of individual batches 53 Details of Spark jobs run in a batch [New in Spark 1.4]
  • 54. Operational Monitoring Streaming statistics published through Codahale metrics Ganglia sink, Graphite sink, custom Codahale metrics sinks Can see long term trends, across hours and days Configure the metrics using $SPARK_HOME/conf/metrics.properties Also need to compile Spark with Ganglia LGPL profile (see http://spark.apache.org/docs/latest/monitoring.html#metrics) 54
  • 55. Upgrading Apps 1. Shutdown your current streaming app gracefully Will process all data before shutting down cleanly streamingContext.stop(stopGracefully = true) 2. Update app code and start it again 56
  • 56. Things not covered Memory tuning GC tuning Using SQLContext Using Broadcasts Using Accumulators … Refer to online guide http://spark.apache.org/docs/latest/streaming-programming- guide.html 57