SlideShare a Scribd company logo
1 of 95
Satendra Kumar
Sr. Software Consultant
Knoldus Software LLP
Stream Processing
Topics Covered
➢ What is Stream
➢ What is Stream processing
➢ The challenges of stream processing
➢ Overview Spark Streaming
➢ Receivers
➢ Custom receivers
➢ Transformations on Dstreams
➢ Failures
➢ Fault-tolerance Semantics
➢ Kafka Integration
➢ Performance Tuning
What is Stream
A stream is a sequence of data elements made available over time
and which can be accessed in sequential order.
Eg. YouTube video buffering.
What is Stream processing
Stream processing is the real-time processing of data
continuously, concurrently, and in a record-by-record fashion.
It treats data not as static tables or files, but as a continuous
infinite stream of data integrated from both live and historical
sources.
➢ Partitioning & Scalability
➢ Semantics & Fault tolerance
➢ Unifying the streams
➢ Time
➢ Re-Processing
The challenges of stream processing
Spark Streaming
➢ Provides a way to process the live data streams.
➢ Scalable, high-throughput, fault-tolerant.
➢ Built top of core Spark API.
➢ API is very similar to Spark core API.
➢ Supports many sources like Kafka, Flume, Kinesis or TCP
sockets.
➢ Currently based on RDDs.
Spark Streaming
Spark Streaming
Spark Streaming
Spark Streaming
Discretized Streams
➢ It provides a high-level abstraction called discretized stream or
DStream, which represents a continuous stream of data;
➢ DStreams can be created either from input data streams from
sources such as Kafka, Flume, and Kinesis, or by applying high-
level operations on other Dstreams.
➢ DStream is represented as a sequence of RDDs.
High level overview
High level overview
High level overview
High level overview
High level overview
High level overview
High level overview
High level overview
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Streaming Context
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Streaming Context
Batch Interval
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Streaming Context
Batch Interval
Receiver
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Streaming Context
Batch Interval
Receiver
Transformations on DStreams
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Streaming Context
Batch Interval
Receiver
Transformations on DStreams
Output Operations on DStreams
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Streaming Context
Batch Interval
Receiver
Transformations on DStreams
Output Operations on DStreams
Start the Streaming
Important Points
➢ Once a context has been started, no new streaming computations can
be set up or added to it.
➢ Once a context has been stopped, it cannot be restarted.
➢ Only one StreamingContext can be active in a JVM at the same time.
➢ stop() on StreamingContext also stops the SparkContext. To stop only
the StreamingContext, set the optional parameter of stop() called
stopSparkContext to false.
➢ A SparkContext can be re-used to create multiple StreamingContexts, as
long as the previous StreamingContext is stopped (without stopping the
SparkContext) before the next StreamingContext is created.
Spark Streaming Concept
➢ Spark streaming is based on micro-batch architecture.
➢ Spark streaming continuously receives live input data streams and divides
the data into batches.
➢ New batches are created at regular time intervals called batch interval.
➢ Each batch have N numbers blocks.
Where N = batch-interval / block-interval
For eg. If batch interval = 1 second and block interval= 200ms(by default)
then each batch have 5 blocks.
Transforming DStream
Transforming DStream
Transforming DStream
Transforming DStream
Transforming DStream
➢ DStream is represented by a continuous series of RDDs
➢ Each RDD in a DStream contains data from a certain interval
➢ Any operation applied on a DStream translates to operations on the
underlying RDDs
➢ Processing time of a batch should less than or equal to batch
interval.
Transformations on DStreams
def map[U: ClassTag](mapFunc: T => U): DStream[U]
def flatMap[U: ClassTag](flatMapFunc: T => TraversableOnce[U]): DStream[U]
def filter(filterFunc: T => Boolean): DStream[T]
def reduce(reduceFunc: (T, T) => T): DStream[T]
def count(): DStream[Long]
def repartition(numPartitions: Int): DStream[T]
def countByValue(numPartitions: Int = ssc.sc.defaultParallelism): DStream[(T, Long)]
def transform[U: ClassTag](transformFunc: RDD[T] => RDD[U]): DStream[U]
Transformations on PairDStream
def groupByKey(): DStream[(K, Iterable[V])]
def reduceByKey(reduceFunc: (V, V) => V, numPartitions: Int): DStream[(K, V)]
def join[W: ClassTag](other: DStream[(K, W)]): DStream[(K, (V, W))]
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],partitioner: Partitioner): DStream[(K, S)]
def cogroup[W: ClassTag](
other: DStream[(K, W)], numPartitions: Int): DStream[(K, (Iterable[V], Iterable[W]))]
def mapValues[U: ClassTag](mapValuesFunc: V => U): DStream[(K, U)]
def leftOuterJoin[W: ClassTag](
other: DStream[(K, W)],numPartitions: Int): DStream[(K, (V, Option[W]))]
def rightOuterJoin[W: ClassTag](
other: DStream[(K, W)], numPartitions: Int): DStream[(K, (Option[V], W))]
updateStateByKey
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
streamingContext.checkpoint(".")
val lines = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val updatedState: DStream[(String, Int)] =
pairs.updateStateByKey[Int] {
(newValues: Seq[Int], state: Option[Int]) => Some(newValues.sum +state.getOrElse(0))
}
updatedState.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Window Operations
Spark Streaming also provides windowed computations, which allow
you to apply transformations over a sliding window of data.
Window operation needs to specify two parameters:
● window length - The duration of the window.
● sliding interval - The interval at which the window operation is performed.
Window Operations
def window(windowDuration: Duration): DStream[T]
def window(windowDuration: Duration, slideDuration: Duration): DStream[T]
def reduceByWindow(reduceFunc: (T, T) => T,
windowDuration: Duration, slideDuration: Duration): DStream[T]
def countByWindow(windowDuration: Duration, slideDuration: Duration): DStream[Long]
def countByValueAndWindow(windowDuration: Duration,
slideDuration: Duration,numPartitions: Int): DStream[(T, Long)]
//pairDStream Operations
def groupByKeyAndWindow(windowDuration: Duration): DStream[(K, Iterable[V])]
def groupByKeyAndWindow(windowDuration: Duration,
slideDuration: Duration): DStream[(K, Iterable[V])]
def reduceByKeyAndWindow(reduceFunc: (V, V) => V,windowDuration: Duration): DStream[(K, V)]
def reduceByKeyAndWindow(reduceFunc: (V, V) => V,
windowDuration: Duration,slideDuration: Duration): DStream[(K, V)]
Window Operations
pairs.window(Seconds(15), Seconds(10))
filteredWords.reduceByWindow((a, b) => a +", "+ b, Seconds(15), Seconds(10))
pairs.reduceByKeyAndWindow((a: Int, b: Int) => a + b, Seconds(15), Seconds(10))
Output Operations on DStreams
def print(num: Int): Unit
def saveAsObjectFiles(prefix: String, suffix: String = ""): Unit
def saveAsTextFiles(prefix: String, suffix: String = ""): Unit
def foreachRDD(foreachFunc: RDD[T] => Unit): Unit
def saveAsHadoopFiles[F <: OutputFormat[K, V]](prefix: String,suffix: String): Unit
Receivers
Spark Streaming have two kinds of receivers:
1) Reliable Receiver - A reliable receiver correctly sends acknowledgment
to a reliable source when the data has been received and stored in Spark with
replication.
2) Unreliable Receiver - An unreliable receiver does not send acknowledgment
to a source.
Custom Receiver
A custom receiver must extend this abstract Receiver class by implementing
two abstract methods:
def onStart(): Unit //Things to do to start receiving data
def onStop(): Unit // Things to do to stop receiving data
Custom Receiver
class CustomReceiver(path: String) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) {
def onStart() {
new Thread("File Reader") {
override def run() {
receive()
}
}.start()
}
def onStop() {}
private def receive() =
try {
println("Reading file " + path)
val reader = new BufferedReader(
new InputStreamReader(new FileInputStream(path), StandardCharsets.UTF_8))
var userInput = reader.readLine()
while (!isStopped && Option(userInput).isDefined) {
store(userInput)
userInput = reader.readLine()
}
reader.close()
println("Stopped receiving")
restart("Trying to connect again")
} catch {
case ex: Exception =>
restart("Error reading file " + path, ex)
}
}
Custom Receiver
object CustomReceiver extends App {
val sparkConf = new SparkConf().setAppName("CustomReceiver")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val lines = ssc.receiverStream(new CustomReceiver(args(0)))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
Failure is everywhere
Fault-tolerance Semantics
Streaming system provides zero data loss guarantees despite any kind of
failure in the system.
➢ At least once- Each record will be processed one or more times.
➢ Exactly once- Each record will be processed exactly once - no data will be lost and no
data will be processed multiple times
Kinds of Failure
There are two kind of failure:
➢ Executor failure
1) Data received and replicated
2) Data received but not replicated
➢ Driver failure
Executor failure
Executor failure
Executor failure
Executor failure
Executor failure
Executor failure
Data would be lost ?
Executor with WAL
Executor failure
Enable write ahead logs
object Streaming2App extends App {
val checkpointDirectory ="checkpointDir"//It should be fault-tolerant & reliable file system(e.g. HDFS, S3, etc.)
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
streamingContext.checkpoint(checkpointDirectory)
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print(20)
streamingContext.start()
streamingContext.awaitTermination()
}
Enable write logs
Enable write ahead logs
object Streaming2App extends App {
val checkpointDirectory ="checkpointDir"//It should be fault-tolerant & reliable file system(e.g. HDFS, S3, etc.)
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
streamingContext.checkpoint(checkpointDirectory)
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print(20)
streamingContext.start()
streamingContext.awaitTermination()
}
Enable write logs
Enable checkpointing
Enable write ahead logs
1) For WAL first need to enable checkpointing
- streamingContext.checkpoint(checkpointDirectory)
2) Enable WAL in spark configuration
-sparkConf.set("spark.streaming.receiver.writeAheadLog.enable","true")
3) Receiver should be reliable
- Acknowledge source only after data saved to WAL
- Unacknowledged data will be replayed from source by restated receiver
4) Disable in-memory replication (Already replicated By HDFS)
- Use StorageLevel.MEMORY_AND_DISK_SER for input DStream
Driver failure
Driver failure
Driver failure
Driver failure
Driver failure
How to recover from this Failure ?
Driver with checkpointing
Dstream Checkpointing : Periodically save the DAG of
DStream to fault-tolerant storage.
Driver failure
Recover from Driver failure
Recover from Driver failure
Recover from Driver failure
1) Configure Automatic driver restart
-All cluster managers support this
2) Set a checkpoint directory
- Directory should be in fault-tolerant & reliable file system (e.g., HDFS, S3, etc.)
- streamingContext.checkpoint(checkpointDirectory)
3) Driver should be restart using checkpointing
Configure Automatic driver restart
Spark Standalone
- use spark-submit with “cluster” mode and “- - supervise”
YARN
-use spark-submit with “cluster” mode
Mesos
-Marathon can restart applications or use “- - supervise” flag
Configure Checkpointing
object RecoverableWordCount {
//should a fault-tolerant,reliable file system(e.g.HDFS,S3, etc.)
val checkpointDirectory = "checkpointDir"
def createContext() = {
val sparkConf = new SparkConf().setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(1))
streamingContext.checkpoint(checkpointDirectory)
val lines = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print(20)
streamingContext
}
}
Driver should be restart using checkpointing
object StreamingApp extends App {
import RecoverableWordCount._
val streamingContext = StreamingContext.getOrCreate(checkpointDirectory, createContext _)
//do other operations
streamingContext.start()
streamingContext.awaitTermination()
}
Driver should be restart using checkpointing
object StreamingApp extends App {
import RecoverableWordCount._
val streamingContext = StreamingContext.getOrCreate(checkpointDirectory, createContext _)
//do other operations
streamingContext.start()
streamingContext.awaitTermination()
}
Checkpointing
There are two types of data that are checkpointed.
1) Metadata checkpointing
-Configuration
-DStream operations
-Incomplete batches
2) Data checkpointing
- Saving of the generated RDDs to reliable storage. This is necessary in some stateful
transformations that combine data across multiple batches.
Checkpointing Latency
➔ Checkpointing of RDDs incurs the cost of saving to reliable storage. The interval of
checkpointing needs to be set carefully.
dstream.checkpoint( Seconds( (batch interval)*10 ) )
➔ A checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
Fault-tolerance Semantics
Fault-tolerance Semantics
Fault-tolerance Semantics
Fault-tolerance Semantics
Fault-tolerance Semantics
Fault-tolerance Semantics
Spark Streaming & Kafka Integration
Why Kafka ?
➢ Velocity & volume of streaming data
➢ Reprocessing of streaming
➢ Reliable receiver complexity
➢ Checkpoint complexity
➢ Upgrading Application Code
Kafka Integration
There are two approaches to integrate Kafka with Spark Streaming:
➢ Receiver-based Approach
➢ Direct Approach
Receiver-based Approach
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
Receiver-based Approach
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
object ReceiverBasedStreaming extends App {
val group = "streaming-test-group"
val zkQuorum = "localhost:2181"
val topics = Map("streaming_queue" -> 1)
val sparkConf = new SparkConf().setAppName("ReceiverBasedStreamingApp")
sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topics)
.map { case (key, message) => message }
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
Direct Approach
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
Direct Approach
import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka._
object KafkaDirectStreaming extends App {
val brokers = "localhost:9092"
val sparkConf = new SparkConf().setAppName("KafkaDirectStreaming")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpointDir") //offset recovery
val topics = Set("streaming_queue")
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages: InputDStream[(String, String)] =
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
val lines = messages.map { case (key, message) => message }
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
Direct Approach
Direct Approach has the following advantages over the receiver-based approach:
➢ Simplified Parallelism
➢ Efficiency
➢ Exactly-once semantics
Performance Tuning
For best performance of a Spark Streaming application we need to
consider two things:
➢ Reducing the Batch Processing Times
➢ Setting the Right Batch Interval
Reducing the Batch Processing Times
➢ Level of Parallelism in Data Receiving
➢ Level of Parallelism in Data Processing
➢ Data Serialization
-Input data
-Persisted RDDs generated by Streaming Operations
➢ Task Launching Overheads
-Running Spark in Standalone mode or coarse-grained Mesos mode leads
to better task launch times.
Setting the Right Batch Interval
➢ Batch processing time should be less than the batch interval.
➢ Memory Tuning
-Persistence Level of Dstreams
-Clearing old data
-CMS Garbage Collector
Code samples
https://github.com/knoldus/spark-streaming-meetup
https://github.com/knoldus/real-time-stream-processing-engine
https://github.com/knoldus/kafka-tweet-producer
Questions & DStream[Answer]
References
http://spark.apache.org/docs/latest/streaming-programming-guide.html
http://spark.apache.org/docs/latest/configuration.html#spark-streaming
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
http://spark.apache.org/docs/latest/tuning.html
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.htm
Thanks
Presenters:
@_satendrakumar
Organizer:
@knolspeak
http://www.knoldus.com

More Related Content

What's hot

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 

What's hot (20)

Apache Camel K - Copenhagen
Apache Camel K - CopenhagenApache Camel K - Copenhagen
Apache Camel K - Copenhagen
 
Best Practices for Middleware and Integration Architecture Modernization with...
Best Practices for Middleware and Integration Architecture Modernization with...Best Practices for Middleware and Integration Architecture Modernization with...
Best Practices for Middleware and Integration Architecture Modernization with...
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Introduction to Amazon Kinesis Analytics
Introduction to Amazon Kinesis AnalyticsIntroduction to Amazon Kinesis Analytics
Introduction to Amazon Kinesis Analytics
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack Presentation
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
6.hive
6.hive6.hive
6.hive
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 
Splunk Dashboarding & Universal Vs. Heavy Forwarders
Splunk Dashboarding & Universal Vs. Heavy ForwardersSplunk Dashboarding & Universal Vs. Heavy Forwarders
Splunk Dashboarding & Universal Vs. Heavy Forwarders
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Why Splunk Chose Pulsar_Karthik Ramasamy
Why Splunk Chose Pulsar_Karthik RamasamyWhy Splunk Chose Pulsar_Karthik Ramasamy
Why Splunk Chose Pulsar_Karthik Ramasamy
 

Viewers also liked

Viewers also liked (20)

Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache Spark
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Introduction to Apache Kafka- Part 2
Introduction to Apache Kafka- Part 2Introduction to Apache Kafka- Part 2
Introduction to Apache Kafka- Part 2
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Effective way to code in Scala
Effective way to code in ScalaEffective way to code in Scala
Effective way to code in Scala
 
Introduction to Shield and kibana
Introduction to Shield and kibanaIntroduction to Shield and kibana
Introduction to Shield and kibana
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
Akka Finite State Machine
Akka Finite State MachineAkka Finite State Machine
Akka Finite State Machine
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Introduction to AWS IAM
Introduction to AWS IAMIntroduction to AWS IAM
Introduction to AWS IAM
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Dancing with Stream Processing
Dancing with Stream ProcessingDancing with Stream Processing
Dancing with Stream Processing
 
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
 
Build and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 MinsBuild and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 Mins
 

Similar to Meet Up - Spark Stream Processing + Kafka

To Infinity & Beyond: Protocols & sequences in Node - Part 2
To Infinity & Beyond: Protocols & sequences in Node - Part 2To Infinity & Beyond: Protocols & sequences in Node - Part 2
To Infinity & Beyond: Protocols & sequences in Node - Part 2
Bahul Neel Upadhyaya
 

Similar to Meet Up - Spark Stream Processing + Kafka (20)

Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
To Infinity & Beyond: Protocols & sequences in Node - Part 2
To Infinity & Beyond: Protocols & sequences in Node - Part 2To Infinity & Beyond: Protocols & sequences in Node - Part 2
To Infinity & Beyond: Protocols & sequences in Node - Part 2
 
Testing batch and streaming Spark applications
Testing batch and streaming Spark applicationsTesting batch and streaming Spark applications
Testing batch and streaming Spark applications
 
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 

More from Knoldus Inc.

More from Knoldus Inc. (20)

Authentication in Svelte using cookies.pptx
Authentication in Svelte using cookies.pptxAuthentication in Svelte using cookies.pptx
Authentication in Svelte using cookies.pptx
 
OAuth2 Implementation Presentation (Java)
OAuth2 Implementation Presentation (Java)OAuth2 Implementation Presentation (Java)
OAuth2 Implementation Presentation (Java)
 
Supply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptxSupply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptx
 
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On Introduction
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptx
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptx
 
GraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfGraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdf
 
NuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxNuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptx
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable Testing
 
K8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesK8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose Kubernetes
 
Introduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxIntroduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptx
 
Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptx
 
Optimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxOptimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptx
 
Azure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxAzure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptx
 
CQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxCQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptx
 
ETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake Presentation
 
Scripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationScripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics Presentation
 
Getting started with dotnet core Web APIs
Getting started with dotnet core Web APIsGetting started with dotnet core Web APIs
Getting started with dotnet core Web APIs
 
Introduction To Rust part II Presentation
Introduction To Rust part II PresentationIntroduction To Rust part II Presentation
Introduction To Rust part II Presentation
 

Recently uploaded

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 

Recently uploaded (20)

WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 

Meet Up - Spark Stream Processing + Kafka

  • 1. Satendra Kumar Sr. Software Consultant Knoldus Software LLP Stream Processing
  • 2. Topics Covered ➢ What is Stream ➢ What is Stream processing ➢ The challenges of stream processing ➢ Overview Spark Streaming ➢ Receivers ➢ Custom receivers ➢ Transformations on Dstreams ➢ Failures ➢ Fault-tolerance Semantics ➢ Kafka Integration ➢ Performance Tuning
  • 3. What is Stream A stream is a sequence of data elements made available over time and which can be accessed in sequential order. Eg. YouTube video buffering.
  • 4. What is Stream processing Stream processing is the real-time processing of data continuously, concurrently, and in a record-by-record fashion. It treats data not as static tables or files, but as a continuous infinite stream of data integrated from both live and historical sources.
  • 5. ➢ Partitioning & Scalability ➢ Semantics & Fault tolerance ➢ Unifying the streams ➢ Time ➢ Re-Processing The challenges of stream processing
  • 6. Spark Streaming ➢ Provides a way to process the live data streams. ➢ Scalable, high-throughput, fault-tolerant. ➢ Built top of core Spark API. ➢ API is very similar to Spark core API. ➢ Supports many sources like Kafka, Flume, Kinesis or TCP sockets. ➢ Currently based on RDDs.
  • 11. Discretized Streams ➢ It provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data; ➢ DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high- level operations on other Dstreams. ➢ DStream is represented as a sequence of RDDs.
  • 20. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() }
  • 21. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() } Streaming Context
  • 22. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() } Streaming Context Batch Interval
  • 23. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() } Streaming Context Batch Interval Receiver
  • 24. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() } Streaming Context Batch Interval Receiver Transformations on DStreams
  • 25. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() } Streaming Context Batch Interval Receiver Transformations on DStreams Output Operations on DStreams
  • 26. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() } Streaming Context Batch Interval Receiver Transformations on DStreams Output Operations on DStreams Start the Streaming
  • 27. Important Points ➢ Once a context has been started, no new streaming computations can be set up or added to it. ➢ Once a context has been stopped, it cannot be restarted. ➢ Only one StreamingContext can be active in a JVM at the same time. ➢ stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext, set the optional parameter of stop() called stopSparkContext to false. ➢ A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created.
  • 28. Spark Streaming Concept ➢ Spark streaming is based on micro-batch architecture. ➢ Spark streaming continuously receives live input data streams and divides the data into batches. ➢ New batches are created at regular time intervals called batch interval. ➢ Each batch have N numbers blocks. Where N = batch-interval / block-interval For eg. If batch interval = 1 second and block interval= 200ms(by default) then each batch have 5 blocks.
  • 33. Transforming DStream ➢ DStream is represented by a continuous series of RDDs ➢ Each RDD in a DStream contains data from a certain interval ➢ Any operation applied on a DStream translates to operations on the underlying RDDs ➢ Processing time of a batch should less than or equal to batch interval.
  • 34. Transformations on DStreams def map[U: ClassTag](mapFunc: T => U): DStream[U] def flatMap[U: ClassTag](flatMapFunc: T => TraversableOnce[U]): DStream[U] def filter(filterFunc: T => Boolean): DStream[T] def reduce(reduceFunc: (T, T) => T): DStream[T] def count(): DStream[Long] def repartition(numPartitions: Int): DStream[T] def countByValue(numPartitions: Int = ssc.sc.defaultParallelism): DStream[(T, Long)] def transform[U: ClassTag](transformFunc: RDD[T] => RDD[U]): DStream[U]
  • 35. Transformations on PairDStream def groupByKey(): DStream[(K, Iterable[V])] def reduceByKey(reduceFunc: (V, V) => V, numPartitions: Int): DStream[(K, V)] def join[W: ClassTag](other: DStream[(K, W)]): DStream[(K, (V, W))] def updateStateByKey[S: ClassTag]( updateFunc: (Seq[V], Option[S]) => Option[S],partitioner: Partitioner): DStream[(K, S)] def cogroup[W: ClassTag]( other: DStream[(K, W)], numPartitions: Int): DStream[(K, (Iterable[V], Iterable[W]))] def mapValues[U: ClassTag](mapValuesFunc: V => U): DStream[(K, U)] def leftOuterJoin[W: ClassTag]( other: DStream[(K, W)],numPartitions: Int): DStream[(K, (V, Option[W]))] def rightOuterJoin[W: ClassTag]( other: DStream[(K, W)], numPartitions: Int): DStream[(K, (Option[V], W))]
  • 36. updateStateByKey object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) streamingContext.checkpoint(".") val lines = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val updatedState: DStream[(String, Int)] = pairs.updateStateByKey[Int] { (newValues: Seq[Int], state: Option[Int]) => Some(newValues.sum +state.getOrElse(0)) } updatedState.print() streamingContext.start() streamingContext.awaitTermination() }
  • 37. Window Operations Spark Streaming also provides windowed computations, which allow you to apply transformations over a sliding window of data. Window operation needs to specify two parameters: ● window length - The duration of the window. ● sliding interval - The interval at which the window operation is performed.
  • 38. Window Operations def window(windowDuration: Duration): DStream[T] def window(windowDuration: Duration, slideDuration: Duration): DStream[T] def reduceByWindow(reduceFunc: (T, T) => T, windowDuration: Duration, slideDuration: Duration): DStream[T] def countByWindow(windowDuration: Duration, slideDuration: Duration): DStream[Long] def countByValueAndWindow(windowDuration: Duration, slideDuration: Duration,numPartitions: Int): DStream[(T, Long)] //pairDStream Operations def groupByKeyAndWindow(windowDuration: Duration): DStream[(K, Iterable[V])] def groupByKeyAndWindow(windowDuration: Duration, slideDuration: Duration): DStream[(K, Iterable[V])] def reduceByKeyAndWindow(reduceFunc: (V, V) => V,windowDuration: Duration): DStream[(K, V)] def reduceByKeyAndWindow(reduceFunc: (V, V) => V, windowDuration: Duration,slideDuration: Duration): DStream[(K, V)]
  • 39. Window Operations pairs.window(Seconds(15), Seconds(10)) filteredWords.reduceByWindow((a, b) => a +", "+ b, Seconds(15), Seconds(10)) pairs.reduceByKeyAndWindow((a: Int, b: Int) => a + b, Seconds(15), Seconds(10))
  • 40. Output Operations on DStreams def print(num: Int): Unit def saveAsObjectFiles(prefix: String, suffix: String = ""): Unit def saveAsTextFiles(prefix: String, suffix: String = ""): Unit def foreachRDD(foreachFunc: RDD[T] => Unit): Unit def saveAsHadoopFiles[F <: OutputFormat[K, V]](prefix: String,suffix: String): Unit
  • 41. Receivers Spark Streaming have two kinds of receivers: 1) Reliable Receiver - A reliable receiver correctly sends acknowledgment to a reliable source when the data has been received and stored in Spark with replication. 2) Unreliable Receiver - An unreliable receiver does not send acknowledgment to a source.
  • 42. Custom Receiver A custom receiver must extend this abstract Receiver class by implementing two abstract methods: def onStart(): Unit //Things to do to start receiving data def onStop(): Unit // Things to do to stop receiving data
  • 43. Custom Receiver class CustomReceiver(path: String) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) { def onStart() { new Thread("File Reader") { override def run() { receive() } }.start() } def onStop() {} private def receive() = try { println("Reading file " + path) val reader = new BufferedReader( new InputStreamReader(new FileInputStream(path), StandardCharsets.UTF_8)) var userInput = reader.readLine() while (!isStopped && Option(userInput).isDefined) { store(userInput) userInput = reader.readLine() } reader.close() println("Stopped receiving") restart("Trying to connect again") } catch { case ex: Exception => restart("Error reading file " + path, ex) } }
  • 44. Custom Receiver object CustomReceiver extends App { val sparkConf = new SparkConf().setAppName("CustomReceiver") val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines = ssc.receiverStream(new CustomReceiver(args(0))) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() }
  • 46. Fault-tolerance Semantics Streaming system provides zero data loss guarantees despite any kind of failure in the system. ➢ At least once- Each record will be processed one or more times. ➢ Exactly once- Each record will be processed exactly once - no data will be lost and no data will be processed multiple times
  • 47. Kinds of Failure There are two kind of failure: ➢ Executor failure 1) Data received and replicated 2) Data received but not replicated ➢ Driver failure
  • 56. Enable write ahead logs object Streaming2App extends App { val checkpointDirectory ="checkpointDir"//It should be fault-tolerant & reliable file system(e.g. HDFS, S3, etc.) val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) streamingContext.checkpoint(checkpointDirectory) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print(20) streamingContext.start() streamingContext.awaitTermination() } Enable write logs
  • 57. Enable write ahead logs object Streaming2App extends App { val checkpointDirectory ="checkpointDir"//It should be fault-tolerant & reliable file system(e.g. HDFS, S3, etc.) val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) streamingContext.checkpoint(checkpointDirectory) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print(20) streamingContext.start() streamingContext.awaitTermination() } Enable write logs Enable checkpointing
  • 58. Enable write ahead logs 1) For WAL first need to enable checkpointing - streamingContext.checkpoint(checkpointDirectory) 2) Enable WAL in spark configuration -sparkConf.set("spark.streaming.receiver.writeAheadLog.enable","true") 3) Receiver should be reliable - Acknowledge source only after data saved to WAL - Unacknowledged data will be replayed from source by restated receiver 4) Disable in-memory replication (Already replicated By HDFS) - Use StorageLevel.MEMORY_AND_DISK_SER for input DStream
  • 63. Driver failure How to recover from this Failure ?
  • 64. Driver with checkpointing Dstream Checkpointing : Periodically save the DAG of DStream to fault-tolerant storage.
  • 68. Recover from Driver failure 1) Configure Automatic driver restart -All cluster managers support this 2) Set a checkpoint directory - Directory should be in fault-tolerant & reliable file system (e.g., HDFS, S3, etc.) - streamingContext.checkpoint(checkpointDirectory) 3) Driver should be restart using checkpointing
  • 69. Configure Automatic driver restart Spark Standalone - use spark-submit with “cluster” mode and “- - supervise” YARN -use spark-submit with “cluster” mode Mesos -Marathon can restart applications or use “- - supervise” flag
  • 70. Configure Checkpointing object RecoverableWordCount { //should a fault-tolerant,reliable file system(e.g.HDFS,S3, etc.) val checkpointDirectory = "checkpointDir" def createContext() = { val sparkConf = new SparkConf().setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(1)) streamingContext.checkpoint(checkpointDirectory) val lines = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print(20) streamingContext } }
  • 71. Driver should be restart using checkpointing object StreamingApp extends App { import RecoverableWordCount._ val streamingContext = StreamingContext.getOrCreate(checkpointDirectory, createContext _) //do other operations streamingContext.start() streamingContext.awaitTermination() }
  • 72. Driver should be restart using checkpointing object StreamingApp extends App { import RecoverableWordCount._ val streamingContext = StreamingContext.getOrCreate(checkpointDirectory, createContext _) //do other operations streamingContext.start() streamingContext.awaitTermination() }
  • 73. Checkpointing There are two types of data that are checkpointed. 1) Metadata checkpointing -Configuration -DStream operations -Incomplete batches 2) Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches.
  • 74. Checkpointing Latency ➔ Checkpointing of RDDs incurs the cost of saving to reliable storage. The interval of checkpointing needs to be set carefully. dstream.checkpoint( Seconds( (batch interval)*10 ) ) ➔ A checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
  • 81. Spark Streaming & Kafka Integration
  • 82. Why Kafka ? ➢ Velocity & volume of streaming data ➢ Reprocessing of streaming ➢ Reliable receiver complexity ➢ Checkpoint complexity ➢ Upgrading Application Code
  • 83. Kafka Integration There are two approaches to integrate Kafka with Spark Streaming: ➢ Receiver-based Approach ➢ Direct Approach
  • 85. Receiver-based Approach import org.apache.spark.SparkConf import org.apache.spark.streaming.kafka.KafkaUtils import org.apache.spark.streaming.{Seconds, StreamingContext} object ReceiverBasedStreaming extends App { val group = "streaming-test-group" val zkQuorum = "localhost:2181" val topics = Map("streaming_queue" -> 1) val sparkConf = new SparkConf().setAppName("ReceiverBasedStreamingApp") sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true") val ssc = new StreamingContext(sparkConf, Seconds(2)) val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topics) .map { case (key, message) => message } val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() }
  • 87. Direct Approach import kafka.serializer.StringDecoder import org.apache.spark.SparkConf import org.apache.spark.streaming._ import org.apache.spark.streaming.dstream.InputDStream import org.apache.spark.streaming.kafka._ object KafkaDirectStreaming extends App { val brokers = "localhost:9092" val sparkConf = new SparkConf().setAppName("KafkaDirectStreaming") val ssc = new StreamingContext(sparkConf, Seconds(2)) ssc.checkpoint("checkpointDir") //offset recovery val topics = Set("streaming_queue") val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers) val messages: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) val lines = messages.map { case (key, message) => message } val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() }
  • 88. Direct Approach Direct Approach has the following advantages over the receiver-based approach: ➢ Simplified Parallelism ➢ Efficiency ➢ Exactly-once semantics
  • 89. Performance Tuning For best performance of a Spark Streaming application we need to consider two things: ➢ Reducing the Batch Processing Times ➢ Setting the Right Batch Interval
  • 90. Reducing the Batch Processing Times ➢ Level of Parallelism in Data Receiving ➢ Level of Parallelism in Data Processing ➢ Data Serialization -Input data -Persisted RDDs generated by Streaming Operations ➢ Task Launching Overheads -Running Spark in Standalone mode or coarse-grained Mesos mode leads to better task launch times.
  • 91. Setting the Right Batch Interval ➢ Batch processing time should be less than the batch interval. ➢ Memory Tuning -Persistence Level of Dstreams -Clearing old data -CMS Garbage Collector

Editor's Notes

  1. Data can be ingested from many sources like Kafka, Flume, Kinesis. Data can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Processed data can be pushed out to filesystems, databases, and live dashboards
  2. Unreliable - This can be used for sources that do not support acknowledgment, or even for reliable sources when one does not want or need to go into the complexity of acknowledgment.
  3. Spark Streaming can receive streaming data from any arbitrary data source beyond the ones for which it has built-in support (that is, beyond Flume, Kafka, Kinesis, files, sockets, etc.). This requires the developer to implement a receiver that is customized for receiving data from the concerned data source. This guide walks through the process of implementing a custom receiver and using it in a Spark Streaming application. Note that custom receivers can be implemented in Scala or Java.
  4. Spark Streaming can receive streaming data from any arbitrary data source beyond the ones for which it has built-in support (that is, beyond Flume, Kafka, Kinesis, files, sockets, etc.). This requires the developer to implement a receiver that is customized for receiving data from the concerned data source. This guide walks through the process of implementing a custom receiver and using it in a Spark Streaming application. Note that custom receivers can be implemented in Scala or Java.
  5. Spark Streaming can receive streaming data from any arbitrary data source beyond the ones for which it has built-in support (that is, beyond Flume, Kafka, Kinesis, files, sockets, etc.). This requires the developer to implement a receiver that is customized for receiving data from the concerned data source. This guide walks through the process of implementing a custom receiver and using it in a Spark Streaming application. Note that custom receivers can be implemented in Scala or Java.
  6. At most once: Each record will be either processed once or not processed at all. At least once: Each record will be processed one or more times. This is stronger than at-most once as it ensure that no data will be lost. But there may be duplicates. Exactly once: Each record will be processed exactly once - no data will be lost and no data will be processed multiple times. This is obviously the strongest guarantee of the three.
  7. Note that checkpointing of RDDs incurs the cost of saving to reliable storage. This may cause an increase in the processing time of those batches where RDDs get checkpointed. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too infrequently causes the lineage and task sizes to grow, which may have detrimental effects. For stateful transformations that require RDD checkpointing, the default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
  8. Note that checkpointing of RDDs incurs the cost of saving to reliable storage. This may cause an increase in the processing time of those batches where RDDs get checkpointed. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too infrequently causes the lineage and task sizes to grow, which may have detrimental effects. For stateful transformations that require RDD checkpointing, the default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
  9. Note that checkpointing of RDDs incurs the cost of saving to reliable storage. This may cause an increase in the processing time of those batches where RDDs get checkpointed. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too infrequently causes the lineage and task sizes to grow, which may have detrimental effects. For stateful transformations that require RDD checkpointing, the default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
  10. Simplified Parallelism: No need to create multiple input Kafka streams and union them. With directStream, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume, which will all read data from Kafka in parallel. So there is a one-to-one mapping between Kafka and RDD partitions, which is easier to understand and tune. Efficiency: Achieving zero-data loss in the first approach required the data to be stored in a Write Ahead Log, which further replicated the data. This is actually inefficient as the data effectively gets replicated twice - once by Kafka, and a second time by the Write Ahead Log. This second approach eliminates the problem as there is no receiver, and hence no need for Write Ahead Logs. As long as you have sufficient Kafka retention, messages can be recovered from Kafka. Exactly-once semantics: The first approach uses Kafka’s high level API to store consumed offsets in Zookeeper. This is traditionally the way to consume data from Kafka. While this approach (in combination with write ahead logs) can ensure zero data loss (i.e. at-least once semantics), there is a small chance some records may get consumed twice under some failures. This occurs because of inconsistencies between data reliably received by Spark Streaming and offsets tracked by Zookeeper. Hence, in this second approach, we use simple Kafka API that does not use Zookeeper. Offsets are tracked by Spark Streaming within its checkpoints. This eliminates inconsistencies between Spark Streaming and Zookeeper/Kafka, and so each record is received by Spark Streaming effectively exactly once despite failures. In order to achieve exactly-once semantics for output of your results, your output operation that saves the data to an external data store must be either idempotent, or an atomic transaction that saves results and offsets (see Semantics of output operations in the main programming guide for further information).
  11. Level of Parallelism in Data Receiving- 1) Create multple receivers and those result a multple dstreams.These multiple DStreams can be unioned together to create a single DStream. Then the transformations that were being applied on a single input DStream can be applied on the unified stream. For example kafka one topic on receiver. 2) Another parameter that should be considered is the receiver’s blocking interval. For most receivers, the received data is coalesced together into blocks of data before storing inside Spark’s memory. The number of tasks per receiver per batch will be approximately (batch interval / block interval).inputStream.repartition(&amp;lt;number of partitions&amp;gt;)). Level of Parallelism in Data Processing- Cluster resources can be under-utilized if the number of parallel tasks used in any stage of the computation is not high enough. For example, for distributed reduce operations like reduceByKey and reduceByKeyAndWindow, the default number of parallel tasks is controlled by the spark.default.parallelism configuration property. You can pass the level of parallelism as an argument (see PairDStreamFunctions documentation), or set the spark.default.parallelism configuration property to change the default. Input data: By default, the input data received through Receivers is stored in the executors’ memory with StorageLevel.MEMORY_AND_DISK_SER_2. That is, the data is serialized into bytes to reduce GC overheads, and replicated for tolerating executor failures. Also, the data is kept first in memory, and spilled over to disk only if the memory is insufficient to hold all of the input data necessary for the streaming computation. This serialization obviously has overheads – the receiver must deserialize the received data and re-serialize it using Spark’s serialization format. Persisted RDDs generated by Streaming Operations: RDDs generated by streaming computations may be persisted in memory. For example, window operations persist data in memory as they would be processed multiple times. However, unlike the Spark Core default of StorageLevel.MEMORY_ONLY, persisted RDDs generated by streaming computations are persisted with StorageLevel.MEMORY_ONLY_SER (i.e. serialized) by default to minimize GC overheads. In both cases, using Kryo serialization can reduce both CPU and memory overheads. See the Spark Tuning Guide for more details.
  12. A good approach to figure out the right batch size for your application is to test it with a conservative batch interval (say, 5-10 seconds) and a low data rate. Persistence Level of DStreams: As mentioned earlier in the Data Serialization section, the input data and RDDs are by default persisted as serialized bytes. This reduces both the memory usage and GC overheads, compared to deserialized persistence. Enabling Kryo serialization further reduces serialized sizes and memory usage. Further reduction in memory usage can be achieved with compression (see the Spark configuration spark.rdd.compress), at the cost of CPU time. Clearing old data: By default, all input data and persisted RDDs generated by DStream transformations are automatically cleared. Spark Streaming decides when to clear the data based on the transformations that are used. For example, if you are using a window operation of 10 minutes, then Spark Streaming will keep around the last 10 minutes of data, and actively throw away older data. Data can be retained for a longer duration (e.g. interactively querying older data) by setting streamingContext.remember. CMS Garbage Collector: Use of the concurrent mark-and-sweep GC is strongly recommended for keeping GC-related pauses consistently low. Even though concurrent GC is known to reduce the overall processing throughput of the system, its use is still recommended to achieve more consistent batch processing times. Make sure you set the CMS GC on both the driver (using --driver-java-options in spark-submit) and the executors (using Spark configuration spark.executor.extraJavaOptions).