Meet Up - Spark Stream Processing + Kafka

Satendra Kumar
Sr. Software Consultant
Knoldus Software LLP
Stream Processing

Topics Covered
➢ What is Stream
➢ What is Stream processing
➢ The challenges of stream processing
➢ Overview Spark Streaming
➢ Receivers
➢ Custom receivers
➢ Transformations on Dstreams
➢ Failures
➢ Fault-tolerance Semantics
➢ Kafka Integration
➢ Performance Tuning

What is Stream
A stream is a sequence of data elements made available over time
and which can be accessed in sequential order.
Eg. YouTube video buffering.

What is Stream processing
Stream processing is the real-time processing of data
continuously, concurrently, and in a record-by-record fashion.
It treats data not as static tables or files, but as a continuous
infinite stream of data integrated from both live and historical
sources.

➢ Partitioning & Scalability
➢ Semantics & Fault tolerance
➢ Unifying the streams
➢ Time
➢ Re-Processing
The challenges of stream processing

Spark Streaming
➢ Provides a way to process the live data streams.
➢ Scalable, high-throughput, fault-tolerant.
➢ Built top of core Spark API.
➢ API is very similar to Spark core API.
➢ Supports many sources like Kafka, Flume, Kinesis or TCP
sockets.
➢ Currently based on RDDs.

Discretized Streams
➢ It provides a high-level abstraction called discretized stream or
DStream, which represents a continuous stream of data;
➢ DStreams can be created either from input data streams from
sources such as Kafka, Flume, and Kinesis, or by applying high-
level operations on other Dstreams.
➢ DStream is represented as a sequence of RDDs.

Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}

Driver Program
val filteredWords = words.filter(!_.trim.isEmpty)
wordCounts.print()
}
Streaming Context

Driver Program
wordCounts.print()
}
Streaming Context
Batch Interval

Driver Program
wordCounts.print()
}
Streaming Context
Batch Interval
Receiver

Driver Program
wordCounts.print()
}
Streaming Context
Batch Interval
Receiver
Transformations on DStreams

Driver Program
wordCounts.print()
}
Streaming Context
Batch Interval
Receiver
Output Operations on DStreams

Driver Program
wordCounts.print()
}
Streaming Context
Batch Interval
Receiver
Start the Streaming

Important Points
➢ Once a context has been started, no new streaming computations can
be set up or added to it.
➢ Once a context has been stopped, it cannot be restarted.
➢ Only one StreamingContext can be active in a JVM at the same time.
➢ stop() on StreamingContext also stops the SparkContext. To stop only
the StreamingContext, set the optional parameter of stop() called
stopSparkContext to false.
➢ A SparkContext can be re-used to create multiple StreamingContexts, as
long as the previous StreamingContext is stopped (without stopping the
SparkContext) before the next StreamingContext is created.

Spark Streaming Concept
➢ Spark streaming is based on micro-batch architecture.
➢ Spark streaming continuously receives live input data streams and divides
the data into batches.
➢ New batches are created at regular time intervals called batch interval.
➢ Each batch have N numbers blocks.
Where N = batch-interval / block-interval
For eg. If batch interval = 1 second and block interval= 200ms(by default)
then each batch have 5 blocks.

Transforming DStream
➢ DStream is represented by a continuous series of RDDs
➢ Each RDD in a DStream contains data from a certain interval
➢ Any operation applied on a DStream translates to operations on the
underlying RDDs
➢ Processing time of a batch should less than or equal to batch
interval.

def map[U: ClassTag](mapFunc: T => U): DStream[U]
def flatMap[U: ClassTag](flatMapFunc: T => TraversableOnce[U]): DStream[U]
def filter(filterFunc: T => Boolean): DStream[T]
def reduce(reduceFunc: (T, T) => T): DStream[T]
def count(): DStream[Long]
def repartition(numPartitions: Int): DStream[T]
def countByValue(numPartitions: Int = ssc.sc.defaultParallelism): DStream[(T, Long)]
def transform[U: ClassTag](transformFunc: RDD[T] => RDD[U]): DStream[U]

Transformations on PairDStream
def groupByKey(): DStream[(K, Iterable[V])]
def reduceByKey(reduceFunc: (V, V) => V, numPartitions: Int): DStream[(K, V)]
def join[W: ClassTag](other: DStream[(K, W)]): DStream[(K, (V, W))]
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],partitioner: Partitioner): DStream[(K, S)]
def cogroup[W: ClassTag](
other: DStream[(K, W)], numPartitions: Int): DStream[(K, (Iterable[V], Iterable[W]))]
def mapValues[U: ClassTag](mapValuesFunc: V => U): DStream[(K, U)]
def leftOuterJoin[W: ClassTag](
other: DStream[(K, W)],numPartitions: Int): DStream[(K, (V, Option[W]))]
def rightOuterJoin[W: ClassTag](
other: DStream[(K, W)], numPartitions: Int): DStream[(K, (Option[V], W))]

updateStateByKey
streamingContext.checkpoint(".")
val lines = streamingContext.socketTextStream("localhost", 9000)
val updatedState: DStream[(String, Int)] =
pairs.updateStateByKey[Int] {
(newValues: Seq[Int], state: Option[Int]) => Some(newValues.sum +state.getOrElse(0))
}
updatedState.print()
}

Window Operations
Spark Streaming also provides windowed computations, which allow
you to apply transformations over a sliding window of data.
Window operation needs to specify two parameters:
● window length - The duration of the window.
● sliding interval - The interval at which the window operation is performed.

Window Operations
def window(windowDuration: Duration): DStream[T]
def window(windowDuration: Duration, slideDuration: Duration): DStream[T]
def reduceByWindow(reduceFunc: (T, T) => T,
windowDuration: Duration, slideDuration: Duration): DStream[T]
def countByWindow(windowDuration: Duration, slideDuration: Duration): DStream[Long]
def countByValueAndWindow(windowDuration: Duration,
slideDuration: Duration,numPartitions: Int): DStream[(T, Long)]
//pairDStream Operations
def groupByKeyAndWindow(windowDuration: Duration): DStream[(K, Iterable[V])]
def groupByKeyAndWindow(windowDuration: Duration,
slideDuration: Duration): DStream[(K, Iterable[V])]
def reduceByKeyAndWindow(reduceFunc: (V, V) => V,windowDuration: Duration): DStream[(K, V)]
def reduceByKeyAndWindow(reduceFunc: (V, V) => V,
windowDuration: Duration,slideDuration: Duration): DStream[(K, V)]

Window Operations
pairs.window(Seconds(15), Seconds(10))
filteredWords.reduceByWindow((a, b) => a +", "+ b, Seconds(15), Seconds(10))
pairs.reduceByKeyAndWindow((a: Int, b: Int) => a + b, Seconds(15), Seconds(10))

def print(num: Int): Unit
def saveAsObjectFiles(prefix: String, suffix: String = ""): Unit
def saveAsTextFiles(prefix: String, suffix: String = ""): Unit
def foreachRDD(foreachFunc: RDD[T] => Unit): Unit
def saveAsHadoopFiles[F <: OutputFormat[K, V]](prefix: String,suffix: String): Unit

Receivers
Spark Streaming have two kinds of receivers:
1) Reliable Receiver - A reliable receiver correctly sends acknowledgment
to a reliable source when the data has been received and stored in Spark with
replication.
2) Unreliable Receiver - An unreliable receiver does not send acknowledgment
to a source.

Custom Receiver
A custom receiver must extend this abstract Receiver class by implementing
two abstract methods:
def onStart(): Unit //Things to do to start receiving data
def onStop(): Unit // Things to do to stop receiving data

Custom Receiver
class CustomReceiver(path: String) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) {
def onStart() {
new Thread("File Reader") {
override def run() {
receive()
}
}.start()
}
def onStop() {}
private def receive() =
try {
println("Reading file " + path)
val reader = new BufferedReader(
new InputStreamReader(new FileInputStream(path), StandardCharsets.UTF_8))
var userInput = reader.readLine()
while (!isStopped && Option(userInput).isDefined) {
store(userInput)
userInput = reader.readLine()
}
reader.close()
println("Stopped receiving")
restart("Trying to connect again")
} catch {
case ex: Exception =>
restart("Error reading file " + path, ex)
}
}

Custom Receiver
object CustomReceiver extends App {
val sparkConf = new SparkConf().setAppName("CustomReceiver")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val lines = ssc.receiverStream(new CustomReceiver(args(0)))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}

Fault-tolerance Semantics
Streaming system provides zero data loss guarantees despite any kind of
failure in the system.
➢ At least once- Each record will be processed one or more times.
➢ Exactly once- Each record will be processed exactly once - no data will be lost and no
data will be processed multiple times

Kinds of Failure
There are two kind of failure:
➢ Executor failure
1) Data received and replicated
2) Data received but not replicated
➢ Driver failure

Executor failure
Data would be lost ?

Enable write ahead logs
object Streaming2App extends App {
val checkpointDirectory ="checkpointDir"//It should be fault-tolerant & reliable file system(e.g. HDFS, S3, etc.)
sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
streamingContext.checkpoint(checkpointDirectory)
wordCounts.print(20)
}
Enable write logs

object Streaming2App extends App {
val checkpointDirectory ="checkpointDir"//It should be fault-tolerant & reliable file system(e.g. HDFS, S3, etc.)
}
Enable write logs
Enable checkpointing

1) For WAL first need to enable checkpointing
- streamingContext.checkpoint(checkpointDirectory)
2) Enable WAL in spark configuration
-sparkConf.set("spark.streaming.receiver.writeAheadLog.enable","true")
3) Receiver should be reliable
- Acknowledge source only after data saved to WAL
- Unacknowledged data will be replayed from source by restated receiver
4) Disable in-memory replication (Already replicated By HDFS)
- Use StorageLevel.MEMORY_AND_DISK_SER for input DStream

Driver failure
How to recover from this Failure ?

Driver with checkpointing
Dstream Checkpointing : Periodically save the DAG of
DStream to fault-tolerant storage.

Recover from Driver failure
1) Configure Automatic driver restart
-All cluster managers support this
2) Set a checkpoint directory
- Directory should be in fault-tolerant & reliable file system (e.g., HDFS, S3, etc.)
- streamingContext.checkpoint(checkpointDirectory)
3) Driver should be restart using checkpointing

Configure Automatic driver restart
Spark Standalone
- use spark-submit with “cluster” mode and “- - supervise”
YARN
-use spark-submit with “cluster” mode
Mesos
-Marathon can restart applications or use “- - supervise” flag

Configure Checkpointing
object RecoverableWordCount {
//should a fault-tolerant,reliable file system(e.g.HDFS,S3, etc.)
val checkpointDirectory = "checkpointDir"
def createContext() = {
val sparkConf = new SparkConf().setAppName("StreamingApp")
val lines = streamingContext.socketTextStream("localhost", 9000)
streamingContext
}
}

Driver should be restart using checkpointing
import RecoverableWordCount._
val streamingContext = StreamingContext.getOrCreate(checkpointDirectory, createContext _)
//do other operations
}

Checkpointing
There are two types of data that are checkpointed.
1) Metadata checkpointing
-Configuration
-DStream operations
-Incomplete batches
2) Data checkpointing
- Saving of the generated RDDs to reliable storage. This is necessary in some stateful
transformations that combine data across multiple batches.

Checkpointing Latency
➔ Checkpointing of RDDs incurs the cost of saving to reliable storage. The interval of
checkpointing needs to be set carefully.
dstream.checkpoint( Seconds( (batch interval)*10 ) )
➔ A checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.

Spark Streaming & Kafka Integration

Why Kafka ?
➢ Velocity & volume of streaming data
➢ Reprocessing of streaming
➢ Reliable receiver complexity
➢ Checkpoint complexity
➢ Upgrading Application Code

Kafka Integration
There are two approaches to integrate Kafka with Spark Streaming:
➢ Receiver-based Approach
➢ Direct Approach

Receiver-based Approach
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

Receiver-based Approach
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
object ReceiverBasedStreaming extends App {
val group = "streaming-test-group"
val zkQuorum = "localhost:2181"
val topics = Map("streaming_queue" -> 1)
val sparkConf = new SparkConf().setAppName("ReceiverBasedStreamingApp")
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topics)
.map { case (key, message) => message }
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
}

Direct Approach
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

Direct Approach
import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka._
object KafkaDirectStreaming extends App {
val brokers = "localhost:9092"
val sparkConf = new SparkConf().setAppName("KafkaDirectStreaming")
ssc.checkpoint("checkpointDir") //offset recovery
val topics = Set("streaming_queue")
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages: InputDStream[(String, String)] =
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
val lines = messages.map { case (key, message) => message }
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
}

Direct Approach
Direct Approach has the following advantages over the receiver-based approach:
➢ Simplified Parallelism
➢ Efficiency
➢ Exactly-once semantics

Performance Tuning
For best performance of a Spark Streaming application we need to
consider two things:
➢ Reducing the Batch Processing Times
➢ Setting the Right Batch Interval

Reducing the Batch Processing Times
➢ Level of Parallelism in Data Receiving
➢ Level of Parallelism in Data Processing
➢ Data Serialization
-Input data
-Persisted RDDs generated by Streaming Operations
➢ Task Launching Overheads
-Running Spark in Standalone mode or coarse-grained Mesos mode leads
to better task launch times.

Setting the Right Batch Interval
➢ Batch processing time should be less than the batch interval.
➢ Memory Tuning
-Persistence Level of Dstreams
-Clearing old data
-CMS Garbage Collector

Code samples
https://github.com/knoldus/spark-streaming-meetup
https://github.com/knoldus/real-time-stream-processing-engine
https://github.com/knoldus/kafka-tweet-producer

References
http://spark.apache.org/docs/latest/streaming-programming-guide.html
http://spark.apache.org/docs/latest/configuration.html#spark-streaming
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
http://spark.apache.org/docs/latest/tuning.html
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.htm

Thanks
Presenters:
@_satendrakumar
Organizer:
@knolspeak
http://www.knoldus.com

Meet Up - Spark Stream Processing + Kafka

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Meet Up - Spark Stream Processing + Kafka

Similar to Meet Up - Spark Stream Processing + Kafka (20)

More from Knoldus Inc.

More from Knoldus Inc. (20)

Recently uploaded

Recently uploaded (20)

Meet Up - Spark Stream Processing + Kafka

Editor's Notes