3. 10.1 A Simple Example
Before we dive into the details of Spark Streaming,
let’s consider a simple example. We will receive a
stream of newline-delimited lines of text from a
server running at port 7777, filter only the lines that
contain the word error, and print them.
Spark Streaming programs are best run as
standalone applications built using Maven or sbt.
Spark Streaming, while part of Spark, ships as a
separate Maven artifact and has some additional
imports you will want to add to your project.
7. 10.3 Transformations
Stateless
the processing of each batch does not depend on the data of its
previous batches
include the common RDD transformations like map(), filter(),
and reduceByKey()
Stateful
use data or intermediate results from previous batches to
compute the results of the current batch
include transformations based on:
sliding windows
tracking state across time
9. 10.3.2 Stateless Transformations
Windowed Transformation
compute results across a longer time period than the
StreamingContext’s batch interval, by combining results from
multiple batches
A windowed stream with a window duration of
3 batches and a slide duration of 2 batches;
every two time steps, we compute a result over
the previous 3 time steps
10. 10.3.2 Stateless Transformations (cont.)
UpdateStateByKey transformation
updateStateByKey() maintains state across the batches in a
DStream by providing access to a state variable for DStreams
of key/value pairs
update(events, oldState) returns a newState
events is a list of events that arrived in the current batch (may be
empty)
oldState is an optional state object, stored within an Option; it
might be missing if there was no previous state for the key
newState is also an Option; we can return an empty Option to
specify that we want to delete the state
11. 10.4 Output Operations
Specify what needs to be done with the final transformed
data in a stream
print()
save()
Saving DStream to text files in Scala
ipAddressRequestCount.saveAsTextFiles("outputDir", "txt")
Saving SequenceFiles from a DStream in Scala
val writableIpAddressRequestCount = ipAddressRequestCount.map {
(ip, count) => (new Text(ip), new LongWritable(count)) }
writableIpAddressRequestCount.saveAsHadoopFiles[
SequenceFileOutputFormat[Text,
LongWritable]]("outputDir", "txt")
12. 10.5 Input Sources
Spark Streaming has built-in support for a number
of different data sources.
“core” sources are built into the Spark Streaming Maven
artifact
others are available through additional artifacts
Eg: spark-streaming-kafka.
13. 10.5.1 Core Sources
Stream of files
allows a stream to be created from files written in a directory of a
Hadoop-compatible filesystem
needs to have a consistent date format for the directory names and
the files have to be created atomically
Eg: Streaming text files written to a directory in Scala
val logData = ssc.textFileStream(logDirectory)
Akka actor stream
allows using Akka actors as a source for streaming
To construct an actor stream:
create an Akka actor
implement the org.apache.spark.streaming.receiver.ActorHelper
interface
15. 10.5.3 Multiple Sources and Cluster Sizing
We can combine multiple DStreams using operations
like union() combine data from multiple input
DStreams
The receivers are executed in the Spark cluster to use
multiple ones
Each receiver runs as a long-running task within Spark’s
executors, and hence occupies CPU cores allocated to the
application
Note: Do not run Spark Streaming programs locally
with master config‐ ured as "local" or "local[1]”
16. 10.6 “24/7” Operations
Spark provides strong fault tolerance guarantees.
As long as the input data is stored reliably, Spark Streaming
will always compute the correct result from it, offering “exactly
once” semantics, even if workers or the driver fail.
To run Spark Streaming applications 24/7
1. setting up checkpointing to a reliable storage system, such as
HDFS or Amazon S3
2. worry about the fault tolerance of the driver program and of
unreliable input sources
17. 10.6.1 Checkpointing
Main mechanism needs to be set up for fault
tolerance
Allows periodically saving data about the application
to a reliable storage system, such as HDFS or
Amazon S3 for use in recovering
Two purposes:
Limiting the state that must be recomputed on failure
Providing fault tolerance for the driver
18. 10.6.2 Driver Fault Tolerance
Requires creating our StreamingContext, which
takes in the checkpoint directory
use the StreamingContext.getOrCreate() function
Write initialization code using getOrCreate(), need to
actually restart your driver program when it crashes
19. 10.6.3 Worker Fault Tolerance
Spark Streaming uses the same techniques as Spark
for its fault tolerance.
All the data received from external sources is
replicated among the Spark workers
All RDDs created through transformations of this
replicated input data are tolerant to failure of a
worker node, as the RDD lineage allows the system
to recompute the lost data all the way from the
surviving replica of the input data.
20. 10.6.4 Receiver Fault Tolerance
Spark Streaming restarts the failed receivers on
other nodes in the cluster
Receivers provide the guarantees:
All data read from a reliable filesystem (e.g., with
StreamingContext.hadoop Files) is reliable, because the underlying
filesystem is replicated.
For unreliable sources such as Kafka, push-based Flume, or
Twitter, Spark repli‐ cates the input data to other nodes, but it
can briefly lose data if a receiver task is down.
21. 10.6.5 Processing Guarantees
Spark Streaming provide exactly- once semantics for
all transformations
Even if a worker fails and some data gets reprocessed, the final
transformed result (that is, the transformed RDDs) will be the
same as if the data were processed exactly once.
When the transformed result is to be pushed to
external systems using out‐ put operations, the task
pushing the result may get executed multiple times
due to failures, and some data can get pushed
multiple times.
22. 10.7 Streaming UI
UI page that lets us look at what applications are
doing. (typically http:// <driver>:4040)
24. 10.8.1 Batch and Window Sizes
Minimum batch size Spark Streaming can use: 500
milliseconds
The best approach:
start with a larger batch size (around 10 seconds)
work your way down to a smaller batch size.
If the processing times reported in the Streaming UI
remain consistent, then you can continue to decrease
the batch size
Note: if they are increasing you may have reached the limit for
your application.
25. 10.8.2 Level of Parallelism
Increasing the parallelism - a common way to reduce
the processing time of batches
3 ways:
Increasing the number of receivers
Explicitly repartitioning received data
Increasing parallelism in aggregation
26. 10.8.3 Garbage Collection and Memory Usage
Java’s garbage collection - an aspect that can cause
problems
To minimize large pauses due to GC enabling
Java’s Concurrent Mark- Sweep garbage collector.
The Concurrent Mark-Sweep garbage collector does consume
more resources overall, but introduces fewer pauses.
To reduce GC pressure
Cache RDDs in serialized form
Use Kryo serialization
Use an LRU cache
27. Edx and Coursera Courses
Introduction to Big Data with Apache Spark
Spark Fundamentals I
Functional Programming Principles in Scala
28. 10.9 Conclusion
In this chapter, we have seen how to work with
streaming data using DStreams.
Since DStreams are composed of RDDs, the
techniques and knowledge you have gained from the
earlier chapters remains applicable for streaming
and real-time applications.
In the next chapter, we will look at machine learning
with Spark.
Notas del editor
Spark Streaming uses a “micro-batch” architecture, where the streaming computa‐ tion is treated as a continuous series of batch computations on small batches of data. Spark Streaming receives data from various input sources and groups it into small batches. New batches are created at regular time intervals. At the beginning of each time interval a new batch is created, and any data that arrives during that interval gets added to that batch. At the end of the time interval the batch is done growing. The size of the time intervals is determined by a parameter called the batch interval. The batch interval is typically between 500 milliseconds and several seconds, as config‐ ured by the application developer. Each input batch forms an RDD, and is processed using Spark jobs to create other RDDs. The processed results can then be pushed out to external systems in batches.
Limiting the state that must be recomputed on failure. As discussed in “Architec‐ ture and Abstraction” on page 186, Spark Streaming can recompute state using the lineage graph of transformations, but checkpointing controls how far back it must go.
Providing fault tolerance for the driver. If the driver program in a streaming application crashes, you can launch it again and tell it to recover from a check‐ point, in which case Spark Streaming will read how far the previous run of the program got in processing the data and take over from there.