1. Spark streaming high level
overview
An abstraction over core spark that provides stream processing
functionality
2. Core spark
• A general, in memory data processing engine
• Batch oriented
• Main abstraction is called an RDD (resilient distributed datasets) -
• Represents a distributed collection
• Doesn’t store data
• It’s the API for defining processing steps
3. stream.
.map(this.tryParseRawLine _)
.filter(_.isSuccess)
.map(_.value)
.map(mapWithKey(_))
.reduceByKey{
/*reduce logic*/
}
.map(_.toString)
• Translates an expression tree into a distributed data processing application
• Serializes functions and their enclosed scope and sends them to executors
• Updates to shared objects won’t be reflected across the cluster
• Be aware not to reference large objects - use spark's ‘broadcast variables’ instead
• Operations consists of transformations (map operations) and actions - causes shuffling of the data
across nodes (group by, reduce etc...)
4. • A pubsub messaging system
• Supports topics – named queues
• Supports replaying messages from any index in the queue
• Scalable
• Data is partitioned and replicated
• Durable
• Messages are persisted and replicated – adds to latency
• Master slave replication at the partition level
5. Back to spark streaming
• API almost identical to core spark
• Micro batches - as data comes in, it is buffered during a given interval
and then served to core spark for processing
• Buffered data is served to core spark as RDDs
• Each interval produces one RDD
6. Data of the current batch is processed in
parallel with buffering data for the next batch
7. A guideline for a spark streaming application
Micro batches processing time must be less than the batch interval
• Maybe trivial, but it is the key to avoid bottlenecks and performance
degradation
8. Performance tuning at large
• Parallelizing data consumption from input source
• In the case of kafka - depends on topics partitions (creating multiple kafka stream consumers)
• Parallelizing data processing (spark partitions), should be balanced
with total number of cores and consumers
• Each kafka consumer uses one core
• Cores available for processing = total #cores - #consumers
• Serialization – avoid java object serialization – kyro is recommended
Notas del editor
Objects in scope must be serializable – or instantiated within the function