Spark streaming high level overview

Spark streaming high level
overview
An abstraction over core spark that provides stream processing
functionality

Core spark
• A general, in memory data processing engine
• Batch oriented
• Main abstraction is called an RDD (resilient distributed datasets) -
• Represents a distributed collection
• Doesn’t store data
• It’s the API for defining processing steps

stream.
.map(this.tryParseRawLine _)
.filter(_.isSuccess)
.map(_.value)
.map(mapWithKey(_))
.reduceByKey{
/*reduce logic*/
}
.map(_.toString)
• Translates an expression tree into a distributed data processing application
• Serializes functions and their enclosed scope and sends them to executors
• Updates to shared objects won’t be reflected across the cluster
• Be aware not to reference large objects - use spark's ‘broadcast variables’ instead
• Operations consists of transformations (map operations) and actions - causes shuffling of the data
across nodes (group by, reduce etc...)

• A pubsub messaging system
• Supports topics – named queues
• Supports replaying messages from any index in the queue
• Scalable
• Data is partitioned and replicated
• Durable
• Messages are persisted and replicated – adds to latency
• Master slave replication at the partition level

Back to spark streaming
• API almost identical to core spark
• Micro batches - as data comes in, it is buffered during a given interval
and then served to core spark for processing
• Buffered data is served to core spark as RDDs
• Each interval produces one RDD

Data of the current batch is processed in
parallel with buffering data for the next batch

A guideline for a spark streaming application
Micro batches processing time must be less than the batch interval
• Maybe trivial, but it is the key to avoid bottlenecks and performance
degradation

Performance tuning at large
• Parallelizing data consumption from input source
• In the case of kafka - depends on topics partitions (creating multiple kafka stream consumers)
• Parallelizing data processing (spark partitions), should be balanced
with total number of cores and consumers
• Each kafka consumer uses one core
• Cores available for processing = total #cores - #consumers
• Serialization – avoid java object serialization – kyro is recommended

Spark streaming high level overview

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (11)

Similar a Spark streaming high level overview

Similar a Spark streaming high level overview (20)

Último

Último (20)

Spark streaming high level overview

Notas del editor