Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

So you think you can stream.pptx

Cargando en…3

Eche un vistazo a continuación

1 de 48 Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a So you think you can stream.pptx (20)


Más reciente (20)

So you think you can stream.pptx

  1. 1. So you think you can Stream? Vida Ha (@femineer) Prakash Chockalingam (@prakash573)
  2. 2. Who am I ? ● Solutions Architect @ Databricks ● Formerly Square, Reporting & Analytics ● Formerly Google, Mobile Web Search & Speech Recognition 2
  3. 3. Who am I ? ● Solutions Architect @ Databricks ● Formerly Netflix, Personalization Infrastructure 3
  4. 4. About Databricks Founded by creators of Spark in 2013 Cloud enterprise data platform - Managed Spark clusters - Interactive data science - Production pipelines - Data governance, security, …
  5. 5. Sign up for the Databricks Community Edition Learn Apache Spark for free ● Mini 6GB cluster for learning Spark ● Interactive notebooks and dashboards ● Online learning resources ● Public environment to share your work
  6. 6. Agenda • Introduction to Spark Streaming • Lifecycle of a Spark streaming app • Aggregations and best practices • Operationalization tips • Key benefits of Spark streaming
  7. 7. What is Spark Streaming?
  8. 8. Spark Streaming
  9. 9. How does it work? ● Receivers receive data streams and chops them in to batches. ● Spark processes the batches and pushes out the results
  10. 10. Word Count val context = new StreamingContext(conf, Seconds(1)) val lines = context.socketTextStream(...) Entry point Batch Interval DStream: represents a data stream
  11. 11. Word Count val context = new StreamingContext(conf, Seconds(1)) val lines = context.socketTextStream(...) val words = lines.flatMap(_.split(“ “)) Transformations: transform data to create new DStreams
  12. 12. Word Count val context = new StreamingContext(conf, Seconds(1)) val lines = context.socketTextStream(...) val words = lines.flatMap(_.split(“ “)) val wordCounts = => (x, 1)).reduceByKey(_+_) wordCounts.print() context.start() Print the DStream contents on screen Start the streaming job
  13. 13. Lifecycle of a streaming app
  14. 14. Execution in any Spark Application Spark Driver User code runs in the driver process YARN / Mesos / Spark Standalone cluster Tasks sent to executors for processing data Spark Executor Spark Executor Spark Executor Driver launches executors in cluster
  15. 15. Execution in Spark Streaming: Receiving data Executor Executor Driver runs receivers as long running tasks Receiver Data stream Driver Receiver divides stream into blocks and keeps in memory Data Blocks Blocks also replicated to another executor Data Blocks
  16. 16. Execution in Spark Streaming: Processing data Executor Executor Receiver Data Blocks Data Blocks results results Data store Every batch interval, driver launches tasks to process the blocks Driver
  17. 17. End-to-end view 18 t1 = ssc.socketStream(“…”) t2 = ssc.socketStream(“…”) t = t1.union(t2).map(…) t.saveAsHadoopFiles(…)…).foreach(…) t.filter(…).foreach(…) T U M T M FFE FE FE B U M B M F Input DStreams Output operations RDD Actions / Spark Jobs BlockRDDs DStreamGraph DAG of RDDs every interval DAG of stages every interval Stage 1 Stage 2 Stage 3 Streaming app Tasks every interval B U M B M F B U M B M F Stage 1 Stage 2 Stage 3 Stage 1 Stage 2 Stage 3 Spark Streaming JobScheduler + JobGenerator Spark DAGScheduler Spark TaskScheduler Executors YOU write this
  18. 18. Aggregations
  19. 19. Word count over a time window val wordCounts = wordStream.reduceByKeyAndWindow((x: Int, y:Int) => x+y, windowSize, slidingInterval) Reduces over a time window
  20. 20. Word count over a time window Scenario: Word count for the last 30 minutes How to optimize for good performance? ● Increase batch interval, if possible ● Incremental aggregations with inverse reduce function val wordCounts = wordStream.reduceByKeyAndWindow( (x: Int, y:Int) => x+y, (x: Int, y: Int) => x-y, windowSize, slidingInterval) ● Checkpointing wordStream.checkpoint(checkpointInterval)
  21. 21. Stateful: Global Aggregations Scenario: Maintain a global state based on the input events coming in. Ex: Word count from beginning of time. updateStateByKey (Spark 1.5 and before) ● Performance is proportional to the size of the state. mapWithState (Spark 1.6+) ● Performance is proportional to the size of the batch.
  22. 22. Stateful: Global Aggregations
  23. 23. Stateful: Global Aggregations Key features of mapWithState: ● An initial state - Read from somewhere as a RDD ● # of partitions for the state - If you have a good estimate of the size of the state, you can specify the # of partitions. ● Partitioner - Default: Hash partitioner. If you have a good understanding of the key space, then you can provide a custom partitioner ● Timeout - Keys whose values are not updated within the specified timeout period will be removed from the state.
  24. 24. Stateful: Global Aggregations (Word count) val stateSpec = StateSpec.function(updateState _) .initialState(initialRDD) .numPartitions(100) .partitioner(MyPartitioner()) .timeout(Minutes(120)) val wordCountState = wordStream.mapWithState(stateSpec)
  25. 25. Stateful: Global Aggregations (Word count) def updateState(batchTime: Time, key: String, value: Option[Int], state: State[Long]) : Option[(String, Long)] Current batch time A Word in the input stream Current value (= 1) Counts so far for the word The word and its new count
  26. 26. Operationalization
  27. 27. Operationalization Three main themes ● Checkpointing ● Achieving good throughput ● Debugging a streaming job
  28. 28. Checkpoint Two types of checkpointing: ● Checkpointing Data ● Checkpointing Metadata
  29. 29. Checkpoint Data ● Checkpointing DStreams • Primarily needed to cut long lineage on past batches (updateStateByKey/reduceByKeyAndWindow). • Example: wordStream.checkpoint(checkpointInterval)
  30. 30. Checkpoint Metadata ● Checkpointing Metadata • All the configuration, DStream operations and incomplete batches are checkpointed. • Required for failure recovery if the driver process crashes. • Example: –streamingContext.checkpoint(directory) –StreamingContext.getActiveOrCreate(directory, ...)
  31. 31. Achieving good throughput context.socketStream(...) .map(...) .filter(...) .saveAsHadoopFile(...) Problem: There will be 1 receiver which receives all the data and stores it in its executor and all the processing happens on that executor. Adding more nodes doesn’t help.
  32. 32. Achieving good throughput Solution: Increase the # of receivers and union them. ● Each receiver is run in 1 executor. Having 5 receivers will ensure that the data gets received in parallel in 5 executors. ● Data gets distributed in 5 executors. So all the subsequent Spark map/filter operations will be distributed val numStreams = 5 val inputStreams = (1 to numStreams).map(i => context. socketStream(...)) val fullStream = context.union(inputStreams)
  33. 33. Achieving good throughput ● In the case of direct receivers (like Kafka), set the appropriate # of partitions in Kafka. ● Each kafka partition gets mapped to a Spark partition. ● More partitions in Kafka = More parallelism in Spark
  34. 34. Achieving good throughput ● Provide the right # of partitions based on your cluster size for operations causing shuffles. => (x, 1)).reduceByKey(_+_, 100) # of partitions
  35. 35. Debugging a Streaming application Streaming tab in Spark UI
  36. 36. Debugging a Streaming application Processing Time ● Make sure that the processing time < batch interval
  37. 37. Debugging a Streaming application
  38. 38. Debugging a Streaming application Batch Details Page: ● Input to the batch ● Jobs that were run as part of the processing for the batch
  39. 39. Debugging a Streaming application Job Details Page ● DAG Visualization ● Stages of a Spark job
  40. 40. Debugging a Streaming application Task Details Page Ensure that the tasks are executed on multiple executors (nodes) in your cluster to have enough parallelism while processing. If you have a single receiver, sometimes only one executor might be doing all the work though you have more than one executor in your cluster.
  41. 41. Key benefits of Spark streaming
  42. 42. Dynamic Load Balancing
  43. 43. Fast failure and Straggler recovery
  44. 44. Combine Batch and Stream Processing Join data streams with static data sets val dataset = sparkContext.hadoopFile(“file”) … kafkaStream.transform{ batchRdd => batchRdd.join(dataset).filter(...) }
  45. 45. Combine ML and Stream Processing Learn models offline, apply them online val model = KMeans.train(dataset, …) { event => model.predict(event.feature) }
  46. 46. Combine SQL and Stream Processing inputStream.foreachRDD{ rdd => val df = SQLContext.createDataframe(rdd) }
  47. 47. Thank you.