Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Spark Performance Tuning
1. Spark UI (Monitor andInspect Jobs).
2. Level of Parallelism (Clusters willnot be fullyutilized...
Próxima SlideShare
Cargando en…5
  • Sé el primero en comentar

Cheat Sheet - Spark Performance Tuning

  1. 1. Spark Performance Tuning 1. Spark UI (Monitor andInspect Jobs). 2. Level of Parallelism (Clusters willnot be fullyutilized unless the level of parallelism for each operationis high enough. Spark automaticallysets the number of partitions of an input file according to its size andfor distributed shuffles, such as groupByKeyand reduceByKey, it usesthe largest parent RDD’s number of partitions. Youcanpassthe level of parallelism as a second argument to anoperation. In general, 2-3 tasks per CPU core in your cluster are recommended. That said, havingtasks that are too small is alsonot advisable as there is some overheadpaidto schedule andrun a task.As a rule of thumbtasks should take at least 100 ms to execute). 3. Reduce working set size (Operations like groupByKey can fail terriblywhentheir working set is huge. Best way to deal withthis willbe to change the level ofparallism) 4. Avoid groupByKey for associative operations(use operations that cancombine) 5. Multiple Disk (give sparkmultiple disks for intermediate persistence. Thisdone via setting in ResourceManager) 6. Degree of Parallelism (~ 2 to 3 time the number ofcores on Worker nodes) 7. Performance due to chosen Language (Scala > Java >> Python > R) 8. Higher level APIs are better (Use Dataframe for core processing, MLlibfor Machine Learning, SparkSQL for Queryand GraphXfor Graphprocessing) i. 9. Avoid collecting large RDDs (use take or takeSample). 10. Use Dataframe (This is more efficient and uses Catalyst optimizer.) 11. Use Scope as provided in mavento avoidpackaging all the dependencies 12. Filter First, Shuffle next 13. Cache after hard work 14. Spark Streaming – enable backpressure (This willtell kafka to slowdownrate of sending messagesifthe processing time is coming more than batch interval and scheduling delayis increasing) 15. If using Kafka, choose Direct Kafka approach 16. Extend Catalyst Optimizer’s code to add/modifyrules 17. Improve Shuffle Performance: a. Enable LZF or SnappyCompression (for shuffle) b. Enable Kryo Serialization c. Keep shuffle data small(usingreduceByKeyor filter before shuffle) d. No Shuffle block canbe greater than2GB in size. Else exception:size is greater than Interger.MAX_SIZE. Spark uses ByteBuffer for ShuffleBlocks. ByteBuffer is limitedby Integer.MAX_SIZE = 2 GB. Ideally, eachpartition should have roughly128 MB. e. Think about partition/ bucketingaheadof time. f. Do as much as possible witha single shuffle 18. Use cogroup (insteadof rdd.flatmap.join.groupby) 19. Spend time of reading RDD lineage graph (handywayis to read RDD.toDebugString() ) 20. Optimize Join Performance a. Use Salting to avoidSkewKeys. Skew sets are the ones where data is not distributed evenly. One for Few partitions have huge amount of Data in comparison to other partitions. i. Here change the (regular key) to (concatenate (regular key, “:”, randomnumber)). ii. Once this is done, thenfirst dojoin operationonsalted keys andthen do the operationon unsalted keys b. Use partitionBy(new hash partition()) 21. Use Caching (Instead ofMEM_ONLY, use MEM_ONLY_SER. This has better GCfor larger datasets) 22. Always cache after repartition. 23. A Map after partitionBy will lose the partition information. Use mapValue instead 24. Speculative Execution (Enable Speculative executionto tackle stragglers) 25. Coalesce or repartition to avoidmassive partitions (smaller partitions workbetter) 26. Use Broadcast variables 27. Use Kryo Serialization (more compact andfaster than Java Serialization. Kryo is onlysupportedin RDD caching and shuffling– not inSerialize To diskoperations like SaveAsObjectFile)