Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Spark / Mesos Cluster Optimization

7.160 visualizaciones

Publicado el

Recipes we applied to optimize our Spark / Mesos Cluster.


Publicado en: Datos y análisis
  • Sé el primero en comentar

Spark / Mesos Cluster Optimization

  1. 1. HAYSSAM SALEH Spark / Mesos Cluster Optimization Paris Spark Meetup April 28th 2016 @Criteo
  2. 2. Grappe Clic Nav AT DataLake Avis Evts. Supports Comptes Spark Job DataMart Contribution Visiteur Session Recherche Bandeau [LR] [FD] [Clic] Project Context
  3. 3. 3 Interactive Discovery Elastcsearch DataMart (KPI)
  4. 4. Interactive Discovery Elastcsearch DataMart (KPI)
  5. 5. OBJECTIVES — 100 Gb of data / day => 50 million requests — 40To / year — How we turned from a 4 hours job on : ¢ 6 nodes ¢ 8 cores & 32 Gb per node ¢ To a 20 minutes Job on : ¢ 4 nodes, 8 cores & 8 Gb per node
  6. 6. SUMMARY ¢ Spark concepts ¢ Spark UI offline ¢ Application Optimization — Shuffling — Partitioning — Closures ¢ Parameters Optimization — Spark Shuffling — Mesos application distribution ¢ Elasticsearch Optimization — Google “elasticsearch performance tuning” -> blog.ebiznext.com
  7. 7. SPARK : LES CONCEPTS ¢ Application — Main application ¢ Job — Roundtrip Driver -> Cluster ¢ Stage — Shuffle Boundary ¢ Task — Thread working on a single RDD partition ¢ Partition — RDD are split into partitions — Partition is unit of work for each task ¢ Executor — System process Application Job Stage Task 1 n n n 1 1 Driver code Spark Action Shuffle Boundary One task / partition RDD Partition
  8. 8. SPARK UI OFFLINE Spark-env.sh On the driver
  9. 9. APPLICATION
  10. 10. JOBS
  11. 11. STAGES
  12. 12. STAGES
  13. 13. TASKS
  14. 14. SHUFFLING APPLICATION OPTIMIZATION
  15. 15. WHY OPTIMIZE SHUFFLING Distance between Data & CPU Duaration (scaled) Cache L1 1 seconde RAM 3 minutes Node to node communication 3 jours
  16. 16. TRANSFORMATIONS LEADING TO SHUFFLING ¢ repartition ¢ cogroup ¢ ...join ¢ ...ByKey ¢ distinct
  17. 17. SHUFFLING OPTIMIZATION 1/2 (k1, v1, w1) (k2, v2, w2) (k3, v3, w3) (k1, v4 w4) (k1, v5, w5) (k3,v6,w6) (k2, v7, w7) (k2 v8, w8) (k3,v9,w9) (k1, v1) (k2, v2) (k3,v3) (k1, v4) (k1, v5) (k3,v6) (k2, v7) (k2 v8) (k3,v9) map map map (k1, [v1, v4, v5]) (k3, [v3, v6, v9]) (k2, [v2, v7, v8]) groupByKey groupByKey groupByKey (k1,v1+v4+v5) (k3,v3+v6+v9) (k2, v2+v7+v8]) Shuffling Bonus : OOM
  18. 18. SHUFFLING OPTIMIZATION 2/2 (k1, v1, w1) (k2, v2, w2) (k3, v3, w3) (k1, v4 w4) (k1, v5, w5) (k3,v6,w6) (k2, v7, w7) (k2 v8, w8) (k3,v9,w9) (k1, v1) (k2, v2) (k3,v3) (k1, v4) (k1, v5) (k3,v6) (k2, v7) (k2 v8) (k3,v9) map map map (k1, v1+v45)) (k3,v3+v6+v9) (k2, v2+v78) reduceByKey (2/2) reduceByKey (2/2) reduceByKey (2/2) (k1, v1) (k2, v2) (k3,v3) (k1, v4+v5) (k3,v6) (k2, v7+v8) (k3,v9) reduceByKey (1/2) reduceByKey (1/2) reduceByKey (1/2) Less shuffling
  19. 19. OPTIMIZATIONS BROUGHT BY SPARK SQL L Weak typing L Limited Java Support L Schema inference not supported • requires scala.Product
  20. 20. LES DATASETS – EXPERIMENTAL - Scala Java J API similar to RDD, strongly typed
  21. 21. REDUCE OBJECT SIZE
  22. 22. CLOSURES
  23. 23. CLOSURES Indirect reference. Use local variables instead
  24. 24. RIGHT PARTITIONING
  25. 25. PARTITIONING ¢ Symptoms — Important number of short lived tasks ¢ Solutions — Review your implementation — Define custom partitionner — Control the number of partitions with ¢ coalesce et repartition
  26. 26. CHOOSE THE RIGHT SERIALIZATION ALGORITHM
  27. 27. REDUCE OBJECT SIZE ¢ Kryo Serialization — Reduce space – up to 10X — Perform gain – 2X to 3X 1. Require Spark to use Kryo serialization 2. Optimize class naming => reduce object size 3. Force all serialized classes to register Not applicable to the Dataset API that use Encoders instead
  28. 28. DATASETS PROMISE
  29. 29. MEMORY TUNING
  30. 30. SPARK MEMORY MANAGEMENT– SPARK 1.6.X Shuffle fraction Memory reserved to Spark (300Mb) Memory assigned to user data 1 - spark.memory.fraction Memory assigned to spark data spark.memory.fraction = 0.75 spark.memory.storageFraction = 0.5 16Go 11.78Go 3.92Go
  31. 31. SPARK MEMORY MANAGEMENT – SPARK 1.6.X ¢ Storage Fraction — Host cached RDDs — Host broadcast variables — Data in this fraction are subject to eviction ¢ Shuffling Fraction — Hold Intermediate Data — Spill to disk when full — Data this fraction cannot be evicted by other threads Memory reserved to Spark (300Mb) Memory assigned to user data 1 - spark.memory.fraction Memory assigned to spark data spark.memory.fraction = 0.75 spark.memory.storageFraction = 0.5 16Go 11.78Go 3.92Go
  32. 32. SPARK MEMORY MANAGEMENT – SPARK 1.6.X ¢ Shuffle & Storage fractions may borrow memory from each other under certain conditions: — The shuffling fraction cannot extend beyond its defined size if storage fraction uses all its memory — Storage fraction cannot evict data from the shuffling fraction even if it expanded beyond its defined size. Memory reserved to Spark (300Mb) Memory assigned to user data 1 - spark.memory.fraction Memory assigned to spark data spark.memory.fraction = 0.75 spark.memory.storageFraction = 0.5 16Go 11.78Go 3.92Go
  33. 33. THE SHUFFLE MANAGER SPARK.SHUFFLE.MANAGER
  34. 34. HASH SHUFFLE MANAGER Executor Partition Partition Partition Partition Partition Task Task Task Number of tasks = spark.executor.cores / spark.task.cpus(1) YARN & standalone only Local Filesystem spark.local.dir File 1 File 2 File … …
  35. 35. SORT SHUFFLE MANAGER Executor Partition Partition Partition Partition Partition Task Task Task Number of tasks = spark.executor.cores / spark.task.cpus(1) YARN & standalone only Filesystem local spark.local.dir Single fileMap AppendOnly Sort & Spill spark.shuffle.spill = true id1 id2 id3 id.. sorted by reducer id
  36. 36. SHUFFLE MANAGER ¢ spark.shuffle.manager= hash — Perform better when the number of mapper/reducer is small (generation of M * R files) ¢ spark.shuffle.manager= sort — Perform better for an important number of mapper/reducer ¢ Best of both world — spark.shuffle.sort.bypassMergeThreshold ¢ spark.shuffle.manager= tungsten-sort ??? — Similar to sort with the following benefits : ¢ Off-Heap allocation ¢ Work directly on binary object serialization (much smaller than JVM objects) – aka memcopy J ¢ Use 8 bytes / record pour les sort => take advantage on L* cache
  37. 37. SPARK ON MESOS
  38. 38. COARSE GRAINED MODE J Static resource allocation Spark Driver CoarseMesosSchedulerBackend Mesos Master Mesos Worker CoarseGrainedExecutorBackend Mesos Worker CoarseGrainedExecutorBackend Mesos Worker CoarseGrainedExecutorBackend Mesos Worker CoarseGrainedExecutorBackend Executor Task Task Task Task Task Task …
  39. 39. COARSE GRAINED MODE L Only one executor / node L No control over the number of executors 8 cores 16g 8 cores 16g 8 cores 16g 8 cores 16g 16 cores 16g 16 cores 16g 32 cores / 64Go 32 cores / 32Go
  40. 40. FINE GRAINED MODE L Only one executor / node L No guaranty of resource availability L No control over the number of executors Spark Driver MesosSchedulerBackend Mesos Master Mesos Worker CoarseGrainedExecutorBackend Mesos Worker CoarseGrainedExecutorBackend Mesos Worker CoarseGrainedExecutorBackend Mesos Worker MesosExecutorBackend Executor Task … Executor Task Executor Task Executor Task spark.mesos.mesosExecutor.cores (1)
  41. 41. SOLUTION 1 : USE MESOS ROLES ¢ Limit the number of cores / node on each mesos worker L Must be configured for each Spark application • Assign a Mesos role to your Spark job 8 cores 16g 8 cores 16g 8 cores 16g 8 cores 16g
  42. 42. SOLUTION 2 : DYNAMIC ALLOCATION (COARSED GRAINED MODE ONLY) ¢ Principles — Dynamically add/remove Spark executors ¢ Requires a dedicated process to move shuffled data (spawned on each Mesos worker through Marathon)
  43. 43. QUESTIONS ?

×