Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Improving Spark SQL at LinkedIn

292 visualizaciones

Publicado el

Improving the Spark SQL usability and computing efficiency is one of the missions for Linkedin’s Spark team. In this talk, we will present the Spark SQL ecosystem and roadmaps at Linkedin, and introduce the highlighted projects we are working on, such as:
* Improving Dataset performance with automated column pruning
* Bringing an efficient 2d join algorithm to Spark SQL
* Fixing join skewness with adaptive execution
* Enhancing the cost-optimizer with a history-based learning approach

Publicado en: Software
  • Sé el primero en comentar

Improving Spark SQL at LinkedIn

  1. 1. Improving Spark SQL At LinkedIn Fangshi Li Staff Software Engineer LinkedIn
  2. 2. 1 2 3 4 Agenda Automated column pruning for Dataset 2d partitioned join Adaptive Execution Cost-based optimizer
  3. 3. Spark SQL adoptions at LinkedIn 60% jobs running on our cluster are Spark jobs Spark jobs: ⅔ Spark SQL ⅓ RDD Spark SQL jobs: ⅔ DataFrame/SQL API ⅓ Dataset API 60% 2/3 1/3
  4. 4. goals Enable computations that could not be completed before Make every job run faster
  5. 5. Spark SQL roadmap at Linkedin: 3-level optimization Operator-level Dataset ser-de joins Plan-level Adaptive Execution, CBO Cluster-level Multi-query optimization
  6. 6. 1 2 3 4 Agenda Automated column pruning for Dataset 2d partitioned join Adaptive Execution Cost-based optimization (CBO)
  7. 7. Dataset performance val ds: Dataset<TrackingEvent> ds = createDataset() val ds2 = ds.filter(x.id > 0).map(x=> (x.id, x.key)) Dataset has performance issue due to 1. Excessive conversion overhead 2. No column pruning for Orc/Parquet
  8. 8. Solutions Apple: Spark + AI 2019 talk: “Bridging the Gap Between Datasets and DataFrames” Using a bytecode analyzer, converting the user lambda functions into SQL expressions E.g., x.id > 0 ----> isLargerThan(col(“id”) , Literal(0)) Linkedin: Using a bytecode analyzer, find out which columns are used in the user lambdas, and prune columns that are not needed val ds: Dataset<TrackingEvent> ds = createDataset() val ds2 = ds.filter(x.id > 0).map(x=> (x.id, x.key)) Big performance boost for orc/parquet since columns can be pushed to readers
  9. 9. 1 2 3 4 Agenda Automated column pruning for Dataset 2d partitioned join Adaptive Execution Cost-based optimization (CBO)
  10. 10. A recommendation use case at Linkedin 1. Pair feature joins with viewer feature 2. Intermediate result joins with entity feature 3. Scores each joined record a ML model 4. Rank the top N entities for each viewer
  11. 11. Exploding intermediate data Can we perform 3-way join and score in a single step without exploding intermediate data?
  12. 12. 2d partitioned join - Partition left, right, and pair table into M, N, M*N partitions - Left and pair table are sorted within each partition - For each partition in pair table - join left table with a sort-merge join - join right table with a shuffle-hash join - For each joined record, perform scoring right away, and output the scorable - Rank the scorables
  13. 13. 10+hBefore 1hAfter
  14. 14. 1 2 3 4 Agenda Automated column pruning for Dataset 2d hash partitioned join Adaptive Execution (AE) Cost-based optimization(CBO)
  15. 15. Adaptive Execution(AE) at LinkedIn Optimize query plan while job is running (SPARK-23128) Handle data skew in join Works great! Convert shuffle-based join to broadcast join at runtime Need shuffle map stage before converting to broadcast join Should we use Adaptive Execution to optimize join plan at runtime?
  16. 16. 1 2 3 4 Agenda Automated column pruning for Dataset 2d hash partitioned join Adaptive Execution Cost-based optimization(CBO)
  17. 17. CBO(Cost-based optimizer) CBO in Spark can optimize the query plan based on the operators cost(data size, # of records). Benefits: Choose best join strategy: broadcast vs shuffle-hash vs sort-merge Multi-Join reordering
  18. 18. CBO(Cost-based optimizer) The native CBO in Spark has usability issue: Requires detailed stats(count, min,max,distinct, histograms) available for the input datasets. Requires scheduled jobs to compute stats on all datasets which is very expensive.
  19. 19. CBO(Cost-based optimizer) Can we learn the stats from history? YES!
  20. 20. Learning-based CBO Eliminate the CBO’s dependency on pre-computing stats by learning stats from job histories A general approach to benefit all SQL engines
  21. 21. Learning-based CBO Approach 2: Model-based learning Ref: “SageDB: A Learned Database System” Approach 1: Instance-based learning Ref: “LEO: DB2’s Learning Optimizer”
  22. 22. Learning-based CBO vs no-CBO Approach 2: Model-based learning Ref: “SageDB: A Learned Database System” Approach 1: Instance-based learning Ref: “LEO: DB2’s Learning Optimizer”
  23. 23. 1 2 3 4 Summary Automated column pruning for Dataset 2d partitioned join Adaptive Execution History-based CBO (Cost-based optimizer)
  24. 24. Thank you

×