Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Apache Spark 3.0: Overview of What’s New and Why Care

570 visualizaciones

Publicado el

Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.0 extends its scope with more than 3000 resolved JIRAs. We will talk about the exciting new developments in the Spark 3.0 as well as some other initiatives that are coming in the future. In this talk, we want to share with the Bogota Spark community an overview of Spark 3.0 features and enhancements.

In particular, we will touch upon the following areas:

* Performance Improvement Features
* Improved Useability Features
* ANSI SQL Compliance
* Pandas UDFs
* Compatibility and migration considerations
* Spark Ecosystem: Delta Lake, Project Hydrogen, and Project Zen

Publicado en: Software
  • Sé el primero en comentar

Apache Spark 3.0: Overview of What’s New and Why Care

  1. 1. What's New in Apache Spark 3.0 & Why Care? Buenas noches amigos de Bogotá Spark! J Jules S. Damji, Databricks Apache Spark Bogota Meetup September 23, 2020
  2. 2. Jules S. Damji Senior Developer Advocate @ Databricks Joined Databricks in 2016 20+ Software Engineer at companies: Sun, Netscape, VeriSign, @Home, LoudCloud/Opsware, Hortonworks etc About Me
  3. 3. Unified data analytics platform for data science, data engineering, and business analytics to solve tough data problems Original creators of popular data and machine learning open source projects Global company with 5,000 customers and 450+ partners
  4. 4. Adaptive Query Execution Dynamic Partition Pruning Query Compilation Speedup Join Hints Performance Richer APIs Accelerator-aware Scheduler Built-in Functions pandas UDF Enhancements DELETE/UPDATE/ MERGE in Catalyst Reserved Keywords Proleptic Gregorian Calendar ANSI Store Assignment Overflow Checking SQL Compatibility Built-in Data Sources Parquet/ORC Nested Column Pruning Parquet: Nested Column Filter Pushdown CSV Filter Pushdown New Binary Data Source Data Source V2 API + Catalog Support Java 11 Support Hadoop 3 Support Hive 3.x Metastore Hive 2.3 Execution Extensibility and Ecosystem Structured Streaming UI DDL/DML Enhancements Observable Metrics Event Log Rollover Monitoring and Debuggability Delta ,
  5. 5. 3400+ Resolved JIRAs in Spark 3.0 Blog
  6. 6. Agenda Performance Spark 3.0 comes with performance improvements to make Spark faster, cheaper, and more flexible Usability Spark is easier to use Compatibility Considerations View notable compatibility/behavior changes Spark Ecosystem Learn about developments in Delta Lake , Project Hydrogen and Project Zen
  7. 7. Performance Achieve high performance for interactive, batch, streaming and ML workloads Adaptive Query Execution Dynamic Partition Pruning Join Hints Blog
  8. 8. Spark Catalyst Optimizer Spark 1.x, Rule Spark 2.x, Rule + Cost Spark 3.0, Rule + Cost + Runtime
  9. 9. Optimization in Spark 2.x Blog
  10. 10. Adaptive Query Execution Based on statistics of the finished plan nodes, re-optimize the execution plan of the remaining queries ▪ Dynamically switch join strategies ▪ Dynamically coalesce shuffle partitions ▪ Dynamically optimize skew joins adaptive planning
  11. 11. Performance Pitfall Choose Broadcast Hash Join? ▪ Increase “spark.sql.autoBroadcastJoinThreshold”? ▪ Use “broadcast” hint? However Hard to tune Hard to maintain over time OOM… Using the wrong join strategy
  12. 12. Adaptive Query Execution Vision: No more manual setting of broadcast hints/thresholds! Capability: SMJ -> BHJ at runtime SMJ Sort Sort Shuffle Write Shuffle Write Left Child Right Child BHJ Broadcast Shuffle Shuffle Left Child Right Child Static size: 15MB Actual: 8MB Shuffle Read Shuffle Read Local Shuffle Read Local Shuffle Read Not Started Done New Plan - Changed
  13. 13. Performance Pitfall Tuning spark.sql.shuffle.partitions ▪ Default magic number: 200 !?! However ▪ Too small: GC pressure; disk spilling ▪ Too large: Inefficient I/O; scheduler pressure ▪ Hard to tune over the whole query plan ▪ Hard to maintain over time Choosing the wrong shuffle partition number
  14. 14. Adaptive Query Execution VISION: No more manual tuning of spark.shuffle.partitions! Capability: Coalesce shuffle partitions Filter Scan Execute Shuffle (50 part.) Sort Stage 1 OptimizeFilter Scan Shuffle (50 part.) Sort Stage 1 Filter Scan Shuffle (50 part.) Sort Stage 1 Coalesce (5 part.) Set the initial partition number 200 or X to accommodate the largest data size of the entire query execution Automatically coalesce partitions if needed after each query stage
  15. 15. Performance Pitfall Symptoms of data skew ▪ Frozen/long-running tasks ▪ Disk spilling ▪ Low resource utilization in most nodes ▪ OOM Various ways ▪ Find the skew values and rewrite the queries ▪ Adding extra skew keys… Data skew Anybody dealt with data skews while running Spark jobs?
  16. 16. Adaptive Query Execution Data Skew
  17. 17. Adaptive Query Execution VISION: No more manual tuning of skew hints!
  18. 18. AQE Configuration Settings Property Name Default Meaning Since Version spark.sql.adaptive. coalescePartitions. enabled true When true and spark.sql.adaptive.enabled is true, Spark will coalesce contiguous shuffle partitions according to the target size (specified by spark.sql.adaptive.advisoryPartitionSizeInBytes), to avoid too many small tasks. 3.0.0 spark.sql.adaptive. coalescePartitions. minPartitionNum Default Parallelism The minimum number of shuffle partitions after coalescing. If not set, the default value is the default parallelism of the Spark cluster. This configuration only has an effect when spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled are both enabled. 3.0.0 spark.sql.adaptive. coalescePartitions. initialPartitionNum 200 The initial number of shuffle partitions before coalescing. By default it equals to spark.sql.shuffle.partitions. This configuration only has an effect when spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled are both enabled. 3.0.0 spark.sql.adaptive. advisoryPartitionSizeInBytes 64 MB The advisory size in bytes of the shuffle partition during adaptive optimization (when spark.sql.adaptive.enabled is true). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition. 3.0.0 AQE is not enabled by default. Set spark.sql.adaptive.enabled to true to use the features below.
  19. 19. Adaptive Query Execution
  20. 20. Performance Achieve high performance for interactive, batch, streaming and ML workloads Adaptive Query Execution Dynamic Partition Pruning Join Hints
  21. 21. Dynamic Partition Pruning • Avoid partition scanning based on the query results of the other query fragments. • Important for star-schema queries. • Significant speedup in TPC-DS.
  22. 22. Static Partition Pruning SELECT * FROM Sales WHERE store_id = 5 Most optimizations employ simple static partition pruning Basic Data Flow Filter Push-down Partitioned files with multi-columnar data
  23. 23. A Common Workload SELECT * FROM Sales JOIN Stores WHERE Stores.city = 'New York' Scan Sales Join Scan Stores Filter city = 'New York' ● Static pruning cannot be applied ● Filter is only acting on the smaller dimensional table, not the larger fact table Small dimensional table Larger fact table Star Schema Queries
  24. 24. Table Denormalization SELECT * FROM Sales JOIN Stores WHERE Stores.city = 'New York' Scan Sales Join Scan Stores Filter city = 'New York' Scan
  25. 25. Dynamic Partition Pruning Physical Plan Optimization Partitioned files with multi-columnar data File Scan Dynamic Filter Broadcast Exchange File Scan with DIM filter Broadcast Hash Join SCAN Fact Table
  26. 26. Dynamic Partition Pruning 60 / 102 TPC-DS queries: a speedup between 2x and 18x
  27. 27. Dynamic Partition Pruning Configuration Dynamic Partition pruning is enabled by default Property Name Default Meaning Since Version spark.sql.optimizer. dynamicPartitionPruning.enabled true When true, we will generate predicate for partition column when it's used as join key 3.0.0
  28. 28. Performance Achieve high performance for interactive, batch, streaming and ML workloads Adaptive Query Execution Dynamic Partition Pruning Join Hints
  29. 29. Optimizer Hints ▪ Join hints influence optimizer to choose the join strategies ▪ Broadcast hash join ▪ Sort-merge join NEW ▪ Shuffle hash join NEW ▪ Shuffle nested loop join NEW ▪ Should be used with extreme caution. ▪ Difficult to manage over time.
  30. 30. Join Strategies Most robust Handle any data size Needs to shuffle and sort Can be slow when table size is small Requires one side to be small No shuffle or sort Very fast Broadcast HashSort-Merge Needs to shuffle, but no sort Can handle large tables Will OOM if data is skewed Shuffle Hash Doesn't require join keys Shuffle Nested Loop
  31. 31. ▪ Broadcast Hash Join SELECT /*+ BROADCAST(a) */ id FROM a JOIN b ON a.key = b.key ▪ Sort-Merge Join SELECT /*+ MERGE(a, b) */ id FROM a JOIN b ON a.key = b.key ▪ Shuffle Hash Join SELECT /*+ SHUFFLE_HASH(a, b) */ id FROM a JOIN b ON a.key = b.key ▪ Shuffle Nested Loop Join SELECT /*+ SHUFFLE_REPLICATE_NL(a, b) */ id FROM a JOIN b How to Use SQL Join Hints?
  32. 32. Join Hint Syntax Shuffle Merge SQL Python
  33. 33. Enable new use cases and simplify the Spark application development Formatted Explain SQL Engine pandas UDF enhancements Useability and Richer APIs
  34. 34. Spark SQL: Old Explain How many of you have scratched your heads reading this?
  35. 35. Spark SQL: New EXPLAIN FORMATTED Header: Basic operating tree for the execution plan Footer: Each operator with additional attributes
  36. 36. * Project (4) +- * Filter (3) +- * ColumnarToRow (2) +- Scan parquet default.tab1 (1) (1) Scan parquet default.tab1 Output [2]: [key#5, val#6] Batched: true Location: InMemoryFileIndex [file:/user/hive/warehouse/tab1] PushedFilters: [IsNotNull(key)] ReadSchema: struct<key:int,val:int> (2) ColumnarToRow [codegen id : 1] Input [2]: [key#5, val#6] (3) Filter [codegen id : 1] Input [2]: [key#5, val#6] Condition : (isnotnull(key#5) AND (key#5 = Subquery scalar-subquery#27, [id=#164])) (4) Project [codegen id : 1] Output [2]: [key#5, val#6] Input [2]: [key#5, val#6] EXPLAIN FORMATTED SELECT * FROM tab1 WHERE key = (SELECT max(key) FROM tab2 WHERE val > 5
  37. 37. DataFrame.explain(mode) Modes: • simple • extended • codegen • formatted query = “””SELECT * FROM tab1 WHERE key = (SELECT max(key) FROM tab2 WHERE val > 5””” df = spark.sql(query) df.explain(mode=“formatted”)
  38. 38. Enable new use cases and simplify the Spark application development Useability and Richer APIs pandas UDF enhancements Structured Streaming
  39. 39. Pandas UDFs (a.k.a. Vectorized UDFs) SPARK 2.3 SPARK 3.0 Python Type Hints
  40. 40. Pandas UDFs Pandas Function APIs - Grouped Map
  41. 41. Pandas UDFs Supported function APIs include: Grouped Map Map Co-grouped Map Pandas Function APIs Spark + AI Session Blog
  42. 42. Enable new use cases and simplify the Spark application development Useability and Richer APIs Structured Streaming UI
  43. 43. Improved Web UI: Structured Streaming Tab
  44. 44. Improved Web UI: Structured Streaming Get real-time metrics via the structured streaming tab including: ▪ Input rate ▪ Process Rate ▪ Input rows ▪ Batch duration ▪ Operation duration ▪ 2 minute window display Documentation
  45. 45. Structured Streaming UI
  46. 46. Improve the plug-in interface and extend the deployment environments Hive 3.x Metastore Hive 2.3 Execution Hadoop 3 Support Java 11 Support Compatibility and Migration Considerations
  47. 47. Spark 3.0 Builds • Only builds with Scala 2.12 • Deprecates Python 2 (already EOL) • Can build with various Hadoop/Hive versions – Hadoop 2.7 + Hive 1.2 – Hadoop 2.7 + Hive 2.3 (supports Java 11) [Default] – Hadoop 3.2 + Hive 2.3 (supports Java 11) • Supports the following Hive metastore versions: – "0.12", "0.13", "0.14", "1.0", "1.1", "1.2", "2.0", "2.1", "2.2", "2.3", "3.0", "3.1"
  48. 48. The Apache Spark Ecosystem
  49. 49. A New Standard for Building Data Lakes A new approach to building Data Lakes • Open format-based on parquet with ACID transactions • Adds reliability, data quality, performance to Data Lakes • Brings the best of data warehousing and data lakes • Based on open source and open format (Parquet) • Enabled by Apache Spark
  50. 50. Challenges with data lakes 1. Hard to append data. Adding newly arrived data leads to incorrect reads.✗ 2. Modification of existing data difficult. GDPR/CCPA requires making fine grained changes to existing data lake. Very costly with Spark. 3. Jobs failing mid way. Half of the data appears in the data lake, the rest missing. How many of you have built data lakes?
  51. 51. Challenges with data lakes 4. Real-time operations hard – mixing streaming and batch leads to inconsistency. 5. Costly to keep historical versions of the data – regulated environments require reproducibility, auditing, and governance. 6. Difficult to handle large metadata – for large data lakes the metadata itself becomes difficult to manage.
  52. 52. Challenges with data lakes 7. "Too many files” problems. Data lakes not great at handling millions of small files.✗ 8. Fine grained access control difficult. Enforcing enterprise-wide role-based access control on data difficult.
  53. 53. Challenges with data lakes 9. Hard to get great performance – partitioning the data for performance error-prone and difficult to change. 10. Data quality issues. Hard to ensure that all the data is correct and has the right quality.
  54. 54. Challenges with data lakes 1. Hard to append data 2. Modification of existing data difficult 3. Jobs failing mid way 4. Real-time operations hard 5. Costly to keep historical data versions 6. Difficult to handle large metadata 7. Poor performance 8. “Too many files” problem 9. Fine grained access control difficult 10. Data quality issues ACID transactions Spark under the hood - auto-indexing - fine grained ACLs and RBAC - schema enforcement and evolution
  55. 55. From Data Lakes -> Delta Lake -> Lakehouse VLDB Conference paper.
  56. 56. Delta Lake Connectors Standardize your big data storage with an open format accessible from various tools Amazon Redshift Amazon Athena
  57. 57. https://databricks.com/diving-into-delta-lake-talkshttps://delta.io
  58. 58. Project Hydrogen
  59. 59. What is Project Hydrogen? The goal of Project Hydrogen is to enable first-class support for all distributed ML frameworks https://vimeo.com/274267107
  60. 60. Incompatible Execution Models Spark ▪ Tasks are independent ▪ Parallel and massively scalable ▪ If a task crashes, rerun it Task 1 Task 2 Task 3 Distributed ML Frameworks ▪ Complete coordination among tasks ▪ Optimized for communication ▪ If a task crashes, rerun all tasks Task 1 Task 2 Task 3
  61. 61. Barrier Execution Mode (Spark 2.4) ▪ Since 2.4, gang scheduling has been implemented on top of the MapReduce execution model ▪ Gang scheduling enables barrier execution mode Stage 1: Data Prep (embarrassingly parallel) Stage 2: Dist ML training (gang scheduled) Stage 3: Data Sink (embarrassingly parallel)
  62. 62. Accelerator Aware Scheduling (Spark 3.0) - Motivation ▪ Deep learning workloads often use GPUs or other accelerators to speed up processing on large datasets ▪ Popular cluster managers YARN and Kubernetes support GPUs ▪ Spark 2.x can support those cluster managers, but is not aware of available GPUs and cannot request or schedule them
  63. 63. Accelerator Aware Scheduling in Spark 3.0 ▪ Used to accelerate special workloads like deep learning and signal processing ▪ Supports Standalone, YARN, and Kubernetes ▪ Supports GPUs ▪ Required resources are specified by configuration, so works only at the application level Future work: ▪ Support TPU, FPGA, etc ▪ Support job/stage/task level resource allocation
  64. 64. Web UI for accelerators
  65. 65. Project Zen : PySpark Improvements
  66. 66. • Redesigning PySpark documentation • PySpark type hints • Visualization • Standardized warnings and exceptions Blog How many of you have scratched your head looking at a PySpark stack trace?
  67. 67. THE FREE VIRTUAL EVENT FOR DATA TEAMS ● Three days of sessions, keynotes, training and demos ● Catch up on rapid advances and best practices in Apache Spark™, Delta Lake, MLflow and Redash ● Network with more than 20,000 data professionals from across Europe and around the world
  68. 68. Thank you for your support & contributions! Happy 10th Birthday cc: Acknowledgement's to Xiao Li and Doug Bateman

×