This document summarizes the results of a performance analysis conducted by the Barcelona Supercomputing Center comparing Apache Spark and Presto on cloud environments using the TPC-DS benchmark. It finds that Databricks Spark was about 4x faster than AWS EMR Presto without statistics and about 3x faster with statistics. Databricks was also more cost effective and had a more efficient runtime, caching, and query optimizer. While EMR Presto required more tuning, Databricks and EMR Spark were easier to configure and use interactive notebooks.
3. The Barcelona Supercomputing Center (BSC) is the Spanish
national supercomputing facility, and a top EU research institution,
established in 2005 by the Spanish government, the Catalan
government and the UPC/BarcelonaTECH university.
The mission of BSC is to be at the service of the international
scientific community and of industry in need of HPC resources.
BSC's research lines are developed within the framework of
European Union research funding programmes, and the centre
also does basic and applied research in collaboration with
companies like IBM, Microsoft, Intel, Nvidia, Repsol, and Iberdrola.
About BSC
3
5. TPC-DS Benchmark Work
5
The BSC collaborated with Databricks to benchmark
comparisons on large-scale analytics computations, using
the TPC-DS Toolkit v2.10.1rc3
The Transaction Processing Performance Council (TPC)
Benchmark DS (1) has the objective of evaluating decision
support systems, which process large volumes of data in
order to provide answers to real-world business questions.
Our results are not official TPC Benchmark DS results.
Databricks provided BSC an account and credits, which
BSC then independently used for the benchmarking study
with other analytics products on the market.
The TPC is a non-profit
corporation focused on
developing data-centric
benchmark standards and
disseminating objective,
verifiable performance data to
the industry.
6. Context and motivation
• Need to adopt data analytics in a cost-effective
manner
– SQL still very relevant
– Open-source based analytics platforms
– On-demand computing resources from the Cloud
• Evaluate Cloud-based SQL engines
6#UnifiedDataAnalytics #SparkAISummit
7. Systems Under Test (SUTs)
• Databricks Unified Analytics Platform
– Based on Apache Spark but with optimized
Databricks Runtime
– Notebooks for interactive development and
production Jobs
– JDBC and custom API access
– Delta storage layer supporting ACID transactions
7#UnifiedDataAnalytics #SparkAISummit
8. Systems Under Test (SUTs)
• AWS EMR Presto
– Distributed SQL engine created by Facebook
– Connectors non-relational and relational sources
– JDBC and CLI access
– Based on in-memory, pipelined parallel execution
• AWS EMR Spark
– Based on open-source Apache Spark
8#UnifiedDataAnalytics #SparkAISummit
9. Plan
• TPC Benchmark DS
• Hardware and software configuration
• Benchmarking infrastructure
• Benchmark results and their analysis
• Usability and developer productivity
• Conclusions
9#UnifiedDataAnalytics #SparkAISummit
10. TPC Benchmark DS
• Created around 2006 to evaluate decision
support systems
• Based on a retailer with several channels of
distribution
• Process large volumes of data to answer
real-world business questions
10#UnifiedDataAnalytics #SparkAISummit
11. TPC Benchmark DS
• Snowflake schema: fact tables associated
with multiple dimension tables
• Data produced by data generator
• 99 queries of various types
– reporting
– ad hoc
– iterative
– data mining
11#UnifiedDataAnalytics #SparkAISummit
20. Speedup with table and column stats
20#UnifiedDataAnalytics #SparkAISummit
CBO enabled: ↓ 0.60
21. TPC-DS Power Test – geom. mean
21#UnifiedDataAnalytics #SparkAISummit
22. TPC-DS Power Test – arith. mean
22#UnifiedDataAnalytics #SparkAISummit
23. Additional configuration for Presto
23#UnifiedDataAnalytics #SparkAISummit
Query-specific configuration parameters
5, 75, 78, and 80 join_distribution_type: PARTITIONED
78 and 85 join_reordering_strategy: NONE
67 task_concurrency: 32
18 join_reordering_strategy=ELIMINATE_CROSS_JOINS
Session configuration for all queries
query_max_stage_count: 102
join_reordering_strategy: AUTOMATIC
join_distribution_type: AUTOMATIC
Query modifications (carried on to all systems)
72 manual join re-ordering
95 add distinct clause
24. TPC-DS Power Test – Query 72
• Manually modified join order
24#UnifiedDataAnalytics #SparkAISummit
catalog_sales ⋈ date_dim ⋈ date_dim ⋈ inventory ⋈ date_dim ⋈ warehouse ⋈ item
⋈ customer_demographics ⋈ household_demographics ⟕ promotion ⟕ catalog_returns
• Databricks optimized join order no stats
Same as modified join order + pushed down selections and projections
• Original benchmark join order
catalog_sales ⋈ inventory ⋈ warehouse ⋈ item ⋈ customer_demographics ⋈
household_demographics ⋈ date_dim ⋈ date_dim ⋈ date_dim ⟕ promotion ⟕ catalog_returns
25. TPC-DS Power Test – Query 72
• Databricks optimized join order with stats
25#UnifiedDataAnalytics #SparkAISummit
(((((((catalog_sales ⋈ household_demographics) ⋈ date_dim) ⋈ customer_demographics) ⋈ item)
(((date_dim ⋈ date_dim) ⋈ inventory) ⋈ warehouse))
⋈
⟕ promotion) ⟕ catalog_returns) +pushed down selections and projections
• EMR Spark optimized join order with stats
and CBO enabled/disabled
Same as modified join order + pushed down selections and projections
but different physical plans
26. Dynamic data partitioning
• Splits a table based on the value of a particular
column
– Split only 7 largest tables by date surrogate keys
– One S3 bucket folder for each value
• Databricks and EMR Spark: limit number of files
per partition
• EMR Presto: out of memory error for largest table
– Use Hive with TEZ to load data
26#UnifiedDataAnalytics #SparkAISummit
27. Benchmark exec. time (part + stats)
27#UnifiedDataAnalytics #SparkAISummit
Power Test: 2 failed queries
Throughput Test: 6 failed queries
30. TPC Benchmark DS metric
• The modified primary performance metric is
30#UnifiedDataAnalytics #SparkAISummit
𝑄𝑝ℎ𝐷𝑆@𝑆𝐹 =
𝑆𝐹 ∗ 𝑄
,
𝑇./ ∗ 𝑇01 ∗ 𝑇11
Scale factor
Num. weighted queries:
num streams x 99
Load factor:
0.1 x num streams x load time
Power Test and Throughput Test times
35. Disk utilization
• Databricks
– Automatically caches hot input data
– Requires machines with NVMe SSDs
• EMR Presto
– Experimental spilling of state to disk
– “we do not configure any of the Facebook
deployments to spill…local disks would increase
hardware costs…”
35#UnifiedDataAnalytics #SparkAISummit
Raghav Sethi et al. Presto: SQL on Everything. ICDE 2019: 1802-1813
38. Usability and developer productivity
38#UnifiedDataAnalytics #SparkAISummit
Feature EMR Presto EMR Spark Databricks
Easy and flexible cluster creation ü ü ü
Framework configuration at cluster
creation time
ü ü ü
Direct distributed file system support û û ü
Independent data catalog (metastore) ü ü ü
Support for notebooks ü ü ü
Integrated Web GUI û û ü
39. 39#UnifiedDataAnalytics #SparkAISummit
Feature EMR Presto EMR Spark Databricks
JDBC access ü ü ü
Programmatic interface û ü ü
Job creation and management
infrastructure
û û ü
Customized visualization of query plan
execution
ü ü ü
Resource utilization monitoring with
Ganglia and CloudWatch
ü ü ü
Usability and developer productivity
40. Conclusions
• Databricks is about 4x faster than EMR Presto
without statistics
– About 3x faster with them
• Difference smaller with EMR Spark
– Databricks still more cost-effective
– More efficient runtime, cache, and CBO optimizer
• Databricks and EMR Spark deal better with
concurrency and benefit from data partitioning
40#UnifiedDataAnalytics #SparkAISummit
41. Conclusions
• EMR Presto requires significantly more tuning
– Minimal for Databricks and EMR Spark
• Functionality of Databricks and EMR
Presto/Spark for SQL very similar
– Databricks more user friendly in some aspects
41#UnifiedDataAnalytics #SparkAISummit
42. DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT