SlideShare una empresa de Scribd logo
1 de 42
Descargar para leer sin conexión
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Víctor Cuevas-Vicenttín,
Barcelona Supercomputing Center
Performance Analysis of
Apache Spark and Presto in
Cloud Environments
#UnifiedDataAnalytics #SparkAISummit
The Barcelona Supercomputing Center (BSC) is the Spanish
national supercomputing facility, and a top EU research institution,
established in 2005 by the Spanish government, the Catalan
government and the UPC/BarcelonaTECH university.
The mission of BSC is to be at the service of the international
scientific community and of industry in need of HPC resources.
BSC's research lines are developed within the framework of
European Union research funding programmes, and the centre
also does basic and applied research in collaboration with
companies like IBM, Microsoft, Intel, Nvidia, Repsol, and Iberdrola.
About BSC
3
4
13.7
Petaflops
TPC-DS Benchmark Work
5
The BSC collaborated with Databricks to benchmark
comparisons on large-scale analytics computations, using
the TPC-DS Toolkit v2.10.1rc3
The Transaction Processing Performance Council (TPC)
Benchmark DS (1) has the objective of evaluating decision
support systems, which process large volumes of data in
order to provide answers to real-world business questions.
Our results are not official TPC Benchmark DS results.
Databricks provided BSC an account and credits, which
BSC then independently used for the benchmarking study
with other analytics products on the market.
The TPC is a non-profit
corporation focused on
developing data-centric
benchmark standards and
disseminating objective,
verifiable performance data to
the industry.
Context and motivation
• Need to adopt data analytics in a cost-effective
manner
– SQL still very relevant
– Open-source based analytics platforms
– On-demand computing resources from the Cloud
• Evaluate Cloud-based SQL engines
6#UnifiedDataAnalytics #SparkAISummit
Systems Under Test (SUTs)
• Databricks Unified Analytics Platform
– Based on Apache Spark but with optimized
Databricks Runtime
– Notebooks for interactive development and
production Jobs
– JDBC and custom API access
– Delta storage layer supporting ACID transactions
7#UnifiedDataAnalytics #SparkAISummit
Systems Under Test (SUTs)
• AWS EMR Presto
– Distributed SQL engine created by Facebook
– Connectors non-relational and relational sources
– JDBC and CLI access
– Based on in-memory, pipelined parallel execution
• AWS EMR Spark
– Based on open-source Apache Spark
8#UnifiedDataAnalytics #SparkAISummit
Plan
• TPC Benchmark DS
• Hardware and software configuration
• Benchmarking infrastructure
• Benchmark results and their analysis
• Usability and developer productivity
• Conclusions
9#UnifiedDataAnalytics #SparkAISummit
TPC Benchmark DS
• Created around 2006 to evaluate decision
support systems
• Based on a retailer with several channels of
distribution
• Process large volumes of data to answer
real-world business questions
10#UnifiedDataAnalytics #SparkAISummit
TPC Benchmark DS
• Snowflake schema: fact tables associated
with multiple dimension tables
• Data produced by data generator
• 99 queries of various types
– reporting
– ad hoc
– iterative
– data mining
11#UnifiedDataAnalytics #SparkAISummit
12#UnifiedDataAnalytics #SparkAISummit
TPC Benchmark DS
• Load Test (1 TB)
• Power Test
• Data Refresh
• Throughput Test
13#UnifiedDataAnalytics #SparkAISummit
.dat
ORC,
parquet
Query 1 Query 99Query 2 . . .
Query1,1 Query1,99Query1,2 . . .
Queryn,1 Queryn,99Queryn,2 . . .
. . .
Hardware configuration
14#UnifiedDataAnalytics #SparkAISummit
Type vCPUs Memory Local storage
i3.2xlarge 8 (2.3 GHz Intel
Xeon E5 2686 v4)
61 GiB 1 x 1,900 GB
NVMe SSD
1 master node 8 worker nodes
Software configuration
15#UnifiedDataAnalytics #SparkAISummit
System Versions Configuration parameters
Runtime 5.5,
Spark 2.4.3,
Scala 2.11
spark.sql.broadcastTimeout: 7200
spark.sql.crossJoin.enabled: true
emr-5.26.0,
Presto 0.220
hive.allow-drop-table: true
hive.compression-codec: SNAPPY
hive.s3-file-system-type: PRESTO
query.max-memory: 240 GB
emr-5.26.0,
Spark 2.4.3
spark.sql.broadcastTimeout : 7200
spark.driver.memory: 5692M
16#UnifiedDataAnalytics #SparkAISummit
SQL
.dat
parquet
ORC
client application cluster execution analysis
.dat
JARJAR
.log
AWS Glue
Metastore
.log
.XLSX
Benchmark execution time (base)
17#UnifiedDataAnalytics #SparkAISummit
Cost-Based Optimizer (CBO) stats
• Collect table and column-level statistics to
create optimized query evaluation plans
– distinct count, min, max, null count
18#UnifiedDataAnalytics #SparkAISummit
Benchmark execution time (stats)
19#UnifiedDataAnalytics #SparkAISummit
CBO enabled: ↑ 27.11
Speedup with table and column stats
20#UnifiedDataAnalytics #SparkAISummit
CBO enabled: ↓ 0.60
TPC-DS Power Test – geom. mean
21#UnifiedDataAnalytics #SparkAISummit
TPC-DS Power Test – arith. mean
22#UnifiedDataAnalytics #SparkAISummit
Additional configuration for Presto
23#UnifiedDataAnalytics #SparkAISummit
Query-specific configuration parameters
5, 75, 78, and 80 join_distribution_type: PARTITIONED
78 and 85 join_reordering_strategy: NONE
67 task_concurrency: 32
18 join_reordering_strategy=ELIMINATE_CROSS_JOINS
Session configuration for all queries
query_max_stage_count: 102
join_reordering_strategy: AUTOMATIC
join_distribution_type: AUTOMATIC
Query modifications (carried on to all systems)
72 manual join re-ordering
95 add distinct clause
TPC-DS Power Test – Query 72
• Manually modified join order
24#UnifiedDataAnalytics #SparkAISummit
catalog_sales ⋈ date_dim ⋈ date_dim ⋈ inventory ⋈ date_dim ⋈ warehouse ⋈ item
⋈ customer_demographics ⋈ household_demographics ⟕ promotion ⟕ catalog_returns
• Databricks optimized join order no stats
Same as modified join order + pushed down selections and projections
• Original benchmark join order
catalog_sales ⋈ inventory ⋈ warehouse ⋈ item ⋈ customer_demographics ⋈
household_demographics ⋈ date_dim ⋈ date_dim ⋈ date_dim ⟕ promotion ⟕ catalog_returns
TPC-DS Power Test – Query 72
• Databricks optimized join order with stats
25#UnifiedDataAnalytics #SparkAISummit
(((((((catalog_sales ⋈ household_demographics) ⋈ date_dim) ⋈ customer_demographics) ⋈ item)
(((date_dim ⋈ date_dim) ⋈ inventory) ⋈ warehouse))
⋈
⟕ promotion) ⟕ catalog_returns) +pushed down selections and projections
• EMR Spark optimized join order with stats
and CBO enabled/disabled
Same as modified join order + pushed down selections and projections
but different physical plans
Dynamic data partitioning
• Splits a table based on the value of a particular
column
– Split only 7 largest tables by date surrogate keys
– One S3 bucket folder for each value
• Databricks and EMR Spark: limit number of files
per partition
• EMR Presto: out of memory error for largest table
– Use Hive with TEZ to load data
26#UnifiedDataAnalytics #SparkAISummit
Benchmark exec. time (part + stats)
27#UnifiedDataAnalytics #SparkAISummit
Power Test: 2 failed queries
Throughput Test: 6 failed queries
Speedup with partitioning and stats
28#UnifiedDataAnalytics #SparkAISummit
TPC Benchmark total execution time
29#UnifiedDataAnalytics #SparkAISummit
TPC Benchmark DS metric
• The modified primary performance metric is
30#UnifiedDataAnalytics #SparkAISummit
𝑄𝑝ℎ𝐷𝑆@𝑆𝐹 =
𝑆𝐹 ∗ 𝑄
,
𝑇./ ∗ 𝑇01 ∗ 𝑇11
Scale factor
Num. weighted queries:
num streams x 99
Load factor:
0.1 x num streams x load time
Power Test and Throughput Test times
TPC Benchmark DS metric
31#UnifiedDataAnalytics #SparkAISummit
System costs
32#UnifiedDataAnalytics #SparkAISummit
𝑁𝑢𝑚. 𝑛𝑜𝑑𝑒𝑠 × 𝑛𝑜𝑑𝑒 𝑐𝑜𝑠𝑡 𝑝𝑒𝑟 ℎ𝑜𝑢𝑟 ×𝑒𝑥𝑒𝑐. 𝑡𝑖𝑚𝑒 𝑖𝑛 ℎ𝑜𝑢𝑟𝑠
System Hardware Software
EMR Presto $0.624 $0.156
EMR Spark $0.624 $0.156
Databricks $0.624 $0.3
𝑛𝑜𝑑𝑒 ℎ𝑎𝑟𝑑𝑤𝑎𝑟𝑒 𝑐𝑜𝑠𝑡𝑠 + 𝑛𝑜𝑑𝑒 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒 𝑐𝑜𝑠𝑡𝑠
TPC Benchmark DS cost
33#UnifiedDataAnalytics #SparkAISummit
TPC-DS price-performance
34#UnifiedDataAnalytics #SparkAISummit
Disk utilization
• Databricks
– Automatically caches hot input data
– Requires machines with NVMe SSDs
• EMR Presto
– Experimental spilling of state to disk
– “we do not configure any of the Facebook
deployments to spill…local disks would increase
hardware costs…”
35#UnifiedDataAnalytics #SparkAISummit
Raghav Sethi et al. Presto: SQL on Everything. ICDE 2019: 1802-1813
#UnifiedDataAnalytics #SparkAISummit
DatabricksEMRPresto
3
7
#UnifiedDataAnalytics #SparkAISummit
DatabricksEMRPresto
Usability and developer productivity
38#UnifiedDataAnalytics #SparkAISummit
Feature EMR Presto EMR Spark Databricks
Easy and flexible cluster creation ü ü ü
Framework configuration at cluster
creation time
ü ü ü
Direct distributed file system support û û ü
Independent data catalog (metastore) ü ü ü
Support for notebooks ü ü ü
Integrated Web GUI û û ü
39#UnifiedDataAnalytics #SparkAISummit
Feature EMR Presto EMR Spark Databricks
JDBC access ü ü ü
Programmatic interface û ü ü
Job creation and management
infrastructure
û û ü
Customized visualization of query plan
execution
ü ü ü
Resource utilization monitoring with
Ganglia and CloudWatch
ü ü ü
Usability and developer productivity
Conclusions
• Databricks is about 4x faster than EMR Presto
without statistics
– About 3x faster with them
• Difference smaller with EMR Spark
– Databricks still more cost-effective
– More efficient runtime, cache, and CBO optimizer
• Databricks and EMR Spark deal better with
concurrency and benefit from data partitioning
40#UnifiedDataAnalytics #SparkAISummit
Conclusions
• EMR Presto requires significantly more tuning
– Minimal for Databricks and EMR Spark
• Functionality of Databricks and EMR
Presto/Spark for SQL very similar
– Databricks more user friendly in some aspects
41#UnifiedDataAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Más contenido relacionado

La actualidad más candente

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdfOracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdfSrirakshaSrinivasan2
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Image Processing on Delta Lake
Image Processing on Delta LakeImage Processing on Delta Lake
Image Processing on Delta LakeDatabricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk finalRachel Warren
 

La actualidad más candente (20)

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdfOracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Image Processing on Delta Lake
Image Processing on Delta LakeImage Processing on Delta Lake
Image Processing on Delta Lake
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk final
 

Similar a Performance Analysis of Apache Spark and Presto in Cloud Environments

Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Databricks
 
Apache Spark Data Validation
Apache Spark Data ValidationApache Spark Data Validation
Apache Spark Data ValidationDatabricks
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to ProductionMostafa Majidpour
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDatabricks
 
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Databricks
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
 
Webinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDBWebinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDBMongoDB
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessAnant Corporation
 
The Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with SparkThe Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with SparkSingleStore
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 
StreamCentral for the IT Professional
StreamCentral for the IT ProfessionalStreamCentral for the IT Professional
StreamCentral for the IT ProfessionalRaheel Retiwalla
 

Similar a Performance Analysis of Apache Spark and Presto in Cloud Environments (20)

Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
 
Apache Spark Data Validation
Apache Spark Data ValidationApache Spark Data Validation
Apache Spark Data Validation
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
 
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
Webinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDBWebinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDB
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
 
The Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with SparkThe Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with Spark
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
StreamCentral for the IT Professional
StreamCentral for the IT ProfessionalStreamCentral for the IT Professional
StreamCentral for the IT Professional
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 

Último (20)

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 

Performance Analysis of Apache Spark and Presto in Cloud Environments

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Víctor Cuevas-Vicenttín, Barcelona Supercomputing Center Performance Analysis of Apache Spark and Presto in Cloud Environments #UnifiedDataAnalytics #SparkAISummit
  • 3. The Barcelona Supercomputing Center (BSC) is the Spanish national supercomputing facility, and a top EU research institution, established in 2005 by the Spanish government, the Catalan government and the UPC/BarcelonaTECH university. The mission of BSC is to be at the service of the international scientific community and of industry in need of HPC resources. BSC's research lines are developed within the framework of European Union research funding programmes, and the centre also does basic and applied research in collaboration with companies like IBM, Microsoft, Intel, Nvidia, Repsol, and Iberdrola. About BSC 3
  • 5. TPC-DS Benchmark Work 5 The BSC collaborated with Databricks to benchmark comparisons on large-scale analytics computations, using the TPC-DS Toolkit v2.10.1rc3 The Transaction Processing Performance Council (TPC) Benchmark DS (1) has the objective of evaluating decision support systems, which process large volumes of data in order to provide answers to real-world business questions. Our results are not official TPC Benchmark DS results. Databricks provided BSC an account and credits, which BSC then independently used for the benchmarking study with other analytics products on the market. The TPC is a non-profit corporation focused on developing data-centric benchmark standards and disseminating objective, verifiable performance data to the industry.
  • 6. Context and motivation • Need to adopt data analytics in a cost-effective manner – SQL still very relevant – Open-source based analytics platforms – On-demand computing resources from the Cloud • Evaluate Cloud-based SQL engines 6#UnifiedDataAnalytics #SparkAISummit
  • 7. Systems Under Test (SUTs) • Databricks Unified Analytics Platform – Based on Apache Spark but with optimized Databricks Runtime – Notebooks for interactive development and production Jobs – JDBC and custom API access – Delta storage layer supporting ACID transactions 7#UnifiedDataAnalytics #SparkAISummit
  • 8. Systems Under Test (SUTs) • AWS EMR Presto – Distributed SQL engine created by Facebook – Connectors non-relational and relational sources – JDBC and CLI access – Based on in-memory, pipelined parallel execution • AWS EMR Spark – Based on open-source Apache Spark 8#UnifiedDataAnalytics #SparkAISummit
  • 9. Plan • TPC Benchmark DS • Hardware and software configuration • Benchmarking infrastructure • Benchmark results and their analysis • Usability and developer productivity • Conclusions 9#UnifiedDataAnalytics #SparkAISummit
  • 10. TPC Benchmark DS • Created around 2006 to evaluate decision support systems • Based on a retailer with several channels of distribution • Process large volumes of data to answer real-world business questions 10#UnifiedDataAnalytics #SparkAISummit
  • 11. TPC Benchmark DS • Snowflake schema: fact tables associated with multiple dimension tables • Data produced by data generator • 99 queries of various types – reporting – ad hoc – iterative – data mining 11#UnifiedDataAnalytics #SparkAISummit
  • 13. TPC Benchmark DS • Load Test (1 TB) • Power Test • Data Refresh • Throughput Test 13#UnifiedDataAnalytics #SparkAISummit .dat ORC, parquet Query 1 Query 99Query 2 . . . Query1,1 Query1,99Query1,2 . . . Queryn,1 Queryn,99Queryn,2 . . . . . .
  • 14. Hardware configuration 14#UnifiedDataAnalytics #SparkAISummit Type vCPUs Memory Local storage i3.2xlarge 8 (2.3 GHz Intel Xeon E5 2686 v4) 61 GiB 1 x 1,900 GB NVMe SSD 1 master node 8 worker nodes
  • 15. Software configuration 15#UnifiedDataAnalytics #SparkAISummit System Versions Configuration parameters Runtime 5.5, Spark 2.4.3, Scala 2.11 spark.sql.broadcastTimeout: 7200 spark.sql.crossJoin.enabled: true emr-5.26.0, Presto 0.220 hive.allow-drop-table: true hive.compression-codec: SNAPPY hive.s3-file-system-type: PRESTO query.max-memory: 240 GB emr-5.26.0, Spark 2.4.3 spark.sql.broadcastTimeout : 7200 spark.driver.memory: 5692M
  • 16. 16#UnifiedDataAnalytics #SparkAISummit SQL .dat parquet ORC client application cluster execution analysis .dat JARJAR .log AWS Glue Metastore .log .XLSX
  • 17. Benchmark execution time (base) 17#UnifiedDataAnalytics #SparkAISummit
  • 18. Cost-Based Optimizer (CBO) stats • Collect table and column-level statistics to create optimized query evaluation plans – distinct count, min, max, null count 18#UnifiedDataAnalytics #SparkAISummit
  • 19. Benchmark execution time (stats) 19#UnifiedDataAnalytics #SparkAISummit CBO enabled: ↑ 27.11
  • 20. Speedup with table and column stats 20#UnifiedDataAnalytics #SparkAISummit CBO enabled: ↓ 0.60
  • 21. TPC-DS Power Test – geom. mean 21#UnifiedDataAnalytics #SparkAISummit
  • 22. TPC-DS Power Test – arith. mean 22#UnifiedDataAnalytics #SparkAISummit
  • 23. Additional configuration for Presto 23#UnifiedDataAnalytics #SparkAISummit Query-specific configuration parameters 5, 75, 78, and 80 join_distribution_type: PARTITIONED 78 and 85 join_reordering_strategy: NONE 67 task_concurrency: 32 18 join_reordering_strategy=ELIMINATE_CROSS_JOINS Session configuration for all queries query_max_stage_count: 102 join_reordering_strategy: AUTOMATIC join_distribution_type: AUTOMATIC Query modifications (carried on to all systems) 72 manual join re-ordering 95 add distinct clause
  • 24. TPC-DS Power Test – Query 72 • Manually modified join order 24#UnifiedDataAnalytics #SparkAISummit catalog_sales ⋈ date_dim ⋈ date_dim ⋈ inventory ⋈ date_dim ⋈ warehouse ⋈ item ⋈ customer_demographics ⋈ household_demographics ⟕ promotion ⟕ catalog_returns • Databricks optimized join order no stats Same as modified join order + pushed down selections and projections • Original benchmark join order catalog_sales ⋈ inventory ⋈ warehouse ⋈ item ⋈ customer_demographics ⋈ household_demographics ⋈ date_dim ⋈ date_dim ⋈ date_dim ⟕ promotion ⟕ catalog_returns
  • 25. TPC-DS Power Test – Query 72 • Databricks optimized join order with stats 25#UnifiedDataAnalytics #SparkAISummit (((((((catalog_sales ⋈ household_demographics) ⋈ date_dim) ⋈ customer_demographics) ⋈ item) (((date_dim ⋈ date_dim) ⋈ inventory) ⋈ warehouse)) ⋈ ⟕ promotion) ⟕ catalog_returns) +pushed down selections and projections • EMR Spark optimized join order with stats and CBO enabled/disabled Same as modified join order + pushed down selections and projections but different physical plans
  • 26. Dynamic data partitioning • Splits a table based on the value of a particular column – Split only 7 largest tables by date surrogate keys – One S3 bucket folder for each value • Databricks and EMR Spark: limit number of files per partition • EMR Presto: out of memory error for largest table – Use Hive with TEZ to load data 26#UnifiedDataAnalytics #SparkAISummit
  • 27. Benchmark exec. time (part + stats) 27#UnifiedDataAnalytics #SparkAISummit Power Test: 2 failed queries Throughput Test: 6 failed queries
  • 28. Speedup with partitioning and stats 28#UnifiedDataAnalytics #SparkAISummit
  • 29. TPC Benchmark total execution time 29#UnifiedDataAnalytics #SparkAISummit
  • 30. TPC Benchmark DS metric • The modified primary performance metric is 30#UnifiedDataAnalytics #SparkAISummit 𝑄𝑝ℎ𝐷𝑆@𝑆𝐹 = 𝑆𝐹 ∗ 𝑄 , 𝑇./ ∗ 𝑇01 ∗ 𝑇11 Scale factor Num. weighted queries: num streams x 99 Load factor: 0.1 x num streams x load time Power Test and Throughput Test times
  • 31. TPC Benchmark DS metric 31#UnifiedDataAnalytics #SparkAISummit
  • 32. System costs 32#UnifiedDataAnalytics #SparkAISummit 𝑁𝑢𝑚. 𝑛𝑜𝑑𝑒𝑠 × 𝑛𝑜𝑑𝑒 𝑐𝑜𝑠𝑡 𝑝𝑒𝑟 ℎ𝑜𝑢𝑟 ×𝑒𝑥𝑒𝑐. 𝑡𝑖𝑚𝑒 𝑖𝑛 ℎ𝑜𝑢𝑟𝑠 System Hardware Software EMR Presto $0.624 $0.156 EMR Spark $0.624 $0.156 Databricks $0.624 $0.3 𝑛𝑜𝑑𝑒 ℎ𝑎𝑟𝑑𝑤𝑎𝑟𝑒 𝑐𝑜𝑠𝑡𝑠 + 𝑛𝑜𝑑𝑒 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒 𝑐𝑜𝑠𝑡𝑠
  • 33. TPC Benchmark DS cost 33#UnifiedDataAnalytics #SparkAISummit
  • 35. Disk utilization • Databricks – Automatically caches hot input data – Requires machines with NVMe SSDs • EMR Presto – Experimental spilling of state to disk – “we do not configure any of the Facebook deployments to spill…local disks would increase hardware costs…” 35#UnifiedDataAnalytics #SparkAISummit Raghav Sethi et al. Presto: SQL on Everything. ICDE 2019: 1802-1813
  • 38. Usability and developer productivity 38#UnifiedDataAnalytics #SparkAISummit Feature EMR Presto EMR Spark Databricks Easy and flexible cluster creation ü ü ü Framework configuration at cluster creation time ü ü ü Direct distributed file system support û û ü Independent data catalog (metastore) ü ü ü Support for notebooks ü ü ü Integrated Web GUI û û ü
  • 39. 39#UnifiedDataAnalytics #SparkAISummit Feature EMR Presto EMR Spark Databricks JDBC access ü ü ü Programmatic interface û ü ü Job creation and management infrastructure û û ü Customized visualization of query plan execution ü ü ü Resource utilization monitoring with Ganglia and CloudWatch ü ü ü Usability and developer productivity
  • 40. Conclusions • Databricks is about 4x faster than EMR Presto without statistics – About 3x faster with them • Difference smaller with EMR Spark – Databricks still more cost-effective – More efficient runtime, cache, and CBO optimizer • Databricks and EMR Spark deal better with concurrency and benefit from data partitioning 40#UnifiedDataAnalytics #SparkAISummit
  • 41. Conclusions • EMR Presto requires significantly more tuning – Minimal for Databricks and EMR Spark • Functionality of Databricks and EMR Presto/Spark for SQL very similar – Databricks more user friendly in some aspects 41#UnifiedDataAnalytics #SparkAISummit
  • 42. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT