SlideShare a Scribd company logo
1 of 19
Hive, Presto, and Spark
on
TPC-DS benchmark
Dongwon Kim, PhD
SK Telecom
Contents
• Experimental setup
• Experimental results
[Experimental setup] TPC-DS dataset and query
• Hive
• Entirely depend on github.com/hortonworks/hive-testbench
• Distributed data generator
• A small dataset (100GB)
• A large dataset (1TB)
• DDLs
• External table declaration
• Partitioned table declaration (ORC)
• 66 queries provided (out of 99 TPC-DS queries)
• Presto
• Use hive-hadoop2 connector to read the same partitioned table
• Use the same query
• Spark
• Connected to Hive MetaStore to read the same partitioned table
• Use the same query
[Experimental setup] Cluster setup
• A single master node with 5 slave nodes
• Two 12-core processes with total 48 hyper threads
• 128GB main memory
• 10 HDDs
• Hadoop 2.7.3
• Hive 2.1.1 + Tez 0.8.4
• A LLAP worker on a node uses 192 cores and 80GB
• Presto 0.162
• A Presto worker on a node uses 192 cores and 80GB
• distributed-joins-enabled = false
• Spark 2.0.2
• 4 Spark executors on a node uses 192 cores and 80GB
[Experimental setup] Performance monitoring tool
• github.com/eastcirclek/swimlane-graphs
• Hive/Presto/Spark task swimlane graph + Ganglia resource utilization graph
• To observe the main cause of performance bottleneck
[Experimental results] Characteristics of each engine
• Hive
• Improve significantly through LLAP
• Good for both small and large workload
• Especially good for IO-bound workloads
• Spark
• Improve CPU performance through Whole Stage Code Generation
• Especially good for CPU-bound workloads
• Does not outperform Hive and Presto for IO-bound workloads
• Presto
• Pipelined execution to reduce unnecessary disk IOs
• Good for simple queries
• Works okay only when data is fit into memory
[Experimental results] Query execution time (100GB)
with query72 without query72
Pairwise comparison
reduction in
sum of running times
Pairwise comparison
reduction in
sum of running times
Spark > Hive 26.3 %
(1668s  1229s)
Hive > Spark 19.8 %
(1143s  916s)
Hive > Presto 55.6 %
(2797s  1241s)
Hive > Presto 50.2 %
(982s  489s)
Spark > Presto 62.0 %
(2932s  1114s)
Spark > Presto 5.2%
(1116s  1057s)
Spark > Hive >>> Presto Hive > Spark >= Presto
Reversed
Gap reduced
significantly
Hive with LLAP is good even for small workload
* When comparing each pair of engines, I count queries that are completed by both of two engines.
[Experimental results] Query execution time (1TB)
with query72 without query72
Pairwise comparison
reduction in
sum of running times
Pairwise comparison
reduction in
sum of running times
Hive > Spark 28.2 %
(6445s  4625s)
Hive > Spark 41.3 %
(6165s  3629s)
Hive > Presto 56.4 %
(5567s  2426s)
Hive > Presto 25.5 %
(1460s  1087s)
Spark > Presto 29.2 %
(5685s  4026s)
Presto > Spark 58.6%
(3812s  1578s)
Hive > Spark >>> Presto Hive > Presto > Spark
Reversed
* When comparing each pair of engines, I count queries that are completed by both of two engines.
Hive with LLAP is good for large, I/O-bound workload
Presto works okay if data fit into memory
[Experimental results] Curse of Query72 on Hive and Presto
0
500
1000
1500
2000
72 89 43 63 19 3 51 52 42 55 82 29 17 25 49 39 91 40 21 13 12 73 96 48 20 85 34 84 79 32 7 26 45 27 46 88 68 15 97 92 93 66 87 24 28 56 71 83 60 76 31 64 74
Presto Spark
100GB dataset
0
500
1000
1500
2000
72 89 43 63 19 3 42 52 55 25 49 29 17 75 82 40 88 31 13 71 91 56 85 60 68 28 46 48 26 7 51 66 21 34 27 20 45 12 87 73 79 15 32 96 84 76 39 93 92 97
Presto Hive
0
500
1000
1500
2000
72 22 67 51 97 92 95 82 39 93 21 94 96 84 73 12 18 79 32 3 91 20 43 98 52 42 89 63 45 15 55 34 48 27 7 26 90 46 68 13 19 85 87 40 66 76 28 88 50 60 29 17 56 58 71 54 25 49 31 80
Spark Hive
0
1000
2000
3000
4000
72 75 91 49 13 26 88 71 40 66 56 60 68 7 31 27 48 79 87 46 45 73 15 34 84 20 21 12 39 32 96 76 28 51 92 85 93
Presto Hive
0
1000
2000
3000
4000
72 67 82 92 97 95 22 51 21 12 20 15 26 19 96 39 27 84 3 13 18 28 43 45 52 7 34 48 42 46 55 32 73 89 87 63 68 90 79 91 76 66 40 71 93 50 94 56 85 54 60 58 25 17 31 29 88 49 74 80
Spark Hive
0
1000
2000
3000
4000
72 26 13 91 21 15 20 12 27 7 84 39 45 96 48 46 34 40 71 68 66 73 87 92 97 79 32 51 28 76 85 56 93 83 60 24 49 88 31 74
Presto Spark
[Experimental results] Curse of Query72 on Hive and Presto 1TB dataset
[Experimental results] Curse of Query72 on Presto and Hive
Presto SparkHive
High CPU utilization
for a long time
(looks like “plateau”)
No plateau observed
thanks to
WholeStageCodeGen
plateau
w/ WholeStageCodeGen w/o WholeStageCodeGen
CPU
Network
Disk
[Experimental results] Whole Stage Code Generation
Presto Hive
Without WholeStageCodeGeneration,
“plateau” is observed like Presto and Hive
 10th stage takes much longer
without WholeStageCodeGeneration
plateau
[Experimental results] Whole Stage Code Generation
1
10
100
1000
75 71 64 83 73 40 56 55 43 85 89 63 84 27 92 82 28 93 91 12 96 32 7 66 20 45 98 42 68 52 87 90 94 79 3 48 26 60 46 34 67 51 15 29 65 76 50 31 49 80 19 18 21 58 17 13 97 22 95 54 25 39 88 24 72
w/ WSCG w/o WSCG
100GB dataset
[Experimental results] Performance of Hive w/ and w/o LLAP
1
10
100
1000
10000
67 51 70 97 92 68 46 42 71 73 90 34 39 48 84 52 43 63 98 13 3 7 79 66 20 26 27 96 21 45 89 80 12 56 15 32 40 18 49 19 55 60 54 87 31 76 75 82 22 88 28 58 94 93 95 72
llap container
LLAP shows improvement over container-based Tez for most queries
100GB
dataset
1TB
dataset
1
10
100
1000
10000
80 84 96 79 94 51 91 31 56 67 27 32 95 46 72 39 19 48 90 52 21 45 15 3 34 55 42 20 12 97 43 60 63 22 26 76 75 83 88 68 70 89 65 28 13 87 40 58 18 85 25 17 29 93 50 92 66 73 71
llap container
All queries : 78.3% reduction
All except 72 : 36.1% reduction
All queries : 44.9% reduction
All except 72 : 27.9% reduction
[Experimental results][Query 75] without and with LLAP
without LLAP without LLAP
with LLAP
with LLAP
[Experimental results][Query 93] without and with LLAP
without LLAP without LLAP with LLAP
with LLAP
[Experimental results][Query 94] without and with LLAP
without LLAP without LLAP with LLAP
with LLAP
[Experimental results][Query 93] Difference pattern of resource utilization
Network  CPU
Presto Hive
CPU  Network
Spark
The end

More Related Content

What's hot

Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevAltinity Ltd
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controllerconfluent
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Databricks
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack IntroductionVikram Shinde
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Presto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analystsPresto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analystsShubham Tagra
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 

What's hot (20)

Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack Introduction
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Presto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analystsPresto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analysts
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
Introduction to sqoop
Introduction to sqoopIntroduction to sqoop
Introduction to sqoop
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
CockroachDB
CockroachDBCockroachDB
CockroachDB
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 

Viewers also liked

A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkDongwon Kim
 
Docker and kubernetes
Docker and kubernetesDocker and kubernetes
Docker and kubernetesDongwon Kim
 
Kubernetes introduction
Kubernetes introductionKubernetes introduction
Kubernetes introductionDongwon Kim
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduceEdureka!
 
Predictive Maintenance with Deep Learning and Apache Flink
Predictive Maintenance with Deep Learning and Apache FlinkPredictive Maintenance with Deep Learning and Apache Flink
Predictive Maintenance with Deep Learning and Apache FlinkDongwon Kim
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentationargonauts007
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016kbajda
 
Presto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and FuturePresto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and FutureDataWorks Summit
 
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CAPresto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CAkbajda
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine kiran palaka
 
How to ensure Presto scalability 
in multi use case
How to ensure Presto scalability 
in multi use case How to ensure Presto scalability 
in multi use case
How to ensure Presto scalability 
in multi use case Kai Sasaki
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageKai Sasaki
 

Viewers also liked (17)

A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
Docker and kubernetes
Docker and kubernetesDocker and kubernetes
Docker and kubernetes
 
Kubernetes introduction
Kubernetes introductionKubernetes introduction
Kubernetes introduction
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
Predictive Maintenance with Deep Learning and Apache Flink
Predictive Maintenance with Deep Learning and Apache FlinkPredictive Maintenance with Deep Learning and Apache Flink
Predictive Maintenance with Deep Learning and Apache Flink
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Presto - SQL on anything
Presto  - SQL on anythingPresto  - SQL on anything
Presto - SQL on anything
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 
Presto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and FuturePresto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and Future
 
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CAPresto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
 
How to ensure Presto scalability 
in multi use case
How to ensure Presto scalability 
in multi use case How to ensure Presto scalability 
in multi use case
How to ensure Presto scalability 
in multi use case
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud Storage
 
Presto
PrestoPresto
Presto
 

Similar to Hive, Presto & Spark on TPC-DS: A Comparison of Query Performance

Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance SmackdownDataWorks Summit
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudNicolas Poggi
 
Scaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insertScaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insertChris Adkin
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffTimescale
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New FeaturesAmazon Web Services
 
Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
Using Derivation-Free Optimization Methods in the Hadoop Cluster with TerasortUsing Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
Using Derivation-Free Optimization Methods in the Hadoop Cluster with TerasortAnhanguera Educacional S/A
 
Cassandra Performance Benchmark
Cassandra Performance BenchmarkCassandra Performance Benchmark
Cassandra Performance BenchmarkBigstep
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceBrendan Gregg
 
Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)Nicolas Poggi
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongCeph Community
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightDataStax Academy
 
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Tier1 App
 
USE_OF_PACKET_CAPTURE.pptx
USE_OF_PACKET_CAPTURE.pptxUSE_OF_PACKET_CAPTURE.pptx
USE_OF_PACKET_CAPTURE.pptxrajaguru91
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemAvleen Vig
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
 
Cloud Performance Benchmarking
Cloud Performance BenchmarkingCloud Performance Benchmarking
Cloud Performance BenchmarkingSantanu Dey
 
ClusterPresentation
ClusterPresentationClusterPresentation
ClusterPresentationWill Dixon
 

Similar to Hive, Presto & Spark on TPC-DS: A Comparison of Query Performance (20)

Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Scaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insertScaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insert
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
 
Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
Using Derivation-Free Optimization Methods in the Hadoop Cluster with TerasortUsing Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
 
QCon London.pdf
QCon London.pdfQCon London.pdf
QCon London.pdf
 
Cassandra Performance Benchmark
Cassandra Performance BenchmarkCassandra Performance Benchmark
Cassandra Performance Benchmark
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
 
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
 
USE_OF_PACKET_CAPTURE.pptx
USE_OF_PACKET_CAPTURE.pptxUSE_OF_PACKET_CAPTURE.pptx
USE_OF_PACKET_CAPTURE.pptx
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log system
 
Silent stores
Silent storesSilent stores
Silent stores
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Cloud Performance Benchmarking
Cloud Performance BenchmarkingCloud Performance Benchmarking
Cloud Performance Benchmarking
 
ClusterPresentation
ClusterPresentationClusterPresentation
ClusterPresentation
 

Recently uploaded

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 

Recently uploaded (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 

Hive, Presto & Spark on TPC-DS: A Comparison of Query Performance

  • 1. Hive, Presto, and Spark on TPC-DS benchmark Dongwon Kim, PhD SK Telecom
  • 2. Contents • Experimental setup • Experimental results
  • 3. [Experimental setup] TPC-DS dataset and query • Hive • Entirely depend on github.com/hortonworks/hive-testbench • Distributed data generator • A small dataset (100GB) • A large dataset (1TB) • DDLs • External table declaration • Partitioned table declaration (ORC) • 66 queries provided (out of 99 TPC-DS queries) • Presto • Use hive-hadoop2 connector to read the same partitioned table • Use the same query • Spark • Connected to Hive MetaStore to read the same partitioned table • Use the same query
  • 4. [Experimental setup] Cluster setup • A single master node with 5 slave nodes • Two 12-core processes with total 48 hyper threads • 128GB main memory • 10 HDDs • Hadoop 2.7.3 • Hive 2.1.1 + Tez 0.8.4 • A LLAP worker on a node uses 192 cores and 80GB • Presto 0.162 • A Presto worker on a node uses 192 cores and 80GB • distributed-joins-enabled = false • Spark 2.0.2 • 4 Spark executors on a node uses 192 cores and 80GB
  • 5. [Experimental setup] Performance monitoring tool • github.com/eastcirclek/swimlane-graphs • Hive/Presto/Spark task swimlane graph + Ganglia resource utilization graph • To observe the main cause of performance bottleneck
  • 6. [Experimental results] Characteristics of each engine • Hive • Improve significantly through LLAP • Good for both small and large workload • Especially good for IO-bound workloads • Spark • Improve CPU performance through Whole Stage Code Generation • Especially good for CPU-bound workloads • Does not outperform Hive and Presto for IO-bound workloads • Presto • Pipelined execution to reduce unnecessary disk IOs • Good for simple queries • Works okay only when data is fit into memory
  • 7. [Experimental results] Query execution time (100GB) with query72 without query72 Pairwise comparison reduction in sum of running times Pairwise comparison reduction in sum of running times Spark > Hive 26.3 % (1668s  1229s) Hive > Spark 19.8 % (1143s  916s) Hive > Presto 55.6 % (2797s  1241s) Hive > Presto 50.2 % (982s  489s) Spark > Presto 62.0 % (2932s  1114s) Spark > Presto 5.2% (1116s  1057s) Spark > Hive >>> Presto Hive > Spark >= Presto Reversed Gap reduced significantly Hive with LLAP is good even for small workload * When comparing each pair of engines, I count queries that are completed by both of two engines.
  • 8. [Experimental results] Query execution time (1TB) with query72 without query72 Pairwise comparison reduction in sum of running times Pairwise comparison reduction in sum of running times Hive > Spark 28.2 % (6445s  4625s) Hive > Spark 41.3 % (6165s  3629s) Hive > Presto 56.4 % (5567s  2426s) Hive > Presto 25.5 % (1460s  1087s) Spark > Presto 29.2 % (5685s  4026s) Presto > Spark 58.6% (3812s  1578s) Hive > Spark >>> Presto Hive > Presto > Spark Reversed * When comparing each pair of engines, I count queries that are completed by both of two engines. Hive with LLAP is good for large, I/O-bound workload Presto works okay if data fit into memory
  • 9. [Experimental results] Curse of Query72 on Hive and Presto 0 500 1000 1500 2000 72 89 43 63 19 3 51 52 42 55 82 29 17 25 49 39 91 40 21 13 12 73 96 48 20 85 34 84 79 32 7 26 45 27 46 88 68 15 97 92 93 66 87 24 28 56 71 83 60 76 31 64 74 Presto Spark 100GB dataset 0 500 1000 1500 2000 72 89 43 63 19 3 42 52 55 25 49 29 17 75 82 40 88 31 13 71 91 56 85 60 68 28 46 48 26 7 51 66 21 34 27 20 45 12 87 73 79 15 32 96 84 76 39 93 92 97 Presto Hive 0 500 1000 1500 2000 72 22 67 51 97 92 95 82 39 93 21 94 96 84 73 12 18 79 32 3 91 20 43 98 52 42 89 63 45 15 55 34 48 27 7 26 90 46 68 13 19 85 87 40 66 76 28 88 50 60 29 17 56 58 71 54 25 49 31 80 Spark Hive
  • 10. 0 1000 2000 3000 4000 72 75 91 49 13 26 88 71 40 66 56 60 68 7 31 27 48 79 87 46 45 73 15 34 84 20 21 12 39 32 96 76 28 51 92 85 93 Presto Hive 0 1000 2000 3000 4000 72 67 82 92 97 95 22 51 21 12 20 15 26 19 96 39 27 84 3 13 18 28 43 45 52 7 34 48 42 46 55 32 73 89 87 63 68 90 79 91 76 66 40 71 93 50 94 56 85 54 60 58 25 17 31 29 88 49 74 80 Spark Hive 0 1000 2000 3000 4000 72 26 13 91 21 15 20 12 27 7 84 39 45 96 48 46 34 40 71 68 66 73 87 92 97 79 32 51 28 76 85 56 93 83 60 24 49 88 31 74 Presto Spark [Experimental results] Curse of Query72 on Hive and Presto 1TB dataset
  • 11. [Experimental results] Curse of Query72 on Presto and Hive Presto SparkHive High CPU utilization for a long time (looks like “plateau”) No plateau observed thanks to WholeStageCodeGen plateau
  • 12. w/ WholeStageCodeGen w/o WholeStageCodeGen CPU Network Disk [Experimental results] Whole Stage Code Generation Presto Hive Without WholeStageCodeGeneration, “plateau” is observed like Presto and Hive  10th stage takes much longer without WholeStageCodeGeneration plateau
  • 13. [Experimental results] Whole Stage Code Generation 1 10 100 1000 75 71 64 83 73 40 56 55 43 85 89 63 84 27 92 82 28 93 91 12 96 32 7 66 20 45 98 42 68 52 87 90 94 79 3 48 26 60 46 34 67 51 15 29 65 76 50 31 49 80 19 18 21 58 17 13 97 22 95 54 25 39 88 24 72 w/ WSCG w/o WSCG 100GB dataset
  • 14. [Experimental results] Performance of Hive w/ and w/o LLAP 1 10 100 1000 10000 67 51 70 97 92 68 46 42 71 73 90 34 39 48 84 52 43 63 98 13 3 7 79 66 20 26 27 96 21 45 89 80 12 56 15 32 40 18 49 19 55 60 54 87 31 76 75 82 22 88 28 58 94 93 95 72 llap container LLAP shows improvement over container-based Tez for most queries 100GB dataset 1TB dataset 1 10 100 1000 10000 80 84 96 79 94 51 91 31 56 67 27 32 95 46 72 39 19 48 90 52 21 45 15 3 34 55 42 20 12 97 43 60 63 22 26 76 75 83 88 68 70 89 65 28 13 87 40 58 18 85 25 17 29 93 50 92 66 73 71 llap container All queries : 78.3% reduction All except 72 : 36.1% reduction All queries : 44.9% reduction All except 72 : 27.9% reduction
  • 15. [Experimental results][Query 75] without and with LLAP without LLAP without LLAP with LLAP with LLAP
  • 16. [Experimental results][Query 93] without and with LLAP without LLAP without LLAP with LLAP with LLAP
  • 17. [Experimental results][Query 94] without and with LLAP without LLAP without LLAP with LLAP with LLAP
  • 18. [Experimental results][Query 93] Difference pattern of resource utilization Network  CPU Presto Hive CPU  Network Spark