SlideShare una empresa de Scribd logo
1 de 45
Descargar para leer sin conexión
The state of Hive and Spark
in the cloud
Nicolas Poggi
July 2017
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_
/_/
Outline
1. Intro to BSC and ALOJA
2. BigBench
3. Cloud systems
4. Sequential tests 1GB – 10TB
1. Scalability
5. Concurrency tests
6. Summary
2
Barcelona Supercomputing Center (BSC)
• Peak performance capacity of 13.7 Petaflop/s
• 390 TB Total RAM
• 14 PB storage (GPFS)
• Heterogeneous architecture:
• Intel Xeon: KNL and NKH
• IBM Power9 + Latest NVIDIA, ARMv8
• Networking: InfiniBand EDR / Omni-path
Launched July 1st 2017
ALOJA: towards cost-effective Big Data
• Research project for automating characterization and
optimization of Big Data deployments
• Open source Benchmarking-to-Insights platform and tools
• Largest Big Data public repository (70,000+ jobs)
• Community collaboration with industry and academia
http://aloja.bsc.es
Big Data
Benchmarking
Online
Repository
Web / ML
Analytics
Motivation
• 2016 SQL-on-Hadoop paper and presentations
• Focused on Hive, due to SparkSQL not being ready to use in PaaS
• Different versions (1.3, 1.5, 1.6)
• Some in preview mode
• Not carefully tuned
• Used TCP-H SQL-only benchmark
• Early 2017, BigBench on Hive and Spark work testing more than SQL
• FOSDEM and HadoopSummit EU presentations
• New code available in May for MLlib2 compatibility
• Goals:
• Understand the different BigBench queries
• Evaluate the current out-of-the-box experience of Spark and Hive in PaaS cloud
• Readiness, scalability, price, and performance
5
BigBench
7
The need for a new benchmark standard
• A benchmark captures the solution to a problem and guides decision making
• Database related benchmarks standards
• Transactional (OLTP): TPC C and E
• Decision Support (DSS/OLAP): TPC H and DS
• And for Big Data analytics properties?
• 3 Vs, ML, M/R
• Benchmark uses:
• System tuning and debugging
• Spread and broad Big Data ecosystem
• Set common rules
• Vendor comparison
• Transparency across the industry
8
What is BigBench (TPCx-BB)?
• End-to-end application level benchmark specification
• result of many years of collaboration of industry and academia
• Covers most Big Data Analytical properties (3Vs)
• Covers 30 business use cases for a retailer company
• Defines data scale factors: 1GB to PBs
9
2012
• Launched at WBDB
2013
• Published at SIGMOD
2014
• First implementation on github
2016
• Standardized by TPC (Feb)
2016
• TCPx-BB Version 1.2 (Nov)
2017
• Spark MLlib v2 compatibility
(under testing - May)
BigBench history
BigBench use cases and process overview
• 30 business uses cases covering:
• Merchandising,
• Pricing Optimization
• Product Return
• Customers...
• Implementation resulted in:
• 14 Declarative queries (SQL)
• 7 with Natural Language Processing
• 4 with data preprocessing with M/R jobs
• 5 with Machine Learning jobs
10
1 Data generation
2 Data loading
3 Power test
4 Throughput test 1
5 Data refresh
6 Throughput test 2
Result
• BB queries / min (BBQpm)
BigBench v1.2 – Reference Implementation
HDFS
Hive Metastore
MapReduce Tez Spark
Yarn
Hive Spark SQL
Mahout Spark’s MLlibMachine Learning
SQL Engine
Table Metastore
Execution Engine
Filesystem
Combination options:
• Hive + MapReduce + Mahout
• Hive + MapReduce + Spark’s Mllib
• v1 and v2
• Hive + Tez + Mahout
• Hive + Tez + Spark’s MLlib
• Spark + Mahout
• Spark + MLlib
• v1 and v2
• Also
• Hive-on-Spark
• Hive LLAP …
Previous results: M/R vs Tez and Mahout vs. MLlib v1
12Average of three executions using 100 GB Scale Factor
M/R
Tez
Mahout
MLlib v1
3.9x 2.2x
Hive and Spark in PaaS
13
Platform-as-a-Service Spark
• Cloud-based managed Hadoop services
• Ready to use Spark, Hive, …
• Simplified management
• Deploys in minutes, on-demand, elastic
• You select the instance and
• the number of processing nodes
• Decoupled compute and storage
• Pay-as-you-go pricing model
• Optimized for general purpose
• Fined tuned to the cloud provider architecture
14
Surveyed Hadoop/Hive PaaS services
• Amazon Elastic Map Reduce (EMR)
• Released: Apr 2009
• OS: Amazon Linux AMI (RHEL-like)
• SW stack: EMR 5.5 (and 5.6)
• Spark 2.1.0 and Hive 2.1 (no LLAP)
• Google Cloud DataProc (CDP)
• Released: Feb 2016
• OS: Debian GNU/Linux 8.4
• SW stack: Preview version Spark 2.1.0
• V 1.1 with Spark 2.0.2
• Both with Hive 2.1 (no LLAP)
• Azure HDInsight (HDI)
• Released: Oct 2013
• OS: Windows Server and Ubuntu 16.04
• SW stack: HDP 2.6 based
• Spark 2.1.0 and 1.6.3
• Hive 1.2 (Hive 2 + LLAP in preview mode)
• Target deployment:
• 16 data nodes with 8-cores each
• Master node with 16-cores
• Decoupled storage only
• Object store / elastic stores
15
VM instances and characteristics
Amazon Elastic Map Reduce (EMR)
• 16x M4.2xlarge (datanodes)
• 8-core, 32GB RAM
• 1x M4.4xlarge (master)
• 16-core, 64 GB RAM
• Storage: 2x EBS GP2 volumes
• Price/hr: $10.96 (billed by the hour)
Azure HDInsight (HDI)
• 16x D4v2 (datanodes)
• 8-core, 28GB RAM
• 2x D14v2 (master)
• 16-core, 112GB RAM
• Storage: WASB (Azure Blob Store)
• Price/hr: $20.68 (billed by the minute)
17
Google Cloud DataProc (CDP)
• 16x n1-standard-8 (datanodes)
• 8-core, 30GB RAM
• 1x n1-standard-16 (master)
• 16-core, 60GB RAM
• Storage GCS
• Price/hr: $10.38 (billed by the minute)
Disclaimer: snapshot of the out-of-the-box price and performance
during May 2017. Performance and especially costs change
often. We use non-discounted pricing. I/O costs are complex to
estimate for a single benchmark, using per second billing.
Sequential Hive vs Spark 2.1
Queries 1-30 on Spark 2.1 (power runs)
Query 1 Query 2 …. Query 30
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 2.1.0
/_/
BB 1GB-1TB Scalability Dataproc:
Hive 2.1 (M/R) vs Spark 2.1
BB 1GB-1TB Scalability EMR:
Hive 2.1 (Tez) vs Spark 2.1
BB 1GB-1TB Scalability HDI:
Hive 1.2 (Tez) vs Spark 2.1
Notes:
• Times for each type of
query different
• Dataproc hive the
slowest due to using
M/R
• HDI fastest results
• Spark faster
• ML
• Hive faster
• UDF
• EMR similar times both
• Spark problems in
M/R
BB 1TB Power runs : Hive vs Spark 2.1
All providers
BigBench Hive vs Spark data node CPU % (HDI)
BigBench Hive vs Spark data containers (HDI)
Comparison of Q5 (ML) in Hive and Spark
31
Errors and configs
EMR slow Spark query and solution
Configurations
BB 1TB M/R-only: Spark 2.1 – All providers
Notes:
• When zooming by query,
we can see that query 2 is
the slowest on ERM
• While on CDP and HDI is
within proportions
BB 1TB Q2: Spark 2.1 – CPU Util % EMR and HDI
Notes:
• Job was CPU bounded. Log showed:
• WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.6 GB of 5.5 GB physical
memory used. Consider boosting spark.yarn.executor.memoryOverhead.
• Solution: Increased memory for executors and time was lowered from 6,417s > 1,501 (4.3X)
Q2: Find the top 30 products that are mostly viewed together with a given product in online store
CREATE TEMPORARY FUNCTION makePairs AS io.bigdatabenchmark.v1.queries.udf.PairwiseUDTF';
Errors in PaaS out-of-the-box…
• Everything was run out-of-the-box, except for:
• Q 14 17 requires cross joins to be enabled in Spark v2
• At 10TB,
• spark.sql.broadcastTimeout (default 300) had to be increased in HDI
• Timeout in seconds for the broadcast wait time in broadcast joins
• At 1TB memory issues
• Queries 3, 4, 8
• TimSort java.lang.OutOfMemoryError: Java heap space at
org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate
• Queries 2, and 30
• 17/05/15 16:57:46 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for
exceeding memory limits. 5.6 GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.
• Configs
• spark.yarn.driver and executor memoryOverhead
• spark.yarn.executor.memory
Spark config
EMR CDP HDI
Java version OpenJDK 1.8.0_121 OpenJDK 1.8.0_121 OpenJDK 1.8.0_131
Spark version 2.1.0 2.1 2.1.0.2.6.0.2-76
Driver memory 5G 5G 5G
Executor memory 5G 10G 4G
Executor cores 4 4 3
Executor instances Dynamic Dynamic 20
dynamicAllocation
enabled
TRUE TRUE FALSE
Executor
memoryOverhead
Default (384MB) 1,117 MB 384 MB
36
The sky 10TB is the limit…
Results for SQL-only
37
BB 1GB-10TB Scalability SQL-only queries
Hive
Spark
BigBench 10TB SQL-only: All providers
Notes:
• At 10TB, only
SQL part ran
correctly in
Spark
• EMR got the
fastest results
• Rest still needs
tuning to
complete
• But reaching
the limit of the
cluster / PaaS
config
Other comparisons:
2.0.2 vs 2.1.0
1.6.3 vs 2.1.0
MLlib v1 vs v2
41
BigBench 1GB-1TB: Spark 2.0.2 vs 2.1.0 (CDP)
Notes:
Spark 2.1 a bit faster at
small scales, slower at
100 GB and 1 TB on the
UDF/NLP queries
2.1 faster up
to 100GB
Slower at 1TB
BigBench 1GB-1TB: Spark 1.6.3 vs 2.1.0
MLlib 1 vs 2.1 MLlib 2(HDI)
Notes:
• Spark 2.1 is always
faster than 1.6.3 in
HDI
• MLlilb 2 using
dataframes over RDDs
is only slightly faster
than V1.
Query 2 CPU % example (100GB)
44
Tez Spark 1.6.2 Spark 2.0.2
Average of three executions using 100 GB Scale Factor
Concurrency runs (throughput)
2 to 32 parallel streams 128-core cluster
SQL-only: 100GB – 10TB 512-core cluster
4545
BB Throughput 1GB, 1-32 streams (128-cores)
Notes:
• Providers similar on
concurrency
• From 16 streams on,
the bottleneck is the
CPU utilization on the
master
• HDI faster at
concurrency,
• But also showed the
worst number
(variability)
High variability in HDI Spark
BB Throughput at 100TB 8 streams SQL-only
(512-cores)
BB Throughput at 1TB 4 streams SQL-only
(512-cores)
BB Throughput at 10TB 2 streams SQL-only
(512-cores)
Summary: Hive and Spark
Hive vs. Spark
• Strategies
• Thin vs. Fat containers
• Sequential
• Hive-on-Tez faster at lower scales (EMR,
HDI)
• Spark catches up at 1TB
• 1TB+
• Spark memory needs tuning
• Especially at 10TB
• Also, speedups query time
• Concurrency
• Similar at 1GB
• Spark significantly faster from 100GB+ (new
cluster)
Providers
• CDP should enable Tez by default
• EMR faster at lower scale
• HDI faster at 1TB
• HDI fastest with Hive (uses 1.2)
• No LLAP yet…
• SQL only
• Google cloud fastest with Spark
• HDI slowest with Spark
Average of three executios of 100 GB Scale Factor 50
Conclusions
• All providers have up to date (2.1.0) and well tuned versions of Spark
• They could run BigBench up to 1TB on medium-sized cluster
• [Almost] Out-of-the box
• Performance similar among providers for similar cluster types and disk configs
• Difference according to scale (and pricing)
• Spark 2.1.0 is faster than previous versions
• Also MLlib 2 with dataframes
• But improvements within the 30% range
• Hive (+Tez + MLlib) are still slightly faster than Spark at lower scales for sequential
• But Spark significantly faster at high data scales and concurrency
• BigBench has been useful to stress a cluster with different workloads
• Highlights config problems fast and stresses scale limits
• Helpful for tuning the clusters
• And yes, Spark is now production ready and performant in PaaS in the cloud
51
Future work / WiP
• Continue the query characterization
• Combined for Hive and Spark, in multiple deployments
• Benchmarking
• Compare Hive versions 1 and 2
• HDI still on v1
• Test LLAP with different settings
• Variability study for spark workloads in the cloud
• Fix 10TB runs to complete results
• Compare to on-prem runs
• optimizations
• Test G1 GC
• Fat vs. thin executors configs
Resources and references
BigBench and ALOJA
• BigBench Spark 2 branch (thanks Christoph and
Michael from bankmark.de):
• https://github.com/carabolic/Big-Data-Benchmark-for-
Big-Bench/tree/spark2
• Original BigBench Implementation repository
• https://github.com/intel-hadoop/Big-Data-Benchmark-
for-Big-Bench
• ALOJA benchmarking platform
• https://github.com/Aloja/aloja
• http://aloja.bsc.es/publications
• ALOJA fork of BigBench (adds support for HDI and fixes
spark)
• https://github.com/Aloja/Big-Data-Benchmark-for-Big-
Bench
• The State of SQL-on-Hadoop in the Cloud – N. Poggi
et. al.
• https://doi.org/10.1109/BigData.2016.7840751
Big Data Benchmarking
• Big Data Benchmarking Community (BDBC) mailing
list
• (~200 members from ~80organizations)
• http://clds.sdsc.edu/bdbc/community
• Workshop Big Data Benchmarking (WBDB)
• http://clds.sdsc.edu/bdbc/workshops
• SPEC Research Big Data working group
• http://research.spec.org/working-groups/big-data-
working-group.html
• Benchmarking slides and video:
• Benchmarking Hadoop:
• https://www.slideshare.net/ni_po/benchmarking-hadoop
• Michael Frank on Big Data benchmarking
• http://www.tele-task.de/archive/podcast/20430/
• Tilmann Rabl Big Data Benchmarking Tutorial
• http://www.slideshare.net/tilmann_rabl/ieee2014-
tutorialbarurabl
53
Thanks, questions?
Follow up / feedback : Nicolas.Poggi@bsc.es
Twitter: ni_po
The state of Hive and Spark in the cloud

Más contenido relacionado

La actualidad más candente

Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Databricks
 
Hive on Spark, production experience @Uber
 Hive on Spark, production experience @Uber Hive on Spark, production experience @Uber
Hive on Spark, production experience @UberFuture of Data Meetup
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudNicolas Poggi
 
State of Spark in the cloud (Spark Summit EU 2017)
State of Spark in the cloud (Spark Summit EU 2017)State of Spark in the cloud (Spark Summit EU 2017)
State of Spark in the cloud (Spark Summit EU 2017)Nicolas Poggi
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobileDataWorks Summit
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryDatabricks
 
CaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use CasesCaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use CasesDataWorks Summit
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Mich Talebzadeh (Ph.D.)
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...DataWorks Summit/Hadoop Summit
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hiverxu
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache TezGal Vinograd
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 

La actualidad más candente (20)

Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Hive on Spark, production experience @Uber
 Hive on Spark, production experience @Uber Hive on Spark, production experience @Uber
Hive on Spark, production experience @Uber
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
State of Spark in the cloud (Spark Summit EU 2017)
State of Spark in the cloud (Spark Summit EU 2017)State of Spark in the cloud (Spark Summit EU 2017)
State of Spark in the cloud (Spark Summit EU 2017)
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China Mobile
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent Memory
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
CaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use CasesCaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use Cases
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache Tez
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 

Similar a The state of Hive and Spark in the Cloud (July 2017)

Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheDavid Grier
 
The State of Spark in the Cloud with Nicolas Poggi
The State of Spark in the Cloud with Nicolas PoggiThe State of Spark in the Cloud with Nicolas Poggi
The State of Spark in the Cloud with Nicolas PoggiSpark Summit
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph Community
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefDevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefGaurav "GP" Pal
 
stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4Gaurav "GP" Pal
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community
 
Retour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantRetour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantSwiss Data Forum Swiss Data Forum
 
Taking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - ArchitectureTaking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - ArchitectureSplunk
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_SummaryHiram Fleitas León
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
Benchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public cloudsBenchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public cloudsdata://disrupted®
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureCeph Community
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitecturePatrick McGarry
 
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsChoose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsDatabricks
 

Similar a The state of Hive and Spark in the Cloud (July 2017) (20)

The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
The State of Spark in the Cloud with Nicolas Poggi
The State of Spark in the Cloud with Nicolas PoggiThe State of Spark in the Cloud with Nicolas Poggi
The State of Spark in the Cloud with Nicolas Poggi
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefDevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
 
stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Retour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantRetour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenant
 
Taking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - ArchitectureTaking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - Architecture
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Benchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public cloudsBenchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public clouds
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsChoose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
 

Más de Nicolas Poggi

Benchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA ConstraintsBenchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA ConstraintsNicolas Poggi
 
Correctness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLCorrectness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLNicolas Poggi
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataNicolas Poggi
 
Vagrant + Docker provider [+Puppet]
Vagrant + Docker provider [+Puppet]Vagrant + Docker provider [+Puppet]
Vagrant + Docker provider [+Puppet]Nicolas Poggi
 
The case for Hadoop performance
The case for Hadoop performanceThe case for Hadoop performance
The case for Hadoop performanceNicolas Poggi
 

Más de Nicolas Poggi (6)

Benchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA ConstraintsBenchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA Constraints
 
Correctness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLCorrectness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQL
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 
Vagrant + Docker provider [+Puppet]
Vagrant + Docker provider [+Puppet]Vagrant + Docker provider [+Puppet]
Vagrant + Docker provider [+Puppet]
 
The case for Hadoop performance
The case for Hadoop performanceThe case for Hadoop performance
The case for Hadoop performance
 

Último

Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 

Último (20)

Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 

The state of Hive and Spark in the Cloud (July 2017)

  • 1. The state of Hive and Spark in the cloud Nicolas Poggi July 2017 ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ /_/
  • 2. Outline 1. Intro to BSC and ALOJA 2. BigBench 3. Cloud systems 4. Sequential tests 1GB – 10TB 1. Scalability 5. Concurrency tests 6. Summary 2
  • 3. Barcelona Supercomputing Center (BSC) • Peak performance capacity of 13.7 Petaflop/s • 390 TB Total RAM • 14 PB storage (GPFS) • Heterogeneous architecture: • Intel Xeon: KNL and NKH • IBM Power9 + Latest NVIDIA, ARMv8 • Networking: InfiniBand EDR / Omni-path Launched July 1st 2017
  • 4. ALOJA: towards cost-effective Big Data • Research project for automating characterization and optimization of Big Data deployments • Open source Benchmarking-to-Insights platform and tools • Largest Big Data public repository (70,000+ jobs) • Community collaboration with industry and academia http://aloja.bsc.es Big Data Benchmarking Online Repository Web / ML Analytics
  • 5. Motivation • 2016 SQL-on-Hadoop paper and presentations • Focused on Hive, due to SparkSQL not being ready to use in PaaS • Different versions (1.3, 1.5, 1.6) • Some in preview mode • Not carefully tuned • Used TCP-H SQL-only benchmark • Early 2017, BigBench on Hive and Spark work testing more than SQL • FOSDEM and HadoopSummit EU presentations • New code available in May for MLlib2 compatibility • Goals: • Understand the different BigBench queries • Evaluate the current out-of-the-box experience of Spark and Hive in PaaS cloud • Readiness, scalability, price, and performance 5
  • 7. The need for a new benchmark standard • A benchmark captures the solution to a problem and guides decision making • Database related benchmarks standards • Transactional (OLTP): TPC C and E • Decision Support (DSS/OLAP): TPC H and DS • And for Big Data analytics properties? • 3 Vs, ML, M/R • Benchmark uses: • System tuning and debugging • Spread and broad Big Data ecosystem • Set common rules • Vendor comparison • Transparency across the industry 8
  • 8. What is BigBench (TPCx-BB)? • End-to-end application level benchmark specification • result of many years of collaboration of industry and academia • Covers most Big Data Analytical properties (3Vs) • Covers 30 business use cases for a retailer company • Defines data scale factors: 1GB to PBs 9 2012 • Launched at WBDB 2013 • Published at SIGMOD 2014 • First implementation on github 2016 • Standardized by TPC (Feb) 2016 • TCPx-BB Version 1.2 (Nov) 2017 • Spark MLlib v2 compatibility (under testing - May) BigBench history
  • 9. BigBench use cases and process overview • 30 business uses cases covering: • Merchandising, • Pricing Optimization • Product Return • Customers... • Implementation resulted in: • 14 Declarative queries (SQL) • 7 with Natural Language Processing • 4 with data preprocessing with M/R jobs • 5 with Machine Learning jobs 10 1 Data generation 2 Data loading 3 Power test 4 Throughput test 1 5 Data refresh 6 Throughput test 2 Result • BB queries / min (BBQpm)
  • 10. BigBench v1.2 – Reference Implementation HDFS Hive Metastore MapReduce Tez Spark Yarn Hive Spark SQL Mahout Spark’s MLlibMachine Learning SQL Engine Table Metastore Execution Engine Filesystem Combination options: • Hive + MapReduce + Mahout • Hive + MapReduce + Spark’s Mllib • v1 and v2 • Hive + Tez + Mahout • Hive + Tez + Spark’s MLlib • Spark + Mahout • Spark + MLlib • v1 and v2 • Also • Hive-on-Spark • Hive LLAP …
  • 11. Previous results: M/R vs Tez and Mahout vs. MLlib v1 12Average of three executions using 100 GB Scale Factor M/R Tez Mahout MLlib v1 3.9x 2.2x
  • 12. Hive and Spark in PaaS 13
  • 13. Platform-as-a-Service Spark • Cloud-based managed Hadoop services • Ready to use Spark, Hive, … • Simplified management • Deploys in minutes, on-demand, elastic • You select the instance and • the number of processing nodes • Decoupled compute and storage • Pay-as-you-go pricing model • Optimized for general purpose • Fined tuned to the cloud provider architecture 14
  • 14. Surveyed Hadoop/Hive PaaS services • Amazon Elastic Map Reduce (EMR) • Released: Apr 2009 • OS: Amazon Linux AMI (RHEL-like) • SW stack: EMR 5.5 (and 5.6) • Spark 2.1.0 and Hive 2.1 (no LLAP) • Google Cloud DataProc (CDP) • Released: Feb 2016 • OS: Debian GNU/Linux 8.4 • SW stack: Preview version Spark 2.1.0 • V 1.1 with Spark 2.0.2 • Both with Hive 2.1 (no LLAP) • Azure HDInsight (HDI) • Released: Oct 2013 • OS: Windows Server and Ubuntu 16.04 • SW stack: HDP 2.6 based • Spark 2.1.0 and 1.6.3 • Hive 1.2 (Hive 2 + LLAP in preview mode) • Target deployment: • 16 data nodes with 8-cores each • Master node with 16-cores • Decoupled storage only • Object store / elastic stores 15
  • 15. VM instances and characteristics Amazon Elastic Map Reduce (EMR) • 16x M4.2xlarge (datanodes) • 8-core, 32GB RAM • 1x M4.4xlarge (master) • 16-core, 64 GB RAM • Storage: 2x EBS GP2 volumes • Price/hr: $10.96 (billed by the hour) Azure HDInsight (HDI) • 16x D4v2 (datanodes) • 8-core, 28GB RAM • 2x D14v2 (master) • 16-core, 112GB RAM • Storage: WASB (Azure Blob Store) • Price/hr: $20.68 (billed by the minute) 17 Google Cloud DataProc (CDP) • 16x n1-standard-8 (datanodes) • 8-core, 30GB RAM • 1x n1-standard-16 (master) • 16-core, 60GB RAM • Storage GCS • Price/hr: $10.38 (billed by the minute) Disclaimer: snapshot of the out-of-the-box price and performance during May 2017. Performance and especially costs change often. We use non-discounted pricing. I/O costs are complex to estimate for a single benchmark, using per second billing.
  • 16. Sequential Hive vs Spark 2.1 Queries 1-30 on Spark 2.1 (power runs) Query 1 Query 2 …. Query 30 Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 2.1.0 /_/
  • 17. BB 1GB-1TB Scalability Dataproc: Hive 2.1 (M/R) vs Spark 2.1
  • 18. BB 1GB-1TB Scalability EMR: Hive 2.1 (Tez) vs Spark 2.1
  • 19. BB 1GB-1TB Scalability HDI: Hive 1.2 (Tez) vs Spark 2.1
  • 20. Notes: • Times for each type of query different • Dataproc hive the slowest due to using M/R • HDI fastest results • Spark faster • ML • Hive faster • UDF • EMR similar times both • Spark problems in M/R BB 1TB Power runs : Hive vs Spark 2.1 All providers
  • 21. BigBench Hive vs Spark data node CPU % (HDI)
  • 22. BigBench Hive vs Spark data containers (HDI)
  • 23. Comparison of Q5 (ML) in Hive and Spark 31
  • 24. Errors and configs EMR slow Spark query and solution Configurations
  • 25. BB 1TB M/R-only: Spark 2.1 – All providers Notes: • When zooming by query, we can see that query 2 is the slowest on ERM • While on CDP and HDI is within proportions
  • 26. BB 1TB Q2: Spark 2.1 – CPU Util % EMR and HDI Notes: • Job was CPU bounded. Log showed: • WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.6 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. • Solution: Increased memory for executors and time was lowered from 6,417s > 1,501 (4.3X) Q2: Find the top 30 products that are mostly viewed together with a given product in online store CREATE TEMPORARY FUNCTION makePairs AS io.bigdatabenchmark.v1.queries.udf.PairwiseUDTF';
  • 27. Errors in PaaS out-of-the-box… • Everything was run out-of-the-box, except for: • Q 14 17 requires cross joins to be enabled in Spark v2 • At 10TB, • spark.sql.broadcastTimeout (default 300) had to be increased in HDI • Timeout in seconds for the broadcast wait time in broadcast joins • At 1TB memory issues • Queries 3, 4, 8 • TimSort java.lang.OutOfMemoryError: Java heap space at org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate • Queries 2, and 30 • 17/05/15 16:57:46 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.6 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. • Configs • spark.yarn.driver and executor memoryOverhead • spark.yarn.executor.memory
  • 28. Spark config EMR CDP HDI Java version OpenJDK 1.8.0_121 OpenJDK 1.8.0_121 OpenJDK 1.8.0_131 Spark version 2.1.0 2.1 2.1.0.2.6.0.2-76 Driver memory 5G 5G 5G Executor memory 5G 10G 4G Executor cores 4 4 3 Executor instances Dynamic Dynamic 20 dynamicAllocation enabled TRUE TRUE FALSE Executor memoryOverhead Default (384MB) 1,117 MB 384 MB 36
  • 29. The sky 10TB is the limit… Results for SQL-only 37
  • 30. BB 1GB-10TB Scalability SQL-only queries Hive Spark
  • 31. BigBench 10TB SQL-only: All providers Notes: • At 10TB, only SQL part ran correctly in Spark • EMR got the fastest results • Rest still needs tuning to complete • But reaching the limit of the cluster / PaaS config
  • 32. Other comparisons: 2.0.2 vs 2.1.0 1.6.3 vs 2.1.0 MLlib v1 vs v2 41
  • 33. BigBench 1GB-1TB: Spark 2.0.2 vs 2.1.0 (CDP) Notes: Spark 2.1 a bit faster at small scales, slower at 100 GB and 1 TB on the UDF/NLP queries 2.1 faster up to 100GB Slower at 1TB
  • 34. BigBench 1GB-1TB: Spark 1.6.3 vs 2.1.0 MLlib 1 vs 2.1 MLlib 2(HDI) Notes: • Spark 2.1 is always faster than 1.6.3 in HDI • MLlilb 2 using dataframes over RDDs is only slightly faster than V1.
  • 35. Query 2 CPU % example (100GB) 44 Tez Spark 1.6.2 Spark 2.0.2 Average of three executions using 100 GB Scale Factor
  • 36. Concurrency runs (throughput) 2 to 32 parallel streams 128-core cluster SQL-only: 100GB – 10TB 512-core cluster 4545
  • 37. BB Throughput 1GB, 1-32 streams (128-cores) Notes: • Providers similar on concurrency • From 16 streams on, the bottleneck is the CPU utilization on the master • HDI faster at concurrency, • But also showed the worst number (variability) High variability in HDI Spark
  • 38. BB Throughput at 100TB 8 streams SQL-only (512-cores)
  • 39. BB Throughput at 1TB 4 streams SQL-only (512-cores)
  • 40. BB Throughput at 10TB 2 streams SQL-only (512-cores)
  • 41. Summary: Hive and Spark Hive vs. Spark • Strategies • Thin vs. Fat containers • Sequential • Hive-on-Tez faster at lower scales (EMR, HDI) • Spark catches up at 1TB • 1TB+ • Spark memory needs tuning • Especially at 10TB • Also, speedups query time • Concurrency • Similar at 1GB • Spark significantly faster from 100GB+ (new cluster) Providers • CDP should enable Tez by default • EMR faster at lower scale • HDI faster at 1TB • HDI fastest with Hive (uses 1.2) • No LLAP yet… • SQL only • Google cloud fastest with Spark • HDI slowest with Spark Average of three executios of 100 GB Scale Factor 50
  • 42. Conclusions • All providers have up to date (2.1.0) and well tuned versions of Spark • They could run BigBench up to 1TB on medium-sized cluster • [Almost] Out-of-the box • Performance similar among providers for similar cluster types and disk configs • Difference according to scale (and pricing) • Spark 2.1.0 is faster than previous versions • Also MLlib 2 with dataframes • But improvements within the 30% range • Hive (+Tez + MLlib) are still slightly faster than Spark at lower scales for sequential • But Spark significantly faster at high data scales and concurrency • BigBench has been useful to stress a cluster with different workloads • Highlights config problems fast and stresses scale limits • Helpful for tuning the clusters • And yes, Spark is now production ready and performant in PaaS in the cloud 51
  • 43. Future work / WiP • Continue the query characterization • Combined for Hive and Spark, in multiple deployments • Benchmarking • Compare Hive versions 1 and 2 • HDI still on v1 • Test LLAP with different settings • Variability study for spark workloads in the cloud • Fix 10TB runs to complete results • Compare to on-prem runs • optimizations • Test G1 GC • Fat vs. thin executors configs
  • 44. Resources and references BigBench and ALOJA • BigBench Spark 2 branch (thanks Christoph and Michael from bankmark.de): • https://github.com/carabolic/Big-Data-Benchmark-for- Big-Bench/tree/spark2 • Original BigBench Implementation repository • https://github.com/intel-hadoop/Big-Data-Benchmark- for-Big-Bench • ALOJA benchmarking platform • https://github.com/Aloja/aloja • http://aloja.bsc.es/publications • ALOJA fork of BigBench (adds support for HDI and fixes spark) • https://github.com/Aloja/Big-Data-Benchmark-for-Big- Bench • The State of SQL-on-Hadoop in the Cloud – N. Poggi et. al. • https://doi.org/10.1109/BigData.2016.7840751 Big Data Benchmarking • Big Data Benchmarking Community (BDBC) mailing list • (~200 members from ~80organizations) • http://clds.sdsc.edu/bdbc/community • Workshop Big Data Benchmarking (WBDB) • http://clds.sdsc.edu/bdbc/workshops • SPEC Research Big Data working group • http://research.spec.org/working-groups/big-data- working-group.html • Benchmarking slides and video: • Benchmarking Hadoop: • https://www.slideshare.net/ni_po/benchmarking-hadoop • Michael Frank on Big Data benchmarking • http://www.tele-task.de/archive/podcast/20430/ • Tilmann Rabl Big Data Benchmarking Tutorial • http://www.slideshare.net/tilmann_rabl/ieee2014- tutorialbarurabl 53
  • 45. Thanks, questions? Follow up / feedback : Nicolas.Poggi@bsc.es Twitter: ni_po The state of Hive and Spark in the cloud