The state of Hive and Spark in the Cloud (July 2017)

The state of Hive and Spark
in the cloud
Nicolas Poggi
July 2017
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_
/_/

Outline
1. Intro to BSC and ALOJA
2. BigBench
3. Cloud systems
4. Sequential tests 1GB – 10TB
1. Scalability
5. Concurrency tests
6. Summary
2

Barcelona Supercomputing Center (BSC)
• Peak performance capacity of 13.7 Petaflop/s
• 390 TB Total RAM
• 14 PB storage (GPFS)
• Heterogeneous architecture:
• Intel Xeon: KNL and NKH
• IBM Power9 + Latest NVIDIA, ARMv8
• Networking: InfiniBand EDR / Omni-path
Launched July 1st 2017

ALOJA: towards cost-effective Big Data
• Research project for automating characterization and
optimization of Big Data deployments
• Open source Benchmarking-to-Insights platform and tools
• Largest Big Data public repository (70,000+ jobs)
• Community collaboration with industry and academia
http://aloja.bsc.es
Big Data
Benchmarking
Online
Repository
Web / ML
Analytics

Motivation
• 2016 SQL-on-Hadoop paper and presentations
• Focused on Hive, due to SparkSQL not being ready to use in PaaS
• Different versions (1.3, 1.5, 1.6)
• Some in preview mode
• Not carefully tuned
• Used TCP-H SQL-only benchmark
• Early 2017, BigBench on Hive and Spark work testing more than SQL
• FOSDEM and HadoopSummit EU presentations
• New code available in May for MLlib2 compatibility
• Goals:
• Understand the different BigBench queries
• Evaluate the current out-of-the-box experience of Spark and Hive in PaaS cloud
• Readiness, scalability, price, and performance
5

The need for a new benchmark standard
• A benchmark captures the solution to a problem and guides decision making
• Database related benchmarks standards
• Transactional (OLTP): TPC C and E
• Decision Support (DSS/OLAP): TPC H and DS
• And for Big Data analytics properties?
• 3 Vs, ML, M/R
• Benchmark uses:
• System tuning and debugging
• Spread and broad Big Data ecosystem
• Set common rules
• Vendor comparison
• Transparency across the industry
8

What is BigBench (TPCx-BB)?
• End-to-end application level benchmark specification
• result of many years of collaboration of industry and academia
• Covers most Big Data Analytical properties (3Vs)
• Covers 30 business use cases for a retailer company
• Defines data scale factors: 1GB to PBs
9
2012
• Launched at WBDB
2013
• Published at SIGMOD
2014
• First implementation on github
2016
• Standardized by TPC (Feb)
2016
• TCPx-BB Version 1.2 (Nov)
2017
• Spark MLlib v2 compatibility
(under testing - May)
BigBench history

BigBench use cases and process overview
• 30 business uses cases covering:
• Merchandising,
• Pricing Optimization
• Product Return
• Customers...
• Implementation resulted in:
• 14 Declarative queries (SQL)
• 7 with Natural Language Processing
• 4 with data preprocessing with M/R jobs
• 5 with Machine Learning jobs
10
1 Data generation
2 Data loading
3 Power test
4 Throughput test 1
5 Data refresh
6 Throughput test 2
Result
• BB queries / min (BBQpm)

BigBench v1.2 – Reference Implementation
HDFS
Hive Metastore
MapReduce Tez Spark
Yarn
Hive Spark SQL
Mahout Spark’s MLlibMachine Learning
SQL Engine
Table Metastore
Execution Engine
Filesystem
Combination options:
• Hive + MapReduce + Mahout
• Hive + MapReduce + Spark’s Mllib
• v1 and v2
• Hive + Tez + Mahout
• Hive + Tez + Spark’s MLlib
• Spark + Mahout
• Spark + MLlib
• v1 and v2
• Also
• Hive-on-Spark
• Hive LLAP …

Previous results: M/R vs Tez and Mahout vs. MLlib v1
12Average of three executions using 100 GB Scale Factor
M/R
Tez
Mahout
MLlib v1
3.9x 2.2x

Platform-as-a-Service Spark
• Cloud-based managed Hadoop services
• Ready to use Spark, Hive, …
• Simplified management
• Deploys in minutes, on-demand, elastic
• You select the instance and
• the number of processing nodes
• Decoupled compute and storage
• Pay-as-you-go pricing model
• Optimized for general purpose
• Fined tuned to the cloud provider architecture
14

Surveyed Hadoop/Hive PaaS services
• Amazon Elastic Map Reduce (EMR)
• Released: Apr 2009
• OS: Amazon Linux AMI (RHEL-like)
• SW stack: EMR 5.5 (and 5.6)
• Spark 2.1.0 and Hive 2.1 (no LLAP)
• Google Cloud DataProc (CDP)
• Released: Feb 2016
• OS: Debian GNU/Linux 8.4
• SW stack: Preview version Spark 2.1.0
• V 1.1 with Spark 2.0.2
• Both with Hive 2.1 (no LLAP)
• Azure HDInsight (HDI)
• Released: Oct 2013
• OS: Windows Server and Ubuntu 16.04
• SW stack: HDP 2.6 based
• Spark 2.1.0 and 1.6.3
• Hive 1.2 (Hive 2 + LLAP in preview mode)
• Target deployment:
• 16 data nodes with 8-cores each
• Master node with 16-cores
• Decoupled storage only
• Object store / elastic stores
15

VM instances and characteristics
Amazon Elastic Map Reduce (EMR)
• 16x M4.2xlarge (datanodes)
• 8-core, 32GB RAM
• 1x M4.4xlarge (master)
• 16-core, 64 GB RAM
• Storage: 2x EBS GP2 volumes
• Price/hr: $10.96 (billed by the hour)
Azure HDInsight (HDI)
• 16x D4v2 (datanodes)
• 2x D14v2 (master)
• Storage: WASB (Azure Blob Store)
• Price/hr: $20.68 (billed by the minute)
17
Google Cloud DataProc (CDP)
• 16x n1-standard-8 (datanodes)
• 1x n1-standard-16 (master)
• Storage GCS
• Price/hr: $10.38 (billed by the minute)
Disclaimer: snapshot of the out-of-the-box price and performance
during May 2017. Performance and especially costs change
often. We use non-discounted pricing. I/O costs are complex to
estimate for a single benchmark, using per second billing.

Sequential Hive vs Spark 2.1
Queries 1-30 on Spark 2.1 (power runs)
Query 1 Query 2 …. Query 30
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 2.1.0
/_/

BB 1GB-1TB Scalability Dataproc:
Hive 2.1 (M/R) vs Spark 2.1

BB 1GB-1TB Scalability EMR:
Hive 2.1 (Tez) vs Spark 2.1

BB 1GB-1TB Scalability HDI:
Hive 1.2 (Tez) vs Spark 2.1

Notes:
• Times for each type of
query different
• Dataproc hive the
slowest due to using
M/R
• HDI fastest results
• Spark faster
• ML
• Hive faster
• UDF
• EMR similar times both
• Spark problems in
M/R
BB 1TB Power runs : Hive vs Spark 2.1
All providers

BigBench Hive vs Spark data node CPU % (HDI)

BigBench Hive vs Spark data containers (HDI)

Comparison of Q5 (ML) in Hive and Spark
31

Errors and configs
EMR slow Spark query and solution
Configurations

BB 1TB M/R-only: Spark 2.1 – All providers
Notes:
• When zooming by query,
we can see that query 2 is
the slowest on ERM
• While on CDP and HDI is
within proportions

BB 1TB Q2: Spark 2.1 – CPU Util % EMR and HDI
Notes:
• Job was CPU bounded. Log showed:
• WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.6 GB of 5.5 GB physical
memory used. Consider boosting spark.yarn.executor.memoryOverhead.
• Solution: Increased memory for executors and time was lowered from 6,417s > 1,501 (4.3X)
Q2: Find the top 30 products that are mostly viewed together with a given product in online store
CREATE TEMPORARY FUNCTION makePairs AS io.bigdatabenchmark.v1.queries.udf.PairwiseUDTF';

Errors in PaaS out-of-the-box…
• Everything was run out-of-the-box, except for:
• Q 14 17 requires cross joins to be enabled in Spark v2
• At 10TB,
• spark.sql.broadcastTimeout (default 300) had to be increased in HDI
• Timeout in seconds for the broadcast wait time in broadcast joins
• At 1TB memory issues
• Queries 3, 4, 8
• TimSort java.lang.OutOfMemoryError: Java heap space at
org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate
• Queries 2, and 30
• 17/05/15 16:57:46 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for
exceeding memory limits. 5.6 GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.
• Configs
• spark.yarn.driver and executor memoryOverhead
• spark.yarn.executor.memory

Spark config
EMR CDP HDI
Java version OpenJDK 1.8.0_121 OpenJDK 1.8.0_121 OpenJDK 1.8.0_131
Spark version 2.1.0 2.1 2.1.0.2.6.0.2-76
Driver memory 5G 5G 5G
Executor memory 5G 10G 4G
Executor cores 4 4 3
Executor instances Dynamic Dynamic 20
dynamicAllocation
enabled
TRUE TRUE FALSE
Executor
memoryOverhead
Default (384MB) 1,117 MB 384 MB
36

The sky 10TB is the limit…
Results for SQL-only
37

BB 1GB-10TB Scalability SQL-only queries
Hive
Spark

BigBench 10TB SQL-only: All providers
Notes:
• At 10TB, only
SQL part ran
correctly in
Spark
• EMR got the
fastest results
• Rest still needs
tuning to
complete
• But reaching
the limit of the
cluster / PaaS
config

Other comparisons:
2.0.2 vs 2.1.0
1.6.3 vs 2.1.0
MLlib v1 vs v2
41

BigBench 1GB-1TB: Spark 2.0.2 vs 2.1.0 (CDP)
Notes:
Spark 2.1 a bit faster at
small scales, slower at
100 GB and 1 TB on the
UDF/NLP queries
2.1 faster up
to 100GB
Slower at 1TB

BigBench 1GB-1TB: Spark 1.6.3 vs 2.1.0
MLlib 1 vs 2.1 MLlib 2(HDI)
Notes:
• Spark 2.1 is always
faster than 1.6.3 in
HDI
• MLlilb 2 using
dataframes over RDDs
is only slightly faster
than V1.

Query 2 CPU % example (100GB)
44
Tez Spark 1.6.2 Spark 2.0.2
Average of three executions using 100 GB Scale Factor

Concurrency runs (throughput)
2 to 32 parallel streams 128-core cluster
SQL-only: 100GB – 10TB 512-core cluster
4545

BB Throughput 1GB, 1-32 streams (128-cores)
Notes:
• Providers similar on
concurrency
• From 16 streams on,
the bottleneck is the
CPU utilization on the
master
• HDI faster at
concurrency,
• But also showed the
worst number
(variability)
High variability in HDI Spark

BB Throughput at 100TB 8 streams SQL-only
(512-cores)

(512-cores)

Summary: Hive and Spark
Hive vs. Spark
• Strategies
• Thin vs. Fat containers
• Sequential
• Hive-on-Tez faster at lower scales (EMR,
HDI)
• Spark catches up at 1TB
• 1TB+
• Spark memory needs tuning
• Especially at 10TB
• Also, speedups query time
• Concurrency
• Similar at 1GB
• Spark significantly faster from 100GB+ (new
cluster)
Providers
• CDP should enable Tez by default
• EMR faster at lower scale
• HDI faster at 1TB
• HDI fastest with Hive (uses 1.2)
• No LLAP yet…
• SQL only
• Google cloud fastest with Spark
• HDI slowest with Spark
Average of three executios of 100 GB Scale Factor 50

Conclusions
• All providers have up to date (2.1.0) and well tuned versions of Spark
• They could run BigBench up to 1TB on medium-sized cluster
• [Almost] Out-of-the box
• Performance similar among providers for similar cluster types and disk configs
• Difference according to scale (and pricing)
• Spark 2.1.0 is faster than previous versions
• Also MLlib 2 with dataframes
• But improvements within the 30% range
• Hive (+Tez + MLlib) are still slightly faster than Spark at lower scales for sequential
• But Spark significantly faster at high data scales and concurrency
• BigBench has been useful to stress a cluster with different workloads
• Highlights config problems fast and stresses scale limits
• Helpful for tuning the clusters
• And yes, Spark is now production ready and performant in PaaS in the cloud
51

Future work / WiP
• Continue the query characterization
• Combined for Hive and Spark, in multiple deployments
• Benchmarking
• Compare Hive versions 1 and 2
• HDI still on v1
• Test LLAP with different settings
• Variability study for spark workloads in the cloud
• Fix 10TB runs to complete results
• Compare to on-prem runs
• optimizations
• Test G1 GC
• Fat vs. thin executors configs

Resources and references
BigBench and ALOJA
• BigBench Spark 2 branch (thanks Christoph and
Michael from bankmark.de):
• https://github.com/carabolic/Big-Data-Benchmark-for-
Big-Bench/tree/spark2
• Original BigBench Implementation repository
• https://github.com/intel-hadoop/Big-Data-Benchmark-
for-Big-Bench
• ALOJA benchmarking platform
• https://github.com/Aloja/aloja
• http://aloja.bsc.es/publications
• ALOJA fork of BigBench (adds support for HDI and fixes
spark)
• https://github.com/Aloja/Big-Data-Benchmark-for-Big-
Bench
• The State of SQL-on-Hadoop in the Cloud – N. Poggi
et. al.
• https://doi.org/10.1109/BigData.2016.7840751
Big Data Benchmarking
• Big Data Benchmarking Community (BDBC) mailing
list
• (~200 members from ~80organizations)
• http://clds.sdsc.edu/bdbc/community
• Workshop Big Data Benchmarking (WBDB)
• http://clds.sdsc.edu/bdbc/workshops
• SPEC Research Big Data working group
• http://research.spec.org/working-groups/big-data-
working-group.html
• Benchmarking slides and video:
• Benchmarking Hadoop:
• https://www.slideshare.net/ni_po/benchmarking-hadoop
• Michael Frank on Big Data benchmarking
• http://www.tele-task.de/archive/podcast/20430/
• Tilmann Rabl Big Data Benchmarking Tutorial
• http://www.slideshare.net/tilmann_rabl/ieee2014-
tutorialbarurabl
53

Thanks, questions?
Follow up / feedback : Nicolas.Poggi@bsc.es
Twitter: ni_po
The state of Hive and Spark in the cloud

The state of Hive and Spark in the Cloud (July 2017)

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a The state of Hive and Spark in the Cloud (July 2017)

Similar a The state of Hive and Spark in the Cloud (July 2017) (20)

Más de Nicolas Poggi

Más de Nicolas Poggi (6)

Último

Último (20)

The state of Hive and Spark in the Cloud (July 2017)