SlideShare una empresa de Scribd logo
1 de 32
1© Cloudera, Inc. All rights reserved.
Faster Batch Processing with
Hive-on-Spark
Santosh Kumar | Cloudera
Rui Li | Intel
2© Cloudera, Inc. All rights reserved.
Agenda
• What is Hive-on-Spark?
• Using Hive-on-Spark
• Performance Metrics
• Configuration & Tuning
• What’s Next?
• Q&A
3© Cloudera, Inc. All rights reserved.
Apache Spark
Flexible, in-memory data processing for Hadoop
Easy
Development
Flexible Extensible
API
Fast Batch & Stream
Processing
• Rich APIs for Scala, Java,
and Python
• Interactive shell
• APIs for different types
of workloads:
• Batch
• Streaming
• Machine Learning
• Graph
• In-Memory processing
and caching
4© Cloudera, Inc. All rights reserved.
Spark Takes Advantage of Memory
• Resilient Distributed Datasets (RDD)
• In-memory data-structure partitioned across a set of machines
• Can fall back to disk when data-set does not fit in memory
• Created by parallel transformations on data in stable storage
• Provides fault-tolerance through concept of lineage
5© Cloudera, Inc. All rights reserved.
Introduction
• Enables Hive to use Spark as underlying execution engine
• Motivations
• Consolidation of Spark as execution engine
• Better performance
• Increased adoption of Hive (e.g. for Spark users)
• Community effort by Cloudera, IBM, Intel, MapR, and others
6© Cloudera, Inc. All rights reserved.
Choosing the Right SQL Engine
Know Your Audience, Know Your Use Case
Batch
Processing
BI and
SQL Analytics
Procedural
Development
SQLOR
Impala
7© Cloudera, Inc. All rights reserved.
Current State of Hive-on-Spark (HoS)
• Fully supported production release in C5.7
• Functional parity with Hive-on-MapReduce (HoMR)
• Average 3x performance improvement vs HoMR
• Automatic configuration and optimizations via Cloudera Manager
• Strong early user base
• Early commitment for future collaboration from Intel and others
8© Cloudera, Inc. All rights reserved.
Design Principles
• Minimize impact on existing code path
• Minimizes functional and performance impact
• Minimizes maintenance
• Maximizes support for Hive features – current as well as future
• Spark invoked only at execution layer
• HoS produces similar logical operators plan as HoMR
• Logical plan runs on low-level Spark primitives
• Minimizes usage of advanced Spark primitives
9© Cloudera, Inc. All rights reserved.
Getting Started with Hive-on-Spark
10© Cloudera, Inc. All rights reserved.
Configuration
• Minimal configurations needed
• Via Cloudera Manager: Set “Spark on YARN Service” (internally sets
spark.master=yarn-cluster)
• Set hive.execution.engine=spark per service or query
• Only yarn-cluster is supported
• Cloudera Manager auto-configures most configurations
• Configuration & Tuning Guide available on Docs
11© Cloudera, Inc. All rights reserved.
Performance
Avg. ~3X faster than Hive-on-MapReduce
More Suitable Less Suitable
Complex workloads w/ multiple MR stages e.g. filter
followed by JOIN followed by GROUP BY
Simple workloads e.g. select *
Disk-bound w/ multiple disk reads/writes CPU bound workloads e.g. complex UDFs
Workloads requiring mins to hours for completion Workloads typically requiring <1 min
12© Cloudera, Inc. All rights reserved.
Query Execution: Background
Input
status_updates( userid int,status string,ds string)
profiles(userid int,school string,gender int)
Output
school_summary(school string,cnt int,ds string)
gender_summary(gender int,cnt int,ds string)
13© Cloudera, Inc. All rights reserved.
Query Execution: MapReduce
BEGINS CONTINUES
CONTINUES ENDS
14© Cloudera, Inc. All rights reserved.
Query Execution: MapReduce
BEGINS CONTINUES
CONTINUES ENDS
15© Cloudera, Inc. All rights reserved.
Query Execution: MapReduce
BEGINS CONTINUES
CONTINUES ENDS
FileSinkOperator (disk write) and TableScanOperator (disk read)
are very costly
16© Cloudera, Inc. All rights reserved.
Query Execution: Hive-on-Spark
Costly Steps Removed
BEGINS CONTINUES
CONTINUES ENDS
17© Cloudera, Inc. All rights reserved.
Query Execution: Hive-on-Spark
Costly Steps Removed
BEGINS CONTINUES
CONTINUES ENDS
18© Cloudera, Inc. All rights reserved.
Optimization for Resource Management:
Long-Live Executors (LLE)
• MR: Each query an independent YARN application
• Spark: Each SQL session is a long-lived YARN application
• First query of a session spawns a YARN app
• Subsequent queries re-use same YARN app as well as containers
• Session disconnect shuts down YARN app and releases container resources
19© Cloudera, Inc. All rights reserved.
Long-Lived Executors Details
• Hive User Session will submit Spark Application to YARN
• Spark YARN Application:
• YARN container = Spark Executors live in YARN containers
• YARN Application Master = RemoteDriver
• Submits Spark ‘jobs’, aka Hive queries, to Spark executors
• Connects back to HS2 to report job progress from Spark executors
User1
User2
HiveServer2
Session1
Session2
YARN Cluster
AM (RemoteDriver1) Containers (Executors)
AM (RemoteDriver2) Containers (Executors)
20© Cloudera, Inc. All rights reserved.
Configuration and Tuning
Hive-on-Spark
21© Cloudera, Inc. All rights reserved.
Spark Configuration
• Size of executors
• Bigger and fewer executors
• Threads contention
• GC pressure
• Smaller and more executors
• Less memory efficient
• Bigger start-up overhead
22© Cloudera, Inc. All rights reserved.
Spark Configuration
• CPU
• Around 5-7 cores per executor
• Memory
• Leave 10% for OS cache
• Executor memory overhead
• Tune by case
• Can be heavily used by Netty
• Usually 15% - 20%
• Around 3GB per core
23© Cloudera, Inc. All rights reserved.
Spark Configuration
• Serialization
• spark.serializer – kryo performs better and is REQUIRED by HoS
• spark.kryo.referenceTracking – disable to avoid java performance issue
• Shuffle
• spark.shuffle.compress
• spark.shuffle.spill.compress
• Trade CPU for I/O
• Increase number of reducers
24© Cloudera, Inc. All rights reserved.
Partitioning
• Number of mappers
• Inputformat
• mapreduce.input.fileinputformat.split.maxsize
• Number of reducers
• hive.exec.reducers.bytes.per.reducer
• mapreduce.job.reduces
• HoS tends to launch more reducers
• Merge small files
• hive.merge.sparkfiles
25© Cloudera, Inc. All rights reserved.
Hive Configuration
• General optimizations
• Enable vectorization
• Enable CBO
• Map join auto convertion
• Map side aggregation
• Etc.
26© Cloudera, Inc. All rights reserved.
Hive Configuration
• Map join
• hive.auto.convert.join.noconditionaltask.size
• HoS doesn’t support conditional map join yet
• HoS uses raw data size as small table size – different from MR
• hive.stats.collect.rawdatasize
• Skew join
• Compile time – same as MR
• Runtime - HoS will split the original task at join
27© Cloudera, Inc. All rights reserved.
Resource Allocation
• Static allocation
• spark.executor.instances
• Won’t release until session is closed
• Recommended for benchmarking
• Dynamic allocation
• spark.dynamicAllocation.enabled
• spark.executor.dynamicAllocation.initialExecutors
• spark.executor.dynamicAllocation.minExecutors
• spark.executor.dynamicAllocation.maxExecutors
• Number of executors per Spark application scales up and down
• Suited for multi-tenancy scenarios (multi-session)
28© Cloudera, Inc. All rights reserved.
Resource Allocation
• Pre-warm containers
• hive.prewarm.enabled
• spark.scheduler.maxRegisteredResourcesWaitingTime
• spark.scheduler.minRegisteredResourcesRatio
• Attempt for better parallelism
• Considerable delay for start-up job
• Not recommended for short-lived sessions
29© Cloudera, Inc. All rights reserved.
Configuration and Tuning Summary
• Number and size of executors most important determinants of
performance
• Resolve query performance/failures by allocating more executors with
more CPU and RAM
• spark.executor.instances, spark.executor.cores, spark.executor.memory,
spark.yarn.executor.memoryOverhead
• Cloudera Manager takes care of most of the optimizations
• Most Hive config settings applicable to HoS, but few have different
semantics
• See Config and Tuning Guide for details
30© Cloudera, Inc. All rights reserved.
Roadmap
• Additional Optimizations
• Dynamic Partition Pruning
• Vectorization support
• Cost-Based Optimizer
• Others – Caching RDDs across queries, Optimize self join/union etc.
• Supportability Enhancements
• Better support for debugging and logging
• More informative stage description in WebUI
• Others: Improve Hue integration, additional metrics specific to HoS etc.
• Rebase to Spark 2.0 and Parquet 1.8
31© Cloudera, Inc. All rights reserved.
More Information & Next Steps
Get Started
• Download C5.7: www.cloudera.com/downloads
Release Notes
• www.cloudera.com/documentation/enterprise/latest/topics/rg_release_
notes.html
Training Classes
• university.cloudera.com
32© Cloudera, Inc. All rights reserved.
Questions?

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac... Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
 
A deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudA deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloud
 
Five Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWSFive Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWS
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ Zoosk
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
 
Risk Management for Data: Secured and Governed
Risk Management for Data: Secured and GovernedRisk Management for Data: Secured and Governed
Risk Management for Data: Secured and Governed
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache Kudu
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Data Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseData Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the Enterprise
 
Apache Spark Operations
Apache Spark OperationsApache Spark Operations
Apache Spark Operations
 
Road to Cloudera certification
Road to Cloudera certificationRoad to Cloudera certification
Road to Cloudera certification
 
Cloudbreak - Technical Deep Dive
Cloudbreak - Technical Deep DiveCloudbreak - Technical Deep Dive
Cloudbreak - Technical Deep Dive
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
Solr consistency and recovery internals
Solr consistency and recovery internalsSolr consistency and recovery internals
Solr consistency and recovery internals
 

Similar a Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production

Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 

Similar a Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production (20)

Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2
 
YARN
YARNYARN
YARN
 
Building Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkBuilding Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache Spark
 
Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Chicago spark meetup-april2017-public
Chicago spark meetup-april2017-publicChicago spark meetup-april2017-public
Chicago spark meetup-april2017-public
 
Getting Apache Spark Customers to Production
Getting Apache Spark Customers to ProductionGetting Apache Spark Customers to Production
Getting Apache Spark Customers to Production
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
 

Más de Cloudera, Inc.

Más de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Último (20)

%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production

  • 1. 1© Cloudera, Inc. All rights reserved. Faster Batch Processing with Hive-on-Spark Santosh Kumar | Cloudera Rui Li | Intel
  • 2. 2© Cloudera, Inc. All rights reserved. Agenda • What is Hive-on-Spark? • Using Hive-on-Spark • Performance Metrics • Configuration & Tuning • What’s Next? • Q&A
  • 3. 3© Cloudera, Inc. All rights reserved. Apache Spark Flexible, in-memory data processing for Hadoop Easy Development Flexible Extensible API Fast Batch & Stream Processing • Rich APIs for Scala, Java, and Python • Interactive shell • APIs for different types of workloads: • Batch • Streaming • Machine Learning • Graph • In-Memory processing and caching
  • 4. 4© Cloudera, Inc. All rights reserved. Spark Takes Advantage of Memory • Resilient Distributed Datasets (RDD) • In-memory data-structure partitioned across a set of machines • Can fall back to disk when data-set does not fit in memory • Created by parallel transformations on data in stable storage • Provides fault-tolerance through concept of lineage
  • 5. 5© Cloudera, Inc. All rights reserved. Introduction • Enables Hive to use Spark as underlying execution engine • Motivations • Consolidation of Spark as execution engine • Better performance • Increased adoption of Hive (e.g. for Spark users) • Community effort by Cloudera, IBM, Intel, MapR, and others
  • 6. 6© Cloudera, Inc. All rights reserved. Choosing the Right SQL Engine Know Your Audience, Know Your Use Case Batch Processing BI and SQL Analytics Procedural Development SQLOR Impala
  • 7. 7© Cloudera, Inc. All rights reserved. Current State of Hive-on-Spark (HoS) • Fully supported production release in C5.7 • Functional parity with Hive-on-MapReduce (HoMR) • Average 3x performance improvement vs HoMR • Automatic configuration and optimizations via Cloudera Manager • Strong early user base • Early commitment for future collaboration from Intel and others
  • 8. 8© Cloudera, Inc. All rights reserved. Design Principles • Minimize impact on existing code path • Minimizes functional and performance impact • Minimizes maintenance • Maximizes support for Hive features – current as well as future • Spark invoked only at execution layer • HoS produces similar logical operators plan as HoMR • Logical plan runs on low-level Spark primitives • Minimizes usage of advanced Spark primitives
  • 9. 9© Cloudera, Inc. All rights reserved. Getting Started with Hive-on-Spark
  • 10. 10© Cloudera, Inc. All rights reserved. Configuration • Minimal configurations needed • Via Cloudera Manager: Set “Spark on YARN Service” (internally sets spark.master=yarn-cluster) • Set hive.execution.engine=spark per service or query • Only yarn-cluster is supported • Cloudera Manager auto-configures most configurations • Configuration & Tuning Guide available on Docs
  • 11. 11© Cloudera, Inc. All rights reserved. Performance Avg. ~3X faster than Hive-on-MapReduce More Suitable Less Suitable Complex workloads w/ multiple MR stages e.g. filter followed by JOIN followed by GROUP BY Simple workloads e.g. select * Disk-bound w/ multiple disk reads/writes CPU bound workloads e.g. complex UDFs Workloads requiring mins to hours for completion Workloads typically requiring <1 min
  • 12. 12© Cloudera, Inc. All rights reserved. Query Execution: Background Input status_updates( userid int,status string,ds string) profiles(userid int,school string,gender int) Output school_summary(school string,cnt int,ds string) gender_summary(gender int,cnt int,ds string)
  • 13. 13© Cloudera, Inc. All rights reserved. Query Execution: MapReduce BEGINS CONTINUES CONTINUES ENDS
  • 14. 14© Cloudera, Inc. All rights reserved. Query Execution: MapReduce BEGINS CONTINUES CONTINUES ENDS
  • 15. 15© Cloudera, Inc. All rights reserved. Query Execution: MapReduce BEGINS CONTINUES CONTINUES ENDS FileSinkOperator (disk write) and TableScanOperator (disk read) are very costly
  • 16. 16© Cloudera, Inc. All rights reserved. Query Execution: Hive-on-Spark Costly Steps Removed BEGINS CONTINUES CONTINUES ENDS
  • 17. 17© Cloudera, Inc. All rights reserved. Query Execution: Hive-on-Spark Costly Steps Removed BEGINS CONTINUES CONTINUES ENDS
  • 18. 18© Cloudera, Inc. All rights reserved. Optimization for Resource Management: Long-Live Executors (LLE) • MR: Each query an independent YARN application • Spark: Each SQL session is a long-lived YARN application • First query of a session spawns a YARN app • Subsequent queries re-use same YARN app as well as containers • Session disconnect shuts down YARN app and releases container resources
  • 19. 19© Cloudera, Inc. All rights reserved. Long-Lived Executors Details • Hive User Session will submit Spark Application to YARN • Spark YARN Application: • YARN container = Spark Executors live in YARN containers • YARN Application Master = RemoteDriver • Submits Spark ‘jobs’, aka Hive queries, to Spark executors • Connects back to HS2 to report job progress from Spark executors User1 User2 HiveServer2 Session1 Session2 YARN Cluster AM (RemoteDriver1) Containers (Executors) AM (RemoteDriver2) Containers (Executors)
  • 20. 20© Cloudera, Inc. All rights reserved. Configuration and Tuning Hive-on-Spark
  • 21. 21© Cloudera, Inc. All rights reserved. Spark Configuration • Size of executors • Bigger and fewer executors • Threads contention • GC pressure • Smaller and more executors • Less memory efficient • Bigger start-up overhead
  • 22. 22© Cloudera, Inc. All rights reserved. Spark Configuration • CPU • Around 5-7 cores per executor • Memory • Leave 10% for OS cache • Executor memory overhead • Tune by case • Can be heavily used by Netty • Usually 15% - 20% • Around 3GB per core
  • 23. 23© Cloudera, Inc. All rights reserved. Spark Configuration • Serialization • spark.serializer – kryo performs better and is REQUIRED by HoS • spark.kryo.referenceTracking – disable to avoid java performance issue • Shuffle • spark.shuffle.compress • spark.shuffle.spill.compress • Trade CPU for I/O • Increase number of reducers
  • 24. 24© Cloudera, Inc. All rights reserved. Partitioning • Number of mappers • Inputformat • mapreduce.input.fileinputformat.split.maxsize • Number of reducers • hive.exec.reducers.bytes.per.reducer • mapreduce.job.reduces • HoS tends to launch more reducers • Merge small files • hive.merge.sparkfiles
  • 25. 25© Cloudera, Inc. All rights reserved. Hive Configuration • General optimizations • Enable vectorization • Enable CBO • Map join auto convertion • Map side aggregation • Etc.
  • 26. 26© Cloudera, Inc. All rights reserved. Hive Configuration • Map join • hive.auto.convert.join.noconditionaltask.size • HoS doesn’t support conditional map join yet • HoS uses raw data size as small table size – different from MR • hive.stats.collect.rawdatasize • Skew join • Compile time – same as MR • Runtime - HoS will split the original task at join
  • 27. 27© Cloudera, Inc. All rights reserved. Resource Allocation • Static allocation • spark.executor.instances • Won’t release until session is closed • Recommended for benchmarking • Dynamic allocation • spark.dynamicAllocation.enabled • spark.executor.dynamicAllocation.initialExecutors • spark.executor.dynamicAllocation.minExecutors • spark.executor.dynamicAllocation.maxExecutors • Number of executors per Spark application scales up and down • Suited for multi-tenancy scenarios (multi-session)
  • 28. 28© Cloudera, Inc. All rights reserved. Resource Allocation • Pre-warm containers • hive.prewarm.enabled • spark.scheduler.maxRegisteredResourcesWaitingTime • spark.scheduler.minRegisteredResourcesRatio • Attempt for better parallelism • Considerable delay for start-up job • Not recommended for short-lived sessions
  • 29. 29© Cloudera, Inc. All rights reserved. Configuration and Tuning Summary • Number and size of executors most important determinants of performance • Resolve query performance/failures by allocating more executors with more CPU and RAM • spark.executor.instances, spark.executor.cores, spark.executor.memory, spark.yarn.executor.memoryOverhead • Cloudera Manager takes care of most of the optimizations • Most Hive config settings applicable to HoS, but few have different semantics • See Config and Tuning Guide for details
  • 30. 30© Cloudera, Inc. All rights reserved. Roadmap • Additional Optimizations • Dynamic Partition Pruning • Vectorization support • Cost-Based Optimizer • Others – Caching RDDs across queries, Optimize self join/union etc. • Supportability Enhancements • Better support for debugging and logging • More informative stage description in WebUI • Others: Improve Hue integration, additional metrics specific to HoS etc. • Rebase to Spark 2.0 and Parquet 1.8
  • 31. 31© Cloudera, Inc. All rights reserved. More Information & Next Steps Get Started • Download C5.7: www.cloudera.com/downloads Release Notes • www.cloudera.com/documentation/enterprise/latest/topics/rg_release_ notes.html Training Classes • university.cloudera.com
  • 32. 32© Cloudera, Inc. All rights reserved. Questions?