Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhang and Jane Wang

•

2 recomendaciones•756 vistas

At Spark Summit 2017, we described our framework to migrate production Hive workload to Spark with minimal user intervention. After a year of migration, Spark now powers an important part of our batch processing workload. The migration framework supports syntax compatibility analysis, offline/online shadowing, and data validation. In this session, we first introduce new features and improvements in the migration framework to support bucketed tables and increase automation. Next, we will deep dive into the top technical challenges we encountered and how we addressed them. We improved the the syntax compatibility between Hive and Spark from around 51% to 85% by identifying/developing top missing features, fixing incompatible UDFs, and implementing a UDF testing framework. In addition, we developed reliable join operators to improve Spark stability in production when leveraging optimizations such as ShuffledHashJoin. Finally, we will share an update on our overall migration effort and examples of migrations wins. For example, we were able to migrate one of the most complicated workloads in Facebook from Hive to Spark with more than 2.5X performance gain.

Datos y análisis

Zhan Zhang, Jane Wang, Facebook
Migrating Apache Hive
Workload to Apache Spark:
Bridge the Gap

Overview
• Hive to Spark Migration Effort
• Narrowing Down Feature Gaps
– Regex Column Specification Support.
– Local Writes support.
– UDFs
• Performance and Reliability
– Dynamic Join
– Bucket Join
• Advanced Optimization for Extremely Large Jobs
– Secondary Partitioning
– Run-time Optimization.

• Why do we migrate workload
from hive to Spark
– Performance
– Identify and narrow down the
feature gap.
Hive to Spark Migration

Regex Column Specification
Support
• One of the most failures in our syntax analysis.
• Support regex column specification.
– SELECT `(a)?+.+` FROM data table
– SELECT t.`(a)?+.+` FROM data table
• SPARK-12139
4put your #assignedhashtag here by setting the footer in view-header/footer

Local Filesystem Writes
• Support Writing data into the filesystem from
queries …
– INSERT OVERWRITE LOCAL? DIRECTORY
path=STRING rowFormat? createFileFormat?
– INSERT OVERWRITE LOCAL? DIRECTORY
(path=STRING)? tableProvider (OPTIONS
options=tablePropertyList)?
5

UDF Support
• UDAF_JAVA_F/UDTF_JAVA_F/UDF_JAVA_F
• UDF_Bind
• UDF_EVAL_F
• Non-deterministic Expression
• …
6

Narrowing Down Feature Gaps - Syntax
• Regex Column Specification
• Syntax parser improvement
• UDF compatibility
– Enum value
– User defined class type
– Lambda function

3X Workload Growth in 6 Month
Reserved CPU Days
CPU Days

Joins
Broadcast Join ShuffleHash Join SortMerge Join

Dynamic Join
Build Hash table
OOM
Hash
Join
Reconstruct
Iterator
Sort
MergeJoin
Start
End
No
Ye
s
• More aggressively
leverage HashJoin
• Provide a reliable
fallback mechanim

Bucket Join
Bucket 1
Bucket 2
Bucket 4
Bucket 2
Bucket 3
Bucket 4
Bucket 3
Bucket 1
Split 1
Split 2
Split 3
Split 4
Bucket 1
Bucket 2
Bucket 4
Bucket 2
Bucket 3
Bucket 1
Split 1
Split 2
• Support different number (multiplier) of buckets
on left/right side.

Bucket Join Validation
• To verify bucket join spark generate consistent
result to hive bucket join
– Read Spark/Hive Table.
– Zip the corresponding splits from spark/hive
generated tables.
– Compare the sorted column in two splits sequentially.
– Sort the bucket column in each split and compare
rows in two splits sequentially.
13

Challenges in Large Jobs
• A large job with 10,000 mapper * 10,000 reducer
– IOPS: 100,000,000
– HDFS: 10,000 result files
– Scheduling Overhead: 20,000 tasks
– Manual Tuning
– Data skewness
14

Pros and Cons
• Reduce IOPS
• Number of HDFS files
• Runtime Optimization
• Backward Compatibility
– Exactly same behavior with split number = 1
• Auto-Configuration
– 503 partitions and 13 buckets to achieve good
performance.
BUT
• Reduced Parallelism
• Need to fetch all before computation.

JIRA
• SPARK-12139
– REGEX Column Specification for Hive Queries
• SPARK-4131
– Support "Writing data into the filesystem from queries”
• SPARK-23306
– Race condition in TaskMemoryManager
• SPARK-19326
– Speculated task attempts do not get launched in few scenarios
• SPARK-19839
– Fix memory leak in BytesToBytesMap

ACKNOWLEDGEMENTS
The presentation includes the work from the Spark
team in Facebook. Thanks for their contribution,
esp., Lin Wang, Tejas Patil.

Más contenido relacionado

La actualidad más candente

Apache Iceberg: An Architectural Look Under the CoversScyllaDB

Starburst Presto - CBO talk - Strata London 2019Justin Borgman

Building an open data platform with apache icebergAlluxio, Inc.

Hive Bucketing in Apache Spark with Tejas PatilDatabricks

Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia

Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward

Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica

SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen

Apache Flink in the Cloud-Native EraFlink Forward

Batch Processing at Scale with Flink & IcebergFlink Forward

Evening out the uneven: dealing with skew in FlinkFlink Forward

Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.

Flink history, roadmap and visionStephan Ewen

Optimizing Apache Spark SQL JoinsDatabricks

Druid deep diveKashif Khan

Deep Dive: Memory Management in Apache SparkDatabricks

Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward

How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit

RocksDB compactionMIJIN AN

Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai

La actualidad más candente (20)

Apache Iceberg: An Architectural Look Under the Covers

Starburst Presto - CBO talk - Strata London 2019

Building an open data platform with apache iceberg

Hive Bucketing in Apache Spark with Tejas Patil

Making Data Timelier and More Reliable with Lakehouse Technology

Building Reliable Lakehouses with Apache Flink and Delta Lake

Stephan Ewen - Experiences running Flink at Very Large Scale

SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...

Apache Flink in the Cloud-Native Era

Batch Processing at Scale with Flink & Iceberg

Evening out the uneven: dealing with skew in Flink

Iceberg + Alluxio for Fast Data Analytics

Flink history, roadmap and vision

Optimizing Apache Spark SQL Joins

Druid deep dive

Deep Dive: Memory Management in Apache Spark

Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...

How to understand and analyze Apache Hive query execution plan for performanc...

RocksDB compaction

Building robust CDC pipeline with Apache Hudi and Debezium

Similar a Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhang and Jane Wang

Apache Drill talk ApacheCon 2018Aman Sinha

[262] netflix 빅데이터 플랫폼NAVER D2

Redshift Chartio Event PresentationChartio

Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen

Scaling Machine Learning Feature Engineering in Apache Spark at FacebookDatabricks

SQL Server 2014 In-Memory OLTPTony Rogerson

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)Amazon Web Services

Hive Evolution: ApacheCon NA 2010John Sichi

SQL Server 2014 Memory Optimised Tables - AdvancedTony Rogerson

SQL on Hadoopnvvrajesh

HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon

Hoodie - DataEngConf 2017Vinoth Chandar

HBaseCon2015-finalMaryann Xue

Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit

Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang

Understanding Query Plans and Spark UIsDatabricks

Spark real world use cases and optimizationsGal Marder

Apache hivepradipbajpai68

Dive into spark2Gal Marder

Similar a Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhang and Jane Wang (20)

Apache Drill talk ApacheCon 2018

[262] netflix 빅데이터 플랫폼

Redshift Chartio Event Presentation

Etu Solution Day 2014 Track-D: 掌握Impala和Spark

Scaling Machine Learning Feature Engineering in Apache Spark at Facebook

SQL Server 2014 In-Memory OLTP

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Hive Evolution: ApacheCon NA 2010

SQL Server 2014 Memory Optimised Tables - Advanced

SQL on Hadoop

HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...

Hoodie - DataEngConf 2017

HBaseCon2015-final

Spark Summit EU talk by Kent Buenaventura and Willaim Lau

Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive

Understanding Query Plans and Spark UIs

Spark real world use cases and optimizations

Apache hive

Dive into spark2

Más de Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Último

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823

Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823

➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823

April 2024 - Crypto Market Report's Analysismanisha194592

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823

Anomaly detection and data imputation within time seriesParis Women in Machine Learning and Data Science

Probability Grade 10 Third Quarter LessonsJoseMangaJr1

Capstone Project on IBM Data Analytics ProgramMoniSankarHazra

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823

➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823

Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal

Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823

Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhang and Jane Wang

1. Zhan Zhang, Jane Wang, Facebook Migrating Apache Hive Workload to Apache Spark: Bridge the Gap

2. Overview • Hive to Spark Migration Effort • Narrowing Down Feature Gaps – Regex Column Specification Support. – Local Writes support. – UDFs • Performance and Reliability – Dynamic Join – Bucket Join • Advanced Optimization for Extremely Large Jobs – Secondary Partitioning – Run-time Optimization.

3. • Why do we migrate workload from hive to Spark – Performance – Identify and narrow down the feature gap. Hive to Spark Migration

4. Regex Column Specification Support • One of the most failures in our syntax analysis. • Support regex column specification. – SELECT `(a)?+.+` FROM data table – SELECT t.`(a)?+.+` FROM data table • SPARK-12139 4put your #assignedhashtag here by setting the footer in view-header/footer

5. Local Filesystem Writes • Support Writing data into the filesystem from queries … – INSERT OVERWRITE LOCAL? DIRECTORY path=STRING rowFormat? createFileFormat? – INSERT OVERWRITE LOCAL? DIRECTORY (path=STRING)? tableProvider (OPTIONS options=tablePropertyList)? 5

6. UDF Support • UDAF_JAVA_F/UDTF_JAVA_F/UDF_JAVA_F • UDF_Bind • UDF_EVAL_F • Non-deterministic Expression • … 6

7. Narrowing Down Feature Gaps - Syntax • Regex Column Specification • Syntax parser improvement • UDF compatibility – Enum value – User defined class type – Lambda function

8. 3X Workload Growth in 6 Month Reserved CPU Days CPU Days

9. Joins Broadcast Join ShuffleHash Join SortMerge Join

10. Dynamic Join Build Hash table OOM Hash Join Reconstruct Iterator Sort MergeJoin Start End No Ye s • More aggressively leverage HashJoin • Provide a reliable fallback mechanim

11. Dynamic Join – Physical Plan

12. Bucket Join Bucket 1 Bucket 2 Bucket 4 Bucket 2 Bucket 3 Bucket 4 Bucket 3 Bucket 1 Split 1 Split 2 Split 3 Split 4 Bucket 1 Bucket 2 Bucket 4 Bucket 2 Bucket 3 Bucket 1 Split 1 Split 2 • Support different number (multiplier) of buckets on left/right side.

13. Bucket Join Validation • To verify bucket join spark generate consistent result to hive bucket join – Read Spark/Hive Table. – Zip the corresponding splits from spark/hive generated tables. – Compare the sorted column in two splits sequentially. – Sort the bucket column in each split and compare rows in two splits sequentially. 13

14. Challenges in Large Jobs • A large job with 10,000 mapper * 10,000 reducer – IOPS: 100,000,000 – HDFS: 10,000 result files – Scheduling Overhead: 20,000 tasks – Manual Tuning – Data skewness 14

15. Advanced - Secondary Partitioning

16. Pros and Cons • Reduce IOPS • Number of HDFS files • Runtime Optimization • Backward Compatibility – Exactly same behavior with split number = 1 • Auto-Configuration – 503 partitions and 13 buckets to achieve good performance. BUT • Reduced Parallelism • Need to fetch all before computation.

17. Runtime Join Optimization

18. JIRA • SPARK-12139 – REGEX Column Specification for Hive Queries • SPARK-4131 – Support "Writing data into the filesystem from queries” • SPARK-23306 – Race condition in TaskMemoryManager • SPARK-19326 – Speculated task attempts do not get launched in few scenarios • SPARK-19839 – Fix memory leak in BytesToBytesMap

19. ACKNOWLEDGEMENTS The presentation includes the work from the Spark team in Facebook. Thanks for their contribution, esp., Lin Wang, Tejas Patil.

20. Question?

Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhang and Jane Wang

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhang and Jane Wang

Similar a Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhang and Jane Wang (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhang and Jane Wang