SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
METRICS-DRIVEN PERFORMANCE TUNING FOR
AWS GLUE ETL JOBS
Benjamin Sowell
Principal Engineer
AWS Glue
A N T 3 2 6
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Apache Spark and AWS Glue ETL
• Apache Spark is a distributed data processing engine with rich support
for complex analytics.
• AWS Glue builds on the Apache Spark runtime to offer ETL specific
functionality.
Spark Core: RDDs
Spark DataFrames AWS Glue DynamicFrames
SparkSQL AWS Glue ETL
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue execution model: jobs and stages
Repartition
FilterRead
Drop
Nulls
Write
Read Show
Job 1
Job 2
Stage 1
Stage 2
Stage 1
Apply
Mapping
Filter
Apply
Mapping
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue execution model: jobs and stages
Repartition
FilterRead
Drop
Nulls
Write
Read Show
Job 1
Job 2
Stage 1
Stage 2
Stage 1
Apply
Mapping
Filter
Apply
Mapping
Actions
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue execution model: jobs and stages
Repartition
FilterRead
Drop
Nulls
Write
Read Show
Job 1
Job 2
Stage 1
Stage 2
Stage 1
Apply
Mapping
Filter
Apply
Mapping Jobs
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue execution model: data partitions
• Apache Spark and AWS Glue
are data parallel.
• Data is divided into partitions
that are processed
concurrently.
• 1 stage x 1 partition = 1 task
Driver
Executors
Overall throughput is limited
by the number of partitions
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue performance: key questions
How is your application divided
into jobs and stages?
How is your dataset partitioned?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing lots of small files
• One common problem is dealing with large numbers of small files.
• Can lead to memory pressure and performance overhead.
• Let's look at a straightforward JSON to Parquet conversion job
• 1.28 million JSON files in 640 partitions:
• We will use AWS Glue job metrics to understand the performance.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Enabling job metrics
• Metrics can be enabled in the CLI/SDK by passing --enable-metrics
as a job parameter key.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing lots of small files
• First try: use a standard SparkSQL job
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing lots of small files
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing lots of small files
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing lots of small files
• Driver memory use is growing fast and approaching the 5g max.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing lots of small files
• Case 2: Run using AWS Glue DynamicFrames.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing lots of small files
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing lots of small files
Driver memory remains below 50%
for the entire duration of execution.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing lots of small files
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing lots of small files
• 160 max executors: Each partition is assigned 1 task. 4 tasks can be
processed on each executor, so there were 640 partitions.
• Glue automatically grouped all of the files in each partition.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Options for grouping files
• groupFiles
• Set to "inPartition" to enable grouping of files within a partition.
• Set to "acrossPartition" to enable grouping of files from different partitions. The partition
value will not be added to each record!
• Grouping is automatically enabled when there are more than 50,000 files.
• groupSize
• Target size of each group in bytes.
• Default is based on the number of cores in the cluster.
• Let's try increasing the group size.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Aggressively grouping files
• Execution is much slower, but hasn't crashed.
"groupFiles": "acrossPartition"
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Aggressively grouping files
Executor memory is higher than driver. Only one executor is active.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing a few large files
• Processing large text files can also cause problems if they cannot be split.
• How the data is compressed makes a big difference in performance.
• Files that are uncompressed or bzip2 compressed can be split and processed in parallel.
• Files that are gzip compressed cannot be split.
• Let's see how this looks on a sample dataset of 5 large csv files.
• Each file is
• 12.5 GB uncompressed
• 1.6 GB gzip
• 1.3 GB bzip2
• Script converts data to Parquet.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing a few large gzip files
• We only have 5 partitions – one for each file.
• Job fails after 2 hours.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing a few large bzip2 files
• Bzip2 files can be split into blocks, so we see up to 104 tasks.
• Job completes in 18 minutes.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing a few large bzip2 files
• With 15 DPU, the number of active executors closely tracks the
maximum needed number of executors.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing a few large uncompressed files
• Uncompressed files can be split into lines, so we construct 64MB partitions.
• Job completes in 12 minutes.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Processing a few large files
• If you have a choice of compression type, prefer bzip2.
• If you are using gzip, make sure you have enough files to fully utilize
your resources.
• Bandwidth is rarely the bottleneck for AWS Glue jobs, so consider
leaving files uncompressed.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Coalescing a dataset
• Common use case: reduce the number of files (32 GB compressed)
df = spark.read.format("json").load(...)
df2 = df.coalesce(1)
df2.write.format("json").save(...)
Read is parallelized and then
all data is sent to one executor
for writing.
1 hr, 52 min.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Coalescing with grouping
• You can control this further with grouping
• Group size 500 GB.
Only a single executor is
used for the entire job
1 hr, 27 min.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Coalescing with grouping
• You can control this further with grouping
• Group size 2 GB.
26 min.Four executors were used
to process 16 partitions
of ~2 GB each.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Baseline performance without coalescing
• Job is fully parallelized and completes in 6 minutes.
• Output data files are around 500 MB.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DynamicFrames and schema computation
• DynamicFrames are Spark
RDDs of self-describing
records.
• An overall schema is not
required for basic ETL
operations.
• I can drop the field "age" without
looking at other records.
• Some operations do require
a complete schema.
• This can force an extra job in your
application.
age
John
Doe
36
Any
town
WA
DropNullFields
Relationalize
ResolveChoice
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DynamicFrame Schema Example: DropNullFields
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DynamicFrame Schema Example: DropNullFields
ApplyMapping sets the
schema without an additional
pass.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DynamicFrame Schema Example: DropNullFields
• There is only one stage in the application.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Optimizing DynamicFrames
• Use ApplyMapping where appropriate to set a schema.
• Be thoughtful about where to add transformations like DropNullFields.
• The default generated scripts are designed to be safe, but you may be able to optimize if
you know your data.
• Try your old scripts!
• We've increased data conversion speed by 4x in the past year.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Python performance in AWS Glue
• Using map and filter in
Python is expensive for large
data sets.
• All data is serialized and sent
between the JVM and Python.
• Alternatives
• Use AWS Glue Scala SDK.
• Convert to DataFrame and use Spark
SQL expressions.
Spark JVM
Python VM
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Benjamin Sowell
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Más contenido relacionado

La actualidad más candente

Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Lessons learned from operating Data Platform on Kubernetes(EKS)
Lessons learned from operating Data Platform on Kubernetes(EKS)Lessons learned from operating Data Platform on Kubernetes(EKS)
Lessons learned from operating Data Platform on Kubernetes(EKS)창언 정
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Databricks
 
Unleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCacheUnleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCacheAmazon Web Services
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
ECS+Locust로 부하 테스트 진행하기
ECS+Locust로 부하 테스트 진행하기ECS+Locust로 부하 테스트 진행하기
ECS+Locust로 부하 테스트 진행하기Yungon Park
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料Amazon Web Services
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016Amazon Web Services
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Amazon Web Services
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftAmazon Web Services
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
 

La actualidad más candente (20)

Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Lessons learned from operating Data Platform on Kubernetes(EKS)
Lessons learned from operating Data Platform on Kubernetes(EKS)Lessons learned from operating Data Platform on Kubernetes(EKS)
Lessons learned from operating Data Platform on Kubernetes(EKS)
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
 
Unleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCacheUnleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCache
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
ECS+Locust로 부하 테스트 진행하기
ECS+Locust로 부하 테스트 진행하기ECS+Locust로 부하 테스트 진행하기
ECS+Locust로 부하 테스트 진행하기
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 

Similar a Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Invent 2018

Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...Amazon Web Services
 
Open Source Managed Databases: Database Week SF
Open Source Managed Databases: Database Week SFOpen Source Managed Databases: Database Week SF
Open Source Managed Databases: Database Week SFAmazon Web Services
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSKimmo Kantojärvi
 
Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018
Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018
Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018Amazon Web Services
 
Optimizing Amazon EBS for Performance (CMP371) - AWS re:Invent 2018
Optimizing Amazon EBS for Performance (CMP371) - AWS re:Invent 2018Optimizing Amazon EBS for Performance (CMP371) - AWS re:Invent 2018
Optimizing Amazon EBS for Performance (CMP371) - AWS re:Invent 2018Amazon Web Services
 
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Amazon Web Services
 
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...Amazon Web Services LATAM
 
Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...
Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...
Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...Amazon Web Services
 
Amazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration ServiceAmazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration ServiceAmazon Web Services
 
Open Source Databases on the Cloud - Peter Dachnowicz
Open Source Databases on the Cloud - Peter DachnowiczOpen Source Databases on the Cloud - Peter Dachnowicz
Open Source Databases on the Cloud - Peter DachnowiczAmazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Amazon Web Services
 
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdf
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdfCome scalare da zero ai tuoi primi 10 milioni di utenti.pdf
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdfAmazon Web Services
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSAmazon Web Services
 
Open Source Managed Databases: Database Week San Francisco
Open Source Managed Databases: Database Week San FranciscoOpen Source Managed Databases: Database Week San Francisco
Open Source Managed Databases: Database Week San FranciscoAmazon Web Services
 
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...Amazon Web Services
 
Leadership Session: AWS Semiconductor (MFG201-L) - AWS re:Invent 2018
Leadership Session: AWS Semiconductor (MFG201-L) - AWS re:Invent 2018Leadership Session: AWS Semiconductor (MFG201-L) - AWS re:Invent 2018
Leadership Session: AWS Semiconductor (MFG201-L) - AWS re:Invent 2018Amazon Web Services
 

Similar a Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Invent 2018 (20)

Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...
 
Open Source Managed Databases: Database Week SF
Open Source Managed Databases: Database Week SFOpen Source Managed Databases: Database Week SF
Open Source Managed Databases: Database Week SF
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018
Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018
Optimizing Amazon EBS for Performance (CMP317-R2) - AWS re:Invent 2018
 
Optimizing Amazon EBS for Performance (CMP371) - AWS re:Invent 2018
Optimizing Amazon EBS for Performance (CMP371) - AWS re:Invent 2018Optimizing Amazon EBS for Performance (CMP371) - AWS re:Invent 2018
Optimizing Amazon EBS for Performance (CMP371) - AWS re:Invent 2018
 
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
 
SQL Server on AWS
SQL Server on AWSSQL Server on AWS
SQL Server on AWS
 
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
 
SQL Server on AWS
SQL Server on AWSSQL Server on AWS
SQL Server on AWS
 
Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...
Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...
Integrating Amazon Elasticsearch with your DevOps Tooling - AWS Online Tech T...
 
Amazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration ServiceAmazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration Service
 
Open Source Databases on the Cloud - Peter Dachnowicz
Open Source Databases on the Cloud - Peter DachnowiczOpen Source Databases on the Cloud - Peter Dachnowicz
Open Source Databases on the Cloud - Peter Dachnowicz
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
 
MySQL and MariaDB
MySQL and MariaDBMySQL and MariaDB
MySQL and MariaDB
 
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdf
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdfCome scalare da zero ai tuoi primi 10 milioni di utenti.pdf
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdf
 
SQL Server on AWS
SQL Server on AWSSQL Server on AWS
SQL Server on AWS
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
Open Source Managed Databases: Database Week San Francisco
Open Source Managed Databases: Database Week San FranciscoOpen Source Managed Databases: Database Week San Francisco
Open Source Managed Databases: Database Week San Francisco
 
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...
 
Leadership Session: AWS Semiconductor (MFG201-L) - AWS re:Invent 2018
Leadership Session: AWS Semiconductor (MFG201-L) - AWS re:Invent 2018Leadership Session: AWS Semiconductor (MFG201-L) - AWS re:Invent 2018
Leadership Session: AWS Semiconductor (MFG201-L) - AWS re:Invent 2018
 

Más de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. METRICS-DRIVEN PERFORMANCE TUNING FOR AWS GLUE ETL JOBS Benjamin Sowell Principal Engineer AWS Glue A N T 3 2 6
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Apache Spark and AWS Glue ETL • Apache Spark is a distributed data processing engine with rich support for complex analytics. • AWS Glue builds on the Apache Spark runtime to offer ETL specific functionality. Spark Core: RDDs Spark DataFrames AWS Glue DynamicFrames SparkSQL AWS Glue ETL
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue execution model: jobs and stages Repartition FilterRead Drop Nulls Write Read Show Job 1 Job 2 Stage 1 Stage 2 Stage 1 Apply Mapping Filter Apply Mapping
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue execution model: jobs and stages Repartition FilterRead Drop Nulls Write Read Show Job 1 Job 2 Stage 1 Stage 2 Stage 1 Apply Mapping Filter Apply Mapping Actions
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue execution model: jobs and stages Repartition FilterRead Drop Nulls Write Read Show Job 1 Job 2 Stage 1 Stage 2 Stage 1 Apply Mapping Filter Apply Mapping Jobs
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue execution model: data partitions • Apache Spark and AWS Glue are data parallel. • Data is divided into partitions that are processed concurrently. • 1 stage x 1 partition = 1 task Driver Executors Overall throughput is limited by the number of partitions
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue performance: key questions How is your application divided into jobs and stages? How is your dataset partitioned?
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing lots of small files • One common problem is dealing with large numbers of small files. • Can lead to memory pressure and performance overhead. • Let's look at a straightforward JSON to Parquet conversion job • 1.28 million JSON files in 640 partitions: • We will use AWS Glue job metrics to understand the performance.
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Enabling job metrics • Metrics can be enabled in the CLI/SDK by passing --enable-metrics as a job parameter key.
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing lots of small files • First try: use a standard SparkSQL job
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing lots of small files
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing lots of small files
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing lots of small files • Driver memory use is growing fast and approaching the 5g max.
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing lots of small files • Case 2: Run using AWS Glue DynamicFrames.
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing lots of small files
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing lots of small files Driver memory remains below 50% for the entire duration of execution.
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing lots of small files
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing lots of small files • 160 max executors: Each partition is assigned 1 task. 4 tasks can be processed on each executor, so there were 640 partitions. • Glue automatically grouped all of the files in each partition.
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Options for grouping files • groupFiles • Set to "inPartition" to enable grouping of files within a partition. • Set to "acrossPartition" to enable grouping of files from different partitions. The partition value will not be added to each record! • Grouping is automatically enabled when there are more than 50,000 files. • groupSize • Target size of each group in bytes. • Default is based on the number of cores in the cluster. • Let's try increasing the group size.
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Aggressively grouping files • Execution is much slower, but hasn't crashed. "groupFiles": "acrossPartition"
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Aggressively grouping files Executor memory is higher than driver. Only one executor is active.
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing a few large files • Processing large text files can also cause problems if they cannot be split. • How the data is compressed makes a big difference in performance. • Files that are uncompressed or bzip2 compressed can be split and processed in parallel. • Files that are gzip compressed cannot be split. • Let's see how this looks on a sample dataset of 5 large csv files. • Each file is • 12.5 GB uncompressed • 1.6 GB gzip • 1.3 GB bzip2 • Script converts data to Parquet.
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing a few large gzip files • We only have 5 partitions – one for each file. • Job fails after 2 hours.
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing a few large bzip2 files • Bzip2 files can be split into blocks, so we see up to 104 tasks. • Job completes in 18 minutes.
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing a few large bzip2 files • With 15 DPU, the number of active executors closely tracks the maximum needed number of executors.
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing a few large uncompressed files • Uncompressed files can be split into lines, so we construct 64MB partitions. • Job completes in 12 minutes.
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Processing a few large files • If you have a choice of compression type, prefer bzip2. • If you are using gzip, make sure you have enough files to fully utilize your resources. • Bandwidth is rarely the bottleneck for AWS Glue jobs, so consider leaving files uncompressed.
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Coalescing a dataset • Common use case: reduce the number of files (32 GB compressed) df = spark.read.format("json").load(...) df2 = df.coalesce(1) df2.write.format("json").save(...) Read is parallelized and then all data is sent to one executor for writing. 1 hr, 52 min.
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Coalescing with grouping • You can control this further with grouping • Group size 500 GB. Only a single executor is used for the entire job 1 hr, 27 min.
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Coalescing with grouping • You can control this further with grouping • Group size 2 GB. 26 min.Four executors were used to process 16 partitions of ~2 GB each.
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Baseline performance without coalescing • Job is fully parallelized and completes in 6 minutes. • Output data files are around 500 MB.
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DynamicFrames and schema computation • DynamicFrames are Spark RDDs of self-describing records. • An overall schema is not required for basic ETL operations. • I can drop the field "age" without looking at other records. • Some operations do require a complete schema. • This can force an extra job in your application. age John Doe 36 Any town WA DropNullFields Relationalize ResolveChoice
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DynamicFrame Schema Example: DropNullFields
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DynamicFrame Schema Example: DropNullFields ApplyMapping sets the schema without an additional pass.
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DynamicFrame Schema Example: DropNullFields • There is only one stage in the application.
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Optimizing DynamicFrames • Use ApplyMapping where appropriate to set a schema. • Be thoughtful about where to add transformations like DropNullFields. • The default generated scripts are designed to be safe, but you may be able to optimize if you know your data. • Try your old scripts! • We've increased data conversion speed by 4x in the past year.
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Python performance in AWS Glue • Using map and filter in Python is expensive for large data sets. • All data is serialized and sent between the JVM and Python. • Alternatives • Use AWS Glue Scala SDK. • Convert to DataFrame and use Spark SQL expressions. Spark JVM Python VM
  • 39. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Benjamin Sowell
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.