SlideShare a Scribd company logo
1 of 30
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Hemant Borole
AWS Professional Services Consultant
Marie Yap
AWS Enterprise Solutions Architect
Data Transformation Patterns
Using Amazon Glue to transform data in your Data Lake
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
• Our journey starts with the Data Lake
• How Glue connects everything
• ETL in AWS
• Glue Job and Scheduling
• Glue Transforms
• Demo on different Glue Transforms
• Tips and Best Practices
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
A quick recap on Data Lake
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Traditionally, analytics used to look like this
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence Relational data
TBs-PBs scale
Schema defined prior to data load
Operational reporting and ad hoc
Large initial capex + $10k-$50k / TB / year
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes extend the traditional approach
Relational and non-relational data
TBs-EBs scale
Schema defined during analysis
Diverse analytical engines to gain insights
Designed for low cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
100110000100101011100
101010111001010100001
011111011010
0011110010110010110
0100011000010
Devices Web Sensor
s
Social
Catalog
Machine
Learning
DW
Queries
Big data
processing
Interactive Real-time
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Snowball Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
S3
Most ways to bring data in
Unmatched durability and availability at EB scale
Best security, compliance, and audit capabilities
Run any analytics on the same data without movement
Scale storage and compute independently
Store at $0.023 / GB-month; Query for $0.05/GB scanned
Redshift
EMR
Athena
Kinesis
Elasticsearch Service
Data Lakes on AWS
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Extract Transform and Load
• Sources
• Raw
• Relational, semi
or unstructured
Extract
• File - Format,
Compression,
Partitioning
• Data
Transform • Stage or
Processed
• Destination(s)
• Visualization
Load
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
There are many ways to do analytics
Interactive Analytics
Big Data Processing
Data Warehousing
Amazon
Athena
Amazon
EMR
Amazon
Redshift*
Amazon ES
Operational Analytics
Amazon
Kinesis
Analytics
Real time Analytics
Amazon
QuickSight
Dashboard and Visualization
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
There has to be something that
Glues them together
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Glue the services together through Data Catalog
Amazon
Athena
Amazon
EMR
AWS Glue
Jobs
Amazon
Redshift
Spectrum
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Extract Transform Load in AWS
Load
•RDS/Databases
•EDW/Redshift
•NoSQL, DynamoDB
•Machine Learning
(SageMaker)
•S3 (Processed output bucket)
Transform
• Amazon Athena
• Amazon Redshift
• Amazon EMR
• AWS Glue
Extract
• Files
• RDS/Database
• EDW
• Glue Data Catalog
• S3
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Knowing the JOB
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Terminologies
JOB - It performs the ETL work in Glue
JOB BOOKMARKS – saves the state across multiple job
runs
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How does a bookmark work?
START OF BOOKMARK
datasource0 =
glueContext.create_dynamic_frame.
from_catalog
(database = source_database,
table_name = source_tables[0],
transformation_ctx = "datasource0")
END OF BOOKMARK
job.commit()
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Bookmark options
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Scheduling
Compose jobs globally with event-
based dependencies
 Easy to reuse and leverage work across
organization boundaries
Multiple triggering mechanisms
 Schedule-based: e.g., time of day
 Event-based: e.g., job completion
 On-demand: e.g., AWS Lambda
 More coming soon: Data Catalog based
events, S3 notifications and Amazon
CloudWatch events
Logs and alerts are available in
Amazon CloudWatch
Marketing: Ad-spend
by
customer segment
Event Based
Lambda Trigger
Sales: Revenue by
customer segment
Schedule
Data
based
Central: ROI by
customer
segment
Weekly
sales
Data
based
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
So how do I transform?
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Glue is based on Apache Spark
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Writing your transformations
https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint.html
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Transforming Distributed, Immutable Data Structures
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
But why do we have a DynamicFrame?
Limitations of DataFrame for ETL Operations
• Expensive in loading large data sets
• Infer the schema
• Load the data
• Does not handle unstructured data types well
• Does not easily deal with Error Handling
DynamicFrame
• Do not need to know schema upfront (Schema per record)
• Supports ETL specific Transformations
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Problem Solving – Unstructured Data in DynamicFrame
df_order = glueContext.create_dynamic_frame.from_catalog
(database=“retail”, table_name=“orders”)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Solution - ResolveChoice
Option 1 – cast
Eliminate the string values by replacing it with null.
Output Schema – ordereid:int
Option 2 – project
Project all the data to one data type. In this case it will convert string to int.
order_rc_cast = df.order.resolveChoice(specs = [(‘ordereid’,’project:int’)])
Output Schema – ordereid:int
order_rc_cast = df.order.resolveChoice(specs = [(‘ordereid’,’cast:int’)])
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Solution - ResolveChoice
Option 3 – make_cols
Splits the column into 2, one for each type
Output Schema – ordereid_int: int
ordereid_string: string
Option 2 – make_struct
Use struct to represent each of the data types
order_rc_cast = df.order.resolveChoice(specs = [(‘ordereid’,’make_struct’)])
Output Schema – ordered: struct
int: int
string: string
order_rc_cast = df.order.resolveChoice(specs = [(‘ordereid’,’make_cols’)])
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ok now we know
DynamicFrame, lets get to the
ETL Job Programming
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Basics of Glue ETL Script
1. Create GlueContext, the entry point
2. Create DynamicFrame from Data Catalog
Other options:
DynamicFrame.fromDF
from_options
3. Transform Data by modifying DynamicFrame
4. Write DynamicFrame into the Target (sink)
Other Options:
from_catalog
from_jdbc_conf
Optionally, you can add your job bookmarks
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Using Glue’s Built-in
Transforms (Demo)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tips and Best Practices
• Useful for debugging
• pause bookmark
• spigot()
• errorsAsDynamicFrame( )
• Take advantage of bookmarks
• Know how much resource you need (DPU)
• Partition/Compress/Format your data to optimize transforms
• https://docs.aws.amazon.com/athena/latest/ug/glue-best-
practices.html
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank You!!!
Documentation:
https://docs.aws.amazon.com/glue/latest/dg/getting-started.html
Built in Glue Transforms:
https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html
Examples:
https://github.com/aws-samples/aws-glue-samples

More Related Content

What's hot

Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018Amazon Web Services
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWSAmazon Web Services
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSAmazon Web Services
 
기술 지원 사례로 알아보는 마이그레이션 이슈 및 해결 방안 모음-김용기, AWS Storage Specialist SA / 한소영, AWS...
기술 지원 사례로 알아보는 마이그레이션 이슈 및 해결 방안 모음-김용기, AWS Storage Specialist SA / 한소영, AWS...기술 지원 사례로 알아보는 마이그레이션 이슈 및 해결 방안 모음-김용기, AWS Storage Specialist SA / 한소영, AWS...
기술 지원 사례로 알아보는 마이그레이션 이슈 및 해결 방안 모음-김용기, AWS Storage Specialist SA / 한소영, AWS...Amazon Web Services Korea
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveCobus Bernard
 
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Amazon Web Services Korea
 
AWS Cloud Cost Optimization
AWS Cloud Cost OptimizationAWS Cloud Cost Optimization
AWS Cloud Cost OptimizationYogesh Sharma
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWSAmazon Web Services Korea
 
Serverless data and analytics on AWS for operations
Serverless data and analytics on AWS for operations Serverless data and analytics on AWS for operations
Serverless data and analytics on AWS for operations CloudHesive
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
천만 사용자를 위한 AWS 아키텍처 보안 모범 사례 (윤석찬, 테크에반젤리스트)
천만 사용자를 위한 AWS 아키텍처 보안 모범 사례 (윤석찬, 테크에반젤리스트)천만 사용자를 위한 AWS 아키텍처 보안 모범 사례 (윤석찬, 테크에반젤리스트)
천만 사용자를 위한 AWS 아키텍처 보안 모범 사례 (윤석찬, 테크에반젤리스트)Amazon Web Services Korea
 

What's hot (20)

Cost Optimisation on AWS
Cost Optimisation on AWSCost Optimisation on AWS
Cost Optimisation on AWS
 
Amazon QuickSight
Amazon QuickSightAmazon QuickSight
Amazon QuickSight
 
Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWS
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWS
 
기술 지원 사례로 알아보는 마이그레이션 이슈 및 해결 방안 모음-김용기, AWS Storage Specialist SA / 한소영, AWS...
기술 지원 사례로 알아보는 마이그레이션 이슈 및 해결 방안 모음-김용기, AWS Storage Specialist SA / 한소영, AWS...기술 지원 사례로 알아보는 마이그레이션 이슈 및 해결 방안 모음-김용기, AWS Storage Specialist SA / 한소영, AWS...
기술 지원 사례로 알아보는 마이그레이션 이슈 및 해결 방안 모음-김용기, AWS Storage Specialist SA / 한소영, AWS...
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
AWS Cloud Cost Optimization
AWS Cloud Cost OptimizationAWS Cloud Cost Optimization
AWS Cloud Cost Optimization
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
 
Serverless data and analytics on AWS for operations
Serverless data and analytics on AWS for operations Serverless data and analytics on AWS for operations
Serverless data and analytics on AWS for operations
 
Amazon QuickSight
Amazon QuickSightAmazon QuickSight
Amazon QuickSight
 
AWS Cloud trail
AWS Cloud trailAWS Cloud trail
AWS Cloud trail
 
2020.02.06 우리는 왜 glue를 버렸나?
2020.02.06 우리는 왜 glue를 버렸나?2020.02.06 우리는 왜 glue를 버렸나?
2020.02.06 우리는 왜 glue를 버렸나?
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
천만 사용자를 위한 AWS 아키텍처 보안 모범 사례 (윤석찬, 테크에반젤리스트)
천만 사용자를 위한 AWS 아키텍처 보안 모범 사례 (윤석찬, 테크에반젤리스트)천만 사용자를 위한 AWS 아키텍처 보안 모범 사례 (윤석찬, 테크에반젤리스트)
천만 사용자를 위한 AWS 아키텍처 보안 모범 사례 (윤석찬, 테크에반젤리스트)
 
AWS Architecting In The Cloud
AWS Architecting In The CloudAWS Architecting In The Cloud
AWS Architecting In The Cloud
 

Similar to Data Transformation Patterns in AWS - AWS Online Tech Talks

Migrazione di Database e Data Warehouse su AWS
Migrazione di Database e Data Warehouse su AWSMigrazione di Database e Data Warehouse su AWS
Migrazione di Database e Data Warehouse su AWSAmazon Web Services
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftAmazon Web Services
 
Using Tableau and AWS for Fearless Reporting at UMD
Using Tableau and AWS for Fearless Reporting at UMDUsing Tableau and AWS for Fearless Reporting at UMD
Using Tableau and AWS for Fearless Reporting at UMDAmazon Web Services
 
Under the Hood: How Amazon Uses AWS Services for Analytics at a Massive Scale...
Under the Hood: How Amazon Uses AWS Services for Analytics at a Massive Scale...Under the Hood: How Amazon Uses AWS Services for Analytics at a Massive Scale...
Under the Hood: How Amazon Uses AWS Services for Analytics at a Massive Scale...Amazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueAmazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Amazon Web Services
 
Analyze Amazon CloudFront and Lambda@Edge Logs to Improve Customer Experience...
Analyze Amazon CloudFront and Lambda@Edge Logs to Improve Customer Experience...Analyze Amazon CloudFront and Lambda@Edge Logs to Improve Customer Experience...
Analyze Amazon CloudFront and Lambda@Edge Logs to Improve Customer Experience...Amazon Web Services
 
Data Warehouses & Data Lakes: Data Analytics Week at the SF Loft
Data Warehouses & Data Lakes: Data Analytics Week at the SF LoftData Warehouses & Data Lakes: Data Analytics Week at the SF Loft
Data Warehouses & Data Lakes: Data Analytics Week at the SF LoftAmazon Web Services
 
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...Amazon Web Services
 
Enabling Your Organization’s Amazon Redshift Adoption – Going from Zero to He...
Enabling Your Organization’s Amazon Redshift Adoption – Going from Zero to He...Enabling Your Organization’s Amazon Redshift Adoption – Going from Zero to He...
Enabling Your Organization’s Amazon Redshift Adoption – Going from Zero to He...Amazon Web Services
 
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Atlanta ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Atlanta ...Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Atlanta ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Atlanta ...Amazon Web Services
 
Migrating Financial and Accounting Systems from Oracle to Amazon DynamoDB (DA...
Migrating Financial and Accounting Systems from Oracle to Amazon DynamoDB (DA...Migrating Financial and Accounting Systems from Oracle to Amazon DynamoDB (DA...
Migrating Financial and Accounting Systems from Oracle to Amazon DynamoDB (DA...Amazon Web Services
 
Non-Relational Revolution: Database Week SF
Non-Relational Revolution: Database Week SFNon-Relational Revolution: Database Week SF
Non-Relational Revolution: Database Week SFAmazon Web Services
 
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...Amazon Web Services
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftAmazon Web Services
 
深入淺出 Amazon Database Migration Service
深入淺出 Amazon Database Migration Service 深入淺出 Amazon Database Migration Service
深入淺出 Amazon Database Migration Service Amazon Web Services
 
Getting Started with Amazon Database Migration Service
Getting Started with Amazon Database Migration ServiceGetting Started with Amazon Database Migration Service
Getting Started with Amazon Database Migration ServiceAmazon Web Services
 
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...Amazon Web Services
 

Similar to Data Transformation Patterns in AWS - AWS Online Tech Talks (20)

Migrating database to cloud
Migrating database to cloudMigrating database to cloud
Migrating database to cloud
 
Migrazione di Database e Data Warehouse su AWS
Migrazione di Database e Data Warehouse su AWSMigrazione di Database e Data Warehouse su AWS
Migrazione di Database e Data Warehouse su AWS
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
 
Using Tableau and AWS for Fearless Reporting at UMD
Using Tableau and AWS for Fearless Reporting at UMDUsing Tableau and AWS for Fearless Reporting at UMD
Using Tableau and AWS for Fearless Reporting at UMD
 
Under the Hood: How Amazon Uses AWS Services for Analytics at a Massive Scale...
Under the Hood: How Amazon Uses AWS Services for Analytics at a Massive Scale...Under the Hood: How Amazon Uses AWS Services for Analytics at a Massive Scale...
Under the Hood: How Amazon Uses AWS Services for Analytics at a Massive Scale...
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
 
Analyze Amazon CloudFront and Lambda@Edge Logs to Improve Customer Experience...
Analyze Amazon CloudFront and Lambda@Edge Logs to Improve Customer Experience...Analyze Amazon CloudFront and Lambda@Edge Logs to Improve Customer Experience...
Analyze Amazon CloudFront and Lambda@Edge Logs to Improve Customer Experience...
 
Non-Relational Revolution
Non-Relational RevolutionNon-Relational Revolution
Non-Relational Revolution
 
Data Warehouses & Data Lakes: Data Analytics Week at the SF Loft
Data Warehouses & Data Lakes: Data Analytics Week at the SF LoftData Warehouses & Data Lakes: Data Analytics Week at the SF Loft
Data Warehouses & Data Lakes: Data Analytics Week at the SF Loft
 
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
 
Enabling Your Organization’s Amazon Redshift Adoption – Going from Zero to He...
Enabling Your Organization’s Amazon Redshift Adoption – Going from Zero to He...Enabling Your Organization’s Amazon Redshift Adoption – Going from Zero to He...
Enabling Your Organization’s Amazon Redshift Adoption – Going from Zero to He...
 
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Atlanta ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Atlanta ...Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Atlanta ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Atlanta ...
 
Migrating Financial and Accounting Systems from Oracle to Amazon DynamoDB (DA...
Migrating Financial and Accounting Systems from Oracle to Amazon DynamoDB (DA...Migrating Financial and Accounting Systems from Oracle to Amazon DynamoDB (DA...
Migrating Financial and Accounting Systems from Oracle to Amazon DynamoDB (DA...
 
Non-Relational Revolution: Database Week SF
Non-Relational Revolution: Database Week SFNon-Relational Revolution: Database Week SF
Non-Relational Revolution: Database Week SF
 
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
 
深入淺出 Amazon Database Migration Service
深入淺出 Amazon Database Migration Service 深入淺出 Amazon Database Migration Service
深入淺出 Amazon Database Migration Service
 
Getting Started with Amazon Database Migration Service
Getting Started with Amazon Database Migration ServiceGetting Started with Amazon Database Migration Service
Getting Started with Amazon Database Migration Service
 
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Data Transformation Patterns in AWS - AWS Online Tech Talks

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hemant Borole AWS Professional Services Consultant Marie Yap AWS Enterprise Solutions Architect Data Transformation Patterns Using Amazon Glue to transform data in your Data Lake
  • 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda • Our journey starts with the Data Lake • How Glue connects everything • ETL in AWS • Glue Job and Scheduling • Glue Transforms • Demo on different Glue Transforms • Tips and Best Practices
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. A quick recap on Data Lake
  • 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Traditionally, analytics used to look like this OLTP ERP CRM LOB Data Warehouse Business Intelligence Relational data TBs-PBs scale Schema defined prior to data load Operational reporting and ad hoc Large initial capex + $10k-$50k / TB / year
  • 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lakes extend the traditional approach Relational and non-relational data TBs-EBs scale Schema defined during analysis Diverse analytical engines to gain insights Designed for low cost storage and analytics OLTP ERP CRM LOB Data Warehouse Business Intelligence Data Lake 100110000100101011100 101010111001010100001 011111011010 0011110010110010110 0100011000010 Devices Web Sensor s Social Catalog Machine Learning DW Queries Big data processing Interactive Real-time
  • 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Snowball Snowmobile Kinesis Data Firehose Kinesis Data Streams S3 Most ways to bring data in Unmatched durability and availability at EB scale Best security, compliance, and audit capabilities Run any analytics on the same data without movement Scale storage and compute independently Store at $0.023 / GB-month; Query for $0.05/GB scanned Redshift EMR Athena Kinesis Elasticsearch Service Data Lakes on AWS
  • 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Extract Transform and Load • Sources • Raw • Relational, semi or unstructured Extract • File - Format, Compression, Partitioning • Data Transform • Stage or Processed • Destination(s) • Visualization Load
  • 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. There are many ways to do analytics Interactive Analytics Big Data Processing Data Warehousing Amazon Athena Amazon EMR Amazon Redshift* Amazon ES Operational Analytics Amazon Kinesis Analytics Real time Analytics Amazon QuickSight Dashboard and Visualization
  • 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. There has to be something that Glues them together
  • 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Glue the services together through Data Catalog Amazon Athena Amazon EMR AWS Glue Jobs Amazon Redshift Spectrum
  • 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Extract Transform Load in AWS Load •RDS/Databases •EDW/Redshift •NoSQL, DynamoDB •Machine Learning (SageMaker) •S3 (Processed output bucket) Transform • Amazon Athena • Amazon Redshift • Amazon EMR • AWS Glue Extract • Files • RDS/Database • EDW • Glue Data Catalog • S3
  • 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Knowing the JOB
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Terminologies JOB - It performs the ETL work in Glue JOB BOOKMARKS – saves the state across multiple job runs
  • 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How does a bookmark work? START OF BOOKMARK datasource0 = glueContext.create_dynamic_frame. from_catalog (database = source_database, table_name = source_tables[0], transformation_ctx = "datasource0") END OF BOOKMARK job.commit()
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Job Bookmark options
  • 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Job Scheduling Compose jobs globally with event- based dependencies  Easy to reuse and leverage work across organization boundaries Multiple triggering mechanisms  Schedule-based: e.g., time of day  Event-based: e.g., job completion  On-demand: e.g., AWS Lambda  More coming soon: Data Catalog based events, S3 notifications and Amazon CloudWatch events Logs and alerts are available in Amazon CloudWatch Marketing: Ad-spend by customer segment Event Based Lambda Trigger Sales: Revenue by customer segment Schedule Data based Central: ROI by customer segment Weekly sales Data based
  • 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. So how do I transform?
  • 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Glue is based on Apache Spark
  • 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Writing your transformations https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint.html
  • 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Transforming Distributed, Immutable Data Structures
  • 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. But why do we have a DynamicFrame? Limitations of DataFrame for ETL Operations • Expensive in loading large data sets • Infer the schema • Load the data • Does not handle unstructured data types well • Does not easily deal with Error Handling DynamicFrame • Do not need to know schema upfront (Schema per record) • Supports ETL specific Transformations
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Problem Solving – Unstructured Data in DynamicFrame df_order = glueContext.create_dynamic_frame.from_catalog (database=“retail”, table_name=“orders”)
  • 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Solution - ResolveChoice Option 1 – cast Eliminate the string values by replacing it with null. Output Schema – ordereid:int Option 2 – project Project all the data to one data type. In this case it will convert string to int. order_rc_cast = df.order.resolveChoice(specs = [(‘ordereid’,’project:int’)]) Output Schema – ordereid:int order_rc_cast = df.order.resolveChoice(specs = [(‘ordereid’,’cast:int’)])
  • 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Solution - ResolveChoice Option 3 – make_cols Splits the column into 2, one for each type Output Schema – ordereid_int: int ordereid_string: string Option 2 – make_struct Use struct to represent each of the data types order_rc_cast = df.order.resolveChoice(specs = [(‘ordereid’,’make_struct’)]) Output Schema – ordered: struct int: int string: string order_rc_cast = df.order.resolveChoice(specs = [(‘ordereid’,’make_cols’)])
  • 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ok now we know DynamicFrame, lets get to the ETL Job Programming
  • 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Basics of Glue ETL Script 1. Create GlueContext, the entry point 2. Create DynamicFrame from Data Catalog Other options: DynamicFrame.fromDF from_options 3. Transform Data by modifying DynamicFrame 4. Write DynamicFrame into the Target (sink) Other Options: from_catalog from_jdbc_conf Optionally, you can add your job bookmarks
  • 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Using Glue’s Built-in Transforms (Demo)
  • 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Tips and Best Practices • Useful for debugging • pause bookmark • spigot() • errorsAsDynamicFrame( ) • Take advantage of bookmarks • Know how much resource you need (DPU) • Partition/Compress/Format your data to optimize transforms • https://docs.aws.amazon.com/athena/latest/ug/glue-best- practices.html
  • 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank You!!! Documentation: https://docs.aws.amazon.com/glue/latest/dg/getting-started.html Built in Glue Transforms: https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html Examples: https://github.com/aws-samples/aws-glue-samples