SlideShare a Scribd company logo
1 of 25
Big Data Engineering
using AWS Glue and EMR
The right Foundation
for making Informed Decisions
April 27, 2019
Agenda
• AWS Data and Analytics Services – overview
• EMR Based Solution – overview, demo
• Glue Based Solution – overview, demo
• Summary
• Q&A
We are a Big Data and Analytics Company
with clear focus on helping organizations
accelerate their “Data-to-Insights-Leap”
Agilisium. Helping Organizations take Data to Insight Leap
3
We are headquartered in Los Angeles (40+) with global presence
in India (250+), Canada, Costa Rica, Netherlands and UK (10+)
We are invested in all stages of Data
Journey: Data Architecture Consulting,
Data Integration, Data Storage, Data
Governance and Data Analytics
Data and Analytics Services on AWS
* This is only a representative image. It does not include all services and all scenarios.
Enterprise
Unstructured
Informational
External
Web
Data Sources
In-bound
API
Layer
In-bound
SFTP
Layer
Out-bound
API Pub /
Sub Layer
EDL
Subscribers
Other Systems
Staging
Data Pond
User’s Data
Pond
Business
Domain
Data Pond
Business
Domain
Data Pond
S3 Data Lake
Elastic Search on
Data Lake
Elastic Map
Reduce
AWS Glue
Ingestion
Kinesis
Direct
Connect
Snow Ball
DB Migration
Service
Quick Sight
SageMaker
Our Reference Architecture*
Easy, fast, and cost-effective way to
process vast amounts of data
4
Redshift
Athena
Step
Function
Data
Pipeline
About EMR
5
EMR is an AWS managed Hadoop framework for easy, fast and cost-effective data processing.
Supports popular distributed frameworks such as Spark, Hbase, Presto and Flink
• Easy to use
• Easily integrates with S3, Glue Catalog, HDFS, Glacier, Redshift, Dynamo DB, RDS
• Support for notebook based development for data science applications
• Multi-user access for EMR Notebooks
• Supports multiple distributed components – Spark, Hadoop, Hbase, Presto
• Support for installing additional software (e.g. Addl. packages)
EMR Architecture
6
EMR Service Components
7
Clusters • Central component of Amazon EMR
• Collection of Amazon EC2 instances
Security
Configurations
• Data encryption at-rest and in-transit
• Identity authentication using Kerberos
VPC Subnet • View the VPC configurations for the EMR
Events • Track EMR events / activities and store them for up to seven days
• Create CW rules according to a specified pattern, and route events to take action
Notebooks • Use EMR notebooks based on Jupyter to analyze data interactively with live code
• Create and attach notebooks to EMR clusters running Hadoop, Spark, and Livy
EMR Pricing
Ref Link: https://aws.amazon.com/emr/pricing/
• Simple and predictable – pay per second rate, with a one-minute minimum
• EMR price is in addition to underlying EC2 pricing and optional EBS pricing if used
− They are also billed per-second, with a one-minute minimum
• EC2 pricing options includes on-demand, reserved and spot instances
8
EMR Demo
USE CASE
Objective: Conduct exploratory data analysis on movie data to narrate the
history and story of cinema
 What movies tend to get higher vote counts and vote averages
Dataset: The dataset is from MovieLens.
 Movie name, genre, budget, revenue, release date, language, countries
released, production company, etc.
 Cast and Crew Information
 User ratings of each movie
10
Solution Approach using EMR
CSV Files
CSV to
Parquet
Parquet
Data Cleansing
Business
Transformation
Spark to
Redshift
Transform Load
Persist
AWS Cloud
VPC
Enriched
Data
11
Start
Launch EMR
Check EMR Step
Status
Get EMR Step
Status
Copy to Redshift
Get Redshift status
Check Redshift
status
Success End
Failed
DataFlowDiagram
StepFlow
Orchestration
Yes
No
Yes
No
Success?
Success?
About Glue
13
Glue is a fully managed, serverless ETL service to prepare and load data for analytics
Also provides centralized metadata repository using Glue Catalog
Use AWS Glue
• to build a data warehouse to organize, cleanse, validate, and format data
• to run serverless queries against your Amazon S3 data lake
• to create event-driven ETL pipelines with AWS Glue
• to understand your data assets
Glue Architecture
14
Crawler Workflow
Glue Service Components
16
AWS Glue
console
• Discover data, transform it, and make it available for search and querying.
AWS Glue Data
Catalog
• Persistent metadata store; contains table definitions, job definitions, and other
control information
• Athena, Redshift Spectrum EMR can access the catalog directly.
Classifier • Determines the data schema of your data
• Glue supports classifiers for CSV, JSON, AVRO, XML and common RDBMS
• Can also develop custom classifier (grok pattern, specifying row tag in an XML)
Crawler • AWS developed program that connects to a data store
• Progresses through a prioritized list of classifiers to determine the data schema and
then creates metadata in the Glue Data Catalog
Glue Jobs
System
• Glue Jobs system provides managed infrastructure to orchestrate ETL workflow
• Jobs can be scheduled, chained, or triggered by events (e.g. received new data)
Glue Pricing
17
ETL Job:
• $0.44 per DPU-Hour, billed per second, with 10-minute minimum for each ETL job of type Apache Spark
• $0.44 per DPU-Hour, billed per second, with 1-minute minimum for each ETL job of type Python shell
• $0.44 per DPU-Hour, billed per second, with 10-minute minimum for each provisioned development endpoint
Crawler:
• $0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run
Storage:
• Free for the first million objects stored
• $1 per 100,000 objects stored above 1M, per month
Requests:
• Free for the first million requests per month
• $1 per million requests above 1M in a month
Glue Demo
Glue Solution – Architecture and Flow
CSV Files CSV to Parquet Parquet
Prepare Data
Business
Transformation
Spark to
Redshift
Transform Load
Persist
AWS Cloud
VPC
Enriched
Data
19
AWS Glue – Solution Orchestration
Parquet Conversion
Cast Data
Parquet Conversion
Crew Data
Parquet Conversion
Movie Data
Parquet Conversion
Rating Data
Get Parquet Conversion Job
Status
Check Parquet Conversion
Job Status
Business
Transformation
Get Transformation
Job Status
Check Transformation
Job Status
Failed
End
Data storage to
Redshift
Get data storage
Redshift Status
Check data storage
Redshift Status
SuccessYes
No
Yes
No
No
Yes
20
Start
Prepare Data Transform Load
Success?
Success?
Success?
21
Summary
EMR vs. Glue Quick Comparison Chart
EMR Glue
Service Type • Managed Hadoop Framework • Fully Managed Service
Software Configuration • Hadoop Ecosystem • Only Spark
Development Effort • Fully user developed
• Leverage blueprints to reduce
level of coding
Metadata repository
• External metastore for Hive using
Glue Catalog / RDS / Aurora
• Glue Catalog
Redshift Write
• Connection established using a
driver
• In-built API
Job Scheduling • EMR Steps • Triggers
Dependent Libraries • R, Python, Scala, java Libraries
are supported
• Scala, Pure Python Libraries
Q&A
AWS Community Day – Chennai
Aug 10, 2018
Avail special discounts
for AWS Meetup Members and Participants
AWS Community
Day Chennai
AWS Chennai
Meetup Group
THANK YOU

More Related Content

What's hot

Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...Amazon Web Services
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSAmazon Web Services
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Amazon Web Services
 
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and YellowfinBuilding a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and YellowfinLynn Langit
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017Amazon Web Services
 
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...Amazon Web Services
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataQubole
 
AWS re:Invent 2016: Taking Data to the Extreme (MBL202)
AWS re:Invent 2016: Taking Data to the Extreme (MBL202)AWS re:Invent 2016: Taking Data to the Extreme (MBL202)
AWS re:Invent 2016: Taking Data to the Extreme (MBL202)Amazon Web Services
 
Strategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageStrategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageAmazon Web Services
 
Databases & Analytics AWS re:invent 2019 Recap
Databases & Analytics AWS re:invent 2019 RecapDatabases & Analytics AWS re:invent 2019 Recap
Databases & Analytics AWS re:invent 2019 RecapSungmin Kim
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWSCaserta
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!Chris Taylor
 

What's hot (20)

Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Athena & Glue
Athena & GlueAthena & Glue
Athena & Glue
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and YellowfinBuilding a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
 
AWS Data Collection & Storage
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & Storage
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
 
AWS re:Invent 2016: Taking Data to the Extreme (MBL202)
AWS re:Invent 2016: Taking Data to the Extreme (MBL202)AWS re:Invent 2016: Taking Data to the Extreme (MBL202)
AWS re:Invent 2016: Taking Data to the Extreme (MBL202)
 
Strategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageStrategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud Storage
 
Databases & Analytics AWS re:invent 2019 Recap
Databases & Analytics AWS re:invent 2019 RecapDatabases & Analytics AWS re:invent 2019 Recap
Databases & Analytics AWS re:invent 2019 Recap
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!
 

Similar to Aws meetup 20190427

BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...Amazon Web Services
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSAmazon Web Services
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Amazon Web Services
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWSAmazon Web Services
 
BDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSBDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSAmazon Web Services
 
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAmazon Web Services
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Amazon Web Services
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftAmazon Web Services
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewAmazon Web Services
 
Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Amazon Web Services
 
Building your First Big Data Application on AWS
Building your First Big Data Application on AWSBuilding your First Big Data Application on AWS
Building your First Big Data Application on AWSAmazon Web Services
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Amazon Web Services
 
Uses, considerations, and recommendations for AWS
Uses, considerations, and recommendations for AWSUses, considerations, and recommendations for AWS
Uses, considerations, and recommendations for AWSScalar Decisions
 

Similar to Aws meetup 20190427 (20)

BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
BDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSBDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWS
 
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF Loft
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – Overview
 
Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2
 
Building your First Big Data Application on AWS
Building your First Big Data Application on AWSBuilding your First Big Data Application on AWS
Building your First Big Data Application on AWS
 
AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS
 
Uses, considerations, and recommendations for AWS
Uses, considerations, and recommendations for AWSUses, considerations, and recommendations for AWS
Uses, considerations, and recommendations for AWS
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 

Recently uploaded

cse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitycse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitysandeepnani2260
 
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityDon't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityApp Ethena
 
Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptxogubuikealex
 
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Sebastiano Panichella
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...漢銘 謝
 
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptxerickamwana1
 
GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE
 
Application of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxApplication of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxRoquia Salam
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRRsarwankumar4524
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRachelAnnTenibroAmaz
 
A Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air CoolerA Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air Coolerenquirieskenstar
 
Internship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SEInternship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SESaleh Ibne Omar
 
General Elections Final Press Noteas per M
General Elections Final Press Noteas per MGeneral Elections Final Press Noteas per M
General Elections Final Press Noteas per MVidyaAdsule1
 
proposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerproposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerkumenegertelayegrama
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEMCharmi13
 
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...Sebastiano Panichella
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxAsifArshad8
 

Recently uploaded (17)

cse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitycse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber security
 
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityDon't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
 
Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptx
 
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
 
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
 
GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024
 
Application of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxApplication of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptx
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
 
A Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air CoolerA Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air Cooler
 
Internship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SEInternship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SE
 
General Elections Final Press Noteas per M
General Elections Final Press Noteas per MGeneral Elections Final Press Noteas per M
General Elections Final Press Noteas per M
 
proposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerproposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeeger
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEM
 
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
 

Aws meetup 20190427

  • 1. Big Data Engineering using AWS Glue and EMR The right Foundation for making Informed Decisions April 27, 2019
  • 2. Agenda • AWS Data and Analytics Services – overview • EMR Based Solution – overview, demo • Glue Based Solution – overview, demo • Summary • Q&A
  • 3. We are a Big Data and Analytics Company with clear focus on helping organizations accelerate their “Data-to-Insights-Leap” Agilisium. Helping Organizations take Data to Insight Leap 3 We are headquartered in Los Angeles (40+) with global presence in India (250+), Canada, Costa Rica, Netherlands and UK (10+) We are invested in all stages of Data Journey: Data Architecture Consulting, Data Integration, Data Storage, Data Governance and Data Analytics
  • 4. Data and Analytics Services on AWS * This is only a representative image. It does not include all services and all scenarios. Enterprise Unstructured Informational External Web Data Sources In-bound API Layer In-bound SFTP Layer Out-bound API Pub / Sub Layer EDL Subscribers Other Systems Staging Data Pond User’s Data Pond Business Domain Data Pond Business Domain Data Pond S3 Data Lake Elastic Search on Data Lake Elastic Map Reduce AWS Glue Ingestion Kinesis Direct Connect Snow Ball DB Migration Service Quick Sight SageMaker Our Reference Architecture* Easy, fast, and cost-effective way to process vast amounts of data 4 Redshift Athena Step Function Data Pipeline
  • 5. About EMR 5 EMR is an AWS managed Hadoop framework for easy, fast and cost-effective data processing. Supports popular distributed frameworks such as Spark, Hbase, Presto and Flink • Easy to use • Easily integrates with S3, Glue Catalog, HDFS, Glacier, Redshift, Dynamo DB, RDS • Support for notebook based development for data science applications • Multi-user access for EMR Notebooks • Supports multiple distributed components – Spark, Hadoop, Hbase, Presto • Support for installing additional software (e.g. Addl. packages)
  • 7. EMR Service Components 7 Clusters • Central component of Amazon EMR • Collection of Amazon EC2 instances Security Configurations • Data encryption at-rest and in-transit • Identity authentication using Kerberos VPC Subnet • View the VPC configurations for the EMR Events • Track EMR events / activities and store them for up to seven days • Create CW rules according to a specified pattern, and route events to take action Notebooks • Use EMR notebooks based on Jupyter to analyze data interactively with live code • Create and attach notebooks to EMR clusters running Hadoop, Spark, and Livy
  • 8. EMR Pricing Ref Link: https://aws.amazon.com/emr/pricing/ • Simple and predictable – pay per second rate, with a one-minute minimum • EMR price is in addition to underlying EC2 pricing and optional EBS pricing if used − They are also billed per-second, with a one-minute minimum • EC2 pricing options includes on-demand, reserved and spot instances 8
  • 10. USE CASE Objective: Conduct exploratory data analysis on movie data to narrate the history and story of cinema  What movies tend to get higher vote counts and vote averages Dataset: The dataset is from MovieLens.  Movie name, genre, budget, revenue, release date, language, countries released, production company, etc.  Cast and Crew Information  User ratings of each movie 10
  • 11. Solution Approach using EMR CSV Files CSV to Parquet Parquet Data Cleansing Business Transformation Spark to Redshift Transform Load Persist AWS Cloud VPC Enriched Data 11 Start Launch EMR Check EMR Step Status Get EMR Step Status Copy to Redshift Get Redshift status Check Redshift status Success End Failed DataFlowDiagram StepFlow Orchestration Yes No Yes No Success? Success?
  • 12.
  • 13. About Glue 13 Glue is a fully managed, serverless ETL service to prepare and load data for analytics Also provides centralized metadata repository using Glue Catalog Use AWS Glue • to build a data warehouse to organize, cleanse, validate, and format data • to run serverless queries against your Amazon S3 data lake • to create event-driven ETL pipelines with AWS Glue • to understand your data assets
  • 16. Glue Service Components 16 AWS Glue console • Discover data, transform it, and make it available for search and querying. AWS Glue Data Catalog • Persistent metadata store; contains table definitions, job definitions, and other control information • Athena, Redshift Spectrum EMR can access the catalog directly. Classifier • Determines the data schema of your data • Glue supports classifiers for CSV, JSON, AVRO, XML and common RDBMS • Can also develop custom classifier (grok pattern, specifying row tag in an XML) Crawler • AWS developed program that connects to a data store • Progresses through a prioritized list of classifiers to determine the data schema and then creates metadata in the Glue Data Catalog Glue Jobs System • Glue Jobs system provides managed infrastructure to orchestrate ETL workflow • Jobs can be scheduled, chained, or triggered by events (e.g. received new data)
  • 17. Glue Pricing 17 ETL Job: • $0.44 per DPU-Hour, billed per second, with 10-minute minimum for each ETL job of type Apache Spark • $0.44 per DPU-Hour, billed per second, with 1-minute minimum for each ETL job of type Python shell • $0.44 per DPU-Hour, billed per second, with 10-minute minimum for each provisioned development endpoint Crawler: • $0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run Storage: • Free for the first million objects stored • $1 per 100,000 objects stored above 1M, per month Requests: • Free for the first million requests per month • $1 per million requests above 1M in a month
  • 19. Glue Solution – Architecture and Flow CSV Files CSV to Parquet Parquet Prepare Data Business Transformation Spark to Redshift Transform Load Persist AWS Cloud VPC Enriched Data 19
  • 20. AWS Glue – Solution Orchestration Parquet Conversion Cast Data Parquet Conversion Crew Data Parquet Conversion Movie Data Parquet Conversion Rating Data Get Parquet Conversion Job Status Check Parquet Conversion Job Status Business Transformation Get Transformation Job Status Check Transformation Job Status Failed End Data storage to Redshift Get data storage Redshift Status Check data storage Redshift Status SuccessYes No Yes No No Yes 20 Start Prepare Data Transform Load Success? Success? Success?
  • 22. EMR vs. Glue Quick Comparison Chart EMR Glue Service Type • Managed Hadoop Framework • Fully Managed Service Software Configuration • Hadoop Ecosystem • Only Spark Development Effort • Fully user developed • Leverage blueprints to reduce level of coding Metadata repository • External metastore for Hive using Glue Catalog / RDS / Aurora • Glue Catalog Redshift Write • Connection established using a driver • In-built API Job Scheduling • EMR Steps • Triggers Dependent Libraries • R, Python, Scala, java Libraries are supported • Scala, Pure Python Libraries
  • 23. Q&A
  • 24. AWS Community Day – Chennai Aug 10, 2018 Avail special discounts for AWS Meetup Members and Participants AWS Community Day Chennai AWS Chennai Meetup Group