Aws meetup 20190427

Big Data Engineering
using AWS Glue and EMR
The right Foundation
for making Informed Decisions
April 27, 2019

Agenda
• AWS Data and Analytics Services – overview
• EMR Based Solution – overview, demo
• Glue Based Solution – overview, demo
• Summary
• Q&A

We are a Big Data and Analytics Company
with clear focus on helping organizations
accelerate their “Data-to-Insights-Leap”
Agilisium. Helping Organizations take Data to Insight Leap
3
We are headquartered in Los Angeles (40+) with global presence
in India (250+), Canada, Costa Rica, Netherlands and UK (10+)
We are invested in all stages of Data
Journey: Data Architecture Consulting,
Data Integration, Data Storage, Data
Governance and Data Analytics

Data and Analytics Services on AWS
* This is only a representative image. It does not include all services and all scenarios.
Enterprise
Unstructured
Informational
External
Web
Data Sources
In-bound
API
Layer
In-bound
SFTP
Layer
Out-bound
API Pub /
Sub Layer
EDL
Subscribers
Other Systems
Staging
Data Pond
User’s Data
Pond
Business
Domain
Data Pond
Business
Domain
Data Pond
S3 Data Lake
Elastic Search on
Data Lake
Elastic Map
Reduce
AWS Glue
Ingestion
Kinesis
Direct
Connect
Snow Ball
DB Migration
Service
Quick Sight
SageMaker
Our Reference Architecture*
Easy, fast, and cost-effective way to
process vast amounts of data
4
Redshift
Athena
Step
Function
Data
Pipeline

About EMR
5
EMR is an AWS managed Hadoop framework for easy, fast and cost-effective data processing.
Supports popular distributed frameworks such as Spark, Hbase, Presto and Flink
• Easy to use
• Easily integrates with S3, Glue Catalog, HDFS, Glacier, Redshift, Dynamo DB, RDS
• Support for notebook based development for data science applications
• Multi-user access for EMR Notebooks
• Supports multiple distributed components – Spark, Hadoop, Hbase, Presto
• Support for installing additional software (e.g. Addl. packages)

EMR Service Components
7
Clusters • Central component of Amazon EMR
• Collection of Amazon EC2 instances
Security
Configurations
• Data encryption at-rest and in-transit
• Identity authentication using Kerberos
VPC Subnet • View the VPC configurations for the EMR
Events • Track EMR events / activities and store them for up to seven days
• Create CW rules according to a specified pattern, and route events to take action
Notebooks • Use EMR notebooks based on Jupyter to analyze data interactively with live code
• Create and attach notebooks to EMR clusters running Hadoop, Spark, and Livy

EMR Pricing
Ref Link: https://aws.amazon.com/emr/pricing/
• Simple and predictable – pay per second rate, with a one-minute minimum
• EMR price is in addition to underlying EC2 pricing and optional EBS pricing if used
− They are also billed per-second, with a one-minute minimum
• EC2 pricing options includes on-demand, reserved and spot instances
8

USE CASE
Objective: Conduct exploratory data analysis on movie data to narrate the
history and story of cinema
 What movies tend to get higher vote counts and vote averages
Dataset: The dataset is from MovieLens.
 Movie name, genre, budget, revenue, release date, language, countries
released, production company, etc.
 Cast and Crew Information
 User ratings of each movie
10

Solution Approach using EMR
CSV Files
CSV to
Parquet
Parquet
Data Cleansing
Business
Transformation
Spark to
Redshift
Transform Load
Persist
AWS Cloud
VPC
Enriched
Data
11
Start
Launch EMR
Check EMR Step
Status
Get EMR Step
Status
Copy to Redshift
Get Redshift status
Check Redshift
status
Success End
Failed
DataFlowDiagram
StepFlow
Orchestration
Yes
No
Yes
No
Success?
Success?

About Glue
13
Glue is a fully managed, serverless ETL service to prepare and load data for analytics
Also provides centralized metadata repository using Glue Catalog
Use AWS Glue
• to build a data warehouse to organize, cleanse, validate, and format data
• to run serverless queries against your Amazon S3 data lake
• to create event-driven ETL pipelines with AWS Glue
• to understand your data assets

Glue Service Components
16
AWS Glue
console
• Discover data, transform it, and make it available for search and querying.
AWS Glue Data
Catalog
• Persistent metadata store; contains table definitions, job definitions, and other
control information
• Athena, Redshift Spectrum EMR can access the catalog directly.
Classifier • Determines the data schema of your data
• Glue supports classifiers for CSV, JSON, AVRO, XML and common RDBMS
• Can also develop custom classifier (grok pattern, specifying row tag in an XML)
Crawler • AWS developed program that connects to a data store
• Progresses through a prioritized list of classifiers to determine the data schema and
then creates metadata in the Glue Data Catalog
Glue Jobs
System
• Glue Jobs system provides managed infrastructure to orchestrate ETL workflow
• Jobs can be scheduled, chained, or triggered by events (e.g. received new data)

Glue Pricing
17
ETL Job:
• $0.44 per DPU-Hour, billed per second, with 10-minute minimum for each ETL job of type Apache Spark
• $0.44 per DPU-Hour, billed per second, with 1-minute minimum for each ETL job of type Python shell
• $0.44 per DPU-Hour, billed per second, with 10-minute minimum for each provisioned development endpoint
Crawler:
• $0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run
Storage:
• Free for the first million objects stored
• $1 per 100,000 objects stored above 1M, per month
Requests:
• Free for the first million requests per month
• $1 per million requests above 1M in a month

Glue Solution – Architecture and Flow
CSV Files CSV to Parquet Parquet
Prepare Data
Business
Transformation
Spark to
Redshift
Transform Load
Persist
AWS Cloud
VPC
Enriched
Data
19

AWS Glue – Solution Orchestration
Parquet Conversion
Cast Data
Parquet Conversion
Crew Data
Parquet Conversion
Movie Data
Parquet Conversion
Rating Data
Get Parquet Conversion Job
Status
Check Parquet Conversion
Job Status
Business
Transformation
Get Transformation
Job Status
Check Transformation
Job Status
Failed
End
Data storage to
Redshift
Get data storage
Redshift Status
Check data storage
Redshift Status
SuccessYes
No
Yes
No
No
Yes
20
Start
Prepare Data Transform Load
Success?
Success?
Success?

EMR vs. Glue Quick Comparison Chart
EMR Glue
Service Type • Managed Hadoop Framework • Fully Managed Service
Software Configuration • Hadoop Ecosystem • Only Spark
Development Effort • Fully user developed
• Leverage blueprints to reduce
level of coding
Metadata repository
• External metastore for Hive using
Glue Catalog / RDS / Aurora
• Glue Catalog
Redshift Write
• Connection established using a
driver
• In-built API
Job Scheduling • EMR Steps • Triggers
Dependent Libraries • R, Python, Scala, java Libraries
are supported
• Scala, Pure Python Libraries

AWS Community Day – Chennai
Aug 10, 2018
Avail special discounts
for AWS Meetup Members and Participants
AWS Community
Day Chennai
AWS Chennai
Meetup Group

Aws meetup 20190427

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Aws meetup 20190427

Similar to Aws meetup 20190427 (20)

Recently uploaded

Recently uploaded (17)

Aws meetup 20190427