This document provides an overview of Amazon Elastic MapReduce (EMR), a service that makes it easy to process large amounts of data using the Hadoop framework. It discusses how EMR allows users to launch Hadoop clusters in minutes, integrate with other AWS services for storage and databases, customize clusters using various Hadoop applications and design patterns, and pay only for the resources used. The document aims to demonstrate how EMR provides an easy, fast, secure and cost-effective way to run Hadoop in the cloud.
2. Agenda
• Is Hadoop the answer?
• Amazon EMR 101
• Integration with AWS storage and database services
• Common Amazon EMR design patterns
• Q+A
3. When leveraging your data to derive new insights,
Big Data problems are everywhere
• Data lacks structure
• Analyzing streams of information
• Processing large datasets
• Warehousing large datasets
• Flexibility for undefined ad hoc analysis
• Speed of queries on large data sets
4. Hadoop is the right system for Big Data
• Massively parallel
• Scalable and fault tolerant
• Flexibility for multiple languages
and data formats
• Open source
• Ecosystem of tools
• Batch and real-time analytics
8. Why Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Elastic
Easily add or remove capacity
Reliable
Spend less time monitoring
Secure
Manage firewalls
Flexible
Customize the cluster
10. Easy to deploy
AWS Management Console AWS Command Line Interface
You can also use the Amazon EMR API with your favorite SDK
or use AWS Data Pipeline to start your clusters.
11. Try different configurations to find your optimal architecture.
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Choose your instance types
Batch Machine Spark and Large
process learning interactive HDFS
13. Spot Instances
for task nodes
Up to 90%
off Amazon EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Mix on-demand and EC2 Spot capacity for low costs
Meet SLA at predictable cost Exceed SLA at lower cost
14. Use multiple EMR instance groups
Master Node
r3.2xlarge
Example Amazon
EMR Cluster
Slave Group - Core
c3.2xlarge
Slave Group – Task
m3.xlarge (EC2 Spot)
Slave Group – Task
m3.2xlarge (EC2 Spot)
Core nodes run HDFS
(DataNode). Task nodes do
not run HDFS. Core and
Task nodes each run YARN
(NodeManager).
Master node runs
NameNode (HDFS),
ResourceManager (YARN),
and serves as a gateway.
16. Easy to add and remove compute
capacity in your cluster from the console, CLI, or API.
Match compute
demands with
cluster sizing.
Resizable clusters
17. Use S3 instead of HDFS for your data layer to decouple
your compute capacity and storage
Amazon S3
Amazon EMR
Shut down your EMR
clusters when you
are not processing
data, and stop paying
for them!
22. Use Identity and Access Management (IAM) roles with
your Amazon EMR cluster
• IAM roles give AWS services fine grained
control over delegating permissions to AWS
services and access to AWS resources
• EMR uses two IAM roles:
• EMR service role is for the Amazon EMR
control plane
• EC2 instance profile is for the actual
instances in the Amazon EMR cluster
• Default IAM roles can be easily created and
used from the AWS Console and AWS CLI
23. EMR Security Groups: default and custom
A security group is a virtual firewall which controls
access to the EC2 instances in your Amazon EMR
cluster
• There is a single default master and default
slave security group across all of your clusters
• The master security group has port 22 access
for SSHing to your cluster
You can add additional security groups to the master
and slave groups on a cluster to separate them from
the default master and slave security groups, and
further limit ingress and egress policies.
Slave
Security
Group
Master
Security
Group
24. Other Amazon EMR security features
EMRFS encryption options
• S3 server-side encryption
• S3 client-side encryption (use AWS Key Management Service keys or custom keys)
CloudTrail integration
• Track Amazon EMR API calls for auditing
Launch your Amazon EMR clusters in a VPC
• Logically isolated portion of the cloud (“Virtual Private Network”)
• Enhanced networking on certain instance types
27. Use Hive on EMR to interact with your data in HDFS
and Amazon S3
• Batch or ad hoc workloads
• Integration with EMRFS for better
performance reading and writing
to S3
• SQL-like query language to make
iterative queries easier
• Schema-on-read to query data
without needing pre-processing
• Use Tez engine for faster queries
28. Use Pig to easily create ETL workflows
• Uses high-level “Pig Latin” language to
easily script data transformations in
Hadoop
• Strong optimizer for workloads
• Options to create robust user defined
functions
29. Use HBase on a persistent EMR cluster as a noSQL
scalable database
• Billions of rows and millions
of columns
• Backup to and restore from
Amazon S3
• Flexible datatypes
• Modulate your HBase tables
when adding new data to
your system
30. Impala: a fast SQL query engine for EMR Clusters
• Low-latency SQL query engine for Hadoop
• Perfect for fast ad hoc, interactive queries on
structured on unstructured data
• Can be easily installed on an EMR cluster,
and queried from the CLI or a 3rd party BI tool
• Perfect for memory optimized instances
• Currently uses HDFS as data layer
34. To install anything else, use Bootstrap Actions
https://github.com/awslabs/emr-bootstrap-actions
35. Spark: an alternative engine to Hadoop with its
own ecosystem of applications
• Does not use map-reduce framework
• In-memory for fast queries
• Great for machine learning or other
iterative queries
• Use Spark SQL to create a low-latency
data warehouse
• Spark Streaming for real-time
workloads
36. Also use Bootstrap Actions to configure your
applications
--bootstrap-action s3://elasticmapreduce/bootstrap-
actions/configure-hadoop
--keyword-config-file (Merge values in new config to existing)
--keyword-key-value (Override values provided)
Configuration File
Name
Configuration File
Keyword
File Name Shortcut
Key-Value Pair
Shortcut
core-site.xml core C c
hdfs-site.xml hdfs H h
mapred-site.xml mapred M m
yarn-site.xml yarn Y y
37. EMR Step API
• EMR step can be a map-
reduce job, Hive program, Pig
script, or even an arbitrary
script
• Easily submit Step from
console, CLI, or API
• Submit multiple steps to use
EMR as a sequential workflow
engine
Submit work via the EMR Step API or SSH to the
EMR master node
Connect to Master Node
• Connect to HUE, interact with
application CLIs, or submit
work directly to the Hadoop
APIs
• View the Hadoop UI
• Useful for long-running clusters
and interactive use cases
39. Diverse set of partners to use with Amazon EMR
BI / Visualization Business Intelligence BI / Visualization BI / Visualization
Hadoop Distribution Data Transfer Data Transformation
Monitoring Performance Tuning Graphical IDE Graphical IDE
Available on AWS Marketplace Available as a distribution in Amazon EMR
ETL Tool
BI / Visualization
42. Amazon S3 as your persistent data store
Amazon S3
• Designed for 99.999999999% durability
• Separate compute and storage
Resize and shut down Amazon EMR clusters
with no data loss
Point multiple Amazon EMR clusters at same
data in Amazon S3 using the EMR File
System (EMRFS)
43. EMRFS makes it easier to leverage Amazon S3
Better performance and error handling options
Transparent to applications – just read/write to “s3://”
Consistent view
• For consistent list and read-after-write for new puts
Support for Amazon S3 server-side and client-side encryption
Faster listing using EMRFS metadata
44. Amazon S3 EMRFS metadata
in Amazon DynamoDB
• List and read-after-write consistency
• Faster list operations
Number
of objects
Without
Consistent
Views
With Consistent
Views
1,000,000 147.72 29.70
100,000 12.70 3.69
Consistent view and fast listing using the optional
EMRFS metadata
*Tested using a single node cluster with a m3.xlarge instance.
45. EMRFS support for Amazon S3 client-side encryption
Amazon S3
AmazonS3encryptionclients
EMRFSenabledfor
AmazonS3client-sideencryption
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)
46. Read data directly into Hive,
Apache Pig, and Hadoop
Streaming and Cascading from
Amazon Kinesis streams
No intermediate data
persistence required
Simple way to introduce real-time sources into
batch-oriented systems
Multi-application support and automatic
checkpointing
Amazon EMR Integration with Amazon Kinesis
47. Use Hive with EMR to query data DynamoDB
• Export data stored in DynamoDB to
Amazon S3
• Import data in Amazon S3 to
DynamoDB
• Query live DynamoDB data using SQL-
like statements (HiveQL)
• Join data stored in DynamoDB and
export it or query against the joined data
• Load DynamoDB data into HDFS and
use it in your EMR job
48. Use AWS Data Pipeline and EMR to transform
data and load into Amazon Redshift
Unstructured Data Processed Data
Pipeline orchestrated and scheduled by AWS Data Pipeline
50. Amazon EMR example #1: Batch processing
GBs of logs pushed
to Amazon S3 hourly
Daily Amazon EMR
cluster using Hive to
process data
Input and output
stored in Amazon S3
250 Amazon EMR jobs per day, processing 30 TB of data
http://aws.amazon.com/solutions/case-studies/yelp/
51. Using Amazon S3 and HDFS
Data Sources
Transient EMR cluster
for batch map/reduce jobs
for daily reports
Long running EMR cluster
holding data in HDFS for
Hive interactive queries
Weekly Report
Ad-hoc Query
Data aggregated
and stored in
Amazon S3
Amazon Confidential
52. Multiple EMR workflows using the same S3
dataset
Computations
S3DistCp
CascalogLZO
Input Amazon
S3 bucket
Intermediate
Amazon S3
bucket
Final
Amazon S3
bucket
Final
Amazon S3
bucket
Final
Amazon S3
bucket
Crashlytics (part of Twitter) uses EMR to
analyze data in S3 to power dashboards
on its Answers platform.
53. Amazon EMR example #2: Long-running cluster
Data pushed to
Amazon S3
Daily Amazon EMR cluster
Extract, Transform, and Load
(ETL) data into database 24/7 Amazon EMR cluster
running HBase holds last 2
years’ worth of data
Front-end service uses
HBase cluster to power
dashboard with high
concurrency
54. Amazon EMR example #3: Interactive query
TBs of logs sent daily
Logs stored in
Amazon S3
Amazon EMR cluster using Presto for ad hoc
analysis of entire log set
Interactive query using Presto on multipetabyte warehouse
http://techblog.netflix.com/2014/10/using-presto-in-our-big-
data-platform.html
55. EMR example #4: EMR for ETL and query engine for
investigations which require all raw data
TBs of logs sent
daily
Logs stored in S3
Hourly EMR cluster
using Spark for ETL
Load subset into
Redshift DW
Transient EMR cluster using Spark for ad hoc
analysis of entire log set
59. Client/ Sensor Recording Service Aggregator/
Sequencer
Continuous
Processor for
Dashboard
Data Warehouse Analytics and
Reporting
Amazon Kinesis Amazon EMR
Streaming Data RepositoryLogging Data Processing
Log4J
Amazon Kinesis + Amazon EMR = Fewer
Moving Parts
60. Processed
output in real-time
and batch
workflows
Input
push with Log 4J to
Hive
Pig
Cascading
pull from
Spark
Amazon EMR
Amazon Kinesis
Customer Application
Amazon DynamoDB
Real-time processing with Spark Streaming and batch
workloads on Kinesis streams with the Hadoop stack
61. AWS Summit – Chicago: An exciting, free cloud conference designed to educate and inform new
customers about the AWS platform, best practices and new cloud services.
Details
• July 1, 2015
• Chicago, Illinois
• @ McCormick Place
Featuring
• New product launches
• 36+ sessions, labs, and bootcamps
• Executive and partner networking
Registration is now open
• Come and see what AWS and the cloud can do for you.
62. CTA Script
- If you are interested in learning more about how to navigate the cloud to grow
your business - then attend the AWS Summit Chicago, July 1st.
- Register today to learn from technical sessions led by AWS engineers, hear best
practices from AWS customers and partners, and participate in some of the 30+
paid sessions and labs.
- Simply go to
https://aws.amazon.com/summits/chicago/?trkcampaign=summit_chicago_bootc
amps&trk=Webinar_slide
to register today.
- Registration is FREE.
TRACKING CODE:
- Listed above.