Learning Objectives:
- Learn how to use Amazon EMR for easy, fast, and cost-effective processing of vast amounts of data across dynamically scalable Amazon EC2 instances.
- Learn how using EC2 Spot can significantly reduce the cost of running your clusters.
- Learn how Amazon EMR Instance Fleets can make it easier to quickly obtain and maintain your desired capacity for your clusters.
2. Learning Objectives
• Learn how to use Amazon EMR for easy, fast, and cost-
effective processing of vast amounts of data
3. Learning Objectives
• Learn how to use Amazon EMR for easy, fast, and cost-
effective processing of vast amounts of data
• Learn how using EC2 Spot Instances can significantly
reduce the cost of running your clusters
4. Learning Objectives
• Learn how to use Amazon EMR for easy, fast, and cost-
effective processing of vast amounts of data
• Learn how using EC2 Spot Instances can significantly
reduce the cost of running your clusters
• Learn how Amazon EMR Instance Fleets can make it
easier to quickly obtain and maintain your desired
capacity for your clusters
5. What We Will Cover
• Introduction to Amazon EMR
• Introduction to Amazon EC2 Spot Instances
• Walk through provisioning an EMR cluster using EMR
instance fleets
• Brief introduction to AWS Glue
• Walk through configuring Spark SQL to use the AWS
Glue Data Catalog as its metastore
• Q & A
12. Why EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Decouple
Storage and Compute
Flexible
Customize the cluster
13. Why EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Decouple
Storage and Compute
Flexible
Customize the cluster
14. Why EMR? Managed, Easy to Use, & Current
EC2 Provisioning Cluster Setup Hadoop Configuration
Installing ApplicationsJob submissionMonitoring and
Failure Handling
15. Create a Fully Configured Cluster in Minutes!
AWS Management
Console
AWS Command Line
Interface (CLI)
Or use a AWS SDK directly with the Amazon EMR API
16. Create a Fully Configured Cluster in Minutes!
AWS Management
Console
AWS Command Line
Interface (CLI)
Or use a AWS SDK directly with the Amazon EMR API
Latest versions!
19. Why EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Decouple
Storage and Compute
Flexible
Customize the cluster
20. Many Storage Layers to Choose From
Amazon DynamoDB
Amazon RDS Amazon Kinesis
Amazon Redshift
Amazon S3
Amazon EMR
21. Why EMR? Decouple Storage and Compute
Persistent Cluster – Interactive Queries
(Spark-SQL | Presto)
Transient Cluster - Batch Jobs
(X hours nightly) – Add/Remove Nodes
External Metastore
Workload specific clusters
(Different sizes, Different Versions)
Amazon S3
22. Decouple Storage and Compute by Using S3
as Your Data Layer
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Intermediates
stored on local
disk or HDFS
Local
24. S3 Tips: Partitions, Compression, and File Formats
• Avoid key names in lexicographical order
• Improve throughput and S3 list performance
• Use hashing/random prefixes or reverse the date-time
• Compress data set to minimize bandwidth from S3 to
EC2
• Make sure you use splittable compression or have each file
be the optimal size for parallelization on your cluster
• Columnar file formats like Parquet can give increased
performance on reads
25. Why EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Decouple
Storage and Compute
Flexible
Customize the cluster
28. Why EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Decouple
Storage and Compute
Flexible
Customize the cluster
29. Why EMR? Flexibility
Compute Memory Storage
Machine Learning
C4 Family
C3 Family
X1 Family
R3 Family
Interactive Analysis
D2 Family
I2 Family
Large HDFS
General
Batch Process
M4 Family
M3 Family
30. Master instance group
EMR cluster
Task instance groupCore instance group
HDFS HDFS
Core nodes can be added
and removed gracefully
Master Node must keep
running
Cluster can tolerate loss
of task nodes
EMR Nodes - Customizable
31. Performance Tuning - Speed and Cost
• Transient or long running
• Instance types
• Cluster size
• Application settings
• File formats and S3 tuning
Master Node
r3.2xlarge
Slave Group - Core
c4.2xlarge
Slave Group – Task
m4.2xlarge (EC2 Spot)
Considerations
32. Performance Tuning - Speed and Cost
• Transient or long running
• Instance types
• Cluster size
• Application settings
• File formats and S3 tuning
Master Node
r3.2xlarge
Slave Group - Core
c4.2xlarge
Slave Group – Task
m4.2xlarge (EC2 Spot)
Considerations
33. Spot for
task nodes
Up to 90%
off EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Meet SLA at predictable cost Exceed SLA at lower cost
Amazon EMR supports most EC2 instance types
Use Spot and Reserved Instances to Lower Cost
34. Instance Fleets for Advanced Spot Provisioning
Master Node Core Instance Fleet Task Instance Fleet
• Provision from a list of instance types with Spot and On-Demand
• Launch in the most optimal Availability Zone based on capacity/price
• Spot Block support
36. On-Demand
Pay for compute
capacity by the hour
with no long-term
commitments
For spiky workloads,
or to define needs
AWS EC2 Consumption Models
Reserved
Make a low, one-time
payment and receive
a significant discount
on the hourly charge
For committed
utilization
Spot Market
Bid for unused
capacity, charged at a
Spot Price which
fluctuates based on
supply and demand
For time-insensitive,
transient, or stateless
workloads
37. Spare Capacity at Scale
AWS has millions of active
customers every month,
including more than 2,300
government agencies, 7,000
education institutions and more
than 22,000 nonprofit
organizations that have used
AWS in the last 12 months.
38. What Are EC2 Spot Instances?
EC2 Spot instances are
spare EC2 On-Demand capacity
with very simple rules…
39. What Are EC2 Spot Instances?
EC2 Spot instances are
spare EC2 On-Demand capacity
with very simple rules…
41. The Very Simple Rules of Spot Instances
Run in markets where the
price of compute changes
based on supply and
demand.
42. The Very Simple Rules of Spot Instances
Run in markets where the price of
compute changes based on supply
and demand.
You’ll never pay more than your
bid. When the market exceeds your
bid you get 2 minutes to wrap up
your work.
43. Get the Best Value for EC2 Capacity
• Since Spot Instances typically cost 50-90% less than
On-Demand, you can:
44. Get the Best Value for EC2 Capacity
• Since Spot Instances typically cost 50-90% less than
On-Demand, you can:
• Increase your compute capacity by 2-10x within the same
budget.
45. Get the Best Value for EC2 Capacity
• Since Spot Instances typically cost 50-90% less than
On-Demand, you can:
• Increase your compute capacity by 2-10x within the same
budget.
• Save 50-90% on your existing workload.
46. Get the Best Value for EC2 Capacity
• Since Spot Instances typically cost 50-90% less than
On-Demand, you can:
• Increase your compute capacity by 2-10x within the same
budget.
• Save 50-90% on your existing workload.
• Or both!
47. Get the Best Value for EC2 Capacity
• Since Spot Instances typically cost 50-90% less than
On-Demand, you can:
• Increase your compute capacity by 2-10x within the same
budget.
• Save 50-90% on your existing workload.
• Or both!
• Either way, you should try it!
48. Understanding EC2 Capacity
AZ1
AZ2
(N. California) Total Capacity
P2 C4 M4 I3 R4 D2
Shared
Dedicated
Shared
Dedicated
x 2x 4x x 2x 4x x 2x 4x x 2x 4x x 2x 4x x 2x 4x
51. $0.27 $0.29$0.50
2b 2c2a
8XL
$0.30 $0.16$0.214XL
$0.07 $0.08$0.082XL
$0.05 $0.04$0.04XL
$0.01 $0.04$0.01L
C4
$1.76
On-
Demand
$0.88
$0.44
$0.22
$0.11
• Each instance family
• Each instance size
Capacity and Spot Markets Recap
us-east-2
52. $0.27 $0.29$0.50
2b 2c2a
8XL
$0.30 $0.16$0.214XL
$0.07 $0.08$0.082XL
$0.05 $0.04$0.04XL
$0.01 $0.04$0.01L
C4
$1.76
On-
Demand
$0.88
$0.44
$0.22
$0.11
• Each instance family
• Each instance size
• Each Availability Zone
Capacity and Spot Markets Recap
us-east-2
53. $0.27 $0.29$0.50
2b 2c2a
8XL
$0.30 $0.16$0.214XL
$0.07 $0.08$0.082XL
$0.05 $0.04$0.04XL
$0.01 $0.04$0.01L
C4
$1.76
On-
Demand
$0.88
$0.44
$0.22
$0.11
• Each instance family
• Each instance size
• Each Availability Zone
• In every region
Capacity and Spot Markets Recap
us-east-2
54. $0.27 $0.29$0.50
2b 2c2a
8XL
$0.30 $0.16$0.214XL
$0.07 $0.08$0.082XL
$0.05 $0.04$0.04XL
$0.01 $0.04$0.01L
C4
$1.76
On-
Demand
$0.88
$0.44
$0.22
$0.11
• Each instance family
• Each instance size
• Each Availability Zone
• In every region
• Is a separate Spot Market
Capacity and Spot Markets Recap
us-east-2
58. 50% Bid
75% Bid
You pay the
market
price
25% Bid
Bid Price vs. Market Price
Keep it simple and just bid 100% On-Demand price!
59. EC2 Spot Instance Best Practices - Diversification
• Multiple EC2 instance types selected
• Multiple Availability Zones selected
• Pick instance types with similar
performance characteristics. For
example: c3.large, m3.large, r3.large,
c4.large, m4.large, r4.large…
60. Amazon EC2 Spot Bid Advisor
• We make this easy using the
Spot bid advisor
• With deliberate pool
selection and bidding, you
will keep your Spot instance
as long as you need to
61. • We make this easy using the
Spot bid advisor
• With deliberate pool
selection and bidding, you
will keep your Spot instance
as long as you need to
Amazon EC2 Spot Bid Advisor
62. Amazon EC2 Spot Bid Advisor
• We make this easy using the
Spot bid advisor
• With deliberate pool
selection and bidding, you
will keep your Spot instance
as long as you need to
66. Petabytes of data generated
on-premises, brought to AWS,
and stored in S3
Thousands of analytical
queries performed on EMR
and Amazon Redshift.
Stringent security requirements
met by leveraging VPC, VPN,
encryption at-rest and in-
transit, CloudTrail, and
database auditing
Flexible
Interactive
Queries
Predefined
Queries
Surveillance
Analytics
Data Management
Data Movement
Data Registration
Version Management
Amazon S3
Web Applications
Analysts; Regulators
FINRA: Migrating From On-Prem to AWS
71. Fully Managed Data Catalog & ETL Service
Integrates with AWS/Non-AWS Data
Stores
Scalable
No Admin
AWS Glue
Learn more: https://aws.amazon.com/glue/
72. Glue automates data cataloging & preparation
Catalogues data sources
Identifies data formats and data types
Generates Extract, Transform, Load code
Executes ETL jobs; managing dependencies
Amazon Glue – Fully Managed ETL Service
73. Why EMR? Decouple Storage and Compute
Persistent Cluster – Interactive Queries
(Spark-SQL | Presto)
Transient Cluster - Batch Jobs
(X hours nightly) – Add/Remove Nodes
External Metastore
Workload specific clusters
(Different sizes, Different Versions)
Amazon S3
74. Use an External Metastore
AWS Glue
Use the AWS Glue Data Catalog to store
external table metadata for Hive and Spark
Amazon S3Set metastore
location in hive-site
75. Walk through configuring Spark SQL to use
the AWS Glue Data Catalog as its metastore
(Console and CLI)