SlideShare una empresa de Scribd logo
1 de 79
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Chad Schmutzer, Solutions Architect - EC2 Spot Instances
September 13, 2017
Best Practices for Managing
Hadoop Framework Based
Workloads (on Amazon EMR)
Learning Objectives
• Learn how to use Amazon EMR for easy, fast, and cost-
effective processing of vast amounts of data
Learning Objectives
• Learn how to use Amazon EMR for easy, fast, and cost-
effective processing of vast amounts of data
• Learn how using EC2 Spot Instances can significantly
reduce the cost of running your clusters
Learning Objectives
• Learn how to use Amazon EMR for easy, fast, and cost-
effective processing of vast amounts of data
• Learn how using EC2 Spot Instances can significantly
reduce the cost of running your clusters
• Learn how Amazon EMR Instance Fleets can make it
easier to quickly obtain and maintain your desired
capacity for your clusters
What We Will Cover
• Introduction to Amazon EMR
• Introduction to Amazon EC2 Spot Instances
• Walk through provisioning an EMR cluster using EMR
instance fleets
• Brief introduction to AWS Glue
• Walk through configuring Spark SQL to use the AWS
Glue Data Catalog as its metastore
• Q & A
What is Amazon EMR?
PIG
Infrastructure
Data Layer
Process Layer
Framework
Applications
PIG
SQL
Infrastructure
Data Layer
Process Layer
Framework
Applications
PIG
SQL
Amazon
EMR
PIG
SQL
Amazon
EMR
Amazon
S3
EMRFS
YARN
PIG
SQL
Amazon
EMR
EMRFS
Amazon
S3
Why EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Decouple
Storage and Compute
Flexible
Customize the cluster
Why EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Decouple
Storage and Compute
Flexible
Customize the cluster
Why EMR? Managed, Easy to Use, & Current
EC2 Provisioning Cluster Setup Hadoop Configuration
Installing ApplicationsJob submissionMonitoring and
Failure Handling
Create a Fully Configured Cluster in Minutes!
AWS Management
Console
AWS Command Line
Interface (CLI)
Or use a AWS SDK directly with the Amazon EMR API
Create a Fully Configured Cluster in Minutes!
AWS Management
Console
AWS Command Line
Interface (CLI)
Or use a AWS SDK directly with the Amazon EMR API
Latest versions!
Amazon EMR Releases
Hue (SQL Interface/Metastore Management)
Zeppelin (Interactive Notebook)
Ganglia (Monitoring)
HiveServer2/Spark Thriftserver (JDBC/ODBC)
Amazon EMR service
Storage
S3 (EMRFS), HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop
HBase/Phoenix
Presto
Streaming
Flink
Amazon EMR Release
Why EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Decouple
Storage and Compute
Flexible
Customize the cluster
Many Storage Layers to Choose From
Amazon DynamoDB
Amazon RDS Amazon Kinesis
Amazon Redshift
Amazon S3
Amazon EMR
Why EMR? Decouple Storage and Compute
Persistent Cluster – Interactive Queries
(Spark-SQL | Presto)
Transient Cluster - Batch Jobs
(X hours nightly) – Add/Remove Nodes
External Metastore
Workload specific clusters
(Different sizes, Different Versions)
Amazon S3
Decouple Storage and Compute by Using S3
as Your Data Layer
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Intermediates
stored on local
disk or HDFS
Local
HBase on S3 for Scalable NoSQL
S3 Tips: Partitions, Compression, and File Formats
• Avoid key names in lexicographical order
• Improve throughput and S3 list performance
• Use hashing/random prefixes or reverse the date-time
• Compress data set to minimize bandwidth from S3 to
EC2
• Make sure you use splittable compression or have each file
be the optimal size for parallelization on your cluster
• Columnar file formats like Parquet can give increased
performance on reads
Why EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Decouple
Storage and Compute
Flexible
Customize the cluster
# CPUs
Time
# CPUs
Time
Wall clock time: 1 hourWall clock time: 10 hours
Cost & Time
Why EMR? Low-cost
Transient
clusters
Reserved
instances
Spot
Instances
Why EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Decouple
Storage and Compute
Flexible
Customize the cluster
Why EMR? Flexibility
Compute Memory Storage
Machine Learning
C4 Family
C3 Family
X1 Family
R3 Family
Interactive Analysis
D2 Family
I2 Family
Large HDFS
General
Batch Process
M4 Family
M3 Family
Master instance group
EMR cluster
Task instance groupCore instance group
HDFS HDFS
Core nodes can be added
and removed gracefully
Master Node must keep
running
Cluster can tolerate loss
of task nodes
EMR Nodes - Customizable
Performance Tuning - Speed and Cost
• Transient or long running
• Instance types
• Cluster size
• Application settings
• File formats and S3 tuning
Master Node
r3.2xlarge
Slave Group - Core
c4.2xlarge
Slave Group – Task
m4.2xlarge (EC2 Spot)
Considerations
Performance Tuning - Speed and Cost
• Transient or long running
• Instance types
• Cluster size
• Application settings
• File formats and S3 tuning
Master Node
r3.2xlarge
Slave Group - Core
c4.2xlarge
Slave Group – Task
m4.2xlarge (EC2 Spot)
Considerations
Spot for
task nodes
Up to 90%
off EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Meet SLA at predictable cost Exceed SLA at lower cost
Amazon EMR supports most EC2 instance types
Use Spot and Reserved Instances to Lower Cost
Instance Fleets for Advanced Spot Provisioning
Master Node Core Instance Fleet Task Instance Fleet
• Provision from a list of instance types with Spot and On-Demand
• Launch in the most optimal Availability Zone based on capacity/price
• Spot Block support
What are Amazon EC2 Spot
Instances?
On-Demand
Pay for compute
capacity by the hour
with no long-term
commitments
For spiky workloads,
or to define needs
AWS EC2 Consumption Models
Reserved
Make a low, one-time
payment and receive
a significant discount
on the hourly charge
For committed
utilization
Spot Market
Bid for unused
capacity, charged at a
Spot Price which
fluctuates based on
supply and demand
For time-insensitive,
transient, or stateless
workloads
Spare Capacity at Scale
AWS has millions of active
customers every month,
including more than 2,300
government agencies, 7,000
education institutions and more
than 22,000 nonprofit
organizations that have used
AWS in the last 12 months.
What Are EC2 Spot Instances?
EC2 Spot instances are
spare EC2 On-Demand capacity
with very simple rules…
What Are EC2 Spot Instances?
EC2 Spot instances are
spare EC2 On-Demand capacity
with very simple rules…
The Very Simple Rules of Spot Instances
The Very Simple Rules of Spot Instances
Run in markets where the
price of compute changes
based on supply and
demand.
The Very Simple Rules of Spot Instances
Run in markets where the price of
compute changes based on supply
and demand.
You’ll never pay more than your
bid. When the market exceeds your
bid you get 2 minutes to wrap up
your work.
Get the Best Value for EC2 Capacity
• Since Spot Instances typically cost 50-90% less than
On-Demand, you can:
Get the Best Value for EC2 Capacity
• Since Spot Instances typically cost 50-90% less than
On-Demand, you can:
• Increase your compute capacity by 2-10x within the same
budget.
Get the Best Value for EC2 Capacity
• Since Spot Instances typically cost 50-90% less than
On-Demand, you can:
• Increase your compute capacity by 2-10x within the same
budget.
• Save 50-90% on your existing workload.
Get the Best Value for EC2 Capacity
• Since Spot Instances typically cost 50-90% less than
On-Demand, you can:
• Increase your compute capacity by 2-10x within the same
budget.
• Save 50-90% on your existing workload.
• Or both!
Get the Best Value for EC2 Capacity
• Since Spot Instances typically cost 50-90% less than
On-Demand, you can:
• Increase your compute capacity by 2-10x within the same
budget.
• Save 50-90% on your existing workload.
• Or both!
• Either way, you should try it!
Understanding EC2 Capacity
AZ1
AZ2
(N. California) Total Capacity
P2 C4 M4 I3 R4 D2
Shared
Dedicated
Shared
Dedicated
x 2x 4x x 2x 4x x 2x 4x x 2x 4x x 2x 4x x 2x 4x
$0.27 $0.29$0.50
2b 2c2a
8XL
$0.30 $0.16$0.214XL
$0.07 $0.08$0.082XL
$0.05 $0.04$0.04XL
$0.01 $0.04$0.01L
C4
$1.76
On-
Demand
$0.88
$0.44
$0.22
$0.11
Capacity and Spot Markets Recap
us-east-2
$0.27 $0.29$0.50
2b 2c2a
8XL
$0.30 $0.16$0.214XL
$0.07 $0.08$0.082XL
$0.05 $0.04$0.04XL
$0.01 $0.04$0.01L
C4
$1.76
On-
Demand
$0.88
$0.44
$0.22
$0.11
• Each instance family
Capacity and Spot Markets Recap
us-east-2
$0.27 $0.29$0.50
2b 2c2a
8XL
$0.30 $0.16$0.214XL
$0.07 $0.08$0.082XL
$0.05 $0.04$0.04XL
$0.01 $0.04$0.01L
C4
$1.76
On-
Demand
$0.88
$0.44
$0.22
$0.11
• Each instance family
• Each instance size
Capacity and Spot Markets Recap
us-east-2
$0.27 $0.29$0.50
2b 2c2a
8XL
$0.30 $0.16$0.214XL
$0.07 $0.08$0.082XL
$0.05 $0.04$0.04XL
$0.01 $0.04$0.01L
C4
$1.76
On-
Demand
$0.88
$0.44
$0.22
$0.11
• Each instance family
• Each instance size
• Each Availability Zone
Capacity and Spot Markets Recap
us-east-2
$0.27 $0.29$0.50
2b 2c2a
8XL
$0.30 $0.16$0.214XL
$0.07 $0.08$0.082XL
$0.05 $0.04$0.04XL
$0.01 $0.04$0.01L
C4
$1.76
On-
Demand
$0.88
$0.44
$0.22
$0.11
• Each instance family
• Each instance size
• Each Availability Zone
• In every region
Capacity and Spot Markets Recap
us-east-2
$0.27 $0.29$0.50
2b 2c2a
8XL
$0.30 $0.16$0.214XL
$0.07 $0.08$0.082XL
$0.05 $0.04$0.04XL
$0.01 $0.04$0.01L
C4
$1.76
On-
Demand
$0.88
$0.44
$0.22
$0.11
• Each instance family
• Each instance size
• Each Availability Zone
• In every region
• Is a separate Spot Market
Capacity and Spot Markets Recap
us-east-2
Bid Price vs. Market Price
You pay the
market
price
Bid Price vs. Market Price
50% Bid
75% Bid
You pay the
market
price
25% Bid
Bid Price vs. Market Price
50% Bid
75% Bid
You pay the
market
price
25% Bid
Bid Price vs. Market Price
Keep it simple and just bid 100% On-Demand price!
EC2 Spot Instance Best Practices - Diversification
• Multiple EC2 instance types selected
• Multiple Availability Zones selected
• Pick instance types with similar
performance characteristics. For
example: c3.large, m3.large, r3.large,
c4.large, m4.large, r4.large…
Amazon EC2 Spot Bid Advisor
• We make this easy using the
Spot bid advisor
• With deliberate pool
selection and bidding, you
will keep your Spot instance
as long as you need to
• We make this easy using the
Spot bid advisor
• With deliberate pool
selection and bidding, you
will keep your Spot instance
as long as you need to
Amazon EC2 Spot Bid Advisor
Amazon EC2 Spot Bid Advisor
• We make this easy using the
Spot bid advisor
• With deliberate pool
selection and bidding, you
will keep your Spot instance
as long as you need to
EC2 Spot Advisor in Console (New!)
EC2 Spot Advisor in Console (New!)
Example Customer Use Case
Petabytes of data generated
on-premises, brought to AWS,
and stored in S3
Thousands of analytical
queries performed on EMR
and Amazon Redshift.
Stringent security requirements
met by leveraging VPC, VPN,
encryption at-rest and in-
transit, CloudTrail, and
database auditing
Flexible
Interactive
Queries
Predefined
Queries
Surveillance
Analytics
Data Management
Data Movement
Data Registration
Version Management
Amazon S3
Web Applications
Analysts; Regulators
FINRA: Migrating From On-Prem to AWS
Lower Cost and Higher Scale Than On-Premises
FINRA Saved 60% by Moving to HBase on EMR
Walk through provisioning an EMR cluster
using EMR instance fleets (Console and CLI)
What is AWS Glue?
Fully Managed Data Catalog & ETL Service
Integrates with AWS/Non-AWS Data
Stores
Scalable
No Admin
AWS Glue
Learn more: https://aws.amazon.com/glue/
Glue automates data cataloging & preparation
 Catalogues data sources
 Identifies data formats and data types
 Generates Extract, Transform, Load code
 Executes ETL jobs; managing dependencies
Amazon Glue – Fully Managed ETL Service
Why EMR? Decouple Storage and Compute
Persistent Cluster – Interactive Queries
(Spark-SQL | Presto)
Transient Cluster - Batch Jobs
(X hours nightly) – Add/Remove Nodes
External Metastore
Workload specific clusters
(Different sizes, Different Versions)
Amazon S3
Use an External Metastore
AWS Glue
Use the AWS Glue Data Catalog to store
external table metadata for Hive and Spark
Amazon S3Set metastore
location in hive-site
Walk through configuring Spark SQL to use
the AWS Glue Data Catalog as its metastore
(Console and CLI)
Q & A
Thank you!
Appendix
Reference links
EC2 Spot Documentation:
http://aws.amazon.com/ec2/spot/
http://aws.amazon.com/ec2/spot/bid-advisor/
http://aws.amazon.com/ec2/spot/getting-started/
http://aws.amazon.com/ec2/spot/faqs/
http://aws.amazon.com/ec2/spot/testimonials/
User Guide
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet.html
http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-fleet.html
Helpful AWS Blog Posts
https://aws.amazon.com/blogs/aws/focusing-on-spot-instances-lets-talk-about-best-practices/
https://aws.amazon.com/blogs/aws/building-price-aware-applications-using-ec2-spot-instances/
https://aws.amazon.com/blogs/compute/cost-effective-batch-processing-with-amazon-ec2-spot/
https://aws.amazon.com/blogs/compute/dynamic-scaling-with-ec2-spot-fleet/

Más contenido relacionado

La actualidad más candente

AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
Amazon Web Services
 

La actualidad más candente (20)

AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
 
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
AWS Summit London 2014 | Deployment Done Right (300)
AWS Summit London 2014 | Deployment Done Right (300)AWS Summit London 2014 | Deployment Done Right (300)
AWS Summit London 2014 | Deployment Done Right (300)
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
 
Getting Started with Amazon EMR
Getting Started with Amazon EMRGetting Started with Amazon EMR
Getting Started with Amazon EMR
 
AWS Summit London 2014 | Customer Stories | Just Eat
AWS Summit London 2014 | Customer Stories | Just EatAWS Summit London 2014 | Customer Stories | Just Eat
AWS Summit London 2014 | Customer Stories | Just Eat
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explained
 
AWS Summit London 2014 | Improving Availability and Lowering Costs (300)
AWS Summit London 2014 | Improving Availability and Lowering Costs (300)AWS Summit London 2014 | Improving Availability and Lowering Costs (300)
AWS Summit London 2014 | Improving Availability and Lowering Costs (300)
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 

Similar a Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) - AWS Online Tech Talks

Workshop; Deploy a Deep Learning Framework on Amazon ECS and Spot Instances
Workshop; Deploy a Deep Learning Framework on Amazon ECS and Spot InstancesWorkshop; Deploy a Deep Learning Framework on Amazon ECS and Spot Instances
Workshop; Deploy a Deep Learning Framework on Amazon ECS and Spot Instances
Amazon Web Services
 

Similar a Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) - AWS Online Tech Talks (20)

Best Practices running SQL Server on AWS
Best Practices running SQL Server on AWSBest Practices running SQL Server on AWS
Best Practices running SQL Server on AWS
 
This One Weird API Request Will Save You Thousands
This One Weird API Request Will Save You ThousandsThis One Weird API Request Will Save You Thousands
This One Weird API Request Will Save You Thousands
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Workshop: Deploy a Deep Learning Framework on Amazon ECS
Workshop: Deploy a Deep Learning Framework on Amazon ECSWorkshop: Deploy a Deep Learning Framework on Amazon ECS
Workshop: Deploy a Deep Learning Framework on Amazon ECS
 
Workshop; Deploy a Deep Learning Framework on Amazon ECS and Spot Instances
Workshop; Deploy a Deep Learning Framework on Amazon ECS and Spot InstancesWorkshop; Deploy a Deep Learning Framework on Amazon ECS and Spot Instances
Workshop; Deploy a Deep Learning Framework on Amazon ECS and Spot Instances
 
EC2.pdf
EC2.pdfEC2.pdf
EC2.pdf
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
WKS401 Deploy a Deep Learning Framework on Amazon ECS and EC2 Spot Instances
WKS401 Deploy a Deep Learning Framework on Amazon ECS and EC2 Spot InstancesWKS401 Deploy a Deep Learning Framework on Amazon ECS and EC2 Spot Instances
WKS401 Deploy a Deep Learning Framework on Amazon ECS and EC2 Spot Instances
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
WKS401 Deploy a Deep Learning Framework on Amazon ECS and EC2 Spot Instances
WKS401 Deploy a Deep Learning Framework on Amazon ECS and EC2 Spot InstancesWKS401 Deploy a Deep Learning Framework on Amazon ECS and EC2 Spot Instances
WKS401 Deploy a Deep Learning Framework on Amazon ECS and EC2 Spot Instances
 
Building Highly Scalable Immersive Media Solutions on AWS
Building Highly Scalable Immersive Media Solutions on AWSBuilding Highly Scalable Immersive Media Solutions on AWS
Building Highly Scalable Immersive Media Solutions on AWS
 
Cost Optimization on AWS - Pop-up Loft Tel Aviv
Cost Optimization on AWS - Pop-up Loft Tel AvivCost Optimization on AWS - Pop-up Loft Tel Aviv
Cost Optimization on AWS - Pop-up Loft Tel Aviv
 
(CMP311) This One Weird API Request Will Save You Thousands
(CMP311) This One Weird API Request Will Save You Thousands(CMP311) This One Weird API Request Will Save You Thousands
(CMP311) This One Weird API Request Will Save You Thousands
 
PASS 17 SQL Server on AWS Best Practices
PASS 17 SQL Server on AWS Best PracticesPASS 17 SQL Server on AWS Best Practices
PASS 17 SQL Server on AWS Best Practices
 
AWS Summit Auckland 2014 | Moving to the Cloud. What does it Mean to your Bus...
AWS Summit Auckland 2014 | Moving to the Cloud. What does it Mean to your Bus...AWS Summit Auckland 2014 | Moving to the Cloud. What does it Mean to your Bus...
AWS Summit Auckland 2014 | Moving to the Cloud. What does it Mean to your Bus...
 
AWS APAC Webinar Series: How to Reduce Your Spend on AWS
AWS APAC Webinar Series: How to Reduce Your Spend on AWSAWS APAC Webinar Series: How to Reduce Your Spend on AWS
AWS APAC Webinar Series: How to Reduce Your Spend on AWS
 
Introduction to EC2
Introduction to EC2Introduction to EC2
Introduction to EC2
 
How to Reduce your Spend on AWS
How to Reduce your Spend on AWSHow to Reduce your Spend on AWS
How to Reduce your Spend on AWS
 
AWS Summit Sydney 2014 | Moving to the Cloud. What does it Mean to your Business
AWS Summit Sydney 2014 | Moving to the Cloud. What does it Mean to your BusinessAWS Summit Sydney 2014 | Moving to the Cloud. What does it Mean to your Business
AWS Summit Sydney 2014 | Moving to the Cloud. What does it Mean to your Business
 

Más de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) - AWS Online Tech Talks

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Chad Schmutzer, Solutions Architect - EC2 Spot Instances September 13, 2017 Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR)
  • 2. Learning Objectives • Learn how to use Amazon EMR for easy, fast, and cost- effective processing of vast amounts of data
  • 3. Learning Objectives • Learn how to use Amazon EMR for easy, fast, and cost- effective processing of vast amounts of data • Learn how using EC2 Spot Instances can significantly reduce the cost of running your clusters
  • 4. Learning Objectives • Learn how to use Amazon EMR for easy, fast, and cost- effective processing of vast amounts of data • Learn how using EC2 Spot Instances can significantly reduce the cost of running your clusters • Learn how Amazon EMR Instance Fleets can make it easier to quickly obtain and maintain your desired capacity for your clusters
  • 5. What We Will Cover • Introduction to Amazon EMR • Introduction to Amazon EC2 Spot Instances • Walk through provisioning an EMR cluster using EMR instance fleets • Brief introduction to AWS Glue • Walk through configuring Spark SQL to use the AWS Glue Data Catalog as its metastore • Q & A
  • 12. Why EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Decouple Storage and Compute Flexible Customize the cluster
  • 13. Why EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Decouple Storage and Compute Flexible Customize the cluster
  • 14. Why EMR? Managed, Easy to Use, & Current EC2 Provisioning Cluster Setup Hadoop Configuration Installing ApplicationsJob submissionMonitoring and Failure Handling
  • 15. Create a Fully Configured Cluster in Minutes! AWS Management Console AWS Command Line Interface (CLI) Or use a AWS SDK directly with the Amazon EMR API
  • 16. Create a Fully Configured Cluster in Minutes! AWS Management Console AWS Command Line Interface (CLI) Or use a AWS SDK directly with the Amazon EMR API Latest versions!
  • 18. Hue (SQL Interface/Metastore Management) Zeppelin (Interactive Notebook) Ganglia (Monitoring) HiveServer2/Spark Thriftserver (JDBC/ODBC) Amazon EMR service Storage S3 (EMRFS), HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop HBase/Phoenix Presto Streaming Flink Amazon EMR Release
  • 19. Why EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Decouple Storage and Compute Flexible Customize the cluster
  • 20. Many Storage Layers to Choose From Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon Redshift Amazon S3 Amazon EMR
  • 21. Why EMR? Decouple Storage and Compute Persistent Cluster – Interactive Queries (Spark-SQL | Presto) Transient Cluster - Batch Jobs (X hours nightly) – Add/Remove Nodes External Metastore Workload specific clusters (Different sizes, Different Versions) Amazon S3
  • 22. Decouple Storage and Compute by Using S3 as Your Data Layer HDFS S3 is designed for 11 9’s of durability and is massively scalable EC2 Instance Memory Amazon S3 Amazon EMR Amazon EMR Intermediates stored on local disk or HDFS Local
  • 23. HBase on S3 for Scalable NoSQL
  • 24. S3 Tips: Partitions, Compression, and File Formats • Avoid key names in lexicographical order • Improve throughput and S3 list performance • Use hashing/random prefixes or reverse the date-time • Compress data set to minimize bandwidth from S3 to EC2 • Make sure you use splittable compression or have each file be the optimal size for parallelization on your cluster • Columnar file formats like Parquet can give increased performance on reads
  • 25. Why EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Decouple Storage and Compute Flexible Customize the cluster
  • 26. # CPUs Time # CPUs Time Wall clock time: 1 hourWall clock time: 10 hours Cost & Time
  • 28. Why EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Decouple Storage and Compute Flexible Customize the cluster
  • 29. Why EMR? Flexibility Compute Memory Storage Machine Learning C4 Family C3 Family X1 Family R3 Family Interactive Analysis D2 Family I2 Family Large HDFS General Batch Process M4 Family M3 Family
  • 30. Master instance group EMR cluster Task instance groupCore instance group HDFS HDFS Core nodes can be added and removed gracefully Master Node must keep running Cluster can tolerate loss of task nodes EMR Nodes - Customizable
  • 31. Performance Tuning - Speed and Cost • Transient or long running • Instance types • Cluster size • Application settings • File formats and S3 tuning Master Node r3.2xlarge Slave Group - Core c4.2xlarge Slave Group – Task m4.2xlarge (EC2 Spot) Considerations
  • 32. Performance Tuning - Speed and Cost • Transient or long running • Instance types • Cluster size • Application settings • File formats and S3 tuning Master Node r3.2xlarge Slave Group - Core c4.2xlarge Slave Group – Task m4.2xlarge (EC2 Spot) Considerations
  • 33. Spot for task nodes Up to 90% off EC2 on-demand pricing On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Meet SLA at predictable cost Exceed SLA at lower cost Amazon EMR supports most EC2 instance types Use Spot and Reserved Instances to Lower Cost
  • 34. Instance Fleets for Advanced Spot Provisioning Master Node Core Instance Fleet Task Instance Fleet • Provision from a list of instance types with Spot and On-Demand • Launch in the most optimal Availability Zone based on capacity/price • Spot Block support
  • 35. What are Amazon EC2 Spot Instances?
  • 36. On-Demand Pay for compute capacity by the hour with no long-term commitments For spiky workloads, or to define needs AWS EC2 Consumption Models Reserved Make a low, one-time payment and receive a significant discount on the hourly charge For committed utilization Spot Market Bid for unused capacity, charged at a Spot Price which fluctuates based on supply and demand For time-insensitive, transient, or stateless workloads
  • 37. Spare Capacity at Scale AWS has millions of active customers every month, including more than 2,300 government agencies, 7,000 education institutions and more than 22,000 nonprofit organizations that have used AWS in the last 12 months.
  • 38. What Are EC2 Spot Instances? EC2 Spot instances are spare EC2 On-Demand capacity with very simple rules…
  • 39. What Are EC2 Spot Instances? EC2 Spot instances are spare EC2 On-Demand capacity with very simple rules…
  • 40. The Very Simple Rules of Spot Instances
  • 41. The Very Simple Rules of Spot Instances Run in markets where the price of compute changes based on supply and demand.
  • 42. The Very Simple Rules of Spot Instances Run in markets where the price of compute changes based on supply and demand. You’ll never pay more than your bid. When the market exceeds your bid you get 2 minutes to wrap up your work.
  • 43. Get the Best Value for EC2 Capacity • Since Spot Instances typically cost 50-90% less than On-Demand, you can:
  • 44. Get the Best Value for EC2 Capacity • Since Spot Instances typically cost 50-90% less than On-Demand, you can: • Increase your compute capacity by 2-10x within the same budget.
  • 45. Get the Best Value for EC2 Capacity • Since Spot Instances typically cost 50-90% less than On-Demand, you can: • Increase your compute capacity by 2-10x within the same budget. • Save 50-90% on your existing workload.
  • 46. Get the Best Value for EC2 Capacity • Since Spot Instances typically cost 50-90% less than On-Demand, you can: • Increase your compute capacity by 2-10x within the same budget. • Save 50-90% on your existing workload. • Or both!
  • 47. Get the Best Value for EC2 Capacity • Since Spot Instances typically cost 50-90% less than On-Demand, you can: • Increase your compute capacity by 2-10x within the same budget. • Save 50-90% on your existing workload. • Or both! • Either way, you should try it!
  • 48. Understanding EC2 Capacity AZ1 AZ2 (N. California) Total Capacity P2 C4 M4 I3 R4 D2 Shared Dedicated Shared Dedicated x 2x 4x x 2x 4x x 2x 4x x 2x 4x x 2x 4x x 2x 4x
  • 49. $0.27 $0.29$0.50 2b 2c2a 8XL $0.30 $0.16$0.214XL $0.07 $0.08$0.082XL $0.05 $0.04$0.04XL $0.01 $0.04$0.01L C4 $1.76 On- Demand $0.88 $0.44 $0.22 $0.11 Capacity and Spot Markets Recap us-east-2
  • 50. $0.27 $0.29$0.50 2b 2c2a 8XL $0.30 $0.16$0.214XL $0.07 $0.08$0.082XL $0.05 $0.04$0.04XL $0.01 $0.04$0.01L C4 $1.76 On- Demand $0.88 $0.44 $0.22 $0.11 • Each instance family Capacity and Spot Markets Recap us-east-2
  • 51. $0.27 $0.29$0.50 2b 2c2a 8XL $0.30 $0.16$0.214XL $0.07 $0.08$0.082XL $0.05 $0.04$0.04XL $0.01 $0.04$0.01L C4 $1.76 On- Demand $0.88 $0.44 $0.22 $0.11 • Each instance family • Each instance size Capacity and Spot Markets Recap us-east-2
  • 52. $0.27 $0.29$0.50 2b 2c2a 8XL $0.30 $0.16$0.214XL $0.07 $0.08$0.082XL $0.05 $0.04$0.04XL $0.01 $0.04$0.01L C4 $1.76 On- Demand $0.88 $0.44 $0.22 $0.11 • Each instance family • Each instance size • Each Availability Zone Capacity and Spot Markets Recap us-east-2
  • 53. $0.27 $0.29$0.50 2b 2c2a 8XL $0.30 $0.16$0.214XL $0.07 $0.08$0.082XL $0.05 $0.04$0.04XL $0.01 $0.04$0.01L C4 $1.76 On- Demand $0.88 $0.44 $0.22 $0.11 • Each instance family • Each instance size • Each Availability Zone • In every region Capacity and Spot Markets Recap us-east-2
  • 54. $0.27 $0.29$0.50 2b 2c2a 8XL $0.30 $0.16$0.214XL $0.07 $0.08$0.082XL $0.05 $0.04$0.04XL $0.01 $0.04$0.01L C4 $1.76 On- Demand $0.88 $0.44 $0.22 $0.11 • Each instance family • Each instance size • Each Availability Zone • In every region • Is a separate Spot Market Capacity and Spot Markets Recap us-east-2
  • 55. Bid Price vs. Market Price
  • 56. You pay the market price Bid Price vs. Market Price
  • 57. 50% Bid 75% Bid You pay the market price 25% Bid Bid Price vs. Market Price
  • 58. 50% Bid 75% Bid You pay the market price 25% Bid Bid Price vs. Market Price Keep it simple and just bid 100% On-Demand price!
  • 59. EC2 Spot Instance Best Practices - Diversification • Multiple EC2 instance types selected • Multiple Availability Zones selected • Pick instance types with similar performance characteristics. For example: c3.large, m3.large, r3.large, c4.large, m4.large, r4.large…
  • 60. Amazon EC2 Spot Bid Advisor • We make this easy using the Spot bid advisor • With deliberate pool selection and bidding, you will keep your Spot instance as long as you need to
  • 61. • We make this easy using the Spot bid advisor • With deliberate pool selection and bidding, you will keep your Spot instance as long as you need to Amazon EC2 Spot Bid Advisor
  • 62. Amazon EC2 Spot Bid Advisor • We make this easy using the Spot bid advisor • With deliberate pool selection and bidding, you will keep your Spot instance as long as you need to
  • 63. EC2 Spot Advisor in Console (New!)
  • 64. EC2 Spot Advisor in Console (New!)
  • 66. Petabytes of data generated on-premises, brought to AWS, and stored in S3 Thousands of analytical queries performed on EMR and Amazon Redshift. Stringent security requirements met by leveraging VPC, VPN, encryption at-rest and in- transit, CloudTrail, and database auditing Flexible Interactive Queries Predefined Queries Surveillance Analytics Data Management Data Movement Data Registration Version Management Amazon S3 Web Applications Analysts; Regulators FINRA: Migrating From On-Prem to AWS
  • 67. Lower Cost and Higher Scale Than On-Premises
  • 68. FINRA Saved 60% by Moving to HBase on EMR
  • 69. Walk through provisioning an EMR cluster using EMR instance fleets (Console and CLI)
  • 70. What is AWS Glue?
  • 71. Fully Managed Data Catalog & ETL Service Integrates with AWS/Non-AWS Data Stores Scalable No Admin AWS Glue Learn more: https://aws.amazon.com/glue/
  • 72. Glue automates data cataloging & preparation  Catalogues data sources  Identifies data formats and data types  Generates Extract, Transform, Load code  Executes ETL jobs; managing dependencies Amazon Glue – Fully Managed ETL Service
  • 73. Why EMR? Decouple Storage and Compute Persistent Cluster – Interactive Queries (Spark-SQL | Presto) Transient Cluster - Batch Jobs (X hours nightly) – Add/Remove Nodes External Metastore Workload specific clusters (Different sizes, Different Versions) Amazon S3
  • 74. Use an External Metastore AWS Glue Use the AWS Glue Data Catalog to store external table metadata for Hive and Spark Amazon S3Set metastore location in hive-site
  • 75. Walk through configuring Spark SQL to use the AWS Glue Data Catalog as its metastore (Console and CLI)
  • 76. Q & A
  • 79. Reference links EC2 Spot Documentation: http://aws.amazon.com/ec2/spot/ http://aws.amazon.com/ec2/spot/bid-advisor/ http://aws.amazon.com/ec2/spot/getting-started/ http://aws.amazon.com/ec2/spot/faqs/ http://aws.amazon.com/ec2/spot/testimonials/ User Guide http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet.html http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-fleet.html Helpful AWS Blog Posts https://aws.amazon.com/blogs/aws/focusing-on-spot-instances-lets-talk-about-best-practices/ https://aws.amazon.com/blogs/aws/building-price-aware-applications-using-ec2-spot-instances/ https://aws.amazon.com/blogs/compute/cost-effective-batch-processing-with-amazon-ec2-spot/ https://aws.amazon.com/blogs/compute/dynamic-scaling-with-ec2-spot-fleet/