How to calculate the cost of a Hadoop infrastructure on Amazon AWS, given some data volume estimates and the rough use case ?
Presentation attempts to compare the different options available on AWS.
3. High Level Requirements
• Build an Analytical & BI platform for web log analytics
• Ingest multiple data sources:
• Log data
• internal user data
• Apply complex business rules
• Manage Events, filter Crawler Driven Logs, apply
Industry and Domain Specific rules
• Populate/export to a BI tool for visualization.
3
4. Non-Functional Requirements
• Today’s baseline: ~42 TB per year (~ 3.5TB raw
data per month), 3 years store
• SLA: Should process data every day. Currently
done once a month.
• Predefined processing via Hive; no exploratory
analysis
• Everything in the cloud:
• Store (HDFS), Compute (M/R), Analysis (BI tool)
4
5. Non-Functional Requirements [2]
• Seeding data in S3 (3 year’s data worth)
• Adding monthly net-new data only.
• Speed not of primary importance
5
6. Data Estimates for Capacity planning [2]
• Cleaned-up log data per year 42 TB (3 years = 126 TB)
• Total disk space required should consider
• Compression (LZO 40%) – Reduces disk space
required to ~25 *
• Replication Factor of 3 : ~75 TB
• 75% disk utilization maximum in Hadoop: 100TB
• Total disk capacity required for DN: ~100TB / year
(17.5TB/ mo)
•
(*disclaimer: depends on codec and data input)
6
7. Data Estimates for Capacity planning:
reduced logs
Expected
Data
data
Log data
After compression
Replication 70% disk utilization
volume volume (TB) (Gzip 40%)
on 3 nodes maximum (TB)
1 month
3.6
2.16
6.5
1 year
42
25
75
9.2
107
3 years
322
126
75.6
226
• Total disk capacity required for DN: ~10TB/ month
7
8. Cloud Solution Architecture
2. Export
data to
HDFS
Amazon AWS
3. Process
in M/R
Hive Tables
BI Tool
Hadoop
S3
HDFS
1. Copy
data to S3
Client
Logs
4. Display in
BI tool
Metadata
Extraction
Webservers
8
User
5. Retain
results into
S3
9. Hadoop on AWS: EC2
• Amazon Elastic Compute Cloud (EC2) is a web
service that provides resizable compute capacity
in the cloud.
• Manual set up of Hadoop on EC2
• Use EBS for storage capacity (HDFS)
• Storage on S3
9
10. Running Hadoop on AWS: EC2
• EC2 instances options
• Choose instance type
• Choose instance type availability
• Choose instance family
• Choose where the data resides:
• S3 – high latency, but highly available
• EBS
• Permanent storage?
• Snapshots to S3?
• Apache Whirr for set up
10
11. Amazon EC2 – Instance features
• Other choices:
• EBS-optimized instances: dedicated throughput
between Amazon EC2 and Amazon EBS, with options
between 500 Mbps and 1000 Mbps depending on the
instance type used.
• Inter-region data transfer
• Dedicated instances: run on single-tenant hardware
dedicated to a single customer.
• Spot instances: Name your price
11
12. Amazon Instance Families
•
Amazon EC2 instances are grouped into six families: General purpose, Memory
optimized, Compute optimized, Storage optimized, micro and GPU.
•
General-purpose instances have memory to CPU ratios suitable for most
general purpose apps.
•
Memory-optimized instances offer larger memory sizes for high throughput
applications.
•
Compute-optimized instances have proportionally more CPU resources
than memory (RAM) and are well suited for compute-intensive applications.
•
Storage-optimized instances are optimized for very high random I/O
performance , or very high storage density, low storage cost, and high
sequential I/O performance.
•
micro instances provide a small amount of CPU with the ability to burst to
higher amounts for brief periods.
•
GPU instances, for dynamic applications.
Data
nodes
12
13. Amazon Instances types availability
•
On-Demand Instances – On-Demand Instances let you pay for compute
capacity by the hour with no long-term commitments. This frees you from the
costs and complexities of planning, purchasing, and maintaining.
•
Reserved Instances – Reserved Instances give you the option to make a onetime payment for each instance you want to reserve and in turn receive a
discount on the hourly charge for that instance. There are three Reserved
Instance types (Light, Medium, and Heavy Utilization Reserved Instances) that
enable you to balance the amount you pay upfront with your effective hourly
price.
•
Spot Instances – Spot Instances allow customers to bid on unused Amazon
EC2 capacity and run those instances for as long as their bid exceeds the
current Spot Price. The Spot Price changes periodically based on supply and
demand, and customers whose bids meet or exceed it gain access to the
available Spot Instances. If you have flexibility in when your applications can run,
Spot Instances can significantly lower your Amazon EC2 costs.
13
15. Amazon EC2 – Instance types
Data nodes
BI
instances
Master
nodes
15
16. Systems Architecture – EC2
AWS
Hadoop
NN
SN
DNs
EN
Client
Logs
HDFS on EBS drives
S3
BI
Node
Node
Node
BI
• Hadoop cluster is initiated when analytics is run
• Data is streamed from S3 to EBS Volumes
• Results from analytics stored to S3 once
computed
16
• BI nodes permanent
Node
17. Hadoop on AWS: EC2
• Probably not the best choice:
• EBS volumes make the solution costly
• If instead using instance storage, choices of
EC2 instances either too small (a few Gigs) or
too big (48 TB/per instance).
• Don’t need the flexibility – just want to use
Hive
17
18. Hadoop on AWS: EMR
• EC2 Amazon Elastic MapReduce ( EMR) is a
web service that provides a hosted Hadoop
framework running on the EC2 and Amazon
Simple Storage Service (S3).
18
19. Running Hadoop on AWS - EMR
•
Elastic Map Reduce
•
For occasional jobs –
Ephemeral clusters
•
Ease of use, but 20% costlier
•
Data stored in S3 - Highly
tuned for S3 storage
•
Hive and Pig available
•
Only pay for S3 + instances
time while jobs running
•
Or: leave it always on.
19
20. Hadoop on AWS - EMR
• EC2 instances with own flavor of Hadoop
• Amazon Apache Hadoop is 1.0.3 version. You
can also choose MapR M3 or M5 (0.20.205)
version.
• You can run Hive (0.7.1 or 0.8.1), Custom JAR,
Streaming, Pig or Hbase.
20
21. Systems Architecture – EMR
AWS
Hadoop
EMR
DNs
SNNN
Client
Logs
HDFS from S3
S3
BI
Instanc
e
Instance
Instance
BI
• Hadoop cluster created elastically
• Data is streamed from S3 to initiate Hadoop cluster
dynamically
• Results from analytics stored to S3 once computed
• BI nodes permanent
Instance
21
23. AWS calculator – EMR calculation
• Calculate and add:
• S3 cost (seeded data)
• Incremental S3 cost, per month
• EC2 cost
• EMR cost
• In/out Transfer of data cost
• Amazon support cost
• Infrastructure support Engineer cost
23
24. AWS calculator – EMR calculation
• Say for 24hrs/day, EMR cost:
24
25. AWS calculator – EMR calculation
• Say for 24hrs/day, 3 year S3:
25
26. AWS calculator – EMR calculation
• Say for 24hrs/day, 3 year EC2:
26
27. Amazon EMR Pricing – Reduced log volume
Data
volume
(in
year)
Instances types
Price/year
Running 24
hours/day
Price/year
Running 8
hours/day
Price/year
Running 8
hours/wee
k
1 year storing
42TB on
S3
10 instances –
Data nodes:
m1.xlarge
NN: m2.2xlarge
BI: m2.2xlarge
Load balancer:
t1.micro
1 year reserved
10 EMR instances
(Subject to change
depending on
actual load)
$14.1k/mo *
12 = $169.2k
$8.9k * 12=
$106k
$6.6k * 12 =
$79.2k
$19.5k *36
mos = $684k
$15.5k * 36
mos =
$558k
$13.2k * 36
mos = $475
3 years
storing
126TB
on S3
27
28. Hadoop on AWS: trade-offs
Feature
EC2
EMR
Ease of use
Hard – IT Ops costs
Easy; Hadoop clusters can be of any size; can
have multiple clusters.
Cost
Cheaper
Costlier: pay for EC2 + EMR
Flexibility
Better: Access to full stack of
Hadoop ecosystem
Portability
Easier to move to dedicated
hardware
Speed
Faster
Lower performance: all data is streamed from S3 for
each job
Maintability
Can choose any vendor;
Can be updated to latest versoin;
Debugging tricky: cluster terminated, no logs
On demand Hadoop cluster: Ease of use Hadoop installed, but with limited options
28
29. EC2 Pricing Gotcha’s
• EMR with Spot instances seems to be the trend for
minimal cost, if SLA timeliness is not of primary
importance.
• Use reserved instances to bring down cost
drastically (60%).
• Compression on S3 ?
• Need to account for secondary NN?
• Ability to estimate better how many EMR nodes
are needed with AWS’s AMI task configuration
29
30. EMR Technical Gotcha’s
• Transferring data between S3 and EMR clusters is
very fast (and free), so long as your S3 bucket and
Hadoop cluster are in the same Amazon region
• EMR’S3 File System streams data directly to S3
instead of buffering to intermediate local files.
• EMR’S3 File System adds Multipart Upload, which
splits your writes into smaller chunks and uploads
them in parallel.
• Store fewer, larger files instead of many smaller
ones
30
•
http://blog.mortardata.com/post/58920122308/s3-hadoop-performance
31. In house Hadoop cluster
Data
volume (in
year)
Storage for Data nodes
Instances
Price, first year
126TB
6*12x2TB
10 data nodes, 3 Master
$10.6k * 6 DN + $7.3k * 3
= $128k
Dell PowerEdge R720: Processor
E5-2640 2.50GHz, 8 cores, 12M
Cache,Turbo,
Memory 64GB Memory, Quad
Ranked RDIMM for 2
Processors, Low Volt
Hard Drives 12 - 2TB 7.2K RPM
SATA 3.5in Hot Plug Hard Drive
Network Card Intel 82599 Dual
Port 10GE Mezzanine Card
BI
4 nodes
+ Vendor Support ($50k)
+ Full-time person
($150k)
=
$328k
$43k
31
33. Hadoop Distributions:
• Cloudera or Hortonworks
• Enterprise 24X7 Production Support - phone and support portal
access(Support Datasheet Attached)
• Minimum $50k$
33
34. Amazon – Support EC2 & EMR
Business
Enterprise
Response Time : 1 Hour
Access: Phone, Chat and Email 24/7
Response Time: 15 minutes
Access: Phone, Chat, TAM and Email
24/7
Costs
Greater of $100 - or •10% of monthly AWS usage for the first
$0-$10K
•7% of monthly AWS usage from $10K$80K
•5% of monthly AWS usage from $80K$250K
•3% of monthly AWS usage from $250K+
(about $800/yr)
http://aws.amazon.com/premiumsupport/
Costs
Greater of $15,000 - or •10% of monthly AWS usage for the first
$0-$150K
•7% of monthly AWS usage from $150K$500K
•5% of monthly AWS usage from $500K$1M
•3% of monthly AWS usage from $1M+
34
On demand: most flexible, it's also the most expensiveWith spot instances, you specify the maximum price you'll pay for an instance, and if there is space, you get that instance. If you're outbid, your instance could be terminated. This means that if you have large jobs that don't need to be completed during any specific time, you can utilize spot instances to complete the job when it's most economical.