Optimizing Storage for Big Data Analytics Workloads

Matt Yanchyshyn, Sr. Manager Solutions Architecture
March 1st, 2017
AWS NYC Loft Storage Day
Optimizing Storage for Big Data/Analytics Workloads

Big Data/Analytics Workloads
Amazon
EMR
Cloudera
Hortonworks
MapR
Amazon
Redshift
Vertica
Teradata
Amazon
Kinesis
Kafka
Tibco EMS
Amazon
DynamoDB
MongoDB
Cassandra
Amazon
ElastiSearch
Splunk
ElasticSearch
SOLR
Hadoop
etc.
Data
Warehouse
Streaming NoSQL Search

AWS Storage
Amazon S3
Multi-tenant
Key-store
Native API
Amazon EBS
Single-tenant
Block-store
Amazon EFS
Shared/Distributed
POSIX
NFS/SMB (CIFS)

A little EBS history…
• 2006 – EC2 launched with instance storage
• 2008 – EBS (Elastic Block Storage) launched on magnetic storage
• 2012 – EBS Provisioned IOPS and EBS-Optimized instances
• 2014 – SSD-Backed general purpose storage
• 2014 – EBS data volume encryption.2015 – Larger and faster EBS
volumes.2015 – EBS boot volume encryption
• 2016 – EBS Throughput Optimized HDD (st1) and Cold HDD (sc1)
volume types
• 2017 – EBS Elastic Volumes!

AWS
block storage
offerings
EC2
instance
store
sc1st1
io1gp2
EBS
SSD-backed
volumes
EBS
HDD-backed
volumes

EBS volume types
General Purpose
SSD
gp2
Provisioned IOPS
SSD
io1
Throughput Optimized
HDD
st1
Cold
HDD
sc1
SSD HDD

Throughput
is more important
Small, random I/O Large, sequential I/O
Latency?
i2
gp2 io1 sc1 st1
d2
Choosing an EBS volume type
IOPS
≤ 65,000> 65,000
< 1 ms Single-digit ms ≤ 1,250 MB/s
Aggregate throughput?
> 1,250 MB/s
is more important
≤ 10k IOPS > 10k IOPS
Throughput per volume
250 MiB/s 500 MiB/s
IOPS per volume

AWS Storage Use Cases
Big Data & Analytics
Data Warehouses
Search & Indexing
Transactional &
NoSQL Databases
Streaming data
Amazon S3
Amazon EC2 Instance store
Amazon EBS
Amazon EFS

Analytics on AWS
Compute Only
(Spark, Presto)
Coupled Storage
(HDFS+SOLR)
Shared Filesystem
(SAS Grid)
Amazon S3
Amazon EC2 Instance (D2)
Amazon EBS ST1 Volumes
Amazon EFS

Everything else…
NoSQL (*)
DW-Row
(Teradata)
DW-Columnar
(Vertica)
Search
(Splunk)
Streaming
(Kafka)
ST1
SC1
D2
GP2
ST1
---
PIOPS
GP2
I2
IOPS
size
EBS Volumes
Instance Store

Building a tiered storage model
Hadoop (Hive)
• A single Hive Table with partitions in HDFS and S3
ALTER TABLE table ADD PARTITION (year='2016’) location ’s3://’
ALTER TABLE table ADD PARTITION (year='2017’) location ’hdfs://'
Data warehouse (Vertica)
• Location with different types of volumes (EBS GP2 -> EBS ST1/SC1)
select alter_location_label('/home/dbadmin/SSD/tables','', 'SSD');
select set_object_storage_policy (’table', 'SSD');
• Use Amazon S3 as Virtual Tables

Building a tiered storage model
No SQL (MongoDB)
• Consider replicas with different storage types
• EBS GP2 (Primary), I2(Secondary – High-IO), R4(Secondary -
In-Memory)
Search (Splunk/ELK)
• Different types of EBS volumes for Hot (GP2), Warm (GP2/ST1),
Cold (ST1/SC1) and Frozen (SC1)

Customer Example: Crowdstrike
Cassandra on EBS vs. Ephemeral storage
“Amazon EBS offered the performance we
needed, at a third of the cost of the SSD-backed
instance storage.”
Goal: 1 million writes per second on 60 nodes with
EBS

Crowdstrike, cont.
Used to believe they could never run Cassandra on
EBS:
• Noisy Neighbor (jitter)
• Single point of failure in a region
• Too expensive
• Bad Volumes (spin up ten, run tests, and pick
the best one)

Crowdstrike, cont.
Painpoints running on ephemeral storage
• Fewer EC2 instances are offering local instance storage
• There is no data persistence if you stop/start the EC2 instance to
resize
• I2’s are expensive, especially when you need three of them per node
for the replication
• You can’t snapshot the data using EBS snapshot
• No EBS volume monitoring

Crowdstrike, cont.
1700
400
0
450
900
1350
1800
2250
I2.2xlarge c4.2xlarge + EBS
Number of Machines for 1PB
Number of Machines for 1PB

Crowdstrike, cont.
14.5
8
4.2
3.5
0
2
4
6
8
10
12
14
16
I2.2xlarge on-demand 2.2xlarge Reserved C4.2xlarge (EBS) - on
demand
C4. 2xlarge (EBS) -
reserved
Yearly Cost for 1 PB Cluster (millions)
Yearly Cost for 1 PB Cluster (millions)

Crowdstrike, Cont.
Crowdstrike Today:
• In the past 12 months, zero Amazon EBS-related failures
• Thousands of GP2 data volumes (~2PB of data)
• Transitioning all systems to Amazon EBS root drives
• Moved all data stores to EBS
Benefits of EBS:
• Use EBS volume monitoring
• Schedule snapshots for consistent backups
• Stop/start and resize
• Half the cost (using reserved pricing comparison)

Hadoop
HDFS – Hadoop Distributed File System
• Replication done for both durability and performance
• Write data using block-sizes of 64/128/256 MB - Sequential IO only
HCFS - Hadoop Compatible System
• EMRFS, S3A, S3N – Maps AWS APIs to Hadoop APIs
• Athena – Presto talking to S3 using a HCFS Implementation

Amazon S3 as HDFS
Advantages
• Scale out horizontally
• Storage decoupled from compute
• Backup and DR not required
• Transient clusters with much higher availability
Challenges
• Rename – Cost based on the size of data.
• List – Cost based on prefix depth.
• Security – IAM. Not supported by Hadoop Security model
• Compatibility

When to use HDFS without Amazon S3?
Cluster type: transient vs. long-running
Customer preference, e.g. Cloudera/Hortonworks/MapR
Optimize for multiple data processing iterations and sequential access
• Amazon EC2 Instance Store – D2 (3+ GB/s)
• Amazon EBS ST1 Volumes (per TB - 250 MBps Burst & 40 MBps Baseline)
• 2TB Volume sizes with Read-ahead of 1MB
• Lower the replication factor to 2X instead of 3X
Amazon EFS?
• HDFS – itself is a distributed filesystem
• Missing adapter to translate from Amazon EFS to HCFS

NETFLIX
Cloud Data Warehouse
with Amazon S3
EMR as the computing engine –
Hive, Spark, Presto
Orchestration and scheduling via
Genie
Customer workloads
PINTEREST
Large EC2 cluster with EBS GP2
Search and Index using Map-
Reduce
Always-on, persistent cluster

New! EBS Elastic Volumes
Simple; Flexible; Non-disruptive; Automated
Modify the configuration of live volumes attached to instances
Dynamically increase size, tune performance, and change the type of existing and new
current generation volumes
No downtime, no performance impact.
You can automate changes using CloudWatch with Lambda or CloudFormation
No need to plan ahead, provision what you need today and change the configuration as
business needs change.

What are the new AWS CLI commands ?
aws ec2 modify-volume
aws ec2 describe-volumes-modifications
*need to install the latest AWS SDK/CLI

How does it work?
Three steps
• Issue the modification command
• Monitor the progress of the modification
• If size change extend the volume's file system

Elastic Volume – Limitations
Limit – You can modify a volume once per 6 hours
Limit – Supported only for current generation volumes (gp2/io1/st1/sc1). Not supported
for Magnetic/Standard volumes
Limit - Live changes supported for volumes attached to current generation instances

Demo: Amazon EMR with HDFS on EBS
…and volume type/size changes!

Optimizing Storage for Big Data Analytics Workloads

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Optimizing Storage for Big Data Analytics Workloads

Similar to Optimizing Storage for Big Data Analytics Workloads (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (14)

Optimizing Storage for Big Data Analytics Workloads