With distributed frameworks like Hadoop and Kafka, it is essential to deploy the right environment to successfully support these workloads. Learn about the different block storage options from AWS and walk through with our experts on how to select the best option for your big data analytic workloads. We will demonstrate how to setup, select, and modify volume types to right size your environment needs.
7. Throughput
is more important
Small, random I/O Large, sequential I/O
Latency?
i2
gp2 io1 sc1 st1
d2
Choosing an EBS volume type
IOPS
≤ 65,000> 65,000
< 1 ms Single-digit ms ≤ 1,250 MB/s
Aggregate throughput?
> 1,250 MB/s
is more important
≤ 10k IOPS > 10k IOPS
Throughput per volume
250 MiB/s 500 MiB/s
IOPS per volume
8. AWS Storage Use Cases
Big Data & Analytics
Data Warehouses
Search & Indexing
Transactional &
NoSQL Databases
Streaming data
Amazon S3
Amazon EC2 Instance store
Amazon EBS
Amazon EFS
11. Building a tiered storage model
Hadoop (Hive)
• A single Hive Table with partitions in HDFS and S3
ALTER TABLE table ADD PARTITION (year='2016’) location ’s3://’
ALTER TABLE table ADD PARTITION (year='2017’) location ’hdfs://'
Data warehouse (Vertica)
• Location with different types of volumes (EBS GP2 -> EBS ST1/SC1)
select alter_location_label('/home/dbadmin/SSD/tables','', 'SSD');
select set_object_storage_policy (’table', 'SSD');
• Use Amazon S3 as Virtual Tables
12. Building a tiered storage model
No SQL (MongoDB)
• Consider replicas with different storage types
• EBS GP2 (Primary), I2(Secondary – High-IO), R4(Secondary -
In-Memory)
Search (Splunk/ELK)
• Different types of EBS volumes for Hot (GP2), Warm (GP2/ST1),
Cold (ST1/SC1) and Frozen (SC1)
13. Customer Example: Crowdstrike
Cassandra on EBS vs. Ephemeral storage
“Amazon EBS offered the performance we
needed, at a third of the cost of the SSD-backed
instance storage.”
Goal: 1 million writes per second on 60 nodes with
EBS
14. Crowdstrike, cont.
Used to believe they could never run Cassandra on
EBS:
• Noisy Neighbor (jitter)
• Single point of failure in a region
• Too expensive
• Bad Volumes (spin up ten, run tests, and pick
the best one)
15. Crowdstrike, cont.
Painpoints running on ephemeral storage
• Fewer EC2 instances are offering local instance storage
• There is no data persistence if you stop/start the EC2 instance to
resize
• I2’s are expensive, especially when you need three of them per node
for the replication
• You can’t snapshot the data using EBS snapshot
• No EBS volume monitoring
18. Crowdstrike, Cont.
Crowdstrike Today:
• In the past 12 months, zero Amazon EBS-related failures
• Thousands of GP2 data volumes (~2PB of data)
• Transitioning all systems to Amazon EBS root drives
• Moved all data stores to EBS
Benefits of EBS:
• Use EBS volume monitoring
• Schedule snapshots for consistent backups
• Stop/start and resize
• Half the cost (using reserved pricing comparison)
19. Hadoop
HDFS – Hadoop Distributed File System
• Replication done for both durability and performance
• Write data using block-sizes of 64/128/256 MB - Sequential IO only
HCFS - Hadoop Compatible System
• EMRFS, S3A, S3N – Maps AWS APIs to Hadoop APIs
• Athena – Presto talking to S3 using a HCFS Implementation
20. Amazon S3 as HDFS
Advantages
• Scale out horizontally
• Storage decoupled from compute
• Backup and DR not required
• Transient clusters with much higher availability
Challenges
• Rename – Cost based on the size of data.
• List – Cost based on prefix depth.
• Security – IAM. Not supported by Hadoop Security model
• Compatibility
21. When to use HDFS without Amazon S3?
Cluster type: transient vs. long-running
Customer preference, e.g. Cloudera/Hortonworks/MapR
Optimize for multiple data processing iterations and sequential access
• Amazon EC2 Instance Store – D2 (3+ GB/s)
• Amazon EBS ST1 Volumes (per TB - 250 MBps Burst & 40 MBps Baseline)
• 2TB Volume sizes with Read-ahead of 1MB
• Lower the replication factor to 2X instead of 3X
Amazon EFS?
• HDFS – itself is a distributed filesystem
• Missing adapter to translate from Amazon EFS to HCFS
22. NETFLIX
Cloud Data Warehouse
with Amazon S3
EMR as the computing engine –
Hive, Spark, Presto
Orchestration and scheduling via
Genie
Customer workloads
PINTEREST
Large EC2 cluster with EBS GP2
Search and Index using Map-
Reduce
Always-on, persistent cluster
23. New! EBS Elastic Volumes
Simple; Flexible; Non-disruptive; Automated
Modify the configuration of live volumes attached to instances
Dynamically increase size, tune performance, and change the type of existing and new
current generation volumes
No downtime, no performance impact.
You can automate changes using CloudWatch with Lambda or CloudFormation
No need to plan ahead, provision what you need today and change the configuration as
business needs change.
24. What are the new AWS CLI commands ?
aws ec2 modify-volume
aws ec2 describe-volumes-modifications
*need to install the latest AWS SDK/CLI
25. How does it work?
Three steps
• Issue the modification command
• Monitor the progress of the modification
• If size change extend the volume's file system
26. Elastic Volume – Limitations
Limit – You can modify a volume once per 6 hours
Limit – Supported only for current generation volumes (gp2/io1/st1/sc1). Not supported
for Magnetic/Standard volumes
Limit - Live changes supported for volumes attached to current generation instances
27. Demo: Amazon EMR with HDFS on EBS
…and volume type/size changes!