SlideShare a Scribd company logo
1 of 28
Matt Yanchyshyn, Sr. Manager Solutions Architecture
March 1st, 2017
AWS NYC Loft Storage Day
Optimizing Storage for Big Data/Analytics Workloads
Big Data/Analytics Workloads
Amazon
EMR
Cloudera
Hortonworks
MapR
Amazon
Redshift
Vertica
Teradata
Amazon
Kinesis
Kafka
Tibco EMS
Amazon
DynamoDB
MongoDB
Cassandra
Amazon
ElastiSearch
Splunk
ElasticSearch
SOLR
Hadoop
etc.
Data
Warehouse
Streaming NoSQL Search
AWS Storage
Amazon S3
Multi-tenant
Key-store
Native API
Amazon EBS
Single-tenant
Block-store
Amazon EFS
Shared/Distributed
POSIX
NFS/SMB (CIFS)
A little EBS history…
• 2006 – EC2 launched with instance storage
• 2008 – EBS (Elastic Block Storage) launched on magnetic storage
• 2012 – EBS Provisioned IOPS and EBS-Optimized instances
• 2014 – SSD-Backed general purpose storage
• 2014 – EBS data volume encryption.2015 – Larger and faster EBS
volumes.2015 – EBS boot volume encryption
• 2016 – EBS Throughput Optimized HDD (st1) and Cold HDD (sc1)
volume types
• 2017 – EBS Elastic Volumes!
AWS
block storage
offerings
EC2
instance
store
sc1st1
io1gp2
EBS
SSD-backed
volumes
EBS
HDD-backed
volumes
EBS volume types
General Purpose
SSD
gp2
Provisioned IOPS
SSD
io1
Throughput Optimized
HDD
st1
Cold
HDD
sc1
SSD HDD
Throughput
is more important
Small, random I/O Large, sequential I/O
Latency?
i2
gp2 io1 sc1 st1
d2
Choosing an EBS volume type
IOPS
≤ 65,000> 65,000
< 1 ms Single-digit ms ≤ 1,250 MB/s
Aggregate throughput?
> 1,250 MB/s
is more important
≤ 10k IOPS > 10k IOPS
Throughput per volume
250 MiB/s 500 MiB/s
IOPS per volume
AWS Storage Use Cases
Big Data & Analytics
Data Warehouses
Search & Indexing
Transactional &
NoSQL Databases
Streaming data
Amazon S3
Amazon EC2 Instance store
Amazon EBS
Amazon EFS
Analytics on AWS
Compute Only
(Spark, Presto)
Coupled Storage
(HDFS+SOLR)
Shared Filesystem
(SAS Grid)
Amazon S3
Amazon EC2 Instance (D2)
Amazon EBS ST1 Volumes
Amazon EFS
Everything else…
NoSQL (*)
DW-Row
(Teradata)
DW-Columnar
(Vertica)
Search
(Splunk)
Streaming
(Kafka)
ST1
SC1
D2
GP2
ST1
---
PIOPS
GP2
I2
IOPS
size
EBS Volumes
Instance Store
Building a tiered storage model
Hadoop (Hive)
• A single Hive Table with partitions in HDFS and S3
ALTER TABLE table ADD PARTITION (year='2016’) location ’s3://’
ALTER TABLE table ADD PARTITION (year='2017’) location ’hdfs://'
Data warehouse (Vertica)
• Location with different types of volumes (EBS GP2 -> EBS ST1/SC1)
select alter_location_label('/home/dbadmin/SSD/tables','', 'SSD');
select set_object_storage_policy (’table', 'SSD');
• Use Amazon S3 as Virtual Tables
Building a tiered storage model
No SQL (MongoDB)
• Consider replicas with different storage types
• EBS GP2 (Primary), I2(Secondary – High-IO), R4(Secondary -
In-Memory)
Search (Splunk/ELK)
• Different types of EBS volumes for Hot (GP2), Warm (GP2/ST1),
Cold (ST1/SC1) and Frozen (SC1)
Customer Example: Crowdstrike
Cassandra on EBS vs. Ephemeral storage
“Amazon EBS offered the performance we
needed, at a third of the cost of the SSD-backed
instance storage.”
Goal: 1 million writes per second on 60 nodes with
EBS
Crowdstrike, cont.
Used to believe they could never run Cassandra on
EBS:
• Noisy Neighbor (jitter)
• Single point of failure in a region
• Too expensive
• Bad Volumes (spin up ten, run tests, and pick
the best one)
Crowdstrike, cont.
Painpoints running on ephemeral storage
• Fewer EC2 instances are offering local instance storage
• There is no data persistence if you stop/start the EC2 instance to
resize
• I2’s are expensive, especially when you need three of them per node
for the replication
• You can’t snapshot the data using EBS snapshot
• No EBS volume monitoring
Crowdstrike, cont.
1700
400
0
450
900
1350
1800
2250
I2.2xlarge c4.2xlarge + EBS
Number of Machines for 1PB
Number of Machines for 1PB
Crowdstrike, cont.
14.5
8
4.2
3.5
0
2
4
6
8
10
12
14
16
I2.2xlarge on-demand 2.2xlarge Reserved C4.2xlarge (EBS) - on
demand
C4. 2xlarge (EBS) -
reserved
Yearly Cost for 1 PB Cluster (millions)
Yearly Cost for 1 PB Cluster (millions)
Crowdstrike, Cont.
Crowdstrike Today:
• In the past 12 months, zero Amazon EBS-related failures
• Thousands of GP2 data volumes (~2PB of data)
• Transitioning all systems to Amazon EBS root drives
• Moved all data stores to EBS
Benefits of EBS:
• Use EBS volume monitoring
• Schedule snapshots for consistent backups
• Stop/start and resize
• Half the cost (using reserved pricing comparison)
Hadoop
HDFS – Hadoop Distributed File System
• Replication done for both durability and performance
• Write data using block-sizes of 64/128/256 MB - Sequential IO only
HCFS - Hadoop Compatible System
• EMRFS, S3A, S3N – Maps AWS APIs to Hadoop APIs
• Athena – Presto talking to S3 using a HCFS Implementation
Amazon S3 as HDFS
Advantages
• Scale out horizontally
• Storage decoupled from compute
• Backup and DR not required
• Transient clusters with much higher availability
Challenges
• Rename – Cost based on the size of data.
• List – Cost based on prefix depth.
• Security – IAM. Not supported by Hadoop Security model
• Compatibility
When to use HDFS without Amazon S3?
Cluster type: transient vs. long-running
Customer preference, e.g. Cloudera/Hortonworks/MapR
Optimize for multiple data processing iterations and sequential access
• Amazon EC2 Instance Store – D2 (3+ GB/s)
• Amazon EBS ST1 Volumes (per TB - 250 MBps Burst & 40 MBps Baseline)
• 2TB Volume sizes with Read-ahead of 1MB
• Lower the replication factor to 2X instead of 3X
Amazon EFS?
• HDFS – itself is a distributed filesystem
• Missing adapter to translate from Amazon EFS to HCFS
NETFLIX
Cloud Data Warehouse
with Amazon S3
EMR as the computing engine –
Hive, Spark, Presto
Orchestration and scheduling via
Genie
Customer workloads
PINTEREST
Large EC2 cluster with EBS GP2
Search and Index using Map-
Reduce
Always-on, persistent cluster
New! EBS Elastic Volumes
Simple; Flexible; Non-disruptive; Automated
Modify the configuration of live volumes attached to instances
Dynamically increase size, tune performance, and change the type of existing and new
current generation volumes
No downtime, no performance impact.
You can automate changes using CloudWatch with Lambda or CloudFormation
No need to plan ahead, provision what you need today and change the configuration as
business needs change.
What are the new AWS CLI commands ?
aws ec2 modify-volume
aws ec2 describe-volumes-modifications
*need to install the latest AWS SDK/CLI
How does it work?
Three steps
• Issue the modification command
• Monitor the progress of the modification
• If size change extend the volume's file system
Elastic Volume – Limitations
Limit – You can modify a volume once per 6 hours
Limit – Supported only for current generation volumes (gp2/io1/st1/sc1). Not supported
for Magnetic/Standard volumes
Limit - Live changes supported for volumes attached to current generation instances
Demo: Amazon EMR with HDFS on EBS
…and volume type/size changes!
Thank you!
@mattgy

More Related Content

What's hot

ENT313 Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum E...
ENT313 Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum E...ENT313 Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum E...
ENT313 Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum E...Amazon Web Services
 
Strategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageStrategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageAmazon Web Services
 
February 2016 Webinar Series - Architectural Patterns for Big Data on AWS
February 2016 Webinar Series - Architectural Patterns for Big Data on AWSFebruary 2016 Webinar Series - Architectural Patterns for Big Data on AWS
February 2016 Webinar Series - Architectural Patterns for Big Data on AWSAmazon Web Services
 
Migrate from SQL Server or Oracle into Amazon Aurora using AWS Database Migra...
Migrate from SQL Server or Oracle into Amazon Aurora using AWS Database Migra...Migrate from SQL Server or Oracle into Amazon Aurora using AWS Database Migra...
Migrate from SQL Server or Oracle into Amazon Aurora using AWS Database Migra...Amazon Web Services
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services
 
AWS re:Invent 2016: AWS Database State of the Union (DAT320)
AWS re:Invent 2016: AWS Database State of the Union (DAT320)AWS re:Invent 2016: AWS Database State of the Union (DAT320)
AWS re:Invent 2016: AWS Database State of the Union (DAT320)Amazon Web Services
 
AWS re:Invent 2016: Simplified Data Center Migration—Lessons Learned by Live ...
AWS re:Invent 2016: Simplified Data Center Migration—Lessons Learned by Live ...AWS re:Invent 2016: Simplified Data Center Migration—Lessons Learned by Live ...
AWS re:Invent 2016: Simplified Data Center Migration—Lessons Learned by Live ...Amazon Web Services
 
Building Big Data Applications with Serverless Architectures - June 2017 AWS...
Building Big Data Applications with Serverless Architectures -  June 2017 AWS...Building Big Data Applications with Serverless Architectures -  June 2017 AWS...
Building Big Data Applications with Serverless Architectures - June 2017 AWS...Amazon Web Services
 
Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Full Stack Analytics on AWS - AWS Summit Cape Town 2017 Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Full Stack Analytics on AWS - AWS Summit Cape Town 2017 Amazon Web Services
 
AWS Innovate: Running Databases in AWS- Russell Nash
AWS Innovate: Running Databases in AWS- Russell NashAWS Innovate: Running Databases in AWS- Russell Nash
AWS Innovate: Running Databases in AWS- Russell NashAmazon Web Services Korea
 
Getting Started with Amazon QuickSight
Getting Started with Amazon QuickSightGetting Started with Amazon QuickSight
Getting Started with Amazon QuickSightAmazon Web Services
 
Getting Started with Amazon Aurora
Getting Started with Amazon AuroraGetting Started with Amazon Aurora
Getting Started with Amazon AuroraAmazon Web Services
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - DatalakeLam Le
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
Getting started with Amazon DynamoDB
Getting started with Amazon DynamoDBGetting started with Amazon DynamoDB
Getting started with Amazon DynamoDBAmazon Web Services
 
Optimizing Data Management Using AWS Storage and Data Migration Products | AW...
Optimizing Data Management Using AWS Storage and Data Migration Products | AW...Optimizing Data Management Using AWS Storage and Data Migration Products | AW...
Optimizing Data Management Using AWS Storage and Data Migration Products | AW...Amazon Web Services
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
(BDT402) Delivering Business Agility Using AWS
(BDT402) Delivering Business Agility Using AWS(BDT402) Delivering Business Agility Using AWS
(BDT402) Delivering Business Agility Using AWSAmazon Web Services
 
Migrating Large Scale Data Sets to the Cloud
Migrating Large Scale Data Sets to the CloudMigrating Large Scale Data Sets to the Cloud
Migrating Large Scale Data Sets to the CloudAmazon Web Services
 

What's hot (20)

ENT313 Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum E...
ENT313 Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum E...ENT313 Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum E...
ENT313 Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum E...
 
Strategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageStrategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud Storage
 
February 2016 Webinar Series - Architectural Patterns for Big Data on AWS
February 2016 Webinar Series - Architectural Patterns for Big Data on AWSFebruary 2016 Webinar Series - Architectural Patterns for Big Data on AWS
February 2016 Webinar Series - Architectural Patterns for Big Data on AWS
 
Migrate from SQL Server or Oracle into Amazon Aurora using AWS Database Migra...
Migrate from SQL Server or Oracle into Amazon Aurora using AWS Database Migra...Migrate from SQL Server or Oracle into Amazon Aurora using AWS Database Migra...
Migrate from SQL Server or Oracle into Amazon Aurora using AWS Database Migra...
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
AWS re:Invent 2016: AWS Database State of the Union (DAT320)
AWS re:Invent 2016: AWS Database State of the Union (DAT320)AWS re:Invent 2016: AWS Database State of the Union (DAT320)
AWS re:Invent 2016: AWS Database State of the Union (DAT320)
 
AWS re:Invent 2016: Simplified Data Center Migration—Lessons Learned by Live ...
AWS re:Invent 2016: Simplified Data Center Migration—Lessons Learned by Live ...AWS re:Invent 2016: Simplified Data Center Migration—Lessons Learned by Live ...
AWS re:Invent 2016: Simplified Data Center Migration—Lessons Learned by Live ...
 
Building Big Data Applications with Serverless Architectures - June 2017 AWS...
Building Big Data Applications with Serverless Architectures -  June 2017 AWS...Building Big Data Applications with Serverless Architectures -  June 2017 AWS...
Building Big Data Applications with Serverless Architectures - June 2017 AWS...
 
Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Full Stack Analytics on AWS - AWS Summit Cape Town 2017 Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Full Stack Analytics on AWS - AWS Summit Cape Town 2017
 
AWS Innovate: Running Databases in AWS- Russell Nash
AWS Innovate: Running Databases in AWS- Russell NashAWS Innovate: Running Databases in AWS- Russell Nash
AWS Innovate: Running Databases in AWS- Russell Nash
 
Getting Started with Amazon QuickSight
Getting Started with Amazon QuickSightGetting Started with Amazon QuickSight
Getting Started with Amazon QuickSight
 
Getting Started with Amazon Aurora
Getting Started with Amazon AuroraGetting Started with Amazon Aurora
Getting Started with Amazon Aurora
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Getting started with Amazon DynamoDB
Getting started with Amazon DynamoDBGetting started with Amazon DynamoDB
Getting started with Amazon DynamoDB
 
Optimizing Data Management Using AWS Storage and Data Migration Products | AW...
Optimizing Data Management Using AWS Storage and Data Migration Products | AW...Optimizing Data Management Using AWS Storage and Data Migration Products | AW...
Optimizing Data Management Using AWS Storage and Data Migration Products | AW...
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
 
(BDT402) Delivering Business Agility Using AWS
(BDT402) Delivering Business Agility Using AWS(BDT402) Delivering Business Agility Using AWS
(BDT402) Delivering Business Agility Using AWS
 
Migrating Large Scale Data Sets to the Cloud
Migrating Large Scale Data Sets to the CloudMigrating Large Scale Data Sets to the Cloud
Migrating Large Scale Data Sets to the Cloud
 

Similar to Optimizing Storage for Big Data Analytics Workloads

Optimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics WorkloadsOptimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics WorkloadsAmazon Web Services
 
Optimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsOptimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsAmazon Web Services
 
Building Analytics Applications in the AWS Cloud
Building Analytics Applications in the AWS CloudBuilding Analytics Applications in the AWS Cloud
Building Analytics Applications in the AWS CloudAmazon Web Services
 
Choosing the right data storage in the Cloud.
Choosing the right data storage in the Cloud. Choosing the right data storage in the Cloud.
Choosing the right data storage in the Cloud. Amazon Web Services
 
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceAmazon Web Services
 
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
2017 AWS DB Day |  AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?2017 AWS DB Day |  AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?Amazon Web Services Korea
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...Amazon Web Services
 
Creative content storage in the AWS Cloud
Creative content storage in the AWS CloudCreative content storage in the AWS Cloud
Creative content storage in the AWS CloudAmazon Web Services
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...Data Con LA
 
Introduction to Storage on AWS - AWS Summit Cape Town 2017
Introduction to Storage on AWS - AWS Summit Cape Town 2017Introduction to Storage on AWS - AWS Summit Cape Town 2017
Introduction to Storage on AWS - AWS Summit Cape Town 2017Amazon Web Services
 
Re invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampionRe invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampionMia D Champion
 
AWS Webcast - Introduction to RDS Low Admin High Perf DBS
AWS Webcast - Introduction to RDS Low Admin High Perf DBSAWS Webcast - Introduction to RDS Low Admin High Perf DBS
AWS Webcast - Introduction to RDS Low Admin High Perf DBSAmazon Web Services
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
Best Practices running SQL Server on AWS
Best Practices running SQL Server on AWSBest Practices running SQL Server on AWS
Best Practices running SQL Server on AWSAmazon Web Services
 
PASS 17 SQL Server on AWS Best Practices
PASS 17 SQL Server on AWS Best PracticesPASS 17 SQL Server on AWS Best Practices
PASS 17 SQL Server on AWS Best PracticesAmazon Web Services
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationmattlieber
 

Similar to Optimizing Storage for Big Data Analytics Workloads (20)

Optimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics WorkloadsOptimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics Workloads
 
Optimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsOptimizing Storage for Big Data Workloads
Optimizing Storage for Big Data Workloads
 
Building Analytics Applications in the AWS Cloud
Building Analytics Applications in the AWS CloudBuilding Analytics Applications in the AWS Cloud
Building Analytics Applications in the AWS Cloud
 
Choosing the right data storage in the Cloud.
Choosing the right data storage in the Cloud. Choosing the right data storage in the Cloud.
Choosing the right data storage in the Cloud.
 
Intro to AWS: Storage Services
Intro to AWS: Storage ServicesIntro to AWS: Storage Services
Intro to AWS: Storage Services
 
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
 
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
2017 AWS DB Day |  AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?2017 AWS DB Day |  AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
 
Creative content storage in the AWS Cloud
Creative content storage in the AWS CloudCreative content storage in the AWS Cloud
Creative content storage in the AWS Cloud
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
 
Introduction to Storage on AWS - AWS Summit Cape Town 2017
Introduction to Storage on AWS - AWS Summit Cape Town 2017Introduction to Storage on AWS - AWS Summit Cape Town 2017
Introduction to Storage on AWS - AWS Summit Cape Town 2017
 
Re invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampionRe invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampion
 
AWS Webcast - Introduction to RDS Low Admin High Perf DBS
AWS Webcast - Introduction to RDS Low Admin High Perf DBSAWS Webcast - Introduction to RDS Low Admin High Perf DBS
AWS Webcast - Introduction to RDS Low Admin High Perf DBS
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Best Practices running SQL Server on AWS
Best Practices running SQL Server on AWSBest Practices running SQL Server on AWS
Best Practices running SQL Server on AWS
 
PASS 17 SQL Server on AWS Best Practices
PASS 17 SQL Server on AWS Best PracticesPASS 17 SQL Server on AWS Best Practices
PASS 17 SQL Server on AWS Best Practices
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Understanding Post Production changes (PPC) in Clinical Data Management (CDM)...
Understanding Post Production changes (PPC) in Clinical Data Management (CDM)...Understanding Post Production changes (PPC) in Clinical Data Management (CDM)...
Understanding Post Production changes (PPC) in Clinical Data Management (CDM)...soumyapottola
 
GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE
 
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptxerickamwana1
 
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...Sebastiano Panichella
 
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityDon't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityApp Ethena
 
Sunlight Spectacle 2024 Practical Action Launch Event 2024-04-08
Sunlight Spectacle 2024 Practical Action Launch Event 2024-04-08Sunlight Spectacle 2024 Practical Action Launch Event 2024-04-08
Sunlight Spectacle 2024 Practical Action Launch Event 2024-04-08LloydHelferty
 
Application of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxApplication of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxRoquia Salam
 
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Sebastiano Panichella
 
General Elections Final Press Noteas per M
General Elections Final Press Noteas per MGeneral Elections Final Press Noteas per M
General Elections Final Press Noteas per MVidyaAdsule1
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxAsifArshad8
 
Scootsy Overview Deck - Pan City Delivery
Scootsy Overview Deck - Pan City DeliveryScootsy Overview Deck - Pan City Delivery
Scootsy Overview Deck - Pan City Deliveryrishi338139
 
cse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitycse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitysandeepnani2260
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRachelAnnTenibroAmaz
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRRsarwankumar4524
 

Recently uploaded (14)

Understanding Post Production changes (PPC) in Clinical Data Management (CDM)...
Understanding Post Production changes (PPC) in Clinical Data Management (CDM)...Understanding Post Production changes (PPC) in Clinical Data Management (CDM)...
Understanding Post Production changes (PPC) in Clinical Data Management (CDM)...
 
GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024
 
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
 
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
 
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityDon't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
 
Sunlight Spectacle 2024 Practical Action Launch Event 2024-04-08
Sunlight Spectacle 2024 Practical Action Launch Event 2024-04-08Sunlight Spectacle 2024 Practical Action Launch Event 2024-04-08
Sunlight Spectacle 2024 Practical Action Launch Event 2024-04-08
 
Application of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxApplication of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptx
 
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
 
General Elections Final Press Noteas per M
General Elections Final Press Noteas per MGeneral Elections Final Press Noteas per M
General Elections Final Press Noteas per M
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
 
Scootsy Overview Deck - Pan City Delivery
Scootsy Overview Deck - Pan City DeliveryScootsy Overview Deck - Pan City Delivery
Scootsy Overview Deck - Pan City Delivery
 
cse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitycse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber security
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
 

Optimizing Storage for Big Data Analytics Workloads

  • 1. Matt Yanchyshyn, Sr. Manager Solutions Architecture March 1st, 2017 AWS NYC Loft Storage Day Optimizing Storage for Big Data/Analytics Workloads
  • 2. Big Data/Analytics Workloads Amazon EMR Cloudera Hortonworks MapR Amazon Redshift Vertica Teradata Amazon Kinesis Kafka Tibco EMS Amazon DynamoDB MongoDB Cassandra Amazon ElastiSearch Splunk ElasticSearch SOLR Hadoop etc. Data Warehouse Streaming NoSQL Search
  • 3. AWS Storage Amazon S3 Multi-tenant Key-store Native API Amazon EBS Single-tenant Block-store Amazon EFS Shared/Distributed POSIX NFS/SMB (CIFS)
  • 4. A little EBS history… • 2006 – EC2 launched with instance storage • 2008 – EBS (Elastic Block Storage) launched on magnetic storage • 2012 – EBS Provisioned IOPS and EBS-Optimized instances • 2014 – SSD-Backed general purpose storage • 2014 – EBS data volume encryption.2015 – Larger and faster EBS volumes.2015 – EBS boot volume encryption • 2016 – EBS Throughput Optimized HDD (st1) and Cold HDD (sc1) volume types • 2017 – EBS Elastic Volumes!
  • 6. EBS volume types General Purpose SSD gp2 Provisioned IOPS SSD io1 Throughput Optimized HDD st1 Cold HDD sc1 SSD HDD
  • 7. Throughput is more important Small, random I/O Large, sequential I/O Latency? i2 gp2 io1 sc1 st1 d2 Choosing an EBS volume type IOPS ≤ 65,000> 65,000 < 1 ms Single-digit ms ≤ 1,250 MB/s Aggregate throughput? > 1,250 MB/s is more important ≤ 10k IOPS > 10k IOPS Throughput per volume 250 MiB/s 500 MiB/s IOPS per volume
  • 8. AWS Storage Use Cases Big Data & Analytics Data Warehouses Search & Indexing Transactional & NoSQL Databases Streaming data Amazon S3 Amazon EC2 Instance store Amazon EBS Amazon EFS
  • 9. Analytics on AWS Compute Only (Spark, Presto) Coupled Storage (HDFS+SOLR) Shared Filesystem (SAS Grid) Amazon S3 Amazon EC2 Instance (D2) Amazon EBS ST1 Volumes Amazon EFS
  • 11. Building a tiered storage model Hadoop (Hive) • A single Hive Table with partitions in HDFS and S3 ALTER TABLE table ADD PARTITION (year='2016’) location ’s3://’ ALTER TABLE table ADD PARTITION (year='2017’) location ’hdfs://' Data warehouse (Vertica) • Location with different types of volumes (EBS GP2 -> EBS ST1/SC1) select alter_location_label('/home/dbadmin/SSD/tables','', 'SSD'); select set_object_storage_policy (’table', 'SSD'); • Use Amazon S3 as Virtual Tables
  • 12. Building a tiered storage model No SQL (MongoDB) • Consider replicas with different storage types • EBS GP2 (Primary), I2(Secondary – High-IO), R4(Secondary - In-Memory) Search (Splunk/ELK) • Different types of EBS volumes for Hot (GP2), Warm (GP2/ST1), Cold (ST1/SC1) and Frozen (SC1)
  • 13. Customer Example: Crowdstrike Cassandra on EBS vs. Ephemeral storage “Amazon EBS offered the performance we needed, at a third of the cost of the SSD-backed instance storage.” Goal: 1 million writes per second on 60 nodes with EBS
  • 14. Crowdstrike, cont. Used to believe they could never run Cassandra on EBS: • Noisy Neighbor (jitter) • Single point of failure in a region • Too expensive • Bad Volumes (spin up ten, run tests, and pick the best one)
  • 15. Crowdstrike, cont. Painpoints running on ephemeral storage • Fewer EC2 instances are offering local instance storage • There is no data persistence if you stop/start the EC2 instance to resize • I2’s are expensive, especially when you need three of them per node for the replication • You can’t snapshot the data using EBS snapshot • No EBS volume monitoring
  • 16. Crowdstrike, cont. 1700 400 0 450 900 1350 1800 2250 I2.2xlarge c4.2xlarge + EBS Number of Machines for 1PB Number of Machines for 1PB
  • 17. Crowdstrike, cont. 14.5 8 4.2 3.5 0 2 4 6 8 10 12 14 16 I2.2xlarge on-demand 2.2xlarge Reserved C4.2xlarge (EBS) - on demand C4. 2xlarge (EBS) - reserved Yearly Cost for 1 PB Cluster (millions) Yearly Cost for 1 PB Cluster (millions)
  • 18. Crowdstrike, Cont. Crowdstrike Today: • In the past 12 months, zero Amazon EBS-related failures • Thousands of GP2 data volumes (~2PB of data) • Transitioning all systems to Amazon EBS root drives • Moved all data stores to EBS Benefits of EBS: • Use EBS volume monitoring • Schedule snapshots for consistent backups • Stop/start and resize • Half the cost (using reserved pricing comparison)
  • 19. Hadoop HDFS – Hadoop Distributed File System • Replication done for both durability and performance • Write data using block-sizes of 64/128/256 MB - Sequential IO only HCFS - Hadoop Compatible System • EMRFS, S3A, S3N – Maps AWS APIs to Hadoop APIs • Athena – Presto talking to S3 using a HCFS Implementation
  • 20. Amazon S3 as HDFS Advantages • Scale out horizontally • Storage decoupled from compute • Backup and DR not required • Transient clusters with much higher availability Challenges • Rename – Cost based on the size of data. • List – Cost based on prefix depth. • Security – IAM. Not supported by Hadoop Security model • Compatibility
  • 21. When to use HDFS without Amazon S3? Cluster type: transient vs. long-running Customer preference, e.g. Cloudera/Hortonworks/MapR Optimize for multiple data processing iterations and sequential access • Amazon EC2 Instance Store – D2 (3+ GB/s) • Amazon EBS ST1 Volumes (per TB - 250 MBps Burst & 40 MBps Baseline) • 2TB Volume sizes with Read-ahead of 1MB • Lower the replication factor to 2X instead of 3X Amazon EFS? • HDFS – itself is a distributed filesystem • Missing adapter to translate from Amazon EFS to HCFS
  • 22. NETFLIX Cloud Data Warehouse with Amazon S3 EMR as the computing engine – Hive, Spark, Presto Orchestration and scheduling via Genie Customer workloads PINTEREST Large EC2 cluster with EBS GP2 Search and Index using Map- Reduce Always-on, persistent cluster
  • 23. New! EBS Elastic Volumes Simple; Flexible; Non-disruptive; Automated Modify the configuration of live volumes attached to instances Dynamically increase size, tune performance, and change the type of existing and new current generation volumes No downtime, no performance impact. You can automate changes using CloudWatch with Lambda or CloudFormation No need to plan ahead, provision what you need today and change the configuration as business needs change.
  • 24. What are the new AWS CLI commands ? aws ec2 modify-volume aws ec2 describe-volumes-modifications *need to install the latest AWS SDK/CLI
  • 25. How does it work? Three steps • Issue the modification command • Monitor the progress of the modification • If size change extend the volume's file system
  • 26. Elastic Volume – Limitations Limit – You can modify a volume once per 6 hours Limit – Supported only for current generation volumes (gp2/io1/st1/sc1). Not supported for Magnetic/Standard volumes Limit - Live changes supported for volumes attached to current generation instances
  • 27. Demo: Amazon EMR with HDFS on EBS …and volume type/size changes!