AWS Summit 2011: Big Data Analytics in the AWS cloud

Analytics in the Cloud
Peter Sirota, GM Elastic MapReduce

Data-Driven Decision Making

Data is the new raw
material for any
business on par with
capital, people, and
labor.

What is Big Data?
Terabytes of semi-structured log data
in which businesses want to:
 find correlations/perform pattern matching
 generate recommendations
 calculate advanced statistics (i.e., TP99)

Twitter “Firehose”
 50 million tweets per day
 1,400% growth per year
 How can advertisers drink from it?

Social graphs
 Value increases with exponential
growth in data connections

Big Data is full of valuable, unanswered questions!

Why is Big Data Hard (and Getting Harder)?
Today’s Data Warehouses
 Need to consolidate from multiple data sources in multiple formats across
multiple businesses
 Unconstrained growth of this business-critical information

Today’s Users
 Expect faster response time of fresher data
 Sampling is not good enough and history is important
 Demand inexpensive experimentation with new data
 Become increasingly sophisticated Data Scientists

Current systems don’t scale (and weren’t meant to)
 Long time to provision more infrastructure
 Specialized DB expertise required
 Expensive and inelastic solutions

We need tools built specifically for Big Data!

What is this thing called Hadoop?
Dealing with Big Data requires two things:
 Distributed, scalable storage
 Inexpensive, flexible analytics

Apache Hadoop is an open source software
platform that addresses both of these needs
 Includes a fault‐tolerant, distributed storage system
(HDFS) developed for commodity servers
 Uses a technique called MapReduce to carry out
exhaustive analysis over huge distributed data sets

Key benefits
 Affordable – Cost / TB is a fraction of traditional options
 Proven at scale – Numerous petabyte implementations in production;
linear scalability
 Flexible – Data can be stored with or without schema

RDBMS vs. MapReduce/Hadoop

RDBMS MapReduce/Hadoop
 Predefined schema  No schema is required
 Strategic data placement for query  Random data placement
tuning  Fast scan of the entire dataset
 Exploit indexes for fast retrieving  Uniform query performance
 SQL only  Linearly scales for reads and
 Doesn’t scale linearly writes
 Support many languages
including SQL

Complementary technologies

Why Amazon Elastic MapReduce?
Managed Apache Hadoop Web Service
 Monitor thousands of clusters per day
 Use cases span from University students to Fortune 50

Reduces complexity of Hadoop management
 Handles node provisioning, customization, and shutdown
 Tunes Hadoop to your hardware and network
 Provides tools to debug and monitor your Hadoop clusters

Provides tight integration with AWS services
 Improved performance working with S3
 Automatic re-provisioning on node failure
 Dynamic expanding/shrinking of cluster size
 Spot integration

Elastic MapReduce Key Features
Simplified Cluster Configuration/Management
 Resize running job flows
 Support for EIP/IAM/Tagging
 Workload-specific configurations
 Bootstrap Actions

Enhanced Monitoring/Debugging
 Free CloudWatch Metrics / Alarms
 Hadoop Metrics in Console
 Ganglia Support

Improved Performance
 S3 Multipart Upload
 Cluster Compute Instances

Analytics Use Cases
Targeted advertising / Clickstream analysis
Data warehousing applications
Bio-informatics (Genome analysis)
Financial simulation (Monte Carlo simulation)
File processing (resize jpegs)
Web indexing
Data mining and BI

APACHE H IVE
DATA WAREHOUSE FOR H ADOOP
Open source project started at Facebook
Turns data on Hadoop into a virtually limitless
data warehouse
Provides data summarization, ad hoc querying
and analysis
Enables SQL-like queries on structured and
unstructured data
 E.g. arbitrary field separators possible such as “,” in
CSV file formats
Inherits linear scalability of Hadoop

AWS Data Warehousing Architecture

Elastic Data Warehouse
Customize cluster size to support varying resource needs
(e.g. query support during the day versus batch processing
overnight)
Reduce costs by increasing server utilization
Improve performance during high usage periods

Data Warehouse
(Batch Processing)
Data Warehouse Data Warehouse
(Steady State) (Steady State)

Shrink to
Expand to 9 instances
25 instances

Reducing Costs with Spot Instances
Mix Spot and On-Demand instances to reduce cost and
accelerate computation while protecting against interruption

Scenario #1 Scenario #2 #1: Cost without Spot
Job Flow 4 instances *14 hrs * $0.50 = $28
Job Flow
#2: Cost with Spot
4 instances *7 hrs * $0.50 = $13 +
5 instances * 7 hrs * $0.25 = $8.75
Total = $21.75

Duration: Duration:
14 Hours
Time Savings: 50%
7 Hours Cost Savings: ~22%

Other EMR + Spot Use Cases
Run entire cluster on Spot for biggest cost savings
Reduce the cost of application testing

Monitoring Clusters with CloudWatch
Free CloudWatch Metrics and Alarms
 Track Hadoop job progress
 Alarm on degradations in cluster health
 Monitor aggregate Elastic MapReduce usage

Big Data Ecosystem And Tools
We have a rapidly growing ecosystem and will continue
to integrate with a wide range of partners. Some
examples:

Business Intelligence
 MicroStrategy, Pentaho
Analytics
 Datameer, Karmasphere, Quest
Open source
 Ganglia, SQuirrel SQL

Resources
Amazon Elastic MapReduce
aws.amazon.com/elasticmapreduce
aws.amazon.com/articles/Elastic-MapReduce
forums.aws.amazon.com/forum.jspa?forumID=52

AWS Summit 2011: Big Data Analytics in the AWS cloud

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (19)

Similar a AWS Summit 2011: Big Data Analytics in the AWS cloud

Similar a AWS Summit 2011: Big Data Analytics in the AWS cloud (20)

Más de Amazon Web Services

Más de Amazon Web Services (20)

Último

Último (20)

AWS Summit 2011: Big Data Analytics in the AWS cloud