3. What is Big Data?
Terabytes of semi-structured log data
in which businesses want to:
find correlations/perform pattern matching
generate recommendations
calculate advanced statistics (i.e., TP99)
Twitter “Firehose”
50 million tweets per day
1,400% growth per year
How can advertisers drink from it?
Social graphs
Value increases with exponential
growth in data connections
Big Data is full of valuable, unanswered questions!
4. Why is Big Data Hard (and Getting Harder)?
Today’s Data Warehouses
Need to consolidate from multiple data sources in multiple formats across
multiple businesses
Unconstrained growth of this business-critical information
Today’s Users
Expect faster response time of fresher data
Sampling is not good enough and history is important
Demand inexpensive experimentation with new data
Become increasingly sophisticated Data Scientists
Current systems don’t scale (and weren’t meant to)
Long time to provision more infrastructure
Specialized DB expertise required
Expensive and inelastic solutions
We need tools built specifically for Big Data!
5. What is this thing called Hadoop?
Dealing with Big Data requires two things:
Distributed, scalable storage
Inexpensive, flexible analytics
Apache Hadoop is an open source software
platform that addresses both of these needs
Includes a fault‐tolerant, distributed storage system
(HDFS) developed for commodity servers
Uses a technique called MapReduce to carry out
exhaustive analysis over huge distributed data sets
Key benefits
Affordable – Cost / TB is a fraction of traditional options
Proven at scale – Numerous petabyte implementations in production;
linear scalability
Flexible – Data can be stored with or without schema
6. RDBMS vs. MapReduce/Hadoop
RDBMS MapReduce/Hadoop
Predefined schema No schema is required
Strategic data placement for query Random data placement
tuning Fast scan of the entire dataset
Exploit indexes for fast retrieving Uniform query performance
SQL only Linearly scales for reads and
Doesn’t scale linearly writes
Support many languages
including SQL
Complementary technologies
7.
8. Why Amazon Elastic MapReduce?
Managed Apache Hadoop Web Service
Monitor thousands of clusters per day
Use cases span from University students to Fortune 50
Reduces complexity of Hadoop management
Handles node provisioning, customization, and shutdown
Tunes Hadoop to your hardware and network
Provides tools to debug and monitor your Hadoop clusters
Provides tight integration with AWS services
Improved performance working with S3
Automatic re-provisioning on node failure
Dynamic expanding/shrinking of cluster size
Spot integration
9. Elastic MapReduce Key Features
Simplified Cluster Configuration/Management
Resize running job flows
Support for EIP/IAM/Tagging
Workload-specific configurations
Bootstrap Actions
Enhanced Monitoring/Debugging
Free CloudWatch Metrics / Alarms
Hadoop Metrics in Console
Ganglia Support
Improved Performance
S3 Multipart Upload
Cluster Compute Instances
10. Analytics Use Cases
Targeted advertising / Clickstream analysis
Data warehousing applications
Bio-informatics (Genome analysis)
Financial simulation (Monte Carlo simulation)
File processing (resize jpegs)
Web indexing
Data mining and BI
11. APACHE H IVE
DATA WAREHOUSE FOR H ADOOP
Open source project started at Facebook
Turns data on Hadoop into a virtually limitless
data warehouse
Provides data summarization, ad hoc querying
and analysis
Enables SQL-like queries on structured and
unstructured data
E.g. arbitrary field separators possible such as “,” in
CSV file formats
Inherits linear scalability of Hadoop
13. Elastic Data Warehouse
Customize cluster size to support varying resource needs
(e.g. query support during the day versus batch processing
overnight)
Reduce costs by increasing server utilization
Improve performance during high usage periods
Data Warehouse
(Batch Processing)
Data Warehouse Data Warehouse
(Steady State) (Steady State)
Shrink to
Expand to 9 instances
25 instances
14. Reducing Costs with Spot Instances
Mix Spot and On-Demand instances to reduce cost and
accelerate computation while protecting against interruption
Scenario #1 Scenario #2 #1: Cost without Spot
Job Flow 4 instances *14 hrs * $0.50 = $28
Job Flow
#2: Cost with Spot
4 instances *7 hrs * $0.50 = $13 +
5 instances * 7 hrs * $0.25 = $8.75
Total = $21.75
Duration: Duration:
14 Hours
Time Savings: 50%
7 Hours Cost Savings: ~22%
Other EMR + Spot Use Cases
Run entire cluster on Spot for biggest cost savings
Reduce the cost of application testing
15. Monitoring Clusters with CloudWatch
Free CloudWatch Metrics and Alarms
Track Hadoop job progress
Alarm on degradations in cluster health
Monitor aggregate Elastic MapReduce usage
16. Big Data Ecosystem And Tools
We have a rapidly growing ecosystem and will continue
to integrate with a wide range of partners. Some
examples:
Business Intelligence
MicroStrategy, Pentaho
Analytics
Datameer, Karmasphere, Quest
Open source
Ganglia, SQuirrel SQL