This document discusses using Hadoop and HBase on Amazon Web Services (AWS) for distributed data storage and analytics. It introduces Hadoop and describes how AWS Elastic MapReduce simplifies implementing and managing Hadoop clusters. It also discusses using HBase, an open-source NoSQL database, on AWS for scalable access to large, unstructured datasets. Finally, it covers strategies for optimizing costs when running big data workloads on AWS infrastructure.
30. “AWS enables Pfizer to explore
difficult or deep scientific questions in
a timely, scalable manner and helps us
make better decisions more quickly”
Michael Miller, Pfizer
62. Hadoop all the way down
Amazon Hadoop distribution
HDFS
Streaming interface
Hive, Pig, Mahout, Spark, Shark
63. Data integration
Optimized and integrated into AWS environment
Reads and writes to S3
Analytics on DynamoDB data
Can process data from any source:
Cassandra, Mongo, Couch, Amazon RDS
64. Data movement
Multi-part upload
Import/Export
AWS Direct Connect
Aspera
65. Cluster scalability
Resize running job flows
Add capacity for shorter runs
Remove capacity during off peak hours
Balance scale and cost
70. Cluster availability
Canonical source of data
Any one in the engineering team
IAM integration
Monitoring
71. Click stream analysis for retail
3.5 billion records
71 million unique cookies
1.7 million targeted ads
13 Tb of clickstream logs
Each day
72. Click stream analysis for retail
Workflow time from 2 days to 8 hours
Procurement time from 2 months to 5 minutes
$13k per month
500% increase return on advertising spend
73.
74. Log data stored in Amazon S3
Amazon S3 Months of user click-through data
Search terms
Ads displayed
Premium listing inventory
76. Find patterns across logs. Write results to S3.
Hadoop Cluster
Amazon S3 Amazon EMR
77. Hadoop in the AWS Cloud
Elastic MapReduce for hosted Hadoop
Optimized, configured, ready to roll
Focus on the business benefit of data
Hadoop all the way down
78. Software for distributed
storage and analysis
Maturation of two things.
Infrastructure for distributed
storage and analysis