2. About Me
• @Facebook (2007-2011):
– First Hadoop Engineer
– Founder - Apache Hive project, PMC Member
– Contributor to Apache Hadoop/HBase
• Founder Qubole (2012-)
– Hadoop-as-a-Service
– 30+ customers: Pinterest, Quora, Mediamath, Tubemogul …
– Design/Code/Ops/Support/…
3. Big Data Cloud
• Elasticity:
– Workloads are Bursty
– Allows easy rolling upgrades and testing
• Lower TCO:
– Cloud Storage is Inexpensive (2-3c/GB/month – globally replicated)
– Zero cost to try new projects
– Upgrade to new hardware easily (no cluster migrations!)
4. Big Data Cloud
• Global:
– Easily set up where employees/customer/entities are located
• Collaboration:
– Zero-Copy sharing of data with Partners and across Departments
– Easy access to great public data sets
• As-a-Service delivery model vastly lowers Operational Cost
5. Cloud-Optimized Big Data?
• Optimized for lower TCO
• Optimized for Speed
• Optimized for Operations/Support
7. 7
Automated LifeCycle Mgmt
select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as a.county
from SMALL_TABLE a) t group by
t.county;
hadoop jar –Dmapred.min.split.size=32000000
myapp.jar –partitioner .org.apache…
insert overwrite table dest
select a.id, a.zip, count(distinct b.uid)
from ads a join LARGE_TABLE b on (a.id=b.ad_id) group
by a.id, a.zip;
AdCo Hadoop
8. Auto-Scaling
insert overwrite table dest
select … from ads join campaigns
on …group by …;
8
StarCluster
Map Tasks
ReduceTasks
Demand
Supply
AWS
Progress
Master
Slaves
Job Tracker
9. 9
Spot Instances
On an average 50-60% cheaper
• Fallback to regular
instances when Spot
unavailable
• Replace regular
instances with Spot
when available
10. 10
Using Fast but ‘Thin’ nodes
• C3 instances: 50% better performance at 20% lower cost
• Little local storage
11. 11
Using Fast but ‘Thin’ nodes
Modify Hadoop to use Network drives for overflow
Map-Reduce HDFS
Local
SSD
Disk I/O
Network Drives
Overflow