Presentation at SVForum Cloud SIG June 26, 2012. I described the MetaZeta.com cluster provisioning service and went into detail about how multiple clusters are coordinated despite Amazon Web Service EC2 request throttling. Techniques for fast spin-up are discussed.
The MetaZeta clusters system was created to spawn clusters for big data, Hadoop, Hive, and HBase training classes where each student gets a dedicated cluster. Screenshots of the clusters are also included.
2. Background of Paul Baclace
2005-2006 Internet Archive with Doug
Cutting on Hadoop/Nutch
2008-2010 AT&T interactive
2010-2012 Euclid Elements, Yoterra,
Zettaset, GroupAngle.com,
ProductSignals.com, ThirdEye,
Hortonworks
July 13, 2012 MetaZeta.com 2
3. Hadoop Clusters for Training
•
Generate pre-configured clusters
•
Identical and independent
•
Hadoop, HDFS, HBase, Hive, Pig
•
Spawn N clusters for deadline
•
Minimize setup needed by student
July 13, 2012 MetaZeta.com 3
4. Cluster Requirements
•
Access cluster via a single meta-page
•
Avoid need for browser proxy or plugins
•
No installation required for student laptop
•
ssh is optional
July 13, 2012 MetaZeta.com 4
17. Challenges
•
Slow Package Installation Process
•
Amazon EC2 throttling
•
Failures after configuration changes
•
Occasional failures of EC2 nodes
Boot failure
DNS server failure
Package repo availability
July 13, 2012 MetaZeta.com 17
18. Slow Package Installation Process
TotalTime = Nclusters * installLatency
installLatency = Npackages * repoLatency
Typical case repoLatency = 10-20sec
Worst case repoLatency = ∞
July 13, 2012 MetaZeta.com 18
19. Slow Package Installation Process
Solution:
•
Pre-install everything on custom AMI
•
Custom AMI can be slower to load
July 13, 2012 MetaZeta.com 19
20. Amazon EC2 throttling
EC2 API Request Rate
At human speeds:
• 100-2000msec latency
• Short sleep in between
Remove sleep time:
• 2-20sec latency
Overlap requests in parallel:
• HTTP 500 (no donut for you)
July 13, 2012 MetaZeta.com 20
21. Amazon EC2 throttling
Solution:
•
Avoidance by rate-limiting all requests
•
Use heuristics to estimate lead-time
needed to spawn N clusters
July 13, 2012 MetaZeta.com 21
22. EC2 or Config Failures
Solution:
•
Acceptance Testing of
HDFS
Map-Reduce
Hive
HBase
Hive + HBase
July 13, 2012 MetaZeta.com 22
24. Credits
Thank you to:
•
Tom White for starting Whirr
•
Adrian Cole for starting jclouds
•
All the contributors to each project
July 13, 2012 MetaZeta.com 24
Photo Credit: Paul Baclace * Hadoop and Cloud Computing Synergy ** Open Source means no license fee per node ** Cloud computing enables anyone to use Hadoop