2. About Me
c 2007
Facebook: Ran/Managed Hadoop ~ 3 years
Wrote Hive
Mentor/PM Hadoop Fair-Scheduler
Used Hadoop/Hive (as Warehouse/ETL Dev)
Re-wrote significant chunks of Hadoop
Job Scheduling (incl. Corona)
Qubole: Running World’s largest Hadoop
clusters on AWS
c 2014
3. The Crime
Shared Hadoop Clusters
Statistical Multiplexing
Largest jobs only fit on pooled hardware
Data Locality
Easier to manage
4. … and the Punishment
• “Have you no Hadoop Etiquettes?” (c 2007)
(reducer count capped in response)
• User takes down entire Cluster (OOM) (c 2007-09)
• Bad Job slows down entire Cluster (c 2009)
• Steady State Latencies get intolerable (c 2010-)
• ”How do I know I am getting my fair share?” (c 2011)
• “Too few reducer slots, cluster idle” (c 2013)
7. And then there’s Hadoop (1.x) …
• Single JobTracker for all Jobs
– Does not scale, SPOF
• Pull Based Architecture
– Scalability and Low Latency at permanent War
– Inefficient – leaves idle time
• Slot Based Scheduling
– Inefficient
• Pessimistic Locking in Tracker
– Scalability Bottleneck
• Long Running Tasks
– Fairness and Efficiency at permanent War
8. Poll Driven Scheduling
insert overwrite table dest
select … from ads join
campaigns on …group by …;
Map Tasks
Job Tracker
Master
ReduceTasks
Heartbeat
MapTask
TaskTracker
Slave
Child
8
9. Pessmistic Locking
getBestTask():
for pool: sortedPools
for job: pool.sortedJobs()
for task: job.tasks()
if betterMatch(task) …
processHeartbeat():
synchronized(world):
return getBestTask()
10. Slot Based Scheduling
• N cpus, M map slots, R reduce slots
– Memory cannot be oversubscribed!
• How to divide?
– M < N not enough mappers at times
– R < N not enough reducers at times
– N=M=R enough memory to run 2N tasks ?
• Reduce Tasks Problematic
– Network Intensive to start, CPU wasted
– Memory Intensive later
11. Long Running Reducers
• Online Scheduling
– No advance information of future workload
• Greedy + Fair Scheduling
– Schedule ASAP
– Preempt if future workload disagrees
• Long Running Reducers
– Preemption causes restart and wasted work
– No effective way to use short bursts of idle cpu
12. Optimistic Locking
Task[] getBestTaskCandidates():
for pool: sortedPools
for job: pool.sortedJobs.clone()
for task: job.tasks.clone()
synchronized(task):
…
processHeartbeat():
tasks = getBestTaskCandidates()
synchronized(world):
return acquireTasks(tasks)
13. Corona: Push Scheduling
1. JT subscribes for M maps and R reduces
–
Receives availability from Cluster Manager (CM)
2. CM publishes availability ASAP
–
Pushes events to JT
3. JT pushes tasks to available TT
– In parallel
14. Corona/YARN: Scalability
1. JobTracker for each Job now Independent
–
More Fault Tolerant and Isolated as well
2. Centralized Cluster/Resource Manager
–
Must be super-efficient!
3. Fundamental Differences
–
–
Corona ~ Latency
YARN ~ Heterogenous workloads
15. Pesky Reducers
• Hadoop 2 removes distinction between M and
R slots
• Not Enough
– Reduce Tasks don’t use much CPU in shuffle
– Still long running and bad to preempt
Re-architect to run millions of small Reducers
16. The Future is Cloudy
• Data Center Assumption:
– Cluster characteristics known
– Job spec fits to cluster
• In Cloud:
– Cluster can grow/shrink, change node-type
– Job Spec must be dynamic
– Uniform task configuration untenable