1. Introduction Implementation Concluding Remarks Appendix
Dynamic Fractional Resource Scheduling
Practical Issues and Future Directions
Mark Stillwell Fr´ed´eric Vivien Henri Casanova
INRIA, Universit´e de Lyon, LIP
International Research Workshop on Advanced High
Performance Computing Systems
Cetraro, Italy
June 27–29, 2011
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
2. Introduction Implementation Concluding Remarks Appendix
Outline
Introduction
Implementation
Concluding Remarks
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
3. Introduction Implementation Concluding Remarks Appendix
Outline
Introduction
Implementation
Concluding Remarks
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
4. Introduction Implementation Concluding Remarks Appendix
Dynamic Fractional Resource Scheduling
HPC Scheduling Problem:
homogeneous nodes with high-speed interconnect
network file system
release date and number of tasks known at submission
computation time known only after completion
goal: minimize maximum stretch
approach:
based on virtual machine technology
fractional resource allocations
preemption / migration
on-line scheduling problem → sequence of off-line resource
allocation problems
computable, on-line performance metric related to max
stretch
heuristics for task placement/resource allocation
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
5. Introduction Implementation Concluding Remarks Appendix
Task Placement and Resource Allocation Problem
nodes comprise a number of resources (CPU cores,
memory, bandwith, etc...)
tasks have rigid requirements and fluid needs
job tasks have the same requirements and needs
rigid requirements specify minimum resource allocations
fluid needs specify minimum allocations without
performance degradation
yield of a job is a value between 0 and 1
correlated with performance
jobs allocated product of yield and need in each resource
Goal: find an allocation that maximizes the minimum yield
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
6. Introduction Implementation Concluding Remarks Appendix
Task Placement Heuristics
Greedy Task Placement – incremental, puts each task on
the node with the lowest computational load on which it
can fit without violating memory constraints
MCB Task Placement – global, iteratively applies
multi-capacity (vector) bin-packing heuristics during a
binary search for the maximized minimum yield
achieves higher minimum yield values than Greedy
can potentially cause lots of migration
what if the system is oversubscribed?
use a priority function to decide which jobs to run
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
7. Introduction Implementation Concluding Remarks Appendix
Priority Function?
Virtual Time: subjective time experienced by a job
first idea: 1
VIRTUAL TIME
informed by ideas about fairness
lead to good results
theoretically prone to starvation
second idea: FLOW TIME
VIRTUAL TIME
addresses starvation problem
lead to poor performance
Third Idea: FLOW TIME
(VIRTUAL TIME)2
combines idea #1 and idea #2
addresses starvation
performs about the same as first priority function
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
8. Introduction Implementation Concluding Remarks Appendix
When to apply Heuristics
We consider a number of different options:
Job Submission – heuristics can use greedy or bin packing
approaches
Job Completion – as above, can help with throughput
when there are lots of short running jobs
Periodically – some heuristics periodically apply vector
packing to improve overall job placement
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
9. Introduction Implementation Concluding Remarks Appendix
Methodology
experiments conducted using discrete event simulator
mix of synthetic and real trace data
considered only memory requirements and CPU needs
preemption/migration of jobs causes 300 second penalty to
runtime
periodic approaches use a 600 second (10 minute) period
absolute bound on max stretch computed for each instance
performance comparison based on degradation from max
stretch bound
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
10. Introduction Implementation Concluding Remarks Appendix
Simulation Algorithms
FCFS
EASY
Greedy*
GreedyPM*
/per
GreedyPM*/per
MCB*/per
GreedyPM*/per/mvt
MCB*/per/mvt
/stretch-per
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
11. Introduction Implementation Concluding Remarks Appendix
Max Stretch Degradation vs. Load, 5 minute penalty
1
10
100
1000
10000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
MaxstretchdegradationFromBound
Load
FCFS
EASY
Greedy*
GreedyPM*
/per
GreedyPM*/per
MCB*/per
/stretch-per
MCB*/per/mvt
GreedyPM*/per/mvt
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
12. Introduction Implementation Concluding Remarks Appendix
Bandwidth vs. Period
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
2000 4000 6000 8000 10000 12000
AverageBandwidth(GB/s)
Period (seconds)
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
13. Introduction Implementation Concluding Remarks Appendix
Max Stretch Degradation vs. Period
5
6
7
8
9
10
11
12
13
14
2000 4000 6000 8000 10000 12000
AverageMaxstretchDegradation
Period (seconds)
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
14. Introduction Implementation Concluding Remarks Appendix
Outline
Introduction
Implementation
Concluding Remarks
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
15. Introduction Implementation Concluding Remarks Appendix
Development Platform
Grid5000 (http://www.grid5000.fr)
Kadeploy images
Genepi Nodes: Quad Core Xeon 2.5GHz w/ 8GB ram
Open-Source Software
Xen 4.0
Debian Squeeze / Linux 2.6.32
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
16. Introduction Implementation Concluding Remarks Appendix
Goals
“small” experiment topics:
VM overhead
resource requirement/needs estimation
time sharing performance impact
preemption/migration costs
bandwidth consumption
performance impact
migration planning strategies
“large” experiment topic: DFRS in practice
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
17. Introduction Implementation Concluding Remarks Appendix
Goals
“small” experiment topics:
VM overhead
resource requirement/needs estimation
time sharing performance impact
preemption/migration costs
bandwidth consumption
performance impact
migration planning strategies
“large” experiment topic: DFRS in practice
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
18. Introduction Implementation Concluding Remarks Appendix
Goals
“small” experiment topics:
VM overhead
resource requirement/needs estimation
time sharing performance impact
preemption/migration costs
bandwidth consumption
performance impact
migration planning strategies
“large” experiment topic: DFRS in practice
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
19. Introduction Implementation Concluding Remarks Appendix
Goals
“small” experiment topics:
VM overhead
resource requirement/needs estimation
time sharing performance impact
preemption/migration costs
bandwidth consumption
performance impact
migration planning strategies
“large” experiment topic: DFRS in practice
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
20. Introduction Implementation Concluding Remarks Appendix
Goals
“small” experiment topics:
VM overhead
resource requirement/needs estimation
time sharing performance impact
preemption/migration costs
bandwidth consumption
performance impact
migration planning strategies
“large” experiment topic: DFRS in practice
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
21. Introduction Implementation Concluding Remarks Appendix
Estimating Resource Needs
Techniques:
repeated runs for benchmarking / model building
online model building
introspection / configuration variation
Models:
constant values
time-varying values
random variables / probability distributions
Questions:
how accurate can we be?
how accurate do we need to be?
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
22. Introduction Implementation Concluding Remarks Appendix
Benchmark Task VM Memory Requirements
bench/tasks 1 2 3 4
CG 477M 251M – 147M
EP <64M <64M <64M <64M
FT 1930M 977M – 499M
IS 637M 329M – 176M
LU 218M 125M – 80M
MG 485M 259M – 147M
SP 369M – – 123M
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
23. Introduction Implementation Concluding Remarks Appendix
Time Sharing Performance Impact
Process thrashing!!!
Gang Scheduling
Flexible Co-scheduling
Uncoordinated might not be so bad...
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
24. Introduction Implementation Concluding Remarks Appendix
VM Consolidation
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8
slowdown
consolidation factor (maximum tasks/core)
y = x
CG
EP
FT
IS
LU
MG
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
25. Introduction Implementation Concluding Remarks Appendix
Parallel Task Preemption/Migration
preemption of parallel distributed jobs is theoretically
difficult
in the general case this may require coordinated shutdown
on the target platform everything happens quickly enough
that it isn’t an issue
migration can be done by preempt/restore or “live”
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
26. Introduction Implementation Concluding Remarks Appendix
Preemption/Migration Timing Results
0
10
20
30
40
50
60
70
80
90
100
400 600 800 1000 1200 1400 1600 1800 2000
Time(seconds)
Memory Allocation (MB)
Preempt
Restore
Preempt+Restore
Migrate
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
27. Introduction Implementation Concluding Remarks Appendix
Migration Planning
using only live migration is NP-hard
could fail (circular dependencies)
heuristic: try to move the highest priority job
if it can’t move, try next highest
if none can move, preempt the lowest priority job scheduled
for migration and try again
restore any preempted jobs at the end
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
28. Introduction Implementation Concluding Remarks Appendix
Planned Implementation Validation Experiment
generate traces using an accepted model
annotate them with CPU/memory values
”fill in” the traces using benchmarks...
NAS parallel benchmarks (HPC)
TCPP benchmarks (Service Hosting)
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
29. Introduction Implementation Concluding Remarks Appendix
Future Work
complete development of practical implementation
extend the model
nonlocal resources
heterogeneous nodes
varying communications delays
new or arbitrary optimization targets
fault tolerance
energy or other costs
formal definitions of fairness
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
30. Introduction Implementation Concluding Remarks Appendix
Summary
DFRS performs well in simulation
orders of magnitude better than batch
close to a theoretical bound
prototype in development
preliminary results are promising
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice
31. Introduction Implementation Concluding Remarks Appendix
References I
Stillwell, M., Schanzenbach, D., Vivien, F., and Casanova,
H. (2010).
Resource allocation algorithms for virtualized service
hosting platforms.
Journal of Parallel and Distributed Computing,
70(9):962–974.
M. Stillwell, F. Vivien, H. Casanova INRIA
DFRS in Practice