Large models such as GPT-3 & ChatGPT have transformed deep learning (DL), powering applications that have captured the public’s imagination. These models are rapidly being adopted across domains for analytics on various modalities, often by finetuning pre-trained base models. Such models need multiple GPUs due to both their size and computational load, driving the development of a bevy of “model parallelism” techniques & tools. Navigating such parallelism choices, however, is a new burden for end users of DL such as data scientists, domain scientists, etc. who may lack the necessary systems knowhow. The need for model selection, which leads to many models to train due to hyper-parameter tuning or layer-wise finetuning, compounds the situation with two more burdens: resource apportioning and scheduling. In this work, we tackle these three burdens for DL users in a unified manner by formalizing them as a joint problem that we call SPASE: Select a Parallelism, Allocate resources, and Schedule. We propose a new information system architecture to tackle the SPASE problem holistically, representing a key step toward enabling wider adoption of large DL models. We devise an extensible template for existing parallelism schemes and combine it with an automated empirical profiler for runtime estimation. We then formulate SPASE as an MILP. We find that direct use of an MILP-solver is significantly more effective than several baseline heuristics.We optimize the system runtime further with an introspective scheduling approach. We implement all these techniques into a new data system we call Saturn. Experiments with benchmark DL workloads show that Saturn achieves 39-49% lower model selection runtimes than typical current DL practice.
5. Fine-Tuning & Applications
• Off-the-shelf models have to be fine-tuned and adapted
• Model is big…data might not be
• Model Selection is critical - motivating multi-model
• Democratizing fine-tuning for domain scientists & practitioners
6. Critical Challenges - Parallelism
• Parallelism has become essential but complex
• Model Parallel?
• Pipelining?
• Offloading?
• Data Parallel / Sharded Data Parallel?
• Hybrids?
7. Critical Challenges - Resource Allocation
• Non-Linear Scaling Complicates Resource Apportioning
• In a multi-job, how should GPUs be distributed?
• How does each model’s performance scale?
• Local performance vs global throughput
8. Critical Challenges - Scheduling
• Scheduling requires both local & global understanding
• What’s the estimated runtime of each job?
• How can I most effectively utilize my GPUs to minimize makespan?
11. SPASE: A New Optimization Problem
• Select Parallelism Pipeline Parallel or Data Parallel?
• Allocate resources How Many GPUs per Job?
• SchedulE jobs A before B, or B before A?
Given a Multi-Job of Large Models, we have to….
12. Saturn - A SPASE System
1. Library
2. Profiler
3. Joint Optimizer
4. Executor
User
Parallelism Registration
Job Submission
13. Saturn - A SPASE System
Library: register & retrieve parallelism techniques
Already supports popular techniques such as pipelining, DDP, FSDP, and more!
14. Saturn - A SPASE System
Profiler: performance estimates for each model
under each parallelism & possible apportionment
15. Saturn - A SPASE System
Introspective Solver: MILP-solving tool to produce
parallelisms, apportionments, & start times for each
model
Pro
fi
ler Results
Hardware Information
Parallelism Selection
per Model
GPU Allocation Per
Model
Start Time Per Model
17. Evaluations: Single-Node, 8-GPU
Baseline: 8-GPUs per model, run in sequence
Standard Practice
30.6 hours
Standard Practice
19.05 hours
ViT
GPT
Saturn Saturn
17.4 hours
10.75 hours
1.76X Speedup!
1.77X Speedup!
18. Evaluations: Two-Node, 16-GPU
Standard Practice
14.57 hours
10.15 hours
ViT
GPT
Saturn Saturn
8.23 hours
5.17 hours
1.77X Speedup!
1.96X Speedup!
Baseline: 8-GPUs per model, run in sequence
Standard Practice
20. Conclusion
• Modern DL Scale challenges motivate automated, easy-to-use, and
resource-efficient training systems
• We should consider DL efficiency holistically
• Saturn, the first work to tackle this new joint problem of
Parallelism, Allocation, and Scheduling demonstrates 40-50%
runtime reductions