Improving the Scalability of Transparent Checkpointing for GPU Computing Systems
1. Improving the Scalability of Transparent
Checkpointing for GPU Computing Systems
The 2012 IEEE Region 10 Conference
(TENCON 2012)
Cebu, Philippines
November 21, 2012
Alfian Amrizal, S. Hirasawa, K. Komatsu, H. Takizawa, H. Kobayashi
Tohoku University
2. Outline
• Introduction
• Two-level CheCL
• Performance Model
• Evaluation and Discussion
• Conclusion
2
3. High-Performance Computing & Checkpoint
• High-performance computing (HPC) systems are getting faster
and larger in scale
– Consist of huge numbers of CPUs and GPUs
– Probability of encountering failures also increases
• Checkpoint/restart (CPR) tools are important to make sure
HPC systems can successfully finish their calculation
– Long running applications; e.g. SPECFEM3D
CPU-GPU in Heterogeneous HPC system
3
4. Difficulties in CPR of Heterogeneous Systems
• Heterogeneous systems use both CPUs and GPUs
• Conventional CPR tools such as BLCR and DMTCP do not
assume GPUs ⇒ CPR fails
compute node CPU GPU
SCR_Start_checkpt();
SCR_Route_file(fn,fn2);
…
fwrite(data,…);
Host
…
Device
SCR_Complete_checkpt();
Memory Memory
process resource
conventional CPR tools CheCL allows conventional
only save CPU state tools to save GPU state
• CheCL has been developed for checkpointing OpenCL
applications running on CPU-GPU systems [Takizawa, IPDPS’11]
4
5. Difficulties in CPR of Heterogeneous Systems
• Problem: checkpointing time increases with the # of nodes
5
6. Writing Checkpoints to Global Storage is Ineffective
• To withstand failures, large-scale heterogeneous systems need
to checkpoint more frequently to the global storage (low BW)
• However, the global storage is shared among nodes
⇒ CheCL ‘s checkpoint time increases with the # of nodes
• CheCL is not scalable: the larger the node’s numbers, the
SCR_Start_checkpt();
SCR_Route_file(fn,fn2);
…
fwrite(data,…);
…
SCR_Start_checkpt();
SCR_Route_file(fn,fn2);
…
fwrite(data,…);
…
SCR_Start_checkpt();
SCR_Route_file(fn,fn2);
…
fwrite(data,…);
…
SCR_Start_checkpt();
SCR_Route_file(fn,fn2);
…
fwrite(data,…);
…
compute nodes it takes to checkpoint
longer
SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt();
• Objective
– To establish an effective implementation of the checkpointing
mechanism for heterogeneous HPC system
Network Contention
global storage 6
7. Writing Checkpoints to Global Storage is Ineffective
• To withstand failures, large-scale heterogeneous systems need
to checkpoint more frequently to the global storage (low BW)
• However, the global storage is shared among nodes
⇒ CheCL ‘s checkpoint time increases with the # of nodes
• CheCL is not scalable: the larger the node’s numbers, the
longer it takes to checkpoint
• Objective
– To establish an effective implementation of the checkpointing
mechanism for heterogeneous HPC system
7
8. Outline
• Introduction
• Two-level CheCL
• Performance Model
• Evaluation and Discussion
• Conclusion
8
9. Local CheCL
• Avoid the network by utilizing node’s local storage
– Simultaneous checkpointing → Fast
– Less reliable
SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt();
SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2);
… … … …
fwrite(data,…); fwrite(data,…); fwrite(data,…); fwrite(data,…);
… … … …
SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt();
compute nodes
Add local storage to Interrupt this process
each node
Large, reliable but slow 9
global storage
10. Local CheCL
• Avoid the network by utilizing node’s local storage
– Simultaneous checkpointing → Fast
– Less reliable
SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt();
SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2);
… … … …
fwrite(data,…); fwrite(data,…); fwrite(data,…); fwrite(data,…);
… … … …
SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt();
compute nodes
Add local storage to
each node
Large, reliable but slow 10
global storage
11. Local CheCL
• Avoid the network by utilizing node’s local storage
– Simultaneous checkpointing → Fast
– Less reliable
SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt();
SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2);
… … … …
fwrite(data,…); fwrite(data,…); fwrite(data,…); fwrite(data,…);
… … … …
SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt();
compute nodes
Add local storage to
each node
Large, reliable but slow 11
global storage
12. Two-level CheCL
• Writing ckpt files to the global storage is more reliable but time
consuming
• Use local storages of compute nodes. Fast but sacrifice reliability
Propose Two-level CheCL : use both local and global ⇒ Local CheCL + Global CheCL
SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt();
SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2);
… … … …
fwrite(data,…); fwrite(data,…); fwrite(data,…); fwrite(data,…);
… … … …
SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt();
compute nodes
local storages
shared global storage
12
13. Outline
• Introduction
• Two-level CheCL
• Performance Model
• Evaluation and Discussion
• Conclusion
13
14. Performance Model
• Total execution time of an OpenCL application running with
Two-level CheCL is Ttotal
• The original execution time is Ts
n dG n n dL n dL n
Ts
14
16. Performance Model
• Total time spent for checkpointing is TC
• Local CheCL ckpt overhead CL, Global CheCL ckpt overhead CG
75% 25%
n CG n CL n CL n CL n
Ts + Tc
16
17. Performance Model
• No failure during ckpt process. On average, failures occur at 0.5n
• TL is time overhead when the process is recoverable by the latest
checkpoint file.
0.5n 0.5n 0.5n 0.5n 0.5n
n CG n CL n CL n CL n
Ts + Tc
17
18. Performance Model
• No failure during ckpt process. On average, failures occur at 0.5n
• TL is time overhead when the process is recoverable by the latest
checkpoint file.
wasted time 85% 15%
# of failures [Moody, SC’10]
n CG n CL 0.5n
RL n CL n
n CG 0.5n
RG n CL n CL n
18
19. Performance Model
• TG is time overhead when the process is only recoverable by the
global checkpoint file.
n CG n CL 0.5n
RG RL n CL n
19
20. Outline
• Introduction
• Two-level CheCL
• Performance Model
• Evaluation and Discussion
• Conclusion
20
21. Experimental Set Up
• The evaluation was conducted on a GPU cluster of
four compute nodes, each compute node has:
– Intel core i7 930 CPU
– Nvidia Tesla C2070 GPU
– Main memory of 24 GB
– tmpfs RAM Disk of 12 GB
• CPR tools:
– BLCR-0.8-4 (CPU state ckpt)
– CheCL (GPU state ckpt)
• Benchmark:
– Molecular Dynamic (MD)
21
22. Checkpoint Time Comparison for GPU Cluster
16000
Accelerate up to > 4x
14000
Checkpoint Time (ms)
12000
10000
8000
6000 Global CheCL
Local CheCL
4000
2000
0
12288 24574 73728 12288 24574 73728 12288 24574 73728
1 node 2 nodes 4 nodes
# of Nodes and Problem size
22
23. Efficiency (Ts/Ttotal) Improvement (No Failure)
100%
Two-level CheCL’s PL:PG=3:1
90%
80%
70%
Efficiency
60%
50%
40%
30%
20%
10%
0%
1x 2x 4x 8x 16x 32x 64x
Checkpoint Frequencies
2 nodes, Local and Global 2 nodes, Global only 4 nodes, Local and Global 4 nodes, Global only
23
24. Efficiency Improvement (MTTF = 3 minutes)
[Schroeder, SciDAC’07]
100%
Two-level CheCL’s PL:PG=3:1
90%
80%
70%
Efficiency
60%
50%
40%
30%
20%
10%
0%
1x 2x 4x 8x 16x 32x 64x
Checkpoint Frequencies
4 nodes, Local and Global 4 nodes, Global only
24
25. Trade-off Between Local/Global Ratio and Two-level CheCL’s Time Overhead
4500
4000
3500
Time overhead (ms)
3000
2500
2000
1500
1000
500
0
(0:10) (1:9) (2:8) (3:7) (4:6) (5:5) (6:4) (7:3) (8:2) (9:1)
Local/Global ratio
25
26. Conclusion
• Checkpointing is important for HPC system
dependability
• Two-level CheCL can improve system efficiency
• Local CheCL can be used for high speed
checkpointing
• There is a trade-off between Local and Global CheCL
which must be treated carefully for future
implementation on large-scale GPU computing
systems
26