SlideShare una empresa de Scribd logo
1 de 26
Improving the Scalability of Transparent
Checkpointing for GPU Computing Systems
              The 2012 IEEE Region 10 Conference
                        (TENCON 2012)
                       Cebu, Philippines
                     November 21, 2012

Alfian Amrizal, S. Hirasawa, K. Komatsu, H. Takizawa, H. Kobayashi
                         Tohoku University
Outline
•   Introduction
•   Two-level CheCL
•   Performance Model
•   Evaluation and Discussion
•   Conclusion




                                2
High-Performance Computing & Checkpoint
• High-performance computing (HPC) systems are getting faster
  and larger in scale
   – Consist of huge numbers of CPUs and GPUs
   – Probability of encountering failures also increases
• Checkpoint/restart (CPR) tools are important to make sure
  HPC systems can successfully finish their calculation
   – Long running applications; e.g. SPECFEM3D




                    CPU-GPU in Heterogeneous HPC system
                                                              3
Difficulties in CPR of Heterogeneous Systems
         • Heterogeneous systems use both CPUs and GPUs
         • Conventional CPR tools such as BLCR and DMTCP do not
           assume GPUs ⇒ CPR fails

 compute node                 CPU          GPU
  SCR_Start_checkpt();
  SCR_Route_file(fn,fn2);
  …
  fwrite(data,…);




                              Host
  …




                                          Device
  SCR_Complete_checkpt();




                             Memory       Memory

                             process      resource
conventional CPR tools                                CheCL allows conventional
   only save CPU state                                tools to save GPU state
         • CheCL has been developed for checkpointing OpenCL
           applications running on CPU-GPU systems [Takizawa, IPDPS’11]
                                                                          4
Difficulties in CPR of Heterogeneous Systems
 • Problem: checkpointing time increases with the # of nodes




                                                               5
Writing Checkpoints to Global Storage is Ineffective
   • To withstand failures, large-scale heterogeneous systems need
      to checkpoint more frequently to the global storage (low BW)
   • However, the global storage is shared among nodes
      ⇒ CheCL ‘s checkpoint time increases with the # of nodes
   • CheCL is not scalable: the larger the node’s numbers, the
               SCR_Start_checkpt();
               SCR_Route_file(fn,fn2);
               …
               fwrite(data,…);
               …
                                               SCR_Start_checkpt();
                                               SCR_Route_file(fn,fn2);
                                               …
                                               fwrite(data,…);
                                               …
                                                                         SCR_Start_checkpt();
                                                                         SCR_Route_file(fn,fn2);
                                                                         …
                                                                         fwrite(data,…);
                                                                         …
                                                                                                   SCR_Start_checkpt();
                                                                                                   SCR_Route_file(fn,fn2);
                                                                                                   …
                                                                                                   fwrite(data,…);
                                                                                                   …




compute nodes it takes to checkpoint
      longer
               SCR_Complete_checkpt();         SCR_Complete_checkpt();   SCR_Complete_checkpt();   SCR_Complete_checkpt();




   • Objective
       – To establish an effective implementation of the checkpointing
         mechanism for heterogeneous HPC system
                              Network Contention




                                         global storage                                                                      6
Writing Checkpoints to Global Storage is Ineffective
 • To withstand failures, large-scale heterogeneous systems need
   to checkpoint more frequently to the global storage (low BW)
 • However, the global storage is shared among nodes
   ⇒ CheCL ‘s checkpoint time increases with the # of nodes
 • CheCL is not scalable: the larger the node’s numbers, the
   longer it takes to checkpoint

 • Objective
    – To establish an effective implementation of the checkpointing
      mechanism for heterogeneous HPC system




                                                                      7
Outline
•   Introduction
•   Two-level CheCL
•   Performance Model
•   Evaluation and Discussion
•   Conclusion




                                8
Local CheCL
  • Avoid the network by utilizing node’s local storage
       –  Simultaneous checkpointing → Fast
       –  Less reliable
                  SCR_Start_checkpt();      SCR_Start_checkpt();      SCR_Start_checkpt();      SCR_Start_checkpt();
                  SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);
                  …                         …                         …                         …
                  fwrite(data,…);           fwrite(data,…);           fwrite(data,…);           fwrite(data,…);
                  …                         …                         …                         …
                  SCR_Complete_checkpt();   SCR_Complete_checkpt();   SCR_Complete_checkpt();   SCR_Complete_checkpt();




compute nodes



Add local storage to                        Interrupt this process
         each node




                                                        Large, reliable but slow                                          9
                                                            global storage
Local CheCL
  • Avoid the network by utilizing node’s local storage
       –  Simultaneous checkpointing → Fast
       –  Less reliable
                  SCR_Start_checkpt();      SCR_Start_checkpt();      SCR_Start_checkpt();      SCR_Start_checkpt();
                  SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);
                  …                         …                         …                         …
                  fwrite(data,…);           fwrite(data,…);           fwrite(data,…);           fwrite(data,…);
                  …                         …                         …                         …
                  SCR_Complete_checkpt();   SCR_Complete_checkpt();   SCR_Complete_checkpt();   SCR_Complete_checkpt();




compute nodes



Add local storage to
         each node




                                                        Large, reliable but slow                                          10
                                                            global storage
Local CheCL
  • Avoid the network by utilizing node’s local storage
       –  Simultaneous checkpointing → Fast
       –  Less reliable
                  SCR_Start_checkpt();      SCR_Start_checkpt();      SCR_Start_checkpt();      SCR_Start_checkpt();
                  SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);
                  …                         …                         …                         …
                  fwrite(data,…);           fwrite(data,…);           fwrite(data,…);           fwrite(data,…);
                  …                         …                         …                         …
                  SCR_Complete_checkpt();   SCR_Complete_checkpt();   SCR_Complete_checkpt();   SCR_Complete_checkpt();




compute nodes



Add local storage to
         each node




                                                        Large, reliable but slow                                          11
                                                            global storage
Two-level CheCL
  • Writing ckpt files to the global storage is more reliable but time
    consuming
  • Use local storages of compute nodes. Fast but sacrifice reliability

Propose Two-level CheCL : use both local and global ⇒ Local CheCL + Global CheCL

                     SCR_Start_checkpt();       SCR_Start_checkpt();      SCR_Start_checkpt();      SCR_Start_checkpt();
                     SCR_Route_file(fn,fn2);    SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);
                     …                          …                         …                         …
                     fwrite(data,…);            fwrite(data,…);           fwrite(data,…);           fwrite(data,…);
                     …                          …                         …                         …
                     SCR_Complete_checkpt();    SCR_Complete_checkpt();   SCR_Complete_checkpt();   SCR_Complete_checkpt();




compute nodes




 local storages


                  shared global storage
                                                                                                                              12
Outline
•   Introduction
•   Two-level CheCL
•   Performance Model
•   Evaluation and Discussion
•   Conclusion




                                13
Performance Model

• Total execution time of an OpenCL application running with
  Two-level CheCL is Ttotal
• The original execution time is Ts




   n      dG n         n dL      n    dL   n

                      Ts
                                                               14
Performance Model

• Total time spent for checkpointing is TC




   n     Cov     n     Cov      n      Cov   n   Cov   n

                             Ts + Tc
                                                           15
Performance Model

• Total time spent for checkpointing is TC
• Local CheCL ckpt overhead CL, Global CheCL ckpt overhead CG




                                            75%         25%




   n      CG      n     CL       n     CL   n      CL    n

                             Ts + Tc
                                                              16
Performance Model

• No failure during ckpt process. On average, failures occur at 0.5n
• TL is time overhead when the process is recoverable by the latest
  checkpoint file.




0.5n             0.5n            0.5n            0.5n            0.5n


   n        CG       n      CL          n   CL          n   CL          n

                                 Ts + Tc
                                                                            17
Performance Model

• No failure during ckpt process. On average, failures occur at 0.5n
• TL is time overhead when the process is recoverable by the latest
  checkpoint file.




        wasted time                                  85%       15%
                           # of failures                   [Moody, SC’10]
    n        CG        n         CL 0.5n
                                       RL        n   CL     n
    n        CG    0.5n
                  RG            n           CL   n   CL     n
                                                                       18
Performance Model

• TG is time overhead when the process is only recoverable by the
  global checkpoint file.




   n       CG      n        CL 0.5n

                       RG      RL      n     CL     n
                                                              19
Outline
•   Introduction
•   Two-level CheCL
•   Performance Model
•   Evaluation and Discussion
•   Conclusion




                                20
Experimental Set Up
• The evaluation was conducted on a GPU cluster of
  four compute nodes, each compute node has:
   –   Intel core i7 930 CPU
   –   Nvidia Tesla C2070 GPU
   –   Main memory of 24 GB
   –   tmpfs RAM Disk of 12 GB
• CPR tools:
   – BLCR-0.8-4 (CPU state ckpt)
   – CheCL (GPU state ckpt)
• Benchmark:
   – Molecular Dynamic (MD)
                                                     21
Checkpoint Time Comparison for GPU Cluster
                        16000
                                                                    Accelerate up to > 4x
                        14000
 Checkpoint Time (ms)




                        12000

                        10000

                        8000

                        6000                                                                Global CheCL
                                                                                            Local CheCL
                        4000

                        2000

                           0
                                12288 24574 73728 12288 24574 73728 12288 24574 73728
                                     1 node            2 nodes             4 nodes
                                         # of Nodes and Problem size


                                                                                                      22
Efficiency (Ts/Ttotal) Improvement (No Failure)
               100%
                                                                            Two-level CheCL’s PL:PG=3:1
               90%
               80%
               70%
  Efficiency




               60%
               50%
               40%
               30%
               20%
               10%
                0%
                        1x            2x          4x          8x           16x            32x         64x
                                                  Checkpoint Frequencies

          2 nodes, Local and Global    2 nodes, Global only   4 nodes, Local and Global     4 nodes, Global only



                                                                                                             23
Efficiency Improvement (MTTF = 3 minutes)
                                                                           [Schroeder, SciDAC’07]

              100%
                                                                     Two-level CheCL’s PL:PG=3:1
              90%
              80%
              70%
 Efficiency




              60%
              50%
              40%
              30%
              20%
              10%
               0%
                     1x    2x            4x           8x            16x           32x    64x
                                        Checkpoint Frequencies
                          4 nodes, Local and Global        4 nodes, Global only


                                                                                               24
Trade-off Between Local/Global Ratio and Two-level CheCL’s Time Overhead


                        4500

                        4000

                        3500
   Time overhead (ms)




                        3000

                        2500

                        2000

                        1500

                        1000

                         500

                           0
                               (0:10)   (1:9)   (2:8)   (3:7)      (4:6)   (5:5)     (6:4)   (7:3)   (8:2)   (9:1)
                                                                Local/Global ratio

                                                                                                                     25
Conclusion
• Checkpointing is important for HPC system
  dependability
• Two-level CheCL can improve system efficiency
• Local CheCL can be used for high speed
  checkpointing
• There is a trade-off between Local and Global CheCL
  which must be treated carefully for future
  implementation on large-scale GPU computing
  systems


                                                        26

Más contenido relacionado

La actualidad más candente

Introduction to eBPF and XDP
Introduction to eBPF and XDPIntroduction to eBPF and XDP
Introduction to eBPF and XDPlcplcp1
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019Brendan Gregg
 
Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Kernel TLV
 
Fun with Network Interfaces
Fun with Network InterfacesFun with Network Interfaces
Fun with Network InterfacesKernel TLV
 
Linux Timer device driver
Linux Timer device driverLinux Timer device driver
Linux Timer device driver艾鍗科技
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudAndrea Righi
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
Linux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - WonokaerunLinux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - Wonokaerunidsecconf
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005dflexer
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Ray Jenkins
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityBrendan Gregg
 
Kernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power ManagementKernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power ManagementAnne Nicolas
 
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021Valeriy Kravchuk
 
Java util concurrent
Java util concurrentJava util concurrent
Java util concurrentRoger Xia
 
Spectre(v1%2 fv2%2fv4) v.s. meltdown(v3)
Spectre(v1%2 fv2%2fv4) v.s. meltdown(v3)Spectre(v1%2 fv2%2fv4) v.s. meltdown(v3)
Spectre(v1%2 fv2%2fv4) v.s. meltdown(v3)Gavin Guo
 
Berkeley Packet Filters
Berkeley Packet FiltersBerkeley Packet Filters
Berkeley Packet FiltersKernel TLV
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCKernel TLV
 

La actualidad más candente (20)

Introduction to eBPF and XDP
Introduction to eBPF and XDPIntroduction to eBPF and XDP
Introduction to eBPF and XDP
 
eBPF Basics
eBPF BasicseBPF Basics
eBPF Basics
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545
 
Fun with Network Interfaces
Fun with Network InterfacesFun with Network Interfaces
Fun with Network Interfaces
 
Linux Timer device driver
Linux Timer device driverLinux Timer device driver
Linux Timer device driver
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Linux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - WonokaerunLinux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - Wonokaerun
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
Kernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power ManagementKernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power Management
 
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
 
Java util concurrent
Java util concurrentJava util concurrent
Java util concurrent
 
Kgdb kdb modesetting
Kgdb kdb modesettingKgdb kdb modesetting
Kgdb kdb modesetting
 
Ch3-2
Ch3-2Ch3-2
Ch3-2
 
Spectre(v1%2 fv2%2fv4) v.s. meltdown(v3)
Spectre(v1%2 fv2%2fv4) v.s. meltdown(v3)Spectre(v1%2 fv2%2fv4) v.s. meltdown(v3)
Spectre(v1%2 fv2%2fv4) v.s. meltdown(v3)
 
Berkeley Packet Filters
Berkeley Packet FiltersBerkeley Packet Filters
Berkeley Packet Filters
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCC
 

Destacado

EC5624A Plus Ethanol Corrosion Inhibitor
EC5624A Plus Ethanol Corrosion InhibitorEC5624A Plus Ethanol Corrosion Inhibitor
EC5624A Plus Ethanol Corrosion InhibitorPhillip Bureman
 
Why alcohols will replace gasoline and diesel fuel to be the fuels of the fut...
Why alcohols will replace gasoline and diesel fuel to be the fuels of the fut...Why alcohols will replace gasoline and diesel fuel to be the fuels of the fut...
Why alcohols will replace gasoline and diesel fuel to be the fuels of the fut...SolarClean Fuels, LLC
 
Crowd sensing, mobiles and feedback
Crowd sensing, mobiles and feedbackCrowd sensing, mobiles and feedback
Crowd sensing, mobiles and feedbackChristian Glahn
 
Oil 101 - Introduction to Petroleum Product Marketing
Oil 101 - Introduction to Petroleum Product MarketingOil 101 - Introduction to Petroleum Product Marketing
Oil 101 - Introduction to Petroleum Product MarketingEKT Interactive
 
ethanol engine modifications
ethanol engine modificationsethanol engine modifications
ethanol engine modificationsSughosh Deshmukh
 
Research report on phil. housing finance sector of Philippines
Research report on phil. housing finance sector of PhilippinesResearch report on phil. housing finance sector of Philippines
Research report on phil. housing finance sector of PhilippinesNelsie Grace Pineda
 
Political and legal environment
Political and legal environmentPolitical and legal environment
Political and legal environmentTala Lorena
 
Political and legal environment
Political and legal environmentPolitical and legal environment
Political and legal environmentluispachon
 
Marketing environment
Marketing environmentMarketing environment
Marketing environmentmustafvi786
 
Macro factors affecting business environment
Macro factors affecting business environmentMacro factors affecting business environment
Macro factors affecting business environmentaayush30
 

Destacado (11)

EC5624A Plus Ethanol Corrosion Inhibitor
EC5624A Plus Ethanol Corrosion InhibitorEC5624A Plus Ethanol Corrosion Inhibitor
EC5624A Plus Ethanol Corrosion Inhibitor
 
Why alcohols will replace gasoline and diesel fuel to be the fuels of the fut...
Why alcohols will replace gasoline and diesel fuel to be the fuels of the fut...Why alcohols will replace gasoline and diesel fuel to be the fuels of the fut...
Why alcohols will replace gasoline and diesel fuel to be the fuels of the fut...
 
Crowd sensing, mobiles and feedback
Crowd sensing, mobiles and feedbackCrowd sensing, mobiles and feedback
Crowd sensing, mobiles and feedback
 
Oil 101 - Introduction to Petroleum Product Marketing
Oil 101 - Introduction to Petroleum Product MarketingOil 101 - Introduction to Petroleum Product Marketing
Oil 101 - Introduction to Petroleum Product Marketing
 
ethanol engine modifications
ethanol engine modificationsethanol engine modifications
ethanol engine modifications
 
Research report on phil. housing finance sector of Philippines
Research report on phil. housing finance sector of PhilippinesResearch report on phil. housing finance sector of Philippines
Research report on phil. housing finance sector of Philippines
 
Marketing Presentation
Marketing PresentationMarketing Presentation
Marketing Presentation
 
Political and legal environment
Political and legal environmentPolitical and legal environment
Political and legal environment
 
Political and legal environment
Political and legal environmentPolitical and legal environment
Political and legal environment
 
Marketing environment
Marketing environmentMarketing environment
Marketing environment
 
Macro factors affecting business environment
Macro factors affecting business environmentMacro factors affecting business environment
Macro factors affecting business environment
 

Similar a Improving the Scalability of Transparent Checkpointing for GPU Computing Systems

Exploitation of counter overflows in the Linux kernel
Exploitation of counter overflows in the Linux kernelExploitation of counter overflows in the Linux kernel
Exploitation of counter overflows in the Linux kernelVitaly Nikolenko
 
NSC #2 - Challenge Solution
NSC #2 - Challenge SolutionNSC #2 - Challenge Solution
NSC #2 - Challenge SolutionNoSuchCon
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with PacemakerKris Buytaert
 
Talk 160920 @ Cat System Workshop
Talk 160920 @ Cat System WorkshopTalk 160920 @ Cat System Workshop
Talk 160920 @ Cat System WorkshopQuey-Liang Kao
 
[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory AnalysisMoabi.com
 
RTOS Material hfffffffffffffffffffffffffffffffffffff
RTOS Material hfffffffffffffffffffffffffffffffffffffRTOS Material hfffffffffffffffffffffffffffffffffffff
RTOS Material hfffffffffffffffffffffffffffffffffffffadugnanegero
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with PacemakerKris Buytaert
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班Paul Chao
 
Decoupling Provenance Capture and Analysis from Execution
Decoupling Provenance Capture and Analysis from ExecutionDecoupling Provenance Capture and Analysis from Execution
Decoupling Provenance Capture and Analysis from ExecutionPaul Groth
 
[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit AutomationMoabi.com
 
LXC on Ganeti
LXC on GanetiLXC on Ganeti
LXC on Ganetikawamuray
 
XDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @CloudflareXDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @CloudflareC4Media
 
HES2011 - Sebastien Tricaud - Capture me if you can
HES2011 - Sebastien Tricaud - Capture me if you canHES2011 - Sebastien Tricaud - Capture me if you can
HES2011 - Sebastien Tricaud - Capture me if you canHackito Ergo Sum
 
Hackito Ergo Sum 2011: Capture me if you can!
Hackito Ergo Sum 2011: Capture me if you can!Hackito Ergo Sum 2011: Capture me if you can!
Hackito Ergo Sum 2011: Capture me if you can!stricaud
 
Linux SMEP bypass techniques
Linux SMEP bypass techniquesLinux SMEP bypass techniques
Linux SMEP bypass techniquesVitaly Nikolenko
 
How to Create AltCoin(Alternative Cryptocurrency)?
How to Create AltCoin(Alternative Cryptocurrency)?How to Create AltCoin(Alternative Cryptocurrency)?
How to Create AltCoin(Alternative Cryptocurrency)?Abdullah Khan Zehady
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugginglibfetion
 
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...CanSecWest
 

Similar a Improving the Scalability of Transparent Checkpointing for GPU Computing Systems (20)

Exploitation of counter overflows in the Linux kernel
Exploitation of counter overflows in the Linux kernelExploitation of counter overflows in the Linux kernel
Exploitation of counter overflows in the Linux kernel
 
NSC #2 - Challenge Solution
NSC #2 - Challenge SolutionNSC #2 - Challenge Solution
NSC #2 - Challenge Solution
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with Pacemaker
 
Talk 160920 @ Cat System Workshop
Talk 160920 @ Cat System WorkshopTalk 160920 @ Cat System Workshop
Talk 160920 @ Cat System Workshop
 
[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis
 
RTOS Material hfffffffffffffffffffffffffffffffffffff
RTOS Material hfffffffffffffffffffffffffffffffffffffRTOS Material hfffffffffffffffffffffffffffffffffffff
RTOS Material hfffffffffffffffffffffffffffffffffffff
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with Pacemaker
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班
 
Genode Compositions
Genode CompositionsGenode Compositions
Genode Compositions
 
Decoupling Provenance Capture and Analysis from Execution
Decoupling Provenance Capture and Analysis from ExecutionDecoupling Provenance Capture and Analysis from Execution
Decoupling Provenance Capture and Analysis from Execution
 
[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation
 
LXC on Ganeti
LXC on GanetiLXC on Ganeti
LXC on Ganeti
 
XDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @CloudflareXDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @Cloudflare
 
HES2011 - Sebastien Tricaud - Capture me if you can
HES2011 - Sebastien Tricaud - Capture me if you canHES2011 - Sebastien Tricaud - Capture me if you can
HES2011 - Sebastien Tricaud - Capture me if you can
 
Hackito Ergo Sum 2011: Capture me if you can!
Hackito Ergo Sum 2011: Capture me if you can!Hackito Ergo Sum 2011: Capture me if you can!
Hackito Ergo Sum 2011: Capture me if you can!
 
Linux SMEP bypass techniques
Linux SMEP bypass techniquesLinux SMEP bypass techniques
Linux SMEP bypass techniques
 
AES on modern GPUs
AES on modern GPUsAES on modern GPUs
AES on modern GPUs
 
How to Create AltCoin(Alternative Cryptocurrency)?
How to Create AltCoin(Alternative Cryptocurrency)?How to Create AltCoin(Alternative Cryptocurrency)?
How to Create AltCoin(Alternative Cryptocurrency)?
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
 

Último

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Último (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Improving the Scalability of Transparent Checkpointing for GPU Computing Systems

  • 1. Improving the Scalability of Transparent Checkpointing for GPU Computing Systems The 2012 IEEE Region 10 Conference (TENCON 2012) Cebu, Philippines November 21, 2012 Alfian Amrizal, S. Hirasawa, K. Komatsu, H. Takizawa, H. Kobayashi Tohoku University
  • 2. Outline • Introduction • Two-level CheCL • Performance Model • Evaluation and Discussion • Conclusion 2
  • 3. High-Performance Computing & Checkpoint • High-performance computing (HPC) systems are getting faster and larger in scale – Consist of huge numbers of CPUs and GPUs – Probability of encountering failures also increases • Checkpoint/restart (CPR) tools are important to make sure HPC systems can successfully finish their calculation – Long running applications; e.g. SPECFEM3D CPU-GPU in Heterogeneous HPC system 3
  • 4. Difficulties in CPR of Heterogeneous Systems • Heterogeneous systems use both CPUs and GPUs • Conventional CPR tools such as BLCR and DMTCP do not assume GPUs ⇒ CPR fails compute node CPU GPU SCR_Start_checkpt(); SCR_Route_file(fn,fn2); … fwrite(data,…); Host … Device SCR_Complete_checkpt(); Memory Memory process resource conventional CPR tools CheCL allows conventional only save CPU state tools to save GPU state • CheCL has been developed for checkpointing OpenCL applications running on CPU-GPU systems [Takizawa, IPDPS’11] 4
  • 5. Difficulties in CPR of Heterogeneous Systems • Problem: checkpointing time increases with the # of nodes 5
  • 6. Writing Checkpoints to Global Storage is Ineffective • To withstand failures, large-scale heterogeneous systems need to checkpoint more frequently to the global storage (low BW) • However, the global storage is shared among nodes ⇒ CheCL ‘s checkpoint time increases with the # of nodes • CheCL is not scalable: the larger the node’s numbers, the SCR_Start_checkpt(); SCR_Route_file(fn,fn2); … fwrite(data,…); … SCR_Start_checkpt(); SCR_Route_file(fn,fn2); … fwrite(data,…); … SCR_Start_checkpt(); SCR_Route_file(fn,fn2); … fwrite(data,…); … SCR_Start_checkpt(); SCR_Route_file(fn,fn2); … fwrite(data,…); … compute nodes it takes to checkpoint longer SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); • Objective – To establish an effective implementation of the checkpointing mechanism for heterogeneous HPC system Network Contention global storage 6
  • 7. Writing Checkpoints to Global Storage is Ineffective • To withstand failures, large-scale heterogeneous systems need to checkpoint more frequently to the global storage (low BW) • However, the global storage is shared among nodes ⇒ CheCL ‘s checkpoint time increases with the # of nodes • CheCL is not scalable: the larger the node’s numbers, the longer it takes to checkpoint • Objective – To establish an effective implementation of the checkpointing mechanism for heterogeneous HPC system 7
  • 8. Outline • Introduction • Two-level CheCL • Performance Model • Evaluation and Discussion • Conclusion 8
  • 9. Local CheCL • Avoid the network by utilizing node’s local storage –  Simultaneous checkpointing → Fast –  Less reliable SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); … … … … fwrite(data,…); fwrite(data,…); fwrite(data,…); fwrite(data,…); … … … … SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); compute nodes Add local storage to Interrupt this process each node Large, reliable but slow 9 global storage
  • 10. Local CheCL • Avoid the network by utilizing node’s local storage –  Simultaneous checkpointing → Fast –  Less reliable SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); … … … … fwrite(data,…); fwrite(data,…); fwrite(data,…); fwrite(data,…); … … … … SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); compute nodes Add local storage to each node Large, reliable but slow 10 global storage
  • 11. Local CheCL • Avoid the network by utilizing node’s local storage –  Simultaneous checkpointing → Fast –  Less reliable SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); … … … … fwrite(data,…); fwrite(data,…); fwrite(data,…); fwrite(data,…); … … … … SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); compute nodes Add local storage to each node Large, reliable but slow 11 global storage
  • 12. Two-level CheCL • Writing ckpt files to the global storage is more reliable but time consuming • Use local storages of compute nodes. Fast but sacrifice reliability Propose Two-level CheCL : use both local and global ⇒ Local CheCL + Global CheCL SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); … … … … fwrite(data,…); fwrite(data,…); fwrite(data,…); fwrite(data,…); … … … … SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); compute nodes local storages shared global storage 12
  • 13. Outline • Introduction • Two-level CheCL • Performance Model • Evaluation and Discussion • Conclusion 13
  • 14. Performance Model • Total execution time of an OpenCL application running with Two-level CheCL is Ttotal • The original execution time is Ts n dG n n dL n dL n Ts 14
  • 15. Performance Model • Total time spent for checkpointing is TC n Cov n Cov n Cov n Cov n Ts + Tc 15
  • 16. Performance Model • Total time spent for checkpointing is TC • Local CheCL ckpt overhead CL, Global CheCL ckpt overhead CG 75% 25% n CG n CL n CL n CL n Ts + Tc 16
  • 17. Performance Model • No failure during ckpt process. On average, failures occur at 0.5n • TL is time overhead when the process is recoverable by the latest checkpoint file. 0.5n 0.5n 0.5n 0.5n 0.5n n CG n CL n CL n CL n Ts + Tc 17
  • 18. Performance Model • No failure during ckpt process. On average, failures occur at 0.5n • TL is time overhead when the process is recoverable by the latest checkpoint file. wasted time 85% 15% # of failures [Moody, SC’10] n CG n CL 0.5n RL n CL n n CG 0.5n RG n CL n CL n 18
  • 19. Performance Model • TG is time overhead when the process is only recoverable by the global checkpoint file. n CG n CL 0.5n RG RL n CL n 19
  • 20. Outline • Introduction • Two-level CheCL • Performance Model • Evaluation and Discussion • Conclusion 20
  • 21. Experimental Set Up • The evaluation was conducted on a GPU cluster of four compute nodes, each compute node has: – Intel core i7 930 CPU – Nvidia Tesla C2070 GPU – Main memory of 24 GB – tmpfs RAM Disk of 12 GB • CPR tools: – BLCR-0.8-4 (CPU state ckpt) – CheCL (GPU state ckpt) • Benchmark: – Molecular Dynamic (MD) 21
  • 22. Checkpoint Time Comparison for GPU Cluster 16000 Accelerate up to > 4x 14000 Checkpoint Time (ms) 12000 10000 8000 6000 Global CheCL Local CheCL 4000 2000 0 12288 24574 73728 12288 24574 73728 12288 24574 73728 1 node 2 nodes 4 nodes # of Nodes and Problem size 22
  • 23. Efficiency (Ts/Ttotal) Improvement (No Failure) 100% Two-level CheCL’s PL:PG=3:1 90% 80% 70% Efficiency 60% 50% 40% 30% 20% 10% 0% 1x 2x 4x 8x 16x 32x 64x Checkpoint Frequencies 2 nodes, Local and Global 2 nodes, Global only 4 nodes, Local and Global 4 nodes, Global only 23
  • 24. Efficiency Improvement (MTTF = 3 minutes) [Schroeder, SciDAC’07] 100% Two-level CheCL’s PL:PG=3:1 90% 80% 70% Efficiency 60% 50% 40% 30% 20% 10% 0% 1x 2x 4x 8x 16x 32x 64x Checkpoint Frequencies 4 nodes, Local and Global 4 nodes, Global only 24
  • 25. Trade-off Between Local/Global Ratio and Two-level CheCL’s Time Overhead 4500 4000 3500 Time overhead (ms) 3000 2500 2000 1500 1000 500 0 (0:10) (1:9) (2:8) (3:7) (4:6) (5:5) (6:4) (7:3) (8:2) (9:1) Local/Global ratio 25
  • 26. Conclusion • Checkpointing is important for HPC system dependability • Two-level CheCL can improve system efficiency • Local CheCL can be used for high speed checkpointing • There is a trade-off between Local and Global CheCL which must be treated carefully for future implementation on large-scale GPU computing systems 26