SlideShare a Scribd company logo
1 of 26
Improving the Scalability of Transparent
Checkpointing for GPU Computing Systems
              The 2012 IEEE Region 10 Conference
                        (TENCON 2012)
                       Cebu, Philippines
                     November 21, 2012

Alfian Amrizal, S. Hirasawa, K. Komatsu, H. Takizawa, H. Kobayashi
                         Tohoku University
Outline
•   Introduction
•   Two-level CheCL
•   Performance Model
•   Evaluation and Discussion
•   Conclusion




                                2
High-Performance Computing & Checkpoint
• High-performance computing (HPC) systems are getting faster
  and larger in scale
   – Consist of huge numbers of CPUs and GPUs
   – Probability of encountering failures also increases
• Checkpoint/restart (CPR) tools are important to make sure
  HPC systems can successfully finish their calculation
   – Long running applications; e.g. SPECFEM3D




                    CPU-GPU in Heterogeneous HPC system
                                                              3
Difficulties in CPR of Heterogeneous Systems
         • Heterogeneous systems use both CPUs and GPUs
         • Conventional CPR tools such as BLCR and DMTCP do not
           assume GPUs ⇒ CPR fails

 compute node                 CPU          GPU
  SCR_Start_checkpt();
  SCR_Route_file(fn,fn2);
  …
  fwrite(data,…);




                              Host
  …




                                          Device
  SCR_Complete_checkpt();




                             Memory       Memory

                             process      resource
conventional CPR tools                                CheCL allows conventional
   only save CPU state                                tools to save GPU state
         • CheCL has been developed for checkpointing OpenCL
           applications running on CPU-GPU systems [Takizawa, IPDPS’11]
                                                                          4
Difficulties in CPR of Heterogeneous Systems
 • Problem: checkpointing time increases with the # of nodes




                                                               5
Writing Checkpoints to Global Storage is Ineffective
   • To withstand failures, large-scale heterogeneous systems need
      to checkpoint more frequently to the global storage (low BW)
   • However, the global storage is shared among nodes
      ⇒ CheCL ‘s checkpoint time increases with the # of nodes
   • CheCL is not scalable: the larger the node’s numbers, the
               SCR_Start_checkpt();
               SCR_Route_file(fn,fn2);
               …
               fwrite(data,…);
               …
                                               SCR_Start_checkpt();
                                               SCR_Route_file(fn,fn2);
                                               …
                                               fwrite(data,…);
                                               …
                                                                         SCR_Start_checkpt();
                                                                         SCR_Route_file(fn,fn2);
                                                                         …
                                                                         fwrite(data,…);
                                                                         …
                                                                                                   SCR_Start_checkpt();
                                                                                                   SCR_Route_file(fn,fn2);
                                                                                                   …
                                                                                                   fwrite(data,…);
                                                                                                   …




compute nodes it takes to checkpoint
      longer
               SCR_Complete_checkpt();         SCR_Complete_checkpt();   SCR_Complete_checkpt();   SCR_Complete_checkpt();




   • Objective
       – To establish an effective implementation of the checkpointing
         mechanism for heterogeneous HPC system
                              Network Contention




                                         global storage                                                                      6
Writing Checkpoints to Global Storage is Ineffective
 • To withstand failures, large-scale heterogeneous systems need
   to checkpoint more frequently to the global storage (low BW)
 • However, the global storage is shared among nodes
   ⇒ CheCL ‘s checkpoint time increases with the # of nodes
 • CheCL is not scalable: the larger the node’s numbers, the
   longer it takes to checkpoint

 • Objective
    – To establish an effective implementation of the checkpointing
      mechanism for heterogeneous HPC system




                                                                      7
Outline
•   Introduction
•   Two-level CheCL
•   Performance Model
•   Evaluation and Discussion
•   Conclusion




                                8
Local CheCL
  • Avoid the network by utilizing node’s local storage
       –  Simultaneous checkpointing → Fast
       –  Less reliable
                  SCR_Start_checkpt();      SCR_Start_checkpt();      SCR_Start_checkpt();      SCR_Start_checkpt();
                  SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);
                  …                         …                         …                         …
                  fwrite(data,…);           fwrite(data,…);           fwrite(data,…);           fwrite(data,…);
                  …                         …                         …                         …
                  SCR_Complete_checkpt();   SCR_Complete_checkpt();   SCR_Complete_checkpt();   SCR_Complete_checkpt();




compute nodes



Add local storage to                        Interrupt this process
         each node




                                                        Large, reliable but slow                                          9
                                                            global storage
Local CheCL
  • Avoid the network by utilizing node’s local storage
       –  Simultaneous checkpointing → Fast
       –  Less reliable
                  SCR_Start_checkpt();      SCR_Start_checkpt();      SCR_Start_checkpt();      SCR_Start_checkpt();
                  SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);
                  …                         …                         …                         …
                  fwrite(data,…);           fwrite(data,…);           fwrite(data,…);           fwrite(data,…);
                  …                         …                         …                         …
                  SCR_Complete_checkpt();   SCR_Complete_checkpt();   SCR_Complete_checkpt();   SCR_Complete_checkpt();




compute nodes



Add local storage to
         each node




                                                        Large, reliable but slow                                          10
                                                            global storage
Local CheCL
  • Avoid the network by utilizing node’s local storage
       –  Simultaneous checkpointing → Fast
       –  Less reliable
                  SCR_Start_checkpt();      SCR_Start_checkpt();      SCR_Start_checkpt();      SCR_Start_checkpt();
                  SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);
                  …                         …                         …                         …
                  fwrite(data,…);           fwrite(data,…);           fwrite(data,…);           fwrite(data,…);
                  …                         …                         …                         …
                  SCR_Complete_checkpt();   SCR_Complete_checkpt();   SCR_Complete_checkpt();   SCR_Complete_checkpt();




compute nodes



Add local storage to
         each node




                                                        Large, reliable but slow                                          11
                                                            global storage
Two-level CheCL
  • Writing ckpt files to the global storage is more reliable but time
    consuming
  • Use local storages of compute nodes. Fast but sacrifice reliability

Propose Two-level CheCL : use both local and global ⇒ Local CheCL + Global CheCL

                     SCR_Start_checkpt();       SCR_Start_checkpt();      SCR_Start_checkpt();      SCR_Start_checkpt();
                     SCR_Route_file(fn,fn2);    SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);   SCR_Route_file(fn,fn2);
                     …                          …                         …                         …
                     fwrite(data,…);            fwrite(data,…);           fwrite(data,…);           fwrite(data,…);
                     …                          …                         …                         …
                     SCR_Complete_checkpt();    SCR_Complete_checkpt();   SCR_Complete_checkpt();   SCR_Complete_checkpt();




compute nodes




 local storages


                  shared global storage
                                                                                                                              12
Outline
•   Introduction
•   Two-level CheCL
•   Performance Model
•   Evaluation and Discussion
•   Conclusion




                                13
Performance Model

• Total execution time of an OpenCL application running with
  Two-level CheCL is Ttotal
• The original execution time is Ts




   n      dG n         n dL      n    dL   n

                      Ts
                                                               14
Performance Model

• Total time spent for checkpointing is TC




   n     Cov     n     Cov      n      Cov   n   Cov   n

                             Ts + Tc
                                                           15
Performance Model

• Total time spent for checkpointing is TC
• Local CheCL ckpt overhead CL, Global CheCL ckpt overhead CG




                                            75%         25%




   n      CG      n     CL       n     CL   n      CL    n

                             Ts + Tc
                                                              16
Performance Model

• No failure during ckpt process. On average, failures occur at 0.5n
• TL is time overhead when the process is recoverable by the latest
  checkpoint file.




0.5n             0.5n            0.5n            0.5n            0.5n


   n        CG       n      CL          n   CL          n   CL          n

                                 Ts + Tc
                                                                            17
Performance Model

• No failure during ckpt process. On average, failures occur at 0.5n
• TL is time overhead when the process is recoverable by the latest
  checkpoint file.




        wasted time                                  85%       15%
                           # of failures                   [Moody, SC’10]
    n        CG        n         CL 0.5n
                                       RL        n   CL     n
    n        CG    0.5n
                  RG            n           CL   n   CL     n
                                                                       18
Performance Model

• TG is time overhead when the process is only recoverable by the
  global checkpoint file.




   n       CG      n        CL 0.5n

                       RG      RL      n     CL     n
                                                              19
Outline
•   Introduction
•   Two-level CheCL
•   Performance Model
•   Evaluation and Discussion
•   Conclusion




                                20
Experimental Set Up
• The evaluation was conducted on a GPU cluster of
  four compute nodes, each compute node has:
   –   Intel core i7 930 CPU
   –   Nvidia Tesla C2070 GPU
   –   Main memory of 24 GB
   –   tmpfs RAM Disk of 12 GB
• CPR tools:
   – BLCR-0.8-4 (CPU state ckpt)
   – CheCL (GPU state ckpt)
• Benchmark:
   – Molecular Dynamic (MD)
                                                     21
Checkpoint Time Comparison for GPU Cluster
                        16000
                                                                    Accelerate up to > 4x
                        14000
 Checkpoint Time (ms)




                        12000

                        10000

                        8000

                        6000                                                                Global CheCL
                                                                                            Local CheCL
                        4000

                        2000

                           0
                                12288 24574 73728 12288 24574 73728 12288 24574 73728
                                     1 node            2 nodes             4 nodes
                                         # of Nodes and Problem size


                                                                                                      22
Efficiency (Ts/Ttotal) Improvement (No Failure)
               100%
                                                                            Two-level CheCL’s PL:PG=3:1
               90%
               80%
               70%
  Efficiency




               60%
               50%
               40%
               30%
               20%
               10%
                0%
                        1x            2x          4x          8x           16x            32x         64x
                                                  Checkpoint Frequencies

          2 nodes, Local and Global    2 nodes, Global only   4 nodes, Local and Global     4 nodes, Global only



                                                                                                             23
Efficiency Improvement (MTTF = 3 minutes)
                                                                           [Schroeder, SciDAC’07]

              100%
                                                                     Two-level CheCL’s PL:PG=3:1
              90%
              80%
              70%
 Efficiency




              60%
              50%
              40%
              30%
              20%
              10%
               0%
                     1x    2x            4x           8x            16x           32x    64x
                                        Checkpoint Frequencies
                          4 nodes, Local and Global        4 nodes, Global only


                                                                                               24
Trade-off Between Local/Global Ratio and Two-level CheCL’s Time Overhead


                        4500

                        4000

                        3500
   Time overhead (ms)




                        3000

                        2500

                        2000

                        1500

                        1000

                         500

                           0
                               (0:10)   (1:9)   (2:8)   (3:7)      (4:6)   (5:5)     (6:4)   (7:3)   (8:2)   (9:1)
                                                                Local/Global ratio

                                                                                                                     25
Conclusion
• Checkpointing is important for HPC system
  dependability
• Two-level CheCL can improve system efficiency
• Local CheCL can be used for high speed
  checkpointing
• There is a trade-off between Local and Global CheCL
  which must be treated carefully for future
  implementation on large-scale GPU computing
  systems


                                                        26

More Related Content

What's hot

Introduction to eBPF and XDP
Introduction to eBPF and XDPIntroduction to eBPF and XDP
Introduction to eBPF and XDPlcplcp1
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019Brendan Gregg
 
Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Kernel TLV
 
Fun with Network Interfaces
Fun with Network InterfacesFun with Network Interfaces
Fun with Network InterfacesKernel TLV
 
Linux Timer device driver
Linux Timer device driverLinux Timer device driver
Linux Timer device driver艾鍗科技
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudAndrea Righi
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
Linux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - WonokaerunLinux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - Wonokaerunidsecconf
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005dflexer
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Ray Jenkins
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityBrendan Gregg
 
Kernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power ManagementKernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power ManagementAnne Nicolas
 
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021Valeriy Kravchuk
 
Java util concurrent
Java util concurrentJava util concurrent
Java util concurrentRoger Xia
 
Spectre(v1%2 fv2%2fv4) v.s. meltdown(v3)
Spectre(v1%2 fv2%2fv4) v.s. meltdown(v3)Spectre(v1%2 fv2%2fv4) v.s. meltdown(v3)
Spectre(v1%2 fv2%2fv4) v.s. meltdown(v3)Gavin Guo
 
Berkeley Packet Filters
Berkeley Packet FiltersBerkeley Packet Filters
Berkeley Packet FiltersKernel TLV
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCKernel TLV
 

What's hot (20)

Introduction to eBPF and XDP
Introduction to eBPF and XDPIntroduction to eBPF and XDP
Introduction to eBPF and XDP
 
eBPF Basics
eBPF BasicseBPF Basics
eBPF Basics
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545
 
Fun with Network Interfaces
Fun with Network InterfacesFun with Network Interfaces
Fun with Network Interfaces
 
Linux Timer device driver
Linux Timer device driverLinux Timer device driver
Linux Timer device driver
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Linux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - WonokaerunLinux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - Wonokaerun
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
Kernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power ManagementKernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power Management
 
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
 
Java util concurrent
Java util concurrentJava util concurrent
Java util concurrent
 
Kgdb kdb modesetting
Kgdb kdb modesettingKgdb kdb modesetting
Kgdb kdb modesetting
 
Ch3-2
Ch3-2Ch3-2
Ch3-2
 
Spectre(v1%2 fv2%2fv4) v.s. meltdown(v3)
Spectre(v1%2 fv2%2fv4) v.s. meltdown(v3)Spectre(v1%2 fv2%2fv4) v.s. meltdown(v3)
Spectre(v1%2 fv2%2fv4) v.s. meltdown(v3)
 
Berkeley Packet Filters
Berkeley Packet FiltersBerkeley Packet Filters
Berkeley Packet Filters
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCC
 

Viewers also liked

EC5624A Plus Ethanol Corrosion Inhibitor
EC5624A Plus Ethanol Corrosion InhibitorEC5624A Plus Ethanol Corrosion Inhibitor
EC5624A Plus Ethanol Corrosion InhibitorPhillip Bureman
 
Why alcohols will replace gasoline and diesel fuel to be the fuels of the fut...
Why alcohols will replace gasoline and diesel fuel to be the fuels of the fut...Why alcohols will replace gasoline and diesel fuel to be the fuels of the fut...
Why alcohols will replace gasoline and diesel fuel to be the fuels of the fut...SolarClean Fuels, LLC
 
Crowd sensing, mobiles and feedback
Crowd sensing, mobiles and feedbackCrowd sensing, mobiles and feedback
Crowd sensing, mobiles and feedbackChristian Glahn
 
Oil 101 - Introduction to Petroleum Product Marketing
Oil 101 - Introduction to Petroleum Product MarketingOil 101 - Introduction to Petroleum Product Marketing
Oil 101 - Introduction to Petroleum Product MarketingEKT Interactive
 
ethanol engine modifications
ethanol engine modificationsethanol engine modifications
ethanol engine modificationsSughosh Deshmukh
 
Research report on phil. housing finance sector of Philippines
Research report on phil. housing finance sector of PhilippinesResearch report on phil. housing finance sector of Philippines
Research report on phil. housing finance sector of PhilippinesNelsie Grace Pineda
 
Political and legal environment
Political and legal environmentPolitical and legal environment
Political and legal environmentTala Lorena
 
Political and legal environment
Political and legal environmentPolitical and legal environment
Political and legal environmentluispachon
 
Marketing environment
Marketing environmentMarketing environment
Marketing environmentmustafvi786
 
Macro factors affecting business environment
Macro factors affecting business environmentMacro factors affecting business environment
Macro factors affecting business environmentaayush30
 

Viewers also liked (11)

EC5624A Plus Ethanol Corrosion Inhibitor
EC5624A Plus Ethanol Corrosion InhibitorEC5624A Plus Ethanol Corrosion Inhibitor
EC5624A Plus Ethanol Corrosion Inhibitor
 
Why alcohols will replace gasoline and diesel fuel to be the fuels of the fut...
Why alcohols will replace gasoline and diesel fuel to be the fuels of the fut...Why alcohols will replace gasoline and diesel fuel to be the fuels of the fut...
Why alcohols will replace gasoline and diesel fuel to be the fuels of the fut...
 
Crowd sensing, mobiles and feedback
Crowd sensing, mobiles and feedbackCrowd sensing, mobiles and feedback
Crowd sensing, mobiles and feedback
 
Oil 101 - Introduction to Petroleum Product Marketing
Oil 101 - Introduction to Petroleum Product MarketingOil 101 - Introduction to Petroleum Product Marketing
Oil 101 - Introduction to Petroleum Product Marketing
 
ethanol engine modifications
ethanol engine modificationsethanol engine modifications
ethanol engine modifications
 
Research report on phil. housing finance sector of Philippines
Research report on phil. housing finance sector of PhilippinesResearch report on phil. housing finance sector of Philippines
Research report on phil. housing finance sector of Philippines
 
Marketing Presentation
Marketing PresentationMarketing Presentation
Marketing Presentation
 
Political and legal environment
Political and legal environmentPolitical and legal environment
Political and legal environment
 
Political and legal environment
Political and legal environmentPolitical and legal environment
Political and legal environment
 
Marketing environment
Marketing environmentMarketing environment
Marketing environment
 
Macro factors affecting business environment
Macro factors affecting business environmentMacro factors affecting business environment
Macro factors affecting business environment
 

Similar to Improving the Scalability of Transparent Checkpointing for GPU Computing Systems

Exploitation of counter overflows in the Linux kernel
Exploitation of counter overflows in the Linux kernelExploitation of counter overflows in the Linux kernel
Exploitation of counter overflows in the Linux kernelVitaly Nikolenko
 
NSC #2 - Challenge Solution
NSC #2 - Challenge SolutionNSC #2 - Challenge Solution
NSC #2 - Challenge SolutionNoSuchCon
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with PacemakerKris Buytaert
 
Talk 160920 @ Cat System Workshop
Talk 160920 @ Cat System WorkshopTalk 160920 @ Cat System Workshop
Talk 160920 @ Cat System WorkshopQuey-Liang Kao
 
[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory AnalysisMoabi.com
 
RTOS Material hfffffffffffffffffffffffffffffffffffff
RTOS Material hfffffffffffffffffffffffffffffffffffffRTOS Material hfffffffffffffffffffffffffffffffffffff
RTOS Material hfffffffffffffffffffffffffffffffffffffadugnanegero
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with PacemakerKris Buytaert
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班Paul Chao
 
Decoupling Provenance Capture and Analysis from Execution
Decoupling Provenance Capture and Analysis from ExecutionDecoupling Provenance Capture and Analysis from Execution
Decoupling Provenance Capture and Analysis from ExecutionPaul Groth
 
[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit AutomationMoabi.com
 
LXC on Ganeti
LXC on GanetiLXC on Ganeti
LXC on Ganetikawamuray
 
XDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @CloudflareXDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @CloudflareC4Media
 
HES2011 - Sebastien Tricaud - Capture me if you can
HES2011 - Sebastien Tricaud - Capture me if you canHES2011 - Sebastien Tricaud - Capture me if you can
HES2011 - Sebastien Tricaud - Capture me if you canHackito Ergo Sum
 
Hackito Ergo Sum 2011: Capture me if you can!
Hackito Ergo Sum 2011: Capture me if you can!Hackito Ergo Sum 2011: Capture me if you can!
Hackito Ergo Sum 2011: Capture me if you can!stricaud
 
Linux SMEP bypass techniques
Linux SMEP bypass techniquesLinux SMEP bypass techniques
Linux SMEP bypass techniquesVitaly Nikolenko
 
How to Create AltCoin(Alternative Cryptocurrency)?
How to Create AltCoin(Alternative Cryptocurrency)?How to Create AltCoin(Alternative Cryptocurrency)?
How to Create AltCoin(Alternative Cryptocurrency)?Abdullah Khan Zehady
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugginglibfetion
 
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...CanSecWest
 

Similar to Improving the Scalability of Transparent Checkpointing for GPU Computing Systems (20)

Exploitation of counter overflows in the Linux kernel
Exploitation of counter overflows in the Linux kernelExploitation of counter overflows in the Linux kernel
Exploitation of counter overflows in the Linux kernel
 
NSC #2 - Challenge Solution
NSC #2 - Challenge SolutionNSC #2 - Challenge Solution
NSC #2 - Challenge Solution
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with Pacemaker
 
Talk 160920 @ Cat System Workshop
Talk 160920 @ Cat System WorkshopTalk 160920 @ Cat System Workshop
Talk 160920 @ Cat System Workshop
 
[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis
 
RTOS Material hfffffffffffffffffffffffffffffffffffff
RTOS Material hfffffffffffffffffffffffffffffffffffffRTOS Material hfffffffffffffffffffffffffffffffffffff
RTOS Material hfffffffffffffffffffffffffffffffffffff
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with Pacemaker
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班
 
Genode Compositions
Genode CompositionsGenode Compositions
Genode Compositions
 
Decoupling Provenance Capture and Analysis from Execution
Decoupling Provenance Capture and Analysis from ExecutionDecoupling Provenance Capture and Analysis from Execution
Decoupling Provenance Capture and Analysis from Execution
 
[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation
 
LXC on Ganeti
LXC on GanetiLXC on Ganeti
LXC on Ganeti
 
XDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @CloudflareXDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @Cloudflare
 
HES2011 - Sebastien Tricaud - Capture me if you can
HES2011 - Sebastien Tricaud - Capture me if you canHES2011 - Sebastien Tricaud - Capture me if you can
HES2011 - Sebastien Tricaud - Capture me if you can
 
Hackito Ergo Sum 2011: Capture me if you can!
Hackito Ergo Sum 2011: Capture me if you can!Hackito Ergo Sum 2011: Capture me if you can!
Hackito Ergo Sum 2011: Capture me if you can!
 
Linux SMEP bypass techniques
Linux SMEP bypass techniquesLinux SMEP bypass techniques
Linux SMEP bypass techniques
 
AES on modern GPUs
AES on modern GPUsAES on modern GPUs
AES on modern GPUs
 
How to Create AltCoin(Alternative Cryptocurrency)?
How to Create AltCoin(Alternative Cryptocurrency)?How to Create AltCoin(Alternative Cryptocurrency)?
How to Create AltCoin(Alternative Cryptocurrency)?
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
 

Recently uploaded

ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctBrainSell Technologies
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?Paolo Missier
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...FIDO Alliance
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentationyogeshlabana357357
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxFIDO Alliance
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandIES VE
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsLeah Henrickson
 

Recently uploaded (20)

ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 

Improving the Scalability of Transparent Checkpointing for GPU Computing Systems

  • 1. Improving the Scalability of Transparent Checkpointing for GPU Computing Systems The 2012 IEEE Region 10 Conference (TENCON 2012) Cebu, Philippines November 21, 2012 Alfian Amrizal, S. Hirasawa, K. Komatsu, H. Takizawa, H. Kobayashi Tohoku University
  • 2. Outline • Introduction • Two-level CheCL • Performance Model • Evaluation and Discussion • Conclusion 2
  • 3. High-Performance Computing & Checkpoint • High-performance computing (HPC) systems are getting faster and larger in scale – Consist of huge numbers of CPUs and GPUs – Probability of encountering failures also increases • Checkpoint/restart (CPR) tools are important to make sure HPC systems can successfully finish their calculation – Long running applications; e.g. SPECFEM3D CPU-GPU in Heterogeneous HPC system 3
  • 4. Difficulties in CPR of Heterogeneous Systems • Heterogeneous systems use both CPUs and GPUs • Conventional CPR tools such as BLCR and DMTCP do not assume GPUs ⇒ CPR fails compute node CPU GPU SCR_Start_checkpt(); SCR_Route_file(fn,fn2); … fwrite(data,…); Host … Device SCR_Complete_checkpt(); Memory Memory process resource conventional CPR tools CheCL allows conventional only save CPU state tools to save GPU state • CheCL has been developed for checkpointing OpenCL applications running on CPU-GPU systems [Takizawa, IPDPS’11] 4
  • 5. Difficulties in CPR of Heterogeneous Systems • Problem: checkpointing time increases with the # of nodes 5
  • 6. Writing Checkpoints to Global Storage is Ineffective • To withstand failures, large-scale heterogeneous systems need to checkpoint more frequently to the global storage (low BW) • However, the global storage is shared among nodes ⇒ CheCL ‘s checkpoint time increases with the # of nodes • CheCL is not scalable: the larger the node’s numbers, the SCR_Start_checkpt(); SCR_Route_file(fn,fn2); … fwrite(data,…); … SCR_Start_checkpt(); SCR_Route_file(fn,fn2); … fwrite(data,…); … SCR_Start_checkpt(); SCR_Route_file(fn,fn2); … fwrite(data,…); … SCR_Start_checkpt(); SCR_Route_file(fn,fn2); … fwrite(data,…); … compute nodes it takes to checkpoint longer SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); • Objective – To establish an effective implementation of the checkpointing mechanism for heterogeneous HPC system Network Contention global storage 6
  • 7. Writing Checkpoints to Global Storage is Ineffective • To withstand failures, large-scale heterogeneous systems need to checkpoint more frequently to the global storage (low BW) • However, the global storage is shared among nodes ⇒ CheCL ‘s checkpoint time increases with the # of nodes • CheCL is not scalable: the larger the node’s numbers, the longer it takes to checkpoint • Objective – To establish an effective implementation of the checkpointing mechanism for heterogeneous HPC system 7
  • 8. Outline • Introduction • Two-level CheCL • Performance Model • Evaluation and Discussion • Conclusion 8
  • 9. Local CheCL • Avoid the network by utilizing node’s local storage –  Simultaneous checkpointing → Fast –  Less reliable SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); … … … … fwrite(data,…); fwrite(data,…); fwrite(data,…); fwrite(data,…); … … … … SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); compute nodes Add local storage to Interrupt this process each node Large, reliable but slow 9 global storage
  • 10. Local CheCL • Avoid the network by utilizing node’s local storage –  Simultaneous checkpointing → Fast –  Less reliable SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); … … … … fwrite(data,…); fwrite(data,…); fwrite(data,…); fwrite(data,…); … … … … SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); compute nodes Add local storage to each node Large, reliable but slow 10 global storage
  • 11. Local CheCL • Avoid the network by utilizing node’s local storage –  Simultaneous checkpointing → Fast –  Less reliable SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); … … … … fwrite(data,…); fwrite(data,…); fwrite(data,…); fwrite(data,…); … … … … SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); compute nodes Add local storage to each node Large, reliable but slow 11 global storage
  • 12. Two-level CheCL • Writing ckpt files to the global storage is more reliable but time consuming • Use local storages of compute nodes. Fast but sacrifice reliability Propose Two-level CheCL : use both local and global ⇒ Local CheCL + Global CheCL SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); … … … … fwrite(data,…); fwrite(data,…); fwrite(data,…); fwrite(data,…); … … … … SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); compute nodes local storages shared global storage 12
  • 13. Outline • Introduction • Two-level CheCL • Performance Model • Evaluation and Discussion • Conclusion 13
  • 14. Performance Model • Total execution time of an OpenCL application running with Two-level CheCL is Ttotal • The original execution time is Ts n dG n n dL n dL n Ts 14
  • 15. Performance Model • Total time spent for checkpointing is TC n Cov n Cov n Cov n Cov n Ts + Tc 15
  • 16. Performance Model • Total time spent for checkpointing is TC • Local CheCL ckpt overhead CL, Global CheCL ckpt overhead CG 75% 25% n CG n CL n CL n CL n Ts + Tc 16
  • 17. Performance Model • No failure during ckpt process. On average, failures occur at 0.5n • TL is time overhead when the process is recoverable by the latest checkpoint file. 0.5n 0.5n 0.5n 0.5n 0.5n n CG n CL n CL n CL n Ts + Tc 17
  • 18. Performance Model • No failure during ckpt process. On average, failures occur at 0.5n • TL is time overhead when the process is recoverable by the latest checkpoint file. wasted time 85% 15% # of failures [Moody, SC’10] n CG n CL 0.5n RL n CL n n CG 0.5n RG n CL n CL n 18
  • 19. Performance Model • TG is time overhead when the process is only recoverable by the global checkpoint file. n CG n CL 0.5n RG RL n CL n 19
  • 20. Outline • Introduction • Two-level CheCL • Performance Model • Evaluation and Discussion • Conclusion 20
  • 21. Experimental Set Up • The evaluation was conducted on a GPU cluster of four compute nodes, each compute node has: – Intel core i7 930 CPU – Nvidia Tesla C2070 GPU – Main memory of 24 GB – tmpfs RAM Disk of 12 GB • CPR tools: – BLCR-0.8-4 (CPU state ckpt) – CheCL (GPU state ckpt) • Benchmark: – Molecular Dynamic (MD) 21
  • 22. Checkpoint Time Comparison for GPU Cluster 16000 Accelerate up to > 4x 14000 Checkpoint Time (ms) 12000 10000 8000 6000 Global CheCL Local CheCL 4000 2000 0 12288 24574 73728 12288 24574 73728 12288 24574 73728 1 node 2 nodes 4 nodes # of Nodes and Problem size 22
  • 23. Efficiency (Ts/Ttotal) Improvement (No Failure) 100% Two-level CheCL’s PL:PG=3:1 90% 80% 70% Efficiency 60% 50% 40% 30% 20% 10% 0% 1x 2x 4x 8x 16x 32x 64x Checkpoint Frequencies 2 nodes, Local and Global 2 nodes, Global only 4 nodes, Local and Global 4 nodes, Global only 23
  • 24. Efficiency Improvement (MTTF = 3 minutes) [Schroeder, SciDAC’07] 100% Two-level CheCL’s PL:PG=3:1 90% 80% 70% Efficiency 60% 50% 40% 30% 20% 10% 0% 1x 2x 4x 8x 16x 32x 64x Checkpoint Frequencies 4 nodes, Local and Global 4 nodes, Global only 24
  • 25. Trade-off Between Local/Global Ratio and Two-level CheCL’s Time Overhead 4500 4000 3500 Time overhead (ms) 3000 2500 2000 1500 1000 500 0 (0:10) (1:9) (2:8) (3:7) (4:6) (5:5) (6:4) (7:3) (8:2) (9:1) Local/Global ratio 25
  • 26. Conclusion • Checkpointing is important for HPC system dependability • Two-level CheCL can improve system efficiency • Local CheCL can be used for high speed checkpointing • There is a trade-off between Local and Global CheCL which must be treated carefully for future implementation on large-scale GPU computing systems 26