SlideShare una empresa de Scribd logo
1 de 42
Sharing GPUs across a Cluster
           also known as


   “The GPU as a service”
Outline

  • Introduction                                  (slide 3)
  • Facing the problems in GPGPU              (slide 13)
  • rCUDA functionality                       (slide 22)
  • New rCUDA version                         (slide 29)
  • Getting rCUDA                             (slide 42)



        As this presentation includes a lot of
     information, the reader can directly go to
     the section most interesting to her/him by
      leveraging the slide number information
Improving application performance

 • The complexity of current applications makes
   their execution time to be extremely high
 • There is the trend to accelerate parts of their
   code by using GPUs
GPU computing: the building block

                  The basic building block is a node with one
                   or more GPUs

                 Main Memory
      mem
      GPU
mem
GPU




                               Network
GPU


      GPU




                    CPU


             PCI-e




                                         mem



                                               mem



                                                     mem



                                                           mem
                                         GPU



                                               GPU



                                                     GPU



                                                           GPU
                                                                 Main Memory




                                                                               Network
                                         GPU


                                               GPU


                                                     GPU


                                                           GPU
                                                                    CPU


                                                                       PCI-e
GPU computing: programmer view

        From the programming point of view:
                        A set of nodes, each one with:
                            one or more CPUs (several cores per CPU)
                                   


                           one or more GPUs (1-4)


                        An interconnection network

                   GPU                                   GPU                               GPU                               GPU                               GPU                              GPU
        GPU        mem                         GPU       mem                     GPU       mem                     GPU       mem                     GPU       mem                    GPU       mem


                   GPU                                   GPU                               GPU                               GPU                               GPU                              GPU
        GPU        mem                         GPU       mem                     GPU       mem                     GPU       mem                     GPU       mem                    GPU       mem
PCI-e




                                       PCI-e




                                                                         PCI-e




                                                                                                           PCI-e




                                                                                                                                             PCI-e




                                                                                                                                                                              PCI-e
                     Main Memory




                                                          Main Memory




                                                                                             Main Memory




                                                                                                                               Main Memory




                                                                                                                                                                Main Memory




                                                                                                                                                                                                  Main Memory
         CPU




                                                CPU




                                                                                  CPU




                                                                                                                    CPU




                                                                                                                                                      CPU




                                                                                                                                                                                       CPU
        Network                                Network                           Network                           Network                           Network                          Network




                                                                        Interconnection Network
Not all kinds of code are eligible for GPUs
   •   For the right kind of code the use of GPUs
       brings huge benefits in terms of performance
       and energy
   •   There must be data parallelism in the code:
       this is the only way to take benefit from the
       hundreds of processors inside a GPU
   •   We can find different scenarios from the point
       of view of the application:
        − Low level of data parallelism


        − High level of data parallelism


        − Moderate level of data parallelism


        − Applications for multi-GPU computing
Low and high level of data parallelism

   Low level of data parallelism

                  Regarding GPU computing?
        No GPU is needed, just proceed with the traditional HPC
                               strategies



   High level of data parallelism

                  Regarding GPU computing?
        Add one or more GPUs to every node in the system and
                   rewrite applications to use them
Moderate level of data parallelism

   Application presents a data parallelism
    around [40%-80%]. This is the common case

                  Regarding GPU computing?
       The GPUs in the system are used only for some parts of the
            application, remaining idle the rest of the time
Leak of money in current clusters
• The GPUs of a CUDA-enabled cluster may
  be idle for long periods of time
• Waste of resources and energy
• The total cost of ownership (TCO) is no
  longer dominated by acquisition costs but
  electricity bill and rack space are
  increasingly contributing
Last scenario: multi-GPU computing

 • An application can use a large amount of
   GPUs in parallel


                 Regarding GPU computing?
    The code running in a node can only access the GPUs in that
     node, but it would run faster if it could have access to more
                                 GPUs
GPU computing presents drawbacks

 • Although GPUs effectively accelerate
   applications, their use may bring additional
   concerns
Outline

  • Introduction                   (slide 3)
  • Facing the problems in GPGPU   (slide 13)
  • rCUDA functionality            (slide 22)
  • New rCUDA version              (slide 29)
  • Getting rCUDA                  (slide 42)
Looking for an efficient solution


         • A way of avoiding the low GPU
           utilization inefficiency is by
           reducing the number of GPUs
           and sharing the remaining
           ones among the CPU nodes in
           the cluster
         • This would increase GPU
           utilization also reducing power
           consumption
Saving costs by doing better


           • Doing better by spending
             less money in GPUs at the
             initial investment and
             therefore reducing TCO
           • Doing better by deploying
             rCUDA into your new cost-
             effective cluster
rCUDA adds value

  • rCUDA allows sharing GPUs among
    nodes in the cluster
  • rCUDA allows having less GPUs
    than nodes in the cluster
  • rCUDA provides remote access
    from each node to any GPU in the
    system
  • rCUDA reduces costs without
    noticeably reducing performance
The main idea behind rCUDA




              Add only the
           GPU computing
          nodes giving you
             the necessary
            computational
                 power!
              Make all the
           GPUs accesible
           from evey node
rCUDA also extends CUDA’s possibilities

  • rCUDA allows providing all the
    GPUs in the cluster to a single
    application
     • Current limitations in the number
       of GPUs per node are avoided
     • Useful for multi-GPU computing:
       now the only limit is the
       programmer’s ability to
       accelerate her/his application
rCUDA for multi-GPU computing
          • All GPUs available to every node



           GPU       mem                         GPU       mem                         GPU       mem                         GPU       mem                         GPU       mem


           GPU       mem                         GPU       mem                         GPU       mem                         GPU       mem                         GPU       mem
  PCI-e




                                         PCI-e




                                                                               PCI-e




                                                                                                                     PCI-e




                                                                                                                                                           PCI-e
                           Main Memory




                                                                 Main Memory




                                                                                                       Main Memory




                                                                                                                                             Main Memory




                                                                                                                                                                                   Main Memory
          CPU                                    CPU                                   CPU                                   CPU                                   CPU



           Network                               Network                               Network                               Network                               Network




                                                            Interconnection Network



          Currently, from a given CPU it is only possible to access
                       the GPUs in that very same node
rCUDA for multi-GPU computing
          • All GPUs available to every node

                                            GPU     mem             GPU   mem       GPU     mem             GPU   mem        GPU    mem


                                            GPU     mem             GPU   mem       GPU     mem             GPU   mem        GPU    mem




                                  Logical connections




                                                                                                                                                                      Main Memory
                    Main Memory




                                                      Main Memory




                                                                                              Main Memory




                                                                                                                                      Main Memory
          CPU                             CPU                                     CPU                                     CPU                               CPU
  PCI-e




                                  PCI-e




                                                                          PCI-e




                                                                                                                  PCI-e




                                                                                                                                                    PCI-e
          Network                         Network                                 Network                                 Network                           Network




                                                    Interconnection Network



          rCUDA makes all GPUs accessible from every node
rCUDA for multi-GPU computing
          • All GPUs available to every node

                                            GPU     mem             GPU   mem       GPU     mem             GPU   mem        GPU    mem


                                            GPU     mem             GPU   mem       GPU     mem             GPU   mem        GPU    mem




                                  Logical connections




                                                                                                                                                                      Main Memory
                    Main Memory




                                                      Main Memory




                                                                                              Main Memory




                                                                                                                                      Main Memory
          CPU                             CPU                                     CPU                                     CPU                               CPU
  PCI-e




                                  PCI-e




                                                                          PCI-e




                                                                                                                  PCI-e




                                                                                                                                                    PCI-e
          Network                         Network                                 Network                                 Network                           Network




                                                    Interconnection Network



  rCUDA makes all GPUs accessible from every node and
enables the access from a CPU to as many as required GPUs
Outline

  • Introduction                   (slide 3)
  • Facing the problems in GPGPU   (slide 13)
  • rCUDA functionality            (slide 22)
  • New rCUDA version              (slide 29)
  • Getting rCUDA                  (slide 42)
How rCUDA works

  • rCUDA is a middleware that enables
    seamlessly remote CUDA usage
  • rCUDA clusters are equipped with:
     − The rCUDA client at every node
     − The rCUDA server only in those
       nodes having a GPU
  • Client-server communication:
    General TCP/IP communications
  • Or alternatively highly efficient low-
    level communications library for
    InfiniBand networks
Seamlessly usage

   Usual way to use GPUs CURRENTLY


             Application




             CUDA library
Seamlessly usage

     rCUDA leverages a client and a server
  Client side             Server side

  Application




                                  CUDA library
Seamlessly usage

      rCUDA leverages a client and a server
   Client side                  Server side

  Application

 rCUDA library                 rCUDA daemon

 Network interface   Network interface   CUDA library
Seamlessly usage

      rCUDA leverages a client and a server
   Client side                        Server side

  Application

 rCUDA library                       rCUDA daemon

 Network interface         Network interface   CUDA library


                 Request


               Response
rCUDA uses a proprietary communication protocol
Example:

 1) initialization
 2) memory allocation
    on the remote GPU
 3) CPU to GPU memory
    transfer of the input
    data
 4) kernel execution
 5) GPU to CPU memory
    transfer of the results
 6) GPU memory release
 7) communication
    channel closing and
    server process
    finalization
Outline

  • Introduction                   (slide 3)
  • Facing the problems in GPGPU   (slide 13)
  • rCUDA functionality            (slide 22)
  • New rCUDA version              (slide 29)
  • Getting rCUDA                  (slide 42)
Features in the new rCUDA version

 • CUDA 5 support
 • Efficient InfiniBand support
 • Support for CUDA extensions to C
 • Multithread support
 • Support for providing a single application
   with multiple GPUs across the cluster
New Infiniband support

    Why InfiniBand support?
      InfiniBand is the most used HPC network


        − Low latency and high bandwidth




                           TOP 500
New Infiniband support
     Use of IB-Verbs
       • All TPC/IP stack overhead is out

     Bandwidth among client and remote GPU
      near the peak InfiniBand network bandwidth
     Use of GPUDirect
        Reduce the number of intra-node data

         movements
     Use of pipelined transfers
        Overlap intra-node data movements with

         transfers
Performance example for InfiniBand

                            Remote GPU bandwidth for
                              synchronous transfers                   Enhanced performance
                                  To GPU       From GPU                  Internal algorithm
                                                                       making use of pinned
                                                                              memory
  Bandwidth (MB/s)



                     4000
                                    Maximum BW                               InfiniBand QDR
                     3000           exploitation                              Maximum BW
                     2000                                                         nVidia
                                                                                   Tesla
                     1000
                                                                                  C2050
                        0
                             rCUDA, rCUDA, rCUDA,                 Local
                              GigaE  IPoIB IB Verbs                GPU


                                                     Low-level
                              1 Gbps      IP over
                                                     InfiniBand
                             Ethernet   InfiniBand
                                                       library
Performance example for InfiniBand
                         2. Using a remote
                        GPU through rCUDA                       3. Therefore,
                       is only slightly slower              employing a remote
                           than local GPU                    GPU is much faster
                                                              than a local CPU
            Matrix-matrix product 4096 x 4096

     rCUDA 40G                50%                                   nVidia
      InfiniBand                                                   GeForce
                                                                   9800 GTX
                                                                  CUBLAS 3.2
           CUDA               48%
                                                                    Intel
                                       100%                      Xeon E5645
             CPU                                                    MKL
 1. Local GPU
computation is     0           0,5           1        1,5
 much faster
   than CPU
                           Execution time (seconds)
Performance example for InfiniBand

Overhead time for a matrix-matrix product lower than 1%
                                    % overhead gpu   % overhead rcuda
                                                                         Tesla C2050
               1                                                         Intel Xeon E5645

                                                                         QDR InfiniBand
              0,9
              0,8
              0,7
              0,6
 % overhead




              0,5
              0,4
              0,3
              0,2
              0,1
               0
                   600 1600 2600 3600 4600 5600 6600 7600 8600 9600 1060011600126001360014600
                100 1100 2100 3100 4100 5100 6100 7100 8100 9100 1010011100121001310014100
                                              Matrix dimension
Performance example for InfiniBand
Execution time for the LAMMPS application, in.eam input script
scaled by a factor of 5 in the three dimensions
                                               Tesla C2050
                                               Intel Xeon E5520

                                               QDR InfiniBand
Support for CUDA extensions to C
     Previously, rCUDA did not support the CUDA
      extensions to C
     In order to execute a program with rCUDA, the
      CUDA extensions included in its code had to be
      “unextended” to the plain C API
                        NVCC inserts calls to
                    undocumented CUDA functions
Support for CUDA extensions to C
     The new rCUDA version to be released will
      support the CUDA extensions to C
     The exact way we have achieved this goal
      will not be disclosed within this document. We
      ask for some patience …
Multithread Support
     The new rCUDA version supports applications
      with multiple threads
     All the threads from the application can
      access the remote GPU concurrently in the
      same way as if the GPU was installed in the
      node executing the application
MultiGPU support for a single application
      The new rCUDA version is able to provide a
       single application all the GPUs in the cluster
      Accelerating applications no longer depends
       on the amount of GPUs that fit into a node
MultiGPU multithreaded support
      The new rCUDA features can be granted to a
       single application, so that each thread of the
       application can access as many GPUs as it
       requires
Outline

  • Introduction                   (slide 3)
  • Facing the problems in GPGPU   (slide 13)
  • rCUDA functionality            (slide 22)
  • New InfiniBand version         (slide 29)
  • Getting rCUDA                  (slide 42)
Getting rCUDA

  Full InfiniBand version freely available:
     − Enhanced client-server data transfers
     − High-performance InfiniBand
       communications library
     − TCP/IP-based communications also
       included for non-InfiniBand networks




    http://www.rcuda.net

Más contenido relacionado

Destacado

Responsivewebdesign part2
Responsivewebdesign part2Responsivewebdesign part2
Responsivewebdesign part2kshima02x
 
Final Report Literature Review
Final Report Literature ReviewFinal Report Literature Review
Final Report Literature ReviewJarett Pederson
 
Articlefunctional theory
Articlefunctional theoryArticlefunctional theory
Articlefunctional theorySubhankar Basu
 
Leveraging TFS for Driving Process Improvement using Lean Principles
Leveraging TFS for Driving Process Improvement using Lean PrinciplesLeveraging TFS for Driving Process Improvement using Lean Principles
Leveraging TFS for Driving Process Improvement using Lean PrinciplesSrini Kadiam
 
MSLGROUP Hispanic Marketing White Paper
MSLGROUP Hispanic Marketing White PaperMSLGROUP Hispanic Marketing White Paper
MSLGROUP Hispanic Marketing White PaperMSLGROUP Americas
 
Azure: Lessons From The Field
Azure: Lessons From The FieldAzure: Lessons From The Field
Azure: Lessons From The FieldRob Gillen
 
Application Architecture Jumpstart
Application Architecture JumpstartApplication Architecture Jumpstart
Application Architecture JumpstartClint Edmonson
 
4.john milton and his time
4.john milton and his time4.john milton and his time
4.john milton and his timemaliterature
 
Herramientas Web 2.0 y su aplicación en salud
Herramientas Web 2.0 y su aplicación en saludHerramientas Web 2.0 y su aplicación en salud
Herramientas Web 2.0 y su aplicación en saludSofia
 
Chapter 8 transport in humans
Chapter 8 transport in humansChapter 8 transport in humans
Chapter 8 transport in humanschenghong03
 
Paravertebral Cevical Sympathetic Block
Paravertebral Cevical Sympathetic BlockParavertebral Cevical Sympathetic Block
Paravertebral Cevical Sympathetic Blockcairo1957
 
Q2 2010-mobile-video-ad-report final
Q2 2010-mobile-video-ad-report finalQ2 2010-mobile-video-ad-report final
Q2 2010-mobile-video-ad-report finalMediaRoni
 

Destacado (15)

Responsivewebdesign part2
Responsivewebdesign part2Responsivewebdesign part2
Responsivewebdesign part2
 
Final Report Literature Review
Final Report Literature ReviewFinal Report Literature Review
Final Report Literature Review
 
Rider motors ppt
Rider motors pptRider motors ppt
Rider motors ppt
 
Articlefunctional theory
Articlefunctional theoryArticlefunctional theory
Articlefunctional theory
 
Leveraging TFS for Driving Process Improvement using Lean Principles
Leveraging TFS for Driving Process Improvement using Lean PrinciplesLeveraging TFS for Driving Process Improvement using Lean Principles
Leveraging TFS for Driving Process Improvement using Lean Principles
 
MSLGROUP Hispanic Marketing White Paper
MSLGROUP Hispanic Marketing White PaperMSLGROUP Hispanic Marketing White Paper
MSLGROUP Hispanic Marketing White Paper
 
SME Yalista Job title
SME Yalista Job titleSME Yalista Job title
SME Yalista Job title
 
Azure: Lessons From The Field
Azure: Lessons From The FieldAzure: Lessons From The Field
Azure: Lessons From The Field
 
Application Architecture Jumpstart
Application Architecture JumpstartApplication Architecture Jumpstart
Application Architecture Jumpstart
 
4.john milton and his time
4.john milton and his time4.john milton and his time
4.john milton and his time
 
PTW Water Cube
PTW Water CubePTW Water Cube
PTW Water Cube
 
Herramientas Web 2.0 y su aplicación en salud
Herramientas Web 2.0 y su aplicación en saludHerramientas Web 2.0 y su aplicación en salud
Herramientas Web 2.0 y su aplicación en salud
 
Chapter 8 transport in humans
Chapter 8 transport in humansChapter 8 transport in humans
Chapter 8 transport in humans
 
Paravertebral Cevical Sympathetic Block
Paravertebral Cevical Sympathetic BlockParavertebral Cevical Sympathetic Block
Paravertebral Cevical Sympathetic Block
 
Q2 2010-mobile-video-ad-report final
Q2 2010-mobile-video-ad-report finalQ2 2010-mobile-video-ad-report final
Q2 2010-mobile-video-ad-report final
 

Último

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 

Último (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 

R cuda presentation_ib_features_120704

  • 1. Sharing GPUs across a Cluster also known as “The GPU as a service”
  • 2. Outline • Introduction (slide 3) • Facing the problems in GPGPU (slide 13) • rCUDA functionality (slide 22) • New rCUDA version (slide 29) • Getting rCUDA (slide 42) As this presentation includes a lot of information, the reader can directly go to the section most interesting to her/him by leveraging the slide number information
  • 3. Improving application performance • The complexity of current applications makes their execution time to be extremely high • There is the trend to accelerate parts of their code by using GPUs
  • 4. GPU computing: the building block  The basic building block is a node with one or more GPUs Main Memory mem GPU mem GPU Network GPU GPU CPU PCI-e mem mem mem mem GPU GPU GPU GPU Main Memory Network GPU GPU GPU GPU CPU PCI-e
  • 5. GPU computing: programmer view  From the programming point of view:  A set of nodes, each one with: one or more CPUs (several cores per CPU)   one or more GPUs (1-4)  An interconnection network GPU GPU GPU GPU GPU GPU GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU GPU GPU GPU GPU GPU GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem PCI-e PCI-e PCI-e PCI-e PCI-e PCI-e Main Memory Main Memory Main Memory Main Memory Main Memory Main Memory CPU CPU CPU CPU CPU CPU Network Network Network Network Network Network Interconnection Network
  • 6. Not all kinds of code are eligible for GPUs • For the right kind of code the use of GPUs brings huge benefits in terms of performance and energy • There must be data parallelism in the code: this is the only way to take benefit from the hundreds of processors inside a GPU • We can find different scenarios from the point of view of the application: − Low level of data parallelism − High level of data parallelism − Moderate level of data parallelism − Applications for multi-GPU computing
  • 7. Low and high level of data parallelism  Low level of data parallelism Regarding GPU computing? No GPU is needed, just proceed with the traditional HPC strategies  High level of data parallelism Regarding GPU computing? Add one or more GPUs to every node in the system and rewrite applications to use them
  • 8. Moderate level of data parallelism  Application presents a data parallelism around [40%-80%]. This is the common case Regarding GPU computing? The GPUs in the system are used only for some parts of the application, remaining idle the rest of the time
  • 9. Leak of money in current clusters • The GPUs of a CUDA-enabled cluster may be idle for long periods of time • Waste of resources and energy • The total cost of ownership (TCO) is no longer dominated by acquisition costs but electricity bill and rack space are increasingly contributing
  • 10. Last scenario: multi-GPU computing • An application can use a large amount of GPUs in parallel Regarding GPU computing? The code running in a node can only access the GPUs in that node, but it would run faster if it could have access to more GPUs
  • 11. GPU computing presents drawbacks • Although GPUs effectively accelerate applications, their use may bring additional concerns
  • 12. Outline • Introduction (slide 3) • Facing the problems in GPGPU (slide 13) • rCUDA functionality (slide 22) • New rCUDA version (slide 29) • Getting rCUDA (slide 42)
  • 13. Looking for an efficient solution • A way of avoiding the low GPU utilization inefficiency is by reducing the number of GPUs and sharing the remaining ones among the CPU nodes in the cluster • This would increase GPU utilization also reducing power consumption
  • 14. Saving costs by doing better • Doing better by spending less money in GPUs at the initial investment and therefore reducing TCO • Doing better by deploying rCUDA into your new cost- effective cluster
  • 15. rCUDA adds value • rCUDA allows sharing GPUs among nodes in the cluster • rCUDA allows having less GPUs than nodes in the cluster • rCUDA provides remote access from each node to any GPU in the system • rCUDA reduces costs without noticeably reducing performance
  • 16. The main idea behind rCUDA Add only the GPU computing nodes giving you the necessary computational power! Make all the GPUs accesible from evey node
  • 17. rCUDA also extends CUDA’s possibilities • rCUDA allows providing all the GPUs in the cluster to a single application • Current limitations in the number of GPUs per node are avoided • Useful for multi-GPU computing: now the only limit is the programmer’s ability to accelerate her/his application
  • 18. rCUDA for multi-GPU computing • All GPUs available to every node GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem PCI-e PCI-e PCI-e PCI-e PCI-e Main Memory Main Memory Main Memory Main Memory Main Memory CPU CPU CPU CPU CPU Network Network Network Network Network Interconnection Network Currently, from a given CPU it is only possible to access the GPUs in that very same node
  • 19. rCUDA for multi-GPU computing • All GPUs available to every node GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem Logical connections Main Memory Main Memory Main Memory Main Memory Main Memory CPU CPU CPU CPU CPU PCI-e PCI-e PCI-e PCI-e PCI-e Network Network Network Network Network Interconnection Network rCUDA makes all GPUs accessible from every node
  • 20. rCUDA for multi-GPU computing • All GPUs available to every node GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem Logical connections Main Memory Main Memory Main Memory Main Memory Main Memory CPU CPU CPU CPU CPU PCI-e PCI-e PCI-e PCI-e PCI-e Network Network Network Network Network Interconnection Network rCUDA makes all GPUs accessible from every node and enables the access from a CPU to as many as required GPUs
  • 21. Outline • Introduction (slide 3) • Facing the problems in GPGPU (slide 13) • rCUDA functionality (slide 22) • New rCUDA version (slide 29) • Getting rCUDA (slide 42)
  • 22. How rCUDA works • rCUDA is a middleware that enables seamlessly remote CUDA usage • rCUDA clusters are equipped with: − The rCUDA client at every node − The rCUDA server only in those nodes having a GPU • Client-server communication: General TCP/IP communications • Or alternatively highly efficient low- level communications library for InfiniBand networks
  • 23. Seamlessly usage Usual way to use GPUs CURRENTLY Application CUDA library
  • 24. Seamlessly usage rCUDA leverages a client and a server Client side Server side Application CUDA library
  • 25. Seamlessly usage rCUDA leverages a client and a server Client side Server side Application rCUDA library rCUDA daemon Network interface Network interface CUDA library
  • 26. Seamlessly usage rCUDA leverages a client and a server Client side Server side Application rCUDA library rCUDA daemon Network interface Network interface CUDA library Request Response
  • 27. rCUDA uses a proprietary communication protocol Example: 1) initialization 2) memory allocation on the remote GPU 3) CPU to GPU memory transfer of the input data 4) kernel execution 5) GPU to CPU memory transfer of the results 6) GPU memory release 7) communication channel closing and server process finalization
  • 28. Outline • Introduction (slide 3) • Facing the problems in GPGPU (slide 13) • rCUDA functionality (slide 22) • New rCUDA version (slide 29) • Getting rCUDA (slide 42)
  • 29. Features in the new rCUDA version • CUDA 5 support • Efficient InfiniBand support • Support for CUDA extensions to C • Multithread support • Support for providing a single application with multiple GPUs across the cluster
  • 30. New Infiniband support  Why InfiniBand support?  InfiniBand is the most used HPC network − Low latency and high bandwidth TOP 500
  • 31. New Infiniband support  Use of IB-Verbs • All TPC/IP stack overhead is out  Bandwidth among client and remote GPU near the peak InfiniBand network bandwidth  Use of GPUDirect  Reduce the number of intra-node data movements  Use of pipelined transfers  Overlap intra-node data movements with transfers
  • 32. Performance example for InfiniBand Remote GPU bandwidth for synchronous transfers Enhanced performance To GPU From GPU Internal algorithm making use of pinned memory Bandwidth (MB/s) 4000 Maximum BW InfiniBand QDR 3000 exploitation Maximum BW 2000 nVidia Tesla 1000 C2050 0 rCUDA, rCUDA, rCUDA, Local GigaE IPoIB IB Verbs GPU Low-level 1 Gbps IP over InfiniBand Ethernet InfiniBand library
  • 33. Performance example for InfiniBand 2. Using a remote GPU through rCUDA 3. Therefore, is only slightly slower employing a remote than local GPU GPU is much faster than a local CPU Matrix-matrix product 4096 x 4096 rCUDA 40G 50% nVidia InfiniBand GeForce 9800 GTX CUBLAS 3.2 CUDA 48% Intel 100% Xeon E5645 CPU MKL 1. Local GPU computation is 0 0,5 1 1,5 much faster than CPU Execution time (seconds)
  • 34. Performance example for InfiniBand Overhead time for a matrix-matrix product lower than 1% % overhead gpu % overhead rcuda Tesla C2050 1 Intel Xeon E5645 QDR InfiniBand 0,9 0,8 0,7 0,6 % overhead 0,5 0,4 0,3 0,2 0,1 0 600 1600 2600 3600 4600 5600 6600 7600 8600 9600 1060011600126001360014600 100 1100 2100 3100 4100 5100 6100 7100 8100 9100 1010011100121001310014100 Matrix dimension
  • 35. Performance example for InfiniBand Execution time for the LAMMPS application, in.eam input script scaled by a factor of 5 in the three dimensions Tesla C2050 Intel Xeon E5520 QDR InfiniBand
  • 36. Support for CUDA extensions to C  Previously, rCUDA did not support the CUDA extensions to C  In order to execute a program with rCUDA, the CUDA extensions included in its code had to be “unextended” to the plain C API NVCC inserts calls to undocumented CUDA functions
  • 37. Support for CUDA extensions to C  The new rCUDA version to be released will support the CUDA extensions to C  The exact way we have achieved this goal will not be disclosed within this document. We ask for some patience …
  • 38. Multithread Support  The new rCUDA version supports applications with multiple threads  All the threads from the application can access the remote GPU concurrently in the same way as if the GPU was installed in the node executing the application
  • 39. MultiGPU support for a single application  The new rCUDA version is able to provide a single application all the GPUs in the cluster  Accelerating applications no longer depends on the amount of GPUs that fit into a node
  • 40. MultiGPU multithreaded support  The new rCUDA features can be granted to a single application, so that each thread of the application can access as many GPUs as it requires
  • 41. Outline • Introduction (slide 3) • Facing the problems in GPGPU (slide 13) • rCUDA functionality (slide 22) • New InfiniBand version (slide 29) • Getting rCUDA (slide 42)
  • 42. Getting rCUDA Full InfiniBand version freely available: − Enhanced client-server data transfers − High-performance InfiniBand communications library − TCP/IP-based communications also included for non-InfiniBand networks http://www.rcuda.net