SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
Reducing the Runtime of Collective
                                        Communications
                                  ISC’10 Birds of a Feather Session

                                                          June 3, 2010
© 2010 Voltaire Inc.
Agenda


    ►      Scalability Challenges for Group Communication

    ►      Voltaire Fabric Collective Accelerator™ (FCA™)

             • Yaron Haviv, CTO, Voltaire


    ►      Customer Experience:

           University of Braunschweig

             • Josef Schüle



© 2010 Voltaire Inc.                        Confidential - Internal   2
About Voltaire (NASDAQ: VOLT)

    ►      Leading provider of scale-out data center fabrics
             • Used by more than 30% of Fortune100 companies
             • Hundreds of installations of over 1000 servers

    ►      Addressing the challenges of HPC, virtualized data centers
           and clouds
    ►      More than half of TOP500 InfiniBand sites
    ►      InfiniBand and 10GbE scale-out fabrics

        End-to-End Scale-out Fabric Product Line




© 2010 Voltaire Inc.                         Confidential - Internal    3
MPI Collectives

    ►      Collective Operations = Group Communication (All to All, One to
           All, All to One)
    ►      Synchronous by nature = consume many “Wait” cycles on large
           clusters                     Collective Operations % of MPI Job Runtime
                                                      100

    ►      Popular examples:                          90

             • Reduce                                 80

                                                      70
             • Allreduce
                                         Percentage
                                                      60
             • Barrier                                50

             • Bcast                                  40

                                                      30
             • Gather
                                                      20
             • Allgather                              10

                                                        0
                                                               ANSYS            SAGE   CPMD   LSTC LS- CD-Adapco   Dacapo
                                                               FLUENT                          DYNA     STAR-CD



                Your cluster might be spending half its time on idle collective cycles
© 2010 Voltaire Inc.                                        Confidential - Internal                                         4
Collective Example - Allreduce

    ►      Allreduce – The Concept
             • Perform specific operation on all arguments, and distribute result to all
               processes. Example with SUM operation:


                           30
                           15
                           8               30
                                           7                             30
                                                                         15
                                                                         6          30
                                                                                    9

    ►      Allreduce on a 4-node cluster




              144144 144144
              144 2 52 6
               1
               20     5             1 2    5 6
                                   144144 144144                     20 2 52 6
                                                                     1      5
                                                                    144144 144144         1 2    5 6
                                                                                         144144 144144
              144144 144144
               3 4    7 8           3 4    7 8
                                   144144 144144                     3 4    7 8
                                                                    144144 144144         3 4    7 8
                                                                                         144144 144144

© 2010 Voltaire Inc.                           Confidential - Internal                                   5
Now try running it on a Petascale machine…



                                      Dozens of core
                                     switches (3 hops)




                                       Hundreds of edge
                                       switches (1 hop)



   1 2      5 6        1 2   5 6
                                        Tens of thousands                  1 2   5 6
   3 4      7 8        3 4   7 8             of cores                      3 4   7 8




                              Single Operation > 3000usec – Not Scalable
© 2010 Voltaire Inc.                    Confidential - Internal                        6
The Challenge:
   Collective Operations Scalability

    ►      Grouping algorithms are unaware of the topology
           and inefficient


    ►      Network congestion due to “All-to-All”
           communication


    ►      Slow nodes & OS involvement impair scalability
           and predictability                         Expected       Actual




    ►      The more powerful servers get (GPUs, more
           cores), the poorer collectives scale in the fabric
© 2010 Voltaire Inc.                       Confidential - Internal            7
The Voltaire InfiniBand Fabric:
   Equipped for the Challenge

   Grid Director                                                          Unified Fabric
   Switches:                                                              Manager (UFM):
   Fabric                                                                 Topology Aware
   Processing                           +                             +   Orchestrator
   Power




                              +                                                    +

                       ……….                                               ……….



                   Fabric computing in use to address the collective challenge
© 2010 Voltaire Inc.                        Confidential - Internal                        8
Introducing:
   Voltaire Fabric Collective Accelerator

 Grid Director
 Grid Director                                                          FCA Manager: Unified Fabric
 Switches:                                                                                   Manager (UFM):
                                                                         Topology-based collective tree
 Switches:
 Fabric                                                                                      Topology Aware
                                                                         Separate Virtual network
   Collective
  Processing                             +                                          + for result distribution
                                                                         IB multicast        Orchestrator
 operations
  Power                                                                  Integration with job schedulers
 offloaded to
 switch CPUs




                               +       FCA Agent:                                                      +
                                          Inter-core processing
                                          localized & optimized
                       ……….                                                                 ……….



                       Breakthrough performance with no additional hardware
© 2010 Voltaire Inc.                          Confidential - Internal                                           9
Efficient Collectives with FCA

                                              4. 2nd tier offload                        5. Result distribution
  1. Pre-config
                                               (result at root)                            (single message)

                             648        11664       648




                36     648   36                                                            36    648    36
                                                   3. 1st tier
                                                    offload
11664 11664
   11664 11664               11664 11664
                                11664 11664                                                             11664 11664
                                                                                                           11664 11664
  1 2 5 6                       1 2 5 6                                                                   1 2 5 6
     36 8
  311664 711664
     4
11664 11664
                                   36 11664
                             11664 411664 8
                                3
                                11664 7
                                                                                                             36
                                                                                                        116644 116648
                                                                                                          311664 711664
                                                 2. Inter-core                   6. Allreduce on 100K
                                                 processing                         cores in 25 usec


© 2010 Voltaire Inc.                                   Confidential - Internal                                      10
UFM Integrated With Job Schedulers

                                                     Matching Jobs Automatically
               Job Submitted in Scheduler                   Created in UFM




                                                                                   • QoS
                                                                                   • Routing
                                                                                   • Placement
                                                                                   • Collectives



           Application Level Monitoring        Fabric-wide Policy Pushed to Match
           & Optimization Measurements              Application Requirements
© 2010 Voltaire Inc.                        Confidential - Internal                                11
FCA Benefits:
   Slashing Job Runtime

    ►      Slashing Runtime                                                   IMB Allreduce 2048 Cores
                                                    Open MPI:
                                                        4000
                                                    >3000usec
                                                        3500

                                                                 3000

                                                                 2500




                                                          usec
                                                                 2000

                                                                 1500

                                                                 1000

                                                                  500                                    FCA: <30usec
                                                                    0




    ►      Eliminating Runtime Variation
             • OS jitter – eliminated in switches
             • Traffic congestion – significantly lower number of messages
             • Cross-application interference – collectives offloaded on a private virtual network

                                                                                                          Server-based
                                                                                                           Collectives
                                                      FCA-based
                                                      Collectives




© 2010 Voltaire Inc.                                Confidential - Internal      Completion Time Distribution            12
FCA Benefits:
   Unprecedented Scalability on HPC Clusters
10000



                                                  ompi-Allreduce-bynode
1000


                                                  ompi-Barrier-bynode

 100

                                   > 180X         FCA-Allreduce                                                > 50%
   10

                                                  FCA-Barrier


    1
        0     200      400   600   800   1000   1200



        ►   Extreme performance                                                                     ►   As process count increases
            improvement on raw
                                                                                                        • % of time spent in MPI
            collectives
                                                                                                          increases
        ►   Scale according to number
                                                                                                        • % of time spent in collectives
            of switch hops, not number
                                                                                                          increases
            of nodes – O(log18)


                             Enabling capability computing on HPC clusters
© 2010 Voltaire Inc.                                                      Confidential - Internal                                          13
Additional Benefits


    ►      Simple, fully integrated
             • No changes to application required

    ►      Tolerance to higher oversubscription (blocking) ratio
             • Same performance at lower cost

    ►      Enables use of non-blocking collectives
             • Part of future MPI implementations

             • FCA guarantees no computation power penalty

    ►      Reduce fabric congestion
             • Avoid interference to other jobs


© 2010 Voltaire Inc.                          Confidential - Internal   14
Customer Experience
                       University of Braunschweig


                                          June 3, 2010
© 2010 Voltaire Inc.
About University of Braunschweig

    ►      General Overview
             • Founded in 1745
             • 120 institutes with ca. 2900 employees
             • Ca. 13000 students
    ►      Main Fields of Research
             • Mobility and transport (road, rail, air and space)
             • Biological and biotechnological research
             • Digital television




© 2010 Voltaire Inc.                           Confidential - Internal   16
System Configuration

    Newest installation:
    ►      Nodes type: NEC HPC 1812Rb-2
               •       CPU: 2 x Intel X5550, MEM: 6 x 2GB, IB: 1 x Infinihost DDR onboard
    ►      System Configuration: 186 nodes
               •       24 nodes per switch (DDR), 12 QDR links to tier2 switches (non-blocking)
    ►      OS: CentOS 5.4
    ►      Open MPI: 1.4.1
                                                      4 x QDR                                     4 x QDR
    ►      FCA:1.0_RC3 rev 2760
    ►      UFM: 2.3 RC7
    ►      Switch: 3.0.629
                                                                24 x DDR                               24 x DDR




© 2010 Voltaire Inc.                                     Confidential - Internal                                  17
FCA Performance:
   A Real Cluster Example with 2048 Ranks

                                            Collective latency (usec)

                       10000
                                                    4000
                                                Microsecond
                                                                                            ompi-Allreduce

                        1000
                                                                                            ompi-Barrier
        Latency (us)




                                                                             180x
                                                                            Faster          FCA-Allreduce

                        100
                                                                                            FCA-Barrier




                          10
                               0   500          1000                     1500        2000    2500
                                         Number of ranks (16 ranks per node)



© 2010 Voltaire Inc.                                   Confidential - Internal                               18
Real Application Results

    ►      OpenFoam
             • Open source CFD solver produced by a commercial company, OpenCFD
             • Used by many leading automotive companies

                                          Open Foam CFD Aerodynamic Benchmark (64 cores)

                                   5000
                                   4500

                                   4000




                                                                 41 ette
                                                                  b
                                   3500




                                                                   % r
                                   3000
                         Seconds




                                                                                    Open MPI 1.4.1
                                   2500
                                                                                    Open MPI 1.4.1 + FCA
                                   2000

                                   1500
                                   1000

                                   500
                                     0
                                                           1


    ►      Expected benefits for several other applications
             • e.g. DLPOLY (molecular dynamics)
© 2010 Voltaire Inc.                                      Confidential - Internal                          19
Voltaire Fabric Collective Accelerator
   Summary

    ► Fully            Integrated Fabric computing offload
             • Combination of SW & HW in a single solution
             • Offloading blocking computational tasks
             • Algorithms leveraging the topology for computation (trees)

    ► Extreme             MPI performance & scalability
             • Capability computing on commodity clusters
             • Two orders of magnitude, hundred-times faster collective runtime
             • Scale by number of hops, not number of nodes
             • Variation eliminated - Consistent results

    ► Transparent             to the application
             • Plug & play - No need for code changes


                                Accelerate your fabric!
© 2010 Voltaire Inc.                          Confidential - Internal             20
Q&A




© 2010 Voltaire Inc.   Confidential - Internal   21

Más contenido relacionado

Similar a Voltaire - Reducing the Runtime of Collective Communications

New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for MahoutTed Dunning
 
Voltaire ufm en_nov10
Voltaire ufm en_nov10Voltaire ufm en_nov10
Voltaire ufm en_nov10sciecomp
 
Keysight Mini-ICT - Testing Days México
Keysight Mini-ICT - Testing Days MéxicoKeysight Mini-ICT - Testing Days México
Keysight Mini-ICT - Testing Days MéxicoInterlatin
 
2020-ntn-vsphere_performance_principles_bondzio.pdf
2020-ntn-vsphere_performance_principles_bondzio.pdf2020-ntn-vsphere_performance_principles_bondzio.pdf
2020-ntn-vsphere_performance_principles_bondzio.pdfPhmNgcTr3
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clusteringTed Dunning
 
A fast implementation of matrix-matrix product in double-double precision on ...
A fast implementation of matrix-matrix product in double-double precision on ...A fast implementation of matrix-matrix product in double-double precision on ...
A fast implementation of matrix-matrix product in double-double precision on ...Maho Nakata
 
Etalis rule ml_2011_itterative
Etalis rule ml_2011_itterativeEtalis rule ml_2011_itterative
Etalis rule ml_2011_itterativeDarko Anicic
 
Voltaire - Achieving Peak Performance with Advanced Fabric Management
Voltaire - Achieving Peak Performance with Advanced Fabric ManagementVoltaire - Achieving Peak Performance with Advanced Fabric Management
Voltaire - Achieving Peak Performance with Advanced Fabric ManagementVoltaire
 
VDSL Vectoring TEST PT TELKOM ALCATEL LUCENT
VDSL Vectoring TEST PT TELKOM ALCATEL LUCENTVDSL Vectoring TEST PT TELKOM ALCATEL LUCENT
VDSL Vectoring TEST PT TELKOM ALCATEL LUCENTWahyu Nasution
 
RE-FRAC OF SHALE WELLS USING ARTIFICIAL INTELLIGENCE
RE-FRAC OF SHALE WELLS USING ARTIFICIAL INTELLIGENCERE-FRAC OF SHALE WELLS USING ARTIFICIAL INTELLIGENCE
RE-FRAC OF SHALE WELLS USING ARTIFICIAL INTELLIGENCEiQHub
 
Grid technology for next gen media processing
Grid technology for next gen media processingGrid technology for next gen media processing
Grid technology for next gen media processingvrt-medialab
 
Streamy, Pipy, Analyticy
Streamy, Pipy, AnalyticyStreamy, Pipy, Analyticy
Streamy, Pipy, Analyticydarach
 
Mv unmasked.w.code.march.2013
Mv unmasked.w.code.march.2013Mv unmasked.w.code.march.2013
Mv unmasked.w.code.march.2013EDB
 
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...The Linux Foundation
 
021413 aix trends_jay_kruemcke
021413 aix trends_jay_kruemcke021413 aix trends_jay_kruemcke
021413 aix trends_jay_kruemckeJay Kruemcke
 
PLNOG 5: Piotr Szołkowski - Data Center i nie tylko...
PLNOG 5: Piotr Szołkowski - Data Center i nie tylko...PLNOG 5: Piotr Szołkowski - Data Center i nie tylko...
PLNOG 5: Piotr Szołkowski - Data Center i nie tylko...PROIDEA
 
IEEE SWTW 2012 Road to 450 mm Semiconductor Wafers - Ira Feldman li2
IEEE SWTW 2012 Road to 450 mm Semiconductor Wafers - Ira Feldman li2IEEE SWTW 2012 Road to 450 mm Semiconductor Wafers - Ira Feldman li2
IEEE SWTW 2012 Road to 450 mm Semiconductor Wafers - Ira Feldman li2Ira Feldman
 
Tungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten ReplicatorsTungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten ReplicatorsContinuent
 

Similar a Voltaire - Reducing the Runtime of Collective Communications (20)

New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
Voltaire ufm en_nov10
Voltaire ufm en_nov10Voltaire ufm en_nov10
Voltaire ufm en_nov10
 
Keysight Mini-ICT - Testing Days México
Keysight Mini-ICT - Testing Days MéxicoKeysight Mini-ICT - Testing Days México
Keysight Mini-ICT - Testing Days México
 
Neutron CI Run on Docker
Neutron CI Run on DockerNeutron CI Run on Docker
Neutron CI Run on Docker
 
2020-ntn-vsphere_performance_principles_bondzio.pdf
2020-ntn-vsphere_performance_principles_bondzio.pdf2020-ntn-vsphere_performance_principles_bondzio.pdf
2020-ntn-vsphere_performance_principles_bondzio.pdf
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
A fast implementation of matrix-matrix product in double-double precision on ...
A fast implementation of matrix-matrix product in double-double precision on ...A fast implementation of matrix-matrix product in double-double precision on ...
A fast implementation of matrix-matrix product in double-double precision on ...
 
Etalis rule ml_2011_itterative
Etalis rule ml_2011_itterativeEtalis rule ml_2011_itterative
Etalis rule ml_2011_itterative
 
Voltaire - Achieving Peak Performance with Advanced Fabric Management
Voltaire - Achieving Peak Performance with Advanced Fabric ManagementVoltaire - Achieving Peak Performance with Advanced Fabric Management
Voltaire - Achieving Peak Performance with Advanced Fabric Management
 
VDSL Vectoring TEST PT TELKOM ALCATEL LUCENT
VDSL Vectoring TEST PT TELKOM ALCATEL LUCENTVDSL Vectoring TEST PT TELKOM ALCATEL LUCENT
VDSL Vectoring TEST PT TELKOM ALCATEL LUCENT
 
RE-FRAC OF SHALE WELLS USING ARTIFICIAL INTELLIGENCE
RE-FRAC OF SHALE WELLS USING ARTIFICIAL INTELLIGENCERE-FRAC OF SHALE WELLS USING ARTIFICIAL INTELLIGENCE
RE-FRAC OF SHALE WELLS USING ARTIFICIAL INTELLIGENCE
 
Grid technology for next gen media processing
Grid technology for next gen media processingGrid technology for next gen media processing
Grid technology for next gen media processing
 
Scalding on tez (final)
Scalding on tez (final)Scalding on tez (final)
Scalding on tez (final)
 
Streamy, Pipy, Analyticy
Streamy, Pipy, AnalyticyStreamy, Pipy, Analyticy
Streamy, Pipy, Analyticy
 
Mv unmasked.w.code.march.2013
Mv unmasked.w.code.march.2013Mv unmasked.w.code.march.2013
Mv unmasked.w.code.march.2013
 
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
 
021413 aix trends_jay_kruemcke
021413 aix trends_jay_kruemcke021413 aix trends_jay_kruemcke
021413 aix trends_jay_kruemcke
 
PLNOG 5: Piotr Szołkowski - Data Center i nie tylko...
PLNOG 5: Piotr Szołkowski - Data Center i nie tylko...PLNOG 5: Piotr Szołkowski - Data Center i nie tylko...
PLNOG 5: Piotr Szołkowski - Data Center i nie tylko...
 
IEEE SWTW 2012 Road to 450 mm Semiconductor Wafers - Ira Feldman li2
IEEE SWTW 2012 Road to 450 mm Semiconductor Wafers - Ira Feldman li2IEEE SWTW 2012 Road to 450 mm Semiconductor Wafers - Ira Feldman li2
IEEE SWTW 2012 Road to 450 mm Semiconductor Wafers - Ira Feldman li2
 
Tungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten ReplicatorsTungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten Replicators
 

Último

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Último (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Voltaire - Reducing the Runtime of Collective Communications

  • 1. Reducing the Runtime of Collective Communications ISC’10 Birds of a Feather Session June 3, 2010 © 2010 Voltaire Inc.
  • 2. Agenda ► Scalability Challenges for Group Communication ► Voltaire Fabric Collective Accelerator™ (FCA™) • Yaron Haviv, CTO, Voltaire ► Customer Experience: University of Braunschweig • Josef Schüle © 2010 Voltaire Inc. Confidential - Internal 2
  • 3. About Voltaire (NASDAQ: VOLT) ► Leading provider of scale-out data center fabrics • Used by more than 30% of Fortune100 companies • Hundreds of installations of over 1000 servers ► Addressing the challenges of HPC, virtualized data centers and clouds ► More than half of TOP500 InfiniBand sites ► InfiniBand and 10GbE scale-out fabrics End-to-End Scale-out Fabric Product Line © 2010 Voltaire Inc. Confidential - Internal 3
  • 4. MPI Collectives ► Collective Operations = Group Communication (All to All, One to All, All to One) ► Synchronous by nature = consume many “Wait” cycles on large clusters Collective Operations % of MPI Job Runtime 100 ► Popular examples: 90 • Reduce 80 70 • Allreduce Percentage 60 • Barrier 50 • Bcast 40 30 • Gather 20 • Allgather 10 0 ANSYS SAGE CPMD LSTC LS- CD-Adapco Dacapo FLUENT DYNA STAR-CD Your cluster might be spending half its time on idle collective cycles © 2010 Voltaire Inc. Confidential - Internal 4
  • 5. Collective Example - Allreduce ► Allreduce – The Concept • Perform specific operation on all arguments, and distribute result to all processes. Example with SUM operation: 30 15 8 30 7 30 15 6 30 9 ► Allreduce on a 4-node cluster 144144 144144 144 2 52 6 1 20 5 1 2 5 6 144144 144144 20 2 52 6 1 5 144144 144144 1 2 5 6 144144 144144 144144 144144 3 4 7 8 3 4 7 8 144144 144144 3 4 7 8 144144 144144 3 4 7 8 144144 144144 © 2010 Voltaire Inc. Confidential - Internal 5
  • 6. Now try running it on a Petascale machine… Dozens of core switches (3 hops) Hundreds of edge switches (1 hop) 1 2 5 6 1 2 5 6 Tens of thousands 1 2 5 6 3 4 7 8 3 4 7 8 of cores 3 4 7 8 Single Operation > 3000usec – Not Scalable © 2010 Voltaire Inc. Confidential - Internal 6
  • 7. The Challenge: Collective Operations Scalability ► Grouping algorithms are unaware of the topology and inefficient ► Network congestion due to “All-to-All” communication ► Slow nodes & OS involvement impair scalability and predictability Expected Actual ► The more powerful servers get (GPUs, more cores), the poorer collectives scale in the fabric © 2010 Voltaire Inc. Confidential - Internal 7
  • 8. The Voltaire InfiniBand Fabric: Equipped for the Challenge Grid Director Unified Fabric Switches: Manager (UFM): Fabric Topology Aware Processing + + Orchestrator Power + + ………. ………. Fabric computing in use to address the collective challenge © 2010 Voltaire Inc. Confidential - Internal 8
  • 9. Introducing: Voltaire Fabric Collective Accelerator Grid Director Grid Director FCA Manager: Unified Fabric Switches: Manager (UFM): Topology-based collective tree Switches: Fabric Topology Aware Separate Virtual network Collective Processing + + for result distribution IB multicast Orchestrator operations Power Integration with job schedulers offloaded to switch CPUs + FCA Agent: + Inter-core processing localized & optimized ………. ………. Breakthrough performance with no additional hardware © 2010 Voltaire Inc. Confidential - Internal 9
  • 10. Efficient Collectives with FCA 4. 2nd tier offload 5. Result distribution 1. Pre-config (result at root) (single message) 648 11664 648 36 648 36 36 648 36 3. 1st tier offload 11664 11664 11664 11664 11664 11664 11664 11664 11664 11664 11664 11664 1 2 5 6 1 2 5 6 1 2 5 6 36 8 311664 711664 4 11664 11664 36 11664 11664 411664 8 3 11664 7 36 116644 116648 311664 711664 2. Inter-core 6. Allreduce on 100K processing cores in 25 usec © 2010 Voltaire Inc. Confidential - Internal 10
  • 11. UFM Integrated With Job Schedulers Matching Jobs Automatically Job Submitted in Scheduler Created in UFM • QoS • Routing • Placement • Collectives Application Level Monitoring Fabric-wide Policy Pushed to Match & Optimization Measurements Application Requirements © 2010 Voltaire Inc. Confidential - Internal 11
  • 12. FCA Benefits: Slashing Job Runtime ► Slashing Runtime IMB Allreduce 2048 Cores Open MPI: 4000 >3000usec 3500 3000 2500 usec 2000 1500 1000 500 FCA: <30usec 0 ► Eliminating Runtime Variation • OS jitter – eliminated in switches • Traffic congestion – significantly lower number of messages • Cross-application interference – collectives offloaded on a private virtual network Server-based Collectives FCA-based Collectives © 2010 Voltaire Inc. Confidential - Internal Completion Time Distribution 12
  • 13. FCA Benefits: Unprecedented Scalability on HPC Clusters 10000 ompi-Allreduce-bynode 1000 ompi-Barrier-bynode 100 > 180X FCA-Allreduce > 50% 10 FCA-Barrier 1 0 200 400 600 800 1000 1200 ► Extreme performance ► As process count increases improvement on raw • % of time spent in MPI collectives increases ► Scale according to number • % of time spent in collectives of switch hops, not number increases of nodes – O(log18) Enabling capability computing on HPC clusters © 2010 Voltaire Inc. Confidential - Internal 13
  • 14. Additional Benefits ► Simple, fully integrated • No changes to application required ► Tolerance to higher oversubscription (blocking) ratio • Same performance at lower cost ► Enables use of non-blocking collectives • Part of future MPI implementations • FCA guarantees no computation power penalty ► Reduce fabric congestion • Avoid interference to other jobs © 2010 Voltaire Inc. Confidential - Internal 14
  • 15. Customer Experience University of Braunschweig June 3, 2010 © 2010 Voltaire Inc.
  • 16. About University of Braunschweig ► General Overview • Founded in 1745 • 120 institutes with ca. 2900 employees • Ca. 13000 students ► Main Fields of Research • Mobility and transport (road, rail, air and space) • Biological and biotechnological research • Digital television © 2010 Voltaire Inc. Confidential - Internal 16
  • 17. System Configuration Newest installation: ► Nodes type: NEC HPC 1812Rb-2 • CPU: 2 x Intel X5550, MEM: 6 x 2GB, IB: 1 x Infinihost DDR onboard ► System Configuration: 186 nodes • 24 nodes per switch (DDR), 12 QDR links to tier2 switches (non-blocking) ► OS: CentOS 5.4 ► Open MPI: 1.4.1 4 x QDR 4 x QDR ► FCA:1.0_RC3 rev 2760 ► UFM: 2.3 RC7 ► Switch: 3.0.629 24 x DDR 24 x DDR © 2010 Voltaire Inc. Confidential - Internal 17
  • 18. FCA Performance: A Real Cluster Example with 2048 Ranks Collective latency (usec) 10000 4000 Microsecond ompi-Allreduce 1000 ompi-Barrier Latency (us) 180x Faster FCA-Allreduce 100 FCA-Barrier 10 0 500 1000 1500 2000 2500 Number of ranks (16 ranks per node) © 2010 Voltaire Inc. Confidential - Internal 18
  • 19. Real Application Results ► OpenFoam • Open source CFD solver produced by a commercial company, OpenCFD • Used by many leading automotive companies Open Foam CFD Aerodynamic Benchmark (64 cores) 5000 4500 4000 41 ette b 3500 % r 3000 Seconds Open MPI 1.4.1 2500 Open MPI 1.4.1 + FCA 2000 1500 1000 500 0 1 ► Expected benefits for several other applications • e.g. DLPOLY (molecular dynamics) © 2010 Voltaire Inc. Confidential - Internal 19
  • 20. Voltaire Fabric Collective Accelerator Summary ► Fully Integrated Fabric computing offload • Combination of SW & HW in a single solution • Offloading blocking computational tasks • Algorithms leveraging the topology for computation (trees) ► Extreme MPI performance & scalability • Capability computing on commodity clusters • Two orders of magnitude, hundred-times faster collective runtime • Scale by number of hops, not number of nodes • Variation eliminated - Consistent results ► Transparent to the application • Plug & play - No need for code changes Accelerate your fabric! © 2010 Voltaire Inc. Confidential - Internal 20
  • 21. Q&A © 2010 Voltaire Inc. Confidential - Internal 21