SlideShare una empresa de Scribd logo
1 de 77
Online performance modeling and
analysis of message-passing
parallel applications

                                                               Delayed receive




PhD Thesis
Oleg Morajko
Universitat Autònoma de Barcelona,   Long local calculations




Barcelona, 2008
Motivation
• Parallel system hardware is evolving at an incredible rate
• Contemporary HPC systems
    – Top500 ranging from 1.000 to 200.000+ processors (June 2008)
    – Take BSC MareNostrum: 10K processors

• Whole industry is shifting to parallel computing




                                                                     2
Motivation
• Challenges of developing large-scale scientific software
   – Evolution of programming models is much slower
   – Hard to achieve good efficiency
   – Hard to achieve scalability

• The parallel applications
  rarely achieve good
  performance immediately




                          MPI
                                                             3
Motivation
• Challenges of developing large-scale scientific software
   – Evolution of programming models is much slower
   – Hard to achieve good efficiency
   – Hard to achieve scalability

• The parallel applications
  rarely achieve good
  performance immediately



   Careful performance analysis
 and optimization tasks are crucial
                                                             4
Motivation
• Quickly finding performance problems and their reasons is hard
• Requires thorough understanding of the program’s behavior
   – Parallel algorithm, domain decomposition, communication, synchronization

• Large scale brings additional complexities
   – Large data volume, excessive analysis cost

• Existing tools support finding what happens, where, and when
   – Locating root causes of problems still manual
   – Tools expose scalability limitations (E.g. tracing)

• Problem diagnosis still requires substantial time and effort of
  highly-skilled professionals

                                                                                5
Our goals
• Analyze the performance of parallel applications
• Detect bottlenecks and explain their causes
   – Focus on communication and synchronization in message-passing
     programs


• Automate the approach to the extent possible
• Scalable to thousands of nodes
• Online approach without trace files



                                                                     6
Contributions
• A systematic approach for automated diagnosis of application
  performance
   – Application is monitored, modeled and diagnosed during its execution

• Scalable modeling technique that generates performance
  knowledge about application behavior

• Analysis technique that diagnoses MPI applications running in
  large-scale parallel systems
   – Detects performance bottlenecks on-the-fly
   – Finds root causes

• Prototype tool to demonstrate the ideas

                                                                        7
Outline

1. Overview of approaches

2. Online performance modeling

3. Online performance analysis

4. Experimental evaluation

5. Conclusions and future work


                                 8
Overview
of approaches

                9
Classical performance analysis

            Code                            Compile
                            Develop         Instrument
            changes



                Find
                                          Execute
              solutions


    Performance                                 Trace
    problems                                    files
                            Analyze
                             trace

                          Visualization
                          tool                           10
Classical performance analysis
Drawbacks

•   Manual task of experimental nature
•   Time consuming
•   High degree of expertise required
•   Full trace excessive volume of information
•   Poor scalability




                                                  11
Automated offline analysis

           Code                                Compile
                            Develop            Instrument
           changes



               Find
                                             Execute
             solutions


   Performance                                     Trace
   problems                                        files
                            Analyze
                             trace
                         Automated tools
                         (KappaPI, EXPERT)
                                                            12
Automated offline analysis
Drawbacks

• Post-mortem
• Addresses only well-known problems
• Not fully explored capabilities to find root causes




                                                        13
Automated online analysis


                            Develop
          Code
          changes                         Compile
                                          Instrument

               Find
             solutions
                                      Execute



     Performance problems                 Online monitoring
     (What, Where, When)                  and diagnosis
                                          (Paradyn)
                                                              14
Automated online analysis

Paradyn advantages              Paradyn drawbacks
• Locate problems while app     • Addresses lower-level
  runs                            problems (profiler)
• Automated problem-space       • No search for root causes of
  search                          problems
   – Functional decomposition
   – Refinable measurements
• Scalable




                                                             15
Automated online analysis
Our approach
                                          Consume
Code              Develop
                                          events            Monitoring
changes                         Compile


        Find                                                                   Refine
      solutions
                            Execute              Modeling


                                                                 Analysis



                                              Observe                                 16

                                               model                      Problems
                                                                         and causes
Automated online analysis
Key characteristics
• Discovers application model on-the-fly
   – Model execution flows, not modules/functions
   – Lossy trace compression

• Runtime analysis based on continuous model
  observation
• Automatically locates problems while app runs
• Search for root-causes of problems


                                                    17
Monitoring




Modeling




             Analysis




Online performance
modeling

                        18
Modeling objectives

• Enable high-level understanding of application performance

• Reflect parallel application structure and runtime behavior

• Maintain tradeoff between volume of collected data and level
  of preserved details
   – Communication and computational patterns

   – Causality of events

• Base for online performance analysis


                                                                19
Online performance modeling

• Novel application performance modeling approach

• Combines static code analysis with runtime monitoring to
  extract performance knowledge

• Three step approach:
   – Modeling individual tasks

   – Modeling inter-task communication

   – Modeling entire application



                                                             20
Modeling individual tasks
• We decompose execution into units that correspond to
  different activities:
   –   Communication activities (E.g. MPI_Send, MPI_Gather)
   –   Computation activities (E.g. calc_gauss)
   –   Control activities (E.g. program start/termination)
   –   Others (E.g. I/O)


• We capture execution flow through these activities using a
  directed graph called Task Activity Graph (TAG):
   – Nodes model communication activities and loops
   – Edges represent sequential flow of execution (computation activities)
   – Nodes and edges maintain happens-before relationship
                                                                             21
Modeling individual tasks
Task Activity Graph (TAG) reflects program structure by
  modeling executed flow of activities




                                                      22
Modeling individual tasks
• Each activity corresponds
  to a particular location
  in the source code




                              23
Modeling individual tasks
• Runtime behavior of activities is described by adding
  performance metrics to nodes and edges
• Data aggregated into statistical execution profiles

                                               Edge counter &
                                             accumulative timer
                                             {min, max, stddev}




       Node
 accumulative timer
 {min, max, stddev}                                               24
Modeling communication
• Message edges capture matching send-receive links
   – P2P, Collective
• Completion edges capture non-blocking semantics
• Performance metrics describe runtime behavior




                                                      25
Modeling parallel application
• Individual TAG models connected by message edges
  form a Parallel-TAG model (PTAG)




                                                     26
Modeling techniques
We developed a set of techniques to automatically construct
  and exploit the PTAG model at runtime

•   Targeted to parallel scientific applications
•   Focus on modeling MPI applications
•   But extendible to other programming paradigms
•   Low-overhead
•   Scalable to 1000+ nodes




                                                              27
Online PTAG construction
                                                    Front-end
                                 7 analyze


                                        6 update


                              5 merge     TBON Node 1      TBON Node 2   …
                         4 update


                  Modeler 1             Modeler 2          Modeler 3     …   Modeler N


   1 instrument        sample   3

   2 build
              MPI Task 1            MPI Task 2            MPI Task 3
                                                                         …   MPI Task N




                                                                                          28
Building individual TAG

                          6 update



                               Modeler
                                                            5 sample

            1     analyze executable

                    2    instrument
                                                         shared memory

                               MPI Task
        capture
      3 events
                                 RT Library




                                              4 update

                                                                         29
Building individual TAG
    Offline program analysis
    • Parse binary executable
    • Find target functions
    • Detect relevant loops




                  Modeler
1
    analyze executable

                               shared memory
                  MPI Task
                  RT Library

                                               30
Building individual TAG
Dynamic instrumentation
• Instrument all target functions:
     – Record events
     – Collect performance metrics
     – Invoke TAG update
• Refinable at runtime



           Modeler
 2   instrument

                           shared memory
           MPI Task
            RT Library

                                           31
Building individual TAG
Performance metrics
•       Counters
•       Timers {sum, sum2, min, max}
•       Histograms                           cnt2++
                                                                   cnt3++
•       Compound metrics


                                             cnt1++           cnt4++           cnt5++

               Modeler
                                                      t1 t2            t3 t4
    2    instrument

                             shared memory
               MPI Task
                RT Library

                                                                                        32
Building individual TAG
     Runtime modeling
     • Process generated events
     • Walk the stack to capture
       program location (call path)
     • Update TAG incrementally




              Modeler


                            shared memory

    capture
              MPI Task
     events   RT Library
3
                                    4   update
                                                 33
Building individual TAG
Model sampling
•   Goal: examine model at runtime
•   Read model from shared memory
•   Sampling is periodic
•   Lock-free synchronization




         Modeler         5   sample




                      shared memory
         MPI Task
         RT Library

                                      34
Online communication modeling
How to model inter-task communication?
• Intercept MPI communication calls (nodes)
• Match sender nodes with receiver nodes
• Add messages edges to the TAG models




                                              35
Online communication modeling
• Requires tracking of individual messages transmitted from
  sender to receiver(s) at runtime

• Achieved by propagating piggyback data over every
  transmitted MPI message
   • Transmit node id from sender to receiver(s)
   • P2P/Blocking/Non-blocking/Collective
   • Optimized hybrid strategy to minimize intrusion

• Store references to sender’s nodes at receiver’s TAG



                                                              36
Online parallel application modeling
Building and maintaining PTAG

• Individual TAGs are
  distributed
                                         Hierarchical Reduction
• Collect TAGs snapshots                    Network (TBON)

• Distributed merge
• Periodic process


                       Individual TAGs      Merged groups         PTAG
                                              of TAGs
                                                                         37
Online parallel application modeling
Scalable modeling


                                       10240
                                       nodes,
                       1024
                       nodes,          625MB
                       62MB
            8 nodes,
            250KB
                                • Increasing data volume
                                • Increasing analysis cost
                                • Non-scalable visualization
                                                               38
Online parallel application modeling
Resolving scalability issues

• Classes of similar tasks
   – E.g. stencil codes, M/W

• TAG clustering
   – Structural equivalence
   – Behavioral equivalence


• Distributed and scalable
  TAG merging algorithm

                                       39
Online parallel application modeling
Scalable PTAG visualization
• Example: 1D stencil, 8 nodes




                                       40
Benefits of modeling

• Facilitates performance understanding

• Reveals communication and computational patterns and their
  causal relationships

• Enables an assortment of online analysis techniques
   – Quick identification of performance bottlenecks and their location

   – Behavioral task clustering

   – Causal relationships permit root-cause analysis

   – Feedback-guided analysis (refinements)


                                                                          41
Monitoring




Modeling




             Analysis




Online performance
analysis

                        42
Online analysis objectives
• Diagnose the performance on-the-fly
• Detect relevant performance bottlenecks and their
  reasons
• Distinguish problem symptoms from root causes
• Explain what, where, when and why

• Focus on communication and synchronization
  problems in MPI applications


                                                      43
Online performance analysis
Time-continuous Root-Cause Analysis process

           Monitoring




Modeling




             Analysis     Phase 1        Phase 2      Phase 3
                          Problem        Problem    Cause-effect
                        identification   analysis     analysis




                                                                   44
Root-cause analysis
                                                       Phase 1        Phase 2      Phase 3
                                                       Problem        Problem    Cause-effect
                                                     identification   analysis     analysis




Phase 1: Problem identification
• Focus attention on code regions with biggest potential
  optimization benefits
• A potential bottleneck – an individual task activity with
  significant amount of execution time

• TAG node might corresponds to a communication or
  synchronization problem
• TAG edge might be a computation-bound problem


                                                                                   45
Problem identification
                                                      Phase 1        Phase 2      Phase 3
                                                      Problem        Problem    Cause-effect
                                                    identification   analysis     analysis




                   CPU-bound activity
                          ~45% time




Cold activity         Hot activity
                                            Blocked receive
                                                    ~42% time
• Rainbow spectrum TAG coloring
                                             Communication or
• Activity time / Max Activity Time     synchronization problem
                                                                                  46
Problem identification
                                                                    Phase 1        Phase 2      Phase 3
                                                                    Problem        Problem    Cause-effect
                                                                  identification   analysis     analysis




 TAG ranking process
 • Identify potential bottlenecks for further analysis
 • Periodic ranking in moving time-window

                    Select top problems by ranking

                       Rank = activity time / task time
                       > 20% for computation activities
                       > 3% for communication activities
TAG snapshot
                                                           Potential bottlenecks




                                                                                                47
Root-cause analysis
                                                     Phase 1        Phase 2      Phase 3
                                                     Problem        Problem    Cause-effect
                                                   identification   analysis     analysis




Phase 2: In-depth problem analysis
• For each potential bottleneck, investigate its causes
• Explore knowledge-based cause space
• Focus on causes that contribute most to the problem time

• Distinguish task-local problems from inter-task problems
   – Find root-causes of task-local problems
       • E.g. CPU-bound computation, local I/O
   – Find symptoms of inter-task problems
       • E.g. Blocked receive, barrier



                                                                                 48
In-depth problem analysis
                                                     Phase 1        Phase 2      Phase 3
                                                     Problem        Problem    Cause-effect
                                                   identification   analysis     analysis




Performance models for activities
• Classification of activities
• Each class has a performance model that divides the activity
  cost into separate components
• Each component is a non-exclusive
  potential cause of the problem




                                                                                 49
In-depth problem analysis
                                                         Phase 1        Phase 2      Phase 3
                                                         Problem        Problem    Cause-effect
                                                       identification   analysis     analysis




Model for computational activities
•   Sequential code region modeled by TAG edge
•   No external knowledge about computation
•   Determine where edge-constrained code spends time
•   Divide TAG edge into components
    – Functional or basic-blocks decomposition
• Apply statistical profiling constrained to an edge
    – Dynamic instrumentation
• Other metrics
    – Idle time, I/O time, hardware counters

                                                                                     50
In-depth problem analysis
                                                                                               Phase 1        Phase 2      Phase 3
                                                                                               Problem        Problem    Cause-effect
                                                                                             identification   analysis     analysis




Model for communication activities
Communication cost = Synchronization Cost + Transmission Cost

                                                             Transmission cost
                     Overall communication cost
    Task




                                              e1                  Send           e3




               e2                                  Receive                            e4



                                                                                      Time
                       Synchronization cost


•      Captures semantics of well-known synchronization inefficiencies
           – Late sender, wait at barrier, early reduce, etc.


                                                                                                                            51
In-depth problem analysis
                                                                                                  Phase 1        Phase 2      Phase 3
                                                                                                  Problem        Problem    Cause-effect
                                                                                                identification   analysis     analysis




Model for communication activities
Communication cost = Synchronization Cost + Transmission Cost

                                                             Transmission cost
                     Overall communication cost

                                                                                             • Piggyback send entry
    Task




                                                                                               timestamp (e1)
                                              e1                  Send           e3          • Accumulate
                                                                                               synchronization cost
               e2                                  Receive                            e4
                                                                                               per message edge

                                                                                      Time
                       Synchronization cost


•      Captures semantics of well-known synchronization inefficiencies
           – Late sender, wait at barrier, early reduce, etc.


                                                                                                                              52
In-depth problem analysis
                                            Phase 1        Phase 2      Phase 3
                                            Problem        Problem    Cause-effect
                                          identification   analysis     analysis




Example receive activity break-down




                                      Requires inter-task
                                      cause-effect analysis


                                                                        53
Root-cause analysis
                                                       Phase 1        Phase 2      Phase 3
                                                       Problem        Problem    Cause-effect
                                                     identification   analysis     analysis




Phase 3: Cause-effect analysis
• Explain causes of synchronization inefficiencies
   – Why sender is late?

• Correlate problems into cause-effect chains
• Distinguish root-causes of inefficiencies from their causal
  propagation (symptoms)
• Pinpoint problems in non-dominant code regions
• Improve the feedback provided to application developers


                                                                                   54
Cause-effect analysis
                                                                                                                                         Phase 1         Phase 2          Phase 3
                                                                                                                                      Problem            Problem        Cause-effect
                                                                                                                                    identification       analysis         analysis




Causal propagation
                                                                         Causes

                                                                                              Causes
                                                          ComputationA
                                                            (Task A)
                                                                              Late Sender                            Causes
                                                                                  (Task A)
                                                                                                     Inefficiency1                             Causes
                                                                                                        (Task B)
                                                                                                                          Late Sender
                                                                                                                              (Task B)
Task




                                                                                                                                                        Inefficiency2
                                                                                                                                                           (Task C)



                                                                                                     ComputationB      Causes
                                                                                                       (Task B)
A           ComputationA          Send1

             WT1      Inefficiency 1      m0


B                          Receive1                      ComputationB                        Send2

                                       WT2     Inefficiency 2                                    m1


C             ComputationC                                 Receive2


       t0                    t1   t2            t3                                t4
                                                                                                       Time

                                                                                                                                                                          55
Cause-effect analysis
                                                                   Phase 1        Phase 2      Phase 3
                                                                   Problem        Problem    Cause-effect
                                                                 identification   analysis     analysis




Explaining problem causes
• Causes of waiting time between two nodes as the differences
  between their execution paths
   – Online adaptation of Wait-Time Analysis approach by Meira et al.
   – Based on PTAG model, not full trace
• Explain synchronization inefficiencies by means of other
  activities
   – Identify corresponding execution paths in PTAG model
   – Compare the paths
   – Build causal tree with explanations
   – Merge trees of individual problems


                                                                                               56
Cause-effect analysis
                                                                                                                  Phase 1           Phase 2      Phase 3
                                                                                                               Problem              Problem    Cause-effect
                                                                                                             identification         analysis     analysis




      Execution path comparison
                                                       Inefficiency caused by
                                                             Late Sender
                                                               problem



      Path q (Task 1)


                                   e7

...                                                                             ...
                        e1    e2        e4        e5      e6



      Path p (Task 2)                                                                                                Inefficiency
                                                                                                                    at MPI_Recv         Waiting-time
                                                  e3                                                                   (Task 1)         138,4 sec.
...                                                                             ...
                         e1                  e2

                                                                                                    Late Sender
                                                                                                       (Task 2)




                                                                                         91.9%                         7.7%

                                                                                      Computation                   Computation
                                                                                        edge e3                       edge e2
                                                                                         (Task 2)                      (Task 2)




                                                                                          Root causes                                            57
Benefits of RCA

• Systematic approach to online performance analysis

• Quick identification of problems as they manifest at runtime
  (without trace)

• Causal correlation of different problems

• Discovery of root-causes of synchronization inefficiencies




                                                                 58
Experimental
evaluation

               59
Prototype tool
                                                 global
                                                analyzer

•   Implemented in C++
•   DynInst 5.1                         mrnet              mrnet




                                                   …
                                        comm               comm

•   MRNet 1.2                            node               node


•   OpenMPI 1.2.x               mrnet                              mrnet

•   Linux platforms             comm
                                 node
                                                                   comm
                                                                    node




                                                     …
    – x86
    – IA-64 (Itanium)
    – PowerPC 32/64      dmad   dmad        dmad                     dmad   dmad




                         MPI
                         Task
                                MPI
                                Task
                                            MPI
                                            Task
                                                     …               MPI
                                                                     Task
                                                                            MPI
                                                                            Task


                                                                               60
Experimental environment
UAB cluster                   BSC Marenostrum
     x86/Linux                    PowerPC-64/Linux
     32 nodes                     512 nodes (restricted)
     Intel Pentium IV 3GHz        PowerPC 2.3GHz dual core
     Linux FC4                    SUSE Linux Enterprise Server 9
     Gigabit Ethernet             Myrinet




                                                               61
Modeling MPI applications
• Experiences with different classes of MPI codes
   – SPMD codes
      • WaveSend – 1D stencil, concurrent wave equation
      • NAS Parallel Benchmarks – 2D stencils
      • SMG2000 – 3D stencil, multigrid solver
   – Master/Worker
      • XFire – forest fire propagation simulator

+ Demonstrated ability to model arbitrary MPI code with
  low-overhead
+ Best with regular codes
– Limitations with recursive codes

                                                          62
Case study #1: Modeling SPMD
Integer sort (IS) NAS Parallel Benchmark
• Large integer sort used in
  “particle method” codes

• Tests both integer computation
  speed and communication
  performance

• Mostly collective communication

• We extract PTAG to understand
  application communication
  patterns and behavior



                                           63
Case study #2: Master/Worker
Forest Fire Propagation Simulator (XFire)
• Calculates the expansion of the fireline
• Computationally intensive code, exploits data parallelism
• We extract and cluster PTAG




                                                              64
Evaluation of overheads
Sources of overheads
    • Offline startup
         – Less than 20 seconds per 1MB executable
         – In function of program size

    • Online TAG construction
         – 4-20 μs per instrumented call (*)
         – Depends on the number of instrumented calls and loops

    • Online TAG sampling
         – 40-50 μs per snapshot (256 KB)
         – Depends on program structure size, number of communication links


(*) Experiments conducted in UAB cluster

                                                                              65
Evaluation of overheads
NAS LU overheads, varying nº of nodes
                           120,00                                                          2,50%


                           100,00                                                  1,91%
                                                                                           2,00%
                                                                       1,59%
      Overhead (seconds)




                            80,00                           1,50%
                                            1,34%   1,42%
                                    1,26%                                                  1,50%

                            60,00
                                                                    Overhead (seconds)
                                                                                           1,00%
                            40,00                                   Overhead (%)

                                                                                           0,50%
                            20,00


                             0,00                                                          0,00%
                                    16       32      64      128         256       512
                                                       Nº CPUs

                                                                                                   66
Case study #3: SPMD analysis
WaveSend application
• Parallel calculations of vibrating string over time

• Wave equation, block-decomposition




• P2P communication to exchange boundary
  points with nearest neighbors

• Synthetic performance problems

                                                        67
Case study #3: SPMD analysis
WaveSend
PTAG

After execution




                               68
Case study #3: SPMD analysis
CPU-bound problem at task 7

PTAG after 30 seconds
of execution




                               69
Case study #3: SPMD analysis
Potential bottlenecks
                           Task 0 findings:
                           35.4% CPU-bound
                           in edge 8→6



                                    Task 1 findings:
                                    33% CPU-bound
                                    in edge 11→6



                                      Task 6 findings:
                                      32.1% CPU-bound
                                      in edge 11→6



                                Task 7 findings:
                                50.5% CPU-bound
                                in edge 8→6
                                         70
Case study #3: SPMD analysis
Potential bottlenecks
                           Task 0 findings:
                           21.4% blocked receive
                           caused by late sender
                           from task 1


                                     Task 1 findings:
                                     19.1 % blocked receive
                                     caused by late sender
                                     from task 2


                                      Task 6 findings:
                                      19.2 blocked receive
                                      caused by late sender
                                      from task 7




                                          71
Case study #3: SPMD analysis
Cause-effect analysis




                               72
Case study #3: SPMD analysis
Analysis results



•   Load imbalance found
•   Multiple instances of late-sender problem
•   Causal propagation of inefficiencies
•   Root-cause found in task 7 as an imbalanced computational
    edge


                                                                73
Conclusions
and future work

                  74
Conclusions
• A novel approach for online performance modeling
   – Discovers high-level application structure and runtime behavior
   – A differential hybrid technique that combines both static code analysis with
      runtime monitoring to extract performance knowledge
   – Scalable to 1000+ processors
• An automated online performance analysis approach
   – Enables quick detection of performance bottlenecks
   – Focuses on explaining sources of communication and synchronization
   – Correlates different problems and identifies their root causes
• A prototype tool that models and analyzes MPI applications
  at runtime

                                                                                    75
Future work
• Modeling
  –   Support for other classes of activities (I/O, MPI RMA)
  –   OpenMP applications
  –   Support for recursive codes
  –   Multi-experiment support


• Analysis
  –   More accurate cause-effect analysis with causal paths
  –   Evaluation of scalability of analysis in large-scale HPC
  –   Actionable recommendations
  –   Integration with automatic tuning framework (MATE)


                                                                 76
Online performance modeling and analysis
 of message-passing parallel applications




        Thank You

           PhD Thesis, Oleg Morajko
       Universitat Autònoma de Barcelona
                                            77

Más contenido relacionado

Similar a Online performance modeling and analysis of message-passing parallel applications

Detection of Seed Methods for Quantification of Feature Confinement
Detection of Seed Methods for Quantification of Feature ConfinementDetection of Seed Methods for Quantification of Feature Confinement
Detection of Seed Methods for Quantification of Feature Confinement
Andrzej Olszak
 
Scientific Workflows Systems :In Drug discovery informatics
Scientific Workflows Systems :In Drug discovery informaticsScientific Workflows Systems :In Drug discovery informatics
Scientific Workflows Systems :In Drug discovery informatics
Khaled Tumbi
 
Lanzamiento Visual Studio 2012 - Modern ALM
Lanzamiento Visual Studio 2012 - Modern ALMLanzamiento Visual Studio 2012 - Modern ALM
Lanzamiento Visual Studio 2012 - Modern ALM
Debora Di Piano
 
Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...
Kun Le
 
Microsoft ALM Platform Overview
Microsoft ALM Platform OverviewMicrosoft ALM Platform Overview
Microsoft ALM Platform Overview
Steve Lange
 
06 operations and feedback dap-kabel
06   operations and feedback dap-kabel06   operations and feedback dap-kabel
06 operations and feedback dap-kabel
David Alvarez Palomo
 
Il product development - 20 01 2011
Il  product development - 20 01 2011Il  product development - 20 01 2011
Il product development - 20 01 2011
nakham
 
Software Development Life Cycle
Software Development Life CycleSoftware Development Life Cycle
Software Development Life Cycle
Slideshare
 
Cdesc dlp 105_ef_ilt
Cdesc dlp 105_ef_iltCdesc dlp 105_ef_ilt
Cdesc dlp 105_ef_ilt
vncsrabelo
 

Similar a Online performance modeling and analysis of message-passing parallel applications (20)

Design For Testability
Design For TestabilityDesign For Testability
Design For Testability
 
Detection of Seed Methods for Quantification of Feature Confinement
Detection of Seed Methods for Quantification of Feature ConfinementDetection of Seed Methods for Quantification of Feature Confinement
Detection of Seed Methods for Quantification of Feature Confinement
 
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
 
Scientific Workflows Systems :In Drug discovery informatics
Scientific Workflows Systems :In Drug discovery informaticsScientific Workflows Systems :In Drug discovery informatics
Scientific Workflows Systems :In Drug discovery informatics
 
Instrumentation and measurement
Instrumentation and measurementInstrumentation and measurement
Instrumentation and measurement
 
Lanzamiento Visual Studio 2012 - Modern ALM
Lanzamiento Visual Studio 2012 - Modern ALMLanzamiento Visual Studio 2012 - Modern ALM
Lanzamiento Visual Studio 2012 - Modern ALM
 
DITEC - Software Engineering
DITEC - Software EngineeringDITEC - Software Engineering
DITEC - Software Engineering
 
Software Engineering.ppt
Software Engineering.pptSoftware Engineering.ppt
Software Engineering.ppt
 
Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...
 
Microsoft ALM Platform Overview
Microsoft ALM Platform OverviewMicrosoft ALM Platform Overview
Microsoft ALM Platform Overview
 
06 operations and feedback dap-kabel
06   operations and feedback dap-kabel06   operations and feedback dap-kabel
06 operations and feedback dap-kabel
 
Introductie Visual Studio ALM 2012
Introductie Visual Studio ALM 2012Introductie Visual Studio ALM 2012
Introductie Visual Studio ALM 2012
 
Introductie Visual Studio ALM 2012
Introductie Visual Studio ALM 2012Introductie Visual Studio ALM 2012
Introductie Visual Studio ALM 2012
 
Unit1
Unit1Unit1
Unit1
 
Pressman ch-3-prescriptive-process-models
Pressman ch-3-prescriptive-process-modelsPressman ch-3-prescriptive-process-models
Pressman ch-3-prescriptive-process-models
 
Il product development - 20 01 2011
Il  product development - 20 01 2011Il  product development - 20 01 2011
Il product development - 20 01 2011
 
CADA english
CADA englishCADA english
CADA english
 
WQD2011 - INNOVATION - DEWA - Substation Signal Analyzer Software
WQD2011 - INNOVATION - DEWA - Substation Signal Analyzer SoftwareWQD2011 - INNOVATION - DEWA - Substation Signal Analyzer Software
WQD2011 - INNOVATION - DEWA - Substation Signal Analyzer Software
 
Software Development Life Cycle
Software Development Life CycleSoftware Development Life Cycle
Software Development Life Cycle
 
Cdesc dlp 105_ef_ilt
Cdesc dlp 105_ef_iltCdesc dlp 105_ef_ilt
Cdesc dlp 105_ef_ilt
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Online performance modeling and analysis of message-passing parallel applications

  • 1. Online performance modeling and analysis of message-passing parallel applications Delayed receive PhD Thesis Oleg Morajko Universitat Autònoma de Barcelona, Long local calculations Barcelona, 2008
  • 2. Motivation • Parallel system hardware is evolving at an incredible rate • Contemporary HPC systems – Top500 ranging from 1.000 to 200.000+ processors (June 2008) – Take BSC MareNostrum: 10K processors • Whole industry is shifting to parallel computing 2
  • 3. Motivation • Challenges of developing large-scale scientific software – Evolution of programming models is much slower – Hard to achieve good efficiency – Hard to achieve scalability • The parallel applications rarely achieve good performance immediately MPI 3
  • 4. Motivation • Challenges of developing large-scale scientific software – Evolution of programming models is much slower – Hard to achieve good efficiency – Hard to achieve scalability • The parallel applications rarely achieve good performance immediately Careful performance analysis and optimization tasks are crucial 4
  • 5. Motivation • Quickly finding performance problems and their reasons is hard • Requires thorough understanding of the program’s behavior – Parallel algorithm, domain decomposition, communication, synchronization • Large scale brings additional complexities – Large data volume, excessive analysis cost • Existing tools support finding what happens, where, and when – Locating root causes of problems still manual – Tools expose scalability limitations (E.g. tracing) • Problem diagnosis still requires substantial time and effort of highly-skilled professionals 5
  • 6. Our goals • Analyze the performance of parallel applications • Detect bottlenecks and explain their causes – Focus on communication and synchronization in message-passing programs • Automate the approach to the extent possible • Scalable to thousands of nodes • Online approach without trace files 6
  • 7. Contributions • A systematic approach for automated diagnosis of application performance – Application is monitored, modeled and diagnosed during its execution • Scalable modeling technique that generates performance knowledge about application behavior • Analysis technique that diagnoses MPI applications running in large-scale parallel systems – Detects performance bottlenecks on-the-fly – Finds root causes • Prototype tool to demonstrate the ideas 7
  • 8. Outline 1. Overview of approaches 2. Online performance modeling 3. Online performance analysis 4. Experimental evaluation 5. Conclusions and future work 8
  • 10. Classical performance analysis Code Compile Develop Instrument changes Find Execute solutions Performance Trace problems files Analyze trace Visualization tool 10
  • 11. Classical performance analysis Drawbacks • Manual task of experimental nature • Time consuming • High degree of expertise required • Full trace excessive volume of information • Poor scalability 11
  • 12. Automated offline analysis Code Compile Develop Instrument changes Find Execute solutions Performance Trace problems files Analyze trace Automated tools (KappaPI, EXPERT) 12
  • 13. Automated offline analysis Drawbacks • Post-mortem • Addresses only well-known problems • Not fully explored capabilities to find root causes 13
  • 14. Automated online analysis Develop Code changes Compile Instrument Find solutions Execute Performance problems Online monitoring (What, Where, When) and diagnosis (Paradyn) 14
  • 15. Automated online analysis Paradyn advantages Paradyn drawbacks • Locate problems while app • Addresses lower-level runs problems (profiler) • Automated problem-space • No search for root causes of search problems – Functional decomposition – Refinable measurements • Scalable 15
  • 16. Automated online analysis Our approach Consume Code Develop events Monitoring changes Compile Find Refine solutions Execute Modeling Analysis Observe 16 model Problems and causes
  • 17. Automated online analysis Key characteristics • Discovers application model on-the-fly – Model execution flows, not modules/functions – Lossy trace compression • Runtime analysis based on continuous model observation • Automatically locates problems while app runs • Search for root-causes of problems 17
  • 18. Monitoring Modeling Analysis Online performance modeling 18
  • 19. Modeling objectives • Enable high-level understanding of application performance • Reflect parallel application structure and runtime behavior • Maintain tradeoff between volume of collected data and level of preserved details – Communication and computational patterns – Causality of events • Base for online performance analysis 19
  • 20. Online performance modeling • Novel application performance modeling approach • Combines static code analysis with runtime monitoring to extract performance knowledge • Three step approach: – Modeling individual tasks – Modeling inter-task communication – Modeling entire application 20
  • 21. Modeling individual tasks • We decompose execution into units that correspond to different activities: – Communication activities (E.g. MPI_Send, MPI_Gather) – Computation activities (E.g. calc_gauss) – Control activities (E.g. program start/termination) – Others (E.g. I/O) • We capture execution flow through these activities using a directed graph called Task Activity Graph (TAG): – Nodes model communication activities and loops – Edges represent sequential flow of execution (computation activities) – Nodes and edges maintain happens-before relationship 21
  • 22. Modeling individual tasks Task Activity Graph (TAG) reflects program structure by modeling executed flow of activities 22
  • 23. Modeling individual tasks • Each activity corresponds to a particular location in the source code 23
  • 24. Modeling individual tasks • Runtime behavior of activities is described by adding performance metrics to nodes and edges • Data aggregated into statistical execution profiles Edge counter & accumulative timer {min, max, stddev} Node accumulative timer {min, max, stddev} 24
  • 25. Modeling communication • Message edges capture matching send-receive links – P2P, Collective • Completion edges capture non-blocking semantics • Performance metrics describe runtime behavior 25
  • 26. Modeling parallel application • Individual TAG models connected by message edges form a Parallel-TAG model (PTAG) 26
  • 27. Modeling techniques We developed a set of techniques to automatically construct and exploit the PTAG model at runtime • Targeted to parallel scientific applications • Focus on modeling MPI applications • But extendible to other programming paradigms • Low-overhead • Scalable to 1000+ nodes 27
  • 28. Online PTAG construction Front-end 7 analyze 6 update 5 merge TBON Node 1 TBON Node 2 … 4 update Modeler 1 Modeler 2 Modeler 3 … Modeler N 1 instrument sample 3 2 build MPI Task 1 MPI Task 2 MPI Task 3 … MPI Task N 28
  • 29. Building individual TAG 6 update Modeler 5 sample 1 analyze executable 2 instrument shared memory MPI Task capture 3 events RT Library 4 update 29
  • 30. Building individual TAG Offline program analysis • Parse binary executable • Find target functions • Detect relevant loops Modeler 1 analyze executable shared memory MPI Task RT Library 30
  • 31. Building individual TAG Dynamic instrumentation • Instrument all target functions: – Record events – Collect performance metrics – Invoke TAG update • Refinable at runtime Modeler 2 instrument shared memory MPI Task RT Library 31
  • 32. Building individual TAG Performance metrics • Counters • Timers {sum, sum2, min, max} • Histograms cnt2++ cnt3++ • Compound metrics cnt1++ cnt4++ cnt5++ Modeler t1 t2 t3 t4 2 instrument shared memory MPI Task RT Library 32
  • 33. Building individual TAG Runtime modeling • Process generated events • Walk the stack to capture program location (call path) • Update TAG incrementally Modeler shared memory capture MPI Task events RT Library 3 4 update 33
  • 34. Building individual TAG Model sampling • Goal: examine model at runtime • Read model from shared memory • Sampling is periodic • Lock-free synchronization Modeler 5 sample shared memory MPI Task RT Library 34
  • 35. Online communication modeling How to model inter-task communication? • Intercept MPI communication calls (nodes) • Match sender nodes with receiver nodes • Add messages edges to the TAG models 35
  • 36. Online communication modeling • Requires tracking of individual messages transmitted from sender to receiver(s) at runtime • Achieved by propagating piggyback data over every transmitted MPI message • Transmit node id from sender to receiver(s) • P2P/Blocking/Non-blocking/Collective • Optimized hybrid strategy to minimize intrusion • Store references to sender’s nodes at receiver’s TAG 36
  • 37. Online parallel application modeling Building and maintaining PTAG • Individual TAGs are distributed Hierarchical Reduction • Collect TAGs snapshots Network (TBON) • Distributed merge • Periodic process Individual TAGs Merged groups PTAG of TAGs 37
  • 38. Online parallel application modeling Scalable modeling 10240 nodes, 1024 nodes, 625MB 62MB 8 nodes, 250KB • Increasing data volume • Increasing analysis cost • Non-scalable visualization 38
  • 39. Online parallel application modeling Resolving scalability issues • Classes of similar tasks – E.g. stencil codes, M/W • TAG clustering – Structural equivalence – Behavioral equivalence • Distributed and scalable TAG merging algorithm 39
  • 40. Online parallel application modeling Scalable PTAG visualization • Example: 1D stencil, 8 nodes 40
  • 41. Benefits of modeling • Facilitates performance understanding • Reveals communication and computational patterns and their causal relationships • Enables an assortment of online analysis techniques – Quick identification of performance bottlenecks and their location – Behavioral task clustering – Causal relationships permit root-cause analysis – Feedback-guided analysis (refinements) 41
  • 42. Monitoring Modeling Analysis Online performance analysis 42
  • 43. Online analysis objectives • Diagnose the performance on-the-fly • Detect relevant performance bottlenecks and their reasons • Distinguish problem symptoms from root causes • Explain what, where, when and why • Focus on communication and synchronization problems in MPI applications 43
  • 44. Online performance analysis Time-continuous Root-Cause Analysis process Monitoring Modeling Analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis 44
  • 45. Root-cause analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Phase 1: Problem identification • Focus attention on code regions with biggest potential optimization benefits • A potential bottleneck – an individual task activity with significant amount of execution time • TAG node might corresponds to a communication or synchronization problem • TAG edge might be a computation-bound problem 45
  • 46. Problem identification Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis CPU-bound activity ~45% time Cold activity Hot activity Blocked receive ~42% time • Rainbow spectrum TAG coloring Communication or • Activity time / Max Activity Time synchronization problem 46
  • 47. Problem identification Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis TAG ranking process • Identify potential bottlenecks for further analysis • Periodic ranking in moving time-window Select top problems by ranking Rank = activity time / task time > 20% for computation activities > 3% for communication activities TAG snapshot Potential bottlenecks 47
  • 48. Root-cause analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Phase 2: In-depth problem analysis • For each potential bottleneck, investigate its causes • Explore knowledge-based cause space • Focus on causes that contribute most to the problem time • Distinguish task-local problems from inter-task problems – Find root-causes of task-local problems • E.g. CPU-bound computation, local I/O – Find symptoms of inter-task problems • E.g. Blocked receive, barrier 48
  • 49. In-depth problem analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Performance models for activities • Classification of activities • Each class has a performance model that divides the activity cost into separate components • Each component is a non-exclusive potential cause of the problem 49
  • 50. In-depth problem analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Model for computational activities • Sequential code region modeled by TAG edge • No external knowledge about computation • Determine where edge-constrained code spends time • Divide TAG edge into components – Functional or basic-blocks decomposition • Apply statistical profiling constrained to an edge – Dynamic instrumentation • Other metrics – Idle time, I/O time, hardware counters 50
  • 51. In-depth problem analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Model for communication activities Communication cost = Synchronization Cost + Transmission Cost Transmission cost Overall communication cost Task e1 Send e3 e2 Receive e4 Time Synchronization cost • Captures semantics of well-known synchronization inefficiencies – Late sender, wait at barrier, early reduce, etc. 51
  • 52. In-depth problem analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Model for communication activities Communication cost = Synchronization Cost + Transmission Cost Transmission cost Overall communication cost • Piggyback send entry Task timestamp (e1) e1 Send e3 • Accumulate synchronization cost e2 Receive e4 per message edge Time Synchronization cost • Captures semantics of well-known synchronization inefficiencies – Late sender, wait at barrier, early reduce, etc. 52
  • 53. In-depth problem analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Example receive activity break-down Requires inter-task cause-effect analysis 53
  • 54. Root-cause analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Phase 3: Cause-effect analysis • Explain causes of synchronization inefficiencies – Why sender is late? • Correlate problems into cause-effect chains • Distinguish root-causes of inefficiencies from their causal propagation (symptoms) • Pinpoint problems in non-dominant code regions • Improve the feedback provided to application developers 54
  • 55. Cause-effect analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Causal propagation Causes Causes ComputationA (Task A) Late Sender Causes (Task A) Inefficiency1 Causes (Task B) Late Sender (Task B) Task Inefficiency2 (Task C) ComputationB Causes (Task B) A ComputationA Send1 WT1 Inefficiency 1 m0 B Receive1 ComputationB Send2 WT2 Inefficiency 2 m1 C ComputationC Receive2 t0 t1 t2 t3 t4 Time 55
  • 56. Cause-effect analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Explaining problem causes • Causes of waiting time between two nodes as the differences between their execution paths – Online adaptation of Wait-Time Analysis approach by Meira et al. – Based on PTAG model, not full trace • Explain synchronization inefficiencies by means of other activities – Identify corresponding execution paths in PTAG model – Compare the paths – Build causal tree with explanations – Merge trees of individual problems 56
  • 57. Cause-effect analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Execution path comparison Inefficiency caused by Late Sender problem Path q (Task 1) e7 ... ... e1 e2 e4 e5 e6 Path p (Task 2) Inefficiency at MPI_Recv Waiting-time e3 (Task 1) 138,4 sec. ... ... e1 e2 Late Sender (Task 2) 91.9% 7.7% Computation Computation edge e3 edge e2 (Task 2) (Task 2) Root causes 57
  • 58. Benefits of RCA • Systematic approach to online performance analysis • Quick identification of problems as they manifest at runtime (without trace) • Causal correlation of different problems • Discovery of root-causes of synchronization inefficiencies 58
  • 60. Prototype tool global analyzer • Implemented in C++ • DynInst 5.1 mrnet mrnet … comm comm • MRNet 1.2 node node • OpenMPI 1.2.x mrnet mrnet • Linux platforms comm node comm node … – x86 – IA-64 (Itanium) – PowerPC 32/64 dmad dmad dmad dmad dmad MPI Task MPI Task MPI Task … MPI Task MPI Task 60
  • 61. Experimental environment UAB cluster BSC Marenostrum  x86/Linux  PowerPC-64/Linux  32 nodes  512 nodes (restricted)  Intel Pentium IV 3GHz  PowerPC 2.3GHz dual core  Linux FC4  SUSE Linux Enterprise Server 9  Gigabit Ethernet  Myrinet 61
  • 62. Modeling MPI applications • Experiences with different classes of MPI codes – SPMD codes • WaveSend – 1D stencil, concurrent wave equation • NAS Parallel Benchmarks – 2D stencils • SMG2000 – 3D stencil, multigrid solver – Master/Worker • XFire – forest fire propagation simulator + Demonstrated ability to model arbitrary MPI code with low-overhead + Best with regular codes – Limitations with recursive codes 62
  • 63. Case study #1: Modeling SPMD Integer sort (IS) NAS Parallel Benchmark • Large integer sort used in “particle method” codes • Tests both integer computation speed and communication performance • Mostly collective communication • We extract PTAG to understand application communication patterns and behavior 63
  • 64. Case study #2: Master/Worker Forest Fire Propagation Simulator (XFire) • Calculates the expansion of the fireline • Computationally intensive code, exploits data parallelism • We extract and cluster PTAG 64
  • 65. Evaluation of overheads Sources of overheads • Offline startup – Less than 20 seconds per 1MB executable – In function of program size • Online TAG construction – 4-20 μs per instrumented call (*) – Depends on the number of instrumented calls and loops • Online TAG sampling – 40-50 μs per snapshot (256 KB) – Depends on program structure size, number of communication links (*) Experiments conducted in UAB cluster 65
  • 66. Evaluation of overheads NAS LU overheads, varying nº of nodes 120,00 2,50% 100,00 1,91% 2,00% 1,59% Overhead (seconds) 80,00 1,50% 1,34% 1,42% 1,26% 1,50% 60,00 Overhead (seconds) 1,00% 40,00 Overhead (%) 0,50% 20,00 0,00 0,00% 16 32 64 128 256 512 Nº CPUs 66
  • 67. Case study #3: SPMD analysis WaveSend application • Parallel calculations of vibrating string over time • Wave equation, block-decomposition • P2P communication to exchange boundary points with nearest neighbors • Synthetic performance problems 67
  • 68. Case study #3: SPMD analysis WaveSend PTAG After execution 68
  • 69. Case study #3: SPMD analysis CPU-bound problem at task 7 PTAG after 30 seconds of execution 69
  • 70. Case study #3: SPMD analysis Potential bottlenecks Task 0 findings: 35.4% CPU-bound in edge 8→6 Task 1 findings: 33% CPU-bound in edge 11→6 Task 6 findings: 32.1% CPU-bound in edge 11→6 Task 7 findings: 50.5% CPU-bound in edge 8→6 70
  • 71. Case study #3: SPMD analysis Potential bottlenecks Task 0 findings: 21.4% blocked receive caused by late sender from task 1 Task 1 findings: 19.1 % blocked receive caused by late sender from task 2 Task 6 findings: 19.2 blocked receive caused by late sender from task 7 71
  • 72. Case study #3: SPMD analysis Cause-effect analysis 72
  • 73. Case study #3: SPMD analysis Analysis results • Load imbalance found • Multiple instances of late-sender problem • Causal propagation of inefficiencies • Root-cause found in task 7 as an imbalanced computational edge 73
  • 75. Conclusions • A novel approach for online performance modeling – Discovers high-level application structure and runtime behavior – A differential hybrid technique that combines both static code analysis with runtime monitoring to extract performance knowledge – Scalable to 1000+ processors • An automated online performance analysis approach – Enables quick detection of performance bottlenecks – Focuses on explaining sources of communication and synchronization – Correlates different problems and identifies their root causes • A prototype tool that models and analyzes MPI applications at runtime 75
  • 76. Future work • Modeling – Support for other classes of activities (I/O, MPI RMA) – OpenMP applications – Support for recursive codes – Multi-experiment support • Analysis – More accurate cause-effect analysis with causal paths – Evaluation of scalability of analysis in large-scale HPC – Actionable recommendations – Integration with automatic tuning framework (MATE) 76
  • 77. Online performance modeling and analysis of message-passing parallel applications Thank You PhD Thesis, Oleg Morajko Universitat Autònoma de Barcelona 77