SlideShare una empresa de Scribd logo
1 de 40
Application Profiling & Analysis

                 Rishi Pathak
National PARAM Supercomputing Facility, C-DAC
               riship@cdac.in
What is application profiling
• Profiling
   – Recording of summary information during execution
      • inclusive, exclusive time, # calls, hardware counter statistics, …
   – Reflects performance behavior of program entities
      • functions, loops, basic blocks
      • user-defined “semantic” entities
   – Very good for low-cost performance assessment
   – Helps to expose performance bottlenecks and hotspots
   – Implemented through either
      • sampling: periodic OS interrupts or hardware counter traps
      • measurement: direct insertion of measurement code
Sampling vs. Instrumentation
                     Sampling                                       Instrumentation
Overhead             Typically about 1%                             High, may be 500% !

System-wide          Yes, profiles all app, drivers, OS functions   Just application and
profiling                                                           instrumented DLLs

Detect unexpected    Yes , can detect other programs using OS       No
events               resources

Setup                None                                           Automatic ins. of data
                                                                    collection stubs required

Data collected       Counters, processor an OS state                Call graph , call times,
                                                                    critical path
Data granularity     Assembly level instr., with src line           Functions, sometimes
                                                                    statements
Detects              No, Limited to processes , threads             Yes – can see algorithm,
algorithmic issues                                                  call path is expensive




                                          Profiling Tools                                       3
Inclusive v/s Exclusive Profiling
    int main( )
    { /* takes 100 secs     */
      f1(); /* takes 20     secs */
     /* other work */
      f2(); /* takes 50     secs */
      f1(); /* takes 20     secs */

     /* other work */
    }

    /* similar for other metrics, such
    as hardware performance counters,
    etc. */



    •   Inclusive time for main
         –   100 secs


    •   Exclusive time for main
         –   100-20-50-20=10 secs
What are Application Traces
• Tracing
  – Recording of information about significant points (events) during program
    execution
      • entering/exiting code region (function, loop, block, …)
      • thread/process interactions (e.g., send/receive message)
  – Save information in event record
      • timestamp
      • CPU identifier, thread identifier
      • Event type and event-specific information
  – Event trace is a time-sequenced stream of event records
  – Can be used to reconstruct dynamic program behavior
  – Typically requires code instrumentation
Profiling v/s Tracing

• Profiling
   – Summary statistics of performance metrics
       •   Number of times a routine was invoked
       •   Exclusive, inclusive time/hpm counts spent executing it
       •   Number of instrumented child routines invoked, etc.
       •   Structure of invocations (call-trees/call-graphs)
       •   Memory, message communication sizes

• Tracing
   –   When and where events took place along a global timeline
       •   Time-stamped log of events
       •   Message communication events (sends/receives) are tracked
       •   Shows when and from/to where messages were sent
       •   Large volume of performance data generated usually leads to more
           perturbation in the program
The Big Picture

   Sampling                Instrumentation




               Profiling




Analysis           Optimization
Instrumentation & Sampling
Measurements - Instrumentation
Instrumentation - Adding measurement probes
to the code to observe its execution
  – Can be done on several levels
  – Different techniques for different levels
  – Different overheads and levels of accuracy with each technique
  – No instrumentation: run in a simulator. E.g., Valgrind
Measurements - Instrumentation
• Source code instrumentation
  – User added time measurement, etc. (e.g.,
    printf(), gettimeofday())
                 Measurements - Instrumentation
  – Many tools expose mechanisms for source code
    instrumentation in addition to automatic
    instrumentation facilities they offer

  – Instrument program phases:
     • initialization/main iteration loop/data post processing
Measurements - Instrumentation
• Preprocessor Instrumentation
  – Example: Instrumenting OpenMP constructs with
    Opari
         Orignial             Pre-            Modified (instrumented)
       source code         processor                source code

  – Preprocessor operation
                                          This is used for OpenMP
                                              analysis in tools such as
                                              KoJak/Scalasca/ompP
        /* ORIGINAL CODE in parallel region */

                                             Instrumentation
                                              added by Opari


  – Example: Instrumenta
Measurements - Instrumentation
• Compiler Instrumentation
  – Many compilers can instrument functions
    automatically
  – GNU compiler flag: -finstrument-
    functions
  – Automatically calls functions on function
    entry/exit that a tool can capture
  – Not standardized across compilers, often
    undocumented flags, sometimes not available at
    all
Measurements - Instrumentation
– GNU compiler example:
void __cyg_profile_func_enter(void *this, void *callsite)
{
        /* called on function entry */
}

void __cyg_profile_func_exit(void *this, void *callsite)
{
        /* called just before returning from function */
}
Measurements - Instrumentation
 • Library Instrumentation:




MPI library interposition

     – All functions are available under two names: MPI_xxx and PMPI_xxx, MPI_xxx
       symbols are weak, can be over-written by interposition library
     – Measurement code in the interposition library measures begin, end, transmitted data,
       etc… and calls corresponding PMPI routine.
     – Not all MPI functions need to be instrumented
Measurements - Instrumentation
• Binary Instrumentation
  – Static binary instrumentation
      • LD_PRELOAD(Linux)
      • DLL injection(MS)

  – Debuggers
     – Breakpoints (Software & Hardware)

  • Dynamic binary instrumentation

      • Injection of instrumentation code into a running process.
      • Tools: PIN(Intel), Valgrind
Measurements - Sampling
Using event triggers
  – Reoccurring - program counter is sampled many
    times
  – Histogram of program contexts(CCT)
  – Sufficiently large number of samples required
  – Uniformity in event triggers wrt execution time
Measurements - Sampling
Event trigger types
• Synchronous
   – Initiated by direct program action
   – E.g. memory allocation, I/O, and inter-process
     communication(including MPI communication)
• Asynchronous
   –   Not initiated by direct program action
   –   OS timer interrupt or
   –   Hardware performance counter events
   –   E.g. CPU, floating point instructions, clock cycles etc.
Profiling Tools


• Gprof
• Intel VTune
• Scalasca
• HPC Tool Kit
            Profiling Tools   18
HPC Tool Kit
• Name: HPCToolkit
• Developer: Rice University
• Website:
  – http://hpctoolkit.org




                            19
HPCToolkit Overview
•   Consists of
     – hpcviewer
          • Sorts by any collected metric, from any processes displayed
          • Displays samples at various levels in call hierarchy through “flattening”
          • Allows user to focus in on interesting sections of the program through “zooming”
     – Hpcprof and hpcprof-mpi
          • Correlating dynamic profiles with static source code structure
     – hpcrun
          • Application profiling using statistical sampling
          • Hpcrun-flat – for collection of `flat' profile
     – Hpcstruct
          • Recovers static program structure such as procedures and loop nests
     – Hpctraceviewer
     – Hpclink
          • For statically-linked executables (e.g. for Cray XT or BG/P)




                                                     20
Available Metrics in HPCToolkit
•   Metrics, obtained by sampling/profiling
     – PAPI Hardware counters
     – OS program counters
•   Wallclock time (WALLCLK)
     – However, can’t get PAPI metrics and Wallclock time in a single run
•   Derived metrics
     – Combination of existing metrics created by specifying a mathematical formula in an XML
       configuration file.
•   Source Code Correlation
     – Metrics reflect exclusive time spent in function based on counter overflow events
     – Metrics correlated at the source line level and the loop level
     – Metrics are related back to source code loops (even if code has been significantly altered
       by optimization) (“bloop”)




                                               21
HPCToolKit workflow
hpcviewer Views

• Calling context view
   – top-down view shows dynamic calling contexts in which costs were
     incurred

• Caller’s view
   – bottom-up view apportions costs incurred in a routine to the
     routine’s dynamic calling contexts

• Flat view
   – aggregates all costs incurred by a routine in any context and shows
     the details of where they were incurred within the routine
hpcviewer User Interface
hpctraceviewer Views

• Trace view (left, top)
   – Time on the horizontal axis
   – Process (or thread) rank on the vertical axis
• Depth view (left, bottom) & Summary view
   – Call-path/time view for the process rank selected
• Call view (right, top)
   – Current call path depth that defines the hierarchical slice
     shown in the Trace View
   – Actual call path for the point selected by the Trace View's
     crosshair
Hpctraceviewer user interface - DEMO
Call Path Profiling: Costs in Context

 Event-based sampling method for performance measurement
• When a profile event occurs, e.g. a timer expires
   – determine context in which cost is incurred
        • unwind call stack to determine set of active procedure frames
   – attribute cost of sample to PC in calling context
• Benefits
   –   monitor unmodified fully optimized code
   –   language independent – C/C++, Fortran, assembly code, …
   –   accurate
   –   low overhead (1K samples per second has ~ 3-5% overhead)
Demo for :
• Hpcrun (list events & proifiling)
• Hpcstruct
• hpcprof
PAPI
• Performance Application Programming
  Interface
   – The purpose of the PAPI project is to design, standardize
      and implement a portable and efficient API to access the
      hardware performance monitor counters found on most
      modern microprocessors.
• Parallel Tools Consortium project started in 1998
• Developed by University of Tennessee, Knoxville
• http://icl.cs.utk.edu/papi/
PAPI - Support
• Unix/Linux
  – Perfctr kernel patch for kernel < 2.6.30
  – Perf package for kernel >= 2.6.30
PAPI - Implementation
                 3rd Party and GUI Tools

Portable                           PAPI High Level
 Layer      PAPI Low Level



            PAPI Machine Dependent Substrate
Machine                           Kernel Extension
Specific
 Layer                       Operating System

                Hardware Performance Counters
PAPI - Hardware Events
• Preset Events(Platform neutral)
    – Standard set of over 100 events for application performance tuning
    – No standardization of the exact definition
    – Mapped to either single or linear combinations of native events on each
      platform
    – Use papi_avail utility to see what preset events are available on a given
      platform
    – PAPI_TOT_INS
• Native Events(Platform dependent)
    – Any event countable by the CPU
    – Same interface as for preset events
    – Use papi_native_avail utility to see all available native events
    – L3_MISSES
• Use papi_event_chooser utility to select a compatible set of events
PAPI events demo
               papi_avail
           papi_native_avail
        papi_event_chooser
Availability for QPI h/w performance
      counters using /sbin/lspci
PAPI-Derived Metrics
                            Metric                                                              Formula
                                                                   Instructions
               Graduated instructions per cycle                                      PAPI_TOT_INS/PAPI_TOT_CYC
                 Issued instructions per cycle                                        PAPI_TOT_IIS/PAPI_TOT_CYC

        Graduated floating point instructions per cycle                               PAPI_FP_INS/PAPI_TOT_CYC

            Percentage floating point instructions                                    PAPI_FP_INS/PAPI_TOT_INS

    Ratio of graduated instructions to issued instructions                            PAPI_TOT_INS/PAPI_TOT_IIS


        Percentage of cycles with no instruction issue                            100.0 * (PAPI_STL_ICY/PAPI_TOT_CYC)

               Data references per instruction                                        PAPI_L1_DCA/PAPI_TOT_INS

Ratio of floating point instructions to L1 data cache accesses                         PAPI_FP_INS/PAPI_L1_DCA


Ratio of floating point instructions to L2 cache accesses (data)                       PAPI_FP_INS/PAPI_L2_DCA


       Issued instructions per L1 instruction cache miss                               PAPI_TOT_IIS/PAPI_L1_ICM


    Graduated instructions per L1 instruction cache miss                              PAPI_TOT_INS/PAPI_L1_ICM

                L1 instruction cache miss ratio                                        PAPI_L2_ICR/PAPI_L1_ICR
PAPI-Derived Metrics
                      Metric                                                             Formula
                                                   Cache & Memory Hierarchy
          Graduated loads & stores per cycle                                    PAPI_LST_INS/PAPI_TOT_CYC

Graduated loads & stores per floating point instruction                          PAPI_LST_INS/PAPI_FP_INS


              L1 cache line reuse (data)                              ((PAPI_LST_INS - PAPI_L1_DCM) / PAPI_L1_DCM)

                L1 cache data hit rate                                        1.0 - (PAPI_L1_DCM/PAPI_LST_INS)
            L1 data cache read miss ratio                                        PAPI_L1_DCM/PAPI_L1_DCA

              L2 cache line reuse (data)                              ((PAPI_L1_DCM - PAPI_L2_DCM) / PAPI_L2_DCM)

                L2 cache data hit rate                                        1.0 - (PAPI_L2_DCM/PAPI_L1_DCM)
                  L2 cache miss ratio                                            PAPI_L2_TCM/PAPI_L2_TCA

              L3 cache line reuse (data)                              ((PAPI_L2_DCM - PAPI_L3_DCM) / PAPI_L3_DCM)

                L3 cache data hit rate                                        1.0 - (PAPI_L3_DCM/PAPI_L2_DCM)
               L3 data cache miss ratio                                          PAPI_L3_DCM/PAPI_L3_DCA
               L3 cache data read ratio                                          PAPI_L3_DCR/PAPI_L3_DCA
            L3 cache instruction miss ratio                                      PAPI_L3_ICM/PAPI_L3_ICR

              Bandwidth used (Lx cache)                          ((PAPI_Lx_TCM * Lx_linesize) / PAPI_TOT_CYC) * Clock(MHz)
PAPI-Derived Metrics
                      Metric                                                         Formula
                                                        Branching

Ratio of mispredicted to correctly predicted branches                      PAPI_BR_MSP/PAPI_BR_PRC

                                                   Processor Stalls

  Percentage of cycles waiting for memory access                     100.0 * (PAPI_MEM_SCY/PAPI_TOT_CYC)

    Percentage of cycles stalled on any resource                      100.0 * (PAPI_RES_STL/PAPI_TOT_CYC)

                                              Aggregate Performance

                MFLOPS (CPU cycles)                                 (PAPI_FP_INS/PAPI_TOT_CYC) * Clock(MHz)

                 MFLOPS (effective)                                         PAPI_FP_INS/Wallclock time

                 MIPS (CPU cycles)                                  (PAPI_TOT_INS/PAPI_TOT_CYC) * Clock(MHz)

                  MIPS (effective)                                         PAPI_TOT_INS/Wallclock time

                Processor utilization                                  (PAPI_TOT_CYC*Clock) / Wallclock time


            http://perfsuite.ncsa.illinois.edu/psprocess/metrics.shtml
Component PAPI (PAPI-C)
• Goals:
   – Support for simultaneous access to on- and off-processor
     counters
   – Isolation of hardware dependent code in a separable
     ‘substrate’ module
   – Extension of platform independent code to support
     multiple simultaneous substrates
   – API calls to support access to any of several substrates

• Released in PAPI 4.0
Extension to PAPI to
           Support Multiple Substrates

                                                      PAPI High Level


            PAPI Low Level
Portable
 Layer                       Hardware Independent Layer




            PAPI Machine Dependent Substrate        PAPI Machine Dependent Substrate
Machine                          Kernel Extension                         Kernel Extension
Specific
                       Operating System                Operating System
 Layer
               Hardware Performance Counters           Off-Processor Hardware Counters
High-level tools that use PAPI
• TAU (U Oregon)

• HPCToolkit (Rice Univ)

• KOJAK (UTK, FZ Juelich)

• PerfSuite (NCSA)
• SCALASCA

• Open|Speedshop (SGI)

• Intel Vtune
• Hpcviewer demo (trace, flops and clock cycles)
  – Nek5000(CFD solver using spectral element
    method)
  – Xhpl(Linpack)

Más contenido relacionado

La actualidad más candente

Processor Organization and Architecture
Processor Organization and ArchitectureProcessor Organization and Architecture
Processor Organization and ArchitectureDhaval Bagal
 
Basic MIPS implementation
Basic MIPS implementationBasic MIPS implementation
Basic MIPS implementationkavitha2009
 
Uvm presentation dac2011_final
Uvm presentation dac2011_finalUvm presentation dac2011_final
Uvm presentation dac2011_finalsean chen
 
Pipelining of Processors Computer Architecture
Pipelining of  Processors Computer ArchitecturePipelining of  Processors Computer Architecture
Pipelining of Processors Computer ArchitectureHaris456
 
Introduction to embedded computing and arm processors
Introduction to embedded computing and arm processorsIntroduction to embedded computing and arm processors
Introduction to embedded computing and arm processorsRAMPRAKASHT1
 
introduction to embedded systems part 2
introduction to embedded systems part 2introduction to embedded systems part 2
introduction to embedded systems part 2Hatem Abd El-Salam
 
Efficient Migration of Verilog Testbenches to 'UVM' Keeping the Functionality...
Efficient Migration of Verilog Testbenches to 'UVM' Keeping the Functionality...Efficient Migration of Verilog Testbenches to 'UVM' Keeping the Functionality...
Efficient Migration of Verilog Testbenches to 'UVM' Keeping the Functionality...DVClub
 
Computer Architecture - Program Execution
Computer Architecture - Program ExecutionComputer Architecture - Program Execution
Computer Architecture - Program ExecutionVarun Bhargava
 
Introduction to Simplified instruction computer or SIC/XE
Introduction to Simplified instruction computer or SIC/XEIntroduction to Simplified instruction computer or SIC/XE
Introduction to Simplified instruction computer or SIC/XETemesgen Molla
 
Pipelining of Processors
Pipelining of ProcessorsPipelining of Processors
Pipelining of ProcessorsGaditek
 
UVM Methodology Tutorial
UVM Methodology TutorialUVM Methodology Tutorial
UVM Methodology TutorialArrow Devices
 
Embedded system design process
Embedded system design processEmbedded system design process
Embedded system design processRayees CK
 
AADL: Architecture Analysis and Design Language
AADL: Architecture Analysis and Design LanguageAADL: Architecture Analysis and Design Language
AADL: Architecture Analysis and Design LanguageIvano Malavolta
 
introduction to embedded systems part 1
introduction to embedded systems part 1introduction to embedded systems part 1
introduction to embedded systems part 1Hatem Abd El-Salam
 
TAU Performance tool using OpenPOWER
TAU Performance tool using OpenPOWERTAU Performance tool using OpenPOWER
TAU Performance tool using OpenPOWERGanesan Narayanasamy
 
[EWiLi2016] Enabling power-awareness for the Xen Hypervisor
[EWiLi2016] Enabling power-awareness for the Xen Hypervisor[EWiLi2016] Enabling power-awareness for the Xen Hypervisor
[EWiLi2016] Enabling power-awareness for the Xen HypervisorMatteo Ferroni
 
Computer organization basics
Computer organization  basicsComputer organization  basics
Computer organization basicsDeepak John
 

La actualidad más candente (19)

Processor Organization and Architecture
Processor Organization and ArchitectureProcessor Organization and Architecture
Processor Organization and Architecture
 
Basic MIPS implementation
Basic MIPS implementationBasic MIPS implementation
Basic MIPS implementation
 
Uvm presentation dac2011_final
Uvm presentation dac2011_finalUvm presentation dac2011_final
Uvm presentation dac2011_final
 
Pipelining of Processors Computer Architecture
Pipelining of  Processors Computer ArchitecturePipelining of  Processors Computer Architecture
Pipelining of Processors Computer Architecture
 
Introduction to embedded computing and arm processors
Introduction to embedded computing and arm processorsIntroduction to embedded computing and arm processors
Introduction to embedded computing and arm processors
 
Demo
DemoDemo
Demo
 
introduction to embedded systems part 2
introduction to embedded systems part 2introduction to embedded systems part 2
introduction to embedded systems part 2
 
Efficient Migration of Verilog Testbenches to 'UVM' Keeping the Functionality...
Efficient Migration of Verilog Testbenches to 'UVM' Keeping the Functionality...Efficient Migration of Verilog Testbenches to 'UVM' Keeping the Functionality...
Efficient Migration of Verilog Testbenches to 'UVM' Keeping the Functionality...
 
Computer Architecture - Program Execution
Computer Architecture - Program ExecutionComputer Architecture - Program Execution
Computer Architecture - Program Execution
 
Introduction to Simplified instruction computer or SIC/XE
Introduction to Simplified instruction computer or SIC/XEIntroduction to Simplified instruction computer or SIC/XE
Introduction to Simplified instruction computer or SIC/XE
 
Pipelining of Processors
Pipelining of ProcessorsPipelining of Processors
Pipelining of Processors
 
UVM Methodology Tutorial
UVM Methodology TutorialUVM Methodology Tutorial
UVM Methodology Tutorial
 
Embedded system design process
Embedded system design processEmbedded system design process
Embedded system design process
 
AADL: Architecture Analysis and Design Language
AADL: Architecture Analysis and Design LanguageAADL: Architecture Analysis and Design Language
AADL: Architecture Analysis and Design Language
 
14 superscalar
14 superscalar14 superscalar
14 superscalar
 
introduction to embedded systems part 1
introduction to embedded systems part 1introduction to embedded systems part 1
introduction to embedded systems part 1
 
TAU Performance tool using OpenPOWER
TAU Performance tool using OpenPOWERTAU Performance tool using OpenPOWER
TAU Performance tool using OpenPOWER
 
[EWiLi2016] Enabling power-awareness for the Xen Hypervisor
[EWiLi2016] Enabling power-awareness for the Xen Hypervisor[EWiLi2016] Enabling power-awareness for the Xen Hypervisor
[EWiLi2016] Enabling power-awareness for the Xen Hypervisor
 
Computer organization basics
Computer organization  basicsComputer organization  basics
Computer organization basics
 

Destacado

Digital signature & eSign overview
Digital signature & eSign overviewDigital signature & eSign overview
Digital signature & eSign overviewRishi Pathak
 
Seminar ppt on digital signature
Seminar ppt on digital signatureSeminar ppt on digital signature
Seminar ppt on digital signaturejolly9293
 
Introduction to Digital signatures
Introduction to Digital signaturesIntroduction to Digital signatures
Introduction to Digital signaturesRohit Bhat
 
HPC Application Profiling and Analysis
HPC Application Profiling and AnalysisHPC Application Profiling and Analysis
HPC Application Profiling and AnalysisRishi Pathak
 
Podpis elektroniczny i jego zastosowanie w biznesie
Podpis elektroniczny i jego zastosowanie w biznesiePodpis elektroniczny i jego zastosowanie w biznesie
Podpis elektroniczny i jego zastosowanie w biznesiepodpisani .pl
 
Google Partners - Adwords
Google Partners - AdwordsGoogle Partners - Adwords
Google Partners - Adwordsjohnbrito123
 
Digital signature and adv payment gateway
Digital signature and adv payment gatewayDigital signature and adv payment gateway
Digital signature and adv payment gatewayKartik Kalpande Patil
 
Digital signature introduction
Digital signature introductionDigital signature introduction
Digital signature introductionAsim Neupane
 
Digital signatures
Digital signaturesDigital signatures
Digital signaturesIshwar Dayal
 
Digital signature
Digital  signatureDigital  signature
Digital signatureAJAL A J
 
Online Payment Gateway System
Online Payment Gateway SystemOnline Payment Gateway System
Online Payment Gateway SystemMannu Khani
 
Software Development Life Cycle (SDLC)
Software Development Life Cycle (SDLC)Software Development Life Cycle (SDLC)
Software Development Life Cycle (SDLC)Angelin R
 
System Development Life Cycle (SDLC)
System Development Life Cycle (SDLC)System Development Life Cycle (SDLC)
System Development Life Cycle (SDLC)fentrekin
 
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Ryan Rosario
 
6 basic steps of software development process
6 basic steps of software development process6 basic steps of software development process
6 basic steps of software development processRiant Soft
 

Destacado (17)

Digital signature & eSign overview
Digital signature & eSign overviewDigital signature & eSign overview
Digital signature & eSign overview
 
Seminar ppt on digital signature
Seminar ppt on digital signatureSeminar ppt on digital signature
Seminar ppt on digital signature
 
Introduction to Digital signatures
Introduction to Digital signaturesIntroduction to Digital signatures
Introduction to Digital signatures
 
HPC Application Profiling and Analysis
HPC Application Profiling and AnalysisHPC Application Profiling and Analysis
HPC Application Profiling and Analysis
 
Podpis elektroniczny i jego zastosowanie w biznesie
Podpis elektroniczny i jego zastosowanie w biznesiePodpis elektroniczny i jego zastosowanie w biznesie
Podpis elektroniczny i jego zastosowanie w biznesie
 
Google Partners - Adwords
Google Partners - AdwordsGoogle Partners - Adwords
Google Partners - Adwords
 
Digital signature and adv payment gateway
Digital signature and adv payment gatewayDigital signature and adv payment gateway
Digital signature and adv payment gateway
 
Digital Signature
Digital SignatureDigital Signature
Digital Signature
 
Digital signature introduction
Digital signature introductionDigital signature introduction
Digital signature introduction
 
Digital signatures
Digital signaturesDigital signatures
Digital signatures
 
Digital signature
Digital  signatureDigital  signature
Digital signature
 
Digital signature
Digital signatureDigital signature
Digital signature
 
Online Payment Gateway System
Online Payment Gateway SystemOnline Payment Gateway System
Online Payment Gateway System
 
Software Development Life Cycle (SDLC)
Software Development Life Cycle (SDLC)Software Development Life Cycle (SDLC)
Software Development Life Cycle (SDLC)
 
System Development Life Cycle (SDLC)
System Development Life Cycle (SDLC)System Development Life Cycle (SDLC)
System Development Life Cycle (SDLC)
 
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
 
6 basic steps of software development process
6 basic steps of software development process6 basic steps of software development process
6 basic steps of software development process
 

Similar a Application Profiling and Analysis Using HPCToolkit

2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurementPTIHPA
 
Monitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance AnalysisMonitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance AnalysisBrendan Gregg
 
Embedded systems notes
Embedded systems notesEmbedded systems notes
Embedded systems notesShikha Sharma
 
Visual Studio 2013 Profiling
Visual Studio 2013 ProfilingVisual Studio 2013 Profiling
Visual Studio 2013 ProfilingDenis Dudaev
 
The differing ways to monitor and instrument
The differing ways to monitor and instrumentThe differing ways to monitor and instrument
The differing ways to monitor and instrumentJonah Kowall
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)Lucas Jellema
 
Performance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTracePerformance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTraceGraeme Jenkinson
 
Instrumentation and measurement
Instrumentation and measurementInstrumentation and measurement
Instrumentation and measurementDr.M.Prasad Naidu
 
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16AppDynamics
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformGanesan Narayanasamy
 
Continuous Profiling in Production: What, Why and How
Continuous Profiling in Production: What, Why and HowContinuous Profiling in Production: What, Why and How
Continuous Profiling in Production: What, Why and HowSadiq Jaffer
 
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor Apps
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor AppsLibrato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor Apps
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor AppsHeroku
 
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1Adam Dunkels
 
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...CODE BLUE
 
Embedded Systems Overview
Embedded Systems OverviewEmbedded Systems Overview
Embedded Systems OverviewSameer Rapate
 
Performance eng prakash.sahu
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahuDr. Prakash Sahu
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at NetflixBrendan Gregg
 
Data Onboarding Breakout Session
Data Onboarding Breakout SessionData Onboarding Breakout Session
Data Onboarding Breakout SessionSplunk
 
ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016Brendan Gregg
 

Similar a Application Profiling and Analysis Using HPCToolkit (20)

2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement
 
Monitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance AnalysisMonitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance Analysis
 
Embedded systems notes
Embedded systems notesEmbedded systems notes
Embedded systems notes
 
Visual Studio 2013 Profiling
Visual Studio 2013 ProfilingVisual Studio 2013 Profiling
Visual Studio 2013 Profiling
 
Visual Studio Profiler
Visual Studio ProfilerVisual Studio Profiler
Visual Studio Profiler
 
The differing ways to monitor and instrument
The differing ways to monitor and instrumentThe differing ways to monitor and instrument
The differing ways to monitor and instrument
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
 
Performance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTracePerformance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTrace
 
Instrumentation and measurement
Instrumentation and measurementInstrumentation and measurement
Instrumentation and measurement
 
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
Continuous Profiling in Production: What, Why and How
Continuous Profiling in Production: What, Why and HowContinuous Profiling in Production: What, Why and How
Continuous Profiling in Production: What, Why and How
 
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor Apps
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor AppsLibrato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor Apps
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor Apps
 
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
 
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
[CB16] COFI break – Breaking exploits with Processor trace and Practical cont...
 
Embedded Systems Overview
Embedded Systems OverviewEmbedded Systems Overview
Embedded Systems Overview
 
Performance eng prakash.sahu
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahu
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 
Data Onboarding Breakout Session
Data Onboarding Breakout SessionData Onboarding Breakout Session
Data Onboarding Breakout Session
 
ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016
 

Application Profiling and Analysis Using HPCToolkit

  • 1. Application Profiling & Analysis Rishi Pathak National PARAM Supercomputing Facility, C-DAC riship@cdac.in
  • 2. What is application profiling • Profiling – Recording of summary information during execution • inclusive, exclusive time, # calls, hardware counter statistics, … – Reflects performance behavior of program entities • functions, loops, basic blocks • user-defined “semantic” entities – Very good for low-cost performance assessment – Helps to expose performance bottlenecks and hotspots – Implemented through either • sampling: periodic OS interrupts or hardware counter traps • measurement: direct insertion of measurement code
  • 3. Sampling vs. Instrumentation Sampling Instrumentation Overhead Typically about 1% High, may be 500% ! System-wide Yes, profiles all app, drivers, OS functions Just application and profiling instrumented DLLs Detect unexpected Yes , can detect other programs using OS No events resources Setup None Automatic ins. of data collection stubs required Data collected Counters, processor an OS state Call graph , call times, critical path Data granularity Assembly level instr., with src line Functions, sometimes statements Detects No, Limited to processes , threads Yes – can see algorithm, algorithmic issues call path is expensive Profiling Tools 3
  • 4. Inclusive v/s Exclusive Profiling int main( ) { /* takes 100 secs */ f1(); /* takes 20 secs */ /* other work */ f2(); /* takes 50 secs */ f1(); /* takes 20 secs */ /* other work */ } /* similar for other metrics, such as hardware performance counters, etc. */ • Inclusive time for main – 100 secs • Exclusive time for main – 100-20-50-20=10 secs
  • 5. What are Application Traces • Tracing – Recording of information about significant points (events) during program execution • entering/exiting code region (function, loop, block, …) • thread/process interactions (e.g., send/receive message) – Save information in event record • timestamp • CPU identifier, thread identifier • Event type and event-specific information – Event trace is a time-sequenced stream of event records – Can be used to reconstruct dynamic program behavior – Typically requires code instrumentation
  • 6. Profiling v/s Tracing • Profiling – Summary statistics of performance metrics • Number of times a routine was invoked • Exclusive, inclusive time/hpm counts spent executing it • Number of instrumented child routines invoked, etc. • Structure of invocations (call-trees/call-graphs) • Memory, message communication sizes • Tracing – When and where events took place along a global timeline • Time-stamped log of events • Message communication events (sends/receives) are tracked • Shows when and from/to where messages were sent • Large volume of performance data generated usually leads to more perturbation in the program
  • 7. The Big Picture Sampling Instrumentation Profiling Analysis Optimization
  • 9. Measurements - Instrumentation Instrumentation - Adding measurement probes to the code to observe its execution – Can be done on several levels – Different techniques for different levels – Different overheads and levels of accuracy with each technique – No instrumentation: run in a simulator. E.g., Valgrind
  • 10. Measurements - Instrumentation • Source code instrumentation – User added time measurement, etc. (e.g., printf(), gettimeofday()) Measurements - Instrumentation – Many tools expose mechanisms for source code instrumentation in addition to automatic instrumentation facilities they offer – Instrument program phases: • initialization/main iteration loop/data post processing
  • 11. Measurements - Instrumentation • Preprocessor Instrumentation – Example: Instrumenting OpenMP constructs with Opari Orignial Pre- Modified (instrumented) source code processor source code – Preprocessor operation This is used for OpenMP analysis in tools such as KoJak/Scalasca/ompP /* ORIGINAL CODE in parallel region */ Instrumentation added by Opari – Example: Instrumenta
  • 12. Measurements - Instrumentation • Compiler Instrumentation – Many compilers can instrument functions automatically – GNU compiler flag: -finstrument- functions – Automatically calls functions on function entry/exit that a tool can capture – Not standardized across compilers, often undocumented flags, sometimes not available at all
  • 13. Measurements - Instrumentation – GNU compiler example: void __cyg_profile_func_enter(void *this, void *callsite) { /* called on function entry */ } void __cyg_profile_func_exit(void *this, void *callsite) { /* called just before returning from function */ }
  • 14. Measurements - Instrumentation • Library Instrumentation: MPI library interposition – All functions are available under two names: MPI_xxx and PMPI_xxx, MPI_xxx symbols are weak, can be over-written by interposition library – Measurement code in the interposition library measures begin, end, transmitted data, etc… and calls corresponding PMPI routine. – Not all MPI functions need to be instrumented
  • 15. Measurements - Instrumentation • Binary Instrumentation – Static binary instrumentation • LD_PRELOAD(Linux) • DLL injection(MS) – Debuggers – Breakpoints (Software & Hardware) • Dynamic binary instrumentation • Injection of instrumentation code into a running process. • Tools: PIN(Intel), Valgrind
  • 16. Measurements - Sampling Using event triggers – Reoccurring - program counter is sampled many times – Histogram of program contexts(CCT) – Sufficiently large number of samples required – Uniformity in event triggers wrt execution time
  • 17. Measurements - Sampling Event trigger types • Synchronous – Initiated by direct program action – E.g. memory allocation, I/O, and inter-process communication(including MPI communication) • Asynchronous – Not initiated by direct program action – OS timer interrupt or – Hardware performance counter events – E.g. CPU, floating point instructions, clock cycles etc.
  • 18. Profiling Tools • Gprof • Intel VTune • Scalasca • HPC Tool Kit Profiling Tools 18
  • 19. HPC Tool Kit • Name: HPCToolkit • Developer: Rice University • Website: – http://hpctoolkit.org 19
  • 20. HPCToolkit Overview • Consists of – hpcviewer • Sorts by any collected metric, from any processes displayed • Displays samples at various levels in call hierarchy through “flattening” • Allows user to focus in on interesting sections of the program through “zooming” – Hpcprof and hpcprof-mpi • Correlating dynamic profiles with static source code structure – hpcrun • Application profiling using statistical sampling • Hpcrun-flat – for collection of `flat' profile – Hpcstruct • Recovers static program structure such as procedures and loop nests – Hpctraceviewer – Hpclink • For statically-linked executables (e.g. for Cray XT or BG/P) 20
  • 21. Available Metrics in HPCToolkit • Metrics, obtained by sampling/profiling – PAPI Hardware counters – OS program counters • Wallclock time (WALLCLK) – However, can’t get PAPI metrics and Wallclock time in a single run • Derived metrics – Combination of existing metrics created by specifying a mathematical formula in an XML configuration file. • Source Code Correlation – Metrics reflect exclusive time spent in function based on counter overflow events – Metrics correlated at the source line level and the loop level – Metrics are related back to source code loops (even if code has been significantly altered by optimization) (“bloop”) 21
  • 23. hpcviewer Views • Calling context view – top-down view shows dynamic calling contexts in which costs were incurred • Caller’s view – bottom-up view apportions costs incurred in a routine to the routine’s dynamic calling contexts • Flat view – aggregates all costs incurred by a routine in any context and shows the details of where they were incurred within the routine
  • 25. hpctraceviewer Views • Trace view (left, top) – Time on the horizontal axis – Process (or thread) rank on the vertical axis • Depth view (left, bottom) & Summary view – Call-path/time view for the process rank selected • Call view (right, top) – Current call path depth that defines the hierarchical slice shown in the Trace View – Actual call path for the point selected by the Trace View's crosshair
  • 27. Call Path Profiling: Costs in Context Event-based sampling method for performance measurement • When a profile event occurs, e.g. a timer expires – determine context in which cost is incurred • unwind call stack to determine set of active procedure frames – attribute cost of sample to PC in calling context • Benefits – monitor unmodified fully optimized code – language independent – C/C++, Fortran, assembly code, … – accurate – low overhead (1K samples per second has ~ 3-5% overhead)
  • 28. Demo for : • Hpcrun (list events & proifiling) • Hpcstruct • hpcprof
  • 29. PAPI • Performance Application Programming Interface – The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors. • Parallel Tools Consortium project started in 1998 • Developed by University of Tennessee, Knoxville • http://icl.cs.utk.edu/papi/
  • 30. PAPI - Support • Unix/Linux – Perfctr kernel patch for kernel < 2.6.30 – Perf package for kernel >= 2.6.30
  • 31. PAPI - Implementation 3rd Party and GUI Tools Portable PAPI High Level Layer PAPI Low Level PAPI Machine Dependent Substrate Machine Kernel Extension Specific Layer Operating System Hardware Performance Counters
  • 32. PAPI - Hardware Events • Preset Events(Platform neutral) – Standard set of over 100 events for application performance tuning – No standardization of the exact definition – Mapped to either single or linear combinations of native events on each platform – Use papi_avail utility to see what preset events are available on a given platform – PAPI_TOT_INS • Native Events(Platform dependent) – Any event countable by the CPU – Same interface as for preset events – Use papi_native_avail utility to see all available native events – L3_MISSES • Use papi_event_chooser utility to select a compatible set of events
  • 33. PAPI events demo papi_avail papi_native_avail papi_event_chooser Availability for QPI h/w performance counters using /sbin/lspci
  • 34. PAPI-Derived Metrics Metric Formula Instructions Graduated instructions per cycle PAPI_TOT_INS/PAPI_TOT_CYC Issued instructions per cycle PAPI_TOT_IIS/PAPI_TOT_CYC Graduated floating point instructions per cycle PAPI_FP_INS/PAPI_TOT_CYC Percentage floating point instructions PAPI_FP_INS/PAPI_TOT_INS Ratio of graduated instructions to issued instructions PAPI_TOT_INS/PAPI_TOT_IIS Percentage of cycles with no instruction issue 100.0 * (PAPI_STL_ICY/PAPI_TOT_CYC) Data references per instruction PAPI_L1_DCA/PAPI_TOT_INS Ratio of floating point instructions to L1 data cache accesses PAPI_FP_INS/PAPI_L1_DCA Ratio of floating point instructions to L2 cache accesses (data) PAPI_FP_INS/PAPI_L2_DCA Issued instructions per L1 instruction cache miss PAPI_TOT_IIS/PAPI_L1_ICM Graduated instructions per L1 instruction cache miss PAPI_TOT_INS/PAPI_L1_ICM L1 instruction cache miss ratio PAPI_L2_ICR/PAPI_L1_ICR
  • 35. PAPI-Derived Metrics Metric Formula Cache & Memory Hierarchy Graduated loads & stores per cycle PAPI_LST_INS/PAPI_TOT_CYC Graduated loads & stores per floating point instruction PAPI_LST_INS/PAPI_FP_INS L1 cache line reuse (data) ((PAPI_LST_INS - PAPI_L1_DCM) / PAPI_L1_DCM) L1 cache data hit rate 1.0 - (PAPI_L1_DCM/PAPI_LST_INS) L1 data cache read miss ratio PAPI_L1_DCM/PAPI_L1_DCA L2 cache line reuse (data) ((PAPI_L1_DCM - PAPI_L2_DCM) / PAPI_L2_DCM) L2 cache data hit rate 1.0 - (PAPI_L2_DCM/PAPI_L1_DCM) L2 cache miss ratio PAPI_L2_TCM/PAPI_L2_TCA L3 cache line reuse (data) ((PAPI_L2_DCM - PAPI_L3_DCM) / PAPI_L3_DCM) L3 cache data hit rate 1.0 - (PAPI_L3_DCM/PAPI_L2_DCM) L3 data cache miss ratio PAPI_L3_DCM/PAPI_L3_DCA L3 cache data read ratio PAPI_L3_DCR/PAPI_L3_DCA L3 cache instruction miss ratio PAPI_L3_ICM/PAPI_L3_ICR Bandwidth used (Lx cache) ((PAPI_Lx_TCM * Lx_linesize) / PAPI_TOT_CYC) * Clock(MHz)
  • 36. PAPI-Derived Metrics Metric Formula Branching Ratio of mispredicted to correctly predicted branches PAPI_BR_MSP/PAPI_BR_PRC Processor Stalls Percentage of cycles waiting for memory access 100.0 * (PAPI_MEM_SCY/PAPI_TOT_CYC) Percentage of cycles stalled on any resource 100.0 * (PAPI_RES_STL/PAPI_TOT_CYC) Aggregate Performance MFLOPS (CPU cycles) (PAPI_FP_INS/PAPI_TOT_CYC) * Clock(MHz) MFLOPS (effective) PAPI_FP_INS/Wallclock time MIPS (CPU cycles) (PAPI_TOT_INS/PAPI_TOT_CYC) * Clock(MHz) MIPS (effective) PAPI_TOT_INS/Wallclock time Processor utilization (PAPI_TOT_CYC*Clock) / Wallclock time http://perfsuite.ncsa.illinois.edu/psprocess/metrics.shtml
  • 37. Component PAPI (PAPI-C) • Goals: – Support for simultaneous access to on- and off-processor counters – Isolation of hardware dependent code in a separable ‘substrate’ module – Extension of platform independent code to support multiple simultaneous substrates – API calls to support access to any of several substrates • Released in PAPI 4.0
  • 38. Extension to PAPI to Support Multiple Substrates PAPI High Level PAPI Low Level Portable Layer Hardware Independent Layer PAPI Machine Dependent Substrate PAPI Machine Dependent Substrate Machine Kernel Extension Kernel Extension Specific Operating System Operating System Layer Hardware Performance Counters Off-Processor Hardware Counters
  • 39. High-level tools that use PAPI • TAU (U Oregon) • HPCToolkit (Rice Univ) • KOJAK (UTK, FZ Juelich) • PerfSuite (NCSA) • SCALASCA • Open|Speedshop (SGI) • Intel Vtune
  • 40. • Hpcviewer demo (trace, flops and clock cycles) – Nek5000(CFD solver using spectral element method) – Xhpl(Linpack)