Online performance modeling and analysis of message-passing parallel applications

Online performance modeling and
analysis of message-passing
parallel applications

Delayed receive

PhD Thesis
Oleg Morajko
Universitat Autònoma de Barcelona, Long local calculations

Barcelona, 2008

Motivation
• Parallel system hardware is evolving at an incredible rate
• Contemporary HPC systems
– Top500 ranging from 1.000 to 200.000+ processors (June 2008)
– Take BSC MareNostrum: 10K processors

• Whole industry is shifting to parallel computing

2

Motivation
• Challenges of developing large-scale scientific software
– Evolution of programming models is much slower
– Hard to achieve good efficiency
– Hard to achieve scalability

• The parallel applications
rarely achieve good
performance immediately

MPI
3

Motivation
• Challenges of developing large-scale scientific software
– Evolution of programming models is much slower
– Hard to achieve good efficiency
– Hard to achieve scalability

• The parallel applications
rarely achieve good
performance immediately

Careful performance analysis
and optimization tasks are crucial
4

Motivation
• Quickly finding performance problems and their reasons is hard
• Requires thorough understanding of the program’s behavior
– Parallel algorithm, domain decomposition, communication, synchronization

• Large scale brings additional complexities
– Large data volume, excessive analysis cost

• Existing tools support finding what happens, where, and when
– Locating root causes of problems still manual
– Tools expose scalability limitations (E.g. tracing)

• Problem diagnosis still requires substantial time and effort of
highly-skilled professionals

5

Our goals
• Analyze the performance of parallel applications
• Detect bottlenecks and explain their causes
– Focus on communication and synchronization in message-passing
programs

• Automate the approach to the extent possible
• Scalable to thousands of nodes
• Online approach without trace files

6

Contributions
• A systematic approach for automated diagnosis of application
performance
– Application is monitored, modeled and diagnosed during its execution

• Scalable modeling technique that generates performance
knowledge about application behavior

• Analysis technique that diagnoses MPI applications running in
large-scale parallel systems
– Detects performance bottlenecks on-the-fly
– Finds root causes

• Prototype tool to demonstrate the ideas

7

Outline

1. Overview of approaches

2. Online performance modeling

3. Online performance analysis

4. Experimental evaluation

5. Conclusions and future work

8

Overview
of approaches

9

Classical performance analysis

Code Compile
Develop Instrument
changes

Find
Execute
solutions

Performance Trace
problems files
Analyze
trace

Visualization
tool 10

Classical performance analysis
Drawbacks

• Manual task of experimental nature
• Time consuming
• High degree of expertise required
• Full trace excessive volume of information
• Poor scalability

11

Automated offline analysis

Code Compile
Develop Instrument
changes

Find
Execute
solutions

Performance Trace
problems files
Analyze
trace
Automated tools
(KappaPI, EXPERT)
12

Automated offline analysis
Drawbacks

• Post-mortem
• Addresses only well-known problems
• Not fully explored capabilities to find root causes

13

Automated online analysis

Develop
Code
changes Compile
Instrument

Find
solutions
Execute

Performance problems Online monitoring
(What, Where, When) and diagnosis
(Paradyn)
14


Paradyn advantages Paradyn drawbacks
• Locate problems while app • Addresses lower-level
runs problems (profiler)
• Automated problem-space • No search for root causes of
search problems
– Functional decomposition
– Refinable measurements
• Scalable

15

Our approach
Consume
Code Develop
events Monitoring
changes Compile

Find Refine
solutions
Execute Modeling

Analysis

Observe 16

model Problems
and causes

Key characteristics
• Discovers application model on-the-fly
– Model execution flows, not modules/functions
– Lossy trace compression

• Runtime analysis based on continuous model
observation
• Automatically locates problems while app runs
• Search for root-causes of problems

17

Monitoring

Modeling

Analysis

Online performance
modeling

18

Modeling objectives

• Enable high-level understanding of application performance

• Reflect parallel application structure and runtime behavior

• Maintain tradeoff between volume of collected data and level
of preserved details
– Communication and computational patterns

– Causality of events

• Base for online performance analysis

19

Online performance modeling

• Novel application performance modeling approach

• Combines static code analysis with runtime monitoring to
extract performance knowledge

• Three step approach:
– Modeling individual tasks

– Modeling inter-task communication

– Modeling entire application

20

Modeling individual tasks
• We decompose execution into units that correspond to
different activities:
– Communication activities (E.g. MPI_Send, MPI_Gather)
– Computation activities (E.g. calc_gauss)
– Control activities (E.g. program start/termination)
– Others (E.g. I/O)

• We capture execution flow through these activities using a
directed graph called Task Activity Graph (TAG):
– Nodes model communication activities and loops
– Edges represent sequential flow of execution (computation activities)
– Nodes and edges maintain happens-before relationship
21

Task Activity Graph (TAG) reflects program structure by
modeling executed flow of activities

22

• Each activity corresponds
to a particular location
in the source code

23

• Runtime behavior of activities is described by adding
performance metrics to nodes and edges
• Data aggregated into statistical execution profiles

Edge counter &
accumulative timer
{min, max, stddev}

Node
accumulative timer
{min, max, stddev} 24

Modeling communication
• Message edges capture matching send-receive links
– P2P, Collective
• Completion edges capture non-blocking semantics
• Performance metrics describe runtime behavior

25

Modeling parallel application
• Individual TAG models connected by message edges
form a Parallel-TAG model (PTAG)

26

Modeling techniques
We developed a set of techniques to automatically construct
and exploit the PTAG model at runtime

• Targeted to parallel scientific applications
• Focus on modeling MPI applications
• But extendible to other programming paradigms
• Low-overhead
• Scalable to 1000+ nodes

27

Online PTAG construction
Front-end
7 analyze

6 update

5 merge TBON Node 1 TBON Node 2 …
4 update

Modeler 1 Modeler 2 Modeler 3 … Modeler N

1 instrument sample 3

2 build
MPI Task 1 MPI Task 2 MPI Task 3
… MPI Task N

28

Building individual TAG

6 update

Modeler
5 sample

1 analyze executable

2 instrument
shared memory

MPI Task
capture
3 events
RT Library

4 update

29

Offline program analysis
• Parse binary executable
• Find target functions
• Detect relevant loops

Modeler
1
analyze executable

shared memory
MPI Task
RT Library

30

Dynamic instrumentation
• Instrument all target functions:
– Record events
– Collect performance metrics
– Invoke TAG update
• Refinable at runtime

Modeler
2 instrument

shared memory
MPI Task
RT Library

31

Performance metrics
• Counters
• Timers {sum, sum2, min, max}
• Histograms cnt2++
cnt3++
• Compound metrics

cnt1++ cnt4++ cnt5++

Modeler
t1 t2 t3 t4
2 instrument

shared memory
MPI Task
RT Library

32

Runtime modeling
• Process generated events
• Walk the stack to capture
program location (call path)
• Update TAG incrementally

Modeler

shared memory

capture
MPI Task
events RT Library
3
4 update
33

Model sampling
• Goal: examine model at runtime
• Read model from shared memory
• Sampling is periodic
• Lock-free synchronization

Modeler 5 sample

shared memory
MPI Task
RT Library

34

Online communication modeling
How to model inter-task communication?
• Intercept MPI communication calls (nodes)
• Match sender nodes with receiver nodes
• Add messages edges to the TAG models

35

Online communication modeling
• Requires tracking of individual messages transmitted from
sender to receiver(s) at runtime

• Achieved by propagating piggyback data over every
transmitted MPI message
• Transmit node id from sender to receiver(s)
• P2P/Blocking/Non-blocking/Collective
• Optimized hybrid strategy to minimize intrusion

• Store references to sender’s nodes at receiver’s TAG

36

Online parallel application modeling
Building and maintaining PTAG

• Individual TAGs are
distributed
Hierarchical Reduction
• Collect TAGs snapshots Network (TBON)

• Distributed merge
• Periodic process

Individual TAGs Merged groups PTAG
of TAGs
37

Scalable modeling

10240
nodes,
1024
nodes, 625MB
62MB
8 nodes,
250KB
• Increasing data volume
• Increasing analysis cost
• Non-scalable visualization
38

Resolving scalability issues

• Classes of similar tasks
– E.g. stencil codes, M/W

• TAG clustering
– Structural equivalence
– Behavioral equivalence

• Distributed and scalable
TAG merging algorithm

39

Scalable PTAG visualization
• Example: 1D stencil, 8 nodes

40

Benefits of modeling

• Facilitates performance understanding

• Reveals communication and computational patterns and their
causal relationships

• Enables an assortment of online analysis techniques
– Quick identification of performance bottlenecks and their location

– Behavioral task clustering

– Causal relationships permit root-cause analysis

– Feedback-guided analysis (refinements)

41

Monitoring

Modeling

Analysis

Online performance
analysis

42

Online analysis objectives
• Diagnose the performance on-the-fly
• Detect relevant performance bottlenecks and their
reasons
• Distinguish problem symptoms from root causes
• Explain what, where, when and why

• Focus on communication and synchronization
problems in MPI applications

43

Online performance analysis
Time-continuous Root-Cause Analysis process

Monitoring

Modeling

Analysis Phase 1 Phase 2 Phase 3
Problem Problem Cause-effect
identification analysis analysis

44

Root-cause analysis
Phase 1 Phase 2 Phase 3

Phase 1: Problem identification
• Focus attention on code regions with biggest potential
optimization benefits
• A potential bottleneck – an individual task activity with
significant amount of execution time

• TAG node might corresponds to a communication or
synchronization problem
• TAG edge might be a computation-bound problem

45

Problem identification

CPU-bound activity
~45% time

Cold activity Hot activity
Blocked receive
~42% time
• Rainbow spectrum TAG coloring
Communication or
• Activity time / Max Activity Time synchronization problem
46

Problem identification

TAG ranking process
• Identify potential bottlenecks for further analysis
• Periodic ranking in moving time-window

Select top problems by ranking

Rank = activity time / task time
> 20% for computation activities
> 3% for communication activities
TAG snapshot
Potential bottlenecks

47

Root-cause analysis

Phase 2: In-depth problem analysis
• For each potential bottleneck, investigate its causes
• Explore knowledge-based cause space
• Focus on causes that contribute most to the problem time

• Distinguish task-local problems from inter-task problems
– Find root-causes of task-local problems
• E.g. CPU-bound computation, local I/O
– Find symptoms of inter-task problems
• E.g. Blocked receive, barrier

48

In-depth problem analysis

Performance models for activities
• Classification of activities
• Each class has a performance model that divides the activity
cost into separate components
• Each component is a non-exclusive
potential cause of the problem

49


Model for computational activities
• Sequential code region modeled by TAG edge
• No external knowledge about computation
• Determine where edge-constrained code spends time
• Divide TAG edge into components
– Functional or basic-blocks decomposition
• Apply statistical profiling constrained to an edge
– Dynamic instrumentation
• Other metrics
– Idle time, I/O time, hardware counters

50


Model for communication activities
Communication cost = Synchronization Cost + Transmission Cost

Transmission cost
Overall communication cost
Task

e1 Send e3

e2 Receive e4

Time
Synchronization cost

• Captures semantics of well-known synchronization inefficiencies
– Late sender, wait at barrier, early reduce, etc.

51


Model for communication activities
Communication cost = Synchronization Cost + Transmission Cost

Transmission cost
Overall communication cost

• Piggyback send entry
Task

timestamp (e1)
e1 Send e3 • Accumulate
synchronization cost
e2 Receive e4
per message edge

Time
Synchronization cost

• Captures semantics of well-known synchronization inefficiencies
– Late sender, wait at barrier, early reduce, etc.

52


Example receive activity break-down

Requires inter-task
cause-effect analysis

53

Root-cause analysis

Phase 3: Cause-effect analysis
• Explain causes of synchronization inefficiencies
– Why sender is late?

• Correlate problems into cause-effect chains
• Distinguish root-causes of inefficiencies from their causal
propagation (symptoms)
• Pinpoint problems in non-dominant code regions
• Improve the feedback provided to application developers

54

Cause-effect analysis

Causal propagation
Causes

Causes
ComputationA
(Task A)
Late Sender Causes
(Task A)
Inefficiency1 Causes
(Task B)
Late Sender
(Task B)
Task

Inefficiency2
(Task C)

ComputationB Causes
(Task B)
A ComputationA Send1

WT1 Inefficiency 1 m0

B Receive1 ComputationB Send2

WT2 Inefficiency 2 m1

C ComputationC Receive2

t0 t1 t2 t3 t4
Time

55


Explaining problem causes
• Causes of waiting time between two nodes as the differences
between their execution paths
– Online adaptation of Wait-Time Analysis approach by Meira et al.
– Based on PTAG model, not full trace
• Explain synchronization inefficiencies by means of other
activities
– Identify corresponding execution paths in PTAG model
– Compare the paths
– Build causal tree with explanations
– Merge trees of individual problems

56


Execution path comparison
Inefficiency caused by
Late Sender
problem

Path q (Task 1)

e7

... ...
e1 e2 e4 e5 e6

Path p (Task 2) Inefficiency
at MPI_Recv Waiting-time
e3 (Task 1) 138,4 sec.
... ...
e1 e2

Late Sender
(Task 2)

91.9% 7.7%

Computation Computation
edge e3 edge e2
(Task 2) (Task 2)

Root causes 57

Benefits of RCA

• Systematic approach to online performance analysis

• Quick identification of problems as they manifest at runtime
(without trace)

• Causal correlation of different problems

• Discovery of root-causes of synchronization inefficiencies

58

Experimental
evaluation

59

Prototype tool
global
analyzer

• Implemented in C++
• DynInst 5.1 mrnet mrnet

…
comm comm

• MRNet 1.2 node node

• OpenMPI 1.2.x mrnet mrnet

• Linux platforms comm
node
comm
node

…
– x86
– IA-64 (Itanium)
– PowerPC 32/64 dmad dmad dmad dmad dmad

MPI
Task
MPI
Task
MPI
Task
… MPI
Task
MPI
Task

60

Experimental environment
UAB cluster BSC Marenostrum
 x86/Linux  PowerPC-64/Linux
 32 nodes  512 nodes (restricted)
 Intel Pentium IV 3GHz  PowerPC 2.3GHz dual core
 Linux FC4  SUSE Linux Enterprise Server 9
 Gigabit Ethernet  Myrinet

61

Modeling MPI applications
• Experiences with different classes of MPI codes
– SPMD codes
• WaveSend – 1D stencil, concurrent wave equation
• NAS Parallel Benchmarks – 2D stencils
• SMG2000 – 3D stencil, multigrid solver
– Master/Worker
• XFire – forest fire propagation simulator

+ Demonstrated ability to model arbitrary MPI code with
low-overhead
+ Best with regular codes
– Limitations with recursive codes

62

Case study #1: Modeling SPMD
Integer sort (IS) NAS Parallel Benchmark
• Large integer sort used in
“particle method” codes

• Tests both integer computation
speed and communication
performance

• Mostly collective communication

• We extract PTAG to understand
application communication
patterns and behavior

63

Case study #2: Master/Worker
Forest Fire Propagation Simulator (XFire)
• Calculates the expansion of the fireline
• Computationally intensive code, exploits data parallelism
• We extract and cluster PTAG

64

Evaluation of overheads
Sources of overheads
• Offline startup
– Less than 20 seconds per 1MB executable
– In function of program size

• Online TAG construction
– 4-20 μs per instrumented call (*)
– Depends on the number of instrumented calls and loops

• Online TAG sampling
– 40-50 μs per snapshot (256 KB)
– Depends on program structure size, number of communication links

(*) Experiments conducted in UAB cluster

65

Evaluation of overheads
NAS LU overheads, varying nº of nodes
120,00 2,50%

100,00 1,91%
2,00%
1,59%
Overhead (seconds)

80,00 1,50%
1,34% 1,42%
1,26% 1,50%

60,00
Overhead (seconds)
1,00%
40,00 Overhead (%)

0,50%
20,00

0,00 0,00%
16 32 64 128 256 512
Nº CPUs

66

Case study #3: SPMD analysis
WaveSend application
• Parallel calculations of vibrating string over time

• Wave equation, block-decomposition

• P2P communication to exchange boundary
points with nearest neighbors

• Synthetic performance problems

67

WaveSend
PTAG

After execution

68

CPU-bound problem at task 7

PTAG after 30 seconds
of execution

69

Task 0 findings:
35.4% CPU-bound
in edge 8→6

Task 1 findings:
33% CPU-bound
in edge 11→6

Task 6 findings:
32.1% CPU-bound
in edge 11→6

Task 7 findings:
50.5% CPU-bound
in edge 8→6
70

Task 0 findings:
21.4% blocked receive
caused by late sender
from task 1

Task 1 findings:
19.1 % blocked receive
from task 2

Task 6 findings:
19.2 blocked receive
from task 7

71


72

Analysis results

• Load imbalance found
• Multiple instances of late-sender problem
• Causal propagation of inefficiencies
• Root-cause found in task 7 as an imbalanced computational
edge

73

Conclusions
and future work

74

Conclusions
• A novel approach for online performance modeling
– Discovers high-level application structure and runtime behavior
– A differential hybrid technique that combines both static code analysis with
runtime monitoring to extract performance knowledge
– Scalable to 1000+ processors
• An automated online performance analysis approach
– Enables quick detection of performance bottlenecks
– Focuses on explaining sources of communication and synchronization
– Correlates different problems and identifies their root causes
• A prototype tool that models and analyzes MPI applications
at runtime

75

Future work
• Modeling
– Support for other classes of activities (I/O, MPI RMA)
– OpenMP applications
– Support for recursive codes
– Multi-experiment support

• Analysis
– More accurate cause-effect analysis with causal paths
– Evaluation of scalability of analysis in large-scale HPC
– Actionable recommendations
– Integration with automatic tuning framework (MATE)

76

Online performance modeling and analysis
of message-passing parallel applications

Thank You

PhD Thesis, Oleg Morajko
Universitat Autònoma de Barcelona
77

Online performance modeling and analysis of message-passing parallel applications

Recomendados

Recomendados

Más contenido relacionado

Similar a Online performance modeling and analysis of message-passing parallel applications

Similar a Online performance modeling and analysis of message-passing parallel applications (20)

Último

Último (20)

Online performance modeling and analysis of message-passing parallel applications