SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
Introduction Acquisition Process Replay Conclusions and Perspectives
Performance Evaluation and Prediction of
Parallel Applications
Georgios Markomanolis
Avalon Team, INRIA, LIP,
Ecole Normale Superiéure de Lyon, France
Ph.D. Defense
January 20th
2014
Under the supervision of Frédéric Desprez and Frédéric Suter
1 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Dimensioning Through Simulation
User and administrator expertise is not enough
Decisions can cost a lot of money
⇒ Need for objective indicators by exploring various “what-if”
scenarios
Simulation has many advantages
Less simplistic than theoretical models
More reproducible than running on production systems
Execution on real platform can be time and money consuming
Focus on non adaptive MPI applications
Two complementary approaches
On-line: execute the application with some simulated parts
Off-line: replay an execution trace
2 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Dimensioning Through Simulation
User and administrator expertise is not enough
Decisions can cost a lot of money
⇒ Need for objective indicators by exploring various “what-if”
scenarios
Simulation has many advantages
Less simplistic than theoretical models
More reproducible than running on production systems
Execution on real platform can be time and money consuming
Focus on non adaptive MPI applications
Two complementary approaches
On-line: execute the application with some simulated parts
Off-line: replay an execution trace
2 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Time-Independent Trace Replay
Post-mortem analysis (or off-line simulation) of MPI applications
Well covered field
Mainly profiling tools
Unexpected behaviors and performance bottlenecks detection
TAU, Scalasca, Vampir, SCORE-P, . . .
Usually based on timed traces
Create a tight link between trace to acquisition environment
3 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Time-Independent Trace Replay
Post-mortem analysis (or off-line simulation) of MPI applications
Well covered field
Mainly profiling tools
Unexpected behaviors and performance bottlenecks detection
TAU, Scalasca, Vampir, SCORE-P, . . .
Usually based on timed traces
Create a tight link between trace to acquisition environment
Proposition: get rid off the timestamps
Trace volumes only
Numbers of instructions for computations
Message sizes for communications
Goals
Get environment oblivious traces
Decouple acquisition from actual replay
3 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Time-Independent Traces
for (i=0; i<4; i++){
if (myId == 0){
/* Compute 1M instructions */
MPI_Send(..., (myId+1));
MPI_Recv(...);
} else {
MPI_Recv(...);
/* Compute 1M instructions */
MPI_Send(..., (myId+1)% nproc);
}
}
list of actions performed by each
process
Action described by
id of the process
type, e.g., computation or
communication
volume in instructions or bytes
some action specific parameters
0 init
0 compute 1e6
0 send 1 1e6
0 recv 3
0 finalize
1 init
1 recv 0
1 compute 1e6
1 send 2 1e6
1 finalize
2 init
2 recv 1
2 compute 1e6
2 send 3 1e6
2 finalize
3 init
3 recv 2
3 compute 1e6
3 send 0 1e6
3 finalize
4 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Experimental Environments
NAS Benchmarks:
EP: An embarrassingly
parallel kernel.
DT: Communication with
large messages using
quad-trees
LU: Solve a synthetic
system of nonlinear PDEs
CG: Conjugate gradient
method
MG, FT, IS, BT, SP (not
tested)
Grid’5000: 24 clusters, 1,169
nodes, 8,080 cores (July
2013)
5 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Contributions
An original approach that totally decouples the acquisition of the
trace from its replay
Several original scenarios that allow for the acquisition of large
execution traces
Study the state of the art and open source profiling tools
A new profiling tool based on our framework requirements
A trace replay tool on top of a fast, scalable and validated
simulation kernel
A complete experimental evaluation of our off-line simulation
framework
6 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Outline of the Talk
1 Introduction
2 Acquisition Process
Instrumentation
Execution
Post Processing
Evaluation of the Acquisition Framework
3 Replay
Calibration
Network model
Simulators
Simulation Accuracy
Addressing Issues
Simulation Time
4 Conclusions and Perspectives
7 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Trace Acquisition Process
for (i=0; i<4; i++){
if (myId == 0){
/* Compute 1M
instructions */
MPI_Send(..., (myId+1));
MPI_Recv(...);
} else {
MPI_Recv(...);
/* Compute 1M
instructions*/
MPI_Send(...,
(myId+1)% nproc);
}
}
0 init
0 compute 1e6
0 send 1 1e6
0 recv 3
0 finalize
1 init
1 recv 0
1 compute 1e6
1 send 2 1e6
1 finalize
2 init
2 recv 1
2 compute 1e6
2 send 3 1e6
2 finalize
3 init
3 recv 2
3 compute 1e6
3 send 0 1e6
3 finalize
8 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Instrumentation
Evaluation of Profiling Tools - Results
Profiling Quality of Space and Quality of Total
features output Time Overheads Software
#criteria 8 #criteria 3 #criteria 8 #criteria 11
PerfBench 2 0 0 5 7
PerfSuite 2 0 0 10 12
MpiP 2 0 0 11 13
IPM 3 0 0 11 14
MPE 4 1 2 10 17
PAPI 4 3 6 11 24
Extrae 7 2 5 11 25
VampirTrace 7 2 5 11 25
MinI 7 3 6 10 26
TAU 8 2 5 11 26
Scalasca 6 2 8 11 27
Score-P 7 2 8 11 28
9 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Instrumentation
Choosing an instrumentation method
Contenders
TAU-full: Selective instrumentation
TAU-reduced: Selective instrumentation by instrumenting only
MPI calls
BEGIN_FILE_EXCLUDE_LIST
*
END_FILE_EXCLUDE_LIST
+ -optTauSelectFile=/path/exclude.pdt
MinI: Combination of PMPI library with PAPI support
Metrics
Skew is the discrepancy in instruction count between a run of the
instrumented application and a run of uninstrumented application
due to the instrumentation code
Overhead, the execution time increase due to the execution of
the instrumentation code
10 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Instrumentation
Instrumentation Skew
TAU−full −− Class B
TAU−full −− Class C
TAU−Reduced −− Class B
TAU−Reduced −− Class C
Minimal −− Class B
Minimal −− Class C
0
5
10
15
20
8 16 32 64 128
Averageskew(in%)
Number of processes
We call “original“ the version with two PAPI calls inserted at the
beginning/end of the LU computation
TAU-full leads to instrumentation skew from 3.66% to 21.62%
MinI achieves instrumentation skew less than 5%
11 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Instrumentation
Instrumentation Overhead
TAU−Reduced −− Class B
TAU−Reduced −− Class C
Minimal −− Class B
Minimal −− Class C
0
5
10
15
20
25
30
35
40
8 16 32 64 128
Instrumentationoverhead(in%)
Number of processes
On average MinI has 1.6 times less instrumentation overhead
than TAU-Reduced
For Mini the instrumentation overhead is up to 23.5%
MinI produces directly Time-Independent trace files
12 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Execution
Execution
Four different acquisition modes
Regular: one process per CPU
Limited scalability
Folded: more than one process per CPU
Acquisition of traces for larger instances
Limited by the available memory
Composite: CPUs don’t necessarily belong to one
cluster
Many nodes available
Composite and Folded: combination of the previous
modes
Site 1 Site2
13 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Execution
Execution
Acquisition mode R F-2 F-16 C-2 CF-(2,4)
Number of nodes 64 32 4 (32,32) (8,8)
LU
Execution Time (in sec.) 11.52 24.45 148.95 23.8 72.14
Ratio to regular mode 1 2.12 12.92 2.09 6.26
Linear increase with folded factor
16 processes per CPU ⇒ 13 times bigger execution time
Increase the number of the sites ⇒ bigger overhead
A trace tool produces traces with erroneous timestamps
All the traces are identical with variations less than 1%
Acquisition and replay are totally decoupled
14 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Post Processing
Trace Gathering
The replay tool requires for the traces to be located on the same
hard disk
K-nomial tree reduction
log(K+1) N steps, where N is the total number of files, and K is the
arity of the tree
For benchmark LU, classes B, C and 64 nodes, 2 - 12.58 times
faster than Kaget tool.
15 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Post Processing
Analysis of Trace sizes
#Processes
Trace size (in MB) for LU benchmark
TAU-full TAU-reduced MinI
Class B Class C Class B Class C Class B Class C
8 334 531 188 298 29.6 48
16 741 1,200 450 714 72 116.8
32 1,600 2,500 973 1,600 159 255
64 3,200 5,100 2,100 3,300 339 550
128 6,600 11,000 4,300 6,800 711 1,200
TAU_Full >> TAU_Reduced >> MinI
More information → essential information
Size related to number of actions
∼ 15 characters/action, depend on the type of action
16 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Evaluation of the Acquisition Framework
Distribution of the acquisition time
0
20
40
60
80
100
120
140
160
8 16 32 64 128 8 16 32 64 128
Time(inseconds)
CB
Application
Tracing overhead
Gathering
LU
Time to gather the
traces up to 62.02%
of the acquisition
time
Horizontal line
indicates the gather
time of compressed
files
Tracing overhead
between 1.75% and
10.55%
17 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Evaluation of the Acquisition Framework
Extreme folding
Time in minutes / Memory footprint in GiB
Instance TAU Scalasca Score-P Minimal
Reduced Instrumentation
B - 256 2.58 / 11 2.1 / 2.8 1.75 / 4 1.9 / 1.65
C - 1024 N/A 16.3 / 12.9 26.2 / 31 12.9 / 7.95
D - 256 81.8 / 40 55.2 / 16.9 72.16 / 32 47.4 / 15.4
Class C
Class D
Class E
10
100
1000
10000
1024 2048 4096 8192 16384
10
100
1000
10000
Acquisitiontime(inseconds)
Memoryfootprint(inGiB)
Number of processes
TAU demands a lot
of memory
Scalasca is efficient
but does not provide
the exact
Time-Independent
trace format
Score-P is getting
improved
MinI tool operates
as expected
according to our
requirements
18 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Evaluation of the Acquisition Framework
Large Scale Experiment: LU - E - 16k
Folded
StRemi cluster, 40 nodes, 960 cores, 48GB memory per node
More than 400 MPI processes per node
Execution time 3.5 hours, 1 TB of memory
Composite and Folded
778 nodes, 18 clusters, 9 geographically distant sites
Folded factor based on the memory node
1.45 TB Time-Independent traces
Less than 1.5 hour to execute the instrumented application (53
minutes ) and gather the compressed trace files (16 minutes)
19 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Evaluation of the Acquisition Framework
Outline of the Talk
1 Introduction
2 Acquisition Process
Instrumentation
Execution
Post Processing
Evaluation of the Acquisition Framework
3 Replay
Calibration
Network model
Simulators
Simulation Accuracy
Addressing Issues
Simulation Time
4 Conclusions and Perspectives
20 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Trace Simulated Replay
Simulation Kernel
Platform
Topology
Application
Deployment
Simulated Execution Time
Time−Independent Trace(s)
power="1E9" bw="1.25E8" lat="15E−6"
<cluster id="cluster" prefix="c−"
suffix=".me" radical="0−3"
</platform>
<platform version="3">
<?xml version=’1.0’?>
bb_bw="1.25E9" bblat="15e−6"/>
<!DOCTYPE platform SYSTEM "simgrid.dtd"> <!DOCTYPE platform SYSTEM "simgrid.dtd">
<platform version="3">
<?xml version=’1.0’?>
<process host="c−0.me" function="0"/>
<process host="c−1.me" function="1"/>
<process host="c−2.me" function="2"/>
<process host="c−3.me" function="3"/>
</platform>
Trace ReplayTool
0 compute 1e6
0 send 1 1e6
0 recv 3 1e6
1 recv 0 1e6
1 compute 1e6
1 send 2 1e6
2 recv 1 1e6
2 compute 1e6
2 send 3 1e6
3 recv 2 1e6
3 compute 1e6
3 send 0 1e6
Timed
Trace
[0.001000] 0 compute 1e6 0.01000
[0.010028] 0 send 1 1e6 0.009028
[0.040113] 0 recv 3 1e6 0.030085
[0.010028] 1 recv 0 1e6 0.010028
...
Gantt
Chart
Simulated time:
0.0401133
http://paje.sourceforge.net
http://simgrid.gforge.inria.fr
21 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
SimGrid in One Slide
Three user APIs
SimDag: heuristics as DAG of (parallel) tasks
MSG: heuristics as Concurrent Sequential Processes
SMPI: simulate MPI real applications
Under the hood
SURF: The simulation kernel (full of deeply investigated models)
XBT: bundle of useful stuff (data structures, logging, . . . )
22 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Calibration
Computation Calibration and Cache Usage
Platform file contains instruction rate of CPUs
<cluster id="AS_cluster" prefix="c-" suffix=".me" radical="0-3"
power="1E9" bw="1.25E8" lat="15E-6" bb_bw="1.25E9"
bb_lat="15E-6"/>
Calibration procedure
Execute a small instrumented instance of the target application
Typically Class A on 4 processes
Compute the instruction rate for every event
Compute a weighted average of the instruction rates for each
process
Compute the average instruction rate for all the process set
Do it five times
A single instruction rate for everything
Small instance ⇒ data fit in L2 cache
Larger instance ⇒ exceed L2 capacity ⇒ lower rate!
We take cache usage into account during calibration
23 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Calibration
Impact on XML description
1 ...
2 <cluster id="AS_sgraphene1"
3 prefix="graphene-" suffix=".nancy.grid5000.fr" radical= "0-38"
4 power="3.68E9" bw="1.25E8" lat="15E-6" bb_bw="1.25E9" bb_lat="15E-6"/>
5 <cluster id="AS_sgraphene2"
6 prefix="graphene-" suffix=".nancy.grid5000.fr" radical= "39-73"
7 power="3.68E9" bw="1.25E8" lat="15E-6" bb_bw="1.25E9" bb_lat="15E-6"/>
8
9 <link id="switch-graphene" bandwidth="1.25E9" latency="5E-4"/>
10
11 <ASroute src="AS_sgraphene1" dst="AS_sgraphene2"
12 gw_src="graphene-AS_sgraphene1_router.nancy.grid5000.fr"
13 gw_dst="graphene-AS_sgraphene3_router.nancy.grid5000.fr">
14 <link_ctn id="switch-graphene"/>
15 </ASroute>
16 ...
24 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Network model
The "hybrid" network model of SMPI
T_3
P_r
P_s
T_1
T_2
Asynchronous
(k ≤ Sa)
P_s
P_r
T_2T_4
T_1
Detached
(Sa < k ≤ Sd )
P_s
P_r
T_4 T_2
Synchronous
(k > Sd )
Sm
all
M
edium
1M
edium
2
D
etached
Sm
all
M
edium
1
M
edium
2
D
etached
Sm
all
M
edium
1
M
edium
2
D
etached
MPI_Send MPI_Recv Ping−Pong
1e−04
1e−02
1e+01 1e+03 1e+05 1e+01 1e+03 1e+05 1e+01 1e+03 1e+05
Message size (bytes)
Duration(seconds)
Large
Large
Large
It is piece
wise linear
and discon-
tinuous
25 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Network model
Impact on XML description
1 <config id="General">
2 <prop id="workstation/model" value="compound"/>
3 <prop id="network/model" value="SMPI"/>
4
5 <prop id="smpi/async_small_thres" value="65536"/>
6 <prop id="smpi/send_is_detached_thres" value="327680"/>
7
8 <prop id= "smpi/os"
9 value="0:8.93009e-06:7.654382e-10; 1420:1.396843e-05:2.974094e-10;
10 32768:1.540828e-05:2.441040e-10; 65536:0.000238:0;327680:0:0"/>
11
12 <prop id= "smpi/or"
13 value="0:8.140255e-06:8.3958e-10; 1420:1.2699e-05:9.092182e-10;
14 32768:3.095706e-05:6.956453e-10; 65536:0:0; 327680:0:0"/>
15
16 <prop id="smpi/bw_factor"
17 value="0:0.400977; 1420:0.913556; 32768:1.078319; 65536:0.956084;
18 327680:0.929868"/>
19
20 <prop id= "smpi/lat_factor"
21 value="0:1.35489; 1420:3.437250; 32768:5.721647;65536:11.988532;
22 327680:9.650420"/>
23 </config>
24 ...
26 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Network model
SMPI - Hybrid Model
UP/DOWN links to share the available bandwidth separately in
each direction
Limiter link shared by all the flows to and from a processor
1-39 40-74 105-14475-104
1G
10G
DownUp DownUp DownUp DownUp
10G
1G
1−39 40−74 105−14475−104
13G
10G
Limiter
... ...... ...
1.5G
1G
Limiter
DownUp
27 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Network model
Impact on XML description
1 <cluster id="AS_sgraphene1" prefix="graphene-"
2 suffix=".nancy.grid5000.fr" radical="0-38" power="3.68E9"
3 bw="1.25E8" lat="15E-6"
4 sharing_policy="FULLDUPLEX" limiter_link="1.875E8"
5 loopback_lat="1.5E-9" loopback_bw="6000000000"></cluster>
6
7 <link id="switch-backbone1" bandwidth="1162500000" latency="1.5E-6"
8 sharing_policy="FULLDUPLEX"/>
9 <link id="switch-backbone2" bandwidth="1162500000" latency="1.5E-6"
10 sharing_policy="FULLDUPLEX"/>
11
12 <link id="explicit-limiter1" bandwidth="1511250000" latency="0"
13 sharing_policy="SHARED"/>
14 <link id="explicit-limiter2" bandwidth="1511250000" latency="0"
15 sharing_policy="SHARED"/>
16
17 <ASroute src="AS_sgraphene1" dst="AS_sgraphene2"
18 gw_src="graphene-AS_sgraphene1_router.nancy.grid5000.fr"
19 gw_dst="graphene-AS_sgraphene2_router.nancy.grid5000.fr"
20 symmetrical="NO"
21 <link_ctn id="switch-backbone1" direction="UP"/>
22 <link_ctn id="explicit-limiter1"/> <link_ctnid="explicit-limiter2"/>
23 <link_ctn id="switch-backbone2" direction="DOWN"/>
24
28 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Simulators
Trace Replay Tool
Based on SMPI
Complete MPI implementation
Better handling of eager-mode
Factoring the efforts
Implementation of the send action
static void action_send (const char *const *action){
int to = atoi(action[2]);
double size = parse_double(action[3]);
smpi_mpi_send (NULL, size, MPI_BYTE, to, 0, MPI_COMM_WORLD);}
A user has to execute the smpi_replay tool
int main(int argc, char *argv[]){
smpi_replay_init(&argc, &argv);
smpi_action_trace_run();
smpi_replay_finalize();
return 0;}
smpirun is used to execute the simulator
smpirun -np 8 -hostfile hostfile -platform platform.xml 
./smpi_replay trace_description
29 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Simulation Accuracy
Simulation Accuracy - Latest Framework
Class B − Execution time
Class B − Simulated time
Class C − Execution time
Class C − Simulated time
Class D − Execution time
Class D − Simulated time
1
10
100
1000
10000
8 16 32 64 128 256 512 1024
Time(inseconds)
Number of processes
EP
Execution time
Simulated time
0.01
0.1
1
10
100
11 21 43 85 171
Time(inseconds)
Number of processes
DT
Class B − Execution time
Class B − Simulated time
Class C − Execution time
Class C − Simulated time
10
100
8 16 32 64 128 256 512 1024
Time(inseconds)
Number of processes
LU
Class B − Execution time
Class B − Simulated time
Class C − Execution time
Class C − Simulated time
10
100
8 16 32 64 128 256 512 1024
Time(inseconds)
Number of processes
CG
30 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Addressing Issues
Source of inaccuracy for CG
Two seconds Gantt-chart of the real execution of a class B
instance of CG for 32 processes on left side and 128 processes
on right side
Massive switch packet drops lead to 0.2s timeouts in TCP for 128
processes
31 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Addressing Issues
Source of inaccuracy for LU
Class B
Class C
−15
−10
−5
0
5
10
15
8 16 32 64 128
Relativeerrorontime(in%)
Number of processes
32 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Addressing Issues
Source of inaccuracy for LU
Class B
Class C
−15
−10
−5
0
5
10
15
8 16 32 64 128
Relativeerrorontime(in%)
Number of processes
Class B
Class C
−4
−2
0
2
4
8 16 32 64 128
Relativeerrorontime(in%)
Number of processes
32 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Addressing Issues
Source of inaccuracy for LU
Class B
Class C
−15
−10
−5
0
5
10
15
8 16 32 64 128
Relativeerrorontime(in%)
Number of processes
Class B
Class C
−4
−2
0
2
4
8 16 32 64 128
Relativeerrorontime(in%)
Number of processes
32 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Simulation Time
Class B
Class C
0
200
400
600
800
1000
1200
8 16 32 64 128
Simulationtime(inseconds)
Number of processes
8.6 seconds for one million actions up to 64 processes
13 seconds for one million actions on 128 processes
Almost 8 days would be needed to replay 1.45 TB of LU-E with
16k processes
33 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Simulation Time
Outline of the Talk
1 Introduction
2 Acquisition Process
Instrumentation
Execution
Post Processing
Evaluation of the Acquisition Framework
3 Replay
Calibration
Network model
Simulators
Simulation Accuracy
Addressing Issues
Simulation Time
4 Conclusions and Perspectives
34 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Conclusions
The Time-Independent Trace Replay Framework was presented
by detailing and evaluation its two main parts: acquisition and
replay
The acquisition procedure is decoupled from its replay, thus the
acquired Time-Independent traces can be simulated with various
scenarios
We can use the same Time-Independent traces with future
SimGrid network models e.g., Infiniband
We implemented a profiling tool respecting our requiremets
which is more efficient than other available ones
35 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Perspectives
Short term
More MPI calls can be supported
Framework for automatic acquisition of the traces
Long term
Simulation of real applications
Simulation of larger platforms
Decrease simulation time and trace sizes
Investigate our framework model with regard to multicore
processors
Difficult
A new computation model could be introduced
36 / 37
Introduction Acquisition Process Replay Conclusions and Perspectives
Publications
Simulation of MPI Applications with Time-Independent Traces, Concurrency and
Computation: Practice and Experience, under revision
Toward Better Simulation of MPI Applications on Ethernet/TCP Networks,
Proceedings of the 4th International Workshop on Performance Modeling,
Benchmarking and Simulation of High PerformanceComputer Systems (PMBS),
2013
Improving the Accuracy and Efficiency of Time-Independent Trace Replay,
Proceedings of the 3rd International Workshop on Performance Modeling,
Benchmarking and Simulation of High Performance Computer Systems (PMBS),
2012
Assessing the Performance of MPI Applications Through Time-Independent Trace
Replay, Proceedings of the 2nd International Workshop on Parallel Software
Tools and Tool Infrastructures (PSTI), 2011
Evaluation of Profiling Tools for the Acquisition of Time Independent Traces,
Technical Report, RT-0437, INRIA
Time-Independent Trace Acquisition Framework – A Grid’5000 How-to, Technical
Report, RT-0407, INRIA
37 / 37

Más contenido relacionado

La actualidad más candente

Fundamentals of the Analysis of Algorithm Efficiency
Fundamentals of the Analysis of Algorithm EfficiencyFundamentals of the Analysis of Algorithm Efficiency
Fundamentals of the Analysis of Algorithm EfficiencySaranya Natarajan
 
الإعتيان والتبديل التمثيلي الرقمي
الإعتيان والتبديل التمثيلي الرقميالإعتيان والتبديل التمثيلي الرقمي
الإعتيان والتبديل التمثيلي الرقميDr. Munthear Alqaderi
 
Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)swapnac12
 
Analysis of algorithn class 2
Analysis of algorithn class 2Analysis of algorithn class 2
Analysis of algorithn class 2Kumar
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer ArchitectureSubhasis Dash
 
Unit7 & 8 Performance and optimization
Unit7 & 8 Performance and optimization Unit7 & 8 Performance and optimization
Unit7 & 8 Performance and optimization leenachandra
 
Energy-Efficient Time Synchronization Achieving Nanosecond Accuracy in Wirele...
Energy-Efficient Time Synchronization Achieving Nanosecond Accuracy in Wirele...Energy-Efficient Time Synchronization Achieving Nanosecond Accuracy in Wirele...
Energy-Efficient Time Synchronization Achieving Nanosecond Accuracy in Wirele...Xi'an Jiaotong-Liverpool University
 
A calculus of mobile Real-Time processes
A calculus of mobile Real-Time processesA calculus of mobile Real-Time processes
A calculus of mobile Real-Time processesPolytechnique Montréal
 
09 accelerators
09 accelerators09 accelerators
09 acceleratorsMurali M
 
digital signal-processing-lab-manual
digital signal-processing-lab-manualdigital signal-processing-lab-manual
digital signal-processing-lab-manualAhmed Alshomi
 
Swift profiling middleware and tools
Swift profiling middleware and toolsSwift profiling middleware and tools
Swift profiling middleware and toolszhang hua
 

La actualidad más candente (13)

Fundamentals of the Analysis of Algorithm Efficiency
Fundamentals of the Analysis of Algorithm EfficiencyFundamentals of the Analysis of Algorithm Efficiency
Fundamentals of the Analysis of Algorithm Efficiency
 
الإعتيان والتبديل التمثيلي الرقمي
الإعتيان والتبديل التمثيلي الرقميالإعتيان والتبديل التمثيلي الرقمي
الإعتيان والتبديل التمثيلي الرقمي
 
Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)
 
Analysis of algorithn class 2
Analysis of algorithn class 2Analysis of algorithn class 2
Analysis of algorithn class 2
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
 
Unit7 & 8 Performance and optimization
Unit7 & 8 Performance and optimization Unit7 & 8 Performance and optimization
Unit7 & 8 Performance and optimization
 
L1
L1L1
L1
 
Energy-Efficient Time Synchronization Achieving Nanosecond Accuracy in Wirele...
Energy-Efficient Time Synchronization Achieving Nanosecond Accuracy in Wirele...Energy-Efficient Time Synchronization Achieving Nanosecond Accuracy in Wirele...
Energy-Efficient Time Synchronization Achieving Nanosecond Accuracy in Wirele...
 
A calculus of mobile Real-Time processes
A calculus of mobile Real-Time processesA calculus of mobile Real-Time processes
A calculus of mobile Real-Time processes
 
09 accelerators
09 accelerators09 accelerators
09 accelerators
 
digital signal-processing-lab-manual
digital signal-processing-lab-manualdigital signal-processing-lab-manual
digital signal-processing-lab-manual
 
Swift profiling middleware and tools
Swift profiling middleware and toolsSwift profiling middleware and tools
Swift profiling middleware and tools
 
Core pipelining
Core pipelining Core pipelining
Core pipelining
 

Destacado

Destacado (11)

What If - Celebrating My Life Plan EOL 2013
What If - Celebrating My Life Plan EOL  2013What If - Celebrating My Life Plan EOL  2013
What If - Celebrating My Life Plan EOL 2013
 
Training Certificates
Training CertificatesTraining Certificates
Training Certificates
 
M. Tech. Chemical Engineer Sagar
M. Tech. Chemical Engineer SagarM. Tech. Chemical Engineer Sagar
M. Tech. Chemical Engineer Sagar
 
Proyecto creativo con T.I.C
Proyecto creativo con T.I.CProyecto creativo con T.I.C
Proyecto creativo con T.I.C
 
Trabajo de informatica
Trabajo de informaticaTrabajo de informatica
Trabajo de informatica
 
SKMBT_C224e16020108580
SKMBT_C224e16020108580SKMBT_C224e16020108580
SKMBT_C224e16020108580
 
All About Me 2015
All About Me 2015 All About Me 2015
All About Me 2015
 
HOME FOR RENT COL SAN BENITO SAN SALVADOR EL SALVADOR CORPORATE - EMBASSY HO...
HOME FOR RENT  COL SAN BENITO SAN SALVADOR EL SALVADOR CORPORATE - EMBASSY HO...HOME FOR RENT  COL SAN BENITO SAN SALVADOR EL SALVADOR CORPORATE - EMBASSY HO...
HOME FOR RENT COL SAN BENITO SAN SALVADOR EL SALVADOR CORPORATE - EMBASSY HO...
 
Sangeetha final
Sangeetha finalSangeetha final
Sangeetha final
 
Collaborative coaching powerpoint
Collaborative coaching powerpointCollaborative coaching powerpoint
Collaborative coaching powerpoint
 
Guide des hébergements 2011
Guide des hébergements 2011Guide des hébergements 2011
Guide des hébergements 2011
 

Similar a markomanolis_phd_defense

Global Interpreter Lock: Episode I - Break the Seal
Global Interpreter Lock: Episode I - Break the SealGlobal Interpreter Lock: Episode I - Break the Seal
Global Interpreter Lock: Episode I - Break the SealTzung-Bi Shih
 
Process wind tunnel - A novel capability for data-driven business process imp...
Process wind tunnel - A novel capability for data-driven business process imp...Process wind tunnel - A novel capability for data-driven business process imp...
Process wind tunnel - A novel capability for data-driven business process imp...Sudhendu Rai
 
02 intel v_tune_session_02
02 intel v_tune_session_0202 intel v_tune_session_02
02 intel v_tune_session_02Vivek chan
 
Winter Simulation Conference 2021 - Process Wind Tunnel Talk
Winter Simulation Conference 2021 - Process Wind Tunnel TalkWinter Simulation Conference 2021 - Process Wind Tunnel Talk
Winter Simulation Conference 2021 - Process Wind Tunnel TalkSudhendu Rai
 
Computer_Architecture_3rd_Edition_by_Moris_Mano_Ch_09.ppt
Computer_Architecture_3rd_Edition_by_Moris_Mano_Ch_09.pptComputer_Architecture_3rd_Edition_by_Moris_Mano_Ch_09.ppt
Computer_Architecture_3rd_Edition_by_Moris_Mano_Ch_09.pptHermellaGashaw
 
Unit 3-pipelining &amp; vector processing
Unit 3-pipelining &amp; vector processingUnit 3-pipelining &amp; vector processing
Unit 3-pipelining &amp; vector processingvishal choudhary
 
Modernization Lessons Learned -Part 2
Modernization Lessons Learned -Part 2Modernization Lessons Learned -Part 2
Modernization Lessons Learned -Part 2Emerson Exchange
 
CS304PC:Computer Organization and Architecture Session 33 demo 1 ppt.pdf
CS304PC:Computer Organization and Architecture  Session 33 demo 1 ppt.pdfCS304PC:Computer Organization and Architecture  Session 33 demo 1 ppt.pdf
CS304PC:Computer Organization and Architecture Session 33 demo 1 ppt.pdfAsst.prof M.Gokilavani
 
Parallel Processing Techniques Pipelining
Parallel Processing Techniques PipeliningParallel Processing Techniques Pipelining
Parallel Processing Techniques PipeliningRNShukla7
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesMarina Kolpakova
 
Monitor your Java application with Prometheus Stack
Monitor your Java application with Prometheus StackMonitor your Java application with Prometheus Stack
Monitor your Java application with Prometheus StackWojciech Barczyński
 
Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)University of Washington
 
Analyze phase lean six sigma tollgate template
Analyze phase   lean six sigma tollgate templateAnalyze phase   lean six sigma tollgate template
Analyze phase lean six sigma tollgate templateSteven Bonacorsi
 
Analyze Phase Lean Six Sigma Tollgate Templates
Analyze Phase Lean Six Sigma Tollgate TemplatesAnalyze Phase Lean Six Sigma Tollgate Templates
Analyze Phase Lean Six Sigma Tollgate TemplatesSteven Bonacorsi
 
BP206 - Let's Give Your LotusScript a Tune-Up
BP206 - Let's Give Your LotusScript a Tune-Up BP206 - Let's Give Your LotusScript a Tune-Up
BP206 - Let's Give Your LotusScript a Tune-Up Craig Schumann
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture designssuser87fa0c1
 
[Webinar] Camunda Optimize Release 3.0
[Webinar] Camunda Optimize Release 3.0[Webinar] Camunda Optimize Release 3.0
[Webinar] Camunda Optimize Release 3.0camunda services GmbH
 

Similar a markomanolis_phd_defense (20)

Global Interpreter Lock: Episode I - Break the Seal
Global Interpreter Lock: Episode I - Break the SealGlobal Interpreter Lock: Episode I - Break the Seal
Global Interpreter Lock: Episode I - Break the Seal
 
Process wind tunnel - A novel capability for data-driven business process imp...
Process wind tunnel - A novel capability for data-driven business process imp...Process wind tunnel - A novel capability for data-driven business process imp...
Process wind tunnel - A novel capability for data-driven business process imp...
 
BIRTE-13-Kawashima
BIRTE-13-KawashimaBIRTE-13-Kawashima
BIRTE-13-Kawashima
 
Code Tuning
Code TuningCode Tuning
Code Tuning
 
02 intel v_tune_session_02
02 intel v_tune_session_0202 intel v_tune_session_02
02 intel v_tune_session_02
 
Winter Simulation Conference 2021 - Process Wind Tunnel Talk
Winter Simulation Conference 2021 - Process Wind Tunnel TalkWinter Simulation Conference 2021 - Process Wind Tunnel Talk
Winter Simulation Conference 2021 - Process Wind Tunnel Talk
 
Computer_Architecture_3rd_Edition_by_Moris_Mano_Ch_09.ppt
Computer_Architecture_3rd_Edition_by_Moris_Mano_Ch_09.pptComputer_Architecture_3rd_Edition_by_Moris_Mano_Ch_09.ppt
Computer_Architecture_3rd_Edition_by_Moris_Mano_Ch_09.ppt
 
Unit 3-pipelining &amp; vector processing
Unit 3-pipelining &amp; vector processingUnit 3-pipelining &amp; vector processing
Unit 3-pipelining &amp; vector processing
 
Modernization Lessons Learned -Part 2
Modernization Lessons Learned -Part 2Modernization Lessons Learned -Part 2
Modernization Lessons Learned -Part 2
 
CS304PC:Computer Organization and Architecture Session 33 demo 1 ppt.pdf
CS304PC:Computer Organization and Architecture  Session 33 demo 1 ppt.pdfCS304PC:Computer Organization and Architecture  Session 33 demo 1 ppt.pdf
CS304PC:Computer Organization and Architecture Session 33 demo 1 ppt.pdf
 
Parallel Processing Techniques Pipelining
Parallel Processing Techniques PipeliningParallel Processing Techniques Pipelining
Parallel Processing Techniques Pipelining
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
Monitor your Java application with Prometheus Stack
Monitor your Java application with Prometheus StackMonitor your Java application with Prometheus Stack
Monitor your Java application with Prometheus Stack
 
Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Analyze phase lean six sigma tollgate template
Analyze phase   lean six sigma tollgate templateAnalyze phase   lean six sigma tollgate template
Analyze phase lean six sigma tollgate template
 
Analyze Phase Lean Six Sigma Tollgate Templates
Analyze Phase Lean Six Sigma Tollgate TemplatesAnalyze Phase Lean Six Sigma Tollgate Templates
Analyze Phase Lean Six Sigma Tollgate Templates
 
BP206 - Let's Give Your LotusScript a Tune-Up
BP206 - Let's Give Your LotusScript a Tune-Up BP206 - Let's Give Your LotusScript a Tune-Up
BP206 - Let's Give Your LotusScript a Tune-Up
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture design
 
[Webinar] Camunda Optimize Release 3.0
[Webinar] Camunda Optimize Release 3.0[Webinar] Camunda Optimize Release 3.0
[Webinar] Camunda Optimize Release 3.0
 

Más de George Markomanolis

Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerGeorge Markomanolis
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer George Markomanolis
 
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PGeorge Markomanolis
 
Introduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IIntroduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IGeorge Markomanolis
 
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIPerformance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIGeorge Markomanolis
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IGeorge Markomanolis
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIGeorge Markomanolis
 
How to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerHow to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerGeorge Markomanolis
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaGeorge Markomanolis
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...George Markomanolis
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIGeorge Markomanolis
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIGeorge Markomanolis
 

Más de George Markomanolis (18)

Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
 
Introduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IIntroduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part I
 
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIPerformance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part II
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part I
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part II
 
How to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerHow to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit Supercomputer
 
Introducing IO-500 benchmark
Introducing IO-500 benchmarkIntroducing IO-500 benchmark
Introducing IO-500 benchmark
 
Experience using the IO-500
Experience using the IO-500Experience using the IO-500
Experience using the IO-500
 
Harshad - Handle Darshan Data
Harshad - Handle Darshan DataHarshad - Handle Darshan Data
Harshad - Handle Darshan Data
 
Lustre Best Practices
Lustre Best Practices Lustre Best Practices
Lustre Best Practices
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to Omega
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen II
 

markomanolis_phd_defense

  • 1. Introduction Acquisition Process Replay Conclusions and Perspectives Performance Evaluation and Prediction of Parallel Applications Georgios Markomanolis Avalon Team, INRIA, LIP, Ecole Normale Superiéure de Lyon, France Ph.D. Defense January 20th 2014 Under the supervision of Frédéric Desprez and Frédéric Suter 1 / 37
  • 2. Introduction Acquisition Process Replay Conclusions and Perspectives Dimensioning Through Simulation User and administrator expertise is not enough Decisions can cost a lot of money ⇒ Need for objective indicators by exploring various “what-if” scenarios Simulation has many advantages Less simplistic than theoretical models More reproducible than running on production systems Execution on real platform can be time and money consuming Focus on non adaptive MPI applications Two complementary approaches On-line: execute the application with some simulated parts Off-line: replay an execution trace 2 / 37
  • 3. Introduction Acquisition Process Replay Conclusions and Perspectives Dimensioning Through Simulation User and administrator expertise is not enough Decisions can cost a lot of money ⇒ Need for objective indicators by exploring various “what-if” scenarios Simulation has many advantages Less simplistic than theoretical models More reproducible than running on production systems Execution on real platform can be time and money consuming Focus on non adaptive MPI applications Two complementary approaches On-line: execute the application with some simulated parts Off-line: replay an execution trace 2 / 37
  • 4. Introduction Acquisition Process Replay Conclusions and Perspectives Time-Independent Trace Replay Post-mortem analysis (or off-line simulation) of MPI applications Well covered field Mainly profiling tools Unexpected behaviors and performance bottlenecks detection TAU, Scalasca, Vampir, SCORE-P, . . . Usually based on timed traces Create a tight link between trace to acquisition environment 3 / 37
  • 5. Introduction Acquisition Process Replay Conclusions and Perspectives Time-Independent Trace Replay Post-mortem analysis (or off-line simulation) of MPI applications Well covered field Mainly profiling tools Unexpected behaviors and performance bottlenecks detection TAU, Scalasca, Vampir, SCORE-P, . . . Usually based on timed traces Create a tight link between trace to acquisition environment Proposition: get rid off the timestamps Trace volumes only Numbers of instructions for computations Message sizes for communications Goals Get environment oblivious traces Decouple acquisition from actual replay 3 / 37
  • 6. Introduction Acquisition Process Replay Conclusions and Perspectives Time-Independent Traces for (i=0; i<4; i++){ if (myId == 0){ /* Compute 1M instructions */ MPI_Send(..., (myId+1)); MPI_Recv(...); } else { MPI_Recv(...); /* Compute 1M instructions */ MPI_Send(..., (myId+1)% nproc); } } list of actions performed by each process Action described by id of the process type, e.g., computation or communication volume in instructions or bytes some action specific parameters 0 init 0 compute 1e6 0 send 1 1e6 0 recv 3 0 finalize 1 init 1 recv 0 1 compute 1e6 1 send 2 1e6 1 finalize 2 init 2 recv 1 2 compute 1e6 2 send 3 1e6 2 finalize 3 init 3 recv 2 3 compute 1e6 3 send 0 1e6 3 finalize 4 / 37
  • 7. Introduction Acquisition Process Replay Conclusions and Perspectives Experimental Environments NAS Benchmarks: EP: An embarrassingly parallel kernel. DT: Communication with large messages using quad-trees LU: Solve a synthetic system of nonlinear PDEs CG: Conjugate gradient method MG, FT, IS, BT, SP (not tested) Grid’5000: 24 clusters, 1,169 nodes, 8,080 cores (July 2013) 5 / 37
  • 8. Introduction Acquisition Process Replay Conclusions and Perspectives Contributions An original approach that totally decouples the acquisition of the trace from its replay Several original scenarios that allow for the acquisition of large execution traces Study the state of the art and open source profiling tools A new profiling tool based on our framework requirements A trace replay tool on top of a fast, scalable and validated simulation kernel A complete experimental evaluation of our off-line simulation framework 6 / 37
  • 9. Introduction Acquisition Process Replay Conclusions and Perspectives Outline of the Talk 1 Introduction 2 Acquisition Process Instrumentation Execution Post Processing Evaluation of the Acquisition Framework 3 Replay Calibration Network model Simulators Simulation Accuracy Addressing Issues Simulation Time 4 Conclusions and Perspectives 7 / 37
  • 10. Introduction Acquisition Process Replay Conclusions and Perspectives Trace Acquisition Process for (i=0; i<4; i++){ if (myId == 0){ /* Compute 1M instructions */ MPI_Send(..., (myId+1)); MPI_Recv(...); } else { MPI_Recv(...); /* Compute 1M instructions*/ MPI_Send(..., (myId+1)% nproc); } } 0 init 0 compute 1e6 0 send 1 1e6 0 recv 3 0 finalize 1 init 1 recv 0 1 compute 1e6 1 send 2 1e6 1 finalize 2 init 2 recv 1 2 compute 1e6 2 send 3 1e6 2 finalize 3 init 3 recv 2 3 compute 1e6 3 send 0 1e6 3 finalize 8 / 37
  • 11. Introduction Acquisition Process Replay Conclusions and Perspectives Instrumentation Evaluation of Profiling Tools - Results Profiling Quality of Space and Quality of Total features output Time Overheads Software #criteria 8 #criteria 3 #criteria 8 #criteria 11 PerfBench 2 0 0 5 7 PerfSuite 2 0 0 10 12 MpiP 2 0 0 11 13 IPM 3 0 0 11 14 MPE 4 1 2 10 17 PAPI 4 3 6 11 24 Extrae 7 2 5 11 25 VampirTrace 7 2 5 11 25 MinI 7 3 6 10 26 TAU 8 2 5 11 26 Scalasca 6 2 8 11 27 Score-P 7 2 8 11 28 9 / 37
  • 12. Introduction Acquisition Process Replay Conclusions and Perspectives Instrumentation Choosing an instrumentation method Contenders TAU-full: Selective instrumentation TAU-reduced: Selective instrumentation by instrumenting only MPI calls BEGIN_FILE_EXCLUDE_LIST * END_FILE_EXCLUDE_LIST + -optTauSelectFile=/path/exclude.pdt MinI: Combination of PMPI library with PAPI support Metrics Skew is the discrepancy in instruction count between a run of the instrumented application and a run of uninstrumented application due to the instrumentation code Overhead, the execution time increase due to the execution of the instrumentation code 10 / 37
  • 13. Introduction Acquisition Process Replay Conclusions and Perspectives Instrumentation Instrumentation Skew TAU−full −− Class B TAU−full −− Class C TAU−Reduced −− Class B TAU−Reduced −− Class C Minimal −− Class B Minimal −− Class C 0 5 10 15 20 8 16 32 64 128 Averageskew(in%) Number of processes We call “original“ the version with two PAPI calls inserted at the beginning/end of the LU computation TAU-full leads to instrumentation skew from 3.66% to 21.62% MinI achieves instrumentation skew less than 5% 11 / 37
  • 14. Introduction Acquisition Process Replay Conclusions and Perspectives Instrumentation Instrumentation Overhead TAU−Reduced −− Class B TAU−Reduced −− Class C Minimal −− Class B Minimal −− Class C 0 5 10 15 20 25 30 35 40 8 16 32 64 128 Instrumentationoverhead(in%) Number of processes On average MinI has 1.6 times less instrumentation overhead than TAU-Reduced For Mini the instrumentation overhead is up to 23.5% MinI produces directly Time-Independent trace files 12 / 37
  • 15. Introduction Acquisition Process Replay Conclusions and Perspectives Execution Execution Four different acquisition modes Regular: one process per CPU Limited scalability Folded: more than one process per CPU Acquisition of traces for larger instances Limited by the available memory Composite: CPUs don’t necessarily belong to one cluster Many nodes available Composite and Folded: combination of the previous modes Site 1 Site2 13 / 37
  • 16. Introduction Acquisition Process Replay Conclusions and Perspectives Execution Execution Acquisition mode R F-2 F-16 C-2 CF-(2,4) Number of nodes 64 32 4 (32,32) (8,8) LU Execution Time (in sec.) 11.52 24.45 148.95 23.8 72.14 Ratio to regular mode 1 2.12 12.92 2.09 6.26 Linear increase with folded factor 16 processes per CPU ⇒ 13 times bigger execution time Increase the number of the sites ⇒ bigger overhead A trace tool produces traces with erroneous timestamps All the traces are identical with variations less than 1% Acquisition and replay are totally decoupled 14 / 37
  • 17. Introduction Acquisition Process Replay Conclusions and Perspectives Post Processing Trace Gathering The replay tool requires for the traces to be located on the same hard disk K-nomial tree reduction log(K+1) N steps, where N is the total number of files, and K is the arity of the tree For benchmark LU, classes B, C and 64 nodes, 2 - 12.58 times faster than Kaget tool. 15 / 37
  • 18. Introduction Acquisition Process Replay Conclusions and Perspectives Post Processing Analysis of Trace sizes #Processes Trace size (in MB) for LU benchmark TAU-full TAU-reduced MinI Class B Class C Class B Class C Class B Class C 8 334 531 188 298 29.6 48 16 741 1,200 450 714 72 116.8 32 1,600 2,500 973 1,600 159 255 64 3,200 5,100 2,100 3,300 339 550 128 6,600 11,000 4,300 6,800 711 1,200 TAU_Full >> TAU_Reduced >> MinI More information → essential information Size related to number of actions ∼ 15 characters/action, depend on the type of action 16 / 37
  • 19. Introduction Acquisition Process Replay Conclusions and Perspectives Evaluation of the Acquisition Framework Distribution of the acquisition time 0 20 40 60 80 100 120 140 160 8 16 32 64 128 8 16 32 64 128 Time(inseconds) CB Application Tracing overhead Gathering LU Time to gather the traces up to 62.02% of the acquisition time Horizontal line indicates the gather time of compressed files Tracing overhead between 1.75% and 10.55% 17 / 37
  • 20. Introduction Acquisition Process Replay Conclusions and Perspectives Evaluation of the Acquisition Framework Extreme folding Time in minutes / Memory footprint in GiB Instance TAU Scalasca Score-P Minimal Reduced Instrumentation B - 256 2.58 / 11 2.1 / 2.8 1.75 / 4 1.9 / 1.65 C - 1024 N/A 16.3 / 12.9 26.2 / 31 12.9 / 7.95 D - 256 81.8 / 40 55.2 / 16.9 72.16 / 32 47.4 / 15.4 Class C Class D Class E 10 100 1000 10000 1024 2048 4096 8192 16384 10 100 1000 10000 Acquisitiontime(inseconds) Memoryfootprint(inGiB) Number of processes TAU demands a lot of memory Scalasca is efficient but does not provide the exact Time-Independent trace format Score-P is getting improved MinI tool operates as expected according to our requirements 18 / 37
  • 21. Introduction Acquisition Process Replay Conclusions and Perspectives Evaluation of the Acquisition Framework Large Scale Experiment: LU - E - 16k Folded StRemi cluster, 40 nodes, 960 cores, 48GB memory per node More than 400 MPI processes per node Execution time 3.5 hours, 1 TB of memory Composite and Folded 778 nodes, 18 clusters, 9 geographically distant sites Folded factor based on the memory node 1.45 TB Time-Independent traces Less than 1.5 hour to execute the instrumented application (53 minutes ) and gather the compressed trace files (16 minutes) 19 / 37
  • 22. Introduction Acquisition Process Replay Conclusions and Perspectives Evaluation of the Acquisition Framework Outline of the Talk 1 Introduction 2 Acquisition Process Instrumentation Execution Post Processing Evaluation of the Acquisition Framework 3 Replay Calibration Network model Simulators Simulation Accuracy Addressing Issues Simulation Time 4 Conclusions and Perspectives 20 / 37
  • 23. Introduction Acquisition Process Replay Conclusions and Perspectives Trace Simulated Replay Simulation Kernel Platform Topology Application Deployment Simulated Execution Time Time−Independent Trace(s) power="1E9" bw="1.25E8" lat="15E−6" <cluster id="cluster" prefix="c−" suffix=".me" radical="0−3" </platform> <platform version="3"> <?xml version=’1.0’?> bb_bw="1.25E9" bblat="15e−6"/> <!DOCTYPE platform SYSTEM "simgrid.dtd"> <!DOCTYPE platform SYSTEM "simgrid.dtd"> <platform version="3"> <?xml version=’1.0’?> <process host="c−0.me" function="0"/> <process host="c−1.me" function="1"/> <process host="c−2.me" function="2"/> <process host="c−3.me" function="3"/> </platform> Trace ReplayTool 0 compute 1e6 0 send 1 1e6 0 recv 3 1e6 1 recv 0 1e6 1 compute 1e6 1 send 2 1e6 2 recv 1 1e6 2 compute 1e6 2 send 3 1e6 3 recv 2 1e6 3 compute 1e6 3 send 0 1e6 Timed Trace [0.001000] 0 compute 1e6 0.01000 [0.010028] 0 send 1 1e6 0.009028 [0.040113] 0 recv 3 1e6 0.030085 [0.010028] 1 recv 0 1e6 0.010028 ... Gantt Chart Simulated time: 0.0401133 http://paje.sourceforge.net http://simgrid.gforge.inria.fr 21 / 37
  • 24. Introduction Acquisition Process Replay Conclusions and Perspectives SimGrid in One Slide Three user APIs SimDag: heuristics as DAG of (parallel) tasks MSG: heuristics as Concurrent Sequential Processes SMPI: simulate MPI real applications Under the hood SURF: The simulation kernel (full of deeply investigated models) XBT: bundle of useful stuff (data structures, logging, . . . ) 22 / 37
  • 25. Introduction Acquisition Process Replay Conclusions and Perspectives Calibration Computation Calibration and Cache Usage Platform file contains instruction rate of CPUs <cluster id="AS_cluster" prefix="c-" suffix=".me" radical="0-3" power="1E9" bw="1.25E8" lat="15E-6" bb_bw="1.25E9" bb_lat="15E-6"/> Calibration procedure Execute a small instrumented instance of the target application Typically Class A on 4 processes Compute the instruction rate for every event Compute a weighted average of the instruction rates for each process Compute the average instruction rate for all the process set Do it five times A single instruction rate for everything Small instance ⇒ data fit in L2 cache Larger instance ⇒ exceed L2 capacity ⇒ lower rate! We take cache usage into account during calibration 23 / 37
  • 26. Introduction Acquisition Process Replay Conclusions and Perspectives Calibration Impact on XML description 1 ... 2 <cluster id="AS_sgraphene1" 3 prefix="graphene-" suffix=".nancy.grid5000.fr" radical= "0-38" 4 power="3.68E9" bw="1.25E8" lat="15E-6" bb_bw="1.25E9" bb_lat="15E-6"/> 5 <cluster id="AS_sgraphene2" 6 prefix="graphene-" suffix=".nancy.grid5000.fr" radical= "39-73" 7 power="3.68E9" bw="1.25E8" lat="15E-6" bb_bw="1.25E9" bb_lat="15E-6"/> 8 9 <link id="switch-graphene" bandwidth="1.25E9" latency="5E-4"/> 10 11 <ASroute src="AS_sgraphene1" dst="AS_sgraphene2" 12 gw_src="graphene-AS_sgraphene1_router.nancy.grid5000.fr" 13 gw_dst="graphene-AS_sgraphene3_router.nancy.grid5000.fr"> 14 <link_ctn id="switch-graphene"/> 15 </ASroute> 16 ... 24 / 37
  • 27. Introduction Acquisition Process Replay Conclusions and Perspectives Network model The "hybrid" network model of SMPI T_3 P_r P_s T_1 T_2 Asynchronous (k ≤ Sa) P_s P_r T_2T_4 T_1 Detached (Sa < k ≤ Sd ) P_s P_r T_4 T_2 Synchronous (k > Sd ) Sm all M edium 1M edium 2 D etached Sm all M edium 1 M edium 2 D etached Sm all M edium 1 M edium 2 D etached MPI_Send MPI_Recv Ping−Pong 1e−04 1e−02 1e+01 1e+03 1e+05 1e+01 1e+03 1e+05 1e+01 1e+03 1e+05 Message size (bytes) Duration(seconds) Large Large Large It is piece wise linear and discon- tinuous 25 / 37
  • 28. Introduction Acquisition Process Replay Conclusions and Perspectives Network model Impact on XML description 1 <config id="General"> 2 <prop id="workstation/model" value="compound"/> 3 <prop id="network/model" value="SMPI"/> 4 5 <prop id="smpi/async_small_thres" value="65536"/> 6 <prop id="smpi/send_is_detached_thres" value="327680"/> 7 8 <prop id= "smpi/os" 9 value="0:8.93009e-06:7.654382e-10; 1420:1.396843e-05:2.974094e-10; 10 32768:1.540828e-05:2.441040e-10; 65536:0.000238:0;327680:0:0"/> 11 12 <prop id= "smpi/or" 13 value="0:8.140255e-06:8.3958e-10; 1420:1.2699e-05:9.092182e-10; 14 32768:3.095706e-05:6.956453e-10; 65536:0:0; 327680:0:0"/> 15 16 <prop id="smpi/bw_factor" 17 value="0:0.400977; 1420:0.913556; 32768:1.078319; 65536:0.956084; 18 327680:0.929868"/> 19 20 <prop id= "smpi/lat_factor" 21 value="0:1.35489; 1420:3.437250; 32768:5.721647;65536:11.988532; 22 327680:9.650420"/> 23 </config> 24 ... 26 / 37
  • 29. Introduction Acquisition Process Replay Conclusions and Perspectives Network model SMPI - Hybrid Model UP/DOWN links to share the available bandwidth separately in each direction Limiter link shared by all the flows to and from a processor 1-39 40-74 105-14475-104 1G 10G DownUp DownUp DownUp DownUp 10G 1G 1−39 40−74 105−14475−104 13G 10G Limiter ... ...... ... 1.5G 1G Limiter DownUp 27 / 37
  • 30. Introduction Acquisition Process Replay Conclusions and Perspectives Network model Impact on XML description 1 <cluster id="AS_sgraphene1" prefix="graphene-" 2 suffix=".nancy.grid5000.fr" radical="0-38" power="3.68E9" 3 bw="1.25E8" lat="15E-6" 4 sharing_policy="FULLDUPLEX" limiter_link="1.875E8" 5 loopback_lat="1.5E-9" loopback_bw="6000000000"></cluster> 6 7 <link id="switch-backbone1" bandwidth="1162500000" latency="1.5E-6" 8 sharing_policy="FULLDUPLEX"/> 9 <link id="switch-backbone2" bandwidth="1162500000" latency="1.5E-6" 10 sharing_policy="FULLDUPLEX"/> 11 12 <link id="explicit-limiter1" bandwidth="1511250000" latency="0" 13 sharing_policy="SHARED"/> 14 <link id="explicit-limiter2" bandwidth="1511250000" latency="0" 15 sharing_policy="SHARED"/> 16 17 <ASroute src="AS_sgraphene1" dst="AS_sgraphene2" 18 gw_src="graphene-AS_sgraphene1_router.nancy.grid5000.fr" 19 gw_dst="graphene-AS_sgraphene2_router.nancy.grid5000.fr" 20 symmetrical="NO" 21 <link_ctn id="switch-backbone1" direction="UP"/> 22 <link_ctn id="explicit-limiter1"/> <link_ctnid="explicit-limiter2"/> 23 <link_ctn id="switch-backbone2" direction="DOWN"/> 24 28 / 37
  • 31. Introduction Acquisition Process Replay Conclusions and Perspectives Simulators Trace Replay Tool Based on SMPI Complete MPI implementation Better handling of eager-mode Factoring the efforts Implementation of the send action static void action_send (const char *const *action){ int to = atoi(action[2]); double size = parse_double(action[3]); smpi_mpi_send (NULL, size, MPI_BYTE, to, 0, MPI_COMM_WORLD);} A user has to execute the smpi_replay tool int main(int argc, char *argv[]){ smpi_replay_init(&argc, &argv); smpi_action_trace_run(); smpi_replay_finalize(); return 0;} smpirun is used to execute the simulator smpirun -np 8 -hostfile hostfile -platform platform.xml ./smpi_replay trace_description 29 / 37
  • 32. Introduction Acquisition Process Replay Conclusions and Perspectives Simulation Accuracy Simulation Accuracy - Latest Framework Class B − Execution time Class B − Simulated time Class C − Execution time Class C − Simulated time Class D − Execution time Class D − Simulated time 1 10 100 1000 10000 8 16 32 64 128 256 512 1024 Time(inseconds) Number of processes EP Execution time Simulated time 0.01 0.1 1 10 100 11 21 43 85 171 Time(inseconds) Number of processes DT Class B − Execution time Class B − Simulated time Class C − Execution time Class C − Simulated time 10 100 8 16 32 64 128 256 512 1024 Time(inseconds) Number of processes LU Class B − Execution time Class B − Simulated time Class C − Execution time Class C − Simulated time 10 100 8 16 32 64 128 256 512 1024 Time(inseconds) Number of processes CG 30 / 37
  • 33. Introduction Acquisition Process Replay Conclusions and Perspectives Addressing Issues Source of inaccuracy for CG Two seconds Gantt-chart of the real execution of a class B instance of CG for 32 processes on left side and 128 processes on right side Massive switch packet drops lead to 0.2s timeouts in TCP for 128 processes 31 / 37
  • 34. Introduction Acquisition Process Replay Conclusions and Perspectives Addressing Issues Source of inaccuracy for LU Class B Class C −15 −10 −5 0 5 10 15 8 16 32 64 128 Relativeerrorontime(in%) Number of processes 32 / 37
  • 35. Introduction Acquisition Process Replay Conclusions and Perspectives Addressing Issues Source of inaccuracy for LU Class B Class C −15 −10 −5 0 5 10 15 8 16 32 64 128 Relativeerrorontime(in%) Number of processes Class B Class C −4 −2 0 2 4 8 16 32 64 128 Relativeerrorontime(in%) Number of processes 32 / 37
  • 36. Introduction Acquisition Process Replay Conclusions and Perspectives Addressing Issues Source of inaccuracy for LU Class B Class C −15 −10 −5 0 5 10 15 8 16 32 64 128 Relativeerrorontime(in%) Number of processes Class B Class C −4 −2 0 2 4 8 16 32 64 128 Relativeerrorontime(in%) Number of processes 32 / 37
  • 37. Introduction Acquisition Process Replay Conclusions and Perspectives Simulation Time Class B Class C 0 200 400 600 800 1000 1200 8 16 32 64 128 Simulationtime(inseconds) Number of processes 8.6 seconds for one million actions up to 64 processes 13 seconds for one million actions on 128 processes Almost 8 days would be needed to replay 1.45 TB of LU-E with 16k processes 33 / 37
  • 38. Introduction Acquisition Process Replay Conclusions and Perspectives Simulation Time Outline of the Talk 1 Introduction 2 Acquisition Process Instrumentation Execution Post Processing Evaluation of the Acquisition Framework 3 Replay Calibration Network model Simulators Simulation Accuracy Addressing Issues Simulation Time 4 Conclusions and Perspectives 34 / 37
  • 39. Introduction Acquisition Process Replay Conclusions and Perspectives Conclusions The Time-Independent Trace Replay Framework was presented by detailing and evaluation its two main parts: acquisition and replay The acquisition procedure is decoupled from its replay, thus the acquired Time-Independent traces can be simulated with various scenarios We can use the same Time-Independent traces with future SimGrid network models e.g., Infiniband We implemented a profiling tool respecting our requiremets which is more efficient than other available ones 35 / 37
  • 40. Introduction Acquisition Process Replay Conclusions and Perspectives Perspectives Short term More MPI calls can be supported Framework for automatic acquisition of the traces Long term Simulation of real applications Simulation of larger platforms Decrease simulation time and trace sizes Investigate our framework model with regard to multicore processors Difficult A new computation model could be introduced 36 / 37
  • 41. Introduction Acquisition Process Replay Conclusions and Perspectives Publications Simulation of MPI Applications with Time-Independent Traces, Concurrency and Computation: Practice and Experience, under revision Toward Better Simulation of MPI Applications on Ethernet/TCP Networks, Proceedings of the 4th International Workshop on Performance Modeling, Benchmarking and Simulation of High PerformanceComputer Systems (PMBS), 2013 Improving the Accuracy and Efficiency of Time-Independent Trace Replay, Proceedings of the 3rd International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2012 Assessing the Performance of MPI Applications Through Time-Independent Trace Replay, Proceedings of the 2nd International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI), 2011 Evaluation of Profiling Tools for the Acquisition of Time Independent Traces, Technical Report, RT-0437, INRIA Time-Independent Trace Acquisition Framework – A Grid’5000 How-to, Technical Report, RT-0407, INRIA 37 / 37