Presentation PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang at the AMD Developer Summit (APU13) November 11-13, 2013
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
Similar a PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang
Headless approach for offloading heavy tasks in MagentoSander Mangel
Similar a PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang (20)
3. IntroducEon
to
MCW
PPA™
For
Cluster
A
tracing
tool
targets
the
distributed
systems.
! Distributely
collect
instrumented
data
and
hardware
measurements
within
a
tracing
infrastructure.
! Provide
visualizaEons
with
intuiEve
graphs/GanX
charts
and
generate
staEsEc
reports
intended
for
idenEfying
criEcal
paths.
! Do
offline
analysis
that
aids
in
understanding
target
system’s
behavior
and
reasoning
about
performance
issues.
! PPA
Product
series
PPA For Cluster
PPA Workstation Edition
3
|
PRESENTATION
TITLE
|
DECEMBER
4,
2013
|
CONFIDENTIAL
PPA For Android
4. Main
Features
! Low
overhead
‒
Have
negligible
performance
impact
on
the
running
applicaEons
by
relying
on
the
PPA
runEme
library.
This
is
very
useful
for
highly
opEmized
cases
which
are
performance
sensiEve.
! InstrumentaBon
on
applicaBon
level
‒ The
PPA
runEme
library
provides
APIs
to
measure
codes.
The
hardware
measurement
part
is
very
transparent
to
the
developers.
And
these
PPA
codes
can
be
easily
cleanup
by
turning
on
a
disable
opEon.
‒ Auto-‐instrumentaEon
of
binaries
available
soon.
! Scalability
‒ The
tool
can
be
extended
to
profile
clusters
with
various
scales
(now
up
to
4000
nodes)
and
services
(e.g.
Hadoop).
This
benefits
from
PPA’s
distributed
data
repositories,
big-‐data
process
and
buffered
views
of
visualizaEons
etc.
‒ PPA
Profiler
can
be
extended
to
support
HW
vendor
specific
features
4
|
PRESENTATION
TITLE
|
DECEMBER
4,
2013
|
CONFIDENTIAL
5. The
Highlights
! Profiler
and
performance
analyzer
‒
‒
‒
‒
‒
‒
‒
Low
overhead
(almost
no
cost
if
no
profiling
capture
is
enabled)
CPU
&
GPU
acEvity
traces
Hardware
uElizaEons
measurement
HW
Vendor
specific
support
Features
Eme-‐based
views
and
staEsEcal
analysis
/
reports
MulE-‐core
profiling
at
process/thread
at
source
code
Good
data
organizaEon
in
intuiEve
colour
schemes
! Big
data
support
‒ Storage
‒ Smooth
visualizaEon
! System-‐wide
criEcal
paths
idenEficaEon
‒
‒
‒
‒
Correlate
hardware
uElizaEons
and
CPU
events
in
the
same
Emeline
Cluster
wide
global
clock
synchronizaEon
MulE-‐views
for
sessions
from
different
nodes
in
the
same
Emeline
RunEme
monitors
! Customizable
for
specific
applicaEons,
e.g.
Hadoop
5
|
PRESENTATION
TITLE
|
DECEMBER
4,
2013
|
CONFIDENTIAL
6. Developer
Library
Overview
! C/C++
SDK
‒ Already
used
in
numerous
OpenCL™
applicaEons
! Java
Support
‒ Java
bindings
for
OpenCL™
applicaEons
! Thread-‐safe
!
Low
overhead
if
no
capture
!
Transparent
for
OpenCL
instrumentaEons
‒
‒
‒
‒
Timing
OpenCL
APIs
Timing
kernels
&
data
transfers:
start/submit/queue/complete
Visualize
construcEon
of
dependence
graph
between
kernels
&
data
transfer
Exclusive
sub-‐kernel
support
for
AMD
GFX
cards
C/C++
Provide
a
friendly
Interface
(ppaAPI.h)
for
the
C/C++
developer.
6
|
PRESENTATION
TITLE
|
DECEMBER
4,
2013
|
CONFIDENTIAL
JAVA
Provide
a
friendly
Interface
(JPPA.jar)
for
the
JAVA
developer.
7. System
Overview
! Distributed
repositories
for
trace
data
! Distributed
post-‐processing
to
minimize
overhead
! Powerful
visualizaEon
engine
! Scalability
to
any
scale
of
cluster
system
Presentation
layer
UI Logic layer
Network layer
Profiler Logic
layer
Data layer
Graphics
Rendering
Raw Data Post Processing
Communication
Framework
Processed Data Repository
Data Transfer
Profiler Control (Start/Stop etc.)
Data collecting by PPA Profiler
7
|
PRESENTATION
TITLE
|
DECEMBER
4,
2013
|
CONFIDENTIAL
Data serialize for Presentation
Fault-tolerant
Synchronization
and heartbeat etc.
Other profiler logic
Raw Data Repository
8. Gepng
Started
! Install
PPA
Clients
and
PPA
Server
on
the
target
plaqorms
‒ Deploy
PPA
Clients
by
scripts
‒ Support
CLI
for
capture
‒ Generally
PPA
Server
is
running
on
master
node
! Set
up
capture
opEons
‒
‒
‒
‒
‒
Node
IP,
communicaEon
Port…
OpEonally
select
nodes
to
profile
OpEonally
enable
CPU
Event
filters
OpEonally
enable
CPU
Event
merge
Hardware
measurement
is
by
default
! Collect
data
and
analysis
reports
! Operate
views
8
|
PRESENTATION
TITLE
|
DECEMBER
4,
2013
|
CONFIDENTIAL
9. Summary
View
! Available
to
help
find
the
problemaEc
nodes
or
un-‐balanced
loads.
! Tell
difference
between
different
runs
Multistage Table
Bar Charts
9
|
PRESENTATION
TITLE
|
DECEMBER
4,
2013
|
CONFIDENTIAL
10. The
Sharp
UElity:
Timeline
View
! Correlate
CPU
Events
to
HW
performance
in
analysis
Monitoring application’s
behaviour
Session and its node list
Monitoring hardware
behavior
10
|
PRESENTATION
TITLE
|
DECEMBER
4,
2013
|
CONFIDENTIAL
Zoom in/out from
hour to ns resolutions
11. Profiling
Data
! CPU
Events
Level
‒ Thread
‒ Name
‒ Core
miEgaEon
‒ Timing
! OpenCL
traces
! Hardware
counters
‒ %
CPU
Usage
‒ Memory
Usage
‒ Bytes
read/write
of
Disk
‒ Bytes
in/write
of
the
Net
‒ Cache
hit/miss
! StaEsEcs
‒ Process/Thread
involved
‒ #
of
total
CPU
Events
‒ #
of
the
same
CPU
Events
‒ Min/Max/Average
for
each
11
|
PRESENTATION
TITLE
|
DECEMBER
4,
2013
|
CONFIDENTIAL
12. Timeline
View
for
CPU
Events
! Process-‐thread-‐event
data
‒ IdenEfy
the
problemaEc
process/thread/event
‒ Tell
the
dependency
‒ Tell
parent
&
child
‒ Frames
analyzer
for
frame-‐based
program
Expand process
Expand thread
12
|
PRESENTATION
TITLE
|
DECEMBER
4,
2013
|
CONFIDENTIAL
13. Timeline
View
for
HW
measurement
! Aggregate
performance
data
! Per-‐core
data
When is the critical
throughput on disk?
Abnormal load of
the Network?
When the CPU usage is
very low or high?
13
|
PRESENTATION
TITLE
|
DECEMBER
4,
2013
|
CONFIDENTIAL
14. Where
mulE-‐views
Help
OpEmizaEon
! IdenEfy
node’s
abnormal
behavior
! Difference/relaEons
between
nodes
! Job
scheduler
maXers
14
|
PRESENTATION
TITLE
|
DECEMBER
4,
2013
|
CONFIDENTIAL
15. Hadoop
with
PPA
on
AWS
as
Demo
! Overview
of
the
tracing
infrastructure
15
|
PRESENTATION
TITLE
|
DECEMBER
4,
2013
|
CONFIDENTIAL
16. Setup
AWS
EC2
instance
! 16
Hadoop
nodes
(dual
core
node
with
7.5GB
memory)
! 4GB
Hadoop
Terasort
Workload
! >
1.2
GB
PPA
trace
per
node
16
|
PRESENTATION
TITLE
|
DECEMBER
4,
2013
|
CONFIDENTIAL
17. Run
Hadoop
jobs
! Start
the
capture
! Jobs
are
done
by
map
&
reduce
17
|
PRESENTATION
TITLE
|
DECEMBER
4,
2013
|
CONFIDENTIAL
18. Remote
control
by
VNC
viewer
! Intended
for
mulEple
users
on
AWS
! Experience
and
operate
PPA
from
different
connect
points
18
|
PRESENTATION
TITLE
|
DECEMBER
4,
2013
|
CONFIDENTIAL