1. PERFORMANCE ANALYSIS OF HIGH
PERFORMANCE COMPUTING APPLICATIONS ON
THE AMAZON WEB SERVICES CLOUD
Keith R. Jackson, Lavanya Ramakrishnan, Krishna Muriki, Shane
Canon, Shreyas Cholia, Harvey J. Wasserman, Nicholas J. Wright
Lawrence Berkeley National Lab
Presentation by Abhishek Gupta,
1 CS 598 Cloud Computing
2. GOALS
Examine the performance of existing cloud computing
infrastructures and create a mechanism for their
quantitative evaluation
Build upon previous studies by using the NERSC
benchmarking framework to evaluate the performance
of real scientific workloads on EC2
Under DOE Magellan project - evaluate the ability of
cloud computing to meet DOE’s computing needs
2
3. CONTRIBUTIONS
Broadest evaluation to date of application performance on
virtualized cloud computing platforms
Experiences with running on Amazon EC2 and the
encountered performance and availability variations.
Analysis of the impact of virtualization based on the
communication characteristics of the application
Impact of virtualization through a simple, well-
documented aggregate measure that expresses the
useful potential of the systems considered
3
4. METHODS - MACHINES
Carver:
Quad-core, dual-socket Linux / Nehalem / QDR IB
cluster
Medium-sized cluster for jobs scaling to hundreds of
processors; 3,200 total cores
Franklin:
Cray XT4
Linux environment / Quad-core, AMD Opteron / Seastar
interconnect, Lustre parallel filesystem
Integrated HPC system for jobs scaling to tens of
thousands of processors; 38,640 total cores 4
5. METHODS - MACHINES
Lawrencium
Quad-core, dual-socket Linux / Harpertown / DDR IB
cluster
Designed for jobs scaling to tens-hundreds of
processors; 1,584 total cores
Amazon EC2
m1.large instance type: four EC2 Compute Units, two
virtual cores with two EC2 Compute Units each, and 7.5
GB of memory
Heterogeneous processor types
5
7. METHODS – APPLICATIONS AND BENCHMARKS
USED
High Performance Computing Challenge (HPCC)
benchmark suite
Consists of seven synthetic benchmarks
Targeted synthetics : DGEMM, STREAM, and two measures
of network latency and bandwidth.
Complex synthetics :HPL, FFTE, PTRANS, and
RandomAccess.
NERSC 6 Benchmarks
Set of applications representative of the NERSC workload
Covers the science domains, parallelization schemes, and
concurrencies, as well as machine-based characteristics that
influence performance such as message size, memory 7
access pattern, and working set sizes
8. METHODS – NERSC APPLICATIONS
CAM: The Community Atmospheric Model
Lower computational intensity
Large point-to-point & collective MPI messages
GAMESS: General Atomic and Molecular
Electronic Structure System
Memory access
No collectives, very little communication
GTC: GyrokineticTurbulence Code
High computational intensity
Bandwidth-bound nearest-neighbor communication plus
collectives with small data payload
8
9. METHODS – NERSC APPLICATIONS
IMPACT-T: Integrated Map and Particle Accelerator Tracking
Time
Memory bandwidth & moderate computational intensity
Collective performance with small to moderate message sizes
MAESTRO: A Low Mach Number Stellar Hydrodynamics
Code
Low computational intensity
Irregular communication patterns
MILC: QCD
High computation intensity
Global communication with small messages
9
PARATEC: PARAllel Total Energy Code
Global communication with small messages
11. RESULTS: APPLICATION PERFORMANCE
Franklin and Lawrencium 1.4 to 2.6 slower than
Carver.
EC2
• Best case, GAMESS, EC2 is only 2.7 slower than Carver.
• Worst case, PARATEC, EC2 is more than 50 slower than Carver.
• Large performance spread caused by different demands of 11
application on the network.
o More detailed analysis required
12. RESULTS: PERFORMANCE ANALYSIS USING IPM
Integrated Performance Monitoring (IPM) framework
• Uses the MPI profiling interface
• Examine the relative amounts of time taken by an application
for computing and communicating, types of MPI calls made
12
13. RESULTS: SUSTAINED SYSTEM PERFORMANCE
SSP: aggregate measure of the workload-specific,
delivered performance of a computing system
For each code measure
• FLOP counts on a reference system
• Wall clock run time on various systems
13
• N chosen to be 3,200
Problem sets drastically reduced
14. RESULTS: VARIABILITY
Performance Variability across runs
• Non-homogeneous nature of the systems allocated
• Network sharing and contention
• Sharing the un-virtualized hardware
14
16. CONCLUSIONS
EC2 performance degrades significantly as
applications spend more time communicating
Applications with global, all-to-all. communication
perform worse then those that mostly use point-to-
point communication.
Amount of variability in EC2 performance can be
significant.
16
17. DISCUSSION QUESTIONS
This paper focused on performance alone. What
are the performance cost tradeoffs for different
platforms?
How does the above tradeoff differ with application
characteristics such as granularity, communication
sensitivity etc.?
What is the primary source of performance
variability on Amazon EC2?
17
Editor's Notes
It has quad-core Intel Nehalem processorsrunning at 2.67 GHz, with dual socket nodes and a singleQuad Data Rate (QDR) IB link per node to a network that islocally a fat-tree with a global 2D-mesh.Each XT4 compute node containsa single quad-core 2.3 GHz AMD Opteron ”Budapest” processor,which is tightly integrated to the XT4 interconnectvia a Cray SeaStar-2 ASIC through a 6.4 GB/s bidirectionalHyperTransport interface.
Each compute node is a Dell Poweredge 1950 server equippedwith two Intel Xeon quad-core 64 bit, 2.66GHz Harpertownprocessors, connected to a Dual Data Rate (DDR) Infinibandnetwork configured as a fat treeAmazon EC2: is a virtual computing environment thatprovides a web services API for launching and managingvirtual machine instances. Amazon provides a number of differentinstance types that have varying performance characteristics.CPU capacity is defined in terms of an abstract AmazonEC2 Compute Unit. One EC2 Compute Unit is approximatelyequivalent to a 1.0-1.2 GHz 2007 Opteron or 2007 Xeonprocessor. For our tests we used the m1.large instances type.The m1.large instance type has four EC2 Compute Units, twovirtual cores with two EC2 Compute Units each, and 7.5 GBof memory. The nodes are connected with gigabit ethernet.
major differences between the Amazon Web Services environmentand that at a typical supercomputing center. For example,almost all HPC applications assume the presence of a sharedparallel filesystem between compute nodes, and a head nodethat can submit MPI jobs to all of the worker nodesThe head node couldsubmit MPI jobs to all of the worker nodes, and the file serverprovided a shared filesystem between the nodes
Targeted: These are microkernelswhich quantify basic system parameters that separatelycharacterize computation and communication performance.Proxy apps
P2p vs all-to-allCommvscompuvs memorySmallmsgvs large msg
The DGEMMresults are as one would expect based on the properties of theCPUs. The STREAM results show that EC2 is significantlyfaster for this benchmark than Lawrencium. We believe this isbecause of the particular processor distribution we received forour EC2 nodes for this testThe network latency and bandwidth results clearly show thedifference between the interconnects on the tested systemsThe ping-pong results show the latency andthe bandwidth with no self-induced contention, while therandomly ordered ring tests show the performance degradationwith self-contention. The uncontended latency and bandwidthmeasurements of the EC2 gigabit ethernet interconnect aremore than 20 times worse than the slowest other machine.However,for EC2 the less capable network clearly inhibits overall HPLperformance, by a factor of six or more. The FFTE benchmarkmeasures the floating point rate of execution of a doubleprecision complex one-dimensional discrete Fourier transform,and the PTRANS benchmark measures the time to transpose alarge matrix. Both of these benchmarks performance dependsupon the memory and network bandwidth and therefore showsimilar trends. EC2 is approximately 20 times slower thanCarver and four times slower than Lawrencium in both cases.The RandomAccess benchmark measures the rate of randomupdates of memory and its performance depends on memoryand network latency. In this case EC2 is approximately 10times slower than Carver and three times slower than Lawrencium.
GAMESS (2.7), for this benchmark problem,places relatively little demand upon the network, and thereforeis hardly slowed down at all on EC2.PARATECshows the worst performance on EC2, 52 slower thanCarver. It performs 3-DFFT’s, and the global (i.e., all-toall)data transposes within these FFT operations can incur alarge communications overheadQualitatively, it seems that those applications that performthe most collective communication with the most messages arethose that perform the worst on EC2.
relative runtime on EC2 compared to Lawrencium plottedagainst the percentage communication for each applicationas measured on Lawrencium. The overall trend is clear:the greater the fraction of its runtime an application spendscommunicating, the worse the performance is on EC2To determine these characteristics we classifiedthe MPI calls of the applications into 4 categories: smalland large messages (latency vs bandwidth limited) and pointto-point vs collective. (Note for the purposes of this work weclassified all messages < 4KB to be latency bound. The overallconclusions shown here contain no significant dependenceon this choice.) From this analysis it is clear why fvCAMbehaves anomalously; it is the only one of the applications thatperforms most of its communication via large messages, bothpoint-to-point and collectives