This document summarizes work to optimize the parallel MATLAB (pMatlab) software for large-scale computation on the IBM Blue Gene/P supercomputer. Key points include porting pMatlab to run on the Blue Gene/P, evaluating its single-process and parallel performance, and developing new aggregation techniques like BAGG and HAGG that improve scalability for collecting distributed data beyond 1024 processes using a binary tree approach. Single-process Octave performance on Blue Gene/P was found to be comparable to MATLAB, and parallel benchmarks showed near-linear strong scaling up to 16,384 processes.
1. Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E. Jordan IBM T.J. Watson Research Center HPEC 2010 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.
2.
3.
4. IBM Blue Gene/P System Core speed: 850 MHz cores LLGrid Core counts: ~1K Blue Gene/P Core counts: ~300K
5.
6.
7.
8.
9.
10. Single Process Performance: Intel Xeon vs. IBM PowerPC 450 * conv2() performance issue in Octave has been improved in a subsequent release Lower is better HPL MandelBrot ZoomImage* Beamformer Blurimage* FFT 26.4 s 11.9 s 18.8 s 6.1 s 1.5 s 5.2 s
11. Octave Performance With Optimized BLAS DGEM Performance Comparison Matrix size (N x N) MFLOPS
12. Single Process Performance: Stream Benchmark Higher is better 944 MB/s 1208 MB/s 996 MB/s Relative Performance to Matlab Triad a = b + q c Add c = a + c Scale b = q c
1 1 Title of the presentation to be presented at HPEC 2010, September 15-16, 2010.
Outline of the presentation. Introduction will brief about motivation on porting pMatlab to mega-core systems such as IBM Blue Gene/P system and how we did it. Then performance details of the math kernel (octave) for pMatlab on BG/P is given. Also, we will present the technical details and performance improvement done for aggregation in order to work on a large scale computation.
1 1 pMatlab is the library layer and abstracts away the parallel distribution details so that the user calls pMatlab functions, which provide mechanisms for easily distributing and parallelizing Matlab code. pMatlab can use either MATLAB or Octave as the mathematics kernel, just as the messaging can accomplished with any library which can send and receive MATLAB messages. For our current implementation MatlabMPI is the communication kernel. pMatlab calls MatlabMPI functions to communicate data between processors.
IBM Blue Gene/P system design diagram Each CPU runs at 850MHZ and may have 2GB or 4GB memory four cores. The Surveyor system at Argonne National Lab has 2GB memory, which was used for this study. One full rack has 1024 CPUs or 4096 compute cores.
IBM Blue Gene/P system supportss two major computing environment. High Throughput Computing (HTC) environment is ideal for serial and pleasantly parallel applications. High Performance Computing environment targets for highly scalable message passing applications with MPI library. HTC environment fits well for porting pMatlab environment to BG/P.
In Symmetric Multiprocessing (SMP) mode each physical Compute Node executes a single process per node with a default maximum of four threads. The process can access the full physical memory. With Dual and VN mode for HTC, only ½ and ¼ of the full memory is accessible by each process. For example, the Surveyor system has 2GB memory per compute node. So a process with SMP mode has approximately 2GB minus the space occupied by the OS. DUAL-mode process has about 1GB RAM and VN-mode process has about 512MB RAM.
Porting pMatlab to BG/P system involves how to integrate BG/P job launch mechanism to the pMatlab job launch mechanism. It involves two steps to run pMatlab jobs on BG/P. First, it requests and boots a BG partition in HTC mode using the qsub command on Surveyor. Then, once the partition gets booted, execute the submit command to submit a job to the partition. By combining the two step under the pRUN command, pMatlab jobs can be executed on BG/P.
Outline Slide for performance studies. We will talk about single process performance, point-to-point communication performance and scalability results using parallel stream benchmark.
Examples used for the performance studies, which are available from the pMatlab package.
Performance of example codes running on a single processor core is compared between Intel Xeon and IBM PowerPC chips
Using an optimized BLAS library (ATLAS), Octave 3.2.4 binary was able to run faster than Matlab 2009a in DGEMM and HPL benchmark examples.
Stream benchmark performance is compared between LLGrid and Surveyor (IBM BG/P) systems
Point-to-point communication performance on BG/P is measured by using pSpeed, a pMatlab code example.
Because messages are files in pMatlab, it is important to have efficient file system in order to scale for a large number of processes. In a single nfs-shared file system, this will have serious resource contention as the number of processes increase. However, by using a group of cross-mounted distributed file system, we may mitigate this issue. The next slide shows a demonstration of resource contention with file system.
The pSpeed performance with MATLAB was measured when all messages were written on a single, nfs-shared disk. It started showing resource contention with four processes and with 8 processes, the issue is obvious.
When using crossmounts for pMatlab message communication, the jittery behavior has been resolved. The overall performance of Octave matches well with that of Matlab when using four and eight processes.
The point-to-point communication (pSpeed) on BG/P system performed as expected when using IBM’s proprietary parallel file system, GPFS
Now we are looking at different perspective on scalability. We increase the problem size proportionally as we increase the number of processes so that each process has the amount of work everytime. With SMP mode, the initial global array size is set to 2^25 with Np=1 which is the largest size we can run on a single process. As scale the problem size proportionally with the number of processes, the bandwidth scales linearly. At Np=1024, Triad bandwidth is 563GB/sec with 100% parallel efficiency. At Np=1024, 1024 nodes and 1 processes per node
pStream benchmark was able to scaled up to 16,384 processes on an IBM internal BG/P machine
Outline Slide for aggregation optimization done for large scale computation
The current aggregation architecture is easy to implement but it is inherently sequential. Therefore, the collection process will slow down significantly.
Binary tree based aggregation can significantly reduce the number of messages that a process may send and receive although message itself becomes larger. For example, when Np =8, the leader process receives three messages with BAGG() while it receives 7 messages with AGG().
On MIT Lincoln Laboratory cluster system, the performance of AGG() and BAGG() is compared for a 2-D data and process distribution using two different file systems. At large process numbers, BAGG() performs significantly better than AGG(). With 128 processes, BAGG() achieved 10x improvement on IBRIX and 8x improvement on LUSTRE file systems. The performance number is highly dependent on the file system loads since they were obtained on a production cluster.
On IBM Blue Gene/P system, the performance comparison was done for a wide range of processes, from 8 to 1024 processes by using a 4-D data and process distribution. With GPFS file system, BAGG() runs 2.5x faster than AGG().
BAGG() only works when Np is a power of two number. We generalized BAGG() so that it can work with any number of processes. The generalization is based on an extended binary tree so that we can still use the binary tree based aggregation architecture. There will be some additional costs associated with those fictitious processes. For those fictitious processes, we skip message communication.
The costs associated with additional bookkeeping appears to be minimal. On a production MIT Lincoln Laboratory cluster, the performance of AGG(), BAGG() and HAGG() were compared. At Np = 128, BAGG() and HAGG() runs about 3x faster than AGG()
On an IBM Blue Gene/P system, the cost of additional bookkeeping was studied on a wide range of processes using a 4-D data and process distribution using GPFS file system. Both performs almost identically through out the entire range of processes, from 8 to 512 processes.
On LLGrid system with using a hybrid configuration (central file system for the lead process and crossmounts for compute nodes), the new agg() function can reduce resource contention on file system. Thus, it improved agg() performance significantly.