Short and comprehensive manual to extend your local matlab with a high performance computing cluster of NVidia tesla's 2070 graphical processing units.
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
HPC and HPGPU Cluster Tutorial
1. Cluster Tutorial
I. Introduction and prerequisits
This short manual should give guidance how to set up a proper workenvironment with the three different clusters which we have available in our institute. The first question to ask yourself is the following:”Do I need the cluster for my problem?”. From experience I can tell mostly not, because sometimes the cost to reprogram the solution in a parallel manner is excceding the benefit by far. Therefor please check the following questions, if you answer them with yes, it makes absolutly sense to solve your problem with the cluster.
I have a huge a set of data which won’t fit to the memory of a single machine?
I have a huge a set of data which won’t fit to the memory of a single machine, which I cannot split in different chunks because I have to jump back and fort within them?
I have plenty of iterations to perform for my simulation which I want to put on the cluster because it would take a couple of months otherwise.
The routine I wrote for the cluster can be used by my peers for the next ten years and will be used daily.
It might sound odd for some, but in general one should not underestimate the initial effort to get started. If you are still not deterred, you must make sure that the following conditions are met.
The special Matlab is installed on your office computer.
HPC Cluster Manager and Job Manager are installed on the very same machine.
You have an Active Directory account, aka PHYSIK3HansWurscht.
If those requirements are not met, please write a ticket to https://opto1:9676/portal describing that you want to participate in the cluster clique and we will set you up within the same moment.
We have three different clusters available, termed ALPHA, GAMMA and SKYNET. They do all serve different purposes, thus it makes total sense to fit your problem to the specific grid.
SKYNET: Is a HPGPU (High performance graphiccard processing unit) which is very experimentally and needs a high degree of expertise. But you can also run regular jobs here, it is not forbidden. It has up to 80 Workers. It has eight M2050 Tesla GPU’s, which are pretty insane.
ALPHA: Is a HPC which makes use of the office computers when those are non busy, for example at night or on weekends. Since that cluster can shrink and grow depending on available resources there is no absolute number available, but the maximum is somewhere around 500 workers.
GAMMA: Is a HPC with 16 Workers but 32GB Memory, in case one must submitt a job with a huge requirement in terms of memory it is recomendet to use this grid.
2. II. Connect to the cluster
Connecting to the cluster is easy as making coffee. Please download the profile from the project server https://projects.gwdg.de/projects/cluster and import them to your local Matlab application Fig.2. Afterwards it is recommended to run the test routines which are checking your configuration. It is very important that they are all marked as passed Fig.4. Where to find the button is shown in Fig.1. It might be that the system is asking for authentication ones, therefore please connect with your regular office computers credentials, the dialog window which appears looks like in Fig.3
Abbildung 1: Manage Cluster
Abbildung 2: Import or find Clusters
Abbildung 3: Connect to cluster with AD credentials
Press here and click manage Clusters
3. Abbildung 4: Test Cluster Connectivity
III. Monitor the jobs
On the local computer one has a program called job manager, which is used to monitor the cluster resources. If for instance a job hangs up or one wants to chancel, this program is the necessary tool.
In Fig.5 the typical layout of the job manager is displayed, to chancel you job, right click on it and chancel. To control different clusters, one needs to set the job manager to right cluster headnode, which is shown in Fig. 6. It is very important that you kill your jobs if they hang up, otherwise the other users of the cluster cannot use it at full resource level.
Abbildung 5:Job Manager
Abbildung 6: Select Head Node
4. IV. Programming Tutorial
The programs which are explained in this tutorial are available for direct use in Matlab Editor, please visit the project server: https://projects.gwdg.de/projects/cluster and download folder example files. Add them to your local Matlab path otherwise the interpreter cannot find them. First you will need to select the parallel configuration; this can be a cluster or your local machine if it consists of several CPU cores. Fig.7.
Abbildung 7:Select Profile
For more information on configurations and programming with user configurations, see:
http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/index.html?/access/helpdesk/help/toolbox/distcomp/f5-16141.html#f5-16540
5. A. Using an Interactive MATLAB pool
To interactively run your parallel code, you first need to open a MATLAB pool. This reserves a collection of MATLAB worker sessions to run your code. The MATLAB pool can consist of MATLAB sessions running on your local machine or on a remote cluster. In this case, we are initially running on your local machine. You can use matlabpool open to start an interactive worker pool. If the number of workers is not defined, the default number defined in your configuration will be used. A good rule of thumb is to not open more workers then cores available. If the Configuration argument is not provided, matlabpool will use the default configuration as setup in the beginning of this section. When you are finished running with your MATLAB pool, you can close it using matlabpool close. Two of the main parallel constructs that can be run on a MATLAB pool are parfor loops (parallel for-loops) and spmd blocks (single program - multiple data blocks). Both constructs allow for a straight- forward mixture of serial and parallel code.
parfor loops are used for task-parallel (i.e. embarrassingly parallel) applications. parfor is used to speed up your code. Below is a simple for loop converted into a parfor to run in parallel, with different iterations of the loop running on different workers. The code outside the parfor loop executes as traditional MATLAB code (serially, in your client MATLAB session).
Different workers. The code outside the parfor loop executes as traditional MATLAB code (serially, in your client MATLAB session).
Note: The example below is located in the m-file, ‘parforExample1.m’.
matlabpool open 2 % can adjust according to your resources
N = 100;
M = 200;
a = zeros(N,1);
tic; % serial (regular) for-loop for i = 1:N
a(i) = a(i) + max(eig(rand(M)));
end toc;
tic; % parallel for-loop parfor i = 1:N
a(i) = a(i) + max(eig(rand(M)));
end toc;
matlabpool close
spmd blocks are a single program multiple data (SPMD) language construct. The "single program" aspect of spmd means that the identical code runs on multiple labs. The code within the spmd body executes simultaneously on the MATLAB workers. The "multiple data" aspect means that even though the spmd statement runs identical code on all workers, each worker can have different, unique data for that code. spmd blocks are useful when dealing with large data that cannot fit on a single machine. Unlike parfor, spmd blocks support inter-worker communication. They allow:
Arrays (and operations on them) to be distributed across multiple workers
Messages to be explicitly passed amongst workers.
6. The example below creates a distributed array (different parts of the array are located on different workers) and computes the svd of this distributed array. The spmd block returns the data in the form of a composite object (behaves similarly to cells in serial MATLAB. For specifics, see the documentation link below).
Note: The example below is located in the m-file, ‘spmdExample1.m’.
matlabpool open 2 % can adjust according to your resources
M = 200;
spmd
N = rand(M,M,codistributor); % 200x100 chunk per worker
A = svd(N);
end
A = max(A{1}); % Indexing into the composite object
disp(A)
clear N
matlabpool close
For information on matlabpool, see:
http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/index.html?/access/helpdesk/help/toolbox/distcomp/ matlabpool.html
For information about getting started using parfor loops, see:
http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/index.html?/access/helpdesk/help/toolbox/distcomp/ brb2x2l-1.html
For information about getting started using spmd blocks, see: http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/index.html?/access/helpdesk/help/toolbox/distcomp/ brukbno-2.html
For information regarding composite objects:
http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/index.html?/access/helpdesk/help/toolbox/distcomp/ brukctb-1.html
For information regarding distributed arrays:
http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/index.html?/access/helpdesk/help/toolbox/distcomp/bqi9fln-1.html
7. 1. Using Batch to Submit Serial Code – Best Practice for Scripts
batch sends your serial script to run on one worker in your cluster. All of the variables in
your client workspace (e.g. the MATLAB process you are submitting from) are sent to the
worker by default. You can alternatively
pass a subset of these variables by
defining the Workspace argument and
passing the desired variables in a
structure. After your job has finished, you
can use the load command to retrieve
the results from the worker-workspace
back into your client-workspace. In this and all examples following, we use a wait to ensure
the job is done before we load back in worker-workspace. This is optional, but you can not
load the data from a task or job until that task or job is finished. So, we use wait to block the
MATLAB command line until that occurs. If the Configuration argument is not provided,
batch will use the default configuration that was set up above.
Note: For this example to work, you will need ‘testBatch.m’ on the machine that you are
submitting from (i.e. the client machine). This example below is located in the m-file,
‘submitJob2a.m’.
%% This script submits a serial script using batch
job2a = batch('testBatch');
wait(job2a); % only can load when job is finished
sprintf('Finished Running Job')
load(job2a); % loads all variables back
sprintf('Loaded Variables into Workspace')
% load(job2a, 'A'); % only loads variable A
destroy(job2a) % permanently removes job data
sprintf('Test Completed')
If you have submitted successfully, you should see the following variables appear in your
client workspace:
Abbildung 9: Workspace
For more information on batch, see:
http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/index.html?/access/helpdesk/help/toolbox/distc
omp/batch.html
and here:
Abbildung 8: Batch Job
8. http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/index.html?/access/helpdesk/help/toolbox/distcomp/brjw1e5-1.html#brjw1fx-3
2. Using Batch to Submit Scripts that Run Using a MATLAB pool
batch with the 'matlabpool' option sends scripts containing parfor or spmd to run on workers via a MATLAB pool. In this process, one worker behaves like a MATLAB client process that facilitates the distribution of the job amongst the workers in the pool and runs the serial portion of the script. Therefore, specifying a 'matlabpool' of size N actually will result in N+1 workers being used. Just like in step 2a, all variables are automatically sent from your client workspace (i.e. the workspace of the MATLAB you are submitting from) to the worker’s workspace on the cluster. load then brings the results from your worker’s workspace back into your client’s workspace. If a configuration is not specified, batch uses the default configuration as defined in the beginning of this section.
Note: For this example to work, you will need ‘testParforBatch.m’ on the machine that you are submitting from (i.e. the client machine). This example below is located in the m-file, submitJob2b.m.
%% This script submits a parfor script using batch
job2b = batch('testParforBatch','matlabpool',2);
wait(job2b); % only can load when job is finished sprintf('Finished Running Job')
load(job2b); % loads all variables back
sprintf('Loaded Variables into Workspace')
% load(job2b, 'A'); % only loads variable A
destroy(job2b) % permanently removes job data
sprintf('Test Completed')
If you have submitted successfully, you should see the following variables appear in your client workspace:
Abbildung 10: Workspace Batch Pool
9. The above code submitted a script containing a parfor. You can submit a script containing a
spmd block in the same fashion by changing the name of the submission script in the batch
command. Note: For this example to work, you will need ‘testSpmdBatch.m’ on the machine
that you are submitting from (i.e. the client machine). This example below is located in the m-file,
submitJob2b_spmd.m.
%% This script submits a spmd script using batch
job2b = batch('testSpmdBatch','matlabpool',2);
wait(job2b); % only can load when job is finished
sprintf('Finished Running Job')
load(job2b); % loads all variables back
sprintf('Loaded Variables into Workspace')
% load(job2b, 'A'); % only loads variable A
destroy(job2b) % permanently removes job data
sprintf('Test Completed')
If you have submitted successfully, you should see the following variables appear in your
client workspace:
Abbildung 11: Batch Pool SPMD
10. B. Run Task-Parallel Example with Jobs and Tasks
In this example, we are sending a task parallel job with multiple tasks. Each task evaluates the built-in MATLAB function. The createTask function in the below example is passed the job, the function to be run in the form of a function handle (@sum), the number of output arguments of the function (1), and the input argument to the sum function in the form of a cell array ({[1 1]});
If not given a configuration, findResource uses the scheduler found in the default configuration defined in the beginning of this section.
Note: This example is located in the m-file, ‘submitJob3a.m’.
%% This script submits a job with 3 tasks
sched = findResource();
job3a = createJob(sched);
createTask(job3a, @sum, 1, {[1 1]});
createTask(job3a, @sum, 1, {[2 2]}); createTask(job3a, @sum, 1, {[3 3]}); submit(job3a)
waitForState(job3a, 'finished') %optional
sprintf('Finished Running Job')
results = getAllOutputArguments(job3a);
sprintf('Got Output Arguments')
destroy(job3a) % permanently removes job data
sprintf('Test Completed')
If you have submitted successfully, you should see the following variables appear in your client workspace:
Abbildung 12: Parallel Task
results should contain the following:
Abbildung 13: Terminal Output Task Parallel
11. You can also call a user-created function in the same way as shown above. In that case, you will need to make sure that any scripts, files, or functions that the task function uses are accessible to the cluster. You can do this by sending those files to the cluster via the FileDependencies property or by directing the worker to a shared directory containing those files via the PathDependencies property. An example of using FileDependencies is shown below: Note: you will need to have a ‘testTask.m’ file on the machine you are submitting from for this example to work. This example is located in the m- file, ‘submitJob3b.m’.
% This script submits a job with 3 tasks
sched = findResource();
job3b = createJob(sched,'FileDependencies',{'testTask.m'});
createTask(job3b, @testTask, 1, {1,1});
createTask(job3b, @testTask, 1, {2,2}); createTask(job3b, @testTask, 1, {3,3}); submit(job3b)
waitForState(job3b, 'finished') % optional sprintf('Finished Running Job')
results = getAllOutputArguments(job3b);
sprintf('Got Output Arguments')
destroy(job3b) % permanently removes job data
sprintf('Test Completed')
If you have submitted successfully, you should see the following variables appear in your client workspace:
Abbildung 14: Task Parallel Workspace
12. results should contain the following:
Abbildung 15: Task Parallel Output
For more information on File and Path Dependencies, see the below documentation.
File Dependencies: http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/index.html?/access/helpdesk/help/toolbox/distcomp/filedependencies.html
Path Dependencies:
http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/index.html?/access/helpdesk/help/toolbox/distcomp/ pathdependencies.html
More general overview about sharing code between client and workers:
http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/index.html?/access/helpdesk/help/toolbox/distcomp/bqur7ev-2.html#bqur7ev-9
13. C. Run Task-Parallel Example with a MATLAB pool job - Best
Practice for parfor or spmd in functions
In this example, we are sending a MATLAB pool job with a single task. This is nearly equivalent
to sending a batch job (see step 2b) with a parfor or a spmd block, except this method is
best used when sending functions and not scripts. It behaves just like jobs/tasks
explained in step 3. The function referenced in the task contains a parfor.
Note: For this example to work, you will need ‘testParforJob.m’ on the machine that you are
submitting from (i.e. the client machine). This example is located in the m-file,
‘submitJob4.m’.
% This script submits a function that contains parfor
sched = findResource();
job4 = createMatlabPoolJob(sched,'FileDependencies',...
{'testParforJob.m'});
createTask(job4, @testParforJob, 1, {});
set(job4, 'MaximumNumberOfWorkers', 3);
set(job4, 'MinimumNumberOfWorkers', 3);
submit(job4)
waitForState(job4, 'finished') % optional
sprintf('Finished Running Job')
results = getAllOutputArguments(job4);
sprintf('Got Output Arguments')
destroy(job4) % permanently removes job data
sprintf('Test Completed')
If you have submitted successfully, you should see the following variables appear in your
client workspace:
results{1} should contain a [50x1 double].
For more information on creating and submitting MATLAB pool jobs, see
http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/index.html?/access/helpdesk/help/toolbox/distc
omp/creatematlabpooljob.html
Abbildung 16: WOrkspace Variables SPMD in Functions
14. D. Run Data-Parallel Example
In this step, we are sending a data parallel job with a single task. The format is similar to that of jobs/tasks (see step 3). For parallel jobs, you only have one task. That task refers to a function that uses distributed arrays, labindex, or some mpi functionality. In this case, we are running a simple built in function (labindex) which takes no inputs and returns a single output. labindex returns the ID value for each of worker processes that ran the it . The value of labindex spans from 1 to n, where n is the number of labs running the current job Note: This example is located in the m-file, ‘submitJob5.m’.
%% Script submits a data parallel job, with one task
sched = findResource();
job5 = createParallelJob(sched);
createTask(job5, @labindex, 1, {});
set(job5, 'MaximumNumberOfWorkers', 3);
set(job5, 'MinimumNumberOfWorkers', 3);
submit(job5)
waitForState(job5, 'finished') % optional
sprintf('Finished Running Job')
results = getAllOutputArguments(job5);
sprintf('Got Output Arguments')
destroy(job5); % permanently removes job data
sprintf('Test Completed')
If you have submitted successfully, you should see the following variables appear in your client workspace:
Abbildung 17: Workspace Data Parallel
15. results should contain the following:
Abbildung 18: Results Data Parallel
For more information on creating and submitting data parallel jobs, see:
http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/index.html?/access/helpdesk/help/toolbox/distcomp/createparalleljob.html
For more information on, labindex, see:
http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/index.html?/access/helpdesk/help/toolbox/distcomp/labindex.html
E. Node GPU Processing
If one needs to accelerate the execution even further a good strategy is to use the GPU CUDA building functions of Matlab distributed toolbox. Set the cluster profile to SKYNET and open a pool of workers.
Matlabpool open …
The Scheduler of located on the headnode of the cluster will recognize that your job has GPU code on board and will switch the scheduling profile automatically to dispatch only to the nodes which have a GPU integrated. In any case it is a good idea to catch the error if the scheduler will not work properly. See the next block of source code to see an excellent example how to do that.
16. function testGPUInParfor()
spmd
selectGPUDeviceForLab();
end
parfor i = 1:1000
% Each iteration will generate some data A
A = rand(5555);
if selectGPUDeviceForLab()
A = gpuArray(A);
disp( 'Do it on the GPU' )
else
disp( 'Do it on the host' )
end
% replace the following line with whatever task you need to do
S = sum(A,1);
% Maybe collect back from GPU (gather is a no-op if not on the GPU)
S = gather(S);
end
function ok = selectGPUDeviceForLab()
persistent hasGPU;
if isempty( hasGPU )
devIdx = mod(labindex-1,gpuDeviceCount())+1;
try
dev = gpuDevice( devIdx );
hasGPU = dev.DeviceSupported;
catch %#ok
hasGPU = false;
end
end
ok = hasGPU;
F. Avoid Errors – Use the Compiler To Your Advantage
It is counterintuitive that in a parallel loop or parallel data block, one single iteration of that particular block is running on a parallel task, aka a worker. Therefor the iterations must be iterations save, since there is no guaranty that the iterations are running in an ascending order. What I prefer to do is borrow the map reduce approach, of course we don’t reduce anything here but the design pattern is great to prevent you some headache. In the reduce step, a piece of data and function is scheduled to a worker, there the reduce function does produce a return. (In real map reduce one can now proced and use the output of step(x-1) in step(x), which we normally don’t.) I can only advice, make extensively use of function returns. The idea is that if a chunk of data is distributed via a dimension or a chunk of iterations are distributed over the sum of iterations that every slice goes via a function call to the worker and comes back via return to the parallel block. Besides it has advantages for the underlying MPI which would shovel all variables via multicast to all workers instead only to one worker, but this is above the scope of this document.
17. In addition it is very clever to write your function in a way that it can run on both, local and cluster environment. For reference see the example folder “sofi” on the project server.
Problem:
Parfor
u=u*v;
end
Better:
Parfor
u=multi(u,v);
end
G. Summary Chart for Scheduling Options
Abbildung 19: Scheduler Options
18. V. Glossary:
HPGPU: High performance graphiccard processing unit
Worker: A worker is a parallel task. In other words, if one as 100 workers available, one can run 100 iterations of a parallel loop in one time intervall (tick) of the processor.
Node: A node is a physical machine, for example a computer connected to a cluster is a node of that very cluster. This node can have 16 Workers, if the node has four CPU’s with for cores on each CPU.
MPI: Message Passing Interface is a fancy piece of software which distributes processes around in a grid via RPC’s.
RPC: Remote Proceture Calls