Parallel programs for multi-core PCs

Parallel programs to multi-processor
computers!
Author: Igor Oreschenkov

Date: 04.02.2009

Abstract
For instance, when a man cannot manage everything he creates a tally for himself, brainless,
irresponsible, able only to solder pins, or carry heavy loads, or write from dictation but doing this very
well.

Arkadiy and Boris Strugatskie

The article is an introduction into parallel programs for beginners. It was published in "PC World"
magazine (http://www.viva64.com/go.php?url=429).

Parallel program, a particular case of labor division
Hardly would we have seen the world around us as we see it if in its time mankind didn't invented the
principle of labor division. Due to this there are professionals specializing in a particular sphere and able
to perform their work perfectly. Implementation of this approach in the sphere of industry allowed us to
increase product release tenfold. That's why it's no wonder that when computers appeared there
arouse the question of applying the principle of labor division to computer programs' operation.
Research in this sphere allowed us to detect also a pleasing fact that complicated mathematical
calculations related, for example, to matrix operations, can be split into independent subtasks. Each
subtask can be executed separately and then you can get the result of solution of the source task by
combining the results of subtasks' solutions. As subtasks are independent they can be solved
simultaneously by several executors. If a computer takes the role of the executor this approach allows
us to find solutions for computationally very complicated tasks of realistic modeling of real-environment
objects, cryptanalysis, real-time control of technological processes. It is this principle according to which
all the supercomputers - the pride of scientific laboratories and computational centers - operate.

Having reached the limit of processors' performance increase relating to fundamental physical limits,
computer manufacturers began to increase the number of computational nodes in their products. And
now inside the chassis of a common PC you may find what has been earlier called a supercomputer - a
multi-core computing device. Operation systems have been ready for this technology for a long time.
Unix and Linux, OS/2, Mac OS and modern versions of Microsoft Windows show performance increase
at simultaneous execution of several programs. For example, while you watch a film, antivirus scanning
of your hard disk can be performed without your noticing it. But there are situations when solution of
only one task can demand a lot of computational resources. Leaving aside the problem of nuclear fusion
synthesis or cracking a password to an archive, let's think about such common things as music and video
coding and computer games. Programs implementing these tasks enable fully the resources of CPU. But
increase of the number of processors in the system won't give you marked performance gain in these
examples if no special methods were used while developing the programs. Indeed, how does an

operation system know the peculiarities of coding methods' implementation to be able to distribute
calculations between processors? Can it know how to perform parallel calculation of a game-situation
scene?

Processes and threads - twins
So, we have a multi-processor computer. We have analyzed the task and singled out the subtasks which
can be solved simultaneously. How can we implement solution of these subtasks in a program? Modern
operation systems offer us two variants of code execution - in the form of processes and in the form of
threads.

A process is operation of a program loaded into main memory and ready for execution. So, when we
simultaneously perform two actions - serf the news in the Internet and burn files on a CD, two processes
are executed simultaneously - an internet-browser and a CD-writing program are operating. A process
consists of program code, operated data and various resources, for example, files or system queues
belonging to the program. Each process is executed in its address space, i.e. has access only to its own
area of main memory. On the one hand it is good because one process cannot interfere with the other
(until they both address the non-divisible resource, for example, the disk drive, but it is a different thing)
and errors in operation of one program in no way will influence operation of the other. But on the other
hand, if processes are meant for solving one common task, we face the question: how will they
exchange information and interact at all?

Launch of processes - Unix and Windows
What is interesting, developers of Mircosoft Windows and Unix (Linux) operation systems applied
different approaches to the question of launching a child process. From the viewpoint of a programmer
using WIN32 API everything is quite natural. In the point of launch of a support process CreateProcess()
function is called whose one argument indicates the path to EXE-file with the program to be executed.

In Unix-like operation system this is implemented in a different way. In the code point where we need to
create a new process, fork() system call is used. As the result the operating program "forks", that is the
state of the program at the moment of fork() execution is copied (by the state we understand the values
of the CPU registers, stack, data area and list of open files) and on the basis of this copy a new process is
launched possessing the same information as the parent process. Both parent and child processes
continue executing the same program beginning with the command following fork() system call. As
execution of one and the same work by two processors is senseless, we face the question: how to
implement solution of an independent subtask in each process we've got? The point is that from the
viewpoint of the parent process the result of fork() function is the identifier of the child process, and
fork() simply returns zero value into the child process. Each process can identify itself by this code. And
all the rest is a technical matter. The child process has only to prepare an environment for executing a
new program, load its code with the help of exec() system call and transfer control to it.

A thread is a means of forking a program inside a process. The process can contain several threads each
of which is executed independently from the others keeping its own values of registers and its own
stack. Threads have access to all the global resources of the executed process, for example, open files or
memory segments. Each process has at least one executed thread, the main thread. When solution of
the task comes to the section allowing paralleling, the processor spawns the necessary number of
additional threads executed simultaneously, solving their subtasks and returning the results to the main
thread.

To solve complex tasks it is not enough just to perform parallel execution of different code sections. You
need also to arrange exchange of the results of their operation, that is organize interaction between
these sections. One would think, in the case of threads it could be easily implemented by providing
access to the defined common area of main memory. But this simplicity is deceptive because operations
of writing into main memory by different threads can intersect very unpredictably so that the integrity
of the information kept in this memory area will be broken.

Example of threads' interaction
Imagine a billing system of a cellular operator executed on a powerful multi-processor server. It's no
wonder that charge-off from a client's account for the provided services is performed by one program
thread while charge of payments on the client's account is performed by another thread. The first
operation can be written in the form: S := S - A where S - balance of the account at the moment of
operation, A - cost of phone calls made. The processor will execute this operation in three steps: at first
the source value of S variable will be received, then A will be subtracted from it, after that the result will
be written into S cell. Similarly the second operation is performed: S := S + B where B - sum of money
paid by the client. Suppose the client make a call some time later after he has paid through a bank
office. As charge of payments on the account occurs with some delay due to objective reasons, it is
possible that both operations - charge-off and charge-on - will be executed in the billing system nearly at
the same time. Suppose that charge-off operation be executed first and the charge-on operation follows
it with one-beat lag (figure 1). After the three beats of the charge-off operation the current account
balance will equal S-A but the fourth beat will make this value S+B.

Figure 1 - An example of a billing system's operation

Thus, the final balance of the account will equal S+B instead of the right S-A+B value. In a different
situation such a collision could be not so good for the client. That's why to avoid such problems it is
necessary to take special measures to protect separate program sections from simultaneous execution.

In the next article sections about synchronization and interaction of simultaneously executed program
sections the terms "process" and "thread" will be used together as the described mechanisms are in
most cases applicable to both notions.

Synchronization...
Modern operation systems offer a wide range of means for synchronizing parallel processes. The
simplest way for synchronizing simultaneously executed code sections is waiting of one thread for
termination of execution of the other. The same mechanism provides support of operation systems
executing asynchronous operations, such as file input-output or data exchange through network.
Besides, functions of waiting can be used at mutual announcing about the events taking place (figure 2).

Figure 2 - Event is an object of operation systems which can switch its state from "common" to "signal"

Thus, for example, while developing a system of bank trading day an event can be registered for
announcing about receipt of funds to a client's account. This event can be waited for by the thread
performing execution of client pay directions.

If there is a possibility that different program sections can simultaneously modify some variable it is
necessary to implement protection of this variable. For example, if one thread deals with calculations of
a game scene and the other deals with its repainting, the latter should be performed only after all the
necessary calculations are done. For this purpose a special object mutex is introduced into operation
systems (English "mutual execution").

Figure 3 - Mutex is a system object which can be in one of the two internal states - "free" or "busy".

Before a thread begins to change the variable it should perform capture of the mutex corresponding to
this variable. If it succeeds the mutex changes its internal state and the thread continues execution. If
the other thread tries to capture the corresponding mutex while the first thread works with the variable,
it will be rejected and it will have to wait until the mutex is released by the first thread (figure 3). Thus,
at a concrete moment of time it is guaranteed that only one section works with the variable.

While programming there can be situations when you need to modify several global variables in one
code section. If there is a possibility that this code will be executed by several threads simultaneously,
you should implement its protection. Specially for such cases operation systems offer to arrange the
corresponding code into a critical section.

Figure 4 - Critical section is a program section which can be executed by only one thread at a time.

The critical section can be executed by only one thread at a time. The other thread can inquire if the
critical section is available for execution and in case of negative answer can either wait until it is free or
execute another operation (figure 4). Critical sections differ from mutexes also in that one and the same
thread can enter the critical section it has captured many times while when trying to capture a mutex
the thread's execution will stop and a so called "deadlock" can occur.

Suppose we have a set of single-type resources, for example, several windows for displaying information
or a pool of network printers. To keep record of distribution of such resources between processors we
use semaphores.

Figure 5 - Semaphore allows you to keep record of distribution of single-type resources between threads

Semaphore is a system object which is a counter. Before a thread starts working with the resource from
the set it should address the semaphore associated with it. If the value of the semaphore counter is
greater than zero the thread is allowed to work with the resource while the counter's value decrements
by one (figure 5). But if all the resources of the set are used the thread will be enqueued. When the
resource is free the thread announces the semaphore about it, the semaphore's value rises and the
resource can be used again.

Sometimes it is necessary to urgently stop execution of a process to perform some urgent actions. Some
operation systems use signals for this purpose. A process which has to execute a special operation sends
a signal to the other process. When writing this other process a procedure of processing signals should
be implemented: when the process gets a signal it can suspend execution or terminate its work, execute
a special subprogram created for such a case or just ignore the signal.

... and interaction
As it was said above, to solve a common task together processors may need not only to coordinate their
work but also to exchange information - for example, intermediate results. For this purpose you can use
main-memory sharing, information exchange through a file, transfer of data through unidirectional
pipes or imitation of network operation (figure 6).

Figure 6 - Ways of exchanging data between processors

The quickest and most obvious way of information exchange between simultaneously executed code
sections is to use the shared area of main memory. One process writes data into memory, the other
reads them and vice versa. But in this case you need to implement synchronization of the processes by
one of the above mentioned ways. If information exchange between the threads through the shared
memory can be implemented directly the processes should at first request the necessary memory
segment from operation systems and coordinate the procedure of its usage.

The next method of information exchange is using pipes. Pipe is a system object allowing you to transfer
information in one direction, as a rule. The most known examples of pipes are standard input/output
streams (stdin, stdout). Stdout of one process can be directed into the stdin of the other. After this the
information being written into stdout of the first process can be read by the second process. Operation

system allow us to create additional pipes. To work with pipes we use functions whose syntax is similar
to functions of file operations.

You can perform data exchange with the help of files as well. Modern operation systems have
embedded mechanisms of buffering information participating in file operations that's why this method
is rather effective. But from the security viewpoint this approach yields to those described above as an
unauthorized application can get access to the file used for inter-processor exchange. Before using this
method you should study the guidance on implementation of file operations in a concrete operation
system and take measures to protect data.

Duplex data exchange between processes can be implemented by operation system network means.
With the help of sockets two processes can establish a channel between them to transfer data in the
same way as a browser and a HTTP-server. The format of the data being transferred in this case can be
absolutely of any kind - the point is that the processes use the same agreements of exchange procedure,
the protocol. Moreover, nothing prevents these processes from being executed on different computers.
And with this last notice we pass on to the next article section.

Supercomputers - everywhere!
Speaking about parallel execution of programs, we suggested that we use a computer with several
processors. But such a computer is not the only platform for executing parallel programs. Let's consider
a simplest computational network consisting of ten common one-processor computers united by a
FastEthernet twisted pair. Looks familiar, doesn't it? What is each computer in fact? Generally speaking,
it is a processor with main and disk memory available to it. Each computer is a perfect area for executing
one process. And what if on each of the ten computers we launch simultaneously programs which solve
subtasks of cracking the password to an archive? Obviously, solution of the task will demand much less
time than in case of using one computer. And what if the network consists not of 10 but of 100
computers? And this network is linked to the Internet?

From this example we see that even small local networks have a great potential computational power.
And if you hardly can use this power for computer games, for CAD systems it will be good enough. A
group of computers united by the bus of transferring data into a single computational system is called a
cluster system or cluster. You need only to write a special program in which you should implement the
possibility of information exchange between its simultaneously executed copies (processes) through the
local network. Generally speaking, this program may have traditional client-server architecture, i.e. be
able both to send messages to its neighbors and receive messages from them. But there already exist
ready solutions for simplifying development of programs for cluster systems, for example, MPI and
PVM, which offer a programmer means for performing both point-to-point and collective interaction
between processes executed on the nodes of the cluster system, and also the methods of their
synchronization.

MPI specification (Message Passing Interface) offers a programming model in which a program spawns
several processes interacting with the help of addressing subprograms of message passing and
receiving. Its implementations are libraries of subprograms which can be used in C/C++ and Fortran
languages. When launching an MPI-program a fixed number of processes is created. MPI specification
has been created as an industry standard that's why all its means are focused on getting high
performance when used on symmetrical multi-processor systems and homogeneous cluster systems
(supercomputers). There is a free implementation of MPI - MPICH (MPI CHamelion), whose Windows

and Linux versions are available for download here: http://www.viva64.com/go.php?url=405. Being
installed on a PC MPICH system can perform development and debugging of programs for further usage
without any modifications on clusters or supercomputers.

Unlike MPI the system of development and execution of parallel programs PVM (Parallel Virtual
Machine) has been created within the framework of a research project meant for heterogeneous
computational complexes. It allows you to quickly unite a heterogeneous set of computers connected
through network into a computational resource which is called "Parallel Computing Machine". The
computers may have different architectures and operate on different OS. PVM system is a set of
libraries and utilities meant for development and debugging of parallel programs and also for controlling
configuration of the Virtual Computing Machine. C/C++ and Fortran languages are supported.
Configuration of the computational complex can change dynamically by means of excluding some nodes
and adding new ones. Such universality is possible also because of some decrease of performance in
comparison with MPI system.

Thus, both MPI and PVM allow you without much effort to turn the local network of your organization
into a powerful computing machine able to solve complex tasks.

We should admit that parallel computational systems are very common nowadays. Operation system
provide developers with the necessary low-level service, while for solving applied tasks there exist ready
proved means in the form of specialized utilities and libraries for popular high-level programming
languages. And a programmer should master methods of parallel program development to keep up with
the modern tendencies.

Parallel programs for multi-core PCs

Recomendados

Recomendados

Más contenido relacionado

Similar a Parallel programs for multi-core PCs

Similar a Parallel programs for multi-core PCs (20)

Último

Último (20)

Parallel programs for multi-core PCs