Coca1

Parallel Programming

Peter van der Stok
Eindhoven University of Technology

Erco Argante and Ian Willers
European Laboratory for Particle Physics

Software paral-

lelization is required
CoCa: a Parallelization
to contend with the

increasing scale and
Model for High-Energy
complexity of High- Physics
Energy Physics

experiments. The

H
authors have developed igh-Energy Physics is a demanding computing area because
it deals in complicated problems that require large-scale,
a programming model,
advanced hardware and software to solve.1 Even a mid-sized
Communication HEP experiment involves hundreds of interconnected com-
Capability (CoCa), which puters and large, geographically dispersed teams that design hard-
ware and software components. During must be increased as the event processing
allows this paralleliza- such experiments, the equipment must time is larger than the average interevent
sustain a throughput of 40 Gbytes/s. In the arrival time. Decreasing event latency
tion at several levels of
future, these demands will only increase. regarding sequential processing is re-
granularity and reduces The Compact Muon Solenoid experi- quired under certain conditions, such as
ment2 of the forthcoming Large Hadron when the results are used in a control
software complexity. Collider3 is a good example of HEP’s feedback loop with a maximum allowed
increasing computing demands. In LHC, response time. It can also be useful when
two counter-rotating particle bunches results are interactively examined or when
cross every 25 nanoseconds. During a sin- memory size and event throughput cre-
gle crossing, there are an average of 20 col- ate too short an interval between data
lisions between highly energetic particles; availability. Event latency decrease is
these collisions produce several new par- established by parallelization of the soft-
ticles, which then pass through detectors ware treating a single event.
that establish their physical parameters. We designed the Communication Capa-
Data associated with a bunch crossing are bility programming model to parallelize
called an event. Researchers predict that software to satisfy the throughput and
CMS will produce and store 109 interest- latency requirements of HEP experi-
ing events per year. That each event will ments.4 CoCa is also hardware indepen-
occupy 1 Mbyte of storage testifies to the dent. This is important in the HEP
true scale of the problem. domain as updates are typically required
HEP software performance is charac- during the 10- to 20-year development and
terized by the rate at which it can process lifecycle of its large applications (> 300,000
events—event throughput—and the time lines of code).
it takes to process one event—event We based our CoCa design on the
latency. Regarding sequential processing database transaction paradigm. The isola-
on a single processor, event throughput tion and atomicity required by database

38 1092-3063/99/$10.00 © 1999 IEEE IEEE Concurrency

1P305

transactions are related to two require-
ments of HEP experiments:
2P488
• the recombining of related data from
components executing in parallel must
be hidden, and
• when a single component rejects an
event, the event and all its associa-
tions must be rejected.
35422
Using this transaction concept simplifies
the programming effort. Also, because
we base CoCa on a general approach,
our model can be applied to areas out-
side HEP, such as radar tracking. Fol-
lowing here, we describe CoCa’s rela-
tionship to the database model and basic 49357
HEP programming principles as imple-
mented in our CoCa model. We then
present results from our study compar-
ing a CoCa prototype with other models Figure 1. A cross section of an event. Crosses show where a particle passing
on a Sun SC2000 and a 64-node Meiko has been detected; shaded areas in the outer circles mark particles’ energy
CS-2 computer. depositions.

Event reconstruction
Accepted event data
CoCa is founded on properties of HEP-
event-reconstruction programs, which
translate events produced by event
detectors into parameters that describe
Calorimeter Calorimeter
the physical properties of elementary
(cluster) (shower)
particles. The 40-MHz rate of CMS
events makes it impossible to store all
events on mass storage. Because deci-
sions to reject or store an event are based
on reconstruction program results,
events must be reconstructed—both in Event
real time and in stages—immediately Track Particle Vertex
Translation
reconstruction tracing calculation
after they are produced.
Before a full reconstruction is per-
formed, a three-level trigger system uses
physical-property requirements to select
interesting events and thus decrease the Calibration constants
event rate. The level-1 trigger reduces the
event rate from 40 MHz to 100 kHz.
Because of these high rates, the level-1 Figure 2. Event-reconstruction modules and their data dependencies.
trigger must be implemented in hard-
ware. Triggers for levels 2 and 3 are
implemented in software executed by nently storing 1 Pbyte of data each year. has been detected. The shaded areas in
1,000 processors in a farm. Level 2 deals Figure 1 shows a cross-section from the outer circles mark the particles’
with an event rate of 100 kHz, uses only an existing detector and four tracks pro- energy depositions in the calorimeter. In
10–25% of an event, and has a deadline duced by charged particles (CMS events most cases, a particle’s identity can be
as short as 10 ms. Level 3 considers the will contain on the order of 300–500 determined from the track curvature and
complete event and brings the event rate tracks). The crosses on the tracks mark its energy disposition.
down from 2 kHz to 100 Hz, perma- hits—locations where a particle passing Figure 2, which we derived from work

April–June 1999 39

Parallelization for HEP lel to construct line segments or (part of) a cluster. Then,
another, coarser, spatial division allows the parallel recom-
Like other applications—such as those used for geographical bination of segments to tracks. Calculation times depend on
data or radar calculations—HEP uses pattern recognition. the number of measurements in a subvolume.
Indeed, some striking similarities with radar applications exist The next stage is to make an initial guess of the particle
such as crossing point calculations or the problem of distin- characteristics. The experimental equipment is designed such
guishing two almost parallel tracks that intersect in some that one to five possibilities must be tried for each track. The
projections. particles are tried out by retracing all possibilities of all found
The incoming data represent measurements distributed tracks in parallel. The retracing yields a probability that a
over a large detector volume. For example, the event, input proposed particle is indeed responsible for the measured
to the CMS online event-reconstruction program, consists of track. Some tracks have a common origin that you can find
a set of 106 spatial coordinates within the detector volume. by calculating the found tracks’ probability of intersection.
About 300–500 subsets represent coordinates of as many For tracks with a common origin, limitations imposed by well-
curved lines in space. Lines that intersect with the detector established physics laws (like conservation of energy) further
volume surface are terminated by a cluster (a set of tightly reduce the possible particle combinations. Calculation times
bunched coordinates that represent a particle’s energy loss depend on physical characteristics of traversed volumes.
in specially designed detectors in the outer installation vol- As particle reconstruction progresses, knowledge about
ume; see Figure 1 in main text). The program’s output is a the event’s physics content increases. At several points,
binary decision: “throw-away” or “keep.” An event is kept enough knowledge is gathered to reject an event as unin-
when 2–10 curves out of the possible 500 interact with the teresting. The goal of the calculation is to retain as many
desired physics properties. A sequential program typically interesting events as possible (process efficiency) and reject
takes about 100 ms on a future 104 MIPS processor to reach as many uninteresting events as possible (rejection rate). Geo-
such a decision. The latency of 100 ms can be reduced by par- metrically, a high parallelization is possible at the start of the
allelizing the treatment of one event. reconstruction process. Later parallelization is determined
A first parallelization is indicated by dividing the detector by the event characteristics: number of measured tracks and
volume into a set of subvolumes. The detector installation’s possible particle combinations. At the end of the process, less
circular symmetry suggests a division in cylindrical sectors. In than eight independent interactions are treated, which
all subvolumes, the local coordinates are inspected in paral- determines whether the even will be kept or rejected.

by Rene Schiefer and David Francis,5 ments, and the experiment’s data input • a module’s processing time depends
shows reconstruction modules for a sin- rate determines the throughput re- on the data it treats,
gle event. Reconstruction starts with the quirements. A third parallelization goal • a single module should reject an event
translation of the detector outputs into is to increase efficiency in resource use, as early as possible to prevent further
physics quantities like positions and volt- particularly processor resources. processing,
ages. Tracks have to be reconstructed For CMS reconstruction software, we • data produced by one module might
from the detector hits. Particle tracing fits use explicit parallelization. At the coarsest be needed by others, and
particle trajectories to the tracks using grain, farming is applied: data from one • data input rates can vary.
physics laws of conservation; the vertices event are presented to one processor, and
between the trajectories are calculated. independent events are treated in paral- The “Parallelization for HEP” sidebar
The particle energy dispositions in the lel by many processors. This does not explains these issues in more detail.
calorimeter are then analyzed. Figure 2 reduce event latency, but it does meet Data are pending when they are ready
also shows the data dependencies be- event throughput requirements. At a finer to be processed, but no processor has
tween modules: the output of some mod- grain, algorithmic parallelization can be started processing them. Pending data
ules is input for those that follow. Some applied: calorimeter results can be treated increase latency. If processors are avail-
modules can be executed independently in parallel with track reconstruction, par- able and large amounts of data are pend-
of each other: calorimeter calculations ticle tracing, and vertex calculation. ing, the workload is unevenly distributed
can be performed in parallel with track Within these three sequentially exe- across the network. If processors are idle
reconstruction, particle tracing, and ver- cuted modules, geometric data paral- and no data are pending, the implemen-
tex calculation. There are no feedback lelism can reduce latency in three ways. tation is inefficient and increases proces-
loops between the modules. An event is First, the experiment’s circular symme- sor costs.
rejected if it fails to satisfy even one mod- try allows data detected in different Ideally, there should be no pending
ule’s criteria. cylinder segments to be recombined in data and no idle processors. When all
parallel. Second, parallel tracing can be application characteristics are known and
Parallelization used to track particles belonging to dif- constant—such as module calculation
ferent tracks, and, third, the coordinates time and data input rate—you can, in
Parallelization’s main goals are to of multiple vertices from which tracks principle, calculate a distribution that
decrease latency and increase through- originate can be calculated in parallel. minimizes pending data for a given pro-
put. Constraints on an event’s process- Parallelization for HEP is complicated cessor configuration. Unfortunately, in
ing time determine latency require- because the reconstruction program, both mod-

40 IEEE Concurrency

C
3
B
2
D 4 x2,x1 p2,p1
3
1 2 4 E (x2,y2),(x1,y1) y1 q1 (p2,q1),(p1,q2)
A B C A B C
4
F y2 q2
2
C 4 B
(a) (b)

Figure 3. An example of module-level parallelization including (a) dataflow between components and (b) data recom-
bination. Circles represent components, arrows represent communications between the components, and capital letters
denote the associated module. Numbers indicate different types of communication.

ule calculation time and data input rate cial measures are taken, Module C’s versa. At runtime, CoCa decides which
are variable. A dynamic data allocation to component might receive p1, q2, p2, and producer communicates with which
processors is a good solution, though it q1 in this order and recombine com- consumer.
increases overhead, as knowledge must be pletely unrelated data. In CoCa, we deal The programmer constructs a set of
distributed to all processors. In any case, with this problem by using the commu- modules and defines typed communica-
100% processor utilization is possible nication layer to hide it from the pro- tion variables that the modules read and
only under well-controlled conditions grammer, such that the component will write. The component distribution, hid-
and almost certainly leads to pending recombine p1 with q1 and p2 with q2. den from the programmer, is performed
data. Therefore, processor farms are over- Traditionally, a communication layer during a separate configuration phase. A
dimensioned to accommodate data rate takes care of correct point-to-point data value stored into a communication vari-
variations. transport. Our model provides addi- able is called a job. Jobs are stored into
tional communication-layer functional- and read from the job space.
Programming model ity, which reduces code complexity at the Components write jobs into job space,
application level. It also provides which is implemented using a set of lists.
A program is composed of modules; a For shared-memory implementation,
component is a compiled and loaded • dynamic communications, in which there are as many lists as there are com-
instance of a module. Figure 3a shows participating partners are determined munication variable types. For distrib-
examples of dataflow between compo- at runtime; uted-memory systems, there can be one
nents. Four types of communication are • recombination of related data that are or more lists per type. At most, there is
possible, corresponding to the numbers distributed over multiple compo- one list per variable type in the memory
in Figure 3: nents; and of a given processor.
• global rejection of related data dis- The closest existing parallel-program-
1. One producer passes data to one con- tributed over multiple components. ming model to our own is Linda, as writ-
sumer, which is typical for a pipeline. ing and reading values to and from vari-
2. One producer passes different data These properties determine the com- ables is equivalent to storing and retrieving
to multiple consumers, which is typ- munication model we use to guide the tuples. The sidebar, “Linda and CoCa:
ical for data parallelism. design of the communication layer. comparison,” describes Linda’s model in
3. Multiple producers pass data to one more detail.
consumer, which is typical for geo- Communication model
metric data parallelism when com- CoCa and the database
ponent results must be combined. In CoCa, we based the communication
4. Multiple producers pass data to model on variables that can be accessed at transactions model
multiple consumers (combination the module level without decreasing per- A database’s purpose is to store and
of types 2 and 3). formance. Such a model decouples com- retrieve data efficiently and conve-
ponents by not sending data directly from niently. Storing and retrieving data is
Because component execution time is one component to another. Two modules done via transactions. A transaction is
data dependent, recombining data is not communicate by writing and reading a composed of a set of actions that access
a straightforward process, as Figure 3b variable of a given type: a producer writes individual data items. It should obey the
shows. Module A’s component receives a value defined by the variable type, and a so-called ACID constraints:
two related items, x1 and y1, and sends consumer reads this value. This is possi-
them to two module B components for ble because the data structure (variable • Atomicity: a transaction finishes com-
treatment. The upper component fin- type) components exchanged remains pletely or aborts without leaving an
ishes before the lower, and sends p1. constant. Because communication is spec- effect.
Module A then receives two related ified at the abstract module level, the con- • Consistency: a transaction executed on
items, x2 and y2, which it sends to two crete producer component does not know a consistent database should leave the
free components of module B. If no spe- about the consumer component and vice database in a consistent state.


Linda and CoCa: comparison structured tuple space can require
numerous searches and long communi-
CoCa uses job space, which is based on putting them into the tuple space; a cation times with large variations.
the tuple space paradigm introduced consumer can use a pattern-matching CoCa differs from Linda in that it is
by the Linda1 language. A comparison primitive to take them out of the tuple motivated more by performance re-
of tuple space with job space clarifies space. This establishes asynchronous quirements than general functional
the differences between Linda and communication and very flexible pro- requirements in two ways. First, search
CoCa. A tuple is a sequence of typed ducer-consumer identification: the pro- time for jobs in CoCa is shorter than the
fields; a field can have 3 types: integer, ducer does not know the consumer and search time for tuples in Linda, which
real, or string. Each field in the tuple vice versa. This flexibility lets multiple has an unstructured tuple space. CoCa
contains one value that is compatible consumers access the same data. At run- uses various allocation strategies to
with the associated type. For example, time, the system configurer can decide partition data and distribute values of
the sequence <7, 1.23, “help”> is a which and how many consumers access the same type among computer pro-
tuple with type declaration integer, the data. They can also change the con- cessors. Second, Linda features dynamic
real, and string. Linda freely deposits sumers set dynamically without repro- tuple types: when a new type of tuple
and retrieves tuples of a declared type gramming the producer module. Linda’s is inserted into the tuple space, a new
in tuple space. large expressive power lets you develop tuple type is created. In our model, the
Linda is a coordination language, very compact parallel applications that set of possible variable types is defined
coordinating intercomponent commu- can run without modification on vari- by the specification of the communi-
nications in a parallel program. It is ous hardware architectures. Linda also cation variables.
sometimes classified as a structured, hides the hardware architecture and
content-addressable distributed-shared component identities from the applica- Reference
memory system. Its aim is to abstract tion code. Unfortunately, Linda’s draw- 1. R. Bjornson et al., Linda, The Portable
from any specific machine architecture. back is performance: Compared with Parallel Language, Tech. Report YALE/
A producer can communicate data by other paradigms, accessing Linda’s un- DCS/RR-520, Yale Univ., 1987.

• Isolation: a transaction cannot access the results of several particle-tracing The CoCa model
results of uncompleted transactions. components are combined by a vertex- The CoCa model lets programmers write
• Durability: the results of modifications calculation component to calculate the modules that are independent of the tar-
to the database should be ensured traced particles’ intersection point. Iso- get hardware architecture. The compo-
regardless of underlying hardware lation occurs in that the vertex-calcula- nents are mapped onto the parallel com-
failures. tion component does not access data puter using a separate mechanism.
belonging to other vertices. In terms of
The isolation property is generally real- atomicity, when one component decides COMPONENT ALLOCATION
ized with well-known concurrency con- that one trace does not satisfy certain In CoCa, the configurer allocates com-
trol techniques such as two-phase locking physics criteria, none of the reconstruc- ponents and variables to processors
and two-phase commit.6 These mecha- tion’s results associated with the vertex through a data-definition language file.
nisms are based on transaction ordering. If are made visible. The DDL permits hardware indepen-
two transactions access the same item and These properties motivated us to inte- dence in components while preserving
one of these accesses is a write access, then grate atomicity and isolation properties good performance, as it lets us specify
the accesses of the individual transactions into the CoCa model. Generally, recon- hardware-dependent information in the
must be ordered in time according to a struction-software designers view data- DDL and take advantage of the hard-
proposed transaction order. base systems as necessary but slow. This ware features.
Depending on a transaction’s char- slowness is related to the durability prop- Shared-memory computers have built-
acteristics, two-phase locking and com- erty: data are stored on permanent stor- in, dynamic component distribution. On
mit can increase overhead due to age, which is typically slow. Dropping distributed-memory computers, com-
additional communication between pro- the durability requirement removes the ponents must be explicitly assigned to
cessors. When transactions have cer- permanent storage requirement; data processors and the values written by the
tain properties—such as, only one can thus be stored in the computer’s components are distributed among the
transaction writes to a given variable, faster main memory. processors where value-consuming com-
or the dataflow between the transac- In addition, thanks to certain recon- ponents are located. We decided to
tions is acyclic—then isolation can be struction program characteristics—no equip CoCa with built-in data distribu-
realized without additional communi- dataflow loops, independent and uniquely tion strategies for load-balancing pur-
cation overhead.7 identifiable events, and components that poses. A strategy can be specified for
To clarify the relationship between write only one value into a variable for each variable type. For our HEP appli-
CoCa and the database transaction each event—we could implement data- cation, we selected three strategies.
model, consider the reconstruction of base properties with little overhead
particle traces originating in one vertex because no locking of multiple variables • Round robin: the component stores
as one transaction. In terms of isolation, is required.7 and retrieves a variable type’s value in

42 IEEE Concurrency

a circular fashion into and out of a most efficient combination. When the Each list uses two Booleans: one that
fixed set of lists. selectivity is high, push mode with range indicates job availability, and the other,
• Range partitioning: a function’s range partitioning is appropriate if the predic- list availability. Because access to the
is divided into contiguous segments, tion based on range partitioning is likely Boolean and the list requires more than
each of which is associated with a list to be correct. If not, pull-mode access one statement, a component might con-
location. should be used. clude that no component is accessing the
• Hash partitioning: A hash function, A range-partitioning function (which list while another component actually
associated with a key, returns the list can often be defined in HEP) is specified accesses the list. To safeguard against
location. in the DDL file. Predicates on the job- this, one lock per list is used. In most
field’s values are defined for modules. cases, components do “busy waiting” on
In the DDL file, a configurer describes These predicates are stored as compiled the Boolean and find the lock free. Fol-
the processors that store values of a given functions with arguments and distrib- lowing is an example of the algorithm in
variable type, the distribution strategy uted over the network. The DDL file pseudocode:
associated with the variable type, and the specifies mapping from predicate-argu-
mapping of components to processors. ment value to processor and distributes VAR
Because CoCa functions have been the mapping functions and the jobs over empty : boolean
added to standard programming lan- processors. When a job is stored in job exclude : boolean
guages, such as C, C++, and Fortran, a space, CoCa evaluates the mapping job-list : lock
module is written in a known program- function and sends the job to the corre- void Insert(job,list){
ming language; components are acti- sponding processor. lock(joblist)
vated by allocating a process (as in the put(list,job)
SC2000 shared-memory machine) or a COMMUNICATION empty := False
thread (as in the CS-2 distributed-mem- To use CoCa, platforms must provide unlock(joblist)
ory machine). two synchronous operations: Send and }
CoCa provides several methods for Receive. For implementation on the
allocating components to processors on Meiko CS-2 machine, we used the Elan job Retrieve(list) {
distributed-memory processors. The communication library. while (empty or exclude)
processor allocation of lists that create For each job, two messages are sent: {skip}
the job space is a major issue. CoCa can the first contains the job type and size; exclude := True
produce two list accesses: the second, the actual job. This creates lock(joblist)
an appropriately sized memory area for job := get(list)
• Push mode, in which the inserting the incoming job. Efficiency is high IF empty(list) {empty :=
component stores the job into the list because when a job arrives, the receiving True}
on processors containing the desti- processor is always ready to receive, and exclude := False
nation component; the destination a good overlap of communication and unlock(joblist)
component obtains the job from its computation can be maintained. }
local list. Components’ asynchronous queue
• Pull mode, in which the inserting access leads to overhead generated by To access a list, three procedures are
component stores the job into the list assumed: empty(), get(), and put().
on its own processor; the receiving • the mutual exclusion of components The lock() and unlock() procedures
component searches a set of proces- that simultaneously access a list, and provide the locking functionality on the
sors to find a list that contains an • a component waiting when a list is lock joblist. The empty() procedure
appropriate job. empty. returns True when list is empty and
False when it is filled with one or more
Given these methods, components In our experiments, we found that the items. Two Boolean variables, empty
can select a particular job by specifying mutual exclusion implemented with and exclude, are also provided. The
selection criteria on the job-fields’ val- operating-system synchronization leads component waits with the while loop if
ues. When few values are specified, a to an unacceptable overhead—around another component is accessing the list
component has a high selectivity. A hundreds of microseconds. For example, or the list is empty. The moment the list
component’s selectivity will determine when list access is controlled by a lock, is filled, empty is set to False and the
whether push mode or pull mode can be testing a free lock can take a few micro- waiting component can proceed. Of
used. When the selectivity is low, jobs seconds and locking a lock or testing a course, on occasion, a retrieving com-
can be distributed evenly over all con- locked lock can take a few hundred ponent might be interleaved with an
suming components. Push-mode access microseconds, whereas list access itself inserting component in the interim
with a round-robin strategy is often the takes only 10 to 20 ms. between the while loop and the lock


ts-11 D1 ts-11

ts-11 D2 ts-11
B1 F1 ts-1
ts-1
For our example, we defined event,
0-0 ts-12 D3 ts-12 ts-1
A1 G1 track, fit, and vertex objects. An
ts-1
F2 event is generated by the hardware and
B2 read by the HW module. The HW
ts-12 D4 ts-12
ts-1
module passes the event object to a
ts-1
Tracker module that recognizes tracks
and stores them into the track object.
E1 The track class is read by a Fitter
module that fits a line through the mea-
sured track. An Intersector module col-
Figure 4. A job-identifier allocation. Components are identified by module
lects all lines and calculates a vertex. The
name and component number; jobs, by a time stamp and module counters.
vertex can be used by another module for
further treatment if needed.
Ntracks is an application-dependent
setting. The lock is set to prevent this When a job arrives in the system, it is constant. Although CoCa allows a vari-
possibility. assigned a time stamp. Each component able number of tracks, for the sake of
that reads one job and produces more clarity we show a fixed number, Ntracks,
ISOLATION than one job postfixes its component of tracks here. CoCa assures that an
We use isolation to recombine func- number to the job identifier. A compo- Intersector component receives the lines
tionally related jobs. For example, tracks nent that recombines jobs acts in two of one given event. Intersector does not
originate in a common vertex. For each stages. In stage one, the component specify which event this is. The first
track, a particle is suggested. For the n specifies a job with a predicate on the job received track of an event determines the
tracks, n jobs are inserted in job space to contents and any identifier. In stage two, event number. A suitable data-distribu-
calculate the particle traces and calculate it specifies a job with the same predicate tion strategy in the DDL file—such as
the probability that these particles cre- and the identifier of the first received job. range partitioning on event number—
ated the observed tracks. The n-con- Because jobs are allocated based on the assures that tracks of the same event are
structed traces are recombined in one job identifier, all related jobs are typi- sent directly to a single component.
component to calculate the probability cally sent to the same component. This
of a given particle combination. assures that jobs are efficiently recom- Performance
Figure 4 shows a possible scenario. A bined on a distributed-memory machine.
job arrives at a module A component, We implemented a prototype of CoCa on
which then calculates and stores three Application example a 20-node Sun SC2000 and a 64-node
related jobs that are read by three com- Meiko CS-2 computer. Table 1 shows the
ponents of modules B and E. The mod- Figure 5 shows an example CoCa appli- total execution time of producer-consumer
ule B components each create two jobs cation that uses the C++ binding (bind- communication on CoCa, the native com-
read by components of module D. One ings also exist for C and Fortran). The munication software, and the Parallel Vir-
module F component reads the jobs parameterized class of CoCa is defined; tual Machine8 and Message Passing Inter-
originating from one module B compo- objects of class T that must be communi- face9 message-passing paradigms.
nent. The final results are read by one cated between modules are defined as a On the SC2000 shared-memory com-
module G component. CoCa class with parameter T. Two meth- puter, CoCa establishes communication
To assure that components recom- ods, in and out, are used in the example. by exchanging references—the data are
bine the related jobs, the jobs are identi- The content of out objects are available not physically moved. Hence, the com-
fied by a time stamp and module coun- to other components; in indicates that a munication time is independent of data
ters. Components are identified by their component wants to receive the contents size, as the table shows. On the CS-2 dis-
module name and component number. of an object from the same class. tributed memory computer, CoCa estab-
lishes a communication bandwidth that
is equal to that of the native communi-
Table 1. Communication and synchronization time.
cation software.
The overhead of our prototype on the
COCA COCA ELAN PVM PVM MPI
SC2000 is mainly synchronization over-
NUMBERS (NATIVE) (NON-NATIVE) (NON-NATIVE)
(µS) (SC2000) (CS-2) (CS-2) (CS-2) (CS-2) (CS-2) head. The performance of our prototype
on the CS-2 compared well to that of
1 Byte 47 143 33 216 2,815 257 native and nonnative PVM and MPI
1 Kbytes 47 179 59 298 3,490 393 despite the higher functionality.
We modified CPREAD,10 the sequen-

44 IEEE Concurrency

// first module
void HW() {
CoCa<Event> event;
while(TRUE) {
equipment(&event); // get event data from equipment
event.out(); // store data in job space
tial event-reconstruction program of the }
CPLEAR experiment, to run on the }
SC2000 and CS-2 computers using CoCa
for communication between parallel // second module
parts. CoCa’s event generator creates void Tracker() {
events and stores them into job space. CoCa<Event> event;
CPREAD components then read the CoCa<Track> track;
events from job space, reconstruct the while(TRUE) {
event, and store an approved event into event.in(); // get an event from job space
job space. The reconstructed events are for(int i := 0; i< NTRACKS; i++) {
read by three components that store the event.find(&track);
events on disk. In the DDL file, we spec- track.out(); // track to job space
ified push mode with round robin. }
Figure 6 shows the results of the event }
reconstruction in relation to the number }
of processors on the CoCa-modified
CPREAD program, compared with a // third module
theoretical scale-up where the perfor- void Fitter() {
mance of one unmodified (“reference CoCa<Track> track;
scale-up”) and one CoCa-based (“CoCa CoCa<Line> line;
scale-up”) CPREAD program is multi- while(TRUE) {
plied by the number of processors. With track.in(); //get track from job space
relatively few modifications to the orig- track.fit(&line);
inal program, the production rate for this line.out(); // put line in job space
experiment was met on a distributed- }
memory machine with 28 processors, 23 }
of which were efficiently used.
// fourth module
void Intersector() {
ALTHOUGH MOTIVATED by proper- CoCa<line> line[NTRACKS];
ties of the HEP event-reconstruction CoCa<vertex> vertex;
programs, the generality of the CoCa while(TRUE) {
model makes it beneficial to other appli- for(int i := 0; i< NTRACKS; i++) {
cations that need parallelization. In ad- line[i].in();
dition, CoCa’s actual implementation } // got all associated lines
led to a surprisingly low communication vertex.intersect(line);
overhead, when compared with the // calculate intersection poin
overhead of PVM and MPI. vertex.out(); // Store vertex in job space
With CoCa we have clearly shown // for later reference
that fine-grain parallelization of HEP }
programs is a viable approach. The CoCa }
model combines a high level of abstrac-
tion and allows parallelization with very Figure 5. An example CoCa application using a C++ binding. The content of
out objects are available to other components; in indicates that a component
little overhead. Our approach, using
wants to receive the contents of an object from the same class.
application characteristics to arrive at a
low overhead programming model, has
produced a considerable payoff.
sity, Linköping, Sweden, for use of their SC2000; References
and Philippe Bloch for the use of the CPREAD 1. J. Zalewki, “Real-Time Data Acquisition in
ACKNOWLEDGMENTS program. The CERN CS-2 was funded by the High-Energy Physics Experiments,” Proc.
This work is jointly funded by CERN and the EU Esprit project P7255, GPMIMD2. We are IEEE Workshop on Real-Time Applications,
Eindhoven University of Technology. Results grateful to Martin Rem from EUT for support- IEEE Computer Society Press, Los Alamitos,
from our experiments are input for the Dutch ing this work and to several anonymous referees Calif., 1993, pp. 112–115.
Science Foundation’s project NFI33.3129, which for valuable comments on the article. Also, Con-
currency editor Keri Schreiner significantly con- 2. The Compact Muon Solenoid – Technical
focuses on construction and performance of real- Proposal, CMS Tech. Report CERN/LHCC
time transactions. We thank Linköping Univer- tributed to this article’s presentation.


How to Reach
700

IEEE Concurrency 600

Event reconstruction rate (Hz)
500
CPLEAR event production rate

400 CoCa
Writers CoCa scale-up
300 Reference scale-up
For detailed information on sub-
mitting articles, write for our edi-
torial guidelines (mdavis@ 200
computer.org), or access http://
100
computer.org/concurrency/
edguide.htm.
5 10 15 20 25
Letters to the Editor Number of processors

Send letters to Figure 6. Event reconstruction rates for CPREAD, CoCa-modified CPREAD, and a
CoCa scale-up where the performance of one unmodified reference and one
CoCa-based CPREAD program is multiplied by the number of processors.
Managing Editor
IEEE Concurrency
10662 Los Vaqueros Circle
Los Alamitos, CA 90720
94-38, LHCC/P1, CERN, Geneva, Dec. 1995. real-time systems, focusing on real-time data-
Please provide an e-mail address bases. He received an MS and PhD in physics
or daytime phone number with 3. The LHC Study Group, Design Study of the
from the University of Amsterdam. He is a
Large Hadron Collider (LHC), CERN 91-03,
your letter. CERN, May, 1991.
member of the IEEE and the ACM. Contact
him at Eindhoven University of Technology,
4. E. Argante, CoCa: A Model for Paral- Dept. of Computing Science, P.O. Box 513,
Subscription Change of Address lelization of High-Energy Physics Software, 5600 MB Eindhoven, the Netherlands;
doctoral thesis, Eindhoven University of wsstok@win.tue.nl.
Send change-of-address requests Technology, Eindhoven, the Netherlands,
1998.
for magazine subscriptions to
address.change@ieee.org. Be sure 5. R. Schiefer and D. Francis, “Parallelisation
of an Existing High Energy Physics Event
to specify IEEE Concurrency. Reconstruction Software Package,” Trans. Ian Willers is a computer scientist on the
Nuclear Science, Vol. 43, No. 1, Feb. 1996, CMS project at CERN/European Organiza-
pp. 79–84. tion for Particle Physics in Geneva. At
Membership Change of Address CERN, he developed techniques in cross
6. H.F. Korth and A. Silberschatz, Database
Send change-of-address requests System Concepts, 2nd ed., McGraw-Hill, compilers and assemblers that led to a
New York, 1991. machine-independent object model format
for the membership directory to that was later developed into the IEEE stan-
directory.updates@ computer.org. 7. M.H. Graham, “How To Get Serializability dard: object model format for microproces-
for Real-Time Transactions Without Hav- sors. His technical interests are in large data-
ing to Pay For It,” Proc. 14th IEEE Real-Time bases. Willers received his PhD in computer
Missing or Damaged Copies Systems Symp., CS Press, 1993, pp. 65–65. science from Cambridge University. He is a
8. A. Geist et al., PVM 3 User’s Guide and Ref- member of the IEEE. Contact him at the
If you are missing an issue or you erence Manual, Oak Ridge Nat’l Lab., Oak European Laboratory for Particle Physics,
received a damaged copy, contact Ridge, Tenn., May 1993. division EP/CMC, CERN, 1211 Geneva 23,
Switzerland; ian.willers@cern.ch.
membership@computer. org. 9. A Message Passing Interface Standard,
Message Passing Interface Forum, Univ. of
Tennessee, Knoxville, 1994.
Reprints of Articles
10. CPLEAR Offline Reference Manual, CERN,
Erco Argante is a systems engineer at Eric-
For price information or to order 1991.
sson Telecommunicatie B.V. in Rijen, the
reprints, send e-mail to mdavis@ Netherlands. His research interests include
computer.org or fax (714) 821-4010. parallel and distributed computing and
object-oriented software development.
Argante received his MS in physics from the
Reprint Permission Peter van der Stok is an associate professor of Catholic University of Nijmegn, and a PhD in
computer science at Eindhoven University of computer science from Eindhoven Univer-
To obtain permission to reprint an Technology. He previously worked at CERN, sity of Technology. Contact him at Ericsson
where he was responsible for the LEP accel- Telecomunicatie B.V., division ETM/R,
article, contact William Hagen, erator’s distributed control system OS soft- Ericssonstraat 2, 5121 ML Rijen, the Nether-
IEEE Copyrights and trademarks ware. His research interests are in distributed lands; etmerar@etm.ericsson.se.
Manager, at whagen@ieee.org.
IEEE Concurrency

Coca1

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (10)

Similar a Coca1

Similar a Coca1 (20)

Último

Último (20)

Coca1