SlideShare una empresa de Scribd logo
1 de 9
Descargar para leer sin conexión
Parallel Programming




                                                                                                   Peter van der Stok
                                                                                     Eindhoven University of Technology

                                                                                      Erco Argante and Ian Willers
                                                                                European Laboratory for Particle Physics




     Software paral-

lelization is required
                            CoCa: a Parallelization
to contend with the

increasing scale and
                            Model for High-Energy
complexity of High-         Physics
Energy Physics

experiments. The




                            H
authors have developed                igh-Energy Physics is a demanding computing area because
                                      it deals in complicated problems that require large-scale,
a programming model,
                                      advanced hardware and software to solve.1 Even a mid-sized
Communication                         HEP experiment involves hundreds of interconnected com-
Capability (CoCa), which    puters and large, geographically dispersed teams that design hard-
                            ware and software components. During            must be increased as the event processing
allows this paralleliza-    such experiments, the equipment must            time is larger than the average interevent
                            sustain a throughput of 40 Gbytes/s. In the     arrival time. Decreasing event latency
tion at several levels of
                            future, these demands will only increase.       regarding sequential processing is re-
granularity and reduces         The Compact Muon Solenoid experi-           quired under certain conditions, such as
                            ment2 of the forthcoming Large Hadron           when the results are used in a control
software complexity.        Collider3 is a good example of HEP’s            feedback loop with a maximum allowed
                            increasing computing demands. In LHC,           response time. It can also be useful when
                            two counter-rotating particle bunches           results are interactively examined or when
                            cross every 25 nanoseconds. During a sin-       memory size and event throughput cre-
                            gle crossing, there are an average of 20 col-   ate too short an interval between data
                            lisions between highly energetic particles;     availability. Event latency decrease is
                            these collisions produce several new par-       established by parallelization of the soft-
                            ticles, which then pass through detectors       ware treating a single event.
                            that establish their physical parameters.           We designed the Communication Capa-
                            Data associated with a bunch crossing are       bility programming model to parallelize
                            called an event. Researchers predict that       software to satisfy the throughput and
                            CMS will produce and store 109 interest-        latency requirements of HEP experi-
                            ing events per year. That each event will       ments.4 CoCa is also hardware indepen-
                            occupy 1 Mbyte of storage testifies to the      dent. This is important in the HEP
                            true scale of the problem.                      domain as updates are typically required
                                HEP software performance is charac-         during the 10- to 20-year development and
                            terized by the rate at which it can process     lifecycle of its large applications (> 300,000
                            events—event throughput—and the time            lines of code).
                            it takes to process one event—event                 We based our CoCa design on the
                            latency. Regarding sequential processing        database transaction paradigm. The isola-
                            on a single processor, event throughput         tion and atomicity required by database


38                          1092-3063/99/$10.00 © 1999 IEEE                                          IEEE Concurrency
1P305




transactions are related to two require-
ments of HEP experiments:
                                                 2P488
• the recombining of related data from
  components executing in parallel must
  be hidden, and
• when a single component rejects an
  event, the event and all its associa-
  tions must be rejected.
                                                                                                                                       35422
Using this transaction concept simplifies
the programming effort. Also, because
we base CoCa on a general approach,
our model can be applied to areas out-
side HEP, such as radar tracking. Fol-
lowing here, we describe CoCa’s rela-
tionship to the database model and basic                                                                               49357
HEP programming principles as imple-
mented in our CoCa model. We then
present results from our study compar-
ing a CoCa prototype with other models        Figure 1. A cross section of an event. Crosses show where a particle passing
on a Sun SC2000 and a 64-node Meiko           has been detected; shaded areas in the outer circles mark particles’ energy
CS-2 computer.                                depositions.


Event reconstruction
                                                                                     Accepted event data
CoCa is founded on properties of HEP-
event-reconstruction programs, which
translate events produced by event
detectors into parameters that describe
                                                                 Calorimeter                                    Calorimeter
the physical properties of elementary
                                                                  (cluster)                                      (shower)
particles. The 40-MHz rate of CMS
events makes it impossible to store all
events on mass storage. Because deci-
sions to reject or store an event are based
on reconstruction program results,
events must be reconstructed—both in            Event
real time and in stages—immediately                                                Track             Particle                    Vertex
                                                         Translation
                                                                               reconstruction        tracing                   calculation
after they are produced.
   Before a full reconstruction is per-
formed, a three-level trigger system uses
physical-property requirements to select
interesting events and thus decrease the                                            Calibration constants
event rate. The level-1 trigger reduces the
event rate from 40 MHz to 100 kHz.
Because of these high rates, the level-1      Figure 2. Event-reconstruction modules and their data dependencies.
trigger must be implemented in hard-
ware. Triggers for levels 2 and 3 are
implemented in software executed by           nently storing 1 Pbyte of data each year.         has been detected. The shaded areas in
1,000 processors in a farm. Level 2 deals        Figure 1 shows a cross-section from            the outer circles mark the particles’
with an event rate of 100 kHz, uses only      an existing detector and four tracks pro-         energy depositions in the calorimeter. In
10–25% of an event, and has a deadline        duced by charged particles (CMS events            most cases, a particle’s identity can be
as short as 10 ms. Level 3 considers the      will contain on the order of 300–500              determined from the track curvature and
complete event and brings the event rate      tracks). The crosses on the tracks mark           its energy disposition.
down from 2 kHz to 100 Hz, perma-             hits—locations where a particle passing               Figure 2, which we derived from work


April–June 1999                                                                                                                              39
Parallelization for HEP                                            lel to construct line segments or (part of) a cluster. Then,
                                                                        another, coarser, spatial division allows the parallel recom-
     Like other applications—such as those used for geographical        bination of segments to tracks. Calculation times depend on
     data or radar calculations—HEP uses pattern recognition.           the number of measurements in a subvolume.
     Indeed, some striking similarities with radar applications exist      The next stage is to make an initial guess of the particle
     such as crossing point calculations or the problem of distin-      characteristics. The experimental equipment is designed such
     guishing two almost parallel tracks that intersect in some         that one to five possibilities must be tried for each track. The
     projections.                                                       particles are tried out by retracing all possibilities of all found
        The incoming data represent measurements distributed            tracks in parallel. The retracing yields a probability that a
     over a large detector volume. For example, the event, input        proposed particle is indeed responsible for the measured
     to the CMS online event-reconstruction program, consists of        track. Some tracks have a common origin that you can find
     a set of 106 spatial coordinates within the detector volume.       by calculating the found tracks’ probability of intersection.
     About 300–500 subsets represent coordinates of as many             For tracks with a common origin, limitations imposed by well-
     curved lines in space. Lines that intersect with the detector      established physics laws (like conservation of energy) further
     volume surface are terminated by a cluster (a set of tightly       reduce the possible particle combinations. Calculation times
     bunched coordinates that represent a particle’s energy loss        depend on physical characteristics of traversed volumes.
     in specially designed detectors in the outer installation vol-        As particle reconstruction progresses, knowledge about
     ume; see Figure 1 in main text). The program’s output is a         the event’s physics content increases. At several points,
     binary decision: “throw-away” or “keep.” An event is kept          enough knowledge is gathered to reject an event as unin-
     when 2–10 curves out of the possible 500 interact with the         teresting. The goal of the calculation is to retain as many
     desired physics properties. A sequential program typically         interesting events as possible (process efficiency) and reject
     takes about 100 ms on a future 104 MIPS processor to reach         as many uninteresting events as possible (rejection rate). Geo-
     such a decision. The latency of 100 ms can be reduced by par-      metrically, a high parallelization is possible at the start of the
     allelizing the treatment of one event.                             reconstruction process. Later parallelization is determined
        A first parallelization is indicated by dividing the detector   by the event characteristics: number of measured tracks and
     volume into a set of subvolumes. The detector installation’s       possible particle combinations. At the end of the process, less
     circular symmetry suggests a division in cylindrical sectors. In   than eight independent interactions are treated, which
     all subvolumes, the local coordinates are inspected in paral-      determines whether the even will be kept or rejected.




by Rene Schiefer and David Francis,5            ments, and the experiment’s data input          • a module’s processing time depends
shows reconstruction modules for a sin-         rate determines the throughput re-                on the data it treats,
gle event. Reconstruction starts with the       quirements. A third parallelization goal        • a single module should reject an event
translation of the detector outputs into        is to increase efficiency in resource use,        as early as possible to prevent further
physics quantities like positions and volt-     particularly processor resources.                 processing,
ages. Tracks have to be reconstructed               For CMS reconstruction software, we         • data produced by one module might
from the detector hits. Particle tracing fits   use explicit parallelization. At the coarsest     be needed by others, and
particle trajectories to the tracks using       grain, farming is applied: data from one        • data input rates can vary.
physics laws of conservation; the vertices      event are presented to one processor, and
between the trajectories are calculated.        independent events are treated in paral-        The “Parallelization for HEP” sidebar
The particle energy dispositions in the         lel by many processors. This does not           explains these issues in more detail.
calorimeter are then analyzed. Figure 2         reduce event latency, but it does meet             Data are pending when they are ready
also shows the data dependencies be-            event throughput requirements. At a finer       to be processed, but no processor has
tween modules: the output of some mod-          grain, algorithmic parallelization can be       started processing them. Pending data
ules is input for those that follow. Some       applied: calorimeter results can be treated     increase latency. If processors are avail-
modules can be executed independently           in parallel with track reconstruction, par-     able and large amounts of data are pend-
of each other: calorimeter calculations         ticle tracing, and vertex calculation.          ing, the workload is unevenly distributed
can be performed in parallel with track             Within these three sequentially exe-        across the network. If processors are idle
reconstruction, particle tracing, and ver-      cuted modules, geometric data paral-            and no data are pending, the implemen-
tex calculation. There are no feedback          lelism can reduce latency in three ways.        tation is inefficient and increases proces-
loops between the modules. An event is          First, the experiment’s circular symme-         sor costs.
rejected if it fails to satisfy even one mod-   try allows data detected in different              Ideally, there should be no pending
ule’s criteria.                                 cylinder segments to be recombined in           data and no idle processors. When all
                                                parallel. Second, parallel tracing can be       application characteristics are known and
Parallelization                                 used to track particles belonging to dif-       constant—such as module calculation
                                                ferent tracks, and, third, the coordinates      time and data input rate—you can, in
Parallelization’s main goals are to             of multiple vertices from which tracks          principle, calculate a distribution that
decrease latency and increase through-          originate can be calculated in parallel.        minimizes pending data for a given pro-
put. Constraints on an event’s process-         Parallelization for HEP is complicated          cessor configuration. Unfortunately, in
ing time determine latency require-             because                                         the reconstruction program, both mod-


40                                                                                                                       IEEE Concurrency
C
                                            3
                                                                                                               B
                            2
                                                    D        4                                        x2,x1              p2,p1
                                            3
                 1              2                             4   E             (x2,y2),(x1,y1)          y1         q1           (p2,q1),(p1,q2)
         A              B               C                                                         A            B            C
                                                         4
                                                                      F                                   y2       q2
                                    2
                                                C         4                                                    B
        (a)                                                                     (b)

Figure 3. An example of module-level parallelization including (a) dataflow between components and (b) data recom-
bination. Circles represent components, arrows represent communications between the components, and capital letters
denote the associated module. Numbers indicate different types of communication.



ule calculation time and data input rate            cial measures are taken, Module C’s                 versa. At runtime, CoCa decides which
are variable. A dynamic data allocation to          component might receive p1, q2, p2, and             producer communicates with which
processors is a good solution, though it            q1 in this order and recombine com-                 consumer.
increases overhead, as knowledge must be            pletely unrelated data. In CoCa, we deal               The programmer constructs a set of
distributed to all processors. In any case,         with this problem by using the commu-               modules and defines typed communica-
100% processor utilization is possible              nication layer to hide it from the pro-             tion variables that the modules read and
only under well-controlled conditions               grammer, such that the component will               write. The component distribution, hid-
and almost certainly leads to pending               recombine p1 with q1 and p2 with q2.                den from the programmer, is performed
data. Therefore, processor farms are over-             Traditionally, a communication layer             during a separate configuration phase. A
dimensioned to accommodate data rate                takes care of correct point-to-point data           value stored into a communication vari-
variations.                                         transport. Our model provides addi-                 able is called a job. Jobs are stored into
                                                    tional communication-layer functional-              and read from the job space.
Programming model                                   ity, which reduces code complexity at the              Components write jobs into job space,
                                                    application level. It also provides                 which is implemented using a set of lists.
A program is composed of modules; a                                                                     For shared-memory implementation,
component is a compiled and loaded                  • dynamic communications, in which                  there are as many lists as there are com-
instance of a module. Figure 3a shows                 participating partners are determined             munication variable types. For distrib-
examples of dataflow between compo-                   at runtime;                                       uted-memory systems, there can be one
nents. Four types of communication are              • recombination of related data that are            or more lists per type. At most, there is
possible, corresponding to the numbers                distributed over multiple compo-                  one list per variable type in the memory
in Figure 3:                                          nents; and                                        of a given processor.
                                                    • global rejection of related data dis-                The closest existing parallel-program-
1.   One producer passes data to one con-             tributed over multiple components.                ming model to our own is Linda, as writ-
     sumer, which is typical for a pipeline.                                                            ing and reading values to and from vari-
2.   One producer passes different data             These properties determine the com-                 ables is equivalent to storing and retrieving
     to multiple consumers, which is typ-           munication model we use to guide the                tuples. The sidebar, “Linda and CoCa:
     ical for data parallelism.                     design of the communication layer.                  comparison,” describes Linda’s model in
3.   Multiple producers pass data to one                                                                more detail.
     consumer, which is typical for geo-            Communication model
     metric data parallelism when com-                                                                  CoCa and the database
     ponent results must be combined.               In CoCa, we based the communication
4.   Multiple producers pass data to                model on variables that can be accessed at          transactions model
     multiple consumers (combination                the module level without decreasing per-            A database’s purpose is to store and
     of types 2 and 3).                             formance. Such a model decouples com-               retrieve data efficiently and conve-
                                                    ponents by not sending data directly from           niently. Storing and retrieving data is
   Because component execution time is              one component to another. Two modules               done via transactions. A transaction is
data dependent, recombining data is not             communicate by writing and reading a                composed of a set of actions that access
a straightforward process, as Figure 3b             variable of a given type: a producer writes         individual data items. It should obey the
shows. Module A’s component receives                a value defined by the variable type, and a         so-called ACID constraints:
two related items, x1 and y1, and sends             consumer reads this value. This is possi-
them to two module B components for                 ble because the data structure (variable            • Atomicity: a transaction finishes com-
treatment. The upper component fin-                 type) components exchanged remains                    pletely or aborts without leaving an
ishes before the lower, and sends p1.               constant. Because communication is spec-              effect.
Module A then receives two related                  ified at the abstract module level, the con-        • Consistency: a transaction executed on
items, x2 and y2, which it sends to two             crete producer component does not know                a consistent database should leave the
free components of module B. If no spe-             about the consumer component and vice                 database in a consistent state.


April–June 1999                                                                                                                                    41
Linda and CoCa: comparison                                                                 structured tuple space can require
                                                                                                numerous searches and long communi-
     CoCa uses job space, which is based on        putting them into the tuple space; a         cation times with large variations.
     the tuple space paradigm introduced           consumer can use a pattern-matching              CoCa differs from Linda in that it is
     by the Linda1 language. A comparison          primitive to take them out of the tuple      motivated more by performance re-
     of tuple space with job space clarifies       space. This establishes asynchronous         quirements than general functional
     the differences between Linda and             communication and very flexible pro-         requirements in two ways. First, search
     CoCa. A tuple is a sequence of typed          ducer-consumer identification: the pro-      time for jobs in CoCa is shorter than the
     fields; a field can have 3 types: integer,    ducer does not know the consumer and         search time for tuples in Linda, which
     real, or string. Each field in the tuple      vice versa. This flexibility lets multiple   has an unstructured tuple space. CoCa
     contains one value that is compatible         consumers access the same data. At run-      uses various allocation strategies to
     with the associated type. For example,        time, the system configurer can decide       partition data and distribute values of
     the sequence <7, 1.23, “help”> is a           which and how many consumers access          the same type among computer pro-
     tuple with type declaration integer,          the data. They can also change the con-      cessors. Second, Linda features dynamic
     real, and string. Linda freely deposits       sumers set dynamically without repro-        tuple types: when a new type of tuple
     and retrieves tuples of a declared type       gramming the producer module. Linda’s        is inserted into the tuple space, a new
     in tuple space.                               large expressive power lets you develop      tuple type is created. In our model, the
        Linda is a coordination language,          very compact parallel applications that      set of possible variable types is defined
     coordinating intercomponent commu-            can run without modification on vari-        by the specification of the communi-
     nications in a parallel program. It is        ous hardware architectures. Linda also       cation variables.
     sometimes classified as a structured,         hides the hardware architecture and
     content-addressable distributed-shared        component identities from the applica-       Reference
     memory system. Its aim is to abstract         tion code. Unfortunately, Linda’s draw-        1. R. Bjornson et al., Linda, The Portable
     from any specific machine architecture.       back is performance: Compared with                Parallel Language, Tech. Report YALE/
     A producer can communicate data by            other paradigms, accessing Linda’s un-            DCS/RR-520, Yale Univ., 1987.




• Isolation: a transaction cannot access          the results of several particle-tracing        The CoCa model
  results of uncompleted transactions.            components are combined by a vertex-           The CoCa model lets programmers write
• Durability: the results of modifications        calculation component to calculate the         modules that are independent of the tar-
  to the database should be ensured               traced particles’ intersection point. Iso-     get hardware architecture. The compo-
  regardless of underlying hardware               lation occurs in that the vertex-calcula-      nents are mapped onto the parallel com-
  failures.                                       tion component does not access data            puter using a separate mechanism.
                                                  belonging to other vertices. In terms of
   The isolation property is generally real-      atomicity, when one component decides          COMPONENT ALLOCATION
ized with well-known concurrency con-             that one trace does not satisfy certain        In CoCa, the configurer allocates com-
trol techniques such as two-phase locking         physics criteria, none of the reconstruc-      ponents and variables to processors
and two-phase commit.6 These mecha-               tion’s results associated with the vertex      through a data-definition language file.
nisms are based on transaction ordering. If       are made visible.                              The DDL permits hardware indepen-
two transactions access the same item and             These properties motivated us to inte-     dence in components while preserving
one of these accesses is a write access, then     grate atomicity and isolation properties       good performance, as it lets us specify
the accesses of the individual transactions       into the CoCa model. Generally, recon-         hardware-dependent information in the
must be ordered in time according to a            struction-software designers view data-        DDL and take advantage of the hard-
proposed transaction order.                       base systems as necessary but slow. This       ware features.
   Depending on a transaction’s char-             slowness is related to the durability prop-       Shared-memory computers have built-
acteristics, two-phase locking and com-           erty: data are stored on permanent stor-       in, dynamic component distribution. On
mit can increase overhead due to                  age, which is typically slow. Dropping         distributed-memory computers, com-
additional communication between pro-             the durability requirement removes the         ponents must be explicitly assigned to
cessors. When transactions have cer-              permanent storage requirement; data            processors and the values written by the
tain properties—such as, only one                 can thus be stored in the computer’s           components are distributed among the
transaction writes to a given variable,           faster main memory.                            processors where value-consuming com-
or the dataflow between the transac-                  In addition, thanks to certain recon-      ponents are located. We decided to
tions is acyclic—then isolation can be            struction program characteristics—no           equip CoCa with built-in data distribu-
realized without additional communi-              dataflow loops, independent and uniquely       tion strategies for load-balancing pur-
cation overhead.7                                 identifiable events, and components that       poses. A strategy can be specified for
   To clarify the relationship between            write only one value into a variable for       each variable type. For our HEP appli-
CoCa and the database transaction                 each event—we could implement data-            cation, we selected three strategies.
model, consider the reconstruction of             base properties with little overhead
particle traces originating in one vertex         because no locking of multiple variables       • Round robin: the component stores
as one transaction. In terms of isolation,        is required.7                                    and retrieves a variable type’s value in


42                                                                                                                        IEEE Concurrency
a circular fashion into and out of a        most efficient combination. When the            Each list uses two Booleans: one that
  fixed set of lists.                         selectivity is high, push mode with range    indicates job availability, and the other,
• Range partitioning: a function’s range      partitioning is appropriate if the predic-   list availability. Because access to the
  is divided into contiguous segments,        tion based on range partitioning is likely   Boolean and the list requires more than
  each of which is associated with a list     to be correct. If not, pull-mode access      one statement, a component might con-
  location.                                   should be used.                              clude that no component is accessing the
• Hash partitioning: A hash function,            A range-partitioning function (which      list while another component actually
  associated with a key, returns the list     can often be defined in HEP) is specified    accesses the list. To safeguard against
  location.                                   in the DDL file. Predicates on the job-      this, one lock per list is used. In most
                                              field’s values are defined for modules.      cases, components do “busy waiting” on
   In the DDL file, a configurer describes    These predicates are stored as compiled      the Boolean and find the lock free. Fol-
the processors that store values of a given   functions with arguments and distrib-        lowing is an example of the algorithm in
variable type, the distribution strategy      uted over the network. The DDL file          pseudocode:
associated with the variable type, and the    specifies mapping from predicate-argu-
mapping of components to processors.          ment value to processor and distributes        VAR
   Because CoCa functions have been           the mapping functions and the jobs over          empty : boolean
added to standard programming lan-            processors. When a job is stored in job          exclude : boolean
guages, such as C, C++, and Fortran, a        space, CoCa evaluates the mapping                job-list : lock
module is written in a known program-         function and sends the job to the corre-       void Insert(job,list){
ming language; components are acti-           sponding processor.                              lock(joblist)
vated by allocating a process (as in the                                                       put(list,job)
SC2000 shared-memory machine) or a            COMMUNICATION                                    empty := False
thread (as in the CS-2 distributed-mem-       To use CoCa, platforms must provide              unlock(joblist)
ory machine).                                 two synchronous operations: Send and           }
   CoCa provides several methods for          Receive. For implementation on the
allocating components to processors on        Meiko CS-2 machine, we used the Elan           job Retrieve(list) {
distributed-memory processors. The            communication library.                           while (empty or exclude)
processor allocation of lists that create        For each job, two messages are sent:                {skip}
the job space is a major issue. CoCa can      the first contains the job type and size;        exclude := True
produce two list accesses:                    the second, the actual job. This creates         lock(joblist)
                                              an appropriately sized memory area for           job := get(list)
• Push mode, in which the inserting           the incoming job. Efficiency is high             IF empty(list) {empty :=
  component stores the job into the list      because when a job arrives, the receiving                        True}
  on processors containing the desti-         processor is always ready to receive, and        exclude := False
  nation component; the destination           a good overlap of communication and              unlock(joblist)
  component obtains the job from its          computation can be maintained.                 }
  local list.                                    Components’ asynchronous queue
• Pull mode, in which the inserting           access leads to overhead generated by           To access a list, three procedures are
  component stores the job into the list                                                   assumed: empty(), get(), and put().
  on its own processor; the receiving         • the mutual exclusion of components         The lock() and unlock() procedures
  component searches a set of proces-           that simultaneously access a list, and     provide the locking functionality on the
  sors to find a list that contains an        • a component waiting when a list is         lock joblist. The empty() procedure
  appropriate job.                              empty.                                     returns True when list is empty and
                                                                                           False when it is filled with one or more
   Given these methods, components               In our experiments, we found that the     items. Two Boolean variables, empty
can select a particular job by specifying     mutual exclusion implemented with            and exclude, are also provided. The
selection criteria on the job-fields’ val-    operating-system synchronization leads       component waits with the while loop if
ues. When few values are specified, a         to an unacceptable overhead—around           another component is accessing the list
component has a high selectivity. A           hundreds of microseconds. For example,       or the list is empty. The moment the list
component’s selectivity will determine        when list access is controlled by a lock,    is filled, empty is set to False and the
whether push mode or pull mode can be         testing a free lock can take a few micro-    waiting component can proceed. Of
used. When the selectivity is low, jobs       seconds and locking a lock or testing a      course, on occasion, a retrieving com-
can be distributed evenly over all con-       locked lock can take a few hundred           ponent might be interleaved with an
suming components. Push-mode access           microseconds, whereas list access itself     inserting component in the interim
with a round-robin strategy is often the      takes only 10 to 20 ms.                      between the while loop and the lock


April–June 1999                                                                                                                   43
ts-11       D1           ts-11


                                              ts-11       D2           ts-11
                                     B1                                            F1     ts-1
                     ts-1
                                                                                                                   For our example, we defined event,
  0-0                                         ts-12       D3           ts-12              ts-1
           A1                                                                                           G1      track, fit, and vertex objects. An
                       ts-1
                                                                                   F2                           event is generated by the hardware and
                                     B2                                                                         read by the HW module. The HW
                                              ts-12       D4           ts-12
                                                                                        ts-1
                                                                                                                module passes the event object to a
                              ts-1
                                                                                                                Tracker module that recognizes tracks
                                                                                                                and stores them into the track object.
                                                          E1                                                    The track class is read by a Fitter
                                                                                                                module that fits a line through the mea-
                                                                                                                sured track. An Intersector module col-
Figure 4. A job-identifier allocation. Components are identified by module
                                                                                                                lects all lines and calculates a vertex. The
name and component number; jobs, by a time stamp and module counters.
                                                                                                                vertex can be used by another module for
                                                                                                                further treatment if needed.
                                                                                                                   Ntracks is an application-dependent
setting. The lock is set to prevent this                       When a job arrives in the system, it is          constant. Although CoCa allows a vari-
possibility.                                                   assigned a time stamp. Each component            able number of tracks, for the sake of
                                                               that reads one job and produces more             clarity we show a fixed number, Ntracks,
ISOLATION                                                      than one job postfixes its component             of tracks here. CoCa assures that an
We use isolation to recombine func-                            number to the job identifier. A compo-           Intersector component receives the lines
tionally related jobs. For example, tracks                     nent that recombines jobs acts in two            of one given event. Intersector does not
originate in a common vertex. For each                         stages. In stage one, the component              specify which event this is. The first
track, a particle is suggested. For the n                      specifies a job with a predicate on the job      received track of an event determines the
tracks, n jobs are inserted in job space to                    contents and any identifier. In stage two,       event number. A suitable data-distribu-
calculate the particle traces and calculate                    it specifies a job with the same predicate       tion strategy in the DDL file—such as
the probability that these particles cre-                      and the identifier of the first received job.    range partitioning on event number—
ated the observed tracks. The n-con-                           Because jobs are allocated based on the          assures that tracks of the same event are
structed traces are recombined in one                          job identifier, all related jobs are typi-       sent directly to a single component.
component to calculate the probability                         cally sent to the same component. This
of a given particle combination.                               assures that jobs are efficiently recom-         Performance
   Figure 4 shows a possible scenario. A                       bined on a distributed-memory machine.
job arrives at a module A component,                                                                            We implemented a prototype of CoCa on
which then calculates and stores three                         Application example                              a 20-node Sun SC2000 and a 64-node
related jobs that are read by three com-                                                                        Meiko CS-2 computer. Table 1 shows the
ponents of modules B and E. The mod-                           Figure 5 shows an example CoCa appli-            total execution time of producer-consumer
ule B components each create two jobs                          cation that uses the C++ binding (bind-          communication on CoCa, the native com-
read by components of module D. One                            ings also exist for C and Fortran). The          munication software, and the Parallel Vir-
module F component reads the jobs                              parameterized class of CoCa is defined;          tual Machine8 and Message Passing Inter-
originating from one module B compo-                           objects of class T that must be communi-         face9 message-passing paradigms.
nent. The final results are read by one                        cated between modules are defined as a              On the SC2000 shared-memory com-
module G component.                                            CoCa class with parameter T. Two meth-           puter, CoCa establishes communication
   To assure that components recom-                            ods, in and out, are used in the example.        by exchanging references—the data are
bine the related jobs, the jobs are identi-                    The content of out objects are available         not physically moved. Hence, the com-
fied by a time stamp and module coun-                          to other components; in indicates that a         munication time is independent of data
ters. Components are identified by their                       component wants to receive the contents          size, as the table shows. On the CS-2 dis-
module name and component number.                              of an object from the same class.                tributed memory computer, CoCa estab-
                                                                                                                lishes a communication bandwidth that
                                                                                                                is equal to that of the native communi-
                  Table 1. Communication and synchronization time.
                                                                                                                cation software.
                                                                                                                   The overhead of our prototype on the
                COCA                 COCA        ELAN            PVM           PVM               MPI
                                                                                                                SC2000 is mainly synchronization over-
NUMBERS                                                          (NATIVE)      (NON-NATIVE)      (NON-NATIVE)
(µS)            (SC2000)             (CS-2)      (CS-2)          (CS-2)        (CS-2)            (CS-2)         head. The performance of our prototype
                                                                                                                on the CS-2 compared well to that of
1 Byte          47                   143         33              216           2,815             257            native and nonnative PVM and MPI
1 Kbytes        47                   179         59              298           3,490             393            despite the higher functionality.
                                                                                                                   We modified CPREAD,10 the sequen-


44                                                                                                                                     IEEE Concurrency
// first module
                                                      void HW() {
                                                            CoCa<Event> event;
                                                            while(TRUE) {
                                                                equipment(&event); // get event data from equipment
                                                                event.out();         // store data in job space
tial event-reconstruction program of the                    }
CPLEAR experiment, to run on the                      }
SC2000 and CS-2 computers using CoCa
for communication between parallel                    // second module
parts. CoCa’s event generator creates                 void Tracker() {
events and stores them into job space.                      CoCa<Event> event;
CPREAD components then read the                             CoCa<Track> track;
events from job space, reconstruct the                      while(TRUE) {
event, and store an approved event into                          event.in();        // get an event from job space
job space. The reconstructed events are                          for(int i := 0; i< NTRACKS; i++)   {
read by three components that store the                              event.find(&track);
events on disk. In the DDL file, we spec-                            track.out(); // track to job space
ified push mode with round robin.                                }
    Figure 6 shows the results of the event                 }
reconstruction in relation to the number              }
of processors on the CoCa-modified
CPREAD program, compared with a                       // third module
theoretical scale-up where the perfor-                void Fitter() {
mance of one unmodified (“reference                         CoCa<Track> track;
scale-up”) and one CoCa-based (“CoCa                        CoCa<Line> line;
scale-up”) CPREAD program is multi-                         while(TRUE) {
plied by the number of processors. With                           track.in();       //get track from job space
relatively few modifications to the orig-                         track.fit(&line);
inal program, the production rate for this                        line.out();       // put line in job space
experiment was met on a distributed-                        }
memory machine with 28 processors, 23                 }
of which were efficiently used.
                                                      // fourth module
                                                      void Intersector() {
ALTHOUGH MOTIVATED by proper-                               CoCa<line> line[NTRACKS];
ties of the HEP event-reconstruction                        CoCa<vertex>     vertex;
programs, the generality of the CoCa                        while(TRUE) {
model makes it beneficial to other appli-                         for(int i := 0; i< NTRACKS; i++) {
cations that need parallelization. In ad-                              line[i].in();
dition, CoCa’s actual implementation                              }          // got all associated lines
led to a surprisingly low communication                           vertex.intersect(line);
overhead, when compared with the                                         // calculate intersection poin
overhead of PVM and MPI.                                          vertex.out();       // Store vertex in job space
    With CoCa we have clearly shown                                                   // for later reference
that fine-grain parallelization of HEP                      }
programs is a viable approach. The CoCa               }
model combines a high level of abstrac-
tion and allows parallelization with very          Figure 5. An example CoCa application using a C++ binding. The content of
                                                   out objects are available to other components; in indicates that a component
little overhead. Our approach, using
                                                   wants to receive the contents of an object from the same class.
application characteristics to arrive at a
low overhead programming model, has
produced a considerable payoff.
                                                   sity, Linköping, Sweden, for use of their SC2000;   References
                                                   and Philippe Bloch for the use of the CPREAD         1. J. Zalewki, “Real-Time Data Acquisition in
ACKNOWLEDGMENTS                                    program. The CERN CS-2 was funded by the                High-Energy Physics Experiments,” Proc.
This work is jointly funded by CERN and the        EU Esprit project P7255, GPMIMD2. We are                IEEE Workshop on Real-Time Applications,
Eindhoven University of Technology. Results        grateful to Martin Rem from EUT for support-            IEEE Computer Society Press, Los Alamitos,
from our experiments are input for the Dutch       ing this work and to several anonymous referees         Calif., 1993, pp. 112–115.
Science Foundation’s project NFI33.3129, which     for valuable comments on the article. Also, Con-
                                                   currency editor Keri Schreiner significantly con-    2. The Compact Muon Solenoid – Technical
focuses on construction and performance of real-                                                           Proposal, CMS Tech. Report CERN/LHCC
time transactions. We thank Linköping Univer-      tributed to this article’s presentation.


April–June 1999                                                                                                                                  45
How to Reach
                                                                           700

     IEEE Concurrency                                                      600




                                          Event reconstruction rate (Hz)
                                                                           500
                                                                                              CPLEAR event production rate

                                                                           400   CoCa
Writers                                                                          CoCa scale-up
                                                                           300   Reference scale-up
For detailed information on sub-
mitting articles, write for our edi-
torial guidelines (mdavis@                                                 200
computer.org), or access http://
                                                                           100
computer.org/concurrency/
edguide.htm.
                                                                                       5              10           15            20            25
Letters to the Editor                                                                                  Number of processors

Send letters to                        Figure 6. Event reconstruction rates for CPREAD, CoCa-modified CPREAD, and a
                                       CoCa scale-up where the performance of one unmodified reference and one
                                       CoCa-based CPREAD program is multiplied by the number of processors.
        Managing Editor
        IEEE Concurrency
    10662 Los Vaqueros Circle
     Los Alamitos, CA 90720
                                                           94-38, LHCC/P1, CERN, Geneva, Dec. 1995.             real-time systems, focusing on real-time data-
Please provide an e-mail address                                                                                bases. He received an MS and PhD in physics
or daytime phone number with             3. The LHC Study Group, Design Study of the
                                                                                                                from the University of Amsterdam. He is a
                                            Large Hadron Collider (LHC), CERN 91-03,
your letter.                                CERN, May, 1991.
                                                                                                                member of the IEEE and the ACM. Contact
                                                                                                                him at Eindhoven University of Technology,
                                         4. E. Argante, CoCa: A Model for Paral-                                Dept. of Computing Science, P.O. Box 513,
Subscription Change of Address              lelization of High-Energy Physics Software,                         5600 MB Eindhoven, the Netherlands;
                                            doctoral thesis, Eindhoven University of                            wsstok@win.tue.nl.
Send change-of-address requests             Technology, Eindhoven, the Netherlands,
                                            1998.
for magazine subscriptions to
address.change@ieee.org. Be sure         5. R. Schiefer and D. Francis, “Parallelisation
                                            of an Existing High Energy Physics Event
to specify IEEE Concurrency.                Reconstruction Software Package,” Trans.                            Ian Willers is a computer scientist on the
                                            Nuclear Science, Vol. 43, No. 1, Feb. 1996,                         CMS project at CERN/European Organiza-
                                            pp. 79–84.                                                          tion for Particle Physics in Geneva. At
Membership Change of Address                                                                                    CERN, he developed techniques in cross
                                         6. H.F. Korth and A. Silberschatz, Database
Send change-of-address requests             System Concepts, 2nd ed., McGraw-Hill,                              compilers and assemblers that led to a
                                            New York, 1991.                                                     machine-independent object model format
for the membership directory to                                                                                 that was later developed into the IEEE stan-
directory.updates@ computer.org.         7. M.H. Graham, “How To Get Serializability                            dard: object model format for microproces-
                                            for Real-Time Transactions Without Hav-                             sors. His technical interests are in large data-
                                            ing to Pay For It,” Proc. 14th IEEE Real-Time                       bases. Willers received his PhD in computer
Missing or Damaged Copies                   Systems Symp., CS Press, 1993, pp. 65–65.                           science from Cambridge University. He is a
                                         8. A. Geist et al., PVM 3 User’s Guide and Ref-                        member of the IEEE. Contact him at the
If you are missing an issue or you          erence Manual, Oak Ridge Nat’l Lab., Oak                            European Laboratory for Particle Physics,
received a damaged copy, contact            Ridge, Tenn., May 1993.                                             division EP/CMC, CERN, 1211 Geneva 23,
                                                                                                                Switzerland; ian.willers@cern.ch.
membership@computer. org.                9. A Message Passing Interface Standard,
                                            Message Passing Interface Forum, Univ. of
                                            Tennessee, Knoxville, 1994.
Reprints of Articles
                                        10. CPLEAR Offline Reference Manual, CERN,
                                                                                                                Erco Argante is a systems engineer at Eric-
For price information or to order           1991.
                                                                                                                sson Telecommunicatie B.V. in Rijen, the
reprints, send e-mail to mdavis@                                                                                Netherlands. His research interests include
computer.org or fax (714) 821-4010.                                                                             parallel and distributed computing and
                                                                                                                object-oriented software development.
                                                                                                                Argante received his MS in physics from the
Reprint Permission                     Peter van der Stok is an associate professor of                          Catholic University of Nijmegn, and a PhD in
                                       computer science at Eindhoven University of                              computer science from Eindhoven Univer-
To obtain permission to reprint an     Technology. He previously worked at CERN,                                sity of Technology. Contact him at Ericsson
                                       where he was responsible for the LEP accel-                              Telecomunicatie B.V., division ETM/R,
article, contact William Hagen,        erator’s distributed control system OS soft-                             Ericssonstraat 2, 5121 ML Rijen, the Nether-
IEEE Copyrights and trademarks         ware. His research interests are in distributed                          lands; etmerar@etm.ericsson.se.
Manager, at whagen@ieee.org.
                                                                                                                                          IEEE Concurrency

Más contenido relacionado

La actualidad más candente

Arrhenius.jl: A Differentiable Combustion Simulation Package
Arrhenius.jl: A Differentiable Combustion Simulation PackageArrhenius.jl: A Differentiable Combustion Simulation Package
Arrhenius.jl: A Differentiable Combustion Simulation PackageWeiqi Ji
 
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...deawoo Kim
 
High Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud TechnologiesHigh Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud Technologiesjaliyae
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...Bomm Kim
 
Pnp mac preemptive slot allocation and non preemptive transmission for provid...
Pnp mac preemptive slot allocation and non preemptive transmission for provid...Pnp mac preemptive slot allocation and non preemptive transmission for provid...
Pnp mac preemptive slot allocation and non preemptive transmission for provid...Iffat Anjum
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
Scalable Interconnection Network Models for Rapid Performance Prediction of H...
Scalable Interconnection Network Models for Rapid Performance Prediction of H...Scalable Interconnection Network Models for Rapid Performance Prediction of H...
Scalable Interconnection Network Models for Rapid Performance Prediction of H...Jason Liu
 
2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18...
2017 18 ieee vlsi titles,IEEE 2017-18  BULK  NS2 PROJECTS TITLES,IEEE 2017-18...2017 18 ieee vlsi titles,IEEE 2017-18  BULK  NS2 PROJECTS TITLES,IEEE 2017-18...
2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18...Nexgen Technology
 
Parallelizing itinerary based knn query
Parallelizing itinerary based knn queryParallelizing itinerary based knn query
Parallelizing itinerary based knn queryambitlick
 
Image transmission in wireless sensor networks
Image transmission in wireless sensor networksImage transmission in wireless sensor networks
Image transmission in wireless sensor networkseSAT Publishing House
 
High Speed and Area Efficient 2D DWT Processor Based Image Compression
High Speed and Area Efficient 2D DWT Processor Based Image CompressionHigh Speed and Area Efficient 2D DWT Processor Based Image Compression
High Speed and Area Efficient 2D DWT Processor Based Image Compressionsipij
 
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...Larry Smarr
 
Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...jaliyae
 
Enhancement of Improved Balanced LEACH for Heterogeneous Wireless Sensor Netw...
Enhancement of Improved Balanced LEACH for Heterogeneous Wireless Sensor Netw...Enhancement of Improved Balanced LEACH for Heterogeneous Wireless Sensor Netw...
Enhancement of Improved Balanced LEACH for Heterogeneous Wireless Sensor Netw...acijjournal
 

La actualidad más candente (20)

Cloud, Fog, or Edge: Where and When to Compute?
Cloud, Fog, or Edge: Where and When to Compute?Cloud, Fog, or Edge: Where and When to Compute?
Cloud, Fog, or Edge: Where and When to Compute?
 
Arrhenius.jl: A Differentiable Combustion Simulation Package
Arrhenius.jl: A Differentiable Combustion Simulation PackageArrhenius.jl: A Differentiable Combustion Simulation Package
Arrhenius.jl: A Differentiable Combustion Simulation Package
 
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
 
High Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud TechnologiesHigh Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud Technologies
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
 
Pnp mac preemptive slot allocation and non preemptive transmission for provid...
Pnp mac preemptive slot allocation and non preemptive transmission for provid...Pnp mac preemptive slot allocation and non preemptive transmission for provid...
Pnp mac preemptive slot allocation and non preemptive transmission for provid...
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
Scalable Interconnection Network Models for Rapid Performance Prediction of H...
Scalable Interconnection Network Models for Rapid Performance Prediction of H...Scalable Interconnection Network Models for Rapid Performance Prediction of H...
Scalable Interconnection Network Models for Rapid Performance Prediction of H...
 
2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18...
2017 18 ieee vlsi titles,IEEE 2017-18  BULK  NS2 PROJECTS TITLES,IEEE 2017-18...2017 18 ieee vlsi titles,IEEE 2017-18  BULK  NS2 PROJECTS TITLES,IEEE 2017-18...
2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18...
 
Fault tolerance on cloud computing
Fault tolerance on cloud computingFault tolerance on cloud computing
Fault tolerance on cloud computing
 
Parallelizing itinerary based knn query
Parallelizing itinerary based knn queryParallelizing itinerary based knn query
Parallelizing itinerary based knn query
 
50120140505008
5012014050500850120140505008
50120140505008
 
Image transmission in wireless sensor networks
Image transmission in wireless sensor networksImage transmission in wireless sensor networks
Image transmission in wireless sensor networks
 
High Speed and Area Efficient 2D DWT Processor Based Image Compression
High Speed and Area Efficient 2D DWT Processor Based Image CompressionHigh Speed and Area Efficient 2D DWT Processor Based Image Compression
High Speed and Area Efficient 2D DWT Processor Based Image Compression
 
GRID COMPUTING
GRID COMPUTINGGRID COMPUTING
GRID COMPUTING
 
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...
 
ECP Application Development
ECP Application DevelopmentECP Application Development
ECP Application Development
 
Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...
 
Enhancement of Improved Balanced LEACH for Heterogeneous Wireless Sensor Netw...
Enhancement of Improved Balanced LEACH for Heterogeneous Wireless Sensor Netw...Enhancement of Improved Balanced LEACH for Heterogeneous Wireless Sensor Netw...
Enhancement of Improved Balanced LEACH for Heterogeneous Wireless Sensor Netw...
 
Data aggregation in wireless sensor networks
Data aggregation in wireless sensor networksData aggregation in wireless sensor networks
Data aggregation in wireless sensor networks
 

Destacado

Risk in dervatives
Risk in dervativesRisk in dervatives
Risk in dervativesAkhel99
 
#ESA2012 social media workshop
#ESA2012 social media workshop#ESA2012 social media workshop
#ESA2012 social media workshopSandra Chung
 
Miriam investigacion
Miriam investigacionMiriam investigacion
Miriam investigacionmiriam2012pos
 
Historia de messi
Historia de messiHistoria de messi
Historia de messihendiodani
 
Introducing the ItsIts
Introducing the ItsItsIntroducing the ItsIts
Introducing the ItsItsKay Francis
 
Bahan sumber matematik 2
Bahan sumber matematik 2Bahan sumber matematik 2
Bahan sumber matematik 2Hani Zah
 
Weathering, soils, & erosion
Weathering, soils, & erosionWeathering, soils, & erosion
Weathering, soils, & erosionmrmolerat
 
Sulzer Thin Film Presentation July 28 2011
Sulzer Thin Film Presentation   July 28 2011Sulzer Thin Film Presentation   July 28 2011
Sulzer Thin Film Presentation July 28 2011tworcester
 

Destacado (10)

Risk in dervatives
Risk in dervativesRisk in dervatives
Risk in dervatives
 
#ESA2012 social media workshop
#ESA2012 social media workshop#ESA2012 social media workshop
#ESA2012 social media workshop
 
Miriam investigacion
Miriam investigacionMiriam investigacion
Miriam investigacion
 
Mobile computing
Mobile computingMobile computing
Mobile computing
 
Historia de messi
Historia de messiHistoria de messi
Historia de messi
 
Introducing the ItsIts
Introducing the ItsItsIntroducing the ItsIts
Introducing the ItsIts
 
Fotografia
FotografiaFotografia
Fotografia
 
Bahan sumber matematik 2
Bahan sumber matematik 2Bahan sumber matematik 2
Bahan sumber matematik 2
 
Weathering, soils, & erosion
Weathering, soils, & erosionWeathering, soils, & erosion
Weathering, soils, & erosion
 
Sulzer Thin Film Presentation July 28 2011
Sulzer Thin Film Presentation   July 28 2011Sulzer Thin Film Presentation   July 28 2011
Sulzer Thin Film Presentation July 28 2011
 

Similar a Coca1

Super-linear speedup for real-time condition monitoring using image processi...
Super-linear speedup for real-time condition monitoring using  image processi...Super-linear speedup for real-time condition monitoring using  image processi...
Super-linear speedup for real-time condition monitoring using image processi...IJECEIAES
 
Interprocedural Constant Propagation
Interprocedural Constant PropagationInterprocedural Constant Propagation
Interprocedural Constant Propagationjames marioki
 
M3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications
M3AT: Monitoring Agents Assignment Model for the Data-Intensive ApplicationsM3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications
M3AT: Monitoring Agents Assignment Model for the Data-Intensive ApplicationsVladislavKashansky
 
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...IDES Editor
 
Alice data acquisition
Alice data acquisitionAlice data acquisition
Alice data acquisitionBertalan EGED
 
FPGA Based Data Processing for Real-time WSN Applications:
FPGA Based Data Processing for Real-time WSN Applications: FPGA Based Data Processing for Real-time WSN Applications:
FPGA Based Data Processing for Real-time WSN Applications: Ilham Amezzane
 
Run-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environmentsRun-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environmentsNECST Lab @ Politecnico di Milano
 
Software rejuvenation based fault tolerance
Software rejuvenation based fault toleranceSoftware rejuvenation based fault tolerance
Software rejuvenation based fault tolerancewww.pixelsolutionbd.com
 
Detailed Simulation of Large-Scale Wireless Networks
Detailed Simulation of Large-Scale Wireless NetworksDetailed Simulation of Large-Scale Wireless Networks
Detailed Simulation of Large-Scale Wireless NetworksGabriele D'Angelo
 
Algorithm selection for sorting in embedded and mobile systems
Algorithm selection for sorting in embedded and mobile systemsAlgorithm selection for sorting in embedded and mobile systems
Algorithm selection for sorting in embedded and mobile systemsJigisha Aryya
 
Dual-resource TCPAQM for Processing-constrained Networks
Dual-resource TCPAQM for Processing-constrained NetworksDual-resource TCPAQM for Processing-constrained Networks
Dual-resource TCPAQM for Processing-constrained Networksambitlick
 
Echi isca2007
Echi isca2007Echi isca2007
Echi isca2007CAA Sudan
 

Similar a Coca1 (20)

Super-linear speedup for real-time condition monitoring using image processi...
Super-linear speedup for real-time condition monitoring using  image processi...Super-linear speedup for real-time condition monitoring using  image processi...
Super-linear speedup for real-time condition monitoring using image processi...
 
cug2011-praveen
cug2011-praveencug2011-praveen
cug2011-praveen
 
Interprocedural Constant Propagation
Interprocedural Constant PropagationInterprocedural Constant Propagation
Interprocedural Constant Propagation
 
cpc-152-2-2003
cpc-152-2-2003cpc-152-2-2003
cpc-152-2-2003
 
Networking Articles Overview
Networking Articles OverviewNetworking Articles Overview
Networking Articles Overview
 
M3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications
M3AT: Monitoring Agents Assignment Model for the Data-Intensive ApplicationsM3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications
M3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications
 
HYPPO - NECSTTechTalk 23/04/2020
HYPPO - NECSTTechTalk 23/04/2020HYPPO - NECSTTechTalk 23/04/2020
HYPPO - NECSTTechTalk 23/04/2020
 
Grid computing & its applications
Grid computing & its applicationsGrid computing & its applications
Grid computing & its applications
 
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
 
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
 
Alice data acquisition
Alice data acquisitionAlice data acquisition
Alice data acquisition
 
FPGA Based Data Processing for Real-time WSN Applications:
FPGA Based Data Processing for Real-time WSN Applications: FPGA Based Data Processing for Real-time WSN Applications:
FPGA Based Data Processing for Real-time WSN Applications:
 
Run-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environmentsRun-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environments
 
Software rejuvenation based fault tolerance
Software rejuvenation based fault toleranceSoftware rejuvenation based fault tolerance
Software rejuvenation based fault tolerance
 
imagefiltervhdl.pptx
imagefiltervhdl.pptximagefiltervhdl.pptx
imagefiltervhdl.pptx
 
D031201021027
D031201021027D031201021027
D031201021027
 
Detailed Simulation of Large-Scale Wireless Networks
Detailed Simulation of Large-Scale Wireless NetworksDetailed Simulation of Large-Scale Wireless Networks
Detailed Simulation of Large-Scale Wireless Networks
 
Algorithm selection for sorting in embedded and mobile systems
Algorithm selection for sorting in embedded and mobile systemsAlgorithm selection for sorting in embedded and mobile systems
Algorithm selection for sorting in embedded and mobile systems
 
Dual-resource TCPAQM for Processing-constrained Networks
Dual-resource TCPAQM for Processing-constrained NetworksDual-resource TCPAQM for Processing-constrained Networks
Dual-resource TCPAQM for Processing-constrained Networks
 
Echi isca2007
Echi isca2007Echi isca2007
Echi isca2007
 

Último

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Último (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Coca1

  • 1. Parallel Programming Peter van der Stok Eindhoven University of Technology Erco Argante and Ian Willers European Laboratory for Particle Physics Software paral- lelization is required CoCa: a Parallelization to contend with the increasing scale and Model for High-Energy complexity of High- Physics Energy Physics experiments. The H authors have developed igh-Energy Physics is a demanding computing area because it deals in complicated problems that require large-scale, a programming model, advanced hardware and software to solve.1 Even a mid-sized Communication HEP experiment involves hundreds of interconnected com- Capability (CoCa), which puters and large, geographically dispersed teams that design hard- ware and software components. During must be increased as the event processing allows this paralleliza- such experiments, the equipment must time is larger than the average interevent sustain a throughput of 40 Gbytes/s. In the arrival time. Decreasing event latency tion at several levels of future, these demands will only increase. regarding sequential processing is re- granularity and reduces The Compact Muon Solenoid experi- quired under certain conditions, such as ment2 of the forthcoming Large Hadron when the results are used in a control software complexity. Collider3 is a good example of HEP’s feedback loop with a maximum allowed increasing computing demands. In LHC, response time. It can also be useful when two counter-rotating particle bunches results are interactively examined or when cross every 25 nanoseconds. During a sin- memory size and event throughput cre- gle crossing, there are an average of 20 col- ate too short an interval between data lisions between highly energetic particles; availability. Event latency decrease is these collisions produce several new par- established by parallelization of the soft- ticles, which then pass through detectors ware treating a single event. that establish their physical parameters. We designed the Communication Capa- Data associated with a bunch crossing are bility programming model to parallelize called an event. Researchers predict that software to satisfy the throughput and CMS will produce and store 109 interest- latency requirements of HEP experi- ing events per year. That each event will ments.4 CoCa is also hardware indepen- occupy 1 Mbyte of storage testifies to the dent. This is important in the HEP true scale of the problem. domain as updates are typically required HEP software performance is charac- during the 10- to 20-year development and terized by the rate at which it can process lifecycle of its large applications (> 300,000 events—event throughput—and the time lines of code). it takes to process one event—event We based our CoCa design on the latency. Regarding sequential processing database transaction paradigm. The isola- on a single processor, event throughput tion and atomicity required by database 38 1092-3063/99/$10.00 © 1999 IEEE IEEE Concurrency
  • 2. 1P305 transactions are related to two require- ments of HEP experiments: 2P488 • the recombining of related data from components executing in parallel must be hidden, and • when a single component rejects an event, the event and all its associa- tions must be rejected. 35422 Using this transaction concept simplifies the programming effort. Also, because we base CoCa on a general approach, our model can be applied to areas out- side HEP, such as radar tracking. Fol- lowing here, we describe CoCa’s rela- tionship to the database model and basic 49357 HEP programming principles as imple- mented in our CoCa model. We then present results from our study compar- ing a CoCa prototype with other models Figure 1. A cross section of an event. Crosses show where a particle passing on a Sun SC2000 and a 64-node Meiko has been detected; shaded areas in the outer circles mark particles’ energy CS-2 computer. depositions. Event reconstruction Accepted event data CoCa is founded on properties of HEP- event-reconstruction programs, which translate events produced by event detectors into parameters that describe Calorimeter Calorimeter the physical properties of elementary (cluster) (shower) particles. The 40-MHz rate of CMS events makes it impossible to store all events on mass storage. Because deci- sions to reject or store an event are based on reconstruction program results, events must be reconstructed—both in Event real time and in stages—immediately Track Particle Vertex Translation reconstruction tracing calculation after they are produced. Before a full reconstruction is per- formed, a three-level trigger system uses physical-property requirements to select interesting events and thus decrease the Calibration constants event rate. The level-1 trigger reduces the event rate from 40 MHz to 100 kHz. Because of these high rates, the level-1 Figure 2. Event-reconstruction modules and their data dependencies. trigger must be implemented in hard- ware. Triggers for levels 2 and 3 are implemented in software executed by nently storing 1 Pbyte of data each year. has been detected. The shaded areas in 1,000 processors in a farm. Level 2 deals Figure 1 shows a cross-section from the outer circles mark the particles’ with an event rate of 100 kHz, uses only an existing detector and four tracks pro- energy depositions in the calorimeter. In 10–25% of an event, and has a deadline duced by charged particles (CMS events most cases, a particle’s identity can be as short as 10 ms. Level 3 considers the will contain on the order of 300–500 determined from the track curvature and complete event and brings the event rate tracks). The crosses on the tracks mark its energy disposition. down from 2 kHz to 100 Hz, perma- hits—locations where a particle passing Figure 2, which we derived from work April–June 1999 39
  • 3. Parallelization for HEP lel to construct line segments or (part of) a cluster. Then, another, coarser, spatial division allows the parallel recom- Like other applications—such as those used for geographical bination of segments to tracks. Calculation times depend on data or radar calculations—HEP uses pattern recognition. the number of measurements in a subvolume. Indeed, some striking similarities with radar applications exist The next stage is to make an initial guess of the particle such as crossing point calculations or the problem of distin- characteristics. The experimental equipment is designed such guishing two almost parallel tracks that intersect in some that one to five possibilities must be tried for each track. The projections. particles are tried out by retracing all possibilities of all found The incoming data represent measurements distributed tracks in parallel. The retracing yields a probability that a over a large detector volume. For example, the event, input proposed particle is indeed responsible for the measured to the CMS online event-reconstruction program, consists of track. Some tracks have a common origin that you can find a set of 106 spatial coordinates within the detector volume. by calculating the found tracks’ probability of intersection. About 300–500 subsets represent coordinates of as many For tracks with a common origin, limitations imposed by well- curved lines in space. Lines that intersect with the detector established physics laws (like conservation of energy) further volume surface are terminated by a cluster (a set of tightly reduce the possible particle combinations. Calculation times bunched coordinates that represent a particle’s energy loss depend on physical characteristics of traversed volumes. in specially designed detectors in the outer installation vol- As particle reconstruction progresses, knowledge about ume; see Figure 1 in main text). The program’s output is a the event’s physics content increases. At several points, binary decision: “throw-away” or “keep.” An event is kept enough knowledge is gathered to reject an event as unin- when 2–10 curves out of the possible 500 interact with the teresting. The goal of the calculation is to retain as many desired physics properties. A sequential program typically interesting events as possible (process efficiency) and reject takes about 100 ms on a future 104 MIPS processor to reach as many uninteresting events as possible (rejection rate). Geo- such a decision. The latency of 100 ms can be reduced by par- metrically, a high parallelization is possible at the start of the allelizing the treatment of one event. reconstruction process. Later parallelization is determined A first parallelization is indicated by dividing the detector by the event characteristics: number of measured tracks and volume into a set of subvolumes. The detector installation’s possible particle combinations. At the end of the process, less circular symmetry suggests a division in cylindrical sectors. In than eight independent interactions are treated, which all subvolumes, the local coordinates are inspected in paral- determines whether the even will be kept or rejected. by Rene Schiefer and David Francis,5 ments, and the experiment’s data input • a module’s processing time depends shows reconstruction modules for a sin- rate determines the throughput re- on the data it treats, gle event. Reconstruction starts with the quirements. A third parallelization goal • a single module should reject an event translation of the detector outputs into is to increase efficiency in resource use, as early as possible to prevent further physics quantities like positions and volt- particularly processor resources. processing, ages. Tracks have to be reconstructed For CMS reconstruction software, we • data produced by one module might from the detector hits. Particle tracing fits use explicit parallelization. At the coarsest be needed by others, and particle trajectories to the tracks using grain, farming is applied: data from one • data input rates can vary. physics laws of conservation; the vertices event are presented to one processor, and between the trajectories are calculated. independent events are treated in paral- The “Parallelization for HEP” sidebar The particle energy dispositions in the lel by many processors. This does not explains these issues in more detail. calorimeter are then analyzed. Figure 2 reduce event latency, but it does meet Data are pending when they are ready also shows the data dependencies be- event throughput requirements. At a finer to be processed, but no processor has tween modules: the output of some mod- grain, algorithmic parallelization can be started processing them. Pending data ules is input for those that follow. Some applied: calorimeter results can be treated increase latency. If processors are avail- modules can be executed independently in parallel with track reconstruction, par- able and large amounts of data are pend- of each other: calorimeter calculations ticle tracing, and vertex calculation. ing, the workload is unevenly distributed can be performed in parallel with track Within these three sequentially exe- across the network. If processors are idle reconstruction, particle tracing, and ver- cuted modules, geometric data paral- and no data are pending, the implemen- tex calculation. There are no feedback lelism can reduce latency in three ways. tation is inefficient and increases proces- loops between the modules. An event is First, the experiment’s circular symme- sor costs. rejected if it fails to satisfy even one mod- try allows data detected in different Ideally, there should be no pending ule’s criteria. cylinder segments to be recombined in data and no idle processors. When all parallel. Second, parallel tracing can be application characteristics are known and Parallelization used to track particles belonging to dif- constant—such as module calculation ferent tracks, and, third, the coordinates time and data input rate—you can, in Parallelization’s main goals are to of multiple vertices from which tracks principle, calculate a distribution that decrease latency and increase through- originate can be calculated in parallel. minimizes pending data for a given pro- put. Constraints on an event’s process- Parallelization for HEP is complicated cessor configuration. Unfortunately, in ing time determine latency require- because the reconstruction program, both mod- 40 IEEE Concurrency
  • 4. C 3 B 2 D 4 x2,x1 p2,p1 3 1 2 4 E (x2,y2),(x1,y1) y1 q1 (p2,q1),(p1,q2) A B C A B C 4 F y2 q2 2 C 4 B (a) (b) Figure 3. An example of module-level parallelization including (a) dataflow between components and (b) data recom- bination. Circles represent components, arrows represent communications between the components, and capital letters denote the associated module. Numbers indicate different types of communication. ule calculation time and data input rate cial measures are taken, Module C’s versa. At runtime, CoCa decides which are variable. A dynamic data allocation to component might receive p1, q2, p2, and producer communicates with which processors is a good solution, though it q1 in this order and recombine com- consumer. increases overhead, as knowledge must be pletely unrelated data. In CoCa, we deal The programmer constructs a set of distributed to all processors. In any case, with this problem by using the commu- modules and defines typed communica- 100% processor utilization is possible nication layer to hide it from the pro- tion variables that the modules read and only under well-controlled conditions grammer, such that the component will write. The component distribution, hid- and almost certainly leads to pending recombine p1 with q1 and p2 with q2. den from the programmer, is performed data. Therefore, processor farms are over- Traditionally, a communication layer during a separate configuration phase. A dimensioned to accommodate data rate takes care of correct point-to-point data value stored into a communication vari- variations. transport. Our model provides addi- able is called a job. Jobs are stored into tional communication-layer functional- and read from the job space. Programming model ity, which reduces code complexity at the Components write jobs into job space, application level. It also provides which is implemented using a set of lists. A program is composed of modules; a For shared-memory implementation, component is a compiled and loaded • dynamic communications, in which there are as many lists as there are com- instance of a module. Figure 3a shows participating partners are determined munication variable types. For distrib- examples of dataflow between compo- at runtime; uted-memory systems, there can be one nents. Four types of communication are • recombination of related data that are or more lists per type. At most, there is possible, corresponding to the numbers distributed over multiple compo- one list per variable type in the memory in Figure 3: nents; and of a given processor. • global rejection of related data dis- The closest existing parallel-program- 1. One producer passes data to one con- tributed over multiple components. ming model to our own is Linda, as writ- sumer, which is typical for a pipeline. ing and reading values to and from vari- 2. One producer passes different data These properties determine the com- ables is equivalent to storing and retrieving to multiple consumers, which is typ- munication model we use to guide the tuples. The sidebar, “Linda and CoCa: ical for data parallelism. design of the communication layer. comparison,” describes Linda’s model in 3. Multiple producers pass data to one more detail. consumer, which is typical for geo- Communication model metric data parallelism when com- CoCa and the database ponent results must be combined. In CoCa, we based the communication 4. Multiple producers pass data to model on variables that can be accessed at transactions model multiple consumers (combination the module level without decreasing per- A database’s purpose is to store and of types 2 and 3). formance. Such a model decouples com- retrieve data efficiently and conve- ponents by not sending data directly from niently. Storing and retrieving data is Because component execution time is one component to another. Two modules done via transactions. A transaction is data dependent, recombining data is not communicate by writing and reading a composed of a set of actions that access a straightforward process, as Figure 3b variable of a given type: a producer writes individual data items. It should obey the shows. Module A’s component receives a value defined by the variable type, and a so-called ACID constraints: two related items, x1 and y1, and sends consumer reads this value. This is possi- them to two module B components for ble because the data structure (variable • Atomicity: a transaction finishes com- treatment. The upper component fin- type) components exchanged remains pletely or aborts without leaving an ishes before the lower, and sends p1. constant. Because communication is spec- effect. Module A then receives two related ified at the abstract module level, the con- • Consistency: a transaction executed on items, x2 and y2, which it sends to two crete producer component does not know a consistent database should leave the free components of module B. If no spe- about the consumer component and vice database in a consistent state. April–June 1999 41
  • 5. Linda and CoCa: comparison structured tuple space can require numerous searches and long communi- CoCa uses job space, which is based on putting them into the tuple space; a cation times with large variations. the tuple space paradigm introduced consumer can use a pattern-matching CoCa differs from Linda in that it is by the Linda1 language. A comparison primitive to take them out of the tuple motivated more by performance re- of tuple space with job space clarifies space. This establishes asynchronous quirements than general functional the differences between Linda and communication and very flexible pro- requirements in two ways. First, search CoCa. A tuple is a sequence of typed ducer-consumer identification: the pro- time for jobs in CoCa is shorter than the fields; a field can have 3 types: integer, ducer does not know the consumer and search time for tuples in Linda, which real, or string. Each field in the tuple vice versa. This flexibility lets multiple has an unstructured tuple space. CoCa contains one value that is compatible consumers access the same data. At run- uses various allocation strategies to with the associated type. For example, time, the system configurer can decide partition data and distribute values of the sequence <7, 1.23, “help”> is a which and how many consumers access the same type among computer pro- tuple with type declaration integer, the data. They can also change the con- cessors. Second, Linda features dynamic real, and string. Linda freely deposits sumers set dynamically without repro- tuple types: when a new type of tuple and retrieves tuples of a declared type gramming the producer module. Linda’s is inserted into the tuple space, a new in tuple space. large expressive power lets you develop tuple type is created. In our model, the Linda is a coordination language, very compact parallel applications that set of possible variable types is defined coordinating intercomponent commu- can run without modification on vari- by the specification of the communi- nications in a parallel program. It is ous hardware architectures. Linda also cation variables. sometimes classified as a structured, hides the hardware architecture and content-addressable distributed-shared component identities from the applica- Reference memory system. Its aim is to abstract tion code. Unfortunately, Linda’s draw- 1. R. Bjornson et al., Linda, The Portable from any specific machine architecture. back is performance: Compared with Parallel Language, Tech. Report YALE/ A producer can communicate data by other paradigms, accessing Linda’s un- DCS/RR-520, Yale Univ., 1987. • Isolation: a transaction cannot access the results of several particle-tracing The CoCa model results of uncompleted transactions. components are combined by a vertex- The CoCa model lets programmers write • Durability: the results of modifications calculation component to calculate the modules that are independent of the tar- to the database should be ensured traced particles’ intersection point. Iso- get hardware architecture. The compo- regardless of underlying hardware lation occurs in that the vertex-calcula- nents are mapped onto the parallel com- failures. tion component does not access data puter using a separate mechanism. belonging to other vertices. In terms of The isolation property is generally real- atomicity, when one component decides COMPONENT ALLOCATION ized with well-known concurrency con- that one trace does not satisfy certain In CoCa, the configurer allocates com- trol techniques such as two-phase locking physics criteria, none of the reconstruc- ponents and variables to processors and two-phase commit.6 These mecha- tion’s results associated with the vertex through a data-definition language file. nisms are based on transaction ordering. If are made visible. The DDL permits hardware indepen- two transactions access the same item and These properties motivated us to inte- dence in components while preserving one of these accesses is a write access, then grate atomicity and isolation properties good performance, as it lets us specify the accesses of the individual transactions into the CoCa model. Generally, recon- hardware-dependent information in the must be ordered in time according to a struction-software designers view data- DDL and take advantage of the hard- proposed transaction order. base systems as necessary but slow. This ware features. Depending on a transaction’s char- slowness is related to the durability prop- Shared-memory computers have built- acteristics, two-phase locking and com- erty: data are stored on permanent stor- in, dynamic component distribution. On mit can increase overhead due to age, which is typically slow. Dropping distributed-memory computers, com- additional communication between pro- the durability requirement removes the ponents must be explicitly assigned to cessors. When transactions have cer- permanent storage requirement; data processors and the values written by the tain properties—such as, only one can thus be stored in the computer’s components are distributed among the transaction writes to a given variable, faster main memory. processors where value-consuming com- or the dataflow between the transac- In addition, thanks to certain recon- ponents are located. We decided to tions is acyclic—then isolation can be struction program characteristics—no equip CoCa with built-in data distribu- realized without additional communi- dataflow loops, independent and uniquely tion strategies for load-balancing pur- cation overhead.7 identifiable events, and components that poses. A strategy can be specified for To clarify the relationship between write only one value into a variable for each variable type. For our HEP appli- CoCa and the database transaction each event—we could implement data- cation, we selected three strategies. model, consider the reconstruction of base properties with little overhead particle traces originating in one vertex because no locking of multiple variables • Round robin: the component stores as one transaction. In terms of isolation, is required.7 and retrieves a variable type’s value in 42 IEEE Concurrency
  • 6. a circular fashion into and out of a most efficient combination. When the Each list uses two Booleans: one that fixed set of lists. selectivity is high, push mode with range indicates job availability, and the other, • Range partitioning: a function’s range partitioning is appropriate if the predic- list availability. Because access to the is divided into contiguous segments, tion based on range partitioning is likely Boolean and the list requires more than each of which is associated with a list to be correct. If not, pull-mode access one statement, a component might con- location. should be used. clude that no component is accessing the • Hash partitioning: A hash function, A range-partitioning function (which list while another component actually associated with a key, returns the list can often be defined in HEP) is specified accesses the list. To safeguard against location. in the DDL file. Predicates on the job- this, one lock per list is used. In most field’s values are defined for modules. cases, components do “busy waiting” on In the DDL file, a configurer describes These predicates are stored as compiled the Boolean and find the lock free. Fol- the processors that store values of a given functions with arguments and distrib- lowing is an example of the algorithm in variable type, the distribution strategy uted over the network. The DDL file pseudocode: associated with the variable type, and the specifies mapping from predicate-argu- mapping of components to processors. ment value to processor and distributes VAR Because CoCa functions have been the mapping functions and the jobs over empty : boolean added to standard programming lan- processors. When a job is stored in job exclude : boolean guages, such as C, C++, and Fortran, a space, CoCa evaluates the mapping job-list : lock module is written in a known program- function and sends the job to the corre- void Insert(job,list){ ming language; components are acti- sponding processor. lock(joblist) vated by allocating a process (as in the put(list,job) SC2000 shared-memory machine) or a COMMUNICATION empty := False thread (as in the CS-2 distributed-mem- To use CoCa, platforms must provide unlock(joblist) ory machine). two synchronous operations: Send and } CoCa provides several methods for Receive. For implementation on the allocating components to processors on Meiko CS-2 machine, we used the Elan job Retrieve(list) { distributed-memory processors. The communication library. while (empty or exclude) processor allocation of lists that create For each job, two messages are sent: {skip} the job space is a major issue. CoCa can the first contains the job type and size; exclude := True produce two list accesses: the second, the actual job. This creates lock(joblist) an appropriately sized memory area for job := get(list) • Push mode, in which the inserting the incoming job. Efficiency is high IF empty(list) {empty := component stores the job into the list because when a job arrives, the receiving True} on processors containing the desti- processor is always ready to receive, and exclude := False nation component; the destination a good overlap of communication and unlock(joblist) component obtains the job from its computation can be maintained. } local list. Components’ asynchronous queue • Pull mode, in which the inserting access leads to overhead generated by To access a list, three procedures are component stores the job into the list assumed: empty(), get(), and put(). on its own processor; the receiving • the mutual exclusion of components The lock() and unlock() procedures component searches a set of proces- that simultaneously access a list, and provide the locking functionality on the sors to find a list that contains an • a component waiting when a list is lock joblist. The empty() procedure appropriate job. empty. returns True when list is empty and False when it is filled with one or more Given these methods, components In our experiments, we found that the items. Two Boolean variables, empty can select a particular job by specifying mutual exclusion implemented with and exclude, are also provided. The selection criteria on the job-fields’ val- operating-system synchronization leads component waits with the while loop if ues. When few values are specified, a to an unacceptable overhead—around another component is accessing the list component has a high selectivity. A hundreds of microseconds. For example, or the list is empty. The moment the list component’s selectivity will determine when list access is controlled by a lock, is filled, empty is set to False and the whether push mode or pull mode can be testing a free lock can take a few micro- waiting component can proceed. Of used. When the selectivity is low, jobs seconds and locking a lock or testing a course, on occasion, a retrieving com- can be distributed evenly over all con- locked lock can take a few hundred ponent might be interleaved with an suming components. Push-mode access microseconds, whereas list access itself inserting component in the interim with a round-robin strategy is often the takes only 10 to 20 ms. between the while loop and the lock April–June 1999 43
  • 7. ts-11 D1 ts-11 ts-11 D2 ts-11 B1 F1 ts-1 ts-1 For our example, we defined event, 0-0 ts-12 D3 ts-12 ts-1 A1 G1 track, fit, and vertex objects. An ts-1 F2 event is generated by the hardware and B2 read by the HW module. The HW ts-12 D4 ts-12 ts-1 module passes the event object to a ts-1 Tracker module that recognizes tracks and stores them into the track object. E1 The track class is read by a Fitter module that fits a line through the mea- sured track. An Intersector module col- Figure 4. A job-identifier allocation. Components are identified by module lects all lines and calculates a vertex. The name and component number; jobs, by a time stamp and module counters. vertex can be used by another module for further treatment if needed. Ntracks is an application-dependent setting. The lock is set to prevent this When a job arrives in the system, it is constant. Although CoCa allows a vari- possibility. assigned a time stamp. Each component able number of tracks, for the sake of that reads one job and produces more clarity we show a fixed number, Ntracks, ISOLATION than one job postfixes its component of tracks here. CoCa assures that an We use isolation to recombine func- number to the job identifier. A compo- Intersector component receives the lines tionally related jobs. For example, tracks nent that recombines jobs acts in two of one given event. Intersector does not originate in a common vertex. For each stages. In stage one, the component specify which event this is. The first track, a particle is suggested. For the n specifies a job with a predicate on the job received track of an event determines the tracks, n jobs are inserted in job space to contents and any identifier. In stage two, event number. A suitable data-distribu- calculate the particle traces and calculate it specifies a job with the same predicate tion strategy in the DDL file—such as the probability that these particles cre- and the identifier of the first received job. range partitioning on event number— ated the observed tracks. The n-con- Because jobs are allocated based on the assures that tracks of the same event are structed traces are recombined in one job identifier, all related jobs are typi- sent directly to a single component. component to calculate the probability cally sent to the same component. This of a given particle combination. assures that jobs are efficiently recom- Performance Figure 4 shows a possible scenario. A bined on a distributed-memory machine. job arrives at a module A component, We implemented a prototype of CoCa on which then calculates and stores three Application example a 20-node Sun SC2000 and a 64-node related jobs that are read by three com- Meiko CS-2 computer. Table 1 shows the ponents of modules B and E. The mod- Figure 5 shows an example CoCa appli- total execution time of producer-consumer ule B components each create two jobs cation that uses the C++ binding (bind- communication on CoCa, the native com- read by components of module D. One ings also exist for C and Fortran). The munication software, and the Parallel Vir- module F component reads the jobs parameterized class of CoCa is defined; tual Machine8 and Message Passing Inter- originating from one module B compo- objects of class T that must be communi- face9 message-passing paradigms. nent. The final results are read by one cated between modules are defined as a On the SC2000 shared-memory com- module G component. CoCa class with parameter T. Two meth- puter, CoCa establishes communication To assure that components recom- ods, in and out, are used in the example. by exchanging references—the data are bine the related jobs, the jobs are identi- The content of out objects are available not physically moved. Hence, the com- fied by a time stamp and module coun- to other components; in indicates that a munication time is independent of data ters. Components are identified by their component wants to receive the contents size, as the table shows. On the CS-2 dis- module name and component number. of an object from the same class. tributed memory computer, CoCa estab- lishes a communication bandwidth that is equal to that of the native communi- Table 1. Communication and synchronization time. cation software. The overhead of our prototype on the COCA COCA ELAN PVM PVM MPI SC2000 is mainly synchronization over- NUMBERS (NATIVE) (NON-NATIVE) (NON-NATIVE) (µS) (SC2000) (CS-2) (CS-2) (CS-2) (CS-2) (CS-2) head. The performance of our prototype on the CS-2 compared well to that of 1 Byte 47 143 33 216 2,815 257 native and nonnative PVM and MPI 1 Kbytes 47 179 59 298 3,490 393 despite the higher functionality. We modified CPREAD,10 the sequen- 44 IEEE Concurrency
  • 8. // first module void HW() { CoCa<Event> event; while(TRUE) { equipment(&event); // get event data from equipment event.out(); // store data in job space tial event-reconstruction program of the } CPLEAR experiment, to run on the } SC2000 and CS-2 computers using CoCa for communication between parallel // second module parts. CoCa’s event generator creates void Tracker() { events and stores them into job space. CoCa<Event> event; CPREAD components then read the CoCa<Track> track; events from job space, reconstruct the while(TRUE) { event, and store an approved event into event.in(); // get an event from job space job space. The reconstructed events are for(int i := 0; i< NTRACKS; i++) { read by three components that store the event.find(&track); events on disk. In the DDL file, we spec- track.out(); // track to job space ified push mode with round robin. } Figure 6 shows the results of the event } reconstruction in relation to the number } of processors on the CoCa-modified CPREAD program, compared with a // third module theoretical scale-up where the perfor- void Fitter() { mance of one unmodified (“reference CoCa<Track> track; scale-up”) and one CoCa-based (“CoCa CoCa<Line> line; scale-up”) CPREAD program is multi- while(TRUE) { plied by the number of processors. With track.in(); //get track from job space relatively few modifications to the orig- track.fit(&line); inal program, the production rate for this line.out(); // put line in job space experiment was met on a distributed- } memory machine with 28 processors, 23 } of which were efficiently used. // fourth module void Intersector() { ALTHOUGH MOTIVATED by proper- CoCa<line> line[NTRACKS]; ties of the HEP event-reconstruction CoCa<vertex> vertex; programs, the generality of the CoCa while(TRUE) { model makes it beneficial to other appli- for(int i := 0; i< NTRACKS; i++) { cations that need parallelization. In ad- line[i].in(); dition, CoCa’s actual implementation } // got all associated lines led to a surprisingly low communication vertex.intersect(line); overhead, when compared with the // calculate intersection poin overhead of PVM and MPI. vertex.out(); // Store vertex in job space With CoCa we have clearly shown // for later reference that fine-grain parallelization of HEP } programs is a viable approach. The CoCa } model combines a high level of abstrac- tion and allows parallelization with very Figure 5. An example CoCa application using a C++ binding. The content of out objects are available to other components; in indicates that a component little overhead. Our approach, using wants to receive the contents of an object from the same class. application characteristics to arrive at a low overhead programming model, has produced a considerable payoff. sity, Linköping, Sweden, for use of their SC2000; References and Philippe Bloch for the use of the CPREAD 1. J. Zalewki, “Real-Time Data Acquisition in ACKNOWLEDGMENTS program. The CERN CS-2 was funded by the High-Energy Physics Experiments,” Proc. This work is jointly funded by CERN and the EU Esprit project P7255, GPMIMD2. We are IEEE Workshop on Real-Time Applications, Eindhoven University of Technology. Results grateful to Martin Rem from EUT for support- IEEE Computer Society Press, Los Alamitos, from our experiments are input for the Dutch ing this work and to several anonymous referees Calif., 1993, pp. 112–115. Science Foundation’s project NFI33.3129, which for valuable comments on the article. Also, Con- currency editor Keri Schreiner significantly con- 2. The Compact Muon Solenoid – Technical focuses on construction and performance of real- Proposal, CMS Tech. Report CERN/LHCC time transactions. We thank Linköping Univer- tributed to this article’s presentation. April–June 1999 45
  • 9. How to Reach 700 IEEE Concurrency 600 Event reconstruction rate (Hz) 500 CPLEAR event production rate 400 CoCa Writers CoCa scale-up 300 Reference scale-up For detailed information on sub- mitting articles, write for our edi- torial guidelines (mdavis@ 200 computer.org), or access http:// 100 computer.org/concurrency/ edguide.htm. 5 10 15 20 25 Letters to the Editor Number of processors Send letters to Figure 6. Event reconstruction rates for CPREAD, CoCa-modified CPREAD, and a CoCa scale-up where the performance of one unmodified reference and one CoCa-based CPREAD program is multiplied by the number of processors. Managing Editor IEEE Concurrency 10662 Los Vaqueros Circle Los Alamitos, CA 90720 94-38, LHCC/P1, CERN, Geneva, Dec. 1995. real-time systems, focusing on real-time data- Please provide an e-mail address bases. He received an MS and PhD in physics or daytime phone number with 3. The LHC Study Group, Design Study of the from the University of Amsterdam. He is a Large Hadron Collider (LHC), CERN 91-03, your letter. CERN, May, 1991. member of the IEEE and the ACM. Contact him at Eindhoven University of Technology, 4. E. Argante, CoCa: A Model for Paral- Dept. of Computing Science, P.O. Box 513, Subscription Change of Address lelization of High-Energy Physics Software, 5600 MB Eindhoven, the Netherlands; doctoral thesis, Eindhoven University of wsstok@win.tue.nl. Send change-of-address requests Technology, Eindhoven, the Netherlands, 1998. for magazine subscriptions to address.change@ieee.org. Be sure 5. R. Schiefer and D. Francis, “Parallelisation of an Existing High Energy Physics Event to specify IEEE Concurrency. Reconstruction Software Package,” Trans. Ian Willers is a computer scientist on the Nuclear Science, Vol. 43, No. 1, Feb. 1996, CMS project at CERN/European Organiza- pp. 79–84. tion for Particle Physics in Geneva. At Membership Change of Address CERN, he developed techniques in cross 6. H.F. Korth and A. Silberschatz, Database Send change-of-address requests System Concepts, 2nd ed., McGraw-Hill, compilers and assemblers that led to a New York, 1991. machine-independent object model format for the membership directory to that was later developed into the IEEE stan- directory.updates@ computer.org. 7. M.H. Graham, “How To Get Serializability dard: object model format for microproces- for Real-Time Transactions Without Hav- sors. His technical interests are in large data- ing to Pay For It,” Proc. 14th IEEE Real-Time bases. Willers received his PhD in computer Missing or Damaged Copies Systems Symp., CS Press, 1993, pp. 65–65. science from Cambridge University. He is a 8. A. Geist et al., PVM 3 User’s Guide and Ref- member of the IEEE. Contact him at the If you are missing an issue or you erence Manual, Oak Ridge Nat’l Lab., Oak European Laboratory for Particle Physics, received a damaged copy, contact Ridge, Tenn., May 1993. division EP/CMC, CERN, 1211 Geneva 23, Switzerland; ian.willers@cern.ch. membership@computer. org. 9. A Message Passing Interface Standard, Message Passing Interface Forum, Univ. of Tennessee, Knoxville, 1994. Reprints of Articles 10. CPLEAR Offline Reference Manual, CERN, Erco Argante is a systems engineer at Eric- For price information or to order 1991. sson Telecommunicatie B.V. in Rijen, the reprints, send e-mail to mdavis@ Netherlands. His research interests include computer.org or fax (714) 821-4010. parallel and distributed computing and object-oriented software development. Argante received his MS in physics from the Reprint Permission Peter van der Stok is an associate professor of Catholic University of Nijmegn, and a PhD in computer science at Eindhoven University of computer science from Eindhoven Univer- To obtain permission to reprint an Technology. He previously worked at CERN, sity of Technology. Contact him at Ericsson where he was responsible for the LEP accel- Telecomunicatie B.V., division ETM/R, article, contact William Hagen, erator’s distributed control system OS soft- Ericssonstraat 2, 5121 ML Rijen, the Nether- IEEE Copyrights and trademarks ware. His research interests are in distributed lands; etmerar@etm.ericsson.se. Manager, at whagen@ieee.org. IEEE Concurrency