SlideShare a Scribd company logo
1 of 8
Download to read offline
A DSEL for Addressing the Problems Posed by Parallel
                        Architectures

                                                          Jason Mc Guiness, Colin Egan




CTCA, School of Computer Science, University of                                     Read EW-PRAM [19]) computation model. Furthermore the
Hertfordshire Hatfield, Hertfordshire, UK                                            DSEL described assists the user with regard to debugging
overload@hussar.demon.co.uk                                                         the resultant parallel program. An implementation of the
                                                                                    DSEL in C++ exists: further details may be found in [12].

                                                                                    2. RELATED WORK
1. INTRODUCTION                                                                       From a hardware perspective, the evolution of computer
   Computers with multiple pipelines have become increas-                           architectures has been heavily influenced by the von Neu-
ingly prevalent, hence a rise in the available parallelism to                       mann model. This has meant that with the relative increase
the programming community. For example the dual-core                                in processor speed vs. memory speed, the introduction of
desktop workstations to multiple core, multiple processors                          memory hierarchies [3] and out-of-order instruction schedul-
within blade frames which may contain hundreds of pipelines                         ing has been highly successful. However, these extra levels
in data centres, to state-of-the-art mainframes in the Top500                       increase the penalty associated with a miss in the memory-
supercomputer list with thousands of cores and the poten-                           subsystem, due to memory-access times, limiting the ILP
tial arrival of next-generation cellular architectures that may                     (Instruction-Level Parallelism). Also there may be an in-
have millions of cores. This surfeit of hardware parallelism                        crease in design complexity and power consumption of the
has apparently yet to be tamed in the software architecture                         overall system. An approach to avoid this problem may be
arena. Various attempts to meet this challenge have been                            to fetch sets of instructions from different memory banks, i.e.
made over the decades, taking such approaches as languages,                         introduce threads, which would allow an increase in ILP, in
compilers or libraries to enable programmers to enhance the                         proportion to the number of executing threads.
parallelism within their various problem domains. Yet the                             From a software perspective, the challenge that has been
common folklore in computer science has still been that it                          presented to programmers by these parallel architectures has
is hard to program parallel algorithms correctly.                                   been the massive parallelism they expose. There has been
   This paper examines what language features would be re-                          much work done in the field of parallelizing software:
quired to add to an existing imperative language that would
have little if no native support for implementing parallel-                            • Auto-parallelizing compilers: such as EARTH-C [17].
ism apart from a simple library that exposes the OS-level                                Much of the work developing auto-parallelizing com-
threading primitives. The goal of the authors has been to                                pilers has derived from the data-flow community [16].
create a minimal and orthogonal DSEL that would add the                                • Language support: such as Erlang [20], UPC [5] or
capabilities of parallelism to that target language. Moreover                            Intel’s [18] and Microsoft’s C++ compilers based upon
the DSEL proposed will be demonstrated to have such use-                                 OpenMP.
ful guarantees as a correct, heuristically efficient schedule.
In terms of correctness the DSEL provides guarantees that                              • Library support: such as POSIX threads (pthreads) or
it can provide deadlock-free and race-condition free sched-                              Win32, MPI, OpenMP, Boost, Intel’s TBB [14], Cilk
ules. In terms of efficiency, the schedule produced will be                                [10] or various libraries targeting C++ [6, 2]. Intel’s
shown to add no worse than a poly-logarithmic order to the                               TBB has higher-level threading constructs, but it has
algorithmic run-time of the schedule of the program on a                                 not supplied parallel algorithms, nor has it provided
CREW-PRAM (Concurrent-Read, Exclusive-Write, Paral-                                      any guarantees regarding its library. It also suffers
lel Random-Access Machine[19]) or EREW-PRAM (Exclusive-                                  from mixing code relating to generating the parallel
                                                                                         schedule and the business logic, which would also make
                                                                                         testing more complex.
                                                                                    These have all had varying levels of success, as discussed in
                                                                                    part in [11], with regards to addressing the issues of pro-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are           gramming effectively for such parallel architectures.
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to      3. MOTIVATION
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.                                                              The basic issues addressed by all of these approaches have
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.                                    been: correctness or optimization. So far it has appeared
that the compiler and language based approaches have been           • It shall assist in debugging any use of a conforming
the only approaches able to address both of those issues              implementation.
together. But the language-based approaches require that
programmers would need to re-implement their programs in            • It should provide guarantees regarding those bane’s of
a potentially novel language, a change that has been very             parallel programming: dead-locks and race-conditions.
hard for business to adopt, severely limiting the use of these      • Moreover it should provide guarantees regarding the
approaches.                                                           algorithmic complexity of any parallel schedule it would
  Amongst the criticisms raised regarding the use of libraries        generate.
[11, 13] such as pthreads, Win32 or OpenMP have been:
                                                                 Initially a description of the grammar will be given, followed
   • They have been too low-level so using them to write
                                                                 by a discussion of some of the properties of the DSEL. Fi-
     correct multi-threaded programs has been very hard;
                                                                 nally some theoretical results derived from the grammar of
     it suffers from composition problems. This problem
                                                                 the DSEL will be given.
     may be summarized as: atomic access to an object
     would be contained within each object (using classic        4.1 Detailed Grammar of the DSEL
     OOD), thus when composing multiple objects, mul-
                                                                   The various types, production rules and operations that
     tiple separate locks, from the different objects, have
                                                                 define the DSEL will be given in this section. The basic
     to be manipulated to guarantee correct access. If this
                                                                 types will be defined first, then the operations upon those
     were done correctly the usual outcome has been a ser-
                                                                 types will be defined. C++ has been chosen as the target
     ious reduction in scalability.
                                                                 language in which to implement the DSEL. This was due
   • A related issue has been that that the programmer           to the rich ability within C++ to extend the type system
     often intimately entangles their thread-safety, thread      at compile-time: primarily using templates but also over-
     scheduling and the business logic of their code. This       loading various operators. Hence the presentation of the
     means that each program would be effectively a be-           grammar relies on the grammar of C++, so it would assist
     spoke program, requiring re-testing of each program         the reader to have familiarity of that grammar, in particular
     for threading issues as well as business logic issues.      Annex A of the ISO C++ Standard [8]. Although C++11
                                                                 has some support for threading, this had not been widely im-
   • Also debugging such code has been found to be very          plemented at the time of writing, moreover the specification
     hard. Debuggers for multi-threaded code have been an        had not addressed the points of the DSEL in this paper.
     open area of research for some time.                          Some clarifications:
Given that the language has to be immutable, a DSEL defined
                                                                    • The subscriptopt means that the keyword is optional.
by a library that attempts to support the correctness and
optimality of the language and compiler approaches and              • The subscriptdef means that the keyword is default and
yet somehow overcomes the limitations of the usual library-           specifies the default value for the optional keyword.
based approaches would seem to be ideal. This DSEL will
now be presented.                                                4.1.1 Types
                                                                   The primary types used within the DSEL are derived from
4. THE DSEL TO ASSIST PARALLELISM                                the thread-pool type.
  We chose to address these issues by defining a carefully          1. Thread pools can be composed with various subtypes
crafted DSEL, then examining it’s properties to demonstrate           that could be used to fundamentally affect the imple-
that the DSEL achieved the goals. The DSEL should have                mentation and performance of any client software:
the following properties:
   • The DSEL shall target what may be termed as gen-                 thread-pool-type:
     eral purpose threading, the authors define this to be                  thread_pool work-policy size-policy pool-adaptor
     scheduling in which the conditions or loop-bounds may                   • A thread pool would contain a collection of
     not be computed at compile-time, nor could they be                        threads that may be more, less or the same as
     represented as monads, so could not be memoized1 . In                     the number of processors on the target archi-
     particular the DSEL shall support both data-flow and                       tecture. This allows for implementations to
     data parallel constructs.                                                 visualize the multiple cores available or make
   • By being implemented in an existing language it would                     use of operating-system provided thread im-
     avoid the necessity of re-implementing the programs, a                    plementations. An implementation may choose
     more progressive approach to adoption could be taken.                     to enforce a synchronization of all threads
                                                                               within the pool once an instance of that pool
   • It shall be a reasonably small DSEL, but be large                         should be destroyed, to ensure that threads
     enough provide sufficient extensions to the host lan-                       managed by the pool are appropriately des-
     guage that express parallel constructs in a manner that                   troyed and work in the process of mutation
     would be natural to a programmer using that language.                     could be appropriately terminated.
1
  A compile or run-time optimisation technique involving a            work-policy: one of
space-time tradeoff. Re-computation of pure functions when                 worker_threads_get_work one_thread_distributes
provided with the same arguments may be avoided by cach-
ing the result; the result will be the same for each call with               • The library should implement the classic work-
the same arguments, if the function has no side-effects.                        stealing or master-slave work sharing algorithms.
Clearly the specific implementation of these                     • The sequential_mode has been provided to
         could affect the internal queue containing un-                     allow implementations to removal all thread-
         processed work within the thread_pool. For                        ing aspects of all of the implementing library,
         example a worker_threads_get_work queue                           which would hugely reduce the burden on the
         might be implemented such that the addition                       programmer regarding identifying bugs within
         of work would be independent to the removal                       their code. If all threading is removed, then
         of work.                                                          all bugs that remain, in principle should reside
                                                                           in their user-code, which once determined to
size-policy: one of                                                        be bug-free, could then be trivially parallel-
     fixed_size tracks_to_max infinite                                     ized by modifying this single specifier and re-
       • The size-policy when used in combination with                     compiling. Then any further bugs introduced
         the threading-model could be used to make                         would be due to bugs within the parallel as-
         considerable simplifications in the implement-                     pects of their code, or the library implement-
         ation of the thread-pool-type which could make                    ing this DSEL. If the user relies upon the lib-
         it faster on certain architectures.                               rary to provide threading, then there should
                                                                           be no further bugs in their code. We consider
       • tracks_to_max would implement some model
                                                                           this feature of paramount importance, as it
         of the cost of re-creating and maintaining threads.
                                                                           directly addresses the complex task of debug-
         If thread were cheap to create & destroy with
                                                                           ging parallel software, by separating the al-
         little overhead, then an infinite size might
                                                                           gorithm by which the parallelism should be
         be a reasonable approximation, conversely threads
                                                                           implemented from the code implementing the
         with opposite characteristics might be better
                                                                           mutations on the data.
         maintained in a fixed_size pool.
                                                                  priority-mode: one of
pool-adaptor:                                                         normal_fifodef prioritized_queue
     joinability api-type threading-model priority-modeopt
     comparatoropt GSS(k)-batch-sizeopt                                  • This is an optional parameter. The prior-
                                                                           itized_queue would allow the user to spe-
joinability: one of                                                        cify whether specific instances of work to be
    joinable nonjoinable                                                   mutated should be performed ahead of other
       • The joinability has been provided to allow                        instances of work, according to a user-specified
         for certain optimizations to be implement-                        comparator .
         able. A thread-pool-type that is nonjoinable             comparator:
         could have a number of simplifying details                  std::lessdef
         that would make it not only easier to imple-
                                                                         • A unary function-type that specifies a strict
         ment but also faster in operation.
                                                                           weak-ordering on the elements within a the
api-type: one of                                                           prioritized_queue.
     no_api MS_Win32 posix_pthreads IBM_cyclops                   GSS(k)-batch-size:
       • Both MS_Win32 and posix_pthreads are ex-                    1def
         amples of heavyweight_threading APIs in                         • A natural number specifying the batch-size
         which threading at the OS-level would be made                     to be used within the queue specified by the
         use of to implement the DSEL. IBM_cyclops                         priority-mode. The default is 1, i.e. no batch-
         would be an implementation of the DSEL us-                        ing would be performed. An implementa-
         ing the lightweight_threading API imple-                          tion would be likely to use this for enabling
         mented by IBM BlueGene/C Cyclops [1].                             GSS(k) scheduling [9].
threading-model: one of                                        2. Adapted collections to assist in providing thread-safety
     sequential_mode heavyweight_threading                        and also specify the memory access model of the col-
     lightweight_threading                                        lection:
       • This specifier provides a coarse representa-
         tion of the various implementations of thread-           safe-colln:
         able construct in the multitude of architec-                 safe_colln collection-type lock-type
         tures available. For example Pthreads would                     • This adaptor wraps the collection-type and
         be considered to be heavyweight_threading                         an instance of lock-type in one object, and
         whereas Cyclops would be lightweight_threading.                   provides a few thread-safe operations upon
         Separation of the threading model versus the                      that collection, plus access to the underlying
         API allows for the possibility that there may                     collection. This access might seem surpris-
         be multiple threading APIs on the same plat-                      ing, but this has been done because locking
         form, which may have different properties, for                     the operations on collections has been shown
         example if there were to be a GPU available                       to not be composable, and cross-cuts both
         in a multi-core computer, there could be two                      object-orientated and functional-decomposition
         different threading models within the same                         designs. This could could be open to misuse,
         program.                                                          but otherwise excessive locking would have
to be done in user code. This has not been                  (b) or implements the function process(result_type
           an ideal design decision, but a simple one,                     &), and the library may determine the actual
           with scope for future work. Note that this                      type of result_type.
           design choice within the DSEL does not inval-
                                                                       The sub-types are:
           idate the rest of the grammar, as this would
           just affect the overloads to the data-parallel-              joinable:
           algorithms, described later.                                    A method of transferring work to be mutated
         • The adaptor also provides access to both read-                  into an instance of thread-pool-types. If the
           lock and write-lock types, which may be the                     work to be mutated were to be transferred
           same, but allow the user to specify the intent                  using this modifier, then the return result of
           of their operations more clearly.                               the transfer would be an execution_context,
                                                                           that may subsequently be used to obtain the
  lock-type: one of
                                                                           result of the mutation. Note that this implies
       critical_section_lock_type read_write
                                                                           that the DSEL implements a form of data-
       read_decaying_write
                                                                           flow operation.
       (a) A critical_section_lock_type would be a                     execution_context:
           single-reader, single-writer lock, a simulation                 This is the type of future that a transfer re-
           of EREW semantics. The implementation of                        turns. It is also a type of proxy to the res-
           this type of lock could be more efficient on                      ult_type that the mutation returns. Access
           certain architectures.                                          via this proxy implicitly causes the calling
       (b) A read_write lock is a multi-readers, single-                   thread to wait until the mutation has been
           write lock, a simulation of CREW semantics.                     completed. This is the other component of
       (c) A read_decaying_write lock would be a spe-                      the DSEL that implements the data-flow model.
           cialization of a read_write lock that also im-                  Various sub-types of execution_context ex-
           plements atomic transformation of a write-                      ist specific to the result_types of the vari-
           lock into a read-lock.                                          ous operations that the DSEL supports. Note
       (d) The lock should be used to govern the opera-                    that the implementation of execution_context
           tions on the collection, and not operations on                  should specifically prohibit aliasing instances
           the items contained within the collection.                      of these types, copying instances of these types
         • The lock-type parameter may be used to spe-                     and assigning instances of these types.
           cify if EREW or CREW operations upon the               nonjoinable:
           collection are allowed. For example if EREW                Another method of transferring work to be mutated
           operations are only allowed, then overlapped               into an instance of thread-pool-types. If the work
           dereferences of the execution_context res-                 to be mutated were to be transferred using this
           ultant from parallel-algorithms operating upon             modifier, then the return result of the transfer
           the same instance of a safe-colln should be                would be nothing. The mutation within the pool
           strictly ordered by an implementation to en-               would occur at some indeterminate time, the res-
           sure EREW semantics are maintained. Al-                    ult of which would, for example, be detectable by
           ternatively if CREW semantics were specified                any side effects of the mutation within the res-
           then an implementation may allow read-operations           ult_type of the work to be mutated.
           upon the same instance of the safe-colln to
                                                                  time_critical:
           occur in parallel, assuming they were not blocked
                                                                      This modifier ensures that when the work is mutated
           by a write operation.
                                                                      by a thread within an instance of thread-pool-
  collection-type:                                                    type into which it has been transferred, it will
       A standard collection such as an STL-style list or             be executed at an implementation-defined higher
       vector, etc.                                                   kernel priority. Other similar modifiers exist in
                                                                      the DSEL for other kernel priorities. This ex-
3. The thread-pool-type defines further sub-types for con-
                                                                      ample demonstrates that specifying other modifi-
   venience to the programmer:
                                                                      ers, that would be extensions to the DSEL, would
  create_direct:                                                      be possible.
      This adaptor, parametrized by the type of work              cliques(natural_number n):
      to be mutated, contains certain sub-types. The                  This modifier is used with data-parallel-algorithms.
      input data and the mutation operation combined                  It causes the instance of thread-pool-type to allow
      are termed the work to be mutated, which would                                                                   p
                                                                      the data-parallel-algorithm to operate with n
      be a type of closure. If the mutation operation                 threads, where p is the number of threads in the
      does not change the state of any data external to               instance.
      the closure, then this would be a type of monad.
      More specifically, this work to be mutated should         4. The DSEL specifies a number of other utility types
      also be a type of functor that either:                      such as shared_pointer, various exception types and
       (a) Provides a type result_type to access the              exception-management adaptors amongst others. The
           result of the mutation, and specifies the muta-         details of these important, but ancillary types has been
           tion member-function,                                  omitted for brevity.
4.1.2    Operators on the thread-pool-type                                     • The style and arguments of the data-parallel-
  The various operations that are defined in the DSEL will                        algorithms is similar to those of the STL in
now be given. These operations tie together the types and                        the C++ ISO Standard. Specifically they all
express the restrictions upon the generation of the control-                     take a safe-colln as the arguments to spe-
flow graph that the DSEL may create.                                              cify the ranges and functors as necessary as
                                                                                 specified within the STL. Note that these al-
  1. The transfer work to be mutated into an instance of                         gorithms all use run-time computed bounds,
     thread-pool-type is defined as follows:                                      otherwise it would be more optimal to use
                                                                                 techniques similar to those used in HPF or
     transfer-future:
                                                                                 described in [9] to parallelize such operations.
         execution-context-resultopt
                                                                                 If the DSEL supports loop-carried dependen-
         thread-pool-type transfer-operation
                                                                                 cies in the functor argument is undefined.
     execution-context-result:
                                                                               • If algorithms were to be implemented using
         execution_context <<
                                                                                 techniques described in [7] and [4], then the
             • The token sequence “<<” is the transfer oper-                     algorithms would be optimal with O (log (p))
               ation, and also used in the definition of the                      complexity in distributing the work to the
               transfer-modifier-operation, amongst other places.                 thread pool. Given that there are no loop-
             • Note how an execution_context can only be                         carried dependencies, each thread may oper-
               created via a transfer of work to be mutated                      ate independently upon a sub-range within
               into the suitably defined thread_pool. It is                       the safe-colln for an optimal algorithmic
               an error to transfer work into a thread_pool                      complexity of O n − 1 + log (p) where n is
                                                                                                     p
               that has been defined using the nonjoinable
                                                                                 the number of items to be computed and p is
               subtype. There is no way to create an ex-
                                                                                 the number of threads, ignoring the operation
               ecution_context with transferring work to
                                                                                 time of the mutations.
               be mutated, so every execution_context is
               guaranteed to eventually contain the result of
                                                                     3. The binary_funs are defined as follows:
               a mutation.
     transfer-operation:                                                binary fun:
         transfer-modifier-operationopt transfer-data-operation              work-to-be-mutated work-to-be-mutated
     transfer-modifier-operation:                                            binary functor
         << transfer-modifier                                                   • A binary functor is just a functor that takes
     transfer-modifier: one of                                                    two arguments. The order of evaluation of
         time_critical joinable nonjoinable cliques                              the arguments is undefined. If the DSEL sup-
     transfer-data-operation:                                                    ports dependencies between the arguments is
         << transfer-data                                                        undefined. This would imply that the argu-
     transfer-data: one of                                                       ments should refrain from modifying any ex-
         work-to-be-mutated parallel-binary-operation data-                      ternal state.
         parallel-algorithm
                                                                     4. Similarly, the logical operations are defined as follows:
The details of the various parallel-binary-operations and data-
parallel-algorithms will be given in the next section.                  logical operation:
                                                                             work-to-be-mutated work-to-be-mutated
4.1.3    The Data-Parallel Operations and Algorithms                         binary functor
  This section will describe the the various parallel algorithms               • Note that no short-circuiting of the compu-
defined within the DSEL.                                                          tation of the arguments occurs. The result
  1. The parallel-binary-operations are defined as follows:                       of mutating the arguments must be boolean.
                                                                                 If the DSEL supports dependencies between
     parallel-binary-operation: one of                                           the arguments is undefined. This would im-
         binary_fun parallel-logical-operation                                   ply that the arguments should refrain from
     parallel-logical-operation: one of                                          modifying any external state.
         logical_and logical_or
             • It is likely that an implementation would not
                                                                   4.2 Properties of the DSEL
               implement the usual short-circuiting of the            In this section some results will be presented that derive
               operands, to allow them to transferred into         from the definitions, the first of which will demonstrate that
               the thread pool and executed in parallel.           the CFG (Control Flow Graph) would be a tree from which
                                                                   the other useful results will directly derive.
  2. The data-parallel-algorithms are defined as follows:
                                                                      Theorem 1. Using the DSEL described above, the par-
     data-parallel-algorithm: one of                               allel control-flow graph of any program that may use a con-
         accumulate copy count count_if fill fill_n                forming implementation of the DSEL must be an acyclic dir-
         find find_if for_each min_element max_element             ected graph, and comprised of at least one singly-rooted tree,
         reverse transform                                         but may contain multiple singly-rooted, independent, trees.
Proof. From the definitions of the DSEL, the transfer         to unlock D. In terms of the DSEL, this implies that exe-
of work to be mutated into the thread_pool may be done          cution_contexts C and D are shared between two threads.
only once according to the definition of transfer-future the     i.e. that an execution_context has been passed from a
result of which returns a single execution_context accord-      node A to a sibling node B, and vice-versa occurs to exe-
ing to the definition of execution-context-result which has      cution_context B. But aliasing execution_contexts has
been the only defined way to create execution_contexts.          been explicitly forbidden in the DSEL by definition 3.
This implies that from a node in the CFG, each transfer to
the thread-pool-type represents a single forward-edge con-         Corollary 1. If the user refrains from using any other
necting the execution_context with the child-node that          threading-related items or atomic objects other than those
contains the mutation. The back-edge from the mutation          defined in the DSEL above and that the work they wish to
to the parent-node is the edge connecting the result of the     mutate may not be aliased by any other object, then the user
mutation with the dereference of the execution_context.         can be guaranteed to have a schedule free of race-conditions
The execution_context and the dereference occur in the          and deadlocks.
same node, because execution_contexts cannot be passed             Proof. It must be proven that the two theorems 2 and 3
between nodes, by definition. In summary: the parent-node        are not mutually exclusive. Let us suppose that a CFG ex-
has an edge from the execution_context it contains to the       ists that satisfies 2 but not 3. Therefore there must be either
mutation and a back-edge to the dereference in that parent-     an edge formed by aliasing an execution_context or a back-
node. Each node may perform none, one or more trans-            edge from the result of a mutation back to a dereference of
fers resulting in none, one or more child-nodes. A node         an execution_context. The former has been explicitly for-
with no children is a leaf-node, containing only a mutation.    bidden in the DSEL by definition of the execution_context,
Now back-edges to multiple parent nodes cannot be created,      3, the latter forbidden by the definition of transfer-future, 1.
according to the definition of execution_context, because        Both are a contradiction, therefore such a CFG cannot exist.
execution_contexts cannot be aliased nor copied between         Therefore any conforming CFG must satisfy both theorems
nodes. So the only edges in this sub-graph are the forward      2 and 3.
and back edges from parent to children. Therefore the sub-
                                                                   Theorem 4. If the user refrains from using any other
graph is not only acyclic, but a tree. Due to the definitions
                                                                threading-related items or atomic objects other than those
of transfer-future and execution-context-result, the only way
                                                                defined in the DSEL above then the schedule of work to be
to generate mutations is via the above technique. Therefore
                                                                mutated by a conforming implementation of the DSEL would
each child-node either returns via the back edge immedi-
                                                                be executed in time taking at least an algorithmic complexity
ately or generates a further sub-tree attaching to the larger
                                                                of O (log (p)) and at most O (n) in units of time to mutate
tree that contains it’s parent. Now if the entry-point of
                                                                the work where n is the number of work items to be mutated
the program is the single thread that runs main(), i.e. the
                                                                on p processors. The algorithmic order of the minimal time
single root, this can only generate a tree, and each node
                                                                would be poly-logarithmic, so within NC, therefore at least
in the tree can only return or generate a tree, the whole
                                                                optimal.
CFG must be a tree. If there were more entry-points, each
one can only generate a tree per entry-point, as the execu-        Proof. Given that the schedule must be a tree according
tion_contexts cannot be aliased nor copied between nodes,       to theorem 1 with at most n leaf-nodes, and that each node
by definition.                                                   takes at most O n − 1 + log (p) computations according
                                                                                   p
                                                                to the definition of the parallel-algorithms. Also it has been
According to the above corollary, one may appreciate that a     proven in [7] that to distribute n items of work onto p pro-
conforming implementation of the DSEL would implement           cessors may be performed with an algorithmic complexity of
data-flow in software.                                           O (log (n)). The fastest computation time would be if the
   Theorem 2. If the user refrains from using any other         schedule were a balanced tree, where the computation time
threading-related items or atomic objects other than those      would be the depth of the tree, i.e. O (log (n)) in the same
defined in the DSEL above then they can be guaranteed to         units. If the n items of work were to be greater than the
have a schedule free of race-conditions.                        p processors, then O (log (p)) ≤ O (log (n)), so the compu-
                                                                tation time would be slower than O (log (p)). The slowest
   Proof. A race-condition is when two threads attempt to       computation time would be if the tree were a chain, i.e.
access the same data at the same time. A race-condition         O (n) time. In those cases this implies that a conforming
in the CFG would be represented by a child node with two        implementation should add at most a constant order to the
parent nodes, with forward-edges connecting the parents to      execution time of the schedule.
the child. Note that the CFG must an acyclic tree according
to theorem 1, then this sub-graph cannot be represented in      4.3 Some Example Usage
a tree, so the schedule must be race-condition free.              These are two toy examples, based upon an implement-
                                                                ation in [12], of how the above DSEL might appear. The
   Theorem 3. If the user refrains from using any other         first example is a data-flow example showing how the DSEL
threading-related items or atomic objects other than those      could be used to mutate some work on a thread within the
defined in the DSEL above and that the work they wish to         thread pool, effectively demonstrating how the future would
mutate may not be aliased by any other object, then the user    be waited upon. Note how the execution_context has been
can be guaranteed to have a schedule free of deadlocks.         created via the transfer of work into the thread_pool.
  Proof. A deadlock may be defined as: when threads A
and B wait on atomic-objects C and D, such that A locks         Listing 1: Data-flow example of a Thread Pool and
C, waits upon D to unlock C and B locks D, waits upon C         Future.
struct res t {                                                                                                5. CONCLUSIONS
      int i ;
};                                                                                                              The goals of the paper has been achieved: a DSEL has
s t r u c t work type {
      void process ( r e s t           &) {}                                                                  been formulated:
};

t y p e d e f ppd : : t h r e a d p o o l <                                                                      • that may be used to expresses general-purpose paral-
      p o o l t r a i t s : : worker threads get work ,
      pool traits :: fixed size ,                                                                                  lelism within a language,
      pool adaptor<
            g e n e r i c t r a i t s : : joinable , platform api ,
            heavyweight threading                                                                                • ensures that there are no deadlocks and race condi-
     >
> pool type ;
                                                                                                                   tions within the program if the programmer restricts
t y p e d e f p o o l t y p e : : c r e a t e d i r e c t <w o r k t y p e > c r e a t o r t ;                     themselves to using the constructs of the DSEL,
typedef c r e a t o r t : : execution context execution context ;
typedef c r e a t o r t : : joinable joinable ;
                                                                                                                 • and does not preclude implementing optimal schedules
pool type pool ( 2 ) ;
e x e c u t i o n c o n t e x t c o n t e x t ( pool<<j o i n a b l e ()<< w o r k t y p e ( ) ) ;                 on a CREW-PRAM or EREW-PRAM computation
context−       >i ;
                                                                                                                   model.
  The typedefs in this example implementation of the gram-                                                    Intuition suggests that this result should have come as no
mar are complex, but the typedef for the thread-pool-type                                                     surprise considering the work done relating to auto-parallelizing
would only be needed once and, reasonably, could be held                                                      compilers, which work within the AST and CFGs of the
in a configuration trait in header file.                                                                        parsed program[17].
  The second example shows how a data-parallel version of
the C++ accumulate algorithm might appear.                                                                       It is interesting to note that the results presented here
                                                                                                              would be applicable to all programming languages, com-
                                                                                                              piled or interpreted, and that one need not be forced to
Listing 2: Example of a parallel version of an STL                                                            re-implement a compiler. Moreover the DSEL has been de-
algorithm.                                                                                                    signed to directly address the issue of debugging any such
t y p e d e f ppd : : t h r e a d p o o l <
      p o o l t r a i t s : : worker threads get work ,
                                                                                                              parallel program, directly addressing this problematic is-
      pool traits :: fixed size ,
      pool adaptor<
                                                                                                              sue. Further advantages of this DSEL are that program-
            g e n e r i c t r a i t s : : joinable , platform api ,                                           mers would not need to learn an entirely new programming
            heavyweight threading ,
            p o o l t r a i t s : : normal fifo , std : : less ,1                                             language, nor would they have to change to a novel com-
     >
> pool type ;
                                                                                                              piler implementing the target language, which may not be
t y p e d e f ppd : : s a f e c o l l n <
                v e c t o r <i n t >, l o c k t r a i t s : : c r i t i c a l s e c t i o n l o c k t y p e
                                                                                                              available, or if it were might be impossible to use for more
> vtr colln t ;                                                                                               prosaic business reasons.
typedef pool type : : accumulate t<
                vtr colln t
>:: e x e c u t i o n c o n t e x t e x e c u t i o n c o n t e x t ;
vtr colln t v;
v . push back ( 1 ) ; v . push back ( 2 ) ;
                                                                                                              6. FUTURE WORK
execution context context (
      pool<    <j o i n a b l e ( )
                                                                                                                There are a number of avenues that arise which could be
           < <p o o l . a c c u m u l a t e (                                                                 investigated, for example a conforming implementation of
                  v , 1 , s t d : : p l u s < v t r c o l l n t : : v a l u e t y p e >()
            )                                                                                                 the DSEL could be presented, for example [12]. The prop-
);
a s s e r t ( ∗ c o n t e x t ==4);
                                                                                                              erties of such an implementation could then be investigated
                                                                                                              by reimplementing a benchmark suite, such as SPEC2006
  All of the parameters have been specified in the thread-                                                     [15], and comparing and contrasting the performance of that
pool-type to demonstrate the appearance of the typedef. Note                                                  implementation versus the literature. The definition of safe-
that the example illustrates a map-reduce operation, an im-                                                   colln has not been an optimal design decision a better ap-
plementation might:                                                                                           proach would have been to define ranges that support lock-
                                                                                                              ing upon the underlying collection. Extending the DSEL
     1. take sub-ranges within the safe-colln,                                                                may be required to admit memoization could be investig-
                                                                                                              ated, such that a conforming implementation might imple-
     2. which would be distributed across the threads within                                                  ment not only inter but intra-procedural analysis.
        the thread_pool,
                                                                                                              7. REFERENCES
     3. the mutations upon each element within each sub-
        range would be performed sequentially, their results                                                   [1] Almasi, G., Cascaval, C., Castanos, J. G.,
        combined via the accumulator functor, without lock-                                                        Denneau, M., Lieber, D., Moreira, J. E., and
        ing any other thread’s operation,                                                                          Henry S. Warren, J. Dissecting Cyclops: a detailed
                                                                                                                   analysis of a multithreaded architecture. SIGARCH
     4. These sub-results would be combined with the final ac-                                                      Comput. Archit. News 31, 1 (2003), 26–38.
        cumulation, in this the implementation providing suit-                                                 [2] Bischof, H., Gorlatch, S., Leshchinskiy, R., and
        able locking to avoid any race-condition,                                                                  M¨ ller, J. Data Parallelism in C++ Template
                                                                                                                     u
                                                                                                                   Programs: a Barnes-hut Case Study. Parallel
     5. The total result would be made available via the exe-                                                      Processing Letters 15, 3 (2005), 257–272.
        cution_context.                                                                                        [3] Burger, D., Goodman, J. R., and Kagi, A.
                                                                                                                   Memory Bandwidth Limitations of Future
Moreover the size of the input collection should be suffi-                                                           Microprocessors. In ISCA (1996), pp. 78–89.
ciently large or the time taken to execute the operation of                                                    [4] Casanova, H., Legrand, A., and Robert, Y.
the accumulator so long, so that the cost of the above oper-                                                       Parallel Algorithms. Chapman & Hall/CRC Press,
ations would be reasonably amortized.                                                                              2008.
[5] El-ghazawi, T. A., Carlson, W. W., and
     Draper, J. M. UPC language specifications v1.1.1.
     Tech. rep., 2003.
 [6] Giacaman, N., and Sinnen, O. Parallel iterator for
     parallelising object oriented applications. In
     SEPADS’08: Proceedings of the 7th WSEAS
     International Conference on Software Engineering,
     Parallel and Distributed Systems (Stevens Point,
     Wisconsin, USA, 2008), World Scientific and
     Engineering Academy and Society (WSEAS),
     pp. 44–49.
 [7] Gibbons, A., and Rytter, W. Efficient parallel
     algorithms. Cambridge University Press, New York,
     NY, USA, 1988.
 [8] ISO. ISO/IEC 14882:2011 Information technology —
     Programming languages — C++. International
     Organization for Standardization, Geneva,
     Switzerland, Feb. 2012.
 [9] Kennedy, K., and Allen, J. R. Optimizing
     compilers for modern architectures: a
     dependence-based approach. Morgan Kaufmann
     Publishers Inc., San Francisco, CA, USA, 2002.
[10] Leiserson, C. E. The Cilk++ concurrency platform.
     J. Supercomput. 51, 3 (Mar. 2010), 244–257.
[11] McGuiness, J. M. Automatic Code-Generation
     Techniques for Micro-Threaded RISC Architectures.
     Master’s thesis, University of Hertfordshire, Hatfield,
     Hertfordshire, UK, July 2006.
[12] McGuiness, J. M. libjmmcg - implementing PPD.
     libjmmcg.sourceforge.net, July 2009.
[13] McGuiness, J. M., Egan, C., Christianson, B.,
     and Gao, G. The Challenges of Efficient
     Code-Generation for Massively Parallel Architectures.
     In Asia-Pacific Computer Systems Architecture
     Conference (2006), pp. 416–422.
[14] Pheatt, C. Intel R threading building blocks. J.
     Comput. Small Coll. 23, 4 (2008), 298–298.
[15] Reilly, J. Evolve or Die: Making SPEC’s CPU Suite
     Relevant Today and Tomorrow. In IISWC (2006),
     p. 119.
[16] Snelling, D. F., and Egan, G. K. A Comparative
     Study of Data-Flow Architectures. Tech. Rep.
     UMCS-94-4-3, 1994.
[17] Tang, X. Compiling for Multithreaded Architectures.
     PhD thesis, University of Delaware, Delaware, USA,
     Fall 1999.
[18] Tian, X., Chen, Y.-K., Girkar, M., Ge, S.,
     Lienhart, R., and Shah, S. Exploring the Use of
     Hyper-Threading Technology for Multimedia
     Applications with Intel R OpenMP* Compiler. In
     IPDPS (2003), p. 36.
[19] Tvrdik, P. Topics in parallel computing - PRAM
     models.
     http://pages.cs.wisc.edu/ tvrdik/2/html/Section2.html,
     January 1999.
[20] Virding, R., Wikstr¨ m, C., and Williams, M.
                             o
     Concurrent programming in ERLANG (2nd ed.).
     Prentice Hall International (UK) Ltd., Hertfordshire,
     UK, UK, 1996.

More Related Content

What's hot

Doppl Development Introduction
Doppl Development IntroductionDoppl Development Introduction
Doppl Development IntroductionDiego Perini
 
Theory Psyco
Theory PsycoTheory Psyco
Theory Psycodidip
 
Tutorial on Parallel Computing and Message Passing Model - C2
Tutorial on Parallel Computing and Message Passing Model - C2Tutorial on Parallel Computing and Message Passing Model - C2
Tutorial on Parallel Computing and Message Passing Model - C2Marcirio Chaves
 
openGauss - The evolution route of openGauss' AIcapabilities
openGauss - The evolution route of openGauss' AIcapabilitiesopenGauss - The evolution route of openGauss' AIcapabilities
openGauss - The evolution route of openGauss' AIcapabilitieswot chin
 
Machine Translation
Machine TranslationMachine Translation
Machine TranslationUwe Muegge
 

What's hot (9)

Hemanth_Krishnan_resume
Hemanth_Krishnan_resumeHemanth_Krishnan_resume
Hemanth_Krishnan_resume
 
Dot net
Dot netDot net
Dot net
 
Doppl Development Introduction
Doppl Development IntroductionDoppl Development Introduction
Doppl Development Introduction
 
thrift-20070401
thrift-20070401thrift-20070401
thrift-20070401
 
Theory Psyco
Theory PsycoTheory Psyco
Theory Psyco
 
Tutorial on Parallel Computing and Message Passing Model - C2
Tutorial on Parallel Computing and Message Passing Model - C2Tutorial on Parallel Computing and Message Passing Model - C2
Tutorial on Parallel Computing and Message Passing Model - C2
 
43
4343
43
 
openGauss - The evolution route of openGauss' AIcapabilities
openGauss - The evolution route of openGauss' AIcapabilitiesopenGauss - The evolution route of openGauss' AIcapabilities
openGauss - The evolution route of openGauss' AIcapabilities
 
Machine Translation
Machine TranslationMachine Translation
Machine Translation
 

Viewers also liked

A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.Jason Hearne-McGuiness
 
Introducing Parallel Pixie Dust: Advanced Library-Based Support for Paralleli...
Introducing Parallel Pixie Dust: Advanced Library-Based Support for Paralleli...Introducing Parallel Pixie Dust: Advanced Library-Based Support for Paralleli...
Introducing Parallel Pixie Dust: Advanced Library-Based Support for Paralleli...Jason Hearne-McGuiness
 
Implementing Batcher’s Bitonic Sort in C++: An Investigation into using Libra...
Implementing Batcher’s Bitonic Sort in C++: An Investigation into using Libra...Implementing Batcher’s Bitonic Sort in C++: An Investigation into using Libra...
Implementing Batcher’s Bitonic Sort in C++: An Investigation into using Libra...Jason Hearne-McGuiness
 
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...Jason Hearne-McGuiness
 

Viewers also liked (7)

A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.
 
Massively Parallel Architectures
Massively Parallel ArchitecturesMassively Parallel Architectures
Massively Parallel Architectures
 
Introducing Parallel Pixie Dust: Advanced Library-Based Support for Paralleli...
Introducing Parallel Pixie Dust: Advanced Library-Based Support for Paralleli...Introducing Parallel Pixie Dust: Advanced Library-Based Support for Paralleli...
Introducing Parallel Pixie Dust: Advanced Library-Based Support for Paralleli...
 
Massively Parallel Architectures
Massively Parallel ArchitecturesMassively Parallel Architectures
Massively Parallel Architectures
 
Implementing Batcher’s Bitonic Sort in C++: An Investigation into using Libra...
Implementing Batcher’s Bitonic Sort in C++: An Investigation into using Libra...Implementing Batcher’s Bitonic Sort in C++: An Investigation into using Libra...
Implementing Batcher’s Bitonic Sort in C++: An Investigation into using Libra...
 
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
 
Intelligent RAM
Intelligent RAMIntelligent RAM
Intelligent RAM
 

Similar to A DSEL for Addressing the Problems Posed by Parallel Architectures

How to write shared libraries!
How to write shared libraries!How to write shared libraries!
How to write shared libraries!Stanley Ho
 
A Man-Computer Interactive System
A Man-Computer Interactive SystemA Man-Computer Interactive System
A Man-Computer Interactive SystemJames Heller
 
Towards high performance computing(hpc) through parallel programming paradigm...
Towards high performance computing(hpc) through parallel programming paradigm...Towards high performance computing(hpc) through parallel programming paradigm...
Towards high performance computing(hpc) through parallel programming paradigm...ijpla
 
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : NotesIs Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : NotesSubhajit Sahu
 
STATICMOCK : A Mock Object Framework for Compiled Languages
STATICMOCK : A Mock Object Framework for Compiled Languages STATICMOCK : A Mock Object Framework for Compiled Languages
STATICMOCK : A Mock Object Framework for Compiled Languages ijseajournal
 
A New Paradigm In Linux Debug From Viosoft Corporation
A New Paradigm In Linux Debug From Viosoft CorporationA New Paradigm In Linux Debug From Viosoft Corporation
A New Paradigm In Linux Debug From Viosoft Corporationart_lee
 
Software engineering
Software engineeringSoftware engineering
Software engineeringFahe Em
 
Software engineering
Software engineeringSoftware engineering
Software engineeringFahe Em
 
Software Abstractions for Parallel Hardware
Software Abstractions for Parallel HardwareSoftware Abstractions for Parallel Hardware
Software Abstractions for Parallel HardwareJoel Falcou
 
Деградация производительности при использовании FUSE
Деградация производительности при использовании FUSEДеградация производительности при использовании FUSE
Деградация производительности при использовании FUSEAnatol Alizar
 
Concurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsConcurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsCSCJournals
 
A New Paradigm In Linux Debug From Viosoft
A New Paradigm In Linux Debug From ViosoftA New Paradigm In Linux Debug From Viosoft
A New Paradigm In Linux Debug From Viosoftguestc28df4
 
LOCK-FREE PARALLEL ACCESS COLLECTIONS
LOCK-FREE PARALLEL ACCESS COLLECTIONSLOCK-FREE PARALLEL ACCESS COLLECTIONS
LOCK-FREE PARALLEL ACCESS COLLECTIONSijdpsjournal
 
Lock free parallel access collections
Lock free parallel access collectionsLock free parallel access collections
Lock free parallel access collectionsijdpsjournal
 
Unit 1_Evaluation Criteria_session 3.pptx
Unit 1_Evaluation Criteria_session 3.pptxUnit 1_Evaluation Criteria_session 3.pptx
Unit 1_Evaluation Criteria_session 3.pptxAsst.prof M.Gokilavani
 

Similar to A DSEL for Addressing the Problems Posed by Parallel Architectures (20)

Dsohowto
DsohowtoDsohowto
Dsohowto
 
ewili13_submission_14
ewili13_submission_14ewili13_submission_14
ewili13_submission_14
 
How to write shared libraries!
How to write shared libraries!How to write shared libraries!
How to write shared libraries!
 
A Man-Computer Interactive System
A Man-Computer Interactive SystemA Man-Computer Interactive System
A Man-Computer Interactive System
 
Towards high performance computing(hpc) through parallel programming paradigm...
Towards high performance computing(hpc) through parallel programming paradigm...Towards high performance computing(hpc) through parallel programming paradigm...
Towards high performance computing(hpc) through parallel programming paradigm...
 
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : NotesIs Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
 
STATICMOCK : A Mock Object Framework for Compiled Languages
STATICMOCK : A Mock Object Framework for Compiled Languages STATICMOCK : A Mock Object Framework for Compiled Languages
STATICMOCK : A Mock Object Framework for Compiled Languages
 
Introducing Parallel Pixie Dust
Introducing Parallel Pixie DustIntroducing Parallel Pixie Dust
Introducing Parallel Pixie Dust
 
A New Paradigm In Linux Debug From Viosoft Corporation
A New Paradigm In Linux Debug From Viosoft CorporationA New Paradigm In Linux Debug From Viosoft Corporation
A New Paradigm In Linux Debug From Viosoft Corporation
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
Software Abstractions for Parallel Hardware
Software Abstractions for Parallel HardwareSoftware Abstractions for Parallel Hardware
Software Abstractions for Parallel Hardware
 
Деградация производительности при использовании FUSE
Деградация производительности при использовании FUSEДеградация производительности при использовании FUSE
Деградация производительности при использовании FUSE
 
01 overview
01 overview01 overview
01 overview
 
Concurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsConcurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core Processors
 
A New Paradigm In Linux Debug From Viosoft
A New Paradigm In Linux Debug From ViosoftA New Paradigm In Linux Debug From Viosoft
A New Paradigm In Linux Debug From Viosoft
 
LOCK-FREE PARALLEL ACCESS COLLECTIONS
LOCK-FREE PARALLEL ACCESS COLLECTIONSLOCK-FREE PARALLEL ACCESS COLLECTIONS
LOCK-FREE PARALLEL ACCESS COLLECTIONS
 
Lock free parallel access collections
Lock free parallel access collectionsLock free parallel access collections
Lock free parallel access collections
 
Unit 1_Evaluation Criteria_session 3.pptx
Unit 1_Evaluation Criteria_session 3.pptxUnit 1_Evaluation Criteria_session 3.pptx
Unit 1_Evaluation Criteria_session 3.pptx
 
C c#
C c#C c#
C c#
 

A DSEL for Addressing the Problems Posed by Parallel Architectures

  • 1. A DSEL for Addressing the Problems Posed by Parallel Architectures Jason Mc Guiness, Colin Egan CTCA, School of Computer Science, University of Read EW-PRAM [19]) computation model. Furthermore the Hertfordshire Hatfield, Hertfordshire, UK DSEL described assists the user with regard to debugging overload@hussar.demon.co.uk the resultant parallel program. An implementation of the DSEL in C++ exists: further details may be found in [12]. 2. RELATED WORK 1. INTRODUCTION From a hardware perspective, the evolution of computer Computers with multiple pipelines have become increas- architectures has been heavily influenced by the von Neu- ingly prevalent, hence a rise in the available parallelism to mann model. This has meant that with the relative increase the programming community. For example the dual-core in processor speed vs. memory speed, the introduction of desktop workstations to multiple core, multiple processors memory hierarchies [3] and out-of-order instruction schedul- within blade frames which may contain hundreds of pipelines ing has been highly successful. However, these extra levels in data centres, to state-of-the-art mainframes in the Top500 increase the penalty associated with a miss in the memory- supercomputer list with thousands of cores and the poten- subsystem, due to memory-access times, limiting the ILP tial arrival of next-generation cellular architectures that may (Instruction-Level Parallelism). Also there may be an in- have millions of cores. This surfeit of hardware parallelism crease in design complexity and power consumption of the has apparently yet to be tamed in the software architecture overall system. An approach to avoid this problem may be arena. Various attempts to meet this challenge have been to fetch sets of instructions from different memory banks, i.e. made over the decades, taking such approaches as languages, introduce threads, which would allow an increase in ILP, in compilers or libraries to enable programmers to enhance the proportion to the number of executing threads. parallelism within their various problem domains. Yet the From a software perspective, the challenge that has been common folklore in computer science has still been that it presented to programmers by these parallel architectures has is hard to program parallel algorithms correctly. been the massive parallelism they expose. There has been This paper examines what language features would be re- much work done in the field of parallelizing software: quired to add to an existing imperative language that would have little if no native support for implementing parallel- • Auto-parallelizing compilers: such as EARTH-C [17]. ism apart from a simple library that exposes the OS-level Much of the work developing auto-parallelizing com- threading primitives. The goal of the authors has been to pilers has derived from the data-flow community [16]. create a minimal and orthogonal DSEL that would add the • Language support: such as Erlang [20], UPC [5] or capabilities of parallelism to that target language. Moreover Intel’s [18] and Microsoft’s C++ compilers based upon the DSEL proposed will be demonstrated to have such use- OpenMP. ful guarantees as a correct, heuristically efficient schedule. In terms of correctness the DSEL provides guarantees that • Library support: such as POSIX threads (pthreads) or it can provide deadlock-free and race-condition free sched- Win32, MPI, OpenMP, Boost, Intel’s TBB [14], Cilk ules. In terms of efficiency, the schedule produced will be [10] or various libraries targeting C++ [6, 2]. Intel’s shown to add no worse than a poly-logarithmic order to the TBB has higher-level threading constructs, but it has algorithmic run-time of the schedule of the program on a not supplied parallel algorithms, nor has it provided CREW-PRAM (Concurrent-Read, Exclusive-Write, Paral- any guarantees regarding its library. It also suffers lel Random-Access Machine[19]) or EREW-PRAM (Exclusive- from mixing code relating to generating the parallel schedule and the business logic, which would also make testing more complex. These have all had varying levels of success, as discussed in part in [11], with regards to addressing the issues of pro- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are gramming effectively for such parallel architectures. not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to 3. MOTIVATION republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. The basic issues addressed by all of these approaches have Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. been: correctness or optimization. So far it has appeared
  • 2. that the compiler and language based approaches have been • It shall assist in debugging any use of a conforming the only approaches able to address both of those issues implementation. together. But the language-based approaches require that programmers would need to re-implement their programs in • It should provide guarantees regarding those bane’s of a potentially novel language, a change that has been very parallel programming: dead-locks and race-conditions. hard for business to adopt, severely limiting the use of these • Moreover it should provide guarantees regarding the approaches. algorithmic complexity of any parallel schedule it would Amongst the criticisms raised regarding the use of libraries generate. [11, 13] such as pthreads, Win32 or OpenMP have been: Initially a description of the grammar will be given, followed • They have been too low-level so using them to write by a discussion of some of the properties of the DSEL. Fi- correct multi-threaded programs has been very hard; nally some theoretical results derived from the grammar of it suffers from composition problems. This problem the DSEL will be given. may be summarized as: atomic access to an object would be contained within each object (using classic 4.1 Detailed Grammar of the DSEL OOD), thus when composing multiple objects, mul- The various types, production rules and operations that tiple separate locks, from the different objects, have define the DSEL will be given in this section. The basic to be manipulated to guarantee correct access. If this types will be defined first, then the operations upon those were done correctly the usual outcome has been a ser- types will be defined. C++ has been chosen as the target ious reduction in scalability. language in which to implement the DSEL. This was due • A related issue has been that that the programmer to the rich ability within C++ to extend the type system often intimately entangles their thread-safety, thread at compile-time: primarily using templates but also over- scheduling and the business logic of their code. This loading various operators. Hence the presentation of the means that each program would be effectively a be- grammar relies on the grammar of C++, so it would assist spoke program, requiring re-testing of each program the reader to have familiarity of that grammar, in particular for threading issues as well as business logic issues. Annex A of the ISO C++ Standard [8]. Although C++11 has some support for threading, this had not been widely im- • Also debugging such code has been found to be very plemented at the time of writing, moreover the specification hard. Debuggers for multi-threaded code have been an had not addressed the points of the DSEL in this paper. open area of research for some time. Some clarifications: Given that the language has to be immutable, a DSEL defined • The subscriptopt means that the keyword is optional. by a library that attempts to support the correctness and optimality of the language and compiler approaches and • The subscriptdef means that the keyword is default and yet somehow overcomes the limitations of the usual library- specifies the default value for the optional keyword. based approaches would seem to be ideal. This DSEL will now be presented. 4.1.1 Types The primary types used within the DSEL are derived from 4. THE DSEL TO ASSIST PARALLELISM the thread-pool type. We chose to address these issues by defining a carefully 1. Thread pools can be composed with various subtypes crafted DSEL, then examining it’s properties to demonstrate that could be used to fundamentally affect the imple- that the DSEL achieved the goals. The DSEL should have mentation and performance of any client software: the following properties: • The DSEL shall target what may be termed as gen- thread-pool-type: eral purpose threading, the authors define this to be thread_pool work-policy size-policy pool-adaptor scheduling in which the conditions or loop-bounds may • A thread pool would contain a collection of not be computed at compile-time, nor could they be threads that may be more, less or the same as represented as monads, so could not be memoized1 . In the number of processors on the target archi- particular the DSEL shall support both data-flow and tecture. This allows for implementations to data parallel constructs. visualize the multiple cores available or make • By being implemented in an existing language it would use of operating-system provided thread im- avoid the necessity of re-implementing the programs, a plementations. An implementation may choose more progressive approach to adoption could be taken. to enforce a synchronization of all threads within the pool once an instance of that pool • It shall be a reasonably small DSEL, but be large should be destroyed, to ensure that threads enough provide sufficient extensions to the host lan- managed by the pool are appropriately des- guage that express parallel constructs in a manner that troyed and work in the process of mutation would be natural to a programmer using that language. could be appropriately terminated. 1 A compile or run-time optimisation technique involving a work-policy: one of space-time tradeoff. Re-computation of pure functions when worker_threads_get_work one_thread_distributes provided with the same arguments may be avoided by cach- ing the result; the result will be the same for each call with • The library should implement the classic work- the same arguments, if the function has no side-effects. stealing or master-slave work sharing algorithms.
  • 3. Clearly the specific implementation of these • The sequential_mode has been provided to could affect the internal queue containing un- allow implementations to removal all thread- processed work within the thread_pool. For ing aspects of all of the implementing library, example a worker_threads_get_work queue which would hugely reduce the burden on the might be implemented such that the addition programmer regarding identifying bugs within of work would be independent to the removal their code. If all threading is removed, then of work. all bugs that remain, in principle should reside in their user-code, which once determined to size-policy: one of be bug-free, could then be trivially parallel- fixed_size tracks_to_max infinite ized by modifying this single specifier and re- • The size-policy when used in combination with compiling. Then any further bugs introduced the threading-model could be used to make would be due to bugs within the parallel as- considerable simplifications in the implement- pects of their code, or the library implement- ation of the thread-pool-type which could make ing this DSEL. If the user relies upon the lib- it faster on certain architectures. rary to provide threading, then there should be no further bugs in their code. We consider • tracks_to_max would implement some model this feature of paramount importance, as it of the cost of re-creating and maintaining threads. directly addresses the complex task of debug- If thread were cheap to create & destroy with ging parallel software, by separating the al- little overhead, then an infinite size might gorithm by which the parallelism should be be a reasonable approximation, conversely threads implemented from the code implementing the with opposite characteristics might be better mutations on the data. maintained in a fixed_size pool. priority-mode: one of pool-adaptor: normal_fifodef prioritized_queue joinability api-type threading-model priority-modeopt comparatoropt GSS(k)-batch-sizeopt • This is an optional parameter. The prior- itized_queue would allow the user to spe- joinability: one of cify whether specific instances of work to be joinable nonjoinable mutated should be performed ahead of other • The joinability has been provided to allow instances of work, according to a user-specified for certain optimizations to be implement- comparator . able. A thread-pool-type that is nonjoinable comparator: could have a number of simplifying details std::lessdef that would make it not only easier to imple- • A unary function-type that specifies a strict ment but also faster in operation. weak-ordering on the elements within a the api-type: one of prioritized_queue. no_api MS_Win32 posix_pthreads IBM_cyclops GSS(k)-batch-size: • Both MS_Win32 and posix_pthreads are ex- 1def amples of heavyweight_threading APIs in • A natural number specifying the batch-size which threading at the OS-level would be made to be used within the queue specified by the use of to implement the DSEL. IBM_cyclops priority-mode. The default is 1, i.e. no batch- would be an implementation of the DSEL us- ing would be performed. An implementa- ing the lightweight_threading API imple- tion would be likely to use this for enabling mented by IBM BlueGene/C Cyclops [1]. GSS(k) scheduling [9]. threading-model: one of 2. Adapted collections to assist in providing thread-safety sequential_mode heavyweight_threading and also specify the memory access model of the col- lightweight_threading lection: • This specifier provides a coarse representa- tion of the various implementations of thread- safe-colln: able construct in the multitude of architec- safe_colln collection-type lock-type tures available. For example Pthreads would • This adaptor wraps the collection-type and be considered to be heavyweight_threading an instance of lock-type in one object, and whereas Cyclops would be lightweight_threading. provides a few thread-safe operations upon Separation of the threading model versus the that collection, plus access to the underlying API allows for the possibility that there may collection. This access might seem surpris- be multiple threading APIs on the same plat- ing, but this has been done because locking form, which may have different properties, for the operations on collections has been shown example if there were to be a GPU available to not be composable, and cross-cuts both in a multi-core computer, there could be two object-orientated and functional-decomposition different threading models within the same designs. This could could be open to misuse, program. but otherwise excessive locking would have
  • 4. to be done in user code. This has not been (b) or implements the function process(result_type an ideal design decision, but a simple one, &), and the library may determine the actual with scope for future work. Note that this type of result_type. design choice within the DSEL does not inval- The sub-types are: idate the rest of the grammar, as this would just affect the overloads to the data-parallel- joinable: algorithms, described later. A method of transferring work to be mutated • The adaptor also provides access to both read- into an instance of thread-pool-types. If the lock and write-lock types, which may be the work to be mutated were to be transferred same, but allow the user to specify the intent using this modifier, then the return result of of their operations more clearly. the transfer would be an execution_context, that may subsequently be used to obtain the lock-type: one of result of the mutation. Note that this implies critical_section_lock_type read_write that the DSEL implements a form of data- read_decaying_write flow operation. (a) A critical_section_lock_type would be a execution_context: single-reader, single-writer lock, a simulation This is the type of future that a transfer re- of EREW semantics. The implementation of turns. It is also a type of proxy to the res- this type of lock could be more efficient on ult_type that the mutation returns. Access certain architectures. via this proxy implicitly causes the calling (b) A read_write lock is a multi-readers, single- thread to wait until the mutation has been write lock, a simulation of CREW semantics. completed. This is the other component of (c) A read_decaying_write lock would be a spe- the DSEL that implements the data-flow model. cialization of a read_write lock that also im- Various sub-types of execution_context ex- plements atomic transformation of a write- ist specific to the result_types of the vari- lock into a read-lock. ous operations that the DSEL supports. Note (d) The lock should be used to govern the opera- that the implementation of execution_context tions on the collection, and not operations on should specifically prohibit aliasing instances the items contained within the collection. of these types, copying instances of these types • The lock-type parameter may be used to spe- and assigning instances of these types. cify if EREW or CREW operations upon the nonjoinable: collection are allowed. For example if EREW Another method of transferring work to be mutated operations are only allowed, then overlapped into an instance of thread-pool-types. If the work dereferences of the execution_context res- to be mutated were to be transferred using this ultant from parallel-algorithms operating upon modifier, then the return result of the transfer the same instance of a safe-colln should be would be nothing. The mutation within the pool strictly ordered by an implementation to en- would occur at some indeterminate time, the res- sure EREW semantics are maintained. Al- ult of which would, for example, be detectable by ternatively if CREW semantics were specified any side effects of the mutation within the res- then an implementation may allow read-operations ult_type of the work to be mutated. upon the same instance of the safe-colln to time_critical: occur in parallel, assuming they were not blocked This modifier ensures that when the work is mutated by a write operation. by a thread within an instance of thread-pool- collection-type: type into which it has been transferred, it will A standard collection such as an STL-style list or be executed at an implementation-defined higher vector, etc. kernel priority. Other similar modifiers exist in the DSEL for other kernel priorities. This ex- 3. The thread-pool-type defines further sub-types for con- ample demonstrates that specifying other modifi- venience to the programmer: ers, that would be extensions to the DSEL, would create_direct: be possible. This adaptor, parametrized by the type of work cliques(natural_number n): to be mutated, contains certain sub-types. The This modifier is used with data-parallel-algorithms. input data and the mutation operation combined It causes the instance of thread-pool-type to allow are termed the work to be mutated, which would p the data-parallel-algorithm to operate with n be a type of closure. If the mutation operation threads, where p is the number of threads in the does not change the state of any data external to instance. the closure, then this would be a type of monad. More specifically, this work to be mutated should 4. The DSEL specifies a number of other utility types also be a type of functor that either: such as shared_pointer, various exception types and (a) Provides a type result_type to access the exception-management adaptors amongst others. The result of the mutation, and specifies the muta- details of these important, but ancillary types has been tion member-function, omitted for brevity.
  • 5. 4.1.2 Operators on the thread-pool-type • The style and arguments of the data-parallel- The various operations that are defined in the DSEL will algorithms is similar to those of the STL in now be given. These operations tie together the types and the C++ ISO Standard. Specifically they all express the restrictions upon the generation of the control- take a safe-colln as the arguments to spe- flow graph that the DSEL may create. cify the ranges and functors as necessary as specified within the STL. Note that these al- 1. The transfer work to be mutated into an instance of gorithms all use run-time computed bounds, thread-pool-type is defined as follows: otherwise it would be more optimal to use techniques similar to those used in HPF or transfer-future: described in [9] to parallelize such operations. execution-context-resultopt If the DSEL supports loop-carried dependen- thread-pool-type transfer-operation cies in the functor argument is undefined. execution-context-result: • If algorithms were to be implemented using execution_context << techniques described in [7] and [4], then the • The token sequence “<<” is the transfer oper- algorithms would be optimal with O (log (p)) ation, and also used in the definition of the complexity in distributing the work to the transfer-modifier-operation, amongst other places. thread pool. Given that there are no loop- • Note how an execution_context can only be carried dependencies, each thread may oper- created via a transfer of work to be mutated ate independently upon a sub-range within into the suitably defined thread_pool. It is the safe-colln for an optimal algorithmic an error to transfer work into a thread_pool complexity of O n − 1 + log (p) where n is p that has been defined using the nonjoinable the number of items to be computed and p is subtype. There is no way to create an ex- the number of threads, ignoring the operation ecution_context with transferring work to time of the mutations. be mutated, so every execution_context is guaranteed to eventually contain the result of 3. The binary_funs are defined as follows: a mutation. transfer-operation: binary fun: transfer-modifier-operationopt transfer-data-operation work-to-be-mutated work-to-be-mutated transfer-modifier-operation: binary functor << transfer-modifier • A binary functor is just a functor that takes transfer-modifier: one of two arguments. The order of evaluation of time_critical joinable nonjoinable cliques the arguments is undefined. If the DSEL sup- transfer-data-operation: ports dependencies between the arguments is << transfer-data undefined. This would imply that the argu- transfer-data: one of ments should refrain from modifying any ex- work-to-be-mutated parallel-binary-operation data- ternal state. parallel-algorithm 4. Similarly, the logical operations are defined as follows: The details of the various parallel-binary-operations and data- parallel-algorithms will be given in the next section. logical operation: work-to-be-mutated work-to-be-mutated 4.1.3 The Data-Parallel Operations and Algorithms binary functor This section will describe the the various parallel algorithms • Note that no short-circuiting of the compu- defined within the DSEL. tation of the arguments occurs. The result 1. The parallel-binary-operations are defined as follows: of mutating the arguments must be boolean. If the DSEL supports dependencies between parallel-binary-operation: one of the arguments is undefined. This would im- binary_fun parallel-logical-operation ply that the arguments should refrain from parallel-logical-operation: one of modifying any external state. logical_and logical_or • It is likely that an implementation would not 4.2 Properties of the DSEL implement the usual short-circuiting of the In this section some results will be presented that derive operands, to allow them to transferred into from the definitions, the first of which will demonstrate that the thread pool and executed in parallel. the CFG (Control Flow Graph) would be a tree from which the other useful results will directly derive. 2. The data-parallel-algorithms are defined as follows: Theorem 1. Using the DSEL described above, the par- data-parallel-algorithm: one of allel control-flow graph of any program that may use a con- accumulate copy count count_if fill fill_n forming implementation of the DSEL must be an acyclic dir- find find_if for_each min_element max_element ected graph, and comprised of at least one singly-rooted tree, reverse transform but may contain multiple singly-rooted, independent, trees.
  • 6. Proof. From the definitions of the DSEL, the transfer to unlock D. In terms of the DSEL, this implies that exe- of work to be mutated into the thread_pool may be done cution_contexts C and D are shared between two threads. only once according to the definition of transfer-future the i.e. that an execution_context has been passed from a result of which returns a single execution_context accord- node A to a sibling node B, and vice-versa occurs to exe- ing to the definition of execution-context-result which has cution_context B. But aliasing execution_contexts has been the only defined way to create execution_contexts. been explicitly forbidden in the DSEL by definition 3. This implies that from a node in the CFG, each transfer to the thread-pool-type represents a single forward-edge con- Corollary 1. If the user refrains from using any other necting the execution_context with the child-node that threading-related items or atomic objects other than those contains the mutation. The back-edge from the mutation defined in the DSEL above and that the work they wish to to the parent-node is the edge connecting the result of the mutate may not be aliased by any other object, then the user mutation with the dereference of the execution_context. can be guaranteed to have a schedule free of race-conditions The execution_context and the dereference occur in the and deadlocks. same node, because execution_contexts cannot be passed Proof. It must be proven that the two theorems 2 and 3 between nodes, by definition. In summary: the parent-node are not mutually exclusive. Let us suppose that a CFG ex- has an edge from the execution_context it contains to the ists that satisfies 2 but not 3. Therefore there must be either mutation and a back-edge to the dereference in that parent- an edge formed by aliasing an execution_context or a back- node. Each node may perform none, one or more trans- edge from the result of a mutation back to a dereference of fers resulting in none, one or more child-nodes. A node an execution_context. The former has been explicitly for- with no children is a leaf-node, containing only a mutation. bidden in the DSEL by definition of the execution_context, Now back-edges to multiple parent nodes cannot be created, 3, the latter forbidden by the definition of transfer-future, 1. according to the definition of execution_context, because Both are a contradiction, therefore such a CFG cannot exist. execution_contexts cannot be aliased nor copied between Therefore any conforming CFG must satisfy both theorems nodes. So the only edges in this sub-graph are the forward 2 and 3. and back edges from parent to children. Therefore the sub- Theorem 4. If the user refrains from using any other graph is not only acyclic, but a tree. Due to the definitions threading-related items or atomic objects other than those of transfer-future and execution-context-result, the only way defined in the DSEL above then the schedule of work to be to generate mutations is via the above technique. Therefore mutated by a conforming implementation of the DSEL would each child-node either returns via the back edge immedi- be executed in time taking at least an algorithmic complexity ately or generates a further sub-tree attaching to the larger of O (log (p)) and at most O (n) in units of time to mutate tree that contains it’s parent. Now if the entry-point of the work where n is the number of work items to be mutated the program is the single thread that runs main(), i.e. the on p processors. The algorithmic order of the minimal time single root, this can only generate a tree, and each node would be poly-logarithmic, so within NC, therefore at least in the tree can only return or generate a tree, the whole optimal. CFG must be a tree. If there were more entry-points, each one can only generate a tree per entry-point, as the execu- Proof. Given that the schedule must be a tree according tion_contexts cannot be aliased nor copied between nodes, to theorem 1 with at most n leaf-nodes, and that each node by definition. takes at most O n − 1 + log (p) computations according p to the definition of the parallel-algorithms. Also it has been According to the above corollary, one may appreciate that a proven in [7] that to distribute n items of work onto p pro- conforming implementation of the DSEL would implement cessors may be performed with an algorithmic complexity of data-flow in software. O (log (n)). The fastest computation time would be if the Theorem 2. If the user refrains from using any other schedule were a balanced tree, where the computation time threading-related items or atomic objects other than those would be the depth of the tree, i.e. O (log (n)) in the same defined in the DSEL above then they can be guaranteed to units. If the n items of work were to be greater than the have a schedule free of race-conditions. p processors, then O (log (p)) ≤ O (log (n)), so the compu- tation time would be slower than O (log (p)). The slowest Proof. A race-condition is when two threads attempt to computation time would be if the tree were a chain, i.e. access the same data at the same time. A race-condition O (n) time. In those cases this implies that a conforming in the CFG would be represented by a child node with two implementation should add at most a constant order to the parent nodes, with forward-edges connecting the parents to execution time of the schedule. the child. Note that the CFG must an acyclic tree according to theorem 1, then this sub-graph cannot be represented in 4.3 Some Example Usage a tree, so the schedule must be race-condition free. These are two toy examples, based upon an implement- ation in [12], of how the above DSEL might appear. The Theorem 3. If the user refrains from using any other first example is a data-flow example showing how the DSEL threading-related items or atomic objects other than those could be used to mutate some work on a thread within the defined in the DSEL above and that the work they wish to thread pool, effectively demonstrating how the future would mutate may not be aliased by any other object, then the user be waited upon. Note how the execution_context has been can be guaranteed to have a schedule free of deadlocks. created via the transfer of work into the thread_pool. Proof. A deadlock may be defined as: when threads A and B wait on atomic-objects C and D, such that A locks Listing 1: Data-flow example of a Thread Pool and C, waits upon D to unlock C and B locks D, waits upon C Future.
  • 7. struct res t { 5. CONCLUSIONS int i ; }; The goals of the paper has been achieved: a DSEL has s t r u c t work type { void process ( r e s t &) {} been formulated: }; t y p e d e f ppd : : t h r e a d p o o l < • that may be used to expresses general-purpose paral- p o o l t r a i t s : : worker threads get work , pool traits :: fixed size , lelism within a language, pool adaptor< g e n e r i c t r a i t s : : joinable , platform api , heavyweight threading • ensures that there are no deadlocks and race condi- > > pool type ; tions within the program if the programmer restricts t y p e d e f p o o l t y p e : : c r e a t e d i r e c t <w o r k t y p e > c r e a t o r t ; themselves to using the constructs of the DSEL, typedef c r e a t o r t : : execution context execution context ; typedef c r e a t o r t : : joinable joinable ; • and does not preclude implementing optimal schedules pool type pool ( 2 ) ; e x e c u t i o n c o n t e x t c o n t e x t ( pool<<j o i n a b l e ()<< w o r k t y p e ( ) ) ; on a CREW-PRAM or EREW-PRAM computation context− >i ; model. The typedefs in this example implementation of the gram- Intuition suggests that this result should have come as no mar are complex, but the typedef for the thread-pool-type surprise considering the work done relating to auto-parallelizing would only be needed once and, reasonably, could be held compilers, which work within the AST and CFGs of the in a configuration trait in header file. parsed program[17]. The second example shows how a data-parallel version of the C++ accumulate algorithm might appear. It is interesting to note that the results presented here would be applicable to all programming languages, com- piled or interpreted, and that one need not be forced to Listing 2: Example of a parallel version of an STL re-implement a compiler. Moreover the DSEL has been de- algorithm. signed to directly address the issue of debugging any such t y p e d e f ppd : : t h r e a d p o o l < p o o l t r a i t s : : worker threads get work , parallel program, directly addressing this problematic is- pool traits :: fixed size , pool adaptor< sue. Further advantages of this DSEL are that program- g e n e r i c t r a i t s : : joinable , platform api , mers would not need to learn an entirely new programming heavyweight threading , p o o l t r a i t s : : normal fifo , std : : less ,1 language, nor would they have to change to a novel com- > > pool type ; piler implementing the target language, which may not be t y p e d e f ppd : : s a f e c o l l n < v e c t o r <i n t >, l o c k t r a i t s : : c r i t i c a l s e c t i o n l o c k t y p e available, or if it were might be impossible to use for more > vtr colln t ; prosaic business reasons. typedef pool type : : accumulate t< vtr colln t >:: e x e c u t i o n c o n t e x t e x e c u t i o n c o n t e x t ; vtr colln t v; v . push back ( 1 ) ; v . push back ( 2 ) ; 6. FUTURE WORK execution context context ( pool< <j o i n a b l e ( ) There are a number of avenues that arise which could be < <p o o l . a c c u m u l a t e ( investigated, for example a conforming implementation of v , 1 , s t d : : p l u s < v t r c o l l n t : : v a l u e t y p e >() ) the DSEL could be presented, for example [12]. The prop- ); a s s e r t ( ∗ c o n t e x t ==4); erties of such an implementation could then be investigated by reimplementing a benchmark suite, such as SPEC2006 All of the parameters have been specified in the thread- [15], and comparing and contrasting the performance of that pool-type to demonstrate the appearance of the typedef. Note implementation versus the literature. The definition of safe- that the example illustrates a map-reduce operation, an im- colln has not been an optimal design decision a better ap- plementation might: proach would have been to define ranges that support lock- ing upon the underlying collection. Extending the DSEL 1. take sub-ranges within the safe-colln, may be required to admit memoization could be investig- ated, such that a conforming implementation might imple- 2. which would be distributed across the threads within ment not only inter but intra-procedural analysis. the thread_pool, 7. REFERENCES 3. the mutations upon each element within each sub- range would be performed sequentially, their results [1] Almasi, G., Cascaval, C., Castanos, J. G., combined via the accumulator functor, without lock- Denneau, M., Lieber, D., Moreira, J. E., and ing any other thread’s operation, Henry S. Warren, J. Dissecting Cyclops: a detailed analysis of a multithreaded architecture. SIGARCH 4. These sub-results would be combined with the final ac- Comput. Archit. News 31, 1 (2003), 26–38. cumulation, in this the implementation providing suit- [2] Bischof, H., Gorlatch, S., Leshchinskiy, R., and able locking to avoid any race-condition, M¨ ller, J. Data Parallelism in C++ Template u Programs: a Barnes-hut Case Study. Parallel 5. The total result would be made available via the exe- Processing Letters 15, 3 (2005), 257–272. cution_context. [3] Burger, D., Goodman, J. R., and Kagi, A. Memory Bandwidth Limitations of Future Moreover the size of the input collection should be suffi- Microprocessors. In ISCA (1996), pp. 78–89. ciently large or the time taken to execute the operation of [4] Casanova, H., Legrand, A., and Robert, Y. the accumulator so long, so that the cost of the above oper- Parallel Algorithms. Chapman & Hall/CRC Press, ations would be reasonably amortized. 2008.
  • 8. [5] El-ghazawi, T. A., Carlson, W. W., and Draper, J. M. UPC language specifications v1.1.1. Tech. rep., 2003. [6] Giacaman, N., and Sinnen, O. Parallel iterator for parallelising object oriented applications. In SEPADS’08: Proceedings of the 7th WSEAS International Conference on Software Engineering, Parallel and Distributed Systems (Stevens Point, Wisconsin, USA, 2008), World Scientific and Engineering Academy and Society (WSEAS), pp. 44–49. [7] Gibbons, A., and Rytter, W. Efficient parallel algorithms. Cambridge University Press, New York, NY, USA, 1988. [8] ISO. ISO/IEC 14882:2011 Information technology — Programming languages — C++. International Organization for Standardization, Geneva, Switzerland, Feb. 2012. [9] Kennedy, K., and Allen, J. R. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002. [10] Leiserson, C. E. The Cilk++ concurrency platform. J. Supercomput. 51, 3 (Mar. 2010), 244–257. [11] McGuiness, J. M. Automatic Code-Generation Techniques for Micro-Threaded RISC Architectures. Master’s thesis, University of Hertfordshire, Hatfield, Hertfordshire, UK, July 2006. [12] McGuiness, J. M. libjmmcg - implementing PPD. libjmmcg.sourceforge.net, July 2009. [13] McGuiness, J. M., Egan, C., Christianson, B., and Gao, G. The Challenges of Efficient Code-Generation for Massively Parallel Architectures. In Asia-Pacific Computer Systems Architecture Conference (2006), pp. 416–422. [14] Pheatt, C. Intel R threading building blocks. J. Comput. Small Coll. 23, 4 (2008), 298–298. [15] Reilly, J. Evolve or Die: Making SPEC’s CPU Suite Relevant Today and Tomorrow. In IISWC (2006), p. 119. [16] Snelling, D. F., and Egan, G. K. A Comparative Study of Data-Flow Architectures. Tech. Rep. UMCS-94-4-3, 1994. [17] Tang, X. Compiling for Multithreaded Architectures. PhD thesis, University of Delaware, Delaware, USA, Fall 1999. [18] Tian, X., Chen, Y.-K., Girkar, M., Ge, S., Lienhart, R., and Shah, S. Exploring the Use of Hyper-Threading Technology for Multimedia Applications with Intel R OpenMP* Compiler. In IPDPS (2003), p. 36. [19] Tvrdik, P. Topics in parallel computing - PRAM models. http://pages.cs.wisc.edu/ tvrdik/2/html/Section2.html, January 1999. [20] Virding, R., Wikstr¨ m, C., and Williams, M. o Concurrent programming in ERLANG (2nd ed.). Prentice Hall International (UK) Ltd., Hertfordshire, UK, UK, 1996.