A DSEL for Addressing the Problems Posed by Parallel Architectures
1. A DSEL for Addressing the Problems Posed by Parallel
Architectures
Jason Mc Guiness, Colin Egan
CTCA, School of Computer Science, University of Read EW-PRAM [19]) computation model. Furthermore the
Hertfordshire Hatfield, Hertfordshire, UK DSEL described assists the user with regard to debugging
overload@hussar.demon.co.uk the resultant parallel program. An implementation of the
DSEL in C++ exists: further details may be found in [12].
2. RELATED WORK
1. INTRODUCTION From a hardware perspective, the evolution of computer
Computers with multiple pipelines have become increas- architectures has been heavily influenced by the von Neu-
ingly prevalent, hence a rise in the available parallelism to mann model. This has meant that with the relative increase
the programming community. For example the dual-core in processor speed vs. memory speed, the introduction of
desktop workstations to multiple core, multiple processors memory hierarchies [3] and out-of-order instruction schedul-
within blade frames which may contain hundreds of pipelines ing has been highly successful. However, these extra levels
in data centres, to state-of-the-art mainframes in the Top500 increase the penalty associated with a miss in the memory-
supercomputer list with thousands of cores and the poten- subsystem, due to memory-access times, limiting the ILP
tial arrival of next-generation cellular architectures that may (Instruction-Level Parallelism). Also there may be an in-
have millions of cores. This surfeit of hardware parallelism crease in design complexity and power consumption of the
has apparently yet to be tamed in the software architecture overall system. An approach to avoid this problem may be
arena. Various attempts to meet this challenge have been to fetch sets of instructions from different memory banks, i.e.
made over the decades, taking such approaches as languages, introduce threads, which would allow an increase in ILP, in
compilers or libraries to enable programmers to enhance the proportion to the number of executing threads.
parallelism within their various problem domains. Yet the From a software perspective, the challenge that has been
common folklore in computer science has still been that it presented to programmers by these parallel architectures has
is hard to program parallel algorithms correctly. been the massive parallelism they expose. There has been
This paper examines what language features would be re- much work done in the field of parallelizing software:
quired to add to an existing imperative language that would
have little if no native support for implementing parallel- • Auto-parallelizing compilers: such as EARTH-C [17].
ism apart from a simple library that exposes the OS-level Much of the work developing auto-parallelizing com-
threading primitives. The goal of the authors has been to pilers has derived from the data-flow community [16].
create a minimal and orthogonal DSEL that would add the • Language support: such as Erlang [20], UPC [5] or
capabilities of parallelism to that target language. Moreover Intel’s [18] and Microsoft’s C++ compilers based upon
the DSEL proposed will be demonstrated to have such use- OpenMP.
ful guarantees as a correct, heuristically efficient schedule.
In terms of correctness the DSEL provides guarantees that • Library support: such as POSIX threads (pthreads) or
it can provide deadlock-free and race-condition free sched- Win32, MPI, OpenMP, Boost, Intel’s TBB [14], Cilk
ules. In terms of efficiency, the schedule produced will be [10] or various libraries targeting C++ [6, 2]. Intel’s
shown to add no worse than a poly-logarithmic order to the TBB has higher-level threading constructs, but it has
algorithmic run-time of the schedule of the program on a not supplied parallel algorithms, nor has it provided
CREW-PRAM (Concurrent-Read, Exclusive-Write, Paral- any guarantees regarding its library. It also suffers
lel Random-Access Machine[19]) or EREW-PRAM (Exclusive- from mixing code relating to generating the parallel
schedule and the business logic, which would also make
testing more complex.
These have all had varying levels of success, as discussed in
part in [11], with regards to addressing the issues of pro-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are gramming effectively for such parallel architectures.
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to 3. MOTIVATION
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. The basic issues addressed by all of these approaches have
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. been: correctness or optimization. So far it has appeared
2. that the compiler and language based approaches have been • It shall assist in debugging any use of a conforming
the only approaches able to address both of those issues implementation.
together. But the language-based approaches require that
programmers would need to re-implement their programs in • It should provide guarantees regarding those bane’s of
a potentially novel language, a change that has been very parallel programming: dead-locks and race-conditions.
hard for business to adopt, severely limiting the use of these • Moreover it should provide guarantees regarding the
approaches. algorithmic complexity of any parallel schedule it would
Amongst the criticisms raised regarding the use of libraries generate.
[11, 13] such as pthreads, Win32 or OpenMP have been:
Initially a description of the grammar will be given, followed
• They have been too low-level so using them to write
by a discussion of some of the properties of the DSEL. Fi-
correct multi-threaded programs has been very hard;
nally some theoretical results derived from the grammar of
it suffers from composition problems. This problem
the DSEL will be given.
may be summarized as: atomic access to an object
would be contained within each object (using classic 4.1 Detailed Grammar of the DSEL
OOD), thus when composing multiple objects, mul-
The various types, production rules and operations that
tiple separate locks, from the different objects, have
define the DSEL will be given in this section. The basic
to be manipulated to guarantee correct access. If this
types will be defined first, then the operations upon those
were done correctly the usual outcome has been a ser-
types will be defined. C++ has been chosen as the target
ious reduction in scalability.
language in which to implement the DSEL. This was due
• A related issue has been that that the programmer to the rich ability within C++ to extend the type system
often intimately entangles their thread-safety, thread at compile-time: primarily using templates but also over-
scheduling and the business logic of their code. This loading various operators. Hence the presentation of the
means that each program would be effectively a be- grammar relies on the grammar of C++, so it would assist
spoke program, requiring re-testing of each program the reader to have familiarity of that grammar, in particular
for threading issues as well as business logic issues. Annex A of the ISO C++ Standard [8]. Although C++11
has some support for threading, this had not been widely im-
• Also debugging such code has been found to be very plemented at the time of writing, moreover the specification
hard. Debuggers for multi-threaded code have been an had not addressed the points of the DSEL in this paper.
open area of research for some time. Some clarifications:
Given that the language has to be immutable, a DSEL defined
• The subscriptopt means that the keyword is optional.
by a library that attempts to support the correctness and
optimality of the language and compiler approaches and • The subscriptdef means that the keyword is default and
yet somehow overcomes the limitations of the usual library- specifies the default value for the optional keyword.
based approaches would seem to be ideal. This DSEL will
now be presented. 4.1.1 Types
The primary types used within the DSEL are derived from
4. THE DSEL TO ASSIST PARALLELISM the thread-pool type.
We chose to address these issues by defining a carefully 1. Thread pools can be composed with various subtypes
crafted DSEL, then examining it’s properties to demonstrate that could be used to fundamentally affect the imple-
that the DSEL achieved the goals. The DSEL should have mentation and performance of any client software:
the following properties:
• The DSEL shall target what may be termed as gen- thread-pool-type:
eral purpose threading, the authors define this to be thread_pool work-policy size-policy pool-adaptor
scheduling in which the conditions or loop-bounds may • A thread pool would contain a collection of
not be computed at compile-time, nor could they be threads that may be more, less or the same as
represented as monads, so could not be memoized1 . In the number of processors on the target archi-
particular the DSEL shall support both data-flow and tecture. This allows for implementations to
data parallel constructs. visualize the multiple cores available or make
• By being implemented in an existing language it would use of operating-system provided thread im-
avoid the necessity of re-implementing the programs, a plementations. An implementation may choose
more progressive approach to adoption could be taken. to enforce a synchronization of all threads
within the pool once an instance of that pool
• It shall be a reasonably small DSEL, but be large should be destroyed, to ensure that threads
enough provide sufficient extensions to the host lan- managed by the pool are appropriately des-
guage that express parallel constructs in a manner that troyed and work in the process of mutation
would be natural to a programmer using that language. could be appropriately terminated.
1
A compile or run-time optimisation technique involving a work-policy: one of
space-time tradeoff. Re-computation of pure functions when worker_threads_get_work one_thread_distributes
provided with the same arguments may be avoided by cach-
ing the result; the result will be the same for each call with • The library should implement the classic work-
the same arguments, if the function has no side-effects. stealing or master-slave work sharing algorithms.
3. Clearly the specific implementation of these • The sequential_mode has been provided to
could affect the internal queue containing un- allow implementations to removal all thread-
processed work within the thread_pool. For ing aspects of all of the implementing library,
example a worker_threads_get_work queue which would hugely reduce the burden on the
might be implemented such that the addition programmer regarding identifying bugs within
of work would be independent to the removal their code. If all threading is removed, then
of work. all bugs that remain, in principle should reside
in their user-code, which once determined to
size-policy: one of be bug-free, could then be trivially parallel-
fixed_size tracks_to_max infinite ized by modifying this single specifier and re-
• The size-policy when used in combination with compiling. Then any further bugs introduced
the threading-model could be used to make would be due to bugs within the parallel as-
considerable simplifications in the implement- pects of their code, or the library implement-
ation of the thread-pool-type which could make ing this DSEL. If the user relies upon the lib-
it faster on certain architectures. rary to provide threading, then there should
be no further bugs in their code. We consider
• tracks_to_max would implement some model
this feature of paramount importance, as it
of the cost of re-creating and maintaining threads.
directly addresses the complex task of debug-
If thread were cheap to create & destroy with
ging parallel software, by separating the al-
little overhead, then an infinite size might
gorithm by which the parallelism should be
be a reasonable approximation, conversely threads
implemented from the code implementing the
with opposite characteristics might be better
mutations on the data.
maintained in a fixed_size pool.
priority-mode: one of
pool-adaptor: normal_fifodef prioritized_queue
joinability api-type threading-model priority-modeopt
comparatoropt GSS(k)-batch-sizeopt • This is an optional parameter. The prior-
itized_queue would allow the user to spe-
joinability: one of cify whether specific instances of work to be
joinable nonjoinable mutated should be performed ahead of other
• The joinability has been provided to allow instances of work, according to a user-specified
for certain optimizations to be implement- comparator .
able. A thread-pool-type that is nonjoinable comparator:
could have a number of simplifying details std::lessdef
that would make it not only easier to imple-
• A unary function-type that specifies a strict
ment but also faster in operation.
weak-ordering on the elements within a the
api-type: one of prioritized_queue.
no_api MS_Win32 posix_pthreads IBM_cyclops GSS(k)-batch-size:
• Both MS_Win32 and posix_pthreads are ex- 1def
amples of heavyweight_threading APIs in • A natural number specifying the batch-size
which threading at the OS-level would be made to be used within the queue specified by the
use of to implement the DSEL. IBM_cyclops priority-mode. The default is 1, i.e. no batch-
would be an implementation of the DSEL us- ing would be performed. An implementa-
ing the lightweight_threading API imple- tion would be likely to use this for enabling
mented by IBM BlueGene/C Cyclops [1]. GSS(k) scheduling [9].
threading-model: one of 2. Adapted collections to assist in providing thread-safety
sequential_mode heavyweight_threading and also specify the memory access model of the col-
lightweight_threading lection:
• This specifier provides a coarse representa-
tion of the various implementations of thread- safe-colln:
able construct in the multitude of architec- safe_colln collection-type lock-type
tures available. For example Pthreads would • This adaptor wraps the collection-type and
be considered to be heavyweight_threading an instance of lock-type in one object, and
whereas Cyclops would be lightweight_threading. provides a few thread-safe operations upon
Separation of the threading model versus the that collection, plus access to the underlying
API allows for the possibility that there may collection. This access might seem surpris-
be multiple threading APIs on the same plat- ing, but this has been done because locking
form, which may have different properties, for the operations on collections has been shown
example if there were to be a GPU available to not be composable, and cross-cuts both
in a multi-core computer, there could be two object-orientated and functional-decomposition
different threading models within the same designs. This could could be open to misuse,
program. but otherwise excessive locking would have
4. to be done in user code. This has not been (b) or implements the function process(result_type
an ideal design decision, but a simple one, &), and the library may determine the actual
with scope for future work. Note that this type of result_type.
design choice within the DSEL does not inval-
The sub-types are:
idate the rest of the grammar, as this would
just affect the overloads to the data-parallel- joinable:
algorithms, described later. A method of transferring work to be mutated
• The adaptor also provides access to both read- into an instance of thread-pool-types. If the
lock and write-lock types, which may be the work to be mutated were to be transferred
same, but allow the user to specify the intent using this modifier, then the return result of
of their operations more clearly. the transfer would be an execution_context,
that may subsequently be used to obtain the
lock-type: one of
result of the mutation. Note that this implies
critical_section_lock_type read_write
that the DSEL implements a form of data-
read_decaying_write
flow operation.
(a) A critical_section_lock_type would be a execution_context:
single-reader, single-writer lock, a simulation This is the type of future that a transfer re-
of EREW semantics. The implementation of turns. It is also a type of proxy to the res-
this type of lock could be more efficient on ult_type that the mutation returns. Access
certain architectures. via this proxy implicitly causes the calling
(b) A read_write lock is a multi-readers, single- thread to wait until the mutation has been
write lock, a simulation of CREW semantics. completed. This is the other component of
(c) A read_decaying_write lock would be a spe- the DSEL that implements the data-flow model.
cialization of a read_write lock that also im- Various sub-types of execution_context ex-
plements atomic transformation of a write- ist specific to the result_types of the vari-
lock into a read-lock. ous operations that the DSEL supports. Note
(d) The lock should be used to govern the opera- that the implementation of execution_context
tions on the collection, and not operations on should specifically prohibit aliasing instances
the items contained within the collection. of these types, copying instances of these types
• The lock-type parameter may be used to spe- and assigning instances of these types.
cify if EREW or CREW operations upon the nonjoinable:
collection are allowed. For example if EREW Another method of transferring work to be mutated
operations are only allowed, then overlapped into an instance of thread-pool-types. If the work
dereferences of the execution_context res- to be mutated were to be transferred using this
ultant from parallel-algorithms operating upon modifier, then the return result of the transfer
the same instance of a safe-colln should be would be nothing. The mutation within the pool
strictly ordered by an implementation to en- would occur at some indeterminate time, the res-
sure EREW semantics are maintained. Al- ult of which would, for example, be detectable by
ternatively if CREW semantics were specified any side effects of the mutation within the res-
then an implementation may allow read-operations ult_type of the work to be mutated.
upon the same instance of the safe-colln to
time_critical:
occur in parallel, assuming they were not blocked
This modifier ensures that when the work is mutated
by a write operation.
by a thread within an instance of thread-pool-
collection-type: type into which it has been transferred, it will
A standard collection such as an STL-style list or be executed at an implementation-defined higher
vector, etc. kernel priority. Other similar modifiers exist in
the DSEL for other kernel priorities. This ex-
3. The thread-pool-type defines further sub-types for con-
ample demonstrates that specifying other modifi-
venience to the programmer:
ers, that would be extensions to the DSEL, would
create_direct: be possible.
This adaptor, parametrized by the type of work cliques(natural_number n):
to be mutated, contains certain sub-types. The This modifier is used with data-parallel-algorithms.
input data and the mutation operation combined It causes the instance of thread-pool-type to allow
are termed the work to be mutated, which would p
the data-parallel-algorithm to operate with n
be a type of closure. If the mutation operation threads, where p is the number of threads in the
does not change the state of any data external to instance.
the closure, then this would be a type of monad.
More specifically, this work to be mutated should 4. The DSEL specifies a number of other utility types
also be a type of functor that either: such as shared_pointer, various exception types and
(a) Provides a type result_type to access the exception-management adaptors amongst others. The
result of the mutation, and specifies the muta- details of these important, but ancillary types has been
tion member-function, omitted for brevity.
5. 4.1.2 Operators on the thread-pool-type • The style and arguments of the data-parallel-
The various operations that are defined in the DSEL will algorithms is similar to those of the STL in
now be given. These operations tie together the types and the C++ ISO Standard. Specifically they all
express the restrictions upon the generation of the control- take a safe-colln as the arguments to spe-
flow graph that the DSEL may create. cify the ranges and functors as necessary as
specified within the STL. Note that these al-
1. The transfer work to be mutated into an instance of gorithms all use run-time computed bounds,
thread-pool-type is defined as follows: otherwise it would be more optimal to use
techniques similar to those used in HPF or
transfer-future:
described in [9] to parallelize such operations.
execution-context-resultopt
If the DSEL supports loop-carried dependen-
thread-pool-type transfer-operation
cies in the functor argument is undefined.
execution-context-result:
• If algorithms were to be implemented using
execution_context <<
techniques described in [7] and [4], then the
• The token sequence “<<” is the transfer oper- algorithms would be optimal with O (log (p))
ation, and also used in the definition of the complexity in distributing the work to the
transfer-modifier-operation, amongst other places. thread pool. Given that there are no loop-
• Note how an execution_context can only be carried dependencies, each thread may oper-
created via a transfer of work to be mutated ate independently upon a sub-range within
into the suitably defined thread_pool. It is the safe-colln for an optimal algorithmic
an error to transfer work into a thread_pool complexity of O n − 1 + log (p) where n is
p
that has been defined using the nonjoinable
the number of items to be computed and p is
subtype. There is no way to create an ex-
the number of threads, ignoring the operation
ecution_context with transferring work to
time of the mutations.
be mutated, so every execution_context is
guaranteed to eventually contain the result of
3. The binary_funs are defined as follows:
a mutation.
transfer-operation: binary fun:
transfer-modifier-operationopt transfer-data-operation work-to-be-mutated work-to-be-mutated
transfer-modifier-operation: binary functor
<< transfer-modifier • A binary functor is just a functor that takes
transfer-modifier: one of two arguments. The order of evaluation of
time_critical joinable nonjoinable cliques the arguments is undefined. If the DSEL sup-
transfer-data-operation: ports dependencies between the arguments is
<< transfer-data undefined. This would imply that the argu-
transfer-data: one of ments should refrain from modifying any ex-
work-to-be-mutated parallel-binary-operation data- ternal state.
parallel-algorithm
4. Similarly, the logical operations are defined as follows:
The details of the various parallel-binary-operations and data-
parallel-algorithms will be given in the next section. logical operation:
work-to-be-mutated work-to-be-mutated
4.1.3 The Data-Parallel Operations and Algorithms binary functor
This section will describe the the various parallel algorithms • Note that no short-circuiting of the compu-
defined within the DSEL. tation of the arguments occurs. The result
1. The parallel-binary-operations are defined as follows: of mutating the arguments must be boolean.
If the DSEL supports dependencies between
parallel-binary-operation: one of the arguments is undefined. This would im-
binary_fun parallel-logical-operation ply that the arguments should refrain from
parallel-logical-operation: one of modifying any external state.
logical_and logical_or
• It is likely that an implementation would not
4.2 Properties of the DSEL
implement the usual short-circuiting of the In this section some results will be presented that derive
operands, to allow them to transferred into from the definitions, the first of which will demonstrate that
the thread pool and executed in parallel. the CFG (Control Flow Graph) would be a tree from which
the other useful results will directly derive.
2. The data-parallel-algorithms are defined as follows:
Theorem 1. Using the DSEL described above, the par-
data-parallel-algorithm: one of allel control-flow graph of any program that may use a con-
accumulate copy count count_if fill fill_n forming implementation of the DSEL must be an acyclic dir-
find find_if for_each min_element max_element ected graph, and comprised of at least one singly-rooted tree,
reverse transform but may contain multiple singly-rooted, independent, trees.
6. Proof. From the definitions of the DSEL, the transfer to unlock D. In terms of the DSEL, this implies that exe-
of work to be mutated into the thread_pool may be done cution_contexts C and D are shared between two threads.
only once according to the definition of transfer-future the i.e. that an execution_context has been passed from a
result of which returns a single execution_context accord- node A to a sibling node B, and vice-versa occurs to exe-
ing to the definition of execution-context-result which has cution_context B. But aliasing execution_contexts has
been the only defined way to create execution_contexts. been explicitly forbidden in the DSEL by definition 3.
This implies that from a node in the CFG, each transfer to
the thread-pool-type represents a single forward-edge con- Corollary 1. If the user refrains from using any other
necting the execution_context with the child-node that threading-related items or atomic objects other than those
contains the mutation. The back-edge from the mutation defined in the DSEL above and that the work they wish to
to the parent-node is the edge connecting the result of the mutate may not be aliased by any other object, then the user
mutation with the dereference of the execution_context. can be guaranteed to have a schedule free of race-conditions
The execution_context and the dereference occur in the and deadlocks.
same node, because execution_contexts cannot be passed Proof. It must be proven that the two theorems 2 and 3
between nodes, by definition. In summary: the parent-node are not mutually exclusive. Let us suppose that a CFG ex-
has an edge from the execution_context it contains to the ists that satisfies 2 but not 3. Therefore there must be either
mutation and a back-edge to the dereference in that parent- an edge formed by aliasing an execution_context or a back-
node. Each node may perform none, one or more trans- edge from the result of a mutation back to a dereference of
fers resulting in none, one or more child-nodes. A node an execution_context. The former has been explicitly for-
with no children is a leaf-node, containing only a mutation. bidden in the DSEL by definition of the execution_context,
Now back-edges to multiple parent nodes cannot be created, 3, the latter forbidden by the definition of transfer-future, 1.
according to the definition of execution_context, because Both are a contradiction, therefore such a CFG cannot exist.
execution_contexts cannot be aliased nor copied between Therefore any conforming CFG must satisfy both theorems
nodes. So the only edges in this sub-graph are the forward 2 and 3.
and back edges from parent to children. Therefore the sub-
Theorem 4. If the user refrains from using any other
graph is not only acyclic, but a tree. Due to the definitions
threading-related items or atomic objects other than those
of transfer-future and execution-context-result, the only way
defined in the DSEL above then the schedule of work to be
to generate mutations is via the above technique. Therefore
mutated by a conforming implementation of the DSEL would
each child-node either returns via the back edge immedi-
be executed in time taking at least an algorithmic complexity
ately or generates a further sub-tree attaching to the larger
of O (log (p)) and at most O (n) in units of time to mutate
tree that contains it’s parent. Now if the entry-point of
the work where n is the number of work items to be mutated
the program is the single thread that runs main(), i.e. the
on p processors. The algorithmic order of the minimal time
single root, this can only generate a tree, and each node
would be poly-logarithmic, so within NC, therefore at least
in the tree can only return or generate a tree, the whole
optimal.
CFG must be a tree. If there were more entry-points, each
one can only generate a tree per entry-point, as the execu- Proof. Given that the schedule must be a tree according
tion_contexts cannot be aliased nor copied between nodes, to theorem 1 with at most n leaf-nodes, and that each node
by definition. takes at most O n − 1 + log (p) computations according
p
to the definition of the parallel-algorithms. Also it has been
According to the above corollary, one may appreciate that a proven in [7] that to distribute n items of work onto p pro-
conforming implementation of the DSEL would implement cessors may be performed with an algorithmic complexity of
data-flow in software. O (log (n)). The fastest computation time would be if the
Theorem 2. If the user refrains from using any other schedule were a balanced tree, where the computation time
threading-related items or atomic objects other than those would be the depth of the tree, i.e. O (log (n)) in the same
defined in the DSEL above then they can be guaranteed to units. If the n items of work were to be greater than the
have a schedule free of race-conditions. p processors, then O (log (p)) ≤ O (log (n)), so the compu-
tation time would be slower than O (log (p)). The slowest
Proof. A race-condition is when two threads attempt to computation time would be if the tree were a chain, i.e.
access the same data at the same time. A race-condition O (n) time. In those cases this implies that a conforming
in the CFG would be represented by a child node with two implementation should add at most a constant order to the
parent nodes, with forward-edges connecting the parents to execution time of the schedule.
the child. Note that the CFG must an acyclic tree according
to theorem 1, then this sub-graph cannot be represented in 4.3 Some Example Usage
a tree, so the schedule must be race-condition free. These are two toy examples, based upon an implement-
ation in [12], of how the above DSEL might appear. The
Theorem 3. If the user refrains from using any other first example is a data-flow example showing how the DSEL
threading-related items or atomic objects other than those could be used to mutate some work on a thread within the
defined in the DSEL above and that the work they wish to thread pool, effectively demonstrating how the future would
mutate may not be aliased by any other object, then the user be waited upon. Note how the execution_context has been
can be guaranteed to have a schedule free of deadlocks. created via the transfer of work into the thread_pool.
Proof. A deadlock may be defined as: when threads A
and B wait on atomic-objects C and D, such that A locks Listing 1: Data-flow example of a Thread Pool and
C, waits upon D to unlock C and B locks D, waits upon C Future.
7. struct res t { 5. CONCLUSIONS
int i ;
}; The goals of the paper has been achieved: a DSEL has
s t r u c t work type {
void process ( r e s t &) {} been formulated:
};
t y p e d e f ppd : : t h r e a d p o o l < • that may be used to expresses general-purpose paral-
p o o l t r a i t s : : worker threads get work ,
pool traits :: fixed size , lelism within a language,
pool adaptor<
g e n e r i c t r a i t s : : joinable , platform api ,
heavyweight threading • ensures that there are no deadlocks and race condi-
>
> pool type ;
tions within the program if the programmer restricts
t y p e d e f p o o l t y p e : : c r e a t e d i r e c t <w o r k t y p e > c r e a t o r t ; themselves to using the constructs of the DSEL,
typedef c r e a t o r t : : execution context execution context ;
typedef c r e a t o r t : : joinable joinable ;
• and does not preclude implementing optimal schedules
pool type pool ( 2 ) ;
e x e c u t i o n c o n t e x t c o n t e x t ( pool<<j o i n a b l e ()<< w o r k t y p e ( ) ) ; on a CREW-PRAM or EREW-PRAM computation
context− >i ;
model.
The typedefs in this example implementation of the gram- Intuition suggests that this result should have come as no
mar are complex, but the typedef for the thread-pool-type surprise considering the work done relating to auto-parallelizing
would only be needed once and, reasonably, could be held compilers, which work within the AST and CFGs of the
in a configuration trait in header file. parsed program[17].
The second example shows how a data-parallel version of
the C++ accumulate algorithm might appear. It is interesting to note that the results presented here
would be applicable to all programming languages, com-
piled or interpreted, and that one need not be forced to
Listing 2: Example of a parallel version of an STL re-implement a compiler. Moreover the DSEL has been de-
algorithm. signed to directly address the issue of debugging any such
t y p e d e f ppd : : t h r e a d p o o l <
p o o l t r a i t s : : worker threads get work ,
parallel program, directly addressing this problematic is-
pool traits :: fixed size ,
pool adaptor<
sue. Further advantages of this DSEL are that program-
g e n e r i c t r a i t s : : joinable , platform api , mers would not need to learn an entirely new programming
heavyweight threading ,
p o o l t r a i t s : : normal fifo , std : : less ,1 language, nor would they have to change to a novel com-
>
> pool type ;
piler implementing the target language, which may not be
t y p e d e f ppd : : s a f e c o l l n <
v e c t o r <i n t >, l o c k t r a i t s : : c r i t i c a l s e c t i o n l o c k t y p e
available, or if it were might be impossible to use for more
> vtr colln t ; prosaic business reasons.
typedef pool type : : accumulate t<
vtr colln t
>:: e x e c u t i o n c o n t e x t e x e c u t i o n c o n t e x t ;
vtr colln t v;
v . push back ( 1 ) ; v . push back ( 2 ) ;
6. FUTURE WORK
execution context context (
pool< <j o i n a b l e ( )
There are a number of avenues that arise which could be
< <p o o l . a c c u m u l a t e ( investigated, for example a conforming implementation of
v , 1 , s t d : : p l u s < v t r c o l l n t : : v a l u e t y p e >()
) the DSEL could be presented, for example [12]. The prop-
);
a s s e r t ( ∗ c o n t e x t ==4);
erties of such an implementation could then be investigated
by reimplementing a benchmark suite, such as SPEC2006
All of the parameters have been specified in the thread- [15], and comparing and contrasting the performance of that
pool-type to demonstrate the appearance of the typedef. Note implementation versus the literature. The definition of safe-
that the example illustrates a map-reduce operation, an im- colln has not been an optimal design decision a better ap-
plementation might: proach would have been to define ranges that support lock-
ing upon the underlying collection. Extending the DSEL
1. take sub-ranges within the safe-colln, may be required to admit memoization could be investig-
ated, such that a conforming implementation might imple-
2. which would be distributed across the threads within ment not only inter but intra-procedural analysis.
the thread_pool,
7. REFERENCES
3. the mutations upon each element within each sub-
range would be performed sequentially, their results [1] Almasi, G., Cascaval, C., Castanos, J. G.,
combined via the accumulator functor, without lock- Denneau, M., Lieber, D., Moreira, J. E., and
ing any other thread’s operation, Henry S. Warren, J. Dissecting Cyclops: a detailed
analysis of a multithreaded architecture. SIGARCH
4. These sub-results would be combined with the final ac- Comput. Archit. News 31, 1 (2003), 26–38.
cumulation, in this the implementation providing suit- [2] Bischof, H., Gorlatch, S., Leshchinskiy, R., and
able locking to avoid any race-condition, M¨ ller, J. Data Parallelism in C++ Template
u
Programs: a Barnes-hut Case Study. Parallel
5. The total result would be made available via the exe- Processing Letters 15, 3 (2005), 257–272.
cution_context. [3] Burger, D., Goodman, J. R., and Kagi, A.
Memory Bandwidth Limitations of Future
Moreover the size of the input collection should be suffi- Microprocessors. In ISCA (1996), pp. 78–89.
ciently large or the time taken to execute the operation of [4] Casanova, H., Legrand, A., and Robert, Y.
the accumulator so long, so that the cost of the above oper- Parallel Algorithms. Chapman & Hall/CRC Press,
ations would be reasonably amortized. 2008.
8. [5] El-ghazawi, T. A., Carlson, W. W., and
Draper, J. M. UPC language specifications v1.1.1.
Tech. rep., 2003.
[6] Giacaman, N., and Sinnen, O. Parallel iterator for
parallelising object oriented applications. In
SEPADS’08: Proceedings of the 7th WSEAS
International Conference on Software Engineering,
Parallel and Distributed Systems (Stevens Point,
Wisconsin, USA, 2008), World Scientific and
Engineering Academy and Society (WSEAS),
pp. 44–49.
[7] Gibbons, A., and Rytter, W. Efficient parallel
algorithms. Cambridge University Press, New York,
NY, USA, 1988.
[8] ISO. ISO/IEC 14882:2011 Information technology —
Programming languages — C++. International
Organization for Standardization, Geneva,
Switzerland, Feb. 2012.
[9] Kennedy, K., and Allen, J. R. Optimizing
compilers for modern architectures: a
dependence-based approach. Morgan Kaufmann
Publishers Inc., San Francisco, CA, USA, 2002.
[10] Leiserson, C. E. The Cilk++ concurrency platform.
J. Supercomput. 51, 3 (Mar. 2010), 244–257.
[11] McGuiness, J. M. Automatic Code-Generation
Techniques for Micro-Threaded RISC Architectures.
Master’s thesis, University of Hertfordshire, Hatfield,
Hertfordshire, UK, July 2006.
[12] McGuiness, J. M. libjmmcg - implementing PPD.
libjmmcg.sourceforge.net, July 2009.
[13] McGuiness, J. M., Egan, C., Christianson, B.,
and Gao, G. The Challenges of Efficient
Code-Generation for Massively Parallel Architectures.
In Asia-Pacific Computer Systems Architecture
Conference (2006), pp. 416–422.
[14] Pheatt, C. Intel R threading building blocks. J.
Comput. Small Coll. 23, 4 (2008), 298–298.
[15] Reilly, J. Evolve or Die: Making SPEC’s CPU Suite
Relevant Today and Tomorrow. In IISWC (2006),
p. 119.
[16] Snelling, D. F., and Egan, G. K. A Comparative
Study of Data-Flow Architectures. Tech. Rep.
UMCS-94-4-3, 1994.
[17] Tang, X. Compiling for Multithreaded Architectures.
PhD thesis, University of Delaware, Delaware, USA,
Fall 1999.
[18] Tian, X., Chen, Y.-K., Girkar, M., Ge, S.,
Lienhart, R., and Shah, S. Exploring the Use of
Hyper-Threading Technology for Multimedia
Applications with Intel R OpenMP* Compiler. In
IPDPS (2003), p. 36.
[19] Tvrdik, P. Topics in parallel computing - PRAM
models.
http://pages.cs.wisc.edu/ tvrdik/2/html/Section2.html,
January 1999.
[20] Virding, R., Wikstr¨ m, C., and Williams, M.
o
Concurrent programming in ERLANG (2nd ed.).
Prentice Hall International (UK) Ltd., Hertfordshire,
UK, UK, 1996.