1. Layered Spiral Algorithm for Memory-Aware
Mapping and Scheduling on Network-on-Chip
Shuo Li, Fahimeh Jafari, Ahmed Hemani Shashi Kumar
Department of Electronic Systems School of Engineering
School of Information and Communication Technology J¨ nk¨ ping University
o o
Royal Institute of Technology J¨ nk¨ ping, Sweden
o o
Stockholm, Sweden Email: Shashi.Kumar@jth.hj.se
Email: {shuol, fjafari, hemani}@kth.se
Abstract—In this paper, Layered Spiral Algorithm (LSA) is in Section V and evaluate our approach in Section VI. Section
proposed for memory-aware application mapping and scheduling VII gives conclusions and future works.
onto Network-on-Chip (NoC) based Multi-Processor System-on-
Chip (MPSoC). The energy consumption is optimized while keeping
high task level parallelism. The experimental evaluation indicates II. R ELATED W ORK
that if memory-awareness is not considered during mapping and
scheduling, memory overflows may occur. The underlying problem Different mapping algorithms [5][6] are proposed without
is also modeled as a Mixed Integer Linear Programming (MILP) memory-awareness and scheduling coverage. In [7], schedul-
problem and solved using an efficient branch-and-bound algorithm ing problem is covered with mapping but without memory-
to compare optimal solutions with results achieved by LSA. Com- awareness. As claimed in [4], memory is critical in NoC design
paring to MILP solutions, the LSA results demonstrate only about
process and consequently should be considered during mapping
20% and 12% increase of total communication cost in case of a
small and middle size synthetic problem, respectively, while it is and scheduling practical applications onto practical platforms.
order of magnitude faster than the MILP solutions. Therefore, the In [8], memory-awareness is covered while scheduling is not
LSA can find acceptable total communication cost with a low run- considered. In this paper, we propose a very fast algorithm
time complexity, enabling quick exploration of large design spaces, called Layered Spiral Algorithm (LSA) to solve memory-aware
which is infeasible for exhaustive search.
application mapping and scheduling problem. The objective of
LSA is minimizing total energy consumption while keeping high
I. I NTRODUCTION task level parallelism. The proposed algorithm is based on spiral
Even though Network-on-Chip (NoC) has been introduced algorithm [9] which is a very fast mapping algorithm without
for a decade [1], programming on them is arduous since the memory-awareness. We extend spiral algorithm by introducing
application mapping and scheduling problem is NP-hard [2]. memory-aware concepts and task layers to cover both memory-
The existing compilation tools are not suitable in NoC context awareness and scheduling problem. The paper demonstrates that
[3]. In addition, as memory is being critical during NoC design LSA is able to solve large scale problems with acceptable
process [4], memory requirement and availability should not accuracy. Although dynamic mapping is a cut above static
be ignored. This memory-awareness increases the complexity mapping in terms of platform utilization [10], extra control logics
of the application mapping and scheduling problem even more. are required. Therefore, we consider static application mapping
Since the entire solution space is enormous, problem-specific and scheduling in this paper for simplification and its dynamic
algorithms are desired beyond exhaustive search-based algo- extension is planned in the future work.
rithms to obtain acceptable solutions in a reasonable time. In
this paper, we describe (1) Memory-Aware Communication Task III. P ROBLEM F ORMULATION
Graph (MACTG) to model applications, (2) Platform Architec-
This section models the underlying application and platform.
ture Graph (PAG) to model NoC-based Multi-Processor System-
Then, the objective function of the memory-aware application
on-Chip (MPSoC) platforms, (3) a problem-specific heuristic
mapping and scheduling problem is formulized.
algorithm, Layered Spiral Algorithm (LSA), to quickly map
and schedule an application characterized by MACTG to an
A. Application Model
architecture characterized by PAG. (4) a Mixed Integer Linear
Programming (MILP) formulation of the underlying problem We exploit a variant of communicational task graph called
and solve it by a branch-and-bound algorithm. Afterwards, the Memory-Aware Communication Task Graph (MACTG) to model
efficacy of the proposed algorithm is established by comparing an application and we model task execution as an impartible
the MILP solutions and LSA results. process with three non-overlapping phases: input phase for
The rest parts of this paper is organized as follows. Section collecting input data, computing phase for computing output data
II introduces related work. Section III discusses the applica- and output phase for sending out output data. The upper part
tion mapping and scheduling problem formulation based on of Table I lists notations in a MACTG and their descriptions.
application and platform models. Section IV is devoted to the Data memory requirement presents data memory required in the
description of LSA. We model the underlying problem in MILP computing phase. Fig. 1(a) shows an example of MACTG.
978-1-4244-8971-8/10$26.00 c 2010 IEEE
2. TABLE I communicaiton
MACTG AND PAG N OTATIONS t0 edge 1 t0 1
1 1
Notation Description normal t 02 t 01
t1 t2 task 1
T = {ti } Set of tasks 3 3 1
2 1
ET i ti ’s execution time t2 t1
DM Ri ti ’s data memory requirement t3 t4 t5 3 3
1 2
E = {eij } Set of communication edges 2 2
2 memory
N = {nj } Set of PM nodes access t 25 t 24 t 14 t 13
M Sj Total data memory size on nj t6 t7 t8 task
1 3 3 2
SM axj Maximum distributed shared memory size on nj 1 3
2 t5 t4 t3
B Communication channel bandwidth 2
CH = {chij } Entire communication channel set t9 2 2 1 2
(a) MACTG t 57 t 58 t 49 t 36
communication
channel 2 2 2
B. Platform Model 1
t7 t8 t6
We model the target 2D homogeneous mesh NoC-based n0 n1 3 2 2
MPSoC by a Platform Architecture Graph (PAG). The platform processor
t 79 t 89 t 69
consists of Processor Memory (PM) nodes and communication data
memory 2
channels. Each PM node consists of one single-threaded pro- 3 2
n2 n3 PM node
cessor and one data memory. Data memory is divided into two t9
parts: local memory for the local processor and distributed shared (b) PAG
(c) MAG
memory for remote processors. Sizes of both parts are tunable
during application execution under the condition of constant total
size and the maximum distributed shared memory size is smaller Fig. 1. MAG Example
than the total memory size. By introducing distributed shared
memory, efficiently memory partitioning becomes a challenge. in the list only provide remote data memory for temporal and/or
This problem is addressed from a slightly different angle by intermediate data during the computation phase. For a memory
cache partitioning problems that targeting to minimize cache access task, it is a list of PM nodes for storing the data. Another
misses while memory partitioning problem is to minimize access annotation is a list of corresponding data memory usage in the
time/energy. We model PM nodes as vertices and communication PM node list. For example, suppose normal task ti uses three PM
channels as arcs in a PAG. The lower part of Table I lists the nodes including nx , ny and nz , and ny is used for computation.
notations of a PAG and their descriptions and Fig. 1(b) illustrates Hence, the PM node list will be {y, x, z}. Assume ti uses 8 KB
a sample PAG. in ny as local data memory and 4 KB in nx and nz as remote
distributed data memory. The memory usage list will be {8 KB,
C. Memory Access Graph 4 KB, 4 KB} or {8, 4, 4}. If the unit of entries all is the same,
Memory Access Graph (MAG) describes data communication. it can be removed from the list.
We define a type of dummy task called memory access task
whose memory requirement is equal to the communication data D. Constraint and Objective Function
volume between two corresponding normal tasks with zero In this subsection, we address the constraints and objective
execution time. Memory access task tij represents the data function used in our problem formulation.
communication from ti to tj . Mapping tij is to locate output Constraints: Besides of constraints related to the memory
data from ti on one or multiple PM nodes. This data will be size, a deadline time timed for execution time of the application
used by tj as input data. In fact, each memory access task is considered. This constraint will be a criterion for validating
contains the input/output data storage information of two directly solutions. Let time0 be the starting time of the application and
related normal tasks. For example, t4 requires data from t1 and timeL be the finishing time of the last task in the application.
t2 . The corresponding memory access tasks are t14 and t24 as Thus, the time constraint is timeL − time0 ≤ timed .
illustrated in the MAG example in Fig. 1(c). By introducing Objective function: Our objective is minimizing the applica-
memory access tasks, the input/output from/to multiple tasks tion energy consumption while meeting mentioned constraints.
are modeled separately such that we do not have to explicitly For simplicity, we assume that the leakage current is negligible
give input/output data location and which tasks fetch/produce [11], which means the total execution energy will be fixed
it. Hence, in the rest of paper, we consider two subset TN and and for minimizing total energy consumption, it is enough to
TA containing normal and memory access tasks, respectively, minimize the communication energy consumption. The under-
such that T = TN TA . Replacing each communication edge lying system’s communications include inter-task and intra-task
by a memory access task and two new communication edges, a communications. Inter-task communication is the data commu-
MACTG is converted to a MAG. nication during the input and output phases while intra-task
Since each task in MAG can be mapped onto several PM nodes communication is the data communication during the computing
and also the input/output data of each task has to be stored on phase.
one or multiple PM nodes, we introduce two new annotations To illustrate the concepts of inter-task and intra-task com-
for each task in MAG. The first one is a list of PM nodes. For a munications, consider the example of Fig. 1(c). Task t1 needs
normal task, this list is a set of PM nodes that the first PM node data produced by t0 , which is modeled by t01 , and t1 produces
in the list stores the program of the normal task and performs data for some other tasks. Suppose application mapping and
computation during the computation phase while other PM nodes scheduling algorithm is performed such that PM node list of
3. t0 is {0}, t01 is {1}, and t1 is {2, 3}. Hence, the total inter-task communication cost, C inter , is
In this case, inter-task communications are the data transfer equal to C out + C in . Assume normal task ti is mapped on the
from n0 (on which task t0 is run) to n1 (in which output data of set of PM nodes NiT and run on node nr ∈ NiT . Thus, it has
i
task t0 or input data of task t1 is located) and also from n1 to intra-task communication with other pieces of remote shared data
n2 (on which task t1 is executed). The intra-task communication memory on each node k ∈ NiT &k = nr . Thus, total inter-task
i
intra
is the data transfer between n2 (on which task t1 is executed) communication cost for normal task ti , Ci , is:
and n3 (which allocates remote shared data memory to task t1 ). intra
We suppose the length of communication path between each two Ci = CVnr ,k ∗ hknr ∗ wR ∗ wa
i i
(4)
T
∀k∈Ni
PM nodes i and j, lij , is proportional to the length of shortest
path between them, hij . This means that for all communications, where wa is a weight factor used to simply model how many
we can say: times in average a normal task will access its remote data mem-
lij = wR ∗ hij (1) ory because sometimes it will access its remote data memory
more than one times. It is clear that CVnr ,k is equal to allocated
where wR is a constant routing factor. i
data memory for task ti on node k, xN . Also, for computing
ik
We also assume that the energy consumption for a communica-
C intra in an application, it is necessary to calculate Ciintra
for
tion between two nodes i and j, Enij , is proportional to CVij ∗
∀i ∈ TN . Therefore, total inter-task communication cost in the
lij ; where CVij is communication data volume between these
application can be written as:
two nodes. Considering Eq. 1, we can say Enij ∝ CVij ∗hij . We
denote CVij ∗hij as communication cost between two PM nodes C intra = xN ∗ hknr ∗ wR ∗ wa
ik (5)
i
i and j. Therefore, for minimizing total energy consumption, we ∀i∈TN ∀k∈Ni
T
can minimize the sum of all individual intra-task and inter-task’s
communication costs of the application executed on the given Thus, total communication cost in the application is as C =
platform. C inter + C intra and our objective is to minimize it.
Before that, we need to introduce several other notations. The IV. A LGORITHM D ESCRIPTION
allocated data memory size on node k to each normal task i and
memory access task j are denoted as xN and xA , respectively. A. Spiral Algorithm
ik jk
The dependency task set of a task contains all directly dependent Since the LSA is an extension of the existing spiral core map-
tasks of that task. The dependency task set of ti is denoted as ping algorithm [9], we briefly describe spiral algorithm in this
Tid . For example, in Fig. 1(c), T1 = {t01 } and T01 = {t0 }.
d d subsection. In this algorithm, IP cores are mapped onto the tiles
As we described before, the mapping of each memory access of the mesh platform based on a Task Priority List (TPL) which
task tij specifies where and how much memory is allocated to is a list of the tasks ordered based on the priorities that they
output data from normal task ti or input data to normal task tj . should be mapped. In mesh topology, the inner switches have
Therefore, for the inter-task communication, the communication a higher connection degree compared to the boundary switches.
data is just the input/output data needed/produced by normal This provides more connectivity to the neighbor switches and
tasks. In this respect, total inter-task communication cost consists forms a Platform Priority List (PPL) (we call it Node Priority List
of total communication cost for writing output data, C out , (NPL) in LSA for reflecting the task-node mapping) which starts
and reading input data, C in . In the mentioned example, the from the center of the mesh platform and ends to a boundary
communication cost for writing output data modeled by memory switch in a spiral fashion. The priority assignment policy is
out expressed in the following rules.
access task t01 is C01 = CV0,1 ∗ h0,1 ∗ wR . Since required
memory size for memory access task t01 in node n1 is equal a) The tasks that have higher data transfer sizes should be
to CV0,1 , we can say C01 = xA ∗ h0,1 ∗ wR . Therefore, for
out
01,1 placed as close as possible to each other to satisfy the bandwidth
out out constraint.
calculating C in an application, it is enough to compute Ci
for ∀i ∈ TA . Assuming task i ∈ TA is mapped on node nm i b) The tasks which are tightly related to each other should have
and task s ∈ Tid is mapped and executed on node nr , total s the least possible Manhattan distance on the mesh platform.
communication cost for writing output data is as follows: c) The tasks which have the high connection degrees should not
be placed on the boundaries. For these tasks, the central area of
C out = xA m ∗ hnm nr ∗ wR
ini i s
(2) the mesh is the best candidate.
∀i∈TA ∀s∈Ti
d
All IP cores are mapped onto tiles of the mesh platform based
In the mentioned example, normal task t1 needs data from on TPL, PPL and the above rules with starting from the IP core
normal task t0 that is modeled by memory access task t01 . There- of the task with highest priority in the TPL.
fore, communication cost for reading input data of normal task B. LSA Algorithm Description
t1 is C1 = CV1,2 ∗ h1,2 ∗ wR . Likewise, since CV1,2 = xA ,
in
01,1
we can say C1 = xA ∗ h1,2 ∗ wR . Therefore, for calculating
in The inputs of LSA are an application modeled by a
01,1
C in in an application, it is enough to compute Ci for ∀i ∈ TN .
in M ACT G = Gm (T, E) and a platform modeled by a P AG =
Assuming task i ∈ TN is mapped and executed on node nr and Gp (N, CH). The output of the LSA is the application mapping
i
task s ∈ Tid is mapped on node nm , total communication cost and scheduling.
s
for reading input data is as follows: Intermediate results in this algorithm are:
1) Task layer is a set of independent tasks. Each task in task
C in = xA m ∗ hnr nm ∗ wR
sns i s
(3) layer k has its dependency tasks in task layer k−1, .., 0. There are
∀i∈TN ∀s∈Ti
d two different kinds of task layers including normal and memory
4. access layers. The former are layers which contain normal tasks TPL of each task layer is generated. After that, node priority list
and the latter include memory access tasks. It is clear that there of the PAG is built in line 6. Line 7 maps the first task layer
is no task layer including normal and memory access task both. onto the platform by using the spiral algorithm. Lines 8-16 map
The set of normal layers is denoted as N L and likewise, AL is remaining task layers onto the platform as follows.
the set of memory access layers. First, the dependency node set Nid for each task ti in
2) Task priority list (TPL): Each task layer has a TPL which task priority list is computed. If Nid has only one node nd ,
is an ordered set of tasks of that layer. The ordering criterion try to map ti onto nd . If nd is not an execution-available node,
is based on priority of each task which means the task with try to map ti onto an execution-available PM node that has
the higher priority is located in front of the task with the lower the highest priority among all execution-available PM nodes.
priority in TPL list. The task priorities are specified as follows: If any single PM nodes has no enough available memory for
• For normal tasks: Each task with larger total input and ti , try to map the execution of ti onto nd and the remote data
output data volume has higher priority. memory onto storage-available PM nodes with as high priority
• For memory access tasks: The priority of each memory as possible. If this trial fails, try to map execution of ti onto an
access task depends on priority of its dependency (parent) execution-available PM node that has the highest priority in all
and child tasks which all are normal tasks. Hence, for execution-available nodes and map ti ’s remote data memory onto
comparing two memory access tasks tix and tjy , we first storage-available PM nodes with as high priority as possible. If
check the priority of tx and ty . If we find that tx has higher this mapping also fails, the LSA will fail. This mapping is called
priority than ty , the memory access task priority list is {tix , single dependency mapping in Algorithm 1. If Nid has multiple
tjy }. If x = y, we check the priority of ti and tj . If priority nodes, for each node nd ∈ Nid , follow the above procedure and
of ti is higher than tj ’s priority, the memory access task collect the cost of each mapping solution of ti . Then compare
priority list is {tix , tjy }. If child task of a memory access the costs and pick up the solution with the minimal cost. This
task is not in the next layer of the current layer, this memory mapping is called multiple dependencies mapping in Algorithm
access task has lower priority compared to other memory 1. Finally, record mapping and scheduling results and if the
access tasks in the same layer. total time fulfills the timing constraint, the algorithm successes;
3) Node priority list (NPL) is an ordered set of all nodes so that otherwise the algorithm cannot solve this problem.
the nodes located closer to the center of the mesh have higher C. Application Mapping and Scheduling Example
priority. If some nodes are located as close as to the center of
Consider a simple example of mapping an application shown
the mesh, the priority list will be formed in a spiral style starting
in Fig. 1(a) onto a platform shown in Fig. 1(b). The numbers
from the right bottom corner.
along with the arrows are the communication volumes between
4) Dependency node set: The dependency node set of a task
two tasks and the arguments are listed in Table II.
contains all nodes assigned to all directly dependent tasks of that
We consider the following assumptions for the platform.
task. The dependency node set of ti is denoted as Nid .
Bandwidth is 1 GB/sec, data memory size is 8 KB and maximum
We also introduce execution-available and storage-available
6 KB data memory can be shared.
terms for PM nodes to state that a PM node is ready for executing
The data memory requirements in KB for memory access
a task and it has available data memory for remote or local
tasks are listed in Table III. TPL of each layer is built by
access, respectively. It is worth mentioning that an execution-
sorting tasks such that the tasks with higher total communication
available PM node is also an storage-available PM node. The
volume has higher priority compared with other tasks. Table IV
LSA works as Algorithm 1.
lists the TPL of each task layer. Since all nodes have the same
Algorithm 1 Layered spiral algorithm pseudo-code connectivity, NPL is {n3 , n2 , n0 , n1 } which lists all nodes in a
spiral style.
1: Gm (T, E) → M AG = Gl (T, E);
Then we map task layer 0, which contains only t0 . This
2: Lt = task layers (Gl (T, E));
task is mapped onto n3 and uses 2 KB of data memory in
3: for all task layer li in Lt do
P
4: li = task priority list (li );
5: end for TABLE II
A NNOTATIONS FOR THE MACTG EXAMPLE SHOWN IN F IG . 1( A )
6: Ln = node priority list (Gp (N, CH));
7: M ap(l0 ); Task ET DMR Task ET DMR
8: for task layer 1 to last layer do (µs) (KB) (µs) (KB)
P 0 1 2 5 2 4
9: for all ti in li do
1 1 2 6 2 4
10: if count(Nid ) = 1 then 2 9 16 7 1 2
11: single dependency mapping(ti ); 3 2 2 8 1 2
12: else 4 2 4 9 4 16
13: multiple dependencies mapping(ti );
TABLE III
14: end if DATA MEMORY REQUIREMENTS FOR MEMORY ACCESS TASKS
15: end for
16: end for Task DMR Task DMR Task DMR Task DMR
01 1 24 3 57 2 89 2
02 1 25 1 58 2
Line 1 of the algorithm converts input MACTG Gm (T, E) to 13 2 36 2 69 2
MAG Gl (T, E) and line 2 extracts task layers. In lines 3-5, the 14 3 49 1 79 3
5. TABLE IV
TASK LAYERS xN +
ij xA ≤ M S j
sj ∀j ∈ N, ∀l ∈ N L (8)
∀i∈T askL(l) ∀i∈InT askL(l)
Layer Priority List Layer Priority List Layer Priority List
0 {0} 3 {14, 24, 25, 13} 6 {7, 6, 8} xA +
ij xN ≤ M S j
sj ∀j ∈ N, ∀l ∈ AL (9)
1 {01, 02} 4 {4, 5, 3} 7 {79, 69, 89} ∀i∈T askL(l) ∀i∈InT askL(l)
2 {1, 2} 5 {57, 36, 58, 49} 8 {9}
yij ≤ 1 ∀j ∈ N, ∀l ∈ N L (10)
∀i∈T askL(l)
n3 . Now n3 has 6 KB data memory available. Then we map yij = 1 ∀i ∈ TN (11)
∀j∈N
task layer 1, which contains {t01 , t02 }. t01 occupies 1 KB and
t02 occupies also 1 KB. t01 and t02 are mapped onto n3 . The xN yij > 0
ij ∀i ∈ TN (12)
memory assigned to t0 is still in n3 . Now, n3 has 4 KB available. ∀j∈N
Then, we map task layer 2 which contains {t1 , t2 }. t1 is mapped xN = DM RN
i
∀i ∈ TN (13)
ij
onto n3 (2 KB) and t2 is mapped onto n2 (8 KB), n0 (4 KB), ∀j∈N
and n3 (4 KB). The memory assigned to t0 in n3 is freed. Now,
xA = DM RA
ij
i
∀i ∈ TA (14)
n3 and n2 have no data memory available and n0 has 4 KB
∀j∈N
available. In the same way, we mapped all task layers and got
the results in the LSA part in Table V. The time parameter in where yij , xN and xA are optimization variables.
ij ij
the table is the LSA computation time. By considering wa = 2 Eq. (6) is the objective function of this optimization problem
and wR = 1, the total communication cost C = 65 µs and total which minimizes total communication cost. Constraint (7) says
application execution time timeL − time0 = 74 µs. that allocated data memory size to tasks of each normal layer
l not running on each node j cannot exceed shared memory
size of that node. Constraint (8) and (9) indicate that allocated
V. MILP F ORMULATION
data memory size for tasks of each layer (N L&AL) and their
To evaluate the capability of our method, we formulate the dependency tasks on each node j cannot exceed memory size of
underlying problem as a MILP problem which maps tasks onto that node. It is clear that at most a task of each normal layer l on
a generic regular NoC architecture and allocates the required every node j can be executed and also each normal task i can be
data memory of each task on every PM node based on tasks run on only one node; these constraints are shown in (10) and
scheduling to minimize the total communication cost while satis- (11), respectively. Constraint (12) says that each normal task i
fying acceptable constraints in the network. For formulating this takes a part or all of its data memory requirements of that node
problem, besides of before mentioned notations, we introduce on which is run. Constraints (13) and (14) state that allocated
more notations. T askL(l) is a set of independent tasks in layer data memory size of each task i ∈ TA or TN cannot exceed its
l. InT askL(l) is the set of dependency tasks of all tasks of data memory requirements.
layer l. For instance, InT askL(2) in example of section IV-C is It is worth mentioning that above mentioned problem is a
{01, 02}. Thus, The cost minimization MILP problem, Minimize- quadratic problem [12]. However, since an integer variable is
Cost-MILP, can be formulated as follows. multiplied by a binary variable, Big M technique [13] can be used
Given a MAG=Gl (T, E), hop count matrix of shortest path to convert it to a mixed integer linear programming. Therefore,
between each two nodes H = {hij |i & j ∈ N }, memory size we solve the proposed Minimize-Cost-MILP problem using an
and maximum shared memory size that each node j can allocate efficient branch-and-bound algorithm in CPLEX.
to tasks in the network which denoted as M S j and SM axj ,
respectively. Data memory requirement for each normal task VI. E XPERIMENTAL R ESULT
i i
i, DM RN , and each memory access task i, DM RA . Find Since there is no existing memory-aware task graph bench-
the mapping matrix Y = {yij |yij = 0 or 1; ∀i ∈ TN & ∀j ∈ N } marks, we use tgff [14] tool to generate synthetic MACTGs.
and also the size of allocated data memory on each node j Therefore, five task graphs are randomly generated to test our
to every normal task i and memory access task i, xN for ij LSA implemented in C#. The program is run on a PC with 2.8
∀i ∈ TN & ∀j ∈ N and xAj for ∀i ∈ TA & ∀j ∈ N ,
ij
GHz Intel i7 CPU and 8 GB main memory. For all case studies,
respectively, such that we assume that wR and wa are equal to 1 and 2, respectively,
and also bandwidth is 1 GB/sec.
A. Comparing with MILP Problem
min xN hjk yik wR wa
ij
yij ,xN ,xA
ij ij ∀i∈TN ∀j∈N ∀k∈N To evaluate the capability of our method, we addressed an
MILP problem of minimizing total communication cost sub-
+ xA hjk yij wR (6) ject to the mentioned constraints. We apply LSA method and
sk
∀s∈Ti ∀j∈N ∀k∈N
d Minimize-Cost-MILP problem to two synthetic task graphs which
are mapped to a 4 × 4 2D mesh network.
+ xA hjk ysk wR
ij Table V shows the results for LSA and Minimize-Cost-MILP
∀i∈TA ∀s∈T d ∀j∈N ∀k∈N
i problem for case study mentioned in IV-C. The results show that
subject to: comparing with the Minimize-Cost-MILP problem, LSA achieves
about 20% larger cost with only 7.92 ms run time. For a larger
xN (1 − yij ) ≤ SM axj
ij ∀j ∈ N, ∀l ∈ N L (7) size problem that maps and schedules a MACTG with 26 tasks,
∀i∈T askL(l) LSA runs 8.56 ms and results a cost of 3167 µs while the
6. Minimize-Cost-MILP problem runs about 3.5 hours and results KB memory. Therefore, LSA maps t2 onto two PM nodes (n9
a cost of 2821 µs. Therefore, the LSA can find acceptable total and n10 ) while nLSA maps it onto only one PM node n10 .
communication cost in a very short time.
As the exploration space increases exponentially, the
VII. C ONCLUSION AND F UTURE W ORKS
Minimize-Cost-MILP problem cannot solve larger size problems
in a reasonable time and physical memory. Thus, in the rest, we LSA can find solution with acceptable energy consumption
achieve results via proposed LSA for larger size problems. in a very short time. As the exploration space increases
exponentially with number of task, large size problems cannot
TABLE V be solved by MILP solver in a reasonable time and physical
C OMPARISON BETWEEN LSA AND Minimize-Cost-MILP PROBLEM
memory so that we can only achieve solutions via proposed
LSA Minimize-Cost-MILP Problem LSA. As shown in Table VI, LSA runtime for mapping and
Task PM Node list Memory Usage PM Node Memory Usage scheduling an application consists of 815 tasks onto an 8 × 8
List List List List platform is still under 0.25 sec. By comparing memory-aware
0 {3} {2} {1} {2}
01 {3} {1} {3} {1} and non-memory-aware solutions, we concluded that the
02 {3} {1} {1} {1} memory requirements and availabilities are critical and should
1 {3} {2} {3} {2} not be ignored when modeling applications and platforms.
13 {1} {2} {3} {2}
14 {0} {3} {3} {3} In the future work, we expect the concept of LSA will also
2 {2, 0, 3} {8, 4, 4} {0, 1, 2} {8, 7, 1} fit into dynamic application mapping and scheduling problems.
24 {1} {3} {2} {3} Another future work is to add more tunable options to LSA
25 {3} {1} {1} {1}
3 {1} {2} {3} {2} such that LSA can be optimized more to parallelism or to
36 {1} {2} {3} {2} energy consumption. Implementation of task layer partitioning
4 {0} {4} {2} {4} is also planned in the future work.
49 {0} {1} {2} {1}
5 {3} {4} {1} {4}
57 {3} {2} {1} {2}
58 {3} {2} {1} {2} R EFERENCES
6 {1} {4} {3} {4} ¨
[1] A. Hemani, A. Jantsch, S. Kumar, A. Postula, J. Oberg, M. Millberg, and
69 {1} {2} {3} {2}
D. Lindqvist, “Network on a chip: An architecture for billion transistor
7 {3} {2} {0} {2}
era,” in Proc. NORCHIP 2000, Turku, Finland, Nov. 2000.
79 {3} {3} {0} {3}
[2] R. Pop and S. Kumar, “A survey of techniques for mapping and scheduling
8 {2} {2} {1} {1}
applications to network on chip systems,” School of Engineering, J¨ nk¨ ping
o o
89 {2} {2} {1} {2}
University, J¨ nk¨ ping, Sweden, Tech. Rep. 04:4, Apr. 2004.
o o
9 {2, 0, 3} {6, 5, 5} {1, 0, 3} {6, 5, 5}
[3] G. Chen, F. Li, S. W. Son, and M. Kandemir, “Application mapping for
C 65 µs 54µs chip multiprocessors,” in Proc. DAC’ 08, Anaheim, California, USA, Jun.
time 7.92 ms 10.91 s 2008, pp. 620–625.
[4] N. Dutt, “Memory-aware noc exploration and design,” in Proc. of Design,
Automation and Test in Europe, 2008. DATE ’08, Munich, Germany, Apr.
B. Comparing with Non-memory-aware LSA 2008, pp. 1128–1129.
[5] T. Lei and S. Kumar, “A two-step genetic algorithm for mapping task
For larger size problems listed in Table VI, we try to graphs to a network on chip architecture,” in Proc. Euromicro Symposium
solve them using LSA and then compare the results to a in Digital Systems Design (DSD), Belek, Turkey, Sep. 2003.
[6] S. Yang, L. Li, M. Gao, and Y. Zhang, “An energy- and delay- aware
non-memory-aware version of LSA (nLSA)’s results. nLSA is mapping method of noc,” Acta Electronica Sinica, vol. 36, no. 5, pp. 937–
LSA that assumes the platform has infinite memory. In Table 942, May 2008.
VI, we tested LSA and nLSA with five test cases (Middle, [7] H. Yu, Y. Ha, and B. Veeravalli, “Communication-aware application
mapping and scheduling for noc-based mpsocs,” in Proc. of 2010 IEEE
Hard, Hard0, Hard1 and Hard2). The number of tasks of each International Symposium on Circuits and Systems (ISCAS), Paris, France,
test case is listed in Table VI. The number of PM nodes in the May/Jun. 2010, pp. 3232–3235.
target platform for each test case is 4 × 4, except for test case [8] M. Monchiero, G. Palermo, C. Silvano, and O. Villa, “Exploration of
distributed shared memory architectures for noc-based multiprocessors,”
Hard0 and Hard4. Their number of PM nodes is 8 × 8. in Proc. of International Conference on Embedded Computer Systems:
There are differences in cost between memory-aware and Architectures, Modeling and Simulation, 2006. IC-SAMOS 2006, Samos,
Greece, Jul. 2006, pp. 144–151.
TABLE VI [9] A. Mehran, S. Saeidi, A. Khademzadeh, and A. Afzali-Kusha, “Spiral: A
E XPERIMENTAL RESULT heuristic mapping algorithm for network on chip,” IEICE Electron. Express,
vol. 4, no. 15, pp. 478–484, 2007.
LSA nLSA [10] E. Carvalho, C. Marcon, N. Calazans, and F. Moraes, “Evaluation of static
Test Number Program Program and dynamic task mapping algorithms in noc-based mpsocs,” in Proc. of
Case of tasks Run Time Cost Run Time Cost International Symposium on System-on-Chip, 2009, Tampere, Finland, Oct.
Middle 26 8.5654 3167 9.3606 ms 1615 2009, pp. 87–90.
Hard 81 13.4606 ms 13788 11.2038 ms 6819 [11] N. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, J. Hu, M. Irwin,
Hard0 147 18.9718 ms 27872 18.7562 ms 12713 M. Kandemir, and V. Narayanan, “Leakage current: Moore’s law meets
Hard1 32 8.5231 ms 4679 10.4146 ms 1990 static power,” Computer, vol. 36, no. 12, pp. 68–75, Dec. 2003.
Hard2 815 246.6145 ms 165486 274.7816 ms 43494 [12] D. P. Bertsekas, Nonlinear Programming. Athena Scientific, 1999.
[13] R. Rafeh, M. J. G. de la Banda, K. Marriott, and M. Wallace, “From zinc
to design model,” in Proc. of 9th International Symposium of Practical
non-memory aware results. This means nLSA actually takes Aspects of Declarative Languages (PADL), Nice, France, Jan. 2007, pp.
some nodes that do not have enough available memory. This 215–229.
[14] R. Dick, D. Rhodes, and W. Wolf, “Tgff: task graphs for free,” in Proc.
leads to a smaller cost but an invalid solution. For example, in of 6th International Workshop on Hardware/Software Codesign, 1998,
the test case Middle, task t2 has a data memory requirement of CODES/CASHE ’98, Seattle, WA, USA, Mar. 1998, pp. 97–101.
880 KB and each PM node in the target platform has only 800