SlideShare una empresa de Scribd logo
1 de 6
Descargar para leer sin conexión
Layered Spiral Algorithm for Memory-Aware
         Mapping and Scheduling on Network-on-Chip
                    Shuo Li, Fahimeh Jafari, Ahmed Hemani                                        Shashi Kumar
                         Department of Electronic Systems                                  School of Engineering
               School of Information and Communication Technology                           J¨ nk¨ ping University
                                                                                             o o
                           Royal Institute of Technology                                     J¨ nk¨ ping, Sweden
                                                                                              o o
                                Stockholm, Sweden                                       Email: Shashi.Kumar@jth.hj.se
                      Email: {shuol, fjafari, hemani}@kth.se




   Abstract—In this paper, Layered Spiral Algorithm (LSA) is          in Section V and evaluate our approach in Section VI. Section
proposed for memory-aware application mapping and scheduling          VII gives conclusions and future works.
onto Network-on-Chip (NoC) based Multi-Processor System-on-
Chip (MPSoC). The energy consumption is optimized while keeping
high task level parallelism. The experimental evaluation indicates                          II. R ELATED W ORK
that if memory-awareness is not considered during mapping and
scheduling, memory overflows may occur. The underlying problem            Different mapping algorithms [5][6] are proposed without
is also modeled as a Mixed Integer Linear Programming (MILP)          memory-awareness and scheduling coverage. In [7], schedul-
problem and solved using an efficient branch-and-bound algorithm       ing problem is covered with mapping but without memory-
to compare optimal solutions with results achieved by LSA. Com-       awareness. As claimed in [4], memory is critical in NoC design
paring to MILP solutions, the LSA results demonstrate only about
                                                                      process and consequently should be considered during mapping
20% and 12% increase of total communication cost in case of a
small and middle size synthetic problem, respectively, while it is    and scheduling practical applications onto practical platforms.
order of magnitude faster than the MILP solutions. Therefore, the     In [8], memory-awareness is covered while scheduling is not
LSA can find acceptable total communication cost with a low run-       considered. In this paper, we propose a very fast algorithm
time complexity, enabling quick exploration of large design spaces,   called Layered Spiral Algorithm (LSA) to solve memory-aware
which is infeasible for exhaustive search.
                                                                      application mapping and scheduling problem. The objective of
                                                                      LSA is minimizing total energy consumption while keeping high
                       I. I NTRODUCTION                               task level parallelism. The proposed algorithm is based on spiral
   Even though Network-on-Chip (NoC) has been introduced              algorithm [9] which is a very fast mapping algorithm without
for a decade [1], programming on them is arduous since the            memory-awareness. We extend spiral algorithm by introducing
application mapping and scheduling problem is NP-hard [2].            memory-aware concepts and task layers to cover both memory-
The existing compilation tools are not suitable in NoC context        awareness and scheduling problem. The paper demonstrates that
[3]. In addition, as memory is being critical during NoC design       LSA is able to solve large scale problems with acceptable
process [4], memory requirement and availability should not           accuracy. Although dynamic mapping is a cut above static
be ignored. This memory-awareness increases the complexity            mapping in terms of platform utilization [10], extra control logics
of the application mapping and scheduling problem even more.          are required. Therefore, we consider static application mapping
Since the entire solution space is enormous, problem-specific          and scheduling in this paper for simplification and its dynamic
algorithms are desired beyond exhaustive search-based algo-           extension is planned in the future work.
rithms to obtain acceptable solutions in a reasonable time. In
this paper, we describe (1) Memory-Aware Communication Task                            III. P ROBLEM F ORMULATION
Graph (MACTG) to model applications, (2) Platform Architec-
                                                                        This section models the underlying application and platform.
ture Graph (PAG) to model NoC-based Multi-Processor System-
                                                                      Then, the objective function of the memory-aware application
on-Chip (MPSoC) platforms, (3) a problem-specific heuristic
                                                                      mapping and scheduling problem is formulized.
algorithm, Layered Spiral Algorithm (LSA), to quickly map
and schedule an application characterized by MACTG to an
                                                                      A. Application Model
architecture characterized by PAG. (4) a Mixed Integer Linear
Programming (MILP) formulation of the underlying problem                We exploit a variant of communicational task graph called
and solve it by a branch-and-bound algorithm. Afterwards, the         Memory-Aware Communication Task Graph (MACTG) to model
efficacy of the proposed algorithm is established by comparing         an application and we model task execution as an impartible
the MILP solutions and LSA results.                                   process with three non-overlapping phases: input phase for
   The rest parts of this paper is organized as follows. Section      collecting input data, computing phase for computing output data
II introduces related work. Section III discusses the applica-        and output phase for sending out output data. The upper part
tion mapping and scheduling problem formulation based on              of Table I lists notations in a MACTG and their descriptions.
application and platform models. Section IV is devoted to the         Data memory requirement presents data memory required in the
description of LSA. We model the underlying problem in MILP           computing phase. Fig. 1(a) shows an example of MACTG.
  978-1-4244-8971-8/10$26.00 c 2010 IEEE
TABLE I                                                                                     communicaiton
                   MACTG AND PAG N OTATIONS                                                   t0                         edge                      1                t0        1
                                                                                     1                 1

    Notation        Description                                                                                          normal           t 02                                    t 01
                                                                                t1                          t2            task        1
    T = {ti }       Set of tasks                                                     3             3                                                                                       1
                                                                            2                                1
    ET i            ti ’s execution time                                                                                                      t2                                      t1
    DM Ri           ti ’s data memory requirement                               t3       t4                 t5                                     3                          3
                                                                                                                                      1                                                    2
    E = {eij }      Set of communication edges                              2                          2
                                                                                                                 2      memory
    N = {nj }       Set of PM nodes                                                                                     access            t 25         t   24        t   14       t 13
    M Sj            Total data memory size on nj                                t6                 t7       t8           task
                                                                                                                                      1                     3             3                2
    SM axj          Maximum distributed shared memory size on nj                         1         3
                                                                                                        2                                     t5                         t4           t3
    B               Communication channel bandwidth                                  2
    CH = {chij }    Entire communication channel set                                          t9                                      2            2                      1                2
                                                                                     (a) MACTG                                            t   57       t   58        t   49       t   36
                                                                                                                     communication
                                                                                                                        channel       2                     2                              2
B. Platform Model                                                                                                                                                        1
                                                                                                                                              t7           t8                         t6
   We model the target 2D homogeneous mesh NoC-based                            n0                     n1                             3                         2                          2
MPSoC by a Platform Architecture Graph (PAG). The platform                                                             processor
                                                                                                                                          t   79       t   89                     t   69
consists of Processor Memory (PM) nodes and communication                                                                data
                                                                                                                        memory                                  2
channels. Each PM node consists of one single-threaded pro-                                                                                        3                          2
                                                                                n2                     n3                PM node
cessor and one data memory. Data memory is divided into two                                                                                                         t9
parts: local memory for the local processor and distributed shared                       (b) PAG
                                                                                                                                                           (c) MAG
memory for remote processors. Sizes of both parts are tunable
during application execution under the condition of constant total
size and the maximum distributed shared memory size is smaller                                              Fig. 1.      MAG Example
than the total memory size. By introducing distributed shared
memory, efficiently memory partitioning becomes a challenge.           in the list only provide remote data memory for temporal and/or
This problem is addressed from a slightly different angle by          intermediate data during the computation phase. For a memory
cache partitioning problems that targeting to minimize cache          access task, it is a list of PM nodes for storing the data. Another
misses while memory partitioning problem is to minimize access        annotation is a list of corresponding data memory usage in the
time/energy. We model PM nodes as vertices and communication          PM node list. For example, suppose normal task ti uses three PM
channels as arcs in a PAG. The lower part of Table I lists the        nodes including nx , ny and nz , and ny is used for computation.
notations of a PAG and their descriptions and Fig. 1(b) illustrates   Hence, the PM node list will be {y, x, z}. Assume ti uses 8 KB
a sample PAG.                                                         in ny as local data memory and 4 KB in nx and nz as remote
                                                                      distributed data memory. The memory usage list will be {8 KB,
C. Memory Access Graph                                                4 KB, 4 KB} or {8, 4, 4}. If the unit of entries all is the same,
    Memory Access Graph (MAG) describes data communication.           it can be removed from the list.
We define a type of dummy task called memory access task
whose memory requirement is equal to the communication data           D. Constraint and Objective Function
volume between two corresponding normal tasks with zero                  In this subsection, we address the constraints and objective
execution time. Memory access task tij represents the data            function used in our problem formulation.
communication from ti to tj . Mapping tij is to locate output            Constraints: Besides of constraints related to the memory
data from ti on one or multiple PM nodes. This data will be           size, a deadline time timed for execution time of the application
used by tj as input data. In fact, each memory access task            is considered. This constraint will be a criterion for validating
contains the input/output data storage information of two directly    solutions. Let time0 be the starting time of the application and
related normal tasks. For example, t4 requires data from t1 and       timeL be the finishing time of the last task in the application.
t2 . The corresponding memory access tasks are t14 and t24 as         Thus, the time constraint is timeL − time0 ≤ timed .
illustrated in the MAG example in Fig. 1(c). By introducing              Objective function: Our objective is minimizing the applica-
memory access tasks, the input/output from/to multiple tasks          tion energy consumption while meeting mentioned constraints.
are modeled separately such that we do not have to explicitly         For simplicity, we assume that the leakage current is negligible
give input/output data location and which tasks fetch/produce         [11], which means the total execution energy will be fixed
it. Hence, in the rest of paper, we consider two subset TN and        and for minimizing total energy consumption, it is enough to
TA containing normal and memory access tasks, respectively,           minimize the communication energy consumption. The under-
such that T = TN TA . Replacing each communication edge               lying system’s communications include inter-task and intra-task
by a memory access task and two new communication edges, a            communications. Inter-task communication is the data commu-
MACTG is converted to a MAG.                                          nication during the input and output phases while intra-task
    Since each task in MAG can be mapped onto several PM nodes        communication is the data communication during the computing
and also the input/output data of each task has to be stored on       phase.
one or multiple PM nodes, we introduce two new annotations               To illustrate the concepts of inter-task and intra-task com-
for each task in MAG. The first one is a list of PM nodes. For a       munications, consider the example of Fig. 1(c). Task t1 needs
normal task, this list is a set of PM nodes that the first PM node     data produced by t0 , which is modeled by t01 , and t1 produces
in the list stores the program of the normal task and performs        data for some other tasks. Suppose application mapping and
computation during the computation phase while other PM nodes         scheduling algorithm is performed such that PM node list of
t0 is {0}, t01 is {1}, and t1 is {2, 3}.                                  Hence, the total inter-task communication cost, C inter , is
   In this case, inter-task communications are the data transfer       equal to C out + C in . Assume normal task ti is mapped on the
from n0 (on which task t0 is run) to n1 (in which output data of       set of PM nodes NiT and run on node nr ∈ NiT . Thus, it has
                                                                                                                 i
task t0 or input data of task t1 is located) and also from n1 to       intra-task communication with other pieces of remote shared data
n2 (on which task t1 is executed). The intra-task communication        memory on each node k ∈ NiT &k = nr . Thus, total inter-task
                                                                                                               i
                                                                                                                 intra
is the data transfer between n2 (on which task t1 is executed)         communication cost for normal task ti , Ci      , is:
and n3 (which allocates remote shared data memory to task t1 ).                    intra
We suppose the length of communication path between each two                      Ci     =           CVnr ,k ∗ hknr ∗ wR ∗ wa
                                                                                                        i         i
                                                                                                                                       (4)
                                                                                                 T
                                                                                             ∀k∈Ni
PM nodes i and j, lij , is proportional to the length of shortest
path between them, hij . This means that for all communications,       where wa is a weight factor used to simply model how many
we can say:                                                            times in average a normal task will access its remote data mem-
                           lij = wR ∗ hij                     (1)      ory because sometimes it will access its remote data memory
                                                                       more than one times. It is clear that CVnr ,k is equal to allocated
where wR is a constant routing factor.                                                                            i
                                                                       data memory for task ti on node k, xN . Also, for computing
                                                                                                                 ik
    We also assume that the energy consumption for a communica-
                                                                       C intra in an application, it is necessary to calculate Ciintra
                                                                                                                                       for
tion between two nodes i and j, Enij , is proportional to CVij ∗
                                                                       ∀i ∈ TN . Therefore, total inter-task communication cost in the
lij ; where CVij is communication data volume between these
                                                                       application can be written as:
two nodes. Considering Eq. 1, we can say Enij ∝ CVij ∗hij . We
denote CVij ∗hij as communication cost between two PM nodes                      C intra =                 xN ∗ hknr ∗ wR ∗ wa
                                                                                                            ik                         (5)
                                                                                                                   i
i and j. Therefore, for minimizing total energy consumption, we                              ∀i∈TN ∀k∈Ni
                                                                                                       T

can minimize the sum of all individual intra-task and inter-task’s
communication costs of the application executed on the given              Thus, total communication cost in the application is as C =
platform.                                                              C inter + C intra and our objective is to minimize it.
    Before that, we need to introduce several other notations. The                     IV. A LGORITHM D ESCRIPTION
allocated data memory size on node k to each normal task i and
memory access task j are denoted as xN and xA , respectively.          A. Spiral Algorithm
                                             ik      jk
The dependency task set of a task contains all directly dependent         Since the LSA is an extension of the existing spiral core map-
tasks of that task. The dependency task set of ti is denoted as        ping algorithm [9], we briefly describe spiral algorithm in this
Tid . For example, in Fig. 1(c), T1 = {t01 } and T01 = {t0 }.
                                      d                d               subsection. In this algorithm, IP cores are mapped onto the tiles
    As we described before, the mapping of each memory access          of the mesh platform based on a Task Priority List (TPL) which
task tij specifies where and how much memory is allocated to            is a list of the tasks ordered based on the priorities that they
output data from normal task ti or input data to normal task tj .      should be mapped. In mesh topology, the inner switches have
Therefore, for the inter-task communication, the communication         a higher connection degree compared to the boundary switches.
data is just the input/output data needed/produced by normal           This provides more connectivity to the neighbor switches and
tasks. In this respect, total inter-task communication cost consists   forms a Platform Priority List (PPL) (we call it Node Priority List
of total communication cost for writing output data, C out ,           (NPL) in LSA for reflecting the task-node mapping) which starts
and reading input data, C in . In the mentioned example, the           from the center of the mesh platform and ends to a boundary
communication cost for writing output data modeled by memory           switch in a spiral fashion. The priority assignment policy is
                        out                                            expressed in the following rules.
access task t01 is C01 = CV0,1 ∗ h0,1 ∗ wR . Since required
memory size for memory access task t01 in node n1 is equal             a) The tasks that have higher data transfer sizes should be
to CV0,1 , we can say C01 = xA ∗ h0,1 ∗ wR . Therefore, for
                            out
                                      01,1                             placed as close as possible to each other to satisfy the bandwidth
                out                                             out    constraint.
calculating C       in an application, it is enough to compute Ci
for ∀i ∈ TA . Assuming task i ∈ TA is mapped on node nm           i    b) The tasks which are tightly related to each other should have
and task s ∈ Tid is mapped and executed on node nr , total  s          the least possible Manhattan distance on the mesh platform.
communication cost for writing output data is as follows:              c) The tasks which have the high connection degrees should not
                                                                       be placed on the boundaries. For these tasks, the central area of
            C out =                 xA m ∗ hnm nr ∗ wR
                                     ini     i  s
                                                                (2)    the mesh is the best candidate.
                      ∀i∈TA ∀s∈Ti
                                d
                                                                          All IP cores are mapped onto tiles of the mesh platform based
   In the mentioned example, normal task t1 needs data from            on TPL, PPL and the above rules with starting from the IP core
normal task t0 that is modeled by memory access task t01 . There-      of the task with highest priority in the TPL.
fore, communication cost for reading input data of normal task         B. LSA Algorithm Description
t1 is C1 = CV1,2 ∗ h1,2 ∗ wR . Likewise, since CV1,2 = xA ,
        in
                                                              01,1
we can say C1 = xA ∗ h1,2 ∗ wR . Therefore, for calculating
              in                                                          The inputs of LSA are an application modeled by a
                       01,1
C in in an application, it is enough to compute Ci for ∀i ∈ TN .
                                                 in                    M ACT G = Gm (T, E) and a platform modeled by a P AG =
Assuming task i ∈ TN is mapped and executed on node nr and             Gp (N, CH). The output of the LSA is the application mapping
                                                            i
task s ∈ Tid is mapped on node nm , total communication cost           and scheduling.
                                     s
for reading input data is as follows:                                     Intermediate results in this algorithm are:
                                                                          1) Task layer is a set of independent tasks. Each task in task
             C in =                 xA m ∗ hnr nm ∗ wR
                                     sns     i s
                                                                (3)    layer k has its dependency tasks in task layer k−1, .., 0. There are
                      ∀i∈TN ∀s∈Ti
                                d                                      two different kinds of task layers including normal and memory
access layers. The former are layers which contain normal tasks           TPL of each task layer is generated. After that, node priority list
and the latter include memory access tasks. It is clear that there        of the PAG is built in line 6. Line 7 maps the first task layer
is no task layer including normal and memory access task both.            onto the platform by using the spiral algorithm. Lines 8-16 map
The set of normal layers is denoted as N L and likewise, AL is            remaining task layers onto the platform as follows.
the set of memory access layers.                                              First, the dependency node set Nid for each task ti in
   2) Task priority list (TPL): Each task layer has a TPL which           task priority list is computed. If Nid has only one node nd ,
is an ordered set of tasks of that layer. The ordering criterion          try to map ti onto nd . If nd is not an execution-available node,
is based on priority of each task which means the task with               try to map ti onto an execution-available PM node that has
the higher priority is located in front of the task with the lower        the highest priority among all execution-available PM nodes.
priority in TPL list. The task priorities are specified as follows:        If any single PM nodes has no enough available memory for
   • For normal tasks: Each task with larger total input and              ti , try to map the execution of ti onto nd and the remote data
      output data volume has higher priority.                             memory onto storage-available PM nodes with as high priority
   • For memory access tasks: The priority of each memory                 as possible. If this trial fails, try to map execution of ti onto an
      access task depends on priority of its dependency (parent)          execution-available PM node that has the highest priority in all
      and child tasks which all are normal tasks. Hence, for              execution-available nodes and map ti ’s remote data memory onto
      comparing two memory access tasks tix and tjy , we first             storage-available PM nodes with as high priority as possible. If
      check the priority of tx and ty . If we find that tx has higher      this mapping also fails, the LSA will fail. This mapping is called
      priority than ty , the memory access task priority list is {tix ,   single dependency mapping in Algorithm 1. If Nid has multiple
      tjy }. If x = y, we check the priority of ti and tj . If priority   nodes, for each node nd ∈ Nid , follow the above procedure and
      of ti is higher than tj ’s priority, the memory access task         collect the cost of each mapping solution of ti . Then compare
      priority list is {tix , tjy }. If child task of a memory access     the costs and pick up the solution with the minimal cost. This
      task is not in the next layer of the current layer, this memory     mapping is called multiple dependencies mapping in Algorithm
      access task has lower priority compared to other memory             1. Finally, record mapping and scheduling results and if the
      access tasks in the same layer.                                     total time fulfills the timing constraint, the algorithm successes;
   3) Node priority list (NPL) is an ordered set of all nodes so that     otherwise the algorithm cannot solve this problem.
the nodes located closer to the center of the mesh have higher            C. Application Mapping and Scheduling Example
priority. If some nodes are located as close as to the center of
                                                                             Consider a simple example of mapping an application shown
the mesh, the priority list will be formed in a spiral style starting
                                                                          in Fig. 1(a) onto a platform shown in Fig. 1(b). The numbers
from the right bottom corner.
                                                                          along with the arrows are the communication volumes between
   4) Dependency node set: The dependency node set of a task
                                                                          two tasks and the arguments are listed in Table II.
contains all nodes assigned to all directly dependent tasks of that
                                                                               We consider the following assumptions for the platform.
task. The dependency node set of ti is denoted as Nid .
                                                                          Bandwidth is 1 GB/sec, data memory size is 8 KB and maximum
   We also introduce execution-available and storage-available
                                                                          6 KB data memory can be shared.
terms for PM nodes to state that a PM node is ready for executing
                                                                               The data memory requirements in KB for memory access
a task and it has available data memory for remote or local
                                                                          tasks are listed in Table III. TPL of each layer is built by
access, respectively. It is worth mentioning that an execution-
                                                                          sorting tasks such that the tasks with higher total communication
available PM node is also an storage-available PM node. The
                                                                          volume has higher priority compared with other tasks. Table IV
LSA works as Algorithm 1.
                                                                          lists the TPL of each task layer. Since all nodes have the same
Algorithm 1 Layered spiral algorithm pseudo-code                          connectivity, NPL is {n3 , n2 , n0 , n1 } which lists all nodes in a
                                                                          spiral style.
 1: Gm (T, E) → M AG = Gl (T, E);
                                                                             Then we map task layer 0, which contains only t0 . This
 2: Lt = task layers (Gl (T, E));
                                                                          task is mapped onto n3 and uses 2 KB of data memory in
 3: for all task layer li in Lt do
       P
 4:   li = task priority list (li );
 5: end for                                                                                          TABLE II
                                                                               A NNOTATIONS FOR THE MACTG EXAMPLE SHOWN IN F IG . 1( A )
 6: Ln = node priority list (Gp (N, CH));
 7: M ap(l0 );                                                                          Task    ET    DMR    Task    ET    DMR
 8: for task layer 1 to last layer do                                                          (µs)   (KB)          (µs)   (KB)
                     P                                                                   0       1      2     5       2      4
 9:   for all ti in li do
                                                                                         1       1      2     6       2      4
10:      if count(Nid ) = 1 then                                                         2       9     16     7       1      2
11:         single dependency mapping(ti );                                              3       2      2     8       1      2
12:      else                                                                            4       2      4     9       4     16
13:         multiple dependencies mapping(ti );
                                                                                                   TABLE III
14:      end if                                                                 DATA MEMORY REQUIREMENTS FOR MEMORY ACCESS TASKS
15:   end for
16: end for                                                                    Task   DMR      Task   DMR    Task   DMR     Task   DMR
                                                                                01     1        24     3      57     2       89     2
                                                                                02     1        25     1      58     2
 Line 1 of the algorithm converts input MACTG Gm (T, E) to                      13     2        36     2      69     2
MAG Gl (T, E) and line 2 extracts task layers. In lines 3-5, the                14     3        49     1      79     3
TABLE IV
                                   TASK LAYERS                                                    xN +
                                                                                                   ij                     xA ≤ M S j
                                                                                                                           sj           ∀j ∈ N, ∀l ∈ N L (8)
                                                                                   ∀i∈T askL(l)          ∀i∈InT askL(l)
 Layer   Priority List     Layer     Priority List      Layer     Priority List
 0       {0}               3         {14, 24, 25, 13}   6         {7, 6, 8}                       xA +
                                                                                                   ij                      xN ≤ M S j
                                                                                                                            sj            ∀j ∈ N, ∀l ∈ AL (9)
 1       {01, 02}          4         {4, 5, 3}          7         {79, 69, 89}     ∀i∈T askL(l)          ∀i∈InT askL(l)
 2       {1, 2}            5         {57, 36, 58, 49}   8         {9}
                                                                                                                 yij ≤ 1       ∀j ∈ N, ∀l ∈ N L          (10)
                                                                                                  ∀i∈T askL(l)


n3 . Now n3 has 6 KB data memory available. Then we map                                                           yij = 1       ∀i ∈ TN                  (11)
                                                                                                           ∀j∈N
task layer 1, which contains {t01 , t02 }. t01 occupies 1 KB and
t02 occupies also 1 KB. t01 and t02 are mapped onto n3 . The                                                     xN yij > 0
                                                                                                                  ij              ∀i ∈ TN                (12)
memory assigned to t0 is still in n3 . Now, n3 has 4 KB available.                                        ∀j∈N

Then, we map task layer 2 which contains {t1 , t2 }. t1 is mapped                                            xN = DM RN
                                                                                                                      i
                                                                                                                                    ∀i ∈ TN              (13)
                                                                                                              ij
onto n3 (2 KB) and t2 is mapped onto n2 (8 KB), n0 (4 KB),                                            ∀j∈N
and n3 (4 KB). The memory assigned to t0 in n3 is freed. Now,
                                                                                                             xA = DM RA
                                                                                                              ij
                                                                                                                      i
                                                                                                                                   ∀i ∈ TA               (14)
n3 and n2 have no data memory available and n0 has 4 KB
                                                                                                      ∀j∈N
available. In the same way, we mapped all task layers and got
the results in the LSA part in Table V. The time parameter in                     where yij , xN and xA are optimization variables.
                                                                                                ij        ij
the table is the LSA computation time. By considering wa = 2                         Eq. (6) is the objective function of this optimization problem
and wR = 1, the total communication cost C = 65 µs and total                      which minimizes total communication cost. Constraint (7) says
application execution time timeL − time0 = 74 µs.                                 that allocated data memory size to tasks of each normal layer
                                                                                  l not running on each node j cannot exceed shared memory
                                                                                  size of that node. Constraint (8) and (9) indicate that allocated
                         V. MILP F ORMULATION
                                                                                  data memory size for tasks of each layer (N L&AL) and their
   To evaluate the capability of our method, we formulate the                     dependency tasks on each node j cannot exceed memory size of
underlying problem as a MILP problem which maps tasks onto                        that node. It is clear that at most a task of each normal layer l on
a generic regular NoC architecture and allocates the required                     every node j can be executed and also each normal task i can be
data memory of each task on every PM node based on tasks                          run on only one node; these constraints are shown in (10) and
scheduling to minimize the total communication cost while satis-                  (11), respectively. Constraint (12) says that each normal task i
fying acceptable constraints in the network. For formulating this                 takes a part or all of its data memory requirements of that node
problem, besides of before mentioned notations, we introduce                      on which is run. Constraints (13) and (14) state that allocated
more notations. T askL(l) is a set of independent tasks in layer                  data memory size of each task i ∈ TA or TN cannot exceed its
l. InT askL(l) is the set of dependency tasks of all tasks of                     data memory requirements.
layer l. For instance, InT askL(2) in example of section IV-C is                     It is worth mentioning that above mentioned problem is a
{01, 02}. Thus, The cost minimization MILP problem, Minimize-                     quadratic problem [12]. However, since an integer variable is
Cost-MILP, can be formulated as follows.                                          multiplied by a binary variable, Big M technique [13] can be used
   Given a MAG=Gl (T, E), hop count matrix of shortest path                       to convert it to a mixed integer linear programming. Therefore,
between each two nodes H = {hij |i & j ∈ N }, memory size                         we solve the proposed Minimize-Cost-MILP problem using an
and maximum shared memory size that each node j can allocate                      efficient branch-and-bound algorithm in CPLEX.
to tasks in the network which denoted as M S j and SM axj ,
respectively. Data memory requirement for each normal task                                            VI. E XPERIMENTAL R ESULT
          i                                             i
i, DM RN , and each memory access task i, DM RA . Find                              Since there is no existing memory-aware task graph bench-
the mapping matrix Y = {yij |yij = 0 or 1; ∀i ∈ TN & ∀j ∈ N }                     marks, we use tgff [14] tool to generate synthetic MACTGs.
and also the size of allocated data memory on each node j                         Therefore, five task graphs are randomly generated to test our
to every normal task i and memory access task i, xN for   ij                      LSA implemented in C#. The program is run on a PC with 2.8
∀i ∈ TN & ∀j ∈ N and xAj for ∀i ∈ TA & ∀j ∈ N ,
                                 ij
                                                                                  GHz Intel i7 CPU and 8 GB main memory. For all case studies,
respectively, such that                                                           we assume that wR and wa are equal to 1 and 2, respectively,
                                                                                  and also bandwidth is 1 GB/sec.
                               
                                                                                  A. Comparing with MILP Problem
            min                               xN hjk yik wR wa
                                                ij
         yij ,xN ,xA
               ij ij   ∀i∈TN       ∀j∈N ∀k∈N                                         To evaluate the capability of our method, we addressed an
                                                                                 MILP problem of minimizing total communication cost sub-
                       +                       xA hjk yij wR               (6)   ject to the mentioned constraints. We apply LSA method and
                                                sk
                           ∀s∈Ti ∀j∈N ∀k∈N
                               d                                                  Minimize-Cost-MILP problem to two synthetic task graphs which
                                                                                  are mapped to a 4 × 4 2D mesh network.
                       +                             xA hjk ysk wR
                                                      ij                             Table V shows the results for LSA and Minimize-Cost-MILP
                           ∀i∈TA ∀s∈T d ∀j∈N ∀k∈N
                                     i                                            problem for case study mentioned in IV-C. The results show that
  subject to:                                                                     comparing with the Minimize-Cost-MILP problem, LSA achieves
                                                                                  about 20% larger cost with only 7.92 ms run time. For a larger
                   xN (1 − yij ) ≤ SM axj
                    ij                              ∀j ∈ N, ∀l ∈ N L        (7)   size problem that maps and schedules a MACTG with 26 tasks,
    ∀i∈T askL(l)                                                                  LSA runs 8.56 ms and results a cost of 3167 µs while the
Minimize-Cost-MILP problem runs about 3.5 hours and results              KB memory. Therefore, LSA maps t2 onto two PM nodes (n9
a cost of 2821 µs. Therefore, the LSA can find acceptable total           and n10 ) while nLSA maps it onto only one PM node n10 .
communication cost in a very short time.
   As the exploration space increases exponentially, the
                                                                                      VII. C ONCLUSION AND F UTURE W ORKS
Minimize-Cost-MILP problem cannot solve larger size problems
in a reasonable time and physical memory. Thus, in the rest, we             LSA can find solution with acceptable energy consumption
achieve results via proposed LSA for larger size problems.               in a very short time. As the exploration space increases
                                                                         exponentially with number of task, large size problems cannot
                          TABLE V                                        be solved by MILP solver in a reasonable time and physical
    C OMPARISON BETWEEN LSA AND Minimize-Cost-MILP PROBLEM
                                                                         memory so that we can only achieve solutions via proposed
                        LSA                 Minimize-Cost-MILP Problem   LSA. As shown in Table VI, LSA runtime for mapping and
  Task     PM Node list    Memory Usage     PM Node     Memory Usage     scheduling an application consists of 815 tasks onto an 8 × 8
              List              List          List            List       platform is still under 0.25 sec. By comparing memory-aware
    0         {3}               {2}           {1}             {2}
   01         {3}               {1}           {3}             {1}        and non-memory-aware solutions, we concluded that the
   02         {3}               {1}           {1}             {1}        memory requirements and availabilities are critical and should
    1         {3}               {2}           {3}             {2}        not be ignored when modeling applications and platforms.
   13         {1}               {2}           {3}             {2}
   14         {0}               {3}           {3}             {3}           In the future work, we expect the concept of LSA will also
    2       {2, 0, 3}         {8, 4, 4}     {0, 1, 2}      {8, 7, 1}     fit into dynamic application mapping and scheduling problems.
   24         {1}               {3}           {2}             {3}        Another future work is to add more tunable options to LSA
   25         {3}               {1}           {1}             {1}
    3         {1}               {2}           {3}             {2}        such that LSA can be optimized more to parallelism or to
   36         {1}               {2}           {3}             {2}        energy consumption. Implementation of task layer partitioning
    4         {0}               {4}           {2}             {4}        is also planned in the future work.
   49         {0}               {1}           {2}             {1}
    5         {3}               {4}           {1}             {4}
   57         {3}               {2}           {1}             {2}
   58         {3}               {2}           {1}             {2}                                       R EFERENCES
    6         {1}               {4}           {3}             {4}                                                                ¨
                                                                          [1] A. Hemani, A. Jantsch, S. Kumar, A. Postula, J. Oberg, M. Millberg, and
   69         {1}               {2}           {3}             {2}
                                                                              D. Lindqvist, “Network on a chip: An architecture for billion transistor
    7         {3}               {2}           {0}             {2}
                                                                              era,” in Proc. NORCHIP 2000, Turku, Finland, Nov. 2000.
   79         {3}               {3}           {0}             {3}
                                                                          [2] R. Pop and S. Kumar, “A survey of techniques for mapping and scheduling
    8         {2}               {2}           {1}             {1}
                                                                              applications to network on chip systems,” School of Engineering, J¨ nk¨ ping
                                                                                                                                                o o
   89         {2}               {2}           {1}             {2}
                                                                              University, J¨ nk¨ ping, Sweden, Tech. Rep. 04:4, Apr. 2004.
                                                                                           o o
    9       {2, 0, 3}         {6, 5, 5}     {1, 0, 3}      {6, 5, 5}
                                                                          [3] G. Chen, F. Li, S. W. Son, and M. Kandemir, “Application mapping for
   C                   65 µs                           54µs                   chip multiprocessors,” in Proc. DAC’ 08, Anaheim, California, USA, Jun.
 time                 7.92 ms                         10.91 s                 2008, pp. 620–625.
                                                                          [4] N. Dutt, “Memory-aware noc exploration and design,” in Proc. of Design,
                                                                              Automation and Test in Europe, 2008. DATE ’08, Munich, Germany, Apr.
B. Comparing with Non-memory-aware LSA                                        2008, pp. 1128–1129.
                                                                          [5] T. Lei and S. Kumar, “A two-step genetic algorithm for mapping task
   For larger size problems listed in Table VI, we try to                     graphs to a network on chip architecture,” in Proc. Euromicro Symposium
solve them using LSA and then compare the results to a                        in Digital Systems Design (DSD), Belek, Turkey, Sep. 2003.
                                                                          [6] S. Yang, L. Li, M. Gao, and Y. Zhang, “An energy- and delay- aware
non-memory-aware version of LSA (nLSA)’s results. nLSA is                     mapping method of noc,” Acta Electronica Sinica, vol. 36, no. 5, pp. 937–
LSA that assumes the platform has infinite memory. In Table                    942, May 2008.
VI, we tested LSA and nLSA with five test cases (Middle,                   [7] H. Yu, Y. Ha, and B. Veeravalli, “Communication-aware application
                                                                              mapping and scheduling for noc-based mpsocs,” in Proc. of 2010 IEEE
Hard, Hard0, Hard1 and Hard2). The number of tasks of each                    International Symposium on Circuits and Systems (ISCAS), Paris, France,
test case is listed in Table VI. The number of PM nodes in the                May/Jun. 2010, pp. 3232–3235.
target platform for each test case is 4 × 4, except for test case         [8] M. Monchiero, G. Palermo, C. Silvano, and O. Villa, “Exploration of
                                                                              distributed shared memory architectures for noc-based multiprocessors,”
Hard0 and Hard4. Their number of PM nodes is 8 × 8.                           in Proc. of International Conference on Embedded Computer Systems:
    There are differences in cost between memory-aware and                    Architectures, Modeling and Simulation, 2006. IC-SAMOS 2006, Samos,
                                                                              Greece, Jul. 2006, pp. 144–151.
                               TABLE VI                                   [9] A. Mehran, S. Saeidi, A. Khademzadeh, and A. Afzali-Kusha, “Spiral: A
                         E XPERIMENTAL RESULT                                 heuristic mapping algorithm for network on chip,” IEICE Electron. Express,
                                                                              vol. 4, no. 15, pp. 478–484, 2007.
                                  LSA                    nLSA            [10] E. Carvalho, C. Marcon, N. Calazans, and F. Moraes, “Evaluation of static
   Test      Number       Program                  Program                    and dynamic task mapping algorithms in noc-based mpsocs,” in Proc. of
   Case      of tasks    Run Time        Cost     Run Time       Cost         International Symposium on System-on-Chip, 2009, Tampere, Finland, Oct.
  Middle        26         8.5654        3167     9.3606 ms      1615         2009, pp. 87–90.
   Hard         81      13.4606 ms      13788    11.2038 ms      6819    [11] N. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, J. Hu, M. Irwin,
  Hard0        147      18.9718 ms      27872    18.7562 ms     12713         M. Kandemir, and V. Narayanan, “Leakage current: Moore’s law meets
  Hard1         32       8.5231 ms       4679    10.4146 ms      1990         static power,” Computer, vol. 36, no. 12, pp. 68–75, Dec. 2003.
  Hard2        815      246.6145 ms     165486   274.7816 ms    43494    [12] D. P. Bertsekas, Nonlinear Programming. Athena Scientific, 1999.
                                                                         [13] R. Rafeh, M. J. G. de la Banda, K. Marriott, and M. Wallace, “From zinc
                                                                              to design model,” in Proc. of 9th International Symposium of Practical
non-memory aware results. This means nLSA actually takes                      Aspects of Declarative Languages (PADL), Nice, France, Jan. 2007, pp.
some nodes that do not have enough available memory. This                     215–229.
                                                                         [14] R. Dick, D. Rhodes, and W. Wolf, “Tgff: task graphs for free,” in Proc.
leads to a smaller cost but an invalid solution. For example, in              of 6th International Workshop on Hardware/Software Codesign, 1998,
the test case Middle, task t2 has a data memory requirement of                CODES/CASHE ’98, Seattle, WA, USA, Mar. 1998, pp. 97–101.
880 KB and each PM node in the target platform has only 800

Más contenido relacionado

La actualidad más candente

On Data Mining in Inverse Scattering Problems: Neural Networks Applied to GPR...
On Data Mining in Inverse Scattering Problems: Neural Networks Applied to GPR...On Data Mining in Inverse Scattering Problems: Neural Networks Applied to GPR...
On Data Mining in Inverse Scattering Problems: Neural Networks Applied to GPR...IDES Editor
 
Statistical analysis of network data and evolution on GPUs: High-performance ...
Statistical analysis of network data and evolution on GPUs: High-performance ...Statistical analysis of network data and evolution on GPUs: High-performance ...
Statistical analysis of network data and evolution on GPUs: High-performance ...Michael Stumpf
 
Accurate Main Content Extraction from Persian HTML Files
Accurate Main Content Extraction from Persian HTML FilesAccurate Main Content Extraction from Persian HTML Files
Accurate Main Content Extraction from Persian HTML FilesHadi Mohammadzadeh
 
A survey of Parallel models for Sequence Alignment using Smith Waterman Algor...
A survey of Parallel models for Sequence Alignment using Smith Waterman Algor...A survey of Parallel models for Sequence Alignment using Smith Waterman Algor...
A survey of Parallel models for Sequence Alignment using Smith Waterman Algor...iosrjce
 
Programacion multiobjetivo
Programacion multiobjetivoProgramacion multiobjetivo
Programacion multiobjetivoDiego Bass
 
11 construction productivity and cost estimation using artificial
11 construction productivity and cost estimation using artificial 11 construction productivity and cost estimation using artificial
11 construction productivity and cost estimation using artificial Vivan17
 
Future semantic segmentation with convolutional LSTM
Future semantic segmentation with convolutional LSTMFuture semantic segmentation with convolutional LSTM
Future semantic segmentation with convolutional LSTMKyuri Kim
 
IRJET- Low Complexity Pipelined FFT Design for High Throughput and Low Densit...
IRJET- Low Complexity Pipelined FFT Design for High Throughput and Low Densit...IRJET- Low Complexity Pipelined FFT Design for High Throughput and Low Densit...
IRJET- Low Complexity Pipelined FFT Design for High Throughput and Low Densit...IRJET Journal
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...NECST Lab @ Politecnico di Milano
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
A tutorial on CGAL polyhedron for subdivision algorithms
A tutorial on CGAL polyhedron for subdivision algorithmsA tutorial on CGAL polyhedron for subdivision algorithms
A tutorial on CGAL polyhedron for subdivision algorithmsRadu Ursu
 
Yet another object system for R
Yet another object system for RYet another object system for R
Yet another object system for RHadley Wickham
 
Learning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLLearning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLlauratoni4
 
A novel architecture of rns based
A novel architecture of rns basedA novel architecture of rns based
A novel architecture of rns basedVLSICS Design
 
D I G I T A L I C A P P L I C A T I O N S J N T U M O D E L P A P E R{Www
D I G I T A L  I C  A P P L I C A T I O N S  J N T U  M O D E L  P A P E R{WwwD I G I T A L  I C  A P P L I C A T I O N S  J N T U  M O D E L  P A P E R{Www
D I G I T A L I C A P P L I C A T I O N S J N T U M O D E L P A P E R{Wwwguest3f9c6b
 

La actualidad más candente (20)

An35225228
An35225228An35225228
An35225228
 
On Data Mining in Inverse Scattering Problems: Neural Networks Applied to GPR...
On Data Mining in Inverse Scattering Problems: Neural Networks Applied to GPR...On Data Mining in Inverse Scattering Problems: Neural Networks Applied to GPR...
On Data Mining in Inverse Scattering Problems: Neural Networks Applied to GPR...
 
Statistical analysis of network data and evolution on GPUs: High-performance ...
Statistical analysis of network data and evolution on GPUs: High-performance ...Statistical analysis of network data and evolution on GPUs: High-performance ...
Statistical analysis of network data and evolution on GPUs: High-performance ...
 
Accurate Main Content Extraction from Persian HTML Files
Accurate Main Content Extraction from Persian HTML FilesAccurate Main Content Extraction from Persian HTML Files
Accurate Main Content Extraction from Persian HTML Files
 
A survey of Parallel models for Sequence Alignment using Smith Waterman Algor...
A survey of Parallel models for Sequence Alignment using Smith Waterman Algor...A survey of Parallel models for Sequence Alignment using Smith Waterman Algor...
A survey of Parallel models for Sequence Alignment using Smith Waterman Algor...
 
Programacion multiobjetivo
Programacion multiobjetivoProgramacion multiobjetivo
Programacion multiobjetivo
 
11 construction productivity and cost estimation using artificial
11 construction productivity and cost estimation using artificial 11 construction productivity and cost estimation using artificial
11 construction productivity and cost estimation using artificial
 
Future semantic segmentation with convolutional LSTM
Future semantic segmentation with convolutional LSTMFuture semantic segmentation with convolutional LSTM
Future semantic segmentation with convolutional LSTM
 
IRJET- Low Complexity Pipelined FFT Design for High Throughput and Low Densit...
IRJET- Low Complexity Pipelined FFT Design for High Throughput and Low Densit...IRJET- Low Complexity Pipelined FFT Design for High Throughput and Low Densit...
IRJET- Low Complexity Pipelined FFT Design for High Throughput and Low Densit...
 
Cbls thuan
Cbls thuanCbls thuan
Cbls thuan
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...
 
Gn3311521155
Gn3311521155Gn3311521155
Gn3311521155
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Ba26343346
Ba26343346Ba26343346
Ba26343346
 
A tutorial on CGAL polyhedron for subdivision algorithms
A tutorial on CGAL polyhedron for subdivision algorithmsA tutorial on CGAL polyhedron for subdivision algorithms
A tutorial on CGAL polyhedron for subdivision algorithms
 
Yet another object system for R
Yet another object system for RYet another object system for R
Yet another object system for R
 
Learning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLLearning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RL
 
H0545156
H0545156H0545156
H0545156
 
A novel architecture of rns based
A novel architecture of rns basedA novel architecture of rns based
A novel architecture of rns based
 
D I G I T A L I C A P P L I C A T I O N S J N T U M O D E L P A P E R{Www
D I G I T A L  I C  A P P L I C A T I O N S  J N T U  M O D E L  P A P E R{WwwD I G I T A L  I C  A P P L I C A T I O N S  J N T U  M O D E L  P A P E R{Www
D I G I T A L I C A P P L I C A T I O N S J N T U M O D E L P A P E R{Www
 

Destacado (9)

26
2626
26
 
44
4444
44
 
30
3030
30
 
73
7373
73
 
32
3232
32
 
49
4949
49
 
27
2727
27
 
68
6868
68
 
75
7575
75
 

Similar a 70

Iisrt z swati sharma
Iisrt z swati sharmaIisrt z swati sharma
Iisrt z swati sharmaIISRT
 
All Pair Shortest Path Algorithm – Parallel Implementation and Analysis
All Pair Shortest Path Algorithm – Parallel Implementation and AnalysisAll Pair Shortest Path Algorithm – Parallel Implementation and Analysis
All Pair Shortest Path Algorithm – Parallel Implementation and AnalysisInderjeet Singh
 
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...Youness Lahdili
 
Paper id 25201467
Paper id 25201467Paper id 25201467
Paper id 25201467IJRAT
 
A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time CompilerA Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time Compilerkeanumit
 
IRJET-Error Detection and Correction using Turbo Codes
IRJET-Error Detection and Correction using Turbo CodesIRJET-Error Detection and Correction using Turbo Codes
IRJET-Error Detection and Correction using Turbo CodesIRJET Journal
 
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIComprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIijtsrd
 
IIIRJET-Implementation of Image Compression Algorithm on FPGA
IIIRJET-Implementation of Image Compression Algorithm on FPGAIIIRJET-Implementation of Image Compression Algorithm on FPGA
IIIRJET-Implementation of Image Compression Algorithm on FPGAIRJET Journal
 
ENERGY AND LATENCY AWARE APPLICATION MAPPING ALGORITHM & OPTIMIZATION FOR HOM...
ENERGY AND LATENCY AWARE APPLICATION MAPPING ALGORITHM & OPTIMIZATION FOR HOM...ENERGY AND LATENCY AWARE APPLICATION MAPPING ALGORITHM & OPTIMIZATION FOR HOM...
ENERGY AND LATENCY AWARE APPLICATION MAPPING ALGORITHM & OPTIMIZATION FOR HOM...cscpconf
 
Energy and latency aware application
Energy and latency aware applicationEnergy and latency aware application
Energy and latency aware applicationcsandit
 
A review of automatic differentiationand its efficient implementation
A review of automatic differentiationand its efficient implementationA review of automatic differentiationand its efficient implementation
A review of automatic differentiationand its efficient implementationssuserfa7e73
 
IRJET - Distributed Arithmetic Method for Complex Multiplication
IRJET -  	  Distributed Arithmetic Method for Complex MultiplicationIRJET -  	  Distributed Arithmetic Method for Complex Multiplication
IRJET - Distributed Arithmetic Method for Complex MultiplicationIRJET Journal
 
Implementation and Comparative Analysis of 2x1 Multiplexers Using Different D...
Implementation and Comparative Analysis of 2x1 Multiplexers Using Different D...Implementation and Comparative Analysis of 2x1 Multiplexers Using Different D...
Implementation and Comparative Analysis of 2x1 Multiplexers Using Different D...IRJET Journal
 
Research of 64-bits RISC Dual-core Microprocessor with High Performance and L...
Research of 64-bits RISC Dual-core Microprocessor with High Performance and L...Research of 64-bits RISC Dual-core Microprocessor with High Performance and L...
Research of 64-bits RISC Dual-core Microprocessor with High Performance and L...TELKOMNIKA JOURNAL
 
Evolutionary Multi-Goal Workflow Progress in Shade
Evolutionary  Multi-Goal Workflow Progress in ShadeEvolutionary  Multi-Goal Workflow Progress in Shade
Evolutionary Multi-Goal Workflow Progress in ShadeIRJET Journal
 
Matrix transposition
Matrix transpositionMatrix transposition
Matrix transposition동호 이
 

Similar a 70 (20)

10.1.1.63.7976
10.1.1.63.797610.1.1.63.7976
10.1.1.63.7976
 
53
5353
53
 
Iisrt z swati sharma
Iisrt z swati sharmaIisrt z swati sharma
Iisrt z swati sharma
 
All Pair Shortest Path Algorithm – Parallel Implementation and Analysis
All Pair Shortest Path Algorithm – Parallel Implementation and AnalysisAll Pair Shortest Path Algorithm – Parallel Implementation and Analysis
All Pair Shortest Path Algorithm – Parallel Implementation and Analysis
 
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
 
Paper id 25201467
Paper id 25201467Paper id 25201467
Paper id 25201467
 
A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time CompilerA Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
 
IRJET-Error Detection and Correction using Turbo Codes
IRJET-Error Detection and Correction using Turbo CodesIRJET-Error Detection and Correction using Turbo Codes
IRJET-Error Detection and Correction using Turbo Codes
 
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIComprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
 
IIIRJET-Implementation of Image Compression Algorithm on FPGA
IIIRJET-Implementation of Image Compression Algorithm on FPGAIIIRJET-Implementation of Image Compression Algorithm on FPGA
IIIRJET-Implementation of Image Compression Algorithm on FPGA
 
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
 
ENERGY AND LATENCY AWARE APPLICATION MAPPING ALGORITHM & OPTIMIZATION FOR HOM...
ENERGY AND LATENCY AWARE APPLICATION MAPPING ALGORITHM & OPTIMIZATION FOR HOM...ENERGY AND LATENCY AWARE APPLICATION MAPPING ALGORITHM & OPTIMIZATION FOR HOM...
ENERGY AND LATENCY AWARE APPLICATION MAPPING ALGORITHM & OPTIMIZATION FOR HOM...
 
Energy and latency aware application
Energy and latency aware applicationEnergy and latency aware application
Energy and latency aware application
 
A review of automatic differentiationand its efficient implementation
A review of automatic differentiationand its efficient implementationA review of automatic differentiationand its efficient implementation
A review of automatic differentiationand its efficient implementation
 
IRJET - Distributed Arithmetic Method for Complex Multiplication
IRJET -  	  Distributed Arithmetic Method for Complex MultiplicationIRJET -  	  Distributed Arithmetic Method for Complex Multiplication
IRJET - Distributed Arithmetic Method for Complex Multiplication
 
Implementation and Comparative Analysis of 2x1 Multiplexers Using Different D...
Implementation and Comparative Analysis of 2x1 Multiplexers Using Different D...Implementation and Comparative Analysis of 2x1 Multiplexers Using Different D...
Implementation and Comparative Analysis of 2x1 Multiplexers Using Different D...
 
Research of 64-bits RISC Dual-core Microprocessor with High Performance and L...
Research of 64-bits RISC Dual-core Microprocessor with High Performance and L...Research of 64-bits RISC Dual-core Microprocessor with High Performance and L...
Research of 64-bits RISC Dual-core Microprocessor with High Performance and L...
 
11 15
11 1511 15
11 15
 
Evolutionary Multi-Goal Workflow Progress in Shade
Evolutionary  Multi-Goal Workflow Progress in ShadeEvolutionary  Multi-Goal Workflow Progress in Shade
Evolutionary Multi-Goal Workflow Progress in Shade
 
Matrix transposition
Matrix transpositionMatrix transposition
Matrix transposition
 

Más de srimoorthi (20)

94
9494
94
 
87
8787
87
 
84
8484
84
 
83
8383
83
 
82
8282
82
 
72
7272
72
 
69
6969
69
 
63
6363
63
 
62
6262
62
 
61
6161
61
 
60
6060
60
 
59
5959
59
 
57
5757
57
 
56
5656
56
 
50
5050
50
 
55
5555
55
 
52
5252
52
 
51
5151
51
 
45
4545
45
 
43
4343
43
 

70

  • 1. Layered Spiral Algorithm for Memory-Aware Mapping and Scheduling on Network-on-Chip Shuo Li, Fahimeh Jafari, Ahmed Hemani Shashi Kumar Department of Electronic Systems School of Engineering School of Information and Communication Technology J¨ nk¨ ping University o o Royal Institute of Technology J¨ nk¨ ping, Sweden o o Stockholm, Sweden Email: Shashi.Kumar@jth.hj.se Email: {shuol, fjafari, hemani}@kth.se Abstract—In this paper, Layered Spiral Algorithm (LSA) is in Section V and evaluate our approach in Section VI. Section proposed for memory-aware application mapping and scheduling VII gives conclusions and future works. onto Network-on-Chip (NoC) based Multi-Processor System-on- Chip (MPSoC). The energy consumption is optimized while keeping high task level parallelism. The experimental evaluation indicates II. R ELATED W ORK that if memory-awareness is not considered during mapping and scheduling, memory overflows may occur. The underlying problem Different mapping algorithms [5][6] are proposed without is also modeled as a Mixed Integer Linear Programming (MILP) memory-awareness and scheduling coverage. In [7], schedul- problem and solved using an efficient branch-and-bound algorithm ing problem is covered with mapping but without memory- to compare optimal solutions with results achieved by LSA. Com- awareness. As claimed in [4], memory is critical in NoC design paring to MILP solutions, the LSA results demonstrate only about process and consequently should be considered during mapping 20% and 12% increase of total communication cost in case of a small and middle size synthetic problem, respectively, while it is and scheduling practical applications onto practical platforms. order of magnitude faster than the MILP solutions. Therefore, the In [8], memory-awareness is covered while scheduling is not LSA can find acceptable total communication cost with a low run- considered. In this paper, we propose a very fast algorithm time complexity, enabling quick exploration of large design spaces, called Layered Spiral Algorithm (LSA) to solve memory-aware which is infeasible for exhaustive search. application mapping and scheduling problem. The objective of LSA is minimizing total energy consumption while keeping high I. I NTRODUCTION task level parallelism. The proposed algorithm is based on spiral Even though Network-on-Chip (NoC) has been introduced algorithm [9] which is a very fast mapping algorithm without for a decade [1], programming on them is arduous since the memory-awareness. We extend spiral algorithm by introducing application mapping and scheduling problem is NP-hard [2]. memory-aware concepts and task layers to cover both memory- The existing compilation tools are not suitable in NoC context awareness and scheduling problem. The paper demonstrates that [3]. In addition, as memory is being critical during NoC design LSA is able to solve large scale problems with acceptable process [4], memory requirement and availability should not accuracy. Although dynamic mapping is a cut above static be ignored. This memory-awareness increases the complexity mapping in terms of platform utilization [10], extra control logics of the application mapping and scheduling problem even more. are required. Therefore, we consider static application mapping Since the entire solution space is enormous, problem-specific and scheduling in this paper for simplification and its dynamic algorithms are desired beyond exhaustive search-based algo- extension is planned in the future work. rithms to obtain acceptable solutions in a reasonable time. In this paper, we describe (1) Memory-Aware Communication Task III. P ROBLEM F ORMULATION Graph (MACTG) to model applications, (2) Platform Architec- This section models the underlying application and platform. ture Graph (PAG) to model NoC-based Multi-Processor System- Then, the objective function of the memory-aware application on-Chip (MPSoC) platforms, (3) a problem-specific heuristic mapping and scheduling problem is formulized. algorithm, Layered Spiral Algorithm (LSA), to quickly map and schedule an application characterized by MACTG to an A. Application Model architecture characterized by PAG. (4) a Mixed Integer Linear Programming (MILP) formulation of the underlying problem We exploit a variant of communicational task graph called and solve it by a branch-and-bound algorithm. Afterwards, the Memory-Aware Communication Task Graph (MACTG) to model efficacy of the proposed algorithm is established by comparing an application and we model task execution as an impartible the MILP solutions and LSA results. process with three non-overlapping phases: input phase for The rest parts of this paper is organized as follows. Section collecting input data, computing phase for computing output data II introduces related work. Section III discusses the applica- and output phase for sending out output data. The upper part tion mapping and scheduling problem formulation based on of Table I lists notations in a MACTG and their descriptions. application and platform models. Section IV is devoted to the Data memory requirement presents data memory required in the description of LSA. We model the underlying problem in MILP computing phase. Fig. 1(a) shows an example of MACTG. 978-1-4244-8971-8/10$26.00 c 2010 IEEE
  • 2. TABLE I communicaiton MACTG AND PAG N OTATIONS t0 edge 1 t0 1 1 1 Notation Description normal t 02 t 01 t1 t2 task 1 T = {ti } Set of tasks 3 3 1 2 1 ET i ti ’s execution time t2 t1 DM Ri ti ’s data memory requirement t3 t4 t5 3 3 1 2 E = {eij } Set of communication edges 2 2 2 memory N = {nj } Set of PM nodes access t 25 t 24 t 14 t 13 M Sj Total data memory size on nj t6 t7 t8 task 1 3 3 2 SM axj Maximum distributed shared memory size on nj 1 3 2 t5 t4 t3 B Communication channel bandwidth 2 CH = {chij } Entire communication channel set t9 2 2 1 2 (a) MACTG t 57 t 58 t 49 t 36 communication channel 2 2 2 B. Platform Model 1 t7 t8 t6 We model the target 2D homogeneous mesh NoC-based n0 n1 3 2 2 MPSoC by a Platform Architecture Graph (PAG). The platform processor t 79 t 89 t 69 consists of Processor Memory (PM) nodes and communication data memory 2 channels. Each PM node consists of one single-threaded pro- 3 2 n2 n3 PM node cessor and one data memory. Data memory is divided into two t9 parts: local memory for the local processor and distributed shared (b) PAG (c) MAG memory for remote processors. Sizes of both parts are tunable during application execution under the condition of constant total size and the maximum distributed shared memory size is smaller Fig. 1. MAG Example than the total memory size. By introducing distributed shared memory, efficiently memory partitioning becomes a challenge. in the list only provide remote data memory for temporal and/or This problem is addressed from a slightly different angle by intermediate data during the computation phase. For a memory cache partitioning problems that targeting to minimize cache access task, it is a list of PM nodes for storing the data. Another misses while memory partitioning problem is to minimize access annotation is a list of corresponding data memory usage in the time/energy. We model PM nodes as vertices and communication PM node list. For example, suppose normal task ti uses three PM channels as arcs in a PAG. The lower part of Table I lists the nodes including nx , ny and nz , and ny is used for computation. notations of a PAG and their descriptions and Fig. 1(b) illustrates Hence, the PM node list will be {y, x, z}. Assume ti uses 8 KB a sample PAG. in ny as local data memory and 4 KB in nx and nz as remote distributed data memory. The memory usage list will be {8 KB, C. Memory Access Graph 4 KB, 4 KB} or {8, 4, 4}. If the unit of entries all is the same, Memory Access Graph (MAG) describes data communication. it can be removed from the list. We define a type of dummy task called memory access task whose memory requirement is equal to the communication data D. Constraint and Objective Function volume between two corresponding normal tasks with zero In this subsection, we address the constraints and objective execution time. Memory access task tij represents the data function used in our problem formulation. communication from ti to tj . Mapping tij is to locate output Constraints: Besides of constraints related to the memory data from ti on one or multiple PM nodes. This data will be size, a deadline time timed for execution time of the application used by tj as input data. In fact, each memory access task is considered. This constraint will be a criterion for validating contains the input/output data storage information of two directly solutions. Let time0 be the starting time of the application and related normal tasks. For example, t4 requires data from t1 and timeL be the finishing time of the last task in the application. t2 . The corresponding memory access tasks are t14 and t24 as Thus, the time constraint is timeL − time0 ≤ timed . illustrated in the MAG example in Fig. 1(c). By introducing Objective function: Our objective is minimizing the applica- memory access tasks, the input/output from/to multiple tasks tion energy consumption while meeting mentioned constraints. are modeled separately such that we do not have to explicitly For simplicity, we assume that the leakage current is negligible give input/output data location and which tasks fetch/produce [11], which means the total execution energy will be fixed it. Hence, in the rest of paper, we consider two subset TN and and for minimizing total energy consumption, it is enough to TA containing normal and memory access tasks, respectively, minimize the communication energy consumption. The under- such that T = TN TA . Replacing each communication edge lying system’s communications include inter-task and intra-task by a memory access task and two new communication edges, a communications. Inter-task communication is the data commu- MACTG is converted to a MAG. nication during the input and output phases while intra-task Since each task in MAG can be mapped onto several PM nodes communication is the data communication during the computing and also the input/output data of each task has to be stored on phase. one or multiple PM nodes, we introduce two new annotations To illustrate the concepts of inter-task and intra-task com- for each task in MAG. The first one is a list of PM nodes. For a munications, consider the example of Fig. 1(c). Task t1 needs normal task, this list is a set of PM nodes that the first PM node data produced by t0 , which is modeled by t01 , and t1 produces in the list stores the program of the normal task and performs data for some other tasks. Suppose application mapping and computation during the computation phase while other PM nodes scheduling algorithm is performed such that PM node list of
  • 3. t0 is {0}, t01 is {1}, and t1 is {2, 3}. Hence, the total inter-task communication cost, C inter , is In this case, inter-task communications are the data transfer equal to C out + C in . Assume normal task ti is mapped on the from n0 (on which task t0 is run) to n1 (in which output data of set of PM nodes NiT and run on node nr ∈ NiT . Thus, it has i task t0 or input data of task t1 is located) and also from n1 to intra-task communication with other pieces of remote shared data n2 (on which task t1 is executed). The intra-task communication memory on each node k ∈ NiT &k = nr . Thus, total inter-task i intra is the data transfer between n2 (on which task t1 is executed) communication cost for normal task ti , Ci , is: and n3 (which allocates remote shared data memory to task t1 ). intra We suppose the length of communication path between each two Ci = CVnr ,k ∗ hknr ∗ wR ∗ wa i i (4) T ∀k∈Ni PM nodes i and j, lij , is proportional to the length of shortest path between them, hij . This means that for all communications, where wa is a weight factor used to simply model how many we can say: times in average a normal task will access its remote data mem- lij = wR ∗ hij (1) ory because sometimes it will access its remote data memory more than one times. It is clear that CVnr ,k is equal to allocated where wR is a constant routing factor. i data memory for task ti on node k, xN . Also, for computing ik We also assume that the energy consumption for a communica- C intra in an application, it is necessary to calculate Ciintra for tion between two nodes i and j, Enij , is proportional to CVij ∗ ∀i ∈ TN . Therefore, total inter-task communication cost in the lij ; where CVij is communication data volume between these application can be written as: two nodes. Considering Eq. 1, we can say Enij ∝ CVij ∗hij . We denote CVij ∗hij as communication cost between two PM nodes C intra = xN ∗ hknr ∗ wR ∗ wa ik (5) i i and j. Therefore, for minimizing total energy consumption, we ∀i∈TN ∀k∈Ni T can minimize the sum of all individual intra-task and inter-task’s communication costs of the application executed on the given Thus, total communication cost in the application is as C = platform. C inter + C intra and our objective is to minimize it. Before that, we need to introduce several other notations. The IV. A LGORITHM D ESCRIPTION allocated data memory size on node k to each normal task i and memory access task j are denoted as xN and xA , respectively. A. Spiral Algorithm ik jk The dependency task set of a task contains all directly dependent Since the LSA is an extension of the existing spiral core map- tasks of that task. The dependency task set of ti is denoted as ping algorithm [9], we briefly describe spiral algorithm in this Tid . For example, in Fig. 1(c), T1 = {t01 } and T01 = {t0 }. d d subsection. In this algorithm, IP cores are mapped onto the tiles As we described before, the mapping of each memory access of the mesh platform based on a Task Priority List (TPL) which task tij specifies where and how much memory is allocated to is a list of the tasks ordered based on the priorities that they output data from normal task ti or input data to normal task tj . should be mapped. In mesh topology, the inner switches have Therefore, for the inter-task communication, the communication a higher connection degree compared to the boundary switches. data is just the input/output data needed/produced by normal This provides more connectivity to the neighbor switches and tasks. In this respect, total inter-task communication cost consists forms a Platform Priority List (PPL) (we call it Node Priority List of total communication cost for writing output data, C out , (NPL) in LSA for reflecting the task-node mapping) which starts and reading input data, C in . In the mentioned example, the from the center of the mesh platform and ends to a boundary communication cost for writing output data modeled by memory switch in a spiral fashion. The priority assignment policy is out expressed in the following rules. access task t01 is C01 = CV0,1 ∗ h0,1 ∗ wR . Since required memory size for memory access task t01 in node n1 is equal a) The tasks that have higher data transfer sizes should be to CV0,1 , we can say C01 = xA ∗ h0,1 ∗ wR . Therefore, for out 01,1 placed as close as possible to each other to satisfy the bandwidth out out constraint. calculating C in an application, it is enough to compute Ci for ∀i ∈ TA . Assuming task i ∈ TA is mapped on node nm i b) The tasks which are tightly related to each other should have and task s ∈ Tid is mapped and executed on node nr , total s the least possible Manhattan distance on the mesh platform. communication cost for writing output data is as follows: c) The tasks which have the high connection degrees should not be placed on the boundaries. For these tasks, the central area of C out = xA m ∗ hnm nr ∗ wR ini i s (2) the mesh is the best candidate. ∀i∈TA ∀s∈Ti d All IP cores are mapped onto tiles of the mesh platform based In the mentioned example, normal task t1 needs data from on TPL, PPL and the above rules with starting from the IP core normal task t0 that is modeled by memory access task t01 . There- of the task with highest priority in the TPL. fore, communication cost for reading input data of normal task B. LSA Algorithm Description t1 is C1 = CV1,2 ∗ h1,2 ∗ wR . Likewise, since CV1,2 = xA , in 01,1 we can say C1 = xA ∗ h1,2 ∗ wR . Therefore, for calculating in The inputs of LSA are an application modeled by a 01,1 C in in an application, it is enough to compute Ci for ∀i ∈ TN . in M ACT G = Gm (T, E) and a platform modeled by a P AG = Assuming task i ∈ TN is mapped and executed on node nr and Gp (N, CH). The output of the LSA is the application mapping i task s ∈ Tid is mapped on node nm , total communication cost and scheduling. s for reading input data is as follows: Intermediate results in this algorithm are: 1) Task layer is a set of independent tasks. Each task in task C in = xA m ∗ hnr nm ∗ wR sns i s (3) layer k has its dependency tasks in task layer k−1, .., 0. There are ∀i∈TN ∀s∈Ti d two different kinds of task layers including normal and memory
  • 4. access layers. The former are layers which contain normal tasks TPL of each task layer is generated. After that, node priority list and the latter include memory access tasks. It is clear that there of the PAG is built in line 6. Line 7 maps the first task layer is no task layer including normal and memory access task both. onto the platform by using the spiral algorithm. Lines 8-16 map The set of normal layers is denoted as N L and likewise, AL is remaining task layers onto the platform as follows. the set of memory access layers. First, the dependency node set Nid for each task ti in 2) Task priority list (TPL): Each task layer has a TPL which task priority list is computed. If Nid has only one node nd , is an ordered set of tasks of that layer. The ordering criterion try to map ti onto nd . If nd is not an execution-available node, is based on priority of each task which means the task with try to map ti onto an execution-available PM node that has the higher priority is located in front of the task with the lower the highest priority among all execution-available PM nodes. priority in TPL list. The task priorities are specified as follows: If any single PM nodes has no enough available memory for • For normal tasks: Each task with larger total input and ti , try to map the execution of ti onto nd and the remote data output data volume has higher priority. memory onto storage-available PM nodes with as high priority • For memory access tasks: The priority of each memory as possible. If this trial fails, try to map execution of ti onto an access task depends on priority of its dependency (parent) execution-available PM node that has the highest priority in all and child tasks which all are normal tasks. Hence, for execution-available nodes and map ti ’s remote data memory onto comparing two memory access tasks tix and tjy , we first storage-available PM nodes with as high priority as possible. If check the priority of tx and ty . If we find that tx has higher this mapping also fails, the LSA will fail. This mapping is called priority than ty , the memory access task priority list is {tix , single dependency mapping in Algorithm 1. If Nid has multiple tjy }. If x = y, we check the priority of ti and tj . If priority nodes, for each node nd ∈ Nid , follow the above procedure and of ti is higher than tj ’s priority, the memory access task collect the cost of each mapping solution of ti . Then compare priority list is {tix , tjy }. If child task of a memory access the costs and pick up the solution with the minimal cost. This task is not in the next layer of the current layer, this memory mapping is called multiple dependencies mapping in Algorithm access task has lower priority compared to other memory 1. Finally, record mapping and scheduling results and if the access tasks in the same layer. total time fulfills the timing constraint, the algorithm successes; 3) Node priority list (NPL) is an ordered set of all nodes so that otherwise the algorithm cannot solve this problem. the nodes located closer to the center of the mesh have higher C. Application Mapping and Scheduling Example priority. If some nodes are located as close as to the center of Consider a simple example of mapping an application shown the mesh, the priority list will be formed in a spiral style starting in Fig. 1(a) onto a platform shown in Fig. 1(b). The numbers from the right bottom corner. along with the arrows are the communication volumes between 4) Dependency node set: The dependency node set of a task two tasks and the arguments are listed in Table II. contains all nodes assigned to all directly dependent tasks of that We consider the following assumptions for the platform. task. The dependency node set of ti is denoted as Nid . Bandwidth is 1 GB/sec, data memory size is 8 KB and maximum We also introduce execution-available and storage-available 6 KB data memory can be shared. terms for PM nodes to state that a PM node is ready for executing The data memory requirements in KB for memory access a task and it has available data memory for remote or local tasks are listed in Table III. TPL of each layer is built by access, respectively. It is worth mentioning that an execution- sorting tasks such that the tasks with higher total communication available PM node is also an storage-available PM node. The volume has higher priority compared with other tasks. Table IV LSA works as Algorithm 1. lists the TPL of each task layer. Since all nodes have the same Algorithm 1 Layered spiral algorithm pseudo-code connectivity, NPL is {n3 , n2 , n0 , n1 } which lists all nodes in a spiral style. 1: Gm (T, E) → M AG = Gl (T, E); Then we map task layer 0, which contains only t0 . This 2: Lt = task layers (Gl (T, E)); task is mapped onto n3 and uses 2 KB of data memory in 3: for all task layer li in Lt do P 4: li = task priority list (li ); 5: end for TABLE II A NNOTATIONS FOR THE MACTG EXAMPLE SHOWN IN F IG . 1( A ) 6: Ln = node priority list (Gp (N, CH)); 7: M ap(l0 ); Task ET DMR Task ET DMR 8: for task layer 1 to last layer do (µs) (KB) (µs) (KB) P 0 1 2 5 2 4 9: for all ti in li do 1 1 2 6 2 4 10: if count(Nid ) = 1 then 2 9 16 7 1 2 11: single dependency mapping(ti ); 3 2 2 8 1 2 12: else 4 2 4 9 4 16 13: multiple dependencies mapping(ti ); TABLE III 14: end if DATA MEMORY REQUIREMENTS FOR MEMORY ACCESS TASKS 15: end for 16: end for Task DMR Task DMR Task DMR Task DMR 01 1 24 3 57 2 89 2 02 1 25 1 58 2 Line 1 of the algorithm converts input MACTG Gm (T, E) to 13 2 36 2 69 2 MAG Gl (T, E) and line 2 extracts task layers. In lines 3-5, the 14 3 49 1 79 3
  • 5. TABLE IV TASK LAYERS xN + ij xA ≤ M S j sj ∀j ∈ N, ∀l ∈ N L (8) ∀i∈T askL(l) ∀i∈InT askL(l) Layer Priority List Layer Priority List Layer Priority List 0 {0} 3 {14, 24, 25, 13} 6 {7, 6, 8} xA + ij xN ≤ M S j sj ∀j ∈ N, ∀l ∈ AL (9) 1 {01, 02} 4 {4, 5, 3} 7 {79, 69, 89} ∀i∈T askL(l) ∀i∈InT askL(l) 2 {1, 2} 5 {57, 36, 58, 49} 8 {9} yij ≤ 1 ∀j ∈ N, ∀l ∈ N L (10) ∀i∈T askL(l) n3 . Now n3 has 6 KB data memory available. Then we map yij = 1 ∀i ∈ TN (11) ∀j∈N task layer 1, which contains {t01 , t02 }. t01 occupies 1 KB and t02 occupies also 1 KB. t01 and t02 are mapped onto n3 . The xN yij > 0 ij ∀i ∈ TN (12) memory assigned to t0 is still in n3 . Now, n3 has 4 KB available. ∀j∈N Then, we map task layer 2 which contains {t1 , t2 }. t1 is mapped xN = DM RN i ∀i ∈ TN (13) ij onto n3 (2 KB) and t2 is mapped onto n2 (8 KB), n0 (4 KB), ∀j∈N and n3 (4 KB). The memory assigned to t0 in n3 is freed. Now, xA = DM RA ij i ∀i ∈ TA (14) n3 and n2 have no data memory available and n0 has 4 KB ∀j∈N available. In the same way, we mapped all task layers and got the results in the LSA part in Table V. The time parameter in where yij , xN and xA are optimization variables. ij ij the table is the LSA computation time. By considering wa = 2 Eq. (6) is the objective function of this optimization problem and wR = 1, the total communication cost C = 65 µs and total which minimizes total communication cost. Constraint (7) says application execution time timeL − time0 = 74 µs. that allocated data memory size to tasks of each normal layer l not running on each node j cannot exceed shared memory size of that node. Constraint (8) and (9) indicate that allocated V. MILP F ORMULATION data memory size for tasks of each layer (N L&AL) and their To evaluate the capability of our method, we formulate the dependency tasks on each node j cannot exceed memory size of underlying problem as a MILP problem which maps tasks onto that node. It is clear that at most a task of each normal layer l on a generic regular NoC architecture and allocates the required every node j can be executed and also each normal task i can be data memory of each task on every PM node based on tasks run on only one node; these constraints are shown in (10) and scheduling to minimize the total communication cost while satis- (11), respectively. Constraint (12) says that each normal task i fying acceptable constraints in the network. For formulating this takes a part or all of its data memory requirements of that node problem, besides of before mentioned notations, we introduce on which is run. Constraints (13) and (14) state that allocated more notations. T askL(l) is a set of independent tasks in layer data memory size of each task i ∈ TA or TN cannot exceed its l. InT askL(l) is the set of dependency tasks of all tasks of data memory requirements. layer l. For instance, InT askL(2) in example of section IV-C is It is worth mentioning that above mentioned problem is a {01, 02}. Thus, The cost minimization MILP problem, Minimize- quadratic problem [12]. However, since an integer variable is Cost-MILP, can be formulated as follows. multiplied by a binary variable, Big M technique [13] can be used Given a MAG=Gl (T, E), hop count matrix of shortest path to convert it to a mixed integer linear programming. Therefore, between each two nodes H = {hij |i & j ∈ N }, memory size we solve the proposed Minimize-Cost-MILP problem using an and maximum shared memory size that each node j can allocate efficient branch-and-bound algorithm in CPLEX. to tasks in the network which denoted as M S j and SM axj , respectively. Data memory requirement for each normal task VI. E XPERIMENTAL R ESULT i i i, DM RN , and each memory access task i, DM RA . Find Since there is no existing memory-aware task graph bench- the mapping matrix Y = {yij |yij = 0 or 1; ∀i ∈ TN & ∀j ∈ N } marks, we use tgff [14] tool to generate synthetic MACTGs. and also the size of allocated data memory on each node j Therefore, five task graphs are randomly generated to test our to every normal task i and memory access task i, xN for ij LSA implemented in C#. The program is run on a PC with 2.8 ∀i ∈ TN & ∀j ∈ N and xAj for ∀i ∈ TA & ∀j ∈ N , ij GHz Intel i7 CPU and 8 GB main memory. For all case studies, respectively, such that we assume that wR and wa are equal to 1 and 2, respectively, and also bandwidth is 1 GB/sec.  A. Comparing with MILP Problem min  xN hjk yik wR wa ij yij ,xN ,xA ij ij ∀i∈TN ∀j∈N ∀k∈N To evaluate the capability of our method, we addressed an  MILP problem of minimizing total communication cost sub- + xA hjk yij wR  (6) ject to the mentioned constraints. We apply LSA method and sk ∀s∈Ti ∀j∈N ∀k∈N d Minimize-Cost-MILP problem to two synthetic task graphs which are mapped to a 4 × 4 2D mesh network. + xA hjk ysk wR ij Table V shows the results for LSA and Minimize-Cost-MILP ∀i∈TA ∀s∈T d ∀j∈N ∀k∈N i problem for case study mentioned in IV-C. The results show that subject to: comparing with the Minimize-Cost-MILP problem, LSA achieves about 20% larger cost with only 7.92 ms run time. For a larger xN (1 − yij ) ≤ SM axj ij ∀j ∈ N, ∀l ∈ N L (7) size problem that maps and schedules a MACTG with 26 tasks, ∀i∈T askL(l) LSA runs 8.56 ms and results a cost of 3167 µs while the
  • 6. Minimize-Cost-MILP problem runs about 3.5 hours and results KB memory. Therefore, LSA maps t2 onto two PM nodes (n9 a cost of 2821 µs. Therefore, the LSA can find acceptable total and n10 ) while nLSA maps it onto only one PM node n10 . communication cost in a very short time. As the exploration space increases exponentially, the VII. C ONCLUSION AND F UTURE W ORKS Minimize-Cost-MILP problem cannot solve larger size problems in a reasonable time and physical memory. Thus, in the rest, we LSA can find solution with acceptable energy consumption achieve results via proposed LSA for larger size problems. in a very short time. As the exploration space increases exponentially with number of task, large size problems cannot TABLE V be solved by MILP solver in a reasonable time and physical C OMPARISON BETWEEN LSA AND Minimize-Cost-MILP PROBLEM memory so that we can only achieve solutions via proposed LSA Minimize-Cost-MILP Problem LSA. As shown in Table VI, LSA runtime for mapping and Task PM Node list Memory Usage PM Node Memory Usage scheduling an application consists of 815 tasks onto an 8 × 8 List List List List platform is still under 0.25 sec. By comparing memory-aware 0 {3} {2} {1} {2} 01 {3} {1} {3} {1} and non-memory-aware solutions, we concluded that the 02 {3} {1} {1} {1} memory requirements and availabilities are critical and should 1 {3} {2} {3} {2} not be ignored when modeling applications and platforms. 13 {1} {2} {3} {2} 14 {0} {3} {3} {3} In the future work, we expect the concept of LSA will also 2 {2, 0, 3} {8, 4, 4} {0, 1, 2} {8, 7, 1} fit into dynamic application mapping and scheduling problems. 24 {1} {3} {2} {3} Another future work is to add more tunable options to LSA 25 {3} {1} {1} {1} 3 {1} {2} {3} {2} such that LSA can be optimized more to parallelism or to 36 {1} {2} {3} {2} energy consumption. Implementation of task layer partitioning 4 {0} {4} {2} {4} is also planned in the future work. 49 {0} {1} {2} {1} 5 {3} {4} {1} {4} 57 {3} {2} {1} {2} 58 {3} {2} {1} {2} R EFERENCES 6 {1} {4} {3} {4} ¨ [1] A. Hemani, A. Jantsch, S. Kumar, A. Postula, J. Oberg, M. Millberg, and 69 {1} {2} {3} {2} D. Lindqvist, “Network on a chip: An architecture for billion transistor 7 {3} {2} {0} {2} era,” in Proc. NORCHIP 2000, Turku, Finland, Nov. 2000. 79 {3} {3} {0} {3} [2] R. Pop and S. Kumar, “A survey of techniques for mapping and scheduling 8 {2} {2} {1} {1} applications to network on chip systems,” School of Engineering, J¨ nk¨ ping o o 89 {2} {2} {1} {2} University, J¨ nk¨ ping, Sweden, Tech. Rep. 04:4, Apr. 2004. o o 9 {2, 0, 3} {6, 5, 5} {1, 0, 3} {6, 5, 5} [3] G. Chen, F. Li, S. W. Son, and M. Kandemir, “Application mapping for C 65 µs 54µs chip multiprocessors,” in Proc. DAC’ 08, Anaheim, California, USA, Jun. time 7.92 ms 10.91 s 2008, pp. 620–625. [4] N. Dutt, “Memory-aware noc exploration and design,” in Proc. of Design, Automation and Test in Europe, 2008. DATE ’08, Munich, Germany, Apr. B. Comparing with Non-memory-aware LSA 2008, pp. 1128–1129. [5] T. Lei and S. Kumar, “A two-step genetic algorithm for mapping task For larger size problems listed in Table VI, we try to graphs to a network on chip architecture,” in Proc. Euromicro Symposium solve them using LSA and then compare the results to a in Digital Systems Design (DSD), Belek, Turkey, Sep. 2003. [6] S. Yang, L. Li, M. Gao, and Y. Zhang, “An energy- and delay- aware non-memory-aware version of LSA (nLSA)’s results. nLSA is mapping method of noc,” Acta Electronica Sinica, vol. 36, no. 5, pp. 937– LSA that assumes the platform has infinite memory. In Table 942, May 2008. VI, we tested LSA and nLSA with five test cases (Middle, [7] H. Yu, Y. Ha, and B. Veeravalli, “Communication-aware application mapping and scheduling for noc-based mpsocs,” in Proc. of 2010 IEEE Hard, Hard0, Hard1 and Hard2). The number of tasks of each International Symposium on Circuits and Systems (ISCAS), Paris, France, test case is listed in Table VI. The number of PM nodes in the May/Jun. 2010, pp. 3232–3235. target platform for each test case is 4 × 4, except for test case [8] M. Monchiero, G. Palermo, C. Silvano, and O. Villa, “Exploration of distributed shared memory architectures for noc-based multiprocessors,” Hard0 and Hard4. Their number of PM nodes is 8 × 8. in Proc. of International Conference on Embedded Computer Systems: There are differences in cost between memory-aware and Architectures, Modeling and Simulation, 2006. IC-SAMOS 2006, Samos, Greece, Jul. 2006, pp. 144–151. TABLE VI [9] A. Mehran, S. Saeidi, A. Khademzadeh, and A. Afzali-Kusha, “Spiral: A E XPERIMENTAL RESULT heuristic mapping algorithm for network on chip,” IEICE Electron. Express, vol. 4, no. 15, pp. 478–484, 2007. LSA nLSA [10] E. Carvalho, C. Marcon, N. Calazans, and F. Moraes, “Evaluation of static Test Number Program Program and dynamic task mapping algorithms in noc-based mpsocs,” in Proc. of Case of tasks Run Time Cost Run Time Cost International Symposium on System-on-Chip, 2009, Tampere, Finland, Oct. Middle 26 8.5654 3167 9.3606 ms 1615 2009, pp. 87–90. Hard 81 13.4606 ms 13788 11.2038 ms 6819 [11] N. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, J. Hu, M. Irwin, Hard0 147 18.9718 ms 27872 18.7562 ms 12713 M. Kandemir, and V. Narayanan, “Leakage current: Moore’s law meets Hard1 32 8.5231 ms 4679 10.4146 ms 1990 static power,” Computer, vol. 36, no. 12, pp. 68–75, Dec. 2003. Hard2 815 246.6145 ms 165486 274.7816 ms 43494 [12] D. P. Bertsekas, Nonlinear Programming. Athena Scientific, 1999. [13] R. Rafeh, M. J. G. de la Banda, K. Marriott, and M. Wallace, “From zinc to design model,” in Proc. of 9th International Symposium of Practical non-memory aware results. This means nLSA actually takes Aspects of Declarative Languages (PADL), Nice, France, Jan. 2007, pp. some nodes that do not have enough available memory. This 215–229. [14] R. Dick, D. Rhodes, and W. Wolf, “Tgff: task graphs for free,” in Proc. leads to a smaller cost but an invalid solution. For example, in of 6th International Workshop on Hardware/Software Codesign, 1998, the test case Middle, task t2 has a data memory requirement of CODES/CASHE ’98, Seattle, WA, USA, Mar. 1998, pp. 97–101. 880 KB and each PM node in the target platform has only 800