1. High-Performance NoC Interface with Interrupt Batching for
Micronmesh MPSoC Prototype Platform on FPGA
Heikki Kariniemi and Jari Nurmi
Department of Computer systems
Tampere University of Technology
Tampere, Finland
Email: {heikki.kariniemi, jari.nurmi}@tut.fi
Abstract—This paper presents a new NoC Interface (NI) targeted reducing the software overhead produced by the interrupt processing
for improving the performance of the Micronmesh and the processor utilization. The usage of the jumbo frames, i.e. large
Multiprocessor System-on-Chip (MPSoC). The previous version messages, makes it possible to reduce the message rate and the
of the NI called Micronswitch Interface (MSI) can zero-copy interrupt frequency [2, 3, 5]. The fragmentation is related to the jumbo
messages as it sends and receives them. It offloads also some frames which are usually fragmented to smaller frames before sending
functionalities of the communication protocol from software [1, 3, 5]. The MSIQ HW also fragments the messages to small fixed
(SW) to hardware (HW), but interrupt processing produces extra sized packets as it sends them to the Micronmesh NoC and assembles
SW overhead and reduces the performance. For this reason, an the received messages from the received packets.
improved version of the MSI called MSI-with-Queues (MSIQ) The interrupt coalescing [2, 3, 5] is a technique used for batching
was designed with a new queue mechanism in order to reduce the interrupt service requests so that every execution of the ISR could
frequency of interrupts and the SW overhead. Owing to the new serve several requests, which reduces the interrupt frequency and the
queue mechanism of the MSIQ it is possible to batch and service software overhead. It has also variants called Interrupt Multiplexing
multiple interrupt service requests by every execution of the [1] and Enabling Disabling (ED) technique [4]. In a typical
Interrupt Service Routine (ISR). Additionally, the new MSIQ implementation the interrupts are delayed until a certain amount of
HW is able to send and receive messages while the processor is interrupts has been batched or a timeout expires. The implementation
running the ISR. The performance of the MSIQ is also analyzed used in the new MSIQ works slightly differently. When receiving the
in this paper. The results show that the queue mechanism messages, the MSIQ generates an interrupt immediately after it has
improves the performance with moderate hardware costs. received a new message. If more messages arrive or have arrived in
bursts during the execution of the ISR, they are also served. This
I. INTRODUCTION method provides a low latency and a good burst tolerance against
bursts of short messages in addition to the reduced interrupt frequency.
In computer systems where computers are connected by high- When sending the messages, the MSIQ sends several messages
speed networks the operation of the network interfaces may become a successively in batches. It generates the interrupts after finishing the
main obstacle for the communication throughput and the performance. sending of the first message of the batches, which makes it possible to
This is because the communication between the CPUs and the network start running the ISR while the sending is still continued. As a
interfaces produces extra software overhead. Several methods like, for consequence of this, the ISR can also be running concurrently with the
example, zero-copying, protocol offloading, jumbo frames, message MSIQ HW, which improves the performance further.
fragmentation, and interrupt coalescing have been presented in
literature [1, 2, 3, 4, 5, 6, 7] for eliminating this problem. Due to In the MSIQ the interrupt coalescing is implemented with send-
certain similarities of architectures these same methods can be used request and receive-request queues. The results of the performance
for solving the same problem in the MPSoCs where distributed analysis and the logic synthesis presented in this paper show that the
memory and message-passing communication architectures are used. improved performance is achieved with small additional HW costs
compared to the old MSI [12]. The MSIQ could also be used with
In the Micronmesh MPSoC platform [8] the tightly coupled polling, but polling is usually used with interrupts and more difficult
operation of the Micron Message-Passing (MMP) protocol [9] and the to implement [6, 7]. Furthermore, the length of the polling period must
MSIQ enables direct message transfers between the local variables of be carefully adapted to the message rate in order to achieve a good
the user threads and the MSIQ which is a technique called zero- performance, because if it is too long, the communication latency
copying in the literature [1, 2, 3, 5]. The zero-copying reduces grows, and if it is too short, the software overhead grows.
communication latency and improves the performance, because it
eliminates copying of messages from user memory to MSIQ through This paper is organized as follows. Section II presents the
intermediate buffers in the kernel memory. The multiplexing and architecture and the operation of the new MSIQ. Section III presents
demultiplexing functions of the MMP protocol are also offloaded to the performance analysis and the HW costs of the new MSIQ, and
the MSIQ HW in order to reduce software overhead. Protocol finally, Section IV concludes this paper.
offloading is used for speeding up the protocol functions by HW and
for reducing the software overhead [1, 2, 3, 4, 5]. II. MICRONSWITCH INTERFACE WITH QUEUES
The interrupt-driven systems provide low latency and low SW The Micronmesh MPSoC platforms [8] consist of Micronmesh
overhead if the interrupt rate is low, but the performance degrades if nodes that contain a local NIOS II processor [13], local on-chip
the interrupt frequency grows. Interrupts produce additional SW memories, a timer, a local Avalon system bus [14], the MSIQ, and the
overhead by causing context switching from a user mode to a kernel Micronswitch [8]. The NIOS II processors are running distinct
mode before the execution of the ISR and back to the user mode from MicroC/OS II real-time kernels [11] in every Micronmesh node. The
the kernel mode after the execution of the ISR is finished [1, 2, 3, 4, 5, MSIQs connect the Micronmesh nodes to the Micronmesh NoC
6, 7, 10, 11]. The last three methods mentioned above are used for through the local Micronswitches.
This research is funded by the Academy of Finland under grant
122361.
978-1-4244-8971-8/10$26.00 c 2010 IEEE
2. A. The Architecture of the MSIQ (LOCAL MEMORY), fragments the messages, generates packets of
the fragments, and writes the packets to the Tx-FIFO from which the
The MSIQ consists of three main sub-blocks which are the MSIQ
MSIQ Tx-master’s Tx-interface (TX-IF) sends them to the
Rx-master, the MSIQ Tx-master, and the MSIQ Slave. It is depicted
Micronswitch. Packets consist of two headers and two payload words
on the bottom of schematic Fig. 1. The MSIQ Rx-master on the left
[9, 12]. The addresses of the messages are passed to the MSIQ Tx-
receives messages from the NoC, the MSIQ Tx-master on the right
sends messages to the NoC, and the MSIQ Slave in the middle is used master’s Avalon interface through the Tx-base-address-FIFO. In Fig.
for controlling and configuring the operations of the MSIQ Masters 1 this address points to the beginning of the Tx-buffer A of thread B,
through the MSIQ’s register interface. The MSIQ Slave is also which is illustrated by arrow A. The routing headers and the protocol
responsible for generating interrupt service requests according to the control headers of the packets are stored into the Tx-routing-header-
MSIQ Masters status. FIFO and the Tx-protocol-control-header-FIFO. The control register
values are passed through the Tx-control-FIFO. After finishing the
sending of the message, the MSIQ Tx-master changes its status in
order to make the MSIQ Slave to generate an interrupt service request,
reads the next send-request from the HW send-request queue, and
continues sending of messages till the HW send-request queue
becomes empty. It can continue sending while the processor is running
the ISR. The maximum size of the message batches depends on the
size of the HW send-request queue. The larger the HW send-request
queue the more messages can be sent without interrupts. If only one
message could be sent at a time, the execution time of the interrupts
would dominate the total sending time especially if the messages
would be short [12]. Hence, owing to the HW send-request queues it is
possible to reduce the interrupt frequency and improve the
performance.
TABLE I. MSIQ’S REGISTER INTERFACE AND QUEUES
Register Description
MSIQ-status The common status register of the MSIQ Masters.
The control register used for controlling the MSIQ
Rx-control
Rx-master’s operation.
Rx-base-
The base-address of the Rx-buffer table.
address
The Rx-routing header of the last packet of the
Rx-routing- received message. This register is part of the
header receive-requests queue and it is the output of the
Rx-routing-header-FIFO.
Rx- The Rx-protocol-control header of the last packet
protocol- of the received message. This register is part of the
control- receive-request queue and it is the output of the Rx-
header protocol-control-header-FIFO.
The Tx-control register used for controlling the
MSIQ Tx-master’s operation. This register is part
Tx-control
of the send-request queue and it is the input of the
Figure 1. The architecture of the MSIQ.
Tx-control-FIFO.
The start address of the message stored into the Tx-
The MSIQ’s register interface is partly presented in Table 1. It Tx-base- buffer. This register is part of the send-request
contains a status register MSIQ-status which is a combined status of address queue and it is the input of the Tx-base-address-
the MSIQ Masters. The values of the Tx-control, the Tx-base-address, FIFO.
the Tx-routing-header, and the Tx-protocol-control-header registers The Tx-routing header template of the packets of
form send-requests that are stored to the HW send-request queue of Tx-routing- the message to be sent. This register is part of the
the MSIQ HW (HW SEND-REQUEST QUEUE). The writing of header send-request queue and it is the input of Tx-routing-
these registers starts the sending of one message. Respectively, the
header-FIFO.
values of the Rx-routing-header and the Rx-protocol-control-header
The Tx-protocol-control header template of the
registers form the receive-requests that are stored into the HW receive- Tx-protocol-
request queue of the MSIQ HW (HW RECEIVE-REQUEST packets of the message to be sent. This register is
control -
QUEUE). The reading of these registers ends the receiving of one part of the send-request queue and it is the input of
header
message. The MSIQ Slave contains also four FIFOs for storing the the Tx-protocol-control-header-FIFO.
send-requests and two FIFOs for storing the receive-requests like
Table I explains. The MSIQ Rx-master’s Rx-interface (RX-IF) receives packets
The MSIQ Tx-master starts sending messages as it receives send- from the Micronswitch and writes them to the Rx-FIFO. The MSIQ
requests through the HW send-request queue from the MSIQ Slave. Rx-master’s Avalon interface (AVA-RX-IF) reads the packets from
The MSIQ Tx-master’s Avalon Interface (AVA-TX-IF) reads the the Rx-FIFO and writes the packet payloads to the Rx-buffers which
messages directly from the Tx-buffers from the local memory are in the local memory. It obtains the Rx-buffer addresses from the
3. local memory from the Rx-buffer table (RX-BUFFER TABLE), and sends them to the Micronswitch. After the sending of the message
which is referred by the Rx-base-address register like arrow B in Fig. is finished the MSIQ Tx-master’s Avalon interface changes its status
1 illustrates, and computes the storage addresses of the packet and the MSIQ Slave generates an interrupt service request accordingly
payloads. When doing this, the MSIQ Rx-master’s Avalon interface which starts the execution of the MSIQ ISR in step four. If the HW
demultiplexes and assembles the messages of different Rx-channels send-request queue is not empty yet, the MSIQ Tx-master’s Avalon
from one input packet stream through the Rx-FIFO to multiple Rx- interface reads the next send-request from it and continues sending
buffers. The Channel Identifiers (CID) of the protocol control headers messages until the queue is empty while the processor is running the
are used for addressing the Rx-buffer table elements like arrow C msiq_isr (ISR) in step four.
illustrates. The Rx-buffer table elements contain the Rx-buffer
addresses like arrow D illustrates. They are used by the MSIQ Rx- 4. The processor starts running msiq_isr (ISR). The msiq_isr
master for addressing the Rx-buffers like arrow E illustrates. After acknowledges the interrupt service request, reads the address of the
finishing the receiving of a message, the MSIQ Rx-master’s Avalon signaling semaphore from the Tx-serviced queue, and posts the
interface writes the receive-request to the HW receive-request queue signaling semaphore to the thread, which called the mmpp_send
and changes its status in order to make the MSIQ Slave to generate an function. This wakes up the thread and the mmpp_send function
interrupt service request. If the HW receive-request queue is not full, returns. If the SW send-request queue is not empty, the msiq_isr reads
the receiving can be continued while the processor is running the ISR. the next send-request from it, stores the address of the Tx-channel’s
Since every execution of the ISR can service multiple receive-requests signaling semaphore to the Tx-serviced queue, and writes the next
the number of interrupts can be reduced. This happens especially if send-request to the HW send-request queue, which enables the
messages are short and several messages arrive in bursts between the sending and the interrupts again. These operations are repeated in a
consecutive executions of the ISR. Furthermore, the performance loop until all of the signaling semaphores of the serviced send-requests
improves also, because the receiving needs to be stopped less have been posted from the Tx-serviced queue and either the HW send-
frequently. request queue is full or the SW send-request queue is empty.
As steps three and four show the HW send-request queue enables
B. The MSIQ device driver and the MMP protocol the interrupt batching. Additionally, the Tx-buffers are mapped to the
local variables of the threads and the MSIQ HW uses DMA (Direct
The main parts of the MSIQ device driver (MSIQ SW) are a state
Memory Access) transfers for zero-copying the messages directly
data structure, send (msiq_send) and receive (msiq_receive) functions, from the Tx-buffers. The MSIQ also slices the messages into packets
and the ISR (msiq_isr). The MSIQ SW is used by the MMP protocol’s as it multiplexes and sends them in one packet stream to the
functions for controlling the operations of the MSIQ. The MMP Micronmesh NoC, which implements message fragmentation.
protocol is a messaging layer protocol which forms an Application
Programming Interface (API) for programming fault-tolerant message-
passing applications [9]. This API contains, for example, functions for D. Receiving of messages
sending (mmpp_send) and receiving (mmpp_receive) messages. The The messages are received in the following way.
MSIQ SW’s state data structure contains also a SW send-request
queue and a Tx-serviced queue. In the SW send-request queue the 1. A thread calls the mmpp_receive function, which prepares the
send-requests are pointers to the data structures of the MMP protocol’s Rx-channel for receiving by deasserting the lock bit and by updating
channels [9] which contain the register values of the send-requests to the address field of the Rx-channel’s Rx-buffer table element. Then it
be stored into the HW send-request queue. The elements of the Tx- calls the msiq_receive function of the MSIQ SW which enables the
serviced queue are pointers to the Tx-channels’ signaling semaphores. MSIQ Rx-master to receive messages.
2. The MSIQ Rx-master’s Rx-interface receives packets from the
C. Sending of messages Micronswitch and writes them to the Rx-FIFO. The Rx-master’s
The messages are sent in the following way. Avalon interface reads the packets from the Rx-FIFO one by one,
computes the addresses of the Rx-buffer table elements by adding the
1. A thread calls the mmpp_send function which calls the msiq_send packets’ CIDs multiplied by four to the Rx-base-address register’s
function of the MSIQ SW. value, and reads the Rx-buffer table elements from the local memory.
Then it multiplies packets’ sequence numbers carried in the protocol
2. The msiq_send function puts at first the address of the Tx- control headers by eight and the address field of the Rx-buffer table
channel’s data structure to the SW send-request queue. Then it reads element by four. The sums of these two products are the storage
the status of the MSIQ. If the MSIQ Tx-master is idle, it reads the addresses of the packet payloads. These multiplications are performed
send-request from the SW send-request queue, stores the address of by simple shift left operations. After computing the storage addresses,
the Tx-channel’s signaling semaphore to the Tx-serviced queue, and the MSIQ Rx-master writes the packet payloads to the Rx-buffers. If
writes the send-request to the HW send-request queue. This enables successive packets have the same CID, the Rx-master can reuse the
the MSIQ Tx-master to send and operation continues in step three. If Rx-buffer table element and only the storage address must be
the MSIQ Tx-master is not idle, the msiq_send lets the ISR (msiq_isr) computed again for each of the packets separately. Otherwise, the Rx-
of the MSI device driver to initialize the sending of the next message buffer table elements must be read from the memory. After the last
as the processor starts running it in step four after the previous send is packet of the message is received, the MSIQ Rx-master’s Avalon
finished, and returns. The accessing of the MSIQ SW’s state data interface asserts the lock bit and updates the address field of the Rx-
structure and the MSIQ’s register interface is controlled by a buffer table element to point to the end of the message, writes the Rx-
semaphore so that they can be accessed only by one thread at a time or buffer table element to the memory, writes the receive-request to the
the msiq_isr. Additionally, because the msiq_isr has also higher HW receive-request queue, and changes its status in order to make the
priority than the threads, it can be guaranteed that the MSIQ SW’s MSI Slave to generate an interrupt service request. Then it continues
data structures and queues are maintained correctly. receiving messages until the HW receive-request queue is full while
3. The MSIQ Tx-master’s Avalon interface reads the send-request the msiq_isr (ISR) is executed in step three.
from the HW send-request queue and starts reading a message from 3. The processor starts running the msiq_isr (ISR) function. The
the Tx-buffer, slices it into packet payloads, generates both of the msiq_isr acknowledges the MSI Rx-master’s interrupt service request,
headers for every packet, and writes complete packets to the Tx-FIFO. reads the receive-request from the HW receive-request queue, obtains
The MSIQ Tx-master’s Tx-interface reads packets from the Tx-FIFO the address of the Rx-channel’s data structure by the CID from the
4. MSIQ SW’s data structure, and posts the Rx-channel’s signaling = Npck × 5 clock cycles. Owing to this simplification and because the
semaphore to the thread that called mmpp_receive function. These interfaces operate at the same clock rate, it is not any longer necessary
operations are repeated in a loop until the HW receive-request queue is to take into consideration the filling of the Tx-FIFO and the emptying
empty or a certain maximum number of receive-requests are serviced. of the Rx-FIFO.
Hence, the HW receive-request queue of the MSIQ can be used
for batching the interrupts. The MSIQ Rx-master’s Avalon interface B. The performance of the MSIQ SW and HW
also offloads the MMP protocol’s functions partly by using the Rx- In the performance analysis a couple of things must be taken into
buffer table for demultiplexing interleaved packets of different consideration. Firstly, the length of the messages and the size of the
channels from a single input packet stream according to the CIDs. queues Qsize affect the theoretic maximum throughput. Secondly, the
Furthermore, because the Rx-channels’ Rx-buffers are mapped to the MSIQ masters can receive and send messages while the local
local variables of the threads [9], it can use DMA for zero-copying and processors are running the ISR. Additionally, the ISR (msiq_isr)
assembling the messages to the Rx-buffers. consists of different Tx-ISR and Rx-ISR branches for servicing
interrupts caused by the MSIQ Tx-master and the MSIQ Rx-master as
was described in sections II.C and II.D.
III. PERFORMANCE ANALYSIS
A theoretic approach is used for estimating the performances of The execution time of the Tx-ISR is
the MSI and the MSIQs. This is because several factors like, for Ttx-isr (n) = Ttx-start + n × Ttx-loop, (1)
example, the operation speed of memories, the size of cache
memories, the operation delay of interrupt logic etc. affect the where Ttx-start is the time consumed in the beginning of the execution of
performance and measurements with only one configuration would not the ISR before the Tx-loop iterations and where n = 1, …, Qsize is the
produce reliable estimates. However, the execution time of the ISR number of serviced send-requests. Parameter Qsize is also the
was measured for calculations with a simple platform where the MSIQ maximum batch size and Ttx-loop is the execution time of the Tx-ISR’s
Masters were connected to different ports of a dual-port on-chip Tx-loop executed in step four of sending as described in subsection
SRAM which contained the buffers. Furthermore, the program code II.C. The sending of other messages generates new interrupt service
and data were stored to a different single-port on-chip SRAM. The requests, but they are masked during the execution of the ISR.
performance analysis is targeted for comparing the operations, the
The service time of the Tx-interrupts is
costs, and the performances of the new MSIQ and the MSI.
Ttx-int (n) = Tres + Ttx-isr (n) + Trec, (2)
The theoretic maximum throughputs with messages of different
sizes represent the peak communication performances achievable where parameter Tres is the response time between the assertion of the
when as many messages as possible are sent or received continuously. interrupt request and the start of the ISR’s execution, and Trec is the
In the first step of the analysis the performance of the MSIQ HW is interrupt recovery time. If NIOS II/f (fast) core is used, parameter Tres
analyzed. The result of the first step is used for simplifying the second = 105 clock cycles and parameter Trec = 62 clock cycles [10].
step of the performance analysis where the performance of both the
MSIQ HW and the MSIQ SW is analyzed together. The execution time of the Rx-ISR is
Trx-isr (n) = Trx-start + n × Trx-loop, (3)
A. The performance of the MSIQ HW
where Trx-start is the time consumed in the beginning of the execution
As messages are sent the MSIQ Tx-master’s Avalon interface of the ISR before the Rx-loop iterations and where n = 1, …, Qsize is
reads packet payloads of two words from the Tx-buffers, generates the number of Rx-ISR’s Rx-loop iterations which is limited by the size
packets, and stores the packets to the Tx-FIFO. After storing the last of queues Qsize. Parameter Trx-loop is the time consumed by each of the
packet of the message to the Tx-FIFO, it changes its status in order to Rx-loop iterations executed in step three of receiving as described in
make the MSIQ Slave to generate an interrupt. The latency of reading subsection II.D. The receiving of new messages generates also
the payloads of Npck packets is Dread(Npck) = Npck×4 +2 clock cycles. receive-requests, but the interrupts are masked during the execution of
This includes the time required for generating and storing Npck packets the ISR.
to the Tx-FIFO. The latency of sending Npck packets from the Tx-FIFO
to the Micronswitch is Dsend(Npck) = Npck×5 clock cycles respectively. The service time of the Rx-interrupts is
Since Dread(Npck) ≤ Dsend(Npck), when Npck ≥ 2, it can be concluded that
Trx-int (n) = Tres + Trx-isr (n) + Trec, (4)
the MSIQ Tx-master’s Tx-interface limits the throughput.
where parameters n, Tres, and Trec are equal to those of formula (2).
The MSIQ Rx-master’s Avalon interface reads packets from the
Rx-FIFO, reads the Rx-buffer table elements and computes the storage In the performance analysis the operation of the MSIQ HW and
addresses, and writes the packet payloads to the Rx-buffers. After the SW can be divided into periods during which the MSIQ masters send
last packet of a message it changes its status in order to make the MSI or receive a certain number of messages and the ISR is executed once.
Slave to generate an interrupt. The latency of writing the payloads of The length of the periods is denoted by Tperiod (n), where n = 1, …,
Npck packets to the Rx-buffer is Dwrite(Npck) = 2 + Npck×2 + 2 clock Qsize is the number of serviced send-requests or receive-requests, i.e.
cycles. The latency of receiving Npck packets through the Rx-interface the batch size. The length of the period is determined by the execution
of the MSIQ Rx-master (RX-IF) is Dreceive(Npck) = Npck×5 clock time of the interrupt services or the time required for sending or
cycles. Since Dwrite(Npck) ≤ Dreceive(Npck), when Npck ≥ 2, it can be receiving n messages. The value of parameter n is floating and its
concluded that the MSIQ Rx-master’s Rx-interface limits the value depends also on the message size. The length of the period
throughput. determines the theoretic maximum message rate
As was shown the Tx-interface and the Rx-interface of the MSIQ Rmsg (n) = n / Tperiod (n) (5)
Masters limit the throughputs like in the original MSI [8]. Therefore,
in order to simplify the performance analysis of the MSIQ HW and and the theoretic maximum bit rate
SW it can be assumed that the processing of every packet takes five Rbit (n) = Msize × Rmsg (n) = Msize × n / Tperiod (n), (6)
clock cycles also by both of the Avalon interfaces of the MSIQ
Masters and that Dread(Npck) = Dsend(Npck) = Dwrite(Npck) = Dreceive(Npck)
5. where n = 1, …, Qsize and parameter Msize is the message size in bits. the interrupt services are requested and the throughput of the MSIQ
The theoretic maximum bit rate Rbit (n) is the theoretic maximum Rx-master.
throughput. Formulas of the maximum theoretic throughputs are
derived for sending and receiving separately in the following two In the case that messages are shorter, the interrupt service time is
subsections. longer than the receiving time of Qsize messages and Trx-int (Qsize) >
Qsize × Trx-msg. In this case the HW receive-request queue is full most
1) The throughput with the send-request queue of the time and the MSIQ Rx-master must stop receiving until the Rx-
ISR’s Rx-loop iterations read receive-requests from the HW receive-
request queue. The interrupt service time Trx-int (n) determines clearly
If Ttx-int (Qsize) = Qsize × Ttx-msg, where parameter Ttx-msg = the length of the periods and Tperiod (n) = Trx-int (n). Because at most
Dsend(Npck) is the sending time of a message as defined in subsection Qsize receive-requests can be read from the HW receive-request queue
III.A, the MSIQ Tx-master is able to send messages continuously and Qsize messages can be received during the periods, the theoretic
without stopping the sending while processors is running the Tx-ISR. maximum throughput is achieved with value n = Qsize and Tperiod (Qsize)
The HW send-request queue can never be emptied by the MSIQ Tx- = Trx-int (Qsize). Hence, the theoretic maximum throughput is
master, because the processor runs the Tx-ISR which puts new send-
requests to the HW send-request queue from the SW send-request Rbit (Qsize) = Msize × Qsize / Trx-int (Qsize). (9)
queue. The MSIQ Tx-master generates interrupts after every sending In the case that messages are longer, the interrupt service time can
of a message, but these interrupt service requests are masked if be shorter than the receiving time of Qsize messages and Trx-int (Qsize) ≤
processor is running the ISR. The performance analysis of the MSIQ Qsize × Trx-msg. Because the processors can service the receive-requests
Tx-master consists of two separate cases, where either Ttx-int (Qsize) > of Qsize messages in a shorter time than the MSIQ Rx-master can
Qsize × Ttx-msg or Ttx-int (Qsize) ≤ Qsize × Ttx-msg, since the message size receive the next Qsize messages, the receiving can be continued without
affects the rate at which the interrupt services are requested and the stops and the receive-request queue can never become full. Finally, if
throughput of the MSIQ Tx-master. the message size is further increased, the Rx-loop is executed only
In the case that messages are shorter, the interrupt service time is once during every execution of the Rx-ISR and Trx-int (1) ≤ Trx-msg.
longer than the sending time of Qsize messages and Ttx-int (Qsize) > Qsize Hence, if Trx-int (Qsize) ≤ Qsize × Trx-msg, the message size determines the
× Ttx-msg. In this case the HW send-request queue is emptied and the number of received messages n during the periods and the length of
MSIQ Tx-master must stop sending messages until the Tx-ISR puts the period Tperiod (n) = n × Trx-msg, where n = 1, …, Qsize. Thus, the
the next send-requests into the HW send-request queue. Thus, with theoretic maximum message rate is Rmsg (n) = n / (n × Trx-msg) = 1 /
shorter messages the interrupt service time Ttx-int (n) determines the Trx-msg and the theoretic maximum throughput is
length of the period and Tperiod (n) = Ttx-int (n). The message rate is Rmsg Rbit (n) = Msize × Rmsg (n) = Msize / Trx-msg. (10)
(n) = n / Tperiod (n) = n / Ttx-int (n), where n = 1, …, Qsize, and the bit
rate is Rbit (n) = Msize × Rmsg (n). The theoretic maximum throughput is
achieved with value n = Qsize, when the ISR loads Qsize send-requests C. Comparison of performances and costs
to the HW send-request queue, and the theoretic maximum throughput The performances of the MSIQ and the MSI are presented in Fig.
is 2 where the horizontal axis shows the message size in 32 bits wide
words and the vertical axis shows the throughputs in GBits/s. The
Rbit (Qsize) = Msize × Rmsg (Qsize) = Msize × Qsize / Ttx-int (Qsize). (7)
throughputs were computed with 100 MHz clock. The throughputs of
In the case that messages are longer, the interrupt service time can the basic MSI, which does not have the queues, are presented with
be smaller than the sending time of the messages and Ttx-int (Qsize) ≤ lines Q1(300) and Q1(600). These lines are computed like in [13] with
Qsize × Ttx-msg. Because the Tx-ISR can put a larger number of send- interrupt service times (Ttx-int, Trx-int) of 300 and 600 clock cycles. The
requests to the HW send-request queue than the MSIQ Tx-master can throughputs of the MSIQ with queues of four send-requests and
send during the interrupt service time Ttx-int (Qsize), the HW send- receive-requests are presented with lines Q4(450) and Q4(900). These
request queue is nonempty most of the time and the sending can lines are computed with equal Tx-loop and Rx-loop execution times
continue without stops. Because the number of Tx-loop iterations of (Ttx-loop, Trx-loop) of 450 and 900 clock cycles, and with the ISR start
the Tx-ISR depends on the message size which determines the sending times (Ttx-start, Trx-start) of 20 clock cycles. The throughputs of the
time, parameter n can also be smaller than Qsize. Hence, the sending MSIQ with the queues of eight requests are not presented, since they
time of the messages determines the length of the period Tperiod (n) = n are quite similar to those of Q4(450) and Q4(900). This is because the
× Ttx-msg, where n = 1, …, Qsize, and the theoretic maximum message total execution times of the loops dominate the total interrupt service
rate Rmsg (n) = n / Tperiod (n) = n / (n × Ttx-msg) = 1 / Ttx-msg, where n = times as the number of loop iterations increases, which reduces the
1, …, Qsize. In this case the theoretic maximum throughput does not effect of the other delay parameters. The threshold message sizes of
depend on the value of parameter n and it is Q4(450) and Q4(900) are 199 and 379 words respectively. With the
threshold message sizes Ttx-int (Qsize) = Qsize × Ttx-msg = Qsize ×
Rbit (n) = Msize × Rmsg (n) = Msize / Ttx-msg. (8) Dsend(Npck) and Trx-int (Qsize) = Qsize × Trx-msg = Qsize × Dreceive(Npck).
Thus, with 100 MHz clock the throughputs or the MSIQ saturate to
2) The throughput with the receive-request queue 1.28 GBits/s actually with smaller messages than Fig. 2 presents.
Formulas (7) and (9) are used for computing the throughputs of the
If Trx-int (Qsize) = Qsize × Trx-msg, where parameter Trx-msg = MSIQ for message sizes that are smaller than the threshold values and
Dreceive(Npck) is the receiving time of a message as defined in formulas (8) and (10) are used for computing the throughputs with
subsection III.A, the MSIQ Rx-master is able to receive the next Qsize message sizes that are higher than or equal to the thresholds.
messages without stopping the receiving while the processor is By comparing line Q1(300) to line Q4(450) and line Q1(600) to
running the ISR. This is because each interrupt services Qsize receive- line Q4(900) it can be concluded that with messages which are smaller
requests while the MSIQ Rx-master receives the next Qsize messages. than 64 and 128 words the theoretic maximum throughputs of the
The MSIQ Rx-master generates new interrupt service request after basic MSI and the MSIQ are quite similar. However, the throughputs
receiving of messages, but these interrupt service requests are masked Q4(450) and Q4(900) of the MSIQ grow much faster as the message
if processor is running the ISR. The analysis divides also into two size is increased and they saturate to 1.28 GBits/s already at the point
separate cases, where either Trx-int (Qsize) > Qsize × Trx-msg or Trx-int of 256 and 512 words. Furthermore, the results in Fig. 2 do not show
(Qsize) ≤ Qsize × Trx-msg, since the message size affects the rate at which the performance with message bursts. Because usually traffic contains
6. also bursts of messages, it is necessary that the NI is able to achieve a costs. It would also be possible to reduce the HW costs by using
high peak performance for short time intervals under burst traffic. This smaller send-request queues in the MSIQ without reducing the
can be achieved by HW send-request and HW receive-request queues. performance significantly.
For example, with queues of eight requests the MSIQ Masters are able
to send and receive bursts of eight messages at the maximum rate
without stopping their operation. ACKNOWLEDGMENT
This research is funded by the Academy of Finland under grant
122361.
REFERENCES
[1] Z.D. Dittia, G.M. Parulkar, and J.R. Cox, “The APIC Approach
to High Performance Interface Design: Protected DMA and
Other Techniques,” Proc. of the IEEE International Conference
on Computer Communications, Kobe, Japan, Apr. 7-12, 1997,
pp. 823-831.
[2] A.F. Diaz, J. Ortega, A. Canas, F.J. Fernandez, M. Anguita, and
A. Prieto, “The lightweight Protocol CLIC on Gigabit Ethernet,”
Proc. of the International Parallel and Distributed Processing
Symposium, Nice, France, Apr. 22-26, 2003, pp. 8.
[3] P. Gilfeather, and A.B. Maccabe, “Modeling Protocol Offload
Figure 2. Theoretic maximum throughput of the MSI and the MSIQ. for Message-Oriented Communication,” Proc. of the IEEE
Internatonal Conference on Cluster Computing, Burlington,
Masschusets, USA, Sept. 27-30, 2005, pp. 1-10.
The synthesis results are in Table 2. The MSIQs and the MSI
contain Tx-FIFOs and Rx-FIFOs of four packets. The logic and [4] S.A. AlQahtani, “Performance Evaluation of Handling Interrupts
register consumptions of the MSIQs and the MSI are quite similar, but Schemes in Gigabit Networks,” Proc. of the IEEE International
Conference on Computer and Information Technology, Aizu-
the amount of block memory bits grows clearly as the size of the Wakamatsu, Fukushima, Japan, Oct. 16-19, 2007, pp. 497-502.
queues is increased. The maximum size of the queues is 16 requests.
With queues of that size the MSIQ would consume 4096 block [5] B. Coglin, and N. Furmento, “Finding a Tradeoff between Host
Interrupt load and MPI Latency over Ethernet,” Proc. of the
memory bits, but it would provide also better theoretic maximum IEEE International Conference on Cluster Computing, New
throughput and burst tolerance. Additionally, it would be possible to Orleans, Lousiana, USA, Aug. 31-Sept. 4, 2009, pp. 1-9.
use smaller HW send-request queue so as to reduce the HW costs,
[6] J. Mogul, and K.K. Ramakrishnan, Eliminating Receive livelock
because the SW send-request queue can store a large number of send- in an Interrupt Driven Kernel, ACM transactions on Computer
requests in any case. For example, with the HW send-request queue of Systems, Vol. 15, No. 3, Aug. 1997, pp. 217-252.
four requests and the HW receive-request queue of 16 requests the
[7] K. Langendoen, J. Romein, R. Bhoedjang, and H. Bal,
MSIQ would consume also 2560 block memory bits. “Integrating Polling, Interrupts, and Thread Management,” Proc.
of the Frontiers of Massively Parallel Computing symposium,
Annapolis, MD, USA, Oct. 27-31, 1996, pp. 13-22.
TABLE II. RESOURCE CONSUMPTIONS IN STRATIX III EP3SL150 [15]
[8] H. Kariniemi, and J. Nurmi, “Micronmesh for Fault-tolerant
MSI MSIQ with MSIQ with GALS Multiprocessors on FPGA,” Proc. of the International
FPGA resource Symposium on System-on-Chip, Tampere, Finland, Nov. 4-6,
Qsize = 1 Qsize = 4 Qsize = 8
Combinational 2008, pp. 1-8.
1550 1665 (7.4%) 1695 (9.3%) [9] H. Kariniemi, and J. Nurmi, “Fault-Tolerant Communication
ALUTs
over Micronmesh NoC with Micron Message-Passing protocol,”
Memory ALUTs 0 0 (0.0%) 0 (0.0%) Proc. of the 11th internation symposium on System-on-Chip,
Tampere, Finland, Oct. 5-7, 2009, pp. 5–12.
Logic registers 1454 1609 (10.6%) 1609 (10.6%)
[10] Altera Corp., NIOS II software developers handbook, Mach
Block memory 2009. Website, <http://www.pldworld.com/_Semiconductors/
1024 1792 (75.0%) 2560 (150.0%)
bits Altera/one_click_niosII_docs_9_0/files/n2sw_nii5v2.pdf>
20.08.2010
IV. CONCLUSIONS [11] J. Labrosse, MicroC/OS-II The real-time kernel, Second ed.,
This paper presents MSIQ NI where a new queue mechanism is CMP Books, San Francisco, USA, 2002.
used for batching interrupts in order to improve the performance. [12] H. Kariniemi, and J. Nurmi, “NoC Interface for Fault-Tolernt
Interrupts generated by the NIs produce a lot of SW overhead and the Message-Passing Communication on Multiprocessor SoC
performance can be improved by reducing the interrupt frequency. platform,” Proc. of the NORCHIP, Trondheim, Norway, Nov.
This is achieved by the send-request and the receive-request queues 2009.
which make it possible to batch interrupt service requests so that [13] Altera Corp., NIOS II processor reference handbook, November
individual ISR executions can serve multiple interrupt requests. The 2009, Website, <http://www.altera.com/literature/hb/nios2/
throughput improves especially with longer messages. Furthermore, n2cpu_nii5v1.pdf> 20.08.2010
the burst tolerance against short messages improves. In addition to the [14] Altera Corp., Quartus II Handbook v10.0, Ch. 2: System
interrupt batching this is also partly owing to that the request queues interconnect fabric for memory-mapped interfaces, July 2010,
allow the MSIQ HW to continue sending and receiving messages Website, <http://www.altera.com/literature/hb/qts/
qts_qii54003.pdf > 20.08.2010
while processor is running the ISR. Hence, the new queue mechanism
enables more efficient concurrent operation of the MSIQ HW and the [15] Altera Corp., Stratix III device handbook, Volume I, San Jose,
SW. The results of the performance analysis and the logic synthesis USA, July 2010. Website, <http://www.altera.com/literature/hb/
stx3/stratix3_handbook.pdf> 20.08.2010
show also clearly that the performance can be improved with tolerable