1. Virtual Interface Architecture over Myrinet
EEL5717 - Computer Architecture
Dr. Alan D. George
Project Final Report
Department of Electrical and Computer Engineering
University of Florida
Edwin Hernandez
December 1998.
2. Implementation of a Virtual Interface Architecture over Myrinet
Edwin Hernandez - hernande@hcs.ufl.edu
1. Introduction
The evolution of network interfaces has been improving over years, but he network protocols add a big
amount of overhead that it is translated into higher latency and low useful throughput. The traditional
protocols such as TCP, UDP and TP4 are not appropriate for high performance environments where light-
weight communications are required in order to maximize the bandwidth theoretically possible to achieve.
In other words, protocols have to be light-weight and they should maximize the useful throughput and
allow minimum latency. Several companies, software and hardware, visioning this problem have come up
with new ideas. One of them is called, Virtual Interface Architecture (VIA), as a matter of fact VIA was
born from the combined force of COMPAQ, Microsoft and INTEL [VIA98]. The Virtual Interface
Specification is defined as a standard and several organizations agree with it and right now the version 1.0
of the standard is available. [VIA97]. Several papers have been published regarding the Virtual Interface
Architecture in which several Virtual Interfaces, VI, have been implemented to probe the concept of VI as
well as probe the reduction in latency and bandwidth. They have used different NIC such as Myrinet and
even Ethernet [Dunn98] and [Eick98]. However, This work was previously started with PA-RISC network
interface architectures [Banks93], virtual protocols for myrinet as stated in [Rosu95], moreover several
researchers have tried to localize the bottlenecks and performance improvements in NIC's like the work
done by [Davi93] and [Rama93] in which they have stated the general memory management concepts as
well as I/O handling techniques. As shown in Section 5, all the measurements and tests were developed at
the Myrinet test-bed in the High Performance Computing and Simulation Research Lab (HCS Lab) at the
University of Florida.
2. Background
The Virtual Interface Architecture is a new concept, this concept perfectly fits the ideas and conceptions of
high performance networks and the design of clusters. The VIA tries to boost the performance by not
allowing excessive copying and performing several tasks without care of many layers and some other
important issues, those issues are explained in section 2.1.
3. 2.1. The Virtual Interface Architecture. [VIA98]
The VIA attacks the problem of relatively low achievable performance of inter-process communication
(IPC) within a cluster1. The overhead is the one that determines the performance of IPC. The software
overhead added during the send/receive operations over a message through the network. The amount of
software layers that are traversed, imply a great amount of context switches, interrupts and data copies
when crossing those boundaries.
However the increase of the processor clock helps in the processing of software layers, this is not a
determinant factor in the reduction of the performance (large penalties for cache miss, software layers
imply a lot of branches).
With the introduction of OC-3 ATM, network bandwidth are being increased from 1Mbps to 100-150Mbps
and 1Gbps as backbones, but the "raw" bandwidth almost never can be achieved.
Having those two reasons into consideration, INTEL and other companies have developed the VIA, which
can be described by:
• User Agent
• Kernel Agent.
An user is the software layer using the architecture, it could be an application or communication services
layer. The kernel agent is a driver running in protected (kernel) mode. It must set up the necessary tables
and structures that allow communication between cooperating processes.
VIA accomplishes low latency in a message-passing environment by following these rules:
• Eliminate any intermediate copies of the data
• Eliminate the need of a driver running in protected kernel mode to multiplex a hardware resource.
• Avoid traps into the operating system whenever possible to avoid context switches in the CPU as well
as cache thrashing
• Remove the constraint of requiring an interrupt when initiating an I/O operation
• Define a simple set of operations that send and receive data.
• Keep the architecture simple enough to be emulated in software.
What VIA does with the processes is that it presents an illusion that it owns the interface to the network.
Each VIA consists of one send and one receive queue, and is owned and maintained by a single process.
1
Cluster computing consists of short distance, low-latency, high-bandwidth IPCs between multiple
building blocks. Cluster building blocks include server, workstations and I/O subsystems, all of which
connect directly to a network.
4. A process can own many Virtual Interfaces (VI), and many processes can own many Vis, the kernel by
itself can also own a VI.
The VI queue is formed by a linked list of variable-length descriptors. To add a descriptor to a queue, the
user builds the descriptor and posts it onto the tail of the appropriate work queue. That same user pulls the
completed description off the head of the same work queue they were posted on.
The process that owns the queue can post four types of descriptors. Send, remote-DMA/write, remote-
DMA/read descriptors are placed on the send queue of a VI. Receive descriptors are placed on the receive
queue of a VI.
VIA also provides polling and blocking mechanisms to synchronize between the user process and
completed operations. When descriptor processing completes, the NIC writes a done bit and includes any
error bits associated with that descriptor in its specified fields. This act transfers ownership of the descriptor
from the NIC back to the process that originally posted it.
These queues are an additional construct that allows the coalescing of completion notifications from
multiple work queues into a single queue. The two work queues of one VI can be associated with
completion queues independently of one another.
Now the descriptors mentioned are constructs which describe the work to be done by the Network
Interface. This is very similar to the architecture proposed in [Davi93]. The SEND/RECEIVE descriptors
contains one segment and a variable number of data segments. Remote-DMA/write and remote-
DMA/receive descriptors contain one additional address segment following the control segment and
preceding the data segments.
The VIA also has :
• immediate data access of a 32-bit data in a descriptor.
• The order of the descriptors is preserved in a FIFO queue, it is easy to maintain consistency with
send/recive and remote/DMA write, however remote/DMA recieve is a round-trip transaction and it is
not completed until the requested data is returned from the remote node/endpoint.
• Work queue scheduling. There is no implicit ordering relationship between descriptors and VIs
therefore the scheduling service depends on the algorithm used by the NIC.
• Memory protection, provides memory protection and ensure that a user process cannot send out of, or
receive into, memory that it does not own.
• Virtual address Translation. This is done when the kernel agent registers a memory region, the kernel
agent performs ownership checks (it comes from a user agent request), it pins the pages into physical
memory, and probes the regions of the virtual-to-physical address translation.
5. Vi
Application
Consumer
OS Communication
Interface
VI User Agent
User Send/Receive/RDMA
Read/ RDMA write
mode
kernel
mode
Se Re Se Re Se Re
nd cv nd cv nd cv
VI KERNEL
Agent
V
V V I
I I
VI Network Adapter
Figure 1. VI Architectural Model
3. MODEL DESIGN
For this class project, there is not enough time to build a Hardware implementation of the VI on-chip using
the Myrinet interface, however it is highly possible to interact with the Myrinet card and generate a Virtual
Emulation of the Virtual Interface merely in software. For this reason, in Appendix 1., there is the source
code in C++, which is located on the top of the Myrinet, therefore the performance enhancements reached
won't be as high as expected, however the Model followed by this project will remain with some
modifications.
It should be noticed that the RDMA transfers, as well as error handling where left aside for this project, the
only concerns in the design of the VI were:
• VI initialization and interaction with the VI
• Implement the Send and Receive Queues
• Implement the completion queues
• Use the standard data-types mentioned in the specification [VIA97]
• Make use of a small application ECHO/REPLY Server for the performance tests.
In addition, the software makes use of the Myrinet adapter in making transfers in DMA mode. At very
early stage this was not taken into account but it seems to be not a good performance "booster" and the
measurements made are quite low, but his aspect will be explained in
Section 5.
6. The basic objects used are:
• Myrinet , which is in charge of handling the send, receive and initialization, it dialogs directly to the
myrinet card.
- int(), initializes the interface, in this case it is needed to change the route of the Myrinet DMA
transfer. In other words, first the interface sends data, then it has to be reinitialized to receive the
Reply from the server, this also happens at the server.
- Send() and Recv(), post and interacts with the shmen* structure in terms of receiving the data
from the Myrinet's SRAM and post data into it.
• VI, The Virtual Interfac Object contains the Send Queue, the Receive Queue and the Completion
Queue. A description of the class members is posted above:
- NIC. Object to reference the instance of the interface being used, in this case Myrinet.
- CQ, SendQ, RecvQ: These data types are used for the queues of descriptors, they are
handled as a List objet. The List object was also developed and it contains all the functions of a
link list.
- SetupDescriptorSendRecv(), this functios is made to initialize a descriptor whether for
send or receive, to or from the working queues. The descriptor created here has the data-type
VIP_DESCRIPTOR defined in the VIA specification.
- ViPostSend().This function is in charge of Posting the Send Descriptor to the queue, it does
not transmits the data.
- ViProcesSend(). This function pops the first descriptor from the queue and starts delivering
the content pointed by the descriptor (DS[0].Local.Data.Address) to the shmem->sendBuffer
pointer. This is the real Send.
- ViPostRecv(), is in charge of posting a reception descriptor into the receive queue, it has
information to the data addresses to store the information. An application can also receive a
descriptor from the other end, depending on the protocol used. For this basic application the
receive descriptor is formed in the reception peer.
- ViProcessRecv(), the process of reception is done though this method, and as well as the
viaProcesSend() pops the first element in the Reception queue and writes whatever is red from the
object NIC into the destination address.
- EchoServer(). Member function to allow be a Server waiting for incoming data and replying
the same data.
- EchoClient(). Member function to allow Sending a block of MTU (Maximum Transfer Unit)
data to the other end, expects for a reply and compares whatever was sent with the content of the
receiving data.
The VIPL.h is the most important of the libraries, because it contains all the data types stated in the
specification., it defines descriptors, responses, error handling, memory management and some other VI
7. properties. However, it was not implemented as stated there, it was modified to agree with the requirements
of this project and the HCS Lab resources.
As stated before two aspects where left aside for the VI application implementation:
a) Threads and multi-threading issues, in order to keep the VI clean of the vices used in the other
protocols, it is imperative to use a library of light-weight threads, otherwise all the overhead
introduced by the traditional thread libraries will twist the results
b) Remote DMA reads and writes, there are two main reasons for leaving this aside, the first one is
requirements of direct memory manipulation which is not permitted without the proper system
administrator rights and the second one, because is not quite clear in the standard how to achieve
it.
c) Error handling was not implemented and a Error-Free environment should be assumed for all the
results.
4. EXPERIMENTS
Experiments were directed in three main areas: Latency, Throughput and Time overhead attributed to the
client and server of the application developed. They were also made following the performance results
obtained by [Berry97] and [Erick98]. In fact, the values gathered by them have much higher performance
than the values gathered at the HCS Lab, the reason could be the VI implementation in hardware, not a
software emulation and a better understanding of the Myrinet architecture, in terms of modes of operation
and how to improve the data transfers . Fist of all, the SAN used consisted of two computers, viking and
vigilante, both Sun Ultra-2 interconnected through the Myrinet switch version 1.0, Berry and his team used
Pentium Pro 200 MHz, PCI bus and there is not much specification concerning the application used.
The set of experiments selected consisted on:
- Throughput in Myrinet with and without the VI
- Latency with and without the VI
- Time distribution at the Client and Server using the VI
The results and analysis are show in section 5.
5. RESULTS AND ANALYSIS
The mode of operation for the Myrinet adapter was DMA transfer and as shown in Figure 2., it was not the
best option, however it fulfilled the requirement of an easy implementation.
8. 40
35
30
25
Throughput Using DMA
Mbytes
20 Throughput Using Mem_map
Thorughput TCP_STREM
15
10
5
0
0 100 200 300 400 500 600 700 800
payload in bytes
Figure 2. Throughput measurements for Myrinet using different modes of operation
As shown there the DMA transfer does not improve the performance whenever is 64 bytes long, moreover
a TCP_STREAM test made with netperf generates a better performance. But the main goal with this paper
is not finding a better mode of operation for Myrinet, if not a way to probe that the VIA is a good concept
and it can be used in SANs.
Having this in mind, it will be only a matter of transport or multiply by a factor of performance
improvement of whatever is found from now on.
The first measurement made consists of the Latency with and without the VI, the RAW-Myrinet represents
the application without the VI overhead, or bulk data transfers and the VI-Myrinet represents the Latency
of the ECHO/Reply Round trip divided by two.
Latency of the Myrinet Interface
1200
1000
800
Micro-seconds
Raw - Myrinet
600
VI - Myrinet
400
200
0
32 64 128 256 512 1024 2048 4096 8192
Payload (bytes)
Figure 3. Latency measurements with myrinet using raw data and the VI on the top of the myrinet
9. From figure 3, can be concluded that the increment in the latency is a constant and not greater than 25%. If
the results shown here are compare with the ones reported by Berry, the difference between latencies are of
a ration of 4:1, in which this ones are the greatest. However, a aspect that is not being taken into account
by Berry and this project is the fact that the VIA specification defines as a Maximum Transfer Unit of
32KB, but all the measurements where done by the other researches at payloads not greater than 8Kbytes.
Throughput of the VI and the Raw Data Transfer with Myrinet
12
10
8
Mbytes/sec
VI - Myrinet
6
DMA-Myrinet
4
2
0
32 64 128 256 512 1024 2048 4096 8192
Payload
Figure 4. Throughput measurements using with and without the VI
In terms of throughput, the performance is decreased in about 40% between the raw-data transfer and the
transfer done using the VI, this value was not expected and unfortunately there are no references to
compare. Generally, VI performance is compare between the Kernel Agent implementation and the VI
emulation, but not against the raw transfer performance.
In addition to the throughput, it is required to find where is the performance bottleneck standing, in other
words where is the 40% of the value in mention is lost. In order to get and discover this value, time
stamping was executed along the client and server application. Although the measurement could be done
and compare both client and server, it is at the client the most representative of the two peers.
10. For this reason in figure 5.0, it is shown the distribution of the time in every process of the application. In
the VIA the process of descriptors is basically negligible and most of the time is spent in the data transfer
and reception of the reply (waiting to receiving descriptor).
VI Distribution of Time at the CLIENT (MTU=8192)
0%
1%
5% 4% 1%
39%
50%
Setting Send Descriptor Po s ti n g D e s c r ip to r Processing Send Myrinet Send (DMA)
Waiting to Receive Desc Receiving Descriptor (CQ ready) Readind Data
Figure 5. Time distribution of the Echo/Reply application
This behavior is expected because the 30% of Myrinet-Myrinet transfer has to be in both ends, therefore if
12% is spent at both ends, it will represent approximately 24% of overhead left, plus from the 50% which
includes 30% server reply, will leave a 10% . more, ends up in 30-35% of processing overhead. This
processing overhead is a constant as shown in Figure 6.0, and it is a matter of improvement of the Myrinet-
to-Myrinet transmission to get better performance levels.
Time Spent Processing Descriptors
300
250
200
micro-seconds
Client Side
150
Server Side
100
50
0
32 64 128 256 512 1024 2048 4096 8192
Payload
Figure 6. Time variation of descriptors processing at client and server.
11. This behavior (on Figure 6.) is explained by the algorithm itself, a block is sent to the server from the
client, the client waits for that block of data, the server copies the data pointed by the descriptor and writes
a send queue with the same data, the data is sent back to the client. In other words, only one descriptor is
needed for sending or all working queues handle only one-element at the time.
6. CONCLUSIONS
First, a proof-of-concept has been achieved at the HCS Lab, the philosophy of the VIA in SAN can be
applied reducing the complexity of the OSI models and layered protocols. The implementation developed
introduces an overhead of 10% on descriptors and data processing, at both ends. For an echo-reply
application the average overhead, having as a reference raw-myrinet transmission using DMA is of 40%.
The average latency added by the VIA to the new application is of 25% maximum. It turns out that the use
of the DMA transfer was not the best choice, it is recommended to use any other technique.
7. FUTURE RESEARCH
Further research could be done in terms of implementation, first improve the Myrinet-to-Myrinet
communication, using mem_map instead of DMA. Implement error checking and Remote DMA reads and
writes. Make use of the SCALE_Threads or any other light-weight library and use multi-issue
implementation. In addition to that, the queues, completion, reception and transmission could be modified
and switching form a simple FIFO to something more efficient such as a Hash table, which will have low
processing overhead but could improve the performance.
ACKNOWLEDGEMENTS
I'd like to thank to Wade Cherry for his introduction and explanation of the LANAI and Myrinet
applications. I'd like to thank the team at INTEL for defining the VIPL.H Library and providing the
source code for free use through the internet, and their visual C++ application from which I gathered lots of
ideas and finally understood the philosophy of VIA.
REFERENCES
[Banks93] Banks, D., Prudence M. "A high-performance Network Architecture for PA-RISC
Workstation", IEEE Journal on Selected Areas of Communications", vol. 11, No. 2,
February 1998, pp 191-202.
[Berry97] Berry, F. Deleganes, E. "The Virtual Interface Architecture Proof-of-Concept
Performance Results", INTEL Corp white paper.
[Davi93] Davie, B. "Architecture and Implementation of a High-Speed Host Interface", IEEE
Journal on Selected Areas in Communications", Vol. 11, No. 2, February 1998., pp 228-
239
12. [Dunn98] Dunning, D, et.al "The Virtual Interface Architecture", IEEE MICRO, v 18, n 2, April
1998, pg. 66-75.
[Eick98] Von Eicken, T., Vogels, W. "Evolution of the Virtual Interface Architecture ", IEEE
Computer Magazine, November 1998, pp 61-68.
[Rama93] Ramakrishnan, K, "Performance Considerations in Designing Network Interfaces", ",
IEEE Journal on Selected Areas in Communications", Vol. 11, No. 2, February 1998., pp
203-219
[Rosu95] Marcel, Rosu. "Processor Controller off-Processor I/O", Cornell University, Grant
ARPA/ONR N00014-92-J-1866, August 1995.
[Steen97] Steenkiste, P."A High Speed Network Interface for Distributed-Memory Systems :
Architecture and Applications ", ACM Transactions on Computer Systems, Vol. 15, No.
1, February 1997, pp 75-109.
[Wels98] Welsh, M., et.al ." Memory Management for User-Level Network Interfaces ", v 18, n 2,
April 1998, pp 77-82.
Web Pages
[VIA98] http://www.viaarch.org/
[INT98] http://www.intel.com/
APPENDICES
Appendix 1. Source Code for the VIA_Server