SlideShare una empresa de Scribd logo
1 de 12
Descargar para leer sin conexión
Virtual Interface Architecture over Myrinet


     EEL5717 - Computer Architecture
             Dr. Alan D. George
              Project Final Report


Department of Electrical and Computer Engineering
              University of Florida




               Edwin Hernandez




                December 1998.
Implementation of a Virtual Interface Architecture over Myrinet
                             Edwin Hernandez - hernande@hcs.ufl.edu
1. Introduction
The evolution of network interfaces has been improving over years, but he network protocols add a big
amount of overhead that it is translated into higher latency and low useful throughput. The traditional
protocols such as TCP, UDP and TP4 are not appropriate for high performance environments where light-
weight communications are required in order to maximize the bandwidth theoretically possible to achieve.
In other words, protocols have to be light-weight and they should maximize the useful throughput and
allow minimum latency. Several companies, software and hardware, visioning this problem have come up
with new ideas. One of them is called, Virtual Interface Architecture (VIA), as a matter of fact VIA was
born from the combined force of COMPAQ, Microsoft and INTEL [VIA98]. The Virtual Interface
Specification is defined as a standard and several organizations agree with it and right now the version 1.0
of the standard is available. [VIA97]. Several papers have been published regarding the Virtual Interface
Architecture in which several Virtual Interfaces, VI, have been implemented to probe the concept of VI as
well as probe the reduction in latency and bandwidth. They have used different NIC such as Myrinet and
even Ethernet [Dunn98] and [Eick98]. However, This work was previously started with PA-RISC network
interface architectures [Banks93], virtual protocols for myrinet as stated in [Rosu95], moreover several
researchers have tried to localize the bottlenecks and performance improvements in NIC's like the work
done by [Davi93] and [Rama93] in which they have stated the general memory management concepts as
well as I/O handling techniques. As shown in Section 5, all the measurements and tests were developed at
the Myrinet test-bed in the High Performance Computing and Simulation Research Lab (HCS Lab) at the
University of Florida.


2. Background
The Virtual Interface Architecture is a new concept, this concept perfectly fits the ideas and conceptions of
high performance networks and the design of clusters. The VIA tries to boost the performance by not
allowing excessive copying and performing several tasks without care of many layers and some other
important issues, those issues are explained in section 2.1.
2.1. The Virtual Interface Architecture. [VIA98]
The VIA attacks the problem of relatively low achievable performance of inter-process communication
(IPC) within a cluster1. The overhead is the one that determines the performance of IPC. The software
overhead added during the send/receive operations over a message through the network. The amount of
software layers that are traversed, imply a great amount of context switches, interrupts and data copies
when crossing those boundaries.


However the increase of the processor clock helps in the processing of software layers, this is not a
determinant factor in the reduction of the performance (large penalties for cache miss, software layers
imply a lot of branches).
With the introduction of OC-3 ATM, network bandwidth are being increased from 1Mbps to 100-150Mbps
and 1Gbps as backbones, but the "raw" bandwidth almost never can be achieved.


Having those two reasons into consideration, INTEL and other companies have developed the VIA, which
can be described by:
•    User Agent
•    Kernel Agent.


An user is the software layer using the architecture, it could be an application or communication services
layer. The kernel agent is a driver running in protected (kernel) mode. It must set up the necessary tables
and structures that allow communication between cooperating processes.


VIA accomplishes low latency in a message-passing environment by following these rules:
•    Eliminate any intermediate copies of the data
•    Eliminate the need of a driver running in protected kernel mode to multiplex a hardware resource.
•    Avoid traps into the operating system whenever possible to avoid context switches in the CPU as well
     as cache thrashing
•    Remove the constraint of requiring an interrupt when initiating an I/O operation
•    Define a simple set of operations that send and receive data.
•    Keep the architecture simple enough to be emulated in software.


What VIA does with the processes is that it presents an illusion that it owns the interface to the network.
Each VIA consists of one send and one receive queue, and is owned and maintained by a single process.



1
  Cluster computing consists of short distance, low-latency, high-bandwidth IPCs between multiple
building blocks. Cluster building blocks include server, workstations and I/O subsystems, all of which
connect directly to a network.
A process can own many Virtual Interfaces (VI), and many processes can own many Vis, the kernel by
itself can also own a VI.


The VI queue is formed by a linked list of variable-length descriptors. To add a descriptor to a queue, the
user builds the descriptor and posts it onto the tail of the appropriate work queue. That same user pulls the
completed description off the head of the same work queue they were posted on.


The process that owns the queue can post four types of descriptors. Send, remote-DMA/write, remote-
DMA/read descriptors are placed on the send queue of a VI. Receive descriptors are placed on the receive
queue of a VI.


VIA also provides polling and blocking mechanisms to synchronize between the user process and
completed operations. When descriptor processing completes, the NIC writes a done bit and includes any
error bits associated with that descriptor in its specified fields. This act transfers ownership of the descriptor
from the NIC back to the process that originally posted it.
These queues are an additional construct that allows the coalescing of completion notifications from
multiple work queues into a single queue. The two work queues of one VI can be associated with
completion queues independently of one another.


Now the descriptors mentioned are constructs which describe the work to be done by the Network
Interface. This is very similar to the architecture proposed in [Davi93]. The SEND/RECEIVE descriptors
contains one segment and a variable number of data segments. Remote-DMA/write and remote-
DMA/receive descriptors contain one additional address segment following the control segment and
preceding the data segments.


The VIA also has :
•    immediate data access of a 32-bit data in a descriptor.
•    The order of the descriptors is preserved in a FIFO queue, it is easy to maintain consistency with
     send/recive and remote/DMA write, however remote/DMA recieve is a round-trip transaction and it is
     not completed until the requested data is returned from the remote node/endpoint.
•    Work queue scheduling. There is no implicit ordering relationship between descriptors and VIs
     therefore the scheduling service depends on the algorithm used by the NIC.
•    Memory protection, provides memory protection and ensure that a user process cannot send out of, or
     receive into, memory that it does not own.
•    Virtual address Translation. This is done when the kernel agent registers a memory region, the kernel
     agent performs ownership checks (it comes from a user agent request), it pins the pages into physical
     memory, and probes the regions of the virtual-to-physical address translation.
Vi
                                                       Application
                 Consumer


                                              OS Communication
                                                  Interface

                                                       VI User Agent



                       User                                                              Send/Receive/RDMA
                                                                                          Read/ RDMA write
                       mode
                      kernel
                      mode

                                                         Se     Re         Se   Re       Se   Re
                                                         nd     cv         nd   cv       nd   cv
                                VI KERNEL
                                   Agent
                                                                                                    V
                                                                       V             V              I
                                                                       I             I

                                                     VI Network Adapter




                                   Figure 1. VI Architectural Model


3.    MODEL DESIGN
For this class project, there is not enough time to build a Hardware implementation of the VI on-chip using
the Myrinet interface, however it is highly possible to interact with the Myrinet card and generate a Virtual
Emulation of the Virtual Interface merely in software. For this reason, in Appendix 1., there is the source
code in C++, which is located on the top of the Myrinet, therefore the performance enhancements reached
won't be as high as expected, however the Model followed by this project will remain with some
modifications.
It should be noticed that the RDMA transfers, as well as error handling where left aside for this project, the
only concerns in the design of the VI were:
•    VI initialization and interaction with the VI
•    Implement the Send and Receive Queues
•    Implement the completion queues
•    Use the standard data-types mentioned in the specification [VIA97]
•    Make use of a small application ECHO/REPLY Server for the performance tests.


In addition, the software makes use of the Myrinet adapter in making transfers in DMA mode. At very
early stage this was not taken into account but it seems to be not a good performance "booster" and the
measurements made are quite low, but his aspect will be explained in
Section 5.
The basic objects used are:
•   Myrinet , which is in charge of handling the send, receive and initialization, it dialogs directly to the
    myrinet card.
    -    int(), initializes the interface, in this case it is needed to change the route of the Myrinet DMA
         transfer. In other words, first the interface sends data, then it has to be reinitialized to receive the
         Reply from the server, this also happens at the server.
    -    Send() and Recv(), post and interacts with the shmen* structure in terms of receiving the data
         from the Myrinet's SRAM and post data into it.
•   VI, The Virtual Interfac Object contains the Send Queue, the Receive Queue and the Completion
    Queue. A description of the class members is posted above:
    -    NIC. Object to reference the instance of the interface being used, in this case Myrinet.
    -    CQ, SendQ, RecvQ: These data types are used for the queues of descriptors, they are
         handled as a List objet. The List object was also developed and it contains all the functions of a
         link list.
    -    SetupDescriptorSendRecv(), this functios is made to initialize a descriptor whether for
         send or receive, to or from the working queues. The descriptor created here has the data-type
         VIP_DESCRIPTOR defined in the VIA specification.
    -    ViPostSend().This function is in charge of Posting the Send Descriptor to the queue, it does
         not transmits the data.
    -    ViProcesSend(). This function pops the first descriptor from the queue and starts delivering
         the content pointed by the descriptor (DS[0].Local.Data.Address) to the shmem->sendBuffer
         pointer. This is the real Send.
    -    ViPostRecv(), is in charge of posting a reception descriptor into the receive queue, it has
         information to the data addresses to store the information. An application can also receive a
         descriptor from the other end, depending on the protocol used. For this basic application the
         receive descriptor is formed in the reception peer.
    -    ViProcessRecv(), the process of reception is done though this method, and as well as the
         viaProcesSend() pops the first element in the Reception queue and writes whatever is red from the
         object NIC into the destination address.
    -    EchoServer(). Member function to allow be a Server waiting for incoming data and replying
         the same data.
    -    EchoClient(). Member function to allow Sending a block of MTU (Maximum Transfer Unit)
         data to the other end, expects for a reply and compares whatever was sent with the content of the
         receiving data.


The VIPL.h is the most important of the libraries, because it contains all the data types stated in the
specification., it defines descriptors, responses, error handling, memory management and some other VI
properties. However, it was not implemented as stated there, it was modified to agree with the requirements
of this project and the HCS Lab resources.
As stated before two aspects where left aside for the VI application implementation:
    a)   Threads and multi-threading issues, in order to keep the VI clean of the vices used in the other
         protocols, it is imperative to use a library of light-weight threads, otherwise all the overhead
         introduced by the traditional thread libraries will twist the results
    b) Remote DMA reads and writes, there are two main reasons for leaving this aside, the first one is
         requirements of direct memory manipulation which is not permitted without the proper system
         administrator rights and the second one, because is not quite clear in the standard how to achieve
         it.
    c)   Error handling was not implemented and a Error-Free environment should be assumed for all the
         results.


4. EXPERIMENTS
Experiments were directed in three main areas: Latency, Throughput and Time overhead attributed to the
client and server of the application developed. They were also made following the performance results
obtained by [Berry97] and [Erick98]. In fact, the values gathered by them have much higher performance
than the values gathered at the HCS Lab, the reason could be the VI implementation in hardware, not a
software emulation and a better understanding of the Myrinet architecture, in terms of modes of operation
and how to improve the data transfers . Fist of all, the SAN used consisted of two computers, viking and
vigilante, both Sun Ultra-2 interconnected through the Myrinet switch version 1.0, Berry and his team used
Pentium Pro 200 MHz, PCI bus and there is not much specification concerning the application used.


The set of experiments selected consisted on:
    -    Throughput in Myrinet with and without the VI
    -    Latency with and without the VI
    -    Time distribution at the Client and Server using the VI
The results and analysis are show in section 5.




5. RESULTS AND ANALYSIS


The mode of operation for the Myrinet adapter was DMA transfer and as shown in Figure 2., it was not the
best option, however it fulfilled the requirement of an easy implementation.
40



                                       35



                                       30



                                       25


                                                                                                                                  Throughput Using DMA




                              Mbytes
                                       20                                                                                         Throughput Using Mem_map
                                                                                                                                  Thorughput TCP_STREM

                                       15



                                       10



                                        5



                                        0
                                             0        100   200   300         400          500         600          700     800
                                                                        payload in bytes




            Figure 2. Throughput measurements for Myrinet using different modes of operation


As shown there the DMA transfer does not improve the performance whenever is 64 bytes long, moreover
a TCP_STREAM test made with netperf generates a better performance. But the main goal with this paper
is not finding a better mode of operation for Myrinet, if not a way to probe that the VIA is a good concept
and it can be used in SANs.
Having this in mind, it will be only a matter of transport or multiply by a factor of performance
improvement of whatever is found from now on.
The first measurement made consists of the Latency with and without the VI, the RAW-Myrinet represents
the application without the VI overhead, or bulk data transfers and the VI-Myrinet represents the Latency
of the ECHO/Reply Round trip divided by two.



                                                                                Latency of the Myrinet Interface

                                1200




                                1000




                                       800
              Micro-seconds




                                                                                                                                                             Raw - Myrinet
                                       600
                                                                                                                                                             VI - Myrinet



                                       400




                                       200




                                        0
                                                 32         64    128           256              512         1024         2048    4096       8192
                                                                                       Payload (bytes)


    Figure 3. Latency measurements with myrinet using raw data and the VI on the top of the myrinet
From figure 3, can be concluded that the increment in the latency is a constant and not greater than 25%. If
the results shown here are compare with the ones reported by Berry, the difference between latencies are of
a ration of 4:1, in which this ones are the greatest. However, a aspect that is not being taken into account
by Berry and this project is the fact that the VIA specification defines as a Maximum Transfer Unit of
32KB, but all the measurements where done by the other researches at payloads not greater than 8Kbytes.


                                             Throughput of the VI and the Raw Data Transfer with Myrinet


                      12




                      10




                      8
         Mbytes/sec




                                                                                                                  VI - Myrinet
                      6
                                                                                                                  DMA-Myrinet




                      4




                      2




                      0
                           32         64      128        256        512       1024       2048       4096   8192
                                                                  Payload


                                Figure 4. Throughput measurements using with and without the VI


In terms of throughput, the performance is decreased in about 40% between the raw-data transfer and the
transfer done using the VI, this value was not expected and unfortunately there are no references to
compare. Generally, VI performance is compare between the Kernel Agent implementation and the VI
emulation, but not against the raw transfer performance.


In addition to the throughput, it is required to find where is the performance bottleneck standing, in other
words where is the 40% of the value in mention is lost. In order to get and discover this value, time
stamping was executed along the client and server application. Although the measurement could be done
and compare both client and server, it is at the client the most representative of the two peers.
For this reason in figure 5.0, it is shown the distribution of the time in every process of the application. In
the VIA the process of descriptors is basically negligible and most of the time is spent in the data transfer
and reception of the reply (waiting to receiving descriptor).

                                                         VI Distribution of Time at the CLIENT (MTU=8192)




                                                                                                        0%
                                                                                                         1%
                                                            5%                         4%                       1%




                                                                                                                                                                 39%




                   50%




            Setting Send Descriptor                        Po s ti n g D e s c r ip to r                 Processing Send                         Myrinet Send (DMA)


            Waiting to Receive Desc                        Receiving Descriptor (CQ ready)               Readind Data


                    Figure 5. Time distribution of the Echo/Reply application


This behavior is expected because the 30% of Myrinet-Myrinet transfer has to be in both ends, therefore if
12% is spent at both ends, it will represent approximately 24% of overhead left, plus from the 50% which
includes 30% server reply, will leave a 10% . more, ends up in 30-35% of processing overhead. This
processing overhead is a constant as shown in Figure 6.0, and it is a matter of improvement of the Myrinet-
to-Myrinet transmission to get better performance levels.


                                                                        Time Spent Processing Descriptors

                                         300




                                         250




                                         200
                         micro-seconds




                                                                                                                                   Client Side
                                         150
                                                                                                                                   Server Side




                                         100




                                          50




                                           0
                                               32   64         128         256             512   1024    2048        4096   8192
                                                                                      Payload



                     Figure 6. Time variation of descriptors processing at client and server.
This behavior (on Figure 6.) is explained by the algorithm itself, a block is sent to the server from the
client, the client waits for that block of data, the server copies the data pointed by the descriptor and writes
a send queue with the same data, the data is sent back to the client. In other words, only one descriptor is
needed for sending or all working queues handle only one-element at the time.



6.   CONCLUSIONS
First, a proof-of-concept has been achieved at the HCS Lab, the philosophy of the VIA in SAN can be
applied reducing the complexity of the OSI models and layered protocols. The implementation developed
introduces an overhead of 10% on descriptors and data processing, at both ends. For an echo-reply
application the average overhead, having as a reference raw-myrinet transmission using DMA is of 40%.
The average latency added by the VIA to the new application is of 25% maximum. It turns out that the use
of the DMA transfer was not the best choice, it is recommended to use any other technique.


7. FUTURE RESEARCH
Further research could be done in terms of implementation, first improve the Myrinet-to-Myrinet
communication, using mem_map instead of DMA. Implement error checking and Remote DMA reads and
writes. Make use of the SCALE_Threads or any other light-weight library and use multi-issue
implementation. In addition to that, the queues, completion, reception and transmission could be modified
and switching form a simple FIFO to something more efficient such as a Hash table, which will have low
processing overhead but could improve the performance.


ACKNOWLEDGEMENTS
I'd like to thank to Wade Cherry for his introduction and explanation of the LANAI and Myrinet
applications.   I'd like to thank the team at INTEL for defining the VIPL.H Library and providing the
source code for free use through the internet, and their visual C++ application from which I gathered lots of
ideas and finally understood the philosophy of VIA.


REFERENCES
[Banks93]         Banks, D., Prudence M. "A high-performance Network Architecture for PA-RISC
                  Workstation", IEEE Journal on Selected Areas of Communications", vol. 11, No. 2,
                  February 1998, pp 191-202.

[Berry97]         Berry, F. Deleganes, E. "The Virtual Interface Architecture Proof-of-Concept
                  Performance Results", INTEL Corp white paper.

[Davi93]          Davie, B. "Architecture and Implementation of a High-Speed Host Interface", IEEE
                  Journal on Selected Areas in Communications", Vol. 11, No. 2, February 1998., pp 228-
                  239
[Dunn98]     Dunning, D, et.al "The Virtual Interface Architecture", IEEE MICRO, v 18, n 2, April
             1998, pg. 66-75.

[Eick98]     Von Eicken, T., Vogels, W. "Evolution of the Virtual Interface Architecture ", IEEE
             Computer Magazine, November 1998, pp 61-68.

[Rama93]     Ramakrishnan, K, "Performance Considerations in Designing Network Interfaces", ",
             IEEE Journal on Selected Areas in Communications", Vol. 11, No. 2, February 1998., pp
             203-219

[Rosu95]     Marcel, Rosu. "Processor Controller off-Processor I/O", Cornell University, Grant
             ARPA/ONR N00014-92-J-1866, August 1995.

[Steen97]    Steenkiste, P."A High Speed Network Interface for Distributed-Memory Systems :
             Architecture and Applications ", ACM Transactions on Computer Systems, Vol. 15, No.
             1, February 1997, pp 75-109.

[Wels98]     Welsh, M., et.al ." Memory Management for User-Level Network Interfaces ", v 18, n 2,
             April 1998, pp 77-82.



Web Pages
[VIA98]      http://www.viaarch.org/
[INT98]      http://www.intel.com/


APPENDICES
Appendix 1. Source Code for the VIA_Server

Más contenido relacionado

La actualidad más candente

Professional Skills Highlights
Professional Skills HighlightsProfessional Skills Highlights
Professional Skills Highlights
Videoguy
 
V.S.VamsiKrishna
V.S.VamsiKrishnaV.S.VamsiKrishna
V.S.VamsiKrishna
vamsisvk
 

La actualidad más candente (20)

Điện tử số thầy Phạm Ngọc Nam.
Điện tử số thầy Phạm Ngọc Nam.Điện tử số thầy Phạm Ngọc Nam.
Điện tử số thầy Phạm Ngọc Nam.
 
Netup dvb-tc-ci
Netup dvb-tc-ciNetup dvb-tc-ci
Netup dvb-tc-ci
 
Professional Skills Highlights
Professional Skills HighlightsProfessional Skills Highlights
Professional Skills Highlights
 
Applied technology
Applied technologyApplied technology
Applied technology
 
10 fn s42
10 fn s4210 fn s42
10 fn s42
 
Mpls vpn
Mpls vpnMpls vpn
Mpls vpn
 
Scalable Video Coding in Content-Aware Networks
Scalable Video Coding in Content-Aware NetworksScalable Video Coding in Content-Aware Networks
Scalable Video Coding in Content-Aware Networks
 
Diameter Overview
Diameter OverviewDiameter Overview
Diameter Overview
 
Diameter and Diameter Roaming
Diameter and Diameter RoamingDiameter and Diameter Roaming
Diameter and Diameter Roaming
 
Network Configuration Example: Configuring LDP Over RSVP
Network Configuration Example: Configuring LDP Over RSVPNetwork Configuration Example: Configuring LDP Over RSVP
Network Configuration Example: Configuring LDP Over RSVP
 
Brocade solution brief
Brocade solution briefBrocade solution brief
Brocade solution brief
 
Chapter2[one.]
Chapter2[one.]Chapter2[one.]
Chapter2[one.]
 
Error Resilient Video Communication
Error Resilient Video CommunicationError Resilient Video Communication
Error Resilient Video Communication
 
Optimal Streaming Protocol for VoD Using Clients' Residual Bandwidth
Optimal Streaming Protocol for VoD Using Clients' Residual BandwidthOptimal Streaming Protocol for VoD Using Clients' Residual Bandwidth
Optimal Streaming Protocol for VoD Using Clients' Residual Bandwidth
 
(4,5) enlaces wan traduccion
(4,5) enlaces wan traduccion(4,5) enlaces wan traduccion
(4,5) enlaces wan traduccion
 
Error control techniques for video communications
Error control techniques for video communications Error control techniques for video communications
Error control techniques for video communications
 
Building a medium sized network
Building a medium sized networkBuilding a medium sized network
Building a medium sized network
 
V.S.VamsiKrishna
V.S.VamsiKrishnaV.S.VamsiKrishna
V.S.VamsiKrishna
 
Vinay's_profile
Vinay's_profileVinay's_profile
Vinay's_profile
 
Hybrid fax platform_ benefits
Hybrid fax platform_ benefitsHybrid fax platform_ benefits
Hybrid fax platform_ benefits
 

Similar a Via

RIFT.io_and_Intel_Taking_Virtual_Network_Functions_to_Hyperscale
RIFT.io_and_Intel_Taking_Virtual_Network_Functions_to_HyperscaleRIFT.io_and_Intel_Taking_Virtual_Network_Functions_to_Hyperscale
RIFT.io_and_Intel_Taking_Virtual_Network_Functions_to_Hyperscale
vibhorrastogi
 
Gervais Peter Resume Oct :2015
Gervais Peter Resume Oct :2015Gervais Peter Resume Oct :2015
Gervais Peter Resume Oct :2015
Peter Gervais
 
08 sdn system intelligence short public beijing sdn conference - 130828
08 sdn system intelligence   short public beijing sdn conference - 13082808 sdn system intelligence   short public beijing sdn conference - 130828
08 sdn system intelligence short public beijing sdn conference - 130828
Mason Mei
 
Multicloud as the Next Generation of Cloud Infrastructure
Multicloud as the Next Generation of Cloud Infrastructure Multicloud as the Next Generation of Cloud Infrastructure
Multicloud as the Next Generation of Cloud Infrastructure
Brad Eckert
 
Windows server 8 hyper v networking (aidan finn)
Windows server 8 hyper v networking (aidan finn)Windows server 8 hyper v networking (aidan finn)
Windows server 8 hyper v networking (aidan finn)
hypervnu
 

Similar a Via (20)

Active network
Active networkActive network
Active network
 
WTSA-16_SG13_Presentation.pptx
WTSA-16_SG13_Presentation.pptxWTSA-16_SG13_Presentation.pptx
WTSA-16_SG13_Presentation.pptx
 
ITU-T Study Group 13 Introduction
ITU-T Study Group 13 IntroductionITU-T Study Group 13 Introduction
ITU-T Study Group 13 Introduction
 
Hyper-V Networking
Hyper-V NetworkingHyper-V Networking
Hyper-V Networking
 
OpenStack and OpenFlow Demos
OpenStack and OpenFlow DemosOpenStack and OpenFlow Demos
OpenStack and OpenFlow Demos
 
Middleware Technologies ppt
Middleware Technologies pptMiddleware Technologies ppt
Middleware Technologies ppt
 
CloudStack DC Meetup - Apache CloudStack Overview and 4.1/4.2 Preview
CloudStack DC Meetup - Apache CloudStack Overview and 4.1/4.2 PreviewCloudStack DC Meetup - Apache CloudStack Overview and 4.1/4.2 Preview
CloudStack DC Meetup - Apache CloudStack Overview and 4.1/4.2 Preview
 
RIFT.io_and_Intel_Taking_Virtual_Network_Functions_to_Hyperscale
RIFT.io_and_Intel_Taking_Virtual_Network_Functions_to_HyperscaleRIFT.io_and_Intel_Taking_Virtual_Network_Functions_to_Hyperscale
RIFT.io_and_Intel_Taking_Virtual_Network_Functions_to_Hyperscale
 
Gervais Peter Resume Oct :2015
Gervais Peter Resume Oct :2015Gervais Peter Resume Oct :2015
Gervais Peter Resume Oct :2015
 
Highilights from Rod Randall (SIRIS/Stratus) LTE Asia
Highilights from Rod Randall (SIRIS/Stratus) LTE AsiaHighilights from Rod Randall (SIRIS/Stratus) LTE Asia
Highilights from Rod Randall (SIRIS/Stratus) LTE Asia
 
Netsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfvNetsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfv
 
Learn OpenStack from trystack.cn ——Folsom in practice
Learn OpenStack from trystack.cn  ——Folsom in practiceLearn OpenStack from trystack.cn  ——Folsom in practice
Learn OpenStack from trystack.cn ——Folsom in practice
 
08 sdn system intelligence short public beijing sdn conference - 130828
08 sdn system intelligence   short public beijing sdn conference - 13082808 sdn system intelligence   short public beijing sdn conference - 130828
08 sdn system intelligence short public beijing sdn conference - 130828
 
Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017 - ...
Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017  - ...Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017  - ...
Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017 - ...
 
Simple, Scalable and Secure Networking for Data Centers with Project Calico
Simple, Scalable and Secure Networking for Data Centers with Project CalicoSimple, Scalable and Secure Networking for Data Centers with Project Calico
Simple, Scalable and Secure Networking for Data Centers with Project Calico
 
Rain Technology.pptx
Rain Technology.pptxRain Technology.pptx
Rain Technology.pptx
 
Network Function Virtualization (NFV) BoF
Network Function Virtualization (NFV) BoFNetwork Function Virtualization (NFV) BoF
Network Function Virtualization (NFV) BoF
 
MidoNet Overview - OpenStack and SDN integration
MidoNet Overview - OpenStack and SDN integrationMidoNet Overview - OpenStack and SDN integration
MidoNet Overview - OpenStack and SDN integration
 
Multicloud as the Next Generation of Cloud Infrastructure
Multicloud as the Next Generation of Cloud Infrastructure Multicloud as the Next Generation of Cloud Infrastructure
Multicloud as the Next Generation of Cloud Infrastructure
 
Windows server 8 hyper v networking (aidan finn)
Windows server 8 hyper v networking (aidan finn)Windows server 8 hyper v networking (aidan finn)
Windows server 8 hyper v networking (aidan finn)
 

Más de Dr. Edwin Hernandez

Más de Dr. Edwin Hernandez (20)

Propuesta para la creación de un Centro de Innovación para la Refundación ...
Propuesta para la creación de un Centro de Innovación para la Refundación ...Propuesta para la creación de un Centro de Innovación para la Refundación ...
Propuesta para la creación de un Centro de Innovación para la Refundación ...
 
EGLA CORP - Honduras Abril 27 , 2024.pptx
EGLA CORP - Honduras Abril 27 , 2024.pptxEGLA CORP - Honduras Abril 27 , 2024.pptx
EGLA CORP - Honduras Abril 27 , 2024.pptx
 
MEVIA Platform for Music and Video
MEVIA Platform for Music and VideoMEVIA Platform for Music and Video
MEVIA Platform for Music and Video
 
Proposal NFT Metaverse Projects.pdf
Proposal NFT Metaverse Projects.pdfProposal NFT Metaverse Projects.pdf
Proposal NFT Metaverse Projects.pdf
 
Emulation MobileCAD
Emulation MobileCADEmulation MobileCAD
Emulation MobileCAD
 
EGLA NFT Offering
EGLA NFT OfferingEGLA NFT Offering
EGLA NFT Offering
 
Next Generation Spaces for Startups
Next Generation Spaces for Startups Next Generation Spaces for Startups
Next Generation Spaces for Startups
 
Analisis del Fraude Electoral en el 2017 - EGLA CORP
Analisis del Fraude Electoral en el 2017 - EGLA CORPAnalisis del Fraude Electoral en el 2017 - EGLA CORP
Analisis del Fraude Electoral en el 2017 - EGLA CORP
 
EGLAVATOR - Innovation, intellectual property services, and capital 2022 - 1
EGLAVATOR - Innovation, intellectual property services, and capital 2022 - 1EGLAVATOR - Innovation, intellectual property services, and capital 2022 - 1
EGLAVATOR - Innovation, intellectual property services, and capital 2022 - 1
 
MEVIA and Cloud to Cable TV Intellectual Property
MEVIA and Cloud to Cable TV Intellectual PropertyMEVIA and Cloud to Cable TV Intellectual Property
MEVIA and Cloud to Cable TV Intellectual Property
 
EGLAVATOR - Who are we?
EGLAVATOR - Who are we?EGLAVATOR - Who are we?
EGLAVATOR - Who are we?
 
Tips para mejorar ventas digitales
Tips para mejorar ventas digitalesTips para mejorar ventas digitales
Tips para mejorar ventas digitales
 
Securing 4G and LTE systems with Deep Learning and Virtualization
Securing 4G and LTE systems with Deep Learning and VirtualizationSecuring 4G and LTE systems with Deep Learning and Virtualization
Securing 4G and LTE systems with Deep Learning and Virtualization
 
EGLAVATOR by EGLA CORP
EGLAVATOR by EGLA CORPEGLAVATOR by EGLA CORP
EGLAVATOR by EGLA CORP
 
MEVIA - Technology Updates - 2020
MEVIA - Technology Updates -  2020MEVIA - Technology Updates -  2020
MEVIA - Technology Updates - 2020
 
MEVIA - Entertaiment and Cloud-based Solution for Yachts
MEVIA - Entertaiment and Cloud-based Solution for Yachts MEVIA - Entertaiment and Cloud-based Solution for Yachts
MEVIA - Entertaiment and Cloud-based Solution for Yachts
 
NextGENTV broadcasting with Cloud to Cable (ATSC 3.0) - Broadcasting to CABSAT
NextGENTV broadcasting with Cloud to Cable  (ATSC 3.0) - Broadcasting to CABSATNextGENTV broadcasting with Cloud to Cable  (ATSC 3.0) - Broadcasting to CABSAT
NextGENTV broadcasting with Cloud to Cable (ATSC 3.0) - Broadcasting to CABSAT
 
New Revenue Opportunities for Cloud Apps and Services with CloudtoCable
New Revenue Opportunities for Cloud Apps and Services with CloudtoCableNew Revenue Opportunities for Cloud Apps and Services with CloudtoCable
New Revenue Opportunities for Cloud Apps and Services with CloudtoCable
 
EGLA CORP: Innovation, Intellectual Property Services, and Capital
EGLA CORP:  Innovation, Intellectual Property Services, and CapitalEGLA CORP:  Innovation, Intellectual Property Services, and Capital
EGLA CORP: Innovation, Intellectual Property Services, and Capital
 
Music for Cable Music Service for Operators
Music for Cable   Music Service for OperatorsMusic for Cable   Music Service for Operators
Music for Cable Music Service for Operators
 

Via

  • 1. Virtual Interface Architecture over Myrinet EEL5717 - Computer Architecture Dr. Alan D. George Project Final Report Department of Electrical and Computer Engineering University of Florida Edwin Hernandez December 1998.
  • 2. Implementation of a Virtual Interface Architecture over Myrinet Edwin Hernandez - hernande@hcs.ufl.edu 1. Introduction The evolution of network interfaces has been improving over years, but he network protocols add a big amount of overhead that it is translated into higher latency and low useful throughput. The traditional protocols such as TCP, UDP and TP4 are not appropriate for high performance environments where light- weight communications are required in order to maximize the bandwidth theoretically possible to achieve. In other words, protocols have to be light-weight and they should maximize the useful throughput and allow minimum latency. Several companies, software and hardware, visioning this problem have come up with new ideas. One of them is called, Virtual Interface Architecture (VIA), as a matter of fact VIA was born from the combined force of COMPAQ, Microsoft and INTEL [VIA98]. The Virtual Interface Specification is defined as a standard and several organizations agree with it and right now the version 1.0 of the standard is available. [VIA97]. Several papers have been published regarding the Virtual Interface Architecture in which several Virtual Interfaces, VI, have been implemented to probe the concept of VI as well as probe the reduction in latency and bandwidth. They have used different NIC such as Myrinet and even Ethernet [Dunn98] and [Eick98]. However, This work was previously started with PA-RISC network interface architectures [Banks93], virtual protocols for myrinet as stated in [Rosu95], moreover several researchers have tried to localize the bottlenecks and performance improvements in NIC's like the work done by [Davi93] and [Rama93] in which they have stated the general memory management concepts as well as I/O handling techniques. As shown in Section 5, all the measurements and tests were developed at the Myrinet test-bed in the High Performance Computing and Simulation Research Lab (HCS Lab) at the University of Florida. 2. Background The Virtual Interface Architecture is a new concept, this concept perfectly fits the ideas and conceptions of high performance networks and the design of clusters. The VIA tries to boost the performance by not allowing excessive copying and performing several tasks without care of many layers and some other important issues, those issues are explained in section 2.1.
  • 3. 2.1. The Virtual Interface Architecture. [VIA98] The VIA attacks the problem of relatively low achievable performance of inter-process communication (IPC) within a cluster1. The overhead is the one that determines the performance of IPC. The software overhead added during the send/receive operations over a message through the network. The amount of software layers that are traversed, imply a great amount of context switches, interrupts and data copies when crossing those boundaries. However the increase of the processor clock helps in the processing of software layers, this is not a determinant factor in the reduction of the performance (large penalties for cache miss, software layers imply a lot of branches). With the introduction of OC-3 ATM, network bandwidth are being increased from 1Mbps to 100-150Mbps and 1Gbps as backbones, but the "raw" bandwidth almost never can be achieved. Having those two reasons into consideration, INTEL and other companies have developed the VIA, which can be described by: • User Agent • Kernel Agent. An user is the software layer using the architecture, it could be an application or communication services layer. The kernel agent is a driver running in protected (kernel) mode. It must set up the necessary tables and structures that allow communication between cooperating processes. VIA accomplishes low latency in a message-passing environment by following these rules: • Eliminate any intermediate copies of the data • Eliminate the need of a driver running in protected kernel mode to multiplex a hardware resource. • Avoid traps into the operating system whenever possible to avoid context switches in the CPU as well as cache thrashing • Remove the constraint of requiring an interrupt when initiating an I/O operation • Define a simple set of operations that send and receive data. • Keep the architecture simple enough to be emulated in software. What VIA does with the processes is that it presents an illusion that it owns the interface to the network. Each VIA consists of one send and one receive queue, and is owned and maintained by a single process. 1 Cluster computing consists of short distance, low-latency, high-bandwidth IPCs between multiple building blocks. Cluster building blocks include server, workstations and I/O subsystems, all of which connect directly to a network.
  • 4. A process can own many Virtual Interfaces (VI), and many processes can own many Vis, the kernel by itself can also own a VI. The VI queue is formed by a linked list of variable-length descriptors. To add a descriptor to a queue, the user builds the descriptor and posts it onto the tail of the appropriate work queue. That same user pulls the completed description off the head of the same work queue they were posted on. The process that owns the queue can post four types of descriptors. Send, remote-DMA/write, remote- DMA/read descriptors are placed on the send queue of a VI. Receive descriptors are placed on the receive queue of a VI. VIA also provides polling and blocking mechanisms to synchronize between the user process and completed operations. When descriptor processing completes, the NIC writes a done bit and includes any error bits associated with that descriptor in its specified fields. This act transfers ownership of the descriptor from the NIC back to the process that originally posted it. These queues are an additional construct that allows the coalescing of completion notifications from multiple work queues into a single queue. The two work queues of one VI can be associated with completion queues independently of one another. Now the descriptors mentioned are constructs which describe the work to be done by the Network Interface. This is very similar to the architecture proposed in [Davi93]. The SEND/RECEIVE descriptors contains one segment and a variable number of data segments. Remote-DMA/write and remote- DMA/receive descriptors contain one additional address segment following the control segment and preceding the data segments. The VIA also has : • immediate data access of a 32-bit data in a descriptor. • The order of the descriptors is preserved in a FIFO queue, it is easy to maintain consistency with send/recive and remote/DMA write, however remote/DMA recieve is a round-trip transaction and it is not completed until the requested data is returned from the remote node/endpoint. • Work queue scheduling. There is no implicit ordering relationship between descriptors and VIs therefore the scheduling service depends on the algorithm used by the NIC. • Memory protection, provides memory protection and ensure that a user process cannot send out of, or receive into, memory that it does not own. • Virtual address Translation. This is done when the kernel agent registers a memory region, the kernel agent performs ownership checks (it comes from a user agent request), it pins the pages into physical memory, and probes the regions of the virtual-to-physical address translation.
  • 5. Vi Application Consumer OS Communication Interface VI User Agent User Send/Receive/RDMA Read/ RDMA write mode kernel mode Se Re Se Re Se Re nd cv nd cv nd cv VI KERNEL Agent V V V I I I VI Network Adapter Figure 1. VI Architectural Model 3. MODEL DESIGN For this class project, there is not enough time to build a Hardware implementation of the VI on-chip using the Myrinet interface, however it is highly possible to interact with the Myrinet card and generate a Virtual Emulation of the Virtual Interface merely in software. For this reason, in Appendix 1., there is the source code in C++, which is located on the top of the Myrinet, therefore the performance enhancements reached won't be as high as expected, however the Model followed by this project will remain with some modifications. It should be noticed that the RDMA transfers, as well as error handling where left aside for this project, the only concerns in the design of the VI were: • VI initialization and interaction with the VI • Implement the Send and Receive Queues • Implement the completion queues • Use the standard data-types mentioned in the specification [VIA97] • Make use of a small application ECHO/REPLY Server for the performance tests. In addition, the software makes use of the Myrinet adapter in making transfers in DMA mode. At very early stage this was not taken into account but it seems to be not a good performance "booster" and the measurements made are quite low, but his aspect will be explained in Section 5.
  • 6. The basic objects used are: • Myrinet , which is in charge of handling the send, receive and initialization, it dialogs directly to the myrinet card. - int(), initializes the interface, in this case it is needed to change the route of the Myrinet DMA transfer. In other words, first the interface sends data, then it has to be reinitialized to receive the Reply from the server, this also happens at the server. - Send() and Recv(), post and interacts with the shmen* structure in terms of receiving the data from the Myrinet's SRAM and post data into it. • VI, The Virtual Interfac Object contains the Send Queue, the Receive Queue and the Completion Queue. A description of the class members is posted above: - NIC. Object to reference the instance of the interface being used, in this case Myrinet. - CQ, SendQ, RecvQ: These data types are used for the queues of descriptors, they are handled as a List objet. The List object was also developed and it contains all the functions of a link list. - SetupDescriptorSendRecv(), this functios is made to initialize a descriptor whether for send or receive, to or from the working queues. The descriptor created here has the data-type VIP_DESCRIPTOR defined in the VIA specification. - ViPostSend().This function is in charge of Posting the Send Descriptor to the queue, it does not transmits the data. - ViProcesSend(). This function pops the first descriptor from the queue and starts delivering the content pointed by the descriptor (DS[0].Local.Data.Address) to the shmem->sendBuffer pointer. This is the real Send. - ViPostRecv(), is in charge of posting a reception descriptor into the receive queue, it has information to the data addresses to store the information. An application can also receive a descriptor from the other end, depending on the protocol used. For this basic application the receive descriptor is formed in the reception peer. - ViProcessRecv(), the process of reception is done though this method, and as well as the viaProcesSend() pops the first element in the Reception queue and writes whatever is red from the object NIC into the destination address. - EchoServer(). Member function to allow be a Server waiting for incoming data and replying the same data. - EchoClient(). Member function to allow Sending a block of MTU (Maximum Transfer Unit) data to the other end, expects for a reply and compares whatever was sent with the content of the receiving data. The VIPL.h is the most important of the libraries, because it contains all the data types stated in the specification., it defines descriptors, responses, error handling, memory management and some other VI
  • 7. properties. However, it was not implemented as stated there, it was modified to agree with the requirements of this project and the HCS Lab resources. As stated before two aspects where left aside for the VI application implementation: a) Threads and multi-threading issues, in order to keep the VI clean of the vices used in the other protocols, it is imperative to use a library of light-weight threads, otherwise all the overhead introduced by the traditional thread libraries will twist the results b) Remote DMA reads and writes, there are two main reasons for leaving this aside, the first one is requirements of direct memory manipulation which is not permitted without the proper system administrator rights and the second one, because is not quite clear in the standard how to achieve it. c) Error handling was not implemented and a Error-Free environment should be assumed for all the results. 4. EXPERIMENTS Experiments were directed in three main areas: Latency, Throughput and Time overhead attributed to the client and server of the application developed. They were also made following the performance results obtained by [Berry97] and [Erick98]. In fact, the values gathered by them have much higher performance than the values gathered at the HCS Lab, the reason could be the VI implementation in hardware, not a software emulation and a better understanding of the Myrinet architecture, in terms of modes of operation and how to improve the data transfers . Fist of all, the SAN used consisted of two computers, viking and vigilante, both Sun Ultra-2 interconnected through the Myrinet switch version 1.0, Berry and his team used Pentium Pro 200 MHz, PCI bus and there is not much specification concerning the application used. The set of experiments selected consisted on: - Throughput in Myrinet with and without the VI - Latency with and without the VI - Time distribution at the Client and Server using the VI The results and analysis are show in section 5. 5. RESULTS AND ANALYSIS The mode of operation for the Myrinet adapter was DMA transfer and as shown in Figure 2., it was not the best option, however it fulfilled the requirement of an easy implementation.
  • 8. 40 35 30 25 Throughput Using DMA Mbytes 20 Throughput Using Mem_map Thorughput TCP_STREM 15 10 5 0 0 100 200 300 400 500 600 700 800 payload in bytes Figure 2. Throughput measurements for Myrinet using different modes of operation As shown there the DMA transfer does not improve the performance whenever is 64 bytes long, moreover a TCP_STREAM test made with netperf generates a better performance. But the main goal with this paper is not finding a better mode of operation for Myrinet, if not a way to probe that the VIA is a good concept and it can be used in SANs. Having this in mind, it will be only a matter of transport or multiply by a factor of performance improvement of whatever is found from now on. The first measurement made consists of the Latency with and without the VI, the RAW-Myrinet represents the application without the VI overhead, or bulk data transfers and the VI-Myrinet represents the Latency of the ECHO/Reply Round trip divided by two. Latency of the Myrinet Interface 1200 1000 800 Micro-seconds Raw - Myrinet 600 VI - Myrinet 400 200 0 32 64 128 256 512 1024 2048 4096 8192 Payload (bytes) Figure 3. Latency measurements with myrinet using raw data and the VI on the top of the myrinet
  • 9. From figure 3, can be concluded that the increment in the latency is a constant and not greater than 25%. If the results shown here are compare with the ones reported by Berry, the difference between latencies are of a ration of 4:1, in which this ones are the greatest. However, a aspect that is not being taken into account by Berry and this project is the fact that the VIA specification defines as a Maximum Transfer Unit of 32KB, but all the measurements where done by the other researches at payloads not greater than 8Kbytes. Throughput of the VI and the Raw Data Transfer with Myrinet 12 10 8 Mbytes/sec VI - Myrinet 6 DMA-Myrinet 4 2 0 32 64 128 256 512 1024 2048 4096 8192 Payload Figure 4. Throughput measurements using with and without the VI In terms of throughput, the performance is decreased in about 40% between the raw-data transfer and the transfer done using the VI, this value was not expected and unfortunately there are no references to compare. Generally, VI performance is compare between the Kernel Agent implementation and the VI emulation, but not against the raw transfer performance. In addition to the throughput, it is required to find where is the performance bottleneck standing, in other words where is the 40% of the value in mention is lost. In order to get and discover this value, time stamping was executed along the client and server application. Although the measurement could be done and compare both client and server, it is at the client the most representative of the two peers.
  • 10. For this reason in figure 5.0, it is shown the distribution of the time in every process of the application. In the VIA the process of descriptors is basically negligible and most of the time is spent in the data transfer and reception of the reply (waiting to receiving descriptor). VI Distribution of Time at the CLIENT (MTU=8192) 0% 1% 5% 4% 1% 39% 50% Setting Send Descriptor Po s ti n g D e s c r ip to r Processing Send Myrinet Send (DMA) Waiting to Receive Desc Receiving Descriptor (CQ ready) Readind Data Figure 5. Time distribution of the Echo/Reply application This behavior is expected because the 30% of Myrinet-Myrinet transfer has to be in both ends, therefore if 12% is spent at both ends, it will represent approximately 24% of overhead left, plus from the 50% which includes 30% server reply, will leave a 10% . more, ends up in 30-35% of processing overhead. This processing overhead is a constant as shown in Figure 6.0, and it is a matter of improvement of the Myrinet- to-Myrinet transmission to get better performance levels. Time Spent Processing Descriptors 300 250 200 micro-seconds Client Side 150 Server Side 100 50 0 32 64 128 256 512 1024 2048 4096 8192 Payload Figure 6. Time variation of descriptors processing at client and server.
  • 11. This behavior (on Figure 6.) is explained by the algorithm itself, a block is sent to the server from the client, the client waits for that block of data, the server copies the data pointed by the descriptor and writes a send queue with the same data, the data is sent back to the client. In other words, only one descriptor is needed for sending or all working queues handle only one-element at the time. 6. CONCLUSIONS First, a proof-of-concept has been achieved at the HCS Lab, the philosophy of the VIA in SAN can be applied reducing the complexity of the OSI models and layered protocols. The implementation developed introduces an overhead of 10% on descriptors and data processing, at both ends. For an echo-reply application the average overhead, having as a reference raw-myrinet transmission using DMA is of 40%. The average latency added by the VIA to the new application is of 25% maximum. It turns out that the use of the DMA transfer was not the best choice, it is recommended to use any other technique. 7. FUTURE RESEARCH Further research could be done in terms of implementation, first improve the Myrinet-to-Myrinet communication, using mem_map instead of DMA. Implement error checking and Remote DMA reads and writes. Make use of the SCALE_Threads or any other light-weight library and use multi-issue implementation. In addition to that, the queues, completion, reception and transmission could be modified and switching form a simple FIFO to something more efficient such as a Hash table, which will have low processing overhead but could improve the performance. ACKNOWLEDGEMENTS I'd like to thank to Wade Cherry for his introduction and explanation of the LANAI and Myrinet applications. I'd like to thank the team at INTEL for defining the VIPL.H Library and providing the source code for free use through the internet, and their visual C++ application from which I gathered lots of ideas and finally understood the philosophy of VIA. REFERENCES [Banks93] Banks, D., Prudence M. "A high-performance Network Architecture for PA-RISC Workstation", IEEE Journal on Selected Areas of Communications", vol. 11, No. 2, February 1998, pp 191-202. [Berry97] Berry, F. Deleganes, E. "The Virtual Interface Architecture Proof-of-Concept Performance Results", INTEL Corp white paper. [Davi93] Davie, B. "Architecture and Implementation of a High-Speed Host Interface", IEEE Journal on Selected Areas in Communications", Vol. 11, No. 2, February 1998., pp 228- 239
  • 12. [Dunn98] Dunning, D, et.al "The Virtual Interface Architecture", IEEE MICRO, v 18, n 2, April 1998, pg. 66-75. [Eick98] Von Eicken, T., Vogels, W. "Evolution of the Virtual Interface Architecture ", IEEE Computer Magazine, November 1998, pp 61-68. [Rama93] Ramakrishnan, K, "Performance Considerations in Designing Network Interfaces", ", IEEE Journal on Selected Areas in Communications", Vol. 11, No. 2, February 1998., pp 203-219 [Rosu95] Marcel, Rosu. "Processor Controller off-Processor I/O", Cornell University, Grant ARPA/ONR N00014-92-J-1866, August 1995. [Steen97] Steenkiste, P."A High Speed Network Interface for Distributed-Memory Systems : Architecture and Applications ", ACM Transactions on Computer Systems, Vol. 15, No. 1, February 1997, pp 75-109. [Wels98] Welsh, M., et.al ." Memory Management for User-Level Network Interfaces ", v 18, n 2, April 1998, pp 77-82. Web Pages [VIA98] http://www.viaarch.org/ [INT98] http://www.intel.com/ APPENDICES Appendix 1. Source Code for the VIA_Server