SlideShare una empresa de Scribd logo
1 de 50
Descargar para leer sin conexión
24 Jun 2002 Roberto Innocente 1
End nodes challenges with
multigigabit networking
GARR – WS4
Roberto Innocente
inno@sissa.it
24 Jun 2002 Roberto Innocente 2
Gilder’s law
! proposed by
G.Gilder(1997) :
The total bandwidth of
communication systems
triples every 12 months
! compare with Moore’s
law (1974):
The processing power of a
microchip doubles
every 18 months
! compare with:
memory performance
increases 10% per year
(source Cisco)
24 Jun 2002 Roberto Innocente 3
Optical networks
! Large fiberbundles
multiply the fibers per
route : 144, 432 and now
864 fibers per bundle
! DWDM (Dense
Wavelength Division
Multiplexing) multiplies
the b/w per fiber :
" Nortel announced a 80 x
80G per fiber (6.4Tb/s)
" Alcatel announced a 128
x 40G per fiber(5.1Tb/s)
24 Jun 2002 Roberto Innocente 4
10 GbE /1
! IEEE 802.3ae 10GbE ratified on June 17, 2002
! Optical Media ONLY : MMF 50u/400Mhz 66m, new
50u/2000Mhz 300m, 62.5u/160Mhz 26m, WWDM 4x
62.5u/160Mhz 300m, SMF 10Km/40Km
! Full Duplex only (for 1 GbE a CSMA/CD mode of
operation was still specified despite no one
implemented it)
! LAN/WAN different :10.3 Gbaud – 9.95Gbaud
(SONET)
! 802.3 frame format (including min/max size)
24 Jun 2002 Roberto Innocente 5
10GbE /2
from Bottor
(Nortel
Networks) :
marriage of
Ethernet
and DWDM
24 Jun 2002 Roberto Innocente 6
Challenges
! Hardware
! Software
! Network protocols
24 Jun 2002 Roberto Innocente 7
Standard data flow /1
Nic memory
Memory
NIC
Memory
bus
1
2
3
Processor Processor
Processor bus
North Bridge
I/O bus
PCI bus Network
24 Jun 2002 Roberto Innocente 8
Standard data flow /2
! Copies :
1. Copy from user space to kernel buffer
2. DMA copy from kernel buffer to NIC memory
3. DMA copy from NIC memory to network
! Loads:
" 2x on the processor bus
" 3x on the memory bus
" 2x on the NIC memory
24 Jun 2002 Roberto Innocente 9
Hardware challenges
! Processor bus
! Memory bus
! I/O bus
! Network technologies
24 Jun 2002 Roberto Innocente 10
Processor / Memory bus
! Processor bus :
" current technology 1 - 2 GB/s
" Athlon/Alpha are pt-2-pt / IA32 multi-pt
" e.g IA32 P4 : 4 x 100 Mhz x 8 bytes = 3.2 GB/s peak, but with
many cycles of overhead for each line of cache transferred
! Memory bus :
" current technology 1-2 GB/s
" DDR 333 8 bytes wide = 2.6 GB/s th. peak
" RAMBUS 800 Mhz x 2 bytes x 2 channels = 3.2 GB/s th. peak
" setup time should be taken into account
24 Jun 2002 Roberto Innocente 11
Memory b/w
24 Jun 2002 Roberto Innocente 12
PCI Bus
! PCI32/33 4 bytes@33Mhz=132MBytes/s (on i440BX,...)
! PCI64/33 8 bytes@33Mhz=264Mbytes/s
! PCI64/66 8 bytes@66Mhz=528Mbytes/s (on i840)
! PCI-X 8 bytes@133Mhz=1056Mbytes/s
PCI-X implements split transactions
24 Jun 2002 Roberto Innocente 13
PCI performance
with common chipsets
478418Supermicro P4D6, P4 Xeon, Intel e7500 chipset
299215P4 Xeon, Intel 860 chipset
311299AMD dual, AMD760MPX chipset
228183API UP2000 alpha, Tsunami chipset
353299Supermicro 370DEI, PIII, Serverset III HE
379353Intel 460GX (Itanium)
464387Alpha ES45, Titan chipset
486432Supermicro 370DLE, dual PIII, Serverworks Serverset III LE
Write MB/sRead MB/sChipset
24 Jun 2002 Roberto Innocente 14
I/O busses
! PCI/PCI-X available, DDR/QDR under development
! Infiniband (HP, Compaq, IBM, Intel ): available
" PCI replacement, memory connected !
" pt-to-pt , switch based
" 2.5 Gb/s per wire f/d, widths 1x,4x,12x
! 3rd Generation I/O (HP,Compaq,IBM..)
" PCI replacement
" pt-to-pt switched
" 2.5 Gb/s per wire f/d ,widths 1x,2x,4x,8x,12x..32x
! HyperTransport (AMD,...)
" PCI and SMP interconnect replacement
" pt-to-pt switched, synchronous
" 800 Mb/s- 2Gb/s per wire f/d, widths 2x,4x...32x
24 Jun 2002 Roberto Innocente 15
Network technologies
! Extended frames :
" Alteon Jumbograms
(MTU up to 9000)
! Interrupt suppression
(coalescing)
! Checksum offloading
(from Gallatin 1999)
24 Jun 2002 Roberto Innocente 16
Software challenges
! ULNI (User Level Network Interface),
OS-bypass
! Zero copy
! Network Path Pipelining
24 Jun 2002 Roberto Innocente 17
Software overhead
software
overhead
Being a constant, is becoming
more and more important !!
txmission
time
total
time software
overhead software
overhead
10Mb/s Ether 100Mb/s Ether 1Gb/s
24 Jun 2002 Roberto Innocente 18
ULNI / OS-bypass
! Traditional networking can be defined as in-kernel networking : for each
network operation (read / write) a kernel call (trap) is invoked
! This frequently is too expensive for HPN (high performance networks)
! Therefore in the last years, in the SAN environment User Level Network
Interface (ULNI) / OS-bypass schemes have had a wide diffusion
! These schemes avoid the kernel intervention for read/write operations.
The user process directly reads from/writes to the NIC memory using
special user space libraries. Examples : U-Net, Illinois Fast Messages, ...
! VIA (Virtual Interface Architecture) is a proposed ULNI industry
standard backed by Intel, Compaq et al. : VIA over IP is an internet draft
24 Jun 2002 Roberto Innocente 19
Standard TCP / OS-bypass
ULNI(gm) on i840 chipset
24 Jun 2002 Roberto Innocente 20
Advanced networking
! Traditional UNIX i/o calls have an implicit copy
semantics. It is possible to diverge from this creating a
new API that avoids this requirement.
! Zero-copy can be obtained in different ways :
" user/kernel page remapping (trapeze bsd drivers) :
performance strongly depends on particular h/w, usually a
Copy On Write flag preserves the copy semantic
" fbufs (fast buffers , Druschel 1993) : shared buffers
24 Jun 2002 Roberto Innocente 21
Zero-copy data flow
Nic memory
Memory
NIC
Memory
bus
1
2
Processor Processor
Processor bus
North Bridge
I/O bus
PCI bus Network
24 Jun 2002 Roberto Innocente 22
Trapeze driver for
Myrinet2000 on freeBSD
! zero-copy TCP by page
remapping and copy-on -write
! split header/payload
! gather/scatter DMA
! checksum offload to NIC
! adaptive message pipelining
! 220 MB/s (1760 Mb/s) on
Myrinet2000 and Dell
Pentium III dual processor
The application does’nt touch the data
(from Chase, Gallatin, Yocum
2001)
24 Jun 2002 Roberto Innocente 23
Fbufs (Druschel 1993)
! uf_read(int
fd,void**bufp,size_t
nbytes)
! uf_get(int fd,void**
bufpp)
! uf_write(int
fs,void*bufp,size
nbytes)
! uf_allocate(size_t
size)
! uf_deallocate(void*ptr
,size_t size)
! Shared buffers between user
processes and kernel
! It is a general schemes that
can be used for either
network or I/O operations
! Specific API without copy
semantic
! A buffer pool specific for a
process is created as part of
process initialization
! This buffer pool can be pre-
mapped in both the user and
kernel address space
24 Jun 2002 Roberto Innocente 24
Gigabit Ethernet TCP
24 Jun 2002 Roberto Innocente 25
TOE
(TCP Off-loading Engine)
24 Jun 2002 Roberto Innocente 26
Network Path Pipelining
! It is frequently believed that the overhead on network operations
decreases as the size of the frame increases, and therefore in principle
an infinite frame would allow the best performance
! This in effect is false (e.g. Prilly 1999) for lightweight protocols : Myrinet,
QSnet (does’nt apply to protocols like TCP that require a checksum
header)
! Myrinet does’nt have a maximum frame size, but to efficiently use the
network path, it is necessary that all the entities (sending CPU, sending
NIC DMA, sending i/f, receiving i/f,...) along the path could work at the
same time on different segments (pipelining). Using long frames inhibits
co-working.
! It is a linear programming exercise to find the best frame size : it is
different for different packet sizes.
! On PCI frequently the best frame size is the maximum transfer size
allowed by the MLT (maximum latency timer) of the device : 256 or 512
bytes
24 Jun 2002 Roberto Innocente 27
Network protocols
challenges
! Active Queue Management (AQM):
" FQ, WFQ, CSFQ, RED
! ECN
! MPLS
! TCP Congestion avoidance/control
! TCP friendly streams
! IP RDMA protocols
24 Jun 2002 Roberto Innocente 28
AQM
(active queue management)
! RFC 2309 Braden et al. :
Recommandation on Queue Management and Congestion
Avoidance in the Internet (Apr 1998)
! “Internet meltdown”, “congestion collapse” first
experienced in the mid ’80
! first solution was V.Jacobson congestion avoidance
mechanism for TCP (from 1986/88 on)
! Anyway, there is a limit to how much control can be
accomplished from the edges of the network
24 Jun 2002 Roberto Innocente 29
AQM /2
! queue disciplines :
" tail drop
# a router sets a max length for each queue, when
the queue is full it drops new packets arriving
until the queue decreases because a pkt from the
queue has been txmitted
" head drop
" random drop
24 Jun 2002 Roberto Innocente 30
AQM /3
Problems of tail-drop discipline:
! Lock-Out :
" in some situations one or a few connections
monopolize queue space. This phenomenon is
frequently the result of synchronization
! Full Queues :
" queues are allowed to remain full for long periods of
time, while it is important to reduce the steady state
size (to guarantee short delays for applications
requiring them )
24 Jun 2002 Roberto Innocente 31
Fair Queuing
! Traditional routers route packets
independently. There is no
memory, no state in the network.
! Demers, Keshav, Shenker
(1990) :
Analysis and simulation of a Fair
Queuing Algorithm
Incoming traffic is separated into
flows, each flow will have an
equal share of the available
bandwidth.
It approximates the sending of 1 bit
for each ongoing flow
! Statistical FQ (hashing flows)
output
line
switching
Traditional queuing:
one queue per line
output
line
Fair queuing :
one queue per flow
switching
24 Jun 2002 Roberto Innocente 32
Weighted Fair Queuing
(WFQ)
This queuing algorithm was developed to serve real time
applications with guaranteed bandwidth.
! L.Zhang (1990):
Virtual clock: A new Traffic Control Algorithm for Packet
Switching Networks
Traffic is classified, and for each class a percentage of the
available b/w is assigned. Packets are sent to different queues
according to their class. Each pkt in the queue is labelled with
a calculated finish time. Pkts are transmitted in finish time
order.
24 Jun 2002 Roberto Innocente 33
CSFQ (Core stateless Fair
Queuing)
! Only ingress/egress routers
perform packet
classification/per flow buffer
mgmt and scheduling
! Edge routers estimate the
incoming rate of each flow
and use it to label each pkt
! Core routers are stateless
! All routers from time to time
compute the fair rate on
outgoing links
! This approximates FQ (from Stoica,Shenker,Zhang1998)
24 Jun 2002 Roberto Innocente 34
RED /1
(Random Early Detection)
! Floyd, V.Jacobson 1993
! Detects incipient congestion
and drops packets with a
certain probability depending
on the queue length. The
drop probability increases
linearly from 0 to maxp as the
queue length increases from
minth to maxth. When the
queue length goes above
maxth (maximum threshold)
all pkts are dropped
24 Jun 2002 Roberto Innocente 35
RED /2
! It is now widely deployed
! There are very good performance reports
! Still an active area of research for possible
modifications
! Cisco implements its own WRED (Weighted
Random Early Detection) that discards packets
based also on precedence (provides separate
threshold and weights for different IP
precedences)
24 Jun 2002 Roberto Innocente 36
RED /3
! (from Van Jacobson, 1998) Traffic on a busy E1 (2
Mbit/s) Internet link. RED was turned on at 10.00 of the
2nd day, and from then utilization rised up to 100% and
remained there steady.
24 Jun 2002 Roberto Innocente 37
ECN (Explicit Congestion
Notification) /1
! RFC 3168 Ramakrishnan, Floyd, Black :
The addition of Explicit Congestion Notification (ECN) to IP (Sep 2001)
! Uses 2 bits of the IPv4 Tos (Now and in IPv6 reserved as a
DiffServ codepoint). These 2 bits encode the states :
" ECN-Capable Transport : ECT(0) and ECT(1)
" Congestion Experienced (CE)
! If the transport is TCP, uses 2 bits of the TCP header, next to the
Urgent flag :
" ECN-Echo(ECE) set in the first ACK after receiving a CE pkt
" Congestion Window Reduced (CWR) set in the first pkt after having
reduced cwnd in response to an ECE pkt
! With TCP the ECN is initialized sending a SYN pkt with ECE and
CWR on, and receiving a SYN+ACK pkt with ECE on
24 Jun 2002 Roberto Innocente 38
ECN /2
How it works:
! senders set ECT(0) or ECT(1) to indicate that the end-nodes of
the transport protocol are ECN capable
! a router experiencing congestion, sets the 2 bits of ECT pkts to
the CE state (instead of dropping the pkt)
! the receiver of the pkt signals back the condition to the other end
! the transports behave as in the case of a pkt drop (no more than
once in a RTT)
Linux adopted it on 2.4 ... many connection troubles (the
default now is to be off, can be turned on/off using :
/proc/sys/net/ipv4/tcp_ecn). Major sites are using
firewalls or load balancers that refuse connections
from ECT (Cisco PIX and Load Director, etc).
24 Jun 2002 Roberto Innocente 39
MPLS (Multi Protocol Label
Switching)
This technology is a marriage of traffic engineering as on ATM and IP :
! Special routers called LER (Label Edge Routers) mark the packets
entering the net with a label that indicates a class of service ( FEC or
Forward Equivalence Class) or priority : the 32 bit header (a Shim
header) is inserted between the OSI level 2 header and upper level
headers (after the ethernet header, before the IP header): 20 bits for the
label, 3 experimental, 1 bit for stack function and 8 bit for TTL
! The IP packet becomes an MPLS pkt
! Internal routers work as LSR (Label Switch Routers) for MPLS pkts :
will look up and follow the label instructions (usually swapping label)
! Between MPLS aware devices LSP (Label Switch Paths) are
established and designed for their traffic characteristics
24 Jun 2002 Roberto Innocente 40
Congestion control in TCP/1
Sending rate is limited by a congestion window
(cwnd): the maximum # of pkts to be sent in
a RTT(round trip time). Phases:
1. Slow start : cwnd is increased exponentially
(# pkts acknowledged in a RTT are summed
to cwnd) up to a threshold ssthresh or a loss
2. Congestion avoidance: AIMD (Additive
increase, Multiplicative decrease) phase.
cwnd is increased by 1 each RTT. When
there is a loss, cwnd is halved.
24 Jun 2002 Roberto Innocente 41
Congestion control in TCP/2
from [Balakrishnan 98]
24 Jun 2002 Roberto Innocente 42
AIMD / AIPD
Congestion control
! AIMD (Additive Increase
Multiplicative Decrease) :
" W = W + a (no loss)
" W = W * (1 – b)
(L > 0 losses)
it achieves a b/w ∝1/sqrt(L)
! AIPD (Additive Increase
Proportional Decrease):
" W = W + a (no loss)
" W = W * (1 –b *L)
(L > 0 losses)
it achieves a b/w ∝1/L
(ns2 plot of 3 flows,source Lee 2001)
24 Jun 2002 Roberto Innocente 43
TCP stacks
! Tahoe(4.3BSD,Net/1) : implements Van Jacobson
slow start/congestion avoidance and fast retransmit
algorithms
! Reno(1990,Net/2) : fast recovery, TCP header
prediction
! Net/3(4.4BSD 1994): multicasting, LFN (Long Fat
Networks) mods
! NewReno : fast retransmit even with multiple losses
(Hoe 1997)
! Vegas: experimental stack (Brakmo, Peterson 1995)
24 Jun 2002 Roberto Innocente 44
TCP Reno
! It is the most widely referenced
implementation, basic mechanisms are
the same also in Net/3 and NewReno
! It is biased against connections with long
delays (large RTT) : in this case the
increase of the cwnd happens slowly and
so they obtain less avg b/w
24 Jun 2002 Roberto Innocente 45
LFN Optimal Bandwidth
and TCP Reno
! Let’s consider a 1 Gb/s WAN connection with a
RTT of 100 ms (BW*rtt ~12 MB)
! Optimal congestion window would be about
8.000 segments (1.5k)
! When there will be a loss cwnd will be
decreased to 4.000 segments
! It will take 400 seconds (7minutes!) to recover
to the optimal b/w. This is awful if the network
is not congested !
24 Jun 2002 Roberto Innocente 46
TCP Vegas
! Approaches congestion but tries to avoid it.
Estimates the available b/w looking at variation
of delays between acks (CARD: Congestion
avoidance by RTT delays)
! It has been shown that it is able to better utilize
the available b/w (30% better than Reno)
! It is fair with long delay connections
! Anyway, when mixed with TCP Reno it gets an
unfair share (50% less) of the bandwidth
24 Jun 2002 Roberto Innocente 47
Restart of Idle connections
! The suggested behaviour after an idle period (more
than a RTO without exchanges) is to reset the
congestion window to its initial value (usually cwnd=1)
and apply slowstart
! This can have a very adverse effect on applications
using e.g. MPI-G
! Rate Based Pacing (RBP): Improving restart of idle
TCP connections ,Viesweswaraiah, Heidemann (1997)
suggests to reuse the previous cwnd , but to smoothly
pace out pkts, rather than burst them out, using a
Vegas like algorithms to estimate the rate
24 Jun 2002 Roberto Innocente 48
TCP buffer tuning and
parallel streaming
! The optimal TCP flow control window size is the bandwidth delay product
for the link. In these years as network speeds have increased, o.s. have
adjusted the default buffer size from 8KB to 64KB. This is far too small for
hpn. An OC12 link with 50 ms rtt delay requires at least 3.75 MB of
buffers ! There is a lot of reseach on mechanisms to automatically set an
optimal flow control window and buffer size, just some examples :
" Auto-tuning (Semke,Mahdavi,Mathis 1998)
" Dynamic Right Sizing(DRS) :
# Weigl, Feng 2001 : Dynamic Right-Sizing: A Simulation study .
" Enable (Tierney et al LBNL 2001) : database of BW-delay products
" Linux 2.4 autotuning/connection caching: controlled by the new kernel
variables net.ipv4.tcp_wmem/tcp_rmem, the advertised receive window starts
small and grows with each segment from the transmitter; tcp control info for a
destination are cached for 10 minutes(cwnd,rtt,rttvar,sshthresh)
! University of Illinois psocket library to support parallel sockets.
24 Jun 2002 Roberto Innocente 49
TCP friendly streams
! It is essential to reduce the amount of streams unresposive to network
congestion to avoid congestion collapses
! It is important to develop congestion algorithms that are fair when mixed
with current TCP, because this constitutes the vast majority of traffic
! TCP friendly streams are streams that exhibit a behavior similar to a
TCP stream in the same conditions :
upon congestion, if L is the loss rate, their bandwidth it’s not more than
1.3*MTU/(RTT*sqrt(L))
! Unfortunately real time applications suffer too much for the TCP
congestion window halving. For this reason a wider class of congestion
control algorithms has been investigated. These algorithms are called
binomial algorithms and they update their window according to (for
TCP k=0,l=1):
" Increase : W = W + a/(W^k)
" Decrease: W = W – b*(W^l)
For k+l=1 these algorithms are TCP friendly (Bansal, Balakrishnan 2000)
24 Jun 2002 Roberto Innocente 50
iWarp
! iWarp is an initial work on RDMA
(Remote DMA). Internet drafts :
" The Architecture of Direct Data Placement
(DDP) and Remote Direct Memory Access
(RDMA) on Internet Protocols, S.Bailey,
February 2002
" The Remote Direct Memory Access Protocol
(iWarp) , S.Bailey, February 2002

Más contenido relacionado

Destacado

Open_IoT_Summit-Europe-2016-Building_a_Drone_from_scratch
Open_IoT_Summit-Europe-2016-Building_a_Drone_from_scratchOpen_IoT_Summit-Europe-2016-Building_a_Drone_from_scratch
Open_IoT_Summit-Europe-2016-Building_a_Drone_from_scratch
Igor Stoppa
 
Open_IoT_Summit-Europe-2016-Building_an_IoT-class_Device_0
Open_IoT_Summit-Europe-2016-Building_an_IoT-class_Device_0Open_IoT_Summit-Europe-2016-Building_an_IoT-class_Device_0
Open_IoT_Summit-Europe-2016-Building_an_IoT-class_Device_0
Igor Stoppa
 

Destacado (20)

Work lifebalance
Work lifebalanceWork lifebalance
Work lifebalance
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computing
 
Mosix : automatic load balancing and migration
Mosix : automatic load balancing and migration Mosix : automatic load balancing and migration
Mosix : automatic load balancing and migration
 
Los contenedores en el mundo Microsoft #ReConnect2016
Los contenedores en el mundo Microsoft #ReConnect2016Los contenedores en el mundo Microsoft #ReConnect2016
Los contenedores en el mundo Microsoft #ReConnect2016
 
Data mining : rule mining algorithms
Data mining : rule mining algorithmsData mining : rule mining algorithms
Data mining : rule mining algorithms
 
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernels
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernelsRefreshing computer-skills: markdown, mathjax, jupyter, docker, microkernels
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernels
 
Docker 1.12 & Swarm Mode [Montreal Docker Meetup Sept. 2016]
Docker 1.12 & Swarm Mode [Montreal Docker Meetup Sept. 2016]Docker 1.12 & Swarm Mode [Montreal Docker Meetup Sept. 2016]
Docker 1.12 & Swarm Mode [Montreal Docker Meetup Sept. 2016]
 
Intro to docker
Intro to dockerIntro to docker
Intro to docker
 
VF360 OpenVPX Board w. Altera Stratix and TI KeyStone DSP
VF360 OpenVPX Board w. Altera Stratix and TI KeyStone DSPVF360 OpenVPX Board w. Altera Stratix and TI KeyStone DSP
VF360 OpenVPX Board w. Altera Stratix and TI KeyStone DSP
 
Kontit pomppimaan3
Kontit pomppimaan3Kontit pomppimaan3
Kontit pomppimaan3
 
Open_IoT_Summit-Europe-2016-Building_a_Drone_from_scratch
Open_IoT_Summit-Europe-2016-Building_a_Drone_from_scratchOpen_IoT_Summit-Europe-2016-Building_a_Drone_from_scratch
Open_IoT_Summit-Europe-2016-Building_a_Drone_from_scratch
 
Open_IoT_Summit-Europe-2016-Building_an_IoT-class_Device_0
Open_IoT_Summit-Europe-2016-Building_an_IoT-class_Device_0Open_IoT_Summit-Europe-2016-Building_an_IoT-class_Device_0
Open_IoT_Summit-Europe-2016-Building_an_IoT-class_Device_0
 
features of tcp important for the web
features of tcp  important for the webfeatures of tcp  important for the web
features of tcp important for the web
 
Engineer - Mastering the Art of Software
Engineer - Mastering the Art of SoftwareEngineer - Mastering the Art of Software
Engineer - Mastering the Art of Software
 
Comp architecture : branch prediction
Comp architecture : branch predictionComp architecture : branch prediction
Comp architecture : branch prediction
 
Public key cryptography
Public key cryptography Public key cryptography
Public key cryptography
 
Fpga computing
Fpga computingFpga computing
Fpga computing
 
Transcript
TranscriptTranscript
Transcript
 
12 Things About WebLogic 12.1.3 #oow2014 #otnla15
12 Things About WebLogic 12.1.3 #oow2014 #otnla1512 Things About WebLogic 12.1.3 #oow2014 #otnla15
12 Things About WebLogic 12.1.3 #oow2014 #otnla15
 
FPGA/Reconfigurable computing (HPRC)
FPGA/Reconfigurable computing (HPRC)FPGA/Reconfigurable computing (HPRC)
FPGA/Reconfigurable computing (HPRC)
 

Similar a End nodes in the Multigigabit era

Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010
TELECOM I+D
 
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Heiko Joerg Schick
 
3 d fnal_gd
3 d fnal_gd3 d fnal_gd
3 d fnal_gd
altaf49
 

Similar a End nodes in the Multigigabit era (20)

CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
 
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010
 
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
 
Semiconductor overview
Semiconductor overviewSemiconductor overview
Semiconductor overview
 
An FPGA for high end Open Networking
An FPGA for high end Open NetworkingAn FPGA for high end Open Networking
An FPGA for high end Open Networking
 
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
 
Internet service
Internet serviceInternet service
Internet service
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Gigabit ethernet by jitendra kumar
Gigabit ethernet by jitendra kumarGigabit ethernet by jitendra kumar
Gigabit ethernet by jitendra kumar
 
Sushant
SushantSushant
Sushant
 
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
 
3 d fnal_gd
3 d fnal_gd3 d fnal_gd
3 d fnal_gd
 
If AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureIf AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC Architecture
 
Evolution of Computing Microprocessors and SoCs
Evolution of Computing Microprocessors and SoCsEvolution of Computing Microprocessors and SoCs
Evolution of Computing Microprocessors and SoCs
 
Osa-multi-core.ppt
Osa-multi-core.pptOsa-multi-core.ppt
Osa-multi-core.ppt
 
cisco-n3k-c3232c-datasheet.pdf
cisco-n3k-c3232c-datasheet.pdfcisco-n3k-c3232c-datasheet.pdf
cisco-n3k-c3232c-datasheet.pdf
 
Computer Maintanance
Computer MaintananceComputer Maintanance
Computer Maintanance
 
Cyclone II FPGA Overview
Cyclone II FPGA OverviewCyclone II FPGA Overview
Cyclone II FPGA Overview
 

Más de rinnocente (6)

Random Number Generators 2018
Random Number Generators 2018Random Number Generators 2018
Random Number Generators 2018
 
Docker containers : introduction
Docker containers : introductionDocker containers : introduction
Docker containers : introduction
 
WiFi placement, can we use Maxwell ?
WiFi placement, can we use Maxwell ?WiFi placement, can we use Maxwell ?
WiFi placement, can we use Maxwell ?
 
TLS, SPF, DKIM, DMARC, authenticated email
TLS, SPF, DKIM, DMARC, authenticated emailTLS, SPF, DKIM, DMARC, authenticated email
TLS, SPF, DKIM, DMARC, authenticated email
 
Ipv6 course
Ipv6  courseIpv6  course
Ipv6 course
 
radius dhcp dot1.x (802.1x)
radius dhcp dot1.x (802.1x)radius dhcp dot1.x (802.1x)
radius dhcp dot1.x (802.1x)
 

Último

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 

End nodes in the Multigigabit era

  • 1. 24 Jun 2002 Roberto Innocente 1 End nodes challenges with multigigabit networking GARR – WS4 Roberto Innocente inno@sissa.it
  • 2. 24 Jun 2002 Roberto Innocente 2 Gilder’s law ! proposed by G.Gilder(1997) : The total bandwidth of communication systems triples every 12 months ! compare with Moore’s law (1974): The processing power of a microchip doubles every 18 months ! compare with: memory performance increases 10% per year (source Cisco)
  • 3. 24 Jun 2002 Roberto Innocente 3 Optical networks ! Large fiberbundles multiply the fibers per route : 144, 432 and now 864 fibers per bundle ! DWDM (Dense Wavelength Division Multiplexing) multiplies the b/w per fiber : " Nortel announced a 80 x 80G per fiber (6.4Tb/s) " Alcatel announced a 128 x 40G per fiber(5.1Tb/s)
  • 4. 24 Jun 2002 Roberto Innocente 4 10 GbE /1 ! IEEE 802.3ae 10GbE ratified on June 17, 2002 ! Optical Media ONLY : MMF 50u/400Mhz 66m, new 50u/2000Mhz 300m, 62.5u/160Mhz 26m, WWDM 4x 62.5u/160Mhz 300m, SMF 10Km/40Km ! Full Duplex only (for 1 GbE a CSMA/CD mode of operation was still specified despite no one implemented it) ! LAN/WAN different :10.3 Gbaud – 9.95Gbaud (SONET) ! 802.3 frame format (including min/max size)
  • 5. 24 Jun 2002 Roberto Innocente 5 10GbE /2 from Bottor (Nortel Networks) : marriage of Ethernet and DWDM
  • 6. 24 Jun 2002 Roberto Innocente 6 Challenges ! Hardware ! Software ! Network protocols
  • 7. 24 Jun 2002 Roberto Innocente 7 Standard data flow /1 Nic memory Memory NIC Memory bus 1 2 3 Processor Processor Processor bus North Bridge I/O bus PCI bus Network
  • 8. 24 Jun 2002 Roberto Innocente 8 Standard data flow /2 ! Copies : 1. Copy from user space to kernel buffer 2. DMA copy from kernel buffer to NIC memory 3. DMA copy from NIC memory to network ! Loads: " 2x on the processor bus " 3x on the memory bus " 2x on the NIC memory
  • 9. 24 Jun 2002 Roberto Innocente 9 Hardware challenges ! Processor bus ! Memory bus ! I/O bus ! Network technologies
  • 10. 24 Jun 2002 Roberto Innocente 10 Processor / Memory bus ! Processor bus : " current technology 1 - 2 GB/s " Athlon/Alpha are pt-2-pt / IA32 multi-pt " e.g IA32 P4 : 4 x 100 Mhz x 8 bytes = 3.2 GB/s peak, but with many cycles of overhead for each line of cache transferred ! Memory bus : " current technology 1-2 GB/s " DDR 333 8 bytes wide = 2.6 GB/s th. peak " RAMBUS 800 Mhz x 2 bytes x 2 channels = 3.2 GB/s th. peak " setup time should be taken into account
  • 11. 24 Jun 2002 Roberto Innocente 11 Memory b/w
  • 12. 24 Jun 2002 Roberto Innocente 12 PCI Bus ! PCI32/33 4 bytes@33Mhz=132MBytes/s (on i440BX,...) ! PCI64/33 8 bytes@33Mhz=264Mbytes/s ! PCI64/66 8 bytes@66Mhz=528Mbytes/s (on i840) ! PCI-X 8 bytes@133Mhz=1056Mbytes/s PCI-X implements split transactions
  • 13. 24 Jun 2002 Roberto Innocente 13 PCI performance with common chipsets 478418Supermicro P4D6, P4 Xeon, Intel e7500 chipset 299215P4 Xeon, Intel 860 chipset 311299AMD dual, AMD760MPX chipset 228183API UP2000 alpha, Tsunami chipset 353299Supermicro 370DEI, PIII, Serverset III HE 379353Intel 460GX (Itanium) 464387Alpha ES45, Titan chipset 486432Supermicro 370DLE, dual PIII, Serverworks Serverset III LE Write MB/sRead MB/sChipset
  • 14. 24 Jun 2002 Roberto Innocente 14 I/O busses ! PCI/PCI-X available, DDR/QDR under development ! Infiniband (HP, Compaq, IBM, Intel ): available " PCI replacement, memory connected ! " pt-to-pt , switch based " 2.5 Gb/s per wire f/d, widths 1x,4x,12x ! 3rd Generation I/O (HP,Compaq,IBM..) " PCI replacement " pt-to-pt switched " 2.5 Gb/s per wire f/d ,widths 1x,2x,4x,8x,12x..32x ! HyperTransport (AMD,...) " PCI and SMP interconnect replacement " pt-to-pt switched, synchronous " 800 Mb/s- 2Gb/s per wire f/d, widths 2x,4x...32x
  • 15. 24 Jun 2002 Roberto Innocente 15 Network technologies ! Extended frames : " Alteon Jumbograms (MTU up to 9000) ! Interrupt suppression (coalescing) ! Checksum offloading (from Gallatin 1999)
  • 16. 24 Jun 2002 Roberto Innocente 16 Software challenges ! ULNI (User Level Network Interface), OS-bypass ! Zero copy ! Network Path Pipelining
  • 17. 24 Jun 2002 Roberto Innocente 17 Software overhead software overhead Being a constant, is becoming more and more important !! txmission time total time software overhead software overhead 10Mb/s Ether 100Mb/s Ether 1Gb/s
  • 18. 24 Jun 2002 Roberto Innocente 18 ULNI / OS-bypass ! Traditional networking can be defined as in-kernel networking : for each network operation (read / write) a kernel call (trap) is invoked ! This frequently is too expensive for HPN (high performance networks) ! Therefore in the last years, in the SAN environment User Level Network Interface (ULNI) / OS-bypass schemes have had a wide diffusion ! These schemes avoid the kernel intervention for read/write operations. The user process directly reads from/writes to the NIC memory using special user space libraries. Examples : U-Net, Illinois Fast Messages, ... ! VIA (Virtual Interface Architecture) is a proposed ULNI industry standard backed by Intel, Compaq et al. : VIA over IP is an internet draft
  • 19. 24 Jun 2002 Roberto Innocente 19 Standard TCP / OS-bypass ULNI(gm) on i840 chipset
  • 20. 24 Jun 2002 Roberto Innocente 20 Advanced networking ! Traditional UNIX i/o calls have an implicit copy semantics. It is possible to diverge from this creating a new API that avoids this requirement. ! Zero-copy can be obtained in different ways : " user/kernel page remapping (trapeze bsd drivers) : performance strongly depends on particular h/w, usually a Copy On Write flag preserves the copy semantic " fbufs (fast buffers , Druschel 1993) : shared buffers
  • 21. 24 Jun 2002 Roberto Innocente 21 Zero-copy data flow Nic memory Memory NIC Memory bus 1 2 Processor Processor Processor bus North Bridge I/O bus PCI bus Network
  • 22. 24 Jun 2002 Roberto Innocente 22 Trapeze driver for Myrinet2000 on freeBSD ! zero-copy TCP by page remapping and copy-on -write ! split header/payload ! gather/scatter DMA ! checksum offload to NIC ! adaptive message pipelining ! 220 MB/s (1760 Mb/s) on Myrinet2000 and Dell Pentium III dual processor The application does’nt touch the data (from Chase, Gallatin, Yocum 2001)
  • 23. 24 Jun 2002 Roberto Innocente 23 Fbufs (Druschel 1993) ! uf_read(int fd,void**bufp,size_t nbytes) ! uf_get(int fd,void** bufpp) ! uf_write(int fs,void*bufp,size nbytes) ! uf_allocate(size_t size) ! uf_deallocate(void*ptr ,size_t size) ! Shared buffers between user processes and kernel ! It is a general schemes that can be used for either network or I/O operations ! Specific API without copy semantic ! A buffer pool specific for a process is created as part of process initialization ! This buffer pool can be pre- mapped in both the user and kernel address space
  • 24. 24 Jun 2002 Roberto Innocente 24 Gigabit Ethernet TCP
  • 25. 24 Jun 2002 Roberto Innocente 25 TOE (TCP Off-loading Engine)
  • 26. 24 Jun 2002 Roberto Innocente 26 Network Path Pipelining ! It is frequently believed that the overhead on network operations decreases as the size of the frame increases, and therefore in principle an infinite frame would allow the best performance ! This in effect is false (e.g. Prilly 1999) for lightweight protocols : Myrinet, QSnet (does’nt apply to protocols like TCP that require a checksum header) ! Myrinet does’nt have a maximum frame size, but to efficiently use the network path, it is necessary that all the entities (sending CPU, sending NIC DMA, sending i/f, receiving i/f,...) along the path could work at the same time on different segments (pipelining). Using long frames inhibits co-working. ! It is a linear programming exercise to find the best frame size : it is different for different packet sizes. ! On PCI frequently the best frame size is the maximum transfer size allowed by the MLT (maximum latency timer) of the device : 256 or 512 bytes
  • 27. 24 Jun 2002 Roberto Innocente 27 Network protocols challenges ! Active Queue Management (AQM): " FQ, WFQ, CSFQ, RED ! ECN ! MPLS ! TCP Congestion avoidance/control ! TCP friendly streams ! IP RDMA protocols
  • 28. 24 Jun 2002 Roberto Innocente 28 AQM (active queue management) ! RFC 2309 Braden et al. : Recommandation on Queue Management and Congestion Avoidance in the Internet (Apr 1998) ! “Internet meltdown”, “congestion collapse” first experienced in the mid ’80 ! first solution was V.Jacobson congestion avoidance mechanism for TCP (from 1986/88 on) ! Anyway, there is a limit to how much control can be accomplished from the edges of the network
  • 29. 24 Jun 2002 Roberto Innocente 29 AQM /2 ! queue disciplines : " tail drop # a router sets a max length for each queue, when the queue is full it drops new packets arriving until the queue decreases because a pkt from the queue has been txmitted " head drop " random drop
  • 30. 24 Jun 2002 Roberto Innocente 30 AQM /3 Problems of tail-drop discipline: ! Lock-Out : " in some situations one or a few connections monopolize queue space. This phenomenon is frequently the result of synchronization ! Full Queues : " queues are allowed to remain full for long periods of time, while it is important to reduce the steady state size (to guarantee short delays for applications requiring them )
  • 31. 24 Jun 2002 Roberto Innocente 31 Fair Queuing ! Traditional routers route packets independently. There is no memory, no state in the network. ! Demers, Keshav, Shenker (1990) : Analysis and simulation of a Fair Queuing Algorithm Incoming traffic is separated into flows, each flow will have an equal share of the available bandwidth. It approximates the sending of 1 bit for each ongoing flow ! Statistical FQ (hashing flows) output line switching Traditional queuing: one queue per line output line Fair queuing : one queue per flow switching
  • 32. 24 Jun 2002 Roberto Innocente 32 Weighted Fair Queuing (WFQ) This queuing algorithm was developed to serve real time applications with guaranteed bandwidth. ! L.Zhang (1990): Virtual clock: A new Traffic Control Algorithm for Packet Switching Networks Traffic is classified, and for each class a percentage of the available b/w is assigned. Packets are sent to different queues according to their class. Each pkt in the queue is labelled with a calculated finish time. Pkts are transmitted in finish time order.
  • 33. 24 Jun 2002 Roberto Innocente 33 CSFQ (Core stateless Fair Queuing) ! Only ingress/egress routers perform packet classification/per flow buffer mgmt and scheduling ! Edge routers estimate the incoming rate of each flow and use it to label each pkt ! Core routers are stateless ! All routers from time to time compute the fair rate on outgoing links ! This approximates FQ (from Stoica,Shenker,Zhang1998)
  • 34. 24 Jun 2002 Roberto Innocente 34 RED /1 (Random Early Detection) ! Floyd, V.Jacobson 1993 ! Detects incipient congestion and drops packets with a certain probability depending on the queue length. The drop probability increases linearly from 0 to maxp as the queue length increases from minth to maxth. When the queue length goes above maxth (maximum threshold) all pkts are dropped
  • 35. 24 Jun 2002 Roberto Innocente 35 RED /2 ! It is now widely deployed ! There are very good performance reports ! Still an active area of research for possible modifications ! Cisco implements its own WRED (Weighted Random Early Detection) that discards packets based also on precedence (provides separate threshold and weights for different IP precedences)
  • 36. 24 Jun 2002 Roberto Innocente 36 RED /3 ! (from Van Jacobson, 1998) Traffic on a busy E1 (2 Mbit/s) Internet link. RED was turned on at 10.00 of the 2nd day, and from then utilization rised up to 100% and remained there steady.
  • 37. 24 Jun 2002 Roberto Innocente 37 ECN (Explicit Congestion Notification) /1 ! RFC 3168 Ramakrishnan, Floyd, Black : The addition of Explicit Congestion Notification (ECN) to IP (Sep 2001) ! Uses 2 bits of the IPv4 Tos (Now and in IPv6 reserved as a DiffServ codepoint). These 2 bits encode the states : " ECN-Capable Transport : ECT(0) and ECT(1) " Congestion Experienced (CE) ! If the transport is TCP, uses 2 bits of the TCP header, next to the Urgent flag : " ECN-Echo(ECE) set in the first ACK after receiving a CE pkt " Congestion Window Reduced (CWR) set in the first pkt after having reduced cwnd in response to an ECE pkt ! With TCP the ECN is initialized sending a SYN pkt with ECE and CWR on, and receiving a SYN+ACK pkt with ECE on
  • 38. 24 Jun 2002 Roberto Innocente 38 ECN /2 How it works: ! senders set ECT(0) or ECT(1) to indicate that the end-nodes of the transport protocol are ECN capable ! a router experiencing congestion, sets the 2 bits of ECT pkts to the CE state (instead of dropping the pkt) ! the receiver of the pkt signals back the condition to the other end ! the transports behave as in the case of a pkt drop (no more than once in a RTT) Linux adopted it on 2.4 ... many connection troubles (the default now is to be off, can be turned on/off using : /proc/sys/net/ipv4/tcp_ecn). Major sites are using firewalls or load balancers that refuse connections from ECT (Cisco PIX and Load Director, etc).
  • 39. 24 Jun 2002 Roberto Innocente 39 MPLS (Multi Protocol Label Switching) This technology is a marriage of traffic engineering as on ATM and IP : ! Special routers called LER (Label Edge Routers) mark the packets entering the net with a label that indicates a class of service ( FEC or Forward Equivalence Class) or priority : the 32 bit header (a Shim header) is inserted between the OSI level 2 header and upper level headers (after the ethernet header, before the IP header): 20 bits for the label, 3 experimental, 1 bit for stack function and 8 bit for TTL ! The IP packet becomes an MPLS pkt ! Internal routers work as LSR (Label Switch Routers) for MPLS pkts : will look up and follow the label instructions (usually swapping label) ! Between MPLS aware devices LSP (Label Switch Paths) are established and designed for their traffic characteristics
  • 40. 24 Jun 2002 Roberto Innocente 40 Congestion control in TCP/1 Sending rate is limited by a congestion window (cwnd): the maximum # of pkts to be sent in a RTT(round trip time). Phases: 1. Slow start : cwnd is increased exponentially (# pkts acknowledged in a RTT are summed to cwnd) up to a threshold ssthresh or a loss 2. Congestion avoidance: AIMD (Additive increase, Multiplicative decrease) phase. cwnd is increased by 1 each RTT. When there is a loss, cwnd is halved.
  • 41. 24 Jun 2002 Roberto Innocente 41 Congestion control in TCP/2 from [Balakrishnan 98]
  • 42. 24 Jun 2002 Roberto Innocente 42 AIMD / AIPD Congestion control ! AIMD (Additive Increase Multiplicative Decrease) : " W = W + a (no loss) " W = W * (1 – b) (L > 0 losses) it achieves a b/w ∝1/sqrt(L) ! AIPD (Additive Increase Proportional Decrease): " W = W + a (no loss) " W = W * (1 –b *L) (L > 0 losses) it achieves a b/w ∝1/L (ns2 plot of 3 flows,source Lee 2001)
  • 43. 24 Jun 2002 Roberto Innocente 43 TCP stacks ! Tahoe(4.3BSD,Net/1) : implements Van Jacobson slow start/congestion avoidance and fast retransmit algorithms ! Reno(1990,Net/2) : fast recovery, TCP header prediction ! Net/3(4.4BSD 1994): multicasting, LFN (Long Fat Networks) mods ! NewReno : fast retransmit even with multiple losses (Hoe 1997) ! Vegas: experimental stack (Brakmo, Peterson 1995)
  • 44. 24 Jun 2002 Roberto Innocente 44 TCP Reno ! It is the most widely referenced implementation, basic mechanisms are the same also in Net/3 and NewReno ! It is biased against connections with long delays (large RTT) : in this case the increase of the cwnd happens slowly and so they obtain less avg b/w
  • 45. 24 Jun 2002 Roberto Innocente 45 LFN Optimal Bandwidth and TCP Reno ! Let’s consider a 1 Gb/s WAN connection with a RTT of 100 ms (BW*rtt ~12 MB) ! Optimal congestion window would be about 8.000 segments (1.5k) ! When there will be a loss cwnd will be decreased to 4.000 segments ! It will take 400 seconds (7minutes!) to recover to the optimal b/w. This is awful if the network is not congested !
  • 46. 24 Jun 2002 Roberto Innocente 46 TCP Vegas ! Approaches congestion but tries to avoid it. Estimates the available b/w looking at variation of delays between acks (CARD: Congestion avoidance by RTT delays) ! It has been shown that it is able to better utilize the available b/w (30% better than Reno) ! It is fair with long delay connections ! Anyway, when mixed with TCP Reno it gets an unfair share (50% less) of the bandwidth
  • 47. 24 Jun 2002 Roberto Innocente 47 Restart of Idle connections ! The suggested behaviour after an idle period (more than a RTO without exchanges) is to reset the congestion window to its initial value (usually cwnd=1) and apply slowstart ! This can have a very adverse effect on applications using e.g. MPI-G ! Rate Based Pacing (RBP): Improving restart of idle TCP connections ,Viesweswaraiah, Heidemann (1997) suggests to reuse the previous cwnd , but to smoothly pace out pkts, rather than burst them out, using a Vegas like algorithms to estimate the rate
  • 48. 24 Jun 2002 Roberto Innocente 48 TCP buffer tuning and parallel streaming ! The optimal TCP flow control window size is the bandwidth delay product for the link. In these years as network speeds have increased, o.s. have adjusted the default buffer size from 8KB to 64KB. This is far too small for hpn. An OC12 link with 50 ms rtt delay requires at least 3.75 MB of buffers ! There is a lot of reseach on mechanisms to automatically set an optimal flow control window and buffer size, just some examples : " Auto-tuning (Semke,Mahdavi,Mathis 1998) " Dynamic Right Sizing(DRS) : # Weigl, Feng 2001 : Dynamic Right-Sizing: A Simulation study . " Enable (Tierney et al LBNL 2001) : database of BW-delay products " Linux 2.4 autotuning/connection caching: controlled by the new kernel variables net.ipv4.tcp_wmem/tcp_rmem, the advertised receive window starts small and grows with each segment from the transmitter; tcp control info for a destination are cached for 10 minutes(cwnd,rtt,rttvar,sshthresh) ! University of Illinois psocket library to support parallel sockets.
  • 49. 24 Jun 2002 Roberto Innocente 49 TCP friendly streams ! It is essential to reduce the amount of streams unresposive to network congestion to avoid congestion collapses ! It is important to develop congestion algorithms that are fair when mixed with current TCP, because this constitutes the vast majority of traffic ! TCP friendly streams are streams that exhibit a behavior similar to a TCP stream in the same conditions : upon congestion, if L is the loss rate, their bandwidth it’s not more than 1.3*MTU/(RTT*sqrt(L)) ! Unfortunately real time applications suffer too much for the TCP congestion window halving. For this reason a wider class of congestion control algorithms has been investigated. These algorithms are called binomial algorithms and they update their window according to (for TCP k=0,l=1): " Increase : W = W + a/(W^k) " Decrease: W = W – b*(W^l) For k+l=1 these algorithms are TCP friendly (Bansal, Balakrishnan 2000)
  • 50. 24 Jun 2002 Roberto Innocente 50 iWarp ! iWarp is an initial work on RDMA (Remote DMA). Internet drafts : " The Architecture of Direct Data Placement (DDP) and Remote Direct Memory Access (RDMA) on Internet Protocols, S.Bailey, February 2002 " The Remote Direct Memory Access Protocol (iWarp) , S.Bailey, February 2002