Why Teams call analytics are critical to your entire business
NIC Virtualization on IBM Flex Systems
1. Draft Document for Review May 1, 2014 2:10 pm SG24-8223-00
ibm.com/redbooks
Front cover
NIC Virtualization on
IBM Flex System
Scott Irwin
Scott Lorditch
Matt Slavin
Ilya Krutov
Introduces NIC virtualization concepts
and technologies
Discusses vNIC deployment
scenarios
Provides vNIC configuration
examples
2.
3. International Technical Support Organization
NIC Virtualization on IBM Flex System
May 2014
Draft Document for Review May 1, 2014 2:10 pm 8223edno.fm
SG24-8223-00
10. 8223spec.fm Draft Document for Review May 1, 2014 2:10 pm
viii NIC Virtualization on IBM Flex System
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines
Corporation in the United States, other countries, or both. These and other IBM trademarked terms are
marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US
registered or common law trademarks owned by IBM at the time this information was published. Such
trademarks may also be registered or common law trademarks in other countries. A current list of IBM
trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml
The following terms are trademarks of the International Business Machines Corporation in the United States,
other countries, or both:
Blade Network Technologies®
BladeCenter®
BNT®
IBM®
IBM Flex System®
Power Systems™
PowerVM®
PureFlex®
RackSwitch™
Redbooks®
Redbooks (logo) ®
System x®
VMready®
The following terms are trademarks of other companies:
Intel, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel
Corporation or its subsidiaries in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States,
other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
12. 8223pref.fm Draft Document for Review May 1, 2014 2:10 pm
x NIC Virtualization on IBM Flex System
Scott Lorditch is a Consulting Systems Engineer for IBM System Networking. He performs
network architecture assessments, and develops designs and proposals for implementing
GbE Switch Module products for the IBM BladeCenter. He also developed several training
and lab sessions for IBM technical and sales personnel. Previously, Scott spent almost 20
years working on networking in various industries, working as a senior network architect, a
product manager for managed hosting services, and manager of electronic securities transfer
projects. Scott holds a BS degree in Operations Research with a specialization in computer
science from Cornell University.
Matt Slavin is a Consulting Systems Engineer for IBM Systems Networking, based out of
Tulsa, Oklahoma, and currently providing network consulting skills to the Americas. He has a
background of over 30 years of hands-on systems and network design, installation, and
troubleshooting. Most recently, he has focused on data center networking where he is leading
client efforts in adopting new and potently game-changing technologies into their day-to-day
operations. Matt joined IBM through the acquisition of Blade Network Technologies, and prior
to that has worked at some of the top systems and networking companies in the world.
Thanks to the following people for their contributions to this project:
Tamikia Barrow, Cheryl Gera, Chris Rayns, Jon Tate, David Watts, Debbie Willmschen
International Technical Support Organization, Raleigh Center
Nghiem Chu, Sai Chan, Michael Easterly, Heidi Griffin, Richard Mancini, Shekhar Mishra,
Heather Richardson, Hector Sanchez, Tim Shaughnessy
IBM
Jeff Lin
Emulex
Now you can become a published author, too!
Here’s an opportunity to spotlight your skills, grow your career, and become a published
author—all at the same time! Join an ITSO residency project and help write a book in your
area of expertise, while honing your experience using leading-edge technologies. Your efforts
will help to increase product acceptance and customer satisfaction, as you expand your
network of technical contacts and relationships. Residencies run from two to six weeks in
length, and you can participate either in person or as a remote resident working from your
home base.
Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
13. Preface xi
Draft Document for Review May 1, 2014 2:10 pm 8223pref.fm
Comments welcome
Your comments are important to us!
We want our books to be as helpful as possible. Send us your comments about this book or
other IBM Redbooks publications in one of the following ways:
Use the online Contact us review Redbooks form found at:
ibm.com/redbooks
Send your comments in an email to:
redbooks@us.ibm.com
Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
Stay connected to IBM Redbooks
Find us on Facebook:
http://www.facebook.com/IBMRedbooks
Follow us on Twitter:
http://twitter.com/ibmredbooks
Look for us on LinkedIn:
http://www.linkedin.com/groups?home=&gid=2130806
Explore new Redbooks publications, residencies, and workshops with the IBM Redbooks
weekly newsletter:
https://www.redbooks.ibm.com/Redbooks.nsf/subscribe?OpenForm
Stay current on recent Redbooks publications with RSS Feeds:
http://www.redbooks.ibm.com/rss.html
16. Introduction.fm Draft Document for Review May 1, 2014 2:10 pm
2 NIC Virtualization on IBM Flex System
1.1 Overview of Flex System I/O module virtualization
technologies
The term virtualization can mean many different things to different people, and in different
contexts.
For example, in the server world it is often associated with taking bare metal platforms and
putting in a layer of software (referred to as a hypervisor) that permits multiple virtual
machines (VMs) to run on that single physical platform, with each VM thinking it owns the
entire hardware platform.
In the network world, there are many different concepts of virtualization. Such things as
overlay technologies, that let a user run one network on top of another network, usually with
the goal of hiding the complexities of the underlying network (often referred to as overlay
networking). Another form of network virtualization would be Openflow technology, which
de-couples a switches control plane from the switch, and allows the switching path decisions
to be made from a central control point.
And then there are other forms of virtualization, such as cross chassis aggregation (also
known as cross-switch aggregation), virtualized NIC technologies, and converged fabrics.
This paper is focused on the latter set of virtualization forms, specifically the following set of
features:
Converged fabrics - Fibre Channel over Ethernet (FCoE) and internet Small Computer
Systems Interconnect (iSCSI)
virtual Link Aggregation (vLAG) - A form of cross switch aggregation
Stacking - Virtualizing the management plane and the switching fabric
Switch Partitioning (SPAR) - Masking the I/O Module from the host and upstream network
Easy Connect Q-in-Q solutions - More ways to mask the I/O Modules from connecting
devices
NIC virtualization - Allowing a single physical 10G NIC to represent multiple NICs to the
host OS
Although we will be introducing all of these topics in this section, the primary focus of this
paper will be around how the last item (NIC virtualization) integrates into the various other
features, and the surrounding customer environment. The specific NIC virtualization features
that will be discussed in detail in this paper include the following:
IBM Virtual Fabric mode - also known as vNIC Virtual Fabric mode, including both
Dedicated Uplink Mode (default) and Shared Uplink Mode (optional) operations
Switch Independent Mode - also known as vNIC Switch Independent Mode
Unified Fabric Port - also known as IBM Unified Fabric Protocol, or just UFP - All modes
Important: The term vNIC can be used both generically for all virtual NIC technologies, or
as a vendor specific term. For example, VMware calls the virtual NIC that resides inside a
VM a vNIC. Unless otherwise noted, the use of the term vNIC in this paper is referring to a
specific feature available on the Flex System I/O modules and Emulex CNAs inside
physical hosts. In a related fashion, the term vPort has multiple connotations, for example,
used by Microsoft for their Hyper-V environment. Unless otherwise noted, the use of the
term vPort in this paper is referring to the UFP feature on the Flex System I/O modules and
Emulex CNAs inside physical hosts.
17. Chapter 1. Introduction to I/O module and NIC virtualization features in the IBM Flex System environment 3
Draft Document for Review May 1, 2014 2:10 pm Introduction.fm
1.1.1 Introduction to converged fabrics
As the name implies, converged fabrics are all about taking a set of protocols and data
designed to run on top of one kind of physical medium, and allowing them to be carried on top
of a different physical medium. This provides a number of cost benefits, such as reducing the
number of physical cabling plants that are required, removing the need for separate physical
NICs and HBAs, including a potential reduction in power and cooling. From an OpEx
perspective it can reduce the cost associated with the management of separate physical
infrastructures. In the datacenter world, two of the most common forms of converged fabrics
are FCoE and iSCSI.
FCoE allows a host to use its 10 Gb Ethernet connections to access Fibre Channel attached
remote storage, as if it were physically Fibre Channel attached to the host, when in fact the
FC traffic is encapsulated into FCoE frames and carried to the remote storage via an Ethernet
network.
iSCSI takes a protocol that was originally designed for hosts to talk to relatively close physical
storage over physical SCSI cables, and converts it to utilize IP and run over an Ethernet
network, and thus be able to access storage way beyond the limitations of a physical SCSI
based solution.
Both of these topics are discussed in more detail in Chapter 2, “Converged networking” on
page 15.
1.1.2 Introduction to vLAG
In its simplest terms, vLAG is a technology designed to enhance traditional Ethernet link
aggregations (sometimes referred to generically as Portchannels or Etherchannels). It is
important to note that vLAG is not a form of aggregation in its own right, but an enhancement
to aggregations.
As some background, under current IEEE specifications, an aggregation is still defined as a
bundle of similar links between two, and only two devices, bound together to operate as a
single logical link. By today’s standards based definitions, you cannot create an aggregation
on one device and have these links of that aggregation connect to more than a single device
on the other side of the aggregation. The use of only two devices in this fashion limits the
ability to offer certain robust designs.
Although the standards bodies are working on a solution that provides split aggregations
across devices, most vendors have developed their own versions of this multi-chassis
aggregation. For example, Cisco has virtual Port Channel (vPC) on NX OS products, and
Virtual Switch System (VSS) on the 6500 IOS products. IBM offers virtual Link Aggregation
(vLAG) on many of the IBM Top of Rack (ToR) solutions, and on the EN4093R and CN4093
Flex System I/O modules.
The primary goal of virtual link aggregation is to overcome the limit imposed by the current
standards-based aggregation, and provide a distributed aggregation across a pair of switches
instead of a single switch. Doing so results in a reduction of single points of failure, while still
maintaining a loop-free, non-blocking environment.
Important: All I/O module features discussed in this paper are based on the latest
available firmware at the time of this writing (7.7.9 for the EN4093R and CN4093, and 7.7.8
for the SI4093 System Interconnect Module).
18. Introduction.fm Draft Document for Review May 1, 2014 2:10 pm
4 NIC Virtualization on IBM Flex System
Figure 1-1, shows an example of how vLAG can create a single common uplink out of a pair
of embedded I/O Modules. This creates a non-looped path with no blocking links, offering the
maximum amount of bandwidth for the links, and no single point of failure.
Figure 1-1 Non-looped design using multi-chassis aggregation on both sides
Although this vLAG based design is considered the most optimal, not all I/O module
virtualization options support this topology, for example, Virtual Fabric vNIC mode or SPAR is
not supported with vLAG.
Another potentially limiting factor with vLAG (and other such cross-chassis aggregations such
as vPC and VSS) is that it only supports a pair of switches acting as one for this cross-chassis
aggregation, and not more than two. If the desire is to split an aggregation across more than
two switches, stacking might be an option to consider.
1.1.3 Introduction to stacking
Stacking provides the ability to take up to eight physical I/O modules and treat them as a
single logical switch from a port usage and management perspective. This means ports on
different I/O modules in the stack can be part of a common aggregation, and you only log in to
a single IP address to manage all I/O modules in the stack. For devices that are attaching to
the stack, the stack looks and acts like a single large switch.
Stacking is supported on the EN4093R and CN4093 I/O modules. It is provided by reserving
a group of uplinks into stacking links and creating a ring of I/O modules with these links. The
ring design ensures the loss of a single link or single I/O module in the stack does not lead to
a disruption of the stack.
Before v7.7 releases of code, it was possible to stack the EN4093R only into a common stack
of like model I/O modules. However, in v7.7 and later code, support was added to add a pair
CN4093s into a hybrid stack of EN4093s to add Fibre Channel Forwarder (FCF) capability
into the stack. The limit for this hybrid stacking is a maximum of 6 x EN4093Rs and 2 x
CN4093s in a common stack.
Chassis
Compute
Node
NIC 1
NIC 2
Upstream
Network
ToR
Switch 2
ToR
Switch 1
Multi-chassis Aggregation (vLAG, vPC, mLAG, etc)
I/O Module 1
I/O Module 2
Multi-chassis Aggregation (vLAG)
Important: When using the EN4093R and CN4093 in hybrid stacking, only the CN4093 is
allowed to act as a stack master or stack backup master for the stack.
19. Chapter 1. Introduction to I/O module and NIC virtualization features in the IBM Flex System environment 5
Draft Document for Review May 1, 2014 2:10 pm Introduction.fm
Stacking the Flex System chassis I/O modules with IBM Top of Rack switches that also
support stacking is not allowed. Connections from a stack of Flex System chassis I/O
modules to upstream switches can be made with normal single or aggregated connections,
including the use of vLAG/vPC on the upstream switches to connect links across stack
members into a common non-blocking fabric between the stack and the Top of Rack switches.
An example of four I/O modules in a highly available stacking design is shown in Figure 1-2.
Figure 1-2 Example of stacking in the Flex System environment
This example shows a design with no single points of failures, via a stack of four I/O modules
in a single stack, and a pair of upstream vLAG/vPC connected switches.
One of the potential limitations of the current implementation of stacking is that if an upgrade
of code is needed, a reload of the entire stack must occur. Because upgrades are uncommon
and should be scheduled for non-production hours anyway, a single stack design is usually
efficient and acceptable. But some customers do not want to have any downtime (scheduled
or otherwise) and a single stack design is thus not an acceptable solution. For these users
that still want to make the most use of stacking, a two-stack design might be an option. This
design features stacking a set of I/O modules in bay 1 into one stack, and a set of I/O modules
in bay 2 in a second stack.
The primary advantage to a two-stack design is that each stack can be upgraded one at a
time, with the running stack maintaining connectivity for the compute nodes during the
upgrade and reload of the other stack. The downside of the two-stack design is that traffic that
is flowing from one stack to another stack must go through the upstream network to reach the
other stack.
As can be seen, stacking might not be suitable for all customers. However, if it is desired, it is
another tool that is available for building a robust infrastructure by using the Flex System I/O
modules.
Multi-chassis Aggregation
(vLAG, vPC, mLAG, etc)
Chassis 1
Compute
Node
NIC 1
NIC 2
Upstream
Network
ToR
Switch 2
ToR
Switch 1
I/O Module 1
I/O Module 2
Stacking
Chassis 2
Compute
Node
NIC 1
NIC 2
I/O Module 1
I/O Module 2
20. Introduction.fm Draft Document for Review May 1, 2014 2:10 pm
6 NIC Virtualization on IBM Flex System
1.1.4 Introduction to SPAR
Switch partitioning (SPAR) is a feature that, among other things, allows a physical I/O module
to be divided into multiple logical switches. After SPAR is configured, ports within a given
SPAR group can communicate only with each other. Ports that are members of different
SPAR groups on the same I/O module can not communicate directly with each other, without
going outside the I/O module.
The EN4093R, CN4093, and the SI4093 I/O Modules support SPAR,
SPAR features two modes of operation:
Pass-through domain mode (also known as transparent mode)
This mode of SPAR uses a Q-in-Q function to encapsulate all traffic passing through the
switch in a second layer of VLAN tagging. This is the default mode when SPAR is enabled
and is VLAN agnostic owing to this Q-in-Q operation. It passes tagged and untagged
packets through the SPAR session without looking at or interfering with any customer
assigned tag.
SPAR pass-thru mode supports passing FCoE packets to an upstream FCF, but without
the benefit of FIP snooping within the SPAR group in pass-through domain mode.
Local domain mode
This mode is not VLAN agnostic and requires a user to create any required VLANs in the
SPAR group.Currently, there is a limit of 256 VLANs in Local domain mode.
Support is available for FIP Snooping on FCoE sessions in Local Domain mode. Unlike
pass-through domain mode, Local Domain mode provides strict control of end host VLAN
usage.
Consider the following points regarding SPAR:
SPAR is disabled by default on the EN4093R and CN4093. SPAR is enabled by default on
SI4093, with all base licensed internal and external ports defaulting to a single
pass-through SPAR group. This default SI4093 configuration can be changed if desired.
Any port can be a member of only a single SPAR group at one time.
Only a single uplink path is allowed per SPAR group (can be a single link, a single static
aggregation, or a single LACP aggregation). This SPAR enforced restriction ensures that
no network loops are possible with ports in a SPAR group.
SPAR cannot be used with UFP or Virtual Fabric vNIC at this time. Switch Independent
Mode vNIC is supported with SPAR. UFP support is slated for a possible future release.
Up to eight SPAR groups per I/O module are supported. This number might be increased
in a future release.
SPAR is not supported with vLAG, stacking or tagpvid-ingress features.
SPAR can be a useful solution in environments were simplicity is paramount.
1.1.5 Easy Connect Q-in-Q solutions
The Easy Connect concept, often referred to as Easy Connect mode, or Transparent mode, is
not a specific feature but a way of using one of four different existing features to attempt to
minimize ongoing I/O module management requirements. The primary goal of Easy Connect
is to make an I/O module transparent to the hosts and the upstream network they need to
access, thus reducing the management requirements for I/O Modules in an Easy Connect
mode.
21. Chapter 1. Introduction to I/O module and NIC virtualization features in the IBM Flex System environment 7
Draft Document for Review May 1, 2014 2:10 pm Introduction.fm
As noted, there are actually several features that can be used to accomplish an Easy Connect
solution, with the following being common aspects of Easy Connect solutions:
At the heart of Easy Connect is some form of Q-in-Q tagging, to mask packets traveling
through the I/O module. This is a fundamental requirement of any Easy Connect solution
and lets the attached hosts and upstream network communicate using any VLAN (tagged
or untagged), and the I/O module will pass those packets through to the other side of the
I/O module by wrapping them in an outer VLAN tag, and then removing that outer VLAN
tag as the packet exits the I/O module, thus making the I/O module VLAN agnostic. This
Q-in-Q operation is what removes the need to manage VLANs on the I/O module, which is
usually one of the larger ongoing management requirements of a deployed I/O module.
Pre-creating an aggregation of the uplinks, in some cases, all of the uplinks, to remove the
likelihood of loops (if all uplinks are not used, any unused uplinks/ports should be disabled
to ensure loops are not possible).
Optionally disabling spanning-tree so the upstream network does not receive any
spanning-tree BPDUs. This is especially important in the case of upstream devices that
will shut down a port if BPDUs are received, such as a Cisco FEX device, or an upstream
switch running some form of BPDU guard.
After it is configured, an I/O module in Easy Connect mode does not require on-going
configuration changes as a customer adds and removes VLANs to the hosts and upstream
network. In essence, Easy Connect turns the I/O module into a VLAN agnostic port
aggregator, with support for growing up to the maximum bandwidth of the product (for
example, add upgrade Feature on Demand (FoD) keys to the I/O module to increase the
10 Gb links to Compute Nodes and 10 Gb and 40 Gb links to the upstream networks).
The following are the two primary methods for deploying an Easy Connect solution:
Use an I/O module that defaults to a form of Easy Connect:
– For customers that want an Easy Connect type of solution that is immediately ready for
use out of the box (zero touch I/O module deployment), the SI4093 provides this by
default. The SI4093 accomplishes this by having the following factory default
configuration:
• All base licensed internal and external ports are put into a single SPAR group.
• All uplinks are put into a single common LACP aggregation and the LACP
suspend-port feature is enabled.
• The failover feature is enabled on the common LACP key.
• No spanning-tree support (the SI4093 is designed to never permit more than a
single uplink path per SPAR, so it can not create a loop and does not support
spanning-tree).
For customers that want the option to be able to use advanced features, but also want an
Easy Connect mode solution, the EN4093R and CN4093 offer configurable options that
can make them transparent to the attaching Compute Nodes and upstream network
switches. While maintaining the option of changing to more advanced modes of
configuration when needed.
As noted, the SI4093 accomplishes this by defaulting to the SPAR feature in pass-through
mode, which puts all compute node ports and all uplinks into a common Q-in-Q group.
For the EN4093R and CN4093, there are a number of features that can be implemented to
accomplish this Easy Connect support. The primary difference between these I/O modules
and the SI4093 is that you must first perform a small set of configuration steps to set up the
EN4093R and CN4093 into an Easy Connect mode, after which minimal management of the
I/O module is required.
22. Introduction.fm Draft Document for Review May 1, 2014 2:10 pm
8 NIC Virtualization on IBM Flex System
For these I/O modules, this Easy Connect mode can be configured by using one of the
following four features:
The SPAR feature that is default on the SI4093 can be configured on both the EN4093R
and CN4093 as well
Utilize the tagpvid-ingress feature
Configure vNIC Virtual Fabric Dedicated Uplink Mode
Configure UFP vPort tunnel mode
In general, all of these features provide this Easy Connect functionality, with each having
some pros and cons. For example, if the desire is to use Easy Connect with vLAG, you should
use the tagpvid-ingress mode or the UFP vPort tunnel mode (SPAR and Virtual Fabric vNIC
do not permit the vLAG ISL). But, if you want to use Easy Connect with FCoE today, you
cannot use tagpvid-ingress and must utilize a different form of Easy connect, such as the
vNIC Virtual Fabric Dedicated Uplink Mode or UFP tunnel mode (SPAR pass-through mode
allows FCoE but does not support FIP snooping, which may or may not be a concern for
some customers).
As an example of how Easy Connect works (in all Easy Connect modes), consider the
tagpvid-ingress Easy Connect mode operation shown in Figure 1-3. When all internal ports
and the desired uplink ports are placed into a common PVID/Native VLAN (4091 in this
example) and tagpvid-ingress is enabled on these ports (with any wanted aggregation
protocol on the uplinks that are required to match the other end of those links), all ports with a
matching Native or PVID setting On this I/O module are part of a single Q-in-Q tunnel. The
Native/PVID VLAN on the port acts as the outer tag and the I/O module switches traffic based
on this outer tag VLAN. The inner customer tag rides through the fabric encapsulated on this
Native/PVID VLAN to the destination port (or ports) in this tunnel, and then has the outer tag
stripped off as it exits the I/O Module, thus re-exposing the original customer facing tag (or no
tag) to the device attaching to that egress port.
Figure 1-3 Packet flow with Easy Connect
In all modes of Easy Connect, local switching based on destination MAC address is still used.
23. Chapter 1. Introduction to I/O module and NIC virtualization features in the IBM Flex System environment 9
Draft Document for Review May 1, 2014 2:10 pm Introduction.fm
Some considerations on what form of Easy Connect mode makes the most sense for a given
situation:
For users that require virtualized NICs and are already using vNIC Virtual Fabric mode,
and are more comfortable staying with it, vNIC Virtual Fabric Easy Connect mode might
be the best solution.
For users that require virtualized NICs and have no particular opinion on which mode of
virtualized NIC they prefer, UFP tunnel mode would be the best choice for Easy Connect
mode, since the UFP feature is the future direction of virtualized NICs in the Flex System
I/O module solutions.
For users planning to make use of the vLAG feature, this would require either UFP tunnel
mode or tagpvid-ingress mode forms of Easy Connect (vNIC virtual fabric mode and
SPAR Easy Connect modes do not work with the vLAG feature).
For users that do not need vLAG or virtual NIC functionality, SPAR is a very simple and
clean solution to implement as an Easy Connect solution.
1.1.6 Introduction to the Failover feature
Failover, some times referred to as Layer 2 Failover or Trunk Failover, is not a virtulization
feature in its own right, but can play an important role when NICs on a server are making use
of teaming/bonding (forms of NIC virtulization in the OS). Failover is particularly important in
an embedded environment, such as in a Flex System chassis.
When NICs are teamed/bonded in an operating system, they need to know when a NIC is no
longer able to reach the upstream network, so they can decide to use or not use a NIC in the
team. Most commonly this is a simple link up/link down check in the server. If the link is
reporting up, use the NIC, if a link is reporting down, do not use the NIC.
In an embedded environment, this can be a problem if the uplinks out of the embedded I/O
module go down, but the internal link to the server is still up. In that case, the server will still
be reporting the NIC link as up, even though there is no path to the upstream network, and
that leads to the server sending traffic out a NIC that has no path out of the embedded I/O
module, and disrupts server communications.
The Failover feature can be implemented in these environments, and when the set of uplinks
the Failover feature is tracking go down, then configurable internal ports will also be taken
down, alerting the embedded server to a path fault in this direction, at which time the server
can utilize the team/bond to select a different NIC, and maintain network connectivity.
24. Introduction.fm Draft Document for Review May 1, 2014 2:10 pm
10 NIC Virtualization on IBM Flex System
An example of how failover can protect Compute Nodes in a PureFlex chassis when there is
an uplink fault out of one of the I/O modules can be seen in Figure 1-4.
Figure 1-4 Example of Failover in action
Without failover or some other form of remote link failure detection, embedded servers would
potentially be exposed to loss of connectivity if the uplink path on one of the embedded I/O
modules were to fail.
Note designs that utilize vLAG or some sort of cross chassis aggregation such as stacking
are not exposed to this issue (and thus do not need the Failover feature) as they have a
different coping method for dealing with uplinks out of an I/O module going down (for
example, with vLAG, the packets that need to get upstream can cross the vLAG ISL and use
the other I/O modules uplinks to get to the upstream network).
1.2 Introduction to NIC virtualization
As noted previously, although we have introduced a number of virtualization elements, this
book is primarily focused on the various options to virtualize NIC technology within the
PureFlex System and Flex System environment. This section introduces the two primary
types of NIC virtualization (vNIC and UFP) available on the Flex System I/O modules, as well
as introduces the various sub-elements of these virtual NIC technologies.
At the core of all virtual NICs discussed in this section is the ability to take a single physical 10
GbE NIC and carve it up into up to three or four NICs for use in the attaching host.
The virtual NIC technologies discussed for the I/O module here are all directly tied to the
Emulex CNA offerings for the Flex System environment, and documented in 3.3, “IBM Flex
System Ethernet adapters” on page 47.
HowFailoverWorks
1. All uplinks out of the I/O module have gone down (could be a link failure or failure
of ToR 1, and so forth).
2. Trunk failover takes down the link to NIC 1 to notify the compute node the path
out of I/O module 1 is gone.
3. NIC teaming on the compute node begins to utilizing the still functioning NIC 2 for
all communications.
Chassis
Node
NIC1
NIC2
ToR
Switch2
ToR
Switch1 I/O Module 1
Failover enabled
I/O Module 2
Failover enabled
X
Logical
Teamed NIC
2
3
1
25. Chapter 1. Introduction to I/O module and NIC virtualization features in the IBM Flex System environment 11
Draft Document for Review May 1, 2014 2:10 pm Introduction.fm
1.2.1 vNIC based NIC virtualization
vNIC is the original virtual NIC technology utilized in the IBM BladeCenter 10Gb Virtual Fabric
Switch Module, and has been brought forward into the PureFlex System environment to allow
customers that have standardized on vNIC to still use it with the PureFlex System solutions.
vNIC has three primary nodes:
vNIC Virtual Fabric - Dedicated Uplink Mode
– Provides a Q-in-Q tunneling action for each vNIC group
– Each vNIC group must have its own dedicated uplink path out
– Any vNICs in one vNIC group can not talk with vNICs in any other vNIC group, without
first exiting to the upstream network
vNIC Virtual Fabric - Shared Uplink Mode
– Each vNIC group provides a single VLAN for all vNICs in that group
– Each vNIC group must be a unique VLAN (can not use same VLAN on more than a
single vNIC group)
– Servers can not use tagging when Shared Uplink Mode is enabled
– Like vNICs in Dedicate Uplink Mode, any vNICs in one vNIC group can not talk with
vNICs in any other vNIC group, without first exiting to the upstream network
vNIC Switch Independent Mode
– Offers virtual NICs to server with no special I/O module side configuration
– The switch is completely unaware that the 10 GbE NIC is being seen as multiple logical
NICs in the OS
Details for enabling and configuring these modes can be found in Chapter 5, “NIC
virtualization considerations on the server side” on page 75 and Chapter 6, “Flex System NIC
virtulization deployment scenarios” on page 133.
1.2.2 Unified Fabric Port based NIC virtualization
UFP is the current direction of IBM NIC virtualization, and provides a more feature rich
solution compared to the original vNIC Virtual Fabric mode. Like VF mode vNIC, UFP allows
carving up a single 10 Gb port into four virtual NICs. UFP also has a number of modes
associated with it, including:
Tunnel mode
Provides a mode very similar to vNIC Virtual Fabric Dedicated Uplink Mode
Trunk mode
Provides a traditional 802.1Q trunk mode to the virtual NIC (vPort) interface
Access mode
Provides a traditional access mode (single untagged VLAN) to the virtual NIC (vPort)
interface
FCoE mode
Provides FCoE functionality to the vPort
Auto-VLAN mode
Auto VLAN creation for Qbg and IBM VMready® environments
26. Introduction.fm Draft Document for Review May 1, 2014 2:10 pm
12 NIC Virtualization on IBM Flex System
Only vPort 2 can be bound to FCoE. If FCoE is not desired, vPort 2 can be configured for one
of the other modes.
Details for enabling and configuring these modes can be found in Chapter 5, “NIC
virtualization considerations on the server side” on page 75 and Chapter 6, “Flex System NIC
virtulization deployment scenarios” on page 133.
1.2.3 Comparing vNIC modes and UFP modes
As a general rule of thumb, if a customer desires virtualized NICs in the PureFlex System
environment, UFP is usually the preferred solution, as all new feature development is going
into UFP.
If a customer has standardized on the original vNIC Virtual Fabric mode, then they can still
continue to use that mode in a fully supported fashion.
If a customer does not want any of the virtual NIC functionality controlled by the I/O module
(only controlled and configured on the server side) then Switch Independent mode vNIC is
the solution of choice. This mode has the advantage of being I/O module independent, such
that any upstream I/O module can be utilized. Some of the down sides to this mode are that
bandwidth restrictions can only be enforced from the server side, not the I/O module side, and
to change bandwidth requires a reload of the server (bandwidth control for the other virtual
NIC modes discussed here are changed from the switch side, enforce bandwidth restrictions
bidirectionally, and can be changed on the fly, with no reboot required).
Table 1-1 shows some of the items that may effect the decision making process.
Table 1-1 Attributes of virtual NIC options
Capability
Virtual Fabric vNIC mode Switch
independent
Mode vNIC
UFP
Dedicated
uplink
Shared
uplink
Requires support in the I/O module Yes Yes No Yes
Requires support in the NIC/CNA Yes Yes Yes Yes
Supports adapter transmit rate
control
Yes Yes Yes Yes
Supports I/O module transmit rate
control
Yes Yes No Yes
Supports changing rate without
restart of node
Yes Yes No Yes
Requires a dedicated uplink path per
vNIC group or vPort
Yes No No Yes for vPorts in
Tunnel mode
Support for node OS-based tagging Yes No Yes Yes
Support for failover per vNIC/
group/UFP vPort
Yes Yes No Yes
Support for more than one uplink
path per vNIC/vPort group
No Yes Yes Yes for vPorts in trunk
and Access modes
Supported regardless of the model
of upstream I/O module
No No Yes No
27. Chapter 1. Introduction to I/O module and NIC virtualization features in the IBM Flex System environment 13
Draft Document for Review May 1, 2014 2:10 pm Introduction.fm
For a deeper dive into virtual NIC operational characteristics from the switch side see
Chapter 4, “NIC virtualization considerations on the switch side” on page 55. For virtual NIC
operational characteristics from the server side, see Chapter 5, “NIC virtualization
considerations on the server side” on page 75.
Supported with vLAG No No Yes Yes for uplinks out of
the I/O Module
carrying vPort traffic
Supported with SPAR No No Yes No
Supported with stacking Yes Yes Yes No (UFP and stacking
on EN/CN4093 in
coming release of
code)
Supported with an SI4093 No No Yes No today, but
supported in coming
release
Supported with EN4093 Yes Yes Yes Yes
Supported with CN4093 Yes Yes Yes Yes
Capability
Virtual Fabric vNIC mode Switch
independent
Mode vNIC
UFP
Dedicated
uplink
Shared
uplink
30. Converged networking.fm Draft Document for Review May 1, 2014 2:10 pm
16 NIC Virtualization on IBM Flex System
2.1 What convergence is
Dictionaries describes convergence as follows:
The degree or point at which lines, objects, and so on, converge1
The merging of distinct technologies, industries, or devices into a unified whole2
In the context of this book, convergence addresses the fusion of local area networks (LANs)
and storage area networks (SANs), including servers and storage systems, into a unified
network. In other words, the same infrastructure is used for both data (LAN) and storage
(SAN) networking; the components of this infrastructure are primarily those traditionally used
for LANs.
2.1.1 Calling it what it is
Many terms and acronyms are used to describe convergence in a network environment.
These terms are described in later chapters of this book. For a better understanding of the
basics, let us start with the core.
Data Center Bridging (DCB)
The Institute of Electrical and Electronics Engineers (IEEE) uses the term DCB to group the
required extensions to enable an enhanced Ethernet that is capable of deploying a converged
network where different applications, relying on different link layer technologies, can be run
over a single physical infrastructure. The Data Center Bridging Task Group (DCB TG), part of
the IEEE 802.1 Working Group, provided the required extensions to existing 802.1 bridge
specifications in several projects.
Converged Enhanced Ethernet (CEE)
This is a trademark term that was registered by IBM in 2007 and was abandoned in 2008.
Initially, it was planned to donate (transfer) this term to the industry (IEEE 802 or Ethernet
Alliance) upon reception. Several vendors started using or referring to CEE in the meantime.
Data Center Ethernet (DCE)
Cisco registered the trademark DCE for their initial activity in the converged network area.
Bringing it all together
All three terms describes more or less the same thing. Some of them were introduced before
an industrial standard (or name) was available. Because manufacturers have used different
command names and terms, different terms might be used in this book. This clarification that
these terms can be interchanged should help prevent confusion. While all of these terms are
still heard, it is preferred to use the open industry standards Data Center Bridging (DCB)
terms. Command syntax in some of the IBM products used for testing in this book includes
the CEE acronym.
2.2 Vision of convergence in data centers
The density - processing and storage capability per square foot - of the data center footprint is
increasing over time, allowing the same processing power and storage capacity in significantly
1
Dictionary.com. Retrieved July 08, 2013 from http://dictionary.reference.com/browse/convergence
2 Merriam-Webster.com. Retrieved July 08, 2013 from http://www.merriam-webster.com/dictionary/convergence
31. Chapter 2. Converged networking 17
Draft Document for Review May 1, 2014 2:10 pm Converged networking.fm
smaller space. At the same time, information technology is embracing infrastructure
virtualization more rapidly than ever.
One way to reduce the storage and network infrastructure footprint is to implement a
converged network. Vendors are adopting industry standards which support convergence
when developing products.
Fibre Channel over Ethernet (FCoE) and iSCSI are two of the enablers of storage and network
convergence. Enterprises can preserve investments in traditional Fibre Channel (FC) storage
and at the same time adapt to higher Ethernet throughput demands which arise from server
virtualization. Most of the vendors in the networking market offer 10 Gbps Network Interface
Cards; 40Gbps NICs are also available today. Similarly, data center network switches
increasingly offer an option to choose 40 Gbps for ports, and 100 Gbps is expected relatively
soon.
Convergence has long had a role in networking, but now it takes on a new significance. The
following sections describe storage and networking in data centers today, explain what is
changing, and highlight approaches to storage and network convergence that are explored in
this book.
2.3 The interest in convergence now
Several factors are driving new interest in combining storage and data infrastructure. The
Ethernet community has a history of continually moving to transmission speeds that were
thought impossible only a few years earlier. Although a 100 Mbps Ethernet was once
considered fast, a 10 Gbps Ethernet is commonplace today. and 40 Gbps Ethernet is
becoming more and more widely available, with 100 Gb Ethernet following shortly. From a
simple data transmission speed perspective, Ethernet can now meet or exceed the speeds
that are available by using FC.
The IEEE 802.3 work group is already working on the 400 Gbps standard (results are
expected in 2017), so this process will continue.
A second factor that is enabling convergence is the addition of capabilities that make Ethernet
lower latency and “lossless,” making it more similar to FC. The Data Center Bridging (DCB)
protocols provide several capabilities that substantially enhance the performance of Ethernet
and initially enable its usage for storage traffic.
One of the primary motivations for storage and networking convergence is improved asset
utilization and cost of ownership, similar to the convergence of voice and data networks that
occurred in previous years. By using a single infrastructure for multiple types of network
traffic, the costs of procuring, installing, managing, and operating the data center
infrastructure can be lowered. Where multiple types of adapters, switches, and cables were
once required for separate networks, a single set of infrastructure will take its place, providing
savings in equipment, cabling, and power requirements. The improved speeds and
capabilities of lossless 10 and 40 Gbps Ethernet are enabling such improvements.
2.4 Fibre Channel SANs today
Fibre Channel SANs are generally regarded as the high-performance approach to storage
networking. With a Fibre Channel SAN, storage arrays are equipped with FC ports that
connect to FC switches. Similarly, servers are equipped with Fibre Channel host bus adapters
32. Converged networking.fm Draft Document for Review May 1, 2014 2:10 pm
18 NIC Virtualization on IBM Flex System
(HBAs) that also connect to Fibre Channel switches. Therefore, the Fibre Channel SAN,
which is the set of FC switches, is a separate network for storage traffic.
Fibre Channel (FC) was standardized in the early 1990s and became the technology of
choice for enterprise-class storage networks. Compared to its alternatives, FC offered
relatively high-speed, low-latency, and back-pressure mechanisms that provide lossless
connectivity. That is, FC is designed not to drop packets during periods of network
congestion.
Just as the maximum speed of Ethernet networks has increased repeatedly, Fibre Channel
networks have offered increased speed, typically by factors of 2, from four to eight to
16 Gbps, with thirty-two Gbps becoming available.
FC has many desirable characteristics for a storage network, but with some considerations.
First, because FC is a separate network from the enterprise data Ethernet network, additional
cost and infrastructure are required.
Second, FC is a different technology from Ethernet. Therefore, the skill set required to design,
install, operate and manage the FC SAN is different from the skill set required for Ethernet,
which adds cost in terms of personnel requirements.
Third, despite many years of maturity in the FC marketplace, vendor interoperability within a
SAN fabric is limited. Such technologies as N_Port Virtualization (NPV) or N_Port ID
Virtualization (NPIV) allow the equipment of one vendor to attach at the edge of the SAN
fabric of another vendor. However, interoperability over inter-switch links (ISLs; E_Port links)
within a Fibre Channel SAN is generally viewed as problematic.
2.5 Ethernet-based storage today
Storage arrays can also be networked by using technologies based on Ethernet. Two major
approaches are the Internet Small Computer System Interface (iSCSI) protocol and various
NAS protocols.
iSCSI provides block-level access to data over IP networks. With iSCSI, the storage arrays
and servers use Ethernet adapters. Servers and storage exchange SCSI commands over an
Ethernet network to store and retrieve data.
iSCSI provides a similar capability to FC, but by using a native Ethernet network. For this
reason, iSCSI is sometimes referred to as IP SAN. By using iSCSI, designers and
administrators can take advantage of familiar Ethernet skills for designing and maintaining
networks. Also, unlike FC devices, Ethernet devices are widely interoperable. Ethernet
infrastructure can also be significantly less expensive than FC gear.
When compared to FC, iSCSI also has challenges. FC is lossless and provides low latency
in-sequence data transfer. However, traditional Ethernet drops packets when traffic
congestion occurs, so that higher-layer protocols are required to ensure that no packets are
lost. For iSCSI, TCP/IP is used above an Ethernet network to guarantee that no storage
packets are lost. Therefore, iSCSI traffic undergoes a further layer of encapsulation as it is
transmitted across an Ethernet network.
Until recently, Ethernet technology was available only at speeds significantly lower than those
speeds of FC. Although FC offered speeds of 2, 4, 8, or 16 Gbps, with 32 Gbps just arriving,
Ethernet traditionally operated at 100 Mbps and1 Gbps. Now, 10 Gbps is common, and 40
Gbps is not far behind. iSCSI might offer a lower cost overall than an FC infrastructure, but it
historically has tended to offer lower performance because of its extra encapsulation and lower
33. Chapter 2. Converged networking 19
Draft Document for Review May 1, 2014 2:10 pm Converged networking.fm
speeds. Therefore, iSCSI has been viewed as a lower cost, lower performance storage
networking approach compared to FC. Today, the DCB standards which are a prerequisite for
FCoE to operate with lossless transmission and packets arriving in order can also be used for
iSCSI, resulting in improved performance.
NAS also operates over Ethernet. NAS protocols, such as Network File System (NFS) and
Common Internet File System (CIFS), provide file-level access to data, not block-level
access. The server that accesses the NAS over a network detects a file system, not a disk.
The operating system in the NAS device converts file-level commands that are received from
the server to block-level commands. The operating system then accesses the data on its
disks and returns information to the server.
NAS appliances are attractive because, similar to iSCSI, they use a traditional Ethernet
infrastructure and offer a simple file-level access method. However, similar to iSCSI, they
have been limited by Ethernet’s capabilities. NAS protocols are encapsulated in an upper
layer protocol (such as TCP or RPC) to ensure no packet loss. While NAS is working on a
file-level, there is the possibility of additional processing on the NAS device, because it is
aware of the stored content (for example, deduplication or incremental backup). On the other
hand, NAS systems require more processing power, because they are also required to handle
all file-system related operations. This requires more resources than pure block-level
handling.
2.6 Benefits of convergence in storage and network
The term convergence has had various meanings in the history of networking. Convergence
is used generally to refer to the notion of combining or consolidating storage traffic and
traditional data traffic on a single network (or fabric). Because Fibre Channel (FC) storage
area networks (SANs) are generally called “fabrics,” the term fabric is now also commonly
used for an Ethernet network that carries storage traffic.
Convergence of network and storage consolidates data and storage traffics into a single,
highly scalable, highly available, high performance and highly reliable storage network
infrastructure.
Converging storage and network brings lot of benefits which outweigh the initial investment.
Here are some of the key benefits:
Simplicity, cost savings, and reliability
Scalability and easier-to-move workloads in the virtual world
Low latency and higher throughput
One single, high-speed network infrastructure for both storage and network
Better utilization of server resources and simplified management
To get an idea how the differences between traditional and converged data centers can look
like, see the following figures. Both figures include three major components: servers, storage,
and the networks, to establish the connections. The required amount of switches in each
network depends on the size of the environment.
Figure 2-1 on page 20 shows a simplified picture of a traditional data center without
convergence. Either servers or storage devices might require multiple interfaces to connect to
the different networks. In addition, each network requires dedicated switches, which leads to
higher investments in multiple devices and more efforts for configuration and management.
34. Converged networking.fm Draft Document for Review May 1, 2014 2:10 pm
20 NIC Virtualization on IBM Flex System
Figure 2-1 Conceptual view of a data center without implemented convergence
Using converged network technologies, as shown by the converged data center in Figure 2-2,
there is only the need for one converged enhanced Ethernet. This results in fewer required
switches and decreases the amount of devices that require management. This reduction
might impact the TCO. Even the servers, clients, and storage devices require only one type of
adapters to be connected. For redundancy, performance, or segmentation purposes, it might
still make sense to use multiple adapters.
Figure 2-2 Conceptual view of a converged data center
2.7 Challenge of convergence
Fibre Channel SANs have different design requirements than Ethernet. To provide a better
understanding, they can be compared with two different transportation systems. Each system
moves people or goods from point A to point B.
35. Chapter 2. Converged networking 21
Draft Document for Review May 1, 2014 2:10 pm Converged networking.fm
Railroads
Trains run on rails and tracks. This can be compared with Fibre Channel SAN.
Figure 2-3 Trains running on rails
Specific aspects for trains that even impact network traffic are as follows:
The route is already defined by rails (shortest path first).
All participating trains are registered and known (nameserver).
The network is isolated, but accidents (dropped packages) have a huge impact.
The amount of trains in one track segment is limited (buffer to buffer credit for a lossless
connection).
Signals and railway switches all over the tracks define the allowed routes (zoning).
They have high capacity (payload 2148 bytes).
Roads
Cars can use roads with paved or even unpaved lanes. This can be compared with traditional
Ethernet traffic.
Figure 2-4 Cars using roads
36. Converged networking.fm Draft Document for Review May 1, 2014 2:10 pm
22 NIC Virtualization on IBM Flex System
Specific aspects for roads that even impact network traffic are as follows:
An unknown number of participants may be using the road at the same time. Metering
lights can only be used as a reactive method to slow down traffic (no confirmation for
available receiving capacity in front of sending).
Accidents are more or less common and expected (packet loss).
All roads lead to Rome (no point-to-point topology).
Navigation is required to prevent moving in circles (requirement of Trill/Spanning
Tree/SDN).
Everybody can join and hop on/off mostly everywhere (no zoning).
They have limited capacity (payload 1500), while available bigger buses/trucks can carry
more (jumbo frames).
Convergence approaches
Maintaining two transportation infrastructure systems, with separate vehicles and different
stations and routes, is complex to manage and expensive. Convergence for storage and
networks can mean “running trains on the road”, to stay in the context. The two potential
vehicles, which are enabled to run as trains on the road, are iSCSI and Fibre Channel over
Ethernet (FCoE).
iSCSI can be used in existing (lossy) and new (lossless) Ethernet infrastructure, with different
performance characteristics. However, FCoE requires a lossless converged enhanced
Ethernet network and it relies on additional functionality known from Fibre Channel (for
example, nameserver, zoning).
The Emulex CNA (Converged Network Adapters) which are used in compute nodes in the
Flex chassis can support either iSCSI or FCoE in their onboard ASIC - that is, in hardware.
Their configuration and use is described in the chapters which follow. Testing was done for the
purpose of this book using FCoE as the storage protocol of choice, because it is more
commonly used at this time and because there are more configuration steps required to
implement FCoE in a Flex environment than to implement iSCSI. Many of the scenarios
presented in the chapters that follow can readily be adapted for deployment in an
environment which includes iSCSI storage networking.
2.8 Conclusion
Convergence is the future. Network convergence can reduce cost, simplify deployment, better
leverage expensive resources, and enable a smaller data center infrastructure footprint. The
IT industry is adopting FCoE more rapidly because the technology is becoming more mature
and offers higher throughput in terms of 40/100 Gbps. Sooner or later, the CIOs will realize
the cost benefits and advantages of convergence and will adopt the storage and network
convergence more rapidly.
The bulk of the chapters of this book focus on insights and capabilities of FCoE on IBM Flex
Systems and introduces available IBM switches and storage solutions with support for
converged networks. Most of the content of the previous book which focused more on IBM
BladeCenter converged solutions is still valid and is an integrated part of the book.
37. Chapter 2. Converged networking 23
Draft Document for Review May 1, 2014 2:10 pm Converged networking.fm
2.9 Fibre Channel over Ethernet protocol stack
FCoE assumes the existence of a lossless Ethernet, such as one that implements the Data
Center Bridging (DCB) extensions to Ethernet. This section highlights, at a high level, the
concepts of FCoE as defined in FC-BB-5. The EN4093R, CN4093, G8264 and G8264CS
switches support FCoE; the G8264 and EN4093R functions as an FCoE transit switch while
the CN4093 and G8264CS have OmniPorts which can be set to function as either FC ports or
Ethernet ports under as specified in the switch configuration.
The basic notion of FCoE is that the upper layers of FC are mapped onto Ethernet, as shown
in Figure 2-5. The upper layer protocols and services of FC remain the same in an FCoE
deployment. Zoning, fabric services, and similar services still exist with FCoE.
Figure 2-5 FCoE protocol mapping
The difference is that the lower layers of FC are replaced by lossless Ethernet, which also
implies that FC concepts, such as port types and lower-layer initialization protocols, must be
replaced by new constructs in FCoE. Such mappings are defined by the FC-BB-5 standard
and are briefly addressed here.
FC-0
FC-1
FC-2P
FC-2M
FC-2V
FC-3
FC-4
Fibre Channel
Protocol Stack
Ethernet PHY
Ethernet MAC
FCoE Entity
FC-2V
FC-3
FC-4
FCoE
Protocol Stack
38. Converged networking.fm Draft Document for Review May 1, 2014 2:10 pm
24 NIC Virtualization on IBM Flex System
Figure 2-6 shows another perspective on FCoE layering compared to other storage
networking technologies. In this figure, FC and FCoE layers are shown with other storage
networking protocols, including iSCSI.
Figure 2-6 Storage Network Protocol Layering
Based on this protocol structure, Figure 2-7 shows a conceptual view of an FCoE frame.
Figure 2-7 Conceptual view of an FCoE frame
2.10 iSCSI
The iSCSI protocol allows for longer distances between a server and its storage when
compared to the traditionally restrictive parallel SCSI solutions or the newer serial-attached
SCSI (SAS). iSCSI technology can use a hardware initiator, such as a host bus adapter
(HBA), or a software initiator to issue requests to target devices. Within iSCSI storage
Operating Systems / Applications
SCSI Layer
1, 2, 4, 8, 16
Gbps
FCP FCP FCP FC
iSCSI SRP
TCP TCP TCP
IP IP IP FCoE
FC IB
iFCP
FCIP
Ethernet
1, 10, 40, 100... Gbps 10, 20, 40 Gbps
FC FCoE
Ethernet
Header
FCoE
Header
FCS
EOF
FC
Header
CRC
FC Payload
Ethernet Frame, Ethertype = FCoE=8906h
Same as a physical FC frame
Control information: version, ordered sets (SOF, EOF)
39. Chapter 2. Converged networking 25
Draft Document for Review May 1, 2014 2:10 pm Converged networking.fm
terminology, the initiator is typically known as a client, and the target is the storage device.
The iSCSI protocol encapsulates SCSI commands into protocol data units (PDUs) within the
TCP/IP protocol and then transports them over the network to the target device. The disk is
presented locally to the client as shown in Figure 2-8.
Figure 2-8 iSCSI architecture overview
The iSCSI protocol is a transport for SCSI over TCP/IP. Figure 2-6 on page 24 illustrates a
protocol stack comparison between Fibre Channel and iSCSI. iSCSI provides block-level
access to storage, as does Fibre Channel, but uses TCP/IP over Ethernet instead of Fibre
Channel protocol. iSCSI is defined in RFC 3720, which you can find at:
http://www.ietf.org/rfc/rfc3720.txt
iSCSI uses Ethernet-based TCP/IP rather than a dedicated (and different) storage area
network (SAN) technology. Therefore, it is attractive for its relative simplicity and usage of
widely available Ethernet skills. Its chief limitations historically have been the relatively lower
speeds of Ethernet compared to Fibre Channel and the extra TCP/IP encapsulation required.
With lossless 10 Gbps Ethernet now available, the attractiveness of iSCSI is expected to grow
rapidly. TCP/IP encapsulation will still be used, but 10 Gbps Ethernet speeds will dramatically
increase the appeal of iSCSI.
2.11 iSCSI versus FCoE
The section highlights the similarities and differences between iSCSI and FCoE. However, in
most cases, considerations other than purely technical ones will influence your decision in
choosing one over the other.
2.11.1 Key similarities
iSCSI and FCoE have the following similarities:
Both protocols are block-oriented storage protocols. That is, the file system logic for
accessing storage with either of them is on the computer where the initiator is, not on the
storage hardware. Therefore, they are both different from typical network-attached storage
(NAS) technologies, which are file oriented.
Both protocols implement Ethernet-attached storage.
Both protocols can be implemented in hardware, which is detected by the operating
system of the host as an HBA.
Both protocols can also be implemented by using software initiators which are available in
various server operating systems. However, this approach uses resources of the main
processor to perform tasks which would otherwise be performed by the hardware of an
HBA.
iSCSI Initiator
Client
TCP Connection iSCSI Target
Client
Network
40. Converged networking.fm Draft Document for Review May 1, 2014 2:10 pm
26 NIC Virtualization on IBM Flex System
Both protocols can use the Converged Enhanced Ethernet (CEE), also referred to as Data
Center Bridging), standards to deliver “lossless” traffic over Ethernet.
Both protocols are alternatives to traditional FC storage and FC SANs.
2.11.2 Key differences
iSCSI and FCoE have the following differences:
iSCSI uses TCP/IP as its transport, and FCoE uses Ethernet. iSCSI can use media other
than Ethernet, such as InfiniBand, and iSCSI can use Layer 3 routing in an IP network.
Numerous vendors provide local iSCSI storage targets, some of which also support Fibre
Channel and other storage technologies. Relatively few native FCoE targets are available
at this time, which might allow iSCSI to be implemented at a lower overall capital cost.
FCoE requires a gateway function, usually called a Fibre Channel Forwarder (FCF),
which allows FCoE access to traditional FC-attached storage. This approach allows FCoE
and traditional FC storage access to coexist either as a long-term approach or as part of a
migration. The G8264CS and CN4093 switches can be used to provide FCF functionality.
iSCSI-to-FC gateways exist but are not required when a storage device is used that can
accept iSCSI traffic directly.
Except in the case of a local FCoE storage target, the last leg of the connection uses FC to
reach the storage. FC uses 8b/10b encoding, which means that, sending 8 bits of data
requires a transmission of 10 bits over the wire or 25% overhead that is transmitted over
the network to prevent corruption of the data. The 10 Gbps Ethernet uses 64b/66b
encoding, which has a far smaller overhead.
iSCSI includes IP headers and Ethernet (or other media) headers with every frame, which
adds overhead.
The largest payload that can be sent in an FCoE frame is 2112. iSCSI can use jumbo
frame support on Ethernet and send 9K or more in a single frame.
iSCSI has been on the market for several years longer than FCoE. Therefore, the iSCSI
standards are more mature than FCoE.
Troubleshooting FCoE end-to-end requires Ethernet networking skills and FC SAN skills.
42. Flex System networking offerings.fm Draft Document for Review May 1, 2014 2:10 pm
28 NIC Virtualization on IBM Flex System
3.1 Enterprise Chassis I/O architecture
The Ethernet networking I/O architecture for the IBM Flex System Enterprise Chassis
includes an array of connectivity options for server nodes that are installed in the enclosure.
Users can decide to use a local switching model that provides superior performance, cable
reduction and a rich feature set, or use pass-through technology and allow all Ethernet
networking decisions to be made external to the Enterprise Chassis.
By far, the most versatile option is to use modules that provide local switching capabilities and
advanced features that are fully integrated into the operation and management of the
Enterprise Chassis. In particular, the EN4093 10Gb Scalable Switch module offers the
maximum port density, highest throughput, and most advanced data center-class features to
support the most demanding compute environments.
From a physical I/O module bay perspective, the Enterprise Chassis has four I/O bays in the
rear of the chassis. The physical layout of these I/O module bays is shown in Figure 3-1.
Figure 3-1 Rear view of the Enterprise Chassis showing I/O module bays
From a midplane wiring point of view, the Enterprise Chassis provides 16 lanes out of each
half-wide node bay (toward the rear I/O bays) with each lane capable of 16 Gbps or higher
speeds. How these lanes are used is a function of which adapters are installed in a node,
which I/O module is installed in the rear, and which port licenses are enabled on the I/O
module.
I/O module
bay 1
I/O module
bay 3
I/O module
bay 2
I/O module
bay 4
43. Chapter 3. IBM Flex System networking architecture and portfolio 29
Draft Document for Review May 1, 2014 2:10 pm Flex System networking offerings.fm
How the midplane lanes connect between the node bays upfront and the I/O bays in the rear
is shown in Figure 3-2. The concept of an I/O module Upgrade Feature on Demand (FoD)
also is shown in Figure 3-2. From a physical perspective, an upgrade FoD in this context is a
bank of 14 ports and some number of uplinks that can be enabled and used on a switch
module. By default, all I/O modules include the base set of ports, and thus have 14 internal
ports, one each connected to the 14 compute node bays in the front. By adding an upgrade
license to the I/O module, it is possible to add more banks of 14 ports (plus some number of
uplinks) to an I/O module. The node needs an adapter that has the necessary physical ports
to connect to the new lanes enabled by the upgrades. Those lanes connect to the ports in the
I/O module enabled by the upgrade.
Figure 3-2 Sixteen lanes total of a single half-wide node bay toward the I/O bays
For example, if a node were installed with only the dual port LAN on system board (LOM)
adapter, only two of the 16 lanes are used (one to I/O bay 1 and one to I/O bay 2), as shown
in Figure 3-3 on page 30.
If a node was installed without LOM and two quad port adapters were installed, eight of the 16
lanes are used (two to each of the four I/O bays).
This installation can potentially provide up to 320 Gb of full duplex Ethernet bandwidth (16
lanes x 10 Gb x 2) to a single half-wide node and over half a terabit (Tb) per second of
bandwidth to a full-wide node.
Node Bay 1
Interface
Connector
To Adapter 2
To LOM or
Adapter 1
Interface
Connector
Midplane
I/O Bay 1
Base
Upgrade 1 (Optional)
Upgrade 2 (Optional)
Future
I/O Bay 2
Base
Upgrade 1 (Optional)
Upgrade 2 (Optional)
Future
I/O Bay 3
Base
Upgrade 1 (Optional)
Upgrade 2 (Optional)
Future
I/O Bay 4
Base
Upgrade 1 (Optional)
Upgrade 2 (Optional)
Future
44. Flex System networking offerings.fm Draft Document for Review May 1, 2014 2:10 pm
30 NIC Virtualization on IBM Flex System
Figure 3-3 Dual port LOM connecting to ports on I/O bays 1 and 2 (all other lanes unused)
Today, there are limits on the port density of the current I/O modules, in that only the first three
lanes are potentially available from the I/O module.
By default, each I/O module provides a single connection (lane) to each of the 14 half-wide
node bays upfront. By adding port licenses, an EN2092 1Gb Ethernet Switch can offer two
1 Gb ports to each half-wide node bay, and an EN4093R 10Gb Scalable Switch, CN4093
10Gb Converged Scalable Switch or SI4093 System Interconnect Module can each provide
up to three 10 Gb ports to each of the 14 half-wide node bays. Because it is a one-for-one
14-port pass-through, the EN4091 10Gb Ethernet Pass-thru I/O module can only ever offer a
single link to each of the half-wide node bays.
As an example, if two 8-port adapters were installed and four I/O modules were installed with
all upgrades, the end node has access 12 10G lanes (three to each switch). On the 8-port
adapter, two lanes are unavailable at this time.
Concerning port licensing, the default available upstream connections also are associated
with port licenses. For more information about these connections and the node that face links,
see 3.2, “IBM Flex System Ethernet I/O modules” on page 31.
All I/O modules include a base set of 14 downstream ports, with the pass-through module
supporting only the single set of 14 server facing ports. The Ethernet switching and
interconnect I/O modules support more than the base set of ports, and the ports are enabled
by the upgrades. For more information, see the respective I/O module section in 3.2, “IBM
Flex System Ethernet I/O modules” on page 31.
As of this writing, although no I/O modules and node adapter combinations can use all 16
lanes between a compute node bay and the I/O bays, the lanes exist to ensure that the
Enterprise Chassis can use future available capacity.
Node Bay 1
Interface
Connector
Interface
Connector
Midplane
Dual port
Ethernet
Adapter
LAN on
Motherboard
I/O Bay 1
Base
Upgrade 1 (Optional)
Upgrade 2 (Optional)
Future
I/O Bay 2
Base
Upgrade 1 (Optional)
Upgrade 2 (Optional)
Future
I/O Bay 3
Base
Upgrade 1 (Optional)
Upgrade 2 (Optional)
Future
I/O Bay 4
Base
Upgrade 1 (Optional)
Upgrade 2 (Optional)
Future
45. Chapter 3. IBM Flex System networking architecture and portfolio 31
Draft Document for Review May 1, 2014 2:10 pm Flex System networking offerings.fm
Beyond the physical aspects of the hardware, there are certain logical aspects that ensure
that the Enterprise Chassis can integrate seamlessly into any modern data centers
infrastructure.
Many of these enhancements, such as vNIC, VMready, and 802.1Qbg, revolve around
integrating virtualized servers into the environment. Fibre Channel over Ethernet (FCoE)
allows users to converge their Fibre Channel traffic onto their 10 Gb Ethernet network, which
reduces the number of cables and points of management that is necessary to connect the
Enterprise Chassis to the upstream infrastructures.
The wide range of physical and logical Ethernet networking options that are available today
and in development ensure that the Enterprise Chassis can meet the most demanding I/O
connectivity challenges now and as the data center evolves.
3.2 IBM Flex System Ethernet I/O modules
The IBM Flex System Enterprise Chassis features a number of Ethernet I/O module solutions
that provide a combination of 1 Gb and 10 Gb ports to the servers and 1 Gb, 10 Gb, and
40 Gb for uplink connectivity to the outside upstream infrastructure. The IBM Flex System
Enterprise Chassis ensures that a suitable selection is available to meet the needs of the
server nodes.
The following Ethernet I/O modules are available for deployment with the Enterprise Chassis:
3.2.1, “IBM Flex System Fabric EN4093 and EN4093R 10Gb Scalable Switches”
3.2.2, “IBM Flex System Fabric CN4093 10Gb Converged Scalable Switch” on page 36
3.2.3, “IBM Flex System Fabric SI4093 System Interconnect Module” on page 42
3.2.4, “I/O modules and cables” on page 46
These modules are described next.
3.2.1 IBM Flex System Fabric EN4093 and EN4093R 10Gb Scalable Switches
The EN4093 and EN4093R 10Gb Scalable Switches are primarily 10 Gb switches that can
provide up to 42 x 10 Gb node-facing ports, and up to 14 SFP+ 10 Gb and two QSFP+ 40 Gb
external upstream facing ports, depending on the applied upgrade licenses.
Note: The EN4093, non R, is no longer available for purchase.
46. Flex System networking offerings.fm Draft Document for Review May 1, 2014 2:10 pm
32 NIC Virtualization on IBM Flex System
A view of the face plate of the EN4093/EN4093R 10Gb Scalable Switch is shown in
Figure 3-4.
Figure 3-4 The IBM Flex System Fabric EN4093/EN4093R 10Gb Scalable Switch
As listed in Table 3-1, the switch is initially licensed with 14 10-Gb internal ports that are
enabled and 10 10-Gb external uplink ports enabled. More ports can be enabled, including
the two 40 Gb external uplink ports with the Upgrade 1 and four more SFP+ 10Gb ports with
Upgrade 2 license options. Upgrade 1 must be applied before Upgrade 2 can be applied.
Table 3-1 IBM Flex System Fabric EN4093 10Gb Scalable Switch part numbers and port upgrades
The IBM Flex System Fabric EN4093 and EN4093R 10Gb Scalable Switches have the
following features and specifications:
Internal ports:
– A total of 42 internal full-duplex 10 Gigabit ports (14 ports are enabled by default;
optional FoD licenses are required to activate the remaining 28 ports).
Part
number
Feature
codea
a. The first feature code that is listed is for configurations that are ordered through System x sales channels (HVEC)
by using x-config. The second feature code is for configurations that are ordered through the IBM Power Systems
channel (AAS) by using e-config.
Product description Total ports that are enabled
Internal 10 Gb uplink 40 Gb uplink
49Y4270 A0TB / 3593 IBM Flex System Fabric EN4093 10Gb
Scalable Switch
10x external 10 Gb uplinks
14x internal 10 Gb ports
14 10 0
05Y3309 A3J6 / ESW7 IBM Flex System Fabric EN4093R 10Gb
Scalable Switch
10x external 10 Gb uplinks
14x internal 10 Gb ports
14 10 0
49Y4798 A1EL / 3596 IBM Flex System Fabric EN4093 10Gb
Scalable Switch (Upgrade 1)
Adds 2x external 40 Gb uplinks
Adds 14x internal 10 Gb ports
28 10 2
88Y6037 A1EM / 3597 IBM Flex System Fabric EN4093 10Gb
Scalable Switch (Upgrade 2) (requires
Upgrade 1):
Adds 4x external 10 Gb uplinks
Add 14x internal 10 Gb ports
42 14 2
47. Chapter 3. IBM Flex System networking architecture and portfolio 33
Draft Document for Review May 1, 2014 2:10 pm Flex System networking offerings.fm
– Two internal full-duplex 1 GbE ports that are connected to the chassis management
module.
External ports:
– A total of 14 ports for 1 Gb or 10 Gb Ethernet SFP+ transceivers (support for
1000BASE-SX, 1000BASE-LX, 1000BASE-T, 10 GBASE-SR, or 10 GBASE-LR) or
SFP+ copper direct-attach cables (DAC). There are 10 ports enabled by default and an
optional FoD license is required to activate the remaining four ports. SFP+ modules
and DAC cables are not included and must be purchased separately.
– Two ports for 40 Gb Ethernet QSFP+ transceivers or QSFP+ DACs (these ports are
disabled by default; an optional FoD license is required to activate them). QSFP+
modules and DAC cables are not included and must be purchased separately.
– One RS-232 serial port (mini-USB connector) that provides another means to
configure the switch module.
Scalability and performance:
– 40 Gb Ethernet ports for extreme uplink bandwidth and performance
– Fixed-speed external 10 Gb Ethernet ports to use 10 Gb core infrastructure
– Support for 1G speeds on uplinks via proper SFP selection
– Non-blocking architecture with wire-speed forwarding of traffic and aggregated
throughput of 1.28 Tbps
– Media access control (MAC) address learning:
• Automatic update
• Support of up to 128,000 MAC addresses
– Up to 128 IP interfaces per switch
– Static and LACP (IEEE 802.1AX; previously known as 802.3ad) link aggregation with
up to:
• 220 Gb of total uplink bandwidth per switch
• 64 trunk groups
• 16 ports per group
– Support for cross switch aggregations via vLAG
– Support for jumbo frames (up to 9,216 bytes)
– Broadcast/multicast storm control
– IGMP snooping to limit flooding of IP multicast traffic
– IGMP filtering to control multicast traffic for hosts that participate in multicast groups
– Configurable traffic distribution schemes over aggregated links
– Fast port forwarding and fast uplink convergence for rapid STP convergence
Availability and redundancy:
– VRRP for Layer 3 router redundancy
– IEEE 802.1D Spanning-tree to providing L2 redundancy, including support for:
• Multiple STP (MSTP) for topology optimization, up to 32 STP instances are
supported by single switch (previously known as 802.1s)
• Rapid STP (RSTP) provides rapid STP convergence for critical delay-sensitive
traffic, such as voice or video (previously known as 802.1w)
• Per-VLAN Rapid STP (PVRST) to seamlessly integrate into Cisco infrastructures