Más contenido relacionado La actualidad más candente (20) Similar a 2014/09/02 Cisco UCS HPC @ ANL (20) 2014/09/02 Cisco UCS HPC @ ANL1. Introduction to
Cisco UCS and
Userspace NIC (usNIC)
Argonne National Laboratory
September 2, 2014
Dave Goodell
dgoodell@cisco.com
© 2013 Cisco and/or its affiliates. All rights reserved. 1
2. Record-setting
Intel Ivy Bridge
1U and 2U servers
(with GPU Support)
Low
latency Ethernet
Up to
1.5 TB
RAM
Yes,
really!
10 & 40 Gbps
top-of-rack
& Core Switching
1.6
usecs
190
nsecs
10 & 40 Gbps!
© 2013 Cisco and/or its affiliates. All rights reserved. 2
3. Performance optimized
for any type of workload Integrated Design
Service Profiles
UCS Manager
UCS Central
Unified Fabric
Virtualized I/O
Form Factor
Independence
Low
Latency
Agility and reduced time
to deploy and provision applications
Role-based management,
automation, ease of integration
Centralized, multi-domain
management, alerting and visibility
Simplified infrastructure
Security isolation per application,
scale, improved performance
Supports both blades and rack
mount servers in a single domain
Low Latency over Industry Standard
Ethernet networking
© 2013 Cisco and/or its affiliates. All rights reserved. 3
4. Consolidating the messaging/interconnect network
Traditional Network
LAN
Ethernet FC FC
Ethernet FC
Unified Fabric
LAN
Ethernet FC
Infiniband
Cluster
DCB, FCoE
& Low Latency
© 2013 Cisco and/or its affiliates. All rights reserved. 4
5. • Benefits
• Low Latency Ethernet delivers high performance while
retaining all the advantages of managing unified network
fabric
• HPC Compute Clusters can coexist with Enterprise IT
under same management framework
• Leverage True Hybrid Solutions From All IT Resources
• Simplifies Procurement
• Accelerates Deployment
• Non Intrusive
• Extends the Product Life Cycle / Reusability
Lower CAPEX and OPEX
© 2013 Cisco and/or its affiliates. All rights reserved. 5
6. One wire to rule them all:
• OS Mgmt Traffic (e.g., ssh)
• Server Hardware Mgmt
• File System / IO Traffic
• MPI / Application Traffic
Cisco CIMC
Rich XML Interface
Unified Management
10 & 40 Gbps Ethernet
With QoS
HPC Networking /
Routing
© 2013 Cisco and/or its affiliates. All rights reserved. 6
7. Host Port Switch Port
eth0
eth1
eth2
VLAN 27, MTU 1500B, Bandwidth: 100 Mbps
VLAN 42, MTU 9000B, Bandwidth: 2Gbps
VLAN 64, MTU 9000B, Bandwidth: Not limited
PCIe Physical Function
eth2
Isolated HW
Resource
Virtual Functions
RX/TX Queue Pairs
CPU
MPI
Process
SSH
Process eth0
© 2013 Cisco and/or its affiliates. All rights reserved. 7
8. Characteristics
• Up to 20 Chassis (160 Blades)
• 3840 CPU Cores
• 20 Gbps Bandwidth/Blade
• Burst Capacity up to 80 Gbps
• Single Wire Management
• Enterprise & HPC
• Pod Architecture
• Scalable
96 or 48
Ports
5.3 usecs
Any to Any
Latency
Up to 82.94 TeraFLOPs
(Intel Ivy Bridge)
© 2013 Cisco and/or its affiliates. All rights reserved. 8
9. 3rd Party GPU
Expansion
C220 M3 - 1RU Dual Socket Rack Server (Up to 384 GB RAM)
3rd Party GPU
Expansion
C240 M3 - 2RU Dual Socket Compute OR Storage Rack
Server
3rd Party GPU
Expansion
C420 M3 - 2RU Dual OR Quad Socket Server (Upto 1.5 TB RAM)
© 2013 Cisco and/or its affiliates. All rights reserved. 9
10. Port-to-Port Latency
190
nsecs
<500
nsecs
<500
nsecs
<500
nsecs
Nexus 3548
48 Port x 10 Gbps
12 x 40 Gbps
Nexus 3172PQ
72 Port x 10 Gbps
6 x 40 Gbps
Nexus 3132Q
32 Port x 40 Gbps
Nexus 9000
9504 - 144 Port x 40 Gbps
9508 - 288 Port x 40 Gbps
9516 - 576 Port x 40 Gbps
© 2013 Cisco and/or its affiliates. All rights reserved. 10
11. © 2013 Cisco and/or its affiliates. All rights reserved. 11
12. App to App Latency Components
Kernel Bypass 2.02 usecs
using SRIOV
Kernel Overhead
9.42 usecs
0 2 4 6 8 10
usNIC
TCP/IP
Latency (usecs)
Middle Ware Kernel NIC Network
HW Resource
isolation using
IOMMU
TCP/IP usNIC
Dual Functionality!
© 2013 Cisco and/or its affiliates. All rights reserved. 12
13. • Direct access to NIC hardware from
Linux userspace
Operating System bypass
via the Linux Verbs API (UD)
• Utilizes Cisco Virtual Interface Card
(VIC) for ultra-low Ethernet latency
2nd generation 80Gbps Cisco ASIC
2 x 10Gbps Ethernet ports, or
2 x 40Gbps Ethernet ports
PCI and mezzanine form factors
• Half-round trip (HRT) ping-pong
latencies (Intel E5-2690 v2 servers):
Raw back to back: 1.57μs
MPI back to back: 1.85μs
Through MPI+N3548: 2.02μs
These
numbers keep
going down
© 2013 Cisco and/or its affiliates. All rights reserved. 13
14. • 2nd generation VIC:
Can present itself 256 times on the
PCI bus
Has enough hardware queues /
buffering for 256 actual NICs
• Created for virtualization
Designed for hypervisor bypass
• Intent:
Each vNIC assigned to a single
virtual machine
Can therefore bypass hypervisor
“Bare metal” network performance in
a VM
© 2013 Cisco and/or its affiliates. All rights reserved. 14
15. VIC
vNIC
vNIC
PCI Physical Function (PF)
vNIC
PCI Physical Function (PF)
vNIC
PCI Physical Function (PF)
MAC address: aa:bb:cc:dd:ee:fa
vNIC
PCI Physical Function (PF)
MAC address: aa:bb:cc:dd:ee:fb
vNIC
PCI Physical Function (PF)
MAC address: aa:bb:cc:dd:ee:fc
PCI Physical Function (PF)
MAC address: aa:bb:cc:dd:ee:fd
MAC address: aa:bb:cc:dd:ee:fe
MAC address: aa:bb:cc:dd:ee:ff
Physical port Physical port
© 2013 Cisco and/or its affiliates. All rights reserved. 15
16. VM
App VM
Guest kernel
Guest driver
App
Guest kernel
Guest driver
App
Guest kernel
Guest driver
virtual switch
Host driver
VM
Hypervisor
data path
VIC
PCI PF
PCI PF
© 2013 Cisco and/or its affiliates. All rights reserved. 16
17. VM
App VM
Guest kernel
Guest driver
App
Guest kernel
Guest driver
App
Guest kernel
Guest driver
virtual switch
Host driver
VM
Hypervisor
data path
VIC
PCI VF
PCI VF
PCI PF
© 2013 Cisco and/or its affiliates. All rights reserved. 17
18. VM
App
User process
User space driver
VM
App
User process
User space driver
VM App
User process
virtual switch
Host driver
Hypervisor
data path
VIC
PCI VF
PCI VF
PCI PF
Host OS
Host TCP/IP
stack
© 2013 Cisco and/or its affiliates. All rights reserved. 18
19. TCP/IP usNIC
Application
Userspace sockets
Userspace
Kernel
library
TCP stack
General Ethernet
driver
Cisco VIC driver
Cisco VIC hardware
Application
Userspace verbs library
Bootstrapping
and setup
Verbs IB core
Cisco USNIC
driver
Send and
receive
fast path
Cisco VIC hardware
© 2013 Cisco and/or its affiliates. All rights reserved. 19
20. MPI
MPI receives
L2 frames
directly from
the VIC
Userspace verbs
library
Cisco VIC hardware
MPI directly
injects L2 frames
(with UDP/IP
payloads)
© 2013 Cisco and/or its affiliates. All rights reserved. 20
21. x86 Chipset VT-d
I/O MMU
VIC
SR-IOV NIC
MPI process
MPI process
Classifier
QQPP
Inbound
L2 frames
Outbound
L2 frames
© 2013 Cisco and/or its affiliates. All rights reserved. 21
22. VIC
Physical Function (PF) Physical Function (PF)
MAC address: aa:bb:cc:dd:ee:fe MAC address: aa:bb:cc:dd:ee:ff
QP QP
VF VF VF
QP QP
VF VF VF
QP QP
VF VF VF
QP QP
VF VF VF
Physical port Physical port
© 2013 Cisco and/or its affiliates. All rights reserved. 22
23. VIC
PF (MAC)
V
F
V
F
V
F
QP QP QP QP
V
F
V
F
V
F
PF (MAC)
V
F
V
F
V
F
V
F
V
F
V
F
MPI process
Intel IO MMU
MPI process Physical
port
Physical
port
© 2013 Cisco and/or its affiliates. All rights reserved. 23
24. • Used for physical virtual memory translation
• usnic verbs driver programs (and de-programs) the IOMMU
Virtual
Virtual VIC Intel IO MMU
Userspace
process
Physical
RAM
Virtual
Physical
© 2013 Cisco and/or its affiliates. All rights reserved. 24
25. © 2013 Cisco and/or its affiliates. All rights reserved. 25
26. • Do you know what these are?
MAC address
IP Subnet
ARP
GID
LID
GRH
© 2013 Cisco and/or its affiliates. All rights reserved. 26
27. • Manage your Ethernet network however you want
• Manage and monitor UDP/IP traffic with standard tools
• Can use IP routing + ECMP to create spine+leaf (Clos) networks
• Incrementally grow deployments without rejiggering existing sub-cluster
subnet config
• No additional cost for IP: Cisco switches route L2/L3 at same
speed
© 2013 Cisco and/or its affiliates. All rights reserved. 27
28. • Design Principle: Behave like OS network stack as much as
possible!
• Examples
Routing
ARP
UDP/IP port usage + visibility
MAC in L2 frames
• Can’t always achieve full parity
exotic routing configurations (e.g., ip rule add blackhole …)
tcpdump (no OS in datapath*)
© 2013 Cisco and/or its affiliates. All rights reserved. 28
29. 1. call ibv_create_qp()
2. allocates a full Linux
UDP socket w/ port in
OS tables
3. pass to kmod w/
create_qp command
4. bump refcount before
installing filter, prevents
freeing socket before
QP destruction
MPI
libibverbs
libusnic_verbs
user
space
kernel usnic_verbs.ko
shows up in lsof/netstat
© 2013 Cisco and/or its affiliates. All rights reserved. 29
30. • Open MPI natively supports multi-rail
• Open MPI automagic configuration philosophy (when possible)
• VICs have 2 ports, can have >1 VIC per server
• Want to avoid artificial contention
pair local interfaces with remote interfaces
• Remote MPI process might be on the same subnet, might not
• Nontrivial software problem
© 2013 Cisco and/or its affiliates. All rights reserved. 30
31. Example Interface Pairing
Host A Host B
NIC A1
NIC A2
NIC B1
NIC B2
P1
P2
Host A Host B
P1
P2
Host A Host B
possible connectivity
OMPI selected pairing
NIC A1
NIC A2
NIC A1
NIC A2
Key
NIC B1
NIC B2
NIC B1
NIC B2
P1
P2
before pairing
valid pairing 1
valid pairing 2
an MPI process
© 2013 Cisco and/or its affiliates. All rights reserved. 31
32. Host A
NIC A1
NIC A2
Host B
NIC
R1a
NIC
R2a
Subnet S1
NIC
R1b
NIC
R2b
NIC B1
NIC B2
Subnet S2
Switch (does not need L3 capability)
© 2013 Cisco and/or its affiliates. All rights reserved. 32
33. Matching Logic Must Watch For Sub-optimal Pairings
Host A Host B
NIC A1
NIC A2
NIC B1
NIC B2
A1 can reach B1 and B2
A2 can only reach B1
NIC A1
NIC A2
NIC B1
NIC B2
NIC A1
NIC A2
NIC B1
NIC B2
Case 1 (sub-optimal)
• A2 cannot pair with
any interface on Host
B
• reduces aggregate
bandwidth
Host A
Host A
Host A
Host B
Case 2 (desired)
• Both Host A interfaces
can pair with Host B
interfaces
© 2013 Cisco and/or its affiliates. All rights reserved. 33
34. © 2013 Cisco and/or its affiliates. All rights reserved. 34
35. 1.88 μs on this SB machine
© 2013 Cisco and/or its affiliates. All rights reserved. 35
36. © 2013 Cisco and/or its affiliates. All rights reserved. 36
37. • Everything above the
firmware is open source
• Open MPI
Distributing in Cisco Open MPI
v1.6.5 (soon to be v1.8.2)
Upstream in Open MPI v1.7.3 and
beyond (current stable is v1.8.1)
• Libibverbs plugin
• Verbs kernel module
© 2013 Cisco and/or its affiliates. All rights reserved. 37
38. • 3rd Generation VIC
2 x 40G and PCIe gen 3
More MPI offload to hardware
• Software update (expected this week)
Upgrade transport from custom L2 protocol to UDP
Key rationale point: Cisco switches L2 and L3 at same speed
Allows switching usNIC traffic around data center
Allows easier monitoring and policy control of usNIC traffic
Kernel + userspace support for RHEL 7.0, SLES 12
Open MPI optimizations for 3rd generation VIC
© 2013 Cisco and/or its affiliates. All rights reserved. 38
Notas del editor UCS is Cisco’s x86 server line. It offers both blade and rack servers with a focus on manageability, virtualization, networking, and performance. It’s all designed to integrate smoothly with Cisco’s switching products. I’m really here to talk about usNIC, our low latency Ethernet solution for HPC.
N3K: 48 ports of 10GB, 12 ports 40GB, 1RU
N6K: 384 ports of 10GB, or 96 ports of 40GB, 4RU Many innovative features in UCS since we launched in 2009. Simplifies deployment and management by cutting out specialized networks. Saves costs by reducing the number of expensive adapters that need to be plugged into a server and reducing the number of cables and switches that need to be purchased and installed. usNIC allows customers to finally take control of their HPC resources and save time, energy and money by empowering IT to do what only scientists and researchers have been doing with compute clusters. This technology also enables HPC On-Demand in that the same VIC which already demonstrated world-record performance in the enterprise now enables the speed HPC applications require. Customers can now provision compute at will from a single point over a single network fabric. The trick is in VLANS and QoS, allowing you to carve that single wire into separate slices. could poll the audience about Ethernet switch latencies <Main point: Approximately 85% of the end to end latency in within the server, lets tackle the big ticker item>
<Click> Latency within the application depends of the application, the way it has been written and designed
<Click> The middle ware layer is a big contributor as well, often taking approximately 20uSecs
<Click> The kernel protocol processing is responsible for at least another 6uSecs
<Click> The adapter itself adds between 3-6uSec depending on the HW vendors design and implementation
<Click> Finally the network elements between 2 servers can add up to 5uSec of latency per hop
The breakdown of these latency elements show that approximately 85% of the latency, and that’s is not counting the application latency itself, is within the server. The network only contributes 15% of the total end to end application latency. At Cisco, our target is to reduce the overall latency and we are taking a holistic view in our approach. All over *standard* Ethernet (though the VIC is required). VT-d: Virtualization Technology for Directed I/O
IO MMU: Input / output memory management unit
SR-IOV: Single Root Input Output Virtualization Measurements taken on E5-2690 0 @ 2.90GHz CPUs (Sandy Bridge) with Icehouse 40 GbE cards (PCIe Gen2, x16)