Más contenido relacionado La actualidad más candente (20) Similar a Cisco EuroMPI'13 vendor session presentation (20) Cisco EuroMPI'13 vendor session presentation1. Cisco Public 1© 2013 Cisco and/or its affiliates. All rights reserved.
WhyCiscoisAwesomeforYourNext
HPCCluster
Jeff Squyres
Cisco Systems, Inc.
2. Cisco Public 2© 2013 Cisco and/or its affiliates. All rights reserved.
Yes, we sell servers now
3. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 3
Cisco UCS servers
Cisco 2 x 10Gb VIC
Cisco 40Gb
Nexus switches
Record-setting
Intel Ivy Bridge
1U and 2U servers
Ultra low
latency Ethernet
Yes,
really!
40Gb top-of-rack
and core switches
4. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 4
4
Rack
4 socket + giant memoryHPC performance
Blade
UCS B420 M3
4-socket blade for
large-memory compute workloads
Cisco UCS: Many Server Form Factors, One System
UCS C240 M3
Perfect as HPC cluster head nodes
or IO nodes (2 socket)
UCS C220 M3
Ideal for HPC compute-intensive
applications (2 socket)
UCS B200 M3
Blade form factor, 2-socket
UCS C420 M3
4-socket rack server for large-memory
compute workloads
Industry-leading compute without compromise
5. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 5
UCS impacting growth of
established vendors like HP
Legacy offerings flat-lining
or in decline
Cisco growth out-pacing the market
Customers have shifted 19.3% of
the global x86 blade server market
to Cisco and over 26% in the
Americas (Source: IDC Worldwide Quarterly
Server Tracker, Q1 2013 Revenue Share, May
2013)Source: IDC Worldwide Quarterly Server Tracker, Q1 2013 Revenue Share, May 2013
Worldwide X86 Server Blade Market Share
Demand for Data Center Innovation Has Vaulted Cisco Unified Computing System
(UCS) to the #2 Leader in the Fast-Growing Segment of the x86 Server Market
Market Appetite
for Innovation Fuels
UCS Growth
UCS #2 and
climbing
6. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 6
Maintained #2 in N. America
(27.9%) and #2 in the US (28.3%)1
UCS x86 Blade servers revenue
grew 35% Y/Y in Q1CY131
Advanced to #2 worldwide in x86
Blades with 19.3%
UCS momentum is fueled by
game-changing innovation; Cisco
is quickly passing established
players
UCS #2 in Only
Four Years
X86ServerBladeMarketShare,Q1CY131
UCS #2 with 26.9%
Source: 1 IDC Worldwide Quarterly Server Tracker, Q1 2013, May 2013, Revenue Share
0 10 20 30 40 50
HP
Cisco
IBM
Dell
NEC
Hitachi
Fujitsu
Oracle
Worldwide
UCS #2 19.3%
0 10 20 30 40 50
Oracle
SGI
Dell
IBM
Cisco
HP
7. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 7
Best CPU
Performance 16 world records
Best
Virtualization
& Cloud
Performance
8 world records
Best Database
Performance 9 world records
Best Enterprise
Application
Performance
18 world records
Best Enterprise
Middleware
Performance
14 world records
Best HPC
Performance 15 world records
8. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 8
One wire to rule them all:
• Commodity traffic (e.g., ssh)
• Cluster / hardware management
• File system / IO traffic
• MPI traffic
10G or 40G
with real QoS
9. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 9
9
High densityLow latency
Cisco Nexus: Years of experience rolled into dependable solutions
Nexus 3548
190ns port-to-port latency (L2 and L3)
Created for HPC / HFT
48 10Gb / 12 40Gb ports
Low latency, high density 10 / 40Gb switches
Nexus 6004
1us port-to-port latency
384 10Gb / 96 40Gb ports
10. Cisco Public© 2013 Cisco and/or its affiliates. All rights reserved. 10
Spine
Leaf
Characteristics
• 3 Hops
• Low Oversubscription – Non-Blocking
• < ~3.5 usecs depending on config and workload
• 10G or 40G Capable
• Spine: 4 to 16 Wide
• Leaf: Determined by Spine Density
Spine - Leaf Port Scale Latency Spines Leafs
10G Fabric 6004 - 6001 18,432 x 10G 3:1 ~ 3 usecs Cut-through 16 384
40G Fabric 6004 - 6004 7,680 x 40G 5:1 ~ 3 usecs Cut-through 16 96
Mixed Fabric 6004 - 6001 4,680 x 10G 3:1 ~ 3 usecs S&F 4 96
10G Fabric 6004 - 3548 12,288 x 10G 3:1 ~ 1.5 usecs Cut-through 16 384
40G Fabric 6004 - 3548 1,152 x 40G 1:1 ~ 1.5 usecs Cut-through 6 96
Mixed Fabric 6004 - 3548 3,072 x 10G 3:1 ~ 1.5 usecs S&F 4 96
…many other configurations are also possible
11. Cisco Public© 2013 Cisco and/or its affiliates. All rights reserved. 11
Leaf
Spine2
Spine1
Spine2-Spine1-Leaf Port Scale Latency Spine2 Spine1 Leafs
10G Fabric 6004 - 6004 - 6001 55,296 x 10G 3:1 ~ 3-5 usecs Cut-through 48 16 x 6 192
40G Fabric 6004 - 6004 - 6004 23,040 x 40G 5:1 ~ 3-5 usecs Cut-through 48 16 48
Mixed Fabric 6004 - 6004 - 6001 18,432 x 10G 3:1 ~ 3-5 usecs S&F 32 4 x 8 48
10G Fabric 6004 - 6004 - 3548 24,576 x 10G 2:1 ~ 1.5-3.5 usecs Cut-through 32 16 x 4 192
40G Fabric 6004 - 6004 - 3548 2,304 x 40G 1:1 ~ 1.5-3.5 usecs Cut-through 24 6 x 8 48
Mixed Fabric 6004 - 6004 - 3548 9,216x 10G 2:1 ~ 1.5-3.5 usecs S&F 24 6 x 8 48
Characteristics
• 3 Hops Pod – 5 hops DC east-west traffic
• Low Oversubscription – Non-Blocking
• < ~3.5 usecs depending on config and
workload
• 10G or 40G Capable
• Two spine layers
12. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 12
13. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 13
• Direct access to NIC hardware from
Linux userspace
Operating System bypass
Via the Linux Verbs API (UD)
• Utilizes Cisco Virtual Interface Card
(VIC) for ultra-low Ethernet latency
2nd generation 80Gbps Cisco ASIC
2 x 10Gbps Ethernet ports
2 x 40Gbps coming in Q4 2013
PCI and mezzanine form factors
• Half-round trip (HRT) ping-pong
latencies:
Back to back: 1.7μs
Through N3548: 1.9μs
Through MPI+N3548: 2.16μs (*)
14. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 14
Application
Kernel
Cisco VIC hardware
TCP stack
General Ethernet driver
Cisco VIC driver
Userspace
Userspace sockets
library
Userspace verbs library
Cisco VIC hardware
Application
Verbs IB core
Cisco USNIC
driver
Bootstrapping
and setup
Send and
receive
fast path
usNICTCP/IP
15. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 15
MPI
MPI directly
injects
L2 frames
to the network
MPI receives
L2 frames
directly from
the VIC
Userspace verbs library
Cisco VIC hardware
16. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 16
IO MMU
SR-IOV NIC
VIC
Classifier
x86 Chipset VT-d
MPI process
QPQP
MPI process
Inbound
L2 frames
Outbound
L2 frames
17. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 17
VIC
VF VF VF
VF VF VF
Physical port Physical port
Physical Function (PF) Physical Function (PF)
MAC address: aa:bb:cc:dd:ee:ff MAC address: aa:bb:cc:dd:ee:fe
VF VF VF
VF VF VF
QPQP
QPQP
QPQP
QPQP
18. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 18
VIC
PF (MAC)
VF VF VF
VF VF VF
PF (MAC)
VF VF VF
VF VF VF
MPI process
MPI process
Physical portPhysical port
Intel IO MMU
QP QP QP QP
19. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 19
• Everything above the
firmware is open source
• Open MPI
Distributing Cisco Open MPI 1.6.5
Upstream in Open MPI 1.7.3
• Libibverbs plugin
• Verbs kernel module
20. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 20
Hardware
• Cisco UCS C220 M3 Rack Server
• Intel E5-2690 Processor 2.9 GHz (3.3 GHz Turbo), 2 Socket, 8 Cores/Socket
• 1600 MHz DDR3 Memory, 8 GB x 16, 128 GB installed
• Cisco VIC 1225 with Ultra Low Latency Networking usNIC Driver
• Cisco Nexus 3548
• 48 Port 10 Gbps Ultra Low Latency Ethernet Networking Switch
Software
• OS: Centos 6.4, Kernel: 2.6.32-358.el6.x86_64 (SMP)
• NetPIPE (ver 3.7.1)
• Intel MPI Benchmarks (ver 3.2.4)
• High Performance Linpack (ver 2.1)
• Other: Intel C Compiler (ver 13.0.1), Open MPI (ver 1.6.5), Cisco usNIC (1.0.0.7x)
21. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 21
0
2500
5000
7500
10000
1
10
100
1000
10000
1
4
12
19
27
35
51
67
99
131
195
259
387
515
771
1027
1539
2051
3075
4099
6147
8195
12291
16387
24579
32771
49155
65539
98307
131075
196611
262147
393219
524291
786435
1048579
1572867
2097155
3145731
4194307
6291459
8388611
Throughput(Mbps)
Latency(usecs)
Message Size (bytes)
Cisco usNIC Latency Cisco usNIC Throughput
2.16 usecs latency for small
messages
9.3 Gbps Throughput
22. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 22
0
300
600
900
1200
1
10
100
1000
10000
4 16 64 256 1024 4096 16384 65536 262144 1048576 4194304
Throughput(MB/s)
Latecny(usecs)
Message Size (bytes)
PingPong ThroughPut (MB/s) PingPing Througput (MB/s) PingPong Latency (usecs) PingPing Latency (usecs)
2.16 usecs PingPong Latency
2.21 usecs PingPing Latency
PingPing and PingPong Latency track together!
23. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 23
0
600
1200
1800
2400
1
10
100
1000
10000
4 16 64 256 1024 4096 16384 65536 262144 1048576 4194304
Throughput(MB/s)
Latecny(usecs)
Message Size (bytes)
SendRecv Throughput (MB/s) Exchange Throughput (MB/s) SendRecv Latency (usecs) Exchange Latency (usecs)
2.22 usecs SendRecv Latency
2.69 usecs Exchange Latency
Full Bi-directional Performance for both
Exchange and SendRecv
24. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 24
16 32 64 128 256 512
GFlops 340.51 673.68 1271.14 2647.09 5258.27 9773.45
0
2500
5000
7500
10000
12500
GFlops
# of CPU Cores
GFLOPS = FLOPS/Cycle x Num CPU Cores x Freq (GHz)
E5-2690 Max GFLOPS = 8 x 16 x 3.3 = 422 GFLOPS
Single Node HPL Score (16 cores): 340.51 GFLOPS*
32 Node HPL Score (512 cores): 9,773.45 GFLOPS
Efficiency based on Single Machine Score:
(9,773.45)/(340.51 x 32) x 100 = 89.69%
* Score may improve with additional compiler settings or newer compiler versions
25. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 25
• Cisco usNIC with Cisco Nexus 3548 switch offers 2.16 usecs latency for small
messages with Open MPI
• Cisco usNIC with Cisco Nexus 3548 switch offers up to 89.69% HPL efficiency
across 512 Cores
• Cisco usNIC integrated with open source Open MPI
• Cisco usNIC offers ultra low latency networking performance over standard
Ethernet networking suitable for HPC applications
Notas del editor N3K: 48 ports of 10GB, 12 ports 40GB, 1RUN6K: 384 ports of 10GB, or 96 ports of 40GB, 4RU C240: 2RU, 1.5TB RAM, B200: 768GB RAM, 80Gb uplinkB420: 1.5TB RAM, 160Gb uplink Graph shows vendor revenue shares for three quarterly periods: Q1CY11, Q1CY12,Q1CY13.Since UCS first recorded IDC vendor revenue in Q3CY09, HP has lost 7.3% share (50.3% in Q3CY09 to 42.9% in Q1CY13) and IBM has lost 13.3% (29.1% to 15.8%).Cisco Americas share in Q1CY13 x86 blades was 26.9%. x86 Blade Shares CQ1’13 - WW: HP 42.9%, Cisco 19.3%, IBM 15.8%, Dell 9.6%, Fujitsu 2.3%, NEC 2.2%, Hitachi Ltd. 1.5%, Oracle 0.9%x86 Blade Shares CQ1’13 - Americas: HP 42.4%, Cisco 26.9%, IBM 15.1%, Dell 10.8% VT-d: Virtualization Technology for Directed I/OIO MMU: Input / output memory management unitSR-IOV: Single Root Input Output Virtualization