SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1
Cisco’s Journey
From Verbs to Libfabric
Abondon the shackles of Verbs
Embrace the freedom of Libfabric
Jeffrey M. Squyres
Cisco Systems
23 September 2015
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 2
Application
Kernel
Cisco VIC ethX port
TCP stack
General Ethernet driver
enic.ko
Userspace
sockets API userspace library
Application
Verbs IB core
usnic.ko
Send and
receive
fast path
usNICTCP/IP
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 3© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 3
Verbs is a fine API.
…if you make InfiniBand hardware.
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 4© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 4
...but now there’s this libfabric thing
(see libfabric.org community for details)
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 5
Keep in mind, Cisco already supports
UD Verbs
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 6
•  Monotonic enum
•  Could not add popular Ethernet values
1500
9000
•  usNIC verbs provider had to lie (!)
…just like iWARP providers
•  MPI had to match verbs device with IP
interface to find real MTU
Verbs
IBV_MTU_256
IBV_MTU_512
IBV_MTU_1024
IBV_MTU_2048
IBV_MTU_4096
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 7
•  Integer (not enum) endpoint attribute
Libfabric
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 8
•  Integer (not enum) endpoint attribute
Libfabric
DONE
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 9
•  Mandatory GRH structure
InfiniBand-specific header
•  40 bytes
UDP header is 42 bytes
…and a different format
•  Breaks ib_ud_pingpong
•  usnic verbs provider used “magic”
ibv_port_query() to return extensions
pointers
E.g., enable 42-byte UDP mode
Verbs
e
t
len chksmacdmac …
ver len
n
e
x
t
h
o
p
sgid dgid
UDP header: 42 bytes
GRH: 40 bytes
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 10
•  FI_MSG_PREFIX and
ep_attr.msg_prefix_size
Libfabric
e
t
len chksmacdmac …
payload
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 11
•  FI_MSG_PREFIX and
ep_attr.msg_prefix_size
Libfabric
e
t
len chksmacdmac …
payload
DONE
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 12
•  Tuple: (device, port)
Usually a physical device and port
Does not match virtualized VIC hardware
•  Queue pair
•  Completion queue
Verbs
PU P#9
PCI 8086:1521
eth3
PCI 1137:0043
eth4
usnic_0
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:0043
ibv_device
ibv_port
QP QP CQ
QP
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 13
•  Maps nicely to SR-IOV
•  Fabric à PCI physical function (PF)
•  Domain à PCI virtual function (VF)
•  Endpoint à Resources in VF
PU P#9
PCI 8086:1521
eth3
PCI 1137:0043
eth4
usnic_0
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:0043
Libfabric
fi_fabric
fi_domain
fi_endpoint
(resources in domain)
EP EP CQ
EP
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 14
•  GID and GUID
No easy mapping back to IP interface
•  usnic verbs provider encoded MAC in
GID
Still cumbersome to map back to IP interface
•  Could use RDMA CM
…but that would be a ton more code
Verbs
mac[0] = gid->raw[8] ^ 2;
mac[1] = gid->raw[9];
mac[2] = gid->raw[10];
mac[3] = gid->raw[13];
mac[4] = gid->raw[14];
mac[5] = gid->raw[15];
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 15
•  Can use IP addressing directly
Libfabric
Everything is awesome
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 16
•  Can use IP addressing directly
Libfabric
Everything is awesomeDONE
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 17
•  Generic send call
ibv_post_send(…SG list…)
Lots of branches
•  Wasteful allocations
•  No prefixed receive
•  Branching in completions
Verbs
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 18
•  Multiple types of send calls
fi_send(buffer, …)
•  Variable-length prefix receive
Provider-specific
•  Fewer branches in completions
Libfabric
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 19
1.9
1.95
2
2.05
2.1
2.15
2.2
2.25
2.3
2.35
2.4
0.1 1 10 100
Time(microseconds)
Buffer size
Open MPI with usNIC: IMB PingPong Latency
imb-pingpong-ompi-1.8-verbs.out
imb-pingpong-ompi-1.8-libfabric.out
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 20
61000
62000
63000
64000
65000
66000
67000
68000
69000
1e+06
Bandwidth(megabits/second)
Buffer size
Open MPI with usNIC: IMB SendRecv Bandwidth
imb-sendrecv-ompi-1.8-verbs.out
imb-sendrecv-ompi-1.8-libfabric.out
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 21
•  Performance issues
•  Memory registration still a problem
•  No MPI-style tag matching
•  One-sided capabilities do not match MPI
•  Network topology is a separate API
Verbs
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 22
•  Performance happiness
•  Many MPI-helpful features:
Tag matching
One-sided operations
Triggered operations
•  Inherently designed to be more than just
point-to-point
•  More work to be done… but promising
MMU notify
Network topology
Libfabric
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 23
•  Long design discussions about how to
expose Ethernet / VIC concepts in the
verbs API
…usually with few good answers
Especially problematic with new VIC features over
time
•  Conclusion: possible (obviously), but not
preferable
•  Whole API designed with multiple vendor
hardware models in mind
•  Much easier to match our hardware to
core Libfabric concepts
•  Conclusion: much more preferable than
verbs
LibfabricVerbs
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 24© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 24
Ok, so let’s do libfabric!
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 25© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 25
Does it play well with MPI?
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 26
Byte Transport Layer
(BTL) plugins
Matching Transport
Layer (MTL) plugins
MPI_Send(…)
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 27
•  Inherently multi-device
•  Round-robin for
small messages
•  Striping for large messages
•  Major protocol decisions and MPI message matching driven by an Open MPI
engine
Byte Transport Layer
(BTL) plugins
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 28
Matching Transport
Layer (MTL) plugins
•  Most details hidden by network API
• MXM
• Portals
• PSM
•  As a side effect, must handle:
• Process loopback
• Server loopback (usually via shared memory)
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 29
Byte Transport Layer
(BTL) plugins
Matching Transport
Layer (MTL) plugins
•  IB / iWarp (verbs)
•  Portals
•  SCIF
•  Shared memory
•  TCP
•  uGNI
•  usNIC (verbs)
•  MXM
•  Portals
•  PSM
•  PSM2
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 30
•  IB / iWarp (verbs)
•  Portals
•  SCIF
•  Shared memory
•  TCP
•  uGNI
•  usNIC
Byte Transport Layer
(BTL) plugins
Matching Transport
Layer (MTL) plugins
•  MXM
•  Portals
•  PSM
•  PSM2
•  ofi
libfabric
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 31
libfabric
usnic BTL ofi MTL
•  Cisco developed
•  usNIC-specific
•  OFI point-to-point / UD
•  Tested with usNIC
•  Intel developed
•  Provider neutral
•  OFI tag matching
•  Tested with PSM / PSM2
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 32
Bootstrapping
Message passing
There are two main parts of the usNIC BTL
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 33
verbs
bootstrapping
verbs
message passing
These two parts were previously
written to the Verbs API
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 34
verbs
bootstrapping
verbs
message passing
sideband
bootstrapping
1.  Find the corresponding ethX device
2.  Obtain MTU
3.  Open usNIC-specific configuration
options
Per the previous slides, the Verbs API
requires some… help… in the form of
sideband bootstrapping
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 35
verbs
bootstrapping
verbs
message passing
sideband
bootstrapping
libfabric
bootstrapping
à
libfabric
message passing
à
Now let’s convert to use the libfabric API
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 36
verbs
bootstrapping
verbs
message passing
sideband
bootstrapping
libfabric
bootstrapping
à
libfabric
message passing
àPretty much a ~1:1 swap of verbs à libfabric calls
Bootstrapping sequence totally different / not comparable
…but libfabric needs no sideband bootstrapping
(got to delete several hundred lines of OMPI code – yay!)
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 37
•  For a specific provider
Ask fi_getinfo() for
prov_name=“usnic”
•  Use usNIC extensions
Netmask, link speed, IP device
name, etc.
•  usNIC-specific error
messages
•  For any tag-matching
provider
•  No extension use
100% portable
•  Generic error messages
usnic BTL ofi MTL
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 38
•  For a specific provider
Ask fi_getinfo() for
prov_name=“usnic”
•  Use usNIC extensions
Netmask, link speed, IP device
name, etc.
•  usNIC-specific error
messages
•  For any tag-matching
provider
•  No extension use
100% portable
•  Generic error messages
usnic BTL ofi MTL
Both libfabric usage models co-exist
(and play well with each other)
inside a single MPI implementation.
Proof positive
of successful co-design
of libfabric and MPI implementations.
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 39
•  For a specific provider
Ask fi_getinfo() for
prov_name=“usnic”
•  Use usNIC extensions
Netmask, link speed, IP device
name, etc.
•  usNIC-specific error
messages
•  For any tag-matching
provider
•  No extension use
100% portable
•  Generic error messages
usnic BTL ofi MTL
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 40
•  Libfabric is the Way
Forward for Cisco
Open community
Matches our hardware
Performance benefits
Features benefits
•  Libfabric matches MPI
Has features MPI has been
asking for… for years
Optimistic about its future
(come join us!)
http://libfabric.org
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 41
Thank you.

Más contenido relacionado

La actualidad más candente

Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
Isn't it ironic - managing a bare metal cloud (OSL TES 2015)Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
Devananda Van Der Veen
 

La actualidad más candente (20)

iSCSI Protocol and Functionality
iSCSI Protocol and FunctionalityiSCSI Protocol and Functionality
iSCSI Protocol and Functionality
 
Lisa 2015-gluster fs-hands-on
Lisa 2015-gluster fs-hands-onLisa 2015-gluster fs-hands-on
Lisa 2015-gluster fs-hands-on
 
Tutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting routerTutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting router
 
OSC2011 Tokyo/Spring 自宅SAN友の会(前半)
OSC2011 Tokyo/Spring 自宅SAN友の会(前半)OSC2011 Tokyo/Spring 自宅SAN友の会(前半)
OSC2011 Tokyo/Spring 自宅SAN友の会(前半)
 
Pushing Packets - How do the ML2 Mechanism Drivers Stack Up
Pushing Packets - How do the ML2 Mechanism Drivers Stack UpPushing Packets - How do the ML2 Mechanism Drivers Stack Up
Pushing Packets - How do the ML2 Mechanism Drivers Stack Up
 
Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
Isn't it ironic - managing a bare metal cloud (OSL TES 2015)Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
 
DPDK QoS
DPDK QoSDPDK QoS
DPDK QoS
 
VPP事始め
VPP事始めVPP事始め
VPP事始め
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
 
Accelerating Virtual Machine Access with the Storage Performance Development ...
Accelerating Virtual Machine Access with the Storage Performance Development ...Accelerating Virtual Machine Access with the Storage Performance Development ...
Accelerating Virtual Machine Access with the Storage Performance Development ...
 
プログラマ目線から見たRDMAのメリットと その応用例について
プログラマ目線から見たRDMAのメリットとその応用例についてプログラマ目線から見たRDMAのメリットとその応用例について
プログラマ目線から見たRDMAのメリットと その応用例について
 
BGP Unnumbered で遊んでみた
BGP Unnumbered で遊んでみたBGP Unnumbered で遊んでみた
BGP Unnumbered で遊んでみた
 
Lisa 2015-gluster fs-introduction
Lisa 2015-gluster fs-introductionLisa 2015-gluster fs-introduction
Lisa 2015-gluster fs-introduction
 
I/O仮想化最前線〜ネットワークI/Oを中心に〜
I/O仮想化最前線〜ネットワークI/Oを中心に〜I/O仮想化最前線〜ネットワークI/Oを中心に〜
I/O仮想化最前線〜ネットワークI/Oを中心に〜
 
Introduction to OpenStack Cinder
Introduction to OpenStack CinderIntroduction to OpenStack Cinder
Introduction to OpenStack Cinder
 
Gluster technical overview
Gluster technical overviewGluster technical overview
Gluster technical overview
 
StarlingX - A Platform for the Distributed Edge | Ildiko Vancsa
StarlingX - A Platform for the Distributed Edge | Ildiko VancsaStarlingX - A Platform for the Distributed Edge | Ildiko Vancsa
StarlingX - A Platform for the Distributed Edge | Ildiko Vancsa
 
OVS VXLAN Network Accelaration on OpenStack (VXLAN offload and DPDK) - OpenSt...
OVS VXLAN Network Accelaration on OpenStack (VXLAN offload and DPDK) - OpenSt...OVS VXLAN Network Accelaration on OpenStack (VXLAN offload and DPDK) - OpenSt...
OVS VXLAN Network Accelaration on OpenStack (VXLAN offload and DPDK) - OpenSt...
 
NETCONF Call Home
NETCONF Call Home NETCONF Call Home
NETCONF Call Home
 
Introduction to CNI (Container Network Interface)
Introduction to CNI (Container Network Interface)Introduction to CNI (Container Network Interface)
Introduction to CNI (Container Network Interface)
 

Similar a Cisco's journey from Verbs to Libfabric

MCET WebEx vs Lync competitive analysis,
MCET WebEx vs Lync competitive analysis,MCET WebEx vs Lync competitive analysis,
MCET WebEx vs Lync competitive analysis,
dark19921
 
Foreman-and-Puppet-for-Openstack-Audo-Deployment
Foreman-and-Puppet-for-Openstack-Audo-DeploymentForeman-and-Puppet-for-Openstack-Audo-Deployment
Foreman-and-Puppet-for-Openstack-Audo-Deployment
yating yang
 

Similar a Cisco's journey from Verbs to Libfabric (20)

Cisco usNIC libfabric provider
Cisco usNIC libfabric providerCisco usNIC libfabric provider
Cisco usNIC libfabric provider
 
2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL
 
Vbrownbag container networking for real workloads
Vbrownbag container networking for real workloadsVbrownbag container networking for real workloads
Vbrownbag container networking for real workloads
 
BRKACI-2102_Tshoot.pdf
BRKACI-2102_Tshoot.pdfBRKACI-2102_Tshoot.pdf
BRKACI-2102_Tshoot.pdf
 
DEVNET-1148 Leveraging Cisco OpenStack Private Cloud for Developers
DEVNET-1148	Leveraging Cisco OpenStack Private Cloud for DevelopersDEVNET-1148	Leveraging Cisco OpenStack Private Cloud for Developers
DEVNET-1148 Leveraging Cisco OpenStack Private Cloud for Developers
 
ACI Hands-on Lab
ACI Hands-on LabACI Hands-on Lab
ACI Hands-on Lab
 
InfluxDB Community Office Hours September 2020
InfluxDB Community Office Hours September 2020 InfluxDB Community Office Hours September 2020
InfluxDB Community Office Hours September 2020
 
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
 
MCET WebEx vs Lync competitive analysis,
MCET WebEx vs Lync competitive analysis,MCET WebEx vs Lync competitive analysis,
MCET WebEx vs Lync competitive analysis,
 
Foreman-and-Puppet-for-Openstack-Audo-Deployment
Foreman-and-Puppet-for-Openstack-Audo-DeploymentForeman-and-Puppet-for-Openstack-Audo-Deployment
Foreman-and-Puppet-for-Openstack-Audo-Deployment
 
Triangle Kubernetes Meetup: Container cloud networking - Contiv for K8S & Ope...
Triangle Kubernetes Meetup: Container cloud networking - Contiv for K8S & Ope...Triangle Kubernetes Meetup: Container cloud networking - Contiv for K8S & Ope...
Triangle Kubernetes Meetup: Container cloud networking - Contiv for K8S & Ope...
 
Integration and Interoperation of existing Nexus networks into an ACI Archite...
Integration and Interoperation of existing Nexus networks into an ACI Archite...Integration and Interoperation of existing Nexus networks into an ACI Archite...
Integration and Interoperation of existing Nexus networks into an ACI Archite...
 
Advanced coding & deployment for Cisco Video Devices - CL20B - DEVNET-3244
Advanced coding & deployment for Cisco Video Devices - CL20B - DEVNET-3244Advanced coding & deployment for Cisco Video Devices - CL20B - DEVNET-3244
Advanced coding & deployment for Cisco Video Devices - CL20B - DEVNET-3244
 
Application Centric Infrastructure (ACI), the policy driven data centre
Application Centric Infrastructure (ACI), the policy driven data centreApplication Centric Infrastructure (ACI), the policy driven data centre
Application Centric Infrastructure (ACI), the policy driven data centre
 
Luca Relandini - Microservices and containers networking: Contiv, deep dive a...
Luca Relandini - Microservices and containers networking: Contiv, deep dive a...Luca Relandini - Microservices and containers networking: Contiv, deep dive a...
Luca Relandini - Microservices and containers networking: Contiv, deep dive a...
 
Cisco Live 2017: Container networking deep dive with Docker Enterprise Editio...
Cisco Live 2017: Container networking deep dive with Docker Enterprise Editio...Cisco Live 2017: Container networking deep dive with Docker Enterprise Editio...
Cisco Live 2017: Container networking deep dive with Docker Enterprise Editio...
 
Microservices and containers networking: Contiv, an industry leading open sou...
Microservices and containers networking: Contiv, an industry leading open sou...Microservices and containers networking: Contiv, an industry leading open sou...
Microservices and containers networking: Contiv, an industry leading open sou...
 
Docker Enterprise Networking and Cisco Contiv - Cisco Live 2017 BRKSDN-2256
Docker Enterprise Networking and Cisco Contiv - Cisco Live 2017 BRKSDN-2256Docker Enterprise Networking and Cisco Contiv - Cisco Live 2017 BRKSDN-2256
Docker Enterprise Networking and Cisco Contiv - Cisco Live 2017 BRKSDN-2256
 
Breizhcamp: Créer un bot, pas si simple. Faisons le point.
Breizhcamp: Créer un bot, pas si simple. Faisons le point.Breizhcamp: Créer un bot, pas si simple. Faisons le point.
Breizhcamp: Créer un bot, pas si simple. Faisons le point.
 
Microservices and containers networking: Contiv, an industry leading open sou...
Microservices and containers networking: Contiv, an industry leading open sou...Microservices and containers networking: Contiv, an industry leading open sou...
Microservices and containers networking: Contiv, an industry leading open sou...
 

Más de Jeff Squyres

Cisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationCisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentation
Jeff Squyres
 

Más de Jeff Squyres (19)

Open MPI State of the Union X SC'16 BOF
Open MPI State of the Union X SC'16 BOFOpen MPI State of the Union X SC'16 BOF
Open MPI State of the Union X SC'16 BOF
 
MPI Sessions: a proposal to the MPI Forum
MPI Sessions: a proposal to the MPI ForumMPI Sessions: a proposal to the MPI Forum
MPI Sessions: a proposal to the MPI Forum
 
MPI Fourm SC'15 BOF
MPI Fourm SC'15 BOFMPI Fourm SC'15 BOF
MPI Fourm SC'15 BOF
 
Open MPI SC'15 State of the Union BOF
Open MPI SC'15 State of the Union BOFOpen MPI SC'15 State of the Union BOF
Open MPI SC'15 State of the Union BOF
 
Fun with Github webhooks: verifying Signed-off-by
Fun with Github webhooks: verifying Signed-off-byFun with Github webhooks: verifying Signed-off-by
Fun with Github webhooks: verifying Signed-off-by
 
Open MPI new version number scheme and roadmap
Open MPI new version number scheme and roadmapOpen MPI new version number scheme and roadmap
Open MPI new version number scheme and roadmap
 
2014 01-21-mpi-community-feedback
2014 01-21-mpi-community-feedback2014 01-21-mpi-community-feedback
2014 01-21-mpi-community-feedback
 
(Open) MPI, Parallel Computing, Life, the Universe, and Everything
(Open) MPI, Parallel Computing, Life, the Universe, and Everything(Open) MPI, Parallel Computing, Life, the Universe, and Everything
(Open) MPI, Parallel Computing, Life, the Universe, and Everything
 
Cisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPICisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPI
 
Cisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationCisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentation
 
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
 
MPI History
MPI HistoryMPI History
MPI History
 
MOSSCon 2013, Cisco Open Source talk
MOSSCon 2013, Cisco Open Source talkMOSSCon 2013, Cisco Open Source talk
MOSSCon 2013, Cisco Open Source talk
 
Ethernet and TCP optimizations
Ethernet and TCP optimizationsEthernet and TCP optimizations
Ethernet and TCP optimizations
 
Friends don't let friends leak MPI_Requests
Friends don't let friends leak MPI_RequestsFriends don't let friends leak MPI_Requests
Friends don't let friends leak MPI_Requests
 
MPI-3 Timer requests proposal
MPI-3 Timer requests proposalMPI-3 Timer requests proposal
MPI-3 Timer requests proposal
 
MPI_Mprobe is good for you
MPI_Mprobe is good for youMPI_Mprobe is good for you
MPI_Mprobe is good for you
 
The Message Passing Interface (MPI) in Layman's Terms
The Message Passing Interface (MPI) in Layman's TermsThe Message Passing Interface (MPI) in Layman's Terms
The Message Passing Interface (MPI) in Layman's Terms
 
What is [Open] MPI?
What is [Open] MPI?What is [Open] MPI?
What is [Open] MPI?
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 

Cisco's journey from Verbs to Libfabric

  • 1. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1 Cisco’s Journey From Verbs to Libfabric Abondon the shackles of Verbs Embrace the freedom of Libfabric Jeffrey M. Squyres Cisco Systems 23 September 2015
  • 2. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 2 Application Kernel Cisco VIC ethX port TCP stack General Ethernet driver enic.ko Userspace sockets API userspace library Application Verbs IB core usnic.ko Send and receive fast path usNICTCP/IP
  • 3. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 3© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 3 Verbs is a fine API. …if you make InfiniBand hardware.
  • 4. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 4© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 4 ...but now there’s this libfabric thing (see libfabric.org community for details)
  • 5. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 5 Keep in mind, Cisco already supports UD Verbs
  • 6. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 6 •  Monotonic enum •  Could not add popular Ethernet values 1500 9000 •  usNIC verbs provider had to lie (!) …just like iWARP providers •  MPI had to match verbs device with IP interface to find real MTU Verbs IBV_MTU_256 IBV_MTU_512 IBV_MTU_1024 IBV_MTU_2048 IBV_MTU_4096
  • 7. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 7 •  Integer (not enum) endpoint attribute Libfabric
  • 8. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 8 •  Integer (not enum) endpoint attribute Libfabric DONE
  • 9. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 9 •  Mandatory GRH structure InfiniBand-specific header •  40 bytes UDP header is 42 bytes …and a different format •  Breaks ib_ud_pingpong •  usnic verbs provider used “magic” ibv_port_query() to return extensions pointers E.g., enable 42-byte UDP mode Verbs e t len chksmacdmac … ver len n e x t h o p sgid dgid UDP header: 42 bytes GRH: 40 bytes
  • 10. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 10 •  FI_MSG_PREFIX and ep_attr.msg_prefix_size Libfabric e t len chksmacdmac … payload
  • 11. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 11 •  FI_MSG_PREFIX and ep_attr.msg_prefix_size Libfabric e t len chksmacdmac … payload DONE
  • 12. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 12 •  Tuple: (device, port) Usually a physical device and port Does not match virtualized VIC hardware •  Queue pair •  Completion queue Verbs PU P#9 PCI 8086:1521 eth3 PCI 1137:0043 eth4 usnic_0 PCI 1137:00cf PCI 1137:00cf PCI 1137:00cf PCI 1137:00cf PCI 1137:0043 ibv_device ibv_port QP QP CQ QP
  • 13. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 13 •  Maps nicely to SR-IOV •  Fabric à PCI physical function (PF) •  Domain à PCI virtual function (VF) •  Endpoint à Resources in VF PU P#9 PCI 8086:1521 eth3 PCI 1137:0043 eth4 usnic_0 PCI 1137:00cf PCI 1137:00cf PCI 1137:00cf PCI 1137:00cf PCI 1137:0043 Libfabric fi_fabric fi_domain fi_endpoint (resources in domain) EP EP CQ EP
  • 14. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 14 •  GID and GUID No easy mapping back to IP interface •  usnic verbs provider encoded MAC in GID Still cumbersome to map back to IP interface •  Could use RDMA CM …but that would be a ton more code Verbs mac[0] = gid->raw[8] ^ 2; mac[1] = gid->raw[9]; mac[2] = gid->raw[10]; mac[3] = gid->raw[13]; mac[4] = gid->raw[14]; mac[5] = gid->raw[15];
  • 15. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 15 •  Can use IP addressing directly Libfabric Everything is awesome
  • 16. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 16 •  Can use IP addressing directly Libfabric Everything is awesomeDONE
  • 17. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 17 •  Generic send call ibv_post_send(…SG list…) Lots of branches •  Wasteful allocations •  No prefixed receive •  Branching in completions Verbs
  • 18. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 18 •  Multiple types of send calls fi_send(buffer, …) •  Variable-length prefix receive Provider-specific •  Fewer branches in completions Libfabric
  • 19. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 19 1.9 1.95 2 2.05 2.1 2.15 2.2 2.25 2.3 2.35 2.4 0.1 1 10 100 Time(microseconds) Buffer size Open MPI with usNIC: IMB PingPong Latency imb-pingpong-ompi-1.8-verbs.out imb-pingpong-ompi-1.8-libfabric.out
  • 20. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 20 61000 62000 63000 64000 65000 66000 67000 68000 69000 1e+06 Bandwidth(megabits/second) Buffer size Open MPI with usNIC: IMB SendRecv Bandwidth imb-sendrecv-ompi-1.8-verbs.out imb-sendrecv-ompi-1.8-libfabric.out
  • 21. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 21 •  Performance issues •  Memory registration still a problem •  No MPI-style tag matching •  One-sided capabilities do not match MPI •  Network topology is a separate API Verbs
  • 22. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 22 •  Performance happiness •  Many MPI-helpful features: Tag matching One-sided operations Triggered operations •  Inherently designed to be more than just point-to-point •  More work to be done… but promising MMU notify Network topology Libfabric
  • 23. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 23 •  Long design discussions about how to expose Ethernet / VIC concepts in the verbs API …usually with few good answers Especially problematic with new VIC features over time •  Conclusion: possible (obviously), but not preferable •  Whole API designed with multiple vendor hardware models in mind •  Much easier to match our hardware to core Libfabric concepts •  Conclusion: much more preferable than verbs LibfabricVerbs
  • 24. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 24© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 24 Ok, so let’s do libfabric!
  • 25. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 25© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 25 Does it play well with MPI?
  • 26. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 26 Byte Transport Layer (BTL) plugins Matching Transport Layer (MTL) plugins MPI_Send(…)
  • 27. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 27 •  Inherently multi-device •  Round-robin for small messages •  Striping for large messages •  Major protocol decisions and MPI message matching driven by an Open MPI engine Byte Transport Layer (BTL) plugins
  • 28. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 28 Matching Transport Layer (MTL) plugins •  Most details hidden by network API • MXM • Portals • PSM •  As a side effect, must handle: • Process loopback • Server loopback (usually via shared memory)
  • 29. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 29 Byte Transport Layer (BTL) plugins Matching Transport Layer (MTL) plugins •  IB / iWarp (verbs) •  Portals •  SCIF •  Shared memory •  TCP •  uGNI •  usNIC (verbs) •  MXM •  Portals •  PSM •  PSM2
  • 30. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 30 •  IB / iWarp (verbs) •  Portals •  SCIF •  Shared memory •  TCP •  uGNI •  usNIC Byte Transport Layer (BTL) plugins Matching Transport Layer (MTL) plugins •  MXM •  Portals •  PSM •  PSM2 •  ofi libfabric
  • 31. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 31 libfabric usnic BTL ofi MTL •  Cisco developed •  usNIC-specific •  OFI point-to-point / UD •  Tested with usNIC •  Intel developed •  Provider neutral •  OFI tag matching •  Tested with PSM / PSM2
  • 32. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 32 Bootstrapping Message passing There are two main parts of the usNIC BTL
  • 33. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 33 verbs bootstrapping verbs message passing These two parts were previously written to the Verbs API
  • 34. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 34 verbs bootstrapping verbs message passing sideband bootstrapping 1.  Find the corresponding ethX device 2.  Obtain MTU 3.  Open usNIC-specific configuration options Per the previous slides, the Verbs API requires some… help… in the form of sideband bootstrapping
  • 35. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 35 verbs bootstrapping verbs message passing sideband bootstrapping libfabric bootstrapping à libfabric message passing à Now let’s convert to use the libfabric API
  • 36. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 36 verbs bootstrapping verbs message passing sideband bootstrapping libfabric bootstrapping à libfabric message passing àPretty much a ~1:1 swap of verbs à libfabric calls Bootstrapping sequence totally different / not comparable …but libfabric needs no sideband bootstrapping (got to delete several hundred lines of OMPI code – yay!)
  • 37. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 37 •  For a specific provider Ask fi_getinfo() for prov_name=“usnic” •  Use usNIC extensions Netmask, link speed, IP device name, etc. •  usNIC-specific error messages •  For any tag-matching provider •  No extension use 100% portable •  Generic error messages usnic BTL ofi MTL
  • 38. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 38 •  For a specific provider Ask fi_getinfo() for prov_name=“usnic” •  Use usNIC extensions Netmask, link speed, IP device name, etc. •  usNIC-specific error messages •  For any tag-matching provider •  No extension use 100% portable •  Generic error messages usnic BTL ofi MTL Both libfabric usage models co-exist (and play well with each other) inside a single MPI implementation. Proof positive of successful co-design of libfabric and MPI implementations.
  • 39. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 39 •  For a specific provider Ask fi_getinfo() for prov_name=“usnic” •  Use usNIC extensions Netmask, link speed, IP device name, etc. •  usNIC-specific error messages •  For any tag-matching provider •  No extension use 100% portable •  Generic error messages usnic BTL ofi MTL
  • 40. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 40 •  Libfabric is the Way Forward for Cisco Open community Matches our hardware Performance benefits Features benefits •  Libfabric matches MPI Has features MPI has been asking for… for years Optimistic about its future (come join us!) http://libfabric.org
  • 41. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 41 Thank you.