Internet Research Lab at NTU, Taiwan.
A survey of routing in data center networks and latest IEEE 802.1Qau - Congestion Notification standard in data center bridging task group.
How to Troubleshoot Apps for the Modern Connected Worker
Data Center Network Multipathing
1. Data Center Network
Multipathing
Peregrine: An All-Layer-2 Container Computer Network
Tzi-cker Chiueh*§, Cheng-Chun Tu*§, Yu-Cheng Wang§, Pai-Wei Wang§, Kai-Wen Li§, Yu-Ming Huang§
*Industrial Technology Research Institute, Taiwan
§Computer Science Department, Stony Brook University
IEEE Cloud 2012
Leveraging Performance of Multiroot Data Center Networks by Reactive Reroute
Adrian S.-W. Tam, Kang Xi H,. Jonathan Chao
Department of Electrical and Computer Engineering, Polytechnic Institute of New York Universit
2010 18th IEEE Symposium on High Performance Interconnects
Presenter: Jason, Tsung-Cheng, HOU
Advisor: Wanjiun Liao
May 17th, 2012 1
2. Motivation
• Summarize features of the popular multi-root
Clos / fat-tree data center topology
Take ITRI’s prototype as an example
• Surveyed solutions of multipathing
• Recap Jin-Jia Chang’s presentation on QCN
• Present another solution to multipathing
• Compare several multipathing methods
2
3. Agenda
• Multi-Root Clos / Fat-Tree Topology
• Surveyed Solutions to Multipathing
• 802.1Qau – QCN
• QCN and Reactive Reroute
• Comparison of Multipathing Methods
Peregrine: An All-Layer-2 Container Computer Network
Tzi-cker Chiueh*§, Cheng-Chun Tu*§, Yu-Cheng Wang§, Pai-Wei Wang§, Kai-Wen Li§, Yu-Ming Huang§
*Industrial Technology Research Institute, Taiwan
§Computer Science Department, Stony Brook University
IEEE Cloud 2012
3
4. Multi-Root Clos / Fat-Tree
• Adopted by various publications
– VL2, PortLand, BCube, Elastic Tree, Peregrine
• Scale-out, cheap commodity switches
• Through fixed maximum switches / hops
– If no bouncing, no routing loop
• Nearly full bisection, multipathing, symmetric
• Possibly tremendous routing table entries
• Up and down paths, handled differently
• High rate but limited capability, buffer, CPU..
4
5. High rate but limited capability
• All-L2 Ethernet switches
• Up to 1 GE or 10 GE links, dozens ports
• Limited buffer, hundred K bytes
• Limited CPU ability, processing bottleneck
• Limited flow table entries, at most dozen Ks
• Optimized for fast table lookups
• Take Peregrine for example
– ITRT’s industrial, commodity production prototype
– Others, mostly experimental or high-end
5
9. DS and RAS
• Directory Server
– Address association, mgmt, and reuse
– Performs IP-MAC lookup, mappings
– Updates mappings to end hosts
• Route Algorithm Server
– Collects entries of the traffic matrix
– Runs load-balancing algorithms, based on TM
– Distributes routing entries to switches, update DS
• Within one container, cross-container unclear
• Scalability unclear, VM mobility unclear
(Only refers to sth like mobile IP) 9
14. ITRI Container Computer Prototype
• 6.096m shipping container
• 12 server racks, 12 storage racks
• All-L2 network, commodity switches
• “Folded” Clos topology
• Directory Server, Route Algorithm Server
• Unclear: Load-balancing algo., VM mobility,
DS-RAS scalability, cross-container
• In the future: OpenFlow, OpenStack
(Currently not using OpenFlow to connect
switches… how? unclear)
14
15. Discussions
• Spanning tree for multipathing and load-
balancing: Simple but limited flexibility
• How to plug and play? Scalable?
– A new switch leads to reconfiguration
– VM migration = affects TM and direct routes?
• DS-RAS: a simple version of controller
But mechanism, performance unclear
• Seems to be trying to combined various
advantages: Address mapping, ST
multipathing, converged network, folded-Clos
15
16. Agenda
• Multi-Root Clos / Fat-Tree Topology
• Surveyed Solutions to Multipathing
• 802.1Qau – QCN
• QCN and Reactive Reroute
• Comparison of Multipathing Methods
16
17. Multipathing
• VLB:
– Traffic splits to intermediate points
– Automatically balances load
– Ideally great, but subject to PKT reordering
• ECMP-hashing
– Different hashing functions, big difference
– Flow always sticks to one path during transmit
• Hedera:
– Flow-to-core mapping, flow scheduling
– Requires global information, higher complexity
17
18. Multipathing
• Spanning Tree / VLAN: (Spain)
– Near-static, pre-computation required, but simple
– Re-computes when topology changes
– Segmentation of resources, limited flexibility
• Multipath TCP:
– One flow, many parallel paths
– VLAN-based routing in publication (like Spain)
– Shifts traffic to less congested paths
– A new transport mechanism, adaptive
– Still with segmentation of resources
18
19. Multipathing References
• M. Kodialam, T. V. Kakshman, S. Sengupta, “Efficient and Robust Routing of Highly
Variable Traffic”, HotHets, 2004.
• R. Zhang-Shen and N. McKeown “Designing a Predictable Internet Backbone Network”,
Third Workshop on Hot Topics in Networks (HotNets-III), November 2004.
• A. Greenberg et al., “VL2: A Scalable and Flexible Data Center Network”, ACM SIGCOMM
2009.
• M YSORE, R. N., PAMPORIS, A., FARRINGTON, N., H UANG, N., MIRI , P., R ADHAKRISHNAN,
S., S UBRAMANYA, V., AND VAHDAT, A. “PortLand: A Scalable, Fault-Tolerant Layer 2 Data
Center Network Fabric.” In Proceedings of ACM SIGCOMM, 2009.
• M. Al-Fares, et. al., “Hedera: Dynamic Flow Scheduling for Data Center Network”,
USENIX NSDI 2010.
• J. Mudigonda, P. Yalagandula, M. Al-Fares, and J. C. Mogul. “SPAIN: COTS Data-Center
Ethernet for Multipathing over Arbitrary Topologies.” In USENIX NSDI, April 2010.
• C. Raiciu, C. Pluntke, S. Barre, A. Greenhalgh, D. Wischik, and M. Handley. “Data center
networking with multipath TCP.” In HotNets, 2010.
19
20. Agenda
• Multi-Root Clos / Fat-Tree Topology
• Surveyed Solutions to Multipathing
• 802.1Qau – QCN
• QCN and Reactive Reroute
• Comparison of Multipathing Methods
Data center transport mechanisms: Congestion control theory and IEEE
standardization
M. Alizadeh, B. Atikoglu, A. Kabbani, A. Lakshmikantha, R. Pan, B. Prabhakar, and M. Seaman,
Communication, Control, and Computing, 2008 46th Annual Allerton Conference on
AF-QCN: Approximate fairness with quantized congestion notification for
multitenanted data centers
A. Kabbani, M. Alizadeh, M. Yasuda, R. Pan, and B. Prabhakar,
B. In High Performance Interconnects (HOTI), 2010, IEEE 18th Annual Symposium on
20
21. Data Center Bridging Task Group
• Converged network
– LAN: no priority control
Qbb: Priority-based Flow Control
– FCoE (SAN): no congestion control
Qau: Quantized Congestion Notification
• Need to survey more on converged network
– Respective features and requirements
– Could be a very important trend
21
22. QCN
• CP: Congestion Point
– A switch monitors queue, Q, Qeg, Qold
– Samples and sends Fb msg to RP
– Fb a combination of (queue, rate) excess
– Targets for no PKT loss
• RP: Reaction Point
– A host with Rate Limiter, Counter, and Timer
– Retries for more BW like AIMD
– Decreases according to Fb msg
– Counter and Timer both controls RL
22
27. Agenda
• Multi-Root Clos / Fat-Tree Topology
• Surveyed Solutions to Multipathing
• 802.1Qau – QCN
• QCN and Reactive Reroute
• Comparison of Multipathing Methods
Leveraging Performance of Multiroot Data Center Networks by Reactive Reroute
Adrian S.-W. Tam, Kang Xi H,. Jonathan Chao
Department of Electrical and Computer Engineering, Polytechnic Institute of New York Universit
27
28. Exploit Multipath Property
• Use QCN to further leverage redundancy
– Per-flow CN adjusts BW: Spectral
– Relocates flows among paths: Spatial
– Both mitigates congestions
• Multiroot, Clos / fat-tree topology
– Upward: destination based, deterministic
– Downward: could be randomized or rerouted
• Hashed ECMP: Distributes flow population
• Flow-reroute: Balancing congested links
28
29. Reactive Reroute
• Edge switches counts received QCNs-Ports
– Only edge switches will reroute, consider enough
– Only for upward PKTs, not for downward
• Reroutes flows (elephant && congested),
detects by counting QCNs in a short period
• Three reroute methods:
– Uniform random
– Min. prob. of congestion (conditional prob.)
– Weighted of above two
• Freezes a rerouted flow to avoid flapping
29
33. Outlier Latency
• Very large flows are throttled by L2 congestion
control, thus with large latency
• 60% within 1ms, but in average it takes 15ms!
33
34. Discussion
• Why Min. reroute is always worse?
– Some flows’ path overlap in the beginning
– Edge switches have no global information
– Receives QCN from the same (port, agg)
Synchronized reroute
• Operates a centralized controller?
– Authors argue that gain is very small
– But they do not present more on the “outliers”
– The flows with longest latencies, the larger
– The larger flows could be some vital connections
34
35. Discussion
• L2 congestion control protects TCP over UDP
• No PKT loss, almost no incast problem
• Out-of-order problem is more severe for UDP
• However, because switch buffer is tightly
monitored, the number of out-of-order PKTs is
limited at most as (5nr/s)
(n: buffer size) (r: sending rate) (s: link rate)
• Freezes a rerouted flow: Also limits reordering
35
36. Agenda
• Multi-Root Clos / Fat-Tree Topology
• Surveyed Solutions to Multipathing
• 802.1Qau – QCN
• QCN and Reactive Reroute
• Comparison of Multipathing Methods
Comparative Evaluation of CEE-based Switch Adaptive Routing
Daniel Crisan, Mitch Gusat, Cyriel Minkenberg,
2nd Workshop on Data Center - Converged and Virtual Ethernet Switching (DC CAVES),
2010
36
37. Multipathing Methods
• Deterministic, static, or preconfigured
– Single fixed path
– VLAN-based, multiple fixed paths, ST-per-VLAN
• Oblivious, randomized
– Hashed by headers
– Split to intermediaries
• Reactive, switch adaptive routing
• Controller-enabled centralized scheduling
37
38. Comparison
• Deterministic, static, or preconfigured
– Simple, no re-ordering
• Oblivious, randomized, good when…
– Single prio., symmetric traffic
• Reactive, switch adaptive routing, realistic…
– Multiple prio., asymmetric
• Controller-enabled centralized scheduling
– Large input set, higher complexity
– Controller hard to implement, high cost low gain?
• Convergence and virtualization are trends
38
39. Discussion
• Data center traffic patterns are evolving and
unknown a priori in many cases
• Justifies multiple routing / balancing schemes
Currently no single killer solution
• Should be able to switch between modes
Reactive-Adaptive and Randomized
• Role of controller still to be optimized
– Could be useful for criti cal flows / situation
– Detect and react in slower manner
– Not ideal for dynamic fast reaction
39
40. Reference
• Tzi-cker Chiueh, Cheng-Chun Tu, Yu-Cheng Wang, Pai-Wei Wang, Kai-Wen Li, Yu-Ming Huang ,
“Peregrine: An All-Layer-2 Container Computer Network”, IEEE Cloud 2012
• M. Alizadeh, B. Atikoglu, A. Kabbani, A. Lakshmikantha, R. Pan, B. Prabhakar, and M. Seaman, “Data
center transport mechanisms: Congestion control theory and IEEE standardization,”
Communication, Control, and Computing, 2008 46th Annual Allerton Conference on
• A. Kabbani, M. Alizadeh, M. Yasuda, R. Pan, and B. Prabhakar. “AF-QCN: Approximate fairness with
quantized congestion notification for multitenanted data centers”, In High Performance
Interconnects (HOTI), 2010, IEEE 18th Annual Symposium on
• Adrian S.-W. Tam, Kang Xi H., Jonathan Chao , “Leveraging Performance of Multiroot Data Center
Networks by Reactive Reroute”, 2010 18th IEEE Symposium on High Performance Interconnects
• Daniel Crisan, Mitch Gusat, Cyriel Minkenberg, “Comparative Evaluation of CEE-based Switch
Adaptive Routing”, 2nd Workshop on Data Center - Converged and Virtual Ethernet Switching (DC
CAVES), 2010
40