- TCP was originally designed with a goal of allowing communication networks to withstand nuclear war through a loose network of interconnected nodes with ad-hoc routing and weak service guarantees.
- While TCP works well for high reliability and in-order delivery over high-latency wide-area networks, it is not optimized for high-performance computing clusters with low-latency infiniband or 10 gigabit ethernet networks.
- Alternatives like TCMP and SCTP were designed for low-latency clustered environments and provide features like message-orientation, multi-streaming, and multi-homing that simplify congestion control compared to TCP.
2. Session outline
• Good old TCP
design goals, tuning, caveats
• Network congestion
• LAN vs. WAN
• Alternatives to TCP
TCMP – cluster transport in Coherence
SCTP
• Death detection
• Multicast
3. TCP/IP origins
Main design goal
– communication infrastructure resistive to effects
of thermonuclear war.
• Loose network of interconnected nodes
• Ad-hoc routing decisions
• Very weak service guaranties
4. TCP’s little dirty secrets
High latency networks
• SO_RCVBUF – caps connection bandwidth
Nagel’s algorithm – 200ms delay
• Use TCP_ NODELAY
• Does not affect localhost connections
Firewalls
• Silent connection drops
6. TCP summary
• Thermonuclear resistance approach
Cowardly in bandwidth utilization
Vulnerable to random packet drops
• No messages
• Head of line syndrome
• No multi homing
7. Reality of HPC clusters
• In order frame delivery
• Infiniband / 10GiE – link level flow control
• Low latency
• No slow start needed
• Congestion control cloud be drastically
simplified
8. UDP based transport
• No flow control
– much large receive buffer required to avoid losses
on receiver side (e.g. sysctl -w
net.core.rmem_max)
– congestion prevention should be implemented
9. TCMP (Oracle Coherence)
• UDP based protocol
• Exploit ordered delivery for fast NACK
• Fast NACK -> very fast congestion detection
• Extra logic to account for JVM specific behavior
11. Peer death detection
Response timeouts are bad detector
Temporary network outages
change of route, congestion, etc
Temporary application outages
GC, swapping, server CPU starvation
Positive loopback effect
“Corrupted witness” syndrome
12. Peer death detection
Ingredients of good death detection:
• Process death detection using TCP
• Monitor remote OS liveliness not just peer
• Multiple witness suspect protocol
13. SCTP
SCTP – L4 protocol, TCP replacement
• Works over IP or SS7
• Message oriented
• Multi stream delivery
• Multi homing
• Fast networks and jumbo frames in mind
14. Multicast
Group multicast
• Suitable for group communication
• Not support in network hardware AFAIK
Hub and spoke multicast
• Replicating large amounts of data
• Video broadcasting
• PIM/IGMP (IGMP spoofing HW support)