SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
High-resolution Timer-based 
Packet Pacing Mechanism 
on the Linux Operating System 
Ryousei Takano, Tomohiro Kudoh, 
Yuetsu Kodama, Fumihiro Okazaki 
 
Information Technology Research Institute, 
National Institute of Advanced Industrial 
Science and Technology (AIST) 
ŬƧƀƩƉƂƆŸƧƐũƤƧż2010ij2010b10†26€ij‹	NS
2 
±ÏŖŸŦ 
• ƍŷƂƆƕƩźƧŶ 
• ĆÖcƀŬƘŨ®ĹŋƍŷƂƆ 
ƕƩźƧŶ” 
• ÚVą 
• ŝŒş
ɃƪƭƬƮƫ 
• ƉƂƆƦƩŵŖNWð5ľêņŐĹťľĴ 
_G)®2¨Ũ…N5ŇťŖŗCýŕ 
• @ăŖ8BŗéŖ¿ 
ƫTCP incast@ă 
• MPI All-to-allé 
• MapReduce źƝƂƐƣé 
3 
: 
: 
×ÁƊƩƇ 
1 żŬƂƁ 
2 
3 
N 
ƌƂƐũ 
ĸŚŦ
ɃƪƮƬƮƫ 
• _G)®2¨Ũ…N5ŇťŕŗƲ 
– ƪèƤƩƆƫƱƪ)®:Ê_GƫćƬćƪ=‚éƊƩƇ{ƫ 
• ņĽņĴéŖƌƩżƆlŖŋşĴ‚³ŕ)®:Ê_GŨâì 
ƍŷƂƆƕƩźƧŶƪƌƩżƆlŖa£5ƫľjÓ 
4 
VüŖéƪƌƩżƆ‡ƫ 
1 żŬƂƁ 
2 
3 
ƌƂƐũ 
ĸŚŦ 
ƌƩżƆ 
BW 
BW / 3 
żŬƂƁ 
BW 
ªm³ŔéƪƌƩżƆ¤ƫ 
1 
2 
3 
BW / 3
ªm³ŔƕƩźƧŶŖV© 
• –¹ŔƍŷƂƆèôú*iľjÓ 
• ƪƍŷƂƆŹŬŽƫƱƪƍŷƂƆôŴƝƂƒƫŖIư 
èƤƩƆŗ¥ª_GŖ1/2 
– ƫjÓŔƍŷƂƆèôúŖÂc 
ijij1 GbpsƬMTU 1500BIJ24ƘŬŵƥ» 
ijij10 GbpsƬMTU 9000BIJ14.4ƘŬŵƥ» 
ijij 
ćƍŷƂƆèôú ćƍŷƂƆôŴƝƂƒ 
5
PSPacer 
• é_GŨ2¨³ŕ)®ŇťŋşŖſƐƆŭŮŪ 
• ŴŲƎƂƆŬƩŹƉƂƆŕļŁťÂXŔƕƩźƧŶŨ 
ſƐƆŭŮŪŌŁőV© 
– )®:Ê_GŕĸŧʼnŐƌƩżƆèŨa£5Ňť 
łŒőĴTUņŋĆĹélÊŨV© 
6 
Buffer 
Overflow 
ƌƩżƆlŖĆĹƆơƐūƂŵŗƍŷƂƆƥ 
żŨfĿáłņĴélÊŖŨrŀ 
Switch/Router 
PSPacerŗƍŷƂƆôúŨÜ|ņĴa£5 
ńŦTUņŋƆơƐūƂŵŨ¬oŇť
ƍŷƂƆƕƩźƧŶŖVÐe 
ƋƩƇŭŮŪVÐ ſƐƆŭŮŪVÐ 
ƀŬƘĄ3 
e 
Öc 
ƀŬƘ 
ĆÖc 
ƀŬƘ 
ŴƝƂƒı 
ƍŷƂƆe 
• FPGAšNPŨ®ĹŋVÐ 
• Chelsio T210 
PSPacer 
PSPacer/HT 
7
8 
PSPacer:ƌŬƆŵƥƂŵ 
• ƀŬƘĄ3e 
– ŴŲƎƂƆÃŖƉƂƆƦƩŵőŗƘŬŵƥ»ÂcŖ*iŗCý 
• OSŖƀŬƘôúư1Ƴ10ƙƢ» 
– ĂÆŔƀŬƘ.çŞ%ªŕŢťŰƩƌƓƂƇŖK0 
• ƌŬƆŵƥƂŵ 
– ƀŬƘ.çŞŨŧňĴèƌŬƆŨŵƥƂŵŕ)® 
• 1ƌŬƆŖèŕÓŇť‚ôŗUijƪ10 GbpsIJ0.8 ƈƊ»ƫ 
– ƦŬƞƤƩƆőƍŷƂƆŨûôŔŀèőĿŦŘĴƍŷƂƆèôúŗ 
–¹ŕ*i:Ê 
7.2 us 7.2 us 
ƌŬƆŵƥƂŵ 
9000B (byte) 
0 9K 18K 27K 
èƀŬƙƧŶŗéóPĽţŖèƌŬƆő›U 
9000B
9 
PSPacerưŴƝƂƒƍŷƂƆe 
• PAUSEƐƤƩƚƪIEEE 802.3x ƐƥƩ*iƫŖ)® 
– -®Ŕņ 
• jňżŬƂƁƨƣƩƀŖ/ƗƩƆő¸Ž 
– VƍŷƂƆŖŞľĴ ŖèôúŨsņŏŏ/ 
– ¦(ŔƋƩƇŭŮŪľÓ 
èPC żŬƂƁ 
VƍŷƂƆ 
ŴƝƂƒƍŷƂƆ
ŴƝƂƒƍŷƂƆeŖ*÷ 
1. ƦŬƞƤƩƆőƍŷƂƆŨèőĿťlÊľjÓ 
– CPUlÊãšĴPCIƌżƖƆƣƉƂŵŕŔťIĴ 
–¹ŔƕƩźƧŶŗ:Ê 
• ƫ10 GbEĴ32bit/33MHz PCI (ªÝƤƩƆ 133MB/s)őGbE 
2. Ethernet
MőŖ®ľ:Ê 
– ŴƝƂƒƍŷƂƆŖV©p—ƪPAUSEƐƤƩƚƫľŔĹ 
3. °ƅƌŬżśŖ‰Yk 
– ƫBondingĴtapƅƌŬż 
– 8ª³ŕŗYk:ÊŌľĴ°ƅƌŬżŖƇơŬƌŕ 
YŇť–ľjÓ 
10
11 
±ÏŖŸŦ 
• ƍŷƂƆƕƩźƧŶ 
• ĆÖcƀŬƘŨ®ĹŋƍŷƂƆ 
żŷŻƟƩƢƧŶ” 
• ÚVą 
• ŝŒş
LinuxŖƀŬƘźżƄƚ 
ÖcƀŬƘƨŬƔƧƆ‡ 
ÖcƀŬƘƨŬƔƧƆ¤ 
ĆÖcƀŬƘŬƔƧƆ 
Ticks (Jiffies) 
• ÖcƀŬƘ 
– 1/HZ»Ŗ?ˆőƋƧƇơŨVÍ 
– ƀŬƘŬƔƧƆ
Mŕ‘ĵŔ%ªŨ=‚ŕVÍ 
• ĆÖcƀŬƘ 
– nŖ‚+ŕƋƧƇơŨ²ò:Ê 
• ?ˆ³ŠņŀŗƦƧźƠƂƆ 
– äðŔŬƔƧƆ%ª 
1000 1001 1002 1003 1004 
12
ƀŬƘĄ3eŖ$Ç 
@ăưĆĂcŖ.çŞ%ªŕYŇťCPUßÌ 
ij IJĆÖcƀŬƘe 
– ÖcŗŹƑƘŬŵƥ»őäð 
• Linux kernel 2.6.31
öŖÖcư1/16ƘŬŵƥ» 
– OSŹƗƩƆŖ| 
ŊŖ
ŖCPUßÌä ŖÀ 
– ƉƂƆƦƩŵżƀƂŵŖƘƣƁŸŪYk 
– NICŕŢťŰƐƥƩƇ”ŖZ 
• ƫTCP Segmentation OffloadŔœ 
13
14 
PSPacer/HTŖVÐ 
• űƩƉƣƜŻƟƩƣƪQdiscƫ 
ŒņŐVÐ 
– űƩƉƣŖ$ŸƧƍŬƣľÓ 
– ŪƒƢŷƩźƠƧ 
– ƒƥƆŸƣżƀƂŵijijÿR 
– ƇơŬƌ 
• Linux’¢ƃƩƣĽţŖ)® 
– Iproute2 (tc(8)) 
 
Socket 
buffer 
Protocol stack 
Device Driver 
enqueue 
dequeue 
PSPacer/HT 
Byte clock 
scheduler 
Socket Layer 
Interface 
queues 
Classifier 
Netlink 
socket I/F
15 
ƌŬƆŵƥƂŵƨżŷŻƟƩơ 
ŵơżŵƥƂŵưųƟƩŖ!āƍŷƂƆŖ 
ijijijijijijijijijijèU‚+ŨsŇť 
ŶƥƩƌƣŵƥƂŵư©DŝőŖèƌŬƆ{ 
…[ŖŵơżŵƥƂŵľĴŶƥƩƌ 
ƣŵƥƂŵŢŤŠ[ńŁŦŘĴŊŖ 
ųƟƩŖ!āƍŷƂƆŨèņĴ 
ŵƥƂŵŨ„~Ňť 
VüŕŗĶèŪŬƇƣ‚ôķľjÓ 
ĆÖcƀŬƘŨÙUņŐĴ•ŖƍŷƂƆè‚+ŝőg”
;eŖ™å 
š®l –¹ń CPUßÌ 
ÖcƀŬƘ 
ĆÖcƀŬƘ 
(PSPacer/HT) 
ŴƝƂƒƍŷƂƆ 
(PSPacer) 
16
17 
±ÏŖŸŦ 
• ƍŷƂƆƕƩźƧŶ 
• ĆÖcƀŬƘŨ®ĹŋƍŷƂƆ 
ƕƩźƧŶ” 
• ÚVą 
• ŝŒş
Ú 
• VąĀ´ 
– aF_G 
• 100 Mbps+ŞőèƤƩƆŨL5ńʼnĴ´’ŒV¡Ŗ 
^'Ũס 
– burstiness 
• ƖƆƣƉƂŵƣƩƀƨżŬƂƁŖƌƂƐũ®ðŕYŇťt’ 
• ƍŷƂƆųƝƒƁƝÄŒŨHŕźƙƟƤƩźƠƧŕŢŤ×Á 
– CPUßÌ 
• żƆƢƩƚ{ŒŖµõ 
• ÚYÞ 
– PSPacerĴPSPacer/HTĴHTB (Hierarchical Token Bucket) 
18
HTB: Hierarchical Token Bucket 
• Linux’¢ŖQdiscƜŻƟƩƣ 
• CBQƪClass based queuingƫŖŢĺŔù]³Ŕ 
_G*iľ:Ê 
• ƍŷƂƆżŷŻƟƩƢƧŶŕĆÖcƀŬƘŨ)® 
– Linux kernel 2.6.31
öŖÖcư1/16ƘŬŵƥ» 
• gō‚ô×ÁœŖíĹ 
– PSPacer/HT: ƍŷƂƆ˜ŕ´’ƤƩƆĽţ×Á 
– HTB: l2t (length to time)ÏŕťÏfĿ 
• ÏŖŬƧƅƂŵżŗ256ņĽŔĹŋşĴÂcŕ÷¯ 
19
Myri-10G Myri-10G 
20 
Vą«J 
• ×Á”żƕƂŵćƪPC Aƫ 
– CPU: Quad-core Xeon (E5430) x 2 
– NIC: Myricom Myri-10G (PCIe x 8) 
• MTU: 9000 byte 
– Memory: 8GB DDR2-667 
• OS: Ubuntu 9.10 server 
sender receiver 
– Linux kernel 2.6.31-10 + myri10ge driver 1.5.1 
– sysctlƍơƛƩƀ: 
• net.core.netdev_max_backlog 25000 
• net.core.rmem_max 16777216 
• net.core.wmem_max 16777216 
• net.ipv4.tcp_rmem 4096 65536 16777216 
• net.ipv4.tcp_wmem 4096 87380 16777216 
• net.ipv4.tcp_no_metrics_save 1 
GtrcNET-10
21 
GtrcNET 
• NÕ“FPGAŨvæņŋƋƩƇŭŮŪƉƂƆƦƩŵƄżƆƔƂƇ 
• ćƦŬƞƤƩƆőńŝŅŝŔ”ÊŨƒƥŶơƚ:Ê 
• ćGtrcNET-1: GbE (GBIC) x 4ports + 16MBytes Memory/port 
• GtrcNET-10: 10GbE (XENPAK) x 3ports + 1GBytes Memory /port 
• VÐ”Ê 
• ć_G¡UƪƗƩƆ6ĴżƆƢƩƚ6ĴVLAN6ƫ 
• ćëdŖ“w 
• ćƍŷƂƆųƝƒƁƝ 
• ćƄżƆƍŷƂƆ¬o 
• ćèƤƩƆ*iƪƕƩźƧŶĴ 
ijźŮƩƏƧŶĴƗƢźƧŶƫ 
http://projects.itri.aist.go.jp/gnet/
aF_G*iŖ–¹ń 
ćčċĐ 
ćčċď 
ćč 
Ċčċď 
ĊčċĐ 
IperfŨ5»ôVÍņŋŒĿŖè_GŨGtrcNET-10ő¡U 
ĜĝĜĠĢĤīČĞĖ 
ĜĝĜĠĢĤī 
ĘĞĖć 
ćč ćď ćĐ ćđ ćē ćĎč 
ěġĬĤīĮĤģćĖĠĩģįħģĭĦćĊćĞĠīĥĤĭćĖĠĩģįħģĭĦ 
ćĈėġĪĬĉ 
ĞĠīĥĤĭćĖĠĩģįħģĭĦćĈėġĪĬĉ 
^'Ŗ…N 
ƪůơƩ¨ƫư 
+473 Mbps (+9.5%) 
+36 Kbps (0.0%) 
-287 Mbps (-5.7%) 
HTB: èƤƩƆÔ¼ŠŤŖOz 
PSPacer: ƦŬƞƤƩƆŐĹŔĹ 
22 
PSPacer/HT: ƦŬƞƤƩƆŔĹ 
źżƄƚőŠ–¹ŔƕƩźƧŶŨV©
Burstiness 
• ƖƆƣƉƂŵƣƩƀƨżŬƂƁŖƌƂƐũ®ðŕ 
YŇťt’ 
burstiness 
– NĿĹŜœƌƂƐũĸŚŦŖ 
7øľĆŝť 
• 5 Gbpsè‚Ŗ70ƍŷƂƆŨųƝƒƁƝņĴ 
źƙƟƤƩźƠƧŢŤburstinessŖ…NŨ×Á 
max. burstiness 
PSPacer 7 
PSPacer/HT 9 
HTB 8 
ÖcƀŬƘƪ1ƙƢ»ƫ 39 
23 
ĆÂcƀŬƘŕŢťƌƩżƆl, 2ŒŗNĿĹ 
 
nưTSO‡2 
ijijijijªm³ŕŗƯ
ĆÖcƀŬƘƋƧƇơ%ªŖëd 
űƩƉƣƀŬƘ.çŞÙUŨL„ņŐĴƀŬƘŬƔƧƆŖëdŨ¡U 
10ƙƢ» 1ƙƢ» 
burstiness = 53 burstiness = 9 
(1) űƩƉƣƀŬƘ.çŞư10ƙƢ» ƪ2ƫćűƩƉƣƀŬƘ.çŞư1ƙƢ» 
ĆÖcƀŬƘƋƧƇơő%ªőĿŔĽŎŋŬƔƧƆŗĴ 
ÖcƀŬƘƋƧƇơŖá3ŝőëdńŦť 
24
CPUßÌƪ1żƆƢƩƚƫ 
×´’ 
_G 
PSPacer PSPacer/HT HTB 
1 
×_G/ 
50 Mbps 1 
×_G/ 
50 Mbps 1 
×_G/ 
50 Mbps 
1 Gbps 0.66 0.71 0.84 
2 Gbps 1.80 1.60 1.83 
4 Gbps 3.74 3.66 3.92 
8 Gbps 7.67 8.35 8.88 
é_GľNĿŀŔťŜœĴĆÖcƀŬƘ%ªŖßÌŗ 
25
CPUßÌƪÒ{żƆƢƩƚƫ 
×´’ 
_G 
PSPacer PSPacer/HT HTB 
1 
×_G/ 
50 Mbps 1 
×_G/ 
50 Mbps 1 
×_G/ 
50 Mbps 
1 Gbps 0.66 1.04 0.71 0.91 0.84 0.82 
2 Gbps 1.80 2.16 1.60 2.44 1.83 1.88 
4 Gbps 3.74 4.78 3.66 8.19 3.92 4.49 
8 Gbps 7.67 11.19 8.35 17.04 8.88 25.55 
żƆƢƩƚ{ľKĻťŜœĴĆÖcƀŬƘ%ªŖßÌŗ 
26
VąÄŒŝŒş 
š®l –¹ń CPUßÌ 
ÖcƀŬƘ 
ĆÖcƀŬƘ 
ŴƝƂƒƍŷƂƆ 
CPUßÌŗxËŖEľĸť 
27
28 
±ÏŖŸŦ 
• ƍŷƂƆƕƩźƧŶ 
• ĆÖcƀŬƘŨ®ĹŋƍŷƂƆ 
ƕƩźƧŶ” 
• ÚVą 
• ŝŒş
ŝŒş 
• ĆÂcƀŬƘŨ®ĹŋƍŷƂƆƕƩźƧŶ”Ũ 
uņĴÚ 
– ŴƝƂƒƍŷƂƆeŖ*÷ŨAî 
– 10GbE«JőŠÂXŔƕƩźƧŶľ:Ê 
– Ò{żƆƢƩƚĴ9ř´’_GľĆĹIĴCPUßÌľ 
ššĆĹĸŤ 
• hŖÛă 
– Ò{żƆƢƩƚé‚ŕļŁťCPUßÌŖ,  
– HTBŕļŁťèƤƩƆÔ¼ŠŤœŖxË 
29
30 
ŃþÈĸŤľŒĺŃŅĹŝņŋ 
PSPacer/HTŗGNU GPLơŬžƧżŕŐ#ó 
http://www.gridmpi.org/pspacer.jsp 
ŔļĴŠ·½Ŗïŗ}ïºS¶ºS·½àÑ1ñƪ20800083ƫĴ 
ļŢř§¾Íyœ
~ůƉƣŴƩƨ­qÎÅ󱔐 
ƪNEDOƫŖQ؏4ĶŶƢƩƧƉƂƆƦƩŵƨźżƄƚqη½ó± 
ƒƥŻŮŵƆƪŶƢƩƧITƒƥŻŮŵƆƫķŖoŒŨž®ņŐĹť

Más contenido relacionado

Más de Ryousei Takano

Error Permissive Computing
Error Permissive ComputingError Permissive Computing
Error Permissive ComputingRyousei Takano
 
Opportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCIOpportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCIRyousei Takano
 
ABCI: An Open Innovation Platform for Advancing AI Research and Deployment
ABCI: An Open Innovation Platform for Advancing AI Research and DeploymentABCI: An Open Innovation Platform for Advancing AI Research and Deployment
ABCI: An Open Innovation Platform for Advancing AI Research and DeploymentRyousei Takano
 
クラウド環境におけるキャッシュメモリQoS制御の評価
クラウド環境におけるキャッシュメモリQoS制御の評価クラウド環境におけるキャッシュメモリQoS制御の評価
クラウド環境におけるキャッシュメモリQoS制御の評価Ryousei Takano
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)Ryousei Takano
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network ProcessingRyousei Takano
 
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore EraFlow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore EraRyousei Takano
 
A Look Inside Google’s Data Center Networks
A Look Inside Google’s Data Center NetworksA Look Inside Google’s Data Center Networks
A Look Inside Google’s Data Center NetworksRyousei Takano
 
クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術Ryousei Takano
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...Ryousei Takano
 
IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告Ryousei Takano
 
Expectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchExpectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchRyousei Takano
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
 
不揮発メモリとOS研究にまつわる何か
不揮発メモリとOS研究にまつわる何か不揮発メモリとOS研究にまつわる何か
不揮発メモリとOS研究にまつわる何かRyousei Takano
 
クラウドの垣根を超えた高性能計算に向けて~AIST Super Green Cloudでの試み~
クラウドの垣根を超えた高性能計算に向けて~AIST Super Green Cloudでの試み~クラウドの垣根を超えた高性能計算に向けて~AIST Super Green Cloudでの試み~
クラウドの垣根を超えた高性能計算に向けて~AIST Super Green Cloudでの試み~Ryousei Takano
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersRyousei Takano
 
高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud
高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud
高性能かつスケールアウト可能なHPCクラウド AIST Super Green CloudRyousei Takano
 
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data Center
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data CenterIris: Inter-cloud Resource Integration System for Elastic Cloud Data Center
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data CenterRyousei Takano
 

Más de Ryousei Takano (20)

Error Permissive Computing
Error Permissive ComputingError Permissive Computing
Error Permissive Computing
 
Opportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCIOpportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCI
 
ABCI: An Open Innovation Platform for Advancing AI Research and Deployment
ABCI: An Open Innovation Platform for Advancing AI Research and DeploymentABCI: An Open Innovation Platform for Advancing AI Research and Deployment
ABCI: An Open Innovation Platform for Advancing AI Research and Deployment
 
ABCI Data Center
ABCI Data CenterABCI Data Center
ABCI Data Center
 
クラウド環境におけるキャッシュメモリQoS制御の評価
クラウド環境におけるキャッシュメモリQoS制御の評価クラウド環境におけるキャッシュメモリQoS制御の評価
クラウド環境におけるキャッシュメモリQoS制御の評価
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore EraFlow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
 
A Look Inside Google’s Data Center Networks
A Look Inside Google’s Data Center NetworksA Look Inside Google’s Data Center Networks
A Look Inside Google’s Data Center Networks
 
クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
 
IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告
 
Expectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchExpectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software research
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
不揮発メモリとOS研究にまつわる何か
不揮発メモリとOS研究にまつわる何か不揮発メモリとOS研究にまつわる何か
不揮発メモリとOS研究にまつわる何か
 
クラウドの垣根を超えた高性能計算に向けて~AIST Super Green Cloudでの試み~
クラウドの垣根を超えた高性能計算に向けて~AIST Super Green Cloudでの試み~クラウドの垣根を超えた高性能計算に向けて~AIST Super Green Cloudでの試み~
クラウドの垣根を超えた高性能計算に向けて~AIST Super Green Cloudでの試み~
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 
高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud
高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud
高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud
 
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data Center
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data CenterIris: Inter-cloud Resource Integration System for Elastic Cloud Data Center
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data Center
 
IEEE/ACM SC2013報告
IEEE/ACM SC2013報告IEEE/ACM SC2013報告
IEEE/ACM SC2013報告
 

High-resolution Timer-based Packet Pacing Mechanism on the Linux Operating System

  • 1. High-resolution Timer-based Packet Pacing Mechanism on the Linux Operating System Ryousei Takano, Tomohiro Kudoh, Yuetsu Kodama, Fumihiro Okazaki Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology (AIST) ŬƧƀƩƉƂƆŸƧƐũƤƧż2010ij2010b10†26€ij‹ NS
  • 2. 2 ±ÏŖŸŦ • ƍŷƂƆƕƩźƧŶ • ĆÖcƀŬƘŨ®ĹŋƍŷƂƆ ƕƩźƧŶ” • ÚVą • ŝŒş
  • 3. ɃƪƭƬƮƫ • ƉƂƆƦƩŵŖNWð5ľêņŐĹťľĴ _G)®2¨Ũ…N5ŇťŖŗCýŕ • @ăŖ8BŗéŖ¿ ƫTCP incast@ă • MPI All-to-allé • MapReduce źƝƂƐƣé 3 : : ×ÁƊƩƇ 1 żŬƂƁ 2 3 N ƌƂƐũ ĸŚŦ
  • 4. ɃƪƮƬƮƫ • _G)®2¨Ũ…N5ŇťŕŗƲ – ƪèƤƩƆƫƱƪ)®:Ê_GƫćƬćƪ=‚éƊƩƇ{ƫ • ņĽņĴéŖƌƩżƆlŖŋşĴ‚³ŕ)®:Ê_GŨâì ƍŷƂƆƕƩźƧŶƪƌƩżƆlŖa£5ƫľjÓ 4 VüŖéƪƌƩżƆ‡ƫ 1 żŬƂƁ 2 3 ƌƂƐũ ĸŚŦ ƌƩżƆ BW BW / 3 żŬƂƁ BW ªm³ŔéƪƌƩżƆ¤ƫ 1 2 3 BW / 3
  • 5. ªm³ŔƕƩźƧŶŖV© • –¹ŔƍŷƂƆèôú*iľjÓ • ƪƍŷƂƆŹŬŽƫƱƪƍŷƂƆôŴƝƂƒƫŖIư èƤƩƆŗ¥ª_GŖ1/2 – ƫjÓŔƍŷƂƆèôúŖÂc ijij1 GbpsƬMTU 1500BIJ24ƘŬŵƥ» ijij10 GbpsƬMTU 9000BIJ14.4ƘŬŵƥ» ijij ćƍŷƂƆèôú ćƍŷƂƆôŴƝƂƒ 5
  • 6. PSPacer • é_GŨ2¨³ŕ)®ŇťŋşŖſƐƆŭŮŪ • ŴŲƎƂƆŬƩŹƉƂƆŕļŁťÂXŔƕƩźƧŶŨ ſƐƆŭŮŪŌŁőV© – )®:Ê_GŕĸŧʼnŐƌƩżƆèŨa£5Ňť łŒőĴTUņŋĆĹélÊŨV© 6 Buffer Overflow ƌƩżƆlŖĆĹƆơƐūƂŵŗƍŷƂƆƥ żŨfĿáłņĴélÊŖŨrŀ Switch/Router PSPacerŗƍŷƂƆôúŨÜ|ņĴa£5 ńŦTUņŋƆơƐūƂŵŨ¬oŇť
  • 7. ƍŷƂƆƕƩźƧŶŖVÐe ƋƩƇŭŮŪVÐ ſƐƆŭŮŪVÐ ƀŬƘĄ3 e Öc ƀŬƘ ĆÖc ƀŬƘ ŴƝƂƒı ƍŷƂƆe • FPGAšNPŨ®ĹŋVÐ • Chelsio T210 PSPacer PSPacer/HT 7
  • 8. 8 PSPacer:ƌŬƆŵƥƂŵ • ƀŬƘĄ3e – ŴŲƎƂƆÃŖƉƂƆƦƩŵőŗƘŬŵƥ»ÂcŖ*iŗCý • OSŖƀŬƘôúư1Ƴ10ƙƢ» – ĂÆŔƀŬƘ.çŞ%ªŕŢťŰƩƌƓƂƇŖK0 • ƌŬƆŵƥƂŵ – ƀŬƘ.çŞŨŧňĴèƌŬƆŨŵƥƂŵŕ)® • 1ƌŬƆŖèŕÓŇť‚ôŗUijƪ10 GbpsIJ0.8 ƈƊ»ƫ – ƦŬƞƤƩƆőƍŷƂƆŨûôŔŀèőĿŦŘĴƍŷƂƆèôúŗ –¹ŕ*i:Ê 7.2 us 7.2 us ƌŬƆŵƥƂŵ 9000B (byte) 0 9K 18K 27K èƀŬƙƧŶŗéóPĽţŖèƌŬƆő›U 9000B
  • 9. 9 PSPacerưŴƝƂƒƍŷƂƆe • PAUSEƐƤƩƚƪIEEE 802.3x ƐƥƩ*iƫŖ)® – -®Ŕņ • jňżŬƂƁƨƣƩƀŖ/ƗƩƆő¸Ž – VƍŷƂƆŖŞľĴ ŖèôúŨsņŏŏ/ – ¦(ŔƋƩƇŭŮŪľÓ èPC żŬƂƁ VƍŷƂƆ ŴƝƂƒƍŷƂƆ
  • 10. ŴƝƂƒƍŷƂƆeŖ*÷ 1. ƦŬƞƤƩƆőƍŷƂƆŨèőĿťlÊľjÓ – CPUlÊãšĴPCIƌżƖƆƣƉƂŵŕŔťIĴ –¹ŔƕƩźƧŶŗ:Ê • ƫ10 GbEĴ32bit/33MHz PCI (ªÝƤƩƆ 133MB/s)őGbE 2. Ethernet MőŖ®ľ:Ê – ŴƝƂƒƍŷƂƆŖV©p—ƪPAUSEƐƤƩƚƫľŔĹ 3. °ƅƌŬżśŖ‰Yk – ƫBondingĴtapƅƌŬż – 8ª³ŕŗYk:ÊŌľĴ°ƅƌŬżŖƇơŬƌŕ YŇť–ľjÓ 10
  • 11. 11 ±ÏŖŸŦ • ƍŷƂƆƕƩźƧŶ • ĆÖcƀŬƘŨ®ĹŋƍŷƂƆ żŷŻƟƩƢƧŶ” • ÚVą • ŝŒş
  • 12. LinuxŖƀŬƘźżƄƚ ÖcƀŬƘƨŬƔƧƆ‡ ÖcƀŬƘƨŬƔƧƆ¤ ĆÖcƀŬƘŬƔƧƆ Ticks (Jiffies) • ÖcƀŬƘ – 1/HZ»Ŗ?ˆőƋƧƇơŨVÍ – ƀŬƘŬƔƧƆ Mŕ‘ĵŔ%ªŨ=‚ŕVÍ • ĆÖcƀŬƘ – nŖ‚+ŕƋƧƇơŨ²ò:Ê • ?ˆ³ŠņŀŗƦƧźƠƂƆ – äðŔŬƔƧƆ%ª 1000 1001 1002 1003 1004 12
  • 13. ƀŬƘĄ3eŖ$Ç @ăưĆĂcŖ.çŞ%ªŕYŇťCPUßÌ ij IJĆÖcƀŬƘe – ÖcŗŹƑƘŬŵƥ»őäð • Linux kernel 2.6.31 öŖÖcư1/16ƘŬŵƥ» – OSŹƗƩƆŖ| ŊŖ
  • 14. ŖCPUßÌä ŖÀ – ƉƂƆƦƩŵżƀƂŵŖƘƣƁŸŪYk – NICŕŢťŰƐƥƩƇ”ŖZ • ƫTCP Segmentation OffloadŔœ 13
  • 15. 14 PSPacer/HTŖVÐ • űƩƉƣƜŻƟƩƣƪQdiscƫ ŒņŐVÐ – űƩƉƣŖ$ŸƧƍŬƣľÓ – ŪƒƢŷƩźƠƧ – ƒƥƆŸƣżƀƂŵijijÿR – ƇơŬƌ • Linux’¢ƃƩƣĽţŖ)® – Iproute2 (tc(8)) Socket buffer Protocol stack Device Driver enqueue dequeue PSPacer/HT Byte clock scheduler Socket Layer Interface queues Classifier Netlink socket I/F
  • 16. 15 ƌŬƆŵƥƂŵƨżŷŻƟƩơ ŵơżŵƥƂŵưųƟƩŖ!āƍŷƂƆŖ ijijijijijijijijijijèU‚+ŨsŇť ŶƥƩƌƣŵƥƂŵư©DŝőŖèƌŬƆ{ …[ŖŵơżŵƥƂŵľĴŶƥƩƌ ƣŵƥƂŵŢŤŠ[ńŁŦŘĴŊŖ ųƟƩŖ!āƍŷƂƆŨèņĴ ŵƥƂŵŨ„~Ňť VüŕŗĶèŪŬƇƣ‚ôķľjÓ ĆÖcƀŬƘŨÙUņŐĴ•ŖƍŷƂƆè‚+ŝőg”
  • 17. ;eŖ™å š®l –¹ń CPUßÌ ÖcƀŬƘ ĆÖcƀŬƘ (PSPacer/HT) ŴƝƂƒƍŷƂƆ (PSPacer) 16
  • 18. 17 ±ÏŖŸŦ • ƍŷƂƆƕƩźƧŶ • ĆÖcƀŬƘŨ®ĹŋƍŷƂƆ ƕƩźƧŶ” • ÚVą • ŝŒş
  • 19. Ú • VąĀ´ – aF_G • 100 Mbps+ŞőèƤƩƆŨL5ńʼnĴ´’ŒV¡Ŗ ^'Ũס – burstiness • ƖƆƣƉƂŵƣƩƀƨżŬƂƁŖƌƂƐũ®ðŕYŇťt’ • ƍŷƂƆųƝƒƁƝÄŒŨHŕźƙƟƤƩźƠƧŕŢŤ×Á – CPUßÌ • żƆƢƩƚ{ŒŖµõ • ÚYÞ – PSPacerĴPSPacer/HTĴHTB (Hierarchical Token Bucket) 18
  • 20. HTB: Hierarchical Token Bucket • Linux’¢ŖQdiscƜŻƟƩƣ • CBQƪClass based queuingƫŖŢĺŔù]³Ŕ _G*iľ:Ê • ƍŷƂƆżŷŻƟƩƢƧŶŕĆÖcƀŬƘŨ)® – Linux kernel 2.6.31 öŖÖcư1/16ƘŬŵƥ» • gō‚ô×ÁœŖíĹ – PSPacer/HT: ƍŷƂƆ˜ŕ´’ƤƩƆĽţ×Á – HTB: l2t (length to time)ÏŕťÏfĿ • ÏŖŬƧƅƂŵżŗ256ņĽŔĹŋşĴÂcŕ÷¯ 19
  • 21. Myri-10G Myri-10G 20 Vą«J • ×Á”żƕƂŵćƪPC Aƫ – CPU: Quad-core Xeon (E5430) x 2 – NIC: Myricom Myri-10G (PCIe x 8) • MTU: 9000 byte – Memory: 8GB DDR2-667 • OS: Ubuntu 9.10 server sender receiver – Linux kernel 2.6.31-10 + myri10ge driver 1.5.1 – sysctlƍơƛƩƀ: • net.core.netdev_max_backlog 25000 • net.core.rmem_max 16777216 • net.core.wmem_max 16777216 • net.ipv4.tcp_rmem 4096 65536 16777216 • net.ipv4.tcp_wmem 4096 87380 16777216 • net.ipv4.tcp_no_metrics_save 1 GtrcNET-10
  • 22. 21 GtrcNET • NÕ“FPGAŨvæņŋƋƩƇŭŮŪƉƂƆƦƩŵƄżƆƔƂƇ • ćƦŬƞƤƩƆőńŝŅŝŔ”ÊŨƒƥŶơƚ:Ê • ćGtrcNET-1: GbE (GBIC) x 4ports + 16MBytes Memory/port • GtrcNET-10: 10GbE (XENPAK) x 3ports + 1GBytes Memory /port • VÐ”Ê • ć_G¡UƪƗƩƆ6ĴżƆƢƩƚ6ĴVLAN6ƫ • ćëdŖ“w • ćƍŷƂƆųƝƒƁƝ • ćƄżƆƍŷƂƆ¬o • ćèƤƩƆ*iƪƕƩźƧŶĴ ijźŮƩƏƧŶĴƗƢźƧŶƫ http://projects.itri.aist.go.jp/gnet/
  • 23. aF_G*iŖ–¹ń ćčċĐ ćčċď ćč Ċčċď ĊčċĐ IperfŨ5»ôVÍņŋŒĿŖè_GŨGtrcNET-10ő¡U ĜĝĜĠĢĤīČĞĖ ĜĝĜĠĢĤī ĘĞĖć ćč ćď ćĐ ćđ ćē ćĎč ěġĬĤīĮĤģćĖĠĩģįħģĭĦćĊćĞĠīĥĤĭćĖĠĩģįħģĭĦ ćĈėġĪĬĉ ĞĠīĥĤĭćĖĠĩģįħģĭĦćĈėġĪĬĉ ^'Ŗ…N ƪůơƩ¨ƫư +473 Mbps (+9.5%) +36 Kbps (0.0%) -287 Mbps (-5.7%) HTB: èƤƩƆÔ¼ŠŤŖOz PSPacer: ƦŬƞƤƩƆŐĹŔĹ 22 PSPacer/HT: ƦŬƞƤƩƆŔĹ źżƄƚőŠ–¹ŔƕƩźƧŶŨV©
  • 24. Burstiness • ƖƆƣƉƂŵƣƩƀƨżŬƂƁŖƌƂƐũ®ðŕ YŇťt’ burstiness – NĿĹŜœƌƂƐũĸŚŦŖ 7øľĆŝť • 5 Gbpsè‚Ŗ70ƍŷƂƆŨųƝƒƁƝņĴ źƙƟƤƩźƠƧŢŤburstinessŖ…NŨ×Á max. burstiness PSPacer 7 PSPacer/HT 9 HTB 8 ÖcƀŬƘƪ1ƙƢ»ƫ 39 23 ĆÂcƀŬƘŕŢťƌƩżƆl, 2ŒŗNĿĹ nưTSO‡2 ijijijijªm³ŕŗƯ
  • 25. ĆÖcƀŬƘƋƧƇơ%ªŖëd űƩƉƣƀŬƘ.çŞÙUŨL„ņŐĴƀŬƘŬƔƧƆŖëdŨ¡U 10ƙƢ» 1ƙƢ» burstiness = 53 burstiness = 9 (1) űƩƉƣƀŬƘ.çŞư10ƙƢ» ƪ2ƫćűƩƉƣƀŬƘ.çŞư1ƙƢ» ĆÖcƀŬƘƋƧƇơő%ªőĿŔĽŎŋŬƔƧƆŗĴ ÖcƀŬƘƋƧƇơŖá3ŝőëdńŦť 24
  • 26. CPUßÌƪ1żƆƢƩƚƫ ×´’ _G PSPacer PSPacer/HT HTB 1 ×_G/ 50 Mbps 1 ×_G/ 50 Mbps 1 ×_G/ 50 Mbps 1 Gbps 0.66 0.71 0.84 2 Gbps 1.80 1.60 1.83 4 Gbps 3.74 3.66 3.92 8 Gbps 7.67 8.35 8.88 é_GľNĿŀŔťŜœĴĆÖcƀŬƘ%ªŖßÌŗ 25
  • 27. CPUßÌƪÒ{żƆƢƩƚƫ ×´’ _G PSPacer PSPacer/HT HTB 1 ×_G/ 50 Mbps 1 ×_G/ 50 Mbps 1 ×_G/ 50 Mbps 1 Gbps 0.66 1.04 0.71 0.91 0.84 0.82 2 Gbps 1.80 2.16 1.60 2.44 1.83 1.88 4 Gbps 3.74 4.78 3.66 8.19 3.92 4.49 8 Gbps 7.67 11.19 8.35 17.04 8.88 25.55 żƆƢƩƚ{ľKĻťŜœĴĆÖcƀŬƘ%ªŖßÌŗ 26
  • 28. VąÄŒŝŒş š®l –¹ń CPUßÌ ÖcƀŬƘ ĆÖcƀŬƘ ŴƝƂƒƍŷƂƆ CPUßÌŗxËŖEľĸť 27
  • 29. 28 ±ÏŖŸŦ • ƍŷƂƆƕƩźƧŶ • ĆÖcƀŬƘŨ®ĹŋƍŷƂƆ ƕƩźƧŶ” • ÚVą • ŝŒş
  • 30. ŝŒş • ĆÂcƀŬƘŨ®ĹŋƍŷƂƆƕƩźƧŶ”Ũ uņĴÚ – ŴƝƂƒƍŷƂƆeŖ*÷ŨAî – 10GbE«JőŠÂXŔƕƩźƧŶľ:Ê – Ò{żƆƢƩƚĴ9ř´’_GľĆĹIĴCPUßÌľ ššĆĹĸŤ • hŖÛă – Ò{żƆƢƩƚé‚ŕļŁťCPUßÌŖ,  – HTBŕļŁťèƤƩƆÔ¼ŠŤœŖxË 29
  • 31. 30 ŃþÈĸŤľŒĺŃŅĹŝņŋ PSPacer/HTŗGNU GPLơŬžƧżŕŐ#ó http://www.gridmpi.org/pspacer.jsp ŔļĴŠ·½Ŗïŗ}ïºS¶ºS·½àÑ1ñƪ20800083ƫĴ ļŢř§¾Íyœ ~ůƉƣŴƩƨ­qÎÅ󱔐 ƪNEDOƫŖQ؏4ĶŶƢƩƧƉƂƆƦƩŵƨźżƄƚqη½ó± ƒƥŻŮŵƆƪŶƢƩƧITƒƥŻŮŵƆƫķŖoŒŨž®ņŐĹť
  • 32. TCP Segmentation Offload • ſƐƆŭŮŪĽţŗMTUŹŬŽľNĿŀŔŎŋŒ ŞŔʼnŘŢĹ – é`ŖTSOŹŬŽŗ64 KB 31 TSOƍŷƂƆ ć7.2ƘŬŵƥ» ć50.4ƘŬŵƥ» Ğĝě¤2‚ư Ğĝ쮂ư (5 Gbps/MTU 9000B) (5 Gbps/MTU 9000B)