1. Joint Techs Workshop, TIP 2004
Jan 28, 2004
Honolulu, Hawaii
Trans-Pacific Grid Datafarm
Osamu Tatebe
Grid Technology Research Center, AIST
On behalf of the Grid Datafarm Project
National Institute of Advanced Industrial Science and Technology
2. Key points of this talk
Trans-pacific Grid file system and testbed
70 TBytes disk capacity, 13 GB/sec disk I/O performance
Trans-pacific file replication [SC2003 Bandwidth Challenge]
1.5TB data transferred in an hour
Multiple high-speed Trans-Pacific networks;
APAN/TransPAC (2.4 Gbps OC48 POS, 500 Mbps OC-12
ATM), SuperSINET (2.4 Gbps x 2, 1 Gbps available)
6,000 miles
stable 3.79 Gbps out of theoretical peak 3.9 Gbps (97%)
using 11 node pairs (MTU 6000B)
We won the "Distributed Infrastructure" award!
National Institute of Advanced Industrial Science and Technology
3. [Background] Petascale Data Intensive
Computing
High Energy Physics
CERN LHC, KEK Belle
~MB/collision,
100 collisions/sec Detector for
LHCb experiment
~PB/year
2000 physicists, 35 countries
Detector for
ALICE experiment
Astronomical Data Analysis
data analysis of whole the data
TB~PB/year/telescope
SUBARU telescope
10 GB/night, 3 TB/year
National Institute of Advanced Industrial Science and Technology
4. [Background 2] Large-scale File Sharing
P2P – exclusive and special-purpose approach
Napster, Gnutella, Freenet, . . .
Grid technology – file transfer, metadata management
GridFTP, Replica Location Service
Storage Resource Broker (SRB)
Large-scale file system – general approach
Legion, Avaki [Grid, no replica management]
Grid Datafarm [Grid]
Farsite, OceanStore [P2P]
AFS, DFS, . . .
National Institute of Advanced Industrial Science and Technology
5. Goal and feature of Grid Datafarm
Goal
Dependable data sharing among multiple organizations
High-speed data access, High-speed data processing
Grid Datafarm
Grid File System – Global dependable virtual file system
Integrates CPU + storage
Parallel & distributed data processing
Features
Secured based on Grid Security Infrastructure
Scalable depending on data size and usage scenarios
Data location transparent data access
Automatic and transparent replica access for fault tolerance
High-performance data access and processing by accessing multiple
dispersed storages in parallel (file affinity scheduling)
National Institute of Advanced Industrial Science and Technology
6. Grid Datafarm (1): Gfarm file system -
World-wide virtual file system [CCGrid 2002]
Transparent access to dispersed file data in a Grid
POSIX I/O APIs, and native Gfarm APIs for extended file
view semantics and replications
Map from virtual directory tree to physical file
Automatic and transparent replica access for fault
tolerance and access-concentration avoidance
Virtual Directory /grid
Tree File system metadata
ggf jp
aist gtrc file1 file2
mapping
file1 file2 file3 file4
File replica creation
Gfarm File System
National Institute of Advanced Industrial Science and Technology
7. Grid Datafarm (2): High-performance data access
and processing support [CCGrid 2002]
World-wide parallel and distributed processing
Aggregate of files = superfile
Data processing of superfiles = parallel and
distributed data processing of member files
Local file view (SPMD parallel file access)
File-affinity scheduling (“Owner-computes”)
World-wide
Virtual CPU Parallel &
distributed
processing
Grid File System Astronomic archival data
365 parallel analysis
in a year (superfile)
National Institute of Advanced Industrial Science and Technology
8. Transfer technology in long fat networks
Bandwidth and latency between US and Japan
1 10 Gbps, 150 300 msec in RTT
TCP acceleration
Adjustment of congestion window
Multiple TCP connections
HighSpeed TCP、Scalable TCP、FAST TCP
XCP (not TCP)
UDP based acceleration
Tsunami、UDT、RBUDP、atou、. . .
Bandwidth prediction without packet loss
National Institute of Advanced Industrial Science and Technology
9. Multiple TCP streams sometimes
considered harmful . . .
Multiple TCP streams achieve good bandwidth, but
excessively congest the network. In fact would
“shoot oneself in the foot”.
APAN/TransPAC LA-Tokyo (2.4Gbps)
2800
Too much
High oscillation
2600 congestion
Not stable! 2400
2200
2000
Bandwidth (Mbp
1800
1600 TxTotal
TxBW0
1400
TxBW1
1200
TxBW2
1000
Compensate 800
each other 600
400
200
0
375.5 Too much 377
376 376.5 network flow
377.5 378
Time (seconds)
[10 msec average]
Need to limit bandwidth appropriately
National Institute of Advanced Industrial Science and Technology
10. A programmable network testbed device
GNET-1
Large high-speed
memory blocks
Programmable hardware
network testbed
WAN emulation
- latency, bandwidth,
packet loss, jitter, . . .
Precise measurement
- bandwidth in 100 usec
- latency, jitter between 2 GNET-1
General purpose, very flexible!
National Institute of Advanced Industrial Science and Technology
11. IFG-based pace control by GNET-1
Shaping by GNET-1 (700Mbps x 3 @ APAN LA-Tokyo(2.4Gbps))
Shaping by GNET-1 (700Mbps x 3 @ APAN LA-Tokyo(2.4Gbps)) Shaping by GNET-1 (700Mbps x 3 @ APAN LA-Tokyo(2.4Gbps))
1000
1000 1000
900
900 900
800
800 800
700
700 700
Bandwidth (Mb
Bandwidth (Mbp
Bandwidth (Mbp
600
600 600
RxBW0 TxBW0 TxBW0
500 TxBW1 TxBW1
RxBW1 500 500
400 TxBW2 TxBW2
400 400
300 300 300
200 200 200
100 100 100
0 0 0
245.5 246 246.5 247 245.5 246 246.5 247 245.5 246 246.5 247
Time (Second) Time (Second) Time (Second)
GNET-1 Bottleneck
1 Gbps
700 Mbps 700 Mbps
(enable flow control)
NO PACKET LOSS!
GNET-1 provides
Precise traffic pacing at any data rate by
changing IFG (Inter-Frame Gap)
Packet loss free network using large input buffer
(16MB)
National Institute of Advanced Industrial Science and Technology
12. Summary of technologies for performance
improvement
[Disk I/O performance] Grid Datafarm – A Grid file system with high-
performance data-intensive computing support
A world-wide virtual file system that federates local file systems of
multiple clusters
It provides scalable disk I/O performance for file replication via high-
speed network links and large-scale data-intensive applications
Trans-Pacific Grid Datafarm testbed
5 clusters in Japan, 3 clusters in US, and 1 cluster in Thailand, provides 70
TBytes disk capacity, 13 GB/sec disk I/O performance
It supports file replication for fault tolerance and access-concentration
avoidance
[World-wide high-speed network efficient utilization] GNET-1 – a gigabit
network testbed device
Provides IFG-based precise rate-controlled flow at any rate
Enables stable and efficient Trans-Pacific network use of HighSpeed
TCP
National Institute of Advanced Industrial Science and Technology
13. Trans-Pacific Grid Datafarm testbed:
Network and cluster configuration
SuperSINET Trans-Pacific thoretical peak 3.9 Gbps Indiana
Gfarm disk capacity 70 TBytes Univ
Titech
disk read/write 13 GB/sec
147 nodes
16 TBytes 10G SuperSINET
4 GB/sec SC2003
Univ 2.4G
Tsukuba NII New 2.4G Phoenix
10 nodes 10G York
1 TBytes 2.4G(1G)
300 MB/sec 10G [950 Mbps]
Abilene
Abilene
KEK [500 Mbps]
7 nodes 1G OC-12 ATM
3.7 TBytes 622M Chicago
200 MB/sec
APAN 1G 10G
Maffin Tokyo XP
APAN/TransPAC
1G 1G
32 nodes
AIST 5G 2.4G Los Angeles
[2.34 Gbps] 23.3 TBytes
10G
Tsukuba SDSC 2 GB/sec
16 nodes 16 nodes Kasetsart
WAN
11.7 TBytes 11.7 TBytes Univ,
1 GB/sec Thailand
1 GB/sec
National Institute of Advanced Industrial Science and Technology
14. Scientific Data for Bandwidth Challenge
Trans-Pacific File Replication of scientific data
For transparent, high-performance, and fault-tolerant access
Astronomical Object Survey on Grid Datafarm [HPC Challenge participant]
World-wide data analysis on whole the archive
652 GBytes data observed by SUBARU telescope
N. Yamamoto (AIST)
Large configuration data from Lattice QCD
Three sets of hundreds of gluon field configurations on a 24^3*48 4-D
space-time lattice (3 sets x 364.5 MB x 800 = 854.3 GB)
Generated by the CP-PACS parallel computer at Center for
Computational Physics, Univ. of Tsukuba (300Gflops x years of CPU
time) [Univ Tsukuba Booth]
National Institute of Advanced Industrial Science and Technology
15. Network bandwidth in APAN/TransPAC
LA route
PC RTT: 141 ms PC
switch switch
PC 3G 3G PC
10G 2.4G
PC FC10 Juniper PC
switch router switch
PC E600 M20 PC
PC switch LA Tokyo switch PC
GNET-1 Stable transfer rate of 2.3 Gbps
2
[Gbps]
1
No pacing Pacing in 2.3 Gbps
(900 + 900 + 500)
National Institute of Advanced Industrial Science and Technology
16. APAN/TransPAC LA route (1)
National Institute of Advanced Industrial Science and Technology
17. APAN/TransPAC LA route (2)
National Institute of Advanced Industrial Science and Technology
18. APAN/TransPAC LA route (3)
National Institute of Advanced Industrial Science and Technology
19. File replication between Japan and US
(network configuration)
PC RTT: 141 ms PC
switch switch
PC 10G PC
3G 3G
PC PC
switch LA Tokyo switch
PC 2.4G PC
PC router PC
Abilene
Abilene
switch Juniper switch
PC FC10 PC
M20
E600
PC PC
switch router switch
PC 500M 1G PC
1G Chicago RTT: 250 ms
PC PC
PC switch 1G router router switch PC
PC 2.4G 1G PC
(1G) NYC RTT: 285 ms
Phoenix GNET-1 Tokyo,
Tsukuba
National Institute of Advanced Industrial Science and Technology
20. File replication performance between Japan
and US (total)
National Institute of Advanced Industrial Science and Technology
21. APAN/TransPAC Chicago
Pacing at 500 Mbps, quite stable
National Institute of Advanced Industrial Science and Technology
22. APAN/TransPAC LA (1)
After re-pacing from 800 to 780 Mbps, quite stable
National Institute of Advanced Industrial Science and Technology
23. APAN/TransPAC LA (2)
After re-pacing of LA (1), quite stable
National Institute of Advanced Industrial Science and Technology
24. APAN/TransPAC LA (3)
After re-pacing of LA (1), quite stable
National Institute of Advanced Industrial Science and Technology
25. SuperSINET NYC
Re-pacing from 930 to 950 Mbps
National Institute of Advanced Industrial Science and Technology
26. Summary
Efficient use around the peak rate in long fat networks
IFG-based precise pacing within packet loss free bandwidth with GNET-1
-> packet loss free network
Stable network flow even with HighSpeed TCP
Disk I/O performance improvement
Parallel disk access using Gfarm
Trans-pacific file replication performance: 3.79Gbps out of theoretical peak 3.9
Gbps (97%) using 11 node pairs (MTU 6000B)
1.5TB data was transferred in an hour
Linux 2.4 kernel problem during file replication (transfer)
Network transfer stopped in a few minutes when flushing buffer cache to disk
Linux kernel bug?
Defensive solution: set very short interval for buffer cache flush
This limits file transfer rate to 400 Mbps for one node pair
Successful Trans-pacific scale data analysis
. . . Scalability problem of LDAP server for a metadata server
Further improvement needed
National Institute of Advanced Industrial Science and Technology
27. Future work
Standardization effort with GGF Grid File System WG
Foster (world-wide) storage sharing and integration
dependable data sharing, high-performance data access
among several organizations
Application area
High energy physics experiment
Astronomic data analysis
Bioinformatics, . . .
Dependable data processing in eGovernment and
eCommerce
Other applications that needs dependable file sharing
among several organizations
National Institute of Advanced Industrial Science and Technology
28. Special thanks to
Hirotaka Ogawa, Yuetsu Kodama, Tomohiro Kudoh, Satoshi
Sekiguchi (AIST), Satoshi Matsuoka, Kento Aida (Titech),
Taisuke Boku, Mitsuhisa Sato (Univ Tsukuba),
Youhei Morita (KEK), Yoshinori Kitatsuji (APAN Tokyo XP),
Jim Williams, John Hicks (TransPAC/Indiana Univ)
Eguchi Hisashi (Maffin), Kazunori Konishi, Jin Tanaka,
Yoshitaka Hattori (APAN), Jun Matsukata (NII), Chris Robb
(Abilene)
Tsukuba WAN NOC team, APAN NOC team, NII SuperSINET
NOC team
Force10 Networks
PRAGMA, ApGrid, SDSC, Indiana University, Kasetsart
University
National Institute of Advanced Industrial Science and Technology