In this presentation from the DDN User Meeting at SC13, Tommy Minyard from the Texas Advanced Computing Center describes TACC's new Corral data storage system.
Watch the video presentation: http://insidehpc.com/2013/11/13/ddn-user-meeting-coming-sc13-nov-18/
Unleash Your Potential - Namagunga Girls Coding Club
Corralling Big Data at TACC
1. Corralling Big Data at TACC
Tommy Minyard
Texas Advanced Computing Center
DDN User Group Meeting
November 18, 2013
2. TACC Mission & Strategy
The mission of the Texas Advanced Computing Center is to enable
scientific discovery and enhance society through the application of
advanced computing technologies.
To accomplish this mission, TACC:
– Evaluates, acquires & operates
advanced computing systems
– Provides training, consulting, and
documentation to users
– Collaborates with researchers to
apply advanced computing techniques
– Conducts research & development to
produce new computational technologies
Resources &
Services
Research &
Development
3. TACC Storage Needs
• Cluster specific storage
– High performance (tens to hundreds GB/s bandwidth)
– Large-capacity (~2TBs per Teraflop), purged frequently
– Very scalable to thousands of clients
• Center-wide persistent storage
– Global filesystem available on all systems
– Very large capacity, quota enabled
– Moderate performance, very reliable, high availability
• Permanent archival storage
– Maximum capacity, tens of PBs of capacity
– Slow performance, tape-based offline storage with spinning
storage cache
4. History of DDN at TACC
• 2006 – Lonestar 3 with DDN S2A9500
controllers and 120TB of disk
• 2008 – Corral with DDN S2A9900 controller
and 1.2PB of disk
• 2010 – Lonestar 4 with DDN SFA10000
controllers with 1.8PB of disk
• 2011 – Corral upgrade with DDN SFA10000
controllers and 5PB of disk
5. Global Filesystem Requirements
• User requests for persistent storage available
on all production systems
– Corral limited to UT System users only
• RFP issued for storage system capable of:
– At least 20PB of usable storage
– At least 100GB/s aggregate bandwidth
– High availability and reliability
• DDN solution selected for project
7. Stockyard: Design and Setup
• A Lustre 2.4.1 based global files system, with
scalability for future upgrades
• Scalable Unit (SU): 16 OSS nodes providing
access to 168 OST’s of RAID6 arrays from
two SFA12k couplets, corresponding to 5PB
capacity and 25+ GB/s throughput per SU
• Four SU’s provide 20PB with 100GB/s now
• 16 initial LNET router set for external mounts
11. Stockyard: Capabilities and Features
• 20PB usable capacity with 100+ GB/s
aggregate bandwidth
• Client systems can bring its own LNET router
set to connect to the Stockyard core IB
switches or connect to the built-in LNET
routers using either IB or TCP. (FDR14 or
10GigE)
• HSM potential to Ranch tape archival system
12. Capabilities and Features (cont’d)
• Meta-data performance enhancement
possible with DNE (phase1)
• NRS (Network Request Scheduler)
evaluation: characteristics of different policies
on ost_io.nrs_policies, particularly with
crrn(client round-robin over nids) under
contention dominated by a few jobs
13. Stockyard: Numbers So Far
• 16 LET-routers configured as direct client
(within the Stockyard fabric) can push 25GB/s
on the unit
• With two SU’s the same set of clients can
achieve 50GB/s, and 75GB/s with three SU.
• With four SU we hit the 16 client limit. No
improvement beyond 75GB/s (corresponding
to ~4.7GB/s from each client)
14. Numbers So Far (Single Client)
• Single thread write performance with Lustre
2.4.1 is ~770MB/s
– big improvement over 2.1.X at about 500MB/s
• Multi-thread from a single client saturates
around 4.7GB/s (with credits=256 on both
servers and clients)
15. Numbers So Far (Aggregate)
• Performance numbers with 16 lnet-routers :
75GB/s from 16 direct clients
• Numbers from Stampede compute clients:
65GB/s with 256 clients (IOR, posix, fpp, with
8 tasks per node)
• Saturation point for Stampede clients: 65GB/s
• N.B. credits=64 on client nodes of Stampede
– Quick test on interactive 2.1.x node with higher
credit number gives expected boost.
16. Numbers So Far (Failover Tests)
• OSS failover test setup and results
• Procedure:
– Identify the OST’s for the test pair
– Initiate the dd processes targeted to the particular OST’s each of
about 67GB in size so that it does not finish before the failover
– Interrupt one of the OSS server with shutdown using ipmitool
– Record the individual dd process outputs as well as server and
client side Lustre messages
– Compare and confirm the recovery and operation of the failover
pair with 21 OST’s
• All I/O completes within 2 minutes of failover
17. Failover Testing (cont’d)
• Similarly for MDS pair: same sequence of interrupted
I/O and collection of Lustre messages on both servers and
clients, client side log shows the recovery.
–
–
–
–
Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre:
13689:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed
out for sent delay: [sent 1381348698/real 0] req@ffff88180cfcd000
x1448277242593528/t0(0) o250>MGC192.168.200.10@o2ib100@192.168.200.10@o2ib100:26/25 lens 400/544 e 0 to
1 dl 1381348704 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre:
13689:0:(client.c:1869:ptlrpc_expire_one_request()) Skipped 1 previous similar
message
Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: Evicted from MGS (at
MGC192.168.200.10@o2ib100_1) after server handle changed from
0xb9929a99b6d258cd to 0x6282da9e97a66646
Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: MGC192.168.200.10@o2ib100:
Connection restored to MGS (at 192.168.200.11@o2ib100)
18. Automated Failover
• The tests were on an artificial setup to
simplify the tracking of the completion of the
I/O on clients and shutdown and failover
mounts were done manually.
• Corosync and pacemaker are being set up to
automate the process.
19. Routed Clients
• We monitor the routerstat output on the
attached routers and differences between two
timestamps, focusing on the even distribution
of request streams
• Contrary to the expectation that “autodown”
may suffice, Lustre clients need to have
“check_routers_before_use=1” to have
automatic updates of router status
20. Routed Clients (cont’d)
• Even with automatic router checks, clients
cannot detect the non-functional routers: a
router which was active only on the client side
will be assumed to be active by clients
• Clients encounter timeouts due to the nonfunctional routers
• Resolution: separate router checks on router
nodes are added.
21. Stockyard: Looking Ahead
• Deploy as a global $WORK space for TACC
resources, will push the number of clients to
all TACC resources
• Evaluation of Lustre 2.5.0 before full
production for HSM functionality and
compatibility with SAMFS on Ranch
• Quota management (different on 2.4+)
• Integrated monitoring setup
• Security evaluation
22. Summary
• Storage capacity and performance needs
growing at exponential rate
• High-performance and reliable filesystems
critical for HPC productivity
• Benefits of large parallel filesystems outweigh
the system administration overhead
• Current best solution for cost, performance
and scalability is Lustre-based filesystem