Corralling Big Data at TACC

Corralling Big Data at TACC

Tommy Minyard
Texas Advanced Computing Center
DDN User Group Meeting
November 18, 2013

TACC Mission & Strategy
The mission of the Texas Advanced Computing Center is to enable
scientific discovery and enhance society through the application of
advanced computing technologies.
To accomplish this mission, TACC:
– Evaluates, acquires & operates
advanced computing systems
– Provides training, consulting, and
documentation to users
– Collaborates with researchers to
apply advanced computing techniques
– Conducts research & development to
produce new computational technologies

Resources &
Services

Research &
Development

TACC Storage Needs
• Cluster specific storage
– High performance (tens to hundreds GB/s bandwidth)
– Large-capacity (~2TBs per Teraflop), purged frequently
– Very scalable to thousands of clients

• Center-wide persistent storage
– Global filesystem available on all systems
– Very large capacity, quota enabled
– Moderate performance, very reliable, high availability

• Permanent archival storage
– Maximum capacity, tens of PBs of capacity
– Slow performance, tape-based offline storage with spinning
storage cache

History of DDN at TACC
• 2006 – Lonestar 3 with DDN S2A9500
controllers and 120TB of disk
• 2008 – Corral with DDN S2A9900 controller
and 1.2PB of disk
• 2010 – Lonestar 4 with DDN SFA10000
controllers with 1.8PB of disk
• 2011 – Corral upgrade with DDN SFA10000
controllers and 5PB of disk

Global Filesystem Requirements
• User requests for persistent storage available
on all production systems
– Corral limited to UT System users only

• RFP issued for storage system capable of:
– At least 20PB of usable storage
– At least 100GB/s aggregate bandwidth
– High availability and reliability

• DDN solution selected for project

Stockyard: Design and Setup
• A Lustre 2.4.1 based global files system, with
scalability for future upgrades
• Scalable Unit (SU): 16 OSS nodes providing
access to 168 OST’s of RAID6 arrays from
two SFA12k couplets, corresponding to 5PB
capacity and 25+ GB/s throughput per SU
• Four SU’s provide 20PB with 100GB/s now
• 16 initial LNET router set for external mounts

SU (One server rack with Two DDN
SFA12k couplet racks)

SU Hardware Details
• SFA12k Rack: 50U rack with 8x L6-30p
• SFA12k couplet with 16 IB FDR ports (direct
attachment to the 16 OSS servers)
• 84 slot SS8460 drive enclosures (10 per rack,
20 enclosures per SU)
• 4TB 7200RPM NL-SAS drives

Stockyard: Capabilities and Features
• 20PB usable capacity with 100+ GB/s
aggregate bandwidth
• Client systems can bring its own LNET router
set to connect to the Stockyard core IB
switches or connect to the built-in LNET
routers using either IB or TCP. (FDR14 or
10GigE)
• HSM potential to Ranch tape archival system

Capabilities and Features (cont’d)
• Meta-data performance enhancement
possible with DNE (phase1)
• NRS (Network Request Scheduler)
evaluation: characteristics of different policies
on ost_io.nrs_policies, particularly with
crrn(client round-robin over nids) under
contention dominated by a few jobs

Stockyard: Numbers So Far
• 16 LET-routers configured as direct client
(within the Stockyard fabric) can push 25GB/s
on the unit
• With two SU’s the same set of clients can
achieve 50GB/s, and 75GB/s with three SU.
• With four SU we hit the 16 client limit. No
improvement beyond 75GB/s (corresponding
to ~4.7GB/s from each client)

Numbers So Far (Single Client)
• Single thread write performance with Lustre
2.4.1 is ~770MB/s
– big improvement over 2.1.X at about 500MB/s

• Multi-thread from a single client saturates
around 4.7GB/s (with credits=256 on both
servers and clients)

Numbers So Far (Aggregate)
• Performance numbers with 16 lnet-routers :
75GB/s from 16 direct clients
• Numbers from Stampede compute clients:
65GB/s with 256 clients (IOR, posix, fpp, with
8 tasks per node)
• Saturation point for Stampede clients: 65GB/s
• N.B. credits=64 on client nodes of Stampede
– Quick test on interactive 2.1.x node with higher
credit number gives expected boost.

Numbers So Far (Failover Tests)
• OSS failover test setup and results
• Procedure:
– Identify the OST’s for the test pair
– Initiate the dd processes targeted to the particular OST’s each of
about 67GB in size so that it does not finish before the failover
– Interrupt one of the OSS server with shutdown using ipmitool
– Record the individual dd process outputs as well as server and
client side Lustre messages
– Compare and confirm the recovery and operation of the failover
pair with 21 OST’s

• All I/O completes within 2 minutes of failover

Failover Testing (cont’d)
• Similarly for MDS pair: same sequence of interrupted
I/O and collection of Lustre messages on both servers and
clients, client side log shows the recovery.
–

–

–

–

Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre:
13689:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed
out for sent delay: [sent 1381348698/real 0] req@ffff88180cfcd000
x1448277242593528/t0(0) o250>MGC192.168.200.10@o2ib100@192.168.200.10@o2ib100:26/25 lens 400/544 e 0 to
1 dl 1381348704 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre:
13689:0:(client.c:1869:ptlrpc_expire_one_request()) Skipped 1 previous similar
message
Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: Evicted from MGS (at
MGC192.168.200.10@o2ib100_1) after server handle changed from
0xb9929a99b6d258cd to 0x6282da9e97a66646
Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: MGC192.168.200.10@o2ib100:
Connection restored to MGS (at 192.168.200.11@o2ib100)

Automated Failover
• The tests were on an artificial setup to
simplify the tracking of the completion of the
I/O on clients and shutdown and failover
mounts were done manually.
• Corosync and pacemaker are being set up to
automate the process.

Routed Clients
• We monitor the routerstat output on the
attached routers and differences between two
timestamps, focusing on the even distribution
of request streams
• Contrary to the expectation that “autodown”
may suffice, Lustre clients need to have
“check_routers_before_use=1” to have
automatic updates of router status

Routed Clients (cont’d)
• Even with automatic router checks, clients
cannot detect the non-functional routers: a
router which was active only on the client side
will be assumed to be active by clients
• Clients encounter timeouts due to the nonfunctional routers
• Resolution: separate router checks on router
nodes are added.

Stockyard: Looking Ahead
• Deploy as a global $WORK space for TACC
resources, will push the number of clients to
all TACC resources
• Evaluation of Lustre 2.5.0 before full
production for HSM functionality and
compatibility with SAMFS on Ranch
• Quota management (different on 2.4+)
• Integrated monitoring setup
• Security evaluation

Summary
• Storage capacity and performance needs
growing at exponential rate
• High-performance and reliable filesystems
critical for HPC productivity
• Benefits of large parallel filesystems outweigh
the system administration overhead
• Current best solution for cost, performance
and scalability is Lustre-based filesystem

Corralling Big Data at TACC

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (14)

Similar a Corralling Big Data at TACC

Similar a Corralling Big Data at TACC (20)

Más de inside-BigData.com

Más de inside-BigData.com (20)

Último

Último (20)

Corralling Big Data at TACC