SlideShare una empresa de Scribd logo
1 de 22
Corralling Big Data at TACC

Tommy Minyard
Texas Advanced Computing Center
DDN User Group Meeting
November 18, 2013
TACC Mission & Strategy
The mission of the Texas Advanced Computing Center is to enable
scientific discovery and enhance society through the application of
advanced computing technologies.
To accomplish this mission, TACC:
– Evaluates, acquires & operates
advanced computing systems
– Provides training, consulting, and
documentation to users
– Collaborates with researchers to
apply advanced computing techniques
– Conducts research & development to
produce new computational technologies

Resources &
Services

Research &
Development
TACC Storage Needs
• Cluster specific storage
– High performance (tens to hundreds GB/s bandwidth)
– Large-capacity (~2TBs per Teraflop), purged frequently
– Very scalable to thousands of clients

• Center-wide persistent storage
– Global filesystem available on all systems
– Very large capacity, quota enabled
– Moderate performance, very reliable, high availability

• Permanent archival storage
– Maximum capacity, tens of PBs of capacity
– Slow performance, tape-based offline storage with spinning
storage cache
History of DDN at TACC
• 2006 – Lonestar 3 with DDN S2A9500
controllers and 120TB of disk
• 2008 – Corral with DDN S2A9900 controller
and 1.2PB of disk
• 2010 – Lonestar 4 with DDN SFA10000
controllers with 1.8PB of disk
• 2011 – Corral upgrade with DDN SFA10000
controllers and 5PB of disk
Global Filesystem Requirements
• User requests for persistent storage available
on all production systems
– Corral limited to UT System users only

• RFP issued for storage system capable of:
– At least 20PB of usable storage
– At least 100GB/s aggregate bandwidth
– High availability and reliability

• DDN solution selected for project
Stockyard: Design and Setup
Stockyard: Design and Setup
• A Lustre 2.4.1 based global files system, with
scalability for future upgrades
• Scalable Unit (SU): 16 OSS nodes providing
access to 168 OST’s of RAID6 arrays from
two SFA12k couplets, corresponding to 5PB
capacity and 25+ GB/s throughput per SU
• Four SU’s provide 20PB with 100GB/s now
• 16 initial LNET router set for external mounts
SU (One server rack with Two DDN
SFA12k couplet racks)
SU Hardware Details
• SFA12k Rack: 50U rack with 8x L6-30p
• SFA12k couplet with 16 IB FDR ports (direct
attachment to the 16 OSS servers)
• 84 slot SS8460 drive enclosures (10 per rack,
20 enclosures per SU)
• 4TB 7200RPM NL-SAS drives
Stockyard Logical Layout
Stockyard: Capabilities and Features
• 20PB usable capacity with 100+ GB/s
aggregate bandwidth
• Client systems can bring its own LNET router
set to connect to the Stockyard core IB
switches or connect to the built-in LNET
routers using either IB or TCP. (FDR14 or
10GigE)
• HSM potential to Ranch tape archival system
Capabilities and Features (cont’d)
• Meta-data performance enhancement
possible with DNE (phase1)
• NRS (Network Request Scheduler)
evaluation: characteristics of different policies
on ost_io.nrs_policies, particularly with
crrn(client round-robin over nids) under
contention dominated by a few jobs
Stockyard: Numbers So Far
• 16 LET-routers configured as direct client
(within the Stockyard fabric) can push 25GB/s
on the unit
• With two SU’s the same set of clients can
achieve 50GB/s, and 75GB/s with three SU.
• With four SU we hit the 16 client limit. No
improvement beyond 75GB/s (corresponding
to ~4.7GB/s from each client)
Numbers So Far (Single Client)
• Single thread write performance with Lustre
2.4.1 is ~770MB/s
– big improvement over 2.1.X at about 500MB/s

• Multi-thread from a single client saturates
around 4.7GB/s (with credits=256 on both
servers and clients)
Numbers So Far (Aggregate)
• Performance numbers with 16 lnet-routers :
75GB/s from 16 direct clients
• Numbers from Stampede compute clients:
65GB/s with 256 clients (IOR, posix, fpp, with
8 tasks per node)
• Saturation point for Stampede clients: 65GB/s
• N.B. credits=64 on client nodes of Stampede
– Quick test on interactive 2.1.x node with higher
credit number gives expected boost.
Numbers So Far (Failover Tests)
• OSS failover test setup and results
• Procedure:
– Identify the OST’s for the test pair
– Initiate the dd processes targeted to the particular OST’s each of
about 67GB in size so that it does not finish before the failover
– Interrupt one of the OSS server with shutdown using ipmitool
– Record the individual dd process outputs as well as server and
client side Lustre messages
– Compare and confirm the recovery and operation of the failover
pair with 21 OST’s

• All I/O completes within 2 minutes of failover
Failover Testing (cont’d)
• Similarly for MDS pair: same sequence of interrupted
I/O and collection of Lustre messages on both servers and
clients, client side log shows the recovery.
–

–

–

–

Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre:
13689:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed
out for sent delay: [sent 1381348698/real 0] req@ffff88180cfcd000
x1448277242593528/t0(0) o250>MGC192.168.200.10@o2ib100@192.168.200.10@o2ib100:26/25 lens 400/544 e 0 to
1 dl 1381348704 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre:
13689:0:(client.c:1869:ptlrpc_expire_one_request()) Skipped 1 previous similar
message
Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: Evicted from MGS (at
MGC192.168.200.10@o2ib100_1) after server handle changed from
0xb9929a99b6d258cd to 0x6282da9e97a66646
Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: MGC192.168.200.10@o2ib100:
Connection restored to MGS (at 192.168.200.11@o2ib100)
Automated Failover
• The tests were on an artificial setup to
simplify the tracking of the completion of the
I/O on clients and shutdown and failover
mounts were done manually.
• Corosync and pacemaker are being set up to
automate the process.
Routed Clients
• We monitor the routerstat output on the
attached routers and differences between two
timestamps, focusing on the even distribution
of request streams
• Contrary to the expectation that “autodown”
may suffice, Lustre clients need to have
“check_routers_before_use=1” to have
automatic updates of router status
Routed Clients (cont’d)
• Even with automatic router checks, clients
cannot detect the non-functional routers: a
router which was active only on the client side
will be assumed to be active by clients
• Clients encounter timeouts due to the nonfunctional routers
• Resolution: separate router checks on router
nodes are added.
Stockyard: Looking Ahead
• Deploy as a global $WORK space for TACC
resources, will push the number of clients to
all TACC resources
• Evaluation of Lustre 2.5.0 before full
production for HSM functionality and
compatibility with SAMFS on Ranch
• Quota management (different on 2.4+)
• Integrated monitoring setup
• Security evaluation
Summary
• Storage capacity and performance needs
growing at exponential rate
• High-performance and reliable filesystems
critical for HPC productivity
• Benefits of large parallel filesystems outweigh
the system administration overhead
• Current best solution for cost, performance
and scalability is Lustre-based filesystem

Más contenido relacionado

La actualidad más candente

Performance Lessons learned in vRouter - Stephen Hemminger
Performance Lessons learned in vRouter - Stephen HemmingerPerformance Lessons learned in vRouter - Stephen Hemminger
Performance Lessons learned in vRouter - Stephen Hemmingerharryvanhaaren
 
The n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkThe n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkmarkdgray
 
TRex Realistic Traffic Generator - Stateless support
TRex  Realistic Traffic Generator  - Stateless support TRex  Realistic Traffic Generator  - Stateless support
TRex Realistic Traffic Generator - Stateless support Hanoch Haim
 
Sdnds tw-meetup-2
Sdnds tw-meetup-2Sdnds tw-meetup-2
Sdnds tw-meetup-2Fei Ji Siao
 
HBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBaseHBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBaseCloudera, Inc.
 
LF_OVS_17_OVS-DPDK: Embracing your NUMA nodes.
LF_OVS_17_OVS-DPDK: Embracing your NUMA nodes.LF_OVS_17_OVS-DPDK: Embracing your NUMA nodes.
LF_OVS_17_OVS-DPDK: Embracing your NUMA nodes.LF_OpenvSwitch
 
Training Slides: Intermediate 205: Configuring Tungsten Replicator to Extract...
Training Slides: Intermediate 205: Configuring Tungsten Replicator to Extract...Training Slides: Intermediate 205: Configuring Tungsten Replicator to Extract...
Training Slides: Intermediate 205: Configuring Tungsten Replicator to Extract...Continuent
 
DPDK in Containers Hands-on Lab
DPDK in Containers Hands-on LabDPDK in Containers Hands-on Lab
DPDK in Containers Hands-on LabMichelle Holley
 
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Community
 
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing CommunityHigh Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community6WIND
 
Pilot Use Case 3: BoD services over the intercontinental FIBRE infrastructure
Pilot Use Case 3: BoD services  over the intercontinental FIBRE infrastructurePilot Use Case 3: BoD services  over the intercontinental FIBRE infrastructure
Pilot Use Case 3: BoD services over the intercontinental FIBRE infrastructureFIBRE Testbed
 
Setup & Operate Tungsten Replicator
Setup & Operate Tungsten ReplicatorSetup & Operate Tungsten Replicator
Setup & Operate Tungsten ReplicatorContinuent
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programminginside-BigData.com
 
Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...
Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...
Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...Continuent
 
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecasesLF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecasesLF_OpenvSwitch
 
LF_OVS_17_Ingress Scheduling
LF_OVS_17_Ingress SchedulingLF_OVS_17_Ingress Scheduling
LF_OVS_17_Ingress SchedulingLF_OpenvSwitch
 
LF_OVS_17_OVS-DPDK Installation and Gotchas
LF_OVS_17_OVS-DPDK Installation and GotchasLF_OVS_17_OVS-DPDK Installation and Gotchas
LF_OVS_17_OVS-DPDK Installation and GotchasLF_OpenvSwitch
 
Training Slides: Advanced 302: Performing Schema Changes in a Multi-Site/Mult...
Training Slides: Advanced 302: Performing Schema Changes in a Multi-Site/Mult...Training Slides: Advanced 302: Performing Schema Changes in a Multi-Site/Mult...
Training Slides: Advanced 302: Performing Schema Changes in a Multi-Site/Mult...Continuent
 

La actualidad más candente (20)

Dpdk performance
Dpdk performanceDpdk performance
Dpdk performance
 
Performance Lessons learned in vRouter - Stephen Hemminger
Performance Lessons learned in vRouter - Stephen HemmingerPerformance Lessons learned in vRouter - Stephen Hemminger
Performance Lessons learned in vRouter - Stephen Hemminger
 
The n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkThe n00bs guide to ovs dpdk
The n00bs guide to ovs dpdk
 
TRex Realistic Traffic Generator - Stateless support
TRex  Realistic Traffic Generator  - Stateless support TRex  Realistic Traffic Generator  - Stateless support
TRex Realistic Traffic Generator - Stateless support
 
Sdnds tw-meetup-2
Sdnds tw-meetup-2Sdnds tw-meetup-2
Sdnds tw-meetup-2
 
HBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBaseHBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBase
 
LF_OVS_17_OVS-DPDK: Embracing your NUMA nodes.
LF_OVS_17_OVS-DPDK: Embracing your NUMA nodes.LF_OVS_17_OVS-DPDK: Embracing your NUMA nodes.
LF_OVS_17_OVS-DPDK: Embracing your NUMA nodes.
 
Training Slides: Intermediate 205: Configuring Tungsten Replicator to Extract...
Training Slides: Intermediate 205: Configuring Tungsten Replicator to Extract...Training Slides: Intermediate 205: Configuring Tungsten Replicator to Extract...
Training Slides: Intermediate 205: Configuring Tungsten Replicator to Extract...
 
DPDK in Containers Hands-on Lab
DPDK in Containers Hands-on LabDPDK in Containers Hands-on Lab
DPDK in Containers Hands-on Lab
 
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
 
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing CommunityHigh Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community
 
Pilot Use Case 3: BoD services over the intercontinental FIBRE infrastructure
Pilot Use Case 3: BoD services  over the intercontinental FIBRE infrastructurePilot Use Case 3: BoD services  over the intercontinental FIBRE infrastructure
Pilot Use Case 3: BoD services over the intercontinental FIBRE infrastructure
 
Setup & Operate Tungsten Replicator
Setup & Operate Tungsten ReplicatorSetup & Operate Tungsten Replicator
Setup & Operate Tungsten Replicator
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programming
 
Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...
Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...
Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...
 
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecasesLF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
 
LF_OVS_17_Ingress Scheduling
LF_OVS_17_Ingress SchedulingLF_OVS_17_Ingress Scheduling
LF_OVS_17_Ingress Scheduling
 
LF_OVS_17_OVS-DPDK Installation and Gotchas
LF_OVS_17_OVS-DPDK Installation and GotchasLF_OVS_17_OVS-DPDK Installation and Gotchas
LF_OVS_17_OVS-DPDK Installation and Gotchas
 
100 M pps on PC.
100 M pps on PC.100 M pps on PC.
100 M pps on PC.
 
Training Slides: Advanced 302: Performing Schema Changes in a Multi-Site/Mult...
Training Slides: Advanced 302: Performing Schema Changes in a Multi-Site/Mult...Training Slides: Advanced 302: Performing Schema Changes in a Multi-Site/Mult...
Training Slides: Advanced 302: Performing Schema Changes in a Multi-Site/Mult...
 

Destacado

DDN Accelerating-Decisions-Through-Enterprise-Hadoop-final
DDN Accelerating-Decisions-Through-Enterprise-Hadoop-finalDDN Accelerating-Decisions-Through-Enterprise-Hadoop-final
DDN Accelerating-Decisions-Through-Enterprise-Hadoop-finalIntelHealthcare
 
SNIA 2012 - Creating an Enterprise Hadoop Platform
SNIA 2012 - Creating an Enterprise Hadoop PlatformSNIA 2012 - Creating an Enterprise Hadoop Platform
SNIA 2012 - Creating an Enterprise Hadoop PlatformJoey Jablonski
 
DDN GS7K - Easy-to-deploy, High Performance Scale-Out Parallel File System Ap...
DDN GS7K - Easy-to-deploy, High Performance Scale-Out Parallel File System Ap...DDN GS7K - Easy-to-deploy, High Performance Scale-Out Parallel File System Ap...
DDN GS7K - Easy-to-deploy, High Performance Scale-Out Parallel File System Ap...inside-BigData.com
 
DDN and Intel: Partnered for Exascale
DDN and Intel: Partnered for ExascaleDDN and Intel: Partnered for Exascale
DDN and Intel: Partnered for ExascaleIntel IT Center
 
Phan tich co phieu JVC, DNM, DDN (fintzone)
Phan tich co phieu JVC, DNM, DDN  (fintzone)Phan tich co phieu JVC, DNM, DDN  (fintzone)
Phan tich co phieu JVC, DNM, DDN (fintzone)Tony Auditor
 
DDN: Protecting Your Data, Protecting Your Hardware
DDN: Protecting Your Data, Protecting Your HardwareDDN: Protecting Your Data, Protecting Your Hardware
DDN: Protecting Your Data, Protecting Your Hardwareinside-BigData.com
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...MLconf
 
IBM general parallel file system - introduction
IBM general parallel file system - introductionIBM general parallel file system - introduction
IBM general parallel file system - introductionIBM Danmark
 
Optimizing Lustre and GPFS with DDN
Optimizing Lustre and GPFS with DDNOptimizing Lustre and GPFS with DDN
Optimizing Lustre and GPFS with DDNinside-BigData.com
 
DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...
DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...
DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...inside-BigData.com
 
Academic Workflows with iRODS FINAL
Academic Workflows with iRODS FINALAcademic Workflows with iRODS FINAL
Academic Workflows with iRODS FINALRandy Splinter
 

Destacado (14)

DDN Accelerating-Decisions-Through-Enterprise-Hadoop-final
DDN Accelerating-Decisions-Through-Enterprise-Hadoop-finalDDN Accelerating-Decisions-Through-Enterprise-Hadoop-final
DDN Accelerating-Decisions-Through-Enterprise-Hadoop-final
 
SNIA 2012 - Creating an Enterprise Hadoop Platform
SNIA 2012 - Creating an Enterprise Hadoop PlatformSNIA 2012 - Creating an Enterprise Hadoop Platform
SNIA 2012 - Creating an Enterprise Hadoop Platform
 
DDN GS7K - Easy-to-deploy, High Performance Scale-Out Parallel File System Ap...
DDN GS7K - Easy-to-deploy, High Performance Scale-Out Parallel File System Ap...DDN GS7K - Easy-to-deploy, High Performance Scale-Out Parallel File System Ap...
DDN GS7K - Easy-to-deploy, High Performance Scale-Out Parallel File System Ap...
 
DDN Service Strategy
DDN Service StrategyDDN Service Strategy
DDN Service Strategy
 
DDN and Intel: Partnered for Exascale
DDN and Intel: Partnered for ExascaleDDN and Intel: Partnered for Exascale
DDN and Intel: Partnered for Exascale
 
Phan tich co phieu JVC, DNM, DDN (fintzone)
Phan tich co phieu JVC, DNM, DDN  (fintzone)Phan tich co phieu JVC, DNM, DDN  (fintzone)
Phan tich co phieu JVC, DNM, DDN (fintzone)
 
Ddn Vision
Ddn VisionDdn Vision
Ddn Vision
 
DDN: Protecting Your Data, Protecting Your Hardware
DDN: Protecting Your Data, Protecting Your HardwareDDN: Protecting Your Data, Protecting Your Hardware
DDN: Protecting Your Data, Protecting Your Hardware
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
 
IBM general parallel file system - introduction
IBM general parallel file system - introductionIBM general parallel file system - introduction
IBM general parallel file system - introduction
 
Optimizing Lustre and GPFS with DDN
Optimizing Lustre and GPFS with DDNOptimizing Lustre and GPFS with DDN
Optimizing Lustre and GPFS with DDN
 
DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...
DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...
DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...
 
DDN Product Update from SC13
DDN Product Update from SC13DDN Product Update from SC13
DDN Product Update from SC13
 
Academic Workflows with iRODS FINAL
Academic Workflows with iRODS FINALAcademic Workflows with iRODS FINAL
Academic Workflows with iRODS FINAL
 

Similar a Corralling Big Data at TACC

Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: Cisco UCS For Big Dat...
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: Cisco UCS For Big Dat...Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: Cisco UCS For Big Dat...
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: Cisco UCS For Big Dat...ervogler
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
An FPGA for high end Open Networking
An FPGA for high end Open NetworkingAn FPGA for high end Open Networking
An FPGA for high end Open Networkingrinnocente
 
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:Tony Antony
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxAkshitAgiwal1
 
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefDevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefGaurav "GP" Pal
 
stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4Gaurav "GP" Pal
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHungWei Chiu
 
Couchbase live 2016
Couchbase live 2016Couchbase live 2016
Couchbase live 2016Pierre Mavro
 
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)Alex Rasmussen
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...PROIDEA
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettJim St. Leger
 
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)Art Schanz
 
On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...Jorge E. López de Vergara Méndez
 
100G Networking Berlin.pdf
100G Networking Berlin.pdf100G Networking Berlin.pdf
100G Networking Berlin.pdfJunZhao68
 

Similar a Corralling Big Data at TACC (20)

LUG 2014
LUG 2014LUG 2014
LUG 2014
 
Stabilizing Ceph
Stabilizing CephStabilizing Ceph
Stabilizing Ceph
 
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: Cisco UCS For Big Dat...
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: Cisco UCS For Big Dat...Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: Cisco UCS For Big Dat...
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: Cisco UCS For Big Dat...
 
NSCC Training Introductory Class
NSCC Training Introductory Class NSCC Training Introductory Class
NSCC Training Introductory Class
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
11540800.ppt
11540800.ppt11540800.ppt
11540800.ppt
 
An FPGA for high end Open Networking
An FPGA for high end Open NetworkingAn FPGA for high end Open Networking
An FPGA for high end Open Networking
 
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefDevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
 
stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
Couchbase live 2016
Couchbase live 2016Couchbase live 2016
Couchbase live 2016
 
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles Shiflett
 
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
 
On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...
 
100G Networking Berlin.pdf
100G Networking Berlin.pdf100G Networking Berlin.pdf
100G Networking Berlin.pdf
 

Más de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

Más de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Último

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Último (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Corralling Big Data at TACC

  • 1. Corralling Big Data at TACC Tommy Minyard Texas Advanced Computing Center DDN User Group Meeting November 18, 2013
  • 2. TACC Mission & Strategy The mission of the Texas Advanced Computing Center is to enable scientific discovery and enhance society through the application of advanced computing technologies. To accomplish this mission, TACC: – Evaluates, acquires & operates advanced computing systems – Provides training, consulting, and documentation to users – Collaborates with researchers to apply advanced computing techniques – Conducts research & development to produce new computational technologies Resources & Services Research & Development
  • 3. TACC Storage Needs • Cluster specific storage – High performance (tens to hundreds GB/s bandwidth) – Large-capacity (~2TBs per Teraflop), purged frequently – Very scalable to thousands of clients • Center-wide persistent storage – Global filesystem available on all systems – Very large capacity, quota enabled – Moderate performance, very reliable, high availability • Permanent archival storage – Maximum capacity, tens of PBs of capacity – Slow performance, tape-based offline storage with spinning storage cache
  • 4. History of DDN at TACC • 2006 – Lonestar 3 with DDN S2A9500 controllers and 120TB of disk • 2008 – Corral with DDN S2A9900 controller and 1.2PB of disk • 2010 – Lonestar 4 with DDN SFA10000 controllers with 1.8PB of disk • 2011 – Corral upgrade with DDN SFA10000 controllers and 5PB of disk
  • 5. Global Filesystem Requirements • User requests for persistent storage available on all production systems – Corral limited to UT System users only • RFP issued for storage system capable of: – At least 20PB of usable storage – At least 100GB/s aggregate bandwidth – High availability and reliability • DDN solution selected for project
  • 7. Stockyard: Design and Setup • A Lustre 2.4.1 based global files system, with scalability for future upgrades • Scalable Unit (SU): 16 OSS nodes providing access to 168 OST’s of RAID6 arrays from two SFA12k couplets, corresponding to 5PB capacity and 25+ GB/s throughput per SU • Four SU’s provide 20PB with 100GB/s now • 16 initial LNET router set for external mounts
  • 8. SU (One server rack with Two DDN SFA12k couplet racks)
  • 9. SU Hardware Details • SFA12k Rack: 50U rack with 8x L6-30p • SFA12k couplet with 16 IB FDR ports (direct attachment to the 16 OSS servers) • 84 slot SS8460 drive enclosures (10 per rack, 20 enclosures per SU) • 4TB 7200RPM NL-SAS drives
  • 11. Stockyard: Capabilities and Features • 20PB usable capacity with 100+ GB/s aggregate bandwidth • Client systems can bring its own LNET router set to connect to the Stockyard core IB switches or connect to the built-in LNET routers using either IB or TCP. (FDR14 or 10GigE) • HSM potential to Ranch tape archival system
  • 12. Capabilities and Features (cont’d) • Meta-data performance enhancement possible with DNE (phase1) • NRS (Network Request Scheduler) evaluation: characteristics of different policies on ost_io.nrs_policies, particularly with crrn(client round-robin over nids) under contention dominated by a few jobs
  • 13. Stockyard: Numbers So Far • 16 LET-routers configured as direct client (within the Stockyard fabric) can push 25GB/s on the unit • With two SU’s the same set of clients can achieve 50GB/s, and 75GB/s with three SU. • With four SU we hit the 16 client limit. No improvement beyond 75GB/s (corresponding to ~4.7GB/s from each client)
  • 14. Numbers So Far (Single Client) • Single thread write performance with Lustre 2.4.1 is ~770MB/s – big improvement over 2.1.X at about 500MB/s • Multi-thread from a single client saturates around 4.7GB/s (with credits=256 on both servers and clients)
  • 15. Numbers So Far (Aggregate) • Performance numbers with 16 lnet-routers : 75GB/s from 16 direct clients • Numbers from Stampede compute clients: 65GB/s with 256 clients (IOR, posix, fpp, with 8 tasks per node) • Saturation point for Stampede clients: 65GB/s • N.B. credits=64 on client nodes of Stampede – Quick test on interactive 2.1.x node with higher credit number gives expected boost.
  • 16. Numbers So Far (Failover Tests) • OSS failover test setup and results • Procedure: – Identify the OST’s for the test pair – Initiate the dd processes targeted to the particular OST’s each of about 67GB in size so that it does not finish before the failover – Interrupt one of the OSS server with shutdown using ipmitool – Record the individual dd process outputs as well as server and client side Lustre messages – Compare and confirm the recovery and operation of the failover pair with 21 OST’s • All I/O completes within 2 minutes of failover
  • 17. Failover Testing (cont’d) • Similarly for MDS pair: same sequence of interrupted I/O and collection of Lustre messages on both servers and clients, client side log shows the recovery. – – – – Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre: 13689:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1381348698/real 0] req@ffff88180cfcd000 x1448277242593528/t0(0) o250>MGC192.168.200.10@o2ib100@192.168.200.10@o2ib100:26/25 lens 400/544 e 0 to 1 dl 1381348704 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre: 13689:0:(client.c:1869:ptlrpc_expire_one_request()) Skipped 1 previous similar message Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: Evicted from MGS (at MGC192.168.200.10@o2ib100_1) after server handle changed from 0xb9929a99b6d258cd to 0x6282da9e97a66646 Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: MGC192.168.200.10@o2ib100: Connection restored to MGS (at 192.168.200.11@o2ib100)
  • 18. Automated Failover • The tests were on an artificial setup to simplify the tracking of the completion of the I/O on clients and shutdown and failover mounts were done manually. • Corosync and pacemaker are being set up to automate the process.
  • 19. Routed Clients • We monitor the routerstat output on the attached routers and differences between two timestamps, focusing on the even distribution of request streams • Contrary to the expectation that “autodown” may suffice, Lustre clients need to have “check_routers_before_use=1” to have automatic updates of router status
  • 20. Routed Clients (cont’d) • Even with automatic router checks, clients cannot detect the non-functional routers: a router which was active only on the client side will be assumed to be active by clients • Clients encounter timeouts due to the nonfunctional routers • Resolution: separate router checks on router nodes are added.
  • 21. Stockyard: Looking Ahead • Deploy as a global $WORK space for TACC resources, will push the number of clients to all TACC resources • Evaluation of Lustre 2.5.0 before full production for HSM functionality and compatibility with SAMFS on Ranch • Quota management (different on 2.4+) • Integrated monitoring setup • Security evaluation
  • 22. Summary • Storage capacity and performance needs growing at exponential rate • High-performance and reliable filesystems critical for HPC productivity • Benefits of large parallel filesystems outweigh the system administration overhead • Current best solution for cost, performance and scalability is Lustre-based filesystem