SlideShare una empresa de Scribd logo
1 de 28
Taming Big Science Data Growth with
Converged Infrastructure
©2014 BioTeam, Inc. All Rights Reserved.
Real-world strategies and implementation details for building converged storage
infrastructure to support the performance, scalability and collaborative
requirements of today's NGS workflows.
Aaron D. Gardner
Senior Scientific Consultant, BioTeam, Inc.
aaron@bioteam.net
BIOTEAM
Enabling Science
1. Introduction
2.The State of NGS Data Analysis
3. Converged Infrastructure
4. Solutions to Support NGS Data Analysis
©2014 BioTeam, Inc. All Rights Reserved.
BIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| About Myself
Who am I?
 A computer engineer who spent the
last 14 years with biologists in situ
 Exposed to NGS in 2005
 Have worked (for better or worse)
with most NGS platforms and data
types
 Along the way learned bioinformatics,
data management, HPC, storage, and
general research cyberinfrastructure
 Desire to help the broader life sciences
community lead me to BioTeam
BIOTEAM
Enabling Science
&
14Years
Later…
&
©2014 BioTeam, Inc. All Rights Reserved.
| About BioTeam
Who are we?
 Independent consulting shop
 Staffed by scientists forced to learn IT,
SW & HPC to get our own research done
 12+ years bridging the “gap”
between science, IT & high
performance computing
BioTeam@Bio-IT World ’14
 Did you just come from Chris’s talk?
Make sure to check out his slides…
 We have lots going on at the
conference this year
 Come visit us at booth #324
BIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| AboutThisTalk
What are we going to talk about?
 A quick look at NGS analysis trends
 Challenges in performance, scalability, and collaboration
 Strategies that address these challenges
 The benefits of pairing converged infrastructure with NGS
 Example topologies and implementations
BIOTEAM
Enabling Science
Approach
 Topics discussed the same way they would
over coffee (or tea)
 I talk about vendors and technologies
I have experience with– that’s why DDN
invited me to speak (thanks)
 Feel free to reach out to me during the
conference if any of this interests you
©2014 BioTeam, Inc. All Rights Reserved.
| A Note About the Big Picture
At BioTeam our mission is to enable science (see above)
i.e. Great people, enabled by great technology, actively
engaging in broader scientific communities
Technology alone doesn’t cover this mandate…
• Instruments never installed, unopened server boxes, idle accelerator racks
Gathering minds without the right resources and tools…
• They flee for the cloud, desk clusters, or other companies or institutions
Locking away resources and data stifles collaboration…
• Focus on services that empower instead of barriers that contain scientists
BIOTEAM
Enabling Science
1. Introduction
2.The State of NGS Data Analysis
3. Converged Infrastructure
4. Solutions to Support NGS Data Analysis
©2014 BioTeam, Inc. All Rights Reserved.
BIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
|NGS Analysis Challenges
Performance
 Compute is easy, just not necessarily efficient
 Analysis pipelines longer and more complex
 Usually serial steps still lurking in them
 “New programs”– wrapper scripts with a twist
 Don’t address performance of fundamental
algorithms underneath
Scalability
 See few analysis algorithms scaled to 1-100K
cores each year, same w/ accelerators
 Still vast majority lucky to reach 10-100
 Life sciences still mostly a HTC problem
 Checkpointing becoming increasingly
important
 Varying and mixed IO patterns make HTC
problematic on shared storage
BIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
|NGS Analysis Challenges
Collaboration
 Community movement to more efficient
sequence data structures—very encouraging
(e.g. SAM/BAM/CRAM/VCF/HDF5)
 Sharing of datasets still incredibly problematic
 Large sequencing centers, institutes, commercial
interests embracing Science DMZ, data transfer
node concept (w/ Globus, Aspera, etc.)
 Without data lifecycle management, this
newfound scientific data mobility will only
amplify storage issues (enter iRODS, etc.)
 Last mile problem for collaborators with poor
network connectivity
 Need real Big Data collobration solutions
 Find a way to bring computation to the data so
the last mile disappears
BIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| Observation: NGS Analysis Inversion IBIOTEAM
Enabling Science
Infrastructure Spending (in theory):
NGS Analysis Facility Infrastructure Spending (in practice):
Infrastructure Itself (the 80%)
SW and HW
Integrations
(the 20%)
Infrastructure
Itself
(the 80%)
SW and HW Integrations (the 20%)
Minimizing integration overhead is one of the principal challenges right now when
designing scientific computing environments.
 This holds for NGS, as well as other scientific domains
 Analysis environments from pieces which have never previously been tried together
 Synthesized based on what’s best for business instead of technical merits, efficiency is
wasted, and for small and midsize infrastructures integration overhead balloons
©2014 BioTeam, Inc. All Rights Reserved.
| Observation: NGS Analysis Inversion IIBIOTEAM
Enabling Science
The 20% The 80%
The 80% The 20%
What the industry is shooting for (in theory):
Where we seem to be (in practice):
1. Introduction
2.The State of NGS Data Analysis
3. Converged Infrastructure
4. Solutions to Support NGS Data Analysis
©2014 BioTeam, Inc. All Rights Reserved.
BIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| Traditional Infrastructure
Traditional computational infrastructures are comprised of separate
hardware (storage, networking, computation) and software
(provisioning, monitoring, management, etc.) components
 Pieced into one-off solutions
 Integrated and tuned on-site
(can take months for large systems)
 As these infrastructures scale,
they become snow flakes
BIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| Converged Infrastructure
With converged infrastructure, multiple hardware and software
components are developed, selected, integrated, and tuned
together, producing a pre-optimized solution
 Infrastructure building block approach
 Some vendors (e.g. DDN) offer mature
converged storage products like the SFA embedded platform
 Facebook’s Open Compute Project lends itself to building
converged infrastructures w/ OCP compliant components
 Analysis appliances (e.g. SlipStream) also use the
converged infrastructure model
Converged infrastructure shifts the focus from integrating
hardware components to building software services, which is
where organizations can better distinguish and define themselves
BIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| Converged Infrastructure: Example SolutionBIOTEAM
Enabling Science
Remote iRODS
server
Traditional
iRODS
iRODS
clients
iRODS data and
control access
NAS file access
RAID controllerRAID controllerRAID controller
SAN Switch
iRODS/iCAT
server
Block storage
access
Cluster
Network Switch
File server File serverFile server
Disk
array
Disk
array
Disk
array
©2014 BioTeam, Inc. All Rights Reserved.
| Converged Infrastructure: Example SolutionBIOTEAM
Enabling Science
iRODS
clients
iRODS data and
control access
High performance
NAS file access
iCAT & iRODS
servers
Network Switch
Cluster
Integrated Appliance Reduces Complexity
and Integration Time
Remote iRODS
server
Converged
iRODS
©2014 BioTeam, Inc. All Rights Reserved.
| Is Converged Infrastructure inYour Critical Path?BIOTEAM
Enabling Science
Converged infrastructure not as necessary when:
1. Hiring lots of smart people and committing their time to infrastructure
2. Attacking a single or small set of large problems
3. Rarely revalidating or reintegrating your HW stack after deployment
• This is because if you tie your platform closely to a mixed and disparate
hardware stack: staff time to explore reintegration and revalidation
issues, rewrite code for new architectures—this can work for hyper
giants and single service efforts but legacy and vendor-controlled
codes, flexible infrastructure, infrastructure for yet unknown or
unsolved problems—converged infrastructure buys these down…
©2014 BioTeam, Inc. All Rights Reserved.
| Tiered Service Models and Changing Staff RolesBIOTEAM
Enabling Science
Challenges With This Model:
• Need for single instance resources
capable of dealing with big data
• Now need multitenancy capabilities
even as a single organization
• Must minimize latency to better utilize
limited resources—public cloud’s
massive scalability approach might not
be suitable for a small or midsize
research environment with legacy
codes, inexperienced users, etc.
 DevOps and the cloud have changed the relationship between the
researcher and the IT practitioner permanently
 Research computing staff should be developing best practices, not acting
as a human ‘sudo’ for informaticists
Users
instantiate
resources on
demand which
they have
privileged
access to–but
no support is
offered beyond
clearing hang-
ups
Services
requiring a
higher degree
of reliability
and/or security
are built and
managed by IT
staff, with
unprivileged
access
provided to
users
Core
computational
services are still
supported end-
to-end by IT
staff, and are
consumed by
resources in the
previous two
levels
Solution: Move to a tired service and support model
1. Introduction
2.The State of NGS Data Analysis
3. Converged Infrastructure
4. Solutions to Support NGS Data Analysis
©2014 BioTeam, Inc. All Rights Reserved.
BIOTEAM
Enabling Science
GPFS is a fast parallel file system written by IBM
 Distributed metadata and locking
 Good performance with small files
 Tunable for large numbers of small files
 Native Linux and Windows clients
 CIFS and NFSv3 (v4 works, unsupported)
 Raw NGS data is big
 NGS analysis datasets are getting bigger
 They can require lots of IOPS during analysis
 Lots of space required to store what comes after
Can’t satisfy all of these considerations with a single storage tier without tremendous cost
Solution: Hierarchical Storage Management (HSM)
 Create different pools of storage, policies govern data movement
 SSD for metadata, small files,VMs, etc. and SATA for capacity and sequential access
 Can also use tape, object storage, and others as cold archive or warm near-line tiers
NOTE: Lustre now has some HSM capabilities too as of version 2.5
©2014 BioTeam, Inc. All Rights Reserved.
| Tiered Data Storage (e.g. GPFS w/ HSM)BIOTEAM
Enabling Science
Example GPFS based
GRIDScaler System from DDN:
©2014 BioTeam, Inc. All Rights Reserved.
| Science DMZ (e.g. ESnet Model)
Core Drivers
• Enterprise networking architecture is optimized for many
small data flows (Web 2.0, mobile, Internet ofThings)
• Not optimized for fewer large data flows
• Deep packet inspection & stateful firewalls
can’t handle large flows, performance tanks
3 Components of a Science DMZ
1. Fast network paths with streamlined security specific to large scientific data flows
2. DataTransfer Node(s) specifically tuned and dedicated to moving large data flows
3. Network monitoring and measurement node(s)
• Government & academic sites have done similar things for years without the name
• BioTeam strongly believes in the Science DMZ concept
• At this point anybody moving large scientific data should be evaluating
• We are already helping deploy them
ESnet has a great web resource available: http://fasterdata.es.net/
BIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| Implementation: Science DMZBIOTEAM
Enabling Science
Design Source: “The Science DMZ: Introduction & Architecture” – ESnet
©2014 BioTeam, Inc. All Rights Reserved.
| Information Lifecycle Management (ILM) (e.g. iRODS)BIOTEAM
Enabling Science
iRODS, the Integrated Rule-Oriented Data System, is a project for building the next
generation data management cyberinfrastructure. One of the main ideas behind
iRODS is to provide a system that enables a flexible, adaptive, customizable data
management architecture. Suitable for preserving data over its lifecycle.
At the iRODS core, a Rule Engine interprets the Rules to decide how the system is to
respond to various requests and conditions.
Interfaces: GUI, Web, WebDAV, CLI
Operations:
 Search, Access and View,
 Add/Extract Metadata, Annotate,
 Analyze & Process,
 Manage, Replicate, Copy, Share,
Repurpose,
 Track access, Subscribe & more…
iRODS Server software and Rule Engine run
on each data server. The iRODS iCAT
Metadata Catalog uses a database to track
metadata describing data and everything that
happens to it
©2014 BioTeam, Inc. All Rights Reserved.
| Implementation: Science DMZ + ILMBIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| NGS Data Analysis (on a Hybrid HPC Cloud)
General Concept
• On-site local resources are a “cache” that exists…
• To be used constantly
• For best data locality
• For specialized resources
• For security
• Elastic resources from public or parent organization’s private cloud
• The middleware offers cloud-style IaaS and/or PaaS
• Multi-tenant– users/virtual communities can spin up their own resources, clusters, etc.
• These on-demand systems accommodate unique software configurations and services
(suited to varying NGS workflows, etc.)
Sounds great, but…
• It will be a while before you can pull a solution like this off the shelf
• Would be a good candidate for a converged infrastructure offering
Goal: HPC-like performance and latency, cloud-like elasticity and provisioning
BIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| Implementation: Science DMZ + ILM + HPC CloudBIOTEAM
Enabling Science
©2014 BioTeam, Inc. All Rights Reserved.
| PartingThoughts & Lessons Learned
1. Confirmation Bias
• Just because it wasn’t viable before, doesn’t mean it won’t ever be
2. Depth Perception
• Bleeding Edge? Leading Edge? State of the Art? Legacy? Ready to Sunset?
3. Outliers
• The existence of edge or corner cases does not necessarily invalidate a solution,
but it does mean you better understand the scope the solution covers
4. The Power of &&
• Multipart solutions seen as complex, abandoned in search of a silver bullet
• Combining ideas is more collaborative and doesn’t force an ultimatum
5. GameTheory
• Bringing a chess set to a checkers tournament…
6. Relationship overTechnology
• Work with vendors and collaborators that are interested in making a long term
investment in what you do
BIOTEAM
Enabling Science
ThankYou
Questions and Discussion Welcome
©2014 BioTeam, Inc. All Rights Reserved.
Aaron D. Gardner
Senior Scientific Consultant, BioTeam, Inc.
aaron@bioteam.net
BIOTEAM
Enabling Science

Más contenido relacionado

La actualidad más candente

Multi-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC ClustersMulti-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC ClustersChris Dagdigian
 
2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentation2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentationChris Dagdigian
 
Cloud Security for Life Science R&D
Cloud Security for Life Science R&DCloud Security for Life Science R&D
Cloud Security for Life Science R&DChris Dagdigian
 
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome MeetingBio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome MeetingChris Dagdigian
 
Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?mark madsen
 
Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)mark madsen
 
Big Data and Bad Analogies
Big Data and Bad AnalogiesBig Data and Bad Analogies
Big Data and Bad Analogiesmark madsen
 
Everything has changed except us
Everything has changed except usEverything has changed except us
Everything has changed except usmark madsen
 
Big Data Meets HCI—How South African Insurance Provider King Price Gives Deve...
Big Data Meets HCI—How South African Insurance Provider King Price Gives Deve...Big Data Meets HCI—How South African Insurance Provider King Price Gives Deve...
Big Data Meets HCI—How South African Insurance Provider King Price Gives Deve...Dana Gardner
 
Maciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The TradeMaciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The TradeCodiax
 
IT Performance Management Handbook for CIOs
IT Performance Management Handbook for CIOsIT Performance Management Handbook for CIOs
IT Performance Management Handbook for CIOsVikram Ramesh
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data FrameworkseXascale Infolab
 
Innovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerInnovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerMicrosoft
 
Microsoft cloud migration and modernization playbook 031819 (1) (2)
Microsoft cloud migration and modernization playbook 031819 (1) (2)Microsoft cloud migration and modernization playbook 031819 (1) (2)
Microsoft cloud migration and modernization playbook 031819 (1) (2)didicadoida
 
frog IoT Big Design IoT World Congress 2015
frog IoT Big Design IoT World Congress 2015frog IoT Big Design IoT World Congress 2015
frog IoT Big Design IoT World Congress 2015Patrick Kalaher
 
Rethinking The Data Warehouse: Emerging Practices and Technologies to Meet To...
Rethinking The Data Warehouse: Emerging Practices and Technologies to Meet To...Rethinking The Data Warehouse: Emerging Practices and Technologies to Meet To...
Rethinking The Data Warehouse: Emerging Practices and Technologies to Meet To...Senturus
 
Informatics Platforms for Biologics R&D: 5 Key Capabilities to Look For
Informatics Platforms for Biologics R&D: 5 Key Capabilities to Look ForInformatics Platforms for Biologics R&D: 5 Key Capabilities to Look For
Informatics Platforms for Biologics R&D: 5 Key Capabilities to Look ForRoger Pellegrini
 
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons LearnedBio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons LearnedChris Dagdigian
 
Damss scurt v2 dss an evolving class ...
Damss scurt  v2 dss an evolving class ...Damss scurt  v2 dss an evolving class ...
Damss scurt v2 dss an evolving class ...ISSIP
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 

La actualidad más candente (20)

Multi-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC ClustersMulti-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC Clusters
 
2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentation2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentation
 
Cloud Security for Life Science R&D
Cloud Security for Life Science R&DCloud Security for Life Science R&D
Cloud Security for Life Science R&D
 
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome MeetingBio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
 
Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?
 
Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)
 
Big Data and Bad Analogies
Big Data and Bad AnalogiesBig Data and Bad Analogies
Big Data and Bad Analogies
 
Everything has changed except us
Everything has changed except usEverything has changed except us
Everything has changed except us
 
Big Data Meets HCI—How South African Insurance Provider King Price Gives Deve...
Big Data Meets HCI—How South African Insurance Provider King Price Gives Deve...Big Data Meets HCI—How South African Insurance Provider King Price Gives Deve...
Big Data Meets HCI—How South African Insurance Provider King Price Gives Deve...
 
Maciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The TradeMaciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The Trade
 
IT Performance Management Handbook for CIOs
IT Performance Management Handbook for CIOsIT Performance Management Handbook for CIOs
IT Performance Management Handbook for CIOs
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
 
Innovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerInnovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringer
 
Microsoft cloud migration and modernization playbook 031819 (1) (2)
Microsoft cloud migration and modernization playbook 031819 (1) (2)Microsoft cloud migration and modernization playbook 031819 (1) (2)
Microsoft cloud migration and modernization playbook 031819 (1) (2)
 
frog IoT Big Design IoT World Congress 2015
frog IoT Big Design IoT World Congress 2015frog IoT Big Design IoT World Congress 2015
frog IoT Big Design IoT World Congress 2015
 
Rethinking The Data Warehouse: Emerging Practices and Technologies to Meet To...
Rethinking The Data Warehouse: Emerging Practices and Technologies to Meet To...Rethinking The Data Warehouse: Emerging Practices and Technologies to Meet To...
Rethinking The Data Warehouse: Emerging Practices and Technologies to Meet To...
 
Informatics Platforms for Biologics R&D: 5 Key Capabilities to Look For
Informatics Platforms for Biologics R&D: 5 Key Capabilities to Look ForInformatics Platforms for Biologics R&D: 5 Key Capabilities to Look For
Informatics Platforms for Biologics R&D: 5 Key Capabilities to Look For
 
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons LearnedBio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
 
Damss scurt v2 dss an evolving class ...
Damss scurt  v2 dss an evolving class ...Damss scurt  v2 dss an evolving class ...
Damss scurt v2 dss an evolving class ...
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 

Similar a Taming Big Science Data Growth with Converged Infrastructure

Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Nathan Bijnens
 
Uniting traditional GIS and mainstream IT
Uniting traditional GIS and mainstream ITUniting traditional GIS and mainstream IT
Uniting traditional GIS and mainstream ITgssg
 
Whitepaper: Big Data - Infrastructure Considerations - Happiest Minds
Whitepaper: Big Data - Infrastructure Considerations - Happiest MindsWhitepaper: Big Data - Infrastructure Considerations - Happiest Minds
Whitepaper: Big Data - Infrastructure Considerations - Happiest MindsHappiest Minds Technologies
 
IBM Aspera In Life Sciences
IBM Aspera In Life SciencesIBM Aspera In Life Sciences
IBM Aspera In Life SciencesChris Shaw
 
Denodo’s Data Catalog: Bridging the Gap between Data and Business
Denodo’s Data Catalog: Bridging the Gap between Data and BusinessDenodo’s Data Catalog: Bridging the Gap between Data and Business
Denodo’s Data Catalog: Bridging the Gap between Data and BusinessDenodo
 
data-mesh_whitepaper_dec2021.pdf
data-mesh_whitepaper_dec2021.pdfdata-mesh_whitepaper_dec2021.pdf
data-mesh_whitepaper_dec2021.pdfssuser18927d
 
Hedstrom Infrastructure
Hedstrom InfrastructureHedstrom Infrastructure
Hedstrom Infrastructureguest2c9ba28e
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureKhalid Salama
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformIRJET Journal
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
 
Accenture’s INTIENT Research Platform
Accenture’s INTIENT Research PlatformAccenture’s INTIENT Research Platform
Accenture’s INTIENT Research Platformaccenture
 
Managing The Data Deluge By Optimizing Storage
Managing The Data Deluge By Optimizing StorageManaging The Data Deluge By Optimizing Storage
Managing The Data Deluge By Optimizing StorageDell World
 
A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)Denodo
 
Building Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated DiscoveryBuilding Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated Discoveryadamkraut
 
CRO Solutions by Dotmatics
CRO Solutions by DotmaticsCRO Solutions by Dotmatics
CRO Solutions by DotmaticsYuri de Lugt
 
IRJET- Open Source Solution for Centralized Storage System using Network ...
IRJET-  	  Open Source Solution for Centralized Storage System using Network ...IRJET-  	  Open Source Solution for Centralized Storage System using Network ...
IRJET- Open Source Solution for Centralized Storage System using Network ...IRJET Journal
 
Introduction to Modern Data Virtualization (US)
Introduction to Modern Data Virtualization (US)Introduction to Modern Data Virtualization (US)
Introduction to Modern Data Virtualization (US)Denodo
 

Similar a Taming Big Science Data Growth with Converged Infrastructure (20)

Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
 
Uniting traditional GIS and mainstream IT
Uniting traditional GIS and mainstream ITUniting traditional GIS and mainstream IT
Uniting traditional GIS and mainstream IT
 
Whitepaper: Big Data - Infrastructure Considerations - Happiest Minds
Whitepaper: Big Data - Infrastructure Considerations - Happiest MindsWhitepaper: Big Data - Infrastructure Considerations - Happiest Minds
Whitepaper: Big Data - Infrastructure Considerations - Happiest Minds
 
IBM Aspera In Life Sciences
IBM Aspera In Life SciencesIBM Aspera In Life Sciences
IBM Aspera In Life Sciences
 
Research paper on big data and hadoop
Research paper on big data and hadoopResearch paper on big data and hadoop
Research paper on big data and hadoop
 
Denodo’s Data Catalog: Bridging the Gap between Data and Business
Denodo’s Data Catalog: Bridging the Gap between Data and BusinessDenodo’s Data Catalog: Bridging the Gap between Data and Business
Denodo’s Data Catalog: Bridging the Gap between Data and Business
 
AtomicDBCoreTech_White Papaer
AtomicDBCoreTech_White PapaerAtomicDBCoreTech_White Papaer
AtomicDBCoreTech_White Papaer
 
data-mesh_whitepaper_dec2021.pdf
data-mesh_whitepaper_dec2021.pdfdata-mesh_whitepaper_dec2021.pdf
data-mesh_whitepaper_dec2021.pdf
 
Connect July-Aug 2014
Connect July-Aug 2014Connect July-Aug 2014
Connect July-Aug 2014
 
Hedstrom Infrastructure
Hedstrom InfrastructureHedstrom Infrastructure
Hedstrom Infrastructure
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft Azure
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop Platform
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
Accenture’s INTIENT Research Platform
Accenture’s INTIENT Research PlatformAccenture’s INTIENT Research Platform
Accenture’s INTIENT Research Platform
 
Managing The Data Deluge By Optimizing Storage
Managing The Data Deluge By Optimizing StorageManaging The Data Deluge By Optimizing Storage
Managing The Data Deluge By Optimizing Storage
 
A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)
 
Building Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated DiscoveryBuilding Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated Discovery
 
CRO Solutions by Dotmatics
CRO Solutions by DotmaticsCRO Solutions by Dotmatics
CRO Solutions by Dotmatics
 
IRJET- Open Source Solution for Centralized Storage System using Network ...
IRJET-  	  Open Source Solution for Centralized Storage System using Network ...IRJET-  	  Open Source Solution for Centralized Storage System using Network ...
IRJET- Open Source Solution for Centralized Storage System using Network ...
 
Introduction to Modern Data Virtualization (US)
Introduction to Modern Data Virtualization (US)Introduction to Modern Data Virtualization (US)
Introduction to Modern Data Virtualization (US)
 

Último

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Taming Big Science Data Growth with Converged Infrastructure

  • 1. Taming Big Science Data Growth with Converged Infrastructure ©2014 BioTeam, Inc. All Rights Reserved. Real-world strategies and implementation details for building converged storage infrastructure to support the performance, scalability and collaborative requirements of today's NGS workflows. Aaron D. Gardner Senior Scientific Consultant, BioTeam, Inc. aaron@bioteam.net BIOTEAM Enabling Science
  • 2. 1. Introduction 2.The State of NGS Data Analysis 3. Converged Infrastructure 4. Solutions to Support NGS Data Analysis ©2014 BioTeam, Inc. All Rights Reserved. BIOTEAM Enabling Science
  • 3. ©2014 BioTeam, Inc. All Rights Reserved. | About Myself Who am I?  A computer engineer who spent the last 14 years with biologists in situ  Exposed to NGS in 2005  Have worked (for better or worse) with most NGS platforms and data types  Along the way learned bioinformatics, data management, HPC, storage, and general research cyberinfrastructure  Desire to help the broader life sciences community lead me to BioTeam BIOTEAM Enabling Science & 14Years Later… &
  • 4. ©2014 BioTeam, Inc. All Rights Reserved. | About BioTeam Who are we?  Independent consulting shop  Staffed by scientists forced to learn IT, SW & HPC to get our own research done  12+ years bridging the “gap” between science, IT & high performance computing BioTeam@Bio-IT World ’14  Did you just come from Chris’s talk? Make sure to check out his slides…  We have lots going on at the conference this year  Come visit us at booth #324 BIOTEAM Enabling Science
  • 5. ©2014 BioTeam, Inc. All Rights Reserved. | AboutThisTalk What are we going to talk about?  A quick look at NGS analysis trends  Challenges in performance, scalability, and collaboration  Strategies that address these challenges  The benefits of pairing converged infrastructure with NGS  Example topologies and implementations BIOTEAM Enabling Science Approach  Topics discussed the same way they would over coffee (or tea)  I talk about vendors and technologies I have experience with– that’s why DDN invited me to speak (thanks)  Feel free to reach out to me during the conference if any of this interests you
  • 6. ©2014 BioTeam, Inc. All Rights Reserved. | A Note About the Big Picture At BioTeam our mission is to enable science (see above) i.e. Great people, enabled by great technology, actively engaging in broader scientific communities Technology alone doesn’t cover this mandate… • Instruments never installed, unopened server boxes, idle accelerator racks Gathering minds without the right resources and tools… • They flee for the cloud, desk clusters, or other companies or institutions Locking away resources and data stifles collaboration… • Focus on services that empower instead of barriers that contain scientists BIOTEAM Enabling Science
  • 7. 1. Introduction 2.The State of NGS Data Analysis 3. Converged Infrastructure 4. Solutions to Support NGS Data Analysis ©2014 BioTeam, Inc. All Rights Reserved. BIOTEAM Enabling Science
  • 8. ©2014 BioTeam, Inc. All Rights Reserved. |NGS Analysis Challenges Performance  Compute is easy, just not necessarily efficient  Analysis pipelines longer and more complex  Usually serial steps still lurking in them  “New programs”– wrapper scripts with a twist  Don’t address performance of fundamental algorithms underneath Scalability  See few analysis algorithms scaled to 1-100K cores each year, same w/ accelerators  Still vast majority lucky to reach 10-100  Life sciences still mostly a HTC problem  Checkpointing becoming increasingly important  Varying and mixed IO patterns make HTC problematic on shared storage BIOTEAM Enabling Science
  • 9. ©2014 BioTeam, Inc. All Rights Reserved. |NGS Analysis Challenges Collaboration  Community movement to more efficient sequence data structures—very encouraging (e.g. SAM/BAM/CRAM/VCF/HDF5)  Sharing of datasets still incredibly problematic  Large sequencing centers, institutes, commercial interests embracing Science DMZ, data transfer node concept (w/ Globus, Aspera, etc.)  Without data lifecycle management, this newfound scientific data mobility will only amplify storage issues (enter iRODS, etc.)  Last mile problem for collaborators with poor network connectivity  Need real Big Data collobration solutions  Find a way to bring computation to the data so the last mile disappears BIOTEAM Enabling Science
  • 10. ©2014 BioTeam, Inc. All Rights Reserved. | Observation: NGS Analysis Inversion IBIOTEAM Enabling Science Infrastructure Spending (in theory): NGS Analysis Facility Infrastructure Spending (in practice): Infrastructure Itself (the 80%) SW and HW Integrations (the 20%) Infrastructure Itself (the 80%) SW and HW Integrations (the 20%) Minimizing integration overhead is one of the principal challenges right now when designing scientific computing environments.  This holds for NGS, as well as other scientific domains  Analysis environments from pieces which have never previously been tried together  Synthesized based on what’s best for business instead of technical merits, efficiency is wasted, and for small and midsize infrastructures integration overhead balloons
  • 11. ©2014 BioTeam, Inc. All Rights Reserved. | Observation: NGS Analysis Inversion IIBIOTEAM Enabling Science The 20% The 80% The 80% The 20% What the industry is shooting for (in theory): Where we seem to be (in practice):
  • 12. 1. Introduction 2.The State of NGS Data Analysis 3. Converged Infrastructure 4. Solutions to Support NGS Data Analysis ©2014 BioTeam, Inc. All Rights Reserved. BIOTEAM Enabling Science
  • 13. ©2014 BioTeam, Inc. All Rights Reserved. | Traditional Infrastructure Traditional computational infrastructures are comprised of separate hardware (storage, networking, computation) and software (provisioning, monitoring, management, etc.) components  Pieced into one-off solutions  Integrated and tuned on-site (can take months for large systems)  As these infrastructures scale, they become snow flakes BIOTEAM Enabling Science
  • 14. ©2014 BioTeam, Inc. All Rights Reserved. | Converged Infrastructure With converged infrastructure, multiple hardware and software components are developed, selected, integrated, and tuned together, producing a pre-optimized solution  Infrastructure building block approach  Some vendors (e.g. DDN) offer mature converged storage products like the SFA embedded platform  Facebook’s Open Compute Project lends itself to building converged infrastructures w/ OCP compliant components  Analysis appliances (e.g. SlipStream) also use the converged infrastructure model Converged infrastructure shifts the focus from integrating hardware components to building software services, which is where organizations can better distinguish and define themselves BIOTEAM Enabling Science
  • 15. ©2014 BioTeam, Inc. All Rights Reserved. | Converged Infrastructure: Example SolutionBIOTEAM Enabling Science Remote iRODS server Traditional iRODS iRODS clients iRODS data and control access NAS file access RAID controllerRAID controllerRAID controller SAN Switch iRODS/iCAT server Block storage access Cluster Network Switch File server File serverFile server Disk array Disk array Disk array
  • 16. ©2014 BioTeam, Inc. All Rights Reserved. | Converged Infrastructure: Example SolutionBIOTEAM Enabling Science iRODS clients iRODS data and control access High performance NAS file access iCAT & iRODS servers Network Switch Cluster Integrated Appliance Reduces Complexity and Integration Time Remote iRODS server Converged iRODS
  • 17. ©2014 BioTeam, Inc. All Rights Reserved. | Is Converged Infrastructure inYour Critical Path?BIOTEAM Enabling Science Converged infrastructure not as necessary when: 1. Hiring lots of smart people and committing their time to infrastructure 2. Attacking a single or small set of large problems 3. Rarely revalidating or reintegrating your HW stack after deployment • This is because if you tie your platform closely to a mixed and disparate hardware stack: staff time to explore reintegration and revalidation issues, rewrite code for new architectures—this can work for hyper giants and single service efforts but legacy and vendor-controlled codes, flexible infrastructure, infrastructure for yet unknown or unsolved problems—converged infrastructure buys these down…
  • 18. ©2014 BioTeam, Inc. All Rights Reserved. | Tiered Service Models and Changing Staff RolesBIOTEAM Enabling Science Challenges With This Model: • Need for single instance resources capable of dealing with big data • Now need multitenancy capabilities even as a single organization • Must minimize latency to better utilize limited resources—public cloud’s massive scalability approach might not be suitable for a small or midsize research environment with legacy codes, inexperienced users, etc.  DevOps and the cloud have changed the relationship between the researcher and the IT practitioner permanently  Research computing staff should be developing best practices, not acting as a human ‘sudo’ for informaticists Users instantiate resources on demand which they have privileged access to–but no support is offered beyond clearing hang- ups Services requiring a higher degree of reliability and/or security are built and managed by IT staff, with unprivileged access provided to users Core computational services are still supported end- to-end by IT staff, and are consumed by resources in the previous two levels Solution: Move to a tired service and support model
  • 19. 1. Introduction 2.The State of NGS Data Analysis 3. Converged Infrastructure 4. Solutions to Support NGS Data Analysis ©2014 BioTeam, Inc. All Rights Reserved. BIOTEAM Enabling Science
  • 20. GPFS is a fast parallel file system written by IBM  Distributed metadata and locking  Good performance with small files  Tunable for large numbers of small files  Native Linux and Windows clients  CIFS and NFSv3 (v4 works, unsupported)  Raw NGS data is big  NGS analysis datasets are getting bigger  They can require lots of IOPS during analysis  Lots of space required to store what comes after Can’t satisfy all of these considerations with a single storage tier without tremendous cost Solution: Hierarchical Storage Management (HSM)  Create different pools of storage, policies govern data movement  SSD for metadata, small files,VMs, etc. and SATA for capacity and sequential access  Can also use tape, object storage, and others as cold archive or warm near-line tiers NOTE: Lustre now has some HSM capabilities too as of version 2.5 ©2014 BioTeam, Inc. All Rights Reserved. | Tiered Data Storage (e.g. GPFS w/ HSM)BIOTEAM Enabling Science Example GPFS based GRIDScaler System from DDN:
  • 21. ©2014 BioTeam, Inc. All Rights Reserved. | Science DMZ (e.g. ESnet Model) Core Drivers • Enterprise networking architecture is optimized for many small data flows (Web 2.0, mobile, Internet ofThings) • Not optimized for fewer large data flows • Deep packet inspection & stateful firewalls can’t handle large flows, performance tanks 3 Components of a Science DMZ 1. Fast network paths with streamlined security specific to large scientific data flows 2. DataTransfer Node(s) specifically tuned and dedicated to moving large data flows 3. Network monitoring and measurement node(s) • Government & academic sites have done similar things for years without the name • BioTeam strongly believes in the Science DMZ concept • At this point anybody moving large scientific data should be evaluating • We are already helping deploy them ESnet has a great web resource available: http://fasterdata.es.net/ BIOTEAM Enabling Science
  • 22. ©2014 BioTeam, Inc. All Rights Reserved. | Implementation: Science DMZBIOTEAM Enabling Science Design Source: “The Science DMZ: Introduction & Architecture” – ESnet
  • 23. ©2014 BioTeam, Inc. All Rights Reserved. | Information Lifecycle Management (ILM) (e.g. iRODS)BIOTEAM Enabling Science iRODS, the Integrated Rule-Oriented Data System, is a project for building the next generation data management cyberinfrastructure. One of the main ideas behind iRODS is to provide a system that enables a flexible, adaptive, customizable data management architecture. Suitable for preserving data over its lifecycle. At the iRODS core, a Rule Engine interprets the Rules to decide how the system is to respond to various requests and conditions. Interfaces: GUI, Web, WebDAV, CLI Operations:  Search, Access and View,  Add/Extract Metadata, Annotate,  Analyze & Process,  Manage, Replicate, Copy, Share, Repurpose,  Track access, Subscribe & more… iRODS Server software and Rule Engine run on each data server. The iRODS iCAT Metadata Catalog uses a database to track metadata describing data and everything that happens to it
  • 24. ©2014 BioTeam, Inc. All Rights Reserved. | Implementation: Science DMZ + ILMBIOTEAM Enabling Science
  • 25. ©2014 BioTeam, Inc. All Rights Reserved. | NGS Data Analysis (on a Hybrid HPC Cloud) General Concept • On-site local resources are a “cache” that exists… • To be used constantly • For best data locality • For specialized resources • For security • Elastic resources from public or parent organization’s private cloud • The middleware offers cloud-style IaaS and/or PaaS • Multi-tenant– users/virtual communities can spin up their own resources, clusters, etc. • These on-demand systems accommodate unique software configurations and services (suited to varying NGS workflows, etc.) Sounds great, but… • It will be a while before you can pull a solution like this off the shelf • Would be a good candidate for a converged infrastructure offering Goal: HPC-like performance and latency, cloud-like elasticity and provisioning BIOTEAM Enabling Science
  • 26. ©2014 BioTeam, Inc. All Rights Reserved. | Implementation: Science DMZ + ILM + HPC CloudBIOTEAM Enabling Science
  • 27. ©2014 BioTeam, Inc. All Rights Reserved. | PartingThoughts & Lessons Learned 1. Confirmation Bias • Just because it wasn’t viable before, doesn’t mean it won’t ever be 2. Depth Perception • Bleeding Edge? Leading Edge? State of the Art? Legacy? Ready to Sunset? 3. Outliers • The existence of edge or corner cases does not necessarily invalidate a solution, but it does mean you better understand the scope the solution covers 4. The Power of && • Multipart solutions seen as complex, abandoned in search of a silver bullet • Combining ideas is more collaborative and doesn’t force an ultimatum 5. GameTheory • Bringing a chess set to a checkers tournament… 6. Relationship overTechnology • Work with vendors and collaborators that are interested in making a long term investment in what you do BIOTEAM Enabling Science
  • 28. ThankYou Questions and Discussion Welcome ©2014 BioTeam, Inc. All Rights Reserved. Aaron D. Gardner Senior Scientific Consultant, BioTeam, Inc. aaron@bioteam.net BIOTEAM Enabling Science