SlideShare una empresa de Scribd logo
1 de 20
Descargar para leer sin conexión
White Paper
Abstract
This EMC Isilon Sizing and Performance Guideline white paper
reviews the Key Performance Indicators (KPIs) that most
strongly impact the production processes for Next-Generation
Sequencing (NGS) workflows.
July 2013
NEXT-GENERATION GENOME SEQUENCING
USING EMC ISILON SCALE-OUT NAS: SIZING
AND PERFORMANCE GUIDELINES
Copyright © 2013 EMC Corporation. All Rights Reserved.
EMC believes the information in this publication is accurate as
of its publication date. The information is subject to change
without notice.
The information in this publication is provided “as is.” EMC
Corporation makes no representations or warranties of any kind
with respect to the information in this publication, and specifically
disclaims implied warranties of merchantability or fitness for a
particular purpose.
Use, copying, and distribution of any EMC software described in
this publication requires an applicable software license.
For the most up-to-date listing of EMC product names, see
EMC Corporation Trademarks on EMC.com.
EMC2
, EMC, the EMC logo, Isilon, OneFS, InsightIQ, SmartConnect,
SmartLock, SmartPools, SmartQuotas, SnapshotIQ, and SyncIQ
are registered trademarks or trademarks of EMC Corporation in
the United States and other countries.
VMware and vCenter are registered trademarks or trademarks of
VMware, Inc. in the United States and/or other jurisdictions.
All other trademarks used herein are the property of their
respective owners.
Part Number H19061.2
2Next-generation genome sequencing using EMC Isilon scale-out NAS
Table of Contents
Executive summary ........................................................................................4
Introduction...................................................................................................4
NGS workflow—sequencing instruments and file types ..................................6
NGS workflow—HPC .......................................................................................7
NGS workflow—Isilon scale-out NAS............................................................10
EMC Isilon scale-out NAS overview ..............................................................12
Simple........................................................................................................ 12
Scalable...................................................................................................... 13
Predictable.................................................................................................. 13
Efficient...................................................................................................... 13
Available..................................................................................................... 14
Enterprise-ready.......................................................................................... 14
NGS: key performance indicators .................................................................17
HPC server parameters................................................................................. 17
Network infrastructure parameters................................................................. 18
Isilon storage configuration parameters.......................................................... 18
Summary.................................................................................................... 19
Conclusion....................................................................................................20
3Next-generation genome sequencing using EMC Isilon scale-out NAS
Executive summary
Next-generation sequencing (NGS) workflows are comprised of genome sequencer
instrumentation, high-performance computing (HPC) infrastructure, a network-
attached storage (NAS) platform, and the network infrastructure connecting these
components together.
Raw NGS data is the largest component of an NGS process, making data storage
capacity and scalability important factors in NGS performance. The raw TIFF image
from the sequencer can be up to 70 percent of the total dataset. These files may be
compressed and stored for later use. Most organizations do not save the TIFF images,
but retain either the BCL or FASTQ files as the raw files. Each sequencing run can also
generate analysis data in the range of 50-200 gigabytes (GB). With faster sequencers
and larger read lengths, this can add up to between approximately 1 petabyte (PB)
and 2 PB per year for a facility with three NGS sequencers.
Beyond capacity scalability, I/O performance is also a critical file storage attribute for
overall NGS performance and efficiency. NGS is I/O-bound rather than processor-bound,
and therefore storage I/O performance has a high impact on overall NGS performance
in relation to other NGS workflow parameters.
Internal EMC testing has determined that the key performance indicators (KPIs) that
most affect the performance of NGS applications are:
• Total random access memory (RAM) size on HPC cluster nodes (recommended at
3 GB/core)
• RAM and SSD allocation on the EMC®
Isilon®
storage cluster—place maximum
allowable RAM on the performance layer and minimum recommended on the
archival layer with about 1 percent to 2 percent of the raw storage capacity
as SSD
• Storage configuration parameters: NFS version 4, NFS async-enabled, TCP
MTU (jumbo frames), LACP (2x 1 Gb/s or 4x 1 Gb/s), and tuning the Grid
Engine package
Introduction
Over the past five years, the precision and effectiveness of sequencing technology have
considerably increased the pace of biological research and discovery. The resources
focused on molecular biology, cellular biology, and bioinformatics continue to accelerate
at a significant pace. Projections indicate that before the end of the 21st century, we
could gain a full understanding of the workings of our DNA. Such knowledge could
allow us to improve our collective quality of life through a better understanding of
how a specific genetic variation impacts a drug’s efficacy or toxicity, or by possibly
providing the knowledge to eradicate a range of genetically based disorders.
DNA exome sequencing is an approach to selectively sequence the coding regions of
the genome as an easier yet still effective alternative to whole genome sequencing.
The exome of the human genome is formed by exons. Exons are short, functionally
4Next-generation genome sequencing using EMC Isilon scale-out NAS
important coding sequences of DNA within the gene’s mature messenger RNA that
constitute about 1.5 percent of the human genome.1
Many large-scale exome sequencing projects are underway to analyze human
diseases. This technology is often the choice as it is more affordable than whole
genome sequencing (WGS) and therefore allows the analysis of more patients. In
addition, it has an advantage in that resulting data volumes are much smaller and
therefore easier to handle. However, recent studies2
focused on this question found
that both technologies complement each other. As neither the whole genome nor the
large-scale exome sequencing technologies cover all sequencing variants, it is optimal
to conduct both experiments in parallel.
A single human genome—composed of a total of about 3.2 billion base pairs—requires
about 1.2 GB of unassembled storage. Industry analysts predict that the estimated
number of human whole genomes sequenced will explode from 25,000 genomes in
2012, to between 50,000 and 100,000 in 2013, and up to about one million by 2015.
The key enabling technologies for NGS are the many commercial sequencers available
from various companies, including Illumina, Life Technologies, Roche/454 Life Sciences,
and others. These sequencers interface to a computer network, which correlates and
concatenates the billions of overlapping segments of DNA sequence short reads that
have been streamed to or stored on a NAS system.
Accommodating the output rate of the sequencers requires a precisely designed and
balanced system. The peak rate of data (base pairs) produced by an Illumina sequencer,
for example, is already approaching 600 GigaBases per week, equivalent to about 100
whole human genomes. The range of data per year for an Illumina sequencer is from
350 TB to 1 PB.
The components of the NGS workflow are comprised of:
• Genome sequencer instruments
• HPC infrastructure
• NAS platform
• Network infrastructure that stitches these components together
These four components make up the hierarchy of the NGS gene-sequencing architecture.
Each component depends on the other and must have the ability to adapt and scale
to meet current and future sequencing needs. If one component creates a bottleneck,
then the performance of the entire NGS system suffers. The focus of this document is
optimum performance as well as sizing guidelines for the core components of NGS:
the HPC infrastructure and network-attached storage.
1
See Gilbert W (February 1978). “Why genes in pieces?”. Nature 271 (5645): 501.
2
See Performance comparison of exome DNA sequencing technologies. Clark MJ, Chen R, Lam HY,
Karczewski KJ, Chen R, Euskirchen G, Butte AJ, Snyder M, Department of Genetics, Stanford University
School of Medicine, Stanford, CA, USA.
5Next-generation genome sequencing using EMC Isilon scale-out NAS
NGS workflow—sequencing instruments and file types
The applications at the heart of NGS data creation come from important established
and emerging organizations involved in bringing NGS to market. The list includes
software from Illumina, Life Technologies (Applied Biosystems), Roche/454, Ion
Torrent, Pacific Biosciences, and a myriad of open source offerings such as Galaxy.
Running these applications in a research and analysis environment places complex
and special requirements on the IT systems, and in particular, the storage
infrastructure. This document will focus on the Illumina technologies—specifically
CASAVA for the sequence analysis software. Other genome assembly and analysis
platforms like Galaxy would be summarized in subsequent documents.
An NGS environment typically consists of scientific, lab, and analysis users:
• The scientific user initiates the method of genome sequencing and
instrumentation. This may also be the analysis user.
• The lab user runs the experiment (chemistry workflow) using a multiplexed
sampling scheme (or lanes) supported by the NGS instrument.
• The analysis user works on the results from the genome sequencing study with
bioinformatics tools and algorithms.
Most commercial NGS data centers also have a trained storage administrator on their
staff. With the growing use of NGS technologies, a new user has emerged for these
storage systems. The scientist or researcher running the experiments frequently handles
the data directly. Data management has to be intuitive to allow this new user to run
experiments and administer the data with minimal difficulty. In addition, the storage
administrator needs access to the more advanced management features to set
sophisticated management policies. These help with optimization of performance
and use of the storage system. It is important that the storage system deployed
provide management capabilities tuned to both types of users.
A graphical representation of the typical NGS data flow is shown below in Figure 1:
Figure 1. NGS architecture, data flow, and file types
6Next-generation genome sequencing using EMC Isilon scale-out NAS
The results stage of the NGS workflow as shown in Figure 1 consists of a number of
successive steps, each involving file conversions and each resulting in approximately
5x smaller file sizes. These steps include conversion of the raw image file into base-
call data, then of base call data into FASTQ text-based file format for storing both
biological sequence and its corresponding quality scores—for example, using LQUAL
or QUAL formats. This is followed by conversion into BAM (Binary Alignment Map) file
data followed by conversion into Variant Call Format (VCF) file data, which is converted
next into results data in SRA format. This tertiary file data is typically kept forever,
needs to be kept safe and available, and accumulates over time.
Today’s instruments produce higher level information and may avoid some of the
intermediate steps, thus reducing output data compared to previous NGS systems.
Therefore, data flows generated by the latest NGS instruments have typically decreased
in size per run. This decrease has been offset by a larger number of experiments,
secondary data, and increased consumption by users working downstream on many
different efforts and workflows. The size and characteristics of data produced from
these efforts place unpredictable demands on capacity as well as on throughput of the
storage systems. NGS storage environments need to be able to adapt to demands for
more capacity from post-processing work done by researchers downstream from the
first data capture.
NGS workflow—HPC
NGS applications have both common and unique analysis tools. All applications generate
large files that must be managed through multiple rounds of processing. Although many
tools were written specifically for easy implementation on a high-end desktop computer
(e.g., 64-bit dual- or quad-core, 16 GB RAM), routine analysis is typically conducted
on high-performance compute clusters.
Using a high-performance compute cluster, secondary analysis processing can generally
be done at a rate equal to or faster than primary data generation. Due to the open-ended
nature of tertiary analysis, a similar rate estimate cannot be precisely stated.
It is important that the parallelization of the NGS analysis platform be well understood
before planning on optimum server CPU core sizing. Most of the NGS tools are at least
multiprocessor-aware or are highly parallelized by simply dividing the sequence data,
the assembly algorithm, variant calling, or all, and starting separate analysis on these
data subsets. For NGS applications, the current parallelization per process is typically
between 75 percent and 90 percent.
As genomics has very large, semi-structured, file-based data and is modeled on post-
process streaming data access and I/O patterns that can be parallelized, it is ideally
suited for the Hadoop software framework3
which consists of two main components:
a file system and a compute system—the Hadoop Distributed File System (HDFS)
and the MapReduce framework, respectively.
3
See Hadoop in the life sciences: an Isilon Systems white paper. Joshi S.
7Next-generation genome sequencing using EMC Isilon scale-out NAS
Figure 2. Amdahl’s Law and parallelization
One of the basic tenets in HPC, Amdahl’s Law4
, postulates that adding more
microprocessor cores to a process does not speed it up linearly. A 64-core HPC
platform is estimated to be the performance threshold for 75 percent parallelization
per NGS process, which delivers a speedup of 4x (see Figure 2). Even more than
100 cores per active NGS process do not speed up the process substantially when
the algorithm(s) are between 75 percent and 90 percent parallelization. During
actual testing of the NGS processes in the range of 75-90 percent parallelization,
the speedup from 12 cores to 72 cores was found to be only about 1.25x.
Horizontal platforms like Hadoop that combine compute and data in a parallel
context would benefit genome assembly considerably.
4
See “Validity of the single processor approach to achieving large-scale computing capabilities,”
Amdahl G, AFIPS Conference Proceedings (30): 483–485, 1967
8Next-generation genome sequencing using EMC Isilon scale-out NAS
Figure 3. Performance curves for NGS using Illumina CASAVA
As shown in Figure 3 above, the NGS process is storage I/O and memory-bound. The
performance curves show a direct relationship between NGS performance and saturation
of read/write I/O and memory functionality. In contrast, there is an inverse relationship
between the CPU core utilization and storage I/O and memory functionality. This number
may be due to mutual dependencies or portions of the process that can only be
performed sequentially; NGS algorithms requiring movement of large amounts of data
in and out of the CPU; startup overhead including base calling and other large numbers
of small file writes; and degree of serialization involved in communication.
In view of the above discussion, it is recommended that the HPC server hardware
platform be configured with:
• Best I/O chipset, for example, using the latest generation Intel I/O controllers
• Highest DRAM speed (with a minimum of 3 GB per core of RAM)
• Multicore CPU set with > 2 GHz processors
• Simplified BIOS and driver upgrades with a single management console for all
driver upgrades
• Linux driver compatibility (over 90 percent of all HPC systems are Linux-based)
• Disk drives between 200 GB and 600 GB with RAID 10
• Cluster management tools such as Ganglia
Increasing the network bandwidth up to 4 Gbps would alleviate the read I/O and
memory saturation.
9Next-generation genome sequencing using EMC Isilon scale-out NAS
NGS workflow—Isilon scale-out NAS
Figure 4. Data flow using Illumina NGS process
NGS production processes generate potentially millions of files with terabytes of
aggregate storage impacting the capacity and manageability limits of existing file
server structures.
Figure 4 shows the data flow including a file number and capacity summary of an actual
NGS process using an Illumina sequencer and Isilon scale-out NAS storage. As can be
seen, the process generates over 500,000 files having aggregate size of greater than 5
TB over the course of the 48-hour run.
Raw NGS data is the largest component of an NGS process. The raw TIFF image can
be up to 70 percent of the total dataset. These files may be compressed and stored
for later use. Most organizations do not save the TIFF images, but retain either the
BCL or FASTQ files. If sequencing as a service is used, the input to the process is a
BAM file. Each sequencing run can also generate intermediate and final analysis data
10Next-generation genome sequencing using EMC Isilon scale-out NAS
in the range of 50-200 GB. With faster sequencers and larger read lengths, this can
add up to between 1 PB and 2 PB per year for a facility with three NGS sequencers.
Genomics is a data reduction process from the raw instrument information (images
or voltages) to the variants. This reduction process follows the “Rule of One-Fifth”
as shown in the sizing table below:
Table 1. Data reduction for the NGS process; human whole genome; all file
sizes are approximate
File
format
Size, GB
Illumina
Size, GB
Ion Torrent
Comments
TIFF,
WELLS
2500 750 TIFF range: 2.5 to 4 TB,
Ion Torrent is WELLS
voltage format
BCL/SFF 500 500 Ion Torrent uses SFF
BAM 100 100 2x compression
(~200 GB normal)
VCF 20 20 Variant calls
SRA,
EMR
4 4 EMR (Electronic Medical
Record) includes radiology
and pathology images
Raw instrument data typically consists of large image files (2–5 TB per run is the
norm), usually in TIFF format or an electropherogram file format native to a sequencer
(for example, the SEQ format native to the Illumina sequencer). These files are only
kept long enough (7–10 days) to verify that the experiment worked. The image file
for the experiment is usually the largest file size in NGS.
Intermediate or secondary data consists of raw data processed into information
of increasing value, stored for medium- to long-term storage (one year or more),
requires high-bandwidth access for fast analysis, and is expensive to re-create, so
storage needs to be highly available. These include files in BCL format for base calling
and conversion with an aggregate ratio of approximately one-fifth compared to raw
instrument data.
Beyond capacity scalability, I/O performance is also a critical file storage attribute for
overall NGS performance and efficiency. As discussed earlier, NGS is I/O-bound, rather
than processor-bound, and thus storage I/O performance has a high impact on overall
NGS performance in relation to other NGS workflow parameters. As a result, NGS
environments require a file storage infrastructure that is purpose-built to address the
11Next-generation genome sequencing using EMC Isilon scale-out NAS
capacity and performance scalability, efficiency, availability, and manageability
challenges of NGS environments.
EMC Isilon scale-out NAS overview
NGS is an unstructured file-based process, not a block-based storage process. EMC Isilon
scale-out NAS manages unstructured file data through a single namespace through its
storage appliance nodes arranged in clusters, which support massive scalability.
A short description of the EMC Isilon storage solution and the EMC Isilon OneFS®
file operating system with each of its features summarized below confirms its
suitability for next-generation genomic sequencing:
Simple
OneFS combines the three layers of traditional storage architectures—the file system,
volume manager, and RAID/data protection—into one unified software layer, creating
a single intelligent distributed file system that runs on an Isilon storage cluster.
Figure 5: OneFS eliminates the need for complex file management
This scale-out hardware provides the appliance on which the OneFS distributed file
system resides. A single EMC Isilon cluster consists of multiple storage nodes, which
are rack-mountable enterprise appliances containing memory, CPU, networking, NVRAM,
storage media, and the InfiniBand back-end network that connects the nodes together.
Hardware components are best-of-breed and benefit from ever-improving cost and
efficiency curves. OneFS allows nodes to be added or removed from the cluster at
will and at any time, abstracting the data and applications away from the hardware.
Adding nodes—instead of adding volumes and LUNs via physical disks—becomes an
extremely simple task at the petabyte (PB) scale, which is common in NGS.
12Next-generation genome sequencing using EMC Isilon scale-out NAS
Scalable
Figure 6: Linear scalability with OneFS
EMC Isilon provides a high-performance, fully symmetric cluster-based distributed
storage platform. It has linear scalability with increasing capacity—from 18 TB to
20 PB in a single file system—as compared to traditional storage. The concept of
node-based capacity growth with linear scaling is critical to NGS, where scale needs
to be painless, since the process can generate upwards of 8 TB per week per instrument.
The researchers and clinicians need to focus on managing scientific data and patients,
not managing storage.
Predictable
Along with raw scaling of capacity, balancing of the content across the new nodes needs
to be predictable for an NGS workflow, due to its sustained throughput requirement.
Since the instrument end keeps changing with newer technologies faster than the HPC
or storage, this balancing and scale become invaluable. Dynamic content balancing
is performed as nodes are added or data capacity changes. There is no added
management time for the administrator, or increased complexity within the storage
system. The storage reporting application, InsightIQ™, can be used to plan the growth
of a system from storage statistics both for infrastructure and for budgeting.
Efficient
Operational Expenditure (OPEX) hinges upon efficiency, specifically in NGS, since the
total storage can run into petabytes. A recent survey conducted by Scripps Institute
concluded that more than 35 percent of institutions today are at petabyte scale in
NGS with a 10 percent year-over-year growth.
Isilon scale-out NAS offers an 80 percent efficiency ratio and “smart pooling” of the
data across multiple tiers, making dynamic, rule-based data transfer between storage
pools an integral piece of the NGS process. This efficiency is at the application level
and tiered by the performance types:
• S-Series node for high performance (I/O per second)
• X-Series node for high throughput
• NL-Series node for archive
13Next-generation genome sequencing using EMC Isilon scale-out NAS
Figure 7: Storage tiering based on node type
The tiers in the storage cluster as shown in Figure 7 above are identified as “pools”
and managed by the EMC Isilon SmartPools®
application. A pool is a group of similar
nodes, which is defined by the user and is based on the functionality or workflow. A
pool is governed by policies that can be changed based on needs; default policies are
built in. Policies may be defined by any standard file metadata: file type, size, name,
location, owner, age, last accessed, etc. Data can be migrated from pool to pool. The
timing for this data movement is configurable: default is 1x per day at 10:00 p.m.
Available
Data availability and redundancy are the core requirements of the scientific and clinical
staff in NGS. As NGS moves into the clinical realm, availability becomes even more
important. Flexible data protection occurs during power loss, node or disk failures,
loss of quorum, and storage rebuild. OneFS avoids the use of hot spare drives, and
simply borrows from the available free space in the system in order to recover from
failures; this technique is called virtual hot spare.
Since all data, metadata, and parity information is distributed across the nodes of
the cluster, the Isilon cluster does not require a dedicated parity node or drive, or a
dedicated device or set of devices to manage metadata. This helps to ensure that no
one node can become a single point of failure and makes the cluster “self-healing.”
Enterprise-ready
The NGS data system does not exist as an island; it usually coexists with other storage
and IT systems. The standard protocols that OneFS supports build the standards-based
protocol bridges to other information systems from NGS. Specifically, connectivity to
the Isilon scale-out NAS cluster is via standard protocols: CIFS, SMB, NFS, FTP/HTTP,
Object, and HDFS. The complete data lifecycle is accessible to the centralized IT group.
Snapshots, replication, and quotas are supported via a simple Web-based UI.
Data is given infinite longevity and future-proofs the enterprise from evolving hardware
generations—eliminating the cost and pain of data migrations and hardware refreshes.
14Next-generation genome sequencing using EMC Isilon scale-out NAS
Standardized authentication and access control are available at scale: Active Directory
(AD), LDAP, NIS, and local users. Simultaneous or rolling upgrades to OneFS are
possible, with little or no impact to the production environment.
Figure 8: Standard protocols are critical to enterprises
The software to manage OneFS is automated to eliminate complexity, as shown in
Figure 9 below:
Figure 9: OneFS software management suite
15Next-generation genome sequencing using EMC Isilon scale-out NAS
All of the applications shown above are available as software licenses and are Web-based
through the main administrative user interface. A comprehensive command-line–based
administration interface is also available.
Table 2. Functional overview of the OneFS software suite
OneFS software management suite
Making data management easier for NGS
OneFS infrastructure software solutions meet critical data protection, access,
management, and availability needs.
Application Category What it does
SmartPools®
Resource
management
Implements a highly efficient, automated
tiered storage strategy to optimize
storage performance and costs
SmartConnect™ Data access Enables load balancing and dynamic NFS
failover and failback of client connections
across storage nodes to optimize use of
cluster resources
SnapshotIQ™ Data protection Protects data efficiently and reliably
with secure, near-instantaneous
snapshots while incurring little to
no performance overhead
InsightIQ™ Performance
management
Maximizes performance of your
Isilon scale-out storage system with
innovative performance-monitoring
and reporting tools
SmartQuotas™ Data
management
Assigns and manages quotas that
partition storage into easily managed
segments at the cluster, directory,
sub-directory, user, and group levels
SyncIQ®
Data replication Replicates and distributes large, mission-
critical data sets to multiple shared
storage systems in multiple sites for
reliable disaster recovery capability
16Next-generation genome sequencing using EMC Isilon scale-out NAS
NGS: key performance indicators
As discussed in the HPC section, performance of the NGS process is highly dependent
on the I/O performance of the file storage system and memory resources available in
the NGS architecture. In addition, there is a range of second-order factors that need to
be considered in terms of optimizing performance for a specific NGS process, including5
:
• How much faster can a given problem be solved with multiple workers (or server
cores) instead of one?
• How much more work can be done with multiple workers (or server cores) instead
of one?
• What impact do the communication requirements of the parallel NGS application
have on overall performance and scalability?
• What fraction of the resources in an NGS configuration is actually used productively
for solving the NGS problem?
The KPI for NGS consists of factors that can be used to predict and optimize the
performance of an NGS configuration and can be broken down into four categories:
• HPC server attributes: RAID, number of processor cores per HPC node, total RAM
size per HPC node
• NGS network infrastructure: TCP MTU, Channel Bonding, DNS
• Sun Grid Engine parameters: number of nodes, PAR_EXECD_INST_COUNT
• Isilon file storage attributes:
 SSD size and RAM
 NFS protocol parameters: NFS server OS, async, number of threads, locks
 Software RAID, maximum number of directories at a level, maximum number
of files in a directory, number of files less than 8 KB
HPC server parameters
RAID: With modern multicore CPUs, the performance of software RAID is very
close to that of hardware RAID. RAID 10 (first mirroring, then striping the mirrors)
is recommended for the HPC nodes with a minimum of two identical drives per node
where both drives are bootable. The benefit of such a configuration is that the server
continues to boot seamlessly even in the face of a failure of a single drive.
Total processor cores: The empirical rule-of-thumb for total number of threads and
processes running in parallel is determined by the equation:
∑ (threads + processes) = (2 x total cores) + 1
Please note that the total number of threads includes functions such as NFS and HPC
queuing, as well as the processes that run NGS algorithms; it is important to document
all the processes that are multithreaded. Amdahl’s law vis-à-vis parallelization is also
an important consideration.
5
See Introduction to High Performance Computing for Scientists and Engineers, Hager G, Wellein G,
© 2010 Taylor & Francis Group, LLC
17Next-generation genome sequencing using EMC Isilon scale-out NAS
Total RAM size: NGS analysis requires large file processing, including functions
related to string processing, clustering of large files, and statistical quality measures,
and thus easily becomes memory-bound. As a result, a large DDR3-based RAM pool
is optimal.
Network infrastructure parameters
TCP MTU: The default maximum transmission unit (MTU) (or frame size) of current
Ethernet systems is 1500 B. However, higher bandwidth network infrastructures can
handle a much higher MTU of 9000 B (called “jumbo frames”) for efficient data transfer.
Please note that the jumbo frame setting needs to be completed both on the HPC
server node(s) and the switch(es).
Ethernet bonding (LACP): Ethernet bonding using the Link Aggregation Control Protocol
(LACP) is a method used to alleviate bandwidth limitations and port-cable-port failure
issues. By combining several Ethernet interfaces to a virtual “bond” interface, the
network bandwidth can be increased since LACP splits the communications and sends
frames among all the Ethernet links. Bonding 2x 1 GbE interfaces provides the
required bandwidth between HPC server nodes and NAS file storage.
Isilon storage configuration parameters
NFS master OS: By default, EMC Isilon OneFS operating system is the NFS server. It
is recommended that this default be maintained since SmartConnect and other OneFS
features may be affected if the HPC master node OS is chosen as the NFS server.
NFSv4: NFSv4 provides improved performance, security, and robustness vis-à-vis
NFSv3. These include support of multiple operations per RPC operation (vs. a single
operation per RPC in NFSv3), use of Kerberos and access control lists (ACLs) for
security (vs. UNIX file permissions in NFSv3), use of TCP transport (vs. UDP in
NFSv3), and integrated file locking (vs. use of the adjunct Network Lock Manager
protocol for NFSv3). As a result, it is recommended that sites utilize NFSv4 for NGS
environments. Please note that initial setting-up of NFSv4 can be cumbersome.
NFS async: The NFS async (asynchronous) mode allows the server to reply to client
requests as soon as it has processed the request and handed it off to the local file
system, without waiting for the data to be written to stable storage. However, write
performance is better when synchronous mode is used (also called “noasync”),
especially for smaller file sizes. This is the recommended mode, especially since
NFSv4 uses TCP connectivity.
NFS number of threads: This is the number of NFS server daemon threads that are
started when the system boots. The OneFS NFS server usually has 16 threads as its
default setting; this value can be changed via the Command Line Interface (CLI):
isi_sysctl_cluster sysctl vfs.nfsrv.rpc.[minthreads,maxthreads]
Increasing the number of NFS daemon threads improves response minimally; the
maximum number of NFS threads needs to be limited to 64.
NFS ACL: The NFS ACL (access control list) for NFSv4 is a list of permissions associated
with a set of files or directories that contain one or more access control entries (ACEs).
There are four types of ACEs: Allow, Deny, Audit, and Alarm; with three kinds of
flags: group, inheritance, and administrative. There are 13 file permissions and
18Next-generation genome sequencing using EMC Isilon scale-out NAS
14 directory permissions. OneFS manages NFS ACLs, which need to be mapped to
the NFS client using the idmapd configuration.
NFS locks: The mounting and locking processes have been enhanced in NFSv4, which
supports mandatory as well as advisory locking. Caching and open delegation provide
performance improvements in most situations. More information about state is stored
on the servers in the HPC tier, enabling recovery of the files when they are in use.6,7
Maximum number of directories at a level and files within a directory: While
Isilon OneFS supports an upper bound of 100,000 files in a directory, as well as number
of directories at a level, in order to ensure highest performance while traversing a
directory tree, the maximum number of directories at a level and the maximum number
of files within a directory needs to be below 10,000.
Number of small (<8 KB) files: Random-write operations on small files have low
response times and can degrade overall application performance. In order to optimize
performance, it is recommended that Base Call files that are typically <8 KB be
aggregated into 128 KB or larger ZIP archive files.
SGE number of nodes: The Sun Grid Engine (SGE) package is a popular distributed
resource manager (DRM) and scheduler package for controlling access to and control
of cluster resources. It is recommended that at least a minimum of three SGE nodes be
used for NGS for performance and backup reasons. While a commercial version of
SGE is available from Oracle, SGE is also available as open source. Other popular
open source DRM packages are Torque/Maui and Lava.
Execution daemons: The SGE PAR_EXECD_INST_COUNT variable contained within
the SGE configuration file defines the number of parallel execd (execution daemons)
for the NGS HPC cluster.
DNS location: If the HPC NGS system is run within a private network, it is
recommended that Linux BIND be installed on the HPC master node with DNS
forwarding to the organization’s DNS server.
Summary
Internal EMC testing determined that the KPIs that affect the performance the most are:
• RAM on HPC cluster server nodes (recommended at 3 GB/core)
• RAM and SSD on the Isilon storage cluster—maximum allowable RAM on the
performance layer and minimum recommended on the archival layer with about
1 percent to 2 percent of the raw storage capacity as SSD
• Storage configuration parameters: NFSv 4, NFS async enabled, TCP MTU (jumbo
frames), LACP, and the Grid Engine package
6
See info on Isilon SmartLock: http://www.emc.com/collateral/software/white-papers/h8325-wp-isd-
smartlock.pdf
7
See info on Isilon high-performance computing: http://www.isilon.com/high-performance-computing
19Next-generation genome sequencing using EMC Isilon scale-out NAS
Conclusion
NGS production processes generate potentially millions of files with terabytes
of aggregate storage impacting the capacity and manageability limits of existing
file server structures. Raw instrument data typically consists of large image files
(2–5 TB per run is the norm), usually in TIFF format. The image file for the
experiment is usually the largest file size in NGS.
Genomics is a data reduction process from the raw instrument information (images
or voltages) to the variants, which follows the “Rule of One-Fifth.” Intermediate or
secondary data consists of raw data files, including files in BCL format for base calling
and conversion, and have an aggregate ratio of approximately one-fifth compared to
raw instrument data.
Internal EMC testing has determined that the KPIs that affect the performance of
NGS applications the most are: total RAM size on HPC cluster nodes (recommended
at 3 Gb/core), RAM and SSD on the Isilon storage cluster (typically 1 percent of RAM
storage), and storage configuration parameters with NFSv4, NFS async enabled, TCP
MTU (jumbo frames), LACP (2x 1 Gb/s or 4x 1 Gb/s), and a Grid Engine package.
NGS environments require a file storage infrastructure that is purpose-built to address
the capacity and performance scalability, efficiency, availability, and manageability
challenges of NGS applications. Cumulative network bandwidth between HPC and
NAS increases with the total number of Isilon nodes on the storage cluster.
Isilon scale-out NAS presents a range of benefits optimal for NGS. The Isilon approach
of enabling storage I/O and capacity growth through addition of cluster nodes is optimal,
since NGS requires storage performance and capacity scalability to be implemented as
seamlessly as possible. In addition, dynamic content balancing performed within Isilon
scale-out NAS as nodes are added or data capacity changes is ideal for an NGS
workflow due to its sustained throughput requirement.
Isilon scale-out NAS also offers an 80 percent efficiency ratio and “smart pooling” of
the data across multiple performance tiers, making dynamic, rule-based data transfer
between storage pools an integral piece of the NGS process. Flexible, multidimensional
data protection, which occurs within Isilon scale-out NAS during power loss, node or disk
failures, loss of quorum, and storage rebuild, enables non-stop data availability for NGS.
20Next-generation genome sequencing using EMC Isilon scale-out NAS

Más contenido relacionado

La actualidad más candente

rnaseq_from_babelomics
rnaseq_from_babelomicsrnaseq_from_babelomics
rnaseq_from_babelomicsFrancisco Garc
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionJatinder Singh
 
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...VHIR Vall d’Hebron Institut de Recerca
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assemblyRamya P
 
How to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeHow to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeLex Nederbragt
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopMorgan Langille
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotLi Shen
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisUniversity of California, Davis
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...Jan Aerts
 
diffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packagediffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packageLi Shen
 
Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)LOGESWARAN KA
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataThomas Keane
 
Analyzing the exome—focusing your NGS analysis with high performance target c...
Analyzing the exome—focusing your NGS analysis with high performance target c...Analyzing the exome—focusing your NGS analysis with high performance target c...
Analyzing the exome—focusing your NGS analysis with high performance target c...Integrated DNA Technologies
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishingNikolay Vyahhi
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolHong ChangBum
 
NGS: bioinformatic challenges
NGS: bioinformatic challengesNGS: bioinformatic challenges
NGS: bioinformatic challengesLex Nederbragt
 
Analysis of ChIP-Seq Data
Analysis of ChIP-Seq DataAnalysis of ChIP-Seq Data
Analysis of ChIP-Seq DataPhil Ewels
 

La actualidad más candente (20)

rnaseq_from_babelomics
rnaseq_from_babelomicsrnaseq_from_babelomics
rnaseq_from_babelomics
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential Expression
 
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 
DNA_Services
DNA_ServicesDNA_Services
DNA_Services
 
How to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeHow to sequence a large eukaryotic genome
How to sequence a large eukaryotic genome
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics Workshop
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processing
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
 
diffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packagediffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis package
 
Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
Analyzing the exome—focusing your NGS analysis with high performance target c...
Analyzing the exome—focusing your NGS analysis with high performance target c...Analyzing the exome—focusing your NGS analysis with high performance target c...
Analyzing the exome—focusing your NGS analysis with high performance target c...
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
 
NGS: bioinformatic challenges
NGS: bioinformatic challengesNGS: bioinformatic challenges
NGS: bioinformatic challenges
 
ChIP-seq Theory
ChIP-seq TheoryChIP-seq Theory
ChIP-seq Theory
 
Analysis of ChIP-Seq Data
Analysis of ChIP-Seq DataAnalysis of ChIP-Seq Data
Analysis of ChIP-Seq Data
 

Destacado

Law of demand 2014
Law of demand 2014Law of demand 2014
Law of demand 2014Travis Klein
 
цахим хэрэглэгдэхүүн 1
цахим хэрэглэгдэхүүн 1цахим хэрэглэгдэхүүн 1
цахим хэрэглэгдэхүүн 1pvsa_8990
 
Dubravka Granulic tm_regional2
Dubravka Granulic tm_regional2Dubravka Granulic tm_regional2
Dubravka Granulic tm_regional2Dubravka Granulić
 
Thur roman empire lang
Thur roman empire langThur roman empire lang
Thur roman empire langTravis Klein
 
Lec 2 types of research
Lec 2 types of researchLec 2 types of research
Lec 2 types of researchNaveed Saeed
 
MT View Day 1 what is an american?
MT View Day 1 what is an american?MT View Day 1 what is an american?
MT View Day 1 what is an american?Travis Klein
 
The Right Way to Become Successful in Burning Fat
The Right Way to Become Successful in Burning FatThe Right Way to Become Successful in Burning Fat
The Right Way to Become Successful in Burning Fattrg911
 
The Evolution of IP Storage and Its Impact on the Network
The Evolution of IP Storage and Its Impact on the NetworkThe Evolution of IP Storage and Its Impact on the Network
The Evolution of IP Storage and Its Impact on the NetworkEMC
 
Diminishing marginal returns
Diminishing marginal returnsDiminishing marginal returns
Diminishing marginal returnsTravis Klein
 
Analisi di Usabilità di Libero Mail
Analisi di Usabilità di Libero MailAnalisi di Usabilità di Libero Mail
Analisi di Usabilità di Libero MailSara M
 
Insaat kursu-esenler
Insaat kursu-esenlerInsaat kursu-esenler
Insaat kursu-esenlersersld54
 
Running Selenium tests on CI server
Running Selenium tests on CI serverRunning Selenium tests on CI server
Running Selenium tests on CI serverAleksandr Zhuikov
 
Presentazione andrea molza 24 05-12 definitiva
Presentazione andrea molza 24 05-12 definitivaPresentazione andrea molza 24 05-12 definitiva
Presentazione andrea molza 24 05-12 definitivaMarco Frullanti
 
Fashion & tecnologia
Fashion & tecnologiaFashion & tecnologia
Fashion & tecnologiaSara M
 
Business impact restrictions on cross border data
Business impact restrictions on cross border dataBusiness impact restrictions on cross border data
Business impact restrictions on cross border dataRene Summer
 

Destacado (18)

Law of demand 2014
Law of demand 2014Law of demand 2014
Law of demand 2014
 
цахим хэрэглэгдэхүүн 1
цахим хэрэглэгдэхүүн 1цахим хэрэглэгдэхүүн 1
цахим хэрэглэгдэхүүн 1
 
Dubravka Granulic tm_regional2
Dubravka Granulic tm_regional2Dubravka Granulic tm_regional2
Dubravka Granulic tm_regional2
 
Thur roman empire lang
Thur roman empire langThur roman empire lang
Thur roman empire lang
 
Duurzaam winkelen 2014
Duurzaam winkelen 2014Duurzaam winkelen 2014
Duurzaam winkelen 2014
 
Lec 2 types of research
Lec 2 types of researchLec 2 types of research
Lec 2 types of research
 
MT View Day 1 what is an american?
MT View Day 1 what is an american?MT View Day 1 what is an american?
MT View Day 1 what is an american?
 
Oral lichen planus
Oral lichen planusOral lichen planus
Oral lichen planus
 
The Right Way to Become Successful in Burning Fat
The Right Way to Become Successful in Burning FatThe Right Way to Become Successful in Burning Fat
The Right Way to Become Successful in Burning Fat
 
The Evolution of IP Storage and Its Impact on the Network
The Evolution of IP Storage and Its Impact on the NetworkThe Evolution of IP Storage and Its Impact on the Network
The Evolution of IP Storage and Its Impact on the Network
 
Diminishing marginal returns
Diminishing marginal returnsDiminishing marginal returns
Diminishing marginal returns
 
Analisi di Usabilità di Libero Mail
Analisi di Usabilità di Libero MailAnalisi di Usabilità di Libero Mail
Analisi di Usabilità di Libero Mail
 
Insaat kursu-esenler
Insaat kursu-esenlerInsaat kursu-esenler
Insaat kursu-esenler
 
Running Selenium tests on CI server
Running Selenium tests on CI serverRunning Selenium tests on CI server
Running Selenium tests on CI server
 
Recording Reccy
Recording ReccyRecording Reccy
Recording Reccy
 
Presentazione andrea molza 24 05-12 definitiva
Presentazione andrea molza 24 05-12 definitivaPresentazione andrea molza 24 05-12 definitiva
Presentazione andrea molza 24 05-12 definitiva
 
Fashion & tecnologia
Fashion & tecnologiaFashion & tecnologia
Fashion & tecnologia
 
Business impact restrictions on cross border data
Business impact restrictions on cross border dataBusiness impact restrictions on cross border data
Business impact restrictions on cross border data
 

Similar a EMC Isilon Sizing and Performance Guidelines for Next-Generation Genome Sequencing

White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...EMC
 
Enabling Large Scale Sequencing Studies through Science as a Service
Enabling Large Scale Sequencing Studies through Science as a ServiceEnabling Large Scale Sequencing Studies through Science as a Service
Enabling Large Scale Sequencing Studies through Science as a ServiceJustin Johnson
 
Closing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceClosing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceJustin Johnson
 
Next generation sequencing by Muhammad Abbas
Next generation sequencing by Muhammad AbbasNext generation sequencing by Muhammad Abbas
Next generation sequencing by Muhammad AbbasMuhammadAbbaskhan9
 
IRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural NetworkIRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural NetworkIRJET Journal
 
Soft Error Study of ARM SoC at 28 Nanometers
Soft Error Study of ARM SoC at 28 NanometersSoft Error Study of ARM SoC at 28 Nanometers
Soft Error Study of ARM SoC at 28 NanometersWojciech Koszek
 
Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010Oladokun Sulaiman
 
Accelerating GWAS epistatic interaction analysis methods
Accelerating GWAS epistatic interaction analysis methodsAccelerating GWAS epistatic interaction analysis methods
Accelerating GWAS epistatic interaction analysis methodsPriscill Orue Esquivel
 
De novo transcriptome assembly of solid sequencing data in cucumis melo
De novo transcriptome assembly of solid sequencing data in cucumis meloDe novo transcriptome assembly of solid sequencing data in cucumis melo
De novo transcriptome assembly of solid sequencing data in cucumis melobioejjournal
 
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELODE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELObioejjournal
 
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...Feature Extraction and Analysis of Natural Language Processing for Deep Learn...
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...Sharmila Sathish
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxxRowlet
 
Chapter 5 applications of neural networks
Chapter 5           applications of neural networksChapter 5           applications of neural networks
Chapter 5 applications of neural networksPunit Saini
 
Storage For Science Wp
Storage For Science WpStorage For Science Wp
Storage For Science Wpsydcarr
 
NGS_1.1-1.4-Introducción_a_la_ultrasecuenciación.pptx
NGS_1.1-1.4-Introducción_a_la_ultrasecuenciación.pptxNGS_1.1-1.4-Introducción_a_la_ultrasecuenciación.pptx
NGS_1.1-1.4-Introducción_a_la_ultrasecuenciación.pptxMartensMilozzi1
 

Similar a EMC Isilon Sizing and Performance Guidelines for Next-Generation Genome Sequencing (20)

White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
 
Enabling Large Scale Sequencing Studies through Science as a Service
Enabling Large Scale Sequencing Studies through Science as a ServiceEnabling Large Scale Sequencing Studies through Science as a Service
Enabling Large Scale Sequencing Studies through Science as a Service
 
Paper - Muhammad Gulraj
Paper - Muhammad GulrajPaper - Muhammad Gulraj
Paper - Muhammad Gulraj
 
Closing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceClosing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real Science
 
50120140505008
5012014050500850120140505008
50120140505008
 
Next generation sequencing by Muhammad Abbas
Next generation sequencing by Muhammad AbbasNext generation sequencing by Muhammad Abbas
Next generation sequencing by Muhammad Abbas
 
IRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural NetworkIRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural Network
 
Soft Error Study of ARM SoC at 28 Nanometers
Soft Error Study of ARM SoC at 28 NanometersSoft Error Study of ARM SoC at 28 Nanometers
Soft Error Study of ARM SoC at 28 Nanometers
 
Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010
 
Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
Accelerating GWAS epistatic interaction analysis methods
Accelerating GWAS epistatic interaction analysis methodsAccelerating GWAS epistatic interaction analysis methods
Accelerating GWAS epistatic interaction analysis methods
 
Final doc of dna
Final  doc of dnaFinal  doc of dna
Final doc of dna
 
De novo transcriptome assembly of solid sequencing data in cucumis melo
De novo transcriptome assembly of solid sequencing data in cucumis meloDe novo transcriptome assembly of solid sequencing data in cucumis melo
De novo transcriptome assembly of solid sequencing data in cucumis melo
 
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELODE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
 
D0521522
D0521522D0521522
D0521522
 
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...Feature Extraction and Analysis of Natural Language Processing for Deep Learn...
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
 
Chapter 5 applications of neural networks
Chapter 5           applications of neural networksChapter 5           applications of neural networks
Chapter 5 applications of neural networks
 
Storage For Science Wp
Storage For Science WpStorage For Science Wp
Storage For Science Wp
 
NGS_1.1-1.4-Introducción_a_la_ultrasecuenciación.pptx
NGS_1.1-1.4-Introducción_a_la_ultrasecuenciación.pptxNGS_1.1-1.4-Introducción_a_la_ultrasecuenciación.pptx
NGS_1.1-1.4-Introducción_a_la_ultrasecuenciación.pptx
 

Más de EMC

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDEMC
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote EMC
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOEMC
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremioEMC
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lakeEMC
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereEMC
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History EMC
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewEMC
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeEMC
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic EMC
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityEMC
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeEMC
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015EMC
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesEMC
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsEMC
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookEMC
 

Más de EMC (20)

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
 

Último

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Último (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

EMC Isilon Sizing and Performance Guidelines for Next-Generation Genome Sequencing

  • 1. White Paper Abstract This EMC Isilon Sizing and Performance Guideline white paper reviews the Key Performance Indicators (KPIs) that most strongly impact the production processes for Next-Generation Sequencing (NGS) workflows. July 2013 NEXT-GENERATION GENOME SEQUENCING USING EMC ISILON SCALE-OUT NAS: SIZING AND PERFORMANCE GUIDELINES
  • 2. Copyright © 2013 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. The information in this publication is provided “as is.” EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com. EMC2 , EMC, the EMC logo, Isilon, OneFS, InsightIQ, SmartConnect, SmartLock, SmartPools, SmartQuotas, SnapshotIQ, and SyncIQ are registered trademarks or trademarks of EMC Corporation in the United States and other countries. VMware and vCenter are registered trademarks or trademarks of VMware, Inc. in the United States and/or other jurisdictions. All other trademarks used herein are the property of their respective owners. Part Number H19061.2 2Next-generation genome sequencing using EMC Isilon scale-out NAS
  • 3. Table of Contents Executive summary ........................................................................................4 Introduction...................................................................................................4 NGS workflow—sequencing instruments and file types ..................................6 NGS workflow—HPC .......................................................................................7 NGS workflow—Isilon scale-out NAS............................................................10 EMC Isilon scale-out NAS overview ..............................................................12 Simple........................................................................................................ 12 Scalable...................................................................................................... 13 Predictable.................................................................................................. 13 Efficient...................................................................................................... 13 Available..................................................................................................... 14 Enterprise-ready.......................................................................................... 14 NGS: key performance indicators .................................................................17 HPC server parameters................................................................................. 17 Network infrastructure parameters................................................................. 18 Isilon storage configuration parameters.......................................................... 18 Summary.................................................................................................... 19 Conclusion....................................................................................................20 3Next-generation genome sequencing using EMC Isilon scale-out NAS
  • 4. Executive summary Next-generation sequencing (NGS) workflows are comprised of genome sequencer instrumentation, high-performance computing (HPC) infrastructure, a network- attached storage (NAS) platform, and the network infrastructure connecting these components together. Raw NGS data is the largest component of an NGS process, making data storage capacity and scalability important factors in NGS performance. The raw TIFF image from the sequencer can be up to 70 percent of the total dataset. These files may be compressed and stored for later use. Most organizations do not save the TIFF images, but retain either the BCL or FASTQ files as the raw files. Each sequencing run can also generate analysis data in the range of 50-200 gigabytes (GB). With faster sequencers and larger read lengths, this can add up to between approximately 1 petabyte (PB) and 2 PB per year for a facility with three NGS sequencers. Beyond capacity scalability, I/O performance is also a critical file storage attribute for overall NGS performance and efficiency. NGS is I/O-bound rather than processor-bound, and therefore storage I/O performance has a high impact on overall NGS performance in relation to other NGS workflow parameters. Internal EMC testing has determined that the key performance indicators (KPIs) that most affect the performance of NGS applications are: • Total random access memory (RAM) size on HPC cluster nodes (recommended at 3 GB/core) • RAM and SSD allocation on the EMC® Isilon® storage cluster—place maximum allowable RAM on the performance layer and minimum recommended on the archival layer with about 1 percent to 2 percent of the raw storage capacity as SSD • Storage configuration parameters: NFS version 4, NFS async-enabled, TCP MTU (jumbo frames), LACP (2x 1 Gb/s or 4x 1 Gb/s), and tuning the Grid Engine package Introduction Over the past five years, the precision and effectiveness of sequencing technology have considerably increased the pace of biological research and discovery. The resources focused on molecular biology, cellular biology, and bioinformatics continue to accelerate at a significant pace. Projections indicate that before the end of the 21st century, we could gain a full understanding of the workings of our DNA. Such knowledge could allow us to improve our collective quality of life through a better understanding of how a specific genetic variation impacts a drug’s efficacy or toxicity, or by possibly providing the knowledge to eradicate a range of genetically based disorders. DNA exome sequencing is an approach to selectively sequence the coding regions of the genome as an easier yet still effective alternative to whole genome sequencing. The exome of the human genome is formed by exons. Exons are short, functionally 4Next-generation genome sequencing using EMC Isilon scale-out NAS
  • 5. important coding sequences of DNA within the gene’s mature messenger RNA that constitute about 1.5 percent of the human genome.1 Many large-scale exome sequencing projects are underway to analyze human diseases. This technology is often the choice as it is more affordable than whole genome sequencing (WGS) and therefore allows the analysis of more patients. In addition, it has an advantage in that resulting data volumes are much smaller and therefore easier to handle. However, recent studies2 focused on this question found that both technologies complement each other. As neither the whole genome nor the large-scale exome sequencing technologies cover all sequencing variants, it is optimal to conduct both experiments in parallel. A single human genome—composed of a total of about 3.2 billion base pairs—requires about 1.2 GB of unassembled storage. Industry analysts predict that the estimated number of human whole genomes sequenced will explode from 25,000 genomes in 2012, to between 50,000 and 100,000 in 2013, and up to about one million by 2015. The key enabling technologies for NGS are the many commercial sequencers available from various companies, including Illumina, Life Technologies, Roche/454 Life Sciences, and others. These sequencers interface to a computer network, which correlates and concatenates the billions of overlapping segments of DNA sequence short reads that have been streamed to or stored on a NAS system. Accommodating the output rate of the sequencers requires a precisely designed and balanced system. The peak rate of data (base pairs) produced by an Illumina sequencer, for example, is already approaching 600 GigaBases per week, equivalent to about 100 whole human genomes. The range of data per year for an Illumina sequencer is from 350 TB to 1 PB. The components of the NGS workflow are comprised of: • Genome sequencer instruments • HPC infrastructure • NAS platform • Network infrastructure that stitches these components together These four components make up the hierarchy of the NGS gene-sequencing architecture. Each component depends on the other and must have the ability to adapt and scale to meet current and future sequencing needs. If one component creates a bottleneck, then the performance of the entire NGS system suffers. The focus of this document is optimum performance as well as sizing guidelines for the core components of NGS: the HPC infrastructure and network-attached storage. 1 See Gilbert W (February 1978). “Why genes in pieces?”. Nature 271 (5645): 501. 2 See Performance comparison of exome DNA sequencing technologies. Clark MJ, Chen R, Lam HY, Karczewski KJ, Chen R, Euskirchen G, Butte AJ, Snyder M, Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA. 5Next-generation genome sequencing using EMC Isilon scale-out NAS
  • 6. NGS workflow—sequencing instruments and file types The applications at the heart of NGS data creation come from important established and emerging organizations involved in bringing NGS to market. The list includes software from Illumina, Life Technologies (Applied Biosystems), Roche/454, Ion Torrent, Pacific Biosciences, and a myriad of open source offerings such as Galaxy. Running these applications in a research and analysis environment places complex and special requirements on the IT systems, and in particular, the storage infrastructure. This document will focus on the Illumina technologies—specifically CASAVA for the sequence analysis software. Other genome assembly and analysis platforms like Galaxy would be summarized in subsequent documents. An NGS environment typically consists of scientific, lab, and analysis users: • The scientific user initiates the method of genome sequencing and instrumentation. This may also be the analysis user. • The lab user runs the experiment (chemistry workflow) using a multiplexed sampling scheme (or lanes) supported by the NGS instrument. • The analysis user works on the results from the genome sequencing study with bioinformatics tools and algorithms. Most commercial NGS data centers also have a trained storage administrator on their staff. With the growing use of NGS technologies, a new user has emerged for these storage systems. The scientist or researcher running the experiments frequently handles the data directly. Data management has to be intuitive to allow this new user to run experiments and administer the data with minimal difficulty. In addition, the storage administrator needs access to the more advanced management features to set sophisticated management policies. These help with optimization of performance and use of the storage system. It is important that the storage system deployed provide management capabilities tuned to both types of users. A graphical representation of the typical NGS data flow is shown below in Figure 1: Figure 1. NGS architecture, data flow, and file types 6Next-generation genome sequencing using EMC Isilon scale-out NAS
  • 7. The results stage of the NGS workflow as shown in Figure 1 consists of a number of successive steps, each involving file conversions and each resulting in approximately 5x smaller file sizes. These steps include conversion of the raw image file into base- call data, then of base call data into FASTQ text-based file format for storing both biological sequence and its corresponding quality scores—for example, using LQUAL or QUAL formats. This is followed by conversion into BAM (Binary Alignment Map) file data followed by conversion into Variant Call Format (VCF) file data, which is converted next into results data in SRA format. This tertiary file data is typically kept forever, needs to be kept safe and available, and accumulates over time. Today’s instruments produce higher level information and may avoid some of the intermediate steps, thus reducing output data compared to previous NGS systems. Therefore, data flows generated by the latest NGS instruments have typically decreased in size per run. This decrease has been offset by a larger number of experiments, secondary data, and increased consumption by users working downstream on many different efforts and workflows. The size and characteristics of data produced from these efforts place unpredictable demands on capacity as well as on throughput of the storage systems. NGS storage environments need to be able to adapt to demands for more capacity from post-processing work done by researchers downstream from the first data capture. NGS workflow—HPC NGS applications have both common and unique analysis tools. All applications generate large files that must be managed through multiple rounds of processing. Although many tools were written specifically for easy implementation on a high-end desktop computer (e.g., 64-bit dual- or quad-core, 16 GB RAM), routine analysis is typically conducted on high-performance compute clusters. Using a high-performance compute cluster, secondary analysis processing can generally be done at a rate equal to or faster than primary data generation. Due to the open-ended nature of tertiary analysis, a similar rate estimate cannot be precisely stated. It is important that the parallelization of the NGS analysis platform be well understood before planning on optimum server CPU core sizing. Most of the NGS tools are at least multiprocessor-aware or are highly parallelized by simply dividing the sequence data, the assembly algorithm, variant calling, or all, and starting separate analysis on these data subsets. For NGS applications, the current parallelization per process is typically between 75 percent and 90 percent. As genomics has very large, semi-structured, file-based data and is modeled on post- process streaming data access and I/O patterns that can be parallelized, it is ideally suited for the Hadoop software framework3 which consists of two main components: a file system and a compute system—the Hadoop Distributed File System (HDFS) and the MapReduce framework, respectively. 3 See Hadoop in the life sciences: an Isilon Systems white paper. Joshi S. 7Next-generation genome sequencing using EMC Isilon scale-out NAS
  • 8. Figure 2. Amdahl’s Law and parallelization One of the basic tenets in HPC, Amdahl’s Law4 , postulates that adding more microprocessor cores to a process does not speed it up linearly. A 64-core HPC platform is estimated to be the performance threshold for 75 percent parallelization per NGS process, which delivers a speedup of 4x (see Figure 2). Even more than 100 cores per active NGS process do not speed up the process substantially when the algorithm(s) are between 75 percent and 90 percent parallelization. During actual testing of the NGS processes in the range of 75-90 percent parallelization, the speedup from 12 cores to 72 cores was found to be only about 1.25x. Horizontal platforms like Hadoop that combine compute and data in a parallel context would benefit genome assembly considerably. 4 See “Validity of the single processor approach to achieving large-scale computing capabilities,” Amdahl G, AFIPS Conference Proceedings (30): 483–485, 1967 8Next-generation genome sequencing using EMC Isilon scale-out NAS
  • 9. Figure 3. Performance curves for NGS using Illumina CASAVA As shown in Figure 3 above, the NGS process is storage I/O and memory-bound. The performance curves show a direct relationship between NGS performance and saturation of read/write I/O and memory functionality. In contrast, there is an inverse relationship between the CPU core utilization and storage I/O and memory functionality. This number may be due to mutual dependencies or portions of the process that can only be performed sequentially; NGS algorithms requiring movement of large amounts of data in and out of the CPU; startup overhead including base calling and other large numbers of small file writes; and degree of serialization involved in communication. In view of the above discussion, it is recommended that the HPC server hardware platform be configured with: • Best I/O chipset, for example, using the latest generation Intel I/O controllers • Highest DRAM speed (with a minimum of 3 GB per core of RAM) • Multicore CPU set with > 2 GHz processors • Simplified BIOS and driver upgrades with a single management console for all driver upgrades • Linux driver compatibility (over 90 percent of all HPC systems are Linux-based) • Disk drives between 200 GB and 600 GB with RAID 10 • Cluster management tools such as Ganglia Increasing the network bandwidth up to 4 Gbps would alleviate the read I/O and memory saturation. 9Next-generation genome sequencing using EMC Isilon scale-out NAS
  • 10. NGS workflow—Isilon scale-out NAS Figure 4. Data flow using Illumina NGS process NGS production processes generate potentially millions of files with terabytes of aggregate storage impacting the capacity and manageability limits of existing file server structures. Figure 4 shows the data flow including a file number and capacity summary of an actual NGS process using an Illumina sequencer and Isilon scale-out NAS storage. As can be seen, the process generates over 500,000 files having aggregate size of greater than 5 TB over the course of the 48-hour run. Raw NGS data is the largest component of an NGS process. The raw TIFF image can be up to 70 percent of the total dataset. These files may be compressed and stored for later use. Most organizations do not save the TIFF images, but retain either the BCL or FASTQ files. If sequencing as a service is used, the input to the process is a BAM file. Each sequencing run can also generate intermediate and final analysis data 10Next-generation genome sequencing using EMC Isilon scale-out NAS
  • 11. in the range of 50-200 GB. With faster sequencers and larger read lengths, this can add up to between 1 PB and 2 PB per year for a facility with three NGS sequencers. Genomics is a data reduction process from the raw instrument information (images or voltages) to the variants. This reduction process follows the “Rule of One-Fifth” as shown in the sizing table below: Table 1. Data reduction for the NGS process; human whole genome; all file sizes are approximate File format Size, GB Illumina Size, GB Ion Torrent Comments TIFF, WELLS 2500 750 TIFF range: 2.5 to 4 TB, Ion Torrent is WELLS voltage format BCL/SFF 500 500 Ion Torrent uses SFF BAM 100 100 2x compression (~200 GB normal) VCF 20 20 Variant calls SRA, EMR 4 4 EMR (Electronic Medical Record) includes radiology and pathology images Raw instrument data typically consists of large image files (2–5 TB per run is the norm), usually in TIFF format or an electropherogram file format native to a sequencer (for example, the SEQ format native to the Illumina sequencer). These files are only kept long enough (7–10 days) to verify that the experiment worked. The image file for the experiment is usually the largest file size in NGS. Intermediate or secondary data consists of raw data processed into information of increasing value, stored for medium- to long-term storage (one year or more), requires high-bandwidth access for fast analysis, and is expensive to re-create, so storage needs to be highly available. These include files in BCL format for base calling and conversion with an aggregate ratio of approximately one-fifth compared to raw instrument data. Beyond capacity scalability, I/O performance is also a critical file storage attribute for overall NGS performance and efficiency. As discussed earlier, NGS is I/O-bound, rather than processor-bound, and thus storage I/O performance has a high impact on overall NGS performance in relation to other NGS workflow parameters. As a result, NGS environments require a file storage infrastructure that is purpose-built to address the 11Next-generation genome sequencing using EMC Isilon scale-out NAS
  • 12. capacity and performance scalability, efficiency, availability, and manageability challenges of NGS environments. EMC Isilon scale-out NAS overview NGS is an unstructured file-based process, not a block-based storage process. EMC Isilon scale-out NAS manages unstructured file data through a single namespace through its storage appliance nodes arranged in clusters, which support massive scalability. A short description of the EMC Isilon storage solution and the EMC Isilon OneFS® file operating system with each of its features summarized below confirms its suitability for next-generation genomic sequencing: Simple OneFS combines the three layers of traditional storage architectures—the file system, volume manager, and RAID/data protection—into one unified software layer, creating a single intelligent distributed file system that runs on an Isilon storage cluster. Figure 5: OneFS eliminates the need for complex file management This scale-out hardware provides the appliance on which the OneFS distributed file system resides. A single EMC Isilon cluster consists of multiple storage nodes, which are rack-mountable enterprise appliances containing memory, CPU, networking, NVRAM, storage media, and the InfiniBand back-end network that connects the nodes together. Hardware components are best-of-breed and benefit from ever-improving cost and efficiency curves. OneFS allows nodes to be added or removed from the cluster at will and at any time, abstracting the data and applications away from the hardware. Adding nodes—instead of adding volumes and LUNs via physical disks—becomes an extremely simple task at the petabyte (PB) scale, which is common in NGS. 12Next-generation genome sequencing using EMC Isilon scale-out NAS
  • 13. Scalable Figure 6: Linear scalability with OneFS EMC Isilon provides a high-performance, fully symmetric cluster-based distributed storage platform. It has linear scalability with increasing capacity—from 18 TB to 20 PB in a single file system—as compared to traditional storage. The concept of node-based capacity growth with linear scaling is critical to NGS, where scale needs to be painless, since the process can generate upwards of 8 TB per week per instrument. The researchers and clinicians need to focus on managing scientific data and patients, not managing storage. Predictable Along with raw scaling of capacity, balancing of the content across the new nodes needs to be predictable for an NGS workflow, due to its sustained throughput requirement. Since the instrument end keeps changing with newer technologies faster than the HPC or storage, this balancing and scale become invaluable. Dynamic content balancing is performed as nodes are added or data capacity changes. There is no added management time for the administrator, or increased complexity within the storage system. The storage reporting application, InsightIQ™, can be used to plan the growth of a system from storage statistics both for infrastructure and for budgeting. Efficient Operational Expenditure (OPEX) hinges upon efficiency, specifically in NGS, since the total storage can run into petabytes. A recent survey conducted by Scripps Institute concluded that more than 35 percent of institutions today are at petabyte scale in NGS with a 10 percent year-over-year growth. Isilon scale-out NAS offers an 80 percent efficiency ratio and “smart pooling” of the data across multiple tiers, making dynamic, rule-based data transfer between storage pools an integral piece of the NGS process. This efficiency is at the application level and tiered by the performance types: • S-Series node for high performance (I/O per second) • X-Series node for high throughput • NL-Series node for archive 13Next-generation genome sequencing using EMC Isilon scale-out NAS
  • 14. Figure 7: Storage tiering based on node type The tiers in the storage cluster as shown in Figure 7 above are identified as “pools” and managed by the EMC Isilon SmartPools® application. A pool is a group of similar nodes, which is defined by the user and is based on the functionality or workflow. A pool is governed by policies that can be changed based on needs; default policies are built in. Policies may be defined by any standard file metadata: file type, size, name, location, owner, age, last accessed, etc. Data can be migrated from pool to pool. The timing for this data movement is configurable: default is 1x per day at 10:00 p.m. Available Data availability and redundancy are the core requirements of the scientific and clinical staff in NGS. As NGS moves into the clinical realm, availability becomes even more important. Flexible data protection occurs during power loss, node or disk failures, loss of quorum, and storage rebuild. OneFS avoids the use of hot spare drives, and simply borrows from the available free space in the system in order to recover from failures; this technique is called virtual hot spare. Since all data, metadata, and parity information is distributed across the nodes of the cluster, the Isilon cluster does not require a dedicated parity node or drive, or a dedicated device or set of devices to manage metadata. This helps to ensure that no one node can become a single point of failure and makes the cluster “self-healing.” Enterprise-ready The NGS data system does not exist as an island; it usually coexists with other storage and IT systems. The standard protocols that OneFS supports build the standards-based protocol bridges to other information systems from NGS. Specifically, connectivity to the Isilon scale-out NAS cluster is via standard protocols: CIFS, SMB, NFS, FTP/HTTP, Object, and HDFS. The complete data lifecycle is accessible to the centralized IT group. Snapshots, replication, and quotas are supported via a simple Web-based UI. Data is given infinite longevity and future-proofs the enterprise from evolving hardware generations—eliminating the cost and pain of data migrations and hardware refreshes. 14Next-generation genome sequencing using EMC Isilon scale-out NAS
  • 15. Standardized authentication and access control are available at scale: Active Directory (AD), LDAP, NIS, and local users. Simultaneous or rolling upgrades to OneFS are possible, with little or no impact to the production environment. Figure 8: Standard protocols are critical to enterprises The software to manage OneFS is automated to eliminate complexity, as shown in Figure 9 below: Figure 9: OneFS software management suite 15Next-generation genome sequencing using EMC Isilon scale-out NAS
  • 16. All of the applications shown above are available as software licenses and are Web-based through the main administrative user interface. A comprehensive command-line–based administration interface is also available. Table 2. Functional overview of the OneFS software suite OneFS software management suite Making data management easier for NGS OneFS infrastructure software solutions meet critical data protection, access, management, and availability needs. Application Category What it does SmartPools® Resource management Implements a highly efficient, automated tiered storage strategy to optimize storage performance and costs SmartConnect™ Data access Enables load balancing and dynamic NFS failover and failback of client connections across storage nodes to optimize use of cluster resources SnapshotIQ™ Data protection Protects data efficiently and reliably with secure, near-instantaneous snapshots while incurring little to no performance overhead InsightIQ™ Performance management Maximizes performance of your Isilon scale-out storage system with innovative performance-monitoring and reporting tools SmartQuotas™ Data management Assigns and manages quotas that partition storage into easily managed segments at the cluster, directory, sub-directory, user, and group levels SyncIQ® Data replication Replicates and distributes large, mission- critical data sets to multiple shared storage systems in multiple sites for reliable disaster recovery capability 16Next-generation genome sequencing using EMC Isilon scale-out NAS
  • 17. NGS: key performance indicators As discussed in the HPC section, performance of the NGS process is highly dependent on the I/O performance of the file storage system and memory resources available in the NGS architecture. In addition, there is a range of second-order factors that need to be considered in terms of optimizing performance for a specific NGS process, including5 : • How much faster can a given problem be solved with multiple workers (or server cores) instead of one? • How much more work can be done with multiple workers (or server cores) instead of one? • What impact do the communication requirements of the parallel NGS application have on overall performance and scalability? • What fraction of the resources in an NGS configuration is actually used productively for solving the NGS problem? The KPI for NGS consists of factors that can be used to predict and optimize the performance of an NGS configuration and can be broken down into four categories: • HPC server attributes: RAID, number of processor cores per HPC node, total RAM size per HPC node • NGS network infrastructure: TCP MTU, Channel Bonding, DNS • Sun Grid Engine parameters: number of nodes, PAR_EXECD_INST_COUNT • Isilon file storage attributes:  SSD size and RAM  NFS protocol parameters: NFS server OS, async, number of threads, locks  Software RAID, maximum number of directories at a level, maximum number of files in a directory, number of files less than 8 KB HPC server parameters RAID: With modern multicore CPUs, the performance of software RAID is very close to that of hardware RAID. RAID 10 (first mirroring, then striping the mirrors) is recommended for the HPC nodes with a minimum of two identical drives per node where both drives are bootable. The benefit of such a configuration is that the server continues to boot seamlessly even in the face of a failure of a single drive. Total processor cores: The empirical rule-of-thumb for total number of threads and processes running in parallel is determined by the equation: ∑ (threads + processes) = (2 x total cores) + 1 Please note that the total number of threads includes functions such as NFS and HPC queuing, as well as the processes that run NGS algorithms; it is important to document all the processes that are multithreaded. Amdahl’s law vis-à-vis parallelization is also an important consideration. 5 See Introduction to High Performance Computing for Scientists and Engineers, Hager G, Wellein G, © 2010 Taylor & Francis Group, LLC 17Next-generation genome sequencing using EMC Isilon scale-out NAS
  • 18. Total RAM size: NGS analysis requires large file processing, including functions related to string processing, clustering of large files, and statistical quality measures, and thus easily becomes memory-bound. As a result, a large DDR3-based RAM pool is optimal. Network infrastructure parameters TCP MTU: The default maximum transmission unit (MTU) (or frame size) of current Ethernet systems is 1500 B. However, higher bandwidth network infrastructures can handle a much higher MTU of 9000 B (called “jumbo frames”) for efficient data transfer. Please note that the jumbo frame setting needs to be completed both on the HPC server node(s) and the switch(es). Ethernet bonding (LACP): Ethernet bonding using the Link Aggregation Control Protocol (LACP) is a method used to alleviate bandwidth limitations and port-cable-port failure issues. By combining several Ethernet interfaces to a virtual “bond” interface, the network bandwidth can be increased since LACP splits the communications and sends frames among all the Ethernet links. Bonding 2x 1 GbE interfaces provides the required bandwidth between HPC server nodes and NAS file storage. Isilon storage configuration parameters NFS master OS: By default, EMC Isilon OneFS operating system is the NFS server. It is recommended that this default be maintained since SmartConnect and other OneFS features may be affected if the HPC master node OS is chosen as the NFS server. NFSv4: NFSv4 provides improved performance, security, and robustness vis-à-vis NFSv3. These include support of multiple operations per RPC operation (vs. a single operation per RPC in NFSv3), use of Kerberos and access control lists (ACLs) for security (vs. UNIX file permissions in NFSv3), use of TCP transport (vs. UDP in NFSv3), and integrated file locking (vs. use of the adjunct Network Lock Manager protocol for NFSv3). As a result, it is recommended that sites utilize NFSv4 for NGS environments. Please note that initial setting-up of NFSv4 can be cumbersome. NFS async: The NFS async (asynchronous) mode allows the server to reply to client requests as soon as it has processed the request and handed it off to the local file system, without waiting for the data to be written to stable storage. However, write performance is better when synchronous mode is used (also called “noasync”), especially for smaller file sizes. This is the recommended mode, especially since NFSv4 uses TCP connectivity. NFS number of threads: This is the number of NFS server daemon threads that are started when the system boots. The OneFS NFS server usually has 16 threads as its default setting; this value can be changed via the Command Line Interface (CLI): isi_sysctl_cluster sysctl vfs.nfsrv.rpc.[minthreads,maxthreads] Increasing the number of NFS daemon threads improves response minimally; the maximum number of NFS threads needs to be limited to 64. NFS ACL: The NFS ACL (access control list) for NFSv4 is a list of permissions associated with a set of files or directories that contain one or more access control entries (ACEs). There are four types of ACEs: Allow, Deny, Audit, and Alarm; with three kinds of flags: group, inheritance, and administrative. There are 13 file permissions and 18Next-generation genome sequencing using EMC Isilon scale-out NAS
  • 19. 14 directory permissions. OneFS manages NFS ACLs, which need to be mapped to the NFS client using the idmapd configuration. NFS locks: The mounting and locking processes have been enhanced in NFSv4, which supports mandatory as well as advisory locking. Caching and open delegation provide performance improvements in most situations. More information about state is stored on the servers in the HPC tier, enabling recovery of the files when they are in use.6,7 Maximum number of directories at a level and files within a directory: While Isilon OneFS supports an upper bound of 100,000 files in a directory, as well as number of directories at a level, in order to ensure highest performance while traversing a directory tree, the maximum number of directories at a level and the maximum number of files within a directory needs to be below 10,000. Number of small (<8 KB) files: Random-write operations on small files have low response times and can degrade overall application performance. In order to optimize performance, it is recommended that Base Call files that are typically <8 KB be aggregated into 128 KB or larger ZIP archive files. SGE number of nodes: The Sun Grid Engine (SGE) package is a popular distributed resource manager (DRM) and scheduler package for controlling access to and control of cluster resources. It is recommended that at least a minimum of three SGE nodes be used for NGS for performance and backup reasons. While a commercial version of SGE is available from Oracle, SGE is also available as open source. Other popular open source DRM packages are Torque/Maui and Lava. Execution daemons: The SGE PAR_EXECD_INST_COUNT variable contained within the SGE configuration file defines the number of parallel execd (execution daemons) for the NGS HPC cluster. DNS location: If the HPC NGS system is run within a private network, it is recommended that Linux BIND be installed on the HPC master node with DNS forwarding to the organization’s DNS server. Summary Internal EMC testing determined that the KPIs that affect the performance the most are: • RAM on HPC cluster server nodes (recommended at 3 GB/core) • RAM and SSD on the Isilon storage cluster—maximum allowable RAM on the performance layer and minimum recommended on the archival layer with about 1 percent to 2 percent of the raw storage capacity as SSD • Storage configuration parameters: NFSv 4, NFS async enabled, TCP MTU (jumbo frames), LACP, and the Grid Engine package 6 See info on Isilon SmartLock: http://www.emc.com/collateral/software/white-papers/h8325-wp-isd- smartlock.pdf 7 See info on Isilon high-performance computing: http://www.isilon.com/high-performance-computing 19Next-generation genome sequencing using EMC Isilon scale-out NAS
  • 20. Conclusion NGS production processes generate potentially millions of files with terabytes of aggregate storage impacting the capacity and manageability limits of existing file server structures. Raw instrument data typically consists of large image files (2–5 TB per run is the norm), usually in TIFF format. The image file for the experiment is usually the largest file size in NGS. Genomics is a data reduction process from the raw instrument information (images or voltages) to the variants, which follows the “Rule of One-Fifth.” Intermediate or secondary data consists of raw data files, including files in BCL format for base calling and conversion, and have an aggregate ratio of approximately one-fifth compared to raw instrument data. Internal EMC testing has determined that the KPIs that affect the performance of NGS applications the most are: total RAM size on HPC cluster nodes (recommended at 3 Gb/core), RAM and SSD on the Isilon storage cluster (typically 1 percent of RAM storage), and storage configuration parameters with NFSv4, NFS async enabled, TCP MTU (jumbo frames), LACP (2x 1 Gb/s or 4x 1 Gb/s), and a Grid Engine package. NGS environments require a file storage infrastructure that is purpose-built to address the capacity and performance scalability, efficiency, availability, and manageability challenges of NGS applications. Cumulative network bandwidth between HPC and NAS increases with the total number of Isilon nodes on the storage cluster. Isilon scale-out NAS presents a range of benefits optimal for NGS. The Isilon approach of enabling storage I/O and capacity growth through addition of cluster nodes is optimal, since NGS requires storage performance and capacity scalability to be implemented as seamlessly as possible. In addition, dynamic content balancing performed within Isilon scale-out NAS as nodes are added or data capacity changes is ideal for an NGS workflow due to its sustained throughput requirement. Isilon scale-out NAS also offers an 80 percent efficiency ratio and “smart pooling” of the data across multiple performance tiers, making dynamic, rule-based data transfer between storage pools an integral piece of the NGS process. Flexible, multidimensional data protection, which occurs within Isilon scale-out NAS during power loss, node or disk failures, loss of quorum, and storage rebuild, enables non-stop data availability for NGS. 20Next-generation genome sequencing using EMC Isilon scale-out NAS