Large-scale Genomic Analysis Enabled by Gordon

•

3 recomendaciones•2,374 vistas

A brief overview of a case study where 438 human genomes underwent read mapping and variant calling in under two months. Architectural requirements for the multi-stage pipeline are covered.

Tecnología Empresariales Educación

Large-scale Genomic Analysis
Enabled by Gordon!
Kristopher Standish*^, Tristan M. Carland*,
Glenn K. Lockwood+^, Mahidhar Tatineni+^,
Wayne Pfeiffer+^, Nicholas J. Schork*^!
*

Scripps Translational Science Institute!
+ San Diego Supercomputer Center!
^ University of California San Diego!

Project funding provided by Janssen R&D!

Background!
•  Janssen R&D performed whole-genome
sequencing on 438 patients undergoing
treatment for rheumatoid arthritis!
•  Problem: correlate response or non-response to
drug therapy with genetic variants!
•  Solution combines multi-disciplinary expertise!
•  Genomic analytics from Janssen R&D and Scripps
Translational Science Institute (STSI)!
•  Data-intensive computing from San Diego Supercomputer
Center (SDSC)!

SAN DIEGO SUPERCOMPUTER CENTER

Technical Challenges!
•  Data Volume: raw reads from 438 full human
genomes!
•  50 TB of compressed data from Janssen R&D!
•  encrypted on 8x 6 TB SATA RAID enclosures!

•  Compute: perform read mapping and variant
calling on all genomes!
•  9-step pipeline to achieve high-quality read mapping!
•  5-step pipeline to do group variant calling for analysis!

•  Project requirements:!
•  FAST turnaround (assembly in < 2 months)!
•  EFFICIENT (minimum core-hours used)!
SAN DIEGO SUPERCOMPUTER CENTER

Read Mapping Pipeline: Looks Uniform from
Traditional HPC Perspective...!
Thread-level Parallelism

Map (BWA)
sam to bam (SAMtools)
Merge Lanes (SAMtools)
Sort (SAMtools)
Mark Duplicates (Picard)
Target Creator (GATK)
Indel Realigner (GATK)
Base Quality Score
Recalibration (GATK)
9.  Print Reads (GATK)

Walltime

1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 

Dimensions drawn to scale!

SAN DIEGO SUPERCOMPUTER CENTER

Read Mapping Pipeline: Non-Traditional
Bottlenecks (DRAM & IO)!
Memory Requirement

Map (BWA)
sam to bam (SAMtools)
Merge Lanes (SAMtools)
Sort (SAMtools)
Mark Duplicates (Picard)
Target Creator (GATK)
Indel Realigner (GATK)
Base Quality Score
Recalibration (GATK)
9.  Print Reads (GATK)

Walltime

1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 

Dimensions drawn to scale!

SAN DIEGO SUPERCOMPUTER CENTER

Sort Step: Bound by Disk IO and
Capacity!
Problem: 16 threads require...!
•  25 GB DRAM!
•  3.5 TB local disk!
•  1.6 TB input data!
which generate...!
•  3,500 IOPs  
(metadata-rich)!
•  1 GB/s read rate!
Solution: BigFlash!
•  64 GB DRAM/node!
•  16x300 GB SSDs 
(4.4 TB usable local ﬂash)!
•  1.6 GB/s from Lustre to SSDs, dedicated I/O InﬁniBand rail!
SAN DIEGO SUPERCOMPUTER CENTER

Group Variant Calling Pipeline!

Walltime

Thread-level Parallelism

•  Massive data
reduction at ﬁrst
step!
•  Reduction in data
parallelism!
•  Subsequent steps
(#2 - #5) ofﬂoaded to
campus cluster!
Dimensions approx. drawn to scale!
•  1-6 threads each!
•  10-30 min each!
SAN DIEGO SUPERCOMPUTER CENTER

Footprint on Gordon: CPUs and Storage Used!
257 TB Lustre
scratch used at peak
!

SAN DIEGO SUPERCOMPUTER CENTER

5,000 cores (30% of
Gordon) in use at once
!

Time to Completion...!
•  Overall: !
•  36 core-years of compute used in 6 weeks—equivalent
to 310 cores running 24/7!
•  57 TB DRAM used (aggregate)!

•  Read Mapping (9-step Pipeline)!
•  5 weeks including time for learning on Gordon (16 days
of compute in public batch queue)!
•  Over 2.5 years of 24/7 compute on a single 8-core
workstation (> 4 years realistically)!

•  Variant Calling (GATK Haplotype Caller)!
•  5 days and 3 hours on Gordon!
•  10.5 months of 24/7 compute on a 16-core workstation!
SAN DIEGO SUPERCOMPUTER CENTER

Acknowledgements
•  Chris Huang

•  Ed Jaeger

•  Sarah Lamberth

•  Lance Smith

•  Zhenya Cherkas

•  Martin Dellwo

•  Carrie Brodmerkel

•  Sandor Szalma

•  Mark Curran

•  Guna Rajagopal

Janssen Research & Development

Más contenido relacionado

Similar a Large-scale Genomic Analysis Enabled by Gordon

ChipSeq Data AnalysisCOST action BM1006

Accelerate pharmaceutical r&d with mongo dbMongoDB

Guy CoatesEduserv

AdamAmeur_SciLife_Bioinfo_course_Nov2015.pptRuthMWinnie

AdamAmeur_SciLife_Bioinfo_course_Nov2015.pptEdizonJambormias2

Accelerate Pharmaceutical R&D with Big Data and MongoDBMongoDB

Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Spark Summit

High Throughput Sequencing Technologies: What We Can KnowBrian Krueger

Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit

Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGSVHIR Vall d’Hebron Institut de Recerca

2011 jeroen vanhoudt_ngsDin Apellidos

Reproducible research - to infinityPeterMorrell4

Fish546mgavery

Next-generation sequencing format and visualization with ngs.plotLi Shen

Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...QIAGEN

Cassandra Summit 2014: Performance Tuning Cassandra in AWSDataStax Academy

sequencing of genomeNaveen Gupta

Cpgr services brochure 14 may 2013 - v 16Reinhard Hiller

Finding Needles in Haystacks (The Size of Countries)packetloop

Future Architectures for genomicsGuy Coates

Similar a Large-scale Genomic Analysis Enabled by Gordon (20)

ChipSeq Data Analysis

Accelerate pharmaceutical r&d with mongo db

Guy Coates

AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt

Accelerate Pharmaceutical R&D with Big Data and MongoDB

Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...

High Throughput Sequencing Technologies: What We Can Know

Spark Summit EU talk by Erwin Datema and Roeland van Ham

Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

2011 jeroen vanhoudt_ngs

Reproducible research - to infinity

Fish546

Next-generation sequencing format and visualization with ngs.plot

Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...

Cassandra Summit 2014: Performance Tuning Cassandra in AWS

sequencing of genome

Cpgr services brochure 14 may 2013 - v 16

Finding Needles in Haystacks (The Size of Countries)

Future Architectures for genomics

Más de Glenn K. Lockwood

Understanding and Measuring I/O PerformanceGlenn K. Lockwood

Parallel R and HadoopGlenn K. Lockwood

ASCI Terascale Simulation Requirements and DeploymentsGlenn K. Lockwood

The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood

Hadoop Streaming: Programming Hadoop without JavaGlenn K. Lockwood

SR-IOV: The Key Enabling Technology for Fully Virtualized HPC ClustersGlenn K. Lockwood

Más de Glenn K. Lockwood (6)

Understanding and Measuring I/O Performance

Parallel R and Hadoop

ASCI Terascale Simulation Requirements and Deployments

The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...

Hadoop Streaming: Programming Hadoop without Java

SR-IOV: The Key Enabling Technology for Fully Virtualized HPC Clusters

Último

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub

Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Manulife - Insurer Transformation Award 2024The Digital Insurer

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Corporate and higher education May webinar.pptxRustici Software

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Large-scale Genomic Analysis Enabled by Gordon

1. Large-scale Genomic Analysis Enabled by Gordon! Kristopher Standish*^, Tristan M. Carland*, Glenn K. Lockwood+^, Mahidhar Tatineni+^, Wayne Pfeiffer+^, Nicholas J. Schork*^! * Scripps Translational Science Institute! + San Diego Supercomputer Center! ^ University of California San Diego! Project funding provided by Janssen R&D!

2. Background! •  Janssen R&D performed whole-genome sequencing on 438 patients undergoing treatment for rheumatoid arthritis! •  Problem: correlate response or non-response to drug therapy with genetic variants! •  Solution combines multi-disciplinary expertise! •  Genomic analytics from Janssen R&D and Scripps Translational Science Institute (STSI)! •  Data-intensive computing from San Diego Supercomputer Center (SDSC)! SAN DIEGO SUPERCOMPUTER CENTER

3. Technical Challenges! •  Data Volume: raw reads from 438 full human genomes! •  50 TB of compressed data from Janssen R&D! •  encrypted on 8x 6 TB SATA RAID enclosures! •  Compute: perform read mapping and variant calling on all genomes! •  9-step pipeline to achieve high-quality read mapping! •  5-step pipeline to do group variant calling for analysis! •  Project requirements:! •  FAST turnaround (assembly in < 2 months)! •  EFFICIENT (minimum core-hours used)! SAN DIEGO SUPERCOMPUTER CENTER

4. Read Mapping Pipeline: Looks Uniform from Traditional HPC Perspective...! Thread-level Parallelism Map (BWA) sam to bam (SAMtools) Merge Lanes (SAMtools) Sort (SAMtools) Mark Duplicates (Picard) Target Creator (GATK) Indel Realigner (GATK) Base Quality Score Recalibration (GATK) 9.  Print Reads (GATK) Walltime 1.  2.  3.  4.  5.  6.  7.  8.  Dimensions drawn to scale! SAN DIEGO SUPERCOMPUTER CENTER

5. Read Mapping Pipeline: Non-Traditional Bottlenecks (DRAM & IO)! Memory Requirement Map (BWA) sam to bam (SAMtools) Merge Lanes (SAMtools) Sort (SAMtools) Mark Duplicates (Picard) Target Creator (GATK) Indel Realigner (GATK) Base Quality Score Recalibration (GATK) 9.  Print Reads (GATK) Walltime 1.  2.  3.  4.  5.  6.  7.  8.  Dimensions drawn to scale! SAN DIEGO SUPERCOMPUTER CENTER

6. Sort Step: Bound by Disk IO and Capacity! Problem: 16 threads require...! •  25 GB DRAM! •  3.5 TB local disk! •  1.6 TB input data! which generate...! •  3,500 IOPs   (metadata-rich)! •  1 GB/s read rate! Solution: BigFlash! •  64 GB DRAM/node! •  16x300 GB SSDs  (4.4 TB usable local ﬂash)! •  1.6 GB/s from Lustre to SSDs, dedicated I/O InﬁniBand rail! SAN DIEGO SUPERCOMPUTER CENTER

7. Group Variant Calling Pipeline! Walltime Thread-level Parallelism •  Massive data reduction at ﬁrst step! •  Reduction in data parallelism! •  Subsequent steps (#2 - #5) ofﬂoaded to campus cluster! Dimensions approx. drawn to scale! •  1-6 threads each! •  10-30 min each! SAN DIEGO SUPERCOMPUTER CENTER

8. Footprint on Gordon: CPUs and Storage Used! 257 TB Lustre scratch used at peak ! SAN DIEGO SUPERCOMPUTER CENTER 5,000 cores (30% of Gordon) in use at once !

9. Time to Completion...! •  Overall: ! •  36 core-years of compute used in 6 weeks—equivalent to 310 cores running 24/7! •  57 TB DRAM used (aggregate)! •  Read Mapping (9-step Pipeline)! •  5 weeks including time for learning on Gordon (16 days of compute in public batch queue)! •  Over 2.5 years of 24/7 compute on a single 8-core workstation (> 4 years realistically)! •  Variant Calling (GATK Haplotype Caller)! •  5 days and 3 hours on Gordon! •  10.5 months of 24/7 compute on a 16-core workstation! SAN DIEGO SUPERCOMPUTER CENTER

10. Acknowledgements •  Chris Huang •  Ed Jaeger •  Sarah Lamberth •  Lance Smith •  Zhenya Cherkas •  Martin Dellwo •  Carrie Brodmerkel •  Sandor Szalma •  Mark Curran •  Guna Rajagopal Janssen Research & Development

Large-scale Genomic Analysis Enabled by Gordon

Recomendados

Recomendados

Más contenido relacionado

Similar a Large-scale Genomic Analysis Enabled by Gordon

Similar a Large-scale Genomic Analysis Enabled by Gordon (20)

Más de Glenn K. Lockwood

Más de Glenn K. Lockwood (6)

Último

Último (20)

Large-scale Genomic Analysis Enabled by Gordon