SlideShare una empresa de Scribd logo
1 de 10
Large-scale Genomic Analysis
Enabled by Gordon!
Kristopher Standish*^, Tristan M. Carland*,
Glenn K. Lockwood+^, Mahidhar Tatineni+^,
Wayne Pfeiffer+^, Nicholas J. Schork*^!
*

Scripps Translational Science Institute!
+ San Diego Supercomputer Center!
^ University of California San Diego!

Project funding provided by Janssen R&D!
Background!
•  Janssen R&D performed whole-genome
sequencing on 438 patients undergoing
treatment for rheumatoid arthritis!
•  Problem: correlate response or non-response to
drug therapy with genetic variants!
•  Solution combines multi-disciplinary expertise!
•  Genomic analytics from Janssen R&D and Scripps
Translational Science Institute (STSI)!
•  Data-intensive computing from San Diego Supercomputer
Center (SDSC)!

SAN DIEGO SUPERCOMPUTER CENTER
Technical Challenges!
•  Data Volume: raw reads from 438 full human
genomes!
•  50 TB of compressed data from Janssen R&D!
•  encrypted on 8x 6 TB SATA RAID enclosures!

•  Compute: perform read mapping and variant
calling on all genomes!
•  9-step pipeline to achieve high-quality read mapping!
•  5-step pipeline to do group variant calling for analysis!

•  Project requirements:!
•  FAST turnaround (assembly in < 2 months)!
•  EFFICIENT (minimum core-hours used)!
SAN DIEGO SUPERCOMPUTER CENTER
Read Mapping Pipeline: Looks Uniform from
Traditional HPC Perspective...!
Thread-level Parallelism

Map (BWA)
sam to bam (SAMtools)
Merge Lanes (SAMtools)
Sort (SAMtools)
Mark Duplicates (Picard)
Target Creator (GATK)
Indel Realigner (GATK)
Base Quality Score
Recalibration (GATK)
9.  Print Reads (GATK)

Walltime

1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 

Dimensions drawn to scale!

SAN DIEGO SUPERCOMPUTER CENTER
Read Mapping Pipeline: Non-Traditional
Bottlenecks (DRAM & IO)!
Memory Requirement

Map (BWA)
sam to bam (SAMtools)
Merge Lanes (SAMtools)
Sort (SAMtools)
Mark Duplicates (Picard)
Target Creator (GATK)
Indel Realigner (GATK)
Base Quality Score
Recalibration (GATK)
9.  Print Reads (GATK)

Walltime

1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 

Dimensions drawn to scale!

SAN DIEGO SUPERCOMPUTER CENTER
Sort Step: Bound by Disk IO and
Capacity!
Problem: 16 threads require...!
•  25 GB DRAM!
•  3.5 TB local disk!
•  1.6 TB input data!
which generate...!
•  3,500 IOPs 

(metadata-rich)!
•  1 GB/s read rate!
Solution: BigFlash!
•  64 GB DRAM/node!
•  16x300 GB SSDs

(4.4 TB usable local flash)!
•  1.6 GB/s from Lustre to SSDs, dedicated I/O InfiniBand rail!
SAN DIEGO SUPERCOMPUTER CENTER
Group Variant Calling Pipeline!

Walltime

Thread-level Parallelism

•  Massive data
reduction at first
step!
•  Reduction in data
parallelism!
•  Subsequent steps
(#2 - #5) offloaded to
campus cluster!
Dimensions approx. drawn to scale!
•  1-6 threads each!
•  10-30 min each!
SAN DIEGO SUPERCOMPUTER CENTER
Footprint on Gordon: CPUs and Storage Used!
257 TB Lustre
scratch used at peak
!

SAN DIEGO SUPERCOMPUTER CENTER

5,000 cores (30% of
Gordon) in use at once
!
Time to Completion...!
•  Overall: !
•  36 core-years of compute used in 6 weeks—equivalent
to 310 cores running 24/7!
•  57 TB DRAM used (aggregate)!

•  Read Mapping (9-step Pipeline)!
•  5 weeks including time for learning on Gordon (16 days
of compute in public batch queue)!
•  Over 2.5 years of 24/7 compute on a single 8-core
workstation (> 4 years realistically)!

•  Variant Calling (GATK Haplotype Caller)!
•  5 days and 3 hours on Gordon!
•  10.5 months of 24/7 compute on a 16-core workstation!
SAN DIEGO SUPERCOMPUTER CENTER
Acknowledgements
•  Chris Huang

•  Ed Jaeger

•  Sarah Lamberth

•  Lance Smith

•  Zhenya Cherkas

•  Martin Dellwo

•  Carrie Brodmerkel

•  Sandor Szalma

•  Mark Curran

•  Guna Rajagopal

Janssen Research & Development

Más contenido relacionado

Similar a Large-scale Genomic Analysis Enabled by Gordon

Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbMongoDB
 
Guy Coates
Guy CoatesGuy Coates
Guy CoatesEduserv
 
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.pptAdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.pptRuthMWinnie
 
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.pptAdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.pptEdizonJambormias2
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBMongoDB
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Spark Summit
 
High Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowHigh Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowBrian Krueger
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit
 
2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngsDin Apellidos
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinityPeterMorrell4
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotLi Shen
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...QIAGEN
 
Cassandra Summit 2014: Performance Tuning Cassandra in AWS
Cassandra Summit 2014: Performance Tuning Cassandra in AWSCassandra Summit 2014: Performance Tuning Cassandra in AWS
Cassandra Summit 2014: Performance Tuning Cassandra in AWSDataStax Academy
 
sequencing of genome
sequencing of genomesequencing of genome
sequencing of genomeNaveen Gupta
 
Cpgr services brochure 14 may 2013 - v 16
Cpgr services brochure   14 may 2013 - v 16Cpgr services brochure   14 may 2013 - v 16
Cpgr services brochure 14 may 2013 - v 16Reinhard Hiller
 
Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)packetloop
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomicsGuy Coates
 

Similar a Large-scale Genomic Analysis Enabled by Gordon (20)

ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo db
 
Guy Coates
Guy CoatesGuy Coates
Guy Coates
 
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.pptAdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
 
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.pptAdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDB
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
 
High Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowHigh Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can Know
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
 
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGSCurso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
 
2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
Fish546
Fish546Fish546
Fish546
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
 
Cassandra Summit 2014: Performance Tuning Cassandra in AWS
Cassandra Summit 2014: Performance Tuning Cassandra in AWSCassandra Summit 2014: Performance Tuning Cassandra in AWS
Cassandra Summit 2014: Performance Tuning Cassandra in AWS
 
sequencing of genome
sequencing of genomesequencing of genome
sequencing of genome
 
Cpgr services brochure 14 may 2013 - v 16
Cpgr services brochure   14 may 2013 - v 16Cpgr services brochure   14 may 2013 - v 16
Cpgr services brochure 14 may 2013 - v 16
 
Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomics
 

Más de Glenn K. Lockwood

Understanding and Measuring I/O Performance
Understanding and Measuring I/O PerformanceUnderstanding and Measuring I/O Performance
Understanding and Measuring I/O PerformanceGlenn K. Lockwood
 
ASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and DeploymentsASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and DeploymentsGlenn K. Lockwood
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaGlenn K. Lockwood
 
SR-IOV: The Key Enabling Technology for Fully Virtualized HPC Clusters
SR-IOV: The Key Enabling Technology for Fully Virtualized HPC ClustersSR-IOV: The Key Enabling Technology for Fully Virtualized HPC Clusters
SR-IOV: The Key Enabling Technology for Fully Virtualized HPC ClustersGlenn K. Lockwood
 

Más de Glenn K. Lockwood (6)

Understanding and Measuring I/O Performance
Understanding and Measuring I/O PerformanceUnderstanding and Measuring I/O Performance
Understanding and Measuring I/O Performance
 
Parallel R and Hadoop
Parallel R and HadoopParallel R and Hadoop
Parallel R and Hadoop
 
ASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and DeploymentsASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and Deployments
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
 
SR-IOV: The Key Enabling Technology for Fully Virtualized HPC Clusters
SR-IOV: The Key Enabling Technology for Fully Virtualized HPC ClustersSR-IOV: The Key Enabling Technology for Fully Virtualized HPC Clusters
SR-IOV: The Key Enabling Technology for Fully Virtualized HPC Clusters
 

Último

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Último (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Large-scale Genomic Analysis Enabled by Gordon

  • 1. Large-scale Genomic Analysis Enabled by Gordon! Kristopher Standish*^, Tristan M. Carland*, Glenn K. Lockwood+^, Mahidhar Tatineni+^, Wayne Pfeiffer+^, Nicholas J. Schork*^! * Scripps Translational Science Institute! + San Diego Supercomputer Center! ^ University of California San Diego! Project funding provided by Janssen R&D!
  • 2. Background! •  Janssen R&D performed whole-genome sequencing on 438 patients undergoing treatment for rheumatoid arthritis! •  Problem: correlate response or non-response to drug therapy with genetic variants! •  Solution combines multi-disciplinary expertise! •  Genomic analytics from Janssen R&D and Scripps Translational Science Institute (STSI)! •  Data-intensive computing from San Diego Supercomputer Center (SDSC)! SAN DIEGO SUPERCOMPUTER CENTER
  • 3. Technical Challenges! •  Data Volume: raw reads from 438 full human genomes! •  50 TB of compressed data from Janssen R&D! •  encrypted on 8x 6 TB SATA RAID enclosures! •  Compute: perform read mapping and variant calling on all genomes! •  9-step pipeline to achieve high-quality read mapping! •  5-step pipeline to do group variant calling for analysis! •  Project requirements:! •  FAST turnaround (assembly in < 2 months)! •  EFFICIENT (minimum core-hours used)! SAN DIEGO SUPERCOMPUTER CENTER
  • 4. Read Mapping Pipeline: Looks Uniform from Traditional HPC Perspective...! Thread-level Parallelism Map (BWA) sam to bam (SAMtools) Merge Lanes (SAMtools) Sort (SAMtools) Mark Duplicates (Picard) Target Creator (GATK) Indel Realigner (GATK) Base Quality Score Recalibration (GATK) 9.  Print Reads (GATK) Walltime 1.  2.  3.  4.  5.  6.  7.  8.  Dimensions drawn to scale! SAN DIEGO SUPERCOMPUTER CENTER
  • 5. Read Mapping Pipeline: Non-Traditional Bottlenecks (DRAM & IO)! Memory Requirement Map (BWA) sam to bam (SAMtools) Merge Lanes (SAMtools) Sort (SAMtools) Mark Duplicates (Picard) Target Creator (GATK) Indel Realigner (GATK) Base Quality Score Recalibration (GATK) 9.  Print Reads (GATK) Walltime 1.  2.  3.  4.  5.  6.  7.  8.  Dimensions drawn to scale! SAN DIEGO SUPERCOMPUTER CENTER
  • 6. Sort Step: Bound by Disk IO and Capacity! Problem: 16 threads require...! •  25 GB DRAM! •  3.5 TB local disk! •  1.6 TB input data! which generate...! •  3,500 IOPs 
 (metadata-rich)! •  1 GB/s read rate! Solution: BigFlash! •  64 GB DRAM/node! •  16x300 GB SSDs
 (4.4 TB usable local flash)! •  1.6 GB/s from Lustre to SSDs, dedicated I/O InfiniBand rail! SAN DIEGO SUPERCOMPUTER CENTER
  • 7. Group Variant Calling Pipeline! Walltime Thread-level Parallelism •  Massive data reduction at first step! •  Reduction in data parallelism! •  Subsequent steps (#2 - #5) offloaded to campus cluster! Dimensions approx. drawn to scale! •  1-6 threads each! •  10-30 min each! SAN DIEGO SUPERCOMPUTER CENTER
  • 8. Footprint on Gordon: CPUs and Storage Used! 257 TB Lustre scratch used at peak ! SAN DIEGO SUPERCOMPUTER CENTER 5,000 cores (30% of Gordon) in use at once !
  • 9. Time to Completion...! •  Overall: ! •  36 core-years of compute used in 6 weeks—equivalent to 310 cores running 24/7! •  57 TB DRAM used (aggregate)! •  Read Mapping (9-step Pipeline)! •  5 weeks including time for learning on Gordon (16 days of compute in public batch queue)! •  Over 2.5 years of 24/7 compute on a single 8-core workstation (> 4 years realistically)! •  Variant Calling (GATK Haplotype Caller)! •  5 days and 3 hours on Gordon! •  10.5 months of 24/7 compute on a 16-core workstation! SAN DIEGO SUPERCOMPUTER CENTER
  • 10. Acknowledgements •  Chris Huang •  Ed Jaeger •  Sarah Lamberth •  Lance Smith •  Zhenya Cherkas •  Martin Dellwo •  Carrie Brodmerkel •  Sandor Szalma •  Mark Curran •  Guna Rajagopal Janssen Research & Development