A brief overview of a case study where 438 human genomes underwent read mapping and variant calling in under two months. Architectural requirements for the multi-stage pipeline are covered.
Boost Fertility New Invention Ups Success Rates.pdf
Large-scale Genomic Analysis Enabled by Gordon
1. Large-scale Genomic Analysis
Enabled by Gordon!
Kristopher Standish*^, Tristan M. Carland*,
Glenn K. Lockwood+^, Mahidhar Tatineni+^,
Wayne Pfeiffer+^, Nicholas J. Schork*^!
*
Scripps Translational Science Institute!
+ San Diego Supercomputer Center!
^ University of California San Diego!
Project funding provided by Janssen R&D!
2. Background!
• Janssen R&D performed whole-genome
sequencing on 438 patients undergoing
treatment for rheumatoid arthritis!
• Problem: correlate response or non-response to
drug therapy with genetic variants!
• Solution combines multi-disciplinary expertise!
• Genomic analytics from Janssen R&D and Scripps
Translational Science Institute (STSI)!
• Data-intensive computing from San Diego Supercomputer
Center (SDSC)!
SAN DIEGO SUPERCOMPUTER CENTER
3. Technical Challenges!
• Data Volume: raw reads from 438 full human
genomes!
• 50 TB of compressed data from Janssen R&D!
• encrypted on 8x 6 TB SATA RAID enclosures!
• Compute: perform read mapping and variant
calling on all genomes!
• 9-step pipeline to achieve high-quality read mapping!
• 5-step pipeline to do group variant calling for analysis!
• Project requirements:!
• FAST turnaround (assembly in < 2 months)!
• EFFICIENT (minimum core-hours used)!
SAN DIEGO SUPERCOMPUTER CENTER
4. Read Mapping Pipeline: Looks Uniform from
Traditional HPC Perspective...!
Thread-level Parallelism
Map (BWA)
sam to bam (SAMtools)
Merge Lanes (SAMtools)
Sort (SAMtools)
Mark Duplicates (Picard)
Target Creator (GATK)
Indel Realigner (GATK)
Base Quality Score
Recalibration (GATK)
9. Print Reads (GATK)
Walltime
1.
2.
3.
4.
5.
6.
7.
8.
Dimensions drawn to scale!
SAN DIEGO SUPERCOMPUTER CENTER
5. Read Mapping Pipeline: Non-Traditional
Bottlenecks (DRAM & IO)!
Memory Requirement
Map (BWA)
sam to bam (SAMtools)
Merge Lanes (SAMtools)
Sort (SAMtools)
Mark Duplicates (Picard)
Target Creator (GATK)
Indel Realigner (GATK)
Base Quality Score
Recalibration (GATK)
9. Print Reads (GATK)
Walltime
1.
2.
3.
4.
5.
6.
7.
8.
Dimensions drawn to scale!
SAN DIEGO SUPERCOMPUTER CENTER
6. Sort Step: Bound by Disk IO and
Capacity!
Problem: 16 threads require...!
• 25 GB DRAM!
• 3.5 TB local disk!
• 1.6 TB input data!
which generate...!
• 3,500 IOPs
(metadata-rich)!
• 1 GB/s read rate!
Solution: BigFlash!
• 64 GB DRAM/node!
• 16x300 GB SSDs
(4.4 TB usable local flash)!
• 1.6 GB/s from Lustre to SSDs, dedicated I/O InfiniBand rail!
SAN DIEGO SUPERCOMPUTER CENTER
7. Group Variant Calling Pipeline!
Walltime
Thread-level Parallelism
• Massive data
reduction at first
step!
• Reduction in data
parallelism!
• Subsequent steps
(#2 - #5) offloaded to
campus cluster!
Dimensions approx. drawn to scale!
• 1-6 threads each!
• 10-30 min each!
SAN DIEGO SUPERCOMPUTER CENTER
8. Footprint on Gordon: CPUs and Storage Used!
257 TB Lustre
scratch used at peak
!
SAN DIEGO SUPERCOMPUTER CENTER
5,000 cores (30% of
Gordon) in use at once
!
9. Time to Completion...!
• Overall: !
• 36 core-years of compute used in 6 weeks—equivalent
to 310 cores running 24/7!
• 57 TB DRAM used (aggregate)!
• Read Mapping (9-step Pipeline)!
• 5 weeks including time for learning on Gordon (16 days
of compute in public batch queue)!
• Over 2.5 years of 24/7 compute on a single 8-core
workstation (> 4 years realistically)!
• Variant Calling (GATK Haplotype Caller)!
• 5 days and 3 hours on Gordon!
• 10.5 months of 24/7 compute on a 16-core workstation!
SAN DIEGO SUPERCOMPUTER CENTER
10. Acknowledgements
• Chris Huang
• Ed Jaeger
• Sarah Lamberth
• Lance Smith
• Zhenya Cherkas
• Martin Dellwo
• Carrie Brodmerkel
• Sandor Szalma
• Mark Curran
• Guna Rajagopal
Janssen Research & Development