Healthcare systems around the world are looking to Precision Medicine -- care decisions tailored for the individual patient -- as a means to drive better care outcomes at lower cost. Today, the most promising technology that has made this possible in certain diseases like cancer is sequencing a patient's genome. For infectious diseases, sequencing has revolutionized our understanding of outbreaks and how they spread. Genome sequencing has progressed significantly in the past decade to improve throughput and lower costs by 100X or more. It is a data and compute intensive endeavor, which most biomedical research and care delivery networks are not equipped to handle. This session features Dr. Swaine Chen from the Genome Institute of Singapore, and the Broad Institute Cromwell team, discussing the problem of dealing with the scale of genomic data, and how they solved these to deliver results.
11. GIS, Pre-AWS
128 nodes
40-80 CPUs
128-512 GB RAM
Head
node
On-site data centerOffice area
User workstations
“SMPs”
96 CPUs
1 TB RAM
1Gbps 40-100 Gbps
Archival
Storage
(10 PB)
Office, home
Storage
3 PB
Compute
Storage
4 PB
12. Cluster nodes (~500)
4-8 CPUs
64-128GB RAM
Head
node
On-site data centerOffice area
User workstations
“SMPs”
128 CPUs
1TB RAM
1Gbps 10-100 Gbps
Archival
Storage
(3 PB)
Office, home
Storage
1PB
Compute
Storage
100TBChallenges
First-time command line users
Heterogeneous compute, storage, network
No/low experience
• Job management
• Optimization
• Software config/documentation
Spiky workloads
Self-inflicted denial of service
GIS, Pre-AWS
13. How did we first use AWS?
Phase 1
• Reimplement “SMPs”
• Users can’t DOS each
other
• Infinite capacity (and
potential for waste)
• Full complexity
Single
instance
EBS / compute
storage
S3 / Object
storage
Individual
user
AWSGIS
14. How did we first use AWS?
Phase 1
• Reimplement “SMPs”
• Users can’t DOS each
other
• Infinite capacity (and
potential for waste)
• Full complexity
15. Our current efforts on AWS
Phase 2
• Nextflow + AWS Batch
• Totally new paradigm, enabled
by cloud
• AWS for elastic provisioning
• Cluster is abstracted away
• Leverage this for software
S3 / Object
storage
Individual
user
AWSGIS
AWS Batch
16. Phase 2
• Nextflow + AWS Batch
• Totally new paradigm, enabled
by cloud
• AWS for elastic provisioning
• Cluster is abstracted away
• Leverage this for software
S3 / Object
storage
Individual
user
AWSGIS
Job repo
Jobtasks
Docker repo (ECR)
AWS Batch
17. Why is this complexity needed?
GATK Best Practices – a standard workflow in genomics
26. Preparing for 1 Million Genomic Devices
Phase 3
• Serverless, event-driven
model
• Massive scale
• No user intervention
• Fundamentally cloud-driven
transformation of our problem
solving
• Enables continuous
monitoring
27. Preparing for 1 Million Genomic Devices
Reimplement variant calling
6 hours 15 minutes
Auto scatter-gather, high
parallelism
1,000 genomes, 25 million GB-s,
no intervention
12 genomes on Lambda free tier!
1
10
100
1000
10000
100000
Run own
servers
GIS +
Lambda
Genomes per unit
cost
20×
29. Maggie Leong
Vincent Quah
Adrian White
Julian Lau
Liew Jun Xian
Andreas Wilm
Shih Chih Chuan
Ng Huck Hui
Pauline Ng
Anders Skanderup
National Precision
Medicine Program