The PanCancer Analysis of Whole Genomes (PCAWG) project is a large-scale, highly distributed research collaboration designed to identify common patterns of mutations across 2,800 cancer genomes. The use of public and private clouds were instrumental in analyzing this dataset using current best practice containerized pipelines. This session describes the technical infrastructure built for the project, how we leveraged cloud environments to perform the “core” analysis, and the lessons learned along the way.
3. PCAWG: A Cloud-Based, Distributed Collaboration
● International Cancer Genome
Consortium (ICGC)
● ~5,800 Whole Genomes
–~2,800 Cancer Donors
–~1,300 with RNASeq data
–Goal is to consistently analyze data
● 8 sites storing and sharing data via GNOS
– 300TB -> 900TB
● 14 Cloud (and HPC) environments
–3 Commercial, 7 OpenStack, 4 HPC
–~630 VMs, ~15K cores, ~60TB of
RAM
8. PCAWG Analysis Architecture & AWS
GNOS
Academic
Compute
Centers
Cloud
Orchestrator
Compute
AWS
Cloud
Cloud
Orchestrator
Metadata
Index
DNAnexus
Seven
Bridges
Sequencing
Projects
Represents a major shift, ICGC data now redistributed within Amazon’s Cloud
Spot
Instances
Work Orders
Amazon S3
9. Lesson 2: Portable Tools
Containerized workflows for portability between sites
Core Workflows
Alignment: BWA-Mem
Variant Calling: Broad, DKFZ/EMBL, and Sanger
https://github.com/ICGC-TCGA-PanCancer
10. Lesson 3: Fault-Tolerant Cloud Execution
Architecture 1.0
Architecture 2.0
Architecture 3.0
● cloud-based
clusters
● gluster distributed
filesystem
● scheduling per
cloud
● single-node
workers
● no distributed
filesystem
● ansible for setup ● a complete rethink
11. Lesson 4: Cloud Costs
Workflow Hardware (cores /
machine)
Runtimes Cost on AWS
BWA 8 cores (16 GB RAM) 5 days (± 5) per
specimen
$11.16
Sanger 8 cores (32 GB RAM) 4 days (± 3) per
donor
$17.22
DKFZ /
EMBL
16 cores (64 GB RAM) 2 days (± 0.6)
per donor
$12.80
Broad 32 cores (256 GB RAM) 2.6 days per
donor
$20.48
workflow storage required per donor
BWA 240 GB
Sanger 4 GB
DKFZ / EMBL 5 GB
Total 249 GB
Data analysis: Create a cloud
commons, Nature 2015
$62/donor
12. ICGC PCAWG Legacy
Publications soon
AWS Public Datasets
Program
~1,400 PCAWG Donors
- BAM (~70% of ICGC
donors)
- VCF from all three
pipelines
- more ICGC data uploaded
regularly
https://dcc.icgc.org/icgc-in-the-cloud
13. The Present
Goal: to formalize lessons from PCAWG into reusable tools
Dockstore
Tool/Workflow
Sharing
Toil
Workflow
Execution
Redwood
File
Storage
15. Redwood - Scalable Storage
Authentication
& Storage
Services
Key Features: based on ICGC Storage Service, supports FUSE, BAM
Slicing, and Highly Parallel access, typically WORM usage pattern
client
Amazon S3
Amazon EC2
instance
AWS cloud
16. Redwood - Storage System Performance
The Redwood Storage System (and underlying S3) provided a stable
and secure mechanism to store and use genomic data
Example run of ~100 simultaneously downloads saw ~45-100MB/s
17. Dockstore.org - Sharing Tools & Workflows
Dockstore:
● Share tools and
workflows
● Package tools with
Docker, Describe
with CWL/WDL
● PCAWG goal,
provide our tools
via Dockstore
http://dockstore.org and https://github.com/ga4gh
19. Dockstore 1.0 Release
Highlighted New Features
Support for 1.0.0 GA4GH Tool Registry API
Support for displaying, sharing, and natively launching CWL 1.0 &
WDL tools
Preliminary support for CWL/WDL workflows
Full list of updates since 0.4-beta.4 in
https://github.com/ga4gh/dockstore/releases
New Content
ICGC PanCancer Analysis of Whole Genomes (PCAWG) tools
• BWA-mem, Sanger, Delly, DKFZ
22. Running Dockstore Tools
Execution with the Dockstore Command Line Interface (CLI)
Goal was something simple but want the same process
accessible via other execution systems!
provision
input files
pull
Docker
images
execute
tool with
inputs
using CWL
provision
output files
somewhere
Seven Bridges, Curoverse, Galaxy, Consonance, etc
Simple Dockstore Command Line
23.
24.
25.
26.
27. Coming Soon to Dockstore
Workflow DAG view
Testing PCAWG
Test Data
“Launch With…”
• Consonance
• Commercial partner(s)
Signed Dockers
Cross site indexing
See Roadmap:
https://goo.gl/4D9a8F
28. Toil - Efficient Compute on AWS
●A system for large-scale, efficient work on AWS
●Toil recently completed a 30K core, 20K sample re-
compute
●Per job granularity allows for better efficiency and
robustness
29. The job graph in
Toil can be either
statically or
dynamically
declared.
Toil - Dynamic DAGs
31. User scripts are written in pure Python
from toil.job import Job
def helloWorld(message, memory="2G", cores=2, disk="3G"):
return "Hello, world!, here's a message: %s" % message
j = Job.wrapFn(helloWorld, "You did it!")
if __name__=="__main__":
parser = Job.Runner.getDefaultArgumentParser()
options = parser.parse_args()
print Job.Runner.startToil(j, options) #Prints Hello, world!, ...
Toil - Accessible to New Developers
32. ● Toil can be installed on any system with
Python 2.7
● Built-in support for various batch systems - a few
in part to open-source community support!
○ Mesos
○ SGE (GridEngine)
○ UCSC’s Parasol
○ Single Machine Mode
○ LSF
○ SLURM
● All batch systems can be interchangeably used
with any of the job stores
Toil - Portable
33. ● Cloud-based job stores are designed to handle
many concurrent workers
● Mesos has been shown to scale to 50k simulated
nodes in Amazon Elastic Compute Cloud (EC2)
● Workers try to reduce interactions with the master
by scheduling jobs locally
Toil - Scalable
34. ● Jobs are checkpointed upon completion, allowing
for resumability after job failure
● Toil’s jobstore can resume from any combination of
leader/worker failure
● Toil currently supports job stores for:
○ Shared file systems
○ AWS (Amazon S3 + Amazon SimpleDB)
○ Experimental support for Azure / Google Cloud
Toil - Robust to Failures
37. The Future
PCAWG showed the power of cloud for large scientific
analysis
Current work with Redwood, Dockstore, and Toil
formalized lessons learned and methodologies
Our future work focuses on establishing standards
from our previous work and applying these to future
larger-scale efforts
38. Tool Registry API
● Formalizing the standard with the GA4GH through the Containers and
Workflows Task Team, implemented in Dockstore
● Basic read API with extended support for write and search
Tool(s)
descriptor
Docker GET list
GET search
POST register
CWL/WDL Conventions API Standard to Share
Emerging GA4GH API Standards
39. Emerging GA4GH API Standards
Further work of the Containers and Workflows Task Team
Workflow/Task Execution APIs
POST new task
GET task status
GET task
stderr/stdout
API Standard to Execute
Tools
Docker
JSON
stderr stdout file(s)
status
+
Cloud-specific
Implementation
WDL/CWL
Workflow or
41. - GA4GH Containers & Workflows Task Team
- Broad Institute
- Cincinnati Children’s Hospital
- Curoverse
- European Bioinformatics Institute
- Intel
- Institute for Systems Biology
- Google, Microsoft, and Amazon
- Ontario Institute for Cancer Research
- Oregon Health and Science University
- Seven Bridges Genomics
- University of California Santa Cruz
● Lincoln Stein, Josh Stuart,
Gad Getz, Peter Campbell,
Jan Korbel - PCAWG
● Vincent Ferretti - Storage
● Denis Yuen - Dockstore
● Kyle Ellrott - Task API
● Peter Amstutz - Workflow API
and Co-leader
● Jeff Gentry - Co-leader
● Hannes Schmidt, Frank
Nothaft & the Toil Team
Acknowledgements
44. Enabling science
Scalable compute resource only when needed
Time to result was greatly reduced
Cost of analysis was greatly reduced
Data is able to be securely shared in place
Global community access
45. Open data as a platform
Data Creation Data Enrichment
Sensemaking
Data at Rest
(Object storage)
Basic APIs
Complex APIs
Consumer
applications
Algorithmic
policy
Data-driven
journalism
Data Catalogs
Focused data
dashboards
Predictive
modeling
Visualizations
Lower cost of knowledge
(Efficiency)
45
46. Open data as a platform
Data Creation Data Enrichment
Sensemaking
Data at Rest
(Object storage)
Basic APIs
Complex APIs
Consumer
applications
Algorithmic
policy
Data-driven
journalism
Data Catalogs
Focused data
dashboards
Predictive
modeling
Visualizations
Lower cost of knowledge
(Efficiency)
46
BAM gVCF
Wig, GFF
? ?
?
??
47. Amazon S3 for science
Amazon S3
Data Lake
Data Science Sandbox
Visualization /
Reporting
48. Public datasets on AWS
To enable more innovation, AWS hosts a selection of datasets that anyone
can access for free. Data in our public datasets is available for rapid
access to our flexible and low-cost computing resources.
Earth Science
• Landsat
• NEXRAD
• NASA NEX
Life Science
• TCGA & ICGC
• 1000 Genomes
• Genome in a Bottle
• Human Microbiome Project
• 3000 Rice Genome Internet Science
• Common Crawl Corpus
• Google Books Ngrams
• Multimedia Commons
https://aws.amazon.com/public-datasets/
50. AWS Lambda
Continuous ScalingNo Servers to
Manage
AWS Lambda automatically
scales your application by running
code in response to each trigger.
Your code runs in parallel and
processes each trigger
individually, scaling precisely with
the size of the workload.
Subsecond
Metering
With AWS Lambda, you are
charged for every 100ms your code
executes and the number of times
your code is triggered. You don't
pay anything when your code isn't
running.
AWS Lambda automatically runs
your code without requiring you to
provision or manage servers. Just
write the code and upload it to
Lambda.
Serverless, event-driven compute service
51. Key Scenarios
Stateless processing of discrete or
streaming updates to your data-store or
message bus
Customize responses and response
workflows to state and data changes
within AWS
Execute server side backend logic in a
cross platform fashion
Data processing App backend development Control systems
52. Evented genome sequence processing
Nanocall*
* Matei David (Jared T. Simpson lab)
doi:10.1093/bioinformatics/btw569
53. The use of API gateway to execute Lambda
functions that bundle a statistical program
function in R for calculating the significance of
an association of a gene’s expression level
with patient survival for every gene in the
genome (~20K)
Utilization of this Serverless architecture
enabled them to scale dynamically without
paying for idle compute and leveraging robust
error handling capabilities
Exemplifies how researchers can leverage
PHI data de-identification to use more
resources on the AWS platform
Data analysis using R, API Gateway, and Lambda
Station X’s GenePool platform enables real-time biomarker analysis and management of
clinical genomic data at scale.
The patient data has been de-
identified…API Gateway and
Lambda only receive the event,
time-to-event, and expression
values [which] ensures that we are
able to use Lambda and API
Gateway...while still complying with
the AWS BAA and HIPAA.
“
”