AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Brian O’Connor
Technical Director - Analysis Core
UCSC Genomics Institute
Nov 28th, 2016
Large-scale, Cloud-based Analysis of
Cancer Genomes
Lessons Learned from the PCAWG Project

PCAWG: A Cloud-Based, Distributed Collaboration
● International Cancer Genome
Consortium (ICGC)
● ~5,800 Whole Genomes
–~2,800 Cancer Donors
–~1,300 with RNASeq data
–Goal is to consistently analyze data
● 8 sites storing and sharing data via GNOS
– 300TB -> 900TB
● 14 Cloud (and HPC) environments
–3 Commercial, 7 OpenStack, 4 HPC
–~630 VMs, ~15K cores, ~60TB of
RAM

PCAWG Cloud Analysis “Core” Workflows

PCAWG Lessons Learned
1. Commercial cloud policies
2. Portable tools
3. Failure-tolerant, distributed execution infrastructure
4. Commercial cloud costs

Lesson 1: Commercial Cloud Policies
• PCAWG analysis showed the power of clouds
• Key policy changes enabled commercial cloud usage
• NIH updated dbGaP cloud policy - March 2015
• ICGC DACO updated ICGC cloud policy - May 2015
• Partnerships with commercial cloud entities
• Amazon Public Datasets Program
• Seven Bridges
• DNAnexus

PCAWG Cloud Analysis Architecture
GNOS
Academic
Compute
Centers
Cloud
Orchestrator
Compute AWS Cloud
Cloud
Orchestrator
Metadata
Index
Sequencing
Projects
Spot
Instances
Work Orders

PCAWG Analysis Architecture & AWS
GNOS
Academic
Compute
Centers
Cloud
Orchestrator
Compute
AWS
Cloud
Cloud
Orchestrator
Metadata
Index
DNAnexus
Seven
Bridges
Sequencing
Projects
Represents a major shift, ICGC data now redistributed within Amazon’s Cloud
Spot
Instances
Work Orders
Amazon S3

Lesson 2: Portable Tools
Containerized workflows for portability between sites
Core Workflows
Alignment: BWA-Mem
Variant Calling: Broad, DKFZ/EMBL, and Sanger
https://github.com/ICGC-TCGA-PanCancer

Lesson 3: Fault-Tolerant Cloud Execution
Architecture 1.0
Architecture 2.0
Architecture 3.0
● cloud-based
clusters
● gluster distributed
filesystem
● scheduling per
cloud
● single-node
workers
● no distributed
filesystem
● ansible for setup ● a complete rethink

Lesson 4: Cloud Costs
Workflow Hardware (cores /
machine)
Runtimes Cost on AWS
BWA 8 cores (16 GB RAM) 5 days (± 5) per
specimen
$11.16
Sanger 8 cores (32 GB RAM) 4 days (± 3) per
donor
$17.22
DKFZ /
EMBL
16 cores (64 GB RAM) 2 days (± 0.6)
per donor
$12.80
Broad 32 cores (256 GB RAM) 2.6 days per
donor
$20.48
workflow storage required per donor
BWA 240 GB
Sanger 4 GB
DKFZ / EMBL 5 GB
Total 249 GB
Data analysis: Create a cloud
commons, Nature 2015
$62/donor

ICGC PCAWG Legacy
Publications soon
AWS Public Datasets
Program
~1,400 PCAWG Donors
- BAM (~70% of ICGC
donors)
- VCF from all three
pipelines
- more ICGC data uploaded
regularly
https://dcc.icgc.org/icgc-in-the-cloud

The Present
Goal: to formalize lessons from PCAWG into reusable tools
Dockstore
Tool/Workflow
Sharing
Toil
Workflow
Execution
Redwood
File
Storage

Redwood - Scalable Storage
Authentication
& Storage
Services
Key Features: based on ICGC Storage Service, supports FUSE, BAM
Slicing, and Highly Parallel access, typically WORM usage pattern
client
Amazon S3
Amazon EC2
instance
AWS cloud

Redwood - Storage System Performance
The Redwood Storage System (and underlying S3) provided a stable
and secure mechanism to store and use genomic data
Example run of ~100 simultaneously downloads saw ~45-100MB/s

Dockstore.org - Sharing Tools & Workflows
Dockstore:
● Share tools and
workflows
● Package tools with
Docker, Describe
with CWL/WDL
● PCAWG goal,
provide our tools
via Dockstore
http://dockstore.org and https://github.com/ga4gh

Dockstore Architecture
Built on
DockerHub/Quay.io
and
GitHub/BitBucket
Adds metadata to
address
shortcomings for
bioinformatics
workflows
CWL/WDL is the
natural choice for
Descriptor

Dockstore 1.0 Release
Highlighted New Features
Support for 1.0.0 GA4GH Tool Registry API
Support for displaying, sharing, and natively launching CWL 1.0 &
WDL tools
Preliminary support for CWL/WDL workflows
Full list of updates since 0.4-beta.4 in
https://github.com/ga4gh/dockstore/releases
New Content
ICGC PanCancer Analysis of Whole Genomes (PCAWG) tools
• BWA-mem, Sanger, Delly, DKFZ

Dockstore Tour
Search
Main Page
Tool Management

Running Dockstore Tools
Execution with the Dockstore Command Line Interface (CLI)
Goal was something simple but want the same process
accessible via other execution systems!
provision
input files
pull
Docker
images
execute
tool with
inputs
using CWL
provision
output files
somewhere
Seven Bridges, Curoverse, Galaxy, Consonance, etc
Simple Dockstore Command Line

Coming Soon to Dockstore
Workflow DAG view
Testing PCAWG
Test Data
“Launch With…”
• Consonance
• Commercial partner(s)
Signed Dockers
Cross site indexing
See Roadmap:
https://goo.gl/4D9a8F

Toil - Efficient Compute on AWS
●A system for large-scale, efficient work on AWS
●Toil recently completed a 30K core, 20K sample re-
compute
●Per job granularity allows for better efficiency and
robustness

The job graph in
Toil can be either
statically or
dynamically
declared.
Toil - Dynamic DAGs

Toil - Spark & ADAM Integration
Amazon EC2 Instances
master
slave
slave slave
slave

User scripts are written in pure Python
from toil.job import Job
def helloWorld(message, memory="2G", cores=2, disk="3G"):
return "Hello, world!, here's a message: %s" % message
j = Job.wrapFn(helloWorld, "You did it!")
if __name__=="__main__":
parser = Job.Runner.getDefaultArgumentParser()
options = parser.parse_args()
print Job.Runner.startToil(j, options) #Prints Hello, world!, ...
Toil - Accessible to New Developers

● Toil can be installed on any system with
Python 2.7
● Built-in support for various batch systems - a few
in part to open-source community support!
○ Mesos
○ SGE (GridEngine)
○ UCSC’s Parasol
○ Single Machine Mode
○ LSF
○ SLURM
● All batch systems can be interchangeably used
with any of the job stores
Toil - Portable

● Cloud-based job stores are designed to handle
many concurrent workers
● Mesos has been shown to scale to 50k simulated
nodes in Amazon Elastic Compute Cloud (EC2)
● Workers try to reduce interactions with the master
by scheduling jobs locally
Toil - Scalable

● Jobs are checkpointed upon completion, allowing
for resumability after job failure
● Toil’s jobstore can resume from any combination of
leader/worker failure
● Toil currently supports job stores for:
○ Shared file systems
○ AWS (Amazon S3 + Amazon SimpleDB)
○ Experimental support for Azure / Google Cloud
Toil - Robust to Failures

Toil in Action
20,000 RNA-seq Sample Recompute

Scalable and robust to failure
Toil RNA-seq Recompute

The Future
PCAWG showed the power of cloud for large scientific
analysis
Current work with Redwood, Dockstore, and Toil
formalized lessons learned and methodologies
Our future work focuses on establishing standards
from our previous work and applying these to future
larger-scale efforts

Tool Registry API
● Formalizing the standard with the GA4GH through the Containers and
Workflows Task Team, implemented in Dockstore
● Basic read API with extended support for write and search
Tool(s)
descriptor
Docker GET list
GET search
POST register
CWL/WDL Conventions API Standard to Share
Emerging GA4GH API Standards

Emerging GA4GH API Standards
Further work of the Containers and Workflows Task Team
Workflow/Task Execution APIs
POST new task
GET task status
GET task
stderr/stdout
API Standard to Execute
Tools
Docker
JSON
stderr stdout file(s)
status
+
Cloud-specific
Implementation
WDL/CWL
Workflow or

GA4GH Containers & Workflow Vision
Toil
Dockstore.org
Redwood

- GA4GH Containers & Workflows Task Team
- Broad Institute
- Cincinnati Children’s Hospital
- Curoverse
- European Bioinformatics Institute
- Intel
- Institute for Systems Biology
- Google, Microsoft, and Amazon
- Ontario Institute for Cancer Research
- Oregon Health and Science University
- Seven Bridges Genomics
- University of California Santa Cruz
● Lincoln Stein, Josh Stuart,
Gad Getz, Peter Campbell,
Jan Korbel - PCAWG
● Vincent Ferretti - Storage
● Denis Yuen - Dockstore
● Kyle Ellrott - Task API
● Peter Amstutz - Workflow API
and Co-leader
● Jeff Gentry - Co-leader
● Hannes Schmidt, Frank
Nothaft & the Toil Team
Acknowledgements

Software Availability
Dockstore
Tool/Workflow
Sharing
Toil
Workflow
Execution
Redwood
File
Storage
https://github.com/icgc-dcc/dcc-storage https://dockstore.org/ https://toil.readthedocs.io
All three projects are open source and welcome your contributions

Enabling science
Scalable compute resource only when needed
Time to result was greatly reduced
Cost of analysis was greatly reduced
Data is able to be securely shared in place
Global community access

Open data as a platform
Data Creation Data Enrichment
Sensemaking
Data at Rest
(Object storage)
Basic APIs
Complex APIs
Consumer
applications
Algorithmic
policy
Data-driven
journalism
Data Catalogs
Focused data
dashboards
Predictive
modeling
Visualizations
Lower cost of knowledge
(Efficiency)
45

Open data as a platform
Data Creation Data Enrichment
Sensemaking
Data at Rest
(Object storage)
Basic APIs
Complex APIs
Consumer
applications
Algorithmic
policy
Data-driven
journalism
Data Catalogs
Focused data
dashboards
Predictive
modeling
Visualizations
Lower cost of knowledge
(Efficiency)
46
BAM gVCF
Wig, GFF
? ?
?
??

Amazon S3 for science
Amazon S3
Data Lake
Data Science Sandbox
Visualization /
Reporting

Public datasets on AWS
To enable more innovation, AWS hosts a selection of datasets that anyone
can access for free. Data in our public datasets is available for rapid
access to our flexible and low-cost computing resources.
Earth Science
• Landsat
• NEXRAD
• NASA NEX
Life Science
• TCGA & ICGC
• 1000 Genomes
• Genome in a Bottle
• Human Microbiome Project
• 3000 Rice Genome Internet Science
• Common Crawl Corpus
• Google Books Ngrams
• Multimedia Commons
https://aws.amazon.com/public-datasets/

Serverless Science with AWS
Lambda

AWS Lambda
Continuous ScalingNo Servers to
Manage
AWS Lambda automatically
scales your application by running
code in response to each trigger.
Your code runs in parallel and
processes each trigger
individually, scaling precisely with
the size of the workload.
Subsecond
Metering
With AWS Lambda, you are
charged for every 100ms your code
executes and the number of times
your code is triggered. You don't
pay anything when your code isn't
running.
AWS Lambda automatically runs
your code without requiring you to
provision or manage servers. Just
write the code and upload it to
Lambda.
Serverless, event-driven compute service

Key Scenarios
Stateless processing of discrete or
streaming updates to your data-store or
message bus
Customize responses and response
workflows to state and data changes
within AWS
Execute server side backend logic in a
cross platform fashion
Data processing App backend development Control systems

Evented genome sequence processing
Nanocall*
* Matei David (Jared T. Simpson lab)
doi:10.1093/bioinformatics/btw569

 The use of API gateway to execute Lambda
functions that bundle a statistical program
function in R for calculating the significance of
an association of a gene’s expression level
with patient survival for every gene in the
genome (~20K)
 Utilization of this Serverless architecture
enabled them to scale dynamically without
paying for idle compute and leveraging robust
error handling capabilities
 Exemplifies how researchers can leverage
PHI data de-identification to use more
resources on the AWS platform
Data analysis using R, API Gateway, and Lambda
Station X’s GenePool platform enables real-time biomarker analysis and management of
clinical genomic data at scale.
The patient data has been de-
identified…API Gateway and
Lambda only receive the event,
time-to-event, and expression
values [which] ensures that we are
able to use Lambda and API
Gateway...while still complying with
the AWS BAA and HIPAA.
“
”

GT-Scan2 – Scaling CRISPR-Cas9 searches

Remember to complete
your evaluations!

AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Similar a AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304) (20)

Más de Amazon Web Services

Más de Amazon Web Services (20)

Último

Último (20)

AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)