SlideShare una empresa de Scribd logo
1 de 49
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Dr. Swaine Chen
Genome Institute of Singapore, National University of Singapore
223537
Accelerating Analytics for the Future of
Genomics
Accelerating Genomics
Research with the Cloud
Swaine Chen
Genome Institute of Singapore
National University of Singapore
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
GEN MICS
WHAT IS
HUMAN
CELLS
NUCLEUS
CHROMOSOMES
DNA
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What is DNA?
G, A, T, C
4 “bases”; 2 bits
Explicitly digital
A – T
C – G
T – A
G – C
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DNA sequencing technology
Maxam-Gilbert
Chemistry and radiation
Solexa (Illumina)
Higher density, imaging
Oxford Nanopore
Electric current detection
Capillary seq
Miniaturization
Parallelization
1970 1980 1990 2000 2010
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DNA sequencing technology
Maxam-Gilbert
Chemistry and radiation
Solexa (Illumina)
Higher density, imaging
Oxford Nanopore
Electric current detection
Capillary seq
Miniaturization
Parallelization
1970 1980 1990 2000 2010
Miniaturization, Parallelization, Digitization
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Genomics data is exploding (in the usual way)
Moore’s law
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Genomics
Genomics data is exploding (in the usual way)
Moore’s law
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Genomics
Analytics is imploding (in an unusual way)
Moore’s law
Hyper-Moore
gap
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The evolution of genomics compute
Driven by the data scale
Enabled by AWS
Our journey at GIS
GIS, Pre-AWS
128 nodes
40-80 CPUs
128-512 GB RAM
Head
node
On-site data centerOffice area
User workstations
“SMPs”
96 CPUs
1 TB RAM
1Gbps 40-100 Gbps
Archival
Storage
(10 PB)
Office, home
Storage
3 PB
Compute
Storage
4 PB
Cluster nodes (~500)
4-8 CPUs
64-128GB RAM
Head
node
On-site data centerOffice area
User workstations
“SMPs”
128 CPUs
1TB RAM
1Gbps 10-100 Gbps
Archival
Storage
(3 PB)
Office, home
Storage
1PB
Compute
Storage
100TBChallenges
First-time command line users
Heterogeneous compute, storage, network
No/low experience
• Job management
• Optimization
• Software config/documentation
Spiky workloads
Self-inflicted denial of service
GIS, Pre-AWS
How did we first use AWS?
Phase 1
• Reimplement “SMPs”
• Users can’t DOS each
other
• Infinite capacity (and
potential for waste)
• Full complexity
Single
instance
EBS / compute
storage
S3 / Object
storage
Individual
user
AWSGIS
How did we first use AWS?
Phase 1
• Reimplement “SMPs”
• Users can’t DOS each
other
• Infinite capacity (and
potential for waste)
• Full complexity
Our current efforts on AWS
Phase 2
• Nextflow + AWS Batch
• Totally new paradigm, enabled
by cloud
• AWS for elastic provisioning
• Cluster is abstracted away
• Leverage this for software
S3 / Object
storage
Individual
user
AWSGIS
AWS Batch
Phase 2
• Nextflow + AWS Batch
• Totally new paradigm, enabled
by cloud
• AWS for elastic provisioning
• Cluster is abstracted away
• Leverage this for software
S3 / Object
storage
Individual
user
AWSGIS
Job repo
Jobtasks
Docker repo (ECR)
AWS Batch
Why is this complexity needed?
GATK Best Practices – a standard workflow in genomics
Capacity + Simplicity on AWS
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The Impact at GIS
GIS Bacterial Projects: 100× in 4 years
1
10
100
1000
10000
1995 2000 2005 2010 2015 2020
Bacterial genomics:
# genomes/paper
tracks Moore’s Law
Year of publication
Numberofgenomes
GIS
GIS Bacterial Projects: 100× in 4 years
1
10
100
1000
10000
1995 2000 2005 2010 2015 2020
Year of publication
Numberofgenomes
GIS
2013: 10-100 strains
GIS Bacterial Projects: 100× in 4 years
1
10
100
1000
10000
1995 2000 2005 2010 2015 2020
Year of publication
Numberofgenomes
GIS
2017: 10,000 strains
Higher resolution, more perspective
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Capacity + Simplicity = Opportunity?
Does AWS fundamentally
change our thinking?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
T H E NN O W
Genomics: Approaching IoTTransition
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
SINGAPORE’S DENGUE
MONITORING
Since 2006
All nonresidential buildings
checked every 3 months
1 million inspections per year
INFRASTRUCTURE
Preparing for 1 Million Genomic Devices
Phase 3
• Serverless, event-driven
model
• Massive scale
• No user intervention
• Fundamentally cloud-driven
transformation of our problem
solving
• Enables continuous
monitoring
Preparing for 1 Million Genomic Devices
Reimplement variant calling
6 hours 15 minutes
Auto scatter-gather, high
parallelism
1,000 genomes, 25 million GB-s,
no intervention
12 genomes on Lambda free tier!
1
10
100
1000
10000
100000
Run own
servers
GIS +
Lambda
Genomes per unit
cost
20×
MANY SMART IDEAS
ONE SMART NATION
ENABLED BY GENOMICS
Maggie Leong
Vincent Quah
Adrian White
Julian Lau
Liew Jun Xian
Andreas Wilm
Shih Chih Chuan
Ng Huck Hui
Pauline Ng
Anders Skanderup
National Precision
Medicine Program
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Ruchi Munshi
The Broad Institute
223537
Accelerating Analytics for the Future of
Genomics
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
An Introduction To Cromwell
Bioinformatics workflows at any
scale
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The backdrop: data generation set to explode
Story begins here
Quarterly output (in TBases) of the Genomics Platform
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The players in the trenches
• Medical Population Genetics Platform
• Tasked with developing tools/ BP pipelines
• Scope creep: run workflow for researchers
• Workflowing solution: GATK-Queue
(scala)
GATK dev team Picard / Ops team
• Genomics Platform
• Initial data processing -> Picard toolkit
• Took over workflows in production
• Workflowing solution: Zamboni (scala)
Cancer Genome Analysis team
• Cancer Program
• Tasked with developing tools/ BP workflows for somatic analysis
• Workflowing solution: Firehose self-service (python?)
The drama:
low portability, silos,
duplication of effort,
looming bottlenecks
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sharing (securely) is caring
Traditional Way: Bring data to the researchers
Problems
Data sharing = data copying
Requires big infrastructure at each site
Largely fixed compute
Individual security implementations
Cloud Way: Bring researchers to the data
Solutions
True data sharing
Cloud provides the infrastructure
Elastic compute and storage
Centralized security implementation
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Genome analysis pipeline throughput is “spiky”
• Solution: move to Cloud! Advantages over on-premises computing:
– No need to pay for compute power when we aren’t using it
– Can tolerate spikes without being forced to maintain a backlog of “things to
process once everything calms down”
Genome processing requests per day over a several month period
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Use containers for portability & reproducibility
A container encapsulates all the
software dependencies
associated with running a
program
Takes the guesswork out of
running workflows on different
platforms!
GATK 2.8
Java 7
R 2.5.0
GATK 3.8
Java 8
R 3.0.1 BWA
Picard
Modified from https://www.docker.com/what-container
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Meet Cromwell & WDL
Execution engine that can
• Run on any platform (on-prem and on Cloud)
• Seamlessly scale based on workflow needs
• Provide maximal flexibility for all use cases
• https://github.com/broadinstitute/cromwell
Workflow language that humans can read/write
• Methods developers and biomedical scientists at large
• https://github.com/openwdl/wdl/
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Two main ways to run Cromwell
• API endpoints
• More scalable
• Some devops needs
• Appropriate for production
environments
• Call caching
• Simple self-contained
command
• Appropriate for independent
analysts
One-off Server mode
java -jar cromwell.jar 
run hello.wdl 
hello_inputs.json
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Use a workflow execution engine that runs anywhere*
Cromwell
…
HPC TESLocal Google
Funnel
https://github.com/broadinstitute/cromwell
AWS* Alicloud
*in development
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Enable local development of workflows, run on the cloud
S3 data
buckets
Managed compute environment
AWS
Persistent
Cromwell
server
REST API
Direct
CLI
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Cromwell will submit jobs to AWS Batch Job Queues
Cromwell
inputs
inputs
outputs
GATK = gatk.jar
RefFasta = hg38.fasta
RefIndex = hg38.fai
RefDict = hg38.dict
sampleName = sample.name
inputBAM = sample.bam
bamIndex = sample.bai
AWS Batch
Workflow
Cromwell
stages the
inputs/outputs
for your jobs
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Being able to send escalate jobs is nice!
URGENT!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Workflow description Language (WDL)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
WDL runtime parameters
resourcing
cost savings!
containers
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Basic WDL plumbing options
call stepA
call stepB { input: in=stepA.out }
call stepC { input: in=stepB.out }
LINEAR CHAINING
MULTI-IN/OUT
call stepC { input :
in1=stepB.out1,
in2=stepB.out2 }
Array[File] inputFiles
scatter(oneFile in inputFiles) {
call stepA { input: in=oneFile }
}
call stepB { input: files=stepA.out }
SCATTER-GATHER
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
OpenWDL: WDL meets open development
Randall Munroe, XKCD
https://www.xkcd.com/225/
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
But what about CWL?
Randall Munroe, XKCD
https://www.xkcd.com/1739/
Thanks to our Workflow Object
Model (WOM), Cromwell now supports
multiple versions of WDL as well as
CWL 1.0!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Cromwell has been busy
Cromwell in production at
Broad:
Processed 47.5 million
jobs over the last two
years
And this is just the tip of the
iceberg!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Want to discuss further?
My Email:
rmunshi@broadinstitute.org
More Information:
Docs: http://cromwell.readthedocs.io/en/develop/
Github: https://www.github.com/broadinstitute/cromwell
WDL: http://www.openwdl.org

Más contenido relacionado

La actualidad más candente

Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
Dayananda Salam
 

La actualidad más candente (20)

qRT-PCR.pdf
qRT-PCR.pdfqRT-PCR.pdf
qRT-PCR.pdf
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
DNA isolation
DNA isolationDNA isolation
DNA isolation
 
Intro to illumina sequencing
Intro to illumina sequencingIntro to illumina sequencing
Intro to illumina sequencing
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
 
Next Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology OverviewNext Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology Overview
 
Real Time PCR
Real Time PCRReal Time PCR
Real Time PCR
 
Biobanking
BiobankingBiobanking
Biobanking
 
Roche Pyrosequencing 454 ; Next generation DNA Sequencing
Roche Pyrosequencing 454 ; Next generation DNA SequencingRoche Pyrosequencing 454 ; Next generation DNA Sequencing
Roche Pyrosequencing 454 ; Next generation DNA Sequencing
 
Ionomics
IonomicsIonomics
Ionomics
 
Reverse transcriptase polymerase chain reaction
Reverse transcriptase polymerase chain reactionReverse transcriptase polymerase chain reaction
Reverse transcriptase polymerase chain reaction
 
20160308 dtl ngs_focus_group_meeting_slideshare
20160308 dtl ngs_focus_group_meeting_slideshare20160308 dtl ngs_focus_group_meeting_slideshare
20160308 dtl ngs_focus_group_meeting_slideshare
 
Pyrosequencing
PyrosequencingPyrosequencing
Pyrosequencing
 
Pcr primer design
Pcr primer designPcr primer design
Pcr primer design
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
Small Molecule Real Time Sequencing
Small Molecule Real Time SequencingSmall Molecule Real Time Sequencing
Small Molecule Real Time Sequencing
 
Fluorescent in situ hybridization (fish) assay
Fluorescent in situ hybridization (fish) assayFluorescent in situ hybridization (fish) assay
Fluorescent in situ hybridization (fish) assay
 
Synthetic Genome
Synthetic Genome Synthetic Genome
Synthetic Genome
 
Identification of disease genes
Identification of disease genesIdentification of disease genes
Identification of disease genes
 
DNA sequencing
DNA sequencingDNA sequencing
DNA sequencing
 

Similar a Accelerating Analytics for the Future of Genomics

Coates bosc2010 clouds-fluff-and-no-substance
Coates bosc2010 clouds-fluff-and-no-substanceCoates bosc2010 clouds-fluff-and-no-substance
Coates bosc2010 clouds-fluff-and-no-substance
BOSC 2010
 

Similar a Accelerating Analytics for the Future of Genomics (20)

High Performance Computing on AWS: Driving Innovation without Infrastructure ...
High Performance Computing on AWS: Driving Innovation without Infrastructure ...High Performance Computing on AWS: Driving Innovation without Infrastructure ...
High Performance Computing on AWS: Driving Innovation without Infrastructure ...
 
Transitioning Geoscience Research to the Cloud: Opportunities and Challenges
Transitioning Geoscience Research to the Cloud: Opportunities and ChallengesTransitioning Geoscience Research to the Cloud: Opportunities and Challenges
Transitioning Geoscience Research to the Cloud: Opportunities and Challenges
 
High Performance Computing with AWS
High Performance Computing with AWSHigh Performance Computing with AWS
High Performance Computing with AWS
 
Coates bosc2010 clouds-fluff-and-no-substance
Coates bosc2010 clouds-fluff-and-no-substanceCoates bosc2010 clouds-fluff-and-no-substance
Coates bosc2010 clouds-fluff-and-no-substance
 
What would you do with a million cores - HPC on AWS
What would you do with a million cores - HPC on AWSWhat would you do with a million cores - HPC on AWS
What would you do with a million cores - HPC on AWS
 
Move Data to AWS Faster for Migrations, DR, & Bidirectional Workflows (STG382...
Move Data to AWS Faster for Migrations, DR, & Bidirectional Workflows (STG382...Move Data to AWS Faster for Migrations, DR, & Bidirectional Workflows (STG382...
Move Data to AWS Faster for Migrations, DR, & Bidirectional Workflows (STG382...
 
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
 
Clouds: All fluff and no substance?
Clouds: All fluff and no substance?Clouds: All fluff and no substance?
Clouds: All fluff and no substance?
 
Analyze Slide Images and Process Phenotypic Assays at Scale on AWS (CMP358) -...
Analyze Slide Images and Process Phenotypic Assays at Scale on AWS (CMP358) -...Analyze Slide Images and Process Phenotypic Assays at Scale on AWS (CMP358) -...
Analyze Slide Images and Process Phenotypic Assays at Scale on AWS (CMP358) -...
 
Computing Outside The Box
Computing Outside The BoxComputing Outside The Box
Computing Outside The Box
 
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
 
Amazon Cloud Resources as Part of Scientific Workflows & HPC - Kevin Jorissen
Amazon Cloud Resources as Part of Scientific Workflows & HPC - Kevin JorissenAmazon Cloud Resources as Part of Scientific Workflows & HPC - Kevin Jorissen
Amazon Cloud Resources as Part of Scientific Workflows & HPC - Kevin Jorissen
 
AWS Storage State of the Union
AWS Storage State of the UnionAWS Storage State of the Union
AWS Storage State of the Union
 
Building a modern data platform in the cloud. AWS DevDay Nordics
Building a modern data platform in the cloud. AWS DevDay NordicsBuilding a modern data platform in the cloud. AWS DevDay Nordics
Building a modern data platform in the cloud. AWS DevDay Nordics
 
Modernizing upstream workflows with aws storage - john mallory
Modernizing upstream workflows with aws storage -  john malloryModernizing upstream workflows with aws storage -  john mallory
Modernizing upstream workflows with aws storage - john mallory
 
Serverless State Management & Orchestration for Modern Apps (API302) - AWS re...
Serverless State Management & Orchestration for Modern Apps (API302) - AWS re...Serverless State Management & Orchestration for Modern Apps (API302) - AWS re...
Serverless State Management & Orchestration for Modern Apps (API302) - AWS re...
 
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
Building Data Lakes and Analytics on AWS. IPExpo Manchester.Building Data Lakes and Analytics on AWS. IPExpo Manchester.
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
 

Más de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Accelerating Analytics for the Future of Genomics

  • 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Dr. Swaine Chen Genome Institute of Singapore, National University of Singapore 223537 Accelerating Analytics for the Future of Genomics
  • 2. Accelerating Genomics Research with the Cloud Swaine Chen Genome Institute of Singapore National University of Singapore
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. GEN MICS WHAT IS HUMAN CELLS NUCLEUS CHROMOSOMES DNA
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is DNA? G, A, T, C 4 “bases”; 2 bits Explicitly digital A – T C – G T – A G – C
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DNA sequencing technology Maxam-Gilbert Chemistry and radiation Solexa (Illumina) Higher density, imaging Oxford Nanopore Electric current detection Capillary seq Miniaturization Parallelization 1970 1980 1990 2000 2010
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DNA sequencing technology Maxam-Gilbert Chemistry and radiation Solexa (Illumina) Higher density, imaging Oxford Nanopore Electric current detection Capillary seq Miniaturization Parallelization 1970 1980 1990 2000 2010 Miniaturization, Parallelization, Digitization
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Genomics data is exploding (in the usual way) Moore’s law
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Genomics Genomics data is exploding (in the usual way) Moore’s law
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Genomics Analytics is imploding (in an unusual way) Moore’s law Hyper-Moore gap
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. The evolution of genomics compute Driven by the data scale Enabled by AWS Our journey at GIS
  • 11. GIS, Pre-AWS 128 nodes 40-80 CPUs 128-512 GB RAM Head node On-site data centerOffice area User workstations “SMPs” 96 CPUs 1 TB RAM 1Gbps 40-100 Gbps Archival Storage (10 PB) Office, home Storage 3 PB Compute Storage 4 PB
  • 12. Cluster nodes (~500) 4-8 CPUs 64-128GB RAM Head node On-site data centerOffice area User workstations “SMPs” 128 CPUs 1TB RAM 1Gbps 10-100 Gbps Archival Storage (3 PB) Office, home Storage 1PB Compute Storage 100TBChallenges First-time command line users Heterogeneous compute, storage, network No/low experience • Job management • Optimization • Software config/documentation Spiky workloads Self-inflicted denial of service GIS, Pre-AWS
  • 13. How did we first use AWS? Phase 1 • Reimplement “SMPs” • Users can’t DOS each other • Infinite capacity (and potential for waste) • Full complexity Single instance EBS / compute storage S3 / Object storage Individual user AWSGIS
  • 14. How did we first use AWS? Phase 1 • Reimplement “SMPs” • Users can’t DOS each other • Infinite capacity (and potential for waste) • Full complexity
  • 15. Our current efforts on AWS Phase 2 • Nextflow + AWS Batch • Totally new paradigm, enabled by cloud • AWS for elastic provisioning • Cluster is abstracted away • Leverage this for software S3 / Object storage Individual user AWSGIS AWS Batch
  • 16. Phase 2 • Nextflow + AWS Batch • Totally new paradigm, enabled by cloud • AWS for elastic provisioning • Cluster is abstracted away • Leverage this for software S3 / Object storage Individual user AWSGIS Job repo Jobtasks Docker repo (ECR) AWS Batch
  • 17. Why is this complexity needed? GATK Best Practices – a standard workflow in genomics
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. The Impact at GIS
  • 20. GIS Bacterial Projects: 100× in 4 years 1 10 100 1000 10000 1995 2000 2005 2010 2015 2020 Bacterial genomics: # genomes/paper tracks Moore’s Law Year of publication Numberofgenomes GIS
  • 21. GIS Bacterial Projects: 100× in 4 years 1 10 100 1000 10000 1995 2000 2005 2010 2015 2020 Year of publication Numberofgenomes GIS 2013: 10-100 strains
  • 22. GIS Bacterial Projects: 100× in 4 years 1 10 100 1000 10000 1995 2000 2005 2010 2015 2020 Year of publication Numberofgenomes GIS 2017: 10,000 strains Higher resolution, more perspective
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Capacity + Simplicity = Opportunity? Does AWS fundamentally change our thinking?
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. T H E NN O W Genomics: Approaching IoTTransition
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. SINGAPORE’S DENGUE MONITORING Since 2006 All nonresidential buildings checked every 3 months 1 million inspections per year INFRASTRUCTURE
  • 26. Preparing for 1 Million Genomic Devices Phase 3 • Serverless, event-driven model • Massive scale • No user intervention • Fundamentally cloud-driven transformation of our problem solving • Enables continuous monitoring
  • 27. Preparing for 1 Million Genomic Devices Reimplement variant calling 6 hours 15 minutes Auto scatter-gather, high parallelism 1,000 genomes, 25 million GB-s, no intervention 12 genomes on Lambda free tier! 1 10 100 1000 10000 100000 Run own servers GIS + Lambda Genomes per unit cost 20×
  • 28. MANY SMART IDEAS ONE SMART NATION ENABLED BY GENOMICS
  • 29. Maggie Leong Vincent Quah Adrian White Julian Lau Liew Jun Xian Andreas Wilm Shih Chih Chuan Ng Huck Hui Pauline Ng Anders Skanderup National Precision Medicine Program
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Ruchi Munshi The Broad Institute 223537 Accelerating Analytics for the Future of Genomics
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. An Introduction To Cromwell Bioinformatics workflows at any scale
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. The backdrop: data generation set to explode Story begins here Quarterly output (in TBases) of the Genomics Platform
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. The players in the trenches • Medical Population Genetics Platform • Tasked with developing tools/ BP pipelines • Scope creep: run workflow for researchers • Workflowing solution: GATK-Queue (scala) GATK dev team Picard / Ops team • Genomics Platform • Initial data processing -> Picard toolkit • Took over workflows in production • Workflowing solution: Zamboni (scala) Cancer Genome Analysis team • Cancer Program • Tasked with developing tools/ BP workflows for somatic analysis • Workflowing solution: Firehose self-service (python?) The drama: low portability, silos, duplication of effort, looming bottlenecks
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Sharing (securely) is caring Traditional Way: Bring data to the researchers Problems Data sharing = data copying Requires big infrastructure at each site Largely fixed compute Individual security implementations Cloud Way: Bring researchers to the data Solutions True data sharing Cloud provides the infrastructure Elastic compute and storage Centralized security implementation
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Genome analysis pipeline throughput is “spiky” • Solution: move to Cloud! Advantages over on-premises computing: – No need to pay for compute power when we aren’t using it – Can tolerate spikes without being forced to maintain a backlog of “things to process once everything calms down” Genome processing requests per day over a several month period
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use containers for portability & reproducibility A container encapsulates all the software dependencies associated with running a program Takes the guesswork out of running workflows on different platforms! GATK 2.8 Java 7 R 2.5.0 GATK 3.8 Java 8 R 3.0.1 BWA Picard Modified from https://www.docker.com/what-container
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Meet Cromwell & WDL Execution engine that can • Run on any platform (on-prem and on Cloud) • Seamlessly scale based on workflow needs • Provide maximal flexibility for all use cases • https://github.com/broadinstitute/cromwell Workflow language that humans can read/write • Methods developers and biomedical scientists at large • https://github.com/openwdl/wdl/
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Two main ways to run Cromwell • API endpoints • More scalable • Some devops needs • Appropriate for production environments • Call caching • Simple self-contained command • Appropriate for independent analysts One-off Server mode java -jar cromwell.jar run hello.wdl hello_inputs.json
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use a workflow execution engine that runs anywhere* Cromwell … HPC TESLocal Google Funnel https://github.com/broadinstitute/cromwell AWS* Alicloud *in development
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Enable local development of workflows, run on the cloud S3 data buckets Managed compute environment AWS Persistent Cromwell server REST API Direct CLI
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cromwell will submit jobs to AWS Batch Job Queues Cromwell inputs inputs outputs GATK = gatk.jar RefFasta = hg38.fasta RefIndex = hg38.fai RefDict = hg38.dict sampleName = sample.name inputBAM = sample.bam bamIndex = sample.bai AWS Batch Workflow Cromwell stages the inputs/outputs for your jobs
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Being able to send escalate jobs is nice! URGENT!
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Workflow description Language (WDL)
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. WDL runtime parameters resourcing cost savings! containers
  • 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Basic WDL plumbing options call stepA call stepB { input: in=stepA.out } call stepC { input: in=stepB.out } LINEAR CHAINING MULTI-IN/OUT call stepC { input : in1=stepB.out1, in2=stepB.out2 } Array[File] inputFiles scatter(oneFile in inputFiles) { call stepA { input: in=oneFile } } call stepB { input: files=stepA.out } SCATTER-GATHER
  • 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. OpenWDL: WDL meets open development Randall Munroe, XKCD https://www.xkcd.com/225/
  • 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. But what about CWL? Randall Munroe, XKCD https://www.xkcd.com/1739/ Thanks to our Workflow Object Model (WOM), Cromwell now supports multiple versions of WDL as well as CWL 1.0!
  • 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cromwell has been busy Cromwell in production at Broad: Processed 47.5 million jobs over the last two years And this is just the tip of the iceberg!
  • 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Want to discuss further? My Email: rmunshi@broadinstitute.org More Information: Docs: http://cromwell.readthedocs.io/en/develop/ Github: https://www.github.com/broadinstitute/cromwell WDL: http://www.openwdl.org