Genome-scale Big Data Pipelines

Dr. Denis Bauer & Lynn Langit
Genomic-scale Data Pipelines

Denis Bauer,
PhD
Oscar Luo,
PhD
Rob Dunne,
PhD
Piotr Szul
Team
Aidan O’BrienLaurence Wilson,
PhD
Adrian White
Andy Hindmarch
Collaborators
David Levy
News
Software
Dan Andrews
Kaitao Lai,
PhD
Arash Bayat
John Hildebrandt
Mia Chapman
Ian Blair
Kelly Williams
Jules Damji
Gaetan Burgio Lynn Langit
Natalie Twine,
PhD
Prabha Pillay
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Transformational Bioinformatics Team

1000
17
2000
0 500 1000 1500 2000 2500
Astronomy
Twitter
YouTube
Big Data in 2025…Petabytes?
1000
17
2000
0 500 1000 1500 2000 2500
Astronomy
Twitter
YouTube
Big Data in 2025…Petabytes?

1
0.17
2
20
0 5 10 15 20 25
Astronomy
Twitter
YouTube
Genomic
GENOMIC Big Data in 2025 - Exabytes

Genome holds Blueprint for Every Cell

Affects Looks, Disease Risk, and Behavior

VCF Data

Genomic Research Workflow
https://www.projectmine.com/about/
BigData Focus

Finding the Disease Gene(s)
Spot the letter that is…
• common amongst all affected
• absent in all unaffected*
* oversimplified
cases
controls
Gene1 Gene2

BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)
Cited
4
Transformational Bioinformatics| Denis C. Bauer @allPowerde

Why
Apache
Spark?

Performance – Faster and More Accurate
VariantSpark is the only method to scale to 100% of the genome
low Accuracy high
lowSpeedhigh

CloudDataPipelinePattern
Business
Problem
Data
Quality
Candidate
Technologies
Build/Test
MVPs
Assemble
Pipeline

Building a CloudDataPipeline
Candidate
Technologies
• Ingest/Clean
• Analyze/Predict
• Visualize
Build MVPs
• Test
• Iterate
• Learn
Assemble
Pipeline
• Combine pieces
• Validate sections
• Test at scale

Building a Cloud Data Pipeline
Spark
•IaaS, PaaS, SaaS Vendors
•AWS, Azure, GCP…

Visualizing Machine Learning Results

Solving Important Questions…
Cancer genomics?

DEMO: Who is a Bondi Hipster?

Supervised ML: Wide Random Forests

Scaling to 50 M variables and 10 K samples
100K trees: 5 – 50h
AWS: ~$215.50
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
• Yarn Cluster
• 12 workers
• 16 x Intel CPUs
• Xeon E5-2660@2.20GHz
• 128 GB RAM
• Spark 1.6.1
• 128 executors
• 6GB / executor 0.75TB
• Synthetic dataset
Whole Genome
Range
GWAS Range

Future Directions for VariantSpark RF
Mixed feature types
Unordered
Categorical
Continuous
Build Community
Python API
Non-Genomic
Demos
Implementation by

Try it out: VariantSpark Notebook
https://docs.databricks.com/spark/latest/training/variant-spark.html

Genome Editing can correct genetic
diseases, ex. hypertrophic cardiomyopathy
“Editing does not work every time, e.g.
only 7 in 10 embryos were mutation free.”
Aim: Develop computational
guidance framework to
enable edits the first time;
every time
Ma et al. Nature 2017 *
* Controversy around the paper – stay tuned

Make Process Parallel and Scalable
SPEED
• Each search can be
broken down into parallel
tasks - each takes
seconds
SCALE
• Researchers might want
to search the target for
one gene or 100,000
Scalability + Agility =

One of the first Serverless Applications in Research
Featured in

X-Ray Tracing Demo of GT-Scan2
• Find performance
bottlenecks
• Fix and test
Webapp
Resources (S3, DynamoDB)
Lambda

25
50
75
getFastaSequence
createJob
targetScan
offtargetScanStarter
offtargetSearch
targetIntersects
targetTranscriptionIntersects
targetW
uScorer
targetSgR
N
AScorer
O
nTargetScorer
genom
eC
R
ISPR
functions
runtime(s)
Type
base
old
GTScan2 X-Ray Analysis

Results – 4x Faster (80% improvement)
2 min
30 sec

Considering Services
for GT-Scan2
• Use AWS Step Functions
• Simplify workflow
• Simplify task timeouts
• Simplify task failures
• Must evaluate costs
• SNS vs. Step Functions

Problem Data Technologies MVPs Pipeline
Search
GTScan2
fastq, bed-> S3, NoSQL Ingest
ETL, Analyze
Viz
S3
Lambda
Lambda/API Gateway
Serverless

Serverless Pipeline Pattern
Lambda
function
1
Lambda
function
2
Lambda
function
3
buckets with
objects DynamoDB
API Gateway Users
Step Functions

Problem Data Technologies MVPs Pipeline
Analyze
GWAS
vcf -> S3/Spark Ingest
ETL
Analyze
Viz
S3 -> Databricks DBFS
Apache Spark
Variant-Spark ML
Notebook, SQL, R, Python
Spark
Server
Cluster

Spark Server Cluster Pipeline Pattern
Jupyter Notebook

Cloud Genomic-Scale Data Pipelines
• Problem # 1 – ML on Large Data
• Solution: Spark-server cluster + custom
machine learning
• Problem #2 – Burstable Search
• Solution: Serverless pipeline

Genomic-scale Data Pipelines
Dr. Denis Bauer & Lynn Langit

Genome-scale Big Data Pipelines

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Similar a Genome-scale Big Data Pipelines

Similar a Genome-scale Big Data Pipelines (20)

Más de Lynn Langit

Más de Lynn Langit (20)

Último

Último (20)

Genome-scale Big Data Pipelines

Notas del editor