Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. Here we introduce VariantSpark, which utilizes Hadoop/Spark along with its machine learning library, MLlib, providing the means of parallelisation for population-scale bioinformatics tasks. VariantSpark is the interface to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results.
To demonstrate the capabilities of VariantSpark, we clustered more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VariantSpark is 80% faster than the Spark-based genome clustering approach, ADAM, the comparable implementation using Hadoop/Mahout, as well as Admixture, a commonly used tool for determining individual ancestries. It is over 90% faster than traditional implementations using R and Python. These benefits of speed, resource consumption and scalability enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomic data.
The package is written in Scala and available at https://github.com/BauerLab/VariantSpark.
Botany krishna series 2nd semester Only Mcq type questions
VariantSpark: applying Spark-based machine learning methods to genomic information
1. Denis C. Bauer | Bioinformatics | @allPowerde
20 Nov 2015
CSIRO HEALTH & BIOSECURITY
VariantSpark: applying Spark-based machine
learning methods to genomic information
ByTimCooper
2. Talk Overview
VariantSpark| Denis C. Bauer @allPowerde2 |
• Background: Why is genomics so important for medicine
• VariantSpark: Overview
• Whole Genome Analysis: Clustering samples by ethnicity
3. Genome sequencing improves diagnostics
Genomic sequencing can lead to a successful diagnosis in up to 50% of
cases where traditional genetic testing failed and is on average 96% cheaper
Presentation title | Presenter name3 |
Oncology
Tandem
duplications
Tandem
duplications
Identifying tumours by their genome-
wide mutation profiles
Rare genetic disorders
Identifying causative mutations by
interrogating all abnormal variants
http://matt.might.net
Bauer et al. Trends Mol Med. 2014 PMID: 24801560
4. Generating data from 1 Million Americans
Presentation title | Presenter name4 |
THEPRECISIONMEDICINEINITIATIVE
WHATISIT?
Precision medicine is an emerging approach for disease
prevention and treatment that takes into account people’s
individual variations in genes, environment, and lifestyle.
The Precision Medicine Initiative will generate the
scientific evidence needed to move the concept of
precision medicine into clinical practice.
WHYNOW?
The time is right because of:
Sequencing
of the human
genome
Improved
technologies for
biomedical analysis
New tools
for using large
datasets
NEAR TERM GOALS
Intensify efforts to apply precision medicine to cancer.
Innovative clinical trials
of targeted drugs for
adult, pediatric cancers
Use of
combination
therapies
Knowledge to
overcome drug
resistance
LONGER TERM GOALS
Create a research cohort of > 1 million American volunteers who will
share genetic data, biological samples, and diet/lifestyle information, all
linked to their electronic health records if they choose.
Pioneer a new model for doing science that emphasizes engaged
participants, responsible data sharing, and privacy pr otection.
Research based upon the cohort data will:
• Advance pharmacogenomics, the right drug for the right patient at the
right dose
• Identify new targets for treatment and prevention
Australia: ~ 100 Million dedicated to clinical genomics
• $25 Million Australian Genomics Health Alliances (NHMRC Grant AI Denis);
• VIC and QLD Alliances ($25 Million each); NSW and ACT (undisclosed $$ to Garvan and JCSMR)
5. 100,000 Genomes project
70,000 individuals
by 2017
The cancer genome atlas
11,000 samples 2015
Genomics projects are getting bigger
VariantSpark| Denis C. Bauer @allPowerde | Page 5
The HapMap Project
270 samples 2002
Human genome
~1 sample
1000 Genome Project
1097 samples 2012
Project MinE
15,000 people with ALS
ASPREE
4000 healthy 70+ year olds
Single samples are around 200GB in size
7. Data Analysis categories for genomics
Map to genome and generate
raw genomic features (e.g. SNPs)
Analyze the data; Uncover the
biological meaning
Produce raw sequence readsBasic Production
Informatics
Advanced
Production Inform.
Bioinformatics
Research
VariantSpark| Denis C. Bauer @allPowerde | Page 7
8. VariantSpark
Mllib*
VCF
VariantSpark is the interface enabling Spark’s MLlib machine learning algorithms
to be applied to genomics data
e.g. grouping
samples by
genomic profile
Input Genomics Application Result
Largescale
compute
VariantSpark| Denis C. Bauer @allPowerde | Page 8
* VariantSpark also uses Spark.ML
10. Cluster individuals into ethnic groups based on
their genomic profiles
www.cloudaccess.eu
1000 x 40 Million variants
Matrix *
Kmeans
Predict super
population
4
14 ethnic groups and
s u p e r
populations
VariantSpark| Denis C. Bauer @allPowerde | Page 10
* VariantSpark can also process phase 3 data: 3000 individuals and 80 million variants
11. Clustering result
• (adjusted Rand index) ARI = 0.84,
with -1 (independent labeling) and 1
(perfect match)
• Majority of American (AMR)
individuals being placed in the same
group as Europeans (EUR), likely
reflecting their migrational
backgrounds.
• ADMIXTURE (state-of-the-art tool for
population structure determination)
returns a low ARI of 0.25
Admixture: Alexander, D.H., Novembre, J., Lange, K.: Fast model-based estimation of ancestry
in unrelated individuals. Genome Res 19(9), 1655–1664 (2009)
VariantSpark| Denis C. Bauer @allPowerde | Page 11
12. Comparison to other implementations
• Preprocessing: converting
location-centric VCF
genotypes into sample-
centric numerical vectors
• Clustering: Kmeans
• ADAM (BigData Genomics): Spark
implementation with dense matrix
• Hadoop: MapReduce without in-
memory caching
0
1000
2000
Python
R
H
adoop
Adam
AD
M
IXTU
R
E
VariantSpark
method
timeinseconds
task
binary−conversion
clustering
pre−processing
Chromosome 22; VM on Microsoft Azure with A7 Linux instance and
8 cores, 56GB memory running Ubuntu.
103 75 29 28 18 4 min
VariantSpark| Denis C. Bauer @allPowerde | Page 12
13. Scaling VariantSpark to the whole genome
• Pre-processing: scales
seamlessly as processes are
independent
• Clustering: memory
consumption increases linear
with number of variants
(24GB) due to additional
distance measurements
between variants and k-means
centroids
• As total memory was the
limiting factor on our
infrastructure the number of
simultaneously used nodes
had to be reduced; increasing
runtime.
pre−processing clustering
40
45
50
55
60
65
5
10
15
20
25
0
20000
40000
executorsmemorytime
20
40
60
80
100
20
40
60
80
100
number of variants (%)
value
variable
executors
memory
time
CSIRO Spark Cluster: Whole genome; Hadoop 2.5.0, managed by cloudera’s CDH 5.
We use Spark 1.3.1. This 13 node cluster has a total of 416 cores and 1.22TB memory.
VariantSpark| Denis C. Bauer @allPowerde | Page 13
14. Three things to remember
• VariantSpark is an interface bringing bigLearning tasks to genomics
applications
• VariantSpark can cluster 3000 individuals and 80 million variants in
under 30 hours using minimal memory (24GB) – a task not being
possible in R/python/ADMIXTURE due to memory limits.
• VariantSpark outperforms ADAM (Big Data Genomics) and
equivalent Hadoop-implementation by almost an order of
magnitude.
https://github.com/BauerLab/VariantSpark
VariantSpark| Denis C. Bauer @allPowerde | Page 14
15. HEALTH AND BIOSECURITY
Thank youHealth & Biosecurity
Denis C. Bauer
t +61 2 9123 4567
e Denis.Bauer@csiro.au
w aehrc.com/biomedical-
informatics/transformational-bioinformatics/
More talks online: Twitter:
http://www.slideshare.net/allPowerde @allPowerde
Aidan O’Brien
Bill Wilson
Transformational Bioinformatics
Team, CSIRO
Former members
Firoz Anwar
Neil Saunders
Rodney Scott
Newcastle University
Funding:
National Health and Medical
Research Council;
National Breast Cancer
Foundation;
CSIRO's Transformational
Capability Platform;
CSIRO’s IM&T;
Science and Industry Endowment
Fund
Buske et al.,
Bioinformatics Jan 2014
O’Brien et al., BMC
Genomics Dec 2015
Dunne et al., in
preparation
FullySIC
Epistatic Gene Network
modelling
in preparation
Anwar et al., in
preparation
Piotr Szul
Gi Guo
Robert Dunne
Data61 CSIRO, Australia
GOdistinct
GO Enrichment or genesets
with distinctive function