SlideShare una empresa de Scribd logo
1 de 16
Denis C. Bauer | Bioinformatics | @allPowerde
20 Nov 2015
CSIRO HEALTH & BIOSECURITY
VariantSpark: applying Spark-based machine
learning methods to genomic information
ByTimCooper
Talk Overview
VariantSpark| Denis C. Bauer @allPowerde2 |
• Background: Why is genomics so important for medicine
• VariantSpark: Overview
• Whole Genome Analysis: Clustering samples by ethnicity
Genome sequencing improves diagnostics
Genomic sequencing can lead to a successful diagnosis in up to 50% of
cases where traditional genetic testing failed and is on average 96% cheaper
Presentation title | Presenter name3 |
Oncology
Tandem
duplications
Tandem
duplications
Identifying tumours by their genome-
wide mutation profiles
Rare genetic disorders
Identifying causative mutations by
interrogating all abnormal variants
http://matt.might.net
Bauer et al. Trends Mol Med. 2014 PMID: 24801560
Generating data from 1 Million Americans
Presentation title | Presenter name4 |
THEPRECISIONMEDICINEINITIATIVE
WHATISIT?
Precision medicine is an emerging approach for disease
prevention and treatment that takes into account people’s
individual variations in genes, environment, and lifestyle.
The Precision Medicine Initiative will generate the
scientific evidence needed to move the concept of
precision medicine into clinical practice.
WHYNOW?
The time is right because of:
Sequencing
of the human
genome
Improved
technologies for
biomedical analysis
New tools
for using large
datasets
NEAR TERM GOALS
Intensify efforts to apply precision medicine to cancer.
Innovative clinical trials
of targeted drugs for
adult, pediatric cancers
Use of
combination
therapies
Knowledge to
overcome drug
resistance
LONGER TERM GOALS
Create a research cohort of > 1 million American volunteers who will
share genetic data, biological samples, and diet/lifestyle information, all
linked to their electronic health records if they choose.
Pioneer a new model for doing science that emphasizes engaged
participants, responsible data sharing, and privacy pr otection.
Research based upon the cohort data will:
• Advance pharmacogenomics, the right drug for the right patient at the
right dose
• Identify new targets for treatment and prevention
Australia: ~ 100 Million dedicated to clinical genomics
• $25 Million Australian Genomics Health Alliances (NHMRC Grant AI Denis);
• VIC and QLD Alliances ($25 Million each); NSW and ACT (undisclosed $$ to Garvan and JCSMR)
100,000 Genomes project
70,000 individuals
by 2017
The cancer genome atlas
11,000 samples 2015
Genomics projects are getting bigger
VariantSpark| Denis C. Bauer @allPowerde | Page 5
The HapMap Project
270 samples 2002
Human genome
~1 sample
1000 Genome Project
1097 samples 2012
Project MinE
15,000 people with ALS
ASPREE
4000 healthy 70+ year olds
Single samples are around 200GB in size
VariantSpark| Denis C. Bauer @allPowerde | Page 6
Data Analysis categories for genomics
Map to genome and generate
raw genomic features (e.g. SNPs)
Analyze the data; Uncover the
biological meaning
Produce raw sequence readsBasic Production
Informatics
Advanced
Production Inform.
Bioinformatics
Research
VariantSpark| Denis C. Bauer @allPowerde | Page 7
VariantSpark
Mllib*
VCF
VariantSpark is the interface enabling Spark’s MLlib machine learning algorithms
to be applied to genomics data
e.g. grouping
samples by
genomic profile
Input Genomics Application Result
Largescale
compute
VariantSpark| Denis C. Bauer @allPowerde | Page 8
* VariantSpark also uses Spark.ML
VariantSpark
VariantSpark| Denis C. Bauer @allPowerde | Page 9
Accepted BMC Genomics (IF=4)
Cluster individuals into ethnic groups based on
their genomic profiles
www.cloudaccess.eu
1000 x 40 Million variants
Matrix *
Kmeans
Predict super
population
4
14 ethnic groups and
s u p e r
populations
VariantSpark| Denis C. Bauer @allPowerde | Page 10
* VariantSpark can also process phase 3 data: 3000 individuals and 80 million variants
Clustering result
• (adjusted Rand index) ARI = 0.84,
with -1 (independent labeling) and 1
(perfect match)
• Majority of American (AMR)
individuals being placed in the same
group as Europeans (EUR), likely
reflecting their migrational
backgrounds.
• ADMIXTURE (state-of-the-art tool for
population structure determination)
returns a low ARI of 0.25
Admixture: Alexander, D.H., Novembre, J., Lange, K.: Fast model-based estimation of ancestry
in unrelated individuals. Genome Res 19(9), 1655–1664 (2009)
VariantSpark| Denis C. Bauer @allPowerde | Page 11
Comparison to other implementations
• Preprocessing: converting
location-centric VCF
genotypes into sample-
centric numerical vectors
• Clustering: Kmeans
• ADAM (BigData Genomics): Spark
implementation with dense matrix
• Hadoop: MapReduce without in-
memory caching
0
1000
2000
Python
R
H
adoop
Adam
AD
M
IXTU
R
E
VariantSpark
method
timeinseconds
task
binary−conversion
clustering
pre−processing
Chromosome 22; VM on Microsoft Azure with A7 Linux instance and
8 cores, 56GB memory running Ubuntu.
103 75 29 28 18 4 min
VariantSpark| Denis C. Bauer @allPowerde | Page 12
Scaling VariantSpark to the whole genome
• Pre-processing: scales
seamlessly as processes are
independent
• Clustering: memory
consumption increases linear
with number of variants
(24GB) due to additional
distance measurements
between variants and k-means
centroids
• As total memory was the
limiting factor on our
infrastructure the number of
simultaneously used nodes
had to be reduced; increasing
runtime.
pre−processing clustering
40
45
50
55
60
65
5
10
15
20
25
0
20000
40000
executorsmemorytime
20
40
60
80
100
20
40
60
80
100
number of variants (%)
value
variable
executors
memory
time
CSIRO Spark Cluster: Whole genome; Hadoop 2.5.0, managed by cloudera’s CDH 5.
We use Spark 1.3.1. This 13 node cluster has a total of 416 cores and 1.22TB memory.
VariantSpark| Denis C. Bauer @allPowerde | Page 13
Three things to remember
• VariantSpark is an interface bringing bigLearning tasks to genomics
applications
• VariantSpark can cluster 3000 individuals and 80 million variants in
under 30 hours using minimal memory (24GB) – a task not being
possible in R/python/ADMIXTURE due to memory limits.
• VariantSpark outperforms ADAM (Big Data Genomics) and
equivalent Hadoop-implementation by almost an order of
magnitude.
https://github.com/BauerLab/VariantSpark
VariantSpark| Denis C. Bauer @allPowerde | Page 14
HEALTH AND BIOSECURITY
Thank youHealth & Biosecurity
Denis C. Bauer
t +61 2 9123 4567
e Denis.Bauer@csiro.au
w aehrc.com/biomedical-
informatics/transformational-bioinformatics/
More talks online: Twitter:
http://www.slideshare.net/allPowerde @allPowerde
Aidan O’Brien
Bill Wilson
Transformational Bioinformatics
Team, CSIRO
Former members
Firoz Anwar
Neil Saunders
Rodney Scott
Newcastle University
Funding:
National Health and Medical
Research Council;
National Breast Cancer
Foundation;
CSIRO's Transformational
Capability Platform;
CSIRO’s IM&T;
Science and Industry Endowment
Fund
Buske et al.,
Bioinformatics Jan 2014
O’Brien et al., BMC
Genomics Dec 2015
Dunne et al., in
preparation
FullySIC
Epistatic Gene Network
modelling
in preparation
Anwar et al., in
preparation
Piotr Szul
Gi Guo
Robert Dunne
Data61 CSIRO, Australia
GOdistinct
GO Enrichment or genesets
with distinctive function
Presentation title | Presenter name16 |

Más contenido relacionado

La actualidad más candente

A New molecular biology techniques for gene therapy
A New molecular biology techniques for gene therapyA New molecular biology techniques for gene therapy
A New molecular biology techniques for gene therapyVanessa Chappell
 
Pros and cons of Transgenic crops current scenario
Pros and cons of Transgenic crops current scenarioPros and cons of Transgenic crops current scenario
Pros and cons of Transgenic crops current scenarioManjunath R
 
Unveiling the Potential of your AAV Gene Therapy: Orthogonal methods to under...
Unveiling the Potential of your AAV Gene Therapy: Orthogonal methods to under...Unveiling the Potential of your AAV Gene Therapy: Orthogonal methods to under...
Unveiling the Potential of your AAV Gene Therapy: Orthogonal methods to under...Merck Life Sciences
 
Cloning Endangered Species
Cloning Endangered SpeciesCloning Endangered Species
Cloning Endangered SpeciesMorganScience
 
Ethical Issues of Xenotransplantation.ppt
Ethical Issues of Xenotransplantation.pptEthical Issues of Xenotransplantation.ppt
Ethical Issues of Xenotransplantation.pptsehikib
 
Stem Cells - Biology ppt slides
Stem Cells - Biology ppt slidesStem Cells - Biology ppt slides
Stem Cells - Biology ppt slidesnihattt
 
Status of GMOs crops in pakistan
Status of  GMOs crops in  pakistanStatus of  GMOs crops in  pakistan
Status of GMOs crops in pakistanMohsinMukhtar6
 
Protocol for systematic literature review
Protocol for systematic literature reviewProtocol for systematic literature review
Protocol for systematic literature reviewKhalid Mahmood
 
Biotechnology ppt by anila rani pullagura
Biotechnology ppt by anila rani pullaguraBiotechnology ppt by anila rani pullagura
Biotechnology ppt by anila rani pullaguraanilarani
 
AVENUES AND Careers IN BIOTECHNOLOGY
AVENUES AND Careers IN BIOTECHNOLOGYAVENUES AND Careers IN BIOTECHNOLOGY
AVENUES AND Careers IN BIOTECHNOLOGYMSCW Mysore
 
(New)introduction to biotechnology (1)
(New)introduction to biotechnology (1)(New)introduction to biotechnology (1)
(New)introduction to biotechnology (1)Sindhu Nathan
 
Biotechnology
Biotechnology Biotechnology
Biotechnology Shohrat266
 
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...VHIR Vall d’Hebron Institut de Recerca
 
Plant biotechnology and its impacts
Plant biotechnology and its impactsPlant biotechnology and its impacts
Plant biotechnology and its impactsdhabzzz
 
Synthetic biology
Synthetic biology Synthetic biology
Synthetic biology Elham Lasemi
 
Intellectual Property Rights for Kids
Intellectual Property Rights for KidsIntellectual Property Rights for Kids
Intellectual Property Rights for KidsJamil AlKhatib
 
Future biotechnology
Future biotechnologyFuture biotechnology
Future biotechnologyOmnia Mohamed
 
Introduction to biotechnology by qasim
Introduction to biotechnology by qasimIntroduction to biotechnology by qasim
Introduction to biotechnology by qasimqasim948
 

La actualidad más candente (20)

A New molecular biology techniques for gene therapy
A New molecular biology techniques for gene therapyA New molecular biology techniques for gene therapy
A New molecular biology techniques for gene therapy
 
Pros and cons of Transgenic crops current scenario
Pros and cons of Transgenic crops current scenarioPros and cons of Transgenic crops current scenario
Pros and cons of Transgenic crops current scenario
 
Animal biotechnology course
Animal biotechnology courseAnimal biotechnology course
Animal biotechnology course
 
Unveiling the Potential of your AAV Gene Therapy: Orthogonal methods to under...
Unveiling the Potential of your AAV Gene Therapy: Orthogonal methods to under...Unveiling the Potential of your AAV Gene Therapy: Orthogonal methods to under...
Unveiling the Potential of your AAV Gene Therapy: Orthogonal methods to under...
 
Cloning Endangered Species
Cloning Endangered SpeciesCloning Endangered Species
Cloning Endangered Species
 
Ethical Issues of Xenotransplantation.ppt
Ethical Issues of Xenotransplantation.pptEthical Issues of Xenotransplantation.ppt
Ethical Issues of Xenotransplantation.ppt
 
Stem Cells - Biology ppt slides
Stem Cells - Biology ppt slidesStem Cells - Biology ppt slides
Stem Cells - Biology ppt slides
 
Status of GMOs crops in pakistan
Status of  GMOs crops in  pakistanStatus of  GMOs crops in  pakistan
Status of GMOs crops in pakistan
 
Protocol for systematic literature review
Protocol for systematic literature reviewProtocol for systematic literature review
Protocol for systematic literature review
 
Biotechnology ppt by anila rani pullagura
Biotechnology ppt by anila rani pullaguraBiotechnology ppt by anila rani pullagura
Biotechnology ppt by anila rani pullagura
 
AVENUES AND Careers IN BIOTECHNOLOGY
AVENUES AND Careers IN BIOTECHNOLOGYAVENUES AND Careers IN BIOTECHNOLOGY
AVENUES AND Careers IN BIOTECHNOLOGY
 
(New)introduction to biotechnology (1)
(New)introduction to biotechnology (1)(New)introduction to biotechnology (1)
(New)introduction to biotechnology (1)
 
Biotechnology
Biotechnology Biotechnology
Biotechnology
 
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
 
Plant biotechnology and its impacts
Plant biotechnology and its impactsPlant biotechnology and its impacts
Plant biotechnology and its impacts
 
Synthetic biology
Synthetic biology Synthetic biology
Synthetic biology
 
Intellectual Property Rights for Kids
Intellectual Property Rights for KidsIntellectual Property Rights for Kids
Intellectual Property Rights for Kids
 
Clinical Applications of Next Generation Sequencing
Clinical Applications of Next Generation SequencingClinical Applications of Next Generation Sequencing
Clinical Applications of Next Generation Sequencing
 
Future biotechnology
Future biotechnologyFuture biotechnology
Future biotechnology
 
Introduction to biotechnology by qasim
Introduction to biotechnology by qasimIntroduction to biotechnology by qasim
Introduction to biotechnology by qasim
 

Destacado

Qbi Centre for Brain genomics (Informatics side)
Qbi Centre for Brain genomics (Informatics side)Qbi Centre for Brain genomics (Informatics side)
Qbi Centre for Brain genomics (Informatics side)Denis C. Bauer
 
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...Reid Robison
 
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedMachine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedSri Ambati
 
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...Wesley De Neve
 
The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...
The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...
The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...The Hive
 

Destacado (7)

Qbi Centre for Brain genomics (Informatics side)
Qbi Centre for Brain genomics (Informatics side)Qbi Centre for Brain genomics (Informatics side)
Qbi Centre for Brain genomics (Informatics side)
 
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
 
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedMachine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
 
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
 
Gene Silencing
Gene SilencingGene Silencing
Gene Silencing
 
The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...
The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...
The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...
 
Machine learning
Machine learningMachine learning
Machine learning
 

Similar a VariantSpark: applying Spark-based machine learning methods to genomic information

How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science researchDenis C. Bauer
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc
 
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...Amazon Web Services
 
Pydata London January 2017
Pydata London January 2017Pydata London January 2017
Pydata London January 2017Edward Perello
 
PyData London January 2017
PyData London January 2017PyData London January 2017
PyData London January 2017Edward Perello
 
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...Amazon Web Services
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSLynn Langit
 
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic AnalysisVarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic AnalysisGolden Helix
 
Platforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-esPlatforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-esJoaquin Dopazo
 
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbuJax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbuAnne Deslattes Mays
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Ian Foster
 
Affymetrix OncoScan®* data analysis with Nexus Copy Number™
Affymetrix OncoScan®* data analysis with Nexus Copy Number™Affymetrix OncoScan®* data analysis with Nexus Copy Number™
Affymetrix OncoScan®* data analysis with Nexus Copy Number™Affymetrix
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global communityExternalEvents
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysisYun Lung Li
 
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...Larry Smarr
 
Open Source Networking Solving Molecular Analysis of Cancer
Open Source Networking Solving Molecular Analysis of CancerOpen Source Networking Solving Molecular Analysis of Cancer
Open Source Networking Solving Molecular Analysis of CancerOpen Networking Summit
 

Similar a VariantSpark: applying Spark-based machine learning methods to genomic information (20)

How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
2023 GIAB AMP Update
2023 GIAB AMP Update2023 GIAB AMP Update
2023 GIAB AMP Update
 
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
 
Pydata London January 2017
Pydata London January 2017Pydata London January 2017
Pydata London January 2017
 
PyData London January 2017
PyData London January 2017PyData London January 2017
PyData London January 2017
 
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
 
Data sharing and analysis
Data sharing and analysisData sharing and analysis
Data sharing and analysis
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWS
 
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic AnalysisVarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
 
Platforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-esPlatforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-es
 
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbuJax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009
 
Affymetrix OncoScan®* data analysis with Nexus Copy Number™
Affymetrix OncoScan®* data analysis with Nexus Copy Number™Affymetrix OncoScan®* data analysis with Nexus Copy Number™
Affymetrix OncoScan®* data analysis with Nexus Copy Number™
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysis
 
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
 
Ashg presentation 2010
Ashg presentation 2010Ashg presentation 2010
Ashg presentation 2010
 
BioData World Basel 2018
BioData World Basel 2018BioData World Basel 2018
BioData World Basel 2018
 
Open Source Networking Solving Molecular Analysis of Cancer
Open Source Networking Solving Molecular Analysis of CancerOpen Source Networking Solving Molecular Analysis of Cancer
Open Source Networking Solving Molecular Analysis of Cancer
 

Más de Denis C. Bauer

Cloud-native machine learning - Transforming bioinformatics research
Cloud-native machine learning - Transforming bioinformatics research Cloud-native machine learning - Transforming bioinformatics research
Cloud-native machine learning - Transforming bioinformatics research Denis C. Bauer
 
Translating genomics into clinical practice - 2018 AWS summit keynote
Translating genomics into clinical practice - 2018 AWS summit keynoteTranslating genomics into clinical practice - 2018 AWS summit keynote
Translating genomics into clinical practice - 2018 AWS summit keynoteDenis C. Bauer
 
Going Server-less for Web-Services that need to Crunch Large Volumes of Data
Going Server-less for Web-Services that need to Crunch Large Volumes of DataGoing Server-less for Web-Services that need to Crunch Large Volumes of Data
Going Server-less for Web-Services that need to Crunch Large Volumes of DataDenis C. Bauer
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science researchDenis C. Bauer
 
Population-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysisPopulation-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysisDenis C. Bauer
 
Allelic Imbalance for Pre-capture Whole Exome Sequencing
Allelic Imbalance for Pre-capture Whole Exome SequencingAllelic Imbalance for Pre-capture Whole Exome Sequencing
Allelic Imbalance for Pre-capture Whole Exome SequencingDenis C. Bauer
 
Centralizing sequence analysis
Centralizing sequence analysisCentralizing sequence analysis
Centralizing sequence analysisDenis C. Bauer
 
Differential gene expression
Differential gene expressionDifferential gene expression
Differential gene expressionDenis C. Bauer
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseqDenis C. Bauer
 
Functionally annotate genomic variants
Functionally annotate genomic variantsFunctionally annotate genomic variants
Functionally annotate genomic variantsDenis C. Bauer
 
Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2Denis C. Bauer
 
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Denis C. Bauer
 
Introduction to second generation sequencing
Introduction to second generation sequencingIntroduction to second generation sequencing
Introduction to second generation sequencingDenis C. Bauer
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to BioinformaticsDenis C. Bauer
 
The missing data issue for HiSeq runs
The missing data issue for HiSeq runsThe missing data issue for HiSeq runs
The missing data issue for HiSeq runsDenis C. Bauer
 
Deciphering the regulatory code in the genome
Deciphering the regulatory code in the genomeDeciphering the regulatory code in the genome
Deciphering the regulatory code in the genomeDenis C. Bauer
 
STAR: Recombination site prediction
STAR: Recombination site predictionSTAR: Recombination site prediction
STAR: Recombination site predictionDenis C. Bauer
 
SUMOylation site prediction
SUMOylation site predictionSUMOylation site prediction
SUMOylation site predictionDenis C. Bauer
 

Más de Denis C. Bauer (20)

Cloud-native machine learning - Transforming bioinformatics research
Cloud-native machine learning - Transforming bioinformatics research Cloud-native machine learning - Transforming bioinformatics research
Cloud-native machine learning - Transforming bioinformatics research
 
Translating genomics into clinical practice - 2018 AWS summit keynote
Translating genomics into clinical practice - 2018 AWS summit keynoteTranslating genomics into clinical practice - 2018 AWS summit keynote
Translating genomics into clinical practice - 2018 AWS summit keynote
 
Going Server-less for Web-Services that need to Crunch Large Volumes of Data
Going Server-less for Web-Services that need to Crunch Large Volumes of DataGoing Server-less for Web-Services that need to Crunch Large Volumes of Data
Going Server-less for Web-Services that need to Crunch Large Volumes of Data
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
 
Population-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysisPopulation-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysis
 
Trip Report Seattle
Trip Report SeattleTrip Report Seattle
Trip Report Seattle
 
Allelic Imbalance for Pre-capture Whole Exome Sequencing
Allelic Imbalance for Pre-capture Whole Exome SequencingAllelic Imbalance for Pre-capture Whole Exome Sequencing
Allelic Imbalance for Pre-capture Whole Exome Sequencing
 
Centralizing sequence analysis
Centralizing sequence analysisCentralizing sequence analysis
Centralizing sequence analysis
 
Differential gene expression
Differential gene expressionDifferential gene expression
Differential gene expression
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseq
 
Functionally annotate genomic variants
Functionally annotate genomic variantsFunctionally annotate genomic variants
Functionally annotate genomic variants
 
Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2
 
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1
 
Introduction to second generation sequencing
Introduction to second generation sequencingIntroduction to second generation sequencing
Introduction to second generation sequencing
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
The missing data issue for HiSeq runs
The missing data issue for HiSeq runsThe missing data issue for HiSeq runs
The missing data issue for HiSeq runs
 
Deciphering the regulatory code in the genome
Deciphering the regulatory code in the genomeDeciphering the regulatory code in the genome
Deciphering the regulatory code in the genome
 
ReliF
ReliFReliF
ReliF
 
STAR: Recombination site prediction
STAR: Recombination site predictionSTAR: Recombination site prediction
STAR: Recombination site prediction
 
SUMOylation site prediction
SUMOylation site predictionSUMOylation site prediction
SUMOylation site prediction
 

Último

Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 

Último (20)

Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 

VariantSpark: applying Spark-based machine learning methods to genomic information

  • 1. Denis C. Bauer | Bioinformatics | @allPowerde 20 Nov 2015 CSIRO HEALTH & BIOSECURITY VariantSpark: applying Spark-based machine learning methods to genomic information ByTimCooper
  • 2. Talk Overview VariantSpark| Denis C. Bauer @allPowerde2 | • Background: Why is genomics so important for medicine • VariantSpark: Overview • Whole Genome Analysis: Clustering samples by ethnicity
  • 3. Genome sequencing improves diagnostics Genomic sequencing can lead to a successful diagnosis in up to 50% of cases where traditional genetic testing failed and is on average 96% cheaper Presentation title | Presenter name3 | Oncology Tandem duplications Tandem duplications Identifying tumours by their genome- wide mutation profiles Rare genetic disorders Identifying causative mutations by interrogating all abnormal variants http://matt.might.net Bauer et al. Trends Mol Med. 2014 PMID: 24801560
  • 4. Generating data from 1 Million Americans Presentation title | Presenter name4 | THEPRECISIONMEDICINEINITIATIVE WHATISIT? Precision medicine is an emerging approach for disease prevention and treatment that takes into account people’s individual variations in genes, environment, and lifestyle. The Precision Medicine Initiative will generate the scientific evidence needed to move the concept of precision medicine into clinical practice. WHYNOW? The time is right because of: Sequencing of the human genome Improved technologies for biomedical analysis New tools for using large datasets NEAR TERM GOALS Intensify efforts to apply precision medicine to cancer. Innovative clinical trials of targeted drugs for adult, pediatric cancers Use of combination therapies Knowledge to overcome drug resistance LONGER TERM GOALS Create a research cohort of > 1 million American volunteers who will share genetic data, biological samples, and diet/lifestyle information, all linked to their electronic health records if they choose. Pioneer a new model for doing science that emphasizes engaged participants, responsible data sharing, and privacy pr otection. Research based upon the cohort data will: • Advance pharmacogenomics, the right drug for the right patient at the right dose • Identify new targets for treatment and prevention Australia: ~ 100 Million dedicated to clinical genomics • $25 Million Australian Genomics Health Alliances (NHMRC Grant AI Denis); • VIC and QLD Alliances ($25 Million each); NSW and ACT (undisclosed $$ to Garvan and JCSMR)
  • 5. 100,000 Genomes project 70,000 individuals by 2017 The cancer genome atlas 11,000 samples 2015 Genomics projects are getting bigger VariantSpark| Denis C. Bauer @allPowerde | Page 5 The HapMap Project 270 samples 2002 Human genome ~1 sample 1000 Genome Project 1097 samples 2012 Project MinE 15,000 people with ALS ASPREE 4000 healthy 70+ year olds Single samples are around 200GB in size
  • 6. VariantSpark| Denis C. Bauer @allPowerde | Page 6
  • 7. Data Analysis categories for genomics Map to genome and generate raw genomic features (e.g. SNPs) Analyze the data; Uncover the biological meaning Produce raw sequence readsBasic Production Informatics Advanced Production Inform. Bioinformatics Research VariantSpark| Denis C. Bauer @allPowerde | Page 7
  • 8. VariantSpark Mllib* VCF VariantSpark is the interface enabling Spark’s MLlib machine learning algorithms to be applied to genomics data e.g. grouping samples by genomic profile Input Genomics Application Result Largescale compute VariantSpark| Denis C. Bauer @allPowerde | Page 8 * VariantSpark also uses Spark.ML
  • 9. VariantSpark VariantSpark| Denis C. Bauer @allPowerde | Page 9 Accepted BMC Genomics (IF=4)
  • 10. Cluster individuals into ethnic groups based on their genomic profiles www.cloudaccess.eu 1000 x 40 Million variants Matrix * Kmeans Predict super population 4 14 ethnic groups and s u p e r populations VariantSpark| Denis C. Bauer @allPowerde | Page 10 * VariantSpark can also process phase 3 data: 3000 individuals and 80 million variants
  • 11. Clustering result • (adjusted Rand index) ARI = 0.84, with -1 (independent labeling) and 1 (perfect match) • Majority of American (AMR) individuals being placed in the same group as Europeans (EUR), likely reflecting their migrational backgrounds. • ADMIXTURE (state-of-the-art tool for population structure determination) returns a low ARI of 0.25 Admixture: Alexander, D.H., Novembre, J., Lange, K.: Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19(9), 1655–1664 (2009) VariantSpark| Denis C. Bauer @allPowerde | Page 11
  • 12. Comparison to other implementations • Preprocessing: converting location-centric VCF genotypes into sample- centric numerical vectors • Clustering: Kmeans • ADAM (BigData Genomics): Spark implementation with dense matrix • Hadoop: MapReduce without in- memory caching 0 1000 2000 Python R H adoop Adam AD M IXTU R E VariantSpark method timeinseconds task binary−conversion clustering pre−processing Chromosome 22; VM on Microsoft Azure with A7 Linux instance and 8 cores, 56GB memory running Ubuntu. 103 75 29 28 18 4 min VariantSpark| Denis C. Bauer @allPowerde | Page 12
  • 13. Scaling VariantSpark to the whole genome • Pre-processing: scales seamlessly as processes are independent • Clustering: memory consumption increases linear with number of variants (24GB) due to additional distance measurements between variants and k-means centroids • As total memory was the limiting factor on our infrastructure the number of simultaneously used nodes had to be reduced; increasing runtime. pre−processing clustering 40 45 50 55 60 65 5 10 15 20 25 0 20000 40000 executorsmemorytime 20 40 60 80 100 20 40 60 80 100 number of variants (%) value variable executors memory time CSIRO Spark Cluster: Whole genome; Hadoop 2.5.0, managed by cloudera’s CDH 5. We use Spark 1.3.1. This 13 node cluster has a total of 416 cores and 1.22TB memory. VariantSpark| Denis C. Bauer @allPowerde | Page 13
  • 14. Three things to remember • VariantSpark is an interface bringing bigLearning tasks to genomics applications • VariantSpark can cluster 3000 individuals and 80 million variants in under 30 hours using minimal memory (24GB) – a task not being possible in R/python/ADMIXTURE due to memory limits. • VariantSpark outperforms ADAM (Big Data Genomics) and equivalent Hadoop-implementation by almost an order of magnitude. https://github.com/BauerLab/VariantSpark VariantSpark| Denis C. Bauer @allPowerde | Page 14
  • 15. HEALTH AND BIOSECURITY Thank youHealth & Biosecurity Denis C. Bauer t +61 2 9123 4567 e Denis.Bauer@csiro.au w aehrc.com/biomedical- informatics/transformational-bioinformatics/ More talks online: Twitter: http://www.slideshare.net/allPowerde @allPowerde Aidan O’Brien Bill Wilson Transformational Bioinformatics Team, CSIRO Former members Firoz Anwar Neil Saunders Rodney Scott Newcastle University Funding: National Health and Medical Research Council; National Breast Cancer Foundation; CSIRO's Transformational Capability Platform; CSIRO’s IM&T; Science and Industry Endowment Fund Buske et al., Bioinformatics Jan 2014 O’Brien et al., BMC Genomics Dec 2015 Dunne et al., in preparation FullySIC Epistatic Gene Network modelling in preparation Anwar et al., in preparation Piotr Szul Gi Guo Robert Dunne Data61 CSIRO, Australia GOdistinct GO Enrichment or genesets with distinctive function
  • 16. Presentation title | Presenter name16 |

Notas del editor

  1. http://www.nature.com/nature/journal/v462/n7276/fig_tab/nature08645_F1.html Bauer et al. Trends Mol Med. 2014 PMID: 24801560.
  2. http://www.nejm.org/doi/full/10.1056/NEJMp1500523?query=featured_home
  3. http://www.cloudaccess.eu/blog/wp-content/uploads/2014/06/genetic_roots.png