SlideShare a Scribd company logo
1 of 27
Download to read offline
FINDING NEEDLES IN GENOMIC
HAYSTACKS WITH “WIDE”
RANDOM FOREST
Piotr Szul
CSIRO Data61
Which needle is the right one?
All humans carry between 200 to
800 mutation that disrupt the
function of a gene.
The human genome is 3 billion letters long.
Finding genetic underpinnings fordiseases and phenotypic traits
What are the biologicalmechanism?
Who is at risk for a disease?
How to prevent and treat?
5319
talented
staff
$1billion+	
budget
Working
with over
2800+
industry
partners
55
sites	across	
Australia
Top 1%
of global
research
agencies
Each year
6 CSIRO
technologies
contribute
$5 billion to
the economy
Agenda
• Intro to Genome Wide Association Studies
• Variant Spark and “Cursed Forest”
• GWAS use-cases
Genome Wide Association Studies
imagecourtesyofPasieka
SciencePhotoLibrary
1000+ samples
Relatively common > 1%
~ 500,000 SNPs
Look at the data
Typical GWAS: 1M variants x 5K samples
Full genome: 80M variants x 2.5K samples
0 1 0 … 1
1 1 1 … 1
0 0 0 … 0
0 0 1 … 1
0 1 1 … 1
0 0 0 … 0
1 2 0 … 0
.........
.........
0 0 0 … 2
1 2 0 … 0
samples (103)
variants(106)
0 1 0 0 0 0 1 ... 0 1
1 1 0 0 1 0 2 ... 0 2
0 1 0 1 1 0 0 ... 0 0
.....................
1 1 0 1 1 0 0 ... 2 0
variants x samples
transpose
D
N
D
.
N
1 x samples
predictors response
associate
0
10,000
20,000
30,000
40,000
50,000
100,000 1,000,000 10,000,000 100,000,000
Studies 1000 Genomes
samples
variants
GWAS
0
2000
4000
6000
8000
10000
12000
2008 2009 2010 2011 2012 2013 2014 2015
GWAS Studies
Associations Studies
2713 studies
31183 associations
Hirota at al (2012) Genome-wide association study identifies eight new
susceptibility loci for atopic dermatitis in the Japanese population The NHGRI-EBI Catalog of published genome-
wide association studies
Missing Heritability
Manolio et al. (2009) Finding the missing heritability of complex diseases
… human height heritability is ~80% yet more
that 40 associated loci explain only about5%
of phenotypic variance …
“Dark matter” of
genomics
Epistasis
Traditional approach for interaction modeling ’squares’ the problem size
500,000 SNPs à ~100,000,000,000 pairs
Random Forest to the rescue
Lunetta et al. (2004) Screening large-scale association study data:
exploiting interactions using random forests
Breiman (2001) Random Forests.
Machine Learning
Random Forest in GWAS
• Non-parametric and arbitrarily expressive
• Insensitive to outliers and non-informative predictors
• Stable performance – no overfitting
• Easy to tune
• Built in error estimate (OOB error)
• Variable importance measures
• Ability to deal with heterogeneous data
• Easy to parallelize and scale on HPC
Sun (2010) Multigenic Modeling of Complex Disease by Random Forests
RF is an appropriate candidate to capture the genetic heterogeneity
underlying the trait because RF itself is an ensemble of many heterogeneous
trees built from uncorrelated subsamples of the original data
VariantSpark
0
1000
2000
P
ython
R
H
adoop
A
dam
A
D
M
IX
TU
R
E
VariantS
park
method
timeinseconds
task
binary−conversion
clustering
pre−processing
It	can	cluster	3000	individuals	and 80	million	
variants
O’Brien et al. (2015) VariantSpark: population scale clustering of genotype information
Natalie TwineDenis Bauer Oscar Luo Rob Dunne Piotr Szul
Transformational Bioinformatics Team
Aidan O’BrienLaurence Wilson
Software
Open source (MIT) @ https://github.com/csirobigdata/variant-spark
Random Forest SparkML
Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK
• Failing for millions of variables
• Relatively slow
“Cursed Forest”
broadcast
aggregate
1
2,1 2,2
Executors
v1
v2
v3v3v3
vn
…
var, point
local best
split
var1, point1
var21, point21 var22, point22
global
best split
…
initial sample
split subsets
Driver
Partition data by variables
(columns)
• Columns are “small” –
easy partition
• An executor can find (an
exact) best split for many
variables
• Finding globalbest split
is efficient
Some implementation tricks
• ”Native” data shape
– VCF files are organized by ”variables”
• Building by levels and tree batching
– Minimize communication overhead and the number of stages
• Optimized split finding for ordered factors
– The most frequent operation
– Java implementation faster then Scala
• Choice of data representation
– byte representation for variant data
– with sparsity 0.75 a sparse vector 3x bigger than a byte array
How fast it is? 16 CPU cores 32GB RAM
local mode
Big data performance
• Yarn Cluster (12 workers)
– 16 x IntelXeon E5-2660@2.20GHzCPU
– 128 GB of RAM
• Spark 1.6.1 on YARN
– 128 executors
– 6GB / executor (0.75TB)
• Synthetic dataset(mtry = 0.25)
Typical
GWAS
Range
100K trees: 5 – 50h
AWS: ~$215.50
Whole
Genome
Range
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
50M variable x 10k samples!
Other features
• Various input formats
– VCF, CSV, parquet
• A variety of RF (fine) tuning
parameters
– Sampling
– Depths
– Splitting
• Insight into RF model
– cumulative OOB error
– per tree variable importance
– per tree OOB predictions
Simulated Data Study
• Synthetic dataset of 2.5M variables and 5000 samples
• 5 informative variables with dichotomous response
• Compare RF importance ranking with the model
• Rank-biased overlap (RBO) – measure of ranking
overlap (with emphasis of highly ranked elements)
RBO
0
0.5
1
1.5
w_1 w_2 w_3 w_4 w_5
Bone Mineral Density Study
• Osteoporotic fracture is a leading cause of morbidity
and mortality particularly amongst the elderly.
• In 2004 ten millionAmericans were estimated to have
osteoporosis, resulting in 1.5 million fractures per
annum.
• Hip fracture is associated with a one year mortality
rate of 36% in men and 21% in women
Burden of disease of osteoporotic fractures overall is
similar to that of colorectal cancer and greater than that of
hypertension and breast cancer
Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel
Genes Affecting BoneMineral Density and Fracture Risk.
Bone Mineral Density Study
• 2036 samples & 288,768
SNPs
• Replicates 21 of 26 known
associated genes
• Identifies 2 novel loci (known
association with BMD)
• Provides strong evidence for
further 4 loci
Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel
Genes Affecting BoneMineral Density and Fracture Risk.
BMD - VariantSpark Results
Known BMD locations have
significantly higher ranking
(Mann-Whitney U, p = 1.3e-7)
A few novel highly ranked
locations with plausible
association with BMD: COLEC10,
PRODH
Not replicated DCDC5 ranked
9,667 out of 10,000
Future work and directions
Techchnical
Compare and
’merge’ with
yggdrasil
Deployment on
cloud platforms
Further
performance
improvements
Functional
Implementation
of cutting edge
research
Integration
within genomics
platforms
(GATK4)
More ML
algorithms
Research Applications
Data science
research Gradient Boosted
Trees
References
1. Aleesha Bates (2016) Practical aspects of GWASAssociation studies under statistical
genetics and GenABEL hands-on tutorial
2. Hirota at al (2012) Genome-wide association study identifies eight new susceptibility loci
for atopic dermatitis in the Japanese population
3. The NHGRI-EBI Catalog of published genome-wide association studies
4. Manolio et al. (2009) Finding the missing heritability of complex diseases
5. Lunetta et al. (2004) Screening large-scale association study data: exploiting interactions
using random forests
6. Breiman (2004) Random Forests. Machine Learning
7. Sun (2010) Multigenic Modeling of Complex Disease by Random Forests
8. Danecek et al. The Variant Call Format and VCFtools
9. O’Brien et al. (2015) VariantSpark: population scale clustering of genotype information
10. Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column
Partitioning in SPARK
11. Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection
Identifies Novel Genes Affecting Bone Mineral Density and Fracture Risk.
Conclusions
Apache Spark is a feasible platform machine learning
in population scale genomics.
VariantSpark with CursedForest is a promising
alternative for traditional GWAS approaches.
Data shape, type, etc. matter – different optimizations
are needed.
Thank You
Email: piotr.szul@data61.csiro.au
Github: https://github.com/csirobigdata/variant-spark

More Related Content

What's hot

Managing Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger InstituteManaging Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger Instituteinside-BigData.com
 
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...GigaScience, BGI Hong Kong
 
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph DatabaseAnalyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph DatabaseNeo4j
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesGuy Coates
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangementGuy Coates
 
Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDatabricks
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsmikaelhuss
 
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Alejandra Gonzalez-Beltran
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use casesGuy Coates
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchEuropean Bioinformatics Institute
 
AWS Customer Presentation- University of Maryland
AWS Customer Presentation- University of MarylandAWS Customer Presentation- University of Maryland
AWS Customer Presentation- University of MarylandAmazon Web Services
 
Fostering Serendipity through Big Linked Data
Fostering Serendipity through Big Linked DataFostering Serendipity through Big Linked Data
Fostering Serendipity through Big Linked DataMuhammad Saleem
 
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013Functional Genomics Data Society
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017philippbayer
 

What's hot (20)

2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
Managing Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger InstituteManaging Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger Institute
 
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
 
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph DatabaseAnalyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangement
 
Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge Graphs
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Michael Reich, GenomeSpace Workshop, fged_seattle_2013
Michael Reich, GenomeSpace Workshop, fged_seattle_2013Michael Reich, GenomeSpace Workshop, fged_seattle_2013
Michael Reich, GenomeSpace Workshop, fged_seattle_2013
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use cases
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven Research
 
AWS Customer Presentation- University of Maryland
AWS Customer Presentation- University of MarylandAWS Customer Presentation- University of Maryland
AWS Customer Presentation- University of Maryland
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
Fostering Serendipity through Big Linked Data
Fostering Serendipity through Big Linked DataFostering Serendipity through Big Linked Data
Fostering Serendipity through Big Linked Data
 
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
 
Folker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data AnnotationFolker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data Annotation
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
Introduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEASTIntroduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEAST
 

Similar to Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit East talk by Piotr Szul

Introduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysisIntroduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysisJosh Neufeld
 
Phylogenomic Supertrees. ORP Bininda-Emond
Phylogenomic Supertrees. ORP Bininda-EmondPhylogenomic Supertrees. ORP Bininda-Emond
Phylogenomic Supertrees. ORP Bininda-EmondRoderic Page
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1Double Check ĆŐNSULTING
 
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyLarge scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyMaté Ongenaert
 
AI & Scientific Discovery in Oncology: Opportunities, Challenges & Trends
AI & Scientific Discovery in Oncology: Opportunities, Challenges & TrendsAI & Scientific Discovery in Oncology: Opportunities, Challenges & Trends
AI & Scientific Discovery in Oncology: Opportunities, Challenges & TrendsAndre Freitas
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSLynn Langit
 
Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchFranciscoJAzuajeG
 
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018David Cook
 
OKC Grand Rounds 2009
OKC Grand Rounds 2009OKC Grand Rounds 2009
OKC Grand Rounds 2009Sean Davis
 
Paper presentation @DILS'07
Paper presentation @DILS'07Paper presentation @DILS'07
Paper presentation @DILS'07Paolo Missier
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
 
Current advances to bridge the usability-expressivity gap in biomedical seman...
Current advances to bridge the usability-expressivity gap in biomedical seman...Current advances to bridge the usability-expressivity gap in biomedical seman...
Current advances to bridge the usability-expressivity gap in biomedical seman...Maulik Kamdar
 
Charleston Conference 2016
Charleston Conference 2016Charleston Conference 2016
Charleston Conference 2016Anita de Waard
 
Knowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningKnowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningjaumebp
 
Comprehensive Exam Slides 11/13/2013
Comprehensive Exam Slides 11/13/2013Comprehensive Exam Slides 11/13/2013
Comprehensive Exam Slides 11/13/2013Qingpeng "Q.P." Zhang
 
American Society for Mass Spectrometry Conference 2013
American Society for Mass Spectrometry Conference 2013American Society for Mass Spectrometry Conference 2013
American Society for Mass Spectrometry Conference 2013Dmitry Grapov
 
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing codeISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing codeKengo Sato
 
Cornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 NetsCornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 NetsMark Gerstein
 

Similar to Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit East talk by Piotr Szul (20)

Introduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysisIntroduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysis
 
Phylogenomic Supertrees. ORP Bininda-Emond
Phylogenomic Supertrees. ORP Bininda-EmondPhylogenomic Supertrees. ORP Bininda-Emond
Phylogenomic Supertrees. ORP Bininda-Emond
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1
 
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyLarge scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
 
AI & Scientific Discovery in Oncology: Opportunities, Challenges & Trends
AI & Scientific Discovery in Oncology: Opportunities, Challenges & TrendsAI & Scientific Discovery in Oncology: Opportunities, Challenges & Trends
AI & Scientific Discovery in Oncology: Opportunities, Challenges & Trends
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWS
 
Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical research
 
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
 
OKC Grand Rounds 2009
OKC Grand Rounds 2009OKC Grand Rounds 2009
OKC Grand Rounds 2009
 
Paper presentation @DILS'07
Paper presentation @DILS'07Paper presentation @DILS'07
Paper presentation @DILS'07
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...
 
Current advances to bridge the usability-expressivity gap in biomedical seman...
Current advances to bridge the usability-expressivity gap in biomedical seman...Current advances to bridge the usability-expressivity gap in biomedical seman...
Current advances to bridge the usability-expressivity gap in biomedical seman...
 
Charleston Conference 2016
Charleston Conference 2016Charleston Conference 2016
Charleston Conference 2016
 
Knowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningKnowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learning
 
Comprehensive Exam Slides 11/13/2013
Comprehensive Exam Slides 11/13/2013Comprehensive Exam Slides 11/13/2013
Comprehensive Exam Slides 11/13/2013
 
American Society for Mass Spectrometry Conference 2013
American Society for Mass Spectrometry Conference 2013American Society for Mass Spectrometry Conference 2013
American Society for Mass Spectrometry Conference 2013
 
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing codeISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
 
Cornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 NetsCornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 Nets
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 

Recently uploaded (20)

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 

Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit East talk by Piotr Szul

  • 1. FINDING NEEDLES IN GENOMIC HAYSTACKS WITH “WIDE” RANDOM FOREST Piotr Szul CSIRO Data61
  • 2. Which needle is the right one? All humans carry between 200 to 800 mutation that disrupt the function of a gene. The human genome is 3 billion letters long. Finding genetic underpinnings fordiseases and phenotypic traits What are the biologicalmechanism? Who is at risk for a disease? How to prevent and treat?
  • 3. 5319 talented staff $1billion+ budget Working with over 2800+ industry partners 55 sites across Australia Top 1% of global research agencies Each year 6 CSIRO technologies contribute $5 billion to the economy
  • 4.
  • 5. Agenda • Intro to Genome Wide Association Studies • Variant Spark and “Cursed Forest” • GWAS use-cases
  • 6. Genome Wide Association Studies imagecourtesyofPasieka SciencePhotoLibrary 1000+ samples Relatively common > 1% ~ 500,000 SNPs
  • 7. Look at the data Typical GWAS: 1M variants x 5K samples Full genome: 80M variants x 2.5K samples 0 1 0 … 1 1 1 1 … 1 0 0 0 … 0 0 0 1 … 1 0 1 1 … 1 0 0 0 … 0 1 2 0 … 0 ......... ......... 0 0 0 … 2 1 2 0 … 0 samples (103) variants(106) 0 1 0 0 0 0 1 ... 0 1 1 1 0 0 1 0 2 ... 0 2 0 1 0 1 1 0 0 ... 0 0 ..................... 1 1 0 1 1 0 0 ... 2 0 variants x samples transpose D N D . N 1 x samples predictors response associate 0 10,000 20,000 30,000 40,000 50,000 100,000 1,000,000 10,000,000 100,000,000 Studies 1000 Genomes samples variants
  • 8. GWAS 0 2000 4000 6000 8000 10000 12000 2008 2009 2010 2011 2012 2013 2014 2015 GWAS Studies Associations Studies 2713 studies 31183 associations Hirota at al (2012) Genome-wide association study identifies eight new susceptibility loci for atopic dermatitis in the Japanese population The NHGRI-EBI Catalog of published genome- wide association studies
  • 9. Missing Heritability Manolio et al. (2009) Finding the missing heritability of complex diseases … human height heritability is ~80% yet more that 40 associated loci explain only about5% of phenotypic variance … “Dark matter” of genomics
  • 10. Epistasis Traditional approach for interaction modeling ’squares’ the problem size 500,000 SNPs à ~100,000,000,000 pairs
  • 11. Random Forest to the rescue Lunetta et al. (2004) Screening large-scale association study data: exploiting interactions using random forests Breiman (2001) Random Forests. Machine Learning
  • 12. Random Forest in GWAS • Non-parametric and arbitrarily expressive • Insensitive to outliers and non-informative predictors • Stable performance – no overfitting • Easy to tune • Built in error estimate (OOB error) • Variable importance measures • Ability to deal with heterogeneous data • Easy to parallelize and scale on HPC Sun (2010) Multigenic Modeling of Complex Disease by Random Forests RF is an appropriate candidate to capture the genetic heterogeneity underlying the trait because RF itself is an ensemble of many heterogeneous trees built from uncorrelated subsamples of the original data
  • 13. VariantSpark 0 1000 2000 P ython R H adoop A dam A D M IX TU R E VariantS park method timeinseconds task binary−conversion clustering pre−processing It can cluster 3000 individuals and 80 million variants O’Brien et al. (2015) VariantSpark: population scale clustering of genotype information Natalie TwineDenis Bauer Oscar Luo Rob Dunne Piotr Szul Transformational Bioinformatics Team Aidan O’BrienLaurence Wilson Software Open source (MIT) @ https://github.com/csirobigdata/variant-spark
  • 14. Random Forest SparkML Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK • Failing for millions of variables • Relatively slow
  • 15. “Cursed Forest” broadcast aggregate 1 2,1 2,2 Executors v1 v2 v3v3v3 vn … var, point local best split var1, point1 var21, point21 var22, point22 global best split … initial sample split subsets Driver Partition data by variables (columns) • Columns are “small” – easy partition • An executor can find (an exact) best split for many variables • Finding globalbest split is efficient
  • 16. Some implementation tricks • ”Native” data shape – VCF files are organized by ”variables” • Building by levels and tree batching – Minimize communication overhead and the number of stages • Optimized split finding for ordered factors – The most frequent operation – Java implementation faster then Scala • Choice of data representation – byte representation for variant data – with sparsity 0.75 a sparse vector 3x bigger than a byte array
  • 17. How fast it is? 16 CPU cores 32GB RAM local mode
  • 18. Big data performance • Yarn Cluster (12 workers) – 16 x IntelXeon E5-2660@2.20GHzCPU – 128 GB of RAM • Spark 1.6.1 on YARN – 128 executors – 6GB / executor (0.75TB) • Synthetic dataset(mtry = 0.25) Typical GWAS Range 100K trees: 5 – 50h AWS: ~$215.50 Whole Genome Range 100K trees: 200 – 2000h AWS: ~ $ 8620.00 50M variable x 10k samples!
  • 19. Other features • Various input formats – VCF, CSV, parquet • A variety of RF (fine) tuning parameters – Sampling – Depths – Splitting • Insight into RF model – cumulative OOB error – per tree variable importance – per tree OOB predictions
  • 20. Simulated Data Study • Synthetic dataset of 2.5M variables and 5000 samples • 5 informative variables with dichotomous response • Compare RF importance ranking with the model • Rank-biased overlap (RBO) – measure of ranking overlap (with emphasis of highly ranked elements) RBO 0 0.5 1 1.5 w_1 w_2 w_3 w_4 w_5
  • 21. Bone Mineral Density Study • Osteoporotic fracture is a leading cause of morbidity and mortality particularly amongst the elderly. • In 2004 ten millionAmericans were estimated to have osteoporosis, resulting in 1.5 million fractures per annum. • Hip fracture is associated with a one year mortality rate of 36% in men and 21% in women Burden of disease of osteoporotic fractures overall is similar to that of colorectal cancer and greater than that of hypertension and breast cancer Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel Genes Affecting BoneMineral Density and Fracture Risk.
  • 22. Bone Mineral Density Study • 2036 samples & 288,768 SNPs • Replicates 21 of 26 known associated genes • Identifies 2 novel loci (known association with BMD) • Provides strong evidence for further 4 loci Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel Genes Affecting BoneMineral Density and Fracture Risk.
  • 23. BMD - VariantSpark Results Known BMD locations have significantly higher ranking (Mann-Whitney U, p = 1.3e-7) A few novel highly ranked locations with plausible association with BMD: COLEC10, PRODH Not replicated DCDC5 ranked 9,667 out of 10,000
  • 24. Future work and directions Techchnical Compare and ’merge’ with yggdrasil Deployment on cloud platforms Further performance improvements Functional Implementation of cutting edge research Integration within genomics platforms (GATK4) More ML algorithms Research Applications Data science research Gradient Boosted Trees
  • 25. References 1. Aleesha Bates (2016) Practical aspects of GWASAssociation studies under statistical genetics and GenABEL hands-on tutorial 2. Hirota at al (2012) Genome-wide association study identifies eight new susceptibility loci for atopic dermatitis in the Japanese population 3. The NHGRI-EBI Catalog of published genome-wide association studies 4. Manolio et al. (2009) Finding the missing heritability of complex diseases 5. Lunetta et al. (2004) Screening large-scale association study data: exploiting interactions using random forests 6. Breiman (2004) Random Forests. Machine Learning 7. Sun (2010) Multigenic Modeling of Complex Disease by Random Forests 8. Danecek et al. The Variant Call Format and VCFtools 9. O’Brien et al. (2015) VariantSpark: population scale clustering of genotype information 10. Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK 11. Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel Genes Affecting Bone Mineral Density and Fracture Risk.
  • 26. Conclusions Apache Spark is a feasible platform machine learning in population scale genomics. VariantSpark with CursedForest is a promising alternative for traditional GWAS approaches. Data shape, type, etc. matter – different optimizations are needed.
  • 27. Thank You Email: piotr.szul@data61.csiro.au Github: https://github.com/csirobigdata/variant-spark