Ml in genomics

Introduction to
machine learning
in genomics
BRIAN SCHILDER
BIOINFORMATICIAN II
RAJ LAB 08/21/2020
[ 1 ] N A S H F A M I L Y D E P A R T M E N T O F N E U R O S C I E N C E &
F R I E D M A N B R A I N I N S T I T U T E
[ 2 ] R O N A L D M . L O E B C E N T E R F O R A L Z H E I M E R ’ S D I S E A S E
[ 3 ] D E P A R T M E N T O F G E N E T I C S A N D G E N O M I C S C I E N C E S &
I C A H N I N S T I T U T E F O R D A T A S C I E N C E A N D G E N O M I C
T E C H N O L O G Y ,
[ 4 ] E S T E L L E A N D D A N I E L M A G G I N D E P A R T M E N T O F
N E U R O L O G Y

Approaches to making predictions
L Breiman, Statistical modeling: The two
cultures. Statistical Science. 16, 199–215 (2001).
Explicit modeling
(your brain learns x~y)
“I will predict y from x by
assuming relationships based on
my knowledge/the literature.”
Pros
Can utilize the
prevailing wisdom.
Highly interpretable
models.
Cons
Susceptible to bias/
assumptions/
arbitrary parameters.
May not explain the
variance very well.
Machine learning
(your computer learns x~y)
“I will predict y from x by having
an algorithm learn the
relationships from data.”
Pros
Less susceptible to
(some forms) of
human bias.
Can make
predictions from
complex/multi-
variate data.
Cons
Can be less
interpretable.
May not generalize
to other data.
• What’s the relationship
between x and y?
• If you do something to x,
what will happen to y?
Science in a nutshell
cells

What is machine learning?
Artificial
Intelligence
The automation of
tasks that normally
require human
intelligence.
Machine
Learning
Automated
optimization of
some function by
learning directly
from data (as
opposed to
following explicit
rules).
x > 4
If True y + z < 2
If True =
Go to Dr.
If False =
Go to ER
If False Stay home
vippng.com

General ML framework
Input
training
data
Output
predictions
Evaluate
accuracy
against real
answer
Adjust
model
1. Training phase
2. Testing phase
Output predictions
Input testing data
Supervised learning example
Input data
• Categorical
• Continuous
MODEL
• Logistic
regression
• Linear
regression
• GLMM
• SVM
• Neural
network
• Genetic
algorithm
• etc...
Output
prediction
• Categorical
• Continuous
Dog
(.04)
Cat
(.96)
Transform data
(or a gene
expression
vector…)
Correct!
+1

ML vs. statistics: an increasingly blurry line
◦ Math and statistics were developed well before the advent of computers.
◦ Modern computers enable rapid iterative processes (optimization, distribution simulation)
◦ Linear regression, PCA and t-SNE are all technically AI/ML, but we often don’t think of them that way anymore.
https://towardsdatascience.com/introduction-to-linear-regression-and-polynomial-regression-f8adc96f31cb
https://ai.googleblog.com/2018/06/realtime-tsne-visualizations-with.html
Linear regression PCA t-SNE

Ways we use ML in biology
• DGE
• GWAS
• LDScore
• Batch correction
• …
Regression
• PCA
• MDS
• t-SNE
• UMAP
• Manifold learning
• Autoencoders
• …
Dimensionality Reduction
• Centroid-based
• K-means
• Hierarchical
• Agglomerative
• Density-based
• DBSCAN
• Louvain
• Distribution-based
• Expectation-maximization
Clustering
• Co-expression
• Multi-omics
• Causality
• Temporal graphs
• Imputation
• …
•Networks
…and much more!

Which ML model do I use?
Complex
relationships?
Yes
Need high
interpretability?
Yes Simpler model
No Lots of data?
Yes
More complex
model
No Simpler model
No Simpler model
https://towardsdatascience.com/the-balance-
accuracy-vs-interpretability-1b3861408062
In practice, you try multiple models of
varying complexity and compare
performances.

What is deep learning?
Eraslan et al. 2019, Nature Genetics Review
Given their sequences (input)
are how probable is it that these
regions are binding motifs for
TF A (output)?
Given their sequences (input)
are how probable is it that these
regions are binding motifs for
TF A (output 1) or TF B
(output 2)?
Given their sequences (input 1)
and chromatin accessibility
profiles (input 2) are how
probable is it that these regions
are binding motifs for TF A
(output)?
node
2-1
node
2-2
node
3-1
node
2-3
node
3-2
Layer 2
(hidden)
Layer 3
(output)
node
1-1
node
1-2
Layer 1
(input)
Pros
• Extremely flexible framework.
• Highly parallelizable (GPUs).
• Able to learn complex, non-linear
features.
Cons
• Challenging to interpret.
• Can require lots of compute.
• [Usually] requires lots of data.

https://www.cybercontrols.org/neuralnetworks

So what exactly can you do
with deep learning?

Predict [x] from DNA sequence
Disease risk
Gene expression
Splicing
TF motifs
Epigenomic impact
DNA sequence
•primateAI
•Deep Structured Phenotype Network (DSPN)
•xpresso
•spliceAI
•Equivariant networks
•DeepSEA
•DeepFIGV
•Avocado
• In many cases, performance of
deep learning models
performed far better than other
approaches (e.g. heuristics,
SVM)
• That said, rigorous testing on
sufficiently different datasets
than were used in training is
key (but often difficult)

Why are ANN so useful for sequences?
◦ DNA sequences are really hard for
humans to understand.
◦ Especially true when considering
long sequences, or multi-scale non-
linear interactions.
◦ Artificial neural networks (ANN)
excel at complex feature learning (e.g.
image recognition).
◦ CNNs are great for learning
hierarchical features
◦ nostril < nose < face < cat
◦ Humans can then interrogate and
interpret these features.
Eraslan et al. 2019, Nature Genetics Review
Encoded
Input ANN model
Predict-
ion

Denoise noisy data (e.g. scRNA-seq)
Deep count autoencoder (DCA)
◦ Eraslan et al. (2019), Nature Communications
Learning with AuToEncoder
◦ Badsha et al. (2020), Quantitative Biology
SAVER-X
◦ Wang et al. (preprint) bioRxiv

Dimensionality reduction
~ 70k PBMCs

Transcriptomes
Realistic inference
Latent space interpolation:
◦ [Conditional] variational autoencoders (VAE)
◦ [style transfer] Generative adversarial networks (GAN)
(Gómez-Bombarelli, et al. 2018)
scGen (Lotfollahi, Wolf, & Theis, 2019)(Pieters & Wiering,(biorxiv) 2018 )
Stimulated (e.g. IFN-β )
peripheral blood mononuclear cell (PBMC):
e.g. T/B/NK cells, monocytes
Drugs
Faces

Disease prediction
“…we developed an interpretable deep-learning framework, the
Deep Structured Phenotype Network (DSPN) (21). This model
combines a Deep Boltzmann Machine architecture with
conditional and lateral connections derived from the regulatory
network (50).”
Improvement over baseline (50%)
• Logistic predictor: 2.4-fold
• DSPN: 6-fold
• Captures non-linear interactions

When does deep learning fail?
When there’s not enough
training/testing data.
Can contribute to
overfitting; model
can’t translate to
other datasets.
When the data hasn’t been
preprocessed properly, or
has some other
uncorrected confound.
e.g. White label on
bottom of image
from disease-
specialty hospital.
When a high degree of
interpretability and
explainability are required.
e.g. Clinical
decision support.
When a simpler model can
do just as well for less
compute.
Always compare
performance to
other methods.
When you’re asking the
wrong question, or the
fitness function is
mispecified.
Requires domain
knowledge.

Deep learning references
Reviews
◦ J Zou et al., A primer on deep learning in genomics. Nature Genetics
(2018), doi:10.1038/s41588-018-0295-5.
◦ G Eraslan et al., Deep learning: new computational modelling
techniques for genomics. Nature Reviews Genetics (2019),
doi:10.1038/s41576-019-0122-6.
◦ TJ Cleophas et al., Machine Learning in Medicine. Circulation. 132,
1920–1930 (2015).
◦ P Baldi, Deep Learning in Biomedical Data Science. Annual Review
of Biomedical Data Science. 1, 181–205 (2018).
◦ R Miotto et al., Deep learning for healthcare: review, opportunities
and challenges. Briefings in Bioinformatics. 19, 1236–1246 (2017).
◦ VI Jurtz et al., An introduction to deep learning on biological
sequence data: Examples and solutions. Bioinformatics. 33, 3685–
3690 (2017).
◦ MKK Leung et al., Machine Learning in Genomic Medicine: A
Review of Computational Problems and Data Sets. Proceedings of the
IEEE. 104, 176–197 (2016).
◦ DSW Ho et al., Machine learning SNP based prediction for
precision medicine. Frontiers in Genetics. 10, 1–10 (2019).
◦ A Taylor-Weiner et al., Scaling computational genomics to millions
of individuals with GPUs. Genome Biology. 20, 1–5 (2019).
◦ L Breiman, Statistical modeling: The two cultures. Statistical Science.
16, 199–215 (2001).
◦ BS Ullman, Using neuroscience to develop artificial intelligence.
363, 692–694 (2019).
◦ A Marblestone et al., Towards an integration of deep learning and
neuroscience. 10, 1–41 (2016).
◦ Y Bengio et al., Towards Biologically Plausible Deep Learning
(2015), doi:10.1007/s13398-014-0173-7.2.
◦ KM Chen et al., Selene: a PyTorch-based deep learning library for
sequence-level data. Nature Methods. 16, 315–318 (2019).
Genomics
◦ Disease risk
◦ D Wang et al., Comprehensive functional genomic resource and integrative model for the
adult brain. Science, 1266 (2018).
◦ L Sundaram et al., Predicting the clinical impact of human mutation with deep neural
networks. Nature Genetics. 50, 1161–1170 (2018).
◦ Y Ding et al., A deep learning model to predict a diagnosis of Alzheimer disease by using
18 F-FDG PET of the brain. Radiology. 290, 456–464 (2019).
◦ I Klyuzhin et al., Use of deep convolutional neural networks to predict Parkinson’s disease
progression from DaTscan SPECT images. Journal of Nuclear Medicine. 59, 29 (2018).
◦ KK Dey et al., Evaluating the informativeness of deep learning annotations for human
complex diseases. bioRxiv, 784439 (2019).
◦ A Romagnoni et al., Comparative performances of machine learning methods for
classifying Crohn Disease patients using genome-wide genotyping data. Scientific Reports. 9,
1–18 (2019).
◦ CAC Montañez et al., Deep Learning Classification of Polygenic Obesity using Genome
Wide Association Study SNPs. Proceedings of the International Joint Conference on Neural
Networks. 2018-July (2018), doi:10.1109/IJCNN.2018.8489048.
◦ Gene expression
◦ V Agarwal et al., Predicting mRNA Abundance Directly from Genomic Sequence Using
Deep Convolutional Neural Networks ll Predicting mRNA Abundance Directly from
Genomic Sequence Using Deep Convolutional Neural Networks. Cell Reports. 31, 107663
(2020).
◦ X Li et al., The impact of rare variation on gene expression across tissues. Nature. 550,
239–243 (2017).
◦ JD Washburn et al., Evolutionarily informed deep learning methods for predicting relative
transcript abundance from DNA sequence. Proceedings of the National Academy of Sciences of
the United States of America. 116, 5542–5549 (2019).
◦ Y Zhang et al., Predicting Gene Expression from DNA Sequence using Residual Neural
Network. bioRxiv, in press, doi:10.1101/2020.06.21.163956.
◦ Epigenomics
◦ J Zhou et al., Predicting effects of noncoding variants with deep learning-based sequence
model. Nature Methods. 12, 931–934 (2015).
◦ GE Hoffman et al., Functional Interpretation of Genetic Variants Using Deep Learning
Predicts Impact on Epigenome. Nucleic Acids Research, 1–15 (2019).
◦ J Schreiber et al., Avocado: a multi-scale deep tensor factorization method learns a latent
representation of the human epigenome. Genome Biology. 21, 364976 (2018).
◦ Splicing
◦ K Jaganathan et al., Predicting Splicing from Primary Sequence with Deep Learning. Cell.
0, 535-548.e24 (2019).
◦ TF
◦ RC Brown et al., An equivariant Bayesian convolutional network predicts recombination
hotspots and accurately resolves binding motifs. Bioinformatics. 35, 2177–2184 (2019).
Transcriptomics
◦ G Eraslan et al., Single-cell RNA-seq denoising using a deep count
autoencoder. Nature Communications. 10, 1–14 (2019).
◦ M Lotfollahi et al., scGen predicts single-cell perturbation responses. Nature
Methods. 16, 715–721 (2019).
◦ M Colomé-Tatché et al., Statistical single cell multi-omics integration. Current
Opinion in Systems Biology. 7, 54–59 (2018).
◦ C Lin et al., Using neural networks for reducing the dimensions of single-cell
RNA-Seq data. Nucleic Acids Research. 45 (2017), doi:10.1093/nar/gkx681.
◦ GP Way et al., Bayesian deep learning for single-cell analysis. Nature Methods.
15, 1009–1010 (2018).
◦ R Lopez et al., Deep generative modeling for single-cell transcriptomics.
Nature Methods. 15, 1053–1058 (2018).
◦ J Wang et al., Data denoising with transfer learning in single-cell
transcriptomics. Nature Methods. 16 (2019), doi:10.1038/s41592-019-0537-1.
◦ OmicsMapNet: Transforming omics data to take advantage of Deep
Convolutional Neural Network for discovery.
Drug discovery
◦ R Gómez-Bombarelli et al., Automatic Chemical Design Using a Data-Driven Continuous
Representation of Molecules. ACS Central Science. 4, 268–276 (2018).
◦ CF Lipinski et al., Advances and Perspectives in Applying Deep Learning for Drug Design and
Discovery. 6, 1–6 (2019).
◦ L David et al., Applications of deep-learning in exploiting large-scale and heterogeneous
compound data in industrial pharmaceutical research. Frontiers in Pharmacology. 10, 1–16 (2019).
Imaging
◦ G Lee et al., Predicting Alzheimer’s disease progression using multi-modal
deep learning approach. Scientific Reports. 9, 1–12 (2019).
◦ H Chen et al., VoxResNet: Deep voxelwise residual networks for brain
segmentation from 3D MR images. NeuroImage. 170, 446–455 (2018).
◦ T Jo et al., Deep Learning in Alzheimer’s Disease: Diagnostic Classification
and Prognostic Prediction Using Neuroimaging Data. Frontiers in Aging
Neuroscience. 11 (2019), doi:10.3389/fnagi.2019.00220.
◦ A Iqbal et al., Developing a brain atlas through deep learning. Nature Machine
Intelligence. 1, 277–287 (2019).
◦ A Mahbod et al., Automatic brain segmentation using artificial neural
networks with shape context. Pattern Recognition Letters. 101, 74–79 (2018).
◦ P Kumar et al., U-SEGNET: Fully convolutional neural network based
automated brain tissue segmentation tool. arXiv (2018).

Ml in genomics

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Ml in genomics

Similar a Ml in genomics (20)

Último

Último (20)

Ml in genomics

Notas del editor