Deep learning has enabled dramatic advances in image recognition performance. In this talk I will discuss using a deep convolutional neural network to detect genetic variation in aligned next-generation sequencing human read data. Our method, called DeepVariant, both outperforms existing genotyping tools and generalizes across genome builds and even to other species. DeepVariant represents a significant step from expert-driven statistical modeling towards more automatic deep learning approaches for developing software to interpret biological instrumentation data.
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Deep learning in medicine: An introduction and applications to next-generation sequencing and disease diagnostics
1. Confidential + Proprietary
Deep learning in medicine: An
introduction and applications to
next-generation sequencing and
disease diagnostics
Allen Day, PhD, allenday@google.com, Twitter @allenday
4. Confidential & Proprietary
Observation: programming a computer to be clever is harder than
programming a computer to learn to be clever.
Intro to machine learning and deep learning
5. Confidential & Proprietary
Traditional Machine Learning...vs the new way
The old way:
Write a computer program
with explicit rules to follow
if email contains V!agrå
then mark is-spam;
if email contains …
if email contains …
The new way:
Write a computer program to
learn from examples
try to classify some emails;
change self to reduce errors;
repeat;
9. Confidential & Proprietary
Key Innovation: Learns Features from the Data
HIGH LEVEL COMPLEX DETECTORS
PARTS OF OBJECTS, MORE COMPLEX
PATTERNS
PRIMITIVE FEATURES: EDGES, BLOCKS
OF COLORS, ETC.
INPUT: RAW DATA
10. Confidential & Proprietary
“cat”
Deep Learning Revolution
Modern Reincarnation of Artificial Neural Networks
Collection of simple trainable mathematical units, organized in layers, that work together to solve
complicated tasks
Key Benefit
Learns features from raw, heterogeneous data
No explicit feature engineering required
What’s New
layered network architecture,
new training math, *scale*
14. Szegedy et al, 2014
“Inception” Module.
Auxiliary Classifiers
Pr(dog)
GoogLeNet (aka “Inception”) Architecture
Main Classifier
Proprietary & Confidential
15. Confidential & Proprietary* Human Performance based on analysis done by Andrej Karpathy. More details here.
%errors
Year
Image understanding is getting better than human
level
ImageNet Challenge: Given
an image, predict one of
1000+ classes
16. Confidential & Proprietary
Search
Search ranking
Speech recognition
Gmail
Smart Reply
Spam classification
Photos
Photos search
Translate
text, graphic, and
speech translations
Android
Keyboard & speech input
Drive
Intelligence in Apps
YouTube
Video recommendations
Better thumbnails
Cardboard
Smart stitching
Play
App recommendations
Game developer experience
Ads
Richer Text Ads
Automated Bidding
Chrome
Search by Image
Maps
Street View image
Parsing Local Search
Machine learning has transformed Google’s products
18. Confidential + Proprietary
Medical applications of deep learning technology
● Deep learning has remarkable efficacy
○ Amazing with images: photos, search, streetview, Android cameras, …
○ And with speech, language, data centers, …
● How and where can we apply this in medicine and biotechnology?
○ Medical imaging: ophthalmology, pathology, ...
○ Genomics
○ ...
19. Confidential + ProprietaryConfidential + Proprietary
Diabetes causes
blindness
5-10% of population is diabetic
Should be screened annually for
diabetic retinopathy
Fastest growing cause of blindness
# Diabetics >> qualified graders
● 387M diabetics, 200k ophthalmologists
● Grading is highly technical
Poor adherence to care plan
● No symptoms, preventive not curative
● 30-50% screened in US
● 10% in high risk populations
● Many lost to follow up
20. Confidential + Proprietary
How DR is Diagnosed: Retinal Fundus Images
Healthy Diseased
Hemorrhages
No DR Mild DR Moderate DR Severe DR Proliferative DR
21. Confidential + Proprietary
Even when available, ophthalmologists are not consistent...
Consistency: intragrader ~65%, intergrader ~60%
Ophthalmologist Graders
Patient
Images
22. Confidential + Proprietary
Adapt deep neural network to read fundus images
Conv Network - 26 layers
No DR
Mild DR
Moderate DR
Severe DR
Proliferative DR
Labeling tool
54 ophthalmologists
130k images
880k
diagnoses
23. Confidential + Proprietary
0.95
F-score
Algorithm Ophthalmologist
(median)
0.91
“The study by Gulshan and
colleagues truly represents the
brave new world in medicine.”
“Google just published this paper
in JAMA (impact factor 37) [...] It
actually lives up to the hype.”
Dr. Andrew Beam, Dr. Isaac Kohane
Harvard Medical School
Dr. Luke Oakden-Rayner
University of Adelaide
24. Confidential + Proprietary
Digital pathology
JAMA. 2015; 313(11):1122-1132
Correct
diagnosis
87%
48%
84%
96%
75%
Example: Breast Cancer Biopsies
Overdiagnosis
Underdiagnosis
1 in 12 breast cancer biopsies is misdiagnosed (population adjusted)
Similar for other cancer types (prostate 1 in 7, etc)
25. Confidential + Proprietary
Detecting breast cancer metastases in lymph nodes
detail ←→ context
Multi scale model
resembles microscope
magnifications
● Goal: train a deep learning
model to identify cancerous
cells in pathology slide images
● Output: a map over the whole
image, indicating the probability
that each region harbors cancer
cells
● Trained on ~23M images
patches extracted from
gigapixel slide images of normal
(n=127) and cancerous (n=88)
tissues from Camelyon16
dataset
26. Confidential + Proprietary
Tumor localization score (FROC) of 0.89 vs 0.73 for pathologist with unlimited time
(92% sensitivity with 8 false positives per slide vs. 73% sensitivity with 0 false positives per slide)
Slide level classification of AUC of 0.96 (on par with pathologist)
Predicted RegionsGround truth MaskOriginal Slide
Metastatic cell detection results are encouraging
Cancer
cells
Read more at https://arxiv.org/abs/1703.02442
27. Confidential + Proprietary
Deep learning in genomics
New application area
Example papers: Alipanahi et al (2015),
Park Y, Kellis M (2015); Xiong et al
(2015); Zhou, Troyanskaya (2015);
Angermueller et al (2016)
Deep learning to call variants
Goals: (1) replace statistical machinery
with single deep learning model; (2)
state-of-the-art or better performance;
(3) generalize to new technologies.
Start with human germline
Use the germline case to figure out
deep learning data representation and
models. Extend the approach to
somatic mutations, non-human, etc..
Variant calling
Key challenge in genomics due to
complex errors of NGS technologies.
Current error rates vary from <1% for
germline SNPs to >25% somatic indels.
28. Confidential + Proprietary
Where should we get started applying deep
learning to genetics and genomics problems?
Must-haves for deep learning
● Lots of data: >50k examples, >1M ideal.
● High-quality input data and labels for training.
● The mapping from data=>label is unknown but certainly exists.
● High-quality previous efforts so we know that deep learning is key.
○ i.e., hard to solve with classical statistical/ML approaches.
SNP and indel calling from NGS data
29. Confidential + Proprietary
Figuring out the true genome sequence from NGS data is
a computational and statistical challenge
.......... cttgggttga tattgtcttg gaacatggag gttgtgtcac cgtaatggca caggacaaac cgactgtcga
catagagctg gttacaacaa cagtcagcaa catggcggag gtaagatcct actgctatga ggcatcaata tcagacatgg
cttcggacag ..........
True genome sequence: 3 billion bases
in 23 contiguous chunks (chromosomes)
Actual sequencer output: ~1 billion ~100
basepair long DNA reads (30x coverage)
Reference: ...ttgtcttggaacatggaggttgtgtcaccgtaatggcacaggacaaacc...
Read1: ...ttgtcttggaacatggaggttgtgtgaccgtaatggcacaggacaaacc
Read2: ...ttgtcttggaacatggaggttgtgtgaccgtaatggcacaggacaaacc...
Read3: tggaacatggaggttgtgtgaccgtaatggcacaggacaaacc...
Align reads to a
reference genome
Infer the true genomic
sequence(s)*
Step 1 Step 2
Read1: cttgggttgatattgtcttggaacatggaggttgtgtcaccgtaatggcacaggacaaacc
Read2: gatattgtcttggaacatggaggttgtgtcaccgtaatggcacaggacaaaccgactgtcg
Read3: tggaacatggaggttgtgtcaccgtaatggcacaggacaaaccgactgtcgacatagagct
Read4: ggttgtgtcaccgtaatggcacaggacaaaccgactgtcgacatagagctggttactgtcg
....
Read 1,000,000,000: ....caactgtcgacatagagctggttactgtcgacatagagctggtt
Reads aligned to a reference genome
Same as reference Same as reference
30. Confidential + Proprietary
A complex error process makes it difficult to
call variants accurately in NGS data
Errors come from many
uncontrollable sources
Quality of the sample DNA
Protocol used to prepare
sample for the sequencer
From physical properties of
instrument itself
Data processing artifacts
Errors are correlated among
the reads
The most accurate variant
callers, such as the GATK,
use multiple techniques, e.g.
● Logistic regression
● Hidden Markov Models
● Bayesian inference
● Gaussian mixture
models
All make approximations
known to be invalid
Existing statistical techniques
work okay...
...but have well-known
drawbacks
Rely on hand-crafted features
Hand optimized parameters
Require years of work by
domain experts
Specialized to specific prep,
sequencer, tool chain, etc
Hard to generalize to new
technologies
31. Confidential + Proprietary
Other features
ACGTGCCCCAAACGTGATGATC
ACGTGCCCCAACC---------
--GTGCCCCAAACGT-------
----GCCCCAAACGTGA-----
-------CCAACCGTGATG---
--------CAAACGTGATGATC
----------ACCGTGATGATC
Ref
Read
bases
Qualities
Pileup image
A
A
A
C
C
C
A
Reference
Reads
Candidate site
0.01 0.95 0.04
hom
ref het
hom
alt
Heterozygous
variant call
Genotype
likelihoods
CNN
Find candidate variants Create pileup images Evaluate image and call variants
DeepVariant
Recasting variant calling for deep learning
32. Confidential + Proprietary
Recasting variant calling for deep learning
Encoding is roughly red = {A,C,G,T}; green = {quality score}; blue = {read strand};
alpha = {matches ref genome}
True
SNPs
True
Indels
False
variants
Encode reads and reference genome as images
33. Confidential + Proprietary
Recasting variant calling for deep learning
Use inception-v3 to call variant genotype
Szegedy et al. 2015, https://arxiv.org/abs/1512.00567
34. Confidential + Proprietary
Genome in a Bottle provides ground truth
human variation
● Extensive sequencing by orthogonal methods of single human (NA12878)
● Stringent criteria identify “callable genomic regions” and true variants
○ ~3.7M regions (covering 80% of genome) identified as callable
○ ~2.8M single nucleotide polymorphisms
○ ~350k small insertion/deletions
● Train and test on biological replicates of NA12878
○ Each germline WGS dataset provides ~3.7M labeled training variants
○ 2.1M true heterozygous variants
○ 1.3M true homozygous variants
○ 215k false positive variants
Zook et al. 2014
35. Confidential + Proprietary
DeepVariant works well in our in-house evaluations
Train model on
training
chromosomes
Evaluate on
held-out
chromosomes
Call
variants
Outperforms GATK on human dataMethodology
36. Confidential + Proprietary
Estimated P(error) [Phred-scaled, -10 log10(P(error))]
DeepVariant
GATK
Perfect calibration lineObservedP(error)
This is the
calibration for
heterozygous SNPs
but other variant
types and genotype
states are similar.
DeepVariant learns an accurate model of the
likelihood function P(genotype | reads)
37. Confidential + Proprietary
DeepVariant learns an accurate model of the
likelihood function P(genotype | reads)
● Variants should be
correct at the
assigned
confidence rate to
be well-calibrated
● Genotype
likelihoods are the
critical input to
genomic analyses
such as imputation,
de-novo mutation
and association
Most callers are overconfident in their likelihoods
38. Confidential + Proprietary
After lots of internal testing, we entered into the public
FDA-sponsored PrecisionFDA competition in April 2016
Unblinded training
sample
Blinded evaluation
sample
39. Confidential + Proprietary
99.85
98.91
DeepVariant won an award at the 2016
PrecisionFDA competition
v2 => v3 truth set
for unblinded
sample
Unblinded =>
blinded sample
with v3 truth set
F-measure is the harmonic mean of precision and recall.
40. Confidential + Proprietary
A trained DeepVariant model encodes everything needed
to call variants, enabling us to apply it in novel contexts
Training data Evaluation data F1
b37 chr1-19 b38 chr20-22 99.45%
b38 chr1-19 b38 chr20-22 99.47%
You can train on one genome build
and call variants on another
You can train on human data and call
mouse data
F1 is the harmonic mean of precision and recall.
Training data Evaluation data F1
Human chr1-19 Mouse chr18-19 98.29%
Mouse chr1-17 Mouse chr18-19 97.84%
Call variants on b38 using a model trained on
either b37 or b38 with effectively identical quality.
Means we can call on a genome build without
needing all of the metadata mapped to that build.
Robust to protocol differences; human: 50x
2x148bp HiSeq 2500; mouse: 27x 2x100bp GAII.
Leverage the larger and better truth data on
humans (e.g., ~5M in humans vs. ~700K in mouse)
to call variants in other organisms.
41. Confidential + Proprietary
Dataset
10X Chromium
75x WGS
Ion AmpliSeq
exome
PacBio raw
reads 40x WGS
SOLID SE 85x
WGS
Illumina
TruSeq exome
DeepVariant
(F1 metric)
99.3% 96.9% 92.7% 86.4% 96.1%
Comparator
(F1 metric)
98.2% 97.3%1
56.1%2
78.8%3
95.4%
Comparator
caller
Long Ranger TVC samtools GATK ensemble
1
Uses four lanes of data vs. one for DeepVariant; 2
No standard caller exists for this technology for human
samples; 3
Old technology without any maintained variant callers.
DeepVariant can learn to call variants in many
sequencing technologies
42. Confidential + Proprietary
DeepVariant can learn to call variants at a
range of input sequence depths
Sensitivity Precision
Sequencing depth Sequencing depth
GATK
DV 35-45x
DV 4-45x
DV 15-25x
GATK
DV 35-45x
DV 4-45x
DV 15-25x
43. Confidential + Proprietary
Proprietary & Confidential
DeepVariant outperforms GATK on low-coverage samples
Training on chromosomes 1-19
Evaluation on chromosomes 20-22
44. Confidential + Proprietary
DeepVariant conclusions
● Deep Learning is a remarkably powerful and flexible technology.
● Example of how to apply deep learning to a genomics problem.
● Equivalent or better performance than current variant calling tools.
● Works for many (any?) sequencing technology.
● Run now at https://cloud.google.com/genomics/v1alpha2/deepvariant
● Open-sourced version coming soon!
● Read more in our BioRxiv paper https://doi.org/10.1101/092890.
45. Google confidential │ Do not distribute
Google’s Data Research...
2002 2004 2006 2008 2010 2012 2014 2016
GFS
MapReduce TensorFlow
BigTable
Dremel
Colossus
Flume
Megastore
Spanner
Millwheel
PubSub
F1
46. Google confidential │ Do not distribute
...are the technologies used in DeepVariant...
2002 2004 2006 2008 2010 2012 2014 2016
GFS
MapReduce TensorFlow
BigTable
Dremel
Colossus
Flume
Megastore
Spanner
Millwheel
PubSub
F1
47. Google confidential │ Do not distribute
... which are available to you today on GCP
2002 2004 2006 2008 2010 2012 2014 2016
ML
PubSub
DataFlow
DataStore
DataFlow
Cloud Storage
BigQuery
BigTable
DataProc
Cloud Storage
48. Confidential + ProprietaryConfidential + Proprietary
Sharing our tools with researchers and developers
around the world
repository
for “machine learning”
category on GitHub
#1
TensorFlow released
in Nov. 2015