Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
NIST program to develop genomic reference materials
1. NIST
Program
to
Develop
Genomic
Reference
Materials
Jus<n
Zook
and
Marc
Salit
2. Scope
of
NIST
work
• Human
Whole
Genome
RMs
• Synthe<c
DNA
constructs
• Microbial
Whole
Genome
RMs
3. RM
Development
Process
1. Select
and
procure
materials
2. Characterize
materials
3. Process
and
integrate
data
from
mul<ple
plaMorms
4. Confirm
selected
genotypes
5. Write
Report
of
Analysis
6. Develop
methods
for
end
users
to
obtain
performance
metrics
from
the
materials
5. Proposed
Timeline
for
Synthe<c
Structures
Title 2011 Effort 2012 2013 2014 2015 2
1) Human RMs 535w
1.1) Select/Procure human DNA for RM 32w
1.2) **NIST receives packaged DNA for RM/SRM
1.3) Develop bioinformatics pipeline for data 97w
integration
1.4) Human Primary Sequencing 147w
1.5) Human Homogeneity assessment 8w
1.6) Analyze homogeneity data and produce preliminary 10w
SNP calls for RM
1.7) Write human RM Report of Analysis 10w
1.8) Process Human RM for release 24w
1.9) **Human RM officially released
1.10) Human Sequencing data integration 25w
1.11) Human Validation 20w
1.12) Human other characterization methods 48w
1.13) Analyze validation data and refine sequencing calls 12w
1.14) Develop pipeline for SVs and test 40w
1.15) Write Human SRM Report of Analysis 8w
1.16) Process Human SRM for release 24w
1.17) **Human SRM officially released
1.18) Procure local data storage 10w
1.19) Procure Bioinformatics data analysis tools 10w
1.20) Procure Automated sample prep instrumentation 10w
2) Microbial RMs 279w
2.1) Select/Procure microbial DNA for RMs 31w
2.2) Microbial Primary Sequencing 124w
2.3) Microbial Homogeneity assessment 6w
2.4) Microbial Sequencing data integration 40w
2.4.1) Mapping/Alignment 10w
2.4.2) Variant calling 12w
2.4.3) Form consensus variant calls 12w
6. Proposed
Characteriza<on
Methods
for
Whole
Genomes
Whole
Genome
Sequencing
Other
• ABI
5500
(1kb,
6kb,
and
• Genotyping
microarrays
10kb
mate-‐pair
libraries)
• Array
CGH
• Illumina
• Targeted
sequencing
• Complete
Genomics
• Fosmid
sequencing?
• Upcoming
technologies?
• Op<cal
Mapping?
– Ion
Proton?
– Oxford
Nanopore?
Father
Mother
• 3x
replica<on
of
sequencing
(3
library
preps)
Husband
NA12878
Son
Daughter
7. Integra<on
of
Exis<ng
Data
to
Form
Consensus
Genotype
Calls
Find
all
possible
variant
sites
Find
sites
where
all
datasets
agree
Iden<fy
sites
with
atypical
characteris<cs
signifying
sequencing,
mapping,
or
alignment
bias
For
each
site,
remove
datasets
with
decreasingly
atypical
characteris<cs
un<l
all
datasets
agree
Even
if
all
datasets
agree,
iden<fy
them
as
uncertain
if
few
have
typical
characteris<cs
8. Consensus
has
lower
FN
rate
than
individual
datasets
Illumina
Omni
SNP
Array
Homozygous
Homozygous
HiSeq
–
GATK
Heterozygous
Uncertain
Reference
Variant
Homozygous
“FNs”
Reference/
1.45M
7.24k
(1.34%)
5.28k
(0.65%)
N/A
No
Call
“FPs*”
Heterozygous
196
(0.03%)
411k
(60.7%)
133
(0.02%)
N/A
Homozygous
154
(0.02%)
150
(0.02%)
249k
(37.0%)
N/A
Variant
Illumina
Omni
SNP
Array
Integrated
Consensus
Homozygous
Homozygous
Heterozygous
Uncertain
Reference
Variant
Homozygous
“FNs”
Genotypes
Reference/
1.45M
613
(0.09%)
977
(0.15%)
N/A
No
Call
“FPs*”
Heterozygous
241
(0.04%)
414k
(61.5%)
173
(0.03%)
N/A
Homozygous
152
(0.02%)
61
(0.01%)
249k
(36.9%)
N/A
Variant
Uncertain
5458
(0.81%)
3421
(0.51%)
4808
(0.71%)
N/A
*
Note
that
most
or
all
of
the
puta<ve
FPs
seem
to
actually
be
FNs
on
the
microarray
12. Challenges
with
assessing
performance
• All
variant
types
are
not
equal
• Nearby
variants
are
ojen
difficult
to
align
• All
regions
of
the
genome
are
not
equal
– Homopolymers,
STRs,
duplica<ons
– Can
be
similar
or
different
in
different
genomes
• Labeling
difficult
variants
as
“uncertain”
in
the
Reference
Material
leads
to
higher
apparent
accuracy
when
assessing
performance
• Genotypes
fall
in
3+
categories
(not
posi<ve/nega<ve)
• It’s
important
to
consider
data
from
mul<ple
plaMorms
and
library
prepara<ons
when
characterizing
a
Reference
Material