NIST program to develop genomic reference materials

NIST
Program
to
Develop

Genomic
Reference
Materials

Jus<n
Zook
and
Marc
Salit

Scope
of
NIST
work

•  Human
Whole
Genome
RMs

•  Synthe<c
DNA
constructs

•  Microbial
Whole
Genome
RMs

RM
Development
Process

1.  Select
and
procure
materials

2.  Characterize
materials

3.  Process
and
integrate
data
from
mul<ple

plaMorms

4.  Conﬁrm
selected
genotypes

5.  Write
Report
of
Analysis

6.  Develop
methods
for
end
users
to
obtain

performance
metrics
from
the
materials

Proposed
Timeline
for
Human
RMs

Proposed
Timeline
for
Synthe<c

Structures

Title 2011 Effort 2012 2013 2014 2015 2
1) Human RMs 535w
1.1) Select/Procure human DNA for RM 32w
1.2) **NIST receives packaged DNA for RM/SRM
1.3) Develop bioinformatics pipeline for data 97w
integration
1.4) Human Primary Sequencing 147w
1.5) Human Homogeneity assessment 8w
1.6) Analyze homogeneity data and produce preliminary 10w
SNP calls for RM
1.7) Write human RM Report of Analysis 10w
1.8) Process Human RM for release 24w
1.9) **Human RM officially released
1.10) Human Sequencing data integration 25w
1.11) Human Validation 20w
1.12) Human other characterization methods 48w
1.13) Analyze validation data and reﬁne sequencing calls 12w
1.14) Develop pipeline for SVs and test 40w
1.15) Write Human SRM Report of Analysis 8w
1.16) Process Human SRM for release 24w
1.17) **Human SRM officially released
1.18) Procure local data storage 10w
1.19) Procure Bioinformatics data analysis tools 10w
1.20) Procure Automated sample prep instrumentation 10w
2) Microbial RMs 279w
2.1) Select/Procure microbial DNA for RMs 31w
2.2) Microbial Primary Sequencing 124w
2.3) Microbial Homogeneity assessment 6w
2.4) Microbial Sequencing data integration 40w
2.4.1) Mapping/Alignment 10w
2.4.2) Variant calling 12w
2.4.3) Form consensus variant calls 12w

Proposed
Characteriza<on
Methods

for
Whole
Genomes

Whole
Genome
Sequencing
Other

•  ABI
5500
(1kb,
6kb,
and
•  Genotyping
microarrays

10kb
mate-‐pair
libraries)
•  Array
CGH

•  Illumina
•  Targeted
sequencing

•  Complete
Genomics
•  Fosmid
sequencing?

•  Upcoming
technologies?

•  Op<cal
Mapping?

–  Ion
Proton?

–  Oxford
Nanopore?

Father
Mother

•  3x
replica<on
of
sequencing

(3
library
preps)
Husband
NA12878

Son
Daughter

Integra<on
of
Exis<ng
Data
to
Form

Consensus
Genotype
Calls

Find
all
possible
variant
sites

Find
sites
where
all
datasets
agree

Iden<fy
sites
with
atypical
characteris<cs
signifying

sequencing,
mapping,
or
alignment
bias

For
each
site,
remove
datasets
with
decreasingly
atypical

characteris<cs
un<l
all
datasets
agree

Even
if
all
datasets
agree,
iden<fy
them
as
uncertain
if

few
have
typical
characteris<cs

Consensus
has
lower
FN
rate
than

individual
datasets

Illumina
Omni
SNP
Array

Homozygous
Homozygous

HiSeq
–
GATK

Heterozygous
Uncertain

Reference
Variant

Homozygous
“FNs”

Reference/
1.45M
7.24k
(1.34%)
5.28k
(0.65%)
N/A

No
Call
“FPs*”

Heterozygous
196
(0.03%)
411k
(60.7%)
133
(0.02%)
N/A

Homozygous

154
(0.02%)
150
(0.02%)
249k
(37.0%)
N/A

Variant

Illumina
Omni
SNP
Array

Integrated
Consensus

Homozygous
Homozygous

Heterozygous
Uncertain

Reference
Variant

Homozygous
“FNs”

Genotypes

Reference/
1.45M
613
(0.09%)
977
(0.15%)
N/A

No
Call
“FPs*”

Heterozygous
241
(0.04%)
414k
(61.5%)
173
(0.03%)
N/A

Homozygous
152
(0.02%)
61
(0.01%)
249k
(36.9%)
N/A

Variant

Uncertain
5458
(0.81%)
3421
(0.51%)
4808
(0.71%)
N/A

*
Note
that
most
or
all
of
the
puta<ve
FPs
seem
to
actually
be
FNs
on
the
microarray

SNP
arrays
overesMmate
performance

Illumina
Omni
SNP
Array

Homozygous
Homozygous

HiSeq
–
GATK

Heterozygous
Uncertain

Reference
Variant

Homozygous
“FNs”

Reference/
1.45M
7.24k
(1.34%)
5.28k
(0.65%)
N/A

No
Call
“FPs*”

Heterozygous
196
(0.03%)
411k
(60.7%)
133
(0.02%)
N/A

Homozygous

154
(0.02%)
150
(0.02%)
249k
(37.0%)
N/A

Variant

Integrated
Consensus
Genotypes

Homozygous
Homozygous

HiSeq
–
GATK

Heterozygous
Uncertain

Reference
Variant

Homozygous
“FNs”

Reference/
1.52M
157k
(4.68%)
30.3k
(0.90%)
4.17M

No
Call
“FPs”

Heterozygous
47
(0.00%)
1.90M
(56.4%)
34
(0.00%)
16.9k
(0.50%)

Homozygous
1
(0.00%)
298
(0.01%)
1.19M
(35.3%)
73.3k
(2.18%)

Variant

Samtools
has
higher
FP
and
lower
FN

than
GATK

Integrated
Consensus
Genotypes

HiSeq
–
samtools

Homozygous
Homozygous

Heterozygous
Uncertain

Reference
Variant

Homozygous
“FNs”

Reference/
1.51M
49.6k
(1.47%)
6.74k
(0.20%)
3.93M

No
Call
“FPs”

Heterozygous
3141(0.09%)
2.00M
(59.6%)
74
(0.00%)
175k
(5.19%)

Homozygous
192k
(5.71%)

21
(0.00%)
777
(0.02%)
1.21M
(36.0%)

Variant

Integrated
Consensus
Genotypes

Homozygous
Homozygous

HiSeq
–
GATK

Heterozygous
Uncertain

Reference
Variant

Homozygous
“FNs”

Reference/
1.52M
157k
(4.68%)
30.3k
(0.90%)
4.17M

No
Call
“FPs”

Heterozygous
47
(0.00%)
1.90M
(56.4%)
34
(0.00%)
16.9k
(0.50%)

Homozygous
1
(0.00%)
298
(0.01%)
1.19M
(35.3%)
73.3k
(2.18%)

Variant

Performance
Metrics:
Characteris<cs

of
Mis-‐calls

Consensus
Genotypes

Hom.
Ref.
Heterozygous
Hom.
Variant
Uncertain

Heterozygous
Hom.
Ref./No
call

HiSeq/GATK

Hom.
Variant

QUAL/Depth
of
Coverage

Strand
Bias

.
.
.

Challenges
with
assessing

performance

•  All
variant
types
are
not
equal

•  Nearby
variants
are
ojen
difficult
to
align

•  All
regions
of
the
genome
are
not
equal

–  Homopolymers,
STRs,
duplica<ons

–  Can
be
similar
or
different
in
different
genomes

•  Labeling
difficult
variants
as
“uncertain”
in
the

Reference
Material
leads
to
higher
apparent
accuracy

when
assessing
performance

•  Genotypes
fall
in
3+
categories
(not
posi<ve/nega<ve)

•  It’s
important
to
consider
data
from
mul<ple
plaMorms

and
library
prepara<ons
when
characterizing
a

Reference
Material

NIST program to develop genomic reference materials

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (20)

Similar a NIST program to develop genomic reference materials

Similar a NIST program to develop genomic reference materials (20)

Más de GenomeInABottle

Más de GenomeInABottle (20)

Último

Último (20)

NIST program to develop genomic reference materials