Más contenido relacionado
Similar a Aug2013 real time genomics trio pedigree analysis (20)
Más de GenomeInABottle (20)
Aug2013 real time genomics trio pedigree analysis
- 1. ©
2013
Real
Time
Genomics,
Inc.
NA12878
Trio/Pedigree
Analysis
Francisco
M.
De
La
Vega,
D.Sc.
VP
Genome
Science
- 2. ©
2013
Real
Time
Genomics,
Inc.
Leveraging trio information
• GiaB has selected reference materials in the form of father,
mother, offspring trios
• The goal was to leverage the Mendelian inheritance patterns
to:
– Identify variant genotype errors that are inconsistent with
Mendelian inheritance
– Remove these errors from the reference baseline calls
• However, if variant identification methods don't use directly
pedigree information and jointly analyze the trio alignments,
an opportunity to improve the genotype calls would be
missed
• We focused on using the RTG Family caller to better leverage
the shared information in the trios and improve the call set,
whilst reducing Mendelian inconsistent genotype errors
- 3. ©
2013
Real
Time
Genomics,
Inc.
C
AA
A
A
A
A
A
A
A
A
A
A
A A/Genotype:
A A
CA
C
C
A
A
A
A
A /Genotype: C
C
A /Genotype:
AC
C
C
|
||
Variant calling can be improved by jointly
analyzing related samples
Shared
haplotypes
- 4. ©
2013
Real
Time
Genomics,
Inc.
C
AA
A
A
A
A
A
A
A
A
A
A
A A/Genotype:
A A
CA
C
C
A
A
A
A
A /Genotype: C
C
A /Genotype:
AC
C
C
|
||
Variant calling can be improved by jointly
analyzing related samples
Mendelian
variant
segregaJon
Shared
haplotypes
- 5. ©
2013
Real
Time
Genomics,
Inc.
Mendelian inconsistency
C
C
/Genotype: C
C
C
C
C
C
C
A
A
A
A A/Genotype: (Low QV)
C
A
A
A
A
A
A /Genotype:
C
C
C
A
A A
CC
AC
|
||
- 6. ©
2013
Real
Time
Genomics,
Inc.
Joint trio analysis corrects Mendelian errors
C
/Genotype: C
C
C
C
C T
G
G
G
C T
C T
C T
C
A
A
A
A
A
Genotype:
C
A / C
G
G
G
G
G
G
G
A
A
A
Genotype: (Good QV)
C T
C T
C T
C T
A / C
G
G
G
A A
CC
AC
|
||
- 7. ©
2013
Real
Time
Genomics,
Inc.
NA12878 calls from trio calling
• Comparing offspring variants from singleton vs
pedigree calling
– Both showing good quality metrics
• Using family information more good calls can be
made and dubious calls are downgraded
NA12878
Call
set SNVs Indels MNPs
SNV
Het/Hom Ti/Tv
%
dbSNP
(r129)
RTG
single
3,329,797 558,242 31,070 1.55
2.11
90.8%
RTG
trio
3,363,619 595,030 33,686 1.57
2.11
90.4%
GATK/VQSR
3,263,289 610,837 N/A 1.51
2.09
91.7%
Variant
StaBsBcs
Data:
WGS
2x100bp
>50X
Illumina
PlaJnum
Genomes
data
(ENA
Acc.
No.
ERP001960).
RTG
AVR
score
cut-‐off
0.15;
GATK
v1.7
&
BWA
0.6.1.
142,848
68,000
Family
Singleton
3,849,457
NA12878
NA12891 NA12892
- 8. ©
2013
Real
Time
Genomics,
Inc.
NA12878 vs reference datasets
NA12878
Call
set
1kP
OMNI
Poly
(TP%)
1kP
OMNI
Mono
(FP%)
Get-‐RM¶
(TP
%)
GiaB
(TP%)
GiaB-‐BED
(TP%)
RTG
single
97.5%
0.10%
97.4%
N/A
N/A
RTG
trio
97.5%
0.24%
97.0%
90.5%
94.1%
GATK/VQSR
97.8%
0.17%
87.8%
88.4%
92.5%
§
RelaJve
to
dbSNP
137;
StaJsJcs
for
SNVs
only.
¶Get-‐RM
consistent
high-‐quality
variants;
n=498
NA12878
NA12891 NA12892
– 1000 Genomes Illumina OMNI SNP array
• Polymorphic sites – TP proxy
• Monomorphic sites – FP proxy
– Get-RM high confidence call set
– GiaB high confidence calls in BED region
- 9. ©
2013
Real
Time
Genomics,
Inc.
ROC Trio calls vs. GiAB baseline (BED)
RTG
snpsimeval
tool;
SNV/indel/MNP;
zygosity
match
- 10. ©
2013
Real
Time
Genomics,
Inc.
ROC Trio calls vs. GiaB baseline
RTG
snpsimeval
tool;
SNV/indel/MNP;
zygosity
match
- 11. ©
2013
Real
Time
Genomics,
Inc.
ROC Trio calls vs. CGI baseline
RTG
snpsimeval
tool;
SNV/indel/MNP;
zygosity
match
- 12. ©
2013
Real
Time
Genomics,
Inc.
Mendelian inconsistency errors
RTG family caller reduces Mendelian Inheritance Errors over 60X vs. RTG
singleton calling (over 70X vs. GATK/VQSR)
Log
Counts
of
MIE
1
10
100
1000
10000
100000
1000000
RTG
single
RTG
trio
GATK/VQSR
335,625
4,870
351,904
- 13. ©
2013
Real
Time
Genomics,
Inc.
Pattern #1: Heterozygous variant
TrioCalling
NA12878
NA12892NA12891
NA12877
NA12889 NA12890
NA12879 NA12880 NA12881 NA12882 NA12883 NA12884 NA12885 NA12886 NA12887 NA12888 NA12893
0/1
0/10/0
0/0 0/0 0/00/0 0/00/1 0/1 0/10/10/1
- 14. ©
2013
Real
Time
Genomics,
Inc.
Segregation of heterozygous variants
0
20,000
40,000
60,000
80,000
1
2
3
4
5
6
7
8
9
10
11
SNV
count
#
of
offspring
segregaBng
SNV
0
100
200
300
400
500
1
2
3
4
5
6
7
8
9
10
11
MNP
count
#
of
offspring
segregaBng
MNP
0
2,000
4,000
6,000
8,000
10,000
1
2
3
4
5
6
7
8
9
10
11
indel
count
#
of
offspring
segregaBng
indel
0
20,000
40,000
60,000
80,000
100,000
1
2
3
4
5
6
7
8
9
10
11
Variant
count
#
of
offspirng
segregaBng
All
Variants
SegregaJon
of
NA12878
heterozygous
variants
called
as
family,
GQ>50,
homozygous
reference
in
other
parent.
- 15. ©
2013
Real
Time
Genomics,
Inc.
Pattern #2: Homozygous-alt variant
TrioCalling
NA12878
NA12892NA12891
NA12877
NA12889 NA12890
NA12879 NA12880 NA12881 NA12882 NA12883 NA12884 NA12885 NA12886 NA12887 NA12888 NA12893
0/1
1/10/0
0/1 0/1 0/10/10/10/1 0/1 0/1 0/1 0/1
- 16. ©
2013
Real
Time
Genomics,
Inc.
Segregation of homo-alt variants
0
20,000
40,000
60,000
80,000
100,000
120,000
1
2
3
4
5
6
7
8
9
10
11
SNV
count
#
of
offspring
segregaBng
SNV
0
100
200
300
400
500
600
700
1
2
3
4
5
6
7
8
9
10
11
MNP
count
#
of
offspring
segregaBng
MNP
0
2,000
4,000
6,000
8,000
10,000
12,000
1
2
3
4
5
6
7
8
9
10
11
indel
count
#
of
offspring
segregaBng
indel
0
20,000
40,000
60,000
80,000
100,000
120,000
1
2
3
4
5
6
7
8
9
10
11
Variant
count
#
of
offspring
segregaBng
All
Variants
SegregaJon
of
NA12878
homozygous
alternaJve
variants
called
as
family,
GQ>50,
homozygous
reference
in
other
parent.
- 17. ©
2013
Real
Time
Genomics,
Inc.
False positive estimate by segregation
GT
Type
All
variants
SNV
MNP
indel
Het
TP
(10-‐11)
123672
110262
693
12717
FP
(1-‐8)
1901
1000
47
854
FP%
1.40%
0.88%
1.42%
5.67%
Homo-‐alt
TP
(2-‐10)
373260
329642
2258
41360
FP
(1,11)
4457
3672
36
749
FP%
1.18%
1.10%
1.57%
1.78%
Overall
TP
496932
439904
2951
54077
FP
6358
4672
83
1603
Overall
FP%
1.26%
1.05%
2.74%
2.88%
- 18. ©
2013
Real
Time
Genomics,
Inc.
Data imputation by pedigree caller
• For genomes with no data use population priors
– With care can iterate over offspring then each of parents
independently
– Avoid exponential explosion so can do whole extended
family in one calling step
- 19. ©
2013
Real
Time
Genomics,
Inc.
Imputation of family members with no data
Simulated
data
True
PosiJves
False
PosiJves
1
offspring
2
offspring
4
offspring
4
offspring
+
father
- 20. ©
2013
Real
Time
Genomics,
Inc.
ROC vs NA12878 imputed baseline
RTG
snpsimeval
tool;
SNV/indel/MNP;
zygosity
match
- 21. ©
2013
Real
Time
Genomics,
Inc.
de novo mutation identification
Call
set
de
novo
candidates
de
novo
germline*
de
novo
somaBc*
TP/FP
Singleton
calls 16,902 49
(100%)
941
(99%)
1:17
Trio
calls 2,205 49
(100%)
941
(99%)
1:2.2
de
novo
MutaBon
Accuracy
(NA12878)
*SensiJvity
vs.
Conrad
et
al.
(2011)
validated
dataset
of
germline
and
somaJc
cell
line
de
novo
mutaJons.
– Uses the parental genomes to identify & score de novo
mutations in offspring
– Greater than 7X improvement in precision to find de novo
mutations vs. naïve methods
NA12878
NA12891 NA12892
- 22. ©
2013
Real
Time
Genomics,
Inc.
Status
• Working through the complete trio datasets for
producing joint pedigree calls for NA12878 trio
– Aiming for a trio call set and another that
includes full Platinum pedigree data
– There is disproportionally more data for
NA12878 than her parents or offspring
• Comprehensive segregation analysis that
includes all Mendelian patterns
• Phasing analysis to identify variants that are
inconsistent with transmitted phases
- 23. ©
2013
Real
Time
Genomics,
Inc.
Issues
• How to integrate pedigree calls with other data?
– Variants that segregate appropriately
candidates for inclusion in baseline
– Variants that don’t segregate appropriately
candidates for removal of baseline
– Improvement of baseline genotypes using
pedigree-based genotypes
• Use of the imputed NA12878 baseline
• Creation of a more inclusive baseline for ROC
curves to compare new methods and select
thresholds
- 24. ©
2013
Real
Time
Genomics,
Inc.
Acknowledgements
• RTG team at Hamilton, New Zealand
– Led by John Cleary, CTO
• RTG team at San Bruno, CA
– Sahar Malakshah
– Minita Shah
– Brian Hilbush
• Michael Eberle, Illumina, Inc. – Platinum Data
• Justin Zook, NIST
• 1000 Genomes Project
©
2013
Real
Time
Genomics,
Inc.
All
rights
reserved.
US
Patent
7,640,256.
Other
patents
pending.
For
research
use
only.
Not
for
diagnosJc
applicaJons.