SlideShare una empresa de Scribd logo
1 de 20
Using Highly Confident Genotype
Calls for NA12878 to understand
sequencing accuracy
Genome in a Bottle Consortium
Justin Zook, Ph.D and Marc Salit, Ph.D.
National Institute of Standards and Technology
1
Why create a set of highly confident
genotypes for a genome?
• Current validation methods have limited purview or accuracy
• Sanger confirmation
– Limited by number of sites (and sometimes it’s wrong)
• High depth NGS confirmation
– May have same systematic errors
• Genotyping microarrays
– Limited to known (easier) variants
– Problems with neighboring variants, homopolymers, duplications
• Mendelian inheritance
– Can’t account for some systematic errors
• Simulated data
– Generally not very representative of errors in real data
• Ti/Tv
– Varies by region of genome, and only gives overall statistic
2
Goals for Data Integration
• Carefully define highly confident regions of the
genome
– distinguish between Hom Ref and Uncertain
• ~0 false positive AND false negative calls in
confident regions
• Include as much of the genome as possible in the
confident regions (i.e., don’t just take the
intersection)
• Avoid bias towards any particular platform
• Avoid bias towards any particular bioinformatics
algorithms
3
Integrate 12 Datasets from 5 platforms
4
Integration of Data to
Form Highly Confident Genotype Calls
Find all possible variant sites
Find highly confident sites across multiple datasets
Identify sites with atypical characteristics signifying
sequencing, mapping, or alignment bias
For each site, remove datasets with decreasingly atypical
characteristics until all datasets agree
Even if all datasets agree, identify them as uncertain if
few have typical characteristics, or if they fall in known
segmental duplications or long repeats
Candidate variants
Confident variants
Find characteristics
of bias
Arbitration
Confidence Level
5
Characteristics of Sequence
Data/Genotype associated with bias
• Systematic sequencing
errors
– Strand bias
– Base Quality Rank Sum
Test
• Local Alignment
problems
– Distance from end of
read
– Read Position Rank Sum
– HaplotypeScore
• Mapping problems
– Mapping Quality
– Higher (or lower) than
expected coverage –
CNV
– Length of aligned reads
• Abnormal allele balance
or Quality/Depth
– Allele Balance
– Quality/Depth
6
Regions excluded as uncertain
7
More recently, we also exclude homopolymers and long STRs, and 30 bp on each side of
uncertain heterozygous and homozygous variant positions
Example of Arbitration: SSE suspected
from strand biasPlatformBPlatformA
Homopolymer
Strand Bias
(SNP overrepresented
on reverse strands)
8
Verification of “Highly Confident”
Genotype accuracy
• Sanger sequencing
– 100% accuracy but only 100s of sites
• X Prize Fosmid sequencing
– Artifacts at end of fosmids
• Microarrays
– Differences appear to be FP or FN in arrays
• Broad 250bp HaplotypeCaller
– Very highly concordant, except a few systematic errors and
homopolymers
• Platinum genomes pedigree SNPs
– Some systematic errors are inherited; different representations of
complex variants
• Real Time Genomics Trio SNPs and indels
– Some interesting sites called by RTG complex caller but have no
evidence in mapped reads
9
GCAT – Interactive Performance
Metrics
• NIST is working with
GCAT to use our highly
confident variant calls
• Assess performance of
many combinations of
mappers and variant
callers
• www.bioplanet.com/gc
at
10
Why do calls differ from our highly
confident genotypes?
Calls not in Integration
• Platform-specific systematic
sequencing errors for SNPs
• Analysis-specific
• Difficult to map regions
• Indels in long
homopolymers
Calls specific to Integration
• Different complex variant
representation
• Some are incorrectly
filtered as suspected FPs
11
Illumina-specific Systematic Sequencing Errors
12
Complex variants have multiple correct
representations
BWA
ssaha2
CGTools
Novo-
align
Ref:
T
insertion
TCTCT
insertion
13
FP SNPs FP MNPs FP indels
Traditional
comparison
0.38%
(610)
100%
(915)
6.5%
(733)
Comparison
with
realignment
0.15%
(249)
4.2%
(38)
2.6%
(298)
Uncertain variants: Difficult to map regions
14
Uncertain variants: Indels in long homopolymers
15
Uncertain variants: Regions with “decoy sequence”
16
Challenges with assessing
performance
• All variant types are not
equal
• Nearby variants are often
difficult to align
– Multiple representations
• All regions of the genome
are not equal
– Homopolymers, STRs, dupli
cations
– Can be similar or different
in different genomes
• Labeling difficult variants
as uncertain leads to
higher apparent accuracy
when assessing
performance
• Genotypes fall in 3+
categories (not
positive/negative)
– standard diagnostic
accuracy measures not
well posed
17
How to incorporate inheritance in
multi-platform integration
• Adding confidence
– Site follows expected
inheritance pattern (and
not all homozygous)
• Identifying errors
– Mendelian inheritance
errors
– Sites where all family
members are
heterozygous
– Some CNVs
• Limitations of
inheritance
– All homozygous sites can
still be systematic errors
– Some errors can follow
inheritance pattern (e.g.,
incorrect alignment
around indel, some
CNVs)
18
Availability of data, genotype calls, and
methods
• Data for NA12878 is
available on NCBI GIAB
ftp site (see blogs on
genomeinabottle.org)
– mirrored to Amazon
today
• Highly confident
genotype calls and bed
files available on GIAB
ftp site
• Pre-print of manuscript
available on arxiv.org
• See
genomeinabottle.org
blog posts for more
information
19
Acknowledgements
• GCAT – David Mittelman and Jason Wang
• FDA HPC – Mike Mikailov, Brian Fitzgerald, et al.
• HSPH – Brad Chapman, Oliver Hofmann, Win
Hide
• Genome in a Bottle Consortium
– www.genomeinabottle.org
• newsletters, blogs, forums, announcements
– new partners welcome! Open to anyone
– targeting pilot reference material availability in early
2014
20

Más contenido relacionado

La actualidad más candente

140127 platinum genomes pedigree analyses
140127 platinum genomes pedigree analyses140127 platinum genomes pedigree analyses
140127 platinum genomes pedigree analyses
GenomeInABottle
 

La actualidad más candente (20)

2016 Dal Human Genetics - Genomics in Medicine Lecture
2016 Dal Human Genetics - Genomics in Medicine Lecture2016 Dal Human Genetics - Genomics in Medicine Lecture
2016 Dal Human Genetics - Genomics in Medicine Lecture
 
Sept2016 plenary mercer_sequins
Sept2016 plenary mercer_sequinsSept2016 plenary mercer_sequins
Sept2016 plenary mercer_sequins
 
SPIN Workshop Microbial Genomics @NIST
SPIN Workshop Microbial Genomics @NISTSPIN Workshop Microbial Genomics @NIST
SPIN Workshop Microbial Genomics @NIST
 
GIAB GRC Workshop slides
GIAB GRC Workshop slidesGIAB GRC Workshop slides
GIAB GRC Workshop slides
 
2017 agbt benchmarking_poster
2017 agbt benchmarking_poster2017 agbt benchmarking_poster
2017 agbt benchmarking_poster
 
HDx™ Reference Standards and Reference Materials for Next Generation Sequenci...
HDx™ Reference Standards and Reference Materials for Next Generation Sequenci...HDx™ Reference Standards and Reference Materials for Next Generation Sequenci...
HDx™ Reference Standards and Reference Materials for Next Generation Sequenci...
 
NGS in Clinical Research: Meet the NGS Experts Series Part 1
NGS in Clinical Research: Meet the NGS Experts Series Part 1NGS in Clinical Research: Meet the NGS Experts Series Part 1
NGS in Clinical Research: Meet the NGS Experts Series Part 1
 
2015 bioinformatics personal_genomics_wim_vancriekinge
2015 bioinformatics personal_genomics_wim_vancriekinge2015 bioinformatics personal_genomics_wim_vancriekinge
2015 bioinformatics personal_genomics_wim_vancriekinge
 
The Clinical Genome Conference 2014
The Clinical Genome Conference 2014The Clinical Genome Conference 2014
The Clinical Genome Conference 2014
 
Translational Genomics and Prostate Cancer: Meet the NGS Experts Series Part 2
Translational Genomics and Prostate Cancer: Meet the NGS Experts Series Part 2Translational Genomics and Prostate Cancer: Meet the NGS Experts Series Part 2
Translational Genomics and Prostate Cancer: Meet the NGS Experts Series Part 2
 
2017 agbt giab_poster
2017 agbt giab_poster2017 agbt giab_poster
2017 agbt giab_poster
 
140127 platinum genomes pedigree analyses
140127 platinum genomes pedigree analyses140127 platinum genomes pedigree analyses
140127 platinum genomes pedigree analyses
 
Aug2015 deanna church analytical validation
Aug2015 deanna church analytical validationAug2015 deanna church analytical validation
Aug2015 deanna church analytical validation
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
 
High-Throughput Sequencing
High-Throughput SequencingHigh-Throughput Sequencing
High-Throughput Sequencing
 
NGS for Infectious Disease Diagnostics: An Opportunity for Growth
NGS for Infectious Disease Diagnostics: An Opportunity for Growth NGS for Infectious Disease Diagnostics: An Opportunity for Growth
NGS for Infectious Disease Diagnostics: An Opportunity for Growth
 
Jan2015 GIAB intro, Update, and Data Analysis Planning
Jan2015 GIAB intro, Update, and Data Analysis PlanningJan2015 GIAB intro, Update, and Data Analysis Planning
Jan2015 GIAB intro, Update, and Data Analysis Planning
 
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...
 

Similar a Aug2013 NIST highly confident genotype calls for NA12878

140127 GIAB update and NIST high-confidence calls
140127 GIAB update and NIST high-confidence calls140127 GIAB update and NIST high-confidence calls
140127 GIAB update and NIST high-confidence calls
GenomeInABottle
 
Genome wide association studies seminar
Genome wide association studies seminarGenome wide association studies seminar
Genome wide association studies seminar
Varsha Gayatonde
 
IGNTU-eContent-328712472244-M.Sc-EnvironmentalScience-2-ManojkumarRai-Environ...
IGNTU-eContent-328712472244-M.Sc-EnvironmentalScience-2-ManojkumarRai-Environ...IGNTU-eContent-328712472244-M.Sc-EnvironmentalScience-2-ManojkumarRai-Environ...
IGNTU-eContent-328712472244-M.Sc-EnvironmentalScience-2-ManojkumarRai-Environ...
sumitraDas14
 

Similar a Aug2013 NIST highly confident genotype calls for NA12878 (20)

140127 GIAB update and NIST high-confidence calls
140127 GIAB update and NIST high-confidence calls140127 GIAB update and NIST high-confidence calls
140127 GIAB update and NIST high-confidence calls
 
Genome wide association studies seminar Prepared by Ms Varsha Gaitonde.
Genome wide association studies seminar Prepared by Ms Varsha Gaitonde.Genome wide association studies seminar Prepared by Ms Varsha Gaitonde.
Genome wide association studies seminar Prepared by Ms Varsha Gaitonde.
 
Genome wide association studies seminar
Genome wide association studies seminarGenome wide association studies seminar
Genome wide association studies seminar
 
molecular biology Molecular markers (1).pptx
molecular biology Molecular markers (1).pptxmolecular biology Molecular markers (1).pptx
molecular biology Molecular markers (1).pptx
 
3UnitGeneMapping.pptx
3UnitGeneMapping.pptx3UnitGeneMapping.pptx
3UnitGeneMapping.pptx
 
Molecular marker by anil bl gather
Molecular marker by anil bl gatherMolecular marker by anil bl gather
Molecular marker by anil bl gather
 
Aug2014 giab status update and wg charge
Aug2014 giab status update and wg chargeAug2014 giab status update and wg charge
Aug2014 giab status update and wg charge
 
2007. stephen chanock. technologic issues in gwas and follow up studies
2007. stephen chanock. technologic issues in gwas and follow up studies2007. stephen chanock. technologic issues in gwas and follow up studies
2007. stephen chanock. technologic issues in gwas and follow up studies
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
IGNTU-eContent-328712472244-M.Sc-EnvironmentalScience-2-ManojkumarRai-Environ...
IGNTU-eContent-328712472244-M.Sc-EnvironmentalScience-2-ManojkumarRai-Environ...IGNTU-eContent-328712472244-M.Sc-EnvironmentalScience-2-ManojkumarRai-Environ...
IGNTU-eContent-328712472244-M.Sc-EnvironmentalScience-2-ManojkumarRai-Environ...
 
Map based cloning of genome
Map based cloning of genomeMap based cloning of genome
Map based cloning of genome
 
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
 
Role of molecular markers in vegetable crops
Role of molecular markers in vegetable cropsRole of molecular markers in vegetable crops
Role of molecular markers in vegetable crops
 
Genotyping in Breeding programs
Genotyping in Breeding programsGenotyping in Breeding programs
Genotyping in Breeding programs
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 
Gene mapping and DNA markers
Gene mapping and DNA markersGene mapping and DNA markers
Gene mapping and DNA markers
 
Genetic Markers and their importance in Forensics
Genetic Markers and their importance in ForensicsGenetic Markers and their importance in Forensics
Genetic Markers and their importance in Forensics
 
Molecular profiling of breast cancer
Molecular profiling of breast cancerMolecular profiling of breast cancer
Molecular profiling of breast cancer
 
Importance of Genetic Markers in Forensics
Importance of Genetic Markers in ForensicsImportance of Genetic Markers in Forensics
Importance of Genetic Markers in Forensics
 
GENE gene marker blood typing , abo blood typing vntr
GENE gene marker blood typing , abo blood typing vntrGENE gene marker blood typing , abo blood typing vntr
GENE gene marker blood typing , abo blood typing vntr
 

Más de GenomeInABottle

Más de GenomeInABottle (20)

2023 GIAB AMP Update
2023 GIAB AMP Update2023 GIAB AMP Update
2023 GIAB AMP Update
 
GIAB Tumor Normal ASHG 2023
GIAB Tumor Normal ASHG 2023GIAB Tumor Normal ASHG 2023
GIAB Tumor Normal ASHG 2023
 
Stratomod ASHG 2023
Stratomod ASHG 2023Stratomod ASHG 2023
Stratomod ASHG 2023
 
GIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdf
 
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
 
Benchmarking with GIAB 220907
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...
 
GIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussionGIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussion
 
Giab agbt small_var_2020
Giab agbt small_var_2020Giab agbt small_var_2020
Giab agbt small_var_2020
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
 
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GHGa4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
 
GIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant poster
 
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATKGIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
 
GIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant posterGIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant poster
 
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant BenchmarkGRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
 
Jason Chin MHC diploid assembly
Jason Chin MHC diploid assemblyJason Chin MHC diploid assembly
Jason Chin MHC diploid assembly
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
GIAB and long reads for bio it world 190417
GIAB and long reads for bio it world 190417GIAB and long reads for bio it world 190417
GIAB and long reads for bio it world 190417
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Aug2013 NIST highly confident genotype calls for NA12878

  • 1. Using Highly Confident Genotype Calls for NA12878 to understand sequencing accuracy Genome in a Bottle Consortium Justin Zook, Ph.D and Marc Salit, Ph.D. National Institute of Standards and Technology 1
  • 2. Why create a set of highly confident genotypes for a genome? • Current validation methods have limited purview or accuracy • Sanger confirmation – Limited by number of sites (and sometimes it’s wrong) • High depth NGS confirmation – May have same systematic errors • Genotyping microarrays – Limited to known (easier) variants – Problems with neighboring variants, homopolymers, duplications • Mendelian inheritance – Can’t account for some systematic errors • Simulated data – Generally not very representative of errors in real data • Ti/Tv – Varies by region of genome, and only gives overall statistic 2
  • 3. Goals for Data Integration • Carefully define highly confident regions of the genome – distinguish between Hom Ref and Uncertain • ~0 false positive AND false negative calls in confident regions • Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection) • Avoid bias towards any particular platform • Avoid bias towards any particular bioinformatics algorithms 3
  • 4. Integrate 12 Datasets from 5 platforms 4
  • 5. Integration of Data to Form Highly Confident Genotype Calls Find all possible variant sites Find highly confident sites across multiple datasets Identify sites with atypical characteristics signifying sequencing, mapping, or alignment bias For each site, remove datasets with decreasingly atypical characteristics until all datasets agree Even if all datasets agree, identify them as uncertain if few have typical characteristics, or if they fall in known segmental duplications or long repeats Candidate variants Confident variants Find characteristics of bias Arbitration Confidence Level 5
  • 6. Characteristics of Sequence Data/Genotype associated with bias • Systematic sequencing errors – Strand bias – Base Quality Rank Sum Test • Local Alignment problems – Distance from end of read – Read Position Rank Sum – HaplotypeScore • Mapping problems – Mapping Quality – Higher (or lower) than expected coverage – CNV – Length of aligned reads • Abnormal allele balance or Quality/Depth – Allele Balance – Quality/Depth 6
  • 7. Regions excluded as uncertain 7 More recently, we also exclude homopolymers and long STRs, and 30 bp on each side of uncertain heterozygous and homozygous variant positions
  • 8. Example of Arbitration: SSE suspected from strand biasPlatformBPlatformA Homopolymer Strand Bias (SNP overrepresented on reverse strands) 8
  • 9. Verification of “Highly Confident” Genotype accuracy • Sanger sequencing – 100% accuracy but only 100s of sites • X Prize Fosmid sequencing – Artifacts at end of fosmids • Microarrays – Differences appear to be FP or FN in arrays • Broad 250bp HaplotypeCaller – Very highly concordant, except a few systematic errors and homopolymers • Platinum genomes pedigree SNPs – Some systematic errors are inherited; different representations of complex variants • Real Time Genomics Trio SNPs and indels – Some interesting sites called by RTG complex caller but have no evidence in mapped reads 9
  • 10. GCAT – Interactive Performance Metrics • NIST is working with GCAT to use our highly confident variant calls • Assess performance of many combinations of mappers and variant callers • www.bioplanet.com/gc at 10
  • 11. Why do calls differ from our highly confident genotypes? Calls not in Integration • Platform-specific systematic sequencing errors for SNPs • Analysis-specific • Difficult to map regions • Indels in long homopolymers Calls specific to Integration • Different complex variant representation • Some are incorrectly filtered as suspected FPs 11
  • 13. Complex variants have multiple correct representations BWA ssaha2 CGTools Novo- align Ref: T insertion TCTCT insertion 13 FP SNPs FP MNPs FP indels Traditional comparison 0.38% (610) 100% (915) 6.5% (733) Comparison with realignment 0.15% (249) 4.2% (38) 2.6% (298)
  • 14. Uncertain variants: Difficult to map regions 14
  • 15. Uncertain variants: Indels in long homopolymers 15
  • 16. Uncertain variants: Regions with “decoy sequence” 16
  • 17. Challenges with assessing performance • All variant types are not equal • Nearby variants are often difficult to align – Multiple representations • All regions of the genome are not equal – Homopolymers, STRs, dupli cations – Can be similar or different in different genomes • Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance • Genotypes fall in 3+ categories (not positive/negative) – standard diagnostic accuracy measures not well posed 17
  • 18. How to incorporate inheritance in multi-platform integration • Adding confidence – Site follows expected inheritance pattern (and not all homozygous) • Identifying errors – Mendelian inheritance errors – Sites where all family members are heterozygous – Some CNVs • Limitations of inheritance – All homozygous sites can still be systematic errors – Some errors can follow inheritance pattern (e.g., incorrect alignment around indel, some CNVs) 18
  • 19. Availability of data, genotype calls, and methods • Data for NA12878 is available on NCBI GIAB ftp site (see blogs on genomeinabottle.org) – mirrored to Amazon today • Highly confident genotype calls and bed files available on GIAB ftp site • Pre-print of manuscript available on arxiv.org • See genomeinabottle.org blog posts for more information 19
  • 20. Acknowledgements • GCAT – David Mittelman and Jason Wang • FDA HPC – Mike Mikailov, Brian Fitzgerald, et al. • HSPH – Brad Chapman, Oliver Hofmann, Win Hide • Genome in a Bottle Consortium – www.genomeinabottle.org • newsletters, blogs, forums, announcements – new partners welcome! Open to anyone – targeting pilot reference material availability in early 2014 20

Notas del editor

  1. ----- Meeting Notes (5/28/13 17:05) -----ask heng for decoy