Variant (SNPs/Indels) calling in DNA sequences, Part 2

[www.absolutefab.com] Variant calling for disease association (2/2) Searching the haystack July 14, 2011

Quick recap: DNA sequence read mapping July 14, 2011 Sequencing->FASTQ->alignment to reference genome Resulting file type: BAM Visualized in Genome Viewer “What genomic regions were sequenced?” Quality Control Projects Fastq Bam

Production Informatics and Bioinformatics July 14, 2011 Produce raw sequence reads Basic Production Informatics Map to genome and generate raw genomic features (e.g. SNPs) Advanced Production Inform. Analyze the data; Uncover the biological meaning Bioinformatics Research Per one-flowcell project

Good mapping is crucial Mapping tools compromise accuracy for speed: approximate mapping. Identifying exactly where the reads map is the fundament for all subsequent analyses. The exact alignment of each read is especially important for variant calling. July 14, 2011 by neilalderney123

Mapping challenges Incorrect mapping Amongst 3 billion bp (human) a 100-mer can occur by chance Multi-mappers The genome has none-unique regions (e.g. repeats) one read mapping to multiple sites can happen Duplicates PCR duplicates can introduce artifacts. July 14, 2011 Streptococcus suis (squares) Musmusculus (triangles) ACGATATTACACGTACACTCAAGTCGTTCGGAACCT TTACACGTACA TACACGTACAC ACACGTACACT CACGTACTCTC CACGTACTCTC CACGTACTCTC CACGTACTCTC Turner DJ, Keane TM, Sudbery I, Adams DJ. Next-generation sequencing of vertebrate experimental organisms. Mamm Genome. 2009 Jun;20(6):327-38. PMID: 19452216

Methods for ensuring a good alignment Biological: Using paired end reads to increase coverage Bioinformatically: Local-realignment Base pair quality score re-calibration July 14, 2011 ~200 bp ? Repeat region

Local Realignment (GATK) July 14, 2011 QBI data Local realignment of all reads at a specific location simultaneously to minimize mismatches to the reference genome Reduces erroneous SNPs refines location of INDELS original realigned DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889

Quality score recalibration (GATK) PHRED scores are predicted Looking at all reads at a specific location allows a better estimate on base pair quality score. Excludes all known dbSNP sites  Assume all other mismatches are sequencing errors Compute a new calibration table bases on mismatch rates per position on the read Important for variant calling July 14, 2011 DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889 Thomas Keane 9th European Conference on Computational Biology 26th September, 2010

Recalibration of quality score July 14, 2011 All bases are called with Q25 In reality not all are that good: bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20” GATK DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889

Variant calling methods > 15 different algorithm Three categories Allele counting Probabilistic methods, e.g. Bayesian model to quantify statistical uncertainty Assign priors e.g. by taking the observed allele frequency of multiple samples into account Incorporating linkage disequilibrium (LD) Specifically helpful for low coverage and common variants July 14, 2011 variant SNP Ref A Ind1 G/G Ind2 A/G Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011 Jun;12(6):443-51. PMID: 21587300. http://seqanswers.com/wiki/Software/list

VCF format [HEADER LINES] #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 chr1 873762 . T G 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255 chr1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0 chr1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26 chr1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255 Individual statistics GT - genotype - 0/1 AD – total number of REF/ALT seen – 173 T, 141 A DP – depth MAPQ > 17 – 282 GQ - Genotype Quality - 99 PL – genotype likelihood - 0/0: 10-25.5=unlikely, 0/1:10-0=likely, and 1/110-25.5=unlikely Location statistics, e.g. Strand bias How many reads have a deletion at this site July 14, 2011

When to call a variant ? July 14, 2011 Hom REF: 0% ALT: 100% Het REF: 50% ALT: 50% ?? REF: 77% ALT: 23% QBI data QBI data

Hard Filtering Reducing false positives by e.g. requiring Sufficient Depth Variant to be in >30% reads High quality Strand balance … Subjective and dangerous in this high dimensional search space July 14, 2011 Strand Bias Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008). Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008). QBI data

Gaussian mixture model Train on trusted variants and require the new variants to live in the same hyperspace Potential problem: Overfitting and biasing to features of knownSNPs!!! July 14, 2011

Indel calling First local realignment might not be sufficient to confidently determine the beginning and end of indels Dindel-algorithm Local realignment for every indel candidate July 14, 2011 Albers CA, Lunter G, Macarthur DG, McVean G, Ouwehand WH, Durbin R. Dindel: Accurate indel calls from short-read data. Genome Res. 2011 Jun;21(6):961-73. PMID: 20980555.

Recap July 14, 2011 DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889

Outcome: How many variants will I find ? July 14, 2011 Hiseq: whole genome; mean coverage 60; 101PE; (NA12878) Exome: agilent capture; mean coverage 20; 76/101PE; (NA12878) DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889

Three things to remember Getting the mapping right is critical Variant calling is not merely to count the differences Just listing the variants does not tell you anything biologically relevant. July 14, 2011 by Яick Harris Fernald GH, Capriotti E, Daneshjou R, Karczewski KJ, Altman RB. Bioinformatics challenges for personalized medicine. Bioinformatics. 2011 Jul 1;27(13):1741-8. PMID: 21596790

Next week: July 14, 2011 Abstract: This seminar aims at answering the question of what to make of the identified variants, specifically how to evaluate the quality, prioritize and functionally annotate the variants.

Variant (SNPs/Indels) calling in DNA sequences, Part 2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Variant (SNPs/Indels) calling in DNA sequences, Part 2

Similar to Variant (SNPs/Indels) calling in DNA sequences, Part 2 (20)

More from Denis C. Bauer

More from Denis C. Bauer (19)

Recently uploaded

Recently uploaded (20)

Variant (SNPs/Indels) calling in DNA sequences, Part 2

Editor's Notes