Abstract: This session will focus on the steps involved in identifying genomic variants after an initial mapping was achieved: improvement the mapping, SNP and indel calling and variant filtering/recalibration will be introduced.
2. Quick recap: DNA sequence read mapping July 14, 2011 Sequencing->FASTQ->alignment to reference genome Resulting file type: BAM Visualized in Genome Viewer “What genomic regions were sequenced?” Quality Control Projects Fastq Bam
3. Production Informatics and Bioinformatics July 14, 2011 Produce raw sequence reads Basic Production Informatics Map to genome and generate raw genomic features (e.g. SNPs) Advanced Production Inform. Analyze the data; Uncover the biological meaning Bioinformatics Research Per one-flowcell project
4. Good mapping is crucial Mapping tools compromise accuracy for speed: approximate mapping. Identifying exactly where the reads map is the fundament for all subsequent analyses. The exact alignment of each read is especially important for variant calling. July 14, 2011 by neilalderney123
5. Mapping challenges Incorrect mapping Amongst 3 billion bp (human) a 100-mer can occur by chance Multi-mappers The genome has none-unique regions (e.g. repeats) one read mapping to multiple sites can happen Duplicates PCR duplicates can introduce artifacts. July 14, 2011 Streptococcus suis (squares) Musmusculus (triangles) ACGATATTACACGTACACTCAAGTCGTTCGGAACCT TTACACGTACA TACACGTACAC ACACGTACACT CACGTACTCTC CACGTACTCTC CACGTACTCTC CACGTACTCTC Turner DJ, Keane TM, Sudbery I, Adams DJ. Next-generation sequencing of vertebrate experimental organisms. Mamm Genome. 2009 Jun;20(6):327-38. PMID: 19452216
6. Methods for ensuring a good alignment Biological: Using paired end reads to increase coverage Bioinformatically: Local-realignment Base pair quality score re-calibration July 14, 2011 ~200 bp ? Repeat region
7. Local Realignment (GATK) July 14, 2011 QBI data Local realignment of all reads at a specific location simultaneously to minimize mismatches to the reference genome Reduces erroneous SNPs refines location of INDELS original realigned DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
8. Quality score recalibration (GATK) PHRED scores are predicted Looking at all reads at a specific location allows a better estimate on base pair quality score. Excludes all known dbSNP sites Assume all other mismatches are sequencing errors Compute a new calibration table bases on mismatch rates per position on the read Important for variant calling July 14, 2011 DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889 Thomas Keane 9th European Conference on Computational Biology 26th September, 2010
9. Recalibration of quality score July 14, 2011 All bases are called with Q25 In reality not all are that good: bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20” GATK DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
10. Variant calling methods > 15 different algorithm Three categories Allele counting Probabilistic methods, e.g. Bayesian model to quantify statistical uncertainty Assign priors e.g. by taking the observed allele frequency of multiple samples into account Incorporating linkage disequilibrium (LD) Specifically helpful for low coverage and common variants July 14, 2011 variant SNP Ref A Ind1 G/G Ind2 A/G Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011 Jun;12(6):443-51. PMID: 21587300. http://seqanswers.com/wiki/Software/list
11. VCF format [HEADER LINES] #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 chr1 873762 . T G 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255 chr1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0 chr1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26 chr1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255 Individual statistics GT - genotype - 0/1 AD – total number of REF/ALT seen – 173 T, 141 A DP – depth MAPQ > 17 – 282 GQ - Genotype Quality - 99 PL – genotype likelihood - 0/0: 10-25.5=unlikely, 0/1:10-0=likely, and 1/110-25.5=unlikely Location statistics, e.g. Strand bias How many reads have a deletion at this site July 14, 2011
12. When to call a variant ? July 14, 2011 Hom REF: 0% ALT: 100% Het REF: 50% ALT: 50% ?? REF: 77% ALT: 23% QBI data QBI data
13. Hard Filtering Reducing false positives by e.g. requiring Sufficient Depth Variant to be in >30% reads High quality Strand balance … Subjective and dangerous in this high dimensional search space July 14, 2011 Strand Bias Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008). Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008). QBI data
14. Gaussian mixture model Train on trusted variants and require the new variants to live in the same hyperspace Potential problem: Overfitting and biasing to features of knownSNPs!!! July 14, 2011
15. Indel calling First local realignment might not be sufficient to confidently determine the beginning and end of indels Dindel-algorithm Local realignment for every indel candidate July 14, 2011 Albers CA, Lunter G, Macarthur DG, McVean G, Ouwehand WH, Durbin R. Dindel: Accurate indel calls from short-read data. Genome Res. 2011 Jun;21(6):961-73. PMID: 20980555.
16. Recap July 14, 2011 DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
17. Outcome: How many variants will I find ? July 14, 2011 Hiseq: whole genome; mean coverage 60; 101PE; (NA12878) Exome: agilent capture; mean coverage 20; 76/101PE; (NA12878) DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
18. Three things to remember Getting the mapping right is critical Variant calling is not merely to count the differences Just listing the variants does not tell you anything biologically relevant. July 14, 2011 by Яick Harris Fernald GH, Capriotti E, Daneshjou R, Karczewski KJ, Altman RB. Bioinformatics challenges for personalized medicine. Bioinformatics. 2011 Jul 1;27(13):1741-8. PMID: 21596790
19. Next week: July 14, 2011 Abstract: This seminar aims at answering the question of what to make of the identified variants, specifically how to evaluate the quality, prioritize and functionally annotate the variants.
unmethylated ‘C’ bases, or cytosines, are converted to ‘T’
The proportion of unique sequence in the Streptococcus suis (squares) and Musmusculus (triangles) genomes for varying read lengths. This graph indicates that read length has a critical affect on the ability to place reads uniquely to the genome