SlideShare a Scribd company logo
1 of 20
[www.absolutefab.com] Variant calling for disease association (2/2) Searching the haystack July 14, 2011
Quick recap: DNA sequence read mapping July 14, 2011 Sequencing->FASTQ->alignment to reference genome Resulting file type: BAM Visualized in Genome Viewer “What genomic regions were sequenced?” Quality Control Projects Fastq Bam
Production Informatics and Bioinformatics July 14, 2011 Produce raw sequence reads Basic Production Informatics Map to genome and generate raw genomic features (e.g. SNPs) Advanced  Production Inform. Analyze the data; Uncover the biological meaning Bioinformatics Research Per one-flowcell project
Good mapping is crucial Mapping tools compromise accuracy for speed: approximate mapping. Identifying exactly where the reads map is the fundament for all subsequent analyses. The exact alignment of each read is especially important for variant calling. July 14, 2011 by neilalderney123
Mapping challenges  Incorrect mapping Amongst 3 billion bp (human) a 100-mer 	can occur by chance  Multi-mappers The genome has none-unique regions (e.g. repeats) one read mapping to multiple sites can happen Duplicates PCR duplicates can introduce artifacts. July 14, 2011 Streptococcus suis (squares)  Musmusculus (triangles)  ACGATATTACACGTACACTCAAGTCGTTCGGAACCT       TTACACGTACA        TACACGTACAC         ACACGTACACT          CACGTACTCTC          CACGTACTCTC          CACGTACTCTC          CACGTACTCTC Turner DJ, Keane TM, Sudbery I, Adams DJ. Next-generation sequencing of vertebrate experimental organisms. Mamm Genome. 2009 Jun;20(6):327-38. PMID: 19452216
Methods for ensuring a good alignment Biological: Using paired end reads to increase coverage Bioinformatically:  Local-realignment Base pair quality score re-calibration July 14, 2011 ~200 bp ? Repeat region
Local Realignment (GATK) July 14, 2011 QBI data Local realignment of all reads at a specific location simultaneously to minimize mismatches to the reference genome Reduces erroneous SNPs refines location of INDELS original realigned DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
Quality score recalibration (GATK) PHRED scores are predicted Looking at all reads at a specific location allows a better estimate on base pair quality score.  Excludes all known dbSNP sites   Assume all other mismatches are sequencing errors  Compute a new calibration table bases on mismatch rates per position on the read Important for variant calling  July 14, 2011 DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889 Thomas Keane   9th European Conference on Computational Biology  26th September, 2010
Recalibration of quality score July 14, 2011 All bases are called with Q25 In reality not all are that good: bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20” GATK DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
Variant calling methods > 15 different algorithm  Three categories Allele counting Probabilistic methods, e.g. Bayesian model  to quantify statistical uncertainty Assign priors e.g. by taking the observed allele frequency of multiple samples into account Incorporating linkage disequilibrium (LD) Specifically helpful for low coverage and common variants July 14, 2011 variant SNP Ref A Ind1 G/G Ind2 A/G Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011 Jun;12(6):443-51. PMID: 21587300. http://seqanswers.com/wiki/Software/list
VCF format [HEADER LINES] #CHROM  POS		ID		REF	ALT	QUAL	    FILTER	INFO	        FORMAT	      NA12878 chr1	  873762	.		T	G	5231.78   PASS	[ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255 chr1	  877664	rs3828047	A	G	3931.66   PASS	[ANNOTATIONS] GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0 chr1	  899282	rs28548431	C	T	71.77	    PASS	[ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26 chr1	  974165	rs9442391	T	C	29.84	    LowQual	[ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255 Individual statistics GT  - genotype   - 0/1 AD – total number of REF/ALT seen – 173 T, 141 A DP – depth MAPQ > 17 – 282 GQ - Genotype Quality - 99  PL – genotype likelihood - 0/0: 10-25.5=unlikely, 0/1:10-0=likely, and 1/110-25.5=unlikely Location statistics, e.g. Strand bias How many reads have a deletion at this site July 14, 2011
When to call a variant ? July 14, 2011 Hom REF: 0%   ALT: 100% Het REF: 50%   ALT: 50% ?? REF: 77%   ALT: 23% QBI data QBI data
Hard Filtering Reducing false positives by e.g. requiring Sufficient Depth Variant to be in >30% reads High quality Strand balance  … Subjective and dangerous in this high dimensional search space July 14, 2011 Strand Bias Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).  Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008). QBI data
Gaussian mixture model Train on trusted variants and require the new variants to live in the same hyperspace Potential problem: Overfitting and biasing to features of knownSNPs!!! July 14, 2011
Indel calling First local realignment might not be sufficient to confidently determine the beginning and end of indels Dindel-algorithm Local realignment for every indel candidate  July 14, 2011 Albers CA, Lunter G, Macarthur DG, McVean G, Ouwehand WH, Durbin R. Dindel: Accurate indel calls from short-read data. Genome Res. 2011 Jun;21(6):961-73. PMID: 20980555.
Recap July 14, 2011 DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
Outcome: How many variants will I find ? July 14, 2011 Hiseq: whole genome; mean coverage 60; 101PE; (NA12878) Exome: agilent capture; mean coverage 20;  76/101PE; (NA12878) DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
Three things to remember Getting the mapping right is critical Variant calling is not merely to count the differences Just listing the variants does not tell you anything biologically relevant.   July 14, 2011 by Яick Harris Fernald GH, Capriotti E, Daneshjou R, Karczewski KJ, Altman RB. Bioinformatics challenges for personalized medicine. Bioinformatics. 2011 Jul 1;27(13):1741-8. PMID: 21596790
Next week: July 14, 2011 Abstract: This seminar aims at answering the question of what to make of the identified variants, specifically how to evaluate the quality, prioritize and functionally annotate the variants.
Walk-in-clinic July 14, 2011

More Related Content

What's hot

Genomic selection
Genomic  selectionGenomic  selection
Genomic selection
pandadebadatta
 
Genome wide association studies seminar
Genome wide association studies seminarGenome wide association studies seminar
Genome wide association studies seminar
Varsha Gayatonde
 

What's hot (20)

Genomic selection
Genomic  selectionGenomic  selection
Genomic selection
 
Report- Genome wide association studies.
Report- Genome wide association studies.Report- Genome wide association studies.
Report- Genome wide association studies.
 
Data analysis pipelines for NGS applications
Data analysis pipelines for NGS applicationsData analysis pipelines for NGS applications
Data analysis pipelines for NGS applications
 
Basics of association_mapping
Basics of association_mappingBasics of association_mapping
Basics of association_mapping
 
Omics for crop improvement (new)
Omics for crop improvement (new)Omics for crop improvement (new)
Omics for crop improvement (new)
 
CRISPR mediated haploid inducer stock development in rice
CRISPR mediated haploid inducer stock development in riceCRISPR mediated haploid inducer stock development in rice
CRISPR mediated haploid inducer stock development in rice
 
GWAS Study.pdf
GWAS Study.pdfGWAS Study.pdf
GWAS Study.pdf
 
SEMINAR ON CRISPR
SEMINAR ON CRISPRSEMINAR ON CRISPR
SEMINAR ON CRISPR
 
Genome wide association studies seminar
Genome wide association studies seminarGenome wide association studies seminar
Genome wide association studies seminar
 
Whole Genome Selection
Whole Genome SelectionWhole Genome Selection
Whole Genome Selection
 
RNAseq Analysis
RNAseq AnalysisRNAseq Analysis
RNAseq Analysis
 
Epigenetics
EpigeneticsEpigenetics
Epigenetics
 
Gene hunting strategies
Gene hunting strategiesGene hunting strategies
Gene hunting strategies
 
Plant genome sequencing and crop improvement
Plant genome sequencing and crop improvementPlant genome sequencing and crop improvement
Plant genome sequencing and crop improvement
 
QTL mapping in genetic analysis
QTL mapping in genetic analysisQTL mapping in genetic analysis
QTL mapping in genetic analysis
 
GWAS
GWASGWAS
GWAS
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)
 
Microsatelit
MicrosatelitMicrosatelit
Microsatelit
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential Expression
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 

Similar to Variant (SNPs/Indels) calling in DNA sequences, Part 2

Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
Thomas Keane
 
Effects of splicing mutations on NF2 transcripts
Effects of splicing mutations on NF2 transcriptsEffects of splicing mutations on NF2 transcripts
Effects of splicing mutations on NF2 transcripts
Bianca Heinrich
 
Guide Picker Poster V3
Guide Picker Poster V3Guide Picker Poster V3
Guide Picker Poster V3
Soren Hough
 

Similar to Variant (SNPs/Indels) calling in DNA sequences, Part 2 (20)

Functionally annotate genomic variants
Functionally annotate genomic variantsFunctionally annotate genomic variants
Functionally annotate genomic variants
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseq
 
20141218 Methylation Sequencing Analysis
20141218  Methylation Sequencing Analysis20141218  Methylation Sequencing Analysis
20141218 Methylation Sequencing Analysis
 
Differential gene expression
Differential gene expressionDifferential gene expression
Differential gene expression
 
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
Effects of splicing mutations on NF2 transcripts
Effects of splicing mutations on NF2 transcriptsEffects of splicing mutations on NF2 transcripts
Effects of splicing mutations on NF2 transcripts
 
Kogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysisKogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysis
 
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
 
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Human Reference Genome Browser Presentation at BIO-ITWorld 2008Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
 
Avoiding Nonsense Results in your NGS Variant Studies
Avoiding Nonsense Results in your NGS Variant StudiesAvoiding Nonsense Results in your NGS Variant Studies
Avoiding Nonsense Results in your NGS Variant Studies
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
Guide Picker Poster V3
Guide Picker Poster V3Guide Picker Poster V3
Guide Picker Poster V3
 
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
 
Comparative genomics to the rescue: How complete is your plant genome sequence?
Comparative genomics to the rescue: How complete is your plant genome sequence?Comparative genomics to the rescue: How complete is your plant genome sequence?
Comparative genomics to the rescue: How complete is your plant genome sequence?
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
Crispr cas9-Creative Biogene
Crispr cas9-Creative BiogeneCrispr cas9-Creative Biogene
Crispr cas9-Creative Biogene
 
Expressed sequence tag (EST), molecular marker
Expressed sequence tag (EST), molecular markerExpressed sequence tag (EST), molecular marker
Expressed sequence tag (EST), molecular marker
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
Aug2015 analysis team 10 mason epigentics
Aug2015 analysis team 10 mason epigenticsAug2015 analysis team 10 mason epigentics
Aug2015 analysis team 10 mason epigentics
 

More from Denis C. Bauer

Population-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysisPopulation-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysis
Denis C. Bauer
 

More from Denis C. Bauer (19)

Cloud-native machine learning - Transforming bioinformatics research
Cloud-native machine learning - Transforming bioinformatics research Cloud-native machine learning - Transforming bioinformatics research
Cloud-native machine learning - Transforming bioinformatics research
 
Translating genomics into clinical practice - 2018 AWS summit keynote
Translating genomics into clinical practice - 2018 AWS summit keynoteTranslating genomics into clinical practice - 2018 AWS summit keynote
Translating genomics into clinical practice - 2018 AWS summit keynote
 
Going Server-less for Web-Services that need to Crunch Large Volumes of Data
Going Server-less for Web-Services that need to Crunch Large Volumes of DataGoing Server-less for Web-Services that need to Crunch Large Volumes of Data
Going Server-less for Web-Services that need to Crunch Large Volumes of Data
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...
 
Population-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysisPopulation-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysis
 
Trip Report Seattle
Trip Report SeattleTrip Report Seattle
Trip Report Seattle
 
Allelic Imbalance for Pre-capture Whole Exome Sequencing
Allelic Imbalance for Pre-capture Whole Exome SequencingAllelic Imbalance for Pre-capture Whole Exome Sequencing
Allelic Imbalance for Pre-capture Whole Exome Sequencing
 
Centralizing sequence analysis
Centralizing sequence analysisCentralizing sequence analysis
Centralizing sequence analysis
 
Qbi Centre for Brain genomics (Informatics side)
Qbi Centre for Brain genomics (Informatics side)Qbi Centre for Brain genomics (Informatics side)
Qbi Centre for Brain genomics (Informatics side)
 
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1
 
Introduction to second generation sequencing
Introduction to second generation sequencingIntroduction to second generation sequencing
Introduction to second generation sequencing
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
The missing data issue for HiSeq runs
The missing data issue for HiSeq runsThe missing data issue for HiSeq runs
The missing data issue for HiSeq runs
 
Deciphering the regulatory code in the genome
Deciphering the regulatory code in the genomeDeciphering the regulatory code in the genome
Deciphering the regulatory code in the genome
 
ReliF
ReliFReliF
ReliF
 
STAR: Recombination site prediction
STAR: Recombination site predictionSTAR: Recombination site prediction
STAR: Recombination site prediction
 
SUMOylation site prediction
SUMOylation site predictionSUMOylation site prediction
SUMOylation site prediction
 

Recently uploaded

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Variant (SNPs/Indels) calling in DNA sequences, Part 2

  • 1. [www.absolutefab.com] Variant calling for disease association (2/2) Searching the haystack July 14, 2011
  • 2. Quick recap: DNA sequence read mapping July 14, 2011 Sequencing->FASTQ->alignment to reference genome Resulting file type: BAM Visualized in Genome Viewer “What genomic regions were sequenced?” Quality Control Projects Fastq Bam
  • 3. Production Informatics and Bioinformatics July 14, 2011 Produce raw sequence reads Basic Production Informatics Map to genome and generate raw genomic features (e.g. SNPs) Advanced Production Inform. Analyze the data; Uncover the biological meaning Bioinformatics Research Per one-flowcell project
  • 4. Good mapping is crucial Mapping tools compromise accuracy for speed: approximate mapping. Identifying exactly where the reads map is the fundament for all subsequent analyses. The exact alignment of each read is especially important for variant calling. July 14, 2011 by neilalderney123
  • 5. Mapping challenges Incorrect mapping Amongst 3 billion bp (human) a 100-mer can occur by chance Multi-mappers The genome has none-unique regions (e.g. repeats) one read mapping to multiple sites can happen Duplicates PCR duplicates can introduce artifacts. July 14, 2011 Streptococcus suis (squares) Musmusculus (triangles) ACGATATTACACGTACACTCAAGTCGTTCGGAACCT TTACACGTACA TACACGTACAC ACACGTACACT CACGTACTCTC CACGTACTCTC CACGTACTCTC CACGTACTCTC Turner DJ, Keane TM, Sudbery I, Adams DJ. Next-generation sequencing of vertebrate experimental organisms. Mamm Genome. 2009 Jun;20(6):327-38. PMID: 19452216
  • 6. Methods for ensuring a good alignment Biological: Using paired end reads to increase coverage Bioinformatically: Local-realignment Base pair quality score re-calibration July 14, 2011 ~200 bp ? Repeat region
  • 7. Local Realignment (GATK) July 14, 2011 QBI data Local realignment of all reads at a specific location simultaneously to minimize mismatches to the reference genome Reduces erroneous SNPs refines location of INDELS original realigned DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
  • 8. Quality score recalibration (GATK) PHRED scores are predicted Looking at all reads at a specific location allows a better estimate on base pair quality score. Excludes all known dbSNP sites  Assume all other mismatches are sequencing errors Compute a new calibration table bases on mismatch rates per position on the read Important for variant calling July 14, 2011 DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889 Thomas Keane 9th European Conference on Computational Biology 26th September, 2010
  • 9. Recalibration of quality score July 14, 2011 All bases are called with Q25 In reality not all are that good: bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20” GATK DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
  • 10. Variant calling methods > 15 different algorithm Three categories Allele counting Probabilistic methods, e.g. Bayesian model to quantify statistical uncertainty Assign priors e.g. by taking the observed allele frequency of multiple samples into account Incorporating linkage disequilibrium (LD) Specifically helpful for low coverage and common variants July 14, 2011 variant SNP Ref A Ind1 G/G Ind2 A/G Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011 Jun;12(6):443-51. PMID: 21587300. http://seqanswers.com/wiki/Software/list
  • 11. VCF format [HEADER LINES] #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 chr1 873762 . T G 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255 chr1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0 chr1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26 chr1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255 Individual statistics GT - genotype - 0/1 AD – total number of REF/ALT seen – 173 T, 141 A DP – depth MAPQ > 17 – 282 GQ - Genotype Quality - 99 PL – genotype likelihood - 0/0: 10-25.5=unlikely, 0/1:10-0=likely, and 1/110-25.5=unlikely Location statistics, e.g. Strand bias How many reads have a deletion at this site July 14, 2011
  • 12. When to call a variant ? July 14, 2011 Hom REF: 0% ALT: 100% Het REF: 50% ALT: 50% ?? REF: 77% ALT: 23% QBI data QBI data
  • 13. Hard Filtering Reducing false positives by e.g. requiring Sufficient Depth Variant to be in >30% reads High quality Strand balance … Subjective and dangerous in this high dimensional search space July 14, 2011 Strand Bias Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008). Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008). QBI data
  • 14. Gaussian mixture model Train on trusted variants and require the new variants to live in the same hyperspace Potential problem: Overfitting and biasing to features of knownSNPs!!! July 14, 2011
  • 15. Indel calling First local realignment might not be sufficient to confidently determine the beginning and end of indels Dindel-algorithm Local realignment for every indel candidate July 14, 2011 Albers CA, Lunter G, Macarthur DG, McVean G, Ouwehand WH, Durbin R. Dindel: Accurate indel calls from short-read data. Genome Res. 2011 Jun;21(6):961-73. PMID: 20980555.
  • 16. Recap July 14, 2011 DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
  • 17. Outcome: How many variants will I find ? July 14, 2011 Hiseq: whole genome; mean coverage 60; 101PE; (NA12878) Exome: agilent capture; mean coverage 20; 76/101PE; (NA12878) DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
  • 18. Three things to remember Getting the mapping right is critical Variant calling is not merely to count the differences Just listing the variants does not tell you anything biologically relevant. July 14, 2011 by Яick Harris Fernald GH, Capriotti E, Daneshjou R, Karczewski KJ, Altman RB. Bioinformatics challenges for personalized medicine. Bioinformatics. 2011 Jul 1;27(13):1741-8. PMID: 21596790
  • 19. Next week: July 14, 2011 Abstract: This seminar aims at answering the question of what to make of the identified variants, specifically how to evaluate the quality, prioritize and functionally annotate the variants.

Editor's Notes

  1. unmethylated ‘C’ bases, or cytosines, are converted to ‘T’
  2. The proportion of unique sequence in the Streptococcus suis (squares) and Musmusculus (triangles) genomes for varying read lengths. This graph indicates that read length has a critical affect on the ability to place reads uniquely to the genome