SlideShare a Scribd company logo
1 of 20
[www.absolutefab.com] Variant calling for disease association (2/2) Searching the haystack July 14, 2011
Quick recap: DNA sequence read mapping July 14, 2011 Sequencing->FASTQ->alignment to reference genome Resulting file type: BAM Visualized in Genome Viewer “What genomic regions were sequenced?” Quality Control Projects Fastq Bam
Production Informatics and Bioinformatics July 14, 2011 Produce raw sequence reads Basic Production Informatics Map to genome and generate raw genomic features (e.g. SNPs) Advanced  Production Inform. Analyze the data; Uncover the biological meaning Bioinformatics Research Per one-flowcell project
Good mapping is crucial Mapping tools compromise accuracy for speed: approximate mapping. Identifying exactly where the reads map is the fundament for all subsequent analyses. The exact alignment of each read is especially important for variant calling. July 14, 2011 by neilalderney123
Mapping challenges  Incorrect mapping Amongst 3 billion bp (human) a 100-mer 	can occur by chance  Multi-mappers The genome has none-unique regions (e.g. repeats) one read mapping to multiple sites can happen Duplicates PCR duplicates can introduce artifacts. July 14, 2011 Streptococcus suis (squares)  Musmusculus (triangles)  ACGATATTACACGTACACTCAAGTCGTTCGGAACCT       TTACACGTACA        TACACGTACAC         ACACGTACACT          CACGTACTCTC          CACGTACTCTC          CACGTACTCTC          CACGTACTCTC Turner DJ, Keane TM, Sudbery I, Adams DJ. Next-generation sequencing of vertebrate experimental organisms. Mamm Genome. 2009 Jun;20(6):327-38. PMID: 19452216
Methods for ensuring a good alignment Biological: Using paired end reads to increase coverage Bioinformatically:  Local-realignment Base pair quality score re-calibration July 14, 2011 ~200 bp ? Repeat region
Local Realignment (GATK) July 14, 2011 QBI data Local realignment of all reads at a specific location simultaneously to minimize mismatches to the reference genome Reduces erroneous SNPs refines location of INDELS original realigned DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
Quality score recalibration (GATK) PHRED scores are predicted Looking at all reads at a specific location allows a better estimate on base pair quality score.  Excludes all known dbSNP sites   Assume all other mismatches are sequencing errors  Compute a new calibration table bases on mismatch rates per position on the read Important for variant calling  July 14, 2011 DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889 Thomas Keane   9th European Conference on Computational Biology  26th September, 2010
Recalibration of quality score July 14, 2011 All bases are called with Q25 In reality not all are that good: bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20” GATK DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
Variant calling methods > 15 different algorithm  Three categories Allele counting Probabilistic methods, e.g. Bayesian model  to quantify statistical uncertainty Assign priors e.g. by taking the observed allele frequency of multiple samples into account Incorporating linkage disequilibrium (LD) Specifically helpful for low coverage and common variants July 14, 2011 variant SNP Ref A Ind1 G/G Ind2 A/G Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011 Jun;12(6):443-51. PMID: 21587300. http://seqanswers.com/wiki/Software/list
VCF format [HEADER LINES] #CHROM  POS		ID		REF	ALT	QUAL	    FILTER	INFO	        FORMAT	      NA12878 chr1	  873762	.		T	G	5231.78   PASS	[ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255 chr1	  877664	rs3828047	A	G	3931.66   PASS	[ANNOTATIONS] GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0 chr1	  899282	rs28548431	C	T	71.77	    PASS	[ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26 chr1	  974165	rs9442391	T	C	29.84	    LowQual	[ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255 Individual statistics GT  - genotype   - 0/1 AD – total number of REF/ALT seen – 173 T, 141 A DP – depth MAPQ > 17 – 282 GQ - Genotype Quality - 99  PL – genotype likelihood - 0/0: 10-25.5=unlikely, 0/1:10-0=likely, and 1/110-25.5=unlikely Location statistics, e.g. Strand bias How many reads have a deletion at this site July 14, 2011
When to call a variant ? July 14, 2011 Hom REF: 0%   ALT: 100% Het REF: 50%   ALT: 50% ?? REF: 77%   ALT: 23% QBI data QBI data
Hard Filtering Reducing false positives by e.g. requiring Sufficient Depth Variant to be in >30% reads High quality Strand balance  … Subjective and dangerous in this high dimensional search space July 14, 2011 Strand Bias Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).  Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008). QBI data
Gaussian mixture model Train on trusted variants and require the new variants to live in the same hyperspace Potential problem: Overfitting and biasing to features of knownSNPs!!! July 14, 2011
Indel calling First local realignment might not be sufficient to confidently determine the beginning and end of indels Dindel-algorithm Local realignment for every indel candidate  July 14, 2011 Albers CA, Lunter G, Macarthur DG, McVean G, Ouwehand WH, Durbin R. Dindel: Accurate indel calls from short-read data. Genome Res. 2011 Jun;21(6):961-73. PMID: 20980555.
Recap July 14, 2011 DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
Outcome: How many variants will I find ? July 14, 2011 Hiseq: whole genome; mean coverage 60; 101PE; (NA12878) Exome: agilent capture; mean coverage 20;  76/101PE; (NA12878) DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
Three things to remember Getting the mapping right is critical Variant calling is not merely to count the differences Just listing the variants does not tell you anything biologically relevant.   July 14, 2011 by Яick Harris Fernald GH, Capriotti E, Daneshjou R, Karczewski KJ, Altman RB. Bioinformatics challenges for personalized medicine. Bioinformatics. 2011 Jul 1;27(13):1741-8. PMID: 21596790
Next week: July 14, 2011 Abstract: This seminar aims at answering the question of what to make of the identified variants, specifically how to evaluate the quality, prioritize and functionally annotate the variants.
Walk-in-clinic July 14, 2011

More Related Content

What's hot

L11 dna__polymorphisms__mutations_and_genetic_diseases4
L11  dna__polymorphisms__mutations_and_genetic_diseases4L11  dna__polymorphisms__mutations_and_genetic_diseases4
L11 dna__polymorphisms__mutations_and_genetic_diseases4
MUBOSScz
 
Single nucleotide polymorphisms (sn ps), haplotypes,
Single nucleotide polymorphisms (sn ps), haplotypes,Single nucleotide polymorphisms (sn ps), haplotypes,
Single nucleotide polymorphisms (sn ps), haplotypes,
Karan Veer Singh
 
Genome wide association studies seminar
Genome wide association studies seminarGenome wide association studies seminar
Genome wide association studies seminar
Varsha Gayatonde
 

What's hot (20)

Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
L11 dna__polymorphisms__mutations_and_genetic_diseases4
L11  dna__polymorphisms__mutations_and_genetic_diseases4L11  dna__polymorphisms__mutations_and_genetic_diseases4
L11 dna__polymorphisms__mutations_and_genetic_diseases4
 
Association mapping
Association mappingAssociation mapping
Association mapping
 
SNP Genotyping Technologies
SNP Genotyping TechnologiesSNP Genotyping Technologies
SNP Genotyping Technologies
 
Genetic mapping
Genetic mappingGenetic mapping
Genetic mapping
 
Snapgene
SnapgeneSnapgene
Snapgene
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
 
Single nucleotide polymorphisms (sn ps), haplotypes,
Single nucleotide polymorphisms (sn ps), haplotypes,Single nucleotide polymorphisms (sn ps), haplotypes,
Single nucleotide polymorphisms (sn ps), haplotypes,
 
Expressed sequence tag (EST), molecular marker
Expressed sequence tag (EST), molecular markerExpressed sequence tag (EST), molecular marker
Expressed sequence tag (EST), molecular marker
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
SNP ppt.pptx
SNP ppt.pptxSNP ppt.pptx
SNP ppt.pptx
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Genome wide association studies seminar
Genome wide association studies seminarGenome wide association studies seminar
Genome wide association studies seminar
 
Molecular Markers
Molecular Markers Molecular Markers
Molecular Markers
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
Genomics(functional genomics)
Genomics(functional genomics)Genomics(functional genomics)
Genomics(functional genomics)
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 
Mapping and QTL
Mapping and QTLMapping and QTL
Mapping and QTL
 
Association mapping
Association mappingAssociation mapping
Association mapping
 
Biochemical and molecular markers for characterization
Biochemical and molecular markers for characterizationBiochemical and molecular markers for characterization
Biochemical and molecular markers for characterization
 

Similar to Variant (SNPs/Indels) calling in DNA sequences, Part 2

Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
Thomas Keane
 
Effects of splicing mutations on NF2 transcripts
Effects of splicing mutations on NF2 transcriptsEffects of splicing mutations on NF2 transcripts
Effects of splicing mutations on NF2 transcripts
Bianca Heinrich
 
Guide Picker Poster V3
Guide Picker Poster V3Guide Picker Poster V3
Guide Picker Poster V3
Soren Hough
 

Similar to Variant (SNPs/Indels) calling in DNA sequences, Part 2 (20)

Functionally annotate genomic variants
Functionally annotate genomic variantsFunctionally annotate genomic variants
Functionally annotate genomic variants
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseq
 
20141218 Methylation Sequencing Analysis
20141218  Methylation Sequencing Analysis20141218  Methylation Sequencing Analysis
20141218 Methylation Sequencing Analysis
 
Differential gene expression
Differential gene expressionDifferential gene expression
Differential gene expression
 
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
Effects of splicing mutations on NF2 transcripts
Effects of splicing mutations on NF2 transcriptsEffects of splicing mutations on NF2 transcripts
Effects of splicing mutations on NF2 transcripts
 
Kogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysisKogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysis
 
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
 
Data analysis pipelines for NGS applications
Data analysis pipelines for NGS applicationsData analysis pipelines for NGS applications
Data analysis pipelines for NGS applications
 
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Human Reference Genome Browser Presentation at BIO-ITWorld 2008Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
 
Avoiding Nonsense Results in your NGS Variant Studies
Avoiding Nonsense Results in your NGS Variant StudiesAvoiding Nonsense Results in your NGS Variant Studies
Avoiding Nonsense Results in your NGS Variant Studies
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
Guide Picker Poster V3
Guide Picker Poster V3Guide Picker Poster V3
Guide Picker Poster V3
 
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
 
Comparative genomics to the rescue: How complete is your plant genome sequence?
Comparative genomics to the rescue: How complete is your plant genome sequence?Comparative genomics to the rescue: How complete is your plant genome sequence?
Comparative genomics to the rescue: How complete is your plant genome sequence?
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
Crispr cas9-Creative Biogene
Crispr cas9-Creative BiogeneCrispr cas9-Creative Biogene
Crispr cas9-Creative Biogene
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
Aug2015 analysis team 10 mason epigentics
Aug2015 analysis team 10 mason epigenticsAug2015 analysis team 10 mason epigentics
Aug2015 analysis team 10 mason epigentics
 

More from Denis C. Bauer

Population-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysisPopulation-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysis
Denis C. Bauer
 

More from Denis C. Bauer (19)

Cloud-native machine learning - Transforming bioinformatics research
Cloud-native machine learning - Transforming bioinformatics research Cloud-native machine learning - Transforming bioinformatics research
Cloud-native machine learning - Transforming bioinformatics research
 
Translating genomics into clinical practice - 2018 AWS summit keynote
Translating genomics into clinical practice - 2018 AWS summit keynoteTranslating genomics into clinical practice - 2018 AWS summit keynote
Translating genomics into clinical practice - 2018 AWS summit keynote
 
Going Server-less for Web-Services that need to Crunch Large Volumes of Data
Going Server-less for Web-Services that need to Crunch Large Volumes of DataGoing Server-less for Web-Services that need to Crunch Large Volumes of Data
Going Server-less for Web-Services that need to Crunch Large Volumes of Data
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...
 
Population-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysisPopulation-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysis
 
Trip Report Seattle
Trip Report SeattleTrip Report Seattle
Trip Report Seattle
 
Allelic Imbalance for Pre-capture Whole Exome Sequencing
Allelic Imbalance for Pre-capture Whole Exome SequencingAllelic Imbalance for Pre-capture Whole Exome Sequencing
Allelic Imbalance for Pre-capture Whole Exome Sequencing
 
Centralizing sequence analysis
Centralizing sequence analysisCentralizing sequence analysis
Centralizing sequence analysis
 
Qbi Centre for Brain genomics (Informatics side)
Qbi Centre for Brain genomics (Informatics side)Qbi Centre for Brain genomics (Informatics side)
Qbi Centre for Brain genomics (Informatics side)
 
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1
 
Introduction to second generation sequencing
Introduction to second generation sequencingIntroduction to second generation sequencing
Introduction to second generation sequencing
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
The missing data issue for HiSeq runs
The missing data issue for HiSeq runsThe missing data issue for HiSeq runs
The missing data issue for HiSeq runs
 
Deciphering the regulatory code in the genome
Deciphering the regulatory code in the genomeDeciphering the regulatory code in the genome
Deciphering the regulatory code in the genome
 
ReliF
ReliFReliF
ReliF
 
STAR: Recombination site prediction
STAR: Recombination site predictionSTAR: Recombination site prediction
STAR: Recombination site prediction
 
SUMOylation site prediction
SUMOylation site predictionSUMOylation site prediction
SUMOylation site prediction
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Variant (SNPs/Indels) calling in DNA sequences, Part 2

  • 1. [www.absolutefab.com] Variant calling for disease association (2/2) Searching the haystack July 14, 2011
  • 2. Quick recap: DNA sequence read mapping July 14, 2011 Sequencing->FASTQ->alignment to reference genome Resulting file type: BAM Visualized in Genome Viewer “What genomic regions were sequenced?” Quality Control Projects Fastq Bam
  • 3. Production Informatics and Bioinformatics July 14, 2011 Produce raw sequence reads Basic Production Informatics Map to genome and generate raw genomic features (e.g. SNPs) Advanced Production Inform. Analyze the data; Uncover the biological meaning Bioinformatics Research Per one-flowcell project
  • 4. Good mapping is crucial Mapping tools compromise accuracy for speed: approximate mapping. Identifying exactly where the reads map is the fundament for all subsequent analyses. The exact alignment of each read is especially important for variant calling. July 14, 2011 by neilalderney123
  • 5. Mapping challenges Incorrect mapping Amongst 3 billion bp (human) a 100-mer can occur by chance Multi-mappers The genome has none-unique regions (e.g. repeats) one read mapping to multiple sites can happen Duplicates PCR duplicates can introduce artifacts. July 14, 2011 Streptococcus suis (squares) Musmusculus (triangles) ACGATATTACACGTACACTCAAGTCGTTCGGAACCT TTACACGTACA TACACGTACAC ACACGTACACT CACGTACTCTC CACGTACTCTC CACGTACTCTC CACGTACTCTC Turner DJ, Keane TM, Sudbery I, Adams DJ. Next-generation sequencing of vertebrate experimental organisms. Mamm Genome. 2009 Jun;20(6):327-38. PMID: 19452216
  • 6. Methods for ensuring a good alignment Biological: Using paired end reads to increase coverage Bioinformatically: Local-realignment Base pair quality score re-calibration July 14, 2011 ~200 bp ? Repeat region
  • 7. Local Realignment (GATK) July 14, 2011 QBI data Local realignment of all reads at a specific location simultaneously to minimize mismatches to the reference genome Reduces erroneous SNPs refines location of INDELS original realigned DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
  • 8. Quality score recalibration (GATK) PHRED scores are predicted Looking at all reads at a specific location allows a better estimate on base pair quality score. Excludes all known dbSNP sites  Assume all other mismatches are sequencing errors Compute a new calibration table bases on mismatch rates per position on the read Important for variant calling July 14, 2011 DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889 Thomas Keane 9th European Conference on Computational Biology 26th September, 2010
  • 9. Recalibration of quality score July 14, 2011 All bases are called with Q25 In reality not all are that good: bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20” GATK DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
  • 10. Variant calling methods > 15 different algorithm Three categories Allele counting Probabilistic methods, e.g. Bayesian model to quantify statistical uncertainty Assign priors e.g. by taking the observed allele frequency of multiple samples into account Incorporating linkage disequilibrium (LD) Specifically helpful for low coverage and common variants July 14, 2011 variant SNP Ref A Ind1 G/G Ind2 A/G Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011 Jun;12(6):443-51. PMID: 21587300. http://seqanswers.com/wiki/Software/list
  • 11. VCF format [HEADER LINES] #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 chr1 873762 . T G 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255 chr1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0 chr1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26 chr1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255 Individual statistics GT - genotype - 0/1 AD – total number of REF/ALT seen – 173 T, 141 A DP – depth MAPQ > 17 – 282 GQ - Genotype Quality - 99 PL – genotype likelihood - 0/0: 10-25.5=unlikely, 0/1:10-0=likely, and 1/110-25.5=unlikely Location statistics, e.g. Strand bias How many reads have a deletion at this site July 14, 2011
  • 12. When to call a variant ? July 14, 2011 Hom REF: 0% ALT: 100% Het REF: 50% ALT: 50% ?? REF: 77% ALT: 23% QBI data QBI data
  • 13. Hard Filtering Reducing false positives by e.g. requiring Sufficient Depth Variant to be in >30% reads High quality Strand balance … Subjective and dangerous in this high dimensional search space July 14, 2011 Strand Bias Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008). Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008). QBI data
  • 14. Gaussian mixture model Train on trusted variants and require the new variants to live in the same hyperspace Potential problem: Overfitting and biasing to features of knownSNPs!!! July 14, 2011
  • 15. Indel calling First local realignment might not be sufficient to confidently determine the beginning and end of indels Dindel-algorithm Local realignment for every indel candidate July 14, 2011 Albers CA, Lunter G, Macarthur DG, McVean G, Ouwehand WH, Durbin R. Dindel: Accurate indel calls from short-read data. Genome Res. 2011 Jun;21(6):961-73. PMID: 20980555.
  • 16. Recap July 14, 2011 DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
  • 17. Outcome: How many variants will I find ? July 14, 2011 Hiseq: whole genome; mean coverage 60; 101PE; (NA12878) Exome: agilent capture; mean coverage 20; 76/101PE; (NA12878) DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
  • 18. Three things to remember Getting the mapping right is critical Variant calling is not merely to count the differences Just listing the variants does not tell you anything biologically relevant. July 14, 2011 by Яick Harris Fernald GH, Capriotti E, Daneshjou R, Karczewski KJ, Altman RB. Bioinformatics challenges for personalized medicine. Bioinformatics. 2011 Jul 1;27(13):1741-8. PMID: 21596790
  • 19. Next week: July 14, 2011 Abstract: This seminar aims at answering the question of what to make of the identified variants, specifically how to evaluate the quality, prioritize and functionally annotate the variants.

Editor's Notes

  1. unmethylated ‘C’ bases, or cytosines, are converted to ‘T’
  2. The proportion of unique sequence in the Streptococcus suis (squares) and Musmusculus (triangles) genomes for varying read lengths. This graph indicates that read length has a critical affect on the ability to place reads uniquely to the genome