SlideShare a Scribd company logo
1 of 17
Statistical Significance of
               Alignments

                    Dr Avril Coghlan
                   alc@sanger.ac.uk

Note: this talk contains animations which can only be seen by
downloading and using ‘View Slide show’ in Powerpoint
Biological importance of alignments
• A sequence alignment represents a hypothesis about
  the homology of individual positions in different
  sequences:


                       Hypothesis: Y in seqs 1,2 is homologous to E in seqs 3,4
• Based on an alignment, we quantify similarity
• Sequence similarity suggests a shared evolutionary
  history
  Furthermore, proteins with very similar sequences probably        have
  similar biological functions
• Once we have an alignment between 2 sequences,
      we can calculate their similarity over their lengths
       A measure of similarity is percent identity, ie. number of identical
             amino acids * 100 / length of the alignment
       eg. the alignment below is 39 amino acids long, & the human & fruitfly
             sequences differ at 1 position
       → Human & fruitfly sequences have a percent identity of (38*100/39 =)
             97% in this part of the Eyeless PAX domain


                                    12 14 16 18 20 22 24 26 28 30 32 34 36 38
             1 2 3 4 5 6 7 8 9 10 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Human
Mouse
Cat
Sea squirt
Fruitfly

                     Human and fruitfly Eyeless proteins differ at this position
Similarity versus homology
• Homologues are similar because they had a common
  ancestor eg. eyeless homologues
• After aligning two sequences, we can say they are
  99% similar, or 50 similar, etc.
                                       Very similar sequences are
  V I V A L A S V E G   90% similar
  V I V A V A S V E G                  probably homologues

• Any 2 random sequences are similar to some extent,
  so similarity doesn’t necessarily imply homology
                                      Sequences with very low
  V I V A L A S V E G   10% similar
  T S Y A V F G R T W                 similarity may be
                                      homologues
Similarity versus homology
• Two girls are either sisters or not
• Two sequences are either homologues or not

                                         Incorrect!
  V I V A L A S V E G   90% similar   “90% homologous”
  V I V A V A S V E G

                                         Incorrect!
  V I V A L A S V E G   10% similar   “10% homologous”
  T S Y A V F G R T W
A key question is:
• How does one interpret minimal similarity?
  Are the sequences actually related, or is the alignment by chance?



                     Q K G S Y Q E K G Y C
                     |     |             |
                     Q Q E S G P V R S T C
Statistical analysis of alignments
• We’ve calculated the score for the best alignment
  between 2 sequences A and B, but is it due to chance
  or biology?
• Sequences accumulate substitutions over millions of
  years, so it is sometimes hard to decide if 2
  sequences are homologous
• Unrelated sequences may be somewhat similar due
  to chance
In humans, mutations in the PTCH2 gene are a cause of brain tumours and
     skin cancers

     In the nematode Caenorhabditis elegans, the tra-2 gene functions in
     development to determine the sex of the embryo
     C. elegans adults can be male (make sperm) or hermaphrodite (make
     sperm & eggs)

Alignment of human PTCH2 & Caenorhabditis elegans TRA2 (score = 136):




Are human PTCH2 and C. elegans tra-2 homologues?
Statistical significance of the
                alignment
• To decide if we two sequences are likely to be
  homologues (related), we calculate the statistical
  significance of the alignment score
• To do this, we first need a null model (background
  model), ie. a statistical model that will let us
  calculate what we expect
  There are many proteins in all the different species
  2 randomly chosen proteins are expected to be unrelated
  Our null model should therefore describe the alignment scores
  expected for pairs of unrelated sequences
• How can we know the alignment scores for pairs of
  unrelated protein sequences?
  We could generate random protein sequences, & calculate
  alignment scores for pairs of random protein sequences
  We can use a multinomial model to generate random protein sequences
  ie. make a roulette wheel with different fractions of the wheel labelled
        for each of the 20 amino acids
  Then spin thin wheel n times to make a random protein sequences that
        is n amino acids long


                                       In this multinomial model,
                                       p(P)=0.14, p(A)=0.28,
                                       p(W)=0.14, p(H)=0.14, p(E)=0.28
                                       All the other amino acids have
                                       probabilities of zero here
• A good multinomial model for random sequences
  should take in the sequence composition
  eg. we could use a multinomial model to generate random sequences of
       the same composition as C. elegans TRA2

  ie. make a roulette wheel where the fraction of the circle labelled with
        each of the 20 amino acids is set equal to the % of that amino
        acid in the TRA2 sequence
• One way to see if an alignment score is statistically
   significant is to compare it to the scores for
   alignments of random sequences
    We make a random sequence of the same length amino acid
    composition as one of our original 2 sequences (eg. TRA2)
    ie. use our ‘TRA2’ multinomial model to do make a sequence


Alignment of human PTCH2 & a random sequence generated using a multinomial
model (with the probabilities of amino acids set equal to their fractions in TRA2)
(score = 51):
• We can generate 200 random sequences using our
  TRA2-like multinomial model
  For each random sequence, we can calculate the best alignment score for
  the random sequence and human PTCH2
Compare the scores obtained with the score seen for PTCH2 & TRA2 eg.

                                                    Alignment score for
Number of                                              proteins PTCH2 &
 alignments                                            TRA2
 of random
 sequences                                                Alignment
                                                             score

                                            5% of scores for alignments
                                               of random sequences

What % of the random sequences have a score equal to or higher than that
   for TRA2 & PTCH2? eg. 0.95 in the picture
This method can be used to estimate the significance of alignments in the
   form of P-values, eg. P=0.05 in the picture
We accept the alignment as significant (indicating probable homology) if the
   score is in the top 5% (or another chosen value) of the scores for random
   sequences, ie. if P ≤ 0.05
eg. for human PTCH2 and C. elegans TRA2:
  The alignment score is 136
  When 200 random sequences (generated with a ‘TRA2’ multinomial
        model) were aligned to PTCH2, only 0.36% alignments had a score of
  ≥136
  Therefore, we estimate a P-value of P=0.0036
  ie. we estimate that the probability of getting a score of 136 for PTCH2
        and TRA2 due to chance is 0.0036 (36/10,000)

Alignment of human PTCH2 & C. elegans TRA2 (score = 136):




   Human PTCH2 and C. elegans tra-2 are probably homologues
In the example below, 0.95 of the random sequences have an alignment
    score equal to or higher to that for A & B, so P=0.95
                  Alignment score for a
 Number of           different A & B
  alignments
  of random
  sequences                                                 Alignment
                                                               score
                         95% of scores for alignments of random sequences
Alignment of fruitfly Eyeless & C. elegans TRA2 (score = 78):




P=1          eyeless and tra-2 are probably not homologues
Further Reading
•   Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn
•   Chapter 6 in Deonier et al book Computational Genome Analysis
•   Practical on alignment in R in the Little Book of R for Bioinformatics:
    https://a-little-book-of-r-for-
    bioinformatics.readthedocs.org/en/latest/src/chapter4.html

More Related Content

What's hot

Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Vijay Hemmadi
 
Genomics and bioinformatics
Genomics and bioinformatics Genomics and bioinformatics
Genomics and bioinformatics Senthil Natesan
 
Scoring matrices
Scoring matricesScoring matrices
Scoring matricesAshwini
 
GENOMICS AND BIOINFORMATICS
GENOMICS AND BIOINFORMATICSGENOMICS AND BIOINFORMATICS
GENOMICS AND BIOINFORMATICSsandeshGM
 
Transcriptomics approaches
Transcriptomics approachesTranscriptomics approaches
Transcriptomics approachesCharupriyaChauhan1
 
PAM : Point Accepted Mutation
PAM : Point Accepted MutationPAM : Point Accepted Mutation
PAM : Point Accepted MutationAmit Kyada
 
Transcriptomics and metabolomics
Transcriptomics and metabolomicsTranscriptomics and metabolomics
Transcriptomics and metabolomicsSukhjinder Singh
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignmentRamya S
 
The Smith Waterman algorithm
The Smith Waterman algorithmThe Smith Waterman algorithm
The Smith Waterman algorithmavrilcoghlan
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partiiSumatiHajela
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAGRF_Ltd
 
Secondary protein structure prediction
Secondary protein structure predictionSecondary protein structure prediction
Secondary protein structure predictionSiva Dharshini R
 
Chou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure predictionChou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure predictionRoshan Karunarathna
 

What's hot (20)

PAM matrices evolution
PAM matrices evolutionPAM matrices evolution
PAM matrices evolution
 
Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins
 
Genomics and bioinformatics
Genomics and bioinformatics Genomics and bioinformatics
Genomics and bioinformatics
 
Scoring matrices
Scoring matricesScoring matrices
Scoring matrices
 
GENOMICS AND BIOINFORMATICS
GENOMICS AND BIOINFORMATICSGENOMICS AND BIOINFORMATICS
GENOMICS AND BIOINFORMATICS
 
Transcriptomics approaches
Transcriptomics approachesTranscriptomics approaches
Transcriptomics approaches
 
PAM : Point Accepted Mutation
PAM : Point Accepted MutationPAM : Point Accepted Mutation
PAM : Point Accepted Mutation
 
Transcriptomics and metabolomics
Transcriptomics and metabolomicsTranscriptomics and metabolomics
Transcriptomics and metabolomics
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
The Smith Waterman algorithm
The Smith Waterman algorithmThe Smith Waterman algorithm
The Smith Waterman algorithm
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partii
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment   Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment
 
Genomic databases
Genomic databasesGenomic databases
Genomic databases
 
artificial neural network-gene prediction
artificial neural network-gene predictionartificial neural network-gene prediction
artificial neural network-gene prediction
 
Secondary protein structure prediction
Secondary protein structure predictionSecondary protein structure prediction
Secondary protein structure prediction
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
Transcriptomics
TranscriptomicsTranscriptomics
Transcriptomics
 
Biological networks
Biological networksBiological networks
Biological networks
 
Chou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure predictionChou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure prediction
 

Similar to Statistical significance of alignments

The derivation of ungapped global protein alignment score distributions - Part1
The derivation of ungapped global protein alignment score distributions - Part1The derivation of ungapped global protein alignment score distributions - Part1
The derivation of ungapped global protein alignment score distributions - Part1Keiji Takamoto
 
Seq alignment
Seq alignment Seq alignment
Seq alignment Nagendrasahu6
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence AlignmentRavi Gandham
 
Computation and System Biology Assignment Help
Computation and System Biology Assignment HelpComputation and System Biology Assignment Help
Computation and System Biology Assignment HelpNursing Assignment Help
 
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07Computer Science Club
 
4. sequence alignment.pptx
4. sequence alignment.pptx4. sequence alignment.pptx
4. sequence alignment.pptxArupKhakhlari1
 
Bioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matricesBioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matricesProf. Wim Van Criekinge
 
sequence alignment
sequence alignmentsequence alignment
sequence alignmentammar kareem
 
Lesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And RegressionLesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And RegressionSumit Prajapati
 
Association mapping
Association mapping Association mapping
Association mapping Preeti Kapoor
 
Presentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticePresentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticezahid6
 
Meiosis, linkage and crossing over
Meiosis, linkage and crossing overMeiosis, linkage and crossing over
Meiosis, linkage and crossing overblogarirahayu
 
How the blast work
How the blast workHow the blast work
How the blast workAtai Rabby
 
ppgardner-lecture06-homologysearch.pdf
ppgardner-lecture06-homologysearch.pdfppgardner-lecture06-homologysearch.pdf
ppgardner-lecture06-homologysearch.pdfPaul Gardner
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformaticsAbhishek Vatsa
 

Similar to Statistical significance of alignments (20)

The derivation of ungapped global protein alignment score distributions - Part1
The derivation of ungapped global protein alignment score distributions - Part1The derivation of ungapped global protein alignment score distributions - Part1
The derivation of ungapped global protein alignment score distributions - Part1
 
Seq alignment
Seq alignment Seq alignment
Seq alignment
 
Nbt1004 1315
Nbt1004 1315Nbt1004 1315
Nbt1004 1315
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Computation and System Biology Assignment Help
Computation and System Biology Assignment HelpComputation and System Biology Assignment Help
Computation and System Biology Assignment Help
 
Sequence alignment belgaum
Sequence alignment belgaumSequence alignment belgaum
Sequence alignment belgaum
 
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07
 
4. sequence alignment.pptx
4. sequence alignment.pptx4. sequence alignment.pptx
4. sequence alignment.pptx
 
Bioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matricesBioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matrices
 
Blast 2013 1
Blast 2013 1Blast 2013 1
Blast 2013 1
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Lesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And RegressionLesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And Regression
 
Stats chapter 15
Stats chapter 15Stats chapter 15
Stats chapter 15
 
Association mapping
Association mapping Association mapping
Association mapping
 
Presentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticePresentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informatice
 
Blast Algorithm
Blast AlgorithmBlast Algorithm
Blast Algorithm
 
Meiosis, linkage and crossing over
Meiosis, linkage and crossing overMeiosis, linkage and crossing over
Meiosis, linkage and crossing over
 
How the blast work
How the blast workHow the blast work
How the blast work
 
ppgardner-lecture06-homologysearch.pdf
ppgardner-lecture06-homologysearch.pdfppgardner-lecture06-homologysearch.pdf
ppgardner-lecture06-homologysearch.pdf
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
 

More from avrilcoghlan

DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club avrilcoghlan
 
Introduction to genomes
Introduction to genomesIntroduction to genomes
Introduction to genomesavrilcoghlan
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignmentavrilcoghlan
 
Alignment scoring functions
Alignment scoring functionsAlignment scoring functions
Alignment scoring functionsavrilcoghlan
 
The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithmavrilcoghlan
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignmentavrilcoghlan
 
Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformaticsavrilcoghlan
 
Introduction to HMMs in Bioinformatics
Introduction to HMMs in BioinformaticsIntroduction to HMMs in Bioinformatics
Introduction to HMMs in Bioinformaticsavrilcoghlan
 

More from avrilcoghlan (10)

DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club
 
Introduction to genomes
Introduction to genomesIntroduction to genomes
Introduction to genomes
 
Homology
HomologyHomology
Homology
 
BLAST
BLASTBLAST
BLAST
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignment
 
Alignment scoring functions
Alignment scoring functionsAlignment scoring functions
Alignment scoring functions
 
The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithm
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
 
Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformatics
 
Introduction to HMMs in Bioinformatics
Introduction to HMMs in BioinformaticsIntroduction to HMMs in Bioinformatics
Introduction to HMMs in Bioinformatics
 

Recently uploaded

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
18-04-UA_REPORT_MEDIALITERAĐĄY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAĐĄY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAĐĄY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAĐĄY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 

Recently uploaded (20)

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
18-04-UA_REPORT_MEDIALITERAĐĄY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAĐĄY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAĐĄY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAĐĄY_INDEX-DM_23-1-final-eng.pdf
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 

Statistical significance of alignments

  • 1. Statistical Significance of Alignments Dr Avril Coghlan alc@sanger.ac.uk Note: this talk contains animations which can only be seen by downloading and using ‘View Slide show’ in Powerpoint
  • 2. Biological importance of alignments • A sequence alignment represents a hypothesis about the homology of individual positions in different sequences: Hypothesis: Y in seqs 1,2 is homologous to E in seqs 3,4 • Based on an alignment, we quantify similarity • Sequence similarity suggests a shared evolutionary history Furthermore, proteins with very similar sequences probably have similar biological functions
  • 3. • Once we have an alignment between 2 sequences, we can calculate their similarity over their lengths A measure of similarity is percent identity, ie. number of identical amino acids * 100 / length of the alignment eg. the alignment below is 39 amino acids long, & the human & fruitfly sequences differ at 1 position → Human & fruitfly sequences have a percent identity of (38*100/39 =) 97% in this part of the Eyeless PAX domain 12 14 16 18 20 22 24 26 28 30 32 34 36 38 1 2 3 4 5 6 7 8 9 10 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Human Mouse Cat Sea squirt Fruitfly Human and fruitfly Eyeless proteins differ at this position
  • 4. Similarity versus homology • Homologues are similar because they had a common ancestor eg. eyeless homologues • After aligning two sequences, we can say they are 99% similar, or 50 similar, etc. Very similar sequences are V I V A L A S V E G 90% similar V I V A V A S V E G probably homologues • Any 2 random sequences are similar to some extent, so similarity doesn’t necessarily imply homology Sequences with very low V I V A L A S V E G 10% similar T S Y A V F G R T W similarity may be homologues
  • 5. Similarity versus homology • Two girls are either sisters or not • Two sequences are either homologues or not Incorrect! V I V A L A S V E G 90% similar “90% homologous” V I V A V A S V E G Incorrect! V I V A L A S V E G 10% similar “10% homologous” T S Y A V F G R T W
  • 6. A key question is: • How does one interpret minimal similarity? Are the sequences actually related, or is the alignment by chance? Q K G S Y Q E K G Y C | | | Q Q E S G P V R S T C
  • 7. Statistical analysis of alignments • We’ve calculated the score for the best alignment between 2 sequences A and B, but is it due to chance or biology? • Sequences accumulate substitutions over millions of years, so it is sometimes hard to decide if 2 sequences are homologous • Unrelated sequences may be somewhat similar due to chance
  • 8. In humans, mutations in the PTCH2 gene are a cause of brain tumours and skin cancers In the nematode Caenorhabditis elegans, the tra-2 gene functions in development to determine the sex of the embryo C. elegans adults can be male (make sperm) or hermaphrodite (make sperm & eggs) Alignment of human PTCH2 & Caenorhabditis elegans TRA2 (score = 136): Are human PTCH2 and C. elegans tra-2 homologues?
  • 9. Statistical significance of the alignment • To decide if we two sequences are likely to be homologues (related), we calculate the statistical significance of the alignment score • To do this, we first need a null model (background model), ie. a statistical model that will let us calculate what we expect There are many proteins in all the different species 2 randomly chosen proteins are expected to be unrelated Our null model should therefore describe the alignment scores expected for pairs of unrelated sequences
  • 10. • How can we know the alignment scores for pairs of unrelated protein sequences? We could generate random protein sequences, & calculate alignment scores for pairs of random protein sequences We can use a multinomial model to generate random protein sequences ie. make a roulette wheel with different fractions of the wheel labelled for each of the 20 amino acids Then spin thin wheel n times to make a random protein sequences that is n amino acids long In this multinomial model, p(P)=0.14, p(A)=0.28, p(W)=0.14, p(H)=0.14, p(E)=0.28 All the other amino acids have probabilities of zero here
  • 11. • A good multinomial model for random sequences should take in the sequence composition eg. we could use a multinomial model to generate random sequences of the same composition as C. elegans TRA2 ie. make a roulette wheel where the fraction of the circle labelled with each of the 20 amino acids is set equal to the % of that amino acid in the TRA2 sequence
  • 12. • One way to see if an alignment score is statistically significant is to compare it to the scores for alignments of random sequences We make a random sequence of the same length amino acid composition as one of our original 2 sequences (eg. TRA2) ie. use our ‘TRA2’ multinomial model to do make a sequence Alignment of human PTCH2 & a random sequence generated using a multinomial model (with the probabilities of amino acids set equal to their fractions in TRA2) (score = 51):
  • 13. • We can generate 200 random sequences using our TRA2-like multinomial model For each random sequence, we can calculate the best alignment score for the random sequence and human PTCH2
  • 14. Compare the scores obtained with the score seen for PTCH2 & TRA2 eg. Alignment score for Number of proteins PTCH2 & alignments TRA2 of random sequences Alignment score 5% of scores for alignments of random sequences What % of the random sequences have a score equal to or higher than that for TRA2 & PTCH2? eg. 0.95 in the picture This method can be used to estimate the significance of alignments in the form of P-values, eg. P=0.05 in the picture We accept the alignment as significant (indicating probable homology) if the score is in the top 5% (or another chosen value) of the scores for random sequences, ie. if P ≤ 0.05
  • 15. eg. for human PTCH2 and C. elegans TRA2: The alignment score is 136 When 200 random sequences (generated with a ‘TRA2’ multinomial model) were aligned to PTCH2, only 0.36% alignments had a score of ≥136 Therefore, we estimate a P-value of P=0.0036 ie. we estimate that the probability of getting a score of 136 for PTCH2 and TRA2 due to chance is 0.0036 (36/10,000) Alignment of human PTCH2 & C. elegans TRA2 (score = 136): Human PTCH2 and C. elegans tra-2 are probably homologues
  • 16. In the example below, 0.95 of the random sequences have an alignment score equal to or higher to that for A & B, so P=0.95 Alignment score for a Number of different A & B alignments of random sequences Alignment score 95% of scores for alignments of random sequences Alignment of fruitfly Eyeless & C. elegans TRA2 (score = 78): P=1 eyeless and tra-2 are probably not homologues
  • 17. Further Reading • Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn • Chapter 6 in Deonier et al book Computational Genome Analysis • Practical on alignment in R in the Little Book of R for Bioinformatics: https://a-little-book-of-r-for- bioinformatics.readthedocs.org/en/latest/src/chapter4.html

Editor's Notes

  1. Image credit (Williams sisters): http://media-2.web.britannica.com/eb-media/24/79824-004-7C20393C.jpg Image credit (Marlyn Monroe): http://cm1.theinsider.com/media/0/81/12/Marilyn-Monroe-11.0.0.0x0.432x594.jpeg
  2. Note: made image by aligning Uniprot PTCH2_HUMAN to C. elegans TRA2 protein (CE23546 from WormBase) using Ssearch (Smith-Waterman algorithm) and viewing the alignment using Jalview. These alignment contains the Patched domain. Image credit (human): http://www.ensembl.org/img/species/pic_Homo_sapiens.png Image credit (C. elegans): http://www.ensembl.org/img/species/pic_Caenorhabditis_elegans.png
  3. Note: made image by aligning Uniprot PTCH2_HUMAN to C. elegans TRA2 protein (CE23546 from WormBase) using Ssearch (Smith-Waterman algorithm) and viewing the alignment using Jalview. Image credit (human): http://www.ensembl.org/img/species/pic_Homo_sapiens.png Image credit (C. elegans): http://www.ensembl.org/img/species/pic_Caenorhabditis_elegans.png