Statistical significance of alignments

Statistical Significance of
Alignments

Dr Avril Coghlan
alc@sanger.ac.uk

Note: this talk contains animations which can only be seen by
downloading and using ‘View Slide show’ in Powerpoint

Biological importance of alignments
• A sequence alignment represents a hypothesis about
the homology of individual positions in different
sequences:

Hypothesis: Y in seqs 1,2 is homologous to E in seqs 3,4
• Based on an alignment, we quantify similarity
• Sequence similarity suggests a shared evolutionary
history
Furthermore, proteins with very similar sequences probably have
similar biological functions

• Once we have an alignment between 2 sequences,
we can calculate their similarity over their lengths
A measure of similarity is percent identity, ie. number of identical
amino acids * 100 / length of the alignment
eg. the alignment below is 39 amino acids long, & the human & fruitfly
sequences differ at 1 position
→ Human & fruitfly sequences have a percent identity of (38*100/39 =)
97% in this part of the Eyeless PAX domain

12 14 16 18 20 22 24 26 28 30 32 34 36 38
1 2 3 4 5 6 7 8 9 10 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Human
Mouse
Cat
Sea squirt
Fruitfly

Human and fruitfly Eyeless proteins differ at this position

Similarity versus homology
• Homologues are similar because they had a common
ancestor eg. eyeless homologues
• After aligning two sequences, we can say they are
99% similar, or 50 similar, etc.
Very similar sequences are
V I V A L A S V E G 90% similar
V I V A V A S V E G probably homologues

• Any 2 random sequences are similar to some extent,
so similarity doesn’t necessarily imply homology
Sequences with very low
V I V A L A S V E G 10% similar
T S Y A V F G R T W similarity may be
homologues

Similarity versus homology
• Two girls are either sisters or not
• Two sequences are either homologues or not

Incorrect!
V I V A L A S V E G 90% similar “90% homologous”
V I V A V A S V E G

Incorrect!
V I V A L A S V E G 10% similar “10% homologous”
T S Y A V F G R T W

A key question is:
• How does one interpret minimal similarity?
Are the sequences actually related, or is the alignment by chance?

Q K G S Y Q E K G Y C
| | |
Q Q E S G P V R S T C

Statistical analysis of alignments
• We’ve calculated the score for the best alignment
between 2 sequences A and B, but is it due to chance
or biology?
• Sequences accumulate substitutions over millions of
years, so it is sometimes hard to decide if 2
sequences are homologous
• Unrelated sequences may be somewhat similar due
to chance

In humans, mutations in the PTCH2 gene are a cause of brain tumours and
skin cancers

In the nematode Caenorhabditis elegans, the tra-2 gene functions in
development to determine the sex of the embryo
C. elegans adults can be male (make sperm) or hermaphrodite (make
sperm & eggs)

Alignment of human PTCH2 & Caenorhabditis elegans TRA2 (score = 136):

Are human PTCH2 and C. elegans tra-2 homologues?

Statistical significance of the
alignment
• To decide if we two sequences are likely to be
homologues (related), we calculate the statistical
significance of the alignment score
• To do this, we first need a null model (background
model), ie. a statistical model that will let us
calculate what we expect
There are many proteins in all the different species
2 randomly chosen proteins are expected to be unrelated
Our null model should therefore describe the alignment scores
expected for pairs of unrelated sequences

• How can we know the alignment scores for pairs of
unrelated protein sequences?
We could generate random protein sequences, & calculate
alignment scores for pairs of random protein sequences
We can use a multinomial model to generate random protein sequences
ie. make a roulette wheel with different fractions of the wheel labelled
for each of the 20 amino acids
Then spin thin wheel n times to make a random protein sequences that
is n amino acids long

In this multinomial model,
p(P)=0.14, p(A)=0.28,
p(W)=0.14, p(H)=0.14, p(E)=0.28
All the other amino acids have
probabilities of zero here

• A good multinomial model for random sequences
should take in the sequence composition
eg. we could use a multinomial model to generate random sequences of
the same composition as C. elegans TRA2

ie. make a roulette wheel where the fraction of the circle labelled with
each of the 20 amino acids is set equal to the % of that amino
acid in the TRA2 sequence

• One way to see if an alignment score is statistically
significant is to compare it to the scores for
alignments of random sequences
We make a random sequence of the same length amino acid
composition as one of our original 2 sequences (eg. TRA2)
ie. use our ‘TRA2’ multinomial model to do make a sequence

Alignment of human PTCH2 & a random sequence generated using a multinomial
model (with the probabilities of amino acids set equal to their fractions in TRA2)
(score = 51):

• We can generate 200 random sequences using our
TRA2-like multinomial model
For each random sequence, we can calculate the best alignment score for
the random sequence and human PTCH2

Compare the scores obtained with the score seen for PTCH2 & TRA2 eg.

Alignment score for
Number of proteins PTCH2 &
alignments TRA2
of random
sequences Alignment
score

5% of scores for alignments
of random sequences

What % of the random sequences have a score equal to or higher than that
for TRA2 & PTCH2? eg. 0.95 in the picture
This method can be used to estimate the significance of alignments in the
form of P-values, eg. P=0.05 in the picture
We accept the alignment as significant (indicating probable homology) if the
score is in the top 5% (or another chosen value) of the scores for random
sequences, ie. if P ≤ 0.05

eg. for human PTCH2 and C. elegans TRA2:
The alignment score is 136
When 200 random sequences (generated with a ‘TRA2’ multinomial
model) were aligned to PTCH2, only 0.36% alignments had a score of
≥136
Therefore, we estimate a P-value of P=0.0036
ie. we estimate that the probability of getting a score of 136 for PTCH2
and TRA2 due to chance is 0.0036 (36/10,000)

Alignment of human PTCH2 & C. elegans TRA2 (score = 136):

Human PTCH2 and C. elegans tra-2 are probably homologues

In the example below, 0.95 of the random sequences have an alignment
score equal to or higher to that for A & B, so P=0.95
Alignment score for a
Number of different A & B
alignments
of random
sequences Alignment
score
95% of scores for alignments of random sequences
Alignment of fruitfly Eyeless & C. elegans TRA2 (score = 78):

P=1 eyeless and tra-2 are probably not homologues

Further Reading
• Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn
• Chapter 6 in Deonier et al book Computational Genome Analysis
• Practical on alignment in R in the Little Book of R for Bioinformatics:
https://a-little-book-of-r-for-
bioinformatics.readthedocs.org/en/latest/src/chapter4.html

Statistical significance of alignments

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Statistical significance of alignments

Similar to Statistical significance of alignments (20)

More from avrilcoghlan

More from avrilcoghlan (10)

Recently uploaded

Recently uploaded (20)

Statistical significance of alignments

Editor's Notes