Correct identification of the Translation Initiation Start (TIS) in cDNA is an important issue for genome annotation. The aim of this work is to improve upon current methods and provide a performance guaranteed prediction.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
TIS prediction in human cDNAs with high accuracy
1. Translation initiation start prediction
in human cDNAs with high accuracy
A. G. Hatzigeorgiou
Paper Presentation
Introduction to Bioinformatics
Anaxagoras Fotopoulos | Marina Adamou - Tzani
21/01/2014
2. Introduction
•
•
•
•
Primary objective of the present research is
contribution to the definition of the coding part of
a gene.
The search is performed in cDNA sequences.
Coding regions are surrounded by UnTraslated
Regions (UTRs).
The interest is focused in finding the Translation
Initiation Start (TIS) which defines the start of the
coding region.
cDNA
complementary DNA (cDNA) is DNA
synthesized from a messenger RNA (mRNA)
in a reaction catalyzed by the enzymes
reverse transcriptase and DNA polymerase.
2
3. Previous Research
Salzberg, 1997
Positional Conditional Probability matrix.
Generalized Second Order Profiles.
•
Implementation of the Ribosome Scanning Model
(Kozak, 1996)
Agarwal and
Bafna, 1998a
The ribosome first attaches to a
specific region in the 5’ end of the
mRNA and then scans the sequence
for the first ATG
•
•
3
No significant deferences were observed
between the above methods and a weight matrix
The above methods are studied in common due
to the high rate of false positives.
4. Previous Research
Pedersen and
Nielsen, 1997
Usage of ANNs for the recognition of local context and
statistical properties around the TIS. Large region of
analysis 100 bases before and 100 after the start codon
Salamov et. al.,
1998
Zien et. al.,
2000
Six characteristics are applied for the analysis of
the region around TIS including weight matrix and
hexanucleotide difference.
Use of Support Vector Machines (SVMs) for TIS
prediction
All of the above methods give up to
85% correct predictions.
4
5. Methods – Suggested Model
Swissprot
475 cDNAs
(Verified + Checked)
Training Gene Pool
Parameter estimation
Training Set + Evaluation Set
Conserved
Motif
Test Gene Pool
TIS Prediction
Consensus
Test Set
NN
Score
Multiplication
Training Gene Pool
Parameter estimation
Training Set + Evaluation Set
Test Gene Pool
TIS Prediction
Test Set
5
Coding/
Non
Coding
Potential
Coding
NN
6. Consensus Neural Network
325 positive
+
325 negative
examples
12-nucleotides long window
Feed forward with
short cut connections
& two hidden units
trained with cascade
correlation algorithm
Selection of the
appropriate
feed-forward NN
Binirization
of the input
Cascade
Correlation
Algorithm
6
7. Coding Neural Network
54 nucleotides length
window
Use Smith –
Waterman algorithm
for the elimination of
homologies between
training and test data
12-nucleotides usage static
long window
Apply codon
250 positive
(Count for every window all
non-overlapping codons)
250 negative
The sequence
window is
rescaled to 64
units
7
+
Sequence regions
extracted for testing
Every unit gives
the normalized
frequency of the
codon in the
window
282 genes with less
than 70% homology
were used for training
700 positive
+
700 negative
Sequence regions
extracted for training
Resilient backpropagation
algorithm is
applied to a
feed-forward NN.
8. Integrated method
Analysis of full length mRNA sequences
1st stage
• Calculation
of coding
score for
every
nucleotide
of the
mRNA
sequence
2nd stage
• Calculation
of coding
evidence of
the coding
region
included in
the longest
ORF of the
sequence
3rd stage
4th stage
• For every
in-frame
ATG a
consensus
score is
calculated
• For the
same inframe ATG,
a coding
difference
score is
calculated
The final score is obtained
by combining the output of
the consensus ANN and the
coding difference
8
9. Integrated method
Analysis of full length mRNA sequences
• This method provides only one prediction for every ORF
• According to the results of the test group:
• 94% of the TIS were correctly predicted
• 6% of the predictions were false positive
The use of the Las Vegas algorithm gives a confident decision. The incorporation
of this algorithm leads to a highly accurate recognition of the TIS in human
cDNAs for 60% of the cases!
Las
Vegas
9
Las Vegas algorithm provides a correct prediction
in some cases and has a “no answer” option in the
remaining cases. That is, it always produces the
correct result or it informs about the failure.
10. Results – Score Combination 1/3
Nucleotide 255 :
cod 0.98 – local 0.2
10
A score combination of
coding ANN and consensus
ANN gives low final score.
Cod line: Score of coding ANN
Local line: Score and position of consensus ANN for all ATGs in
coding frame
11. Results – Score Combination 2/3
Nucleotide 270:
cod 0.44 – local 0.4
11
A score combination of
coding ANN and consensus
ANN gives low final score.
Cod line: Score of coding ANN
Local line: Score and position of consensus ANN for all ATGs in
coding frame
12. Results – Score Combination 3/3
Correct
TIS
Nucleotide 148:
cod 0.95 – local 0.8
12
A score combination of
coding ANN and consensus
ANN gives high final score.
Cod line: Score of coding ANN
Local line: Score and position of consensus ANN for all ATGs in
coding frame
18. Results – Methods Comparison
Prediction Analysis
High
prediction
score
difference
TIS correct
position: 471
Did not find TIS
18
Found TIS but other
higher score exists
19. Results – Methods Comparison
Performance of the three programs for TIS prediction along
the mRNA with signal peptide sequences
Correct TIS positions
19
24. Results – Methods Comparison
Prediction example #1:
DIANA-TIS is able to distinguish between TIS and other ATGs better
than other ANN based programs like NetStart:
2 suitable ATGs are 12
nucleotides away
Coding/non-coding
information is similar
Consensus motif is
completely different
24
25. Results – Methods Comparison
Prediction example #2:
A favorable prediction does not work for all examples:
Consensus motif is
completely different
Combined score is
much lower
In some signal
peptides sequences
the coding potential
score is relatively
low, and can thus
affect the combined
score.
25
26. Results – Methods Comparison
TIS
prediction program
TIS
prediction
rate
DIANA-TIS (2001)
94%
Agarwal & Bafna (1998)
85%
ATGPred
(Salamov et al, 1998)
79%
NetStart
(Pedersen & Nielsen, 1997)
78%
These methods allow
more than one
prediction per gene
Notice The results come from different datasets and
thus these numbers should not be directly compared.
26
27. Thank you!
Introduction to Bioinformatics
Information Technologies in Medicine and Biology
National & Kapodistrian
University of Athens
Department of Informatics
Biomedical Research
Foundation
Academy of Athens
27
Technological Education
Institute of Athens
Department of Biomedical
Engineering
Demokritos
National Center
for Scientific Research