TIS prediction in human cDNAs with high accuracy

Translation initiation start prediction
in human cDNAs with high accuracy
A. G. Hatzigeorgiou

Paper Presentation
Introduction to Bioinformatics
Anaxagoras Fotopoulos | Marina Adamou - Tzani

21/01/2014

Introduction
•

•
•

•

Primary objective of the present research is
contribution to the definition of the coding part of
a gene.
The search is performed in cDNA sequences.
Coding regions are surrounded by UnTraslated
Regions (UTRs).
The interest is focused in finding the Translation
Initiation Start (TIS) which defines the start of the
coding region.

cDNA
complementary DNA (cDNA) is DNA
synthesized from a messenger RNA (mRNA)
in a reaction catalyzed by the enzymes
reverse transcriptase and DNA polymerase.
2

Previous Research
Salzberg, 1997

Positional Conditional Probability matrix.
Generalized Second Order Profiles.
•
Implementation of the Ribosome Scanning Model
(Kozak, 1996)

Agarwal and
Bafna, 1998a

The ribosome first attaches to a
specific region in the 5’ end of the
mRNA and then scans the sequence
for the first ATG

•
•
3

No significant deferences were observed
between the above methods and a weight matrix
The above methods are studied in common due
to the high rate of false positives.

Previous Research
Pedersen and
Nielsen, 1997

Usage of ANNs for the recognition of local context and
statistical properties around the TIS. Large region of
analysis 100 bases before and 100 after the start codon

Salamov et. al.,
1998

Zien et. al.,
2000

Six characteristics are applied for the analysis of
the region around TIS including weight matrix and
hexanucleotide difference.
Use of Support Vector Machines (SVMs) for TIS
prediction

All of the above methods give up to
85% correct predictions.

4

Methods – Suggested Model
Swissprot
475 cDNAs
(Verified + Checked)

Training Gene Pool
Parameter estimation
Training Set + Evaluation Set

Conserved
Motif

Test Gene Pool

TIS Prediction
Consensus

Test Set

NN

Score
Multiplication

Training Gene Pool
Parameter estimation
Training Set + Evaluation Set

Test Gene Pool

TIS Prediction
Test Set
5

Coding/
Non
Coding
Potential
Coding

NN

Consensus Neural Network
325 positive
+
325 negative
examples
12-nucleotides long window
Feed forward with
short cut connections
& two hidden units
trained with cascade
correlation algorithm

Selection of the
appropriate
feed-forward NN

Binirization
of the input

Cascade
Correlation
Algorithm
6

Coding Neural Network
54 nucleotides length
window

Use Smith –
Waterman algorithm
for the elimination of
homologies between
training and test data

12-nucleotides usage static
long window
Apply codon

250 positive

(Count for every window all
non-overlapping codons)

250 negative

The sequence
window is
rescaled to 64
units

7

+

Sequence regions
extracted for testing

Every unit gives
the normalized
frequency of the
codon in the
window

282 genes with less
than 70% homology
were used for training

700 positive
+
700 negative
Sequence regions
extracted for training

Resilient backpropagation
algorithm is
applied to a
feed-forward NN.

Integrated method
Analysis of full length mRNA sequences

1st stage
• Calculation
of coding
score for
every
nucleotide
of the
mRNA
sequence

2nd stage
• Calculation
of coding
evidence of
the coding
region
included in
the longest
ORF of the
sequence

3rd stage

4th stage

• For every
in-frame
ATG a
consensus
score is
calculated

• For the
same inframe ATG,
a coding
difference
score is
calculated

The final score is obtained
by combining the output of
the consensus ANN and the
coding difference
8

Integrated method
Analysis of full length mRNA sequences
• This method provides only one prediction for every ORF
• According to the results of the test group:
• 94% of the TIS were correctly predicted
• 6% of the predictions were false positive
The use of the Las Vegas algorithm gives a confident decision. The incorporation
of this algorithm leads to a highly accurate recognition of the TIS in human
cDNAs for 60% of the cases!
Las
Vegas

9

Las Vegas algorithm provides a correct prediction
in some cases and has a “no answer” option in the
remaining cases. That is, it always produces the
correct result or it informs about the failure.

Results – Score Combination 1/3

Nucleotide 255 :
cod 0.98 – local 0.2

10

A score combination of
coding ANN and consensus
ANN gives low final score.

Cod line: Score of coding ANN
Local line: Score and position of consensus ANN for all ATGs in
coding frame


Nucleotide 270:

11

ANN gives low final score.

coding frame


Correct
TIS

Nucleotide 148:

12

ANN gives high final score.

coding frame

Results – Methods Comparison
Correct TIS positions

13

Prediction for the 3 TIS positions
with the highest scores

14

Consensus motif scores
(only for DIANA-TIS)

15

Final scores

16

Correct predictions

17

Prediction Analysis

High
prediction
score
difference

TIS correct
position: 471

Did not find TIS
18

Found TIS but other
higher score exists


Performance of the three programs for TIS prediction along
the mRNA with signal peptide sequences
Correct TIS positions

19

Length of signal peptide

20

Prediction for the 2 TIS positions
with the highest scores

21

Consensus motif scores
only for DIANA-TIS)

22

Final scores

23

Prediction example #1:
DIANA-TIS is able to distinguish between TIS and other ATGs better
than other ANN based programs like NetStart:

2 suitable ATGs are 12
nucleotides away

Coding/non-coding
information is similar

Consensus motif is
completely different

24

Prediction example #2:
A favorable prediction does not work for all examples:

Consensus motif is
completely different

Combined score is
much lower

In some signal
peptides sequences
the coding potential
score is relatively
low, and can thus
affect the combined
score.

25


TIS
prediction program

TIS
prediction
rate

DIANA-TIS (2001)

94%

Agarwal & Bafna (1998)

85%

ATGPred
(Salamov et al, 1998)

79%

NetStart
(Pedersen & Nielsen, 1997)

78%

These methods allow
more than one
prediction per gene

Notice The results come from different datasets and
thus these numbers should not be directly compared.
26

Thank you!

Introduction to Bioinformatics
Information Technologies in Medicine and Biology
National & Kapodistrian
University of Athens
Department of Informatics
Biomedical Research
Foundation
Academy of Athens
27

Technological Education
Institute of Athens
Department of Biomedical
Engineering
Demokritos
National Center
for Scientific Research

TIS prediction in human cDNAs with high accuracy

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (15)

Similar a TIS prediction in human cDNAs with high accuracy

Similar a TIS prediction in human cDNAs with high accuracy (20)

Más de Anax Fotopoulos

Más de Anax Fotopoulos (20)

Último

Último (20)

TIS prediction in human cDNAs with high accuracy