BIM_2010_20_Bioinformatics_Project

Validation of Time Series Technique
for Prediction of Conformational
States of Amino Acids

Dr. Sangeeta Sawant , Bioinformatics Centre, UoP, Pune (Guide)

Dr. Mohan Kale, Dept. of Statistics, UoP, Pune (co-guide)

Concepts Used
Ramachandran Plot

Time series

AR,ARMA,ARIMA models

AIC criteria

Euclidean distance

Potential values for AA residues

Feynman Problem Solving Algorithm

Time Series
a sequence of data points or set of observations, measured
typically at successive time instants spaced at uniform time
intervals.

Patterns, variations

forecasting

Time Series Models (probability model)

Autoregressive (AR) models

Autoregressive-moving average (ARMA)

Autoregressive integrated moving average (ARIMA)
models

- depend linearly on previous data points

Materials & Methods
R

R-Studio, Tinn-R

bio3d,itsmr,forecast,tseries,timsac,wordcloud

ITSM_2000- Standalone

R Nabble
BioStars
stats.stackexchange

Methods

A) Calculation of Potential values for AA
residues

B)Forecasting of AA states

C) Clustering

Calculation of Potential values for AA residues

Dataset-I

3829 proteins selected from PDB (Protein Data Bank) –PDBSelect dataset list(25 %
seq. similarity)

Expt. method-X-ray, R-factor: - 0-0.25 (for best resolved structures)

Chain breaks, only CA atoms

Phi-Psi values –torsion.pdb() of “bio3d” & verified via PDBGoodies (IISC, Bangalore) &
Protein Angle Descriptor utility (IIT, Delhi )

Assignment of Conformational state 1, 2, or 3 - to regions I, II, or III of the Rama.
Plot, to each amino-acid residue (Phi_psi values)

ᵠ

ᶲ
Figure No- 2 Ramachandran plot showing three conformational regions I ,II and III

I- closely/tightly packed conformations, Phi-140 to 0,Psi -100 to 0
II-extended conformations, Phi -180 to 0, Psi 80 to 180
III- all remaining confirmations

Frequencies of single residues in three states calculated
& normalized using (Kolaskar, A.S. & Sawant, S.V. -1996 )

nik N
Pik =
 nik  nik
Nik –no. of times the AA of type (i) occurs in state k=1-3;
N -total no. of residues
Pik -potential values of AA of type (i) in state k

Potential values in pdf

ACF –Stat Vs. Non-stationary
Stationary

Non-stationary

Time Series
ACF plot

Stationary

Non-
stationary

Stationary

TS model building…..

AR (p)

ARMA(p,q)

ARIMA (p,q)

Best model Selection

AR (p)

ARMA (p, q)

ARIMA (p, q)

AIC

Forecasting of AA states for best models

Forecasting of AA states for best models….

e.g. for AR(1) process,

X t = φ X (t-1) + Z (t), t=0,± 1,….

Where {Z t}~ WN (0, s2) & | φ | <1

1st observed potential for AA with index given as data points & t
respectively, prediction starts from 2nd position up to last index
using forecast() “itsmr”

Similarly for ARMA (1,1) /ARIMA (1,1)

X t = φ X (t-1) + Z (t) + θ Z (t-1), θ+φ

Forecasting Quality by coefficient of determination (R2)
using formula

R =1
2  (Yi  Fi )2
 (Yi  Y )2

Yi =True value /Observed value
Fi = Forecasted/predicted value

Clustering
Dataset-II

SCOP Domain specific PDB-style files(ATOM & HETATM records )
downloaded from

ASTRAL Compendium for Sequence and Structure Analysis -
release 1.75 (June 2009)

Scan for chain breaks & presence of CA atoms only, breaked files
kept aside

Length of AA residues(100-110) e.g.
10gsa1_a_133_pot.txt

File

Potential values (Time series),each domain divided into
stationary (506) & non-stationary process (1692)

Non-stationary data kept aside for further
transformations

AR,ARMA & ARIMA models

Best model (minimum AIC criteria)

Best-AR(22),ARMA(484),ARIMA(No model)

AR(p), ARMA(p,q) -distance matrix (Euclidean distance )

Dendrogram-Neighbour-joing ( Phylip packages)

Dendrogram_TS –ARMA models-484

• Phylowidget link

Results & Discussion

For each AA of all the proteins, 3D-
Cartesian co-ordinates were transformed
into 2D info. i.e. conformational states of
AA and potential values were computed
and used to build time-distance (index of
AA) dependent statistical model as time
series for forecasting purposes.

AR values

Autoregressive order (p)  1-18 range

Short & long range dependence  variations
in protein structural arrangements

Variations proves  diversity exhibits
through structural components

Table No. II – Forecasting results for AR models (44) out of best
90 models (Note- for 46 models, class information not found in
SCOP database) All values are in % accuracy

All  (a)-12 All  (b)-5 /  (c)-9  +  (d)-13 Small Coiled-coil Designed
proteins (h)-3 proteins
(g)-1 (k)-1

Max Min Max Min Max Min Max Min Max Min
AA 26.82 2.41 16.30 8.88 27.77 1.47 28.57 7.04 19.51 22.5 5.88 29.03
seq
(%)
States 55.68 21.77 51.11 44.76 54.76 30.64 51.70 19.04 48.78 26 15 26.88
(%)

Conformational states accuracy > AA residues accuracy due to low
resolution of potential values(forecasted values)

Table No. III– Forecasting results for ARMA models (557) out of best 1239
models (Note- for 682 models, class information not found in SCOP
database) —All values are in % accuracy

All  (a)-123 All  (b)-146 /  (c)-120  +  (d)-127 Multi domains Membrane & Small
proteins (e)-13 cell surface proteins(g)-
(f)-3 17

Max Min Max Min Max Min Max Min Max Min Max Min Max Min

AA 32.55 2.63 32.81 3.96 43.47 5 37.96 2.70 24.39 6.034 12.65 7.01 30.64 6.60
seq
(%)

States 65.77 8.06 65.01 17.94 62.89 8.97 68.15 11.11 50 17.80 34.33 11.42 64.51 14.28
(%)

Due to non-representative dataset & inadequate info. about class, can’t say
that for any particular class i) pred. accuracy ↑ or ↓ & ii) follows mostly
ARMA process

Discussion
TS graphs opens new door in scientific visualization of proteins (no 3D str. info) i.e.
specific AA can be visualized on line plot with its value proportional to frequency to
occur into allowed regions of Ramachandran plot.

Potential value for each AA adds new feature of selection in machine learning
techniques.

Order of AR model tells how current value linearly related to past p value

Intra-dependency of AA shown using models of TS e.g. AR(4),ARMA(1,3)

CONCLUSIONS
Found new way of looking at protein structure
prediction.

Application of TS technique for predicting conformational states based on the
conformational state potentials instead of secondary str. has been attempted.

Accuracy of prediction of conformational states for AA, using time series is
higher than that for prediction of AA residues.

To increase accuracy for prediction, multivariate time series concept may be
useful instead of uni-variate time series

Intra-fluctuations inside proteins, due to AA arrangement can be traced out
by stationary & non-stationary groups

FUTURE WORK
AR and MA order of TS models -as point of genetic information (distances) to
predict evolutionary relationship between different proteins.

TS concept can be used to predict conformational states of missing residues
in PDB data files

Hierarchical clustering/classification of TS of proteins -birth to new concept
of time dependent clustering (pseudo-clustering) & pseudo-phylogeny.

Development of synthetic proteins to combat seasonal diseases & to tackle
chemical warfare attacks.

TS fluctuations for specific class of proteins can be used as “Pattern” for data
analysis and pattern-dependent classification of proteins

References

Blundell TL, Sibanda BL, Sternberg MJ, Thornton JM. Knowledge-
based prediction of protein structures and the design of novel
molecules. Nature. 1987 Mar 26-Apr 1;326(6111):347-52. Review

Kolaskar, A.S., Sawant, S.V. (1996). Prediction of conformational
states of amino acids using a Ramachandran plot. Int.J.Peptide
Protein Res.110-116

Alessandro G.,Romualdo B.,(2000). Nonlinear Methods in the
Analysis of Protein Sequences:A Case Study in Rubredoxins.
Biophysical Journal.136-148

BIM_2010_20_Bioinformatics_Project

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to BIM_2010_20_Bioinformatics_Project

Similar to BIM_2010_20_Bioinformatics_Project (20)

Recently uploaded

Recently uploaded (20)

BIM_2010_20_Bioinformatics_Project