Khamis 2015 - Comparative Assessment of Machine-Learning Scoring Functions on PDBbind 2013

See discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/280066187
Comparative Assessment of Machine-Learning Scoring
Functions on PDBbind 2013 (Demo)
DATASET · JULY 2015
DOWNLOADS
2
2 AUTHORS, INCLUDING:
Mohamed AbdElAziz Khamis
Egypt-Japan University of Science and Technology
21 PUBLICATIONS 26 CITATIONS
SEE PROFILE
Available from: Mohamed AbdElAziz Khamis
Retrieved on: 15 July 2015

Mohamed A. Khamis, Walid Gomaa, Comparative assessment of machine-learning
scoring functions on PDBbind 2013, Engineering Applications of Artificial Intelligence
(2015), http://dx.doi.org/10.1016/j.engappai.2015.06.021
http://dx.doi.org/10.1016/j.engappai.2015.06.021

Objective
 We present a comparative assessment of machine-learning scoring
functions on PDBbind 2013 in computational docking.
 Computational docking is the process of predicting the best pose
(orientation + conformation) of a small molecule (drug candidate)
when bound to a target larger receptor molecule (protein) in order
to form a stable complex molecule.
 A scoring function is a mathematical predictive model that produces
a score that represents the binding free energy of a binding pose.
 The result of the docking process is a set of ligands ranked according
to their predicted binding scores.

Powers of Scoring Functions
3
 Scoring Power: Score the protein-ligand complex.
 Ranking Power: Rank different ligands bound to the
same target protein.
 Docking Power: Identify the native binding pose among
computer-generated decoys.
 Screening Power: Classify the true binders versus the
negative binders (random molecules).

Powers of Scoring Functions -
Measurements
4
 Scoring Power: Pearson linear correlation coefficient
between predicted & experimentally determined binding
affinities.
 Ranking Power: Ranking percentage (high-level ranking,
low-level ranking, Spearman rank correlation coefficient).
 Docking Power: Root-mean-square-deviation (RMSD) value
between the native binding pose & best-scored binding pose.
 Screening Power: Total number of true binders among the
1%, 5%, and 10% top-ranked ligands.

Molecular Features
5
 For the scoring and ranking powers, the proposed ML scoring functions
depend on wide range of features that entirely characterize the protein-
ligand complexes.
 These features include geometrical features of the RF-Score (Ballester and
Mitchell, 2010) (36 features), energy terms of the BALL software
(Hildebrandt et al., 2010) (5 features) and energy terms of the X-Score
(Wang et al., 2002) (8 features), and pharmacophore features of the SLIDE
software (Zavodszky et al., 2002) (59 features).
 We perform dimensionality reduction using the principal component
analysis (PCA) technique.
 For the docking and screening powers, the proposed ML scoring functions
depend on the geometrical features of the RF-Score (Ballester and Mitchell,
2010) (36 features).

Summary of the scoring functions
evaluated in CASF-2013
6 http://dx.doi.org/10.1016/j.engappai.2015.06.021

Optimal parameters values of the 12 ML scoring functions
on the scoring, ranking, docking, and screening powers
 Random Forests (RF), Boosted Regression Trees (BRT), K-Nearest Neighbours
(kNN), Multivariate Adaptive Regression Splines (MARS), Neural Network (NN),
Partial Least Squares Regression (PLSR), Principle Component Regression
(PCR), Logistic Regression (LR) , Multiple Linear Regression (MLR), Regression
with Regularization (RR), Support Vector Machines (SVM), Decision Tree (DT).

Performance of the 20 classical scoring functions versus the 12 ML scoring
functions with the most important 17 principle components (with @ML
suffix) in the scoring power test

Protein family dependent scoring power of
top ML scoring functions

Effect of changing the number of principal components
on the top 5 ML scoring functions scoring power

Effect of applying the PCA technique on the top 10 ML
scoring functions scoring power

Performance of the 20 classical scoring functions versus the 12 ML
scoring functions with the most important 17 principal components (with
@ML suffix) in the ranking power test

Protein family dependent ranking power of
the top ML scoring functions

Effect of changing the number of principal components on
the top 5 ML scoring functions high level ranking power

Effect of changing the number of principal components on
the top 5 ML scoring functions low level ranking power

Effect of applying the PCA technique on the top 10 ML
scoring functions ranking power

Success rates in the docking power test when one or more best-scored ligand binding poses
are considered. The cutoff of acceptance here is that the RMSD value between one best-
scored binding pose and the true binding pose is lower than 2.0 ˚A. The scoring functions are
ranked when the top three best-scored ligand poses are considered to match the native pose.

Enrichment factors of all 20 scoring functions versus the 12 ML
scoring functions in the screening power test. The scoring functions are
ranked by their average enrichment factor obtained at the top 1% level.

Success rates of finding the best ligand molecule of all 20 scoring functions versus the 12 ML
scoring functions in the screening power test. Scoring functions are ranked by their success
rates obtained at the top 1% level. Numbers in brackets are the number of successful cases,
for which the upper limit is 65 (for ML scoring functions the upper limit is 62).

Conclusion
20
 Machine Learning techniques give ability to utilize as many
relevant molecular features (e.g., geometric features,
pharmacophore features, etc.) as possible.
 Particularly, ensemble-based machine learning approaches
(e.g., random forest, boosted regression trees, etc.) are
resilient to over fitting.
 For docking & screening powers, machine learning
techniques need to be more target-specific, train on a larger
number of known binders for each target protein, using SVM
classifier for discriminating actives from decoys instead of
SVR regressor.

Acknowledgement
21
 This work is supported:
 Mainly by Information Technology Industry
Development Agency (ITIDA) under ITAC Program
grant number CFP#58
 In part by E-JUST Research Fellowship

Publications
22
 Mohamed A. Khamis, Walid Gomaa, Walaa A. Fathy,
Machine Learning in Computational Docking,
Artificial Intelligence in Medicine, Elsevier, Volume 63,
Feb 2015, Pages 135–152.
 Mohamed A. Khamis, Walid Gomaa, Basem Galal,
Deep Learning Competes Random Forest in
Computational Docking, Artificial Intelligence in
Medicine, Elsevier, 2015 (submitted).

Supplemental Material & Questions
 Supplemental Material:
 Source code of machine learning techniques, feature
extraction scripts, PDB IDs, and molecular features, etc.
 https://www.researchgate.net/profile/Mohamed_Khamis4
 E-mail:
 mohamed.khamis@ejust.edu.eg
 mohamed.abdelaziz.khamis@gmail.com

Khamis 2015 - Comparative Assessment of Machine-Learning Scoring Functions on PDBbind 2013

Recomendados

Recomendados

Más contenido relacionado

Similar a Khamis 2015 - Comparative Assessment of Machine-Learning Scoring Functions on PDBbind 2013

Similar a Khamis 2015 - Comparative Assessment of Machine-Learning Scoring Functions on PDBbind 2013 (20)

Khamis 2015 - Comparative Assessment of Machine-Learning Scoring Functions on PDBbind 2013