2. Mohamed A. Khamis, Walid Gomaa, Comparative assessment of machine-learning
scoring functions on PDBbind 2013, Engineering Applications of Artificial Intelligence
(2015), http://dx.doi.org/10.1016/j.engappai.2015.06.021
http://dx.doi.org/10.1016/j.engappai.2015.06.021
3. Objective
http://dx.doi.org/10.1016/j.engappai.2015.06.0212
We present a comparative assessment of machine-learning scoring
functions on PDBbind 2013 in computational docking.
Computational docking is the process of predicting the best pose
(orientation + conformation) of a small molecule (drug candidate)
when bound to a target larger receptor molecule (protein) in order
to form a stable complex molecule.
A scoring function is a mathematical predictive model that produces
a score that represents the binding free energy of a binding pose.
The result of the docking process is a set of ligands ranked according
to their predicted binding scores.
4. Powers of Scoring Functions
3
Scoring Power: Score the protein-ligand complex.
Ranking Power: Rank different ligands bound to the
same target protein.
Docking Power: Identify the native binding pose among
computer-generated decoys.
Screening Power: Classify the true binders versus the
negative binders (random molecules).
http://dx.doi.org/10.1016/j.engappai.2015.06.021
5. Powers of Scoring Functions -
Measurements
4
Scoring Power: Pearson linear correlation coefficient
between predicted & experimentally determined binding
affinities.
Ranking Power: Ranking percentage (high-level ranking,
low-level ranking, Spearman rank correlation coefficient).
Docking Power: Root-mean-square-deviation (RMSD) value
between the native binding pose & best-scored binding pose.
Screening Power: Total number of true binders among the
1%, 5%, and 10% top-ranked ligands.
http://dx.doi.org/10.1016/j.engappai.2015.06.021
6. Molecular Features
5
For the scoring and ranking powers, the proposed ML scoring functions
depend on wide range of features that entirely characterize the protein-
ligand complexes.
These features include geometrical features of the RF-Score (Ballester and
Mitchell, 2010) (36 features), energy terms of the BALL software
(Hildebrandt et al., 2010) (5 features) and energy terms of the X-Score
(Wang et al., 2002) (8 features), and pharmacophore features of the SLIDE
software (Zavodszky et al., 2002) (59 features).
We perform dimensionality reduction using the principal component
analysis (PCA) technique.
For the docking and screening powers, the proposed ML scoring functions
depend on the geometrical features of the RF-Score (Ballester and Mitchell,
2010) (36 features).
http://dx.doi.org/10.1016/j.engappai.2015.06.021
7. Summary of the scoring functions
evaluated in CASF-2013
6 http://dx.doi.org/10.1016/j.engappai.2015.06.021
8. Optimal parameters values of the 12 ML scoring functions
on the scoring, ranking, docking, and screening powers
7 http://dx.doi.org/10.1016/j.engappai.2015.06.021
Random Forests (RF), Boosted Regression Trees (BRT), K-Nearest Neighbours
(kNN), Multivariate Adaptive Regression Splines (MARS), Neural Network (NN),
Partial Least Squares Regression (PLSR), Principle Component Regression
(PCR), Logistic Regression (LR) , Multiple Linear Regression (MLR), Regression
with Regularization (RR), Support Vector Machines (SVM), Decision Tree (DT).
9. Performance of the 20 classical scoring functions versus the 12 ML scoring
functions with the most important 17 principle components (with @ML
suffix) in the scoring power test
8 http://dx.doi.org/10.1016/j.engappai.2015.06.021
10. Protein family dependent scoring power of
top ML scoring functions
9 http://dx.doi.org/10.1016/j.engappai.2015.06.021
11. Effect of changing the number of principal components
on the top 5 ML scoring functions scoring power
10 http://dx.doi.org/10.1016/j.engappai.2015.06.021
12. Effect of applying the PCA technique on the top 10 ML
scoring functions scoring power
11 http://dx.doi.org/10.1016/j.engappai.2015.06.021
13. Performance of the 20 classical scoring functions versus the 12 ML
scoring functions with the most important 17 principal components (with
@ML suffix) in the ranking power test
12 http://dx.doi.org/10.1016/j.engappai.2015.06.021
14. Protein family dependent ranking power of
the top ML scoring functions
13 http://dx.doi.org/10.1016/j.engappai.2015.06.021
15. Effect of changing the number of principal components on
the top 5 ML scoring functions high level ranking power
14 http://dx.doi.org/10.1016/j.engappai.2015.06.021
16. Effect of changing the number of principal components on
the top 5 ML scoring functions low level ranking power
15 http://dx.doi.org/10.1016/j.engappai.2015.06.021
17. Effect of applying the PCA technique on the top 10 ML
scoring functions ranking power
16 http://dx.doi.org/10.1016/j.engappai.2015.06.021
18. Success rates in the docking power test when one or more best-scored ligand binding poses
are considered. The cutoff of acceptance here is that the RMSD value between one best-
scored binding pose and the true binding pose is lower than 2.0 ˚A. The scoring functions are
ranked when the top three best-scored ligand poses are considered to match the native pose.
17 http://dx.doi.org/10.1016/j.engappai.2015.06.021
19. Enrichment factors of all 20 scoring functions versus the 12 ML
scoring functions in the screening power test. The scoring functions are
ranked by their average enrichment factor obtained at the top 1% level.
18 http://dx.doi.org/10.1016/j.engappai.2015.06.021
20. Success rates of finding the best ligand molecule of all 20 scoring functions versus the 12 ML
scoring functions in the screening power test. Scoring functions are ranked by their success
rates obtained at the top 1% level. Numbers in brackets are the number of successful cases,
for which the upper limit is 65 (for ML scoring functions the upper limit is 62).
19 http://dx.doi.org/10.1016/j.engappai.2015.06.021
21. Conclusion
20
Machine Learning techniques give ability to utilize as many
relevant molecular features (e.g., geometric features,
pharmacophore features, etc.) as possible.
Particularly, ensemble-based machine learning approaches
(e.g., random forest, boosted regression trees, etc.) are
resilient to over fitting.
For docking & screening powers, machine learning
techniques need to be more target-specific, train on a larger
number of known binders for each target protein, using SVM
classifier for discriminating actives from decoys instead of
SVR regressor.
http://dx.doi.org/10.1016/j.engappai.2015.06.021
22. Acknowledgement
21
This work is supported:
Mainly by Information Technology Industry
Development Agency (ITIDA) under ITAC Program
grant number CFP#58
In part by E-JUST Research Fellowship
http://dx.doi.org/10.1016/j.engappai.2015.06.021
23. Publications
22
Mohamed A. Khamis, Walid Gomaa, Walaa A. Fathy,
Machine Learning in Computational Docking,
Artificial Intelligence in Medicine, Elsevier, Volume 63,
Feb 2015, Pages 135–152.
Mohamed A. Khamis, Walid Gomaa, Basem Galal,
Deep Learning Competes Random Forest in
Computational Docking, Artificial Intelligence in
Medicine, Elsevier, 2015 (submitted).
http://dx.doi.org/10.1016/j.engappai.2015.06.021
24. Supplemental Material & Questions
http://dx.doi.org/10.1016/j.engappai.2015.06.02123
Supplemental Material:
Source code of machine learning techniques, feature
extraction scripts, PDB IDs, and molecular features, etc.
https://www.researchgate.net/profile/Mohamed_Khamis4
E-mail:
mohamed.khamis@ejust.edu.eg
mohamed.abdelaziz.khamis@gmail.com