Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a Chemogenomics Approach

Link Mining for Kernel-based
Compound-Protein Interaction Predictions
Using a Chemogenomics Approach
Masahito Ohue Takuro Yamazaki Tomohiro Ban Yutaka Akiyama
Department of Computer Science, School of Computing,
Tokyo Institute of Technology, Japan.
Thirteenth International Conference on Intelligent Computing (ICIC2017)
R13: Protein and Gene Bioinformatics: Analysis, Algorithms and Applications, Aug 9, 2017.

• Drug discovery and development
– >10 years time and >2 billion US dollars
– Possibility to reduce costs by computational approaches
• molecular docking, QSAR/QSPR modeling, toxicity prediction,
compound-protein interaction prediction
2
Background
Aug 9th, 2017 ICIC2017 Masahito Ohue
Paul SM, et al. Nat Rev Drug Discov. 2010, 9(3):203.

• Compound-Protein Interaction (CPI) Prediction (2008-)
– a.k.a. Drug-Target Interaction Prediction
– Recognize that query compound and protein are interact (1) or not (0)
by using machine learning (ML) with compounds, proteins, and
interaction information.
• Called “chemogenomics approach”
– It leads to find new targets, side effects, toxicity, etc.
3
Compound-Protein Interaction Prediction
compounds proteins
Unknown interaction
(negative label, 0)
c1
c2
c3
c4
c5
c6
p1
p2
p3
p4
p5

p 1 p 2 p 3 p 4 p 5
c 1 1 1 0 0 0
c 2 0 0 1 0 0
c 3 0 1 1 0 0
c 4 0 0 1 0 0
c 5 0 0 0 0 0
c 6 0 0 0 0 0
4
CPI Prediction Problem
Notation
Interaction Matrix Feature Vector of Compounds
Feature Vector of Proteins
e.g. MACCS Key,
PubChem fingerprint,
etc.
e.g. PFAM fingerprint,
amino acid k-mer, etc.

• Basic Concept
“Similar compounds/proteins have similar interactions”
a) Kernel-based Machine Learning
b) Matrix Factorization
p 1 p 2 p 3 p 4 p 5
c 1 1 1 0 0 0
c 2 0 0 1 0 0
c 3 0 1 1 0 0
c 4 0 0 1 0 0
c 5 0 0 0 0 0
c 6 0 0 0 0 0
5
Major Approaches for CPI Prediction
→Kernel-based machine learning
c 1
c 2
c 3
c 4
c 5
c 6
p 1 p 2 p 3 p 4 p 5
decomposition
feature vectors
kernel functions

6
Pairwise Kernel Method (PKM)
1 1 0
1 0 0 …
0 1 1
︙
compound-protein
interaction network
︙
interaction matrix
compound kernel
(similarity matrix)
protein kernel
(similarity matrix)
Learning
(SVM, etc.)
Pairwise Kernel (kernel trick)
Training
data
Pairwise Kernel Method (PKM) (Jacob & Vert. Bioinformatics 2008)
Prediction
Model

7
Gaussian Interaction Profile (GIP)
1 1 0
1 0 0 …
0 1 1
︙
compound-protein
interaction network
︙
interaction matrix
compound kernel
(similarity matrix)
protein kernel
(similarity matrix)
Prediction
Model
Training
data
Gaussian Interaction Profile (GIP) (van Laarhoven, et al. Bioinformatics 2011)
Learning
(SVM, etc.)
Integration
(Multiple kernel scheme)
Similar interaction patterns
→ Have similar interactions

8
Gaussian Interaction Profile (GIP)
1 1 0
1 0 0 …
0 1 1
︙
1
0
0
︙
0
0
1
︙
interaction matrix
interaction profile
compound
GIP kernel
protein
GIP kernel
GIP kernels (Gaussian kernel)
GIP method
• More accurate than using only
compound/protein similarities
Problem
= ‘0’ and ‘1’ are almost same
• All ‘0’ vectors obtained
maximum value, same as
all ‘1’ vectors .
• ‘1’ is experimentally determined,
but ‘0’ is unknown interaction.
• ‘1’-information should be
considered more reliable than ‘0’.

• Idea: ‘1’-information is more important
→ Network theory, graph mining, link mining
Data mining on world wide web, social network, biological networks, etc.
• Link indicators used in the field of link mining were
applied to CPI bipartite network.
9
Idea from Link Mining
Bipartite graph (network)General graph (network)
node
link (edge)
Group A Group B

10
Link Indicators
1 1 0
1 0 0 …
0 1 1
︙
0
0
1
︙
interaction matrix
1
0
0
︙
Bipartite network (CPIs)
3 link indicators were used in this study
Calculate
link indicator
Jaccard index
Cosine similarity
LHN
because these link indicators become positive definite kernels when used as kernels.
*

11
Proposed Method: Link Indicator Kernel (LIK)
1 1 0
1 0 0 …
0 1 1
︙
1
0
0
︙
0
0
1
︙
interaction matrix
interaction profile
compound
Link Indicator Kernel
protein
Link Indicator Kernel
Link Indicator Kernels (LIKs)
• All ‘0’ vectors obtained
minimum value
• ‘1’-information are considered
more important than ‘0’.
*

12
CPI Prediction Method Summary
1 1 0
1 0 0 …
0 1 1
︙
compound-protein
interaction network
︙
1
0
0
︙
0
0
1
︙
interaction matrix
interaction profile
compound kernel
(similarity matrix)
protein kernel
(similarity matrix)
compound kernel
(GIP/LIK)
protein kernel
(GIP/LIK)
Prediction
Model
Learning
(SVM, etc.)

• Dataset
– General benchmark dataset by Yamanishi et al.
– Contains 4 CPI networks
– Similarity matrices (similarity kernels) are precomputed
• Prediction Methods
– PKM w/similarity kernels + GIP (conventional)
– PKM w/similarity kernels + LIK (Jac/cos/LHN) (proposed)
Nuclear
Receptor
GPCR
Ion
Channel
Enzyme
#compounds 54 223 210 445
#proteins 26 95 204 664
#interactions 90 635 1476 2926
Density 6.41% 3.00% 3.45% 0.99%
13
Benchmarking
(Yamanishi, et al. Bioinformatics, 2008)
http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/

14
Evaluation (According to Ding’s benchmarking)
c1
c2
c3
c4
c5
c6
c1
c2
c3
c4
c5
c6
c1
c2
c3
c4
c5
c6
p1 p2 p3 p4 p5 p6 p1 p2 p3 p4 p5 p6 p1 p2 p3 p4 p5 p6
compound-wise CV protein-wise CV pairwise CV
1 0 0 1 0 0
0 1 0 0 1 1
1 0 0 0 0 0
1 0 1 1 0 0
? ? ? ? ? ?
? ? ? ? ? ?
1 0 0 1 ? ?
0 1 0 0 ? ?
1 0 0 0 ? ?
1 0 1 1 ? ?
0 1 0 0 ? ?
0 0 1 1 ? ?
? 0 ? 1 0 ?
0 ? 0 ? 1 1
1 0 0 0 0 ?
1 0 ? ? 0 0
? 1 0 0 0 1
? 0 ? 1 ? 0
• 3 Types of Cross-Validations (CVs)
• AUROC and AUPR
Precision
Recall
TP rate
FP rate
AUPR
AUROC
Perfect prediction
→AUROC = 1
Random prediction
→AUROC = 0.5
(diagonal line)
Perfect prediction
→AUPR = 1
Random prediction
→AUPR = density
(avg. AUPR≒0.035)
* 10-fold CV was randomly tried 5 times and the accuracy (AUROC, AUPR) were averaged.
(Ding, et al. Brief Bioinform 2014)
CPIs have much fewer positives than negatives, and FPs should be weighed more.
AUPR punishes FPs more than AUROC. →AUPR is more important than AUROC

15
Prediction Accuracy (Cross-Validations)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
AUPR
Jaccard index cosine similarity LHN GIP random
0.4
0.5
0.6
0.7
0.8
0.9
1
AUROC
AUPR (averaged the 4 datasets)
AUROC (averaged the 4 datasets)
good
good
→LIKs are better than GIP, especially
compound-wise and protein-wise CVs.
LIK
LIK
LIK
LIK LIK
LIK

• Computational complexity
– #compound= , #protein=
– PKM Learning:
– Calc. LIKs:
– Total:
• Experimental computational time
16
Computational Time
Nuclear
Receptor
GPCR
Ion
Channel
Enzyme
Conventional (PKM) [sec] 0.0680 4.86 24.1 232
Proposed (LIK) [sec] 0.0850 5.17 24.8 239
Increase rate (%) 25% 6.4% 2.9% 3.3%
Almost same calculation time as PKM
small large
dataset size

• We proposed Link Indicator Kernel (LIK) for CPI predictions.
• Compared with GIP, the calculation time was the same and
the accuracy was improved.
– Especially, predictions for novel compound, novel protein were good.
– Overall, LIK with cosine similarity was the most accurate.
– The difference between LIK’s 0 and 1 handling may be successful.
• LIK can also be applied to the derivation of GIP such as
WNNGIP, KronRLS-MKL, etc.
• Future Work
– Hyperparameter search becomes a bottleneck in the CPI problem,
but it can be accelerated with application of Bayesian Optimization*.
– It may be better to treat unknown interaction as unknown label.
Exploring the applicability of positive-unlabeled learning** is interest.
17
Conclusion
**Lan, et al. Predicting drug-target interaction using
positive-unlabeled learning. Neurocomputing 206, 2016.
*Ban, Ohue, Akiyama. (submitted)

Acknowledgements
18
This work was partially supported by the Japan Society for the Promotion of Science (JSPS)
KAKENHI (grant nos. 24240044 and 15K16081), and Core Research for Evolutional Science and
Technology (CREST) “Extreme Big Data” (grant no. JPMJCR1303) from the Japan Science and
Technology Agency (JST).
Akiyama Lab. members Tokyo Tech
Takuro Yamazaki
Former student of our lab.
Currently he is a graduate student of the University of Tokyo, Japan.

• Proof: Cosine similarity & LHN
• Proof: Jaccard index
– Previously proved to be positive definite by Bouchard et al.
20
LIKs are Positive Definite Kernels
Use the property of positive definite kernel:
Let be a positive definite kernel and be an arbitrary function.
Then, the kernel is also positive definite.
Inner product is positive definite.
Bouchard, et al. A proof for the positive definiteness of the Jaccard index matrix.
Int. J. Approx. Reason. 54: 615-626, 2013.

• Normally, a vector for a pair of compounds
and proteins is required for ML scheme.
• In the PKM, is defined as the tensor product
of the map of compound and protein .
• Pairwise kernel is defined between two pairs of
proteins and compounds and as
21
Pairwise Kernel
Use only similarity matrices (kernels), do not use feature vector of .

22
Observed Distribution of Link Indicator Frequency
moderate distribution
immoderate distribution

23
Overall Prediction Accuracy
Overall prediction accuracy for each CPI prediction method in 10-fold CV
tests. The AUPR and AUROC values are averaged values of 3 CVs and 4
datasets (total average for 12 AUPR/AUROC values).
wk : multiple kernel weight
• Cosine similarity with wk=0.5 showed the best performance.
• Compared with GIP, the accuracy of LIK showed higher accuracy overall.

• SVM
– Cost parameter C = {0.1, 1, 10, 100}
– Multiple kernel weight wk = {0.1, 0.3, 0.5, 1}
• Cross validation (CV)
– 3 types; Compound-wise, Protein-wise, Pairwise
– 10-fold CVs
– Division of the dataset was randomly tried 5 times
– AUROC and AUPR
24
Settings for Learning

25
History of the CPI Prediction Methods
2008 2017 year
KRM (Yamanishi et al., 2008)
PKM (Jacob & Vert, 2008)
BLM (Bleakley et al., 2008)
LapRLS (Xia et al., 2010)
GIP (van Laarhoven et al., 2011)
KBMF2K (Gonen et al., 2012)
WNNGIP (van Laarhoven et al., 2013)
BLMNII (Mei et al., 2013)
MSCMF (Zheng et al., 2013)
REMAP (Lim et al., 2016)
Kernel-based
Matrix Factorization-based
KronRLS-MKL (Nascimento, et al., 2016)
NRLMF (Liu et al., 2016)

Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a Chemogenomics Approach

Recomendados

Recomendados

Más contenido relacionado

Similar a Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a Chemogenomics Approach

Similar a Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a Chemogenomics Approach (20)

Más de Masahito Ohue

Más de Masahito Ohue (20)

Último

Último (20)

Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a Chemogenomics Approach

Notas del editor