Thirteenth International Conference on Intelligent Computing (ICIC2017)
R13: Protein and Gene Bioinformatics: Analysis, Algorithms and Applications, Aug 9, 2017.
Masahito Ohue, Takuro Yamazaki, Tomohiro Ban, Yutaka Akiyama.
In Proceedings of the Thirteenth International Conference On Intelligent Computing (ICIC2017) (Lecture Notes in Computer Science), 10362, 549-558, Liverpool,UK August 7-10, 2017
https://link.springer.com/chapter/10.1007/978-3-319-63312-1_48
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a Chemogenomics Approach
1. Link Mining for Kernel-based
Compound-Protein Interaction Predictions
Using a Chemogenomics Approach
Masahito Ohue Takuro Yamazaki Tomohiro Ban Yutaka Akiyama
Department of Computer Science, School of Computing,
Tokyo Institute of Technology, Japan.
Thirteenth International Conference on Intelligent Computing (ICIC2017)
R13: Protein and Gene Bioinformatics: Analysis, Algorithms and Applications, Aug 9, 2017.
2. • Drug discovery and development
– >10 years time and >2 billion US dollars
– Possibility to reduce costs by computational approaches
• molecular docking, QSAR/QSPR modeling, toxicity prediction,
compound-protein interaction prediction
2
Background
Aug 9th, 2017 ICIC2017 Masahito Ohue
Paul SM, et al. Nat Rev Drug Discov. 2010, 9(3):203.
3. • Compound-Protein Interaction (CPI) Prediction (2008-)
– a.k.a. Drug-Target Interaction Prediction
– Recognize that query compound and protein are interact (1) or not (0)
by using machine learning (ML) with compounds, proteins, and
interaction information.
• Called “chemogenomics approach”
– It leads to find new targets, side effects, toxicity, etc.
3
Compound-Protein Interaction Prediction
Aug 9th, 2017 ICIC2017 Masahito Ohue
compounds proteins
Unknown interaction
(negative label, 0)
c1
c2
c3
c4
c5
c6
p1
p2
p3
p4
p5
4. p 1 p 2 p 3 p 4 p 5
c 1 1 1 0 0 0
c 2 0 0 1 0 0
c 3 0 1 1 0 0
c 4 0 0 1 0 0
c 5 0 0 0 0 0
c 6 0 0 0 0 0
4
CPI Prediction Problem
Aug 9th, 2017 ICIC2017 Masahito Ohue
Notation
Interaction Matrix Feature Vector of Compounds
Feature Vector of Proteins
e.g. MACCS Key,
PubChem fingerprint,
etc.
e.g. PFAM fingerprint,
amino acid k-mer, etc.
5. • Basic Concept
“Similar compounds/proteins have similar interactions”
a) Kernel-based Machine Learning
b) Matrix Factorization
p 1 p 2 p 3 p 4 p 5
c 1 1 1 0 0 0
c 2 0 0 1 0 0
c 3 0 1 1 0 0
c 4 0 0 1 0 0
c 5 0 0 0 0 0
c 6 0 0 0 0 0
5
Major Approaches for CPI Prediction
Aug 9th, 2017 ICIC2017 Masahito Ohue
→Kernel-based machine learning
c 1
c 2
c 3
c 4
c 5
c 6
p 1 p 2 p 3 p 4 p 5
decomposition
feature vectors
kernel functions
7. 7
Gaussian Interaction Profile (GIP)
Aug 9th, 2017 ICIC2017 Masahito Ohue
1 1 0
1 0 0 …
0 1 1
︙
compound-protein
interaction network
︙
interaction matrix
compound kernel
(similarity matrix)
protein kernel
(similarity matrix)
Prediction
Model
Training
data
Gaussian Interaction Profile (GIP) (van Laarhoven, et al. Bioinformatics 2011)
Learning
(SVM, etc.)
Integration
(Multiple kernel scheme)
Similar interaction patterns
→ Have similar interactions
8. 8
Gaussian Interaction Profile (GIP)
Aug 9th, 2017 ICIC2017 Masahito Ohue
1 1 0
1 0 0 …
0 1 1
︙
1
0
0
︙
0
0
1
︙
interaction matrix
interaction profile
compound
GIP kernel
protein
GIP kernel
GIP kernels (Gaussian kernel)
GIP method
• More accurate than using only
compound/protein similarities
Problem
= ‘0’ and ‘1’ are almost same
• All ‘0’ vectors obtained
maximum value, same as
all ‘1’ vectors .
• ‘1’ is experimentally determined,
but ‘0’ is unknown interaction.
• ‘1’-information should be
considered more reliable than ‘0’.
9. • Idea: ‘1’-information is more important
→ Network theory, graph mining, link mining
Data mining on world wide web, social network, biological networks, etc.
• Link indicators used in the field of link mining were
applied to CPI bipartite network.
9
Idea from Link Mining
Aug 9th, 2017 ICIC2017 Masahito Ohue
Bipartite graph (network)General graph (network)
node
link (edge)
Group A Group B
10. 10
Link Indicators
Aug 9th, 2017 ICIC2017 Masahito Ohue
1 1 0
1 0 0 …
0 1 1
︙
0
0
1
︙
interaction matrix
1
0
0
︙
Bipartite network (CPIs)
3 link indicators were used in this study
Calculate
link indicator
Jaccard index
Cosine similarity
LHN
because these link indicators become positive definite kernels when used as kernels.
*
11. 11
Proposed Method: Link Indicator Kernel (LIK)
Aug 9th, 2017 ICIC2017 Masahito Ohue
1 1 0
1 0 0 …
0 1 1
︙
1
0
0
︙
0
0
1
︙
interaction matrix
interaction profile
compound
Link Indicator Kernel
protein
Link Indicator Kernel
Link Indicator Kernels (LIKs)
• All ‘0’ vectors obtained
minimum value
• ‘1’-information are considered
more important than ‘0’.
*
13. • Dataset
– General benchmark dataset by Yamanishi et al.
– Contains 4 CPI networks
– Similarity matrices (similarity kernels) are precomputed
• Prediction Methods
– PKM w/similarity kernels + GIP (conventional)
– PKM w/similarity kernels + LIK (Jac/cos/LHN) (proposed)
Nuclear
Receptor
GPCR
Ion
Channel
Enzyme
#compounds 54 223 210 445
#proteins 26 95 204 664
#interactions 90 635 1476 2926
Density 6.41% 3.00% 3.45% 0.99%
13
Benchmarking
Aug 9th, 2017 ICIC2017 Masahito Ohue
(Yamanishi, et al. Bioinformatics, 2008)
http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/
14. 14
Evaluation (According to Ding’s benchmarking)
c1
c2
c3
c4
c5
c6
c1
c2
c3
c4
c5
c6
c1
c2
c3
c4
c5
c6
p1 p2 p3 p4 p5 p6 p1 p2 p3 p4 p5 p6 p1 p2 p3 p4 p5 p6
compound-wise CV protein-wise CV pairwise CV
1 0 0 1 0 0
0 1 0 0 1 1
1 0 0 0 0 0
1 0 1 1 0 0
? ? ? ? ? ?
? ? ? ? ? ?
1 0 0 1 ? ?
0 1 0 0 ? ?
1 0 0 0 ? ?
1 0 1 1 ? ?
0 1 0 0 ? ?
0 0 1 1 ? ?
? 0 ? 1 0 ?
0 ? 0 ? 1 1
1 0 0 0 0 ?
1 0 ? ? 0 0
? 1 0 0 0 1
? 0 ? 1 ? 0
• 3 Types of Cross-Validations (CVs)
• AUROC and AUPR
Precision
Recall
TP rate
FP rate
AUPR
AUROC
Perfect prediction
→AUROC = 1
Random prediction
→AUROC = 0.5
(diagonal line)
Perfect prediction
→AUPR = 1
Random prediction
→AUPR = density
(avg. AUPR≒0.035)
* 10-fold CV was randomly tried 5 times and the accuracy (AUROC, AUPR) were averaged.
(Ding, et al. Brief Bioinform 2014)
CPIs have much fewer positives than negatives, and FPs should be weighed more.
AUPR punishes FPs more than AUROC. →AUPR is more important than AUROC
15. 15
Prediction Accuracy (Cross-Validations)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
compound-wise CV protein-wise CV pairwise CV
AUPR
Jaccard index cosine similarity LHN GIP random
0.4
0.5
0.6
0.7
0.8
0.9
1
compound-wise CV protein-wise CV pairwise CV
AUROC
AUPR (averaged the 4 datasets)
AUROC (averaged the 4 datasets)
Aug 9th, 2017 ICIC2017 Masahito Ohue
good
good
→LIKs are better than GIP, especially
compound-wise and protein-wise CVs.
LIK
LIK
LIK
LIK LIK
LIK
16. • Computational complexity
– #compound= , #protein=
– PKM Learning:
– Calc. LIKs:
– Total:
• Experimental computational time
16
Computational Time
Aug 9th, 2017 ICIC2017 Masahito Ohue
Nuclear
Receptor
GPCR
Ion
Channel
Enzyme
Conventional (PKM) [sec] 0.0680 4.86 24.1 232
Proposed (LIK) [sec] 0.0850 5.17 24.8 239
Increase rate (%) 25% 6.4% 2.9% 3.3%
Almost same calculation time as PKM
small large
dataset size
17. • We proposed Link Indicator Kernel (LIK) for CPI predictions.
• Compared with GIP, the calculation time was the same and
the accuracy was improved.
– Especially, predictions for novel compound, novel protein were good.
– Overall, LIK with cosine similarity was the most accurate.
– The difference between LIK’s 0 and 1 handling may be successful.
• LIK can also be applied to the derivation of GIP such as
WNNGIP, KronRLS-MKL, etc.
• Future Work
– Hyperparameter search becomes a bottleneck in the CPI problem,
but it can be accelerated with application of Bayesian Optimization*.
– It may be better to treat unknown interaction as unknown label.
Exploring the applicability of positive-unlabeled learning** is interest.
17
Conclusion
Aug 9th, 2017 ICIC2017 Masahito Ohue
**Lan, et al. Predicting drug-target interaction using
positive-unlabeled learning. Neurocomputing 206, 2016.
*Ban, Ohue, Akiyama. (submitted)
18. Acknowledgements
18
This work was partially supported by the Japan Society for the Promotion of Science (JSPS)
KAKENHI (grant nos. 24240044 and 15K16081), and Core Research for Evolutional Science and
Technology (CREST) “Extreme Big Data” (grant no. JPMJCR1303) from the Japan Science and
Technology Agency (JST).
Akiyama Lab. members Tokyo Tech
Takuro Yamazaki
Former student of our lab.
Currently he is a graduate student of the University of Tokyo, Japan.
Aug 9th, 2017 ICIC2017 Masahito Ohue
20. • Proof: Cosine similarity & LHN
• Proof: Jaccard index
– Previously proved to be positive definite by Bouchard et al.
20
LIKs are Positive Definite Kernels
Aug 9th, 2017 ICIC2017 Masahito Ohue
Use the property of positive definite kernel:
Let be a positive definite kernel and be an arbitrary function.
Then, the kernel is also positive definite.
Inner product is positive definite.
Bouchard, et al. A proof for the positive definiteness of the Jaccard index matrix.
Int. J. Approx. Reason. 54: 615-626, 2013.
21. • Normally, a vector for a pair of compounds
and proteins is required for ML scheme.
• In the PKM, is defined as the tensor product
of the map of compound and protein .
• Pairwise kernel is defined between two pairs of
proteins and compounds and as
21
Pairwise Kernel
Aug 9th, 2017 ICIC2017 Masahito Ohue
Use only similarity matrices (kernels), do not use feature vector of .
22. 22
Observed Distribution of Link Indicator Frequency
Aug 9th, 2017 ICIC2017 Masahito Ohue
moderate distribution
immoderate distribution
23. 23
Overall Prediction Accuracy
Aug 9th, 2017 ICIC2017 Masahito Ohue
Overall prediction accuracy for each CPI prediction method in 10-fold CV
tests. The AUPR and AUROC values are averaged values of 3 CVs and 4
datasets (total average for 12 AUPR/AUROC values).
wk : multiple kernel weight
• Cosine similarity with wk=0.5 showed the best performance.
• Compared with GIP, the accuracy of LIK showed higher accuracy overall.
24. • SVM
– Cost parameter C = {0.1, 1, 10, 100}
– Multiple kernel weight wk = {0.1, 0.3, 0.5, 1}
• Cross validation (CV)
– 3 types; Compound-wise, Protein-wise, Pairwise
– 10-fold CVs
– Division of the dataset was randomly tried 5 times
– AUROC and AUPR
24
Settings for Learning
Aug 9th, 2017 ICIC2017 Masahito Ohue
25. 25
History of the CPI Prediction Methods
Aug 9th, 2017 ICIC2017 Masahito Ohue
2008 2017 year
KRM (Yamanishi et al., 2008)
PKM (Jacob & Vert, 2008)
BLM (Bleakley et al., 2008)
LapRLS (Xia et al., 2010)
GIP (van Laarhoven et al., 2011)
KBMF2K (Gonen et al., 2012)
WNNGIP (van Laarhoven et al., 2013)
BLMNII (Mei et al., 2013)
MSCMF (Zheng et al., 2013)
REMAP (Lim et al., 2016)
Kernel-based
Matrix Factorization-based
KronRLS-MKL (Nascimento, et al., 2016)
NRLMF (Liu et al., 2016)
Notas del editor
※まず左と上にに枠を伸ばす。左下のカーソルをレーザーポインターに設定する。
Thank you for introduction professor Gromiha.
My name is Masahito Ohue from Tokyo Tech Japan.
Today, I will talk about “Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a Chemogenomics Approach”.
This study is related drug discovery.
The costs of drug discovery and development are increasing recently.
It requires over 10 years and over 2 billion dollars.
The computational approaches for supporting drug discovery are expected to reduce costs.
For example, molecular docking, QSAR modeling, toxicity prediction, and compound-protein interaction prediction, it is the main topic on my talk.
The compound-protein interaction prediction, CPI prediction, is discrimination the query compound and protein are interact or not, by using machine learning techniques.
It is called chemogenomics approach, because of using information of multiple compounds and multiple proteins.
Interact is represented as 1, this is positive label, and unknown interaction is represented as 0, this is negative label.
And the CPIs are represented as the binary interaction matrix Y, 1 is interact, and 0 is unknown interaction.
The feature vectors of compounds and proteins are calculated by several ways, for example, compound fingerprint, protein domain information, and amino acid sequence information.
These feature vectors and interaction matrix used in machine learning scheme.
To predict the compound-protein interactions, there is a basic concept, similar compounds or similar proteins have similar interactions.
Thus mainly machine learning approaches are used in the problem.
There are two major approaches, one is kernel-based machine learning method, another one is matrix factorization-based method.
In this talk, we focus the kernel-based machine learning methods.
One of the excellent methods is pairwise kernel method, PKM.
PKM uses the interaction matrix as training data and kernel matrices between compounds and proteins.
Then, kernel-based learning using support vector machine or other kernel methods is done, and prediction model of compound-protein pair is generated.
A keypoint is that PKM does not calculate kernel of compound-protein pair directly.
It is defined by multiplying the compound kernel and protein kernel.
By doing this, calculations in the high-dimensional space are avoided.
And the other one is Gaussian interaction profile method, GIP.
GIP is roughly the same scheme as PKM, except that it also uses the information of the interaction matrix.
The idea is that if the compounds or proteins have similar interaction patterns, then they have similar interactions.
GIP uses the interaction matrix as interaction profile vectors, and calculates the interaction profile-based kernel matrices using Gaussian kernels.
GIP can more accurate prediction than using only compound and protein similarities.
However, there is one problem that the Gaussian kernel treats the difference between 0s and the difference between 1s in the same way.
For example, the kernel function k_GIP obtains the maximum value of 1 by all 0 vectors and all 1 vectors.
‘1’ means experimentally determined interaction but ‘0’ means interaction is unknown, so ‘1’ information should be considered more reliable than ‘0’.
The idea to solve this problem exists in the field of link mining which is data mining on world wide web, social network, biological networks, and so on.
Link mining predicts the possibility of a new link on the network, that is, focuses on ‘1’.
In this study, link indicators used in link mining were applied to CPI bipartite network.
Link indicators are calculated using interaction matrix.
This study, 3 link indicators were used, Jaccard index, cosine similarity, and LHN.
These link indicators become positive definite kernels, thus they can be applied kernel-based machine learning method.
This slide shows our proposed method, called LIK.
It is almost same as GIP, but the kernel matrices calculated by link indicators with interaction profiles.
The kernel function k_LIK obtains the maximum value of 1 by same vectors.
If 0-vectors are inputted then the kernel value is 0.
LIK makes ‘1‘-information more important than ‘0’.
This is the whole picture of the proposed method and conventional method.
We used LIK, link indicator kernels instead of GIPs.
In order to evaluate the prediction method, we used the general benchmark dataset created by Yamanishi.
This dataset contains four CPI networks, nuclear receptor, GPCR, ion channel, and enzyme.
Similarity matrices are precomputed by Yamanishi, and available at this website.
We used GIP and three kinds of LIK methods for the computational experiment.
The evaluation of the prediction methods were done by cross-validations.
According to the previous benchmarking work, we used compound-wise, protein-wise and pairwise cross-validations.
Compound-wise CV can verify whether the interaction of novel compound can be predicted.
Similarly, protein-wise CV verifies the novel protein interactions.
We used two types of accuracy, AUROC and AUPR.
However, CPIs have much fewer positives than negatives, and FPs should be weighed more.
AUPR punishes FPs more than AUROC. So AUPR is more important in CPI prediction problem.
This figure shows the results of prediction accuracy.
In AUROC, we can see that LIKs are comparable to GIP.
On the other hand, in AUPR, LIKs are obtained better prediction than GIP.
Especially, LIK works well with compound-wise and protein-wise cross-validations.
This is the results of computational time.
The computational complexity of PKM is order nc cubed times np cubed, and LIKs are order nc np times nc plus np.
Thus the total computational complexity is the same as PKM.
On the other hand, the actual calculation time depends on the size of the dataset.
But the difference with PKM was kept within a short time.
(時間が無い時→This slide is concluding remarks in my talk. で次。)
OK, let me summarize my talk.
We proposed link indicator kernel LIK for compound-protein interaction predictions.
Our proposed method compared with GIP, and the calculation time was almost same and the prediction accuracy was improved.
The difference between 0 and 1 handling on link mining may be successful.
In future work, hyperparameter search becomes a bottle neck in the CPI prediction problem.
It can be accelerated with application of Bayesian optimization technique which is our ongoing work.
It may be better to treat unknown interaction as unknown label.
Exploring the applicability of positive-unlabeled learning technique is interest work.
OK, that’s all. Thank you very much for your attention.