Machine learning methods to infer drug-target interaction network Yoshihiro Yamanishi Medical Institute of Bioregulation Kyushu University
Outline n Background Drug-target interaction network Chemical, genomic, and pharmacological spaces n Methods to predict drug-target interactions Chemogenomic approach Pharmacogenomic approach Algorithm n Results n Concluding remarks
Drug-target interaction Phenotypic effect Drug interaction Primary target protein Off-target protein Efficacy Side-effect Phenotypic scale Identification of interactions between drugs and target proteins is crucial in the drug development
Possible data sources Genomic space etc. Target proteins etc. Drug chemical structures Chemical space etc. Phenotypic effects Pharmacological space
Objective n Prediction of unknown drug-target interactions on a large scale from chemical, genomic and pharmacological data Known interactions 1 1 2 Drug 3 4 2 3 Target protein 5 4 6 5 Unknown interactions (to be predicted)
Chemogenomic approach Genomic space etc. Target proteins Chemogenomics Pharmacogenomics etc. Drug chemical structures Chemical space etc. Phenotypic effects Pharmacological space
Examples of the data structures Drug Target protein MAHAAQVGLQDATSPIMEELITFHDHALMIIFLICFLVLYA LFLTLTTKLTNTNISDAQEMETVWTILPAIILVLIALPSLRIL YMTDEVNDPSLTIKSIGHQWYWTYEYTDYGGLIFNSYML PPLFLEPGDLRLLDVDNRVVLPIEAPIRMMITSQDVLHSW AVPTLGLKTDAIPGRLNQTTFTATRPGVYYGQCSEICGAN HSFMPIVLELIPLKIFEMGPVFTL Chemical graph structure Amino acid sequence
Chemogenomic approach Strategy: Chemically similar drugs are predicted to interact with simlar target proteins (Yamanishi et al, Bioifnormatics, 2008; Faulon et al., Bioinformatics, 2008; Jacob et al, Bioinformatics, 2008, Yabuuchi et al, Mol Sys Bio, 2011) Chemical structure similarity for drugs is evaluated by a graph kernel: (K x ) ij = k x (x i, x j ) for i, j =1, 2,..., n x (Mahe et al, J Chem Inf Model, 2005) Genomic sequence similarity for target proteins is evaluated by a string kernel: (K z ) ij = k x (z i, z j ) for i, j =1, 2,..., n z (Saigo et al, Bioinformatics, 2004)
Binary classification approach n Classification of drug-target pairs into the interaction class or non-interaction class Support Vector Machine (SVM) with pairwise kernels Faulon et al., Bioinformatics, 24:225-233, 2008 Jacob and Vert, Bioinformatics, 24:2149-2156, 2008 Yabuuchi et al, Mol Sys Bio, 2011
SVM with pairwise kernels (pairwise SVM) ordinary SVM : n i=1 f (x) = a i k(x i, x ") + b where x : an object pairwise SVM : n x n z i=1 f (x,z) = a i k((x,z) i,( x ", z ")) + b where (x,z) : a compound protein pair
Supervised bipartie graph inference Chemical space Interaction space Genomic space known drug Compounds with similar structures are close to each other known interaction Interacting drugs and targets are connected on the graph known target Proteins with similar sequences are close to each other
Step 1: embedding drugs and targets on the known graph into a unified feature space R d Chemical space Interaction space Genomic space known drug Compounds with similar structures are close to each other known interaction Interacting drugs and targets are close to each other known target Proteins with similar sequences are close to each other
Step 2. Learning a model between the chemical/ genomic space and the interaction space Chemical space Interaction space Genomic space f x f z known drug Compounds with similar structures are close to each other known interaction Interacting drugs and targets are close to each other known target Proteins with similar sequences are close to each other
Step 3. Predicting unknown interactions involving new compounds/proteins after the projection Chemical space Interaction space Genomic space f x f z new compound New compounds are mapped with fx new protein New proteins are mapped with fz
Step 3. Predicting unknown interactions involving new compounds/proteins after the projection Chemical space Interaction space Genomic space f x f z new compound New compounds are mapped with fx predicted interaction Connect compound-protein pairs which are closer than a threshold new protein New proteins are mapped with fz
Mappting to a unified space Let us consider two functions to map each compound x and each protein z onto a unified Euclidian space f x (x) = ( f (1) (d % x (x),, f ) x (x)) T R d f z (z) = ( f (1) (d % z (z),, f ) z (z)) T R d We find f x and f z which minimize R( f x, f z ) = ( f x (x i ) f z (z j )) 2 + λ 1 f x 2 +λ 2 f z 2 (x i,z j ) E (xi,z j ) V x V z ( f x (x i ) f z (z j )) 2 where V x (resp. V z ) is a set of drugs (resp. target proteins), E is a set of interactions, and λ 1 and λ 2 are regularization parameters (Yamanishi, Adv Neural Inf Process Syst 21, 2009)
Extraction of multiple features Succesive features f x (q) and f z (q ) (q =1,2,,d) are obtained by " f x (q) f z (q) % ' = argmin (x i,z j ) E (x i,z j ) V x V z ( f x (x i ) f z (z j )) 2 + λ 1 f x 2 +λ 2 f z 2 ( f x (x i ) f z (z j )) 2 under the following orthogonality constraints: f x f x (1),, f x (q 1), f z f z (1),, f z (q 1) Prediction score for a given pair of compound x " and protein z " : g( x ", " d q =1 z ) = f (q) (q x ( x ") f ) z ( z ")
Algorithm By the representer theorem, features can be expanded as n x f x (x) = α j k x (x j, x), f z (z) = β j k z (z j, z) j=1 Kernel Gram matrices: n z j=1 (K x ) ij = k x (x i, x j ), i, j =1, 2,, n x (K z ) ij = k z (z i, z j ), i, j =1, 2,, n z Norms of features: f x 2 = α T K x α, f z 2 = β T K z β, where α = (α 1,,α nx ) T R n x, β = (β 1,, β n z ) T R n z
It is reduced to the generalized eigenvalue problem: K x D x K x + λ 1 K x K x AK z K x A T K x K z D z K z + λ 2 K z " % ' ' α β " % ' ' = ρ K x 2 0 0 K z 2 " % ' ' α β " % ' ' The solution can be obtained by finding α q and β q which minimizes R(α, β) = α β! " % T K x D x K x K x AK z K z A T K x K z D z K z! " % α β! " % + λ 1α T K x α + λ 2 β T K z β α β! " % T K x 2 0 0 K z 2! " % α β! " % under the following constraints: α T K x α 1 = = α T K x α q 1 = 0, β T K z β 1 = = β T K z β q 1 = 0 where D x (resp. D z ) : degree matrix of drugs (resp. target proteins), A : adjacency matrix of drug-target interactions
Drug-target interaction data for human: Gold standard data Statistics Number of drugs 1874 Number of target proteins (Total in human genome) Number of drug-target interactions 436 (23196) 6769 KEGG DRUG (December, 2011)
Cross-validation (CV) i) Pairwise CV (Missing interaction detection) ii) Blockwise CV I (new drug identification) Target protein i)????? D???? r??? u? g??????? ii) D r u g Target protein Training set? Test set iii) Blockwise CV II (new target identification) iii) D r u g Target protein Training set? Test set
Pharmacogenomic approach Genomic space etc. Target proteins Chemogenomics Pharmacogenomics etc. Drug chemical structures Chemical space etc. Phenotypic effects Pharmacological space
Pharmacogenomic approach Strategy: Phenotypically similar drugs are predicted to interact with simialr target proteins. (Campillos et al, Science, 2008; Yamamishi et al, Bioinformatics (ISMB2010), 2010; Atilas et al, J Comp Bio (RECOMB2010), 2011) Drug phenotypes can be represented by a profile of side-effects such as headache, hypertention, astriction, aortic stenosis, impotence, cardiac infarction, dyspnea, and many more.
Pharmacological similarity Each drug is represented by a profile y = (y 1, y 2,, y S ) T in which 17109 side-effect terms in the JAPIC database are coded as 1 or 0, respectively (S=17109). Pharmacological similarity: k phar (y i, y j ) = S w k y ik y k=1 jk S 2 w k y k=1 ik where w k = exp( d k 2 /σ 2 ), S 2 w k y k=1 jk, for i, j =1, 2,..., n y d k : frequency of the k-th side-effect
Chemical similarity vs. Pharmacological similarity 0.8 0.6 0.4 0.2 0.0 Pharmacological effect similarity 1.0 Drugs targeting enzymes 0.0 0.2 0.4 0.6 Chemical structure similarity 0.8 1.0
Chemical similarity vs Pharmacological similarity Chemical similarity Pharmacological similarity Chemical structure similarity 0.0 0.2 0.4 0.6 0.8 1.0 Pharmacological effect similarity 0.0 0.2 0.4 0.6 0.8 1.0 Interaction Not-interaction Interaction Not-interaction Interaction: drug pairs share the same target Non-interaction: drug pairs do not share the same target
Limitation of pharmacogenomic approach Problem: It is applicable only to marketed drugs for which side-effect information is available Proposed procedure: 1. Predict unknown pharmacological similarity from chemical structure similarity 2. Apply a bipartite graph inference method with pharmacological similarity for compounds and genomic sequence similarity for proteins
Step 1: Prediction of unknown pharmacological similarity from chemical structure similarity! Chemical similarity matrix: C = " A varient of regression model to predict missing parts: k y (y, y!) = f (k x (x, x!))+ε = u(x) T u( x!)+ε where u(x) = (u (1) (x),...,u (m) (x)) T : the underlying features C tt C pt! Pharmacological similarity matrix: P = " T C pt C pp where t : drugs with side-effect information, % P tt??? p : drugs with no side-effect information %
Step 2: Bipartite graph inference to predict unknown interactions A predictive model is learned based on pharmacological similarity for drugs genomic sequence similarity for targets proteins Prediction score for a given pair of compound y " and protein z " : g( y ", " d q =1 z ) = f (q) (q y ( y ") f ) z ( z ")
Comprehensive prediction of unknown drug-target interactions n Test drugs: all compounds in KEGG LIGAND and all drugs in KEGG DRUG n Test target proteins: all human proteins in KEGG GENES n All gold standard interaction data are used in the training Chemogenomic approach: 140 out of top 1000 predictions were confirmed in the literature Pharmacogenomic approach: 223 out of top 1000 predictions were confirmed in the literature
Conclusion n Drug-target interactions are more correlated with pharmacological similarity than with chemical structure similarity n The proposed method can predict unknown drugtarget interactions on a large scale the lack of need for 3D structure information of the target proteins the use of chemical, genomic, and pharmacological data in an integrated framework
Machine learning in chemoinformatics Lodhi, H. and Yamanishi, Y., Chemoinformatics and Advanced Machine Learning Perspectives: Complex Computational Methods and Collaborative Techniques, IGI Global, 2010.
Acknowledgements n Curie Institute, Inserm U900, Mines ParisTech Jean-Philippe Vert Véronique Stoven Kevin Bleakley Edouard Pauwels n Kyoto University Minoru Kanehisa Susumu Goto Michihiro Araki Masaaki Kotera