Graph Kernels in Chemoinformatics

Size: px

Start display at page:

Download "Graph Kernels in Chemoinformatics"

Tamsyn Goodman
5 years ago
Views:

1 in hemoinformatics Luc Brun o authors: Benoit Gaüzère(MIVIA), Pierre Anthony Grenier(GREY), Alessia aggese(mivia), Didier Villemin (LMT) Bordeaux, November,

2 Plan yclic similarity onclusion yclic similarity 5 onclusion 6 L. Brun (GREY) eminar Bordeaux, November, / 32

3 Plan yclic similarity onclusion yclic similarity 5 onclusion 6 L. Brun (GREY) eminar Bordeaux, November, / 32

4 Motivation yclic similarity onclusion tructural Pattern Recognition Rich description of objects Poor properties of graph s space does not allow to readily generalize/combine sets of graphs tatistical Pattern Recognition Global description of objects Numerical spaces with many mathematical properties (metric, vector space,... ). Motivation Analyse large famillies of structural and numerical objects using a unified framework based on pairwise similarity. L. Brun (GREY) eminar Bordeaux, November, / 32

.. creening : ynthesis of molecule, Test and validation of the synthesized molecules Virtual creening : election of a limited set of

5 hemoinformatics yclic similarity onclusion Usages of computer science for chemistry Prediction of molecular properties Prediction of biological activities Toxicity, cancerous... Prediction of physico-chemical properties. Boiling point, LogP, ability to capture 2... creening : ynthesis of molecule, Test and validation of the synthesized molecules Virtual creening : election of a limited set of candidate molecules, ynthesis and test of the selected molecules, Gain of Time and Money L. Brun (GREY) eminar Bordeaux, November, / 32

6 Molecular representation yclic similarity onclusion Molecular Graph: Vertice atoms. Edges Atomic bonds. l l H 3 H 3 P P H 3 L. Brun (GREY) eminar Bordeaux, November, / 32

7 Molecular representation yclic similarity onclusion Molecular Graph: Vertice atoms. Edges Atomic bonds. l l H 3 H 3 P P H 3 imilarity principle [Johnson and Maggiora, 1990] imilar molecules have similar properties. Design of similarity measures between molecular graphs. L. Brun (GREY) eminar Bordeaux, November, / 32

8 Plan yclic similarity onclusion yclic similarity 5 onclusion 6 L. Brun (GREY) eminar Bordeaux, November, / 32

9 Kernel Theory yclic similarity onclusion Kernel k : X X R ymmetric: k(x i, x j ) = k(x j, x i ), For any dataset D = {x 1,..., x n }, Gram s matrix K R n n is defined by K i,j = k(x i, x j ). emi-definite positive: α R n, α T Kα 0 L. Brun (GREY) eminar Bordeaux, November, / 32

10 Kernel Theory yclic similarity onclusion Kernel k : X X R ymmetric: k(x i, x j ) = k(x j, x i ), For any dataset D = {x 1,..., x n }, Gram s matrix K R n n is defined by K i,j = k(x i, x j ). emi-definite positive: α R n, α T Kα 0 onnection with Hilbert spaces, reproducing kernel Hilbert paces [Aronszajn, 1950]: Relationship scalar product kernel : k(x, x ) = Φ(x), Φ(x ) H. L. Brun (GREY) eminar Bordeaux, November, / 32

11 yclic similarity onclusion Graph kernels in chemoinformatics Graph kernel k : G G R. imilarity measure between molecular graphs Natural encoding of a molecule Implicit embedding in H, Machine learning methods (VM, KRR,... ). L. Brun (GREY) eminar Bordeaux, November, / 32

12 yclic similarity onclusion Graph kernels in chemoinformatics Graph kernel k : G G R. imilarity measure between molecular graphs Natural encoding of a molecule Implicit embedding in H, Machine learning methods (VM, KRR,... ). Natural connection between molecular graphs and machine learning methods Definition of a similarity measure as a kernel L. Brun (GREY) eminar Bordeaux, November, / 32

13 yclic similarity onclusion Kernels based on bags of patterns Graph kernels based on bags of patterns (1) Extraction of a set of patterns from graphs, (2) omparison between patterns, (3) omparison between bags of patterns. L. Brun (GREY) eminar Bordeaux, November, / 32

14 Plan yclic similarity onclusion yclic similarity 5 onclusion 6 L. Brun (GREY) eminar Bordeaux, November, / 32

15 Treelets yclic similarity onclusion et of labeled subtrees with at most 6 nodes : Encode branching points Take labels into account. Pertinent from a chemical point of view limited set of substructures Linear complexity (in (nd 5 )). L. Brun (GREY) eminar Bordeaux, November, / 32

16 enumaration yclic similarity onclusion Labelling ode encoding the labels, Defined on each of the 14 structure of treelet, omputation in constant time. H 3 H 3 H 3 L. Brun (GREY) eminar Bordeaux, November, / 32

17 enumaration yclic similarity onclusion Labelling ode encoding the labels, Defined on each of the 14 structure of treelet, omputation in constant time. H 3 H 3 H 3 anonical code : type of treelet, } canonical code : G Labels. t t code(t) = code(t ) L. Brun (GREY) eminar Bordeaux, November, / 32

18 Definition of the Treelet kernel yclic similarity onclusion ounting treelet s occurences f t (G) : Number of occurences of treelet t in G. L. Brun (GREY) eminar Bordeaux, November, / 32

19 Definition of the Treelet kernel yclic similarity onclusion ounting treelet s occurences f t (G) : Number of occurences of treelet t in G. Treelet kernel k T (G, G ) = k(f t (G), f t (G )) t T (G) T (G ) T (G): et of treelets of G. k(.,.): Kernel between real numbers. L. Brun (GREY) eminar Bordeaux, November, / 32

20 Definition of the Treelet kernel yclic similarity onclusion ounting treelet s occurences f t (G) : Number of occurences of treelet t in G. Treelet kernel k T (G, G ) = k t (G, G ) T (G) : et of treelets of G. k t (G, G ) = k(f t (G), f t (G ). k t (.,.) : imilarity according to t. t T (G) T (G ) L. Brun (GREY) eminar Bordeaux, November, / 32

21 Pattern selection yclic similarity onclusion Prediction problems: Different properties Different causes. L. Brun (GREY) eminar Bordeaux, November, / 32

22 Pattern selection yclic similarity onclusion Prediction problems: Different properties Different causes. hemist approach: Pattern relevant for a precise property A priori selection by a chemical expert L. Brun (GREY) eminar Bordeaux, November, / 32

23 Pattern selection yclic similarity onclusion Prediction problems: Different properties Different causes. hemist approach: Pattern relevant for a precise property A priori selection by a chemical expert Machine learning approach: Parcimonious set of relevant treelets election a posteriori induced by data. L. Brun (GREY) eminar Bordeaux, November, / 32

24 Multiple kernel learning yclic similarity onclusion ombination of kernels [Rakotomamonjy et al., 2008] : k(g, G ) = t T k t (G, G ) L. Brun (GREY) eminar Bordeaux, November, / 32

25 Multiple kernel learning yclic similarity onclusion ptimal ombination of kernels [Rakotomamonjy et al., 2008] : k W (G, G ) = t T w t k t (G, G ) T : et of treelets extracted from the learning dataset. w t [0, 1] : Measure the influence of t Weighing of each sub kernel election of relevant treelets Weighs computed for a given property. L. Brun (GREY) eminar Bordeaux, November, / 32

26 Experiments yclic similarity onclusion Prediction of the boiling point of acyclic molecules: 183 acyclic molecules : Learning set: 80 % Validation set: 10 % 10 Test set: 10 % election of optimal parameters using a grid search RME obtained on the 10 test sets. L. Brun (GREY) eminar Bordeaux, November, / 32

27 Experiments yclic similarity onclusion Prediction of the boiling point of acyclic molecules: Méthode RME ( ) Descriptor based approach [herqaoui et al., 1994]* 5,10 Empirical embedding [Riesen, 2009] 10,15 Gaussian kernel based on the graph edit distance [Neuhaus and Bunke, 2007] 10,27 Kernel on paths [Ralaivola et al., 2005] 12,24 Kernel on random walks [Kashima et al., 2003] 18,72 Kernel on tree patterns [Mahé and Vert, 2009] 11,02 Weisfeiler-Lehman Kernel[hervaszide, 2012] 14,98 Treelet kernel 6,45 Treelet kernel with MKL 4,22 * leave-one-out L. Brun (GREY) eminar Bordeaux, November, / 32

28 Plan yclic similarity onclusion yclic similarity 5 onclusion 6 L. Brun (GREY) eminar Bordeaux, November, / 32

29 yclic molecular information yclic similarity onclusion ycles of a molecule Have a large influence on molecular properties. Identifies some molecular s families. N (a) Benzène (b) Indole (c) yclohexane yclic information must be taken into account. L. Brun (GREY) eminar Bordeaux, November, / 32

30 Relevant ycles yclic similarity onclusion Enumeration of all cycles is NP complete ycles of a graph form a vector space on Z 2Z. HN HN H L. Brun (GREY) eminar Bordeaux, November, / 32

31 Relevant ycles yclic similarity onclusion Enumeration of relevant cycles [Vismara, 1995]: Union of all cyclic bases with a minimal size, Unique for each graph, Enumeration in polynomial time, Kernel on relevant cycles [Horváth, 2005]. HN HN H L. Brun (GREY) eminar Bordeaux, November, / 32

32 Relevant cycles Graph yclic similarity onclusion Graph of relevant cycles G (G) [Vismara, 1995] : Representation of the cyclic system, Vertices: Relevant cycles, Edges: intersection between relevant cycles. HN HN H L. Brun (GREY) eminar Bordeaux, November, / 32

33 Relevant cycles Graph yclic similarity onclusion Graph of relevant cycles G (G) [Vismara, 1995] : Representation of the cyclic system, Vertices: Relevant cycles, Edges: intersection between relevant cycles. HN HN H L. Brun (GREY) eminar Bordeaux, November, / 32

34 Relevant cycles Graph yclic similarity onclusion Graph of relevant cycles G (G) [Vismara, 1995] : Representation of the cyclic system, Vertices: Relevant cycles, Edges: intersection between relevant cycles. HN HN H L. Brun (GREY) eminar Bordeaux, November, / 32

35 Relevant cycles Graph yclic similarity onclusion imilarity between relevant cycle graphs Enumeration of relevant cycle graph s treelets. Adjacency relationship between relevant cycles. Extracted Patterns : HN HN H olely cyclic information. L. Brun (GREY) eminar Bordeaux, November, / 32

36 Relevant cycle Hypergraph yclic similarity onclusion onnexion between cycles and acyclic parts Idea : Molecular Representation: Global, Explicit cyclic Information Addition of acyclic parts to the relevant cycle graph. HN HN H L. Brun (GREY) eminar Bordeaux, November, / 32

37 Relevant cycle Hypergraph yclic similarity onclusion onnexion between cycles and acyclic parts Idea : Molecular Representation: Global, Explicit cyclic Information Addition of acyclic parts to the relevant cycle graph. HN HN HN H HN H L. Brun (GREY) eminar Bordeaux, November, / 32

38 Relevant cycle Hypergraph yclic similarity onclusion Problem : An atomic bound can connect an atom to two cycles. N N L. Brun (GREY) eminar Bordeaux, November, / 32

39 Relevant cycle Hypergraph yclic similarity onclusion onnexion between cycles and acyclic parts hypergraph representation of the molecule. Adjacency relationship: oriented hyper-edges. N N L. Brun (GREY) eminar Bordeaux, November, / 32

40 Relevant cycle Hypergraph yclic similarity onclusion onnexion between cycles and acyclic parts hypergraph representation of the molecule. Adjacency relationship: oriented hyper-edges. N N Hypergraph comparison. L. Brun (GREY) eminar Bordeaux, November, / 32

41 yclic similarity onclusion Treelets on relevant cycle hypergraphs Adaptation of the treelet kernel: (1) enumeration of the relevant cycle hypergraph s treelets without hyper-edges. N L. Brun (GREY) eminar Bordeaux, November, / 32

42 yclic similarity onclusion Treelets on relevant cycle hypergraphs Adaptation of the treelet kernel: (1) enumeration of the relevant cycle hypergraph s treelets without hyper-edges. N L. Brun (GREY) eminar Bordeaux, November, / 32

43 yclic similarity onclusion Treelets on relevant cycle hypergraphs Adaptation of the treelet kernel: (1) enumeration of the relevant cycle hypergraph s treelets without hyper-edges. N N N L. Brun (GREY) eminar Bordeaux, November, / 32

44 yclic similarity onclusion Treelets on relevant cycle hypergraphs Adaptation of the treelet kernel: (1) enumeration of the relevant cycle hypergraph s treelets without hyper-edges. (2) ontraction of cycles incident to each hyper-edge. onstruction of the reduced relevant cycle hypergraph G R (G). N N L. Brun (GREY) eminar Bordeaux, November, / 32

45 yclic similarity onclusion Treelets on relevant cycle hypergraphs Adaptation of the treelet kernel: (1) enumeration of the relevant cycle hypergraph s treelets without hyper-edges. (2) ontraction of cycles incident to each hyper-edge. onstruction of the reduced relevant cycle hypergraph G R (G). (3) Extraction of treelets containing an initial hyper-edge N N L. Brun (GREY) eminar Bordeaux, November, / 32

46 yclic similarity onclusion Kernel on relevant cycle hypergraphs Bag of patterns: T R (G) = T 1 T 2 T 1 : et of treelets obtained from the hypergraph without hyper-edges. T 2 : et of treelets extracted from G R (G). N N N N N L. Brun (GREY) eminar Bordeaux, November, / 32

47 yclic similarity onclusion Kernel on relevant cycle hypergraphs Bag of patterns: T R (G) = T 1 T 2 T 1 : et of treelets obtained from the hypergraph without hyper-edges. T 2 : et of treelets extracted from G R (G). Kernel design: k T (G, G ) = t T R (G) T R (G ) w t k t (f t (G), f t (G )) L. Brun (GREY) eminar Bordeaux, November, / 32

48 Experiments yclic similarity onclusion Predictive Toxicity hallenge Molecular graph labelled and composed of cycles. 340 molecules for each dataset 4 classes of animals (MM, FM, MR, FR) : Learning set 310 molécules, } 10 Test set 34 molecules, Number of molecules correctly classified on the 10 test sets. Méthode MM FM MR FR Treelet kernel (TK) Relevant cycle kernel TK on relevant cycle graph (T) TK on relevant cycle hypergraph (TH) L. Brun (GREY) eminar Bordeaux, November, / 32

49 Experiments yclic similarity onclusion Predictive Toxicity hallenge Méthode MM FM MR FR Treelet kernel (TK) Relevant cycle kernel TK on relevant cycle graph (T) TK on relevant cycle hypergraph (TH) TK + MKL T + MKL TH + MKL L. Brun (GREY) eminar Bordeaux, November, / 32

50 Experiments yclic similarity onclusion Méthode MM FM MR FR Treelet kernel (TK) Relevant cycle kernel TK on relevant cycle graph (T) TK on relevant cycle hypergraph (TH) TK + MKL TH + MKL ombo TK - TH Gaussian kernel on graph edit distance [Neuhaus and Bunke, 2007] Laplacien kernel Kernel on paths [Ralaivola et al., 2005]* Weisfeiler-Lehman Kernel [hervaszide, 2012] * leave-one-out L. Brun (GREY) eminar Bordeaux, November, / 32

51 Plan yclic similarity onclusion yclic similarity 5 onclusion 6 L. Brun (GREY) eminar Bordeaux, November, / 32

52 onclusion yclic similarity onclusion General problem: Kernel between molecular graphs including cyclic and acyclic patterns in order to predict molecular properties. : yclic Information: Encoding the relative positioning of atoms R 1 R 1 R 3 Taking into account stereoisometry. R 2 R 3 R 2 L. Brun (GREY) eminar Bordeaux, November, / 32

53 Plan yclic similarity onclusion yclic similarity 5 onclusion 6 L. Brun (GREY) eminar Bordeaux, November, / 32

54 Kernel between shapes yclic similarity onclusion a) le graphe ompute skeletons Encode skeletons by graphs, Kernel between bags of treelets. b) la segmentation induite L. Brun (GREY) eminar Bordeaux, November, / 32

ompute a string encoding the sequence of

11 14 15 13 28 27 16 29 1917 23 21 26 30

55 Kernel on strings yclic similarity onclusion Decompose the scene into zones, ompute a string encoding the sequence of traversed zones, Define a kernel between strings, luster strings L. Brun (GREY) eminar Bordeaux, November, / 32

56 Bibliographie I yclic similarity onclusion Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical ociety, 68(3): herqaoui, D., Villemin, D., Mesbah, A., ense, J. M., and Kvasnicka, V. (1994). Use of a Neural Network to Determine the Normal Boiling Points of Acyclic Ethers, Peroxides, Acetals and their ulfur Analogues. J. hem. oc. Faraday Trans., 90: Horváth, T. (2005). yclic pattern kernels revisited. In pringer-verlag, editor, Proceedings of the 9th Pacific-Asia onference on Knowledge Discovery and Data Mining, volume 3518, pages L. Brun (GREY) eminar Bordeaux, November, / 32

57 Bibliographie II yclic similarity onclusion Johnson, M. A. and Maggiora, G. M., editors (1990). oncepts and Applications of Molecular imilarity. Wiley. Kashima, H., Tsuda, K., and Inokuchi, A. (2003). Marginalized Kernels Between Labeled Graphs. Machine Learning. Mahé, P. and Vert, J.-p. (2009). Graph kernels based on tree patterns for molecules. Machine Learning, 75(1)(eptember 2008):3 35. Neuhaus, M. and Bunke, H. (2007). Bridging the gap between graph edit distance and kernel machines, volume 68 of eries in Machine Perception and Artificial Intelligence. World cientific Publishing. L. Brun (GREY) eminar Bordeaux, November, / 32

58 Bibliographie III yclic similarity onclusion Rakotomamonjy, A., Bach, F., anu,., and Grandvalet, Y. (2008). implemkl. Journal of Machine Learning Research, 9: Ralaivola, L., wamidass,. J., aigo, H., and Baldi, P. (2005). Graph kernels for chemical informatics. Neural networks : the official journal of the International Neural Network ociety, 18(8): Riesen, K. (2009). lassification and lustering of Vector pace Embedded Graphs. PhD thesis, Institut für Informatik und angewandte Mathematik Universität Bern. hervaszide, N. (2012). calable. PhD thesis, Universität Tübingen. L. Brun (GREY) eminar Bordeaux, November, / 32

59 Bibliographie IV yclic similarity onclusion Vismara, P. (1995). Reconnaissance et représentation d éléments structuraux pour la description d objets complexes. Application à l élaboration de stratégies de synthèse en chimie organique. PhD thesis, Université Montpellier II. L. Brun (GREY) eminar Bordeaux, November, / 32

Kernels for small molecules

Kernels for small molecules Günter Klambauer June 23, 2015 Contents 1 Citation and Reference 1 2 Graph kernels 2 2.1 Implementation............................ 2 2.2 The Spectrum Kernel........................