Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics

Size: px

Start display at page:

Download "Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics"

Dortha Phillips
5 years ago
Views:

1 Data Mining in Bioinformatics Day 9: Graph Mining in hemoinformatics hloé-agathe Azencott & Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & omputational Biology Research Group Max Planck Institutes Tübingen and Eberhard Karls Universität Tübingen hloé-agathe Azencott: Data Mining in Bioinformatics, Page 1

2 Drug discovery Modern therapeutic research From serendipity to rationalized drug design Ancient Greeks treat infections with mould NH2 NH S HO H3 O N H3 O O HO Biapenem in PBP-1A hloé-agathe Azencott: Data Mining in Bioinformatics, Page 2

3 Drug discovery process 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay Protein that we want to inhibit so as to interfer with a biological process ompounds likely to bind to the target an they be drugs? (ADME-Tox) - bioactivity - pharmacokinetics - synthetic pathway - in vitro - in vivo - clinical hloé-agathe Azencott: Data Mining in Bioinformatics, Page 3

4 Drug discovery process 52 months 90 months 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay hloé-agathe Azencott: Data Mining in Bioinformatics, Page 4

5 Drug discovery process 52 months 90 months 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay $500,000,000 to $2,000,000,000 hloé-agathe Azencott: Data Mining in Bioinformatics, Page 5

6 hemoinformatics How can computer science help? hemoinformatics!...the mixing of information resources to transform data into information, and information into knowledge, for the intended purpose of making better decisions faster in the arena of drug lead identification and optimisation. F. K. Brown... the application of informatics methods to solve chemical problems. J. Gasteiger and T. Engel hloé-agathe Azencott: Data Mining in Bioinformatics, Page 6

7 hemoinformatics hemoinformatics 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay hloé-agathe Azencott: Data Mining in Bioinformatics, Page 7

hemoinformatics The chemical space 10 60 possible small organic molecules 10 22 stars in the observable

8 hemoinformatics The chemical space possible small organic molecules stars in the observable universe (Slide courtesy of Matthew A. Kayala) hloé-agathe Azencott: Data Mining in Bioinformatics, Page 8

9 Drug discovery process 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay QSAR QSPR QSAR: Qualitative Structure-Activity Relationship i.e. classification QSPR: Quantititive Structure-Property Relationship i.e. regression hloé-agathe Azencott: Data Mining in Bioinformatics, Page 9

10 Representing chemicals in silico Expert knowledge molecular descriptors hard, potentially incomplete Molecules are... NH 2 HO O NH O N S H 3 H 3 HO O hloé-agathe Azencott: Data Mining in Bioinformatics, Page 10

11 Representing chemicals in silico Similar Property Principle Molecules having similar structures should exhibit similar activities. Structure-based representations ompare molecules by comparing substructures hloé-agathe Azencott: Data Mining in Bioinformatics, Page 11

12 Molecular graph O O d O d N O d N S O N Undirected labeled graph hloé-agathe Azencott: Data Mining in Bioinformatics, Page 12

13 Fingerprints Define feature vectors that record the presence/absence (or number of occurrences) of particular patterns in a given molecular graph where φ(a) = (φ s (A)) s substructure φ s (A) = { 1 if s occurs in A 0 otherwise Extension of traditional chemical fingerprints hloé-agathe Azencott: Data Mining in Bioinformatics, Page 13

14 Fingerprints Learning from fingerprints lassical machine learning and data mining techniques can be applied to these vectorial feature representations. Any distance / kernel can be used lassification Feature selection lustering hloé-agathe Azencott: Data Mining in Bioinformatics, Page 14

15 Fingerprints Fingerprints compression Systematic enumeration long, sparse vectors e.g. 50, 000 random compounds from hemdb 300, 000 paths of length up to non-zeros on average Naive ompression List the positions of the 1s 2 19 = 524, 288 average encoding: = 5, 700 bits hloé-agathe Azencott: Data Mining in Bioinformatics, Page 15

16 Fingerprints Fingerprints compression Modulo ompression (lossy) Elias-Gamma Monotone Encoding (lossless) [Baldi et al., 2007] index j log(j) 0 bits + binary encoding of j j i < j i+1 : log(j i+1 ) log(j i ) log(j i+1 ) average compressed size = 1, 800 bits hloé-agathe Azencott: Data Mining in Bioinformatics, Page 16

17 Frequent patterns fingerprints MOLFEA [Helma et al., 2004] P = positive (mutagenic) compounds N = negative compounds features: fragments (= patterns) f such that both freq(f, P) t and freq(f, N) t Limited to frequent linear patterns ML algorithm: SVM with linear or quadratic kernel hloé-agathe Azencott: Data Mining in Bioinformatics, Page 17

18 Frequent patterns fingerprints MOLFEA [Helma et al., 2004] PDB arcinogenic Potency DataBase 684 compounds classified in 341 mutagens and 343 nonmutagens according to Ames test on Salmonella 100 Mutagenicity prediction [Hema04] Linear kernel Quadratic kernel 90 ross-validated sensitivity % 3% 5% 10% Frequency threshold hloé-agathe Azencott: Data Mining in Bioinformatics, Page 18

19 Spectrum kernels φ(a) = (φ s (A)) s S K spectrum (A, A ) = k(φ(a), φ(a )) k R R (S) R (S) can be Dot product (linear kernel) RBF kernel Tanimoto kernel: k(a, B) = A B MinMax kernel: N i=1 min(a i,b i ) N i=1 max(a i,b i ) A B hloé-agathe Azencott: Data Mining in Bioinformatics, Page 19

20 Spectrum kernels Tanimoto and MinMax Both Tanimoto and Minmax are kernels. Proof for Tanimoto: J.. Gower A general coefficient of similarity and some of its properties. Biometrics Proof for MinMax: φ(x), φ(y) MinMax(x, y) = φ(x), φ(x) + φ(y), φ(y) φ(x), φ(y) with φ(x) of length: # patterns max count φ(x) i = 1 iff. the pattern indexed by i/q appears more than i mod q times in x hloé-agathe Azencott: Data Mining in Bioinformatics, Page 20

21 All patterns fingerprints Paths fingerprints Labeled sub-paths (walks) O O O d O d d ssdo N N NsssS S O N Some sub-paths of length 3 hloé-agathe Azencott: Data Mining in Bioinformatics, Page 21

22 All patterns fingerprints ircular fingerprints Labeled sub-trees - Extended-onnectivity (or ircular) features O O O d d N O d N S O N {s{sn s} sn{s} ss{s}} Example of a circular substructure of depth 2 hloé-agathe Azencott: Data Mining in Bioinformatics, Page 22

23 All patterns fingerprints 2D spectrum kernels [Azencott et al., 2007] Systematically extract paths / circular fingerprints, for various maximal depths SVM with Tanimoto / Minmax hloé-agathe Azencott: Data Mining in Bioinformatics, Page 23

24 All patterns fingerprints 2D spectrum kernels [Azencott et al., 2007] Mutagenicity (Mutag): 188 compounds Benzodiazepine receptor affinity (BZR): compounds yclooxygenase-2 ihibitors (OX2): compounds Estrogen receptor affinity (ER): compounds Data SVM Previous best Mutag 90.4% 85.2% (gboost) BZR 79.8% 76.4% OX2 70.1% 73.6% ER 82.1% 79.8% hloé-agathe Azencott: Data Mining in Bioinformatics, Page 24

25 Informative patterns Extract informative patterns while learning All patterns + sparsity regularization gboost hloé-agathe Azencott: Data Mining in Bioinformatics, Page 25

26 gboost [Saigo et al., 2009] Train data: {(G n, y n )} n=1...l Stump or hypothesis: h(x : t, ω) = ω(2x t 1) x t = 1 if x t G and 0 otherwise Decision function: f(x) = t T,ω { 1,+1} Equivalent to solving (LP-Boost) min λ,γ γ α tω h(x : t, ω) such that ln=1 λ ny n h(x n : t, ω) γ t, ω l n=1 λ n = 1 and 0 λ n D n hloé-agathe Azencott: Data Mining in Bioinformatics, Page 26

27 gboost [Saigo et al., 2009] Solve by column generation start with H = and λ n = 1/l n Iteratively: find (t, ω ) that maximizes add (t, ω ) to H update λ n, γ g(t, ω) = l λ n y n h(x n ; t, ω) n=1 Stop when (t, w ) such that g(t, w ) > γ + ɛ hloé-agathe Azencott: Data Mining in Bioinformatics, Page 27

28 gboost [Saigo et al., 2009] Finding (t, w ): DFS code tree Pruning condition (g optimal gain so far): if max 2 n:y n =1,t G n λ n l y n λ n, 2 n=1 then t : t t, w, g > g(t, w ) n:y n = 1,t G n λ n + l y n λ n < g n=1 hloé-agathe Azencott: Data Mining in Bioinformatics, Page 28

29 gboost [Saigo et al., 2009] Application to PDB Accuracy similar to Helma et al. (79%) Most discriminative patterns hloé-agathe Azencott: Data Mining in Bioinformatics, Page 29

30 Weisfeiler-Lehman kernel [Shervashidze et al., 2011] Goal: scalability ompute a sequence that captures topological and label information of graphs in a runtime linear in the number of edges sub-tree kernel hloé-agathe Azencott: Data Mining in Bioinformatics, Page 30

31 Weisfeiler-Lehman kernel [Shervashidze et al., 2011] hloé-agathe Azencott: Data Mining in Bioinformatics, Page 31

32 onvolution kernels a.k.a. decomposition kernels (x 1,..., x D ) is a tuple of parts of x, with x d X for each part d = 1,..., D k d R X d X d : a Mercer kernel K decomposition (x, x ) = x 1 x 2...x D =x x 1 x 2 x D =x k 1 (x 1, x 1)k 2 (x 2, x 2)... k D (x D, x D) Spectrum kernels are a particular case of convolution kernels hloé-agathe Azencott: Data Mining in Bioinformatics, Page 32

33 onvolution kernels Weighted Decomposition Kernel [Menchetti et al., 2005] Match atoms and weigh them according to a kernel between subgraphs that include these atoms K W DK (x, x ) = (a,σ D r (x)) (a,σ D r (x )) δ(a, a )K c (σ, σ ) r > 0 N D r (x): decompositions of the molecular graph of x in an atom a and a subpath σ of x including a and of depth at most r hloé-agathe Azencott: Data Mining in Bioinformatics, Page 33

34 onvolution kernels Weighted Decomposition Kernel [Menchetti et al., 2005] K c : contextual kernel, here: histogram intersection kernel K c (σ, σ ) = l L min(f σ (l), f σ (l)) L: possible labels for edges and vertices f σ (l): frequency of label l subgraph σ. hloé-agathe Azencott: Data Mining in Bioinformatics, Page 34

35 Optimal assignment kernels Try to best map x and x Not necessarily a kernel in practice: K K λ min I hloé-agathe Azencott: Data Mining in Bioinformatics, Page 35

36 Optimal assignment kernels The Local Atom Pair kernel [Hinselmann et al., 2010] M: pairwise intramolecular matrix of inter-atomic topological distances Local atom environment: l(i) = {(L(i), M ij, L(j)), j i} κ(i, j): dot product, Tanimoto or MinMax between l(i) and l(j) hloé-agathe Azencott: Data Mining in Bioinformatics, Page 36

37 Optimal assignment kernels LAP: performance [Hinselmann et al., 2010] hloé-agathe Azencott: Data Mining in Bioinformatics, Page 37

38 Introducing spatial information 3D Histograms [Azencott et al., 2007] Groups of k atoms Associated size: Pairwise distances (k = 2) diameter of the smallest sphere that contains all k atoms hloé-agathe Azencott: Data Mining in Bioinformatics, Page 38

39 Introducing spatial information 3D Histograms [Azencott et al., 2007] One histogram per class of k-tuple (e.g. --, --O) O O O O N N S O 6.3 N Frequency of N-O N-O distance (A) hloé-agathe Azencott: Data Mining in Bioinformatics, Page 39

40 Introducing spatial information 3D Histograms: performance [Azencott et al., 2007] Data 2D kernel Hist3D kernel Mutag 90.4% 88.8% BZR (loo) 82.0% 79.4% ER (loo) 87.0% 86.1% OX2 76.9% 78.6% hloé-agathe Azencott: Data Mining in Bioinformatics, Page 40

41 Introducing spatial information 3D Decomposition Kernels [eroni et al., 2007] Remember: K W DK (x, x ) = (a,σ D r (x)) (a,σ D r (x )) δ(a, a )K c (σ, σ ) K 3DDK (x, x ) = σ S r (x) σ S r (x ) K s(σ, σ ) S r (x): subgraphs of x composed of r distinct vertices K s (σ, σ ) = r(r 1)/2 i=1 δ(e i, e i )e γ(l i l i ) l i = length of edge e i in x (e 1, e 2,..., e r(r 1)/2 lexicographically ordered; γ R hloé-agathe Azencott: Data Mining in Bioinformatics, Page 41

42 Introducing spatial information 3DDK: Performance [eroni et al., 2007] Data 2D kernel Hist3D kernel 3DDK irc3ddk Mutag 90.4% 88.8% 86.7% 83.5% BZR (loo) 82.0% 79.4% 78.4% 81.4% ER (loo) 87.0% 86.1% 82.3% 82.1% OX2 76.9% 78.6% 75.6% 75.2% hloé-agathe Azencott: Data Mining in Bioinformatics, Page 42

43 Introducing spatial information The pharmacophore kernel [Mahé et al., 2006] pharmacophore p P(x): p = [(x 1, l 1 ), (x 2, l 2 ), (x 3, l 3 )] x i 3D coordinates of atom i of x; l i = label of atom i K(x, x ) = p P(x) p P(x ) K P (p, p ) K P (p, p ) = K dist (d 1, d 1)K dist (d 2, d 2)K dist (d 3, d 3)K feat (l 1, l 1)K feat (l 2, l 2)K feat (l 3, l 3) K dist : RBF Gaussian K dist (d, d ) = exp K feat : Dirac ( d d 2 2σ 2 ) hloé-agathe Azencott: Data Mining in Bioinformatics, Page 43

44 Introducing spatial information 3D LAP kernel [Hinselmann et al., 2010] M: pairwise intramolecular matrix of inter-atomic geometric distances hloé-agathe Azencott: Data Mining in Bioinformatics, Page 44

45 Introducing spatial information onclusion How relevant is 3D information? How good is 3D information? hloé-agathe Azencott: Data Mining in Bioinformatics, Page 45

46 Drug discovery process 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay Docking Virtual High-Throughput Screening hloé-agathe Azencott: Data Mining in Bioinformatics, Page 46

47 High-throughput screening Assay a large library of potential drugs against their target Very costly docking virtual high-throughput screening (vhts) hloé-agathe Azencott: Data Mining in Bioinformatics, Page 47

48 Measuring performance Imbalanced data Typically, most compounds are inactive many more negative than positive examples E.g. DHFR data set: 99, 995 chemicals screened for activity against dihydrofolate reductase; < 0.2% active compounds Accuracy is not appropriate: predicting all compounds negative accuracy = 99.8% sensitivity= # True Positives # Positives # True Negatives specificity= # Negatives For many methods, the output is continuous accuracy, sensitivity and specificity depend on a threshold θ hloé-agathe Azencott: Data Mining in Bioinformatics, Page 48

49 Measuring performance Receiver-Operator haracteristic urves True Positive Rate For all possible values of θ, report sensitivity and 1 specificity AURO (Area under the RO urve) is a numerical measure of peformance AURO(random) = 0.5 and AURO(optimal) = 1 0 1/4 2/4 3/ x x x x 0.9 x 0.95 x 0.94 x random perfect Inf x real 0 1/6 1/3 1/2 2/3 5/6 1 False Positive Rate x x x label prediction hloé-agathe Azencott: Data Mining in Bioinformatics, Page 49

50 Measuring performance Inhibition of DHFR: RO urves [Azencott et al., 2007] method AU IRV 0.71 SVM 0.59 knn 0.59 MAX-SIM 0.54 TPR RANDOM IRV SVM MAXSIM FPR hloé-agathe Azencott: Data Mining in Bioinformatics, Page 50

51 Measuring performance Precision-recall curves Precision = # True Positives # Predicted Positives Recall = sensitivity Precision 0 1/5 2/5 3/5 4/ x 0.94 x 0.9 x 0 1/4 2/4 3/4 1 Recall 0.81 x 0.73 x 0.52 x 0.2 x perfect real 0.17 x x hloé-agathe Azencott: Data Mining in Bioinformatics, Page 51

52 Other applications Other applications of graph mining in chemoinformatics Database indexing and search Prediction of 3D structures of small compounds and proteins Reaction Prediction hloé-agathe Azencott: Data Mining in Bioinformatics, Page 52

53 References and further reading [Azencott et al., 2007] Azencott,.-A., Ksikes, A., Swamidass, S. J., hen, J. H., Ralaivola, L. and Baldi, P. (2007). One-to fourdimensional kernels for virtual screening and the prediction of physical, chemical, and biological properties. Journal of chemical information and modeling 47, , 24, 38, 39, 40, 50 [Baldi et al., 2007] Baldi, P., Benz, R. W., Hirschberg, D. S. and Swamidass, S. J. (2007). Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval. Journal of chemical information and modeling 47, [eroni et al., 2007] eroni, A., osta, F. and Frasconi, P. (2007). lassification of small molecules by two-and three-dimensional decomposition kernels. Bioinformatics 23, , 42 [Helma et al., 2004] Helma,., ramer, T., Kramer, S. and De Raedt, L. (2004). Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. Journal of chemical information and computer sciences 44, , 18 [Hinselmann et al., 2010] Hinselmann, G., Fechner, N., Jahn, A., Eckert, M. and Zell, A. (2010). Graph kernels for chemical compounds using topological and three-dimensional local atom pair environments. Neurocomputing 74, , 37, 44 [Mahé et al., 2006] Mahé, P., Ralaivola, L., Stoven, V. and Vert, J.-P. (2006). The pharmacophore kernel for virtual screening with support vector machines. Journal of chemical information and modeling 46, [Menchetti et al., 2005] Menchetti, S., osta, F. and Frasconi, P. (2005). Weighted Decomposition Kernels. In Proceedings of the 22nd International onference on Machine Learning pp , AM, Bonn, Germany. 33, 34 [Saigo et al., 2009] Saigo, H., Nowozin, S., Kadowaki, T., Kudo, T. and Tsuda, K. (2009). gboost: a mathematical programming approach to graph classification and regression. Machine Learning 75, , 27, 28, 29 [Shervashidze et al., 2011] Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K. and Borgwardt, K. M. (2011). Weisfeiler- Lehman graph kernels. Journal of Machine Learning Research 12, , 31 hloé-agathe Azencott: Data Mining in Bioinformatics, Page 53

54 The end Tomorrow: Projects Presentations By 9:45 AM on Friday, March 1, 2013, please submit the following by to Prof. Borgwardt: A short report on your project, that gives your answers to the questions in Section 2 (You can ignore Section 1 here) in your exercise sheet The code that you wrote as part of your project Your presentation slides as a PDF. hloé-agathe Azencott: Data Mining in Bioinformatics, Page 54

Machine learning for ligand-based virtual screening and chemogenomics!

Machine learning for ligand-based virtual screening and chemogenomics! Jean-Philippe Vert Institut Curie - INSERM U900 - Mines ParisTech In silico discovery of molecular probes and drug-like compounds: