Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics
|
|
- Dortha Phillips
- 5 years ago
- Views:
Transcription
1 Data Mining in Bioinformatics Day 9: Graph Mining in hemoinformatics hloé-agathe Azencott & Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & omputational Biology Research Group Max Planck Institutes Tübingen and Eberhard Karls Universität Tübingen hloé-agathe Azencott: Data Mining in Bioinformatics, Page 1
2 Drug discovery Modern therapeutic research From serendipity to rationalized drug design Ancient Greeks treat infections with mould NH2 NH S HO H3 O N H3 O O HO Biapenem in PBP-1A hloé-agathe Azencott: Data Mining in Bioinformatics, Page 2
3 Drug discovery process 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay Protein that we want to inhibit so as to interfer with a biological process ompounds likely to bind to the target an they be drugs? (ADME-Tox) - bioactivity - pharmacokinetics - synthetic pathway - in vitro - in vivo - clinical hloé-agathe Azencott: Data Mining in Bioinformatics, Page 3
4 Drug discovery process 52 months 90 months 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay hloé-agathe Azencott: Data Mining in Bioinformatics, Page 4
5 Drug discovery process 52 months 90 months 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay $500,000,000 to $2,000,000,000 hloé-agathe Azencott: Data Mining in Bioinformatics, Page 5
6 hemoinformatics How can computer science help? hemoinformatics!...the mixing of information resources to transform data into information, and information into knowledge, for the intended purpose of making better decisions faster in the arena of drug lead identification and optimisation. F. K. Brown... the application of informatics methods to solve chemical problems. J. Gasteiger and T. Engel hloé-agathe Azencott: Data Mining in Bioinformatics, Page 6
7 hemoinformatics hemoinformatics 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay hloé-agathe Azencott: Data Mining in Bioinformatics, Page 7
8 hemoinformatics The chemical space possible small organic molecules stars in the observable universe (Slide courtesy of Matthew A. Kayala) hloé-agathe Azencott: Data Mining in Bioinformatics, Page 8
9 Drug discovery process 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay QSAR QSPR QSAR: Qualitative Structure-Activity Relationship i.e. classification QSPR: Quantititive Structure-Property Relationship i.e. regression hloé-agathe Azencott: Data Mining in Bioinformatics, Page 9
10 Representing chemicals in silico Expert knowledge molecular descriptors hard, potentially incomplete Molecules are... NH 2 HO O NH O N S H 3 H 3 HO O hloé-agathe Azencott: Data Mining in Bioinformatics, Page 10
11 Representing chemicals in silico Similar Property Principle Molecules having similar structures should exhibit similar activities. Structure-based representations ompare molecules by comparing substructures hloé-agathe Azencott: Data Mining in Bioinformatics, Page 11
12 Molecular graph O O d O d N O d N S O N Undirected labeled graph hloé-agathe Azencott: Data Mining in Bioinformatics, Page 12
13 Fingerprints Define feature vectors that record the presence/absence (or number of occurrences) of particular patterns in a given molecular graph where φ(a) = (φ s (A)) s substructure φ s (A) = { 1 if s occurs in A 0 otherwise Extension of traditional chemical fingerprints hloé-agathe Azencott: Data Mining in Bioinformatics, Page 13
14 Fingerprints Learning from fingerprints lassical machine learning and data mining techniques can be applied to these vectorial feature representations. Any distance / kernel can be used lassification Feature selection lustering hloé-agathe Azencott: Data Mining in Bioinformatics, Page 14
15 Fingerprints Fingerprints compression Systematic enumeration long, sparse vectors e.g. 50, 000 random compounds from hemdb 300, 000 paths of length up to non-zeros on average Naive ompression List the positions of the 1s 2 19 = 524, 288 average encoding: = 5, 700 bits hloé-agathe Azencott: Data Mining in Bioinformatics, Page 15
16 Fingerprints Fingerprints compression Modulo ompression (lossy) Elias-Gamma Monotone Encoding (lossless) [Baldi et al., 2007] index j log(j) 0 bits + binary encoding of j j i < j i+1 : log(j i+1 ) log(j i ) log(j i+1 ) average compressed size = 1, 800 bits hloé-agathe Azencott: Data Mining in Bioinformatics, Page 16
17 Frequent patterns fingerprints MOLFEA [Helma et al., 2004] P = positive (mutagenic) compounds N = negative compounds features: fragments (= patterns) f such that both freq(f, P) t and freq(f, N) t Limited to frequent linear patterns ML algorithm: SVM with linear or quadratic kernel hloé-agathe Azencott: Data Mining in Bioinformatics, Page 17
18 Frequent patterns fingerprints MOLFEA [Helma et al., 2004] PDB arcinogenic Potency DataBase 684 compounds classified in 341 mutagens and 343 nonmutagens according to Ames test on Salmonella 100 Mutagenicity prediction [Hema04] Linear kernel Quadratic kernel 90 ross-validated sensitivity % 3% 5% 10% Frequency threshold hloé-agathe Azencott: Data Mining in Bioinformatics, Page 18
19 Spectrum kernels φ(a) = (φ s (A)) s S K spectrum (A, A ) = k(φ(a), φ(a )) k R R (S) R (S) can be Dot product (linear kernel) RBF kernel Tanimoto kernel: k(a, B) = A B MinMax kernel: N i=1 min(a i,b i ) N i=1 max(a i,b i ) A B hloé-agathe Azencott: Data Mining in Bioinformatics, Page 19
20 Spectrum kernels Tanimoto and MinMax Both Tanimoto and Minmax are kernels. Proof for Tanimoto: J.. Gower A general coefficient of similarity and some of its properties. Biometrics Proof for MinMax: φ(x), φ(y) MinMax(x, y) = φ(x), φ(x) + φ(y), φ(y) φ(x), φ(y) with φ(x) of length: # patterns max count φ(x) i = 1 iff. the pattern indexed by i/q appears more than i mod q times in x hloé-agathe Azencott: Data Mining in Bioinformatics, Page 20
21 All patterns fingerprints Paths fingerprints Labeled sub-paths (walks) O O O d O d d ssdo N N NsssS S O N Some sub-paths of length 3 hloé-agathe Azencott: Data Mining in Bioinformatics, Page 21
22 All patterns fingerprints ircular fingerprints Labeled sub-trees - Extended-onnectivity (or ircular) features O O O d d N O d N S O N {s{sn s} sn{s} ss{s}} Example of a circular substructure of depth 2 hloé-agathe Azencott: Data Mining in Bioinformatics, Page 22
23 All patterns fingerprints 2D spectrum kernels [Azencott et al., 2007] Systematically extract paths / circular fingerprints, for various maximal depths SVM with Tanimoto / Minmax hloé-agathe Azencott: Data Mining in Bioinformatics, Page 23
24 All patterns fingerprints 2D spectrum kernels [Azencott et al., 2007] Mutagenicity (Mutag): 188 compounds Benzodiazepine receptor affinity (BZR): compounds yclooxygenase-2 ihibitors (OX2): compounds Estrogen receptor affinity (ER): compounds Data SVM Previous best Mutag 90.4% 85.2% (gboost) BZR 79.8% 76.4% OX2 70.1% 73.6% ER 82.1% 79.8% hloé-agathe Azencott: Data Mining in Bioinformatics, Page 24
25 Informative patterns Extract informative patterns while learning All patterns + sparsity regularization gboost hloé-agathe Azencott: Data Mining in Bioinformatics, Page 25
26 gboost [Saigo et al., 2009] Train data: {(G n, y n )} n=1...l Stump or hypothesis: h(x : t, ω) = ω(2x t 1) x t = 1 if x t G and 0 otherwise Decision function: f(x) = t T,ω { 1,+1} Equivalent to solving (LP-Boost) min λ,γ γ α tω h(x : t, ω) such that ln=1 λ ny n h(x n : t, ω) γ t, ω l n=1 λ n = 1 and 0 λ n D n hloé-agathe Azencott: Data Mining in Bioinformatics, Page 26
27 gboost [Saigo et al., 2009] Solve by column generation start with H = and λ n = 1/l n Iteratively: find (t, ω ) that maximizes add (t, ω ) to H update λ n, γ g(t, ω) = l λ n y n h(x n ; t, ω) n=1 Stop when (t, w ) such that g(t, w ) > γ + ɛ hloé-agathe Azencott: Data Mining in Bioinformatics, Page 27
28 gboost [Saigo et al., 2009] Finding (t, w ): DFS code tree Pruning condition (g optimal gain so far): if max 2 n:y n =1,t G n λ n l y n λ n, 2 n=1 then t : t t, w, g > g(t, w ) n:y n = 1,t G n λ n + l y n λ n < g n=1 hloé-agathe Azencott: Data Mining in Bioinformatics, Page 28
29 gboost [Saigo et al., 2009] Application to PDB Accuracy similar to Helma et al. (79%) Most discriminative patterns hloé-agathe Azencott: Data Mining in Bioinformatics, Page 29
30 Weisfeiler-Lehman kernel [Shervashidze et al., 2011] Goal: scalability ompute a sequence that captures topological and label information of graphs in a runtime linear in the number of edges sub-tree kernel hloé-agathe Azencott: Data Mining in Bioinformatics, Page 30
31 Weisfeiler-Lehman kernel [Shervashidze et al., 2011] hloé-agathe Azencott: Data Mining in Bioinformatics, Page 31
32 onvolution kernels a.k.a. decomposition kernels (x 1,..., x D ) is a tuple of parts of x, with x d X for each part d = 1,..., D k d R X d X d : a Mercer kernel K decomposition (x, x ) = x 1 x 2...x D =x x 1 x 2 x D =x k 1 (x 1, x 1)k 2 (x 2, x 2)... k D (x D, x D) Spectrum kernels are a particular case of convolution kernels hloé-agathe Azencott: Data Mining in Bioinformatics, Page 32
33 onvolution kernels Weighted Decomposition Kernel [Menchetti et al., 2005] Match atoms and weigh them according to a kernel between subgraphs that include these atoms K W DK (x, x ) = (a,σ D r (x)) (a,σ D r (x )) δ(a, a )K c (σ, σ ) r > 0 N D r (x): decompositions of the molecular graph of x in an atom a and a subpath σ of x including a and of depth at most r hloé-agathe Azencott: Data Mining in Bioinformatics, Page 33
34 onvolution kernels Weighted Decomposition Kernel [Menchetti et al., 2005] K c : contextual kernel, here: histogram intersection kernel K c (σ, σ ) = l L min(f σ (l), f σ (l)) L: possible labels for edges and vertices f σ (l): frequency of label l subgraph σ. hloé-agathe Azencott: Data Mining in Bioinformatics, Page 34
35 Optimal assignment kernels Try to best map x and x Not necessarily a kernel in practice: K K λ min I hloé-agathe Azencott: Data Mining in Bioinformatics, Page 35
36 Optimal assignment kernels The Local Atom Pair kernel [Hinselmann et al., 2010] M: pairwise intramolecular matrix of inter-atomic topological distances Local atom environment: l(i) = {(L(i), M ij, L(j)), j i} κ(i, j): dot product, Tanimoto or MinMax between l(i) and l(j) hloé-agathe Azencott: Data Mining in Bioinformatics, Page 36
37 Optimal assignment kernels LAP: performance [Hinselmann et al., 2010] hloé-agathe Azencott: Data Mining in Bioinformatics, Page 37
38 Introducing spatial information 3D Histograms [Azencott et al., 2007] Groups of k atoms Associated size: Pairwise distances (k = 2) diameter of the smallest sphere that contains all k atoms hloé-agathe Azencott: Data Mining in Bioinformatics, Page 38
39 Introducing spatial information 3D Histograms [Azencott et al., 2007] One histogram per class of k-tuple (e.g. --, --O) O O O O N N S O 6.3 N Frequency of N-O N-O distance (A) hloé-agathe Azencott: Data Mining in Bioinformatics, Page 39
40 Introducing spatial information 3D Histograms: performance [Azencott et al., 2007] Data 2D kernel Hist3D kernel Mutag 90.4% 88.8% BZR (loo) 82.0% 79.4% ER (loo) 87.0% 86.1% OX2 76.9% 78.6% hloé-agathe Azencott: Data Mining in Bioinformatics, Page 40
41 Introducing spatial information 3D Decomposition Kernels [eroni et al., 2007] Remember: K W DK (x, x ) = (a,σ D r (x)) (a,σ D r (x )) δ(a, a )K c (σ, σ ) K 3DDK (x, x ) = σ S r (x) σ S r (x ) K s(σ, σ ) S r (x): subgraphs of x composed of r distinct vertices K s (σ, σ ) = r(r 1)/2 i=1 δ(e i, e i )e γ(l i l i ) l i = length of edge e i in x (e 1, e 2,..., e r(r 1)/2 lexicographically ordered; γ R hloé-agathe Azencott: Data Mining in Bioinformatics, Page 41
42 Introducing spatial information 3DDK: Performance [eroni et al., 2007] Data 2D kernel Hist3D kernel 3DDK irc3ddk Mutag 90.4% 88.8% 86.7% 83.5% BZR (loo) 82.0% 79.4% 78.4% 81.4% ER (loo) 87.0% 86.1% 82.3% 82.1% OX2 76.9% 78.6% 75.6% 75.2% hloé-agathe Azencott: Data Mining in Bioinformatics, Page 42
43 Introducing spatial information The pharmacophore kernel [Mahé et al., 2006] pharmacophore p P(x): p = [(x 1, l 1 ), (x 2, l 2 ), (x 3, l 3 )] x i 3D coordinates of atom i of x; l i = label of atom i K(x, x ) = p P(x) p P(x ) K P (p, p ) K P (p, p ) = K dist (d 1, d 1)K dist (d 2, d 2)K dist (d 3, d 3)K feat (l 1, l 1)K feat (l 2, l 2)K feat (l 3, l 3) K dist : RBF Gaussian K dist (d, d ) = exp K feat : Dirac ( d d 2 2σ 2 ) hloé-agathe Azencott: Data Mining in Bioinformatics, Page 43
44 Introducing spatial information 3D LAP kernel [Hinselmann et al., 2010] M: pairwise intramolecular matrix of inter-atomic geometric distances hloé-agathe Azencott: Data Mining in Bioinformatics, Page 44
45 Introducing spatial information onclusion How relevant is 3D information? How good is 3D information? hloé-agathe Azencott: Data Mining in Bioinformatics, Page 45
46 Drug discovery process 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay Docking Virtual High-Throughput Screening hloé-agathe Azencott: Data Mining in Bioinformatics, Page 46
47 High-throughput screening Assay a large library of potential drugs against their target Very costly docking virtual high-throughput screening (vhts) hloé-agathe Azencott: Data Mining in Bioinformatics, Page 47
48 Measuring performance Imbalanced data Typically, most compounds are inactive many more negative than positive examples E.g. DHFR data set: 99, 995 chemicals screened for activity against dihydrofolate reductase; < 0.2% active compounds Accuracy is not appropriate: predicting all compounds negative accuracy = 99.8% sensitivity= # True Positives # Positives # True Negatives specificity= # Negatives For many methods, the output is continuous accuracy, sensitivity and specificity depend on a threshold θ hloé-agathe Azencott: Data Mining in Bioinformatics, Page 48
49 Measuring performance Receiver-Operator haracteristic urves True Positive Rate For all possible values of θ, report sensitivity and 1 specificity AURO (Area under the RO urve) is a numerical measure of peformance AURO(random) = 0.5 and AURO(optimal) = 1 0 1/4 2/4 3/ x x x x 0.9 x 0.95 x 0.94 x random perfect Inf x real 0 1/6 1/3 1/2 2/3 5/6 1 False Positive Rate x x x label prediction hloé-agathe Azencott: Data Mining in Bioinformatics, Page 49
50 Measuring performance Inhibition of DHFR: RO urves [Azencott et al., 2007] method AU IRV 0.71 SVM 0.59 knn 0.59 MAX-SIM 0.54 TPR RANDOM IRV SVM MAXSIM FPR hloé-agathe Azencott: Data Mining in Bioinformatics, Page 50
51 Measuring performance Precision-recall curves Precision = # True Positives # Predicted Positives Recall = sensitivity Precision 0 1/5 2/5 3/5 4/ x 0.94 x 0.9 x 0 1/4 2/4 3/4 1 Recall 0.81 x 0.73 x 0.52 x 0.2 x perfect real 0.17 x x hloé-agathe Azencott: Data Mining in Bioinformatics, Page 51
52 Other applications Other applications of graph mining in chemoinformatics Database indexing and search Prediction of 3D structures of small compounds and proteins Reaction Prediction hloé-agathe Azencott: Data Mining in Bioinformatics, Page 52
53 References and further reading [Azencott et al., 2007] Azencott,.-A., Ksikes, A., Swamidass, S. J., hen, J. H., Ralaivola, L. and Baldi, P. (2007). One-to fourdimensional kernels for virtual screening and the prediction of physical, chemical, and biological properties. Journal of chemical information and modeling 47, , 24, 38, 39, 40, 50 [Baldi et al., 2007] Baldi, P., Benz, R. W., Hirschberg, D. S. and Swamidass, S. J. (2007). Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval. Journal of chemical information and modeling 47, [eroni et al., 2007] eroni, A., osta, F. and Frasconi, P. (2007). lassification of small molecules by two-and three-dimensional decomposition kernels. Bioinformatics 23, , 42 [Helma et al., 2004] Helma,., ramer, T., Kramer, S. and De Raedt, L. (2004). Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. Journal of chemical information and computer sciences 44, , 18 [Hinselmann et al., 2010] Hinselmann, G., Fechner, N., Jahn, A., Eckert, M. and Zell, A. (2010). Graph kernels for chemical compounds using topological and three-dimensional local atom pair environments. Neurocomputing 74, , 37, 44 [Mahé et al., 2006] Mahé, P., Ralaivola, L., Stoven, V. and Vert, J.-P. (2006). The pharmacophore kernel for virtual screening with support vector machines. Journal of chemical information and modeling 46, [Menchetti et al., 2005] Menchetti, S., osta, F. and Frasconi, P. (2005). Weighted Decomposition Kernels. In Proceedings of the 22nd International onference on Machine Learning pp , AM, Bonn, Germany. 33, 34 [Saigo et al., 2009] Saigo, H., Nowozin, S., Kadowaki, T., Kudo, T. and Tsuda, K. (2009). gboost: a mathematical programming approach to graph classification and regression. Machine Learning 75, , 27, 28, 29 [Shervashidze et al., 2011] Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K. and Borgwardt, K. M. (2011). Weisfeiler- Lehman graph kernels. Journal of Machine Learning Research 12, , 31 hloé-agathe Azencott: Data Mining in Bioinformatics, Page 53
54 The end Tomorrow: Projects Presentations By 9:45 AM on Friday, March 1, 2013, please submit the following by to Prof. Borgwardt: A short report on your project, that gives your answers to the questions in Section 2 (You can ignore Section 1 here) in your exercise sheet The code that you wrote as part of your project Your presentation slides as a PDF. hloé-agathe Azencott: Data Mining in Bioinformatics, Page 54
Machine learning for ligand-based virtual screening and chemogenomics!
Machine learning for ligand-based virtual screening and chemogenomics! Jean-Philippe Vert Institut Curie - INSERM U900 - Mines ParisTech In silico discovery of molecular probes and drug-like compounds:
More informationPlan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.
Plan Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics. Exercise: Example and exercise with herg potassium channel: Use of
More informationKernels for small molecules
Kernels for small molecules Günter Klambauer June 23, 2015 Contents 1 Citation and Reference 1 2 Graph kernels 2 2.1 Implementation............................ 2 2.2 The Spectrum Kernel........................
More informationMachine Learning Concepts in Chemoinformatics
Machine Learning Concepts in Chemoinformatics Martin Vogt B-IT Life Science Informatics Rheinische Friedrich-Wilhelms-Universität Bonn BigChem Winter School 2017 25. October Data Mining in Chemoinformatics
More informationA Linear Programming Approach for Molecular QSAR analysis
Linear Programming pproach for Molecular QSR analysis Hiroto Saigo 1, Tadashi Kadowaki 2, and Koji Tsuda 1 1 Max Planck Institute for iological Cybernetics, Spemannstr. 38, 72076 Tübingen, Germany {hiroto.saigo,koji.tsuda}@tuebingen.mpg.de,
More informationIn silico pharmacology for drug discovery
In silico pharmacology for drug discovery In silico drug design In silico methods can contribute to drug targets identification through application of bionformatics tools. Currently, the application of
More informationChemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller
Chemogenomic: Approaches to Rational Drug Design Jonas Skjødt Møller Chemogenomic Chemistry Biology Chemical biology Medical chemistry Chemical genetics Chemoinformatics Bioinformatics Chemoproteomics
More informationResearch Article. Chemical compound classification based on improved Max-Min kernel
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(2):368-372 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Chemical compound classification based on improved
More informationJCICS Major Research Areas
JCICS Major Research Areas Chemical Information Text Searching Structure and Substructure Searching Databases Patents George W.A. Milne C571 Lecture Fall 2002 1 JCICS Major Research Areas Chemical Computation
More informationIntroduction to Chemoinformatics and Drug Discovery
Introduction to Chemoinformatics and Drug Discovery Irene Kouskoumvekaki Associate Professor February 15 th, 2013 The Chemical Space There are atoms and space. Everything else is opinion. Democritus (ca.
More informationStructure-Activity Modeling - QSAR. Uwe Koch
Structure-Activity Modeling - QSAR Uwe Koch QSAR Assumption: QSAR attempts to quantify the relationship between activity and molecular strcucture by correlating descriptors with properties Biological activity
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationIn Silico Investigation of Off-Target Effects
PHARMA & LIFE SCIENCES WHITEPAPER In Silico Investigation of Off-Target Effects STREAMLINING IN SILICO PROFILING In silico techniques require exhaustive data and sophisticated, well-structured informatics
More informationAn Integrated Approach to in-silico
An Integrated Approach to in-silico Screening Joseph L. Durant Jr., Douglas. R. Henry, Maurizio Bronzetti, and David. A. Evans MDL Information Systems, Inc. 14600 Catalina St., San Leandro, CA 94577 Goals
More informationChemoinformatics and information management. Peter Willett, University of Sheffield, UK
Chemoinformatics and information management Peter Willett, University of Sheffield, UK verview What is chemoinformatics and why is it necessary Managing structural information Typical facilities in chemoinformatics
More informationRchemcpp Similarity measures for chemical compounds. Michael Mahr and Günter Klambauer. Institute of Bioinformatics, Johannes Kepler University Linz
Software Manual Institute of Bioinformatics, Johannes Kepler University Linz Rchemcpp Similarity measures for chemical compounds Michael Mahr and Günter Klambauer Institute of Bioinformatics, Johannes
More informationMachine Learning, Midterm Exam
10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have
More informationKernel-based Machine Learning for Virtual Screening
Kernel-based Machine Learning for Virtual Screening Dipl.-Inf. Matthias Rupp Beilstein Endowed Chair for Chemoinformatics Johann Wolfgang Goethe-University Frankfurt am Main, Germany 2008-04-11, Helmholtz
More informationMachine learning in molecular classification
Machine learning in molecular classification Hongyu Su Department of Computer Science PO Box 68, FI-00014 University of Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Chemical and biological
More informationSupport Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM
1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University
More informationGraph Kernels in Chemoinformatics
in hemoinformatics Luc Brun o authors: Benoit Gaüzère(MIVIA), Pierre Anthony Grenier(GREY), Alessia aggese(mivia), Didier Villemin (LMT) Bordeaux, November, 20 2014 Plan yclic similarity onclusion 1 2
More informationPattern Recognition and Machine Learning. Perceptrons and Support Vector machines
Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationXia Ning,*, Huzefa Rangwala, and George Karypis
J. Chem. Inf. Model. XXXX, xxx, 000 A Multi-Assay-Based Structure-Activity Relationship Models: Improving Structure-Activity Relationship Models by Incorporating Activity Information from Related Targets
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationKaggle.
Administrivia Mini-project 2 due April 7, in class implement multi-class reductions, naive bayes, kernel perceptron, multi-class logistic regression and two layer neural networks training set: Project
More informationComputational chemical biology to address non-traditional drug targets. John Karanicolas
Computational chemical biology to address non-traditional drug targets John Karanicolas Our computational toolbox Structure-based approaches Ligand-based approaches Detailed MD simulations 2D fingerprints
More informationSupport Vector Machines.
Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel
More informationA NETFLOW DISTANCE BETWEEN LABELED GRAPHS: APPLICATIONS IN CHEMOINFORMATICS
1 A NETFLOW DISTANCE BETWEEN LABELED GRAPHS: APPLICATIONS IN CHEMOINFORMATICS Subhransu Maji Department of Computer Science, Indian Institue of Technology, Kanpur email: maji@cse.iitk.ac.in Shashank Mehta
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationLossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval
Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval Pierre Baldi,*,, Ryan W. Benz, Daniel S. Hirschberg, and S. Joshua Swamidass Institute for Genomics
More informationMachine Learning, Fall 2011: Homework 5
0-60 Machine Learning, Fall 0: Homework 5 Machine Learning Department Carnegie Mellon University Due:??? Instructions There are 3 questions on this assignment. Please submit your completed homework to
More informationCondensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule
Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule Frank Hoonakker 1,3, Nicolas Lachiche 2, Alexandre Varnek 3, and Alain Wagner 3,4 1 Chemoinformatics laboratory,
More informationIntroduction to Chemoinformatics
Introduction to Chemoinformatics Dr. Igor V. Tetko Helmholtz Zentrum München - German Research Center for Environmental Health (GmbH) Institute of Bioinformatics & Systems Biology (HMGU) Kyiv, 10 August
More informationCS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines
CS4495/6495 Introduction to Computer Vision 8C-L3 Support Vector Machines Discriminative classifiers Discriminative classifiers find a division (surface) in feature space that separates the classes Several
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationComparison of Descriptor Spaces for Chemical Compound Retrieval and Classification
Knowledge and Information Systems (20XX) Vol. X: 1 29 c 20XX Springer-Verlag London Ltd. Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification Nikil Wale Department of Computer
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationA Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery
AtomNet A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery Izhar Wallach, Michael Dzamba, Abraham Heifets Victor Storchan, Institute for Computational and
More information10701/15781 Machine Learning, Spring 2007: Homework 2
070/578 Machine Learning, Spring 2007: Homework 2 Due: Wednesday, February 2, beginning of the class Instructions There are 4 questions on this assignment The second question involves coding Do not attach
More informationSupport Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2
More informationMining Molecular Fragments: Finding Relevant Substructures of Molecules
Mining Molecular Fragments: Finding Relevant Substructures of Molecules Christian Borgelt, Michael R. Berthold Proc. IEEE International Conference on Data Mining, 2002. ICDM 2002. Lecturers: Carlo Cagli
More informationAcyclic Subgraph based Descriptor Spaces for Chemical Compound Retrieval and Classification. Technical Report
Acyclic Subgraph based Descriptor Spaces for Chemical Compound Retrieval and Classification Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 200
More informationPackage Rchemcpp. August 14, 2018
Type Package Title Similarity measures for chemical compounds Version 2.19.0 Date 2015-06-23 Author Michael Mahr, Guenter Klambauer Package Rchemcpp August 14, 2018 Maintainer Guenter Klambauer
More informationA Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors
A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors Rajarshi Guha, Debojyoti Dutta, Ting Chen and David J. Wild School of Informatics Indiana University and Dept.
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear
More informationFINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE
FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE You are allowed a two-page cheat sheet. You are also allowed to use a calculator. Answer the questions in the spaces provided on the question sheets.
More informationSupport Vector Machines
Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization
More informationNext Generation Computational Chemistry Tools to Predict Toxicity of CWAs
Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs William (Bill) Welsh welshwj@umdnj.edu Prospective Funding by DTRA/JSTO-CBD CBIS Conference 1 A State-wide, Regional and National
More informationData Mining in the Chemical Industry. Overview of presentation
Data Mining in the Chemical Industry Glenn J. Myatt, Ph.D. Partner, Myatt & Johnson, Inc. glenn.myatt@gmail.com verview of presentation verview of the chemical industry Example of the pharmaceutical industry
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationDISCOVERING new drugs is an expensive and challenging
1036 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 8, AUGUST 2005 Frequent Substructure-Based Approaches for Classifying Chemical Compounds Mukund Deshpande, Michihiro Kuramochi, Nikil
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationAnalysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing
Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning and Simulated Annealing Student: Ke Zhang MBMA Committee: Dr. Charles E. Smith (Chair) Dr. Jacqueline M. Hughes-Oliver
More informationhsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference
CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science
More informationPrediction in bioinformatics applications by conformal predictors
Prediction in bioinformatics applications by conformal predictors Alex Gammerman (joint work with Ilia Nuoretdinov and Paolo Toccaceli) Computer Learning Research Centre Royal Holloway, University of London
More informationTufts COMP 135: Introduction to Machine Learning
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationVirtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME
Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME Iván Solt Solutions for Cheminformatics Drug Discovery Strategies for known targets High-Throughput Screening (HTS) Cells
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationDevelopment of a Structure Generator to Explore Target Areas on Chemical Space
Development of a Structure Generator to Explore Target Areas on Chemical Space Kimito Funatsu Department of Chemical System Engineering, This materials will be published on Molecular Informatics Drug Development
More informationTopology based deep learning for biomolecular data
Topology based deep learning for biomolecular data Guowei Wei Departments of Mathematics Michigan State University http://www.math.msu.edu/~wei American Institute of Mathematics July 23-28, 2017 Grant
More informationSupport Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification
More informationModel Accuracy Measures
Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses
More informationSupport'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan
Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationReceptor Based Drug Design (1)
Induced Fit Model For more than 100 years, the behaviour of enzymes had been explained by the "lock-and-key" mechanism developed by pioneering German chemist Emil Fischer. Fischer thought that the chemicals
More informationSupport Vector Machine & Its Applications
Support Vector Machine & Its Applications A portion (1/3) of the slides are taken from Prof. Andrew Moore s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials Mingyue Tan The University of British Columbia
More informationCS446: Machine Learning Fall Final Exam. December 6 th, 2016
CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationSimilarity Search. Uwe Koch
Similarity Search Uwe Koch Similarity Search The similar property principle: strurally similar molecules tend to have similar properties. However, structure property discontinuities occur frequently. Relevance
More informationStatistical concepts in QSAR.
Statistical concepts in QSAR. Computational chemistry represents molecular structures as a numerical models and simulates their behavior with the equations of quantum and classical physics. Available programs
More informationIntroduction. OntoChem
Introduction ntochem Providing drug discovery knowledge & small molecules... Supporting the task of medicinal chemistry Allows selecting best possible small molecule starting point From target to leads
More informationQSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression
APPLICATION NOTE QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression GAINING EFFICIENCY IN QUANTITATIVE STRUCTURE ACTIVITY RELATIONSHIPS ErbB1 kinase is the cell-surface receptor
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationCS798: Selected topics in Machine Learning
CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning
More informationPriority Setting of Endocrine Disruptors Using QSARs
Priority Setting of Endocrine Disruptors Using QSARs Weida Tong Manager of Computational Science Group, Logicon ROW Sciences, FDA s National Center for Toxicological Research (NCTR), U.S.A. Thanks for
More informationApplied Machine Learning Annalisa Marsico
Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 29 April, SoSe 2015 Support Vector Machines (SVMs) 1. One of
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationc 4, < y 2, 1 0, otherwise,
Fundamentals of Big Data Analytics Univ.-Prof. Dr. rer. nat. Rudolf Mathar Problem. Probability theory: The outcome of an experiment is described by three events A, B and C. The probabilities Pr(A) =,
More informationRetrieving hits through in silico screening and expert assessment M. N. Drwal a,b and R. Griffith a
Retrieving hits through in silico screening and expert assessment M.. Drwal a,b and R. Griffith a a: School of Medical Sciences/Pharmacology, USW, Sydney, Australia b: Charité Berlin, Germany Abstract:
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationDr. Sander B. Nabuurs. Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre
Dr. Sander B. Nabuurs Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre The road to new drugs. How to find new hits? High Throughput
More informationThe exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.
CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please
More informationMehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel
More informationMidterm: CS 6375 Spring 2015 Solutions
Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationMachine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)
Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:
More informationInformation Extraction from Text
Information Extraction from Text Jing Jiang Chapter 2 from Mining Text Data (2012) Presented by Andrew Landgraf, September 13, 2013 1 What is Information Extraction? Goal is to discover structured information
More informationReview: Support vector machines. Machine learning techniques and image analysis
Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization
More informationi=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 November 16, 017 Due: Dec 01, 017 A. Kernels Show that the following kernels K are PDS: 1.
More informationIntroduction to Machine Learning Midterm, Tues April 8
Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationCS534 Machine Learning - Spring Final Exam
CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the
More informationClass 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio
Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant
More informationMachine learning methods to infer drug-target interaction network
Machine learning methods to infer drug-target interaction network Yoshihiro Yamanishi Medical Institute of Bioregulation Kyushu University Outline n Background Drug-target interaction network Chemical,
More informationSupport Vector Machines
Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6
More information