Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics

Size: px
Start display at page:

Download "Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics"

Transcription

1 Data Mining in Bioinformatics Day 9: Graph Mining in hemoinformatics hloé-agathe Azencott & Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & omputational Biology Research Group Max Planck Institutes Tübingen and Eberhard Karls Universität Tübingen hloé-agathe Azencott: Data Mining in Bioinformatics, Page 1

2 Drug discovery Modern therapeutic research From serendipity to rationalized drug design Ancient Greeks treat infections with mould NH2 NH S HO H3 O N H3 O O HO Biapenem in PBP-1A hloé-agathe Azencott: Data Mining in Bioinformatics, Page 2

3 Drug discovery process 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay Protein that we want to inhibit so as to interfer with a biological process ompounds likely to bind to the target an they be drugs? (ADME-Tox) - bioactivity - pharmacokinetics - synthetic pathway - in vitro - in vivo - clinical hloé-agathe Azencott: Data Mining in Bioinformatics, Page 3

4 Drug discovery process 52 months 90 months 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay hloé-agathe Azencott: Data Mining in Bioinformatics, Page 4

5 Drug discovery process 52 months 90 months 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay $500,000,000 to $2,000,000,000 hloé-agathe Azencott: Data Mining in Bioinformatics, Page 5

6 hemoinformatics How can computer science help? hemoinformatics!...the mixing of information resources to transform data into information, and information into knowledge, for the intended purpose of making better decisions faster in the arena of drug lead identification and optimisation. F. K. Brown... the application of informatics methods to solve chemical problems. J. Gasteiger and T. Engel hloé-agathe Azencott: Data Mining in Bioinformatics, Page 6

7 hemoinformatics hemoinformatics 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay hloé-agathe Azencott: Data Mining in Bioinformatics, Page 7

8 hemoinformatics The chemical space possible small organic molecules stars in the observable universe (Slide courtesy of Matthew A. Kayala) hloé-agathe Azencott: Data Mining in Bioinformatics, Page 8

9 Drug discovery process 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay QSAR QSPR QSAR: Qualitative Structure-Activity Relationship i.e. classification QSPR: Quantititive Structure-Property Relationship i.e. regression hloé-agathe Azencott: Data Mining in Bioinformatics, Page 9

10 Representing chemicals in silico Expert knowledge molecular descriptors hard, potentially incomplete Molecules are... NH 2 HO O NH O N S H 3 H 3 HO O hloé-agathe Azencott: Data Mining in Bioinformatics, Page 10

11 Representing chemicals in silico Similar Property Principle Molecules having similar structures should exhibit similar activities. Structure-based representations ompare molecules by comparing substructures hloé-agathe Azencott: Data Mining in Bioinformatics, Page 11

12 Molecular graph O O d O d N O d N S O N Undirected labeled graph hloé-agathe Azencott: Data Mining in Bioinformatics, Page 12

13 Fingerprints Define feature vectors that record the presence/absence (or number of occurrences) of particular patterns in a given molecular graph where φ(a) = (φ s (A)) s substructure φ s (A) = { 1 if s occurs in A 0 otherwise Extension of traditional chemical fingerprints hloé-agathe Azencott: Data Mining in Bioinformatics, Page 13

14 Fingerprints Learning from fingerprints lassical machine learning and data mining techniques can be applied to these vectorial feature representations. Any distance / kernel can be used lassification Feature selection lustering hloé-agathe Azencott: Data Mining in Bioinformatics, Page 14

15 Fingerprints Fingerprints compression Systematic enumeration long, sparse vectors e.g. 50, 000 random compounds from hemdb 300, 000 paths of length up to non-zeros on average Naive ompression List the positions of the 1s 2 19 = 524, 288 average encoding: = 5, 700 bits hloé-agathe Azencott: Data Mining in Bioinformatics, Page 15

16 Fingerprints Fingerprints compression Modulo ompression (lossy) Elias-Gamma Monotone Encoding (lossless) [Baldi et al., 2007] index j log(j) 0 bits + binary encoding of j j i < j i+1 : log(j i+1 ) log(j i ) log(j i+1 ) average compressed size = 1, 800 bits hloé-agathe Azencott: Data Mining in Bioinformatics, Page 16

17 Frequent patterns fingerprints MOLFEA [Helma et al., 2004] P = positive (mutagenic) compounds N = negative compounds features: fragments (= patterns) f such that both freq(f, P) t and freq(f, N) t Limited to frequent linear patterns ML algorithm: SVM with linear or quadratic kernel hloé-agathe Azencott: Data Mining in Bioinformatics, Page 17

18 Frequent patterns fingerprints MOLFEA [Helma et al., 2004] PDB arcinogenic Potency DataBase 684 compounds classified in 341 mutagens and 343 nonmutagens according to Ames test on Salmonella 100 Mutagenicity prediction [Hema04] Linear kernel Quadratic kernel 90 ross-validated sensitivity % 3% 5% 10% Frequency threshold hloé-agathe Azencott: Data Mining in Bioinformatics, Page 18

19 Spectrum kernels φ(a) = (φ s (A)) s S K spectrum (A, A ) = k(φ(a), φ(a )) k R R (S) R (S) can be Dot product (linear kernel) RBF kernel Tanimoto kernel: k(a, B) = A B MinMax kernel: N i=1 min(a i,b i ) N i=1 max(a i,b i ) A B hloé-agathe Azencott: Data Mining in Bioinformatics, Page 19

20 Spectrum kernels Tanimoto and MinMax Both Tanimoto and Minmax are kernels. Proof for Tanimoto: J.. Gower A general coefficient of similarity and some of its properties. Biometrics Proof for MinMax: φ(x), φ(y) MinMax(x, y) = φ(x), φ(x) + φ(y), φ(y) φ(x), φ(y) with φ(x) of length: # patterns max count φ(x) i = 1 iff. the pattern indexed by i/q appears more than i mod q times in x hloé-agathe Azencott: Data Mining in Bioinformatics, Page 20

21 All patterns fingerprints Paths fingerprints Labeled sub-paths (walks) O O O d O d d ssdo N N NsssS S O N Some sub-paths of length 3 hloé-agathe Azencott: Data Mining in Bioinformatics, Page 21

22 All patterns fingerprints ircular fingerprints Labeled sub-trees - Extended-onnectivity (or ircular) features O O O d d N O d N S O N {s{sn s} sn{s} ss{s}} Example of a circular substructure of depth 2 hloé-agathe Azencott: Data Mining in Bioinformatics, Page 22

23 All patterns fingerprints 2D spectrum kernels [Azencott et al., 2007] Systematically extract paths / circular fingerprints, for various maximal depths SVM with Tanimoto / Minmax hloé-agathe Azencott: Data Mining in Bioinformatics, Page 23

24 All patterns fingerprints 2D spectrum kernels [Azencott et al., 2007] Mutagenicity (Mutag): 188 compounds Benzodiazepine receptor affinity (BZR): compounds yclooxygenase-2 ihibitors (OX2): compounds Estrogen receptor affinity (ER): compounds Data SVM Previous best Mutag 90.4% 85.2% (gboost) BZR 79.8% 76.4% OX2 70.1% 73.6% ER 82.1% 79.8% hloé-agathe Azencott: Data Mining in Bioinformatics, Page 24

25 Informative patterns Extract informative patterns while learning All patterns + sparsity regularization gboost hloé-agathe Azencott: Data Mining in Bioinformatics, Page 25

26 gboost [Saigo et al., 2009] Train data: {(G n, y n )} n=1...l Stump or hypothesis: h(x : t, ω) = ω(2x t 1) x t = 1 if x t G and 0 otherwise Decision function: f(x) = t T,ω { 1,+1} Equivalent to solving (LP-Boost) min λ,γ γ α tω h(x : t, ω) such that ln=1 λ ny n h(x n : t, ω) γ t, ω l n=1 λ n = 1 and 0 λ n D n hloé-agathe Azencott: Data Mining in Bioinformatics, Page 26

27 gboost [Saigo et al., 2009] Solve by column generation start with H = and λ n = 1/l n Iteratively: find (t, ω ) that maximizes add (t, ω ) to H update λ n, γ g(t, ω) = l λ n y n h(x n ; t, ω) n=1 Stop when (t, w ) such that g(t, w ) > γ + ɛ hloé-agathe Azencott: Data Mining in Bioinformatics, Page 27

28 gboost [Saigo et al., 2009] Finding (t, w ): DFS code tree Pruning condition (g optimal gain so far): if max 2 n:y n =1,t G n λ n l y n λ n, 2 n=1 then t : t t, w, g > g(t, w ) n:y n = 1,t G n λ n + l y n λ n < g n=1 hloé-agathe Azencott: Data Mining in Bioinformatics, Page 28

29 gboost [Saigo et al., 2009] Application to PDB Accuracy similar to Helma et al. (79%) Most discriminative patterns hloé-agathe Azencott: Data Mining in Bioinformatics, Page 29

30 Weisfeiler-Lehman kernel [Shervashidze et al., 2011] Goal: scalability ompute a sequence that captures topological and label information of graphs in a runtime linear in the number of edges sub-tree kernel hloé-agathe Azencott: Data Mining in Bioinformatics, Page 30

31 Weisfeiler-Lehman kernel [Shervashidze et al., 2011] hloé-agathe Azencott: Data Mining in Bioinformatics, Page 31

32 onvolution kernels a.k.a. decomposition kernels (x 1,..., x D ) is a tuple of parts of x, with x d X for each part d = 1,..., D k d R X d X d : a Mercer kernel K decomposition (x, x ) = x 1 x 2...x D =x x 1 x 2 x D =x k 1 (x 1, x 1)k 2 (x 2, x 2)... k D (x D, x D) Spectrum kernels are a particular case of convolution kernels hloé-agathe Azencott: Data Mining in Bioinformatics, Page 32

33 onvolution kernels Weighted Decomposition Kernel [Menchetti et al., 2005] Match atoms and weigh them according to a kernel between subgraphs that include these atoms K W DK (x, x ) = (a,σ D r (x)) (a,σ D r (x )) δ(a, a )K c (σ, σ ) r > 0 N D r (x): decompositions of the molecular graph of x in an atom a and a subpath σ of x including a and of depth at most r hloé-agathe Azencott: Data Mining in Bioinformatics, Page 33

34 onvolution kernels Weighted Decomposition Kernel [Menchetti et al., 2005] K c : contextual kernel, here: histogram intersection kernel K c (σ, σ ) = l L min(f σ (l), f σ (l)) L: possible labels for edges and vertices f σ (l): frequency of label l subgraph σ. hloé-agathe Azencott: Data Mining in Bioinformatics, Page 34

35 Optimal assignment kernels Try to best map x and x Not necessarily a kernel in practice: K K λ min I hloé-agathe Azencott: Data Mining in Bioinformatics, Page 35

36 Optimal assignment kernels The Local Atom Pair kernel [Hinselmann et al., 2010] M: pairwise intramolecular matrix of inter-atomic topological distances Local atom environment: l(i) = {(L(i), M ij, L(j)), j i} κ(i, j): dot product, Tanimoto or MinMax between l(i) and l(j) hloé-agathe Azencott: Data Mining in Bioinformatics, Page 36

37 Optimal assignment kernels LAP: performance [Hinselmann et al., 2010] hloé-agathe Azencott: Data Mining in Bioinformatics, Page 37

38 Introducing spatial information 3D Histograms [Azencott et al., 2007] Groups of k atoms Associated size: Pairwise distances (k = 2) diameter of the smallest sphere that contains all k atoms hloé-agathe Azencott: Data Mining in Bioinformatics, Page 38

39 Introducing spatial information 3D Histograms [Azencott et al., 2007] One histogram per class of k-tuple (e.g. --, --O) O O O O N N S O 6.3 N Frequency of N-O N-O distance (A) hloé-agathe Azencott: Data Mining in Bioinformatics, Page 39

40 Introducing spatial information 3D Histograms: performance [Azencott et al., 2007] Data 2D kernel Hist3D kernel Mutag 90.4% 88.8% BZR (loo) 82.0% 79.4% ER (loo) 87.0% 86.1% OX2 76.9% 78.6% hloé-agathe Azencott: Data Mining in Bioinformatics, Page 40

41 Introducing spatial information 3D Decomposition Kernels [eroni et al., 2007] Remember: K W DK (x, x ) = (a,σ D r (x)) (a,σ D r (x )) δ(a, a )K c (σ, σ ) K 3DDK (x, x ) = σ S r (x) σ S r (x ) K s(σ, σ ) S r (x): subgraphs of x composed of r distinct vertices K s (σ, σ ) = r(r 1)/2 i=1 δ(e i, e i )e γ(l i l i ) l i = length of edge e i in x (e 1, e 2,..., e r(r 1)/2 lexicographically ordered; γ R hloé-agathe Azencott: Data Mining in Bioinformatics, Page 41

42 Introducing spatial information 3DDK: Performance [eroni et al., 2007] Data 2D kernel Hist3D kernel 3DDK irc3ddk Mutag 90.4% 88.8% 86.7% 83.5% BZR (loo) 82.0% 79.4% 78.4% 81.4% ER (loo) 87.0% 86.1% 82.3% 82.1% OX2 76.9% 78.6% 75.6% 75.2% hloé-agathe Azencott: Data Mining in Bioinformatics, Page 42

43 Introducing spatial information The pharmacophore kernel [Mahé et al., 2006] pharmacophore p P(x): p = [(x 1, l 1 ), (x 2, l 2 ), (x 3, l 3 )] x i 3D coordinates of atom i of x; l i = label of atom i K(x, x ) = p P(x) p P(x ) K P (p, p ) K P (p, p ) = K dist (d 1, d 1)K dist (d 2, d 2)K dist (d 3, d 3)K feat (l 1, l 1)K feat (l 2, l 2)K feat (l 3, l 3) K dist : RBF Gaussian K dist (d, d ) = exp K feat : Dirac ( d d 2 2σ 2 ) hloé-agathe Azencott: Data Mining in Bioinformatics, Page 43

44 Introducing spatial information 3D LAP kernel [Hinselmann et al., 2010] M: pairwise intramolecular matrix of inter-atomic geometric distances hloé-agathe Azencott: Data Mining in Bioinformatics, Page 44

45 Introducing spatial information onclusion How relevant is 3D information? How good is 3D information? hloé-agathe Azencott: Data Mining in Bioinformatics, Page 45

46 Drug discovery process 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay Docking Virtual High-Throughput Screening hloé-agathe Azencott: Data Mining in Bioinformatics, Page 46

47 High-throughput screening Assay a large library of potential drugs against their target Very costly docking virtual high-throughput screening (vhts) hloé-agathe Azencott: Data Mining in Bioinformatics, Page 47

48 Measuring performance Imbalanced data Typically, most compounds are inactive many more negative than positive examples E.g. DHFR data set: 99, 995 chemicals screened for activity against dihydrofolate reductase; < 0.2% active compounds Accuracy is not appropriate: predicting all compounds negative accuracy = 99.8% sensitivity= # True Positives # Positives # True Negatives specificity= # Negatives For many methods, the output is continuous accuracy, sensitivity and specificity depend on a threshold θ hloé-agathe Azencott: Data Mining in Bioinformatics, Page 48

49 Measuring performance Receiver-Operator haracteristic urves True Positive Rate For all possible values of θ, report sensitivity and 1 specificity AURO (Area under the RO urve) is a numerical measure of peformance AURO(random) = 0.5 and AURO(optimal) = 1 0 1/4 2/4 3/ x x x x 0.9 x 0.95 x 0.94 x random perfect Inf x real 0 1/6 1/3 1/2 2/3 5/6 1 False Positive Rate x x x label prediction hloé-agathe Azencott: Data Mining in Bioinformatics, Page 49

50 Measuring performance Inhibition of DHFR: RO urves [Azencott et al., 2007] method AU IRV 0.71 SVM 0.59 knn 0.59 MAX-SIM 0.54 TPR RANDOM IRV SVM MAXSIM FPR hloé-agathe Azencott: Data Mining in Bioinformatics, Page 50

51 Measuring performance Precision-recall curves Precision = # True Positives # Predicted Positives Recall = sensitivity Precision 0 1/5 2/5 3/5 4/ x 0.94 x 0.9 x 0 1/4 2/4 3/4 1 Recall 0.81 x 0.73 x 0.52 x 0.2 x perfect real 0.17 x x hloé-agathe Azencott: Data Mining in Bioinformatics, Page 51

52 Other applications Other applications of graph mining in chemoinformatics Database indexing and search Prediction of 3D structures of small compounds and proteins Reaction Prediction hloé-agathe Azencott: Data Mining in Bioinformatics, Page 52

53 References and further reading [Azencott et al., 2007] Azencott,.-A., Ksikes, A., Swamidass, S. J., hen, J. H., Ralaivola, L. and Baldi, P. (2007). One-to fourdimensional kernels for virtual screening and the prediction of physical, chemical, and biological properties. Journal of chemical information and modeling 47, , 24, 38, 39, 40, 50 [Baldi et al., 2007] Baldi, P., Benz, R. W., Hirschberg, D. S. and Swamidass, S. J. (2007). Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval. Journal of chemical information and modeling 47, [eroni et al., 2007] eroni, A., osta, F. and Frasconi, P. (2007). lassification of small molecules by two-and three-dimensional decomposition kernels. Bioinformatics 23, , 42 [Helma et al., 2004] Helma,., ramer, T., Kramer, S. and De Raedt, L. (2004). Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. Journal of chemical information and computer sciences 44, , 18 [Hinselmann et al., 2010] Hinselmann, G., Fechner, N., Jahn, A., Eckert, M. and Zell, A. (2010). Graph kernels for chemical compounds using topological and three-dimensional local atom pair environments. Neurocomputing 74, , 37, 44 [Mahé et al., 2006] Mahé, P., Ralaivola, L., Stoven, V. and Vert, J.-P. (2006). The pharmacophore kernel for virtual screening with support vector machines. Journal of chemical information and modeling 46, [Menchetti et al., 2005] Menchetti, S., osta, F. and Frasconi, P. (2005). Weighted Decomposition Kernels. In Proceedings of the 22nd International onference on Machine Learning pp , AM, Bonn, Germany. 33, 34 [Saigo et al., 2009] Saigo, H., Nowozin, S., Kadowaki, T., Kudo, T. and Tsuda, K. (2009). gboost: a mathematical programming approach to graph classification and regression. Machine Learning 75, , 27, 28, 29 [Shervashidze et al., 2011] Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K. and Borgwardt, K. M. (2011). Weisfeiler- Lehman graph kernels. Journal of Machine Learning Research 12, , 31 hloé-agathe Azencott: Data Mining in Bioinformatics, Page 53

54 The end Tomorrow: Projects Presentations By 9:45 AM on Friday, March 1, 2013, please submit the following by to Prof. Borgwardt: A short report on your project, that gives your answers to the questions in Section 2 (You can ignore Section 1 here) in your exercise sheet The code that you wrote as part of your project Your presentation slides as a PDF. hloé-agathe Azencott: Data Mining in Bioinformatics, Page 54

Machine learning for ligand-based virtual screening and chemogenomics!

Machine learning for ligand-based virtual screening and chemogenomics! Machine learning for ligand-based virtual screening and chemogenomics! Jean-Philippe Vert Institut Curie - INSERM U900 - Mines ParisTech In silico discovery of molecular probes and drug-like compounds:

More information

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics. Plan Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics. Exercise: Example and exercise with herg potassium channel: Use of

More information

Kernels for small molecules

Kernels for small molecules Kernels for small molecules Günter Klambauer June 23, 2015 Contents 1 Citation and Reference 1 2 Graph kernels 2 2.1 Implementation............................ 2 2.2 The Spectrum Kernel........................

More information

Machine Learning Concepts in Chemoinformatics

Machine Learning Concepts in Chemoinformatics Machine Learning Concepts in Chemoinformatics Martin Vogt B-IT Life Science Informatics Rheinische Friedrich-Wilhelms-Universität Bonn BigChem Winter School 2017 25. October Data Mining in Chemoinformatics

More information

A Linear Programming Approach for Molecular QSAR analysis

A Linear Programming Approach for Molecular QSAR analysis Linear Programming pproach for Molecular QSR analysis Hiroto Saigo 1, Tadashi Kadowaki 2, and Koji Tsuda 1 1 Max Planck Institute for iological Cybernetics, Spemannstr. 38, 72076 Tübingen, Germany {hiroto.saigo,koji.tsuda}@tuebingen.mpg.de,

More information

In silico pharmacology for drug discovery

In silico pharmacology for drug discovery In silico pharmacology for drug discovery In silico drug design In silico methods can contribute to drug targets identification through application of bionformatics tools. Currently, the application of

More information

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller Chemogenomic: Approaches to Rational Drug Design Jonas Skjødt Møller Chemogenomic Chemistry Biology Chemical biology Medical chemistry Chemical genetics Chemoinformatics Bioinformatics Chemoproteomics

More information

Research Article. Chemical compound classification based on improved Max-Min kernel

Research Article. Chemical compound classification based on improved Max-Min kernel Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(2):368-372 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Chemical compound classification based on improved

More information

JCICS Major Research Areas

JCICS Major Research Areas JCICS Major Research Areas Chemical Information Text Searching Structure and Substructure Searching Databases Patents George W.A. Milne C571 Lecture Fall 2002 1 JCICS Major Research Areas Chemical Computation

More information

Introduction to Chemoinformatics and Drug Discovery

Introduction to Chemoinformatics and Drug Discovery Introduction to Chemoinformatics and Drug Discovery Irene Kouskoumvekaki Associate Professor February 15 th, 2013 The Chemical Space There are atoms and space. Everything else is opinion. Democritus (ca.

More information

Structure-Activity Modeling - QSAR. Uwe Koch

Structure-Activity Modeling - QSAR. Uwe Koch Structure-Activity Modeling - QSAR Uwe Koch QSAR Assumption: QSAR attempts to quantify the relationship between activity and molecular strcucture by correlating descriptors with properties Biological activity

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

In Silico Investigation of Off-Target Effects

In Silico Investigation of Off-Target Effects PHARMA & LIFE SCIENCES WHITEPAPER In Silico Investigation of Off-Target Effects STREAMLINING IN SILICO PROFILING In silico techniques require exhaustive data and sophisticated, well-structured informatics

More information

An Integrated Approach to in-silico

An Integrated Approach to in-silico An Integrated Approach to in-silico Screening Joseph L. Durant Jr., Douglas. R. Henry, Maurizio Bronzetti, and David. A. Evans MDL Information Systems, Inc. 14600 Catalina St., San Leandro, CA 94577 Goals

More information

Chemoinformatics and information management. Peter Willett, University of Sheffield, UK

Chemoinformatics and information management. Peter Willett, University of Sheffield, UK Chemoinformatics and information management Peter Willett, University of Sheffield, UK verview What is chemoinformatics and why is it necessary Managing structural information Typical facilities in chemoinformatics

More information

Rchemcpp Similarity measures for chemical compounds. Michael Mahr and Günter Klambauer. Institute of Bioinformatics, Johannes Kepler University Linz

Rchemcpp Similarity measures for chemical compounds. Michael Mahr and Günter Klambauer. Institute of Bioinformatics, Johannes Kepler University Linz Software Manual Institute of Bioinformatics, Johannes Kepler University Linz Rchemcpp Similarity measures for chemical compounds Michael Mahr and Günter Klambauer Institute of Bioinformatics, Johannes

More information

Machine Learning, Midterm Exam

Machine Learning, Midterm Exam 10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

More information

Kernel-based Machine Learning for Virtual Screening

Kernel-based Machine Learning for Virtual Screening Kernel-based Machine Learning for Virtual Screening Dipl.-Inf. Matthias Rupp Beilstein Endowed Chair for Chemoinformatics Johann Wolfgang Goethe-University Frankfurt am Main, Germany 2008-04-11, Helmholtz

More information

Machine learning in molecular classification

Machine learning in molecular classification Machine learning in molecular classification Hongyu Su Department of Computer Science PO Box 68, FI-00014 University of Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Chemical and biological

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Graph Kernels in Chemoinformatics

Graph Kernels in Chemoinformatics in hemoinformatics Luc Brun o authors: Benoit Gaüzère(MIVIA), Pierre Anthony Grenier(GREY), Alessia aggese(mivia), Didier Villemin (LMT) Bordeaux, November, 20 2014 Plan yclic similarity onclusion 1 2

More information

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Xia Ning,*, Huzefa Rangwala, and George Karypis

Xia Ning,*, Huzefa Rangwala, and George Karypis J. Chem. Inf. Model. XXXX, xxx, 000 A Multi-Assay-Based Structure-Activity Relationship Models: Improving Structure-Activity Relationship Models by Incorporating Activity Information from Related Targets

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Kaggle.

Kaggle. Administrivia Mini-project 2 due April 7, in class implement multi-class reductions, naive bayes, kernel perceptron, multi-class logistic regression and two layer neural networks training set: Project

More information

Computational chemical biology to address non-traditional drug targets. John Karanicolas

Computational chemical biology to address non-traditional drug targets. John Karanicolas Computational chemical biology to address non-traditional drug targets John Karanicolas Our computational toolbox Structure-based approaches Ligand-based approaches Detailed MD simulations 2D fingerprints

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

A NETFLOW DISTANCE BETWEEN LABELED GRAPHS: APPLICATIONS IN CHEMOINFORMATICS

A NETFLOW DISTANCE BETWEEN LABELED GRAPHS: APPLICATIONS IN CHEMOINFORMATICS 1 A NETFLOW DISTANCE BETWEEN LABELED GRAPHS: APPLICATIONS IN CHEMOINFORMATICS Subhransu Maji Department of Computer Science, Indian Institue of Technology, Kanpur email: maji@cse.iitk.ac.in Shashank Mehta

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval

Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval Pierre Baldi,*,, Ryan W. Benz, Daniel S. Hirschberg, and S. Joshua Swamidass Institute for Genomics

More information

Machine Learning, Fall 2011: Homework 5

Machine Learning, Fall 2011: Homework 5 0-60 Machine Learning, Fall 0: Homework 5 Machine Learning Department Carnegie Mellon University Due:??? Instructions There are 3 questions on this assignment. Please submit your completed homework to

More information

Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule

Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule Frank Hoonakker 1,3, Nicolas Lachiche 2, Alexandre Varnek 3, and Alain Wagner 3,4 1 Chemoinformatics laboratory,

More information

Introduction to Chemoinformatics

Introduction to Chemoinformatics Introduction to Chemoinformatics Dr. Igor V. Tetko Helmholtz Zentrum München - German Research Center for Environmental Health (GmbH) Institute of Bioinformatics & Systems Biology (HMGU) Kyiv, 10 August

More information

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines CS4495/6495 Introduction to Computer Vision 8C-L3 Support Vector Machines Discriminative classifiers Discriminative classifiers find a division (surface) in feature space that separates the classes Several

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification

Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification Knowledge and Information Systems (20XX) Vol. X: 1 29 c 20XX Springer-Verlag London Ltd. Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification Nikil Wale Department of Computer

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery

A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery AtomNet A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery Izhar Wallach, Michael Dzamba, Abraham Heifets Victor Storchan, Institute for Computational and

More information

10701/15781 Machine Learning, Spring 2007: Homework 2

10701/15781 Machine Learning, Spring 2007: Homework 2 070/578 Machine Learning, Spring 2007: Homework 2 Due: Wednesday, February 2, beginning of the class Instructions There are 4 questions on this assignment The second question involves coding Do not attach

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Mining Molecular Fragments: Finding Relevant Substructures of Molecules Mining Molecular Fragments: Finding Relevant Substructures of Molecules Christian Borgelt, Michael R. Berthold Proc. IEEE International Conference on Data Mining, 2002. ICDM 2002. Lecturers: Carlo Cagli

More information

Acyclic Subgraph based Descriptor Spaces for Chemical Compound Retrieval and Classification. Technical Report

Acyclic Subgraph based Descriptor Spaces for Chemical Compound Retrieval and Classification. Technical Report Acyclic Subgraph based Descriptor Spaces for Chemical Compound Retrieval and Classification Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 200

More information

Package Rchemcpp. August 14, 2018

Package Rchemcpp. August 14, 2018 Type Package Title Similarity measures for chemical compounds Version 2.19.0 Date 2015-06-23 Author Michael Mahr, Guenter Klambauer Package Rchemcpp August 14, 2018 Maintainer Guenter Klambauer

More information

A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors

A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors Rajarshi Guha, Debojyoti Dutta, Ting Chen and David J. Wild School of Informatics Indiana University and Dept.

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE You are allowed a two-page cheat sheet. You are also allowed to use a calculator. Answer the questions in the spaces provided on the question sheets.

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization

More information

Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs

Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs William (Bill) Welsh welshwj@umdnj.edu Prospective Funding by DTRA/JSTO-CBD CBIS Conference 1 A State-wide, Regional and National

More information

Data Mining in the Chemical Industry. Overview of presentation

Data Mining in the Chemical Industry. Overview of presentation Data Mining in the Chemical Industry Glenn J. Myatt, Ph.D. Partner, Myatt & Johnson, Inc. glenn.myatt@gmail.com verview of presentation verview of the chemical industry Example of the pharmaceutical industry

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

DISCOVERING new drugs is an expensive and challenging

DISCOVERING new drugs is an expensive and challenging 1036 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 8, AUGUST 2005 Frequent Substructure-Based Approaches for Classifying Chemical Compounds Mukund Deshpande, Michihiro Kuramochi, Nikil

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Analysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing

Analysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning and Simulated Annealing Student: Ke Zhang MBMA Committee: Dr. Charles E. Smith (Chair) Dr. Jacqueline M. Hughes-Oliver

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

Prediction in bioinformatics applications by conformal predictors

Prediction in bioinformatics applications by conformal predictors Prediction in bioinformatics applications by conformal predictors Alex Gammerman (joint work with Ilia Nuoretdinov and Paolo Toccaceli) Computer Learning Research Centre Royal Holloway, University of London

More information

Tufts COMP 135: Introduction to Machine Learning

Tufts COMP 135: Introduction to Machine Learning Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME

Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME Iván Solt Solutions for Cheminformatics Drug Discovery Strategies for known targets High-Throughput Screening (HTS) Cells

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Development of a Structure Generator to Explore Target Areas on Chemical Space

Development of a Structure Generator to Explore Target Areas on Chemical Space Development of a Structure Generator to Explore Target Areas on Chemical Space Kimito Funatsu Department of Chemical System Engineering, This materials will be published on Molecular Informatics Drug Development

More information

Topology based deep learning for biomolecular data

Topology based deep learning for biomolecular data Topology based deep learning for biomolecular data Guowei Wei Departments of Mathematics Michigan State University http://www.math.msu.edu/~wei American Institute of Mathematics July 23-28, 2017 Grant

More information

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Receptor Based Drug Design (1)

Receptor Based Drug Design (1) Induced Fit Model For more than 100 years, the behaviour of enzymes had been explained by the "lock-and-key" mechanism developed by pioneering German chemist Emil Fischer. Fischer thought that the chemicals

More information

Support Vector Machine & Its Applications

Support Vector Machine & Its Applications Support Vector Machine & Its Applications A portion (1/3) of the slides are taken from Prof. Andrew Moore s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials Mingyue Tan The University of British Columbia

More information

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

CS446: Machine Learning Fall Final Exam. December 6 th, 2016 CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

Similarity Search. Uwe Koch

Similarity Search. Uwe Koch Similarity Search Uwe Koch Similarity Search The similar property principle: strurally similar molecules tend to have similar properties. However, structure property discontinuities occur frequently. Relevance

More information

Statistical concepts in QSAR.

Statistical concepts in QSAR. Statistical concepts in QSAR. Computational chemistry represents molecular structures as a numerical models and simulates their behavior with the equations of quantum and classical physics. Available programs

More information

Introduction. OntoChem

Introduction. OntoChem Introduction ntochem Providing drug discovery knowledge & small molecules... Supporting the task of medicinal chemistry Allows selecting best possible small molecule starting point From target to leads

More information

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression APPLICATION NOTE QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression GAINING EFFICIENCY IN QUANTITATIVE STRUCTURE ACTIVITY RELATIONSHIPS ErbB1 kinase is the cell-surface receptor

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

Priority Setting of Endocrine Disruptors Using QSARs

Priority Setting of Endocrine Disruptors Using QSARs Priority Setting of Endocrine Disruptors Using QSARs Weida Tong Manager of Computational Science Group, Logicon ROW Sciences, FDA s National Center for Toxicological Research (NCTR), U.S.A. Thanks for

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 29 April, SoSe 2015 Support Vector Machines (SVMs) 1. One of

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

c 4, < y 2, 1 0, otherwise,

c 4, < y 2, 1 0, otherwise, Fundamentals of Big Data Analytics Univ.-Prof. Dr. rer. nat. Rudolf Mathar Problem. Probability theory: The outcome of an experiment is described by three events A, B and C. The probabilities Pr(A) =,

More information

Retrieving hits through in silico screening and expert assessment M. N. Drwal a,b and R. Griffith a

Retrieving hits through in silico screening and expert assessment M. N. Drwal a,b and R. Griffith a Retrieving hits through in silico screening and expert assessment M.. Drwal a,b and R. Griffith a a: School of Medical Sciences/Pharmacology, USW, Sydney, Australia b: Charité Berlin, Germany Abstract:

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Dr. Sander B. Nabuurs. Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre

Dr. Sander B. Nabuurs. Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre Dr. Sander B. Nabuurs Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre The road to new drugs. How to find new hits? High Throughput

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel

More information

Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML) Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:

More information

Information Extraction from Text

Information Extraction from Text Information Extraction from Text Jing Jiang Chapter 2 from Mining Text Data (2012) Presented by Andrew Landgraf, September 13, 2013 1 What is Information Extraction? Goal is to discover structured information

More information

Review: Support vector machines. Machine learning techniques and image analysis

Review: Support vector machines. Machine learning techniques and image analysis Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization

More information

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 November 16, 017 Due: Dec 01, 017 A. Kernels Show that the following kernels K are PDS: 1.

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

Machine learning methods to infer drug-target interaction network

Machine learning methods to infer drug-target interaction network Machine learning methods to infer drug-target interaction network Yoshihiro Yamanishi Medical Institute of Bioregulation Kyushu University Outline n Background Drug-target interaction network Chemical,

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6

More information