Interactive Feature Selection with

Size: px
Start display at page:

Download "Interactive Feature Selection with"

Transcription

1 Chapter 6 Interactive Feature Selection with TotalBoost g ν We saw in the experimental section that the generalization performance of the corrective and totally corrective boosting algorithms is comparable. However the totally corrective algorithms produce smaller final combinations of features, avoid redundant features and are not as prone to re-selecting the same features again. All these properties were crucial in the application that we will discuss in this section which is the design of an interactive feature selection algorithm for computational drug design. The final hypotheses is only useful to the chemist if it is based on a small number of features and therefore is interpretable. We initially used corrective boosting algorithms for this purpose and ran into the following problems: the constructed convex combination of features was too large and the algorithms reselected previously 85

2 included features as weak hypotheses. Our collaborators from Telik cooperation were particularly confused by the second problem. They assumed that algorithm was broken because it suggested the same feature again even though they just selected it a couple of iterations ago. These problems led us to investigate totally corrective boosting algorithm and were the main motivation for this thesis. This chapter is outlined as follows. We begin with a summary of some background knowledge about computational drug design. This is followed by a discussion of the data set and the descriptors we used for our experiments. After discussing the experimental setup for doing feature selection, we give two modes for producing a small convex combination of features. The first one simply applies TotalBoost g ν without interference from the outside. In the second one the algorithm gives a list of features to the computational chemist and the chemist selects from the list. Finally we compare the results for both modes of operation. 6.1 Background of Computational Drug Design The drug design involves the process of the identification of the drug target, the search and the design of small molecules that uniquely interact with the target (called drug lead) and the optimization of the drug-like properties of the lead molecules. Typically a drug target is a protein that is responsible for the manifestation of a therapeutically relevant disease or a toxic side effect. However, it is expensive to accurately test a large amount of molecules against the drug target. Computa- 86

3 tional drug design develops methods of virtual screening which greatly reduce the cost. Virtual screening builds computational models based on some molecules with known activity and/or the knowledge of the drug target. The model is then used to predict large libraries of molecules and only a few molecules with high predictions are synthesized and tested. The initial drug leads typically do not have all necessary properties of being a drug, e.g. the ADME (absorption, distribution, metabolism, and excretion) properties. Computational drug design also develops methods to understand the correlation between the ADME properties and the chemical structure of the molecules so that new molecules whose structures are slightly different from the lead molecules can be designed to improve these properties. 6.2 Dataset and Descriptors We used Dataset 1 (COX-1 dataset) in our test. It has 19 known nonsteroidal-anti-inflammatory drugs (NSAIDs), which are assumed to be active (positive examples in classification problems), and 106 compounds chosen from the Available Chemicals Directory (ACD), which were assumed to be inactive (negative examples). The putative inactives were chosen randomly with the constraint that the distribution of the molecular weights of the compounds approximately match that of the 19 NSAIDs. In addition, to make the problem more difficult, all inactives were required to have a carboxylic acid, a feature present in most of the known NSAIDs. Now we explain the descriptors we used for this dataset. The descriptor 87

4 vectors based on MDL s MACCS keys are standard in the industry. Each molecule is described as a vector of present features. However, we found that in lead optimization data sets, the MACCS keys were often insufficient to discriminate between closely-related chemical structures. In order to obtain a more sensitive feature set, we developed a new molecular descriptor based on a Graph-theoretic approach to molecular representation. We call these descriptors the Structural Path Profile (SPPs), or simply path descriptors. The path descriptor of a molecule is the set of all unique paths (up to a certain length) in a molecule. A path is a set of atoms (nodes) connected by bonds (edges). A dictionary of all possible paths in all the molecules under consideration defines the feature space, and a bit vector is used to encode the presence or absence of each path in a given molecule (see Figure 6.1 for an example). Note that the dimension of our path descriptors feature space is quite large. It is around 10 6 when we make the length of the path not greater than 8. Compared to MACCS key which only has 166 bits, the path descriptors are able to distinguish between molecules that were indistinguishable in the MACCS keys feature space and thus makes it ideal for lead optimization. 6.3 Experiment Setup We use a 60%/40% split of the COX-1 dataset into training set and test set, respectively. In our random split, the training set contains 11 actives and 64 inactives. 88

5 Figure 6.1: Path descriptors: each unique path of the molecule corresponds to a binary descriptor. In the data preprocessing phrase, we eliminated the features that never show up in any active compounds of the training set. This removes approximately 95% of the features and leaves about 3000 features. We did this because there exists a large amount of features which are irrelevant to the activity but may uniquely identify any molecule in the dataset. When the classifier selects these features, then the generalization performance of the produced final hypothesis is poor. Note that we are not cheating since our pre-selection of features is based on the training set only. We observed in other experiments (not shown) that this pre-selection of features greatly improved the generalization performance. The individual features in the representation of the molecule are the weak 89

6 hypotheses used for boosting. We assume that the set of features is complementation closed. That is if h is a feature, then h is also. In each iteration of boosting, the base learner simply selects the feature that maximizes the edge w.r.t. the current distribution d on the examples. Since we always select a feature with maximum edge, the guarantee of the resulting weak learner is the maximum margin ρ. If TotalBoost ν is not given the value of the guarantee g = ρ, then the algorithm starts too slowly (see experimental section). Therefore we first pre-compute the maximum margin ρ of the training set and then use TotalBoost g ν with g = ρ in all our experiments. 6.4 The Final Hypothesis Produced by Non-interactive TotalBoost g ν The five features with the highest weights in the final hypothesis of TotalBoost g ν are shown in Figure 6.2. The top feature is a path that corresponds to the arylproprionic acid group seen in several of the NSAIDs (e.g., ibuprofen). One feature with positive weight corresponds to aromatic rings. It is interesting to note that two of the features chosen by the algorithm have negative weight 1. One corresponds to a carboxylate, which is a consequence of how the negative examples were chosen - they all possessed this feature, so this result is an artifact of the data set selection. The C-N feature is interesting because the distribution of positive and negative examples 1 Note that this is possible because the feature set is complementary closed. 90

7 Figure 6.2: The features with the highest weights in the final hypothesis of TotalBoost g ν. is such that a carbon-nitrogen sigma bond is more frequently found in negative examples than positive examples. In summary, TotalBoost g ν selects some good features and a few irrelevant features. Note that the test error for this model is 0.12 and the test AUC is Interactive Feature Selection by TotalBoost g ν As we see from the previous section, although the final hypothesis produced by TotalBoost g ν uses a small number of features, TotalBoost g ν usually chose irrelevant features due to the overly rich feature set compared to the small number of training examples. These irrelevant features could easily be detected by a chemist. It is laborious to manually process the feature set before we start experiments because 91

8 of the size of the feature set. We want to use the domain knowledge of the expert to determine useful features. For this purpose, we designed an interactive feature selection procedure based on TotalBoost g ν. In this procedure, the chemist interacts with the program and decides which feature should be chosen next. The interactive feature selection procedure works as follows: at each iteration, TotalBoost g ν calculates the weighted errors of all the features w.r.t. the current distribution of examples and provides the chemist with a ranked list of features based on the weighted error 2. The highly ranked features are preferable since they make the algorithm converge faster. After the chemist chooses a feature, the new distribution based on the enlarged feature set is generated. With this new method, incorporating the expert s knowledge becomes easy since now the chemist can interactively choose the feature at each iteration. We initially designed an interactive feature selection procedure based on AdaBoost. However, AdaBoost repeatedly suggested the same features in its list and this confused our collaborators. To avoid this problem, we developed TotalBoost ν. We designed a computer program that implemented this interactive feature selection procedure. We show the first three iterations of the interactive feature selection procedure with TotalBoost g ν for Dataset 1 in Figure 6.3 (page 94). The running screen of each iteration splits into four parts: The left part shows the information about the top ranked candidate features: we start with the rank of the feature, fol- 2 Since the features are binary, we can use the error of the hypotheses instead of the edge. 92

9 lowed by the feature name (path descriptor), the sign of the feature weight (- means the negation of feature), the weighted error of this feature, and the number of compounds for which this feature is +1. The list of features are sorted by the weighted error rate from low to high. The center part shows the current hypothesis sorted by feature weights. The right part shows the information of hard-to-classify compounds by the current hypothesis: we begin on the right with the compound s ID and its relative weight to the average of all compounds, the label and the prediction score by the current hypothesis. The list of compounds are sorted by their weights. The bottom part is the user interaction area. Here the user inputs the feature to be added into hypothesis training in this iteration. We claim that irrelevant features can be avoided easily by our interactive feature selection algorithm. For example, at iteration 2 and iteration 3 of the run of interactive feature selection shown in Figure 6.3, the chemist deliberately ignored the top ranked feature C-N because C-N commonly exists in many molecules. 6.6 The Final Hypothesis Produced by Interactive TotalBoost g ν We compare the final hypothesis produced in the interactive mode of TotalBoost g ν with the one produced in the non-interactive mode. All experimental settings are identical. In Figure 6.4, we show the five features with the highest weights in the 93

10 Iteration Err.Rate:0.000 TruePos:0 FalsePos:0 TrueNeg:0 FalseNeg: No Features P. W.e No current hypotheses Weight CompID R.w La Score C:C:C-C-C-O$ C:C:C:C-C-C-O$ C:C:C:C:C-C-C-O$ C:C:C:C:C:C-C-C-O$ C:C-C-C-O$ C:C:C-C-C=O$ C:C:C:C-C-C=O$ C:C:C:C:C-C-C=O$ C:C:C:C:C:C-C-C=O$ C-N Command:s4 Select C:C:C:C:C:C-C-C-O$! Iteration Err.Rate:0.053 TruePos:7 FalsePos:0 TrueNeg:64 FalseNeg: No Features P. W.e No current hypotheses Weight CompID R.w La Score C-N C:C:C:C:C:C-C-C-O$ C:C:C:N:C:C constant C-C-C-N C-C-N C:C:C:C:C:N C:C:C:C:N:C O$-C=O$ C:C:C:C:C:C-N N-C:C:C:C:C:C:r C:C:N:C:C Command:s2 Select C:C:C:N:C:C! Iteration Err.Rate:0.040 TruePos:9 FalsePos:1 TrueNeg:63 FalseNeg: No Features P. W.e No current hypotheses Weight CompID R.w La Score C-N C:C:C:C:C:C-C-C-O$ C-C-N C:C:C:N:C:C N constant S C-C-C:C:C:C:C:C:r C-C-C:C:C:C:C C-C-C:C:C:C:C:C O$-C=O$ C:C:C-C-C-O$ C:C:C:C-C-C-O$ Command:s7 Select C-C-C:C:C:C:C:C! Figure 6.3: The first three iterations of interactive feature selection by TotalBoost g ν 94

11 final hypothesis of interactive TotalBoost g ν. Figure 6.4: The features with the highest weights in the final hypothesis of TotalBoost g ν with interactive feature selection. Note that the test error for this model is 0.13 and the test AUC is The previous numbers for non-interactive TotalBoost g ν were test error 0.12 and test AUC Although the results are slightly worse than the case without interactive feature selection, Our collaborators strongly preferred the hypothesis produced by the interactive procedure. 6.7 Summary We apply TotalBoost ν to a computational drug discovery problem. Our goal is to find structural features that determine the high activity of a molecule. 95

12 TotalBoost g ν uses a small set of hypotheses in the final hypothesis. However, we found out that some features selected by TotalBoost g ν may be irrelevant to the activity of the molecule due to the nature of the data i.e. an overly rich feature set w.r.t. a small number of examples. Therefore we designed an interactive feature selection procedure based on TotalBoost g ν which can easily incorporate the chemists expert knowledge into the real-time hypothesis building. The resulting hypotheses were considered more useful by the chemists. 96

Totally Corrective Boosting Algorithms that Maximize the Margin

Totally Corrective Boosting Algorithms that Maximize the Margin Totally Corrective Boosting Algorithms that Maximize the Margin Manfred K. Warmuth 1 Jun Liao 1 Gunnar Rätsch 2 1 University of California, Santa Cruz 2 Friedrich Miescher Laboratory, Tübingen, Germany

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Multiclass Boosting with Repartitioning

Multiclass Boosting with Repartitioning Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006 Binary and Multiclass Problems Binary classification problems Y = { 1, 1} Multiclass classification problems Y

More information

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c15 2013/9/9 page 331 le-tex 331 15 Ensemble Learning The expression ensemble learning refers to a broad class

More information

Lecture 3: Decision Trees

Lecture 3: Decision Trees Lecture 3: Decision Trees Cognitive Systems - Machine Learning Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning last change November 26, 2014 Ute Schmid (CogSys,

More information

Entropy Regularized LPBoost

Entropy Regularized LPBoost Entropy Regularized LPBoost Manfred K. Warmuth Karen Glocer S.V.N. Vishwanathan (pretty slides from Gunnar Rätsch) Updated: October 13, 2008 M.K.Warmuth et.al. () Entropy Regularized LPBoost Updated: October

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Learning = improving with experience Improve over task T (e.g, Classification, control tasks) with respect

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

In silico pharmacology for drug discovery

In silico pharmacology for drug discovery In silico pharmacology for drug discovery In silico drug design In silico methods can contribute to drug targets identification through application of bionformatics tools. Currently, the application of

More information

Receptor Based Drug Design (1)

Receptor Based Drug Design (1) Induced Fit Model For more than 100 years, the behaviour of enzymes had been explained by the "lock-and-key" mechanism developed by pioneering German chemist Emil Fischer. Fischer thought that the chemicals

More information

Similarity Search. Uwe Koch

Similarity Search. Uwe Koch Similarity Search Uwe Koch Similarity Search The similar property principle: strurally similar molecules tend to have similar properties. However, structure property discontinuities occur frequently. Relevance

More information

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics. Plan Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics. Exercise: Example and exercise with herg potassium channel: Use of

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62 Updated: March 23, 2010 Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62 ICML 2009 Tutorial Survey of Boosting from an Optimization Perspective Part I: Entropy Regularized LPBoost Part II: Boosting from

More information

BioSolveIT. A Combinatorial Docking Approach for Dealing with Protonation and Tautomer Ambiguities

BioSolveIT. A Combinatorial Docking Approach for Dealing with Protonation and Tautomer Ambiguities BioSolveIT Biology Problems Solved using Information Technology A Combinatorial Docking Approach for Dealing with Protonation and Tautomer Ambiguities Ingo Dramburg BioSolve IT Gmb An der Ziegelei 75 53757

More information

Adam Filandr. Utilizing simulated annealing for molecular fingerprints optimization for virtual screening

Adam Filandr. Utilizing simulated annealing for molecular fingerprints optimization for virtual screening BACHELOR THESIS Adam Filandr Utilizing simulated annealing for molecular fingerprints optimization for virtual screening Department of Software Engineering Supervisor of the bachelor thesis: Study programme:

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

Machine learning for ligand-based virtual screening and chemogenomics!

Machine learning for ligand-based virtual screening and chemogenomics! Machine learning for ligand-based virtual screening and chemogenomics! Jean-Philippe Vert Institut Curie - INSERM U900 - Mines ParisTech In silico discovery of molecular probes and drug-like compounds:

More information

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Mining Molecular Fragments: Finding Relevant Substructures of Molecules Mining Molecular Fragments: Finding Relevant Substructures of Molecules Christian Borgelt, Michael R. Berthold Proc. IEEE International Conference on Data Mining, 2002. ICDM 2002. Lecturers: Carlo Cagli

More information

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology AdaBoost S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology 1 Introduction In this chapter, we are considering AdaBoost algorithm for the two class classification problem.

More information

Structure-Activity Modeling - QSAR. Uwe Koch

Structure-Activity Modeling - QSAR. Uwe Koch Structure-Activity Modeling - QSAR Uwe Koch QSAR Assumption: QSAR attempts to quantify the relationship between activity and molecular strcucture by correlating descriptors with properties Biological activity

More information

Problem. Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26

Problem. Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26 Binary Search Introduction Problem Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26 Strategy 1: Random Search Randomly select a page until the page containing

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan Ensemble Methods NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan How do you make a decision? What do you want for lunch today?! What did you have last night?! What are your favorite

More information

Midterm, Fall 2003

Midterm, Fall 2003 5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

More information

Machine Learning Concepts in Chemoinformatics

Machine Learning Concepts in Chemoinformatics Machine Learning Concepts in Chemoinformatics Martin Vogt B-IT Life Science Informatics Rheinische Friedrich-Wilhelms-Universität Bonn BigChem Winter School 2017 25. October Data Mining in Chemoinformatics

More information

Data Warehousing & Data Mining

Data Warehousing & Data Mining 13. Meta-Algorithms for Classification Data Warehousing & Data Mining Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13.

More information

Analysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing

Analysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning and Simulated Annealing Student: Ke Zhang MBMA Committee: Dr. Charles E. Smith (Chair) Dr. Jacqueline M. Hughes-Oliver

More information

Ping-Chiang Lyu. Institute of Bioinformatics and Structural Biology, Department of Life Science, National Tsing Hua University.

Ping-Chiang Lyu. Institute of Bioinformatics and Structural Biology, Department of Life Science, National Tsing Hua University. Pharmacophore-based Drug design Ping-Chiang Lyu Institute of Bioinformatics and Structural Biology, Department of Life Science, National Tsing Hua University 96/08/07 Outline Part I: Analysis The analytical

More information

The use of Design of Experiments to develop Efficient Arrays for SAR and Property Exploration

The use of Design of Experiments to develop Efficient Arrays for SAR and Property Exploration The use of Design of Experiments to develop Efficient Arrays for SAR and Property Exploration Chris Luscombe, Computational Chemistry GlaxoSmithKline Summary of Talk Traditional approaches SAR Free-Wilson

More information

Introduction to Chemoinformatics and Drug Discovery

Introduction to Chemoinformatics and Drug Discovery Introduction to Chemoinformatics and Drug Discovery Irene Kouskoumvekaki Associate Professor February 15 th, 2013 The Chemical Space There are atoms and space. Everything else is opinion. Democritus (ca.

More information

Assessing Synthetic Accessibility of Chemical Compounds Using Machine Learning Methods

Assessing Synthetic Accessibility of Chemical Compounds Using Machine Learning Methods J. Chem. Inf. Model. 2010, 50, 979 991 979 Assessing Synthetic Accessibility of Chemical Compounds Using Machine Learning Methods Yevgeniy Podolyan, Michael A. Walters, and George Karypis*, Department

More information

Data Mining in the Chemical Industry. Overview of presentation

Data Mining in the Chemical Industry. Overview of presentation Data Mining in the Chemical Industry Glenn J. Myatt, Ph.D. Partner, Myatt & Johnson, Inc. glenn.myatt@gmail.com verview of presentation verview of the chemical industry Example of the pharmaceutical industry

More information

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018 From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction

More information

Infinite Ensemble Learning with Support Vector Machinery

Infinite Ensemble Learning with Support Vector Machinery Infinite Ensemble Learning with Support Vector Machinery Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology ECML/PKDD, October 4, 2005 H.-T. Lin and L. Li (Learning Systems

More information

Gradient Boosting (Continued)

Gradient Boosting (Continued) Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive

More information

Xia Ning,*, Huzefa Rangwala, and George Karypis

Xia Ning,*, Huzefa Rangwala, and George Karypis J. Chem. Inf. Model. XXXX, xxx, 000 A Multi-Assay-Based Structure-Activity Relationship Models: Improving Structure-Activity Relationship Models by Incorporating Activity Information from Related Targets

More information

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning Question of the Day Machine Learning 2D1431 How can you make the following equation true by drawing only one straight line? 5 + 5 + 5 = 550 Lecture 4: Decision Tree Learning Outline Decision Tree for PlayTennis

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

BioSolveIT. A Combinatorial Approach for Handling of Protonation and Tautomer Ambiguities in Docking Experiments

BioSolveIT. A Combinatorial Approach for Handling of Protonation and Tautomer Ambiguities in Docking Experiments BioSolveIT Biology Problems Solved using Information Technology A Combinatorial Approach for andling of Protonation and Tautomer Ambiguities in Docking Experiments Ingo Dramburg BioSolve IT Gmb An der

More information

Midterm Exam Solutions, Spring 2007

Midterm Exam Solutions, Spring 2007 1-71 Midterm Exam Solutions, Spring 7 1. Personal info: Name: Andrew account: E-mail address:. There should be 16 numbered pages in this exam (including this cover sheet). 3. You can use any material you

More information

Electrical and Computer Engineering Department University of Waterloo Canada

Electrical and Computer Engineering Department University of Waterloo Canada Predicting a Biological Response of Molecules from Their Chemical Properties Using Diverse and Optimized Ensembles of Stochastic Gradient Boosting Machine By Tarek Abdunabi and Otman Basir Electrical and

More information

has its own advantages and drawbacks, depending on the questions facing the drug discovery.

has its own advantages and drawbacks, depending on the questions facing the drug discovery. 2013 First International Conference on Artificial Intelligence, Modelling & Simulation Comparison of Similarity Coefficients for Chemical Database Retrieval Mukhsin Syuib School of Information Technology

More information

Structural biology and drug design: An overview

Structural biology and drug design: An overview Structural biology and drug design: An overview livier Taboureau Assitant professor Chemoinformatics group-cbs-dtu otab@cbs.dtu.dk Drug discovery Drug and drug design A drug is a key molecule involved

More information

Patent Searching using Bayesian Statistics

Patent Searching using Bayesian Statistics Patent Searching using Bayesian Statistics Willem van Hoorn, Exscientia Ltd Biovia European Forum, London, June 2017 Contents Who are we? Searching molecules in patents What can Pipeline Pilot do for you?

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning

More information

Visual Object Detection

Visual Object Detection Visual Object Detection Ying Wu Electrical Engineering and Computer Science Northwestern University, Evanston, IL 60208 yingwu@northwestern.edu http://www.eecs.northwestern.edu/~yingwu 1 / 47 Visual Object

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabás Póczos Contents Decision Trees: Definition + Motivation Algorithm for Learning Decision Trees Entropy, Mutual Information, Information

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

CSC314 / CSC763 Introduction to Machine Learning

CSC314 / CSC763 Introduction to Machine Learning CSC314 / CSC763 Introduction to Machine Learning COMSATS Institute of Information Technology Dr. Adeel Nawab More on Evaluating Hypotheses/Learning Algorithms Lecture Outline: Review of Confidence Intervals

More information

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. VC Dimension Review The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. Previously, in discussing PAC learning, we were trying to answer questions about

More information

A Framework For Genetic-Based Fusion Of Similarity Measures In Chemical Compound Retrieval

A Framework For Genetic-Based Fusion Of Similarity Measures In Chemical Compound Retrieval A Framework For Genetic-Based Fusion Of Similarity Measures In Chemical Compound Retrieval Naomie Salim Faculty of Computer Science and Information Systems Universiti Teknologi Malaysia naomie@fsksm.utm.my

More information

In Silico Investigation of Off-Target Effects

In Silico Investigation of Off-Target Effects PHARMA & LIFE SCIENCES WHITEPAPER In Silico Investigation of Off-Target Effects STREAMLINING IN SILICO PROFILING In silico techniques require exhaustive data and sophisticated, well-structured informatics

More information

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE You are allowed a two-page cheat sheet. You are also allowed to use a calculator. Answer the questions in the spaces provided on the question sheets.

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

In Search of Desirable Compounds

In Search of Desirable Compounds In Search of Desirable Compounds Adrijo Chakraborty University of Georgia Email: adrijoc@uga.edu Abhyuday Mandal University of Georgia Email: amandal@stat.uga.edu Kjell Johnson Arbor Analytics, LLC Email:

More information

Chemical library design

Chemical library design Chemical library design Pavel Polishchuk Institute of Molecular and Translational Medicine Palacky University pavlo.polishchuk@upol.cz Drug development workflow Vistoli G., et al., Drug Discovery Today,

More information

Online Passive-Aggressive Algorithms. Tirgul 11

Online Passive-Aggressive Algorithms. Tirgul 11 Online Passive-Aggressive Algorithms Tirgul 11 Multi-Label Classification 2 Multilabel Problem: Example Mapping Apps to smart folders: Assign an installed app to one or more folders Candy Crush Saga 3

More information

Machine Learning. Ensemble Methods. Manfred Huber

Machine Learning. Ensemble Methods. Manfred Huber Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected

More information

Molecular Fragment Mining for Drug Discovery

Molecular Fragment Mining for Drug Discovery Molecular Fragment Mining for Drug Discovery Christian Borgelt 1, Michael R. Berthold 2, and David E. Patterson 2 1 School of Computer Science, tto-von-guericke-university of Magdeburg, Universitätsplatz

More information

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers JMLR: Workshop and Conference Proceedings vol 35:1 8, 014 Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers Balázs Kégl LAL/LRI, University

More information

Introduction. OntoChem

Introduction. OntoChem Introduction ntochem Providing drug discovery knowledge & small molecules... Supporting the task of medicinal chemistry Allows selecting best possible small molecule starting point From target to leads

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

PROVIDING CHEMINFORMATICS SOLUTIONS TO SUPPORT DRUG DISCOVERY DECISIONS

PROVIDING CHEMINFORMATICS SOLUTIONS TO SUPPORT DRUG DISCOVERY DECISIONS 179 Molecular Informatics: Confronting Complexity, May 13 th - 16 th 2002, Bozen, Italy PROVIDING CHEMINFORMATICS SOLUTIONS TO SUPPORT DRUG DISCOVERY DECISIONS CARLETON R. SAGE, KEVIN R. HOLME, NIANISH

More information

Boosting: Algorithms and Applications

Boosting: Algorithms and Applications Boosting: Algorithms and Applications Lecture 11, ENGN 4522/6520, Statistical Pattern Recognition and Its Applications in Computer Vision ANU 2 nd Semester, 2008 Chunhua Shen, NICTA/RSISE Boosting Definition

More information

COMBINATORIAL CHEMISTRY IN A HISTORICAL PERSPECTIVE

COMBINATORIAL CHEMISTRY IN A HISTORICAL PERSPECTIVE NUE FEATURE T R A N S F O R M I N G C H A L L E N G E S I N T O M E D I C I N E Nuevolution Feature no. 1 October 2015 Technical Information COMBINATORIAL CHEMISTRY IN A HISTORICAL PERSPECTIVE A PROMISING

More information

Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value

Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value Anthony Arvanites Daylight User Group Meeting March 10, 2005 Outline 1. Company Introduction

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.

More information

1.0 Basic Concepts and Definitions

1.0 Basic Concepts and Definitions 1.0 Basic Concepts and Definitions signals, noise, and the signal-to-noise ratio (SNR or S/N) definition of basic terms calibration graph terms types of measurements hypothesis testing general observations

More information

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24 Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Feb 24 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Prediction III Goal

More information

Generalization to a zero-data task: an empirical study

Generalization to a zero-data task: an empirical study Generalization to a zero-data task: an empirical study Université de Montréal 20/03/2007 Introduction Introduction and motivation What is a zero-data task? task for which no training data are available

More information

Priority Setting of Endocrine Disruptors Using QSARs

Priority Setting of Endocrine Disruptors Using QSARs Priority Setting of Endocrine Disruptors Using QSARs Weida Tong Manager of Computational Science Group, Logicon ROW Sciences, FDA s National Center for Toxicological Research (NCTR), U.S.A. Thanks for

More information

Law of Definite Proportions (LDP)

Law of Definite Proportions (LDP) Law of Definite Proportions - S 1 of 8 Law of Definite Proportions (LDP) The law of definite proportions states that all pure samples of a chemical compound contain exactly the same proportion of elements

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems

More information

DivCalc: A Utility for Diversity Analysis and Compound Sampling

DivCalc: A Utility for Diversity Analysis and Compound Sampling Molecules 2002, 7, 657-661 molecules ISSN 1420-3049 http://www.mdpi.org DivCalc: A Utility for Diversity Analysis and Compound Sampling Rajeev Gangal* SciNova Informatics, 161 Madhumanjiri Apartments,

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

6.036 midterm review. Wednesday, March 18, 15

6.036 midterm review. Wednesday, March 18, 15 6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that

More information

Large-Margin Thresholded Ensembles for Ordinal Regression

Large-Margin Thresholded Ensembles for Ordinal Regression Large-Margin Thresholded Ensembles for Ordinal Regression Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology, U.S.A. Conf. on Algorithmic Learning Theory, October 9,

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

Tutorials on Library Design E. Lounkine and J. Bajorath (University of Bonn) C. Muller and A. Varnek (University of Strasbourg)

Tutorials on Library Design E. Lounkine and J. Bajorath (University of Bonn) C. Muller and A. Varnek (University of Strasbourg) Tutorials on Library Design E. Lounkine and J. Bajorath (University of Bonn) C. Muller and A. Varnek (University of Strasbourg) The purpose of this tutorial is to generate a library of potential inhibitors

More information

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller Chemogenomic: Approaches to Rational Drug Design Jonas Skjødt Møller Chemogenomic Chemistry Biology Chemical biology Medical chemistry Chemical genetics Chemoinformatics Bioinformatics Chemoproteomics

More information

Learning with Probabilities

Learning with Probabilities Learning with Probabilities CS194-10 Fall 2011 Lecture 15 CS194-10 Fall 2011 Lecture 15 1 Outline Bayesian learning eliminates arbitrary loss functions and regularizers facilitates incorporation of prior

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning Ensembles. 293S T. Yang. UCSB, 2017. Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

Virtual affinity fingerprints in drug discovery: The Drug Profile Matching method

Virtual affinity fingerprints in drug discovery: The Drug Profile Matching method Ágnes Peragovics Virtual affinity fingerprints in drug discovery: The Drug Profile Matching method PhD Theses Supervisor: András Málnási-Csizmadia DSc. Associate Professor Structural Biochemistry Doctoral

More information

Models, Data, Learning Problems

Models, Data, Learning Problems Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Models, Data, Learning Problems Tobias Scheffer Overview Types of learning problems: Supervised Learning (Classification, Regression,

More information

Unsupervised Image Segmentation Using Comparative Reasoning and Random Walks

Unsupervised Image Segmentation Using Comparative Reasoning and Random Walks Unsupervised Image Segmentation Using Comparative Reasoning and Random Walks Anuva Kulkarni Carnegie Mellon University Filipe Condessa Carnegie Mellon, IST-University of Lisbon Jelena Kovacevic Carnegie

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo! Theory and Applications of A Repeated Game Playing Algorithm Rob Schapire Princeton University [currently visiting Yahoo! Research] Learning Is (Often) Just a Game some learning problems: learn from training

More information

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4] 1 DECISION TREE LEARNING [read Chapter 3] [recommended exercises 3.1, 3.4] Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting Decision Tree 2 Representation: Tree-structured

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

Ensembles. Léon Bottou COS 424 4/8/2010

Ensembles. Léon Bottou COS 424 4/8/2010 Ensembles Léon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. Léon

More information

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997 Outline Training Examples for EnjoySport Learning from examples General-to-specific ordering over hypotheses [read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] Version spaces and candidate elimination

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Kernel-based Machine Learning for Virtual Screening

Kernel-based Machine Learning for Virtual Screening Kernel-based Machine Learning for Virtual Screening Dipl.-Inf. Matthias Rupp Beilstein Endowed Chair for Chemoinformatics Johann Wolfgang Goethe-University Frankfurt am Main, Germany 2008-04-11, Helmholtz

More information