Machine Learning Concepts in Chemoinformatics

Similar documents
Structure-Activity Modeling - QSAR. Uwe Koch

Performance Evaluation

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation requires to define performance measures to be optimized

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Performance Evaluation

Evaluation & Credibility Issues

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Model Accuracy Measures

Performance Measures. Sören Sonnenburg. Fraunhofer FIRST.IDA, Kekuléstr. 7, Berlin, Germany

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Anomaly Detection. Jing Gao. SUNY Buffalo

Lecture 4 Discriminant Analysis, k-nearest Neighbors

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Chart types and when to use them

Performance evaluation of binary classifiers

Stephen Scott.

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Similarity Search. Uwe Koch

Data Mining and Analysis: Fundamental Concepts and Algorithms

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Modern Information Retrieval

CS145: INTRODUCTION TO DATA MINING

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CS249: ADVANCED DATA MINING

6.036 midterm review. Wednesday, March 18, 15

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Bayesian Decision Theory

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Applied Machine Learning Annalisa Marsico

Classification. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Learning with multiple models. Boosting.

Modern Information Retrieval

Lecture 3: Pattern Classification. Pattern classification

Big Data Analytics: Evaluating Classification Performance April, 2016 R. Bohn. Some overheads from Galit Shmueli and Peter Bruce 2010

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

QSAR Modeling of Human Liver Microsomal Stability Alexey Zakharov

Jeff Howbert Introduction to Machine Learning Winter

An Integrated Approach to in-silico

Linear Classifiers as Pattern Detectors

Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier

Machine Learning. Lecture 9: Learning Theory. Feng Li.

day month year documentname/initials 1

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Data Mining and Knowledge Discovery: Practice Notes

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

MACHINE LEARNING ADVANCED MACHINE LEARNING

Machine Learning Lecture 2

Holdout and Cross-Validation Methods Overfitting Avoidance

Machine Learning, Midterm Exam

Machine learning for ligand-based virtual screening and chemogenomics!

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Data Mining in the Chemical Industry. Overview of presentation

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Classification of Highly Unbalanced CYP450 Data of Drugs Using Cost Sensitive Machine Learning Techniques

VBM683 Machine Learning

Machine Learning in Action

Machine Learning Lecture 3

Machine Learning 4771

Stat 502X Exam 2 Spring 2014

Performance Evaluation

Machine Learning Linear Classification. Prof. Matteo Matteucci

Predicting flight on-time performance

CS395T Computational Statistics with Application to Bioinformatics

Data Mining algorithms

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Classification, Linear Models, Naïve Bayes

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Computational Statistics with Application to Bioinformatics. Unit 18: Support Vector Machines (SVMs)

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Nonlinear Classification

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Machine Learning, Fall 2009: Midterm

Introduction to Machine Learning Midterm Exam

Research Article. Chemical compound classification based on improved Max-Min kernel

Applications of multi-class machine

CSC 411: Lecture 03: Linear Classification

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Machine Learning Lecture 2

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Midterm: CS 6375 Spring 2015 Solutions

ECE 5424: Introduction to Machine Learning

Notes on Discriminant Functions and Optimal Classification

Issues and Techniques in Pattern Classification

The maximum margin classifier. Machine Learning and Bayesian Inference. Dr Sean Holden Computer Laboratory, Room FC06

Linear Classifiers as Pattern Detectors

Short Note: Naive Bayes Classifiers and Permanence of Ratios

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

CS6220: DATA MINING TECHNIQUES

Performance Evaluation and Comparison

Transcription:

Machine Learning Concepts in Chemoinformatics Martin Vogt B-IT Life Science Informatics Rheinische Friedrich-Wilhelms-Universität Bonn BigChem Winter School 2017 25. October

Data Mining in Chemoinformatics Goal: construct models that enable the identification of relationships between chemical structure and activity Traditional QSAR techniques (multiple linear regression) are not generally applicable - data sets (e.g. HTS sets) are usually too large - data sets are usually structurally diverse Machine learning techniques are required

Types of Machine Learning Algorithms Machine Learning Supervised Unsupervised Classification Clustering Regression Associative Learning annotated training sets given (input/output pairs: / f()) deduce function f from training data produce the correct output f() for an input only unlabeled training eamples are given determine how data are organized / find patterns in the data

Classification Prediction of a class based on classified eamples Training set Supervised Machine Learning Unsupervised Classification Clustering Regression Associative Learning Structural pattern

Regression Prediction of a numerical property based on eamples with specific values Training set Supervised Machine Learning Unsupervised 0.3 3 2 0.5 Classification Regression Clustering Associative Learning Structurevalue relations 1.25 0.4

Clustering Organize data into groups of similar objects Training set Supervised Machine Learning Unsupervised Classification Clustering Regression Associative Learning Structural pattern

Machine learning steps Data Representation / Distance metric Objective function Machine learning method Performance evaluation Model selection: parameter optimization

Data and representation Select data for training Data representation - vector of features features can be categorical or numerical - something else (computer-readable representation) Distance metric for representation - assess similarity between objects

Objective function Mathematical formulation of what to learn, e.g. - classification: minimize the number of misclassifications - regression: minimize the difference between correct and predicted quantity - clustering: minimize the distance within clusters while maimizing the distance between clusters ML method suited for minimizing the chosen objective function, e.g. - SVM for minimizing classification errors - linear regression for minimizing the sum of squared errors - hierarchical clustering

Optimization method ML methods perform a tradeoff between - variance: sensitivity to training data high variance -> overfitting - bias: error in (simplified) model assumptions model performs as well on test as training data high bias -> underfitting Regularization parameters - some methods have hyperparameters controlling the compleity of a model - the higher the compleity the better the performance on the training data - simpler models might not perform so well on training data, but might perform comparable on test data

Classification: Naive Bayes Models feature distributions for different classes Assumes that features are distributed differently Distributions are modeled based on training data - normal distributions - Bernoulli distributions Independence of features is assumed 2 2 2 ) ( ep 2 1 ) ( ) ( i i i i i A p A L i i i p A v P v A L ) 1 ( 1) ( n i i A p A p A L 1 ) ( ) ( ) (

Classification: Naive Bayes Naive Bayesian classification are easy to use Naive Bayes makes strong assumptions - continuous features are normally distributed - feature distributions are conditionally independent These assumptions can introduce a strong bias into the model

Classification: Decision Tree Simple eample - classification of oygen-containing compounds 4 categories compound to be classified Is there a hydrogen atom connected to the oygen? yes no Is there a carbonyl group net to the oygen? Is there a carbonyl group net to the oygen? yes no yes no acid alcohol ester ether

Classification: Decision Tree Given a query object (a molecule, e.g.) - traverse the tree and test the attribute values of the object - assign the class label of the respective leaf to the object class label CH 2 COOH Is there a hydrogen atom connected to the oygen? yes no Is there a carbonyl group net to the oygen? Is there a carbonyl group net to the oygen? yes no yes no acid alcohol ester ether

Classification: Decision Trees Decision trees are easy to use - different types os features: numerical categorical - no eplicit metric required - white bo : Relevant features are observable - prone to overfitting (high variance) Hyperparameters have to be set - depth of tree - number of features to consider

Classification: Random Forest A machine learning ensemble classifier - consisting of many decision trees - trees build from subsamples of training data - tree decisions based on subset of features Ensemble models increase bias for individual models while decreasing the variance of the overall model

Classification: Random Forest A machine learning ensemble classifier - combining output class labels of the individual trees to one final output class label - consensus prediction (class predicted by the majority of trees) Ensemble models increase bias for individual models while decreasing the variance of the overall model

Classification: Support Vector Machines (SVM) Supervised binary classification approach 2 Idea: Derivation of a separating hyperplane Projection of test compounds for classification ranking Slack variables allow for misclassification of some data during modeling active w inactive 1

Classification: SVM Feature Space Transformation A reasonable linear separation of data is not always possible (even if limited classification errors are allowed) Projection of data into higher dimensional feature space often permits a linear separation

Classification: SVM Popular Kernel Functions Linear kernel (standard scalar product): Gaussian radial basis function: Polynomial kernel: Tanimoto kernel: ) 2 ' ep( ) ', ( 2 2 Gaussian K d K ) 1 ', ( ) ', ( Polynomial ', ' ',, ', ) ', ( Tanimoto K ', ) ', ( Linear K

Classification: SVM SVMs have hyperparameters that influence the compleity of a model: - Coefficient controlling the sensitivity to errors - Some kernels like Gaussian or polynomial kernel are parameterized

Performance measures Confusion matri True class: Negative True class: Positive Predicted class: Negative True negatives (TN) False negatives (FN) Predicted class: positive False positives (FP) True positives (TP) Sensitivity (true positive rate) TPR = TP TP+FN Specificity (true negative rate) TNR = TN TN+FP Precision (positive predictive value): PPV = TP Accuracy: Acc = TP+TN TP+FN+FP+FN Balanced accuracy: Acc B = 0.5 F1-score: F 1 = 2 PPV TPR PPV+TPR TP TP+FN + Matthews correlation coefficient: MCC = TN TN+FP TP+FP = 0.5 TPR + TNR (harmonic mean of PPV and TPR) TP TN FP FN (TN+FP)(FN+TP)(TN+FN)(TP+TP)

Receiver operating characteristic (ROC) Some ML methods (can) yield scores or probabilities of a class A variable threshold is used for categorization ROC: - Vary threshold - Plot FPR () vs. TPR (y) A curve above diagonal indicates positive performance TPR Random classification corresponds to diagonal FPR

Model evaluation How can we tell whether a model is good? - a number of metrics eist for measuring the performance of models classification: (balanced) accuracy, precision, recall, ROC, correlation coefficient, regression: mean squared/absolute error clustering: silhouette coefficient, A model might be good on the training data, will it be good on test data?

Model evaluation Cross validation: - select specific model and parameter settings - split training data into n parts, repeatedly retain 1 part, train on n-1 parts measure performance on the 1 part, which was not part of the training - assign the average performance Cross validation properties: - works on the internal training set - performance is evaluated on data not used for training - checks the generalization ability of the model

Parameter optimization / Model selection Cross validation can be directly used to compare different ML methods Many ML methods possess hyperparameters - control the compleity of a model - cross validation can and should be used to tune these parameters Few hyperparameters can be tested using grid search - systematically eplore possible parameter values - determine performance with cross validation - choose best settings for final model Other strategies eist - random search, gradient-based optimization,

ML for Virtual Screening: Data, Representations and Metrics

Machine Learning in Chemoinformatics Central theme Investigation of the relationships between similarity and properties of molecules Similarity may refer to structure shape physicochemical properties pharmacophoric features... Properties may refer to biological activity target selectivity oral availability toicity...

Machine Learning in Chemoinformatics Central theme Investigation of the relationships between similarity and properties of molecules General rule Similarity property principle Similar structures show similar activities pk i 9.08 9.37

Data: Public Sources ChEMBL - only activity annotations - analog series bias - for specific target: not representative sample of active chemical compounds PubChem - HTS assay data less biased - less confident

Data: Normalization Molecular representations can vary - protonation states - salts Consistent representation required - washing: remove counter ions consistent protonation states - nitrogen, oygen, sulfur hydrogen suppressed representation Activity annotations might not be comparable - prefer pk i over IC50 IC50 depends on assay conditions like enzyme and substrate concentrations

Representation & Distance Metric Fingerprints Descriptors 3-D conformations Graph structures Representation limits what can be perceived - only information encoded in the representation is available Distance metric depends on the representation

Representation: Descriptors Numerical property descriptors - physicochemical descriptors logp(o/w) molecular weight, - topological descriptors connectivity indices shape descriptors, - count descriptors # of nitrogen atoms # of rings, numerical vector: ( 1, 2,, n )

Distance metric: Descriptors Euclidean distance - d, y = ( i y i ) 2 Important: Normalization - normalized Euclidean distance - standard deviations of sample data: s i - d, y = ( i y i ) 2 s i 2 Kernel functions: - radial basis function (RBF): φ r = e γ y 2

Representation: Fingerprints => binary vector (0,0,1,1,0,1,0,1) or feature set {2,3,5,7}

Distance Metric: Fingerprints Hamming/Manhattan distance - number of differing features - Tc A, B = A B = a i b i = a b 2 Tanimoto coefficient - ratio of the number of features two molecules have in common to the number of all occuring features - Tc A, B = A B A B = a i b i a i + b i a i b i Kernel function (RBF) - r = a b - Gaussian: φ r = e γr2

Representation: Conformations Volumetric shapes Metric: - Shape superposition - ROCS

Representation: Graph structures 2D Graph representations Metrics: - Maimum common substructure (MCS): ratio of number of bonds in MCS to total number of bonds - Graph kernels: random walk kernel

SVM in Compound Space Compounds as data points Negative class: inactive compounds 2 negative class Positive class: active compounds Compound reference space described by FP features positive class 1

SVM in Target-Ligand Space Data points are targetligand pairs 2 Positive class: active compounds with true targets Negative class: inactive compounds with pseudo targets 1

Target-Ligand Kernel (TLK) target space ligand space SSGADYPDELQCLDAPVLSQAKC NVGKGQPSVLQVVNLPIVERPVC K target (T 1,T 2 ) = 0.26 K ligand (l 1,l 2 ) = 1 K target-ligand ((T 1,l 1 )(T 2,l 2 )) = K target (T 1, T 2 ) K ligand (l 1,l 2 )) = 0.26 1