Modeling Mutagenicity Status of a Diverse Set of Chemical Compounds by Envelope Methods

Similar documents
Envelopes: Methods for Efficient Estimation in Multivariate Statistics

Mol2Net, 2015, 1(Section B), pages 1-9, Proceedings 1

Fast envelope algorithms

Statistica Sinica Preprint No: SS R2

Supplementary Materials for Tensor Envelope Partial Least Squares Regression

(Received April 2008; accepted June 2009) COMMENT. Jinzhu Jia, Yuval Benjamini, Chinghway Lim, Garvesh Raskutti and Bin Yu.

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Basics of Multivariate Modelling and Data Analysis

Principal component analysis

26:010:557 / 26:620:557 Social Science Research Methods

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Multivariate Statistical Analysis

Classification: Linear Discriminant Analysis

PCA and admixture models

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Machine Learning Linear Classification. Prof. Matteo Matteucci

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

Regularized Discriminant Analysis and Reduced-Rank LDA

Introduction to Signal Detection and Classification. Phani Chavali

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Machine Learning 2nd Edition

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

EXTENDING PARTIAL LEAST SQUARES REGRESSION

Robust estimation of principal components from depth-based multivariate rank covariance matrix

Machine learning for pervasive systems Classification in high-dimensional spaces

Principal Component Analysis, A Powerful Scoring Technique

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Vector Space Models. wine_spectral.r

Plausible Values for Latent Variables Using Mplus

Gaussian Mixture Models with Component Means Constrained in Pre-selected Subspaces

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

Principal Component Analysis-I Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 17

PLS. theoretical results for the chemometrics use of PLS. Liliana Forzani. joint work with R. Dennis Cook

Experimental Design and Data Analysis for Biologists

Explaining Correlations by Plotting Orthogonal Contrasts

A Selective Review of Sufficient Dimension Reduction

Principal Component Analysis

Lecture 6: Methods for high-dimensional problems

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Course in Data Science

1 A factor can be considered to be an underlying latent variable: (a) on which people differ. (b) that is explained by unknown variables

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

LogitBoost with Trees Applied to the WCCI 2006 Performance Prediction Challenge Datasets

Statistical Pattern Recognition

Regression Graphics. 1 Introduction. 2 The Central Subspace. R. D. Cook Department of Applied Statistics University of Minnesota St.

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Functional Latent Feature Models. With Single-Index Interaction

Simultaneous envelopes for multivariate linear regression

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Dimensionality Reduction for Exponential Family Data

Electrical and Computer Engineering Department University of Waterloo Canada

L11: Pattern recognition principles

Kernel Partial Least Squares for Nonlinear Regression and Discrimination

STA 414/2104: Lecture 8

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

ECE521 week 3: 23/26 January 2017

c 4, < y 2, 1 0, otherwise,

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Pattern Recognition and Machine Learning

Functional SVD for Big Data

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Notes on Discriminant Functions and Optimal Classification

Exploring the black box: structural and functional interpretation of QSAR models.

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Supplementary Materials for Fast envelope algorithms

CS281 Section 4: Factor Analysis and PCA

Lecture #11: Classification & Logistic Regression

Principal Component Analysis Applied to Polytomous Quadratic Logistic

Multicategory Vertex Discriminant Analysis for High-Dimensional Data

Machine Learning 11. week

Linear Dimensionality Reduction

Tensor Envelope Partial Least Squares Regression

Structure-Activity Modeling - QSAR. Uwe Koch

Unsupervised Learning: Dimensionality Reduction

Computation. For QDA we need to calculate: Lets first consider the case that

2 D wavelet analysis , 487

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Multivariate Fundamentals: Rotation. Exploratory Factor Analysis

Algorithms for Envelope Estimation II

High Dimensional Discriminant Analysis

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Supervised multiway factorization

Machine Learning (Spring 2012) Principal Component Analysis

Forecasting 1 to h steps ahead using partial least squares

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

CSE 258, Winter 2017: Midterm

Dimensionality Reduction and Principle Components Analysis

Transcription:

Modeling Mutagenicity Status of a Diverse Set of Chemical Compounds by Envelope Methods Subho Majumdar School of Statistics, University of Minnesota Envelopes in Chemometrics August 4, 2014 1 / 23

Motivation Predictive analysis of data in Chemistry Generation of in silico models to predict activities of chemical compounds Application in drug development to reduce cost of manufacturing derivatives of chemicals Specific problem Binary class prediction in heterogeneous multivariate data(e.g. mutagen/ non-mutagen, curative effect of drug): dimension reduction Envelopes in Chemometrics August 4, 2014 2 / 23

Table of contents 1 The data and the variables 2 The models 3 Results 4 Conclusion Envelopes in Chemometrics August 4, 2014 3 / 23

Outline 1 The data and the variables 2 The models 3 Results 4 Conclusion Envelopes in Chemometrics August 4, 2014 4 / 23

The data The data were taken from the CRC Handbook of Identified Carcinogens and Non-carcinogens [5]. Response variable is 0/1 mutagen status obtained from Ames test of mutagenicity. A chemical compound was classified as mutagen (scored 1) if its Ames score exceeded a certain cutoff, non-mutagen (scored 0) otherwise. Total 508 compounds- 256 mutagens and 252 non-mutagens. The dataset is diverse, meaning that chemical compounds belong to fairly different from each other, like Alkanes and Amines. Envelopes in Chemometrics August 4, 2014 5 / 23

Envelopes in Chemometrics August 4, 2014 6 / 23

Description of variables Four types of variables: 1 Topostructural (TS)- Define the molecular topology, i.e. connectedness of atoms within a molecule (103 descriptors). 2 Topochemical (TC)- Have information on atom and bond types (195 descriptors). 3 3-dimensional (3D)- Define 3-dimensional aspects of the overall molecular structure (3 descriptors). 4 Quantum-Chemical (QC)- Electronic aspects of molecular structure (6 descriptors). Envelopes in Chemometrics August 4, 2014 7 / 23

Previous work Use of Ridge Regression to build a predictive model of mutagenicity [2]. The 0/1 mutagenicity score was used as response variable since 1 corresponds to a higher mutagenicity score and 0 corresponds to a lower one. Variable selection on a larger set of predictors by adapting a supervised clustering algorithm previously used on high-dimensional genetic data [4]. Envelopes in Chemometrics August 4, 2014 8 / 23

Outline 1 The data and the variables 2 The models 3 Results 4 Conclusion Envelopes in Chemometrics August 4, 2014 9 / 23

Envelope regression model Y i = α + βx i + ɛ i, ɛ i N(0, Σ) with Σ = ΓΩΓ T + Γ 0 Ω 0 Γ T 0 i = 1, 2,..., n Due to Cook, Li and Chiaromonte, 2010 [1]. Y R r n multivariate response vector, X R p n non-stochastic predictors. α R r intercept, β R r p matrix of regression coefficients: both unknown. Γ R r u, Γ 0 R r (r u) semi-orthogonal basis matrices of E Σ (B) and its orthogonal complement, respectively, with B = span(β) and 0 u r being the dimension of the envelope. Ω = ΓΣΓ T, Ω 0 = Γ 0 ΣΓ T 0 coordinate matrices corresponding to Γ, Γ 0. Envelopes in Chemometrics August 4, 2014 10 / 23

Graphical illustration of envelope model (Source: Stat 8932 class notes, R. Dennis Cook) Envelopes in Chemometrics August 4, 2014 11 / 23

Envelope regression model for our data: basics log-transformed data. Predictors taken as multivariate response, and the 0/1 mutagenicity status taken as the single predictor, and then envelope regression models are obtained. Hierarchical approach to observe the effect of adding different classes of predictors: separate envelope models fit on data with only TS, only TC, TC + TS and full set of predictors. Data rank deficient, so PCA was performed on data and envelope model was built on first few PCs that explained 90% (or 95%) of total variation. Envelopes in Chemometrics August 4, 2014 12 / 23

Supervised Singular-Value Decomposition (SupSVD) X = YBV T + FV T + E Due to Li et al, 2014 [3]. Matrix of predictors X R n p, supervision data matrix Y R n r. B R r q is the multivariate matrix of coefficients, V R p q full-rank loading matrix. 0 q r the dimension of the underlying space of latent parameters, and F N q (0, Σ f ), E N p (0, σ 2 ei p ) are random error matrices s.t. Σ = VΣ f V T + σ 2 ei p. A modified EM algorithm is used to obtain the unknown parameters θ = (B, V, Σ f, σ 2 e). The vector of mutagenicity status is now used as the supervision data matrix Y, while the data on 308 predictors is the matrix X. Envelopes in Chemometrics August 4, 2014 13 / 23

Prediction through Linear Discriminant Analysis Envelope model- Estimate the envelope basis, say ˆΓ, reduce the matrix of predictors by multiplying it with the basis and then apply Fisher s Linear Discriminant Analysis on ˆΓ T Y. supsvd- Here the notations are reversed and X is our 508 307 data matrix. After obtaining the loading matrix V, we transform the data matrix as: U = XV, and apply LDA on U, taking Y as the 0/1 class variable. Correct classification percentages are obtained through cross-validation on the full sample. Envelopes in Chemometrics August 4, 2014 14 / 23

Method of Cross-validation Naïve CV vs. Two-fold CV Envelopes in Chemometrics August 4, 2014 15 / 23

Outline 1 The data and the variables 2 The models 3 Results 4 Conclusion Envelopes in Chemometrics August 4, 2014 16 / 23

Results: Variance reduction by envelopes In all the envelope models, there were massive gains in terms of variation. The gains were especially large for the first 2 principal components. Set of No. of Envelope % variance explained by Envelope gain ratios for descriptors PCs dimension (u) PC1 PC2 PC3 PC1 PC2 PC3 TS 7 3 70.43 10.35 2.60 25.91 36.17 2.10 TC 8 4 75.89 6.52 2.42 15.40 35.26 1.00 TS + TC 13 6 70.27 7.94 2.21 10.40 37.99 1.22 Full 15 11 58.19 7.60 5.98 1.00 1.00 1.00 Note: With default tolerances of objective and gradient function in env the algorithm did not converge in 1000 iterations. For this reason they were set to 1e-7 and 1e-4. As far as other PCs of full model were concerned, PCs 9, 11, 13 and 15 gave 1.26, 1.96, 1.88 and 1.5-fold gains, respectively. Envelopes in Chemometrics August 4, 2014 17 / 23

Performance of envelopes and supsvd in prediction Model Type of predictors No. of Correct classification % description in model predictors Total Mutagens Non-mutagens Ridge regression[2] TS+TC 298 76.97 83.98 69.84 Ridge regression[2] TS+TC+3D+QC 307 77.17 84.38 69.84 Ridge regression after variable selection[4] TS+TC+AP 203 78.35 84.38 72.22 TS 103 57.09 65.63 48.41 Envelope LDA TC 195 58.27 69.92 46.43 TS+TC 298 60.24 69.14 51.19 TS 103 (5) 59.45 70.31 48.41 SupSVD LDA TC 195 (37) 70.47 76.56 64.29 90% cutoff TS+TC 298 (32) 68.90 75.39 62.30 TS+TC+3D+QC 307 (34) 70.47 77.73 63.09 TS 103 (8) 60.04 67.58 52.38 SupSVD LDA TC 195 (51) 72.44 78.13 66.67 95% cutoff TS+TC 298 (48) 70.47 78.91 61.90 TS+TC+3D+QC 307 (51) 71.06 78.91 63.09 Table : Comparison of predictive performance of various models Envelopes in Chemometrics August 4, 2014 18 / 23

Outline 1 The data and the variables 2 The models 3 Results 4 Conclusion Envelopes in Chemometrics August 4, 2014 19 / 23

Discussion For estimation, envelope models performed really well in conjunction with PCA for rank-deficient data, offering heavy gains for the major principal components over OLS. Possible reason for the poor performance in prediction: High material to immaterial variation ratio Heteroskedasticity caused by diverse chemical classes among compounds Variation of scales between different types of variables Logistic Envelope Regression. supsvd a potential plausible approach because of its general framework and computational stability and applicability in n << p scenario. Envelopes in Chemometrics August 4, 2014 20 / 23

Acknowledgements Prof. Dennis Cook, for his guidance and valuable inputs. Henry Zhang, for providing his codes for logistic envelope regression. Greg Grunwald, UofM-Duluth for providing the dataset. Envelopes in Chemometrics August 4, 2014 21 / 23

References COOK, R., LI, B., AND CHIAROMONTE, F. Envelope models for parsimonious and efficient multivariate linear regression. Stat. Sinica 20 (2010), 927 1010. HAWKINS, D., BASAK, S., AND MILLS, D. QSARs for chemical mutagens from structure: ridge regression fitting and diagnostics. Environ. Toxicol. Pharmacol. 16 (2004), 37 44. LI, G., YANG, D., SHEN, H., AND NOBEL, A. Supervised Singular Value Decomposition and its asymptotic properties. Technometrics Submitted. MAJUMDAR, S., BASAK, S., AND GRUNWALD, G. Adapting interrelated two-way clustering method for Quantitative Structure-Activity relationship (QSAR) modeling of mutagenicity/ non-mutagenicity of a diverse set of chemicals. Curr. Comput. Aided Drug Des. 9 (2013), 463 471. SODERMAN, J. CRC Handbook of Identified Carcinogens and Noncarcinogens: Carcinogenicity-Mutagenicity Database. CRC Press, Boca Raton, FL, 1982. Envelopes in Chemometrics August 4, 2014 22 / 23

THANK YOU! Envelopes in Chemometrics August 4, 2014 23 / 23