Extensions to LDA and multinomial regression

Similar documents
Other likelihoods. Patrick Breheny. April 25. Multinomial regression Robust regression Cox regression

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Lecture 6: Methods for high-dimensional problems

CMSC858P Supervised Learning Methods

Stability and the elastic net

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Regularized Discriminant Analysis. Part I. Linear and Quadratic Discriminant Analysis. Discriminant Analysis. Example. Example. Class distribution

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

MSA220/MVE440 Statistical Learning for Big Data

Classification 2: Linear discriminant analysis (continued); logistic regression

Eigenvalues and diagonalization

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

MSA200/TMS041 Multivariate Analysis

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Classification 1: Linear regression of indicators, linear discriminant analysis

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Opening Theme: Flexibility vs. Stability

Regularized Discriminant Analysis and Its Application in Microarray

Lecture 5: Classification

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Local regression I. Patrick Breheny. November 1. Kernel weighted averages Local linear regression

Nonconvex penalties: Signal-to-noise ratio and algorithms

The Wishart distribution Scaled Wishart. Wishart Priors. Patrick Breheny. March 28. Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/11

Dimension Reduction Methods

Linear Models for Regression CS534

Machine Learning for OR & FE

Prediction & Feature Selection in GLM

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Further extensions: Non-nested models and generalized linear models

MATH 567: Mathematical Techniques in Data Science Logistic regression and Discriminant Analysis

Linear Models for Regression CS534

Building a Prognostic Biomarker

Regularized Discriminant Analysis and Its Application in Microarrays

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Statistical Data Mining and Machine Learning Hilary Term 2016

MSA220/MVE440 Statistical Learning for Big Data

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Semiparametric Discriminant Analysis of Mixture Populations Using Mahalanobis Distance. Probal Chaudhuri and Subhajit Dutta

STA414/2104 Statistical Methods for Machine Learning II

Lecture 5: LDA and Logistic Regression

Lecture 14: Shrinkage

INRIA Rh^one-Alpes. Abstract. Friedman (1989) has proposed a regularization technique (RDA) of discriminant analysis

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

An Introduction to Multivariate Methods

Regularized Discriminant Analysis and Reduced-Rank LDA

LDA, QDA, Naive Bayes

Linear Models for Regression CS534

application in microarrays

Parameter Norm Penalties. Sargur N. Srihari

Classification techniques focus on Discriminant Analysis

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Lecture Data Science

Fast Regularization Paths via Coordinate Descent

STAT 730 Chapter 1 Background

Model comparison. Patrick Breheny. March 28. Introduction Measures of predictive power Model selection

Linear Decision Boundaries

Regularization: Ridge Regression and the LASSO

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Sparse statistical modelling

LEC 4: Discriminant Analysis for Classification

Machine Learning Linear Classification. Prof. Matteo Matteucci

Theoretical results for lasso, MCP, and SCAD

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Linear Methods for Prediction

Shrinkage Methods: Ridge and Lasso

Bias-Variance Tradeoff

STA 4273H: Statistical Machine Learning

Classification: Linear Discriminant Analysis

Cox regression: Estimation

Gaussian Models

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

The Bayes classifier

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

Lecture 9: Classification, LDA

Chapter 7, continued: MANOVA

Machine Learning (CS 567) Lecture 5

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

MSA220 Statistical Learning for Big Data

Biostatistics Advanced Methods in Biostatistics IV

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Midterm exam CS 189/289, Fall 2015

Multivariate analysis of variance and covariance

Introduction to Machine Learning Spring 2018 Note 18

L20: MLPs, RBFs and SPR Bayes discriminants and MLPs The role of MLP hidden units Bayes discriminants and RBFs Comparison between MLPs and RBFs

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Metric-based classifiers. Nuno Vasconcelos UCSD

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

Sampling distribution of GLM regression coefficients

Machine Learning Gaussian Naïve Bayes Big Picture

Machine Learning

The exam is closed book, closed notes except your one-page cheat sheet.

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

A simulation study of model fitting to high dimensional data using penalized logistic regression

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

Transcription:

Extensions to LDA and multinomial regression Patrick Breheny September 22 Patrick Breheny BST 764: Applied Statistical Modeling 1/20

Introduction Quadratic discriminant analysis Fitting models Linear discriminant analysis begins by assuming that x G follows a multivariate normal distribution with a variance that does not depend on the class G Consider now doing away with this assumption and allowing the class-conditional distributions to have unequal variances; denote the variance given class k as Σ k Patrick Breheny BST 764: Applied Statistical Modeling 2/20

Quadratic discriminant analysis Fitting models Without the equal variance assumption, the convenient cancellations of LDA do not occur: δ k (x) = log π k 1 2 log Σ k 1 2 (x µ k) T Σ 1 k (x µ k); in particular, note that the quadratic terms remain The discriminant functions and decision boundaries therefore become quadratic in x This procedure is therefore called quadratic discriminant analysis () Patrick Breheny BST 764: Applied Statistical Modeling 3/20

Example 1 Quadratic discriminant analysis Fitting models Patrick Breheny BST 764: Applied Statistical Modeling 4/20

Example 1 Quadratic discriminant analysis Fitting models Patrick Breheny BST 764: Applied Statistical Modeling 5/20

Example 2 Quadratic discriminant analysis Fitting models Patrick Breheny BST 764: Applied Statistical Modeling 6/20

Example 2 Quadratic discriminant analysis Fitting models Patrick Breheny BST 764: Applied Statistical Modeling 7/20

R/SAS Quadratic discriminant analysis Fitting models is implemented and easily run in both SAS and R In SAS: PROC DISCRIM DATA=Iris POOL=NO; CLASS Species; RUN; In R (again requiring the MASS package): fit <- qda(species~.,data) Patrick Breheny BST 764: Applied Statistical Modeling 8/20

Car silhouette data Fitting models For the iris data set, whether we use LDA or makes no difference: we end up with the same predictions either way (the probabilities differ, of course, but the class with the highest probability does not) Of course, this is not always the case An example of a data set where performs extremely well is the Vehicle Silhouettes data set, in which, based on computer images of the silhouettes of vehicles, the model is made to predict between four vehicles: a bus, a van, and two cars (a Saab and an Opel) Patrick Breheny BST 764: Applied Statistical Modeling 9/20

Vehicle silhouette results Fitting models Leave-one-out cross validation results for LDA and on the vehicle silhouette data: LDA bus opel saab van bus 209 8 11 2 opel 4 130 65 3 saab 2 67 128 2 van 3 7 13 192 bus opel saab van bus 214 0 2 1 opel 0 160 54 2 saab 0 43 154 0 van 4 9 7 196 Overall CV error rates: 22.1% for LDA, 14.4% for Patrick Breheny BST 764: Applied Statistical Modeling 10/20

on the small-sample iris data Fitting models Of course, does not always outperform LDA; in fact it often does quite a bit worse For example, let s compare LDA with using the iris data in the same manner as our earlier comparison involving multinomial regression (randomly choose 5 observations per class used as training data, the rest to test the fit of the model) has a test error rate of 24.7%; far worse than LDA s 5.2% and multinomial regression s 7.7% Patrick Breheny BST 764: Applied Statistical Modeling 11/20

Iris cross-validation comparison Fitting models Test error rate 0.05 0.10 0.15 0.20 0.25 LDA 10 20 30 40 Training observations per class Patrick Breheny BST 764: Applied Statistical Modeling 12/20

Remarks Quadratic discriminant analysis Fitting models Why does do so poorly with n = 5 per class? Consider the number of additional parameters that have to be estimated by : LDA must estimate one 4 4 covariance matrix (with 10 distinct parameters), whereas must estimate three such matrices Thus, must estimate 20 additional parameters, which is a lot to ask with a sample size of 15 As usual, the tradeoff between LDA or is one of bias and variance LDA makes stronger assumptions and obtains estimates with lower variance, but has the potential for biased decision boundaries if heterogeneity is truly present Patrick Breheny BST 764: Applied Statistical Modeling 13/20

Ridge/lasso penalized multinomial regression As a final topic for this section of the course, we will take a brief tour of some of the ways in which penalization and regularization can be applied to the methods of multinomial regression and discriminant analysis Extending ridge and lasso to logistic and multinomial regression is fairly straightforward; the lasso estimate β would be minimize Q(β) = 1 K K p log π ik + β kj, n where and η ki = x T i β k k=1 i g i =k π ik = exp(η ki) l exp(η li) k=1 j=1 Patrick Breheny BST 764: Applied Statistical Modeling 14/20

Penalization and identifiability The ridge estimates would of course be defined similarly, albeit with β 2 ki replacing β kj For the most part, the extensions are straightforward; the only noticeable change is with regard to the notion of a reference category In traditional multinomial regression, one category must be chosen as a reference group (i.e., have β k set to 0) or else the problem is not identifiable Patrick Breheny BST 764: Applied Statistical Modeling 15/20

Penalization and identifiability (cont d) With penalized regression, this restriction is not necessary For example, suppose K = 2; then β 1j = 1, β 2j = 1 produces the exact same {π ik } as β 1j = 2, β 2j = 0 As it is impossible to tell the two models apart (and an infinite range of other models), we cannot estimate {β k } Patrick Breheny BST 764: Applied Statistical Modeling 16/20

Penalization and identifiability (cont d) With, say, a ridge penalty, this is no longer the case, as k β2 jk = 2 in the first situation and 4 in the second; the proper estimate is clear The same holds for the lasso penalty, although of course there is now the possibility of sparsity, perhaps with respect to multiple classes For example, with λ = 0.1, the coefficient vector for petal width in the iris data is (0, 0, 1.57) In words, an increase of one cm in petal width increases the log odds of virginica by 1.57 with respect to both setosa and versicolor (those two species serving, in effect, as a common reference group) Patrick Breheny BST 764: Applied Statistical Modeling 17/20

Regularized discriminant analysis The notion of regularization can also be extended to discriminant analysis For example, why commit ourselves to LDA or when we could consider the range of intermediary estimates given by ˆΣ k (α) = α ˆΣ k + (1 α) ˆΣ With α = 0 we have LDA and with α = 1 we have ; in between we have estimates that allow for class-specific variances, but with shrinkage toward the common covariance matrix Patrick Breheny BST 764: Applied Statistical Modeling 18/20

Regularized discriminant analysis (cont d) Furthermore, one could consider regularizing ˆΣ itself in a ridge-like manner: ˆΣ(γ) = γ ˆΣ + (1 γ)i Recall that the LDA/ estimates rely on ˆΣ 1 ; thus, the addition of a ridge down the diagonal is perhaps a very good idea, as it stabilizes this inverse Furthermore, it ensures that the inverse always exists Patrick Breheny BST 764: Applied Statistical Modeling 19/20

Shrunken centroids As final example of regularization, we could apply a lasso-type penalty to the means {µ k } This would have the advantage of parsimony (a smaller number of variables would actually be involved in the model) as well as reduction of variance, especially in settings where p is large The means {µ k } are sometimes called the centroids of the classes; the method described on this slide is therefore called nearest shrunken centroids Patrick Breheny BST 764: Applied Statistical Modeling 20/20