Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Similar documents
Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Lecture 5: Classification

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Performance Evaluation

Lecture 4 Discriminant Analysis, k-nearest Neighbors

The Bayes classifier

Classification Methods II: Linear and Quadratic Discrimminant Analysis

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Introduction to Machine Learning

Lecture 9: Classification, LDA

Bayesian Decision Theory

Introduction to Machine Learning

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Performance Evaluation

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Machine Learning Linear Classification. Prof. Matteo Matteucci

MULTIVARIATE HOMEWORK #5

Machine Learning (CS 567) Lecture 5

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

The SAS System 18:28 Saturday, March 10, Plot of Canonical Variables Identified by Cluster

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Performance evaluation of binary classifiers

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Extensions to LDA and multinomial regression

Classification 2: Linear discriminant analysis (continued); logistic regression

Statistics 202: Data Mining. c Jonathan Taylor. Week 6 Based in part on slides from textbook, slides of Susan Holmes. October 29, / 1

Machine Learning for Signal Processing Bayes Classification and Regression

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

LDA, QDA, Naive Bayes

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Gaussian Models

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Linear Classifiers as Pattern Detectors

Introduction to Machine Learning

7 Gaussian Discriminant Analysis (including QDA and LDA)

Classification: Linear Discriminant Analysis

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Bayesian Classification Methods

Linear Classifiers as Pattern Detectors

Data Mining and Analysis: Fundamental Concepts and Algorithms

Statistical Machine Learning Hilary Term 2018

Classification 1: Linear regression of indicators, linear discriminant analysis

Machine Learning - MT Classification: Generative Models

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Applied Multivariate and Longitudinal Data Analysis

Stephen Scott.

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Introduction to Machine Learning

LEC 4: Discriminant Analysis for Classification

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Solution to Series 10

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

CMSC858P Supervised Learning Methods

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Machine Learning 2017

Bayesian Decision Theory

5. Discriminant analysis

Evaluation & Credibility Issues

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Machine Learning for OR & FE

PRINCIPAL COMPONENTS ANALYSIS

Generative Clustering, Topic Modeling, & Bayesian Inference

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CPSC 540: Machine Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

CSC 411: Lecture 09: Naive Bayes

4 Statistics of Normally Distributed Data

Feature Engineering, Model Evaluations

Discriminant Analysis

Metric-based classifiers. Nuno Vasconcelos UCSD

A Study of Relative Efficiency and Robustness of Classification Methods

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

L5: Quadratic classifiers

Motivating the Covariance Matrix

Generative Learning algorithms

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

1. Introduction to Multivariate Analysis

Machine Learning for Signal Processing Bayes Classification

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Classification objectives COMS 4771

Does Modeling Lead to More Accurate Classification?

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

CPSC 340: Machine Learning and Data Mining

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

MLE/MAP + Naïve Bayes

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Naïve Bayes classification

CSCI-567: Machine Learning (Spring 2019)

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Transcription:

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes November 9, 2012 1 / 1

Nearest centroid rule Suppose we break down our data matrix as by the labels yielding (X j ) 1 j k with sizes nrow(x j ) = n j. A simple rule for classification is: Assign a new observation with features x to f (x) = argmin d(x, X j ) 1 j k What do we mean by distance here? 2 / 1

Nearest centroid rule If we can assign a central point or centroid µ j to each X j, then we can define the distance above as distance to the centroid µ j. This yields the nearest centroid rule f (x) = argmin d(x, µ j ) 1 j k 3 / 1

Nearest centroid rule This rule is described completely by the functions h ij (x) = d(x, µ j ) d(x, µ i ) with f (x) being any i such that h ij (x) 1 j. 4 / 1

Nearest centroid rule If d(x, y) = x y 2 then the natural centroid is µ j = 1 n j n j l=1 This rule just classifies points to the nearest µ j. But, if the covariance matrix of our data is not I, this rule ignores structure in the data... X j l 5 / 1

Why we should use covariance 20 15 10 5 0 5 3 2 1 0 1 2 3 6 / 1

Choice of distance Often, there is some background model for our data that is equivalent to a given procedure. For this nearest centroid rule, using the Euclidean distance effectively assumes that within the set of points X j, the rows are multivariate Gaussian with covariance matrix proportional to I. It also implicitly assumes that the n j s are roughly equal. For instance, if one n j was very small just because a data point is close to that µ j doesn t necessarily mean that we should conclude it has label j because there might be a huge number of points of label i near that µ j. 7 / 1

Gaussian discriminant functions Suppose each group with label j had its own mean µ j and covariance matrix Σ j, as well as proportion π j. The Gaussian discriminant functions are defined as h ij (x) = h i (x) h j (x) h i (x) = log π i log Σ i /2 (x µ i ) T Σ 1 i (x µ i )/2 The first term weights the prior probability, the second two terms are log φ µi,σ i (x) = log L(µ i, Σ i x). Ignoring the first two terms, the h i s are essentially within-group Mahalanobis distances... 8 / 1

Gaussian discriminant functions The classifier assigns x label i if h i (x) h j (x) j. Or, f (x) = argmax h i (x) 1 i k This is equivalent to a Bayesian rule. We ll see more Bayesian rules when we talk about naïve Bayes... When all Σ i and π i s are identical, the classifier is just nearest centroid using Mahalanobis distance instead of Euclidean distance. 9 / 1

Estimating discriminant functions In practice, we will have to estimate π j, µ j, Σ j. Obvious estimates: π j = n j k j=1 n j µ j = 1 n j n j l=1 X j l Σ j = 1 n j 1 (X j µ j 1) T (X j µ j 1) 10 / 1

Estimating discriminant functions If we assume that the covariance matrix is the same within groups, then we might also form the pooled estimate Σ P = k j=1 (n j 1) Σ j k j=1 n j 1 If we use the pooled estimate Σ j = Σ P and plug these into the Gaussian discrimants, the functions h ij (x) are linear (or affine) functions of x. This is called Linear Discriminant Analysis (LDA). Not to be confused with the other LDA (Latent Dirichlet Allocation)... 11 / 1

Linear Discriminant Analysis using (petal.width, petal.length) 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 0 1 2 3 4 5 6 7 8 12 / 1

Linear Discriminant Analysis using (sepal.width, sepal.length) 5.0 4.5 4.0 3.5 3.0 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 13 / 1

Quadratic Discriminant Analysis If we use don t use pooled estimate Σ j = Σ j and plug these into the Gaussian discrimants, the functions h ij (x) are quadratic functions of x. This is called Quadratic Discriminant Analysis (QDA). 14 / 1

Quadratic Discriminant Analysis using (petal.width, petal.length) 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 0 1 2 3 4 5 6 7 8 15 / 1

Quadratic Discriminant Analysis using (sepal.width, sepal.length) 5.0 4.5 4.0 3.5 3.0 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 16 / 1

Motivation for Fisher s rule 20 15 10 5 0 5 3 2 1 0 1 2 3 17 / 1

Fisher s discriminant function Fisher proposed to classify using a linear rule. He first decomposed Then, he proposed, (X µ1) T (X µ1) = ŜS B + ŜS W v = argmax v:v T ŜS W v=1 v T ŜS B v Having found v, form X j v and the centroid η j = mean(x j v) In the two-class problem k = 2, this is the same as LDA. 18 / 1

Fisher s discriminant functions The direction v 1 is an eigenvector of some matrix. There are others, up to k 2 more. Suppose we find all k 1 vectors and form X j V T, each one an n j (k 1) matrix with centroid η j R k 1. The matrix V determines a map from R p to R k 1, so given a new data point we can compute V x R k 1. This gives rise to a classifier f (x) = nearest centroid(v x) This is LDA (assuming π j = 1 k )... 19 / 1

Discriminant models in general A discriminant model is generally a model that estimates P(Y = j x), 1 j k That is, given that the features I observe are x, the probability I think this label is j... LDA and QDA are actually generative models since they specify P(X = x Y = j). There are lots of discriminant models... 20 / 1

Logistic regression The logistic regression model is ubiquitous in binary classification (two-class) problems Model: P(Y = 1 x) = α + ext β 1 + e α+xt β = π(α, β, x) 21 / 1

Logistic regression Software that fits a logistic regression model produces an estimate of β based on a data matrix X n p and binary labels Y n 1 {0, 1} n It fits the model minimizing what we call the deviance DEV(β) = 2 n (Y i log π(β, X i )+(1 Y i ) log(1 π(β, X i )) i=1 While not immediately obvious, this is a convex minimization problem, hence is fairly easy to solve. Unlike trees, the convexity yields a globally optimal solution. 22 / 1

Logistic regression, setosa vs virginica, versicolor using (sepal.width, sepal.length) 5.0 4.5 4.0 3.5 3.0 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 23 / 1

Logistic regression, virginica vs setosa, versicolor using (sepal.width, sepal.length) 5.0 4.5 4.0 3.5 3.0 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 24 / 1

Logistic regression, versicolor vs setosa, virginica using (sepal.width, sepal.length) 5.0 4.5 4.0 3.5 3.0 2.5 2.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 25 / 1

Logistic regression Logistic regression produces an estimate of P(y = 1 x) = π(x) Typically, we classify as 1 if π(x) > 0.5. This yields a 2 2 confusion matrix Predicted: 0 Predicted: 1 Actual : 0 TN FP Actual : 1 FN TP From the 2 2 confusion matrix, we can compute Sensitivity, Specificity, etc. 26 / 1

Logistic regression However, we could choose the threshold differently, perhaps related to estimates of the prior probabilities of 0 s and 1 s. Now, each threshold 0 t 1 yields a new confusion matrix This yields a 2 2 confusion matrix Predicted: 0 Predicted: 1 Actual : 0 TN(t) FP(t) Actual : 1 FN(t) TP(t) 27 / 1

ROC curve Generally speaking, we prefer classifiers that are both highly sensitive and highly specific. These confusion matrices can be summarized using an ROC (Receiver Operating Characteristic) curve. This is a plot of the curve (1 Specificity(t), Sensitivity(t)) 0 t 1 Often, Specificity is referred to as TNR (True Negative Rate), and (1 Specificity(t)) as FPR. Often, Sensitivity is referred to as TPR (True Positive Rate), and (1 Sensitivity(t)) as FNR. 28 / 1

AUC: Area under ROC curve 1.0 0.8 True Positive Rate 0.6 0.4 0.2 0.0 ROC curve Random guessing t =0.5 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 29 / 1

ROC curve Points in the upper left of the ROC curve are good. Any point on the diagonal line represents a classifier that is guessing randomly. A tree classifier as we ve discussed seems to correspond to only one point in the ROC plot. But one can estimate probabilities based on frequencies in terminal nodes. 30 / 1

ROC curve A common numeric summary of the ROC curve is AUC(ROC curve) = Area under ROC curve. Can be interpreted as an estimate of the probability that the classifier will give a random positive instance a higher score than a random negative instance. Maximum value is 1. For a random guesser, AUC is 0.5 31 / 1

AUC: Area under ROC curve 1.0 0.8 True Positive Rate 0.6 0.4 0.2 0.0 Logistic-Random Random 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 32 / 1

ROC curve: logistic, rpart, lda 33 / 1

34 / 1