Classification: Linear Discriminant Analysis

Similar documents
Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Regularized Discriminant Analysis and Reduced-Rank LDA

Classification. Chapter Introduction. 6.2 The Bayes classifier

DISCRIMINANT ANALYSIS. 1. Introduction

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture 9: Classification, LDA

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Lecture 4 Discriminant Analysis, k-nearest Neighbors

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2012, Mr. Ruey S. Tsay

12 Discriminant Analysis

The Bayes classifier

6-1. Canonical Correlation Analysis

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Introduction to Machine Learning Spring 2018 Note 18

MATH 567: Mathematical Techniques in Data Science Logistic regression and Discriminant Analysis

LDA, QDA, Naive Bayes

Applied Multivariate and Longitudinal Data Analysis

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Linear Decision Boundaries

Support Vector Machine. Industrial AI Lab.

Introduction to Machine Learning

Introduction to Signal Detection and Classification. Phani Chavali

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Lecture #11: Classification & Logistic Regression

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

ECE521 week 3: 23/26 January 2017

Chapter 10 Logistic Regression

Machine Learning (CS 567) Lecture 5

Principal component analysis

Neural networks (not in book)

Classification 1: Linear regression of indicators, linear discriminant analysis

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Lecture 8: Classification

Learning from Data: Regression

Introduction to Machine Learning

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

CS534 Machine Learning - Spring Final Exam

Building a Prognostic Biomarker

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Gaussian Models

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Logistic Regression: Regression with a Binary Dependent Variable

Lecture 5: LDA and Logistic Regression

Chapter 8: Regression Models with Qualitative Predictors

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear Methods for Prediction

Linear Classification and SVM. Dr. Xin Zhang

Linear Methods for Prediction

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Machine Learning 2017

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Discriminant analysis and supervised classification

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

An Alternative Algorithm for Classification Based on Robust Mahalanobis Distance

SGN (4 cr) Chapter 5

CMSC858P Supervised Learning Methods

Y (Nominal/Categorical) 1. Metric (interval/ratio) data for 2+ IVs, and categorical (nominal) data for a single DV

Lecture 5: Classification

Course in Data Science

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Section 7: Discriminant Analysis.

Discriminant Analysis (DA)

Learning with multiple models. Boosting.

Week 5: Classification

Statistical Machine Learning Hilary Term 2018

Logistic Regression. Seungjin Choi

Classification 2: Linear discriminant analysis (continued); logistic regression

Modelling Academic Risks of Students in a Polytechnic System With the Use of Discriminant Analysis

10-701/ Machine Learning - Midterm Exam, Fall 2010

DISCRIMINANT ANALYSIS: LDA AND QDA

Linear Discrimination Functions

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

L11: Pattern recognition principles

Algorithm-Independent Learning Issues

SF2935: MODERN METHODS OF STATISTICAL LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS LEARNING. Tatjana Pavlenko.

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Reduced Error Logistic Regression

STATISTICS 407 METHODS OF MULTIVARIATE ANALYSIS TOPICS

Train the model with a subset of the data. Test the model on the remaining data (the validation set) What data to choose for training vs. test?

Machine Learning Gaussian Naïve Bayes Big Picture

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Machine Learning and Data Mining. Linear classification. Kalev Kask

Linear Regression and Discrimination

Statistical Methods for SVM

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Transcription:

Classification: Linear Discriminant Analysis Discriminant analysis uses sample information about individuals that are known to belong to one of several populations for the purposes of classification. Based on the variable values for these individuals (called the training data) whose population memberships are known, we determine a rule to classify new individuals (called the test data) whose population memberships are unknown. We can collect variable values for a new individual and use the data and this classification rule to classify the new individual (i.e., predict which population it belongs to). When there are two populations, Fisher s Linear Discriminant Function specifies a linear function of the q variables that best separates the sampled individuals into two groups. The classification rule is then based on that linear combination of the variables. STAT J530 Page 1

Examples of Classification Admissions officials in colleges attempt to use data on applicants (GPA, SAT scores, etc.) to classify them into two groups: those who successfully graduate and those who fail to graduate. They can use the data on previous years applicants (who are known to have either graduated or not) to build a classification rule. Archaeologists and zooarchaeologists attempt to classify animal remains into groups (such as male/female) based on measurements on bones. Marketing experts can use data on past potential customers to classify future individuals as likely buyers or unlikely buyers. Lifetime data can only be measured by destroying an object. We may want to classify an object as defective or not (without destroying it) using certain preliminary measurements, after obtaining a small sample of objects that have been seen to be defective or not. STAT J530 Page 2

Mathematical Details of Fisher s Linear Discriminant Analysis (LDA) We assume that we are classifying individuals into one of two known groups based on their values of the variables x 1, x 2,..., x q. We have the data (on the q variables) for n 1 individuals that are known to belong to Group 1 and for n 2 individuals that are known to belong to Group 2. We will use the linear combination of these variables: z = a 1 x 1 + a 2 x 2 + + a q x q that maximizes the ratio of between-group variance of z to within-group variance of z for the observed sample of individuals whose groups are known. See geometric interpretation in q = 2 dimensions. STAT J530 Page 3

Mathematical Details of LDA (continued) That is, we choose a = (a 1,..., a q ) to maximize V = a Ba a Sa, where S is the pooled within-group sample covariance matrix, and B is the covariance matrix of the group sample means (see p. 143 for formulas). For two groups, the choice of a that maximizes V is a = S 1 ( x 1 x 2 ) STAT J530 Page 4

The LDA Classification Rule Let z = a x = a 1 x 1 + a 2 x 2 + + a q x q be the discriminant score for any individual. Also let z 1 be the sample mean of the discriminant scores for the individuals known to come from Group 1, and let z 2 be the mean discriminant score for the individuals known to come from Group 2. Case I: Suppose z 1 > z 2. For a new individual with discriminant score z, classify this individual into Group 1 if z > ( z 1 + z 2 )/2; otherwise classify it into Group 2. Case II: Suppose z 1 < z 2. For a new individual with discriminant score z, classify this individual into Group 1 if z < ( z 1 + z 2 )/2; otherwise classify it into Group 2. The LDA method is implemented in R by the lda function in the MASS package. STAT J530 Page 5

Remarks about the LDA Classification Rule This LDA approach assumes the data come from one of two multivariate normal populations, each with the same covariance matrix. This LDA rule is equivalent to classifying an individual having data vector x into Group 1 if and only if MV N(x, x 1, S) > MV N(x, x 2, S), where MV N represents the multivariate normal density function. This rule is appropriate only if the prior probabilities of the individual being in each group are equal. If the prior probabilities p 1 and p 2, are not equal, then z ( z 1 + z 2 )/2 would be compared to ln(p 2 /p 1 ) rather than to 0 as before. Also, the rule assumes the cost of misclassification is the same whether an individual is misclassified into Group 1 or misclassified into Group 2. STAT J530 Page 6

Generalizations to More than Two Groups When we have three or more groups, the classification rule can be generalized. If there are k (multivariate normal) populations, then we classify an individual having data vector x into Group i if and only if MV N(x, x i, S) > MV N(x, x j, S) for all j i, where MV N represents the multivariate normal density function. For three groups, this is equivalent to basing the classification on a series of pairwise rules for choosing between Groups 1 and 2, between Groups 1 and 3, and between Groups 2 and 3 (see pp. 150-151 for details). This rule could actually be further generalized to the case in which the prior probabilities of being in each group are not equal, and in which the populations have some known non-normal distributions: Classify an individual having data vector x into Group i if and only if p i f i (x) > p j f j (x) for all j i, where f j ( ) is the j-th of the density functions. STAT J530 Page 7

Other Types of Discriminant Analysis When the covariance matrices for Populations 1 and 2 are not believed to be equal, then quadratic discriminant analysis (QDA) is more appropriate than LDA. QDA is more flexible than LDA, but it can often overfit the observed data, creating a rule that classifies the known observations nearly perfectly but that is not as generalizable to future observations. QDA is implemented by the R function qda in the MASS package. A compromise between LDA and QDA is regularized discriminant analysis. When the populations are not close to multivariate normal, an alternative to LDA is logistic discrimination. STAT J530 Page 8

Judging the Performance of the Discriminant Function To judge how well the discriminant rule is classifying, we could calculate the plug-in misclassification rate, which is simply the proportion of the known individuals that would be misclassified if we used the rule to classify them. Since the rule has been derived using those known individuals, the plug-in rate typically is too optimistic it underestimates the rate at which the rule would misclassify new individuals. A better approach is the cross-validation (or leave-one-out ) method. This uses all but one of the known individuals to derive a classification rule and then, based on that rule, classifies the other individual. This is done (separately) for all the known individuals, and the misclassification rate is the proportion of those classifications that are incorrect. The R function predict can help produce these classification rates. STAT J530 Page 9

Using Regression Methods for Classification In classification problems, we use one or more numerical variables to predict the category of an individual with respect to some categorical variable. Since a major purpose of regression is prediction, could we use regression techniques to classify individuals? One option: Code the categories as numerical values (1, 2, 3,...) and use linear regression to predict the category based on the numerical explanatory variables. Problem: We must specify an ordering of the categories, which may be arbitrary. It is easier with binary (two-category) response data, as the ordering is not as important. STAT J530 Page 10

Logistic Regression A better regression approach is to use a method designed for categorical responses: logistic regression. This works best when there are only two categories: We can code them as Y = 0 or Y = 1. A logistic regression predicts the probability that Y = 1, given a value of some explanatory variable X: P (Y = 1 X) = eβ 0+β 1 X 1 + e β 0+β 1 X This model (unlike linear regression) will always produce a predicted probability between 0 and 1. The parameters β 0 and β 1 are estimated from the training data using the maximum likelihood method. This can be done easily in R using the glm function. STAT J530 Page 11

Classifications Based on Logistic Regression Assuming two groups, a simple classification rule is to predict that Y = 1 for any individual whose predicted probability that Y = 1 is greater than 0.5. If P (Y = 1) 0.5, we would predict Y = 0 for that individual. If the number of individuals in the population having Y = 1 is believed to differ from the number having Y = 0, then we could adjust our cutoff value away from 0.5. One option: Predict Y = 1 if P (Y = 1) > p, where p is chosen to minimize the misclassification rate for the individuals in the training set. STAT J530 Page 12

Multiple Logistic Regression If we have several numerical variables measured on each individual, we can generalize our logistic regression model to: P (Y = 1 X 1,..., X q ) = eβ 0+β 1 X 1 + +β q X q 1 + e β 0+β 1 X 1 + +β q X q Z-tests about the regression coefficients assess the importance of each individual explanatory variable on the classification. The logistic model can be extended to situations where we are classifying into more than 2 categories, but it is more common to use other classification methods (like discriminant analysis) in those cases. STAT J530 Page 13