Spring 2006: Linear Discriminant Analysis, Etc.

Size: px

Start display at page:

Download "Spring 2006: Linear Discriminant Analysis, Etc."

Erick Washington
5 years ago
Views:

1 Spring 2006: Linear Discriminant Analysis, Etc. Brian Junker April 17, 2006 Review: The Bayes Classifier Linear and Quadratic Discriminant Analysis and Friends Linear regression of an indicator matrix Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) (Polytomous) Logistic Regression April 17, 2006 Review: The Bayes Classifier For the discrete loss matrix L=[L(k,l)], K EPE=E[L(Y, f (X))]=E X L(k, f (X))P(Y= k X) ; we can minimize pointwise to obtain K f (x)=argmin l L(k,l)P(Y= k X=x). k=1 k=1 When L(k,l)=1 {k l}, zero-one loss, we obtain f (x) = argmin l P(Y= k X=x) k l = argmin l {1 p(y=l X=x)} = argmax l P(Y=l X=x) This the the posterior mode, or Bayes classifier. Its error rate is the Bayes rate April 17, 2006

2 Linear and Quadratic Discriminant Analysis and Friends All these methods produce classification functionsδ k (x) with which we classify x into class k x = argmax k δ k (x). Generalization of our earlier linear regression method: Linear regression of an indicator matrix;δ k (x) will be linear. Linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) are variations on a Gaussian mixture model that uses training data to define the clusters (hence no E-M needed). LDA: Common variance/covariance matrixσwithin each cluster; δ k (x) is linear in x. QDA: DifferentΣ k in each cluster k;δ k (x) is quadratic in x. Logistic regression leads to similar classification rules (δ k (x) is again linear) but treats the relationship between data and model differently April 17, 2006 Linear regression of an indicator matrix Let Y n K = [y i j ] be a class-inidcator matrix; that is, there are n rows for the n training observations, and each row has K 1 0 s and one 1 indicating the class of the observation. Let X n p be the matrix of features (including a column of 1 s). Then ˆB=(X T X) 1 X T Y is a matrix of linear regression coefficients, one column for each column of Y. For a new row-vector observation x, the fitted values δ(x)=[δ 1 (x),...,δ K (x)]= x ˆB provide a classification rule: Class(x)=argmax k δ k (x) April 17, 2006

3 Some interpretations: It is easy to verify thatδ(x i )= x i ˆB for the regression coefficients ˆB minimize the sum of squares n min y i x i B 2 ; B and for a new observation x, classifying according to i=1 Class(x)=argmin k δ(x) t k 2, where t k is the k th row of I K K (identity matrix), is the same as classifying according to Class(x)=argmax k δ k (x). Can also view the linear regression functionsδ k (x) here as crude approximations to P(obs i belongs to class k x i ) in the Bayes classifier. In either case, increasing the dimension of x should improve the classification; e.g.: Polynomial functions of X; Projecting X onto an appropriate orthogonal set of basis functions; etc April 17, 2006 Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) Here we start more explicitly from the Bayes classifier, which chooses according to Class(x)=argmax k P[G i = k x i = x] where G i is the true classification of observation i. Ifπ k are the prior probabilities of the K classes (P[G i = k]=π k ), then clearly P[G i = k x i = x]= f k (x)π k K l=1 f l(x)π l and, as we said before, our main task is to model the within-class densities f k (x). LDA, QDA and related methods begin with the Gaussian model f k (x)= 1 1 (2π) p/2 Σ k 1/2 e 2 (x µ k )TΣ 1 k (x µ k) (and so these are the classification analogues to Gaussian-mixture-model cluster analysis) April 17, 2006

4 Linear discriminant analysis (LDA) LDA arises in the special case thatσ k Σ, k. In comparing two classes k andlit is enough to look at the log-ratio of posterior probabilities log P[G=k X=x] P[G=l X=x] = log f k(x) f l (x) + logπ k π l = log π k π l 1 2 (µ k+µ l ) T Σ 1 (µ k µ l )+ x T Σ 1 (µ k µ l ) after some algebra and cancellation due toσ k Σ. This in turn leads to the linear discriminant functions δ k (x)= x T Σ 1 µ k 1 2 µt k Σ 1 µ k + logπ k In practice we use unbiased estimates of the cluster parameters: ˆπ k = N k /N, the fraction of training examples in class k; ˆµ k = 1 N k G i =k x i ; Σ= N K 1 K k=1 g i =k(x i ˆµ k )(x i µ k ) T April 17, 2006 Quadratic discriminant analysis (QDA) With Quadratic discriminant analysis we allowσ k to vary between clusters. There are then fewer cancellations in the algebra above, and we are lead to quadratic discriminant functions Some observations/comparisons: 1 2 log Σ k 1 2 (x i µ k ) T Σ 1 k (x i µ k )+logπ k LDA and linear regression of the indicator matrix Y correspond exactly in the case of two classes. Both LDA and linear regression of the indicator matrix, with polynomial functions of X, perform similarly to QDA with simply linear X (this might be expected!). When the assumption of Gaussian mixtures is correct, the above unbiased estimates of parameters lead to optimal (empirical Bayes) classification rules. When the assumption of Gaussian is not correct, one can generally improve the classifications in by adding biases b k to the discrimination function, i.e., δ k (x)=δ k(x)+b k to optimize classification loss in the training data April 17, 2006

5 (Polytomous) Logistic Regression The (polytomous) logistic regression model asserts a linear model for the log-ratios of classification probabilities directly: log log P[G=1 X=x] P[G=K X=x] log P[G=2 X=x] P[G=K X=x] P[G=K 1 X=x] P[G=K X=x] = β 10 +β T 1 x = β 20 +β T 2 x. = β (K 1)0 +β T K 1 x from which we can easily calculate that P[G=k X=x]= exp(β k0 +β T k x) 1+ K 1 l=1 exp(β l0+β T l x) April 17, 2006 Since LDA also assumes a linear form for the log ratios, log P[G=k X=x] P[G=K X=x] =β k0+β T k x which is better/preferable? Let us write, P(X, G=k)=Pr(X)P[G= k X] Logistic regression estimates P[G = k X] conditional on X, ignoring Pr(X); LDA estimates the full model P(X, G=k)= P(G=k)P[X G=k], using a specific model for P[X G= k] If the Gaussian model for P[X G=k] is correct, LDA will do better. In other settings, logistic regression is thought to provide a more robust classifier. However, in most circumnstances, LDA and logistic regression provide very similar answers April 17, 2006

6 A nearly linearly-separable case... linear regression of indicators lda, orig features err rate = err rate = lda, quadratic features qda, orig features err rate = 0.03 err rate = April 17, 2006 lda, degree 1 polynomial features lda, degree 2 polynomial features err rate = err rate = 0.03 lda, degree 3 polynomial features lda, degree 4 polynomial features err rate = 0.04 err rate = April 17, 2006

7 A quadratically-separable case... linear regression of indicators lda, orig features lda, degree 1 polynomial features lda, degree 2 polynomial features 10 err rate = err rate = lda, degree 3 polynomial features lda, degree 4 polynomial features 10 err rate = err rate = lda, quadratic features qda, orig features 10 err rate = err rate = April 17, err rate = err rate = April 17, 2006

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear