STATS306B Discriminant Analysis Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010
Classification Given K classes in R p, represented as densities f i (x), 1 i K classify x R p. In other words, partition R p (or other sample space) into subsets Π i, 1 i k based on the densities f i (x). Maximum likelihood rule x Π i i = argmax f j (x) j
Example: multinomial Suppose the sample space is all p-tuples of integers that sum to n. Two classes f 1 = Multinom(n, α), f 2 = Multinom(n, β). ML rule boils down to x Π 1 p i=1 x i log α i β i > 0 The function h 12 (x) = p i=1 x i log α i β i is called a discriminant function between classes 1 & 2.
Discriminant functions ML rule can be summarized as x Π i h ij (x) > 0 j where h ij (x) = log f i(x) f j (x).
Bayesian rule If prior class probabilities (π 1,..., π K ) are available, a more sensible rule is x Π i i = argmax π j f j (x). j Modified discriminant functions: h ij (x) = h ij (x) + log π i π j
Example: Gaussian in R Let f 1 = N(µ 1, σ 2 1 ), f 2 = N(µ 2, σ 2 2 ). Discriminant function: h 12 (x) = x 2 ( 1 2 σ1 2 1 σ2 2 Note: h 12 is quadratic, unless σ 1 = σ 2. ) ( µ1 x σ1 2 µ ) 2 σ2 2 + 1 ( ) µ 2 2 2 σ2 2 µ2 1 σ1 2 +log σ 2 σ 1 LDA (Linear Discriminant Analysis): σ 1 = σ 2 QDA (Quadratic Discriminant Analysis): σ 1 σ 2.
Example: Gaussian in R p In general, ML rule classifies x by minimizing Mahalanobis distance x Π i i = argmin Σj (x, µ j ) + log det(σ i ), j after adjusting for Σ i. If Σ i = Σ for all i, the ML (LDA) rule classifies by minimizing Mahalanobis distance. Bayesian rule (with Σ i = Σ) classifies by x Π i i = argmin Σ (x, µ j ) 2 log π i j
Sample ML and Bayesian rules For each class, estimate ( µ i, Σ i, π i ) with π i = n i /n. QDA: classify according to x Π i i = argmin bσj (x, µ j ) + log det( Σ j ) 2 log π j. j
Sample ML and Bayesian rules LDA: estimate pooled covariance matrix Σ = 1 n K and classify according to K (n i 1) Σ i i=1 x Π i i = argmin bσ (x, µ j ) 2 log π j. j
Gaussian in R p : Σ i = Σ Suppose that p > K, and let L = µ 1 span{µ i µ 1, 2 i K} It is clear that all of the action happens along affine subspace L R p of dimension at most (K 1). Suggests we should reduce dimension...
Fisher s linear discriminant Assumption: Σ i = Σ Given a data matrix X n p and labels L l, 1 l n consider a linear combination Y (v) n 1 = X v. The SSE of Y (v) can be decomposed as n (Y (v) l Ȳ (v)) 2 = l=1 k n i (Ȳ (v) i Ȳ (v)) 2 i=1 + k n i (Y (v) ij Ȳ (v) i ) 2 i=1 j=1 = v ΣB v + v ΣW v
Fisher s linear discriminant Fisher s suggestion: choose v = argmax v ΣB v v:v Σ b W v=1 i.e. maximize between groups variance subject to within group variance of 1. Leads to generalized eigenvalue problem Σ B v = λ Σ W v Can construct up to K 1 different directions (subject v i Σ W v j = δ ij ).
Fisher s linear discriminant Define the Fisher discriminant scores V j = X v j, 1 j K 1 to form a new data matrix V n (K 1). The pooled covariance matrix of the V i s will be I so LDA is just classifying to the nearest centroid V i = mean of V in class i.
Fisher s linear discriminant & CCA Consider the indicators Y li = 1 {Ll =i}, 1 l n, 1 i K 1 Putting the data matrices (Y, X ) through CCA yields K 1 pairs ( α i, β i ), 1 i K 1 of canonical directions. It turns out that β i = v i (up to a scalar multiple)...
Reducing the rank Suppose some of hte µ i s are collinear dim(l) < K 1 Then, some of the Fisher scores will have little information. We can discard some of the scores and then classify according to the nearest centroid of reduced space.
QDA revisited Fisher s linear discriminant functions are dimension reduction tools. In olive data, groups have unequal variance suggests we could use QDA on fisher scores. Note: this is not the same as QDA on whole vector unless the noise orthogonal to L has the same covariance...
QDA revisited LDA produces boundaries that are linear in X. If we transform X R p to f (X ) R q, LDA on h(x ) will produce boundaries that are linear in the components of h(x ). Suppose p = 2, take f (x) = (x 1, x 2, x 2 1, x 2 2, x 1x 2 ). This will produce discriminant functions h ij (x) = a ij,1 x 1 + a ij,2 x 2 + a ij,3 x 2 1 + a ij,4 x 2 2 + a ij,5 x 1 x 1 + c ij Cheap way to get quadratic boundaries.
More general expansions Why limit ourselves to quadratic? We could take a large basis f (x) = (f 1 (x),..., f m (x)) and perform LDA on f (X ) with labels L. If we take all K 1 fisher scores, the number of coefficients we need to estimate is (K 1) m this grows quickly.
Penalized discriminant analysis Recall that Fisher s scores were constructed as max(v ΣB v) s.t. v ΣW v = 1 To regularize, we can insist instead that v ( Σ W + λω)v = 1 If Ω penalizes rough functions, this will produce smoother decision boundaries as λ grows...
Penalized discriminant analysis Generalized eigenproblem: Σ B v = λ( Σ W + λω)v. with Σ B, Σ W estimated covariance matrices of derived variables f (X ) n m. Using the scores from this eigenproblem and classifying by nearest centroid corresponds to this rule: x Π i i = argmin bσw +λω (h(x), µ j,f ) 2 log π j j with µ j,f the sample mean of f (X ) in class j.
Flexible discriminant analysis The previous penalized approach suggests the following strategy: 1. Find good scores... 2. Use nearest centroid classification on the scores... How do we find good scores?
Flexible discriminant analysis Connection with CCA: let Y n K be the matrix of indicators for the classes based on labels L n 1. Fisher s directions are (parallel to) canonical directions for X ( α, β) = argmax Ĉor(α Y, β X ) α,β = argmax α,β 1 n 1 (Y α) (X β) subject to Var(α Y ) = α Σ Y α = Var(β X ) = β ΣX β = 1, Ê(α Y ) = Ê(β X ) = 0.
Flexible discriminant analysis Under these constraints 1 n 1 (Y 1 α) (X β) = 1 2(n 1) Y α X β 2. Fixing α and maximizing Ĉor(α Y, β X ) is a regression of X onto Y α.
n STATS306B Flexible discriminant analysis (Unpenalized) problem is recast as min θ, β n i=1 (θ(l i ) X i β) 2 with l i the i-th label, and X i the i-th row of X Subject to constraint n i=1 θ(l i) = 0, n i=1 θ(l i) 2 = 1. (Reexpression of Ê(Y α) = 0, Var(α Y ) = 1.) As in CCA, we obtain successive pairs ( θ l, β l ) solving this problem... Inner loop can be replaced with a more flexible model...
Flexible discriminant analysis (FDA): algorithm (Ch. 12 ESL) 1. Let ŶY = η (X ) be a linear regression estimator of E(Y ), i.e. ŶY is n K with i-th row η (X i ). 2. Let C p p = ŶY ŶY. 3. Let Θ be the eigenvectors of C normalized so that Θ DΘ where D = ( π 1,..., π K ). [Maximization over α]. 4. Define η(x) = Θ η (x). [Update output of regression to give optimal scores from above] 5. Compute η(x ) n K and centroids η i,..., η K. 6. Classify a new observation based on η(x) to nearest centroid η i.
Flexible discriminant analysis (FDA) 1. FDA tries to minimize L λ (Θ, η) = 1 2 Tr((Y Θ η(x )) (Y Θ η(x ))) where η = η(x, Y, Θ, λ) is a multivariate regression method. 2. More precisely, we could minimize L λ (Θ, β) = 1 2 Tr((Y Θ X β) (Y Θ X β)) + λp(β) for some penalty β. 3. Examples 3.1 LASSO: P(β) = p i=1 k j=1 β ij
Steps of the alternating algorithm 1. Choose some initial Θ 0 such that Θ 0 (Y Y )Θ 0 = ni k k. 2. For Θ fixed, define η = η(x, Y Θ, λ) to be the output of the regression method, when X is regressed onto Y Θ. That is, 3. For (η, Θ) fixed, define Û = η : R p R k Û(Y Θ, η(x )) = argmin Tr((Y ΘU η(x )) (Y ΘU η(x ))). U:U U=I
Procrustes problem The problem Û = Û(Y Θ, η(x )) = argmin Tr((Y ΘU η(x )) (Y ΘU η(x ))). U:U U=I is called a Procrustes problem. The matrix Û can be obtained via an SVD of Y Θ η(x ). Let Y Θ η(x ) = U 1 DU 2. Then, Û = U 1 U 2. Note, if Y Θ η(x ) is symmetric, then Û = I and D are its eigenvalues Y Θ η(x ). These singular values are used as weights for the different optimal scores.
Alternating algorithm for FDA Choose some initial Θ 0 such that Θ 0 Y Y Θ 0 = ni. For i 1, until convergence is reached based on L λ (Θ, η) 1. Find η i = η(x, Y Θ i, λ) 2. Compute Y Θ i η i(x ) and find its SVD: U 1 Di U 2. 3. Update Θ i+1 = Θ i U 1 U 2.
Alternating algorithm for FDA This will converge as long as each of the steps of finding η i and Θ i+1 decreases the loss. Use η to compute class centroids ( η j ) 1 j k. Classify using nearest centroids with weights 1/( D (1 D )).
Digits examples First example: P(β) = Tr(β (0) Lβ (0)) and β (0) is β without intercept term, β 0. The penalty L is the discrete Laplacian of the 16 16 lattice. Defined as diag(rowsum(a)) A where A 256 256 is the adjacency matrix of the lattice. Second example: P(β) = β
Digits: ridge with discrete Laplacian
Digits: ridge with discrete Laplacian
Digits: ridge with discrete Laplacian
Digits: ridge with discrete Laplacian
Digits: ridge with LASSO
Digits: ridge with LASSO
Digits: ridge with LASSO
Digits: ridge with discrete Laplacian