STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, PDF Free Download

STATS306B Discriminant Analysis Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010

Classification Given K classes in R p, represented as densities f i (x), 1 i K classify x R p. In other words, partition R p (or other sample space) into subsets Π i, 1 i k based on the densities f i (x). Maximum likelihood rule x Π i i = argmax f j (x) j

Example: multinomial Suppose the sample space is all p-tuples of integers that sum to n. Two classes f 1 = Multinom(n, α), f 2 = Multinom(n, β). ML rule boils down to x Π 1 p i=1 x i log α i β i > 0 The function h 12 (x) = p i=1 x i log α i β i is called a discriminant function between classes 1 & 2.

Discriminant functions ML rule can be summarized as x Π i h ij (x) > 0 j where h ij (x) = log f i(x) f j (x).

Bayesian rule If prior class probabilities (π 1,..., π K ) are available, a more sensible rule is x Π i i = argmax π j f j (x). j Modified discriminant functions: h ij (x) = h ij (x) + log π i π j

Example: Gaussian in R Let f 1 = N(µ 1, σ 2 1 ), f 2 = N(µ 2, σ 2 2 ). Discriminant function: h 12 (x) = x 2 ( 1 2 σ1 2 1 σ2 2 Note: h 12 is quadratic, unless σ 1 = σ 2. ) ( µ1 x σ1 2 µ ) 2 σ2 2 + 1 ( ) µ 2 2 2 σ2 2 µ2 1 σ1 2 +log σ 2 σ 1 LDA (Linear Discriminant Analysis): σ 1 = σ 2 QDA (Quadratic Discriminant Analysis): σ 1 σ 2.

Example: Gaussian in R p In general, ML rule classifies x by minimizing Mahalanobis distance x Π i i = argmin Σj (x, µ j ) + log det(σ i ), j after adjusting for Σ i. If Σ i = Σ for all i, the ML (LDA) rule classifies by minimizing Mahalanobis distance. Bayesian rule (with Σ i = Σ) classifies by x Π i i = argmin Σ (x, µ j ) 2 log π i j

Sample ML and Bayesian rules For each class, estimate ( µ i, Σ i, π i ) with π i = n i /n. QDA: classify according to x Π i i = argmin bσj (x, µ j ) + log det( Σ j ) 2 log π j. j

Sample ML and Bayesian rules LDA: estimate pooled covariance matrix Σ = 1 n K and classify according to K (n i 1) Σ i i=1 x Π i i = argmin bσ (x, µ j ) 2 log π j. j

Gaussian in R p : Σ i = Σ Suppose that p > K, and let L = µ 1 span{µ i µ 1, 2 i K} It is clear that all of the action happens along affine subspace L R p of dimension at most (K 1). Suggests we should reduce dimension...

Fisher s linear discriminant Assumption: Σ i = Σ Given a data matrix X n p and labels L l, 1 l n consider a linear combination Y (v) n 1 = X v. The SSE of Y (v) can be decomposed as n (Y (v) l Ȳ (v)) 2 = l=1 k n i (Ȳ (v) i Ȳ (v)) 2 i=1 + k n i (Y (v) ij Ȳ (v) i ) 2 i=1 j=1 = v ΣB v + v ΣW v

Fisher s linear discriminant Fisher s suggestion: choose v = argmax v ΣB v v:v Σ b W v=1 i.e. maximize between groups variance subject to within group variance of 1. Leads to generalized eigenvalue problem Σ B v = λ Σ W v Can construct up to K 1 different directions (subject v i Σ W v j = δ ij ).

Fisher s linear discriminant Define the Fisher discriminant scores V j = X v j, 1 j K 1 to form a new data matrix V n (K 1). The pooled covariance matrix of the V i s will be I so LDA is just classifying to the nearest centroid V i = mean of V in class i.

Fisher s linear discriminant & CCA Consider the indicators Y li = 1 {Ll =i}, 1 l n, 1 i K 1 Putting the data matrices (Y, X ) through CCA yields K 1 pairs ( α i, β i ), 1 i K 1 of canonical directions. It turns out that β i = v i (up to a scalar multiple)...

Reducing the rank Suppose some of hte µ i s are collinear dim(l) < K 1 Then, some of the Fisher scores will have little information. We can discard some of the scores and then classify according to the nearest centroid of reduced space.

QDA revisited Fisher s linear discriminant functions are dimension reduction tools. In olive data, groups have unequal variance suggests we could use QDA on fisher scores. Note: this is not the same as QDA on whole vector unless the noise orthogonal to L has the same covariance...

QDA revisited LDA produces boundaries that are linear in X. If we transform X R p to f (X ) R q, LDA on h(x ) will produce boundaries that are linear in the components of h(x ). Suppose p = 2, take f (x) = (x 1, x 2, x 2 1, x 2 2, x 1x 2 ). This will produce discriminant functions h ij (x) = a ij,1 x 1 + a ij,2 x 2 + a ij,3 x 2 1 + a ij,4 x 2 2 + a ij,5 x 1 x 1 + c ij Cheap way to get quadratic boundaries.

More general expansions Why limit ourselves to quadratic? We could take a large basis f (x) = (f 1 (x),..., f m (x)) and perform LDA on f (X ) with labels L. If we take all K 1 fisher scores, the number of coefficients we need to estimate is (K 1) m this grows quickly.

Penalized discriminant analysis Recall that Fisher s scores were constructed as max(v ΣB v) s.t. v ΣW v = 1 To regularize, we can insist instead that v ( Σ W + λω)v = 1 If Ω penalizes rough functions, this will produce smoother decision boundaries as λ grows...

Penalized discriminant analysis Generalized eigenproblem: Σ B v = λ( Σ W + λω)v. with Σ B, Σ W estimated covariance matrices of derived variables f (X ) n m. Using the scores from this eigenproblem and classifying by nearest centroid corresponds to this rule: x Π i i = argmin bσw +λω (h(x), µ j,f ) 2 log π j j with µ j,f the sample mean of f (X ) in class j.

Flexible discriminant analysis The previous penalized approach suggests the following strategy: 1. Find good scores... 2. Use nearest centroid classification on the scores... How do we find good scores?

Flexible discriminant analysis Connection with CCA: let Y n K be the matrix of indicators for the classes based on labels L n 1. Fisher s directions are (parallel to) canonical directions for X ( α, β) = argmax Ĉor(α Y, β X ) α,β = argmax α,β 1 n 1 (Y α) (X β) subject to Var(α Y ) = α Σ Y α = Var(β X ) = β ΣX β = 1, Ê(α Y ) = Ê(β X ) = 0.

Flexible discriminant analysis Under these constraints 1 n 1 (Y 1 α) (X β) = 1 2(n 1) Y α X β 2. Fixing α and maximizing Ĉor(α Y, β X ) is a regression of X onto Y α.

n STATS306B Flexible discriminant analysis (Unpenalized) problem is recast as min θ, β n i=1 (θ(l i ) X i β) 2 with l i the i-th label, and X i the i-th row of X Subject to constraint n i=1 θ(l i) = 0, n i=1 θ(l i) 2 = 1. (Reexpression of Ê(Y α) = 0, Var(α Y ) = 1.) As in CCA, we obtain successive pairs ( θ l, β l ) solving this problem... Inner loop can be replaced with a more flexible model...

Flexible discriminant analysis (FDA): algorithm (Ch. 12 ESL) 1. Let ŶY = η (X ) be a linear regression estimator of E(Y ), i.e. ŶY is n K with i-th row η (X i ). 2. Let C p p = ŶY ŶY. 3. Let Θ be the eigenvectors of C normalized so that Θ DΘ where D = ( π 1,..., π K ). [Maximization over α]. 4. Define η(x) = Θ η (x). [Update output of regression to give optimal scores from above] 5. Compute η(x ) n K and centroids η i,..., η K. 6. Classify a new observation based on η(x) to nearest centroid η i.

Flexible discriminant analysis (FDA) 1. FDA tries to minimize L λ (Θ, η) = 1 2 Tr((Y Θ η(x )) (Y Θ η(x ))) where η = η(x, Y, Θ, λ) is a multivariate regression method. 2. More precisely, we could minimize L λ (Θ, β) = 1 2 Tr((Y Θ X β) (Y Θ X β)) + λp(β) for some penalty β. 3. Examples 3.1 LASSO: P(β) = p i=1 k j=1 β ij

Steps of the alternating algorithm 1. Choose some initial Θ 0 such that Θ 0 (Y Y )Θ 0 = ni k k. 2. For Θ fixed, define η = η(x, Y Θ, λ) to be the output of the regression method, when X is regressed onto Y Θ. That is, 3. For (η, Θ) fixed, define Û = η : R p R k Û(Y Θ, η(x )) = argmin Tr((Y ΘU η(x )) (Y ΘU η(x ))). U:U U=I

Procrustes problem The problem Û = Û(Y Θ, η(x )) = argmin Tr((Y ΘU η(x )) (Y ΘU η(x ))). U:U U=I is called a Procrustes problem. The matrix Û can be obtained via an SVD of Y Θ η(x ). Let Y Θ η(x ) = U 1 DU 2. Then, Û = U 1 U 2. Note, if Y Θ η(x ) is symmetric, then Û = I and D are its eigenvalues Y Θ η(x ). These singular values are used as weights for the different optimal scores.

Alternating algorithm for FDA Choose some initial Θ 0 such that Θ 0 Y Y Θ 0 = ni. For i 1, until convergence is reached based on L λ (Θ, η) 1. Find η i = η(x, Y Θ i, λ) 2. Compute Y Θ i η i(x ) and find its SVD: U 1 Di U 2. 3. Update Θ i+1 = Θ i U 1 U 2.

Alternating algorithm for FDA This will converge as long as each of the steps of finding η i and Θ i+1 decreases the loss. Use η to compute class centroids ( η j ) 1 j k. Classify using nearest centroids with weights 1/( D (1 D )).

Digits examples First example: P(β) = Tr(β (0) Lβ (0)) and β (0) is β without intercept term, β 0. The penalty L is the discrete Laplacian of the 16 16 lattice. Defined as diag(rowsum(a)) A where A 256 256 is the adjacency matrix of the lattice. Second example: P(β) = β

Digits: ridge with discrete Laplacian

Digits: ridge with LASSO

Digits: ridge with discrete Laplacian

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010