Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chap. 20: Linear Discriminant Analysis Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 1 / 20

Linear Discriminant Analysis Given labeled data consisting of d-dimensional points x i along with their classes y i, the goal of linear discriminant analysis (LDA) is to find a vector w that maximizes the separation between the classes after projection onto w. The key difference between principal component analysis and LDA is that the former deals with unlabeled data and tries to maximize variance, whereas the latter deals with labeled data and tries to maximize the discrimination between the classes. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 2 / 20

Projection onto a Line Let D i denote the subset of points labeled with class c i, i.e., D i = {x j y j = c i }, and let D i = n i denote the number of points with class c i. We assume that there are only k = 2 classes. The projection of any d-dimensional point x i onto a unit vector w is given as ( w x T ) x i i = w T w = ( w T ) x i w = ai w w where a i specifies the offset or coordinate of x i along the line w: a i = w T x i The set of n scalars{a 1, a 2,...,a n } represents the mapping from R d to R, that is, from the original d-dimensional space to a 1-dimensional space (along w). Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 3 / 20

Projection onto w: Iris 2D Data iris-setosa as class c 1 (circles), and the other two Iris types as class c 2 (triangles) 4.5 4.0 3.5 3.0 2.5 2.0 w 1.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 4 / 20

Optimal Linear Discriminant Direction 4.5 4.0 3.5 3.0 2.5 2.0 w 1.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 5 / 20

Optimal Linear Discriminant The mean of the projected points is given as: m 1 =w T µ 1 m 2 =w T µ 2 To maximize the separation between the classes, we maximize the difference between the projected means, m 1 m 2. However, for good separation, the variance of the projected points for each class should also not be too large. LDA maximizes the separation by ensuring that the scatter s 2 i for the projected points within each class is small, where scatter is defined as s 2 i = x j D i (a j m i ) 2 = n i σ 2 i where σ 2 i is the variance for class c i. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 6 / 20

Linear Discriminant Analysis: Fisher Objective We incorporate the two LDA criteria, namely, maximizing the distance between projected means and minimizing the sum of projected scatter, into a single maximization criterion called the Fisher LDA objective: max w J(w) = (m 1 m 2 ) 2 s 2 1 + s2 2 In matrix terms, we can rewrite (m 1 m 2 ) 2 as follows: ( 2 (m 1 m 2 ) 2 = w T (µ 1 µ 2 )) = w T Bw where B = (µ 1 µ 2 )(µ 1 µ 2 ) T is a d d rank-one matrix called the between-class scatter matrix. The projected scatter for class c i is given as si 2 = (w T x j w T µ i ) 2 = w T (x j µ i )(x j µ i ) T w = w T S i w x j D 1 x j D i where S i is the scatter matrix for D i. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 7 / 20

Linear Discriminant Analysis: Fisher Objective The combined scatter for both classes is given as s 2 1 + s2 2 = wt S 1 w+w T S 2 w = w T (S 1 + S 2 )w = w T Sw where the symmetric positive semidefinite matrix S = S 1 + S 2 denotes the within-class scatter matrix for the pooled data. The LDA objective function in matrix form is max w J(w) = wt Bw w T Sw To solve for the best direction w, we differentiate the objective function with respect to w; after simplification it yields the generalized eigenvalue problem Bw = λsw where λ = J(w) is a generalized eigenvalue of B and S. To maximize the objective λ should be chosen to be the largest generalized eigenvalue, and w to be the corresponding eigenvector. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 8 / 20

Linear Discriminant Algorithm 1 2 3 4 5 6 7 LINEARDISCRIMINANT (D = {(x i, y i )} n i=1 ): D i { x j y j = c i, j = 1,...,n }, i = 1, 2// class-specific subsets µ i mean(d i ), i = 1, 2// class means B (µ 1 µ 2 )(µ 1 µ 2 ) T // between-class scatter matrix Z i D i 1 ni µ T i, i = 1, 2// center class matrices S i Z T i Z i, i = 1, 2// class scatter matrices S S 1 + S 2 // within-class scatter matrix λ 1, w eigen(s 1 B)// compute dominant eigenvector Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 9 / 20

Linear Discriminant Direction: Iris 2D Data The between-class scatter matrix is B = ( 1.587 ) 0.693 0.693 0.303 4.5 4.0 3.5 3.0 2.5 2.0 w 1.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 and the within-class scatter matrix is ( ) 6.09 4.91 S 1 = 4.91 7.11 ( ) 43.5 12.09 S 2 = 12.09 10.96 ( ) 49.58 17.01 S = 17.01 18.08 The direction of most separation between c 1 and c 2 is the dominant eigenvector corresponding to the largest eigenvalue of the matrix S 1 B. The solution is J(w) = λ 1 = 0.11 ( ) 0.551 w = 0.834 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 10 / 20

Linear Discriminant Analysis: Two Classes For the two class scenario, if S is nonsingular, we can directly solve for w without computing the eigenvalues and eigenvectors. The between-class scatter matrix B points in the same direction as (µ 1 µ 2 ) because ( Bw = (µ 1 µ 2 )(µ 1 µ 2 ) T) w ( ) =(µ 1 µ 2 ) (µ 1 µ 2 ) T w =b(µ 1 µ 2 ) The generalized eigenvectors equation can then be rewritten as w = b λ S 1 (µ 1 µ 2 ) Because b λ is just a scalar, we can solve for the best linear discriminant as w =S 1 (µ 1 µ 2 ) We can finally normalize w to be a unit vector. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 11 / 20

Kernel Discriminant Analysis The goal of kernel LDA is to find the direction vector w in feature space that maximizes max w J(w) = (m 1 m 2 ) 2 s 2 1 + s2 2 It is well known that w can be expressed as a linear combination of the points in feature space n w = a j φ(x j ) j=1 The mean for class c i in feature space is given as µ φ i = 1 n i x j D i φ(x j ) and the covariance matrix for class c i in feature space is Σ φ i = 1 ( )( φ(x j ) µ φ i φ(x j ) µ φ i n i x j D i ) T Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 12 / 20

Kernel Discriminant Analysis The between-class scatter matrix in feature space is B φ = ( )( ) T µ φ 1 µφ 2 µ φ 1 µφ 2 and the within-class scatter matrix in feature space is S φ = n 1 Σ φ 1 + n 2Σ φ 2 S φ is a d d symmetric, positive semidefinite matrix, where d is the dimensionality of the feature space. The best linear discriminant vector w in feature space is the dominant eigenvector, which satisfies the expression ( ) S 1 φ B φ w = λw where we assume that S φ is non-singular. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 13 / 20

LDA Objective via Kernel Matrix: Between-class Scatter The projected mean for class c i is given as m i = w T µ φ i = 1 n n i j=1 x k D i a j K(x j, x k ) = a T m i where a = (a 1, a 2,...,a n) T is the weight vector, and m i = 1 x k D i K(x 1, x k ) x k D i K(x 2, x k ) n i = 1 K c i 1 ni. n i x k D i K(x n, x k ) where K c i is the n n i subset of the kernel matrix, restricted to columns belonging to points only in D i, and 1 ni is the n i -dimensional vector all of whose entries are one. The separation between the projected means is thus ( ) 2 (m 1 m 2 ) 2 = a T m 1 a T m 2 = a T Ma where M = (m 1 m 2 )(m 1 m 2 ) T is the between-class scatter matrix. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 14 / 20

LDA Objective via Kernel Matrix: Within-class Scatter We can compute the projected scatter for each class, s1 2 and s2 2, purely in terms of the kernel function, as follows s1 2 = ( ( w T φ(x i ) w T µ φ 1 2 = a T K i K T i x i D 1 x i D 1 ) ) n 1 m 1 m T 1 a = a T N 1 a where K i is the ith column of the kernel matrix, and N 1 is the class scatter matrix for c 1. The sum of projected scatter values is then given as s 2 1 + s2 2 = at (N 1 + N 2 )a = a T Na where N is the n n within-class scatter matrix. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 15 / 20

Kernel LDA The kernel LDA maximization condition is max w J(w) = max a J(a) = at Ma a T Na The weight vector a is the eigenvector corresponding to the largest eigenvalue of the generalized eigenvalue problem: Ma = λ 1 Na When there are only two classes a can be obtained directly: a = N 1 (m 1 m 2 ) To normalize w to be a unit vector we scale a by 1 a T Ka. We can project any point x onto the discriminant direction as follows: w T φ(x) = n a j φ(x j ) T φ(x) = j=1 n a j K(x j, x) j=1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 16 / 20

Kernel Discriminant Analysis Algorithm 1 2 3 4 5 6 7 8 KERNELDISCRIMINANT (D = {(x i, y i )} n i=1, K ): K { K(x i, x j ) } // compute n n kernel matrix i,j=1,...,n K c i { K(j, k) y k = c i, 1 j, k n }, i = 1, 2// class kernel matrix m i 1 n i K c i 1ni, i = 1, 2// class means M (m 1 m 2 )(m 1 m 2 ) T // between-class scatter matrix N i K c i (Ini 1 n i 1 ni n i )(K c i ) T, i = 1, 2// class scatter matrices N N 1 + N 2 // within-class scatter matrix λ 1, a eigen(n 1 M)// compute weight vector a a // normalize w to be unit vector at Ka Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 17 / 20

Kernel Discriminant Analysis: Quadratic Homogeneous Kernel Iris 2D Data: c1 (circles; iris-virginica) and c2 (triangles; other two Iris types). Kernel Function: K (xi, xj ) = (xti xj )2. Contours of constant projection onto optimal kernel discriminant. u2 1.0 0.5 Tu Tu Tu 0 Tu 0.5 Tu Tu Tu Cb Cb Cb Cb Cb Cb Cb Cb Tu Tu Tu 1.0 u1 1.5 4 Zaki & Meira Jr. (RPI and UFMG) 3 2 1 0 1 Data Mining and Analysis 2 3 Chap. 20: Linear Discriminant Analysis 18 / 20

Kernel Discriminant Analysis: Quadratic Homogeneous Kernel Iris 2D Data: c 1 (circles;iris-virginica) and c 2 (triangles; other two Iris types). Kernel Function: K(x i, x j ) = (x T i x j ) 2. Projection onto Optimal Kernel Discriminant 1 0 1 2 3 4 5 6 7 8 9 10 w Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 19 / 20

Kernel Feature Space and Optimal Discriminant Iris 2D Data: c 1 (circles;iris-virginica) and c 2 (triangles; other two Iris types). Kernel Function: K(x i, x j ) = (x T i x j ) 2. X1X2 X2 2 w X 2 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap. 20: Linear Discriminant Analysis 20 / 20