Linear Algebra Methods for Data Mining

Size: px

Start display at page:

Download "Linear Algebra Methods for Data Mining"

Victoria Copeland
6 years ago
Views:

1 Linear Algebra Methods for Data Mining Saara Hyvönen, Spring 2007 Linear Discriminant Analysis Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki

2 Principal components analysis Idea: look for such a direction that the data projected onto it has maimal variance. When found, continue by seeking the net direction, which is orthogonal to this (i.e. uncorrelated), and which eplains as much of the remaining variance in the data as possible. Ergo: we are seeking linear combinations of the original variables. If we are lucky, we can find a few such linear combinations, or directions, or (principal) components, which describe the data fairly accurately. The aim is to capture the intrinsic variability in the data. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 1

3 1st principal component 2nd principal component Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 2

4 How to compute the PCA: Data matri A, rows=data points, columns = variables (attributes, parameters). 1. Center the data by subtracting the mean of each column. 2. Compute the SVD of the centered matri values and vectors): Â = UΣV T. Â (or the k first singular 3. The principal components are the columns of V, the coordinates of the data in the basis defined by the principal components are UΣ. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 3

5 But the PC s are not always what we want! o o o o o o o o o o o o Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 4

6 Eample: Atmospheric data Data: 1500 days, and for each day, we have the means and stds of around 30 measured variables (temperature, wind speed and direction, rain fall, UV-A radiation, concentration of CO2 etc.) Therefore, our data matri is Visualizing things in a 60-dimensional space is challenging! Instead, do PCA, and project days onto the plane defined by the first two principal components. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 5

7 30 Days projected in the plane defined by the 1st two principal components, colored per month nd principal component st principal component 1 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 6

8 But this is not really what we are interested in! Instead, we are interested in distinguishing days when new particles spontaneously form from days with no such formation. Prinicplal components are not very good at this! Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 7

9 30 Days projected in the plane defined by the first two principal component, colored according to particle formation nd principal component st principal component Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 8

10 What to do? Look instead for a direction, which Minimized within-group variance Maimized between-group variance. Project the data onto this direction: groups (should be) well separated! This is what Linear Discriminant Analysis does. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 9

11 0.3 Days projected in the plane defined by the first two linear discriminants, colored according to particle formation nd linear discriminant st linear discriminant Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 10

12 Linear Discriminant Analysis We are given the data matri together with class labels: each data point belongs to one of the classes 1... k. Goal: map the original data into features that most effectively discriminate between classes. In other words, reduce dimension of data in a way that best preserves its cluster structure. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 11

13 Assume the columns of A R m n are grouped into k clusters: A = [A 1 A 2... A k ], A i R m n i, k i=1 n i = n. The centroid of each cluster ins computed by taking the average of the columns in A i : c i = 1 n i A i e i, e i = (1,... 1) T R n i 1, and the global centroid is defined as c = 1 n Ae, e = (1,... 1)T R n 1. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 12

14 Let N i denote the set of column indices that belong to cluster A i. Then the within-cluster, between-cluster and miture (or total) scatter matrices are defined as follows: S w = k i=1 (a j c i )(a j c i ) T j N i S b = S m = k i=1 (c i c)(c i c) T = j N i n (a j c)(a j c) T k i=1 n i (c i c)(c i c) T i=1 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 13

15 Let w be the vector along which we shall project our data. Now we achieve our goal by maimizing the objective: J(w) = wt S b w w T S w w In doing so, we minimize the within-cluster scatter while maimizing the between-cluster scatter. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 14

16 Note: one can show that S m = S w + S b, so J(w) = wt S b w w T S w w can be written as J(w) = wt S m w w T S w w 1 which means we are maimizing total scatter while minimizing withincluster scatter. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 15

17 Note, that the value of J(w) = wt S b w w T S w w is the same regardless of how we scale w αw. Before (when discussing PCA) we chose w T w = 1. This time, let us require w to be such that w T S w w = 1. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 16

18 Now our problem can be stated as follows: we wish to maimize J(w) = wt S b w w T S w w subject to the constraint w T S w w = 1. Optimization problem: maimize λ is the Lagrange multiplier. f = w T S b w λ(w T S w w 1), Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 17

19 Again we solve the optimization problem ma w f = ma w ( w T S b w λ(w T S w w 1) ) by differentiating with respect to w; this yields f w = 2S bw 2λS w w = 0 This leads to the generalized eigenvalue problem S b w = λs w w. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 18

20 If S w is invertible, the generalized eigenproblem S b w = λs w w. can be written as S 1 w S b w = λw. The solutions of this are the eigenvalues and eigenvectors of the matri Sw 1 S b. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 19

21 Denote the eigenvalues and eigenvectors by λ k and w k. Remembering, that insert these into J(w): S b w k = λ k S w w k J(w k ) = wt k S bw k w T k S ww k = wt k λ ks w w k w T k S ww k = λ k. So the direction w which maimizes the value of J(w) is the eigenvector corresponding to the largest eigenvalue of S 1 w S b. The largest eigenvalue tells about how well classes separate. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 20

22 A, B R n n. Generalized eigenvalue problems A = λb, 0. Has n generalized eigenvalues λ if and only if rankb = n. If rankb < n, then the number of λ may be zero, finite, or infinite: A = B = ( ) 1 2, A = 0 3 ( ) 1 0. B = 0 0 ( ) 1 2, A = 0 3 ( ) 0 1. B = 0 0 ( ) 1 2, 0 0 ( ) Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 21

23 Symmetric-definite generalized eigenproblems Let the matrices A, B R n n, be such that A symmetric, B symmetric positive definite. Find λ and 0 such that A = λb. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 22

24 Theorem. If A, B R n n, A symmetric, B symmetric positive definite, then there eists a nonsingular X = [ 1,..., n ] such that X T AX = diag(a 1,..., a n ), X T BX = diag(b 1,..., b n ). Moreover, A i = λ i B i for i = 1,..., n, where λ i = a i /b i. Note: the matri X can be chosen in such a way, that X T AX = diag(λ 1,..., λ n ), X T BX = diag(1,..., 1). Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 23

25 Eample A = X = ( ) ( ) , B = ( 81 ) Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 24

26 Our generalized eigenvalue problem was S b w = λs w w. We know that both S b and S w are symmetric, and positive semidefinite. Assuming S w is invertible it is also positive definite. So there is a matri X such that and X T S b X = diag(λ 1,..., λ n ), X T S w X = diag(1,..., 1), S b i = λ j S w i. Since S b is positive semidefinite, and T i S b i = λ j, we see that the λ i 0 for all i. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 25

27 So, the generalized eigenvalues of our problem S b w = λs w w are all nonnegative. Moreover, only the largest r eigenvalues are nonzero, where r = rank(s b ). Remember that k S b = n i (c i c)(c i c) T, i=1 which is a sum of k rank-1 matrices, so the rank of S b is at most k. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 26

28 Here we only considered looking for the first linear discriminant, which is the eigenvector corresponding to the largest eigenvalue of S 1 w S b. the eigenvectors corre- The l first linear discriminants are (of course!) sponding to the l largest eigenvalues. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 27

29 So how did we get the linear discriminants? Step 1: Compute the scatter matrices S b and S w. Step 2: Solve the generalized eigenvalue problem S b w = λs w w. (In matlab you can use eigs!) Step 3: Order the eigenvalues from largest to smallest, and the eigenvectors accordingly. These are your linear discriminants. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 28

30 Step 4: In classification: use training set to decide where boundaries are. Use test set to evaluate performance. Note: in reality, one should use N-fold crossvalidation: Divide all data into N parts, and use one part as the test set and the rest as the training set. Report the average performance across the test sets. Why? To get a more reliable estimate of performance (avoiding the situation where the training set and test set are too good ). Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 29

31 Eample: 2 classes in 2D c1=mean(a1,2);c2=mean(a2,2); A=[A1 A2];c=mean(A,2); sb=n1*(c1-c)*(c1-c) +n2*(c2-c)*(c2-c) ; tmp1=a1-repmat(c1,1,n1); tmp2=a2-repmat(c2,1,n2); sw=tmp1*tmp1 +tmp2*tmp2 ; [v,d]=eigs(sb,sw); Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 30

32 data y Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 31

33 1.5 data in pccoords pc pc Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 32

34 0 data projected in the plane defined by the first two linear discriminants nd linear discriminant st linear discriminant Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 33

35 3 data,1st pc (solid), 1st ld (dashed) y Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 34

36 Two-class case S w = j N 1 (a j c 1 )(a j c 1 ) T + where Σ i is the covariance matri of class i. j N 2 (a j c 2 )(a j c 2 ) T = n 1 Σ 1 + n 2 Σ 2, Also, we can use the fact that c = 1 n (n 1c 1 + n 2 c 2 ) to get S b = n 1 (c 1 c)(c 1 c) T +n 2 (c 2 c)(c 2 c) T = n 1n 2 n (c 2 c 1 )(c 2 c 1 ) T. This is a rank-1 matri! Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 35

37 We have, for nonsingular S w, S 1 w S b 1 = S w n 1 n 2 n (c 2 c 1 )(c 2 c 1 ) T 1 = λ 1 1, which yields (for some α) 1 = αs 1 w (c 2 c 1 ), and λ 1 = n 1n 2 n (c 2 c 1 ) T S 1 w (c 2 c 1 ) ( = trace(s 1 w S b ) ). Good class separation? Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 36

38 Remember: the largest eigenvalue tells about how well classes separate. λ 1 = n 1n 2 n (c 2 c 1 ) T S 1 w (c 2 c 1 ) So we get better separation of two classes, if difference of class means (c 2 c 1 ) is large relative to the weighted sum of class covariance matrices n 1 Σ 1 + n 2 Σ 2 = S w. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 37

39 Atmospheric data again 0.3 Days projected in the plane defined by the first two linear discriminants, colored according to particle formation nd linear discriminant st linear discriminant Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 38

40 Could we look at the weights of the variables in the first linear discriminant to see which variables are important in separating the red dots from the blue? Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 39

41 Fisher discriminant analysis (R.A. Fisher, The use of multiple measurements in taonomic problems, 1936) Uses slighlty different criterion: maimize where the scatter matrices are J(w) = wt S b w w T S w w S b = (c 2 c 1 )(c 2 c 1 ) T, S w = Σ 1 + Σ 2. If the classes are of equal size (n 1 = n 2 ), then this is the same as what we discussed above. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 40

42 Other criteria Several measures of cluster quality, which involve the three scatter matrices, have been suggested, including J = trace(s 1 w S b ) and J = trace(s 1 w S m ). For more discussion on these and others, see e.g. therein. [2] and references Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 41

43 What if S w is singular? Then this approach will not work, as it is based on finding the eigenvalues of S 1 w S b! This is typically the case in undersampled problems, where the number of samples is small compared to the dimension of the data points. For eample, microarray data, tet data, image data. Answer: instead of solving the generalized eigenproblem we can formulate the problem in terms of the generalized SVD. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 42

44 References [1] Lars Eldén: Matri Methods in Data Mining and Pattern Recognition, SIAM [2] P. Howland and H. Park: Etension of Discriminant Analysis based on the Generalized Singular Value Decomposition, [3] B. G. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 43

Linear Algebra Methods for Data Mining

Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 The Singular Value Decomposition (SVD) continued Linear Algebra Methods for Data Mining, Spring 2007, University