6-1. Canonical Correlation Analysis

Size: px

Start display at page:

Download "6-1. Canonical Correlation Analysis"

Jonathan Samuel Garrett
5 years ago
Views:

1 6-1. Canonical Correlation Analysis Canonical Correlatin analysis focuses on the correlation between a linear combination of the variable in one set and a linear combination of the variables in another set. Canonical variables, Canonical correlation. Examples: Arithmetic speed and arithmetic power to reading speed and reading power, Governmental policy variables with economic goal variables, College performance variables with precollege achievement variables. 1

2 Canonical Variates and Canonical Correlation Let X (1) be a p 1 random vector, X (2) be a q 1 random vector with p q, and E(X (1) ) = µ (1) ; Cov(X (1) ) = Σ 11 ; Set E(X (2) ) = µ (2) ; Cov(X (2) ) = Σ 22 ; Cov(X (1), X (2) ) = Σ 12 = Σ 21. U = a X (1), V = b X (2), Then we shall seek coefficient vectors a and b such that Corr(U, V ) = a Σ 12 b a Σ 11 a b Σ 22 b is as large as possible. 2

3 The first pair of canonical variables, or first canonical variate pair the pair linear combination U 1 and V 1 having unit variances, which maximize the correlation Corr(U, V ); The second pair of canonical variables, or second canonical variate pair the pair of linear combinations U 2 and V 2 having unit variances, which maximize the correlation Corr(U, V ) among all choices that are uncorrelated with the first pair of canonical variables. The kth pair of canonical variables, or kth canonical variate pair the pair of linear combinations U k, V k having unit variances, which maximize the correlation Corr(U, V ) among all choices uncorrelated with the previous k 1 canonical variable pares The correlation between the kth pair of canonical variable s is called the kth canonical correlation 3

4 Result 6-1.1: Suppose p q and let the p-dimensional random vectors X ( 1) and q dimensional X (2) have Cov(X (1) ) = Σ 11, Cov(X (1) ) = Σ 22, and Cov(X (1), X (2) ) = Σ 12, where Σ has full rank. For coefficients p 1 vector a and q 1 vector b, form the linear combination U = a X (1) and V = b X (2). Then max a,b Corr(U, V ) = ρ 1 attained by the linear combinations (first canonical variate pair) U 1 = e 1Σ 1/2 11 X (1) and V 1 = f 1Σ 1/2 22 X (2), The kth pair of canonical variates, k = 2, 3,..., p U k = e kσ 1/2 11 X (1) and V k = f kσ 1/2 22 X (2), maximize max a,b Corr(U, V ) = ρ k among those linear combinations uncorrelated with the preceding 1, 2,..., k 1 canonical variables. 4

5 Here ρ 2 1 ρ 2 2 ρ 2 p are the eigenvalues of Σ 1/2 11 Σ 12 Σ 1 22 Σ 21Σ 1/2 11, and e 1, e 2,..., e p are associated p 1 eigenvectors. The quantities ρ 2 1 ρ 2 2 ρ 2 p are also the p largest eigenvalues of the matrix Σ 1/2 22 Σ 21 Σ 1 11 Σ 12Σ 1/2 22 with corresponding q 1 eigenvectors f 1, f 2,..., f p. Each f i is proportional to Σ 1/2 22 Σ 21 Σ 1/2 11 e i. The canonical variates have the properties Var(U k ) = Var(V k ) = 1, k = 1,..., p, Corr(U k, U l ) = Corr(V k, V l ) = Corr(U k, V l ) = 0 for k, l = 1, 2,..., p and k l. 5

6 Example Suppose Z (1) = [Z (1) 1, Z(1) 2 ] are standardize variables and Z (2) = [Z (2) 1, Z(2) 2 ] are also standardized variables. Let Z = [Z (1), Z (2) ] and Cov(Z) = Calculate canonical variates and canonical correlations for standardized variables Z (1) and Z (2). Example Compute the computing correlations between the first pair canonical variates and their component variables for the situation considered in Example

7 Example Consider the covariance matrix Cov X (1) 1 X (1) 2 X (2) 1 X (2) 2 = Calculate the canonical correlation between [X (1) 1, X(1) 2 ] and [X (2) 1, X(2) 2 ] 7

8 The Sample Canonical Variates and Sample Canonical Correlation Results S 1/2 11 S 12 S22 1 S 21S 1/2 Let ˆρ 2 1 ˆρ ˆρ 2 p be the p ordered eigenvalues of 11 with corresponding eigenvectors ê 1, ê 2,..., ê p, where p q. Let ˆf 1,...,ˆf p be the eigenvectors of S 1/2 22 S 21 S11 1 S 12S 1/2 22. Then the kth sample canonical variate pair is Û k = ê k S 1/2 11 x (1), ˆVk = ˆf k S 1/2 22 x (2) where x (1) and x (2) are the values of the variables X (1) and X (2) for a particular experimental unit. Also for the kth pair, k = 1,..., p rûk, ˆV k = ˆρ k. The quantities ˆρ 1,..., ˆρ p are the sample canonical correlations. 8

9 Large Sample Inference Results Let [ ] X (1) j X j = X (2), j = 1, 2,..., n j bee a random sample from an N p+q (µ, Σ) population with Σ = [ ] Σ11 Σ 12 Σ 21 Σ 22 Then the likelihood ratio test H 0 : Σ 12 = 0 vs H 1 : Σ 12 0 reject H 0 for large value of ( ) S11 S 22 p 2 ln Γ = n ln = n ln (1 ˆρ 2 i ) S where S = [ ] S11 S 12 S 21 S 22 is the unbiased estimator of Σ. For large n the test statistic 2 ln Γ is approximately distributed as a chi-square random variable with pq degree 9 freedom. i=1

25 6-2. Discrimination and Classification Discrimination and classification are multivariate techniques concerned with separating distinct sets of objects (or observations) and with allocating new objects (observations) to previously defined group. Goal 1. To describe, either graphically ( in three or fewer dimensions) or algebraically, the differential features of objects (observations) from several known collections (population). We try to find discriminants whose numerical values are such that the collections are separated as much as possible. Goal 2. To sort objects (observations) into two or more labeled classes. The emphasis is on deriving a rule that can be used to optimally assign new objects to the labeled classes. 1

26 Separation and Classification for Two populations 2

27 Allocation or classification rules are usually developed from learning samples. Measured characteristics of randomly selected object known to come from each of the two populations are examined for differences. Why we know that some observations belong to a particular population, but we are unsure about others. Incomplete knowledge of future performance Perfect information requires destroying the object. Unavailable or expensive information. 3

28 4

29 5

30 Let f 1 (x) and f 2 (x) be the probability density functions associated with p 1 vector random variable X for the populations π 1 and π 2 respectively. An object with associate measurements x must be assigned to either π 1 or π 2. Let Ω be the complete sample space, let R 1 be that set of x values for which we classify objects as π 1 and R 2 = Ω R 1 be the remaining x values for which we classify objects as π 2. So R 1 R 2 = Ω and R 1 R 2 =. 6

31 The conditional probability, P (2 1), of classifying an object as π 2 when, in fact, it is from π 1 is P (2 1) = P (X R 2 π 1 ) = R 2 =Ω R 1 f 1 (x)dx. Similar, the conditional probability, P (1 2), of classifying an object as π 1 when, in fact, it is from π 2 is. P (1 2) = P (X R 1 π 2 ) = R 1 =Ω R 2 f 1 (x)dx 7

32 Let p 1 be the prior probability of π 1 and p 2 be the prior probability of π 2, where p 1 + p 2 = 1. Then P (observation is correlectly classified as π 1 ) = P (X R 1 π 1 )P (π 1 ) = P (1 1)p 1, P (observation is misclassified as π 1 ) = P (X R 1 π 2 )P (π 2 ) = P (1 2)p 2, P (observation is correlectly classified as π 2 ) = P (X R 2 π 2 )P (π 2 ) = P (2 2)p 2, P (observation is misclassified as π 2 ) = P (X R 2 π 1 )P (π 1 ) = P (2 1)p 1, 8

33 The costs of misclassification is defined by a cost matrix π 1 π 2 π 1 0 c(2 1) π 2 c(1 2) 0 Expected cost of misclassification (ECM) ECM = c(2 1)P (2 1)p 1 + c(1 2)P (1 2)p 2 A reasonable classification rule should have an ECM as small, or nearly as small, as possible. 9

34 Results The region R 1 and R 2 that minimize the ECM are defined by the value x for which the following inequalities hold: R 1 : f 1(x) f 2 (x) ( ) ( c(1 2) p2 c(2 1) p 1 ) ( density ratio ) ( cost ratio ) ( prior probability ratio ) R 2 : f 1(x) f 2 (x) < ( ) ( c(1 2) p2 c(2 1) p 1 ) ( density ratio ) < ( cost ratio ) ( prior probability ratio ) 10

35 11

36 12

37 Other classification procedures: Choose R 1 and R 2 to minimize the total probability of misclassification (TPM). TPM = p 1 f 1 (x)dx + p 2 f 2 (x)dx. R 2 R 1 Allocate a new observation x 0 to the population with the largest posterior probability P (π i x 0 ). P (π 1 x 0 ) = p 1 f 1 (x 0 ) p 1 f 1 (x 0 ) + p 2 f 2 (x 0 ), P (π 2 x 0 ) = 1 P (π 1 x 0 ) = p 2 f 2 (x 0 ) p 1 f 1 (x 0 ) + p 2 f 2 (x 0 ). Classifying an observation x 0 as π 1 when P (π 1 x 0 ) > P (π 2 x 0 ) is equivalent to using the (b) rule for total probability of misclassification. 13

38 Classification with Two Multivariate Normal Populations Assume f 1 (x) and f 2 (x) are multivariate normal densities, the first with mean vector µ 1 and covariance Σ 1 and the second with mean µ 2 and covariance Σ 2. Suppose that joint densities of X = [X 1, X 2,..., X p ] for population π 1 and π 2 are given by f i (x) = [ 1 exp 1 ] (2π) p/2 Σ 1/2 2 (x µ i) Σ(x µ i ) for i = 1, 2. 14

39 Result Let the populations π 1 and π 2 be described by multivariate normal densities of the form above. Then the allocation rule that minimizes the ECM is as follows: Allocate x 0 to π 1 if (µ 1 µ 2 ) Σ 1 x 0 1 [( ) ( )] c(1 2) 2 (µ 1 µ 2 )Σ 1 p2 (µ 1 + µ 2 ) ln, c(2 1) Allocate x 0 to π 2 otherwise. p 1 The Estimated Minimum ECM Rule for Two Normal Population Allocate x 0 to π 1 if ( x 1 x 2 ) S 1 pooled x ( x 1 x 2 )S 1 pooled ( x 1 + x 2 ) ln Allocate x 0 to π 2 otherwise. [( c(1 2) c(2 1) ) ( )] p2, p 1 15

40 16

41 17

42 18

43 19

44 Scaling The coefficient vector â = S pooled ( x 1 x 2 ) is unique only up to a multiplicative constant, so for c 0, any vector câ will also serve as discriminant coefficients. The vector â is frequently scaled or normalized to ease the interpretation of its elements. (1) Set â = â/ â â. (2) Set â = â/â 1. 20

45 Fisher s Approach to Classification with Two populations Fisher s idea was to transform the multivariate observations x to univariate observation y such that y s derived from populations π 1 and π 2 were separated as much as possible. Fisher suggested taking linear combinations of x to create y s because they are simple enough functions of the x to be handled easily. Fisher s approach does not assume that the populations are normal, but implicitly assume that the population covariance matrices are equal. 21

46 Result The linear combinations ŷ = â x = ( x 1 x 2 ) S 1 pooled x maximize the ratio ( squared distance between sample mean of y (sample variance of y) ) = (ȳ 1 ȳ 2 ) 2 s 2 y = (â x 1 â x 2 ) 2 â S pooled â = (â d) 2 â S pooled â over all possible coefficient vectors â where d = ( x 1 x 2 ). The maximum of the ratio is D 2 = ( x 1 x 2 ) S 1 pooled ( x 1 x 2 ). 22

47 23

48 An Allocation Rule Based on Fisher s Discriminant Function Allocate x 0 to π 1 if ŷ 0 = ( x 1 x 2 )S 1 pooled x 0 ˆm = 1 2 ( x 1 x 2 )S 1 pooled ( x 1 + x 2 ) or Allocate x 0 to π 2 if ŷ 0 ˆm 0. ŷ 0 < ˆm or ŷ 0 ˆm < 0. 24

49 25

50 Classification of Normal Population When Σ 1 Σ 2 Result Let the populations π 1 and π 2 be described by multivariate normal densities with mean vectors and covariance matrices µ 1, Σ 1 and µ 2, Σ 2, respectively. The allocation rule that minimizes the expected cost of misclassification is given by Allocate x 0 to π 1 if 1 2 x 0(Σ 1 1 Σ 1 2 )x 0 + (µ 1Σ 1 1 µ 2Σ 1 2 )x 0 k ln [( c(1 2) c(2 1) ) ( )] p2, ( ) where k = 1 2 ln Σ1 Σ (µ 1Σ 1 1 µ 1 µ 2Σ 1 2 µ 2). Allocate x 0 to π 2 otherwise. p 1 26

51 Quadratic Classification Rule (Normal Populations with Unequal Covariance Matrices) Allocate x 0 to π 1 if 1 2 x 0(S 1 1 S 1 2 )x 0 + ( x 1S 1 1 x 2S 1 2 )x 0 k ln Allocate x 0 to π 2 otherwise. [( c(1 2) c(2 1) ) ( )] p2. p 1 27

52 28

53 Evaluating Classification Functions The total probability of misclassification TPM = p 1 f 1 (x)dx + p 2 f 2 (x)dx R 2 R 1 Optimum error rate (OER): the smallest value of TPM, obtained by a judicious choice of R 1 and R 2. R 1 and R 2 for OER are determined by the rule of Minimum Expected Cost Regions with equal misclassification costs. 29

54 30

55 31

56 32

57 Actual error rate (AER): AER = p 1 f 1 (x)dx + p 2 f 2 (x)dx ˆR 2 ˆR 1 Apparent error rate (APER): the fraction of observations in the training sample that are misclassified by the sample classification function. 33

58 34

59 35

60 APER tends to underestimate the AER, and the problem does not disappear unless the sample sizes n 1 and n 2 are very large. Essentially, this optimistic estimate occurs because the data used to build the classification function are also used to evaluate it. Error-rate estimates can be constructed that are better than the apparent error rate, remain relatively easy to calculate, and do not require distributional assumption. Split the total sample into training sample and a validation sample. Shortcoming: 1. Requires large samples, 2. valuable information may be lost. Lachenbruch s holdout procedure. 36

61 Lachenbruch s holdout procedure 1. Start with the π 1 group of observations. Omit one observation from this group, and develop a classification function based on the remaining n 1 1, n 2 observations. 2. Classify the holdout observation, using the function constructed in Step Repeat Step 1 and 2 until all of the π 1 observations are classified, Let n (H) 1M be the number of holdout (H) observations misclassified in this group. 4. Repeat Step 1 through 3 for the π 2 observations, Let n (H) 2M holdout observations misclassified in this group. be the number of and ˆP (2 1) = n(h) 1M n, ˆP (H) (1 2) = 2M n 1 n 2 Ê(AER) = n(h) 1M + n(h) 2M n 1 + n 2 37

62 38

63 39

64 40

65 41

66 42

67 43

68 Classification with Serval Populations The Minimum Expected Cost of Misclassification Method Let f i (x) be the density associated with population π i, i = 1, 2,..., g. Let p i be the prior probability of population π i, i = 1,..., g, and c(k i) be the cost of allocating an item to π k when, in fact, it belongs to π i, for k, i = 1,..., g. For k = i, c(i i) = 0. Finally, let R k be the set of x s classed as π k and P (k i) = R k f i (x)dx for k, i = 1, 2,..., g with P (i i) = 1 g k=1,k i P (k i). The conditional expected cost of misclassifying an x from π 1 π 3,..., π g is int π 2, or ECM(1) = P (2 1)c(2 1) P (g 1)c(g 1) = g P (k 1)c(k 1). 44 k=2

69 Multiplying each conditional ECM by its prior probability and summing gives the overall ECM: ECM = p 1 ECM(1) + + p g ECM(g) g g = P (k i)c(k i). i=1 p i k=1,k i Result The classification regions that minimize the above ECM are defined by allocating g p i f i (x)c(k i) i=1,i k is smallest. If a tie occurs, x can be assigned to any of the tied populations. If c(i k) for any i k are same, allocate x to π k if p k f k (x) > p i f i (x), for all i k. 45

70 Classification with Normal Populations When the f i (x) = [ 1 exp 1 ] (2π) p/2 Σ i 1/2 2 (x µ i) Σ 1 i (x µ i ), i = 1, 2,..., g, If further the misclassification costs are all equals, c(k i) = 1, k i, then Allocate x to π k if ln p k f k (x) = ln p k p 2 ln(2π) 1 2 ln Σ k 1 2 (x µ k) Σ 1 k (x µ k) = max ln p i f i (x). i 46

71 Define the sample quadratic discrimination score ˆd Q i (x) as ˆd Q i (x) = 1 2 ln S i 1 2 (x (x) i ) Σ 1 k (x (x) i ) + ln p i, i = 1, 2,..., g. Then allocate x to π k if the quadratic score ˆd Q k ˆd Q 1 (x),..., ˆd Q g (x). (x) is the largest of 47

72 If the population covariance matrices Σ i are equal, then define the sample linear discrimination score as ˆd i (x) = x i S 1 pooled x 1 2 S 1 pooled x i + ln p i, for i = 1, 2,..., g. Then, allocate x to π i if the linear discrimination score ˆd k (x) is the a largest of ˆd 1 (x),..., ˆd g (x). where S pooled = 1 n 1 + n n g g ((n 1 )S 1 + (n 2 1)S (n g 1)S g ). 48

73 Fisher s Method for Discrimiating among Several Populations The motivation behind the Fisher discriminant analysis is the need to obtain a reasonable representation of the populations that involves only a few linear combinations of the observations, such as a 1x, a 2x and a 3x. The approach has several advantages when one is interested in separating several population for (1) visual inspection or (2) graphical descriptive purposes. Assume that p p population covariance matrices are equal and of full rank. That is Σ 1 = Σ 2 = = Σ g = Σ. 49

74 Define and x i = 1 n i W = n i j=1 x ij, i = 1,..., g, and x = 1 g B = g ( x i x)( x i x) i=1 g (n i 1)S i = i=1 g x i. i=1 n g i (x ij x i )(x ij x i ). i=1 j=1 50

75 Fisher s Sample Linear Discriminants Let ˆλ 1, ˆλ 2,..., ˆλ s > 0 denote the s min(g 1, p) nonzero eigenvalues of W 1 B and ê 1,..., ê s be the corresponding eigenvectors (scaled so that ê S pooled ê = 1). Then the vector of coefficients â that maximizes the ratio â Bâ â Wâ = ( g ) â ( x i x)( x i x) â i=1 ( ) g â n i (x ij x i )(x ij x i ) â i=1 j=1 is given by â 1 = ê 1. The linear combination â 1x is, called the sample first discriminant. The choice â 2 = ê 2 produces the sample second discriminant, â 2x, and continuing, we obtain â k x = ê kx, the sample kth discriminant, k s. 51

76 Let or, equivalently d i (x) = µ iσ 1 x 1 2 µ iσ 1 µ i + ln p i d i (x) 1 2 x Σ 1 x = 1 2 (x µ i) Σ 1 (x µ i ) + ln p i. 52

77 Result Let y j = a j x where a j = Σ 1/2 e j and e j is an eigenvector of Σ 1/2 BµΣ 1/2. Then p (y j µ iyj ) 2 = j=1 p [a j(x µ i )] 2 = (x µ i ) Σ 1 (x µ i ) j=1 If λ 1 λ s > 0 = λ s+1 = = λ p, = 2d i (x) + x Σ 1 x + 2 ln p i p j=s+1 (y j µ iyj ) 2 is constant for all populations, i = 1, 2,..., g so only the first s discriminants y j, or contribute to the classification. s (y j µ iyj ) 2, j=1 53

78 Fisher s Classification Procedure Based on Sample Discriminations Allocate x to π k if r (ŷ j ȳ kj ) 2 = j=1 r [â j(x x k )] 2 j=1 r [â j(x x j )] 2 mboxforall i k j=1 where â j is the corresponding eigenvectors of W 1 B, ȳ kj = â j x k and r s. 54

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2012, Mr. Ruey S. Tsay

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2012, Mr Ruey S Tsay Lecture 9: Discrimination and Classification 1 Basic concept Discrimination is concerned with separating