13 Principal Components Analysis

Size: px

Start display at page:

Download "13 Principal Components Analysis"

Lenard Morris
5 years ago
Views:

1 Prncpal Components Analyss 13 Prncpal Components Analyss We now dscuss an unsupervsed learnng algorthm, called Prncpal Components Analyss, or PCA. The method s unsupervsed because we are learnng a mappng wthout any examples of what the mappng looks lke; all we see are the outputs, and we want to estmate both the mappng and the nputs. PCA s prmarly a tool for dealng wth hgh-dmensonal data. If our measurements are 17- dmensonal, or 30-dmensonal, or 10,000-dmensonal, manpulatng the data can be extremely dffcult. Qute often, the actual data can be descrbed by a much lower-dmensonal representaton that captures all of the structure of the data. PCA s perhaps the smplest approach for fndng such a representaton, and yet s t also very fast and effectve, resultng n t beng very wdely used. There are several ways n whch PCA can help: Vsualzaton: PCA provdes a way to vsualze the data, by projectng the data down to two or three dmensons that you can plot, n order to get a better sense of the data. Furthermore, the prncpal component vectors sometmes provde nsght as to the nature of the data as well. Preprocessng: Learnng complex models of hgh-dmensonal data s often very slow, and also prone to overfttng the number of parameters n a model s usually exponental n the number of dmensons, meanng that very large data sets are requred for hgher-dmensonal models. Ths problem s generally called the curse of dmensonalty. PCA can be used to frst map the data to a low-dmensonal representaton before applyng a more sophstcated algorthm to t. Wth PCA one can also whten the representaton, whch rebalances the weghts of the data to gve better performance n some cases. Modelng: PCA learns a representaton that s sometmes used as an entre model, e.g., a pror dstrbuton for new data. Compresson: PCA can be used to compress data, by replacng data wth ts low-dmensonal representaton The model and learnng In PCA, we assume we are gven N data vectors {y }, where each vector s D-dmensonal: y R D. Our goal s to replace these vectors wth lower-dmensonal vectors{x } wth dmensonalty C, where C < D. We assume that they are related by a lnear transformaton: C y = Wx+b = w j x j +b (1) j=1 The matrx W can be vewed as a contanng a set of C bass vectors W = [w 1,...,w C ]. If we also assume Gaussan nose n the measurements, ths model s the same as the lnear regresson model studed earler, but now the x s are unknown n addton to the lnear parameters. Copyrght c 2015 Aaron Hertzmann, Davd J. Fleet and Marcus Brubaker 75

2 Prncpal Components Analyss To learn the model, we solve the followng constraned least-squares problem: arg mn W,b,{x } y (Wx +b) 2 (2) subject to W T W = I (3) The constrant W T W = I requres that we obtan an orthonormal mappng W; t s equvalent to sayng that w T w j = { 1 = j 0 j (4) Ths constrant s requred to resolve an ambguty n the mappng: f we dd not requre W to be orthonormal, then the objectve functon s underconstraned (why?). Note that an ambguty remans n the learnng even wth ths constrant (whch one?), but ths ambguty s not very mportant. The x coordnates are often called latent coordnates. The algorthm for mnmzng ths objectve functon s as follows: 1. Letb = 1 N y 2. LetK = 1 N (y b)(y b) T 3. Let VΛV T = K be the egenvector decomposton of K. Λ s a dagonal matrx of egenvalues (Λ = dag(λ 1,...λ D )). The matrx V contans the egenvectors: V = [V 1,...V D ] and s orthonormal V T V = I. 4. Assume that the egenvalues are sorted from largest to smallest (λ λ +1 ). If ths s not the case, sort them (and ther correspondng egenvectors). 5. LetWbe a matrx of the frstc egenvectors: W = [V 1,...V C ]. 6. Letx = W T (y b), for all Reconstructon Suppose we have learned a PCA model, and are gven a new y new value; how do we estmate ts correspondng x new? Ths can be done by mnmzng y new (Wx new +b) 2 (5) Ths s a lnear least-squares problem, and can be solved wth standard methods (n MATLAB, mplemented by the backslash operator). HoweverWs orthonormal, and thus ts transpose s the pseudonverse, so the soluton s gven smply by: x new = W T (y new b) (6) Copyrght c 2015 Aaron Hertzmann, Davd J. Fleet and Marcus Brubaker 76

3 Prncpal Components Analyss 13.3 Propertes of PCA Mean zero coeffcents. One can show that the PCA coeffcents that represent the tranng data,.e.,{x } N =1, are mean zero. mean(x) 1 x = 1 W T (y b) (7) N N ) ( = 1 N WT y Nb (8) = 0 (9) Varance maxmzaton. PCA can also be defned n the followng way; n fact, ths s the orgnal defnton of PCA, and the one that s often meant when people dscuss PCA. However, ths formulaton s exactly equvalent to the one dscussed above. In ths goal, we wsh to fnd the frst prncpal component w 1 to maxmze the varance of the frst coordnate of the data: var(x 1 ) = 1 x 2 1, = 1 (w T N N 1(y b)) 2 (10) such that w 1 2 = 1. Then, we wsh to choose the second prncpal component to be a unt vector and orthogonal to the frst component, whle maxmzng the varance ofx 2. The remanng prncple components are also defned n ths recursve way, so that each component w s a unt vector, orthogonal to all prevous bass vectors. Uncorrelated coeffcents. It s straghtforward to show that the covarance matrx of the PCA coeffcents s the just the upper leftc C submatrx ofλ(.e., the dagonal matrx contanng the C leadng egenvalues ofk. cov(x) 1 (W T (y b))(w T (y b)) T (11) N = 1 N WT ( (y b)(y b) T )W (12) = W T KW (13) = W T VΛV T W (14) = Λ (15) where Λ s the dagonal matrx contanng the C leadng egenvalues n Λ. Ths smple dervaton also shows that the margnal varances of the PCA coeffcents are gven by the egenvalues;.e., var(x j ) = λ j. Copyrght c 2015 Aaron Hertzmann, Davd J. Fleet and Marcus Brubaker 77

4 Prncpal Components Analyss Out of Subspace Error. The total varance n the data s gven by the sum of the egenvalues of the sample covarance matrx K. The varance captured by the PCA subspace representaton s the sum of the frst C egenvalues. The total amount of varance lost n the representaton s gven by the sum of the remanng egenvalues. In fact, one can show that the least-squares error n the approxmaton to the orgnal data provded by the optmal (ML) model parameters, W, {x }, andb, s gven by y (W x +b ) 2 = D j=c+1 λ j. (16) When learnng a PCA model t s common to use the rato of the total LS error and the total varance n the tranng data (.e., the sum of all egenvalues). One needs to choose C to be large enough that ths rato s small (often 0.1 or less) Whtenng Whtenng s a preprocess that replaces the data wth a representaton that has zero-mean and unt covarance, and s often useful as a data preprocessng step. Gven measurements{y }, we replace them wth {z } gven by where Λ s a dagonal matrx of the frstc egenvalues. Then, the sample mean of the z s s equal to 0: z = Λ 1 2 W T (y b) = Λ 1 2 x (17) mean(z) = mean( Λ 1 2 x ) = Λ 1 2 mean(x) = 0 (18) To derve the sample covarance, we wll frst compute the covarance of the untruncated values: z Λ 1 2 V T (y b): cov( z) 1 Λ 1 2 V T (y b)(y b) T VΛ 1 2 (19) N ( = Λ 1 2 V T 1 (y b)(y b) )VΛ T 1 2 (20) N = Λ 1 2 V T KVΛ 1 2 (21) = Λ 1 2 V T VΛV T VΛ 1 2 (22) = I (23) Snce z s just the frstc elements of z,zalso has sample covarance I. Copyrght c 2015 Aaron Hertzmann, Davd J. Fleet and Marcus Brubaker 78

5 Prncpal Components Analyss 13.5 Modelng PCA s sometmes used to model data lkelhood, e.g., we can use t as a form of a pror. For example, suppose we have nosy measurements of some y values and wsh to estmate ther true values. If we parameterze the unknown y values by ther correspondng x values nstead, then we constran the estmated values to le n the low-dmensonal subspace of the orgnal data. However, ths approach mples a unform pror over x values, whch may be nadequate, whle beng ntolerant to devatons from the subspace. A better approach wth an nherently probablstc model s descrbed below Probablstc PCA Probablstc PCA s a way to estmate a probablty dstrbuton p(y); n fact, t s a form of Gaussan dstrbuton. In partcular, we assume the followng probablty dstrbuton: x N(0,I) (24) y = Wx+b+n, n N(0,σ 2 I) (25) where x and n are assumed to be statstcally ndependent. The model says that the low-dmensonal coordnates x (.e., the underlyng causes) come from a unt Gaussan dstrbuton, and the y measurements are a lnear functon of these low-dmensoanl causes, plus Gaussan nose. Note that we do not requre that W be orthonormal anymore (n part because we now constran the magntude of the x varables). Snce any lnear transformaton of a Gaussan varable s tself Gaussan,ymust also be Gaussan. Ths dstrbuton s: p(y) = p(x,y)dx = p(y x)p(x)dx = G(y;Wx+b, σ 2 I)G(x;0, I)dx (26) Evaluatng ths ntegral wll gve usp(y), however, there s a smpler way to solve for the Gaussan dstrbuton. Snce we know that y s Gaussan, all we need to do s derve ts mean and covarance, whch can be done as follows (usng the fact that mathematcal expectaton s lnear): mean(y) = E[y] = E[Wx+b+n] (27) = WE[x]+b+E[n] (28) = b (29) cov(y) = E[(y b)(y b) T ] (30) = E[(Wx+b+n b)(wx+b+n b) T ] (31) = E[(Wx+n)(Wx+n) T ] (32) = E[Wxx T W T ]+E[Wxn T ]+E[nx T W T ]+E[nn T ] (33) = WE[xx T ]W T +WE[x]E[n T ]+E[n]E[x T ]W T +σ 2 I (34) = WW T +σ 2 I (35) Copyrght c 2015 Aaron Hertzmann, Davd J. Fleet and Marcus Brubaker 79

6 Prncpal Components Analyss y 2 w y 2 p(y ˆx) b }ẑ w b p(x) p(y) ˆx x y 1 y 1 Fgure 1: Vsualzaton of PPCA mappng for a 1D to 2D model. A Gaussan n 1D s mapped to a lne, and then blurred wth 2D nose. (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) Hence y N(b,WW T +σ 2 I) (36) In other words, learnng a PPCA model s equvalent to learnng a partcular form of a Gaussan dstrbuton. Ths s llustrated n Fgure 1. The PPCA model s not as general as learnng a full Gaussan model wth ad D covarance matrx; however, t uses fewer numbers to represent the Gaussan (CD+1 versusd 2 /2+D/2; why?). Because the representaton s more compact, t can be estmated from smaller datasets, and requres less memory to store the model. These dfferences wll be sgnfcant when D s large; e.g., f D = 100, the full covarance matrx would requre 5050 parameters and thus requre hundreds of thousands of data ponts to estmate relably. However, f the effectve dmensonalty s, say, 2 or 3, then the PPCA representaton wll only have a few hundred parameters and many fewer measurements. Learnng. The PPCA model can be learned by Maxmum Lkelhood,.e., by mnmzng: L(W,b,σ 2 ) = ln N G(y ; b, WW T +σ 2 I) (37) =1 = 1 (y b) T (WW T +σ 2 I) 1 (y b)+ N 2 2 ln(2π)d WW T +σ 2 I (38) Ths can be optmzed n closed form. The soluton s very smlar to the conventonal PCA case: 1. Letb = 1 N y 2. LetK = 1 N (y b)(y b) T Copyrght c 2015 Aaron Hertzmann, Davd J. Fleet and Marcus Brubaker 80

7 Prncpal Components Analyss 3. Let VΛV T = K be the egenvector decomposton of K. Λ s a dagonal matrx of egenvalues (Λ = dag(λ 1,...λ D )). The matrx V contans the egenvectors: V = [V 1,...V D ] and s orthonormal V T V = I. 4. Assume that the egenvalues are sorted from largest to smallest (λ λ +1 ). If ths s not the case, sort them (and ther correspondng egenvectors). 5. Let σ 2 = 1 D C D j=c+1 λ j. In words, the estmated nose varance s equal to the average margnal data varance over all drectons that are orthogonal to the C prncpal drectons (.e., ths s the average varance (per dmenson) of the data that s lost n the approxmaton of the data n the C dmensonal subspace). 6. Let Ṽ be the matrx comprsng the frst C egenvectors: Ṽ = [V 1,...V C ], and let Λ be the dagonal matrx wth the C leadng egenvalues: Λ = [λ 1,...λ C ]. 7. W = Ṽ( Λ σ 2 I) Letx = W T (y b), for all. Note that ths soluton s smlar to that n the conventonal PCA case wth whtenng, except that (a) the nose varance s estmated, and (b) the nose s removed from the varances of the remanng egenvalues. An alternatve optmzaton. In the above learnng algorthm, we margnalzed out x when estmatng PPCA. In other words, we maxmzed p(y 1:N W,b,σ 2 ) = p(y 1:N,x 1:N W,b,σ 2 )dx 1:N (39) = p(y 1:N x 1:N,W,b,σ 2 )p(x 1:N )dx 1:N (40) = p(y x,w,b,σ 2 )p(x )dx (41) nstead of maxmzng p(y 1:N,x 1:N W,b,σ 2 ) = = p(y,x W,b,σ 2 ) (42) p(y x,w,b,σ 2 )p(x ) (43) By ntegratng out x, we are estmatng fewer parameters and thus can get better estmates. Loosely speakng, dong so mght be vewed as beng more Bayesan. Suppose we dd nstead try to Copyrght c 2015 Aaron Hertzmann, Davd J. Fleet and Marcus Brubaker 81

8 Prncpal Components Analyss estmate the x s together wth the model parameters: L(x 1:N,W,b,σ 2 ) = lnp(y 1:N,x 1:N W,b,σ 2 ) (44) = ( 1 2σ 2 y (Wx +b) ) 2 x 2 + ND 2 lnσ2 +NDln2π (45) Now, suppose we are optmzng ths objectve functon, and we have some estmates for W and x. We can always reduce the objectve functon by replacng W 2W (46) x x/2 (47) By dong ths replacement arbtrarly many tmes, we can get nfntesmal values for x. Ths ndcates that the objectve functon s degenerate; usng t wll yeld to very poor results. Note that, however, ths arses usng the same model as before, but wthout margnalzng out x. Ths llustrates a general prncple: the more parameters you estmate (nstead of margnalzng out), the greater the danger of based and/or degenerate solutons. Copyrght c 2015 Aaron Hertzmann, Davd J. Fleet and Marcus Brubaker 82

14 Lagrange Multipliers

14 Lagrange Multipliers Lagrange Multplers 14 Lagrange Multplers The Method of Lagrange Multplers s a powerful technque for constraned optmzaton. Whle t has applcatons far beyond machne learnng t was orgnally developed to solve