Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

Lecture 10: Factor Aalysis ad Pricipal Compoet Aalysis Sam Roweis February 9, 2004 Whe we assume that the subspace is liear ad that the uderlyig latet variable has a Gaussia distributio we get a model kow as factor aalysis: data y (p-dim); latet variable (k-dim) Factor Aalysis p() = N ( 0, I) p(y, θ) = N (y + Λ, Ψ) where is the mea vector, Λ is the p by k factor loadig matri, ad Ψ is the sesor oise covariace (ususally diagoal). Importat: sice the product of Gaussias is still Gaussia, the joit distributio p(, y), the other margial p(y) ad the coditioal p( y) are also Gaussia. λ 1 λ2 Cotiuous Latet Variables I may models there are some uderlyig causes of the data. Miture models use a discrete class variable: clusterig. Sometimes, it is more appropriate to thik i terms of cotiuous factors which cotrol the data we observe. Geometrically, this is equivalet to thikig of a data maifold or subspace. To geerate data, first geerate a poit withi the maifold the add oise. Coordiates of poit are compoets of latet variable. λ 1 λ2 Margial Data Distributio Just as with discrete latet variables, we ca compute the margial desity p(y θ) by summig out. But ow the sum is a itegral: p(y θ) = p()p(y, θ)d = N (y, ΛΛ +Ψ) which ca be doe by completig the square i the epoet. However, sice the margial is Gaussia, we ca also just compute its mea ad covariace. (Assume oise ucorrelated with data.) E[y] = E[ + Λ + oise] = + ΛE[] + E[oise] = + Λ 0 + 0 = Cov[y] = E[(y )(y ) ] = E[( + Λ + oise )( + Λ + oise ) ] = E[(Λ + )(Λ + ) ] = ΛE( )Λ + E( ) = ΛΛ + Ψ

FA = Costraied Covariace Gaussia Margial desity for factor aalysis (y is p-dim, is k-dim): p(y θ) = N (y, ΛΛ +Ψ) So the effective covariace is the low-rak outer product of two log skiy matrices plus a diagoal matri: Cov[y] Λ I other words, factor aalysis is just a costraied Gaussia model. (If Ψ were ot diagoal the we could model ay Gaussia ad it would be poitless.) Learig: how should we fit the ML parameters? It is easy to fid : just take the mea of the data. From ow o assume we have doe this ad re-cetred y. What about the other parameters? Λ T Ψ EM for Factor Aalysis We will do maimum likelihood learig usig (surprise, surprise) the EM algorithm. E-step: q t+1 = p( y, θ t ) M-step: θ t+1 = argma θ qt+1 ( y ) log p(y, θ)d For E-step we eed the coditioal distributio (iferece) For M-step we eed the epected log of the complete data. E step : q t+1 = p( y, θ t ) = N ( m, V ) M step : Λ t+1 = argma Λ Ψ t+1 = argma Ψ l c (, y ) q t+1 l c (, y ) q t+1 Likelihood Fuctio Sice the FA data model is Gaussia, likelihood fuctio is simple: l(θ; D) = N 2 log ΛΛ + Ψ 1 (y ) (ΛΛ + Ψ) 1 (y ) 2 [ = N 2 log V 1 2 trace V 1 ] (y )(y ) = N 2 log V 1 ] [V 2 trace 1 S V is model covariace; S is sample data covariace. I other words, we are tryig to make the costraied model covariace as close as possible to the observed covariace, where close meas the trace of the ratio. Thus, the sufficiet statistics are the same as for the Gaussia: mea y ad covariace (y )(y ). From Joit Distributio to Coditioal To get the coditioal p( y) we will start with the joit p(, y) ad apply Bayes rule for Gaussia coditioals. Write dow the joit distribtio of ad y: [ ] [ ] [ ] [ ] 0 I Λ p( ) = N (, y y Λ ΛΛ ) + Ψ where the corer elemets Λ, Λ come from Cov[, y]: Cov[, y] = E[( 0)(y ) ] = E[( + Λ + oise ) ] = E[(Λ + oise) ] = Λ Assume oise is ucorrelated with data or latet variables.

E-step: Iferece i Factor Aalysis Apply the Gaussia coditioig formulas to the joit distributio we derived above. This gives: p( y) = N ( m, V) V = I Λ (ΛΛ + Ψ) 1 Λ m = Λ (ΛΛ + Ψ) 1 (y ) Now apply the matri iversio lemma to get: p( y) = N ( m, V) V = (I + Λ Ψ 1 Λ) 1 m = VΛ Ψ 1 (y ) y Complete Data Likelihood We kow the optimal is the data mea. Assume the mea has bee subtracted off y from ow o. The complete likelihood (igorig mea): l c (Λ, Ψ) = log p(, y ) = log p( ) + log p(y ) = N 2 log Ψ 1 1 (y Λ ) Ψ 1 (y Λ ) 2 2 l c (Λ, Ψ) = N 2 log Ψ N 2 trace[sψ 1 ] S = 1 (y Λ )(y Λ ) N Iferece is Liear Note: iferece just multiplies y by a matri: p( y) = N ( m, V) V = I Λ (ΛΛ + Ψ) 1 Λ = (I + Λ Ψ 1 Λ) 1 m = Λ (ΛΛ + Ψ) 1 (y ) = VΛ Ψ 1 (y ) Note: iferece of the posterior mea is just a liear operatio! m = β(y ) where β ca be computed beforehad give the model parameters. Also: posterior covariace does ot deped o observed data! cov[ y] = V = (I + Λ Ψ 1 Λ) 1 M-step: Optimize Parameters Take the derivates of the complete log likelihood wrt. parameters: l c (Λ, Ψ)/ Λ = Ψ 1 y + Ψ 1 Λ l c (Λ, Ψ)/ Ψ 1 = +(N/2)Ψ (N/2)S Take the epectatio with respect to q t from E-step: < l Λ > = Ψ 1 y m + Ψ 1 Λ V < l Ψ 1 > = +(N/2)Ψ (N/2) < S > Fially, set the derivatives to zero to solve for optimal parameters: ( ) ( ) 1 Λ t+1 = y m V [ Ψ t+1 = 1 N diag y y + Λ t+1 ] m y

Fial Algorithm: EM for Factor Aalysis First, set equal to the sample mea (1/N) y, ad subtract this mea from all the data. Now ru the followig iteratios: E step : q t+1 = p( y, θ t ) = N ( m, V ) V = (I + Λ Ψ 1 Λ) 1 m = V Λ Ψ 1 (y ) ( ) ( M step : Λ t+1 = y m Ψ t+1 = 1 N diag [ V ) 1 y y + Λ t+1 m y ] Likelihood Fuctio As with FA, the PPCA data model is Gaussia. Thus, the likelihood fuctio is simple: l(θ; D) = N 2 log ΛΛ + Ψ 1 (y ) (ΛΛ + Ψ) 1 (y ) 2 [ = N 2 log V 1 2 trace V 1 ] (y )(y ) = N 2 log V 1 ] [V 2 trace 1 S V is model covariace; S is sample data covariace. I other words, we are tryig to make the costraied model covariace as close as possible to the observed covariace, where close meas the trace of the ratio. Thus, the sufficiet statistics are the same as for the Gaussia: mea y ad covariace (y )(y ). Pricipal Compoet Aalysis I Factor Aalysis, we ca write the margial desity eplicitly: p(y θ) = p()p(y, θ)d = N (y, ΛΛ +Ψ) Noise Ψ mut be restricted for model to be iterestig. (Why?) I Factor Aalysis the restrictio is that Ψ is diagoal (ais-aliged). What if we further restrict Ψ = σ 2 I (ie spherical)? We get the Probabilistic Pricipal Compoet Aalysis (PPCA) model: p() = N ( 0, I) p(y, θ) = N (y + Λ, σ 2 I) where is the mea vector, colums of Λ are the pricipal compoets (usually orthogoal), ad σ 2 is the global sesor oise. Fittig the PPCA model The stadard EM algorithm applies to PPCA also: E-step: q t+1 = p( y, θ t ) M-step: θ t+1 = argma θ qt+1 ( y ) log p(y, θ)d For this we eed the coditioal distributio (iferece) ad the epected log of the complete data. Results: E step : q t+1 = p( y, θ t ) = N ( m, V ) V = (I + σ 2 Λ Λ) 1 m = σ 2 V Λ (y ) ( ) ( ) 1 M step : Λ t+1 = y m V [ σ 2t+1 = 1 y y + Λ t+1 DN i m y ] ii

PCA: Zero Noise Limit The traditioal PCA model is actually a limit as σ 2 0. The model we saw is actually called probabilistic PCA. However, the ML parameters Λ are the same. The oly differece is the global sesor oise σ 2. I the zero oise limit iferece is easier: orthogoal projectio. lim Λ (ΛΛ + σ 2 I) 1 = (Λ Λ) 1 Λ σ 2 0 y Scale Ivariace i Factor Aalysis I FA the scale of the data is uimportat: we ca multiply y i by α i without chagig aythig: i α i i Λ ij α i Λ ij Ψ i α 2 i Ψ i j However, the rotatio of the data is importat. FA looks for directios of large correlatio i the data, so it is ot fooled by large variace oise. FA PCA Direct Fittig For FA the parameters are coupled i a way that makes it impossible to solve for the ML params directly. We must use EM or other oliear optimizatio techiques. But for (P)PCA, the ML params ca be solved for directly: The k th colum of Λ is the k th largest eigevalue of the sample covariace S times the associated eigevector. The global sesor oise σ 2 is the sum of all the eigevalues smaller tha the k th oe. This techique is good for iitializig FA also. Actually PCA is the limit as the ratio of the oise variace o the output to the prior variace o the latet variables goes to zero. We ca either achieve this with zero oise or with ifiite variace priors. Rotatioal Ivariace i PCA I PCA the rotatio of the data is uimportat: we ca multiply the data y by ad rotatio Q without chagig aythig: Q Λ QΛ Ψ uchaged However, the scale of the data is importat. PCA looks for directios of large variace, so it will chase big oise directios. FA PCA

Gaussias are Footballs i High-D Recall the ituitio that Gaussias are hyperellipsoids. Mea == cetre of football Eigevectors of covariace matri == aes of football Eigevalues == legths of aes I FA our football is a ais aliged cigar. I PPCA our football is a sphere of radius σ 2. Review: Matri Iversio Lemma There is a good trick for ivertig matrices whe they ca be decomposed ito the sum of a easily iverted matri (D) ad a low rak outer product. It is called the matri iversio lemma. (D AB 1 A ) 1 = D 1 + D 1 A(B A D 1 A) 1 A D 1 FA Ψ PCA ει Review: Gaussia Coditioig Remember the formulas [ ] for codtioal [ ] [ Gaussia ] [ distributios: ] 1 1 1 Σ11 Σ p( ) = N (, 12 ) 2 2 2 Σ 21 Σ 22 p( 1 2 ) = N ( 1 m 1 2, V 1 2 ) m 1 2 = 1 + Σ 12 Σ 1 22 ( 2 2 ) V 1 2 = Σ 11 Σ 12 Σ 1 22 Σ 21 Review: Matri Derivatives You ofte eed these tricks to compute the M-step: A log A = (A 1 ) A trace[b A] = B A trace[ba CA] = 2CAB

Review: Meas, Variaces ad Covariaces Remember the defiitio of the mea ad covariace of a vector radom variable: E[] = p()d = m Cov[] = E[( m)( m) ] = ( m)( m) p()d = V which is the epected value of the outer product of the variable with itself, after subtractig the mea. Also, the covariace betwee two variables: Cov[, y] = E[( m )(y m y ) ] = C = ( m )(y m y ) p(, y)ddy = C y which is the epected value of the outer product of oe variable with aother, after subtractig their meas. Note: C is ot symmetric.