Dimensionality Reduction vs. Clustering

Dimesioality Reductio vs. Clusterig Lecture 9: Cotiuous Latet Variable Models Sam Roweis Traiig such factor models (e.g. FA, PCA, ICA) is called dimesioality reductio. You ca thik of this as (o)liear regressio with missig iputs. Cotiuous causes ca sometimes be much more efficiet at represetig iformatio tha discrete causes. For example, if there are two factors, with about 256 settigs each we ca describe the latet causes with two 8-bit umbers. If we tried to cluster we would eed 2 16 10 5 clusters. November 4, 2003 Cotiuous Latet Variable Models Ofte there are some ukow uderlyig causes of the data. Mixture models use a discrete class variable: clusterig. Sometimes, it is more appropriate to thik i terms of cotiuous factors which cotrol the data we observe. Geometrically, this is equivalet to thikig of a data maifold or subspace. y 3 µ y 1 y 2 To geerate data, first geerate a poit withi the maifold the add oise. Coordiates of poit are compoets of latet variable. λ 1 λ2 Whe we assume that the subspace is liear ad that the uderlyig latet variable has a Gaussia distributio we get a model kow as factor aalysis: data y (p-dim); latet variable z (k-dim) Factor Aalysis µ y 1 y 2 p(z) = N (z 0, I) p(y z, θ) = N (y µ + Λz, Ψ) where µ is the mea vector, Λ is the p by k factor loadig matrix, ad Ψ is the sesor oise covariace (usually diagoal). Importat: sice the product of Gaussias is still Gaussia, the joit distributio p(z, y), the other margial p(y) ad the coditioal p(z y) are also Gaussia. y 3 λ 1 λ2

Margial Data Distributio Just as with discrete latet variables, we ca compute the margial desity p(y θ) by summig out z. But ow the sum is a itegral: p(y θ) = p(z)p(y z, θ)dz = N (y µ, ΛΛ +Ψ) z which ca be doe by completig the square i the expoet. However, sice the margial is Gaussia, we ca also just compute its mea ad covariace. (Assume oise ucorrelated with data.) E[y] = E[µ + Λz + oise] = µ + ΛE[z] + E[oise] = µ + Λ 0 + 0 = µ Cov[y] = E[(y µ)(y µ) ] = E[(µ + Λz + oise µ)(µ + Λz + oise µ) ] = E[(Λz + )(Λz + ) ] = ΛE(zz )Λ + E( ) = ΛΛ + Ψ Remider: Gaussia Coditioig Remember the formulas for margializig ad coditioig Gaussia probability distributio fuctios. Joit: Margials: Coditioals: [ ] ([ ] [ ] [ ]) x1 x1 µ1 Σ11 Σ p( ) = N, 12 x 2 x 2 µ 2 Σ 21 Σ 22 p(x 1 ) = N (µ 1, Σ 11 ) p(x 1 x 2 ) = N (x 1 m 1 2, V 1 2 ) m 1 2 = µ 1 + Σ 12 Σ 1 22 (x 2 µ 2 ) V 1 2 = Σ 11 Σ 12 Σ 1 22 Σ 21 Remider: Meas, Variaces ad Covariaces Remember the defiitio of the mea ad covariace of a vector radom variable: E[x] = xp(x)dx = m x Cov[x] = E[(x m)(x m) ] = (x m)(x m) p(x)dx = V x which is the expected value of the outer product of the variable with itself, after subtractig the mea. Symmetric. Also, the (cross)covariace betwee two variables: Cov[x, y] = E[(x m x )(y m y ) ] = C = (x m x )(y m y ) p(x, y)dxdy = C xy which is the expected value of the outer product of oe variable with aother, after subtractig their meas. Note: C is ot symmetric. Costraied Covariace Margial desity for factor aalysis (y is p-dim, z is k-dim): p(y θ) = N (y µ, ΛΛ +Ψ) So the effective covariace is the low-rak outer product of two log skiy matrices plus a diagoal matrix: Cov[y] Λ I other words, factor aalysis is just a costraied Gaussia model. (If Ψ were ot diagoal the we could model ay Gaussia ad it would be poitless.) It is easy to fid µ: just take the mea of the data. From ow o assume we have doe this ad re-cetred y. Λ T Ψ

EM for Factor Aalysis We will do maximum likelihood learig usig (surprise, surprise) the EM algorithm. E-step: q t+1 = p(z y, θ t ) M-step: θ t+1 = argmax θ z qt+1 (z y ) log p(y, z θ)dz For this we eed the coditioal distributio (iferece) ad the expected log of the complete data. Results: E step : q t+1 = p(z y, θ t ) = N (z m, V ) V = (I + Λ Ψ 1 Λ) 1 m = V Λ Ψ 1 (y µ) ( ) ( M step : Λ t+1 = y m Ψ t+1 = 1 N diag [ V ) 1 y y + Λ t+1 m y ] Iferece i Factor Aalysis Apply the Gaussia coditioig formulas to the joit distributio we derived above. This gives: p(z y) = N (z m, V) V = I Λ (ΛΛ + Ψ) 1 Λ m = Λ (ΛΛ + Ψ) 1 (y µ) Now apply the matrix iversio lemma to get: p(z y) = N (z m, V) V = (I + Λ Ψ 1 Λ) 1 m = VΛ Ψ 1 (y µ) y 3 µ y 1 y 2 y Complete Data Likelihood Write dow the joit distributio of z ad y: [ ] [ ] [ ] [ ] z z 0 I Λ p( ) = N (, y y µ Λ ΛΛ ) + Ψ where the corer elemets Λ, Λ come from Cov[z, y]: Cov[z, y] = E[(z 0)(y µ) ] = E[z(µ + Λz + µ) ] = E[z(Λz + ) ] = Λ This gives the complete likelihood (igorig mea): l c (Λ, Ψ) = N 2 log Ψ 1 z z 1 (y Λz ) Ψ 1 (y Λz ) 2 2 = N 2 log Ψ N 2 trace[sψ 1 ] S = 1 (y Λz )(y Λz ) N Matrix Iversio Lemma There is a good trick for ivertig matrices whe they ca be decomposed ito the sum of a easily iverted matrix (D) ad a low rak outer product. It is called the matrix iversio lemma. (D AB 1 A ) 1 = D 1 + D 1 A(B A D 1 A) 1 A D 1

Derivatives You eed these tricks to compute the M-step derivatives: log A = A A A trace[b A] = B A trace[ba CA] = 2CAB Gaussias are Footballs i High-D Recall the ituitio that Gaussias are hyperellipsoids. Mea == cetre of football Eigevectors of covariace matrix == axes of football Eigevalues == legths of axes I FA our football is a axis aliged cigar. I PCA our football is a sphere of radius σ 2. FA Ψ PCA ει Pricipal Compoet Aalysis I Factor Aalysis, we ca write the margial desity explicitly: p(y θ) = p(z)p(y z, θ)dz = N (y µ, ΛΛ +Ψ) z Noise Ψ mut be restricted for model to be iterestig. (Why?) I Factor Aalysis the restrictio is that Ψ is diagoal (axis-aliged). What if we further restrict Ψ = σ 2 I (ie spherical)? We get the Pricipal Compoet Aalysis (PCA) model: p(z) = N (z 0, I) p(y z, θ) = N (y µ + Λz, σ 2 I) where µ is the mea vector, colums of Λ are the pricipal compoets (usually orthogoal), ad σ 2 is the global sesor oise. Likelihood Fuctios For both FA ad PCA, the data model is Gaussia. Thus, the likelihood fuctio is simple: l(θ; D) = N 2 log ΛΛ + Ψ 1 (y µ) (ΛΛ + Ψ) 1 (y µ) 2 [ = N 2 log V 1 2 trace V 1 ] (y µ)(y µ) = N 2 log V 1 ] [V 2 trace 1 S V is model covariace; S is sample data covariace. I other words, we are tryig to make the costraied model covariace as close as possible to the observed covariace, where close meas the trace of the ratio. Thus, the sufficiet statistics are the same as for the Gaussia: mea y ad covariace (y µ)(y µ).

Fittig the PCA model The stadard EM algorithm applies to PCA also: E-step: q t+1 = p(z y, θ t ) M-step: θ t+1 = argmax θ z qt+1 (z y ) log p(y, z θ)dz For this we eed the coditioal distributio (iferece) ad the expected log of the complete data. Results: E step : q t+1 = p(z y, θ t ) = N (z m, V ) V = (I + σ 2 Λ Λ) 1 m = σ 2 V Λ (y µ) ( ) ( ) 1 M step : Λ t+1 = y m V [ σ 2t+1 = 1 y y + Λ t+1 ND i m y ] ii Iferece is Liear Recall the iferece formulas for FA: p(z y) = N (z m, V) V = I Λ (ΛΛ + Ψ) 1 Λ = (I + Λ Ψ 1 Λ) 1 m = Λ (ΛΛ + Ψ) 1 (y µ) = VΛ Ψ 1 (y µ) Note: iferece of the posterior mea is just a liear operatio! m = β(y µ) where β ca be computed beforehad give the model parameters. Also: posterior covariace does ot deped o observed data! cov[z y] = V = (I + Λ Ψ 1 Λ) 1 Direct Fittig For FA the parameters are coupled i a way that makes it impossible to solve for the ML params directly. We must use EM or other oliear optimizatio techiques. But for PCA, the ML params ca be solved for directly: The k th colum of Λ is the k th largest eigevalue of the sample covariace S times the associated eigevector. The global sesor oise σ 2 is the sum of all the eigevalues smaller tha the k th oe. This techique is good for iitializig FA also. We ca t make the sesor oise ucostraied, or else we would always get a perfect fit! Zero Noise Limit The traditioal PCA model is actually a limit as σ 2 0. The model we saw is actually called probabilistic PCA. However, the ML parameters Λ are the same. The oly differece is the global sesor oise σ 2. I the zero oise limit iferece is easier: orthogoal projectio. lim Λ (ΛΛ + σ 2 I) 1 = (Λ Λ) 1 Λ σ 2 0 y 3 µ y y 1 y 2

Scale Ivariace i Factor Aalysis I FA the scale of the data is uimportat: we ca multiply y i by α i without chagig aythig: µ i α i µ i Λ ij α i Λ ij Ψ i α 2 i Ψ i j However, the rotatio of the data is importat. FA looks for directios of large correlatio i the data, so it is ot fooled by large variace oise. FA Model Ivariace ad Idetifiability There is degeeracy i the FA model. Sice Λ oly appears as outer product ΛΛ, the model is ivariat to rotatio ad axis flips of the latet space. We ca replace Λ with ΛQ for ay uitary matrix Q ad the model remais the same: (ΛQ)(ΛQ) = Λ(QQ )Λ = ΛΛ. This meas that there is o oe best settig of the parameters. A ifiite umber of parameters all give the ML score! Such models are called u-idetifiable sice two people both fittig ML params to the idetical data will ot be guarateed to idetify the same parameters. PCA Rotatioal Ivariace i PCA I PCA the rotatio of the data is uimportat: we ca multiply the data y by ad rotatio Q without chagig aythig: µ Qµ Λ QΛ Ψ uchaged However, the scale of the data is importat. PCA looks for directios of large variace, so it will chase big oise directios. FA PCA Latet Covariace i Factor Aalysis ad PCA What if we allow the latet variable z to have a covariace matrix of its ow: p(z) = N (z 0, P)? We ca still compute the margial probability: p(y θ) = p(z)p(y z, θ)dz = N (y µ, ΛPΛ +Ψ) z We ca always absorb P ito the loadig matrix Λ by diagoalizig it: P = EDE ad settig Λ = ΛED 1/2. Thus, there is aother degeeracy i FA, betwee P ad Λ: we ca set P to be the idetity, to be diagoal, whatever we wat. Traditioally we break this degeeracy by either: set the covariace P of the latet variable to be I (FA) or force the colums of Λ to be orthoormal (PCA)

Mixtures of Dimesioality Reducers What s the ext logical step? Try a model that has two kids latet variables: oe discrete cluster, ad oe vector of cotiuous causes. Such models simultaeously do clusterig, ad withi each cluster, dimesioality reductio. Great idea! Idepedet Compoets Aalysis (ICA) ICA is aother cotiuous latet variable model, like FA, but it has a o-gaussia ad factorized prior o the latet variables. This is good i situatios where most of the factors are very small most of the time ad they do ot iteract with each other. Example: mixtures of speech sigals. The learig problem is the same: fid the weights from the factors to the outputs ad ifer the ukow factor values. I the case of ICA the factors are sometimes called sources, ad the learig is sometimes called umixig. Mixtures of Factor Aalyzers The simplest versio of this is the mixture of factor aalyzers. p(z) = N (z 0, I) p(k) = α k p(y z, k, θ) = N (y µ k + Λ k z, Ψ) p(y θ) = p(k)p(z)p(y z, k, θ)dz k z = N (y µ k, Λ k Λ k +Ψ) k Which is a costraied mixture of Gaussias. This is like a mixture of liear experts, usig a logistic regressio gate, eith missig iputs. Fittig procedure? EM, of course! see ftp.cs.toroto.edu/pub/zoubi/tr-96-1.ps.gz Geometric Ituitio Sice the latet variables are assumed to be idepedet, we are tryig to fid a liear trasformatio of the data that recovers these idepedet causes. Ofte we use heavy tailed source priors, e.g. p(z i ) 1/ cosh(z i ). Geometric ituitio: fidig spikes i histograms. x 2 0.5 0 Leared basis vectors 0.5 0.5 0 0.5 x 1

ICA Model The simplest form of ICA has as may outputs as sources (square) ad o sesor oise o the outputs: p(z) = p(z k ) k y = Vz Learig i this case ca be doe with gradiet descet (plus some covariat tricks to make the updates faster ad more stable). If you keep the square V ad use isotropic Gaussia oise o the outputs there is a simple EM algorithm, derived by Max Wellig ad Markus Weber. Much more complex cases have bee studied also: osquare, covolutioal, time delays i mixig, etc. But for that, we eed to kow about time-series...