Bayes nets with tabular CPDs We have mostly focused on graphs where all latent nodes are discrete, and all CPDs/potentials are full tables.

Size: px

Start display at page:

Download "Bayes nets with tabular CPDs We have mostly focused on graphs where all latent nodes are discrete, and all CPDs/potentials are full tables."

Moris Tyler Warner
5 years ago
Views:

1 Lecture 7: Liear Gaussia Models Bayes ets with tabular CPDs We have mostly focused o graphs where all latet odes are discrete, ad all CPDs/potetials are full tables. x X x 2 x X 2 x 4 x 2 X 4 x 6 X 6 x 5 x 2 Kevi Murphy 7 November 24 x 3 x X 3 X 5 x 5 x 3 Mixtures of Gaussias We have also cosidered discrete latet variables with cotiuous observed variables. e.g., i a mixture of Gaussias: P(Z = i) = θ i p(x = x Z = i) = N(x;µ i, Σ i ) This ca be used for classificatio (supervised) ad clusterig/ vector quatizatio (usupervised). Z Liear Gaussia model We ow cosider the case where all CPDs are liear-gaussia: p(z = z) = N(z;µ z, Σ z ) p(x = x Z = z) = N(x;µ x + Λz, Σ) i.e., child = liear fuctio of paret plus gaussia oise X = ΛZ + oise, oise N(µ x, Σ x ) For the supervised case, this is just liear regressio: Z X X

2 Factor aalysis (Jorda ch 4) Usupervised liear regressio is called factor aalysis. p(x) = N(x;,I) p(y x) = N(y;µ + Λx, Ψ) where Λ is the factor loadig matrix ad Ψ is diagoal. Z X y 3 µ y y 2 To geerate data, first geerate a poit withi the maifold the add oise. Coordiates of poit are compoets of latet variable. λ λ2 Review: Meas, Variaces ad Covariaces Remember the defiitio of the mea ad covariace of a vector radom variable: E[x] = xp(x)dx = m x Cov[x] = E[(x m)(x m) ] = (x m)(x m) p(x)dx = V which is the expected value of the outer product of the variable with itself, after subtractig the mea. Also, the covariace betwee two variables: Cov[x,y] = E[(x m x )(y m y ) ] = C = (x m x )(y m y ) p(x,y)dxdy = C xy which is the expected value of the outer product of oe variable with aother, after subtractig their meas. Note: C is ot symmetric. x Margial Data Distributio Sice a margial Gaussia times a coditioal Gaussia is a joit Gaussia, we ca compute the margial desity p(y θ) by itegratig out x. p(y θ) = p(x)p(y x,θ)dx = N(y µ, ΛΛ +Ψ) x which ca be doe by completig the square i the expoet. However, sice the margial is Gaussia, we ca also just compute its mea ad covariace. (Assume oise ucorrelated with data.) E[y] = E[µ + Λx + oise] = µ + ΛE[x] + E[oise] = µ + Λ + = µ Cov[y] = E[(y µ)(y µ) ] = E[(µ + Λx + oise µ)(µ + Λx + oise µ) ] = E[(Λx + )(Λx + ) ] = ΛE(xx )Λ + E( ) = ΛΛ + Ψ FA = Costraied Covariace Gaussia Margial desity for factor aalysis (y is p-dim, x is k-dim): p(y θ) = N(y µ, ΛΛ +Ψ) So the effective covariace is the low-rak outer product of two log skiy matrices plus a diagoal matrix: Cov[y] Λ I other words, factor aalysis is just a costraied Gaussia model. (If Ψ were ot diagoal the we could model ay Gaussia ad it would be poitless.) Λ T Ψ

3 Probabilistic Pricipal Compoet Aalysis (PPCA) I Factor Aalysis, we ca write the margial desity explicitly: p(y θ) = p(x)p(y x,θ)dx = N(y µ, ΛΛ +Ψ) x Noise Ψ mut be restricted for model to be iterestig. (Why?) I Factor Aalysis the restrictio is that Ψ is diagoal (axis-aliged). What if we further restrict Ψ = σ 2 I (ie spherical)? We get the Probabilistic Pricipal Compoet Aalysis (PPCA) model: ywhere µ is the mea vector, colums of Λ are the pricipal compoets (usually orthogoal), ad σ 2 is the global sesor oise. PCA: Zero Noise Limit The traditioal PCA model is actually a limit as σ 2. Actually PCA is the limit as the ratio of the oise variace o the output to the prior variace o the latet variables goes to zero. We ca either achieve this with zero oise or with ifiite variace priors. (P)PCA is very useful for dimesioality reductio, e.g., for visualizig high-dimesioal data. Differece betwee FA ad PPCA Recall the ituitio that Gaussias are hyperellipsoids, where Mea = cetre of football, Eigevectors of covariace matrix = axes of football, Eigevalues = legths of axes FA Ψ I FA our football is a axis aliged cigar. I PPCA our football is a sphere of radius σ 2. PCA I FA, the variace of the oise is idepedet for each dimesio; i PPCA, the oise is assumed to be the same. PCA looks for directios of large variace, Coversely, FA looks for directios of large correlatio. ει Model Ivariace ad Idetfiability There is degeeracy i the FA model. Sice Λ oly appears as outer product ΛΛ, the model is ivariat to rotatio ad axis flips of the latet space. We ca replace Λ with ΛQ for ay uitary matrix Q ad the model remais the same: (ΛQ)(ΛQ) = Λ(QQ )Λ = ΛΛ. This meas that there is o oe best settig of the parameters. A ifiite umber of parameters all give the ML score! Such models are called u-idetifiable sice two people both fittig ML params to the idetical data will ot be guarateed to idetify the same parameters.

4 Latet Covariace i Factor Aalysis ad PCA What if we allow the latet variable x to have a covariace matrix of its ow: p(x) = N(x,P)? We ca still compute the margial probability: p(y θ) = p(x)p(y x,θ)dx = N(y µ, ΛPΛ +Ψ) x We ca always absorb P ito the loadig matrix Λ by diagoalizig it: P = EDE ad settig Λ = ΛED /2. Thus, there is aother degeeracy i FA, betwee P ad Λ: we ca set P to be the idetity, to be diagoal, whatever we wat. Traditioally we break this degeeracy by either: set the covariace P of the latet variable to be I (FA) or force the colums of Λ to be orthoormal (PCA) Model FA joit distributio p(x) = N(x,I) p(y x,θ) = N(y µ + Λx, Ψ) Hece the joit distribtio of x ad y: [ ] [ ] [ ] [ ] x x I Λ p( ) = N(, y y µ Λ ΛΛ ) + Ψ where the corer elemets Λ, Λ come from Cov[x,y]: Cov[x,y] = E[(x )(y µ) ] = E[x(µ + Λx + oise µ) ] = E[x(Λx + oise) ] = Λ Assume oise is ucorrelated with data or latet variables. Review: Gaussia Coditioig (Jorda ch 3) Remember the formulas [ ] for coditioal [ ] [ ] Gaussia [ distributios: ] x x µ Σ Σ p( ) = N(, 2 ) x 2 x 2 µ 2 Σ 2 Σ 22 p(x x 2 ) = N(x m 2,V 2 ) m 2 = µ + Σ 2 Σ 22 (x 2 µ 2 ) V 2 = Σ Σ 2 Σ 22 Σ 2 Iferece i Factor Aalysis Apply the Gaussia coditioig formulas to the joit distributio we derived above, where gives Σ = I Σ 2 = Λ T Σ 22 = (ΛΛT + Ψ) V 2 = Σ Σ 2 Σ 22 Σ 2 = I Λ (ΛΛ + Ψ) Λ m 2 = m + Σ 2 Σ 22 (x 2 m 2 ) = Λ (ΛΛ + Ψ) (y µ)

5 Iferece i Factor Aalysis The previous formula requires ivertig a matrix of size y y. p(x y) = N(x m,v) V = I Λ (ΛΛ + Ψ) Λ m = Λ (ΛΛ + Ψ) (y µ) We ow use a good trick for ivertig matrices whe they ca be decomposed ito the sum of a easily iverted matrix (D) ad a low rak outer product. It is called the matrix iversio lemma. (D AB A ) = D + D A(B A D A) A D Apply the matrix iversio lemma we get: p(x y) = N(x m,v) V = (I + Λ Ψ Λ) m = VΛ Ψ (y µ) which iverts a matrix of size x x. The posterior is Iferece is Liear projectio p(x y) = N(x m,v) V = I Λ (ΛΛ + Ψ) Λ = (I + Λ Ψ Λ) m = Λ (ΛΛ + Ψ) (y µ) = VΛ Ψ (y µ) Posterior covariace does ot deped o observed data y! (Ca be precomputed.) Also, computig the posterior mea is just a liear operatio: m = β(y µ) where β (hat matrix) ca be precomputed. y 3 µ y y 2 y Icomplete data log likelihood Fuctio Usig the margial desity of p(y): l(θ; D) = N 2 log ΛΛ + Ψ (y µ) (ΛΛ + Ψ) (y µ) 2 [ = N 2 log V 2 trace V ] (y µ)(y µ) = N 2 log V ] [V 2 trace S V is model covariace; S is sample data covariace. I other words, we are tryig to make the costraied model covariace as close as possible to the observed covariace, where close meas the trace of the ratio. Thus, the sufficiet statistics are the same as for the Gaussia: mea y ad covariace (y µ)(y µ). EM for Factor Aalysis Parameters are coupled oliearly i log-likelihood: l(θ; D) = N 2 log ΛΛ + Ψ ] [(ΛΛ 2 trace + Ψ) S We could use cojugate gradiet methods. But EM is simpler. For E-step we do iferece (liear projectio) For M-step we do liear regressio o the expected values. More precisely, we maximize the expected complete log likelihood. E step : q t+ = p(x y,θ t ) = N(x m,v ) M step : Λ t+ = argmax Λ l c (x,y ) q t+ Ψ t+ = argmax Ψ l c (x,y ) q t+

6 Complete Data Likelihood We kow the optimal µ is the data mea. Assume the mea has bee subtracted off y from ow o. The complete likelihood (igorig mea): l c (Λ, Ψ) = log p(x,y ) = log p(x ) + log p(y x ) = N 2 log Ψ x x (y Λx ) Ψ (y Λx ) 2 2 = N 2 log Ψ N 2 trace[sψ ] S = (y Λx )(y Λx ) N M-step: Optimize Parameters (Jorda ch 3) Take the derivates of the complete log likelihood wrt. parameters. We will eed thse idetities. A log A = (A ) A trace[b A] = B A trace[ba CA] = 2CAB Hece l c (Λ, Ψ)/ Λ = Ψ y x + Ψ Λ l c (Λ, Ψ)/ Ψ = +(N/2)Ψ (N/2)S x x M-step: Optimize Parameters Derivatives from previous slide: M-step: Optimize Parameters Take the expectatio of gradiet with respect to q t from E-step: l c (Λ, Ψ)/ Λ = Ψ y x + Ψ Λ x x < l Λ > = Ψ y m + Ψ Λ V l c (Λ, Ψ)/ Ψ = +(N/2)Ψ (N/2)S Take the expectatio of gradiet with respect to q t from E-step: < l Λ > = Ψ y m + Ψ Λ V < l Ψ > = +(N/2)Ψ (N/2) < S > where m = E[X y ] V = V ar(x y ) + E(X y )E(X y ) < S > = y y y m N Λ Λm y + ΛV Λ < l Ψ > = +(N/2)Ψ (N/2) < S > Fially, set the derivatives to zero to solve for optimal parameters: ( ) ( ) Λ t+ = y m V [ Ψ t+ = N diag y y Λ t+ m y ]

7 Fial Algorithm: EM for Factor Aalysis First, set µ equal to the sample mea (/N) y, ad subtract this mea from all the data. Now ru the followig iteratios: E step : q t+ = p(x y,θ t ) = N(x m,v ) V = (I + Λ Ψ Λ) m = V Λ Ψ (y µ) ( )( M step : Λ t+ = y m Ψ t+ = N diag [ V ) y y Λ t+ m y ] Learig for PPCA For FA the parameters are coupled i a way that makes it impossible to solve for the ML params directly. We must use EM or other oliear optimizatio techiques. But for (P)PCA, the ML params ca be solved for directly: The k th colum of Λ is the k th largest eigevalue of the sample covariace S times the associated eigevector. The global sesor oise σ 2 is the sum of all the eigevalues smaller tha the k th oe. This techique is good for iitializig FA also. Note: EM ca be used as a fast way to compute eigevectors! Idepedet Compoets Aalysis (ICA) ICA is similar to FA, except it assumes the latet source has o-gaussia desity. Hece ICA ca extract higher order momets (ot just secod order). It is commoly used to solve blid source separatio (cocktail party problem). Idepedet Factor Aalysis (IFA) is a approximatio to ICA where we model the source usig a mixture of Gaussias. Hidde Markov Models You ca thik of a HMM as: A Markov chai with stochastic measuremets. x y x 2 y 2 x 3 y 3 or a mixture of Gaussias, with latet variables chagig over time (lily pad model) x T y T x x 2 x 3 x T y y 2 y 3 y T FA FA Mixture of FA IFA

8 State space models (SSMs) A SSM is like a HMM, except the hidde state is cotiuous valued. I geeral, we ca write x t = f(x t ) + w t y t = g(x t ) + v t where f is the dyamical (process) model, g is the observatio model. w t is the process oise, embedded iside P(X t X t ). v t is the observatio oise embedded iside P(Y t X t ). For a cotrolled Markov chai, we coditio all quatities o the observed iput sigals u t. Liear dyamic systems (LDS) (Jorda ch 5) A LDS is a special case of a SSM where f ad g are liear fuctios ad the oise terms are Gaussia. x t = Ax t + w t y t = Bx t + v t w t N(,Q) v t N(,R) A LDS is like factor aalysis through time. Kalma filterig ad smoothig Olie vs offlie iferece The Kalma filter is a way to perform exact olie iferece (sequetial Bayesia updatig) i a LDS. It is the Gaussia aalog of the forwards algorithm for HMMs: P(X t = i y :t ) = α t (i) p(y t X t = i) j P(X t = i X t = j)α t (j) The Rauch-Tug-Strievel smoother is a way to perform exact offlie iferece i a LDS. It is the Gaussia aalog of the forwardsbackwards algorithm: P(X t = i y :T ) α t (i)β t (i)

9 LDS for 2D trackig Dyamics: ew positio = old positio + * velocity + oise (costat velocity model, Gaussia acceleratio) x t x x 2 t t ẋ t = x 2 t 2 ẋ t 3 + oise ẋ 2 t ẋ 2 t 4 Observatio: project out first two compoets (we observe Cartesia positio of object - liear!) ( ) y t yt 2 = ( ) x t x 2 t ẋ t ẋ 2 t + oise Y D filterig true observed smoothed X (a) 2D trackig Y D smoothig true observed smoothed X (b) Kalma filterig i the brai? r(t-) The World Hidde State ("Iteral" to the World) The World V m Temporal Dyamics of Hidde State r (mappig fuctio) r(t) U Visible State ("Exteral") Coversio to Visible State (Image I ) (a) - The Estimator Model Parameters Hidde State (iverse ("Iteral" to mappig) the Estimator) I(t) I(t) U r(t) V r(t-) (b) Visible State (Image I ) The Estimator (Iteral Model of the World) Predicted Kalma Filter Estimated Dyamics of Hidde State r Kalma filterig derivatio Sice all CPDs are liear Gaussia, the system defies a large multivariate Gaussia. Hece all margials are Gaussia. Hece we ca represet the belief state P(X t y :t ) as a Gaussia with mea ˆx t t ad covariace P t t. It is commo to work with the iverse covariace (precisio) matrix ; this is called iformatio form. P t t Kalma filterig is a recursive procedure to update the belief state: Predict step: compute P(X t+ y :t ) from prior belief P(X t y :t ) ad dyamical model P(X t+ X t ). Update step: compute ew belief P(X t+ y :t+ ) from predictio P(X t+ y :t ), observatio y t+ ad observatio model P(y t X t ).

10 Predict step Dyamical Model: x t+ = Ax t + Gw t, w t N(,Q). Oe step ahead predictio of state: ˆx t+ t = E[X t+ y :t ] = Aˆx t t P t+ t = E[(X t+ ˆx t+ t )(X t+ ˆx t+ t ) T y :t ] = AP t t A T + GQG T Observatio Model: y t = Cy t + v t, v t N(,R). Oe step ahead predictio of observatio: E[Y t+ y :t ] = E[CX t+ + v t+ y :t ] = C ˆx t+ t E[(Y t+ ŷ t+ t )(Y t+ ŷ t+ t ) T y :t ] = CP t+ t C T + R E[(Y t+ ŷ t+ t )(X t+ ˆx t+ t ) T y :t ] = CP t+ t Update step Summarizig results from previous slide, we have P(X t+,y t+ y :t ) N(m t+,v t+ ), where ( ) ( ˆxt+ t Pt+ t P m t+ =, V C ˆx t+ = t+ t C T ) t+ t CP t+ t CP t+ t C T + R Remember the formulas [ ] for coditioal [ ] [ ] Gaussia [ distributios: ] x x µ Σ Σ p( ) = N(, 2 ) x 2 x 2 µ 2 Σ 2 Σ 22 p(x x 2 ) = N(x m 2,V 2 ) m 2 = µ + Σ 2 Σ 22 (x 2 µ 2 ) V 2 = Σ Σ 2 Σ 22 Σ 2 Kalma filter equatios Usig the coditioig formulas, we get: ˆx t+ t = Aˆx t t P t+ t = AP t t A T + GQG T ˆx t+ t+ = ˆx t+ t + K t+ (y t+ C ˆx t+ t ) P t+ t+ = P t+ t K t+ CP t+ t where K t+ is the Kalma gai matrix K t+ = P t+ t C T (CP t+ t C T + R) K t ca be precomputed (sice it is idepedet of the data). This rapidly coverges to a costat matrix (Ricatti equatio). Example of KF i D (Russell & Norvig 2e p554) Cosider oisy observatios of a D particle doig a radom walk: x t = x t + N(,σ x ), y t = x t + N(,σ y ) Hece P t+ t = AP t t A T + GQG T = σ t + σ x K t+ = P t+ t C T (CP t+ t C T + R) = (σ t + σ x )(σ t + σ x + σ z ) ˆx t+ t+ = ˆx t+ t + K t+ (y t+ C ˆx t+ t ) = (σ t + σ x )z t + σ z µ t σ t + σ x + σ z P t+ t+ = P t+ t K t+ CP t+ t = (σ t + σ x )σ z σ t + σ x + σ z P(x) P(x ) P(x ) P(x z = 2.5) *z x positio

11 The KF update of the mea is KF: ituitio ˆx t+ t+ = Aˆx t t + K t+ (y t+ CAˆx t t ) = (σ t + σ x )z t + σ z µ t σ t + σ x + σ z The term (y t+ C ˆx t+ t ) is called the iovatio. New belief is covex combiatio of updates from prior ad observatio, weighted by Kalma Gai matrix: K t+ = P t+ t C T (CP t+ t C T + R) If the observatio is ureliable, σ z is large so K t+ is small, so we pay more attetio to the predictio. If the old prior is ureliable (large σ t ) or the process is very upredictable (large σ x ), we pay more attetio to the observatio. The KF update of the mea is KF, RLS ad LMS ˆx t+ t+ = Aˆx t t + K t+ (y t+ CAˆx t t ) Cosider the special case where the hidde state is a costat, x t = θ, but the observatio matrix C is a time-varyig vector, C = x T t. Hece y t = x T t θ + v t as i liear regressio. We ca estimate θ recursively usig the Kalma filter: ˆθ t+ = ˆθ t + P t+ R (y t+ x T t ˆθ t )x t This is called the recursive least squares (RLS) algorithm. We ca approximate P t+ R η t+ by a scalar costat. This is called the least mea squares (LMS) algorithm. We ca adapt η t olie usig stochastic approximatio theory.

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis Lecture 10: Factor Aalysis ad Pricipal Compoet Aalysis Sam Roweis February 9, 2004 Whe we assume that the subspace is liear ad that the uderlyig latet variable has a Gaussia distributio we get a model