Probabilistic Graphical Models

Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures courtesy Michael Jordan s draft textbook, An Introduction to Probabilistic Graphical Models

Undirected Gaussian Graphical Models N (x 0, ) = 1 Z Y st(x s,x t ) (s,t)2e Z =((2 ) N ) 1/2 Undirected Markov properties correspond to sparse inverse covariance matrices For connected Gaussian MRFs, covariance is usually dense (all pairs correlated) Number of parameters, and thus learning complexity, reduced from O(N 2 )too(n) J = 1

Duality in Gaussian Distributions Marginals: 1 11 = 11 12 1 22 21 Conditionals: Ø Moment parameters: trivial marginalization, conditioning requires computation Ø Canonical parameters: trivial conditioning, marginalization requires computation

Linear Gaussian Systems Marginal Likelihood: X Posterior Distribution: Y

Directed Gaussian Graphical Models N (x 0, ) = Y s2v N (x s A s x (s),r s ) Sequence of locally normalized conditional distributions of each D-dimensional node: A s R s x s Linear state space model is a widely used special case: x t+1 N (A t x t,q t ) y t N (C t x t,r t ) Dimensionality reduction: State smaller than observed vectors Rich temporal dynamics: State larger than observed vectors

Probabilistic PCA & Factor Analysis Both Models: Data is a linear function of low-dimensional latent coordinates, plus Gaussian noise p(x i z i, )=N (x i Wz i + µ, ) p(x i ) =N (x i µ, W W T + ) Factor analysis: Probabilistic PCA: x 2 is a general diagonal matrix is a multiple of identity matrix = 2 I p(x bz) w p(z i ) =N (z i 0,I) x 2 low rank covariance parameterization µ } b z w µ p(z) p(x) bz z x 1 C. Bishop, Pattern Recognition & Machine Learning x 1

Expectation Maximization (EM) ln p(x ) =ln ln p(x ) Z z Z z Initialization: Randomly select starting parameters E-Step: Given parameters, find posterior of hidden data q (t) = arg max q L(q, (t 1) ) M-Step: Given posterior distributions, find likely parameters (t) = arg max p(x, z ) dz q(z)lnp(x, z ) dz L(q (t), ) Iteration: Alternate E-step & M-step until convergence Z z q(z) lnq(z) dz, L(q, ) (0)

ln p(x ) Z z EM: Expectation Step q(z)lnp(x, z ) dz Z z q(z) lnq(z) dz, L(q, ) q (t) = arg max q L(q, (t 1) ) General solution, for any probabilistic model: q (t) (z) =p(z x, (t 1) ) posterior distribution given current parameters For factor analysis and probabilistic PCA these are Gaussian: p(z x, ) = NY i=1 p(z i x i, ) = {W, µ, } p(z i x i,w,µ, ) = N (z i i W T 1 (x i µ), i ) 1 i = I + W T 1 W

PCA versus Probabilistic PCA p(z i x i,w,µ, ) = N (z i i W T 1 (x i µ), i ) 1 i = I + W T 1 W Standard PCA (orthogonal projection) Probabilistic PCA (shrunk towards mean) Maximum likelihood estimates of probabilistic PCA parameters are equal to the classic PCA eigenvector solution For classical PCA, optimal embedding is orthogonal projection For PPCA, latent coordinates are biased towards mean (zero)

EM: Maximization Step Z Z ln p(x ) q(z)lnp(x, z ) dz q(z) lnq(z) dz, L(q, ) z z Z (t) = arg max L(q (t), ) = arg max q(z)lnp(x, z ) dz Unlike E-step, no simplified general solution For factor analysis and probabilistic PCA, these reduce to weighted linear regression problems ln p(x, z ) =C + 1 NX zi 2 + D log 2 + 2 x i Wz i µ 2 2 min W,µ, 1 2 = 2 I NX i=1 i=1 D log 2 + 2 E q [ x i Wz i µ 2 ] z

EM Algorithm for Probabilistic PCA 2 (a) 0!2 True Principal Components!2 0 2 C. Bishop, Pattern Recognition & Machine Learning

EM Algorithm for Probabilistic PCA 2 (b) 0!2 E-Step #1 (Projection)!2 0 2 C. Bishop, Pattern Recognition & Machine Learning

EM Algorithm for Probabilistic PCA 2 (c) 0!2 M-Step #1 (Regression)!2 0 2 C. Bishop, Pattern Recognition & Machine Learning

EM Algorithm for Probabilistic PCA 2 (d) 0!2 E-Step #2 (Projection)!2 0 2 C. Bishop, Pattern Recognition & Machine Learning

EM Algorithm for Probabilistic PCA 2 (e) 0!2 M-Step #2 (Regression)!2 0 2 C. Bishop, Pattern Recognition & Machine Learning

EM Algorithm for Probabilistic PCA 2 (f) 0!2 Converged Solution!2 0 2 C. Bishop, Pattern Recognition & Machine Learning

Factor EM for Analysis: Linear State Space State Space: Models Factor Analysis or PPCA Linear-Gaussian State Space Model E-Step: Ø Independently find Gaussian posteriors for each observation M-Step: Ø Weighted linear regression to map embeddings to observations E-Step: Ø Determine posterior marginals via Gaussian BP (Kalman smoother) M-Step: Ø Observation regression identical to factor analysis M-step Ø Separate dynamics regression

Linear State Space Models States & observations jointly Gaussian: Ø All marginals & conditionals Gaussian Ø Linear transformations remain Gaussian

Simple Linear Dynamics Brownian Motion Constant Velocity Time Time

Constant Velocity Tracking Kalman Filter Kalman Smoother (K. Murphy, 1998)

Nonlinear State Space Models State dynamics and measurements given by potentially complex nonlinear functions Noise sampled from non-gaussian distributions

Examples of Nonlinear Models Dynamics implicitly determined by geophysical simulations Observed image is a complex function of the 3D pose, other nearby objects & clutter, lighting conditions, camera calibration, etc.

Nonlinear Filtering Prediction: Update:

Approximate Nonlinear Filters No direct represention of continuous functions, or closed form for the prediction integral Big literature on approximate filtering: Ø Histogram filters Ø Extended & unscented Kalman filters Ø Particle filters Ø

Nonlinear Filtering Taxonomy Histogram Filter: Ø Evaluate on fixed discretization grid Ø Only feasible in low dimensions Ø Expensive or inaccurate Extended/Unscented Kalman Filter: Ø Approximate posterior as Gaussian via linearization, quadrature, Ø Inaccurate for multimodal posterior distributions Particle Filter: Ø Dynamically evaluate states with highest probability Ø Monte Carlo approximation

E[f] = Z Monte Carlo Methods f(z)p(z) dz 1 L Provably good if L sufficiently large: Unbiased for any sample size Variance inversely proportional to sample size (and independent of dimension of space) Weak law of large numbers Strong law of large numbers Problem: Drawing samples from complex distributions LX `=1 Alternatives for hard problems: f(z (`) ) Importance sampling Markov chain Monte Carlo (MCMC) z (`) p(z) Estimation of expected model properties via simulation