Lecture 6: April 19, 2002

Similar documents
Factor Analysis and Kalman Filtering (11/2/04)

Latent Variable Models and EM algorithm

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

The Expectation Maximization Algorithm

Factor Analysis (10/2/13)

Probabilistic Graphical Models

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Weighted Finite-State Transducers in Computational Biology

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

ABSTRACT INTRODUCTION

Week 3: The EM algorithm

CS281A/Stat241A Lecture 17

Introduction to Machine Learning

Probabilistic Graphical Models

Unsupervised Learning

STA 4273H: Statistical Machine Learning

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Latent Variable Models and EM Algorithm

Linear Dynamical Systems

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Estimating Gaussian Mixture Densities with EM A Tutorial

Dimension Reduction. David M. Blei. April 23, 2012

Expectation Maximization

Lecture 16 Deep Neural Generative Models

Variational Inference (11/04/13)

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

13: Variational inference II

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

CS229 Lecture notes. Andrew Ng

Introduction to Graphical Models

But if z is conditioned on, we need to model it:

Variational Autoencoders

Latent Variable Models

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

EM for Spherical Gaussians

MIXTURE MODELS AND EM

Expectation Maximization

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Based on slides by Richard Zemel

CIFAR Lectures: Non-Gaussian statistics and natural images

Introduction to Machine Learning

Introduction to Probabilistic Graphical Models: Exercises

A Note on the Expectation-Maximization (EM) Algorithm

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Clustering with k-means and Gaussian mixture distributions

Self-Organization by Optimizing Free-Energy

Expectation Maximization

Mixture Models and EM

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Lecture 4 October 18th

A latent variable modelling approach to the acoustic-to-articulatory mapping problem

Unsupervised Learning

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

STA 414/2104: Machine Learning

Variational Principal Components

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Statistical Techniques in Robotics (16-831, F12) Lecture#17 (Wednesday October 31) Kalman Filters. Lecturer: Drew Bagnell Scribe:Greydon Foil 1

p L yi z n m x N n xi

Advanced Introduction to Machine Learning

Graphical Models for Collaborative Filtering

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Lecture 3: Pattern Classification

Lecture 7: Con3nuous Latent Variable Models

Machine Learning 4771

Machine Learning Techniques for Computer Vision

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Statistical learning. Chapter 20, Sections 1 4 1

Probabilistic Graphical Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Linear Factor Models. Sargur N. Srihari

Technical Details about the Expectation Maximization (EM) Algorithm

Mixture Models and Expectation-Maximization

The Expectation-Maximization Algorithm

Expectation Maximization Algorithm

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Probabilistic Graphical Models

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Probabilistic Graphical Models

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Lecture 4: Probabilistic Learning

Gaussian Mixture Models

Expectation-Maximization (EM) algorithm

A Unifying Review of Linear Gaussian Models

STA 414/2104: Machine Learning

01 Probability Theory and Statistics Review

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Chapter 4: Factor Analysis

Lecture 8: Graphical models for Text

DD Advanced Machine Learning

What is the expectation maximization algorithm?

Statistical Pattern Recognition

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

The Variational Gaussian Approximation Revisited

Dynamic models 1 Kalman filters, linearization,

STA 4273H: Statistical Machine Learning

Transcription:

EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin 6.1 Factor Analysis In last lecture we described a projection model which tries to model the variance at each individual component of the data vector including the noise. In this section we will describe Factor Analysis (FA) model which tries to model the correlations among the individual components to the extent that they are generated by an underlying low dimensional hidden variables and/or whether or not there is such an underlying distribution. FA originally is developed by psychologists to find the contributions of different factors to the mental abilities. For example, given the performances of a test subject at English, French and Spanish we would like to find what the underlying mental ability is. Find the factor loading λ = [λ 1, λ 2, λ 3 ] T, where λ i = [λ i1, λ i2 ], the common factors f = [f 1, f 2 ] T and disturbance term V = [v 1, v 2, v 3 ] T such that y i = λ T i f + v i (6.1) where Y = y 1 (French) y 2 (English) y 3 (Spanish) [ and f = f 1 (illiteracy) f 2 (intelligence) ] Since we are trying to model covariances instead of variances FA is invariant to the level of noise at the individual components, but it is sensitive to the direction of data. 6.1.1 The Factor Analysis Model Let Y be the p component observation vector of variables y 1,..., y p. The FA model states that y i s are linear combinations a small set of common factors f i s, random errors and a constant offset to account for the nonzero means, y 1 = λ 11 f 1 + λ 12 f 2 +... + λ 1k f k + v 1 + µ 1 (6.2) y 2 = λ 21 f 1 + λ 22 f 2 +... + λ 2k f k + v 2 + µ 2. y p = λ p1 f 1 + λ p2 f 2 +... + λ pk f k + v p + µ p where f k s represent the underlying k common factors, v i s represent the combined effects of specific factors and error, µ i s are the means of each variable and λ ik s are the factor loadings representing the effect of kth common factor on ith variable. Usually k p. Writing in matrix form the FA model becomes Y = ΛX + v + µ (6.3) 6-1

Lecture 6: April 19, 2002 6-2 where Λ is called factor loading matrix and we have replaced f by X to be consistent with the notation at other sections, X and v. If we assume that specific factors v i s are uncorrelated with each other, Y i s become conditionally independent given the common factors X, i.e. covariance terms at the covariance matrix of Y are only due to Λ. We assume that factors already have been rotated and standardized so that common factors are uncorrelated with each other : E{X} = 0 and Cov{X} = I, and specific factors are zero mean and uncorrelated with each other: E{v} = 0 and Cov{v} = Ψ. Adding normality we get X N(0, I), v N(0, Ψ) and Cov(X, v) = 0 Y N(µ, ΛΛ T + Ψ) where the factor loading matrix Λ is not necessarily diagonal. There is an identifiability problem. Λ is not unique. Rotating both common factors and factor loading matrix in a suitable manner, we would get the same set of observations: Y = (ΛG)G T X + v + µ (6.4) where G is any orthonormal matrix of appropriate size, i.e. GG T = G T G = I. To make the specification unique, we pick one particular G or add an arbitrary constraint that Λ should be such that Having solved the uniqueness problem, there are two basic problems with FA: Λ T ΨΛ = diagonal matrix. (6.5) learning Given data Y, find the model parameters Λ, Ψ and µ, find x = argmax x p(x Y ). probabilistic inference Find best X given Y, i.e. p(x Y ) using different assumptions about where Y came from. 6.2 Unifying View of These Models - Static Kalman Filter Consider the following continuous time linear dynamical system, x t+1 = Ax t + w t, w t N(0, Q) (6.6) y t = Cx t + v t v t N(0, R). where x t is the hidden state of the system and y t is the output of the system at time t. The state of the system evolves as a first order dynamical system while the output is generated from current state by a linear observation process plus an additive Gaussian noise. Two noise processes are white and independent of each other. This is the classical Kalman filter. However if each state x t is produced independently of each other and identically distributed, then there is no temporal ordering to the data points so that we can drop time index t and the state of the system is completely random, i.e. A = 0, X = w, w N(0, Q) (6.7) Y = CX + v v N(µ, R). The observation at any time only depends on the state of the system at that time. This is why the term static has used. Since the linear combinations of Gaussians are Gaussian and sum of independent Gaussians are also Gaussian, E{Y } = µ, Cov{Y } = CQC T + R Y N(µ, CQC T + R). (6.8)

Lecture 6: April 19, 2002 6-3 X 0 C Y w Figure 6.1: Static Kalman Filter v We add some restriction on the above module: identification It is not possible to uniquely determine C and Q. We can always interchange the factors between C and Q, CQC T = C(ΓΛΓ T )C T (6.9) = (CΓ)Λ(CΓ) T Q = Λ, diagonal = (CΓΛ 1/2 )(CΓΛ 1/2 ) T = C C T Q = I. To ensure uniqueness we constrain Q to be I without loss of generality. Restriction to R. We must restrict R so that the maximum likelihood parameter estimation does not set C = 0 and let R explain all of the variability without capturing any interesting and informative projections in X. Since Y is now single Gaussian without any time dependence, the ML estimate of Cov{Y } is the sample covariance. Nothing restricts R not to be the sample covariance matrix. We constrain R to be diagonal, i.e the components of v are uncorrelated and the correlations in Y are only due to C. Key idea is that FA is scale invariant but not rotation invariant while PCA is rotation invariant but not scale invariant. 6.2.1 Likelihood of the data The inference problem is to determine the posterior probability of a particular state X given a set of observations Y {Y 1,..., Y N }, i.e, P (X Y). For static Kalman filter model that inference problem reduces to p(x Y ) where Y is the observation corresponding to state X since all of the data points are generated independently. Using By We can easily get Y N(µ, CC T + R) (6.10) X N(0, I), y t = Cx t + v t v t N(0, R). (6.11) P (Y X) N(CX + µ, R) (6.12)

Lecture 6: April 19, 2002 6-4 To compute P (X Y ), we need to know Cov{X, Y } = E{(X µ X )(Y µ Y ) T } (6.13) = E{X(Y µ Y ) T } = E{X(CX + v µ) T } = E{XX T }C T + E{Xv T } = C T since w v. X and Y are jointly Gaussian, [ ] ([ X 0 N Y µ Using the following formula for a Gaussian distribution [ ] ([ ] [ ]) X µx ΣXX Σ N, XY Y we get µ Y Σ Y X Σ Y Y ] [ I C T, C CC T + R p(x Y ) N ( µ X + Σ XY Σ 1 Y Y (Y µ Y ), Σ XX Σ XY Σ 1 Y Y Σ Y X), ]). (6.14) p(x Y ) N ( C T (CC T + R) 1 (Y µ), I C T (CC T + R) 1 C ). (6.15) 6.2.2 ML Parameter Estimation Given a set of observations Y we choose the parameters of our models, {C, R, µ} such that the likelihood of Y is maximized. {C, µ, R } = argmax C,µ,R = argmax C,µ,R p(y, X) (6.16) p(y X) where X {X 1,..., X N } is the set of corresponding states producing Y. We prefer to maximize p(y X) instead of p(y) since the latter is difficult to differentiate with respect to C. p(y X) is easy to differentiate, however only Y is observed and X hidden. To maximize with hidden variables we use Expectation-Maximization (EM) algorithm which we will be describing next section. Before going to learning with hidden variables we will show how PCA, FA and econometric model can be formulated by static Kalman filter modeling. PCA i.e. v is deterministic and is equal to µ. X = w w N(0, I) (6.17) Y = CX + v v N(µ, 0), X = C 1 (Y µ) (6.18) = C T (Y µ) assuming C is orthonormal. The learning problem is to find C.

Lecture 6: April 19, 2002 6-5 FA X = w w N(0, I) (6.19) Y = CX + v v N(µ, R) which could be thought as data and sensor noise. C is the correlation matrix (same as Λ in section 4.3.3), R is a diagonal matrix. The covariance structure of Y is in C and the variance structure of Y is in the diagonal matrix R. Econometric X t = w t w t N(0, I) (6.20) Y t = CX t + v t v t N(µ, R t ) and w t is not necessarily white over time. The model is quite complicated because the volatility matrix R t is time dependent. 6.3 Learning with Hidden Variables Hidden variables (unobserved variables or latent variables) are often introduced in a model to simplify it. For example given a set of dependent variables instead of addling links between each of them a top-down structure through hidden variables can simplify the model: given hidden variables they are independent (figure 6.2). Figure 6.2: Introducing hidden variables to simplify a densely connected graph They can be discrete or continuous. If they are discrete they represent the underlying classes representing observations or discrete states associated with a dynamical system like a Hidden Markov Model (HMM). For example in a Gaussian mixture this would be the identity of distribution from which an observation has been produced (figure 6.3). X - discrete Figure 6.3: Mixture Models Y If they are continuous, they represent the state of a static system like X in FA or dynamic system like X t in a Kalman filter. They parameterize low dimensional spaces. In FA the underlying common factors are of less dimension than observations, k p. In PCA, a small number of principal components might account for most of the variance. They can explain independent components of data - ICA (figure (6.4)).

Lecture 6: April 19, 2002 6-6 X Y Figure 6.4: Independent Component Analysis,one possible model They can also be associated with the underlying physical model like diseases cause symptoms. Even only symptoms are observed, hidden variables can be used to infer about diseases: having observed some symptoms find the most probable disease (figure 6.5). diseases symptoms Figure 6.5: QMR diseases Although the hidden variables simplify the model, training models with hidden variables is difficult because they are unobserved. We still like to use ML estimation since it is statistically well founded. However, only a subset of model variables are observed and marginalization of the joint distribution over hidden variables couples observed variables and the resulting likelihood is usually mathematically intractable. For example, for a Gaussian mixture model p(y) = = N log P (Y i ) (6.21) i=1 ( N M ) log π k N(Y i ; µ k, Σ k ) i=1 k=1 and sums inside log are not amenable. The problem would be trivial if we knew which mixture component is responsible for each data point. Taking relative class frequencies would give π k s and sample means and covariances would give mixture means and variances. EM solves the problem with hidden variables. 6.3.1 EM Algorithm Let X be the observed and Z be the hidden data samples, {X, Z} = {(X i, Z i ), i = 1,..., N}. We assume that there exists a complete probability assuming some implementation and structure. Usually this joint distribution factors out nicely like in our Gaussian mixture example so that the maximization of l c (θ) = log (6.22)

Lecture 6: April 19, 2002 6-7 called complete data log likelihood, is usually easier. θ = argmax θ l c (θ) (6.23) However when Z is not observed, observations will not decouple, in this case, we have p(x θ), and our goal is to find We get the log likelihood ration θ = argmax θ p(x θ) (6.24) l(θ) log p(x θ) = log Z (6.25) as we showed for Gaussian mixture. The problem is the sum inside the log couples together variables. Suppose we have Q(Z X, θ), which is an appropriate distribution of Z depending on observed values, observing that = i p(x i x πi, θ ix ) i p(z i z πi, θ iz ) (6.26) log() = i log(p(x i x πi, θ ix )) + i log(p(z i z πi, θ iz )) (6.27) The variables of X and Z are decoupled and therefore can lead to tractable inference. < l C (θ) > Q = Z Q(Z X, θ) log (6.28) = E Q(Z,X θ) [log X, θ] which is called as the expected complete log likelihood. Note that if we had observed Z i s, and Q(Z X, θ) assigned to the sequence consisting of observed Z i s 1, and 0 to everything else, then expected complete log likelihood would reduce to complete log likelihood. Since we don t know these true values, we hope that maximizing the average of complete likelihoods by different assignments to Z by Q( ) will give an improvement towards the value maximizing (6.25). It is crucial that averaging should weight probable Z sequences more. X is a source for this since Z and X are related. An intuitive choice is P (Z X) since different sequences are weighted based on how likely they are if they were to produce X with the assumed model. We will first show that Q(Z X) can provide a lower bound for l(θ) and then we will describe an iterative procedure which raises this lower bound every iteration. Manipulating l(θ), l(θ) = log Z (6.29) = log Z Z = L(Q, θ) Q(Z X, θ) Q(Z X, θ) Q(Z X, θ) log Q(Z X, θ) where we have used Ef(X) f(ex) for any concave f(x), i.e. Jensen s inequality. Hence for a given θ maximizing L(θ, Q) would raise the lower bound of l(θ).

Lecture 6: April 19, 2002 6-8 Denote L(θ, Q) = Z Q(Z X) log p(z X) (6.30) EM is a coordinate ascent algorithm on L(θ, Q). We iteratively maximize with respect to Q and θ. At the (t + 1)st iteration we first maximize L(θ (t), Q) with respect to Q, and we then maximize L(θ, Q (t+1) ) with respect to θ. Q (t+1) = argmax Q θ (t+1) = argmax θ L(Q, θ (t) ) E-step (6.31) L(Q (t+1), θ) The E-step is easy to solve by setting Q (t+1) (Z X) = P (Z X, θ (t) ) since M-step L(p(Z X, θ (t) ), θ (t) ) = Z = Z p(z X, θ (t) ) log p(x, Z θ(t) ) p(z X, θ (t) ) p(z X, θ (t) ) log p(x θ (t) ) (6.32) = log p(x θ (t) ) Z p(z X, θ (t) ) = log p(x θ (t) ) l(θ (t) ) and using L(Q, θ) l(θ). Hence at the at the end of E-step of (t + 1)st iteration we have l(θ (t) ) = L(Q (t+1), θ (t) ). Increasing L(Q (t+1), θ) will necessarily increase l(θ). This E-step can also seen using the following: l(θ) L(Q, θ) = log p(x θ) Z Q(Z X) log p(z X) = Q(Z X) log p(x θ) Q(Z X) log Z Z = [ Q(Z X) log Q(Z X)p(X θ) ] Z = [ Q(Z X) log Q(Z X) ] p(z X, θ) Z = D(Q(Z X) P (Z X, θ)) 0 = 0 only when Q(Z X) = P (Z X, θ) p(z X) (6.33) M-step maximizes the expected complete log likelihood because L(Q, θ) = Z Q(Z X, θ) log Q(Z X, θ) (6.34) = Z Q(Z X, θ) log Z Q(Z X, θ) log Q(Z X, θ) = E Q(Z X) [log ] + H(Q) = E p(z X,θ (t) )[log ] + H(Q)

Lecture 6: April 19, 2002 6-9 and H(Q) does not depend on θ for a given Q. By repeating these two steps EM is guaranteed to converge to a maximum of θ which is not necessarily the global maximum. (t+1) l( θ ) θ (t+1) θ (t) l( θ ) θ (t) L(Q (t+1), θ) L(Q (t), θ) l(θ) Figure 6.6: EM climbing the hill In figure (6.6), at each step first curve L(Q (t), θ) is found based on current value of θ which is θ (t) and the curve for constant Q (t+1) is maximized with respect to θ. At the maximizing θ the two curves L(Q (t+1), θ) and l(θ) touch each other. References [Bilmes97] [Bishop96] J.A. BILMES, A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, ICSI Technical Report ICSI-TR-97-021, 1997 C.M. BISHOP, Learning with Latent Variables, in Learning in Graphical Models, M.I. Jordan (ed.), 1996 [Dempster77] A.P. DEMPSTER, N.M. LAIRD, D.B. RUBIN, Maximum Likelihood Estimation from Incomplete Data, J. Royal Stat. Soc. (B), vol.39, no.1. pp.1-38, 1977 [Everitt84] B.S. EVERITT, An Introduction to Latent Variable Models, Chapman and Hill, 1984 [JB00] M.I. JORDAN and C. BISHOP, An Introduction to Graphical Models To be published, 2000 [Neal96] R.M. NEAL, G.E. HINTON, A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants, in Learning in Graphical Models, M.I. Jordan (ed.), 1996 [Roweis97] S. ROWEIS, Z. GHAHRAMANI, A Unifying View of Linear Gaussian Models, Unpublished, 1997 [Svensen98] J.H.M. SVENSEN, GTM: The Generative Topographic Mapping, Aston University PhD Thesis, 1998