EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin 6.1 Factor Analysis In last lecture we described a projection model which tries to model the variance at each individual component of the data vector including the noise. In this section we will describe Factor Analysis (FA) model which tries to model the correlations among the individual components to the extent that they are generated by an underlying low dimensional hidden variables and/or whether or not there is such an underlying distribution. FA originally is developed by psychologists to find the contributions of different factors to the mental abilities. For example, given the performances of a test subject at English, French and Spanish we would like to find what the underlying mental ability is. Find the factor loading λ = [λ 1, λ 2, λ 3 ] T, where λ i = [λ i1, λ i2 ], the common factors f = [f 1, f 2 ] T and disturbance term V = [v 1, v 2, v 3 ] T such that y i = λ T i f + v i (6.1) where Y = y 1 (French) y 2 (English) y 3 (Spanish) [ and f = f 1 (illiteracy) f 2 (intelligence) ] Since we are trying to model covariances instead of variances FA is invariant to the level of noise at the individual components, but it is sensitive to the direction of data. 6.1.1 The Factor Analysis Model Let Y be the p component observation vector of variables y 1,..., y p. The FA model states that y i s are linear combinations a small set of common factors f i s, random errors and a constant offset to account for the nonzero means, y 1 = λ 11 f 1 + λ 12 f 2 +... + λ 1k f k + v 1 + µ 1 (6.2) y 2 = λ 21 f 1 + λ 22 f 2 +... + λ 2k f k + v 2 + µ 2. y p = λ p1 f 1 + λ p2 f 2 +... + λ pk f k + v p + µ p where f k s represent the underlying k common factors, v i s represent the combined effects of specific factors and error, µ i s are the means of each variable and λ ik s are the factor loadings representing the effect of kth common factor on ith variable. Usually k p. Writing in matrix form the FA model becomes Y = ΛX + v + µ (6.3) 6-1
Lecture 6: April 19, 2002 6-2 where Λ is called factor loading matrix and we have replaced f by X to be consistent with the notation at other sections, X and v. If we assume that specific factors v i s are uncorrelated with each other, Y i s become conditionally independent given the common factors X, i.e. covariance terms at the covariance matrix of Y are only due to Λ. We assume that factors already have been rotated and standardized so that common factors are uncorrelated with each other : E{X} = 0 and Cov{X} = I, and specific factors are zero mean and uncorrelated with each other: E{v} = 0 and Cov{v} = Ψ. Adding normality we get X N(0, I), v N(0, Ψ) and Cov(X, v) = 0 Y N(µ, ΛΛ T + Ψ) where the factor loading matrix Λ is not necessarily diagonal. There is an identifiability problem. Λ is not unique. Rotating both common factors and factor loading matrix in a suitable manner, we would get the same set of observations: Y = (ΛG)G T X + v + µ (6.4) where G is any orthonormal matrix of appropriate size, i.e. GG T = G T G = I. To make the specification unique, we pick one particular G or add an arbitrary constraint that Λ should be such that Having solved the uniqueness problem, there are two basic problems with FA: Λ T ΨΛ = diagonal matrix. (6.5) learning Given data Y, find the model parameters Λ, Ψ and µ, find x = argmax x p(x Y ). probabilistic inference Find best X given Y, i.e. p(x Y ) using different assumptions about where Y came from. 6.2 Unifying View of These Models - Static Kalman Filter Consider the following continuous time linear dynamical system, x t+1 = Ax t + w t, w t N(0, Q) (6.6) y t = Cx t + v t v t N(0, R). where x t is the hidden state of the system and y t is the output of the system at time t. The state of the system evolves as a first order dynamical system while the output is generated from current state by a linear observation process plus an additive Gaussian noise. Two noise processes are white and independent of each other. This is the classical Kalman filter. However if each state x t is produced independently of each other and identically distributed, then there is no temporal ordering to the data points so that we can drop time index t and the state of the system is completely random, i.e. A = 0, X = w, w N(0, Q) (6.7) Y = CX + v v N(µ, R). The observation at any time only depends on the state of the system at that time. This is why the term static has used. Since the linear combinations of Gaussians are Gaussian and sum of independent Gaussians are also Gaussian, E{Y } = µ, Cov{Y } = CQC T + R Y N(µ, CQC T + R). (6.8)
Lecture 6: April 19, 2002 6-3 X 0 C Y w Figure 6.1: Static Kalman Filter v We add some restriction on the above module: identification It is not possible to uniquely determine C and Q. We can always interchange the factors between C and Q, CQC T = C(ΓΛΓ T )C T (6.9) = (CΓ)Λ(CΓ) T Q = Λ, diagonal = (CΓΛ 1/2 )(CΓΛ 1/2 ) T = C C T Q = I. To ensure uniqueness we constrain Q to be I without loss of generality. Restriction to R. We must restrict R so that the maximum likelihood parameter estimation does not set C = 0 and let R explain all of the variability without capturing any interesting and informative projections in X. Since Y is now single Gaussian without any time dependence, the ML estimate of Cov{Y } is the sample covariance. Nothing restricts R not to be the sample covariance matrix. We constrain R to be diagonal, i.e the components of v are uncorrelated and the correlations in Y are only due to C. Key idea is that FA is scale invariant but not rotation invariant while PCA is rotation invariant but not scale invariant. 6.2.1 Likelihood of the data The inference problem is to determine the posterior probability of a particular state X given a set of observations Y {Y 1,..., Y N }, i.e, P (X Y). For static Kalman filter model that inference problem reduces to p(x Y ) where Y is the observation corresponding to state X since all of the data points are generated independently. Using By We can easily get Y N(µ, CC T + R) (6.10) X N(0, I), y t = Cx t + v t v t N(0, R). (6.11) P (Y X) N(CX + µ, R) (6.12)
Lecture 6: April 19, 2002 6-4 To compute P (X Y ), we need to know Cov{X, Y } = E{(X µ X )(Y µ Y ) T } (6.13) = E{X(Y µ Y ) T } = E{X(CX + v µ) T } = E{XX T }C T + E{Xv T } = C T since w v. X and Y are jointly Gaussian, [ ] ([ X 0 N Y µ Using the following formula for a Gaussian distribution [ ] ([ ] [ ]) X µx ΣXX Σ N, XY Y we get µ Y Σ Y X Σ Y Y ] [ I C T, C CC T + R p(x Y ) N ( µ X + Σ XY Σ 1 Y Y (Y µ Y ), Σ XX Σ XY Σ 1 Y Y Σ Y X), ]). (6.14) p(x Y ) N ( C T (CC T + R) 1 (Y µ), I C T (CC T + R) 1 C ). (6.15) 6.2.2 ML Parameter Estimation Given a set of observations Y we choose the parameters of our models, {C, R, µ} such that the likelihood of Y is maximized. {C, µ, R } = argmax C,µ,R = argmax C,µ,R p(y, X) (6.16) p(y X) where X {X 1,..., X N } is the set of corresponding states producing Y. We prefer to maximize p(y X) instead of p(y) since the latter is difficult to differentiate with respect to C. p(y X) is easy to differentiate, however only Y is observed and X hidden. To maximize with hidden variables we use Expectation-Maximization (EM) algorithm which we will be describing next section. Before going to learning with hidden variables we will show how PCA, FA and econometric model can be formulated by static Kalman filter modeling. PCA i.e. v is deterministic and is equal to µ. X = w w N(0, I) (6.17) Y = CX + v v N(µ, 0), X = C 1 (Y µ) (6.18) = C T (Y µ) assuming C is orthonormal. The learning problem is to find C.
Lecture 6: April 19, 2002 6-5 FA X = w w N(0, I) (6.19) Y = CX + v v N(µ, R) which could be thought as data and sensor noise. C is the correlation matrix (same as Λ in section 4.3.3), R is a diagonal matrix. The covariance structure of Y is in C and the variance structure of Y is in the diagonal matrix R. Econometric X t = w t w t N(0, I) (6.20) Y t = CX t + v t v t N(µ, R t ) and w t is not necessarily white over time. The model is quite complicated because the volatility matrix R t is time dependent. 6.3 Learning with Hidden Variables Hidden variables (unobserved variables or latent variables) are often introduced in a model to simplify it. For example given a set of dependent variables instead of addling links between each of them a top-down structure through hidden variables can simplify the model: given hidden variables they are independent (figure 6.2). Figure 6.2: Introducing hidden variables to simplify a densely connected graph They can be discrete or continuous. If they are discrete they represent the underlying classes representing observations or discrete states associated with a dynamical system like a Hidden Markov Model (HMM). For example in a Gaussian mixture this would be the identity of distribution from which an observation has been produced (figure 6.3). X - discrete Figure 6.3: Mixture Models Y If they are continuous, they represent the state of a static system like X in FA or dynamic system like X t in a Kalman filter. They parameterize low dimensional spaces. In FA the underlying common factors are of less dimension than observations, k p. In PCA, a small number of principal components might account for most of the variance. They can explain independent components of data - ICA (figure (6.4)).
Lecture 6: April 19, 2002 6-6 X Y Figure 6.4: Independent Component Analysis,one possible model They can also be associated with the underlying physical model like diseases cause symptoms. Even only symptoms are observed, hidden variables can be used to infer about diseases: having observed some symptoms find the most probable disease (figure 6.5). diseases symptoms Figure 6.5: QMR diseases Although the hidden variables simplify the model, training models with hidden variables is difficult because they are unobserved. We still like to use ML estimation since it is statistically well founded. However, only a subset of model variables are observed and marginalization of the joint distribution over hidden variables couples observed variables and the resulting likelihood is usually mathematically intractable. For example, for a Gaussian mixture model p(y) = = N log P (Y i ) (6.21) i=1 ( N M ) log π k N(Y i ; µ k, Σ k ) i=1 k=1 and sums inside log are not amenable. The problem would be trivial if we knew which mixture component is responsible for each data point. Taking relative class frequencies would give π k s and sample means and covariances would give mixture means and variances. EM solves the problem with hidden variables. 6.3.1 EM Algorithm Let X be the observed and Z be the hidden data samples, {X, Z} = {(X i, Z i ), i = 1,..., N}. We assume that there exists a complete probability assuming some implementation and structure. Usually this joint distribution factors out nicely like in our Gaussian mixture example so that the maximization of l c (θ) = log (6.22)
Lecture 6: April 19, 2002 6-7 called complete data log likelihood, is usually easier. θ = argmax θ l c (θ) (6.23) However when Z is not observed, observations will not decouple, in this case, we have p(x θ), and our goal is to find We get the log likelihood ration θ = argmax θ p(x θ) (6.24) l(θ) log p(x θ) = log Z (6.25) as we showed for Gaussian mixture. The problem is the sum inside the log couples together variables. Suppose we have Q(Z X, θ), which is an appropriate distribution of Z depending on observed values, observing that = i p(x i x πi, θ ix ) i p(z i z πi, θ iz ) (6.26) log() = i log(p(x i x πi, θ ix )) + i log(p(z i z πi, θ iz )) (6.27) The variables of X and Z are decoupled and therefore can lead to tractable inference. < l C (θ) > Q = Z Q(Z X, θ) log (6.28) = E Q(Z,X θ) [log X, θ] which is called as the expected complete log likelihood. Note that if we had observed Z i s, and Q(Z X, θ) assigned to the sequence consisting of observed Z i s 1, and 0 to everything else, then expected complete log likelihood would reduce to complete log likelihood. Since we don t know these true values, we hope that maximizing the average of complete likelihoods by different assignments to Z by Q( ) will give an improvement towards the value maximizing (6.25). It is crucial that averaging should weight probable Z sequences more. X is a source for this since Z and X are related. An intuitive choice is P (Z X) since different sequences are weighted based on how likely they are if they were to produce X with the assumed model. We will first show that Q(Z X) can provide a lower bound for l(θ) and then we will describe an iterative procedure which raises this lower bound every iteration. Manipulating l(θ), l(θ) = log Z (6.29) = log Z Z = L(Q, θ) Q(Z X, θ) Q(Z X, θ) Q(Z X, θ) log Q(Z X, θ) where we have used Ef(X) f(ex) for any concave f(x), i.e. Jensen s inequality. Hence for a given θ maximizing L(θ, Q) would raise the lower bound of l(θ).
Lecture 6: April 19, 2002 6-8 Denote L(θ, Q) = Z Q(Z X) log p(z X) (6.30) EM is a coordinate ascent algorithm on L(θ, Q). We iteratively maximize with respect to Q and θ. At the (t + 1)st iteration we first maximize L(θ (t), Q) with respect to Q, and we then maximize L(θ, Q (t+1) ) with respect to θ. Q (t+1) = argmax Q θ (t+1) = argmax θ L(Q, θ (t) ) E-step (6.31) L(Q (t+1), θ) The E-step is easy to solve by setting Q (t+1) (Z X) = P (Z X, θ (t) ) since M-step L(p(Z X, θ (t) ), θ (t) ) = Z = Z p(z X, θ (t) ) log p(x, Z θ(t) ) p(z X, θ (t) ) p(z X, θ (t) ) log p(x θ (t) ) (6.32) = log p(x θ (t) ) Z p(z X, θ (t) ) = log p(x θ (t) ) l(θ (t) ) and using L(Q, θ) l(θ). Hence at the at the end of E-step of (t + 1)st iteration we have l(θ (t) ) = L(Q (t+1), θ (t) ). Increasing L(Q (t+1), θ) will necessarily increase l(θ). This E-step can also seen using the following: l(θ) L(Q, θ) = log p(x θ) Z Q(Z X) log p(z X) = Q(Z X) log p(x θ) Q(Z X) log Z Z = [ Q(Z X) log Q(Z X)p(X θ) ] Z = [ Q(Z X) log Q(Z X) ] p(z X, θ) Z = D(Q(Z X) P (Z X, θ)) 0 = 0 only when Q(Z X) = P (Z X, θ) p(z X) (6.33) M-step maximizes the expected complete log likelihood because L(Q, θ) = Z Q(Z X, θ) log Q(Z X, θ) (6.34) = Z Q(Z X, θ) log Z Q(Z X, θ) log Q(Z X, θ) = E Q(Z X) [log ] + H(Q) = E p(z X,θ (t) )[log ] + H(Q)
Lecture 6: April 19, 2002 6-9 and H(Q) does not depend on θ for a given Q. By repeating these two steps EM is guaranteed to converge to a maximum of θ which is not necessarily the global maximum. (t+1) l( θ ) θ (t+1) θ (t) l( θ ) θ (t) L(Q (t+1), θ) L(Q (t), θ) l(θ) Figure 6.6: EM climbing the hill In figure (6.6), at each step first curve L(Q (t), θ) is found based on current value of θ which is θ (t) and the curve for constant Q (t+1) is maximized with respect to θ. At the maximizing θ the two curves L(Q (t+1), θ) and l(θ) touch each other. References [Bilmes97] [Bishop96] J.A. BILMES, A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, ICSI Technical Report ICSI-TR-97-021, 1997 C.M. BISHOP, Learning with Latent Variables, in Learning in Graphical Models, M.I. Jordan (ed.), 1996 [Dempster77] A.P. DEMPSTER, N.M. LAIRD, D.B. RUBIN, Maximum Likelihood Estimation from Incomplete Data, J. Royal Stat. Soc. (B), vol.39, no.1. pp.1-38, 1977 [Everitt84] B.S. EVERITT, An Introduction to Latent Variable Models, Chapman and Hill, 1984 [JB00] M.I. JORDAN and C. BISHOP, An Introduction to Graphical Models To be published, 2000 [Neal96] R.M. NEAL, G.E. HINTON, A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants, in Learning in Graphical Models, M.I. Jordan (ed.), 1996 [Roweis97] S. ROWEIS, Z. GHAHRAMANI, A Unifying View of Linear Gaussian Models, Unpublished, 1997 [Svensen98] J.H.M. SVENSEN, GTM: The Generative Topographic Mapping, Aston University PhD Thesis, 1998