Cours 7 12th November 2014

Size: px

Start display at page:

Download "Cours 7 12th November 2014"

Hester Evans
6 years ago
Views:

1 Sum Product Algorithm and Hidden Markov Model 2014/2015 Cours 7 12th November 2014 Enseignant: Francis Bach Scribe: Pauline Luc, Mathieu Andreux 7.1 Sum Product Algorithm Motivations Inference, along with estimation and decoding, are the three key operations one must be able to perform efficiently in graphical models. Given a discrete Gibbs model of the form: p(x) = 1 Z C C ψ C(x C ), where C is the set of cliques of the graph, inference enables: Computation of the marginal p(x i ) or more generally, p(x C ). Computation of the partition function Z Computation of the conditional marginal p(x i X j = x j, X k = x k ) And as a consequence Computation of the gradient in a exponential family Computation of the expected value of the loglikelihood of an exponential family at step E of the EM algorithm (for example for HMM) Example 1: Ising model Let X = (X i ) i V be a vector of random variables, taking value in {0, 1} V, of which the exponential form of the distribution is: p(x) = e A(η) e η ix i i V e η i,jx i x j (7.1) (i,j) E We then have the log-likelihood: l(η) = i V η i x i + (i,j) E η i,j x i x j A(η) (7.2) We can therefore write the sufficient statistic: ( ) (xi ) φ(x) = i V (x i x j ) (i,j) E (7.3) 7-1

2 But we have seen that for exponential families: l(η) = φ(x) T η A(η) (7.4) η l(η) = φ(x) η A(η) }{{} E η[φ(x)] We therefore need to compute E η [φ(x)]. In the case of the Ising model, we get: (7.5) E η [X i ] = P η [X i = 1] (7.6) E η [X i X j ] = P η [X i = 1, X j = 1] (7.7) The equation 7.6 is one of the motivations for solving the problem of inference. In order to be able to compute the gradient of the log-likelihood, we need to know the marginal laws. Example 2: Potts model X i are random variables, taking value in {1,..., K i }. We note ik the random variable such that ik = 1 if and only if X i = k. Then, p(δ) = exp K i η i,k δ ik + K i K j η i,j,k,k δ ik δ jk A(η) (7.8) i V and k=1 (i,j) E k=1 k =1 ( ) (δik ) φ(δ) = i,k (δ ik δ jk ) i,j,k,k (7.9) E η [ ik ] = P η [X i = k] (7.10) E η [ ik jk ] = P η [X i = k, X j = k ] (7.11) These examples illustrate the need to perform inference. Problem: In general, the inference problem is NP-hard. For trees the inference problem is efficient as it is linear in n. For tree-like graphs we use the Junction Tree Algorithm which enables us to bring the situation back to that of a tree. In the general case we are forced to carry out approximative inference. 7-2

3 7.1.2 Inference on a chain We define X i a random variable, taking value in {1,..., K}, i V = {1,..., n} with joint distribution p(x) defined as:. p(x) = 1 Z n ψ i (x i ) i=1 n ψ i 1,i (x i 1, x i ) (7.12) We wish to compute p(x j ) for a certain j. The naive solution would be to compute the marginal p(x j ) = i=2 x V \{j} p(x 1,..., x n ) (7.13) Unfortunately, this type of calculation is of complexity O(K n ). We therefore develop the expression p(x j ) = 1 Z = 1 Z = 1 Z x V \{j} i=1 x V \{j} i=1 x V \{j,n} n ψ i (x i ) n ψ i 1,i (x i 1, x i ) (7.14) i=2 n 1 n 1 ψ i (x i ) ψ i 1,i (x i 1, x i )ψ n (x n )ψ n 1,n (x n 1, x n ) (7.15) x n i=1 i=2 n 1 n 1 ψ i (x i ) ψ i 1,i (x i 1, x i )ψ n (x n )ψ n 1,n (x n 1, x n ) (7.16) i=2 Which allows us to bring out the messaged passed by (n) to (n 1): µ n n 1 (x n 1 ). When continuing, we obtain: p(x j ) = 1 Z = 1 Z = 1 Z x V \{j,n} i=1 n 1 n 1 ψ i (x i ) ψ i 1,i (x i 1, x i ) ψ n (x n )ψ n 1,n (x n 1, x n ) x n x V \{j,n,n 1} i=1 i=2 n 2 n 2 ψ i (x i ) ψ i 1,i (x i 1, x i ) i=2 } {{ } µ n n 1 (x n 1 ) x n 1 ψ n 1 (x n 1 )ψ n 2,n 1 (x n 2, x n 1 )µ n n 1 (x n 1 ) (7.17) (7.18) } {{ } µ n 1 n 2 (x n 2 ) µ 1 2 (x 2 )... µ n 1 n 2 (x n 2 ) (7.19) x V \{1,j,n,n 1} 7-3

4 In the above equation, we have implicitely used the following definitions for descending and ascending messages: µ j j 1 (x j 1 ) = x j ψ j (x j )ψ j 1,j (x j 1, x j )µ j+1 j (x j ) (7.20) µ j j+1 (x j+1 ) = x j ψ j (x j )ψ j,j+1 (x j, x j+1 )µ j 1 j (x j ) (7.21) Each of these messages is computed with complexity O((n 1)K 2 ). And finally, we get p(x j ) = 1 Z µ j 1 j(x j ) ψ j (x j ) µ j+1 j (x j ) (7.22) With only 2(n 1) messages, we have calculated p(x j ) j V. Z is otained by summing Z = x i µ i 1 i (x i ) ψ i (x i ) µ i+1 i (x i ) (7.23) Inference in undirected trees We note i the vertice of which we want to compute the marginal law p(x i ). We set i to be the root of our tree. j V, we note C(j) the set of childen of j and D(j) the set of descendants of j. The joint probability is: p(x) = 1 ψ i (x i ) Z i V (i,j) E For a tree with at least two vertices, we define by reccurence, F (x i, x j, x D(j) ) ψ i,j (x i, x j )ψ j (x j ) Then by reformulating the marginal: ψ i,j (x i, x j ) (7.24) k C(j) F (x j, x k, x D(k) ) (7.25) 7-4

5 p(x i ) = 1 ψ i (x i ) F (x i, x j, x D(j) ) Z x V \{i} j C(i) (7.26) = 1 Z ψ i(x i ) F (x i, x j, x D(j) ) j C(i) x j,x D(j) (7.27) = 1 Z ψ i(x i ) ψ i,j (x i, x j )ψ j (x j ) F (x j, x k, x D(k) ) j C(i) x j,x D(j) k C(j) (7.28) = 1 Z ψ i(x i ) ψ i,j (x i, x j )ψ j (x j ) F (x j, x k, x D(k) ) j C(i) x j k C(j) x k,x D(k) }{{} (7.29) µ k j (x j ) = 1 Z ψ i(x i ) ψ i,j (x i, x j )ψ j (x j ) µ k j (x j ) j C(i) x j k C(j) }{{} (7.30) µ j i (x i ) Which leads us to the recurrence relation for the Sum Product Algorithm (SPA): µ j i (x i ) = x j ψ i,j (x i, x j )ψ j (x j ) Sum Product Algorithm (SPA) Sequential SPA for a rooted tree k C(j) µ k j (x j ) (7.31) For a rooted tree, of root (i), the Sum Product Algorithm is written as follows: 1. All the leaves send µ n πn (x πn ) µ n πn (x πn ) = x n ψ n (x n )ψ n,πn (x n, x πn ) (7.32) 2. Iteratively, at each step, all the nodes (k) which have received messages from all their children send µ k πk (x πk ) to their parents. 3. At the root we have p(x i ) = 1 Z ψ i(x i ) j C(i) µ j i (x i ) (7.33) 7-5

6 This algorithm only enables us to compute p(x i ) at the root. To be able to compute all the marginals (as well as the conditional marginals), one must not only collect all the messages from the leafs to the root, but then also distribute them back to the leafs. In fact, the algorithm can then be written independently from the choice of a root. SPA for an undirected tree The case of undirected trees is slightly different: 1. All the leaves send µ n πn (x πn ) 2. At each step, if a node (j) hasn t send a message to one of his neighbours, say (i) (Note: (i) here is not the root) and if it has received messages from all his other neighbours N (j)\i, it send to (i) the following message Parallel SPA (flooding) 1. Initialise the messages randomly µ j i (x i ) = x j ψ j (x j )ψ j,i (x i, x i ) k N (j)\{i} µ k j (x j ) (7.34) 2. At each step, each node sends a new message to each of its neighbours, using the messages received at the previous step. Marginal laws Once all messages have been passed, we can easily calculate all the marginal laws i V, p(x i ) = 1 Z ψ i(x i ) k N (i) (i, j) E, p(x i, x j ) = 1 Z ψ i(x i ) ψ j (x j ) ψ j,i (x i, x i ) Conditional probabilities µ k i (x i ) (7.35) k N (i)\j µ k i (x i ) k N (j)\i µ k j (x j ) (7.36) We can use a clever notation to calculate the conditional probabilities. Suppose that we want to compute We can set p(x i x 5 = 3, x 10 = 2) p(x i, x 5 = 3, x 10 = 2) ψ 5 (x 5 ) = ψ 5 (x 5 ) δ(x 5, 3) 7-6

7 Generally speaking, if we observe X j = x j0 for j J obs, we can define the modified potentials: ψ j (x j ) = ψ j (x j ) δ(x j, x j0 ) such that Indeed we have p(x X Jobs = x Jobs 0) = 1 Z ψ i (x i ) i V (i,j) E ψ i,j (x i, x j ) (7.37) p(x X Jobs = x Jobs 0) p(x Jobs = x Jobs 0) = p(x) δ(x j, x j0 ) (7.38) j J obs so that by dividing the equality by p(x Jobs = x Jobs 0) we obtain the previous equation with Z = Zp(X Jobs = x Jobs 0). We then simply apply the SPA to these new potentials to compute the marginal laws p(x i X Jobs = x Jobs 0) Remarks The SPA is also called belief propagation or message passing. On trees, it is an exact inference algorithm. If G is not a tree, the algorithm doesn t converge in general to the right marginal laws, but sometimes gives reasonable approximations. We then refer to Loopy belief propagation, which is still often used in real life. The only property that we have used to construct the algorithm is the fact that (R, +, ) is a semi-ring. It is interesting to notice that the same can therefore also be done with (R +, max, ) and (R, max, +). Example For (R +, max, ) we define the Max-Product algorithm, also called Viterbi algorithm which enables us to solve the decoding problem, namely to compute the most probable configuartion of the variables, given fixed parameters, thanks to the messages [ ] µ j i (x i ) = max x j ψ i,j (x i, x j )ψ j (x j ) x k µ k j (x j ) (7.39) If we run the Max-Product algorithm with respect to a chosen root, the collection phase of the messages to the root enables us to compute the maximal probability over all configurations, and if at each calculation of a message we have also kept the argmax, we can perform a distribution phase, which instead of propagating the messages, will consist of recursively calculating one of the configurations which will reach the maximum. 7-7

8 In practice, we may be working on such small values that the computer will return errors. For instance, for k binary variables, the joint law p(x 1, x 2...x n ) = 1 can take 2n infinitesimal values for a large k. The solution is to work with logarithms: if p = i p i, by setting a i = log(p i ) we have: log(p) = log [ ] i e a i [ ] log(p) = a i + log e (a i a i ) i (7.40) With a i = max i a i. Using logarithms ensures a numerical stability Proof of the algorithm We are going to prove that the SPA is correct by recurrence. In the case of two nodes, we have: p(x 1, x 2 ) = 1 Z ψ 1(x i )ψ 2 (x 2 )ψ 1,2 (x 1, x 2 ) We marginalize, and we obtain p(x 1 ) = 1 Z ψ 1(x 1 ) x 2 ψ 1,2 (x 1, x 2 )ψ 2 (x 2 ) } {{ } µ 2 1 (x 1 ) We can hence deduct And p(x 1 ) = 1 Z ψ 1(x 1 )µ 2 1 (x 1 ) p(x 2 ) = 1 Z ψ 2(x 2 )µ 1 2 (x 2 ) We assume that the result is true for trees of size n 1, and we consider a tree of size n. Without loss of generality, we can assume that the nodes are numbered, so that the n th be a leaf, and we will call π n its parent (which is unique, the graph being a tree). The first message to be passed is: And the last message to be passed is: µ πn n(x n ) = x πn µ n πn (x πn ) = x n ψ n (x n )ψ n,πn (x n, x πn ) (7.41) ψ πn (x πn )ψ n,πn (x n, x πn ) 7-8 k N (π n)\{n} µ k πn (x πn ) (7.42)

9 We are going to construct a tree T of size n 1, as well as a family of potentials, such that the 2(n 2) messages passed in T (i.e. all the messages except for the first and the last) be equal to the 2(n 2) messages passed in T. We define the tree and the potentials as follows: T = (Ṽ, Ẽ) with Ṽ = {1,..., n 1} and Ẽ = E\{n, π n} (i.e., it is the subtree corresponding to the n 1 first vertices). The potentials are all the same as those of T, except for the potential ψ πn (x πn ) = ψ πn (x πn )µ n πn (x πn ) (7.43) The root is unchanged, and the topological order is also kept. We then obtain two important properties: 1) The product of the potentials of the tree of size n 1 is equal to: p(x 1,..., x n 1 ) = 1 ψ i (x i ) ψ i,j (x i, x j ) Z ψ πn (x πn ) i n,π n (i,j) E\{n,π n} = 1 ψ i (x i ) ψ i,j (x i, x j ) ψ n (x n )ψ πn (x πn )ψ n,πn (x n, x πn ) Z i n,π n (i,j) E\{n,π n} x n = 1 n ψ i (x i ) ψ i,j (x i, x j ) Z x n i=1 (i,j) E = x n p(x 1,..., x n 1, x n ) which shows that these new potentials define on (X 1,..., X n 1 ) exactly the distribution induced by p when marginalizing X n. 2) All of the messages passed in T correspond to the messages passed in T (except for the first and the last). Now, with the recurrence hypothesis that the SPA is true for trees of size n 1, we are going to show that it is true for trees of size n. For nodes i n, π n, the result is obvious, as all messages passed are the same: For the case i = π n, we deduct: i V \{n, π n }, p(x i ) = 1 Z ψ i(x i ) 7-9 k N (i) µ k i (x i ) (7.44)

10 p(x πn ) = 1 Z ψ πn (x πn ) µ k πn (x πn ) ( product over the neighbours of π n in T ) k Ñ (πn) = 1 Z ψ πn (x πn ) µ k πn (x πn ) k N (π n)\{n} = 1 Z ψ π n (x πn )µ n πn (x πn ) = 1 Z ψ π n (x πn ) k N (π n) For the case i = n, we have: p(x n, x πn ) = Therefore: k N (π n)\{n} µ k πn (x πn ) µ k πn (x πn ) x V \{n,πn} p(x) = ψ n (x n )ψ πn (x πn )ψ n,πn (x n, x πn ) p(x) ψ x n (x n )ψ πn (x πn )ψ n,πn (x n, x πn ) V \{n,πn} }{{} α(x πn ) Consequently: p(x n, x πn ) = ψ πn (x πn )α(x πn )ψ n (x n )ψ n,πn (x n, x πn ) (7.45) p(x πn ) = ψ πn (x πn )α(x πn ) x n ψ n (x n )ψ n,πn (x n, x πn ) } {{ } µ n πn (x πn ) Hence: α(x πn ) = p(x πn ) ψ πn (x πn )µ n πn (x πn ) By using (7.31), (7.32) and the previous result, we deduct that: (7.46) p(x πn ) p(x n, x πn ) = ψ πn (x πn )ψ n (x n )ψ n,πn (x n, x πn ) ψ πn (x πn )µ n πn (x πn ) 1 Z = ψ πn (x πn )ψ n (x n )ψ n,πn (x n, x πn ) ψ π n (x πn ) k N (π µ n) k π n (x πn ) ψ πn (x πn )µ n πn (x πn ) = 1 Z ψ π n (x πn )ψ n (x n )ψ n,πn (x n, x πn ) µ k πn (x πn ) 7-10 k N (π n)\{n}

11 By summing with respect to x πn, we get the result for p(x n ): p(x n ) = x πn p(x n, x πn ) = 1 Z ψ n(x n )µ πn n(x n ) Proposition: Let p L(G), for G = (V, E) a tree, then we have: p(x 1,..., x n ) = 1 ψ(x i ) Z i V (i,j) E p(x i, x j ) p(x i )p(x j ) (7.47) Proof: we prove it by reccurence. The case n = 1 is trivial. Then, assuming that n is a leaf, and we can write p(x 1,..., x n ) = p(x 1,..., x n 1 ) p(x n x πn ). But multiplying by p(x n x πn ) = p(xn,xπn) p(x p(x n)p(x πn ) n) boils down to adding the edge potential for (n, π n ) and the node potential for the leaf n. The formula is hence verified by reccurence Junction tree Junction tree is an algorithm designed to tackle the problem of inference on general graphs. The idea is to look at a general graph from far away, where it can be seen as a tree. By merging nodes, one will hopefully be able to build a tree. When this is not the case, one can also think of adding some edges to the graph (i.e., cast the present distribution into a larger set) to be able to build such a graph. The trap is that if one collapses too many nodes, the number of possibles values will explode, and as such the complexity of the whole algorithm. The tree width is the smallest possible clique size. For instance, for a 2D regular grid with n points, the tree width is equal to n. 7.2 Hidden Markov Model (HMM) The Hidden Markov Model ( Modèle de Markov caché in French) is one of the most used Graphical Models. We note z 0, z 1,..., z T the states corresponding to the latent variables, and y 0, y 1,..., y T the states corresponding to the observed variables. We further assume that: 1. {z 0,..., z T } is a Markov chain (hence the name of the model); 2. z 0,..., z T take K values; 3. z 0 follows a multinomial distribution: p(z 0 = i) = (π 0 ) i with i (π 0 ) i = 1; 7-11

12 4. The transition probabilities are homogeneous: p(z t = i z t 1 = j) = A ij, where A satisfies A ij = 1; i 5. The emission probabilities p(y t z t ) are homogeneous, i.e. p(y t z t ) = f(y t, z t ). 6. The joint probability distribution function can be written as: T 1 p(z 0,...z T, y 0,..., y T ) = p(z 0 ) p(z t+1 z t ) There are different tasks that we want to perform on this model: Filtering: p(z t y 1,..., y t 1 ) Smoothing: p(z t y 1,..., y T ) Decoding: max p(z 0,..., z T y 0,..., y T ) z 0,...,z T t=0 T p(y t z t ). All these tasks can be performed with a sum-product or max-product algorithm. Sum-product From now on, we note the observations y = {y 1,..., y T }. The distribution on y t simply becomes the delta function δ(y t = y t ). To use the sum-product algorithm, we define z T as the root, i.e. we send all forward messages to z T and go back afterwards. Forward: t=0 µ y0 z 0 (z 0 ) = y 0 δ(y 0 = y 0 )p(y 0 z 0 ) = p(y 0 z 0 ) µ z0 z 1 (z 1 ) = z 0 p(z 1 z 0 )µ y0 z 0 (z 0 ) µ yt 1 z t 1 (z t 1 ) = y t 1. δ(y t 1 = y t 1 )p(y t 1 z t 1 ) = p(y t 1 z t 1 ) µ zt 1 z t (z t ) = z t 1 p(z t z t 1 )µ zt 2 z t 1 (z t 1 ) p(z t 1 y t 1 ) Let us define the alpha-message α t (z t ) as: α t (z t ) = µ yt zt (z t )µ zt 1 z t (z t ) α 0 (z 0 ) is initialized with the virtual message µ z 1 z 0 (z 0 ) = p(z 0 ) 7-12

13 Property 7.1 α t (z t ) = µ yt z t (z t )µ zt 1 z t (z t ) = p(z t, y 0,..., z t ) This is due to the definition of the messages: the product µ yt z t (z t )µ zt 1 z t (z t ) represents a marginal of the distribution corresponding to the sub-hmm {z 0,..., z t }. Moreover, the following recursion formula for α t (z t ) (called alpha-recursion ) holds: α t+1 (z t+1 ) = p(y t+1 z t+1 ) z t p(z t+1 z t )α(z t ) Backward: µ zt+1 z t (z t ) = z t+1 p(z t+1 z t )µ zt+2 z t+1 (z t+1 )p(y t+1 z t+1 ) We define the beta-message β t (z t ) as: β t (z t ) = µ zt+1 z t (z t ) As an initialization, we take β T (z T ) = 1. The following recursion formula for β t (z t ) (called beta-recursion ) holds: β t (z t ) = z t+1 p(z t+1 z t )p(y t+1 z t+1 )β t+1 (z t+1 ) Property p(z t, y 0,..., y T ) = α t (z t )β t (z t ) 2. p(y 0,..., y T ) = z t α t (z t )β t (z t ) 3. p(z t y 0,..., y T ) = p(zt,y 0,...,y T ) p(y 0,...,y T = αt(zt)βt(zt) ) α t(z t)β t(z t) z t 4. For all t < T, p(z t, z t+1 y 0,..., y T ) = 1 p(y 0,...,y T ) α t(z t ) β t+1 (z t+1 ) p(z t+1 z t ) p(y t+1 z t+1 ) Remark The alpha-recursion and beta-recursion are easy to implement, but one needs to avoid errors in the indices! 2. The sums need to be coded using logs in order to prevent numerical errors. 7-13

14 EM algorithm With the previous notations and assumptions, we write the complete log-likelihood l c (θ) with θ the parameters of the model (containing (π 0, A), but also parameters for f): ( T 1 l c (θ) = log p(z 0 ) p(z t+1 z t ) t=0 = log(p(z 0 )) + = T t=0 T log p(y t z t ) + t=0 K δ(z 0 = i) log((π 0 ) i ) + i=1 ) p(y t z t ) T 1 T log p(z t+1 z t ) t=0 t=0 i,j=1 K δ(z t+1 = i, z t = j) log(a i,j ) + T K δ(z t = i) log f(y t, z t ) t=0 i=1 When applying E-M to estimate the parameters of this HMM, we use Jensen s inequality to obtain a lower bound on the log-likelihood: log p(y 0,..., y T ) E q [log p(z 0,..., z T, y 0,..., y T )] = E q [l c (θ)] At the k-th expectation step, we use q(z 0,..., z T ) = P(z 0,..., z T y 0,..., y T ; θ k 1 ), and this boils down to applying the following rules: E[δ(z 0 = i) y] = p(z 0 = i y; θ k 1 ) E[δ(z t = i) y] = p(z t = i y; θ k 1 ) E[δ(z t+1 = i, z t = j y; θ k 1 ] = p(z t+1 = i, z t = j y; θ k 1 ) Thus, in the former expression of the complete log-likelihood, we just have to replace δ(z 0 = i) by p(z 0 = i y; θ k 1 ), and similarly for the other terms. At the k-th maximization step, we maximize the new obtained expression with respect to the parameters θ in the usual manner to obtain a new estimator θ k. The key is that everything will decouple, thus maximizing is simple and can be done in closed form. 7.3 Principal Component Analysis (PCA) Framework: x 1,..., x N R d Goal: put points on a closest affine subspace Analysis view Find w R d such that Var(x T w) is maximal, with w =

15 With centered data, i.e. 1 N N x n = 0, the empirical variance is: n=1 ˆ Var(x T w) = 1 N N (x T nw) 2 = 1 N wt (X T X)w n=1 where X R N d is the design matrix. In this case: w is the eigenvector of X T X with largest eigenvalue. It is not obvious a priori that this is the direction we care about. If more than one direction is required, one can use deflation: 1. Find w 2. Project x n onto the orthogonal of Vect(w) 3. Start again Synthesis view min w N d(x n, {w T x = 0}) 2 with w R D, w = 1. n=1 Advantage: if one wants more than 1 dimension, replace {w T x = 0} by any subspace. Probabilistic approach: Factor Analysis Model: Λ = (λ 1,..., λ K ) R d k X R k N (0, I) ɛ N (0, Ψ), ɛ R d independent from X with Ψ diagonal. Y R d : Y = ΛX + µ + ɛ We have Y X N (ΛX + µ, Ψ). Problem: get X Y. (X, Y ) is a Gaussian vector on R d+k which satisfies: E[X] = 0 = µ X E[Y ] = E[ΛX + µ + ɛ]µ = µ Y 7-15

16 Σ XX = I Σ XY = Cov(X, ΛX + µ + ɛ) = Cov(X, ΛX) = Λ T Σ Y Y = Var(ΛX + ɛ, ΛX + ɛ) = Var(ΛX, ΛX) + Var(ɛ, ɛ) = ΛΛ T + Ψ Thanks to the results we know on exponential families, we know how to compute X Y : In our case, we therefore have: E[X Y = y] = µ X + Σ XY Σ 1 Y Y (y µ Y ) Cov[X Y = y] = Σ XX Σ XY Σ 1 Y Y Σ Y X E[X Y = y] = Λ T (ΛΛ T + Ψ) 1 (y µ) Cov[X Y = y] = I Λ T (ΛΛ T + Ψ) 1 Λ T To apply EM, one needs to write down the complete log-likelihood. log p(x, Y )α 1 2 XT X 1 2 (Y ΛX µ)t Ψ 1 (Y ΛX µ) 1 log det Ψ 2 Trap: E[XX T Y ] Cov(X Y ) Rather, E[XX T Y ] = Cov(X Y ) + E[X Y ]E[X Y ] T Remark 7.4 Cov(X) = ΛΛ T + Ψ: our parameters are not identifiable, Λ ΛR with R a rotation gives the same results (in other words, a subspace has different orthonormal bases). Why do we care? 1. A probabilistic interpretation allows to model in a finer way the problem. 2. It is very flexible and therefore allows to combine multiple models. 7-16

17 Bibliography 17

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project