Cours 7 12th November 2014

Size: px
Start display at page:

Download "Cours 7 12th November 2014"

Transcription

1 Sum Product Algorithm and Hidden Markov Model 2014/2015 Cours 7 12th November 2014 Enseignant: Francis Bach Scribe: Pauline Luc, Mathieu Andreux 7.1 Sum Product Algorithm Motivations Inference, along with estimation and decoding, are the three key operations one must be able to perform efficiently in graphical models. Given a discrete Gibbs model of the form: p(x) = 1 Z C C ψ C(x C ), where C is the set of cliques of the graph, inference enables: Computation of the marginal p(x i ) or more generally, p(x C ). Computation of the partition function Z Computation of the conditional marginal p(x i X j = x j, X k = x k ) And as a consequence Computation of the gradient in a exponential family Computation of the expected value of the loglikelihood of an exponential family at step E of the EM algorithm (for example for HMM) Example 1: Ising model Let X = (X i ) i V be a vector of random variables, taking value in {0, 1} V, of which the exponential form of the distribution is: p(x) = e A(η) e η ix i i V e η i,jx i x j (7.1) (i,j) E We then have the log-likelihood: l(η) = i V η i x i + (i,j) E η i,j x i x j A(η) (7.2) We can therefore write the sufficient statistic: ( ) (xi ) φ(x) = i V (x i x j ) (i,j) E (7.3) 7-1

2 But we have seen that for exponential families: l(η) = φ(x) T η A(η) (7.4) η l(η) = φ(x) η A(η) }{{} E η[φ(x)] We therefore need to compute E η [φ(x)]. In the case of the Ising model, we get: (7.5) E η [X i ] = P η [X i = 1] (7.6) E η [X i X j ] = P η [X i = 1, X j = 1] (7.7) The equation 7.6 is one of the motivations for solving the problem of inference. In order to be able to compute the gradient of the log-likelihood, we need to know the marginal laws. Example 2: Potts model X i are random variables, taking value in {1,..., K i }. We note ik the random variable such that ik = 1 if and only if X i = k. Then, p(δ) = exp K i η i,k δ ik + K i K j η i,j,k,k δ ik δ jk A(η) (7.8) i V and k=1 (i,j) E k=1 k =1 ( ) (δik ) φ(δ) = i,k (δ ik δ jk ) i,j,k,k (7.9) E η [ ik ] = P η [X i = k] (7.10) E η [ ik jk ] = P η [X i = k, X j = k ] (7.11) These examples illustrate the need to perform inference. Problem: In general, the inference problem is NP-hard. For trees the inference problem is efficient as it is linear in n. For tree-like graphs we use the Junction Tree Algorithm which enables us to bring the situation back to that of a tree. In the general case we are forced to carry out approximative inference. 7-2

3 7.1.2 Inference on a chain We define X i a random variable, taking value in {1,..., K}, i V = {1,..., n} with joint distribution p(x) defined as:. p(x) = 1 Z n ψ i (x i ) i=1 n ψ i 1,i (x i 1, x i ) (7.12) We wish to compute p(x j ) for a certain j. The naive solution would be to compute the marginal p(x j ) = i=2 x V \{j} p(x 1,..., x n ) (7.13) Unfortunately, this type of calculation is of complexity O(K n ). We therefore develop the expression p(x j ) = 1 Z = 1 Z = 1 Z x V \{j} i=1 x V \{j} i=1 x V \{j,n} n ψ i (x i ) n ψ i 1,i (x i 1, x i ) (7.14) i=2 n 1 n 1 ψ i (x i ) ψ i 1,i (x i 1, x i )ψ n (x n )ψ n 1,n (x n 1, x n ) (7.15) x n i=1 i=2 n 1 n 1 ψ i (x i ) ψ i 1,i (x i 1, x i )ψ n (x n )ψ n 1,n (x n 1, x n ) (7.16) i=2 Which allows us to bring out the messaged passed by (n) to (n 1): µ n n 1 (x n 1 ). When continuing, we obtain: p(x j ) = 1 Z = 1 Z = 1 Z x V \{j,n} i=1 n 1 n 1 ψ i (x i ) ψ i 1,i (x i 1, x i ) ψ n (x n )ψ n 1,n (x n 1, x n ) x n x V \{j,n,n 1} i=1 i=2 n 2 n 2 ψ i (x i ) ψ i 1,i (x i 1, x i ) i=2 } {{ } µ n n 1 (x n 1 ) x n 1 ψ n 1 (x n 1 )ψ n 2,n 1 (x n 2, x n 1 )µ n n 1 (x n 1 ) (7.17) (7.18) } {{ } µ n 1 n 2 (x n 2 ) µ 1 2 (x 2 )... µ n 1 n 2 (x n 2 ) (7.19) x V \{1,j,n,n 1} 7-3

4 In the above equation, we have implicitely used the following definitions for descending and ascending messages: µ j j 1 (x j 1 ) = x j ψ j (x j )ψ j 1,j (x j 1, x j )µ j+1 j (x j ) (7.20) µ j j+1 (x j+1 ) = x j ψ j (x j )ψ j,j+1 (x j, x j+1 )µ j 1 j (x j ) (7.21) Each of these messages is computed with complexity O((n 1)K 2 ). And finally, we get p(x j ) = 1 Z µ j 1 j(x j ) ψ j (x j ) µ j+1 j (x j ) (7.22) With only 2(n 1) messages, we have calculated p(x j ) j V. Z is otained by summing Z = x i µ i 1 i (x i ) ψ i (x i ) µ i+1 i (x i ) (7.23) Inference in undirected trees We note i the vertice of which we want to compute the marginal law p(x i ). We set i to be the root of our tree. j V, we note C(j) the set of childen of j and D(j) the set of descendants of j. The joint probability is: p(x) = 1 ψ i (x i ) Z i V (i,j) E For a tree with at least two vertices, we define by reccurence, F (x i, x j, x D(j) ) ψ i,j (x i, x j )ψ j (x j ) Then by reformulating the marginal: ψ i,j (x i, x j ) (7.24) k C(j) F (x j, x k, x D(k) ) (7.25) 7-4

5 p(x i ) = 1 ψ i (x i ) F (x i, x j, x D(j) ) Z x V \{i} j C(i) (7.26) = 1 Z ψ i(x i ) F (x i, x j, x D(j) ) j C(i) x j,x D(j) (7.27) = 1 Z ψ i(x i ) ψ i,j (x i, x j )ψ j (x j ) F (x j, x k, x D(k) ) j C(i) x j,x D(j) k C(j) (7.28) = 1 Z ψ i(x i ) ψ i,j (x i, x j )ψ j (x j ) F (x j, x k, x D(k) ) j C(i) x j k C(j) x k,x D(k) }{{} (7.29) µ k j (x j ) = 1 Z ψ i(x i ) ψ i,j (x i, x j )ψ j (x j ) µ k j (x j ) j C(i) x j k C(j) }{{} (7.30) µ j i (x i ) Which leads us to the recurrence relation for the Sum Product Algorithm (SPA): µ j i (x i ) = x j ψ i,j (x i, x j )ψ j (x j ) Sum Product Algorithm (SPA) Sequential SPA for a rooted tree k C(j) µ k j (x j ) (7.31) For a rooted tree, of root (i), the Sum Product Algorithm is written as follows: 1. All the leaves send µ n πn (x πn ) µ n πn (x πn ) = x n ψ n (x n )ψ n,πn (x n, x πn ) (7.32) 2. Iteratively, at each step, all the nodes (k) which have received messages from all their children send µ k πk (x πk ) to their parents. 3. At the root we have p(x i ) = 1 Z ψ i(x i ) j C(i) µ j i (x i ) (7.33) 7-5

6 This algorithm only enables us to compute p(x i ) at the root. To be able to compute all the marginals (as well as the conditional marginals), one must not only collect all the messages from the leafs to the root, but then also distribute them back to the leafs. In fact, the algorithm can then be written independently from the choice of a root. SPA for an undirected tree The case of undirected trees is slightly different: 1. All the leaves send µ n πn (x πn ) 2. At each step, if a node (j) hasn t send a message to one of his neighbours, say (i) (Note: (i) here is not the root) and if it has received messages from all his other neighbours N (j)\i, it send to (i) the following message Parallel SPA (flooding) 1. Initialise the messages randomly µ j i (x i ) = x j ψ j (x j )ψ j,i (x i, x i ) k N (j)\{i} µ k j (x j ) (7.34) 2. At each step, each node sends a new message to each of its neighbours, using the messages received at the previous step. Marginal laws Once all messages have been passed, we can easily calculate all the marginal laws i V, p(x i ) = 1 Z ψ i(x i ) k N (i) (i, j) E, p(x i, x j ) = 1 Z ψ i(x i ) ψ j (x j ) ψ j,i (x i, x i ) Conditional probabilities µ k i (x i ) (7.35) k N (i)\j µ k i (x i ) k N (j)\i µ k j (x j ) (7.36) We can use a clever notation to calculate the conditional probabilities. Suppose that we want to compute We can set p(x i x 5 = 3, x 10 = 2) p(x i, x 5 = 3, x 10 = 2) ψ 5 (x 5 ) = ψ 5 (x 5 ) δ(x 5, 3) 7-6

7 Generally speaking, if we observe X j = x j0 for j J obs, we can define the modified potentials: ψ j (x j ) = ψ j (x j ) δ(x j, x j0 ) such that Indeed we have p(x X Jobs = x Jobs 0) = 1 Z ψ i (x i ) i V (i,j) E ψ i,j (x i, x j ) (7.37) p(x X Jobs = x Jobs 0) p(x Jobs = x Jobs 0) = p(x) δ(x j, x j0 ) (7.38) j J obs so that by dividing the equality by p(x Jobs = x Jobs 0) we obtain the previous equation with Z = Zp(X Jobs = x Jobs 0). We then simply apply the SPA to these new potentials to compute the marginal laws p(x i X Jobs = x Jobs 0) Remarks The SPA is also called belief propagation or message passing. On trees, it is an exact inference algorithm. If G is not a tree, the algorithm doesn t converge in general to the right marginal laws, but sometimes gives reasonable approximations. We then refer to Loopy belief propagation, which is still often used in real life. The only property that we have used to construct the algorithm is the fact that (R, +, ) is a semi-ring. It is interesting to notice that the same can therefore also be done with (R +, max, ) and (R, max, +). Example For (R +, max, ) we define the Max-Product algorithm, also called Viterbi algorithm which enables us to solve the decoding problem, namely to compute the most probable configuartion of the variables, given fixed parameters, thanks to the messages [ ] µ j i (x i ) = max x j ψ i,j (x i, x j )ψ j (x j ) x k µ k j (x j ) (7.39) If we run the Max-Product algorithm with respect to a chosen root, the collection phase of the messages to the root enables us to compute the maximal probability over all configurations, and if at each calculation of a message we have also kept the argmax, we can perform a distribution phase, which instead of propagating the messages, will consist of recursively calculating one of the configurations which will reach the maximum. 7-7

8 In practice, we may be working on such small values that the computer will return errors. For instance, for k binary variables, the joint law p(x 1, x 2...x n ) = 1 can take 2n infinitesimal values for a large k. The solution is to work with logarithms: if p = i p i, by setting a i = log(p i ) we have: log(p) = log [ ] i e a i [ ] log(p) = a i + log e (a i a i ) i (7.40) With a i = max i a i. Using logarithms ensures a numerical stability Proof of the algorithm We are going to prove that the SPA is correct by recurrence. In the case of two nodes, we have: p(x 1, x 2 ) = 1 Z ψ 1(x i )ψ 2 (x 2 )ψ 1,2 (x 1, x 2 ) We marginalize, and we obtain p(x 1 ) = 1 Z ψ 1(x 1 ) x 2 ψ 1,2 (x 1, x 2 )ψ 2 (x 2 ) } {{ } µ 2 1 (x 1 ) We can hence deduct And p(x 1 ) = 1 Z ψ 1(x 1 )µ 2 1 (x 1 ) p(x 2 ) = 1 Z ψ 2(x 2 )µ 1 2 (x 2 ) We assume that the result is true for trees of size n 1, and we consider a tree of size n. Without loss of generality, we can assume that the nodes are numbered, so that the n th be a leaf, and we will call π n its parent (which is unique, the graph being a tree). The first message to be passed is: And the last message to be passed is: µ πn n(x n ) = x πn µ n πn (x πn ) = x n ψ n (x n )ψ n,πn (x n, x πn ) (7.41) ψ πn (x πn )ψ n,πn (x n, x πn ) 7-8 k N (π n)\{n} µ k πn (x πn ) (7.42)

9 We are going to construct a tree T of size n 1, as well as a family of potentials, such that the 2(n 2) messages passed in T (i.e. all the messages except for the first and the last) be equal to the 2(n 2) messages passed in T. We define the tree and the potentials as follows: T = (Ṽ, Ẽ) with Ṽ = {1,..., n 1} and Ẽ = E\{n, π n} (i.e., it is the subtree corresponding to the n 1 first vertices). The potentials are all the same as those of T, except for the potential ψ πn (x πn ) = ψ πn (x πn )µ n πn (x πn ) (7.43) The root is unchanged, and the topological order is also kept. We then obtain two important properties: 1) The product of the potentials of the tree of size n 1 is equal to: p(x 1,..., x n 1 ) = 1 ψ i (x i ) ψ i,j (x i, x j ) Z ψ πn (x πn ) i n,π n (i,j) E\{n,π n} = 1 ψ i (x i ) ψ i,j (x i, x j ) ψ n (x n )ψ πn (x πn )ψ n,πn (x n, x πn ) Z i n,π n (i,j) E\{n,π n} x n = 1 n ψ i (x i ) ψ i,j (x i, x j ) Z x n i=1 (i,j) E = x n p(x 1,..., x n 1, x n ) which shows that these new potentials define on (X 1,..., X n 1 ) exactly the distribution induced by p when marginalizing X n. 2) All of the messages passed in T correspond to the messages passed in T (except for the first and the last). Now, with the recurrence hypothesis that the SPA is true for trees of size n 1, we are going to show that it is true for trees of size n. For nodes i n, π n, the result is obvious, as all messages passed are the same: For the case i = π n, we deduct: i V \{n, π n }, p(x i ) = 1 Z ψ i(x i ) 7-9 k N (i) µ k i (x i ) (7.44)

10 p(x πn ) = 1 Z ψ πn (x πn ) µ k πn (x πn ) ( product over the neighbours of π n in T ) k Ñ (πn) = 1 Z ψ πn (x πn ) µ k πn (x πn ) k N (π n)\{n} = 1 Z ψ π n (x πn )µ n πn (x πn ) = 1 Z ψ π n (x πn ) k N (π n) For the case i = n, we have: p(x n, x πn ) = Therefore: k N (π n)\{n} µ k πn (x πn ) µ k πn (x πn ) x V \{n,πn} p(x) = ψ n (x n )ψ πn (x πn )ψ n,πn (x n, x πn ) p(x) ψ x n (x n )ψ πn (x πn )ψ n,πn (x n, x πn ) V \{n,πn} }{{} α(x πn ) Consequently: p(x n, x πn ) = ψ πn (x πn )α(x πn )ψ n (x n )ψ n,πn (x n, x πn ) (7.45) p(x πn ) = ψ πn (x πn )α(x πn ) x n ψ n (x n )ψ n,πn (x n, x πn ) } {{ } µ n πn (x πn ) Hence: α(x πn ) = p(x πn ) ψ πn (x πn )µ n πn (x πn ) By using (7.31), (7.32) and the previous result, we deduct that: (7.46) p(x πn ) p(x n, x πn ) = ψ πn (x πn )ψ n (x n )ψ n,πn (x n, x πn ) ψ πn (x πn )µ n πn (x πn ) 1 Z = ψ πn (x πn )ψ n (x n )ψ n,πn (x n, x πn ) ψ π n (x πn ) k N (π µ n) k π n (x πn ) ψ πn (x πn )µ n πn (x πn ) = 1 Z ψ π n (x πn )ψ n (x n )ψ n,πn (x n, x πn ) µ k πn (x πn ) 7-10 k N (π n)\{n}

11 By summing with respect to x πn, we get the result for p(x n ): p(x n ) = x πn p(x n, x πn ) = 1 Z ψ n(x n )µ πn n(x n ) Proposition: Let p L(G), for G = (V, E) a tree, then we have: p(x 1,..., x n ) = 1 ψ(x i ) Z i V (i,j) E p(x i, x j ) p(x i )p(x j ) (7.47) Proof: we prove it by reccurence. The case n = 1 is trivial. Then, assuming that n is a leaf, and we can write p(x 1,..., x n ) = p(x 1,..., x n 1 ) p(x n x πn ). But multiplying by p(x n x πn ) = p(xn,xπn) p(x p(x n)p(x πn ) n) boils down to adding the edge potential for (n, π n ) and the node potential for the leaf n. The formula is hence verified by reccurence Junction tree Junction tree is an algorithm designed to tackle the problem of inference on general graphs. The idea is to look at a general graph from far away, where it can be seen as a tree. By merging nodes, one will hopefully be able to build a tree. When this is not the case, one can also think of adding some edges to the graph (i.e., cast the present distribution into a larger set) to be able to build such a graph. The trap is that if one collapses too many nodes, the number of possibles values will explode, and as such the complexity of the whole algorithm. The tree width is the smallest possible clique size. For instance, for a 2D regular grid with n points, the tree width is equal to n. 7.2 Hidden Markov Model (HMM) The Hidden Markov Model ( Modèle de Markov caché in French) is one of the most used Graphical Models. We note z 0, z 1,..., z T the states corresponding to the latent variables, and y 0, y 1,..., y T the states corresponding to the observed variables. We further assume that: 1. {z 0,..., z T } is a Markov chain (hence the name of the model); 2. z 0,..., z T take K values; 3. z 0 follows a multinomial distribution: p(z 0 = i) = (π 0 ) i with i (π 0 ) i = 1; 7-11

12 4. The transition probabilities are homogeneous: p(z t = i z t 1 = j) = A ij, where A satisfies A ij = 1; i 5. The emission probabilities p(y t z t ) are homogeneous, i.e. p(y t z t ) = f(y t, z t ). 6. The joint probability distribution function can be written as: T 1 p(z 0,...z T, y 0,..., y T ) = p(z 0 ) p(z t+1 z t ) There are different tasks that we want to perform on this model: Filtering: p(z t y 1,..., y t 1 ) Smoothing: p(z t y 1,..., y T ) Decoding: max p(z 0,..., z T y 0,..., y T ) z 0,...,z T t=0 T p(y t z t ). All these tasks can be performed with a sum-product or max-product algorithm. Sum-product From now on, we note the observations y = {y 1,..., y T }. The distribution on y t simply becomes the delta function δ(y t = y t ). To use the sum-product algorithm, we define z T as the root, i.e. we send all forward messages to z T and go back afterwards. Forward: t=0 µ y0 z 0 (z 0 ) = y 0 δ(y 0 = y 0 )p(y 0 z 0 ) = p(y 0 z 0 ) µ z0 z 1 (z 1 ) = z 0 p(z 1 z 0 )µ y0 z 0 (z 0 ) µ yt 1 z t 1 (z t 1 ) = y t 1. δ(y t 1 = y t 1 )p(y t 1 z t 1 ) = p(y t 1 z t 1 ) µ zt 1 z t (z t ) = z t 1 p(z t z t 1 )µ zt 2 z t 1 (z t 1 ) p(z t 1 y t 1 ) Let us define the alpha-message α t (z t ) as: α t (z t ) = µ yt zt (z t )µ zt 1 z t (z t ) α 0 (z 0 ) is initialized with the virtual message µ z 1 z 0 (z 0 ) = p(z 0 ) 7-12

13 Property 7.1 α t (z t ) = µ yt z t (z t )µ zt 1 z t (z t ) = p(z t, y 0,..., z t ) This is due to the definition of the messages: the product µ yt z t (z t )µ zt 1 z t (z t ) represents a marginal of the distribution corresponding to the sub-hmm {z 0,..., z t }. Moreover, the following recursion formula for α t (z t ) (called alpha-recursion ) holds: α t+1 (z t+1 ) = p(y t+1 z t+1 ) z t p(z t+1 z t )α(z t ) Backward: µ zt+1 z t (z t ) = z t+1 p(z t+1 z t )µ zt+2 z t+1 (z t+1 )p(y t+1 z t+1 ) We define the beta-message β t (z t ) as: β t (z t ) = µ zt+1 z t (z t ) As an initialization, we take β T (z T ) = 1. The following recursion formula for β t (z t ) (called beta-recursion ) holds: β t (z t ) = z t+1 p(z t+1 z t )p(y t+1 z t+1 )β t+1 (z t+1 ) Property p(z t, y 0,..., y T ) = α t (z t )β t (z t ) 2. p(y 0,..., y T ) = z t α t (z t )β t (z t ) 3. p(z t y 0,..., y T ) = p(zt,y 0,...,y T ) p(y 0,...,y T = αt(zt)βt(zt) ) α t(z t)β t(z t) z t 4. For all t < T, p(z t, z t+1 y 0,..., y T ) = 1 p(y 0,...,y T ) α t(z t ) β t+1 (z t+1 ) p(z t+1 z t ) p(y t+1 z t+1 ) Remark The alpha-recursion and beta-recursion are easy to implement, but one needs to avoid errors in the indices! 2. The sums need to be coded using logs in order to prevent numerical errors. 7-13

14 EM algorithm With the previous notations and assumptions, we write the complete log-likelihood l c (θ) with θ the parameters of the model (containing (π 0, A), but also parameters for f): ( T 1 l c (θ) = log p(z 0 ) p(z t+1 z t ) t=0 = log(p(z 0 )) + = T t=0 T log p(y t z t ) + t=0 K δ(z 0 = i) log((π 0 ) i ) + i=1 ) p(y t z t ) T 1 T log p(z t+1 z t ) t=0 t=0 i,j=1 K δ(z t+1 = i, z t = j) log(a i,j ) + T K δ(z t = i) log f(y t, z t ) t=0 i=1 When applying E-M to estimate the parameters of this HMM, we use Jensen s inequality to obtain a lower bound on the log-likelihood: log p(y 0,..., y T ) E q [log p(z 0,..., z T, y 0,..., y T )] = E q [l c (θ)] At the k-th expectation step, we use q(z 0,..., z T ) = P(z 0,..., z T y 0,..., y T ; θ k 1 ), and this boils down to applying the following rules: E[δ(z 0 = i) y] = p(z 0 = i y; θ k 1 ) E[δ(z t = i) y] = p(z t = i y; θ k 1 ) E[δ(z t+1 = i, z t = j y; θ k 1 ] = p(z t+1 = i, z t = j y; θ k 1 ) Thus, in the former expression of the complete log-likelihood, we just have to replace δ(z 0 = i) by p(z 0 = i y; θ k 1 ), and similarly for the other terms. At the k-th maximization step, we maximize the new obtained expression with respect to the parameters θ in the usual manner to obtain a new estimator θ k. The key is that everything will decouple, thus maximizing is simple and can be done in closed form. 7.3 Principal Component Analysis (PCA) Framework: x 1,..., x N R d Goal: put points on a closest affine subspace Analysis view Find w R d such that Var(x T w) is maximal, with w =

15 With centered data, i.e. 1 N N x n = 0, the empirical variance is: n=1 ˆ Var(x T w) = 1 N N (x T nw) 2 = 1 N wt (X T X)w n=1 where X R N d is the design matrix. In this case: w is the eigenvector of X T X with largest eigenvalue. It is not obvious a priori that this is the direction we care about. If more than one direction is required, one can use deflation: 1. Find w 2. Project x n onto the orthogonal of Vect(w) 3. Start again Synthesis view min w N d(x n, {w T x = 0}) 2 with w R D, w = 1. n=1 Advantage: if one wants more than 1 dimension, replace {w T x = 0} by any subspace. Probabilistic approach: Factor Analysis Model: Λ = (λ 1,..., λ K ) R d k X R k N (0, I) ɛ N (0, Ψ), ɛ R d independent from X with Ψ diagonal. Y R d : Y = ΛX + µ + ɛ We have Y X N (ΛX + µ, Ψ). Problem: get X Y. (X, Y ) is a Gaussian vector on R d+k which satisfies: E[X] = 0 = µ X E[Y ] = E[ΛX + µ + ɛ]µ = µ Y 7-15

16 Σ XX = I Σ XY = Cov(X, ΛX + µ + ɛ) = Cov(X, ΛX) = Λ T Σ Y Y = Var(ΛX + ɛ, ΛX + ɛ) = Var(ΛX, ΛX) + Var(ɛ, ɛ) = ΛΛ T + Ψ Thanks to the results we know on exponential families, we know how to compute X Y : In our case, we therefore have: E[X Y = y] = µ X + Σ XY Σ 1 Y Y (y µ Y ) Cov[X Y = y] = Σ XX Σ XY Σ 1 Y Y Σ Y X E[X Y = y] = Λ T (ΛΛ T + Ψ) 1 (y µ) Cov[X Y = y] = I Λ T (ΛΛ T + Ψ) 1 Λ T To apply EM, one needs to write down the complete log-likelihood. log p(x, Y )α 1 2 XT X 1 2 (Y ΛX µ)t Ψ 1 (Y ΛX µ) 1 log det Ψ 2 Trap: E[XX T Y ] Cov(X Y ) Rather, E[XX T Y ] = Cov(X Y ) + E[X Y ]E[X Y ] T Remark 7.4 Cov(X) = ΛΛ T + Ψ: our parameters are not identifiable, Λ ΛR with R a rotation gives the same results (in other words, a subspace has different orthonormal bases). Why do we care? 1. A probabilistic interpretation allows to model in a finer way the problem. 2. It is very flexible and therefore allows to combine multiple models. 7-16

17 Bibliography 17

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Lecture 4 October 18th

Lecture 4 October 18th Directed and undirected graphical models Fall 2017 Lecture 4 October 18th Lecturer: Guillaume Obozinski Scribe: In this lecture, we will assume that all random variables are discrete, to keep notations

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

9 Forward-backward algorithm, sum-product on factor graphs

9 Forward-backward algorithm, sum-product on factor graphs Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 9 Forward-backward algorithm, sum-product on factor graphs The previous

More information

Bayesian Machine Learning - Lecture 7

Bayesian Machine Learning - Lecture 7 Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS JONATHAN YEDIDIA, WILLIAM FREEMAN, YAIR WEISS 2001 MERL TECH REPORT Kristin Branson and Ian Fasel June 11, 2003 1. Inference Inference problems

More information

5. Sum-product algorithm

5. Sum-product algorithm Sum-product algorithm 5-1 5. Sum-product algorithm Elimination algorithm Sum-product algorithm on a line Sum-product algorithm on a tree Sum-product algorithm 5-2 Inference tasks on graphical models consider

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Variational Inference II: Mean Field Method and Variational Principle Junming Yin Lecture 15, March 7, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3

More information

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25 Approimate inference, Sampling & Variational inference Fall 2015 Cours 9 November 25 Enseignant: Guillaume Obozinski Scribe: Basile Clément, Nathan de Lara 9.1 Approimate inference with MCMC 9.1.1 Gibbs

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

4 : Exact Inference: Variable Elimination

4 : Exact Inference: Variable Elimination 10-708: Probabilistic Graphical Models 10-708, Spring 2014 4 : Exact Inference: Variable Elimination Lecturer: Eric P. ing Scribes: Soumya Batra, Pradeep Dasigi, Manzil Zaheer 1 Probabilistic Inference

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single

More information

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010 Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition

More information

Dynamic Approaches: The Hidden Markov Model

Dynamic Approaches: The Hidden Markov Model Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

Lecture 6: April 19, 2002

Lecture 6: April 19, 2002 EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Learning in graphical models & MC methods Fall Cours 8 November 18

Learning in graphical models & MC methods Fall Cours 8 November 18 Learning in graphical models & MC methods Fall 2015 Cours 8 November 18 Enseignant: Guillaume Obozinsi Scribe: Khalife Sammy, Maryan Morel 8.1 HMM (end) As a reminder, the message propagation algorithm

More information

10708 Graphical Models: Homework 2

10708 Graphical Models: Homework 2 10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm Probabilistic Graphical Models 10-708 Homework 2: Due February 24, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 4-8. You must complete all four problems to

More information

4.1 Notation and probability review

4.1 Notation and probability review Directed and undirected graphical models Fall 2015 Lecture 4 October 21st Lecturer: Simon Lacoste-Julien Scribe: Jaime Roquero, JieYing Wu 4.1 Notation and probability review 4.1.1 Notations Let us recall

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14 STATS 306B: Unsupervised Learning Spring 2014 Lecture 5 April 14 Lecturer: Lester Mackey Scribe: Brian Do and Robin Jia 5.1 Discrete Hidden Markov Models 5.1.1 Recap In the last lecture, we introduced

More information

Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation Jason K. Johnson, Dmitry M. Malioutov and Alan S. Willsky Department of Electrical Engineering and Computer Science Massachusetts Institute

More information

Linear Dynamical Systems (Kalman filter)

Linear Dynamical Systems (Kalman filter) Linear Dynamical Systems (Kalman filter) (a) Overview of HMMs (b) From HMMs to Linear Dynamical Systems (LDS) 1 Markov Chains with Discrete Random Variables x 1 x 2 x 3 x T Let s assume we have discrete

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm Jan-Willem van de Meent, 19 November 2016 1 Hidden Markov Models A hidden Markov model (HMM)

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

An Introduction to Bayesian Machine Learning

An Introduction to Bayesian Machine Learning 1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems

More information

Factor Analysis and Kalman Filtering (11/2/04)

Factor Analysis and Kalman Filtering (11/2/04) CS281A/Stat241A: Statistical Learning Theory Factor Analysis and Kalman Filtering (11/2/04) Lecturer: Michael I. Jordan Scribes: Byung-Gon Chun and Sunghoon Kim 1 Factor Analysis Factor analysis is used

More information

Directed Probabilistic Graphical Models CMSC 678 UMBC

Directed Probabilistic Graphical Models CMSC 678 UMBC Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement 1: Assignment 3 Due Wednesday April 11 th, 11:59 AM Any questions? Announcement 2: Progress Report on Project Due Monday April 16 th,

More information

Chapter 17: Undirected Graphical Models

Chapter 17: Undirected Graphical Models Chapter 17: Undirected Graphical Models The Elements of Statistical Learning Biaobin Jiang Department of Biological Sciences Purdue University bjiang@purdue.edu October 30, 2014 Biaobin Jiang (Purdue)

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

A brief introduction to Conditional Random Fields

A brief introduction to Conditional Random Fields A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood

More information

CSC 412 (Lecture 4): Undirected Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:

More information

Introduction to Graphical Models

Introduction to Graphical Models Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

More information

MACHINE LEARNING 2 UGM,HMMS Lecture 7

MACHINE LEARNING 2 UGM,HMMS Lecture 7 LOREM I P S U M Royal Institute of Technology MACHINE LEARNING 2 UGM,HMMS Lecture 7 THIS LECTURE DGM semantics UGM De-noising HMMs Applications (interesting probabilities) DP for generation probability

More information

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma COMS 4771 Probabilistic Reasoning via Graphical Models Nakul Verma Last time Dimensionality Reduction Linear vs non-linear Dimensionality Reduction Principal Component Analysis (PCA) Non-linear methods

More information

Probabilistic and Bayesian Machine Learning

Probabilistic and Bayesian Machine Learning Probabilistic and Bayesian Machine Learning Day 4: Expectation and Belief Propagation Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

CS281A/Stat241A Lecture 19

CS281A/Stat241A Lecture 19 CS281A/Stat241A Lecture 19 p. 1/4 CS281A/Stat241A Lecture 19 Junction Tree Algorithm Peter Bartlett CS281A/Stat241A Lecture 19 p. 2/4 Announcements My office hours: Tuesday Nov 3 (today), 1-2pm, in 723

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

Machine Learning for Data Science (CS4786) Lecture 24

Machine Learning for Data Science (CS4786) Lecture 24 Machine Learning for Data Science (CS4786) Lecture 24 Graphical Models: Approximate Inference Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016sp/ BELIEF PROPAGATION OR MESSAGE PASSING Each

More information

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Inference in Graphical Models Variable Elimination and Message Passing Algorithm Inference in Graphical Models Variable Elimination and Message Passing lgorithm Le Song Machine Learning II: dvanced Topics SE 8803ML, Spring 2012 onditional Independence ssumptions Local Markov ssumption

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2014 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Yu-Hsin Kuo, Amos Ng 1 Introduction Last lecture

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15

More information

Review: Directed Models (Bayes Nets)

Review: Directed Models (Bayes Nets) X Review: Directed Models (Bayes Nets) Lecture 3: Undirected Graphical Models Sam Roweis January 2, 24 Semantics: x y z if z d-separates x and y d-separation: z d-separates x from y if along every undirected

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 12 Dynamical Models CS/CNS/EE 155 Andreas Krause Homework 3 out tonight Start early!! Announcements Project milestones due today Please email to TAs 2 Parameter learning

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Statistical Learning

Statistical Learning Statistical Learning Lecture 5: Bayesian Networks and Graphical Models Mário A. T. Figueiredo Instituto Superior Técnico & Instituto de Telecomunicações University of Lisbon, Portugal May 2018 Mário A.

More information

Probabilistic Graphical Models (I)

Probabilistic Graphical Models (I) Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010 Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 12: Gaussian Belief Propagation, State Space Models and Kalman Filters Guest Kalman Filter Lecture by

More information

CS838-1 Advanced NLP: Hidden Markov Models

CS838-1 Advanced NLP: Hidden Markov Models CS838-1 Advanced NLP: Hidden Markov Models Xiaojin Zhu 2007 Send comments to jerryzhu@cs.wisc.edu 1 Part of Speech Tagging Tag each word in a sentence with its part-of-speech, e.g., The/AT representative/nn

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Graphical Models Seminar

Graphical Models Seminar Graphical Models Seminar Forward-Backward and Viterbi Algorithm for HMMs Bishop, PRML, Chapters 13.2.2, 13.2.3, 13.2.5 Dinu Kaufmann Departement Mathematik und Informatik Universität Basel April 8, 2013

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics EECS 281A / STAT 241A Statistical Learning Theory Solutions to Problem Set 2 Fall 2011 Issued: Wednesday,

More information

Statistical Approaches to Learning and Discovery

Statistical Approaches to Learning and Discovery Statistical Approaches to Learning and Discovery Graphical Models Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon University

More information

Latent Dirichlet Allocation

Latent Dirichlet Allocation Latent Dirichlet Allocation 1 Directed Graphical Models William W. Cohen Machine Learning 10-601 2 DGMs: The Burglar Alarm example Node ~ random variable Burglar Earthquake Arcs define form of probability

More information

Note Set 5: Hidden Markov Models

Note Set 5: Hidden Markov Models Note Set 5: Hidden Markov Models Probabilistic Learning: Theory and Algorithms, CS 274A, Winter 2016 1 Hidden Markov Models (HMMs) 1.1 Introduction Consider observed data vectors x t that are d-dimensional

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabás Póczos & Aarti Singh Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

order is number of previous outputs

order is number of previous outputs Markov Models Lecture : Markov and Hidden Markov Models PSfrag Use past replacements as state. Next output depends on previous output(s): y t = f[y t, y t,...] order is number of previous outputs y t y

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Factor Graphs and Message Passing Algorithms Part 1: Introduction

Factor Graphs and Message Passing Algorithms Part 1: Introduction Factor Graphs and Message Passing Algorithms Part 1: Introduction Hans-Andrea Loeliger December 2007 1 The Two Basic Problems 1. Marginalization: Compute f k (x k ) f(x 1,..., x n ) x 1,..., x n except

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014 Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 Problem Set 3 Issued: Thursday, September 25, 2014 Due: Thursday,

More information

The Ising model and Markov chain Monte Carlo

The Ising model and Markov chain Monte Carlo The Ising model and Markov chain Monte Carlo Ramesh Sridharan These notes give a short description of the Ising model for images and an introduction to Metropolis-Hastings and Gibbs Markov Chain Monte

More information

Inference in Bayesian Networks

Inference in Bayesian Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)

More information

Lecture 4: State Estimation in Hidden Markov Models (cont.)

Lecture 4: State Estimation in Hidden Markov Models (cont.) EE378A Statistical Signal Processing Lecture 4-04/13/2017 Lecture 4: State Estimation in Hidden Markov Models (cont.) Lecturer: Tsachy Weissman Scribe: David Wugofski In this lecture we build on previous

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 24. Hidden Markov Models & message passing Looking back Representation of joint distributions Conditional/marginal independence

More information

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013 School of Computer Science Probabilistic Graphical Models Theory of Variational Inference: Inner and Outer Approximation Junming Yin Lecture 15, March 4, 2013 Reading: W & J Book Chapters 1 Roadmap Two

More information

Semi-Markov/Graph Cuts

Semi-Markov/Graph Cuts Semi-Markov/Graph Cuts Alireza Shafaei University of British Columbia August, 2015 1 / 30 A Quick Review For a general chain-structured UGM we have: n n p(x 1, x 2,..., x n ) φ i (x i ) φ i,i 1 (x i, x

More information

Learning MN Parameters with Approximation. Sargur Srihari

Learning MN Parameters with Approximation. Sargur Srihari Learning MN Parameters with Approximation Sargur srihari@cedar.buffalo.edu 1 Topics Iterative exact learning of MN parameters Difficulty with exact methods Approximate methods Approximate Inference Belief

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: ony Jebara Kalman Filtering Linear Dynamical Systems and Kalman Filtering Structure from Motion Linear Dynamical Systems Audio: x=pitch y=acoustic waveform Vision: x=object

More information