Cours 7 12th November 2014

Similar documents
STA 4273H: Statistical Machine Learning

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Linear Dynamical Systems

Variational Inference (11/04/13)

Graphical Models and Kernel Methods

STA 414/2104: Machine Learning

Lecture 4 October 18th

Basic math for biology

9 Forward-backward algorithm, sum-product on factor graphs

Bayesian Machine Learning - Lecture 7

13: Variational inference II

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

5. Sum-product algorithm

Probabilistic Graphical Models

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25

Probabilistic Graphical Models

4 : Exact Inference: Variable Elimination

CPSC 540: Machine Learning

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Dynamic Approaches: The Hidden Markov Model

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Graphical Models for Collaborative Filtering

Chapter 16. Structured Probabilistic Models for Deep Learning

Lecture 6: April 19, 2002

Chris Bishop s PRML Ch. 8: Graphical Models

Learning in graphical models & MC methods Fall Cours 8 November 18

10708 Graphical Models: Homework 2

Pattern Recognition and Machine Learning

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

4.1 Notation and probability review

Lecture 13 : Variational Inference: Mean Field Approximation

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

Linear Dynamical Systems (Kalman filter)

Hidden Markov Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Probabilistic Graphical Models

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

Bayesian Machine Learning

An Introduction to Bayesian Machine Learning

Factor Analysis and Kalman Filtering (11/2/04)

Directed Probabilistic Graphical Models CMSC 678 UMBC

Chapter 17: Undirected Graphical Models

Lecture 9: PGM Learning

A brief introduction to Conditional Random Fields

CSC 412 (Lecture 4): Undirected Graphical Models

Introduction to Graphical Models

MACHINE LEARNING 2 UGM,HMMS Lecture 7

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Probabilistic and Bayesian Machine Learning

p L yi z n m x N n xi

CS281A/Stat241A Lecture 19

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Machine Learning for Data Science (CS4786) Lecture 24

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Conditional Random Field

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

14 : Theory of Variational Inference: Inner and Outer Approximation

Probabilistic Graphical Models

Review: Directed Models (Bayes Nets)

Probabilistic Graphical Models

Probabilistic Graphical Models

Statistical Learning

Probabilistic Graphical Models (I)

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Probabilistic Graphical Models

CS838-1 Advanced NLP: Hidden Markov Models

Based on slides by Richard Zemel

Graphical Models Seminar

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory

Statistical Approaches to Learning and Discovery

Latent Dirichlet Allocation

Note Set 5: Hidden Markov Models

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning Midterm, Tues April 8

order is number of previous outputs

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Factor Graphs and Message Passing Algorithms Part 1: Introduction

Machine Learning Summer School

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

The Ising model and Markov chain Monte Carlo

Inference in Bayesian Networks

Lecture 4: State Estimation in Hidden Markov Models (cont.)

Lecture 15. Probabilistic Models on Graph

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Unsupervised Learning

Lecture 6: Graphical Models: Learning

CPSC 540: Machine Learning

COMP90051 Statistical Machine Learning

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Semi-Markov/Graph Cuts

Learning MN Parameters with Approximation. Sargur Srihari

11. Learning graphical models

Machine Learning 4771

Transcription:

Sum Product Algorithm and Hidden Markov Model 2014/2015 Cours 7 12th November 2014 Enseignant: Francis Bach Scribe: Pauline Luc, Mathieu Andreux 7.1 Sum Product Algorithm 7.1.1 Motivations Inference, along with estimation and decoding, are the three key operations one must be able to perform efficiently in graphical models. Given a discrete Gibbs model of the form: p(x) = 1 Z C C ψ C(x C ), where C is the set of cliques of the graph, inference enables: Computation of the marginal p(x i ) or more generally, p(x C ). Computation of the partition function Z Computation of the conditional marginal p(x i X j = x j, X k = x k ) And as a consequence Computation of the gradient in a exponential family Computation of the expected value of the loglikelihood of an exponential family at step E of the EM algorithm (for example for HMM) Example 1: Ising model Let X = (X i ) i V be a vector of random variables, taking value in {0, 1} V, of which the exponential form of the distribution is: p(x) = e A(η) e η ix i i V e η i,jx i x j (7.1) (i,j) E We then have the log-likelihood: l(η) = i V η i x i + (i,j) E η i,j x i x j A(η) (7.2) We can therefore write the sufficient statistic: ( ) (xi ) φ(x) = i V (x i x j ) (i,j) E (7.3) 7-1

But we have seen that for exponential families: l(η) = φ(x) T η A(η) (7.4) η l(η) = φ(x) η A(η) }{{} E η[φ(x)] We therefore need to compute E η [φ(x)]. In the case of the Ising model, we get: (7.5) E η [X i ] = P η [X i = 1] (7.6) E η [X i X j ] = P η [X i = 1, X j = 1] (7.7) The equation 7.6 is one of the motivations for solving the problem of inference. In order to be able to compute the gradient of the log-likelihood, we need to know the marginal laws. Example 2: Potts model X i are random variables, taking value in {1,..., K i }. We note ik the random variable such that ik = 1 if and only if X i = k. Then, p(δ) = exp K i η i,k δ ik + K i K j η i,j,k,k δ ik δ jk A(η) (7.8) i V and k=1 (i,j) E k=1 k =1 ( ) (δik ) φ(δ) = i,k (δ ik δ jk ) i,j,k,k (7.9) E η [ ik ] = P η [X i = k] (7.10) E η [ ik jk ] = P η [X i = k, X j = k ] (7.11) These examples illustrate the need to perform inference. Problem: In general, the inference problem is NP-hard. For trees the inference problem is efficient as it is linear in n. For tree-like graphs we use the Junction Tree Algorithm which enables us to bring the situation back to that of a tree. In the general case we are forced to carry out approximative inference. 7-2

7.1.2 Inference on a chain We define X i a random variable, taking value in {1,..., K}, i V = {1,..., n} with joint distribution p(x) defined as:. p(x) = 1 Z n ψ i (x i ) i=1 n ψ i 1,i (x i 1, x i ) (7.12) We wish to compute p(x j ) for a certain j. The naive solution would be to compute the marginal p(x j ) = i=2 x V \{j} p(x 1,..., x n ) (7.13) Unfortunately, this type of calculation is of complexity O(K n ). We therefore develop the expression p(x j ) = 1 Z = 1 Z = 1 Z x V \{j} i=1 x V \{j} i=1 x V \{j,n} n ψ i (x i ) n ψ i 1,i (x i 1, x i ) (7.14) i=2 n 1 n 1 ψ i (x i ) ψ i 1,i (x i 1, x i )ψ n (x n )ψ n 1,n (x n 1, x n ) (7.15) x n i=1 i=2 n 1 n 1 ψ i (x i ) ψ i 1,i (x i 1, x i )ψ n (x n )ψ n 1,n (x n 1, x n ) (7.16) i=2 Which allows us to bring out the messaged passed by (n) to (n 1): µ n n 1 (x n 1 ). When continuing, we obtain: p(x j ) = 1 Z = 1 Z = 1 Z x V \{j,n} i=1 n 1 n 1 ψ i (x i ) ψ i 1,i (x i 1, x i ) ψ n (x n )ψ n 1,n (x n 1, x n ) x n x V \{j,n,n 1} i=1 i=2 n 2 n 2 ψ i (x i ) ψ i 1,i (x i 1, x i ) i=2 } {{ } µ n n 1 (x n 1 ) x n 1 ψ n 1 (x n 1 )ψ n 2,n 1 (x n 2, x n 1 )µ n n 1 (x n 1 ) (7.17) (7.18) } {{ } µ n 1 n 2 (x n 2 ) µ 1 2 (x 2 )... µ n 1 n 2 (x n 2 ) (7.19) x V \{1,j,n,n 1} 7-3

In the above equation, we have implicitely used the following definitions for descending and ascending messages: µ j j 1 (x j 1 ) = x j ψ j (x j )ψ j 1,j (x j 1, x j )µ j+1 j (x j ) (7.20) µ j j+1 (x j+1 ) = x j ψ j (x j )ψ j,j+1 (x j, x j+1 )µ j 1 j (x j ) (7.21) Each of these messages is computed with complexity O((n 1)K 2 ). And finally, we get p(x j ) = 1 Z µ j 1 j(x j ) ψ j (x j ) µ j+1 j (x j ) (7.22) With only 2(n 1) messages, we have calculated p(x j ) j V. Z is otained by summing Z = x i µ i 1 i (x i ) ψ i (x i ) µ i+1 i (x i ) (7.23) 7.1.3 Inference in undirected trees We note i the vertice of which we want to compute the marginal law p(x i ). We set i to be the root of our tree. j V, we note C(j) the set of childen of j and D(j) the set of descendants of j. The joint probability is: p(x) = 1 ψ i (x i ) Z i V (i,j) E For a tree with at least two vertices, we define by reccurence, F (x i, x j, x D(j) ) ψ i,j (x i, x j )ψ j (x j ) Then by reformulating the marginal: ψ i,j (x i, x j ) (7.24) k C(j) F (x j, x k, x D(k) ) (7.25) 7-4

p(x i ) = 1 ψ i (x i ) F (x i, x j, x D(j) ) Z x V \{i} j C(i) (7.26) = 1 Z ψ i(x i ) F (x i, x j, x D(j) ) j C(i) x j,x D(j) (7.27) = 1 Z ψ i(x i ) ψ i,j (x i, x j )ψ j (x j ) F (x j, x k, x D(k) ) j C(i) x j,x D(j) k C(j) (7.28) = 1 Z ψ i(x i ) ψ i,j (x i, x j )ψ j (x j ) F (x j, x k, x D(k) ) j C(i) x j k C(j) x k,x D(k) }{{} (7.29) µ k j (x j ) = 1 Z ψ i(x i ) ψ i,j (x i, x j )ψ j (x j ) µ k j (x j ) j C(i) x j k C(j) }{{} (7.30) µ j i (x i ) Which leads us to the recurrence relation for the Sum Product Algorithm (SPA): µ j i (x i ) = x j ψ i,j (x i, x j )ψ j (x j ) 7.1.4 Sum Product Algorithm (SPA) Sequential SPA for a rooted tree k C(j) µ k j (x j ) (7.31) For a rooted tree, of root (i), the Sum Product Algorithm is written as follows: 1. All the leaves send µ n πn (x πn ) µ n πn (x πn ) = x n ψ n (x n )ψ n,πn (x n, x πn ) (7.32) 2. Iteratively, at each step, all the nodes (k) which have received messages from all their children send µ k πk (x πk ) to their parents. 3. At the root we have p(x i ) = 1 Z ψ i(x i ) j C(i) µ j i (x i ) (7.33) 7-5

This algorithm only enables us to compute p(x i ) at the root. To be able to compute all the marginals (as well as the conditional marginals), one must not only collect all the messages from the leafs to the root, but then also distribute them back to the leafs. In fact, the algorithm can then be written independently from the choice of a root. SPA for an undirected tree The case of undirected trees is slightly different: 1. All the leaves send µ n πn (x πn ) 2. At each step, if a node (j) hasn t send a message to one of his neighbours, say (i) (Note: (i) here is not the root) and if it has received messages from all his other neighbours N (j)\i, it send to (i) the following message Parallel SPA (flooding) 1. Initialise the messages randomly µ j i (x i ) = x j ψ j (x j )ψ j,i (x i, x i ) k N (j)\{i} µ k j (x j ) (7.34) 2. At each step, each node sends a new message to each of its neighbours, using the messages received at the previous step. Marginal laws Once all messages have been passed, we can easily calculate all the marginal laws i V, p(x i ) = 1 Z ψ i(x i ) k N (i) (i, j) E, p(x i, x j ) = 1 Z ψ i(x i ) ψ j (x j ) ψ j,i (x i, x i ) Conditional probabilities µ k i (x i ) (7.35) k N (i)\j µ k i (x i ) k N (j)\i µ k j (x j ) (7.36) We can use a clever notation to calculate the conditional probabilities. Suppose that we want to compute We can set p(x i x 5 = 3, x 10 = 2) p(x i, x 5 = 3, x 10 = 2) ψ 5 (x 5 ) = ψ 5 (x 5 ) δ(x 5, 3) 7-6

Generally speaking, if we observe X j = x j0 for j J obs, we can define the modified potentials: ψ j (x j ) = ψ j (x j ) δ(x j, x j0 ) such that Indeed we have p(x X Jobs = x Jobs 0) = 1 Z ψ i (x i ) i V (i,j) E ψ i,j (x i, x j ) (7.37) p(x X Jobs = x Jobs 0) p(x Jobs = x Jobs 0) = p(x) δ(x j, x j0 ) (7.38) j J obs so that by dividing the equality by p(x Jobs = x Jobs 0) we obtain the previous equation with Z = Zp(X Jobs = x Jobs 0). We then simply apply the SPA to these new potentials to compute the marginal laws p(x i X Jobs = x Jobs 0) 7.1.5 Remarks The SPA is also called belief propagation or message passing. On trees, it is an exact inference algorithm. If G is not a tree, the algorithm doesn t converge in general to the right marginal laws, but sometimes gives reasonable approximations. We then refer to Loopy belief propagation, which is still often used in real life. The only property that we have used to construct the algorithm is the fact that (R, +, ) is a semi-ring. It is interesting to notice that the same can therefore also be done with (R +, max, ) and (R, max, +). Example For (R +, max, ) we define the Max-Product algorithm, also called Viterbi algorithm which enables us to solve the decoding problem, namely to compute the most probable configuartion of the variables, given fixed parameters, thanks to the messages [ ] µ j i (x i ) = max x j ψ i,j (x i, x j )ψ j (x j ) x k µ k j (x j ) (7.39) If we run the Max-Product algorithm with respect to a chosen root, the collection phase of the messages to the root enables us to compute the maximal probability over all configurations, and if at each calculation of a message we have also kept the argmax, we can perform a distribution phase, which instead of propagating the messages, will consist of recursively calculating one of the configurations which will reach the maximum. 7-7

In practice, we may be working on such small values that the computer will return errors. For instance, for k binary variables, the joint law p(x 1, x 2...x n ) = 1 can take 2n infinitesimal values for a large k. The solution is to work with logarithms: if p = i p i, by setting a i = log(p i ) we have: log(p) = log [ ] i e a i [ ] log(p) = a i + log e (a i a i ) i (7.40) With a i = max i a i. Using logarithms ensures a numerical stability. 7.1.6 Proof of the algorithm We are going to prove that the SPA is correct by recurrence. In the case of two nodes, we have: p(x 1, x 2 ) = 1 Z ψ 1(x i )ψ 2 (x 2 )ψ 1,2 (x 1, x 2 ) We marginalize, and we obtain p(x 1 ) = 1 Z ψ 1(x 1 ) x 2 ψ 1,2 (x 1, x 2 )ψ 2 (x 2 ) } {{ } µ 2 1 (x 1 ) We can hence deduct And p(x 1 ) = 1 Z ψ 1(x 1 )µ 2 1 (x 1 ) p(x 2 ) = 1 Z ψ 2(x 2 )µ 1 2 (x 2 ) We assume that the result is true for trees of size n 1, and we consider a tree of size n. Without loss of generality, we can assume that the nodes are numbered, so that the n th be a leaf, and we will call π n its parent (which is unique, the graph being a tree). The first message to be passed is: And the last message to be passed is: µ πn n(x n ) = x πn µ n πn (x πn ) = x n ψ n (x n )ψ n,πn (x n, x πn ) (7.41) ψ πn (x πn )ψ n,πn (x n, x πn ) 7-8 k N (π n)\{n} µ k πn (x πn ) (7.42)

We are going to construct a tree T of size n 1, as well as a family of potentials, such that the 2(n 2) messages passed in T (i.e. all the messages except for the first and the last) be equal to the 2(n 2) messages passed in T. We define the tree and the potentials as follows: T = (Ṽ, Ẽ) with Ṽ = {1,..., n 1} and Ẽ = E\{n, π n} (i.e., it is the subtree corresponding to the n 1 first vertices). The potentials are all the same as those of T, except for the potential ψ πn (x πn ) = ψ πn (x πn )µ n πn (x πn ) (7.43) The root is unchanged, and the topological order is also kept. We then obtain two important properties: 1) The product of the potentials of the tree of size n 1 is equal to: p(x 1,..., x n 1 ) = 1 ψ i (x i ) ψ i,j (x i, x j ) Z ψ πn (x πn ) i n,π n (i,j) E\{n,π n} = 1 ψ i (x i ) ψ i,j (x i, x j ) ψ n (x n )ψ πn (x πn )ψ n,πn (x n, x πn ) Z i n,π n (i,j) E\{n,π n} x n = 1 n ψ i (x i ) ψ i,j (x i, x j ) Z x n i=1 (i,j) E = x n p(x 1,..., x n 1, x n ) which shows that these new potentials define on (X 1,..., X n 1 ) exactly the distribution induced by p when marginalizing X n. 2) All of the messages passed in T correspond to the messages passed in T (except for the first and the last). Now, with the recurrence hypothesis that the SPA is true for trees of size n 1, we are going to show that it is true for trees of size n. For nodes i n, π n, the result is obvious, as all messages passed are the same: For the case i = π n, we deduct: i V \{n, π n }, p(x i ) = 1 Z ψ i(x i ) 7-9 k N (i) µ k i (x i ) (7.44)

p(x πn ) = 1 Z ψ πn (x πn ) µ k πn (x πn ) ( product over the neighbours of π n in T ) k Ñ (πn) = 1 Z ψ πn (x πn ) µ k πn (x πn ) k N (π n)\{n} = 1 Z ψ π n (x πn )µ n πn (x πn ) = 1 Z ψ π n (x πn ) k N (π n) For the case i = n, we have: p(x n, x πn ) = Therefore: k N (π n)\{n} µ k πn (x πn ) µ k πn (x πn ) x V \{n,πn} p(x) = ψ n (x n )ψ πn (x πn )ψ n,πn (x n, x πn ) p(x) ψ x n (x n )ψ πn (x πn )ψ n,πn (x n, x πn ) V \{n,πn} }{{} α(x πn ) Consequently: p(x n, x πn ) = ψ πn (x πn )α(x πn )ψ n (x n )ψ n,πn (x n, x πn ) (7.45) p(x πn ) = ψ πn (x πn )α(x πn ) x n ψ n (x n )ψ n,πn (x n, x πn ) } {{ } µ n πn (x πn ) Hence: α(x πn ) = p(x πn ) ψ πn (x πn )µ n πn (x πn ) By using (7.31), (7.32) and the previous result, we deduct that: (7.46) p(x πn ) p(x n, x πn ) = ψ πn (x πn )ψ n (x n )ψ n,πn (x n, x πn ) ψ πn (x πn )µ n πn (x πn ) 1 Z = ψ πn (x πn )ψ n (x n )ψ n,πn (x n, x πn ) ψ π n (x πn ) k N (π µ n) k π n (x πn ) ψ πn (x πn )µ n πn (x πn ) = 1 Z ψ π n (x πn )ψ n (x n )ψ n,πn (x n, x πn ) µ k πn (x πn ) 7-10 k N (π n)\{n}

By summing with respect to x πn, we get the result for p(x n ): p(x n ) = x πn p(x n, x πn ) = 1 Z ψ n(x n )µ πn n(x n ) Proposition: Let p L(G), for G = (V, E) a tree, then we have: p(x 1,..., x n ) = 1 ψ(x i ) Z i V (i,j) E p(x i, x j ) p(x i )p(x j ) (7.47) Proof: we prove it by reccurence. The case n = 1 is trivial. Then, assuming that n is a leaf, and we can write p(x 1,..., x n ) = p(x 1,..., x n 1 ) p(x n x πn ). But multiplying by p(x n x πn ) = p(xn,xπn) p(x p(x n)p(x πn ) n) boils down to adding the edge potential for (n, π n ) and the node potential for the leaf n. The formula is hence verified by reccurence. 7.1.7 Junction tree Junction tree is an algorithm designed to tackle the problem of inference on general graphs. The idea is to look at a general graph from far away, where it can be seen as a tree. By merging nodes, one will hopefully be able to build a tree. When this is not the case, one can also think of adding some edges to the graph (i.e., cast the present distribution into a larger set) to be able to build such a graph. The trap is that if one collapses too many nodes, the number of possibles values will explode, and as such the complexity of the whole algorithm. The tree width is the smallest possible clique size. For instance, for a 2D regular grid with n points, the tree width is equal to n. 7.2 Hidden Markov Model (HMM) The Hidden Markov Model ( Modèle de Markov caché in French) is one of the most used Graphical Models. We note z 0, z 1,..., z T the states corresponding to the latent variables, and y 0, y 1,..., y T the states corresponding to the observed variables. We further assume that: 1. {z 0,..., z T } is a Markov chain (hence the name of the model); 2. z 0,..., z T take K values; 3. z 0 follows a multinomial distribution: p(z 0 = i) = (π 0 ) i with i (π 0 ) i = 1; 7-11

4. The transition probabilities are homogeneous: p(z t = i z t 1 = j) = A ij, where A satisfies A ij = 1; i 5. The emission probabilities p(y t z t ) are homogeneous, i.e. p(y t z t ) = f(y t, z t ). 6. The joint probability distribution function can be written as: T 1 p(z 0,...z T, y 0,..., y T ) = p(z 0 ) p(z t+1 z t ) There are different tasks that we want to perform on this model: Filtering: p(z t y 1,..., y t 1 ) Smoothing: p(z t y 1,..., y T ) Decoding: max p(z 0,..., z T y 0,..., y T ) z 0,...,z T t=0 T p(y t z t ). All these tasks can be performed with a sum-product or max-product algorithm. Sum-product From now on, we note the observations y = {y 1,..., y T }. The distribution on y t simply becomes the delta function δ(y t = y t ). To use the sum-product algorithm, we define z T as the root, i.e. we send all forward messages to z T and go back afterwards. Forward: t=0 µ y0 z 0 (z 0 ) = y 0 δ(y 0 = y 0 )p(y 0 z 0 ) = p(y 0 z 0 ) µ z0 z 1 (z 1 ) = z 0 p(z 1 z 0 )µ y0 z 0 (z 0 ) µ yt 1 z t 1 (z t 1 ) = y t 1. δ(y t 1 = y t 1 )p(y t 1 z t 1 ) = p(y t 1 z t 1 ) µ zt 1 z t (z t ) = z t 1 p(z t z t 1 )µ zt 2 z t 1 (z t 1 ) p(z t 1 y t 1 ) Let us define the alpha-message α t (z t ) as: α t (z t ) = µ yt zt (z t )µ zt 1 z t (z t ) α 0 (z 0 ) is initialized with the virtual message µ z 1 z 0 (z 0 ) = p(z 0 ) 7-12

Property 7.1 α t (z t ) = µ yt z t (z t )µ zt 1 z t (z t ) = p(z t, y 0,..., z t ) This is due to the definition of the messages: the product µ yt z t (z t )µ zt 1 z t (z t ) represents a marginal of the distribution corresponding to the sub-hmm {z 0,..., z t }. Moreover, the following recursion formula for α t (z t ) (called alpha-recursion ) holds: α t+1 (z t+1 ) = p(y t+1 z t+1 ) z t p(z t+1 z t )α(z t ) Backward: µ zt+1 z t (z t ) = z t+1 p(z t+1 z t )µ zt+2 z t+1 (z t+1 )p(y t+1 z t+1 ) We define the beta-message β t (z t ) as: β t (z t ) = µ zt+1 z t (z t ) As an initialization, we take β T (z T ) = 1. The following recursion formula for β t (z t ) (called beta-recursion ) holds: β t (z t ) = z t+1 p(z t+1 z t )p(y t+1 z t+1 )β t+1 (z t+1 ) Property 7.2 1. p(z t, y 0,..., y T ) = α t (z t )β t (z t ) 2. p(y 0,..., y T ) = z t α t (z t )β t (z t ) 3. p(z t y 0,..., y T ) = p(zt,y 0,...,y T ) p(y 0,...,y T = αt(zt)βt(zt) ) α t(z t)β t(z t) z t 4. For all t < T, p(z t, z t+1 y 0,..., y T ) = 1 p(y 0,...,y T ) α t(z t ) β t+1 (z t+1 ) p(z t+1 z t ) p(y t+1 z t+1 ) Remark 7.3 1. The alpha-recursion and beta-recursion are easy to implement, but one needs to avoid errors in the indices! 2. The sums need to be coded using logs in order to prevent numerical errors. 7-13

EM algorithm With the previous notations and assumptions, we write the complete log-likelihood l c (θ) with θ the parameters of the model (containing (π 0, A), but also parameters for f): ( T 1 l c (θ) = log p(z 0 ) p(z t+1 z t ) t=0 = log(p(z 0 )) + = T t=0 T log p(y t z t ) + t=0 K δ(z 0 = i) log((π 0 ) i ) + i=1 ) p(y t z t ) T 1 T log p(z t+1 z t ) t=0 t=0 i,j=1 K δ(z t+1 = i, z t = j) log(a i,j ) + T K δ(z t = i) log f(y t, z t ) t=0 i=1 When applying E-M to estimate the parameters of this HMM, we use Jensen s inequality to obtain a lower bound on the log-likelihood: log p(y 0,..., y T ) E q [log p(z 0,..., z T, y 0,..., y T )] = E q [l c (θ)] At the k-th expectation step, we use q(z 0,..., z T ) = P(z 0,..., z T y 0,..., y T ; θ k 1 ), and this boils down to applying the following rules: E[δ(z 0 = i) y] = p(z 0 = i y; θ k 1 ) E[δ(z t = i) y] = p(z t = i y; θ k 1 ) E[δ(z t+1 = i, z t = j y; θ k 1 ] = p(z t+1 = i, z t = j y; θ k 1 ) Thus, in the former expression of the complete log-likelihood, we just have to replace δ(z 0 = i) by p(z 0 = i y; θ k 1 ), and similarly for the other terms. At the k-th maximization step, we maximize the new obtained expression with respect to the parameters θ in the usual manner to obtain a new estimator θ k. The key is that everything will decouple, thus maximizing is simple and can be done in closed form. 7.3 Principal Component Analysis (PCA) Framework: x 1,..., x N R d Goal: put points on a closest affine subspace Analysis view Find w R d such that Var(x T w) is maximal, with w = 1 7-14

With centered data, i.e. 1 N N x n = 0, the empirical variance is: n=1 ˆ Var(x T w) = 1 N N (x T nw) 2 = 1 N wt (X T X)w n=1 where X R N d is the design matrix. In this case: w is the eigenvector of X T X with largest eigenvalue. It is not obvious a priori that this is the direction we care about. If more than one direction is required, one can use deflation: 1. Find w 2. Project x n onto the orthogonal of Vect(w) 3. Start again Synthesis view min w N d(x n, {w T x = 0}) 2 with w R D, w = 1. n=1 Advantage: if one wants more than 1 dimension, replace {w T x = 0} by any subspace. Probabilistic approach: Factor Analysis Model: Λ = (λ 1,..., λ K ) R d k X R k N (0, I) ɛ N (0, Ψ), ɛ R d independent from X with Ψ diagonal. Y R d : Y = ΛX + µ + ɛ We have Y X N (ΛX + µ, Ψ). Problem: get X Y. (X, Y ) is a Gaussian vector on R d+k which satisfies: E[X] = 0 = µ X E[Y ] = E[ΛX + µ + ɛ]µ = µ Y 7-15

Σ XX = I Σ XY = Cov(X, ΛX + µ + ɛ) = Cov(X, ΛX) = Λ T Σ Y Y = Var(ΛX + ɛ, ΛX + ɛ) = Var(ΛX, ΛX) + Var(ɛ, ɛ) = ΛΛ T + Ψ Thanks to the results we know on exponential families, we know how to compute X Y : In our case, we therefore have: E[X Y = y] = µ X + Σ XY Σ 1 Y Y (y µ Y ) Cov[X Y = y] = Σ XX Σ XY Σ 1 Y Y Σ Y X E[X Y = y] = Λ T (ΛΛ T + Ψ) 1 (y µ) Cov[X Y = y] = I Λ T (ΛΛ T + Ψ) 1 Λ T To apply EM, one needs to write down the complete log-likelihood. log p(x, Y )α 1 2 XT X 1 2 (Y ΛX µ)t Ψ 1 (Y ΛX µ) 1 log det Ψ 2 Trap: E[XX T Y ] Cov(X Y ) Rather, E[XX T Y ] = Cov(X Y ) + E[X Y ]E[X Y ] T Remark 7.4 Cov(X) = ΛΛ T + Ψ: our parameters are not identifiable, Λ ΛR with R a rotation gives the same results (in other words, a subspace has different orthonormal bases). Why do we care? 1. A probabilistic interpretation allows to model in a finer way the problem. 2. It is very flexible and therefore allows to combine multiple models. 7-16

Bibliography 17