Tutorial on Statistical Machine Learning

October 7th, 2005 - ICMI, Trento Samy Bengio Tutorial on Statistical Machine Learning 1 Tutorial on Statistical Machine Learning with Applications to Multimodal Processing Samy Bengio IDIAP Research Institute Martigny, Switzerland bengio@idiap.ch http://www.idiap.ch/~bengio

Samy Bengio Tutorial on Statistical Machine Learning 2 Outline of the Tutorial Outline Introduction Statistical Learning Theory EM and Gaussian Mixture Models Hidden Markov Models Advanced Models for Multimodal Processing Updated slides http://www.idiap.ch/~bengio/icmi2005.pdf

Samy Bengio Tutorial on Statistical Machine Learning 3 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications Part I Introduction

Samy Bengio Tutorial on Statistical Machine Learning 4 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications 1 What is Machine Learning? 2 Why Learning is Difficult? 3 Types of Problems 4 Applications

Samy Bengio Tutorial on Statistical Machine Learning 5 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications What is Machine Learning? (Graphical View)

Samy Bengio Tutorial on Statistical Machine Learning 6 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications What is Machine Learning? Learning is an essential human property Learning means changing in order to be better (according to a given criterion) when a similar situation arrives Learning IS NOT learning by heart Any computer can learn by heart, the difficulty is to generalize a behavior to a novel situation

Samy Bengio Tutorial on Statistical Machine Learning 7 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications 1 What is Machine Learning? 2 Why Learning is Difficult? 3 Types of Problems 4 Applications

Samy Bengio Tutorial on Statistical Machine Learning 8 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications Why Learning is Difficult? Given a finite amount of training data, you have to derive a relation for an infinite domain In fact, there is an infinite number of such relations How should we draw the relation?

Samy Bengio Tutorial on Statistical Machine Learning 9 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications Why Learning is Difficult? (2) Given a finite amount of training data, you have to derive a relation for an infinite domain In fact, there is an infinite number of such relations Which relation is the most appropriate?

Samy Bengio Tutorial on Statistical Machine Learning 10 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications Why Learning is Difficult? (3) Given a finite amount of training data, you have to derive a relation for an infinite domain In fact, there is an infinite number of such relations... the hidden test points...

Samy Bengio Tutorial on Statistical Machine Learning 11 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications Occam s Razor s Principle William of Occam: Monk living in the 14th century Principle of Parcimony: One should not increase, beyond what is necessary, the number of entities required to explain anything When many solutions are available for a given problem, we should select the simplest one But what do we mean by simple? We will use prior knowledge of the problem to solve to define what is a simple solution Example of a prior: smoothness

Samy Bengio Tutorial on Statistical Machine Learning 12 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications Learning as a Search Problem Initial solution Set of solutions chosen a priori Optimal solution Set of solutions compatible with training set

Samy Bengio Tutorial on Statistical Machine Learning 13 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications 1 What is Machine Learning? 2 Why Learning is Difficult? 3 Types of Problems 4 Applications

Samy Bengio Tutorial on Statistical Machine Learning 14 Types of Problems What is Machine Learning? Why Learning is Difficult? Types of Problems Applications There are 3 kinds of problems: regression

Samy Bengio Tutorial on Statistical Machine Learning 15 Types of Problems What is Machine Learning? Why Learning is Difficult? Types of Problems Applications There are 3 kinds of problems: regression, classification

Samy Bengio Tutorial on Statistical Machine Learning 16 Types of Problems What is Machine Learning? Why Learning is Difficult? Types of Problems Applications There are 3 kinds of problems: regression, classification, density estimation P(X) X

Samy Bengio Tutorial on Statistical Machine Learning 17 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications 1 What is Machine Learning? 2 Why Learning is Difficult? 3 Types of Problems 4 Applications

Samy Bengio Tutorial on Statistical Machine Learning 18 Applications What is Machine Learning? Why Learning is Difficult? Types of Problems Applications Vision Processing Face detection/verification Handwritten recognition Speech Processing Phoneme/Word/Sentence/Person recognition Others Finance: asset prediction, portfolio and risk management Telecom: traffic prediction Data mining: make use of huge datasets kept by large corporations Games: Backgammon, go Control: robots... and plenty of others of course!

Samy Bengio Tutorial on Statistical Machine Learning 19 Data, Functions, Risk The Capacity Methodology Models Part II Statistical Learning Theory

Samy Bengio Tutorial on Statistical Machine Learning 20 Data, Functions, Risk The Capacity Methodology Models 5 Data, Functions, Risk 6 The Capacity 7 Methodology 8 Models

Samy Bengio Tutorial on Statistical Machine Learning 21 Data, Functions, Risk The Capacity Methodology Models The Data Available training data Let Z 1, Z 2,, Z n be an n-tuple random sample of an unknown distribution of density p(z). All Z i are independently and identically distributed (iid). Let D n be a particular instance = {z 1, z 2,, z n }. Various forms of the data Classification: Z = (X, Y ) R d { 1, 1} objective: given a new x, estimate P(Y X = x) Regression: Z = (X, Y ) R d R objective: given a new x, estimate E[Y X = x] Density estimation: Z R d objective: given a new z, estimate p(z)

Samy Bengio Tutorial on Statistical Machine Learning 22 The Function Space Data, Functions, Risk The Capacity Methodology Models Learning: search for a good function in a function space F Examples of functions f ( ; θ) F: Regression: Classification: ŷ = f (x; a, b, c) = a x 2 + b x + c ŷ = f (x; a, b, c) = sign(a x 2 + b x + c) Density estimation ˆp(z) = f (z; µ, Σ) = (2π) z 2 1 exp Σ ( 1 ) 2 (z µ)t Σ 1 (z µ)

Samy Bengio Tutorial on Statistical Machine Learning 23 The Loss Function Data, Functions, Risk The Capacity Methodology Models Learning: search for a good function in a function space F Examples of loss functions L : Z F Regression: Classification: Density estimation: L(z, f ) = L((x, y), f ) = (f (x) y) 2 L(z, f ) = L((x, y), f ) = L(z, f ) = log p(z) { 0 if f (x) = y 1 otherwise

Samy Bengio Tutorial on Statistical Machine Learning 24 Data, Functions, Risk The Capacity Methodology Models The Risk and the Empirical Risk Learning: search for a good function in a function space F Minimize the Expected Risk on F, defined for a given f as R(f ) = E Z [L(z, f )] = L(z, f )p(z)dz Induction Principle: select f = arg min f F R(f ) problem: p(z) is unknown!!! Empirical Risk: ˆR(f, D n ) = 1 n Z n L(z i, f ) i=1

Samy Bengio Tutorial on Statistical Machine Learning 25 Data, Functions, Risk The Capacity Methodology Models The Risk and the Empirical Risk The empirical risk is an unbiased estimate of the risk: E[ˆR(f, D)] = R(f ) The principle of empirical risk minimization: Training error: f (D n ) = arg min ˆR(f, D n ) f F ˆR(f (D n ), D n ) = min ˆR(f, D n ) f F Is the training error a biased estimate of the risk?

Samy Bengio Tutorial on Statistical Machine Learning 26 The Training Error Data, Functions, Risk The Capacity Methodology Models Is the training error a biased estimate of the risk? yes. E[R(f (D n )) ˆR(f (D n ), D n )] 0 The solution f (D n ) found by minimizing the training error is better on D n than on any other set D n drawn from p(z). Can we bound the difference between the training error and the generalization error? R(f (D n )) ˆR(f (D n ), D n )? Answer: under certain conditions on F, yes. These conditions depend on the notion of capacity h of F.

Samy Bengio Tutorial on Statistical Machine Learning 27 Data, Functions, Risk The Capacity Methodology Models 5 Data, Functions, Risk 6 The Capacity 7 Methodology 8 Models

Samy Bengio Tutorial on Statistical Machine Learning 28 The Capacity Data, Functions, Risk The Capacity Methodology Models The capacity h(f) is a measure of its size, or complexity. Classification: The capacity h(f) is the largest n such that there exist a set of examples D n such that one can always find an f F which gives the correct answer for all examples in D n, for any possible labeling. Regression and density estimation: capacity exists also, but more complex to derive (for instance, we can always reduce a regression problem to a classification problem). Bound on the expected risk: let τ = sup L inf L. η we have P sup R(f ) ˆR(f, D n ) 2τ f F h ( ln 2n h + 1) ln η 9 n 1 η

Samy Bengio Tutorial on Statistical Machine Learning 29 Theoretical Curves Data, Functions, Risk The Capacity Methodology Models Bound on the Expected Risk Confidence Interval Empirical Risk h

Theoretical Curves Data, Functions, Risk The Capacity Methodology Models Bound on the Expected Risk inf R(f) Empirical Risk Samy Bengio Tutorial on Statistical Machine Learning 30 n

Samy Bengio Tutorial on Statistical Machine Learning 31 Data, Functions, Risk The Capacity Methodology Models 5 Data, Functions, Risk 6 The Capacity 7 Methodology 8 Models

Samy Bengio Tutorial on Statistical Machine Learning 32 Methodology Data, Functions, Risk The Capacity Methodology Models First: identify the goal! It could be 1 to give the best model you can obtain given a training set? 2 to give the expected performance of a model obtained by empirical risk minimization given a training set? 3 to give the best model and its expected performance that you can obtain given a training set? If the goal is (1): use need to do model selection If the goal is (2), you need to estimate the risk If the goal is (3): use need to do both! There are various methods that can be used for either risk estimation or model selection: simple validation cross validation (k-fold, leave-one-out)

Samy Bengio Tutorial on Statistical Machine Learning 33 Data, Functions, Risk The Capacity Methodology Models Model Selection - Validation Select a family of functions with hyper-parameter θ Divide your training set D n into two parts D tr = {z 1, z 2,, z tr } D va = {z tr+1, z tr+2,, z tr+va } tr + va = n For each value θ m of the hyper-parameter θ select fθ m (D tr ) = arg min ˆR(f, D tr ) f F θm estimate R(f θ m ) with ˆR(f θ m, D va ) = 1 va select θm = arg min R(fθ θ m ) m return f (D n ) = arg min f F θ m ˆR(f, D n ) L(z i, fθ m (D tr )) z i D va

Samy Bengio Tutorial on Statistical Machine Learning 34 Data, Functions, Risk The Capacity Methodology Models Model Selection - Cross-validation Select a family of functions with hyper-parameter θ Divide your training set D n into K distinct and equal parts D 1,, D k,, D K For each value θ m of the hyper-parameter θ For each part D k (and its counterpart D k ) select fθ m ( D k ) = arg min ˆR(f, D k ) f F θm estimate R(fθ m ( D k )) with ˆR(f θ m ( D k ), D k ) = 1 X L(z D k i, fθ m ( D k )) z i D k estimate R(fθ m (D n )) with 1 R(fθ K m ( D k )) select θm = arg min R(fθ θ m (D)) m return f (D n ) = arg min ˆR(f, D n ) f F θ m k

Samy Bengio Tutorial on Statistical Machine Learning 35 Data, Functions, Risk The Capacity Methodology Models Estimation of the Risk - Validation Divide your training set D n into two parts D tr = {z 1, z 2,, z tr } D te = {z tr+1, z tr+2,, z tr+te } tr + te = n select f (D tr ) = arg min ˆR(f, D tr ) f F (this optimization process could include model selection) estimate R(f (D tr )) with ˆR(f (D tr ), D te ) = 1 L(z i, f (D tr )) te z i D te

Samy Bengio Tutorial on Statistical Machine Learning 36 Data, Functions, Risk The Capacity Methodology Models Estimation of the Risk - Cross-validation Divide your training set D n into K distinct and equal parts D 1,, D k,, D K For each part D k let D k be the set of examples that are in D n but not in D k select f ( D k ) = arg min ˆR(f, D k ) f F (this process could include model selection) estimate R(f ( D k )) with ˆR(f ( D k ), D k ) = 1 D k L(z i, f ( D k )) z i D k estimate R(f (D n )) with 1 R(f ( K D k )) k When k = n: leave-one-out cross-validation

Samy Bengio Tutorial on Statistical Machine Learning 37 Data, Functions, Risk The Capacity Methodology Models Estimation of the Risk and Model Selection When you want both the best model and its expected risk. You then need to merge the methods already presented. For instance: train-validation-test: 3 separate data sets are necessary cross-validation + test: cross-validate on train set, then test on separate set double-cross-validation: for each subset, need to do a second cross-validation with the K 1 other subsets Other important methodological aspects: compare your results with other methods!!!! use statistical tests to verify significance verify your model on more than one datasets

Samy Bengio Tutorial on Statistical Machine Learning 38 Data, Functions, Risk The Capacity Methodology Models Train - Validation - Test Select a family of functions with hyper-parameter θ Divide your training set D n into three parts D tr, D va, and D te For each value θ m of the hyper-parameter θ select fθ m (D tr ) = arg min ˆR(f, D tr ) f F θm let ˆR(f θ m (D tr ), D va ) = 1 va select θ m = arg min θ m L(z i, fθ m (D tr )) z i D va ˆR(f θ m (D tr ), D va ) select f (D tr D va ) = arg min f F θ m ˆR(f, D tr D va ) estimate R(f (D tr D va )) with 1 te z i D te L(z i, f (D tr D va ))

Samy Bengio Tutorial on Statistical Machine Learning 39 Data, Functions, Risk The Capacity Methodology Models Cross-validation + Test Select a family of functions with hyper-parameter θ Divide you dataset D n into two parts: a training set D tr and a test set D te For each value θ m of the hyper-parameter θ estimate R(f θ m (D tr )) with D tr using cross-validation select θm = arg min R(fθ θ m (D tr )) m retrain f (D tr ) = arg min f F θ m ˆR(f, D tr ) estimate R(f (D tr )) with 1 te z i D te L(z i, f (D tr ))

Samy Bengio Tutorial on Statistical Machine Learning 40 Data, Functions, Risk The Capacity Methodology Models 5 Data, Functions, Risk 6 The Capacity 7 Methodology 8 Models

Samy Bengio Tutorial on Statistical Machine Learning 41 Data, Functions, Risk The Capacity Methodology Models Examples of Known Models Multi-Layer Perceptrons (regression, classification) Radial Basis Functions (regression, classification) Support Vector Machines (classification, regression) Gaussian Mixture Models (density estimation, classification) Hidden Markov Models (density estimation, classification) Graphical Models (density estimation, classification) AdaBoost and Bagging (classification, regression, density estimation) Decision Trees (classification, regression)

Samy Bengio Tutorial on Statistical Machine Learning 42 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Part III EM and Gaussian Mixture Models

Samy Bengio Tutorial on Statistical Machine Learning 43 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Probabilities 9 Preliminaries 10 Gaussian Mixture Models 11 Expectation-Maximization 12 EM for GMMs

Samy Bengio Tutorial on Statistical Machine Learning 44 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Probabilities Reminder: Basics on Probabilities A few basic equalities that are often used: 1 (conditional probabilities) 2 (Bayes rule) P(A, B) = P(A B) P(B) P(A B) = P(B A) P(A) P(B) 3 If ( B i = Ω) and i, j i (B i Bj = ) then P(A) = i P(A, B i )

Samy Bengio Tutorial on Statistical Machine Learning 45 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Gaussian Mixture Models 9 Preliminaries 10 Gaussian Mixture Models 11 Expectation-Maximization 12 EM for GMMs

Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs What is a Gaussian Mixture Model Gaussian Mixture Models A Gaussian Mixture Model (GMM) is a distribution The likelihood given a Gaussian distribution is N (x; µ, Σ) = (2π) x 2 1 exp Σ ( 1 2 (x µ)t Σ 1 (x µ) where µ is the mean and Σ is the covariance matrix of the Gaussian. Σ is often diagonal. The likelihood given a GMM is N p(x) = w i N (x; µ, Σ) i=1 where N is the number of Gaussians and w i is the weight of Gaussian i, with w i = 1 and i : w i 0 Samy Bengio Tutoriali on Statistical Machine Learning 46 )

Samy Bengio Tutorial on Statistical Machine Learning 47 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Gaussian Mixture Models Characteristics of a GMM While Multi-Layer Perceptrons are universal approximators of functions, GMMs are universal approximators of densities. (as long as there are enough Gaussians of course) Even diagonal GMMs are universal approximators. Full rank GMMs are not easy to handle: number of parameters is the square of the number of dimensions. GMMs can be trained by maximum likelihood using an efficient algorithm: Expectation-Maximization.

Samy Bengio Tutorial on Statistical Machine Learning 48 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Graphical View More Formally 9 Preliminaries 10 Gaussian Mixture Models 11 Expectation-Maximization 12 EM for GMMs

Samy Bengio Tutorial on Statistical Machine Learning 49 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Graphical View More Formally Basics of Expectation-Maximization Objective: maximize the likelihood p(x ; θ) of the data X drawn from an unknown distribution, given the model parameterized by θ: θ = arg max θ p(x θ) = arg max θ n p(x p θ) p=1 Basic ideas of EM: Introduce a hidden variable such that its knowledge would simplify the maximization of p(x ; θ) At each iteration of the algorithm: E-Step: estimate the distribution of the hidden variable given the data and the current value of the parameters M-Step: modify the parameters in order to maximize the joint distribution of the data and the hidden variable

Samy Bengio Tutorial on Statistical Machine Learning 50 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Graphical View More Formally EM for GMM (Graphical View, 1) Hidden variable: for each point, which Gaussian generated it? A B

Samy Bengio Tutorial on Statistical Machine Learning 51 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Graphical View More Formally EM for GMM (Graphical View, 2) E-Step: for each point, estimate the probability the each Gaussian generated it A B P(A) = 0.6 P(B) = 0.4 P(A) = 0.2 P(B) = 0.8

Samy Bengio Tutorial on Statistical Machine Learning 52 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Graphical View More Formally EM for GMM (Graphical View, 3) M-Step: modify the parameters according to the hidden variable to maximize the likelihood of the data (and the hidden variable) A B P(A) = 0.6 P(B) = 0.4 P(A) = 0.2 P(B) = 0.8

Samy Bengio Tutorial on Statistical Machine Learning 53 EM: More Formally Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Graphical View More Formally Let us call the hidden variable Q. Let us consider the following auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] It can be shown that maximizing A θ s+1 = arg max A(θ, θ s ) θ always increases the likelihood of the data p(x θ s+1 ), and a maximum of A corresponds to a maximum of the likelihood.

Samy Bengio Tutorial on Statistical Machine Learning 54 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Auxiliary Function Update Rules 9 Preliminaries 10 Gaussian Mixture Models 11 Expectation-Maximization 12 EM for GMMs

Samy Bengio Tutorial on Statistical Machine Learning 55 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs EM for GMM: Hidden Variable Auxiliary Function Update Rules For GMM, the hidden variable Q will describe which Gaussian generated each example. If Q was observed, then it would be simple to maximize the likelihood of the data: simply estimate the parameters Gaussian by Gaussian Moreover, we will see that we can easily estimate Q Let us first write the mixture of Gaussian model for one x i : N p(x i θ) = P(j θ)p(x i j, θ) j=1 Let us now introduce the following indicator variable: { 1 if Gaussian j emitted xi q i,j = 0 otherwise

Samy Bengio Tutorial on Statistical Machine Learning 56 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs EM for GMM: Auxiliary Function Auxiliary Function Update Rules We can now write the joint likelihood of all the X and Q: n N p(x, Q θ) = P(j θ) q i,j p(x i j, θ) q i,j which in log gives i=1 j=1 log p(x, Q θ) = n N q i,j log P(j θ) + q i,j log p(x i j, θ) i=1 j=1

Samy Bengio Tutorial on Statistical Machine Learning 57 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs EM for GMM: Auxiliary Function Auxiliary Function Update Rules Let us now write the corresponding auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] n N = E Q q i,j log P(j θ) + q i,j log p(x i j, θ) X, θ s i=1 j=1 i=1 j=1 n N = E Q [q i,j X, θ s ] log P(j θ) + E Q [q i,j X, θ s ] log p(x i j, θ)

Samy Bengio Tutorial on Statistical Machine Learning 58 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs EM for GMM: Update Rules Auxiliary Function Update Rules Means ˆµ j = n x i P(j x i, θ s ) i=1 Variances (ˆσ j ) 2 = Weights: ŵ j = 1 n n P(j x i, θ s ) i=1 n (x i µ j ) 2 P(j x i, θ s ) i=1 n P(j x i, θ s ) i=1 n P(j x i, θ s ) i=1

Samy Bengio Tutorial on Statistical Machine Learning 59 Initialization Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Auxiliary Function Update Rules EM is an iterative procedure that is very sensitive to initial conditions! Start from trash end up with trash. Hence, we need a good and fast initialization procedure. Often used: K-Means. Other options: hierarchical K-Means, Gaussian splitting.

Samy Bengio Tutorial on Statistical Machine Learning 60 Markovian Models Hidden Markov Models Speech Recognition Part IV Hidden Markov Models

Samy Bengio Tutorial on Statistical Machine Learning 61 Markovian Models Hidden Markov Models Speech Recognition Definition Graphical View 13 Markovian Models 14 Hidden Markov Models 15 Speech Recognition

Samy Bengio Tutorial on Statistical Machine Learning 62 Markovian Models Hidden Markov Models Speech Recognition Definition Graphical View Markov Models Stochastic process of a temporal sequence: the probability distribution of the variable q at time t depends on the variable q at times t 1 to 1. P(q 1, q 2,..., q T ) = P(q T 1 ) = P(q 1 ) First Order Markov Process: T t=2 P(q t q t 1 1 ) = P(q t q t 1 ) P(q t q t 1 1 ) Markov Model: model of a Markovian process with discrete states. Hidden Markov Model: Markov Model whose state is not observed, but of which one can observe a manifestation (a variable x t which depends only on q t ).

Samy Bengio Tutorial on Statistical Machine Learning 63 Markovian Models Hidden Markov Models Speech Recognition Definition Graphical View Markov Models (Graphical View) A Markov model: 1 2 3 A Markov model unfolded in time: 1 q(t-2) q(t-1) q(t) q(t+1) 2 3

Samy Bengio Tutorial on Statistical Machine Learning 64 Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm 13 Markovian Models 14 Hidden Markov Models 15 Speech Recognition

Samy Bengio Tutorial on Statistical Machine Learning 65 Markovian Models Hidden Markov Models Speech Recognition Hidden Markov Models Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm A hidden Markov model: A hidden Markov model unfolded in time: y(t-2) y(t-1) y(t) y(t+1) 1 2 3 1 2 3 q(t-2) q(t-1) q(t) q(t+1)

Samy Bengio Tutorial on Statistical Machine Learning 66 Elements of an HMM Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm A finite number of states N. Transition probabilities between states, which depend only on previous state: P(q t =i q t 1 =j, θ). Emission probabilities, which depend only on the current state: p(x t q t =i, θ) (where x t is observed). Initial state probabilities: P(q 0 = i θ). Each of these 3 sets of probabilities have parameters θ to estimate.

Samy Bengio Tutorial on Statistical Machine Learning 67 Markovian Models Hidden Markov Models Speech Recognition The 3 Problems of HMMs Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm The HMM model gives rise to 3 different problems: Given an HMM parameterized by θ, can we compute the likelihood of a sequence X = x T 1 = {x 1, x 2,..., x T }: p(x T 1 θ) Given an HMM parameterized by θ and a set of sequences D n, can we select the parameters θ such that: θ = arg max θ n p(x (p) θ) p=1 Given an HMM parameterized by θ, can we compute the optimal path Q through the state space given a sequence X : Q = arg max p(x, Q θ) Q

Samy Bengio Tutorial on Statistical Machine Learning 68 Markovian Models Hidden Markov Models Speech Recognition HMMs as Generative Processes Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm HMMs can be use to generate sequences: Let us define a set of starting states with initial probabilities P(q 0 = i). Let us also define a set of final states. Then for each sequence to generate: 1 Select an initial state j according to P(q 0 ). 2 Select the next state i according to P(q t = i q t 1 = j). 3 Emit an output according to the emission distribution P(x t q t = i). 4 If i is a final state, then stop, otherwise loop to step 2.

Samy Bengio Tutorial on Statistical Machine Learning 69 Markovian Models Hidden Markov Models Speech Recognition Markovian Assumptions Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm Emissions: the probability to emit x t at time t in state q t = i does not depend on anything else: p(x t q t = i, q t 1 1, x t 1 1 ) = p(x t q t = i) Transitions: the probability to go from state j to state i at time t does not depend on anything else: P(q t = i q t 1 = j, q t 2 1, x t 1 1 ) = P(q t = i q t 1 = j) Moreover, this probability does not depend on time t: P(q t = i q t 1 = j) is the same for all t we say that such Markov models are homogeneous.

Samy Bengio Tutorial on Statistical Machine Learning 70 Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm Derivation of the Forward Variable α the probability of having generated the sequence x1 t and being in state i at time t: α(i, t) def = p(x t 1, q t = i) = p(x t x t 1 1, q t = i)p(x t 1 1, q t = i) = p(x t q t = i) j p(x t 1 1, q t = i, q t 1 = j) = p(x t q t = i) j = p(x t q t = i) j = p(x t q t = i) j P(q t = i x t 1 1, q t 1 = j)p(x t 1 1, q t 1 = j) P(q t = i q t 1 = j)p(x t 1 1, q t 1 = j) P(q t = i q t 1 = j)α(j, t 1)

Samy Bengio Tutorial on Statistical Machine Learning 71 Markovian Models Hidden Markov Models Speech Recognition From α to the Likelihood Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm Reminder: α(i, t) def = p(x t 1, q t = i) Initial condition: α(i, 0) = P(q 0 = i) prior probabilities of each state i Then let us compute α(i, t) for each state i and each time t of a given sequence x T 1 Afterward, we can compute the likelihood as follows: p(x T 1 ) = i = i p(x T 1, q T = i) α(i, T ) Hence, to compute the likelihood p(x T 1 ), we need O(N2 T ) operations, where N is the number of states

Samy Bengio Tutorial on Statistical Machine Learning 72 Markovian Models Hidden Markov Models Speech Recognition EM Training for HMM Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm For HMM, the hidden variable Q will describe in which state the HMM was for each observation x t of a sequence X. The joint likelihood of all sequences X (l) and the hidden variable Q is then: p(x, Q θ) = n p(x (l), Q θ) l=1 Let us introduce the following indicator variable: { 1 if qt = i q i,t = 0 otherwise

Samy Bengio Tutorial on Statistical Machine Learning 73 Joint Likelihood Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm p(x, Q θ) = = n p(x (l), Q θ) l=1 ( n N ) P(q 0 = i) q i,0 l=1 T l i=1 t=1 i=1 N p(x t (l) q t = i) q i,t N P(q t = i q t 1 = j) q i,t q j,t 1 j=1

Samy Bengio Tutorial on Statistical Machine Learning 74 Joint Log Likelihood Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm log p(x, Q θ) = n N q i,0 log P(q 0 = i) + l=1 i=1 n T l N q i,t log p(x t (l) q t = i) + l=1 t=1 i=1 n T l N N q i,t q j,t 1 log P(q t = i q t 1 = j) l=1 t=1 i=1 j=1

Samy Bengio Tutorial on Statistical Machine Learning 75 Auxiliary Function Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm Let us now write the corresponding auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] = n N E Q [q i,0 X, θ s ] log P(q 0 = i) + l=1 i=1 n T l N E Q [q i,t X, θ s ] log p(x t (l) q t = i) + l=1 t=1 i=1 n T l N N E Q [q i,t q j,t 1 X, θ s ] log P(q t = i q t 1 = j) l=1 t=1 i=1 j=1 From now on, let us forget about index l for simplification.

Samy Bengio Tutorial on Statistical Machine Learning 76 Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm Derivation of the Backward Variable β the probability to generate the rest of the sequence xt+1 T we are in state i at time t β(i, t) def = p(x T t+1 q t =i) = j p(x T t+1, q t+1 =j q t =i) given that = j = j = j = j p(x t+1 xt+2, T q t+1 =j, q t =i)p(xt+2, T q t+1 =j q t =i) p(x t+1 q t+1 =j)p(xt+2 q T t+1 =j, q t =i)p(q t+1 =j q t =i) p(x t+1 q t+1 =j)p(xt+2 q T t+1 =j)p(q t+1 =j q t =i) p(x t+1 q t+1 =j)β(j, t + 1)P(q t+1 =j q t =i)

Samy Bengio Tutorial on Statistical Machine Learning 77 Final Details About β Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm Reminder: β(i, t) = p(x T t+1 q t=i) Final condition: { 1 if i is a final state β(i, T ) = 0 otherwise Hence, to compute all the β variables, we need O(N 2 T ) operations, where N is the number of states

Samy Bengio Tutorial on Statistical Machine Learning 78 E-Step for HMMs Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm Posterior on emission distributions: E Q [q i,t X, θ s ] = P(q t = i x T 1, θ s ) = P(q t = i x T 1 ) = p(x T 1, q t = i) p(x T 1 ) = p(x T t+1 q t = i, x t 1 )p(x t 1, q t = i) p(x T 1 ) = p(x t+1 T q t = i)p(x1 t, q t = i) p(x1 T ) β(i, t) α(i, t) = α(j, T ) j

Samy Bengio Tutorial on Statistical Machine Learning 79 E-Step for HMMs Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm Posterior on transition distributions: E Q [q i,t q j,t 1 X, θ s ] = P(q t = i, q t 1 = j x T 1, θ s ) = p(x T 1, q t = i, q t 1 = j) p(x T 1 ) = p(x t+1 T q t=i)p(q t =i q t 1 =j)p(x t q t =i)p(x1 t 1, q t 1 =j) p(x1 T ) = β(i, t)p(q t = i q t 1 = j)p(x t q t = i)α(j, t 1) α(j, T ) j

Samy Bengio Tutorial on Statistical Machine Learning 80 E-Step for HMMs Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm Posterior on initial state distribution: E Q [q i,0 X, θ p ] = P(q 0 = i x T 1, θ s ) = P(q 0 = i x T 1 ) = p(x T 1, q 0 = i) p(x T 1 ) = p(x 1 T q 0 = i)p(q 0 = i) p(x1 T ) = β(i, 0) P(q 0 = i) α(j, T ) j

Samy Bengio Tutorial on Statistical Machine Learning 81 M-Step for HMMs Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm Find the parameters θ that maximizes A, hence search for A θ = 0 When transition distributions are represented as tables, using a Lagrange multiplier, we obtain: T P(q t = i, q t 1 = j x1 T, θ s ) P(q t = i q t 1 = j) = t=1 T P(q t = i x1 T, θ s ) t=1 When emission distributions are implemented as GMMs, use already given equations, weighted by the posterior on emissions P(q t = i x T 1, θs ).

Samy Bengio Tutorial on Statistical Machine Learning 82 Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm The Most Likely Path (Graphical View) The Viterbi algorithm finds the best state sequence. Compute the patial paths Backtrack in time q3 States q2 q1 q3 States q2 q1 Time Time

Samy Bengio Tutorial on Statistical Machine Learning 83 Markovian Models Hidden Markov Models Speech Recognition The Viterbi Algorithm for HMMs Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm The Viterbi algorithm finds the best state sequence. V (i, t) def = max q t 1 1 p(x t 1, q t 1 1, q t =i) = max p(x t x q t 1 1 t 1, q1 t 1, q t =i)p(x1 t 1, q1 t 1, q t =i) 1 = p(x t q t =i) max max p(x q t 2 1 t 1, q1 t 2, q t =i, q t 1 =j) j 1 = p(x t q t =i) max max p(q t =i q t 1 =j)p(x q t 2 1 t 1, q1 t 2, q t 1 =j) j 1 = p(x t q t =i) max j = p(x t q t =i) max j p(q t =i q t 1 =j) max p(x q t 2 1 t 1, q1 t 2, q t 1 =j) 1 p(q t =i q t 1 =j)v (j, t 1)

Samy Bengio Tutorial on Statistical Machine Learning 84 Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm From Viterbi to the State Sequence Reminder: V (i, t) = max p(x1, t q q t 1 1 t 1, q t =i) 1 Let us compute V (i, t) for each state i and each time t of a given sequence x1 T Moreover, let us also keep for each V (i, t) the associated argmax previous state j Then, starting from the state i = arg max V (j, T ) backtrack j to decode the most probable state sequence. Hence, to compute all the V (i, t) variables, we need O(N 2 T ) operations, where N is the number of states

Samy Bengio Tutorial on Statistical Machine Learning 85 Markovian Models Hidden Markov Models Speech Recognition Embeded Training Decoding Measuring Error 13 Markovian Models 14 Hidden Markov Models 15 Speech Recognition

Samy Bengio Tutorial on Statistical Machine Learning 86 Markovian Models Hidden Markov Models Speech Recognition Embeded Training Decoding Measuring Error HMMs for Speech Recognition Application: continuous speech recognition: Find a sequence of phonemes (or words) given an acoustic sequence Idea: use a phoneme model

Samy Bengio Tutorial on Statistical Machine Learning 87 Markovian Models Hidden Markov Models Speech Recognition Embeded Training Decoding Measuring Error Embeded Training of HMMs For each acoustic sequence in the training set, create a new HMM as the concatenation of the HMMs representing the underlying sequence of phonemes. Maximize the likelihood of the training sentences. C A T

Samy Bengio Tutorial on Statistical Machine Learning 88 Markovian Models Hidden Markov Models Speech Recognition Embeded Training Decoding Measuring Error HMMs: Decoding a Sentence Decide what is the accepted vocabulary. Optionally add a language model: P(word sequence) Efficient algorithm to find the optimal path in the decoding HMM: D O G C A T

Samy Bengio Tutorial on Statistical Machine Learning 89 Markovian Models Hidden Markov Models Speech Recognition Embeded Training Decoding Measuring Error Measuring Error How do we measure the quality of a speech recognizer? Problem: the target solution is a sentence, the obtained solution is also a sentence, but they might have different size! Proposed solution: the Edit Distance: assume you have access to the operators insert, delete, and substitute, what is the smallest number of such operators we need to go from the obtained to the desired sentence? An efficient algorithm exists to compute this. At the end, we measure the error as follows: WER = #ins + #del + #subst #words Note that the word error rate (WER) can be greater than 1...

Samy Bengio Tutorial on Statistical Machine Learning 90 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Part V Advanced Models for Multimodal Processing

Samy Bengio Tutorial on Statistical Machine Learning 91 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings 16 Introduction to Graphical Models 17 A Zoo of Graphical Models 18 Applications to Meetings

Samy Bengio Tutorial on Statistical Machine Learning 92 Graphical Models Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings A tool to efficiently model complex joint distributions: p(x 1, X 2,, X n ) = n p(x i parents(x i )) i=1 Introduces and makes use of known conditional independences. p(x 1, X 2,, X n ) = p(x 1 X 2, X 5 ) p(x 2 ) p(x 3 ) p(x 4 X 2, X 3 ) p(x 5 X 4 ) X 1 X 2 X 4 X 3 X5

Samy Bengio Tutorial on Statistical Machine Learning 93 Graphical Models Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Can handle an arbitrary number of random variables Junction Tree Algorithm (JTA): used to estimate joint probability (inference) Expectation-Maximization (EM): used to train Depending on the graph, EM and JTA are tractable or not Dynamic Bayes Nets: temporal extension of graphical models

Samy Bengio Tutorial on Statistical Machine Learning 94 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs 16 Introduction to Graphical Models 17 A Zoo of Graphical Models 18 Applications to Meetings

Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs A Zoo of Graphical Models for Multi-Channel Processing Samy Bengio Tutorial on Statistical Machine Learning 95

Samy Bengio Tutorial on Statistical Machine Learning 96 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Notation for Multimodal Data Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Notation for One Stream Let us have a training set of L observations sequences. ) The observation sequence l: O l = (o l 1, ol 2,..., ol with o l Tl t the vector of multimodal features at time t for sequence l, of length T l. Notation for Several Streams Let us consider N streams for each observation sequence. The stream ( n of observation ) sequence l: O l:n = o l:n 1, ol:n 2,..., ol:n T l with o l:n t the vector of features of stream n at time t for sequence l, of length T l.

Samy Bengio Tutorial on Statistical Machine Learning 97 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Goals for Multimodal Data Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Inference Likelihood: L l=1 ( { p O l:n} ) N ; θ n=1 Training θ = arg max θ L l=1 ( { p O l:n} ) N ; θ n=1

Samy Bengio Tutorial on Statistical Machine Learning 98 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Sample all streams at the same frame rate Emission distributions ( { } ) N p o l:n t q t = i n=1 Transition distributions p(q t = i q t 1 = j)

Samy Bengio Tutorial on Statistical Machine Learning 99 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Multi-Stream HMMs Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Train a separate model for each stream n θ n = arg max θ L l=1 Inference, for each state q t = i p ( { } ) N o l:n t q t = i = n=1 ) p (O l:n ; θ n N n=1 ( ) p o l:n t q t = i; θ n

Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Multi-Stream HMMs Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Training and Inference Complexity Per iteration, per observation sequence l: O(N S 2 T l ) with S the number of states of each HMM stream, N the number of streams, T l the length of the observation sequence Additional Notes All stream HMMs need to be of the same topology. One can add stream weights ω n as follows: p ( { } ) N ot l:n q t = i = n=1 N n=1 ( p o l:n t q t = i; θ n ) ωn Samy Bengio Tutorial on Statistical Machine Learning 100

Samy Bengio Tutorial on Statistical Machine Learning 101 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Multi-Stream HMMs Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Pros Efficient training (linear in the number of streams) Robust to stream-dependent noise Easy to implement Cons Does not maximize the joint probability of the data Each stream needs to use the same HMM topology Assumes complete stream independence during training, and stream independence given the state during decoding.

Samy Bengio Tutorial on Statistical Machine Learning 102 Coupled HMMs Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Let q n t = i be the state at time t for stream n Stream emission distributions: ( ) p o l:n t qt n = i; θ n Coupled transition distributions: ( p qt n { ) qt 1} m N m=1 ; θ n

Samy Bengio Tutorial on Statistical Machine Learning 103 Coupled HMMs Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Complexity Per iteration, per observation sequence l: O(S 2 N T l ) with S the number of states of each HMM stream, N the number of streams, T l the length of the observation sequence N-Heads approximation: O(N 2 S 2 T l )

Samy Bengio Tutorial on Statistical Machine Learning 104 Coupled HMMs Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Pros Considers correlations between streams during training and decoding. Cons The algorithm is intractable (unless using the approximation). Assumes synchrony between the streams.

Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Challenges in Multi Channel Integration: Asynchrony Stream 1 A B C Stream 2 A B C Naive Integration A A A B B B B C C C Samy Bengio Tutorial on Statistical Machine Learning 105

Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Some Evidence of Stream Asynchrony Audio-visual speech recognition with lip movements: lips do not move at the same time as we hear the corresponding sound. Speaking and pointing: pointing to a map and saying I want to go there. Gesticulating, looking at, and talking to someone during a conversation. In a news video, the delay between the moment when the newscaster says Bush and the moment when Bush s picture appears.... Samy Bengio Tutorial on Statistical Machine Learning 106

Samy Bengio Tutorial on Statistical Machine Learning 107 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Asynchrony Revisited Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Stream 1 A B C 3 d1 dim states Stream 2 A B C 3 d2 dim states Naive Integration A A A B B B B C C C 5 (d1+d2) dim states Joint / Asynchronous A A B B C C 3 (d1+d2) dim states

Samy Bengio Tutorial on Statistical Machine Learning 108 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Asynchronous HMMs Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Enables re-synchronization of streams. One HMM: maximizes the likelihood of all streams jointly. O 1 1:T1 O 2 1:T2 p(o 1 r q p(o 1 r, o 2 t ) s q t ) p(o 2 s q t ) P(q t q t 1 ) p(emit o 1 p(emit o 2 r q t ) s q t ) q t 1 q t q t+1

Samy Bengio Tutorial on Statistical Machine Learning 109 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Training and Complexity Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Training EM algorithm maximizes the joint probability p(x T 1 1, yt 2 1, ). Complexity grows with number of streams, can be controlled Exact complexity: O(S 2 N i=1 T i) Introduction of temporal constraints: O(S 2 d T ) with d the maximum delay allowed between streams. Applications Easily adaptable to complex tasks such as speech recognition. Significant performance improvements in audio-visual speech and speaker recognition meeting analysis.

Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Alignments with Asynchronous HMMs Word Error Rate (WER) 80 70 60 50 40 30 20 10 0db audio HMM audio+video HMM audio+video AHMM video HMM 5db 10db Video 200 150 100 zero un deux trois cinq sept neuf quatre six huit 50 AHMM Alignment HMM Alignment Alignment Bounds Segmentation 0 0 100 200 300 400 500 600 700 800 Signal to Noise Ratio (SNR) Audio Samy Bengio Tutorial on Statistical Machine Learning 110

Samy Bengio Tutorial on Statistical Machine Learning 111 Layered HMMs Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Philosophy Divide and Conquer for complex tasks Idea: transform the data from a raw representation into a higher level abstraction Then, transform again the result into yet another and higher level of abstraction, and continue as needed For temporal sequences: go from a fine time granularity to a coarser one.

Samy Bengio Tutorial on Statistical Machine Learning 112 General Algorithm Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Layered Algorithm For i = 0 to M layers, do: 1 Let O i 1:T be a sequence of observations of HMM model m i. 2 Train m i using EM in order to maximise p(o i 1:T m i). 3 Compute, for each state s i j for m i the data posterior: p(s i j Oi 1:T, m i) 4 Define O i+1 1:T as the sequence of vectors of p(si j Oi 1:T, m i)

Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Complexity and Additional Notes Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Complexity Training complexity: O(N S 2 M) for M layers. A Speech Example A phoneme layer (with phoneme constraints) A word layer (with word constraints) A sentence layer (with language constraints) Samy Bengio Tutorial on Statistical Machine Learning 113

Samy Bengio Tutorial on Statistical Machine Learning 114 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Complexity of Meetings A Layered Approach Experiments 16 Introduction to Graphical Models 17 A Zoo of Graphical Models 18 Applications to Meetings

Samy Bengio Tutorial on Statistical Machine Learning 115 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Complexity of Meetings A Layered Approach Experiments Complexity of the Meeting Scenario Modeling multimodal group-level human interactions in meetings. Multimodal nature: data collected from multiple sensors (cameras, microphones, projectors, white-board, etc). Group nature: involves multiple interacting persons at the same time.

Samy Bengio Tutorial on Statistical Machine Learning 116 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Complexity of Meetings A Layered Approach Experiments Meeting Analysis Structure a meeting as a sequence of group actions taken from an exhaustive set V of N possible actions: V = {v 1, v 2, v 3,, v N } t V 1 V 2 V 3 V 2 V 4 Recorded and annotated 30 training a 30 test meetings. Extract high level audio and visual features. Try to recover the target action sequence of unseen meetings. V 1 I. McCowan, D. Gatica-Perez, S. Bengio, G. Lathoud, M. Barnard, and D. Zhang. Automatic Analysis of Multimodal Group Actions in Meetings. IEEE Transactions on PAMI, 27(3), 2005.

Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Complexity of Meetings A Layered Approach Experiments A One-Layer Approach Person 1 AV Features Cameras Microphones Person 2 AV Features Person N AV Features G-HMM Group AV Features Classical Approach A large vector of audio-visual features from each participant and group-level features are concatenated to define the observation space. A general HMM is trained using a set of labeled meetings. Samy Bengio Tutorial on Statistical Machine Learning 117

Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Complexity of Meetings A Layered Approach Experiments A Two-Layer Approach Person 1 AV Features I-HMM 1 Cameras Person 2 AV Features I-HMM 2 G-HMM Microphones Person N AV Features I-HMM N Group AV Features Advantages Smaller observation space. I-HMMs share parameters, person independent, trained on simple task (write, talk, passive). Last layer less sensitive to variations of low-level data. Samy Bengio Tutorial on Statistical Machine Learning 118