Tutorial on Statistical Machine Learning

Size: px
Start display at page:

Download "Tutorial on Statistical Machine Learning"

Transcription

1 October 7th, ICMI, Trento Samy Bengio Tutorial on Statistical Machine Learning 1 Tutorial on Statistical Machine Learning with Applications to Multimodal Processing Samy Bengio IDIAP Research Institute Martigny, Switzerland bengio@idiap.ch

2 Samy Bengio Tutorial on Statistical Machine Learning 2 Outline of the Tutorial Outline Introduction Statistical Learning Theory EM and Gaussian Mixture Models Hidden Markov Models Advanced Models for Multimodal Processing Updated slides

3 Samy Bengio Tutorial on Statistical Machine Learning 3 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications Part I Introduction

4 Samy Bengio Tutorial on Statistical Machine Learning 4 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications 1 What is Machine Learning? 2 Why Learning is Difficult? 3 Types of Problems 4 Applications

5 Samy Bengio Tutorial on Statistical Machine Learning 5 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications What is Machine Learning? (Graphical View)

6 Samy Bengio Tutorial on Statistical Machine Learning 6 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications What is Machine Learning? Learning is an essential human property Learning means changing in order to be better (according to a given criterion) when a similar situation arrives Learning IS NOT learning by heart Any computer can learn by heart, the difficulty is to generalize a behavior to a novel situation

7 Samy Bengio Tutorial on Statistical Machine Learning 7 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications 1 What is Machine Learning? 2 Why Learning is Difficult? 3 Types of Problems 4 Applications

8 Samy Bengio Tutorial on Statistical Machine Learning 8 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications Why Learning is Difficult? Given a finite amount of training data, you have to derive a relation for an infinite domain In fact, there is an infinite number of such relations How should we draw the relation?

9 Samy Bengio Tutorial on Statistical Machine Learning 9 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications Why Learning is Difficult? (2) Given a finite amount of training data, you have to derive a relation for an infinite domain In fact, there is an infinite number of such relations Which relation is the most appropriate?

10 Samy Bengio Tutorial on Statistical Machine Learning 10 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications Why Learning is Difficult? (3) Given a finite amount of training data, you have to derive a relation for an infinite domain In fact, there is an infinite number of such relations... the hidden test points...

11 Samy Bengio Tutorial on Statistical Machine Learning 11 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications Occam s Razor s Principle William of Occam: Monk living in the 14th century Principle of Parcimony: One should not increase, beyond what is necessary, the number of entities required to explain anything When many solutions are available for a given problem, we should select the simplest one But what do we mean by simple? We will use prior knowledge of the problem to solve to define what is a simple solution Example of a prior: smoothness

12 Samy Bengio Tutorial on Statistical Machine Learning 12 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications Learning as a Search Problem Initial solution Set of solutions chosen a priori Optimal solution Set of solutions compatible with training set

13 Samy Bengio Tutorial on Statistical Machine Learning 13 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications 1 What is Machine Learning? 2 Why Learning is Difficult? 3 Types of Problems 4 Applications

14 Samy Bengio Tutorial on Statistical Machine Learning 14 Types of Problems What is Machine Learning? Why Learning is Difficult? Types of Problems Applications There are 3 kinds of problems: regression

15 Samy Bengio Tutorial on Statistical Machine Learning 15 Types of Problems What is Machine Learning? Why Learning is Difficult? Types of Problems Applications There are 3 kinds of problems: regression, classification

16 Samy Bengio Tutorial on Statistical Machine Learning 16 Types of Problems What is Machine Learning? Why Learning is Difficult? Types of Problems Applications There are 3 kinds of problems: regression, classification, density estimation P(X) X

17 Samy Bengio Tutorial on Statistical Machine Learning 17 What is Machine Learning? Why Learning is Difficult? Types of Problems Applications 1 What is Machine Learning? 2 Why Learning is Difficult? 3 Types of Problems 4 Applications

18 Samy Bengio Tutorial on Statistical Machine Learning 18 Applications What is Machine Learning? Why Learning is Difficult? Types of Problems Applications Vision Processing Face detection/verification Handwritten recognition Speech Processing Phoneme/Word/Sentence/Person recognition Others Finance: asset prediction, portfolio and risk management Telecom: traffic prediction Data mining: make use of huge datasets kept by large corporations Games: Backgammon, go Control: robots... and plenty of others of course!

19 Samy Bengio Tutorial on Statistical Machine Learning 19 Data, Functions, Risk The Capacity Methodology Models Part II Statistical Learning Theory

20 Samy Bengio Tutorial on Statistical Machine Learning 20 Data, Functions, Risk The Capacity Methodology Models 5 Data, Functions, Risk 6 The Capacity 7 Methodology 8 Models

21 Samy Bengio Tutorial on Statistical Machine Learning 21 Data, Functions, Risk The Capacity Methodology Models The Data Available training data Let Z 1, Z 2,, Z n be an n-tuple random sample of an unknown distribution of density p(z). All Z i are independently and identically distributed (iid). Let D n be a particular instance = {z 1, z 2,, z n }. Various forms of the data Classification: Z = (X, Y ) R d { 1, 1} objective: given a new x, estimate P(Y X = x) Regression: Z = (X, Y ) R d R objective: given a new x, estimate E[Y X = x] Density estimation: Z R d objective: given a new z, estimate p(z)

22 Samy Bengio Tutorial on Statistical Machine Learning 22 The Function Space Data, Functions, Risk The Capacity Methodology Models Learning: search for a good function in a function space F Examples of functions f ( ; θ) F: Regression: Classification: ŷ = f (x; a, b, c) = a x 2 + b x + c ŷ = f (x; a, b, c) = sign(a x 2 + b x + c) Density estimation ˆp(z) = f (z; µ, Σ) = (2π) z 2 1 exp Σ ( 1 ) 2 (z µ)t Σ 1 (z µ)

23 Samy Bengio Tutorial on Statistical Machine Learning 23 The Loss Function Data, Functions, Risk The Capacity Methodology Models Learning: search for a good function in a function space F Examples of loss functions L : Z F Regression: Classification: Density estimation: L(z, f ) = L((x, y), f ) = (f (x) y) 2 L(z, f ) = L((x, y), f ) = L(z, f ) = log p(z) { 0 if f (x) = y 1 otherwise

24 Samy Bengio Tutorial on Statistical Machine Learning 24 Data, Functions, Risk The Capacity Methodology Models The Risk and the Empirical Risk Learning: search for a good function in a function space F Minimize the Expected Risk on F, defined for a given f as R(f ) = E Z [L(z, f )] = L(z, f )p(z)dz Induction Principle: select f = arg min f F R(f ) problem: p(z) is unknown!!! Empirical Risk: ˆR(f, D n ) = 1 n Z n L(z i, f ) i=1

25 Samy Bengio Tutorial on Statistical Machine Learning 25 Data, Functions, Risk The Capacity Methodology Models The Risk and the Empirical Risk The empirical risk is an unbiased estimate of the risk: E[ˆR(f, D)] = R(f ) The principle of empirical risk minimization: Training error: f (D n ) = arg min ˆR(f, D n ) f F ˆR(f (D n ), D n ) = min ˆR(f, D n ) f F Is the training error a biased estimate of the risk?

26 Samy Bengio Tutorial on Statistical Machine Learning 26 The Training Error Data, Functions, Risk The Capacity Methodology Models Is the training error a biased estimate of the risk? yes. E[R(f (D n )) ˆR(f (D n ), D n )] 0 The solution f (D n ) found by minimizing the training error is better on D n than on any other set D n drawn from p(z). Can we bound the difference between the training error and the generalization error? R(f (D n )) ˆR(f (D n ), D n )? Answer: under certain conditions on F, yes. These conditions depend on the notion of capacity h of F.

27 Samy Bengio Tutorial on Statistical Machine Learning 27 Data, Functions, Risk The Capacity Methodology Models 5 Data, Functions, Risk 6 The Capacity 7 Methodology 8 Models

28 Samy Bengio Tutorial on Statistical Machine Learning 28 The Capacity Data, Functions, Risk The Capacity Methodology Models The capacity h(f) is a measure of its size, or complexity. Classification: The capacity h(f) is the largest n such that there exist a set of examples D n such that one can always find an f F which gives the correct answer for all examples in D n, for any possible labeling. Regression and density estimation: capacity exists also, but more complex to derive (for instance, we can always reduce a regression problem to a classification problem). Bound on the expected risk: let τ = sup L inf L. η we have P sup R(f ) ˆR(f, D n ) 2τ f F h ( ln 2n h + 1) ln η 9 n 1 η

29 Samy Bengio Tutorial on Statistical Machine Learning 29 Theoretical Curves Data, Functions, Risk The Capacity Methodology Models Bound on the Expected Risk Confidence Interval Empirical Risk h

30 Theoretical Curves Data, Functions, Risk The Capacity Methodology Models Bound on the Expected Risk inf R(f) Empirical Risk Samy Bengio Tutorial on Statistical Machine Learning 30 n

31 Samy Bengio Tutorial on Statistical Machine Learning 31 Data, Functions, Risk The Capacity Methodology Models 5 Data, Functions, Risk 6 The Capacity 7 Methodology 8 Models

32 Samy Bengio Tutorial on Statistical Machine Learning 32 Methodology Data, Functions, Risk The Capacity Methodology Models First: identify the goal! It could be 1 to give the best model you can obtain given a training set? 2 to give the expected performance of a model obtained by empirical risk minimization given a training set? 3 to give the best model and its expected performance that you can obtain given a training set? If the goal is (1): use need to do model selection If the goal is (2), you need to estimate the risk If the goal is (3): use need to do both! There are various methods that can be used for either risk estimation or model selection: simple validation cross validation (k-fold, leave-one-out)

33 Samy Bengio Tutorial on Statistical Machine Learning 33 Data, Functions, Risk The Capacity Methodology Models Model Selection - Validation Select a family of functions with hyper-parameter θ Divide your training set D n into two parts D tr = {z 1, z 2,, z tr } D va = {z tr+1, z tr+2,, z tr+va } tr + va = n For each value θ m of the hyper-parameter θ select fθ m (D tr ) = arg min ˆR(f, D tr ) f F θm estimate R(f θ m ) with ˆR(f θ m, D va ) = 1 va select θm = arg min R(fθ θ m ) m return f (D n ) = arg min f F θ m ˆR(f, D n ) L(z i, fθ m (D tr )) z i D va

34 Samy Bengio Tutorial on Statistical Machine Learning 34 Data, Functions, Risk The Capacity Methodology Models Model Selection - Cross-validation Select a family of functions with hyper-parameter θ Divide your training set D n into K distinct and equal parts D 1,, D k,, D K For each value θ m of the hyper-parameter θ For each part D k (and its counterpart D k ) select fθ m ( D k ) = arg min ˆR(f, D k ) f F θm estimate R(fθ m ( D k )) with ˆR(f θ m ( D k ), D k ) = 1 X L(z D k i, fθ m ( D k )) z i D k estimate R(fθ m (D n )) with 1 R(fθ K m ( D k )) select θm = arg min R(fθ θ m (D)) m return f (D n ) = arg min ˆR(f, D n ) f F θ m k

35 Samy Bengio Tutorial on Statistical Machine Learning 35 Data, Functions, Risk The Capacity Methodology Models Estimation of the Risk - Validation Divide your training set D n into two parts D tr = {z 1, z 2,, z tr } D te = {z tr+1, z tr+2,, z tr+te } tr + te = n select f (D tr ) = arg min ˆR(f, D tr ) f F (this optimization process could include model selection) estimate R(f (D tr )) with ˆR(f (D tr ), D te ) = 1 L(z i, f (D tr )) te z i D te

36 Samy Bengio Tutorial on Statistical Machine Learning 36 Data, Functions, Risk The Capacity Methodology Models Estimation of the Risk - Cross-validation Divide your training set D n into K distinct and equal parts D 1,, D k,, D K For each part D k let D k be the set of examples that are in D n but not in D k select f ( D k ) = arg min ˆR(f, D k ) f F (this process could include model selection) estimate R(f ( D k )) with ˆR(f ( D k ), D k ) = 1 D k L(z i, f ( D k )) z i D k estimate R(f (D n )) with 1 R(f ( K D k )) k When k = n: leave-one-out cross-validation

37 Samy Bengio Tutorial on Statistical Machine Learning 37 Data, Functions, Risk The Capacity Methodology Models Estimation of the Risk and Model Selection When you want both the best model and its expected risk. You then need to merge the methods already presented. For instance: train-validation-test: 3 separate data sets are necessary cross-validation + test: cross-validate on train set, then test on separate set double-cross-validation: for each subset, need to do a second cross-validation with the K 1 other subsets Other important methodological aspects: compare your results with other methods!!!! use statistical tests to verify significance verify your model on more than one datasets

38 Samy Bengio Tutorial on Statistical Machine Learning 38 Data, Functions, Risk The Capacity Methodology Models Train - Validation - Test Select a family of functions with hyper-parameter θ Divide your training set D n into three parts D tr, D va, and D te For each value θ m of the hyper-parameter θ select fθ m (D tr ) = arg min ˆR(f, D tr ) f F θm let ˆR(f θ m (D tr ), D va ) = 1 va select θ m = arg min θ m L(z i, fθ m (D tr )) z i D va ˆR(f θ m (D tr ), D va ) select f (D tr D va ) = arg min f F θ m ˆR(f, D tr D va ) estimate R(f (D tr D va )) with 1 te z i D te L(z i, f (D tr D va ))

39 Samy Bengio Tutorial on Statistical Machine Learning 39 Data, Functions, Risk The Capacity Methodology Models Cross-validation + Test Select a family of functions with hyper-parameter θ Divide you dataset D n into two parts: a training set D tr and a test set D te For each value θ m of the hyper-parameter θ estimate R(f θ m (D tr )) with D tr using cross-validation select θm = arg min R(fθ θ m (D tr )) m retrain f (D tr ) = arg min f F θ m ˆR(f, D tr ) estimate R(f (D tr )) with 1 te z i D te L(z i, f (D tr ))

40 Samy Bengio Tutorial on Statistical Machine Learning 40 Data, Functions, Risk The Capacity Methodology Models 5 Data, Functions, Risk 6 The Capacity 7 Methodology 8 Models

41 Samy Bengio Tutorial on Statistical Machine Learning 41 Data, Functions, Risk The Capacity Methodology Models Examples of Known Models Multi-Layer Perceptrons (regression, classification) Radial Basis Functions (regression, classification) Support Vector Machines (classification, regression) Gaussian Mixture Models (density estimation, classification) Hidden Markov Models (density estimation, classification) Graphical Models (density estimation, classification) AdaBoost and Bagging (classification, regression, density estimation) Decision Trees (classification, regression)

42 Samy Bengio Tutorial on Statistical Machine Learning 42 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Part III EM and Gaussian Mixture Models

43 Samy Bengio Tutorial on Statistical Machine Learning 43 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Probabilities 9 Preliminaries 10 Gaussian Mixture Models 11 Expectation-Maximization 12 EM for GMMs

44 Samy Bengio Tutorial on Statistical Machine Learning 44 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Probabilities Reminder: Basics on Probabilities A few basic equalities that are often used: 1 (conditional probabilities) 2 (Bayes rule) P(A, B) = P(A B) P(B) P(A B) = P(B A) P(A) P(B) 3 If ( B i = Ω) and i, j i (B i Bj = ) then P(A) = i P(A, B i )

45 Samy Bengio Tutorial on Statistical Machine Learning 45 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Gaussian Mixture Models 9 Preliminaries 10 Gaussian Mixture Models 11 Expectation-Maximization 12 EM for GMMs

46 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs What is a Gaussian Mixture Model Gaussian Mixture Models A Gaussian Mixture Model (GMM) is a distribution The likelihood given a Gaussian distribution is N (x; µ, Σ) = (2π) x 2 1 exp Σ ( 1 2 (x µ)t Σ 1 (x µ) where µ is the mean and Σ is the covariance matrix of the Gaussian. Σ is often diagonal. The likelihood given a GMM is N p(x) = w i N (x; µ, Σ) i=1 where N is the number of Gaussians and w i is the weight of Gaussian i, with w i = 1 and i : w i 0 Samy Bengio Tutoriali on Statistical Machine Learning 46 )

47 Samy Bengio Tutorial on Statistical Machine Learning 47 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Gaussian Mixture Models Characteristics of a GMM While Multi-Layer Perceptrons are universal approximators of functions, GMMs are universal approximators of densities. (as long as there are enough Gaussians of course) Even diagonal GMMs are universal approximators. Full rank GMMs are not easy to handle: number of parameters is the square of the number of dimensions. GMMs can be trained by maximum likelihood using an efficient algorithm: Expectation-Maximization.

48 Samy Bengio Tutorial on Statistical Machine Learning 48 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Graphical View More Formally 9 Preliminaries 10 Gaussian Mixture Models 11 Expectation-Maximization 12 EM for GMMs

49 Samy Bengio Tutorial on Statistical Machine Learning 49 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Graphical View More Formally Basics of Expectation-Maximization Objective: maximize the likelihood p(x ; θ) of the data X drawn from an unknown distribution, given the model parameterized by θ: θ = arg max θ p(x θ) = arg max θ n p(x p θ) p=1 Basic ideas of EM: Introduce a hidden variable such that its knowledge would simplify the maximization of p(x ; θ) At each iteration of the algorithm: E-Step: estimate the distribution of the hidden variable given the data and the current value of the parameters M-Step: modify the parameters in order to maximize the joint distribution of the data and the hidden variable

50 Samy Bengio Tutorial on Statistical Machine Learning 50 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Graphical View More Formally EM for GMM (Graphical View, 1) Hidden variable: for each point, which Gaussian generated it? A B

51 Samy Bengio Tutorial on Statistical Machine Learning 51 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Graphical View More Formally EM for GMM (Graphical View, 2) E-Step: for each point, estimate the probability the each Gaussian generated it A B P(A) = 0.6 P(B) = 0.4 P(A) = 0.2 P(B) = 0.8

52 Samy Bengio Tutorial on Statistical Machine Learning 52 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Graphical View More Formally EM for GMM (Graphical View, 3) M-Step: modify the parameters according to the hidden variable to maximize the likelihood of the data (and the hidden variable) A B P(A) = 0.6 P(B) = 0.4 P(A) = 0.2 P(B) = 0.8

53 Samy Bengio Tutorial on Statistical Machine Learning 53 EM: More Formally Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Graphical View More Formally Let us call the hidden variable Q. Let us consider the following auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] It can be shown that maximizing A θ s+1 = arg max A(θ, θ s ) θ always increases the likelihood of the data p(x θ s+1 ), and a maximum of A corresponds to a maximum of the likelihood.

54 Samy Bengio Tutorial on Statistical Machine Learning 54 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Auxiliary Function Update Rules 9 Preliminaries 10 Gaussian Mixture Models 11 Expectation-Maximization 12 EM for GMMs

55 Samy Bengio Tutorial on Statistical Machine Learning 55 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs EM for GMM: Hidden Variable Auxiliary Function Update Rules For GMM, the hidden variable Q will describe which Gaussian generated each example. If Q was observed, then it would be simple to maximize the likelihood of the data: simply estimate the parameters Gaussian by Gaussian Moreover, we will see that we can easily estimate Q Let us first write the mixture of Gaussian model for one x i : N p(x i θ) = P(j θ)p(x i j, θ) j=1 Let us now introduce the following indicator variable: { 1 if Gaussian j emitted xi q i,j = 0 otherwise

56 Samy Bengio Tutorial on Statistical Machine Learning 56 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs EM for GMM: Auxiliary Function Auxiliary Function Update Rules We can now write the joint likelihood of all the X and Q: n N p(x, Q θ) = P(j θ) q i,j p(x i j, θ) q i,j which in log gives i=1 j=1 log p(x, Q θ) = n N q i,j log P(j θ) + q i,j log p(x i j, θ) i=1 j=1

57 Samy Bengio Tutorial on Statistical Machine Learning 57 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs EM for GMM: Auxiliary Function Auxiliary Function Update Rules Let us now write the corresponding auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] n N = E Q q i,j log P(j θ) + q i,j log p(x i j, θ) X, θ s i=1 j=1 i=1 j=1 n N = E Q [q i,j X, θ s ] log P(j θ) + E Q [q i,j X, θ s ] log p(x i j, θ)

58 Samy Bengio Tutorial on Statistical Machine Learning 58 Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs EM for GMM: Update Rules Auxiliary Function Update Rules Means ˆµ j = n x i P(j x i, θ s ) i=1 Variances (ˆσ j ) 2 = Weights: ŵ j = 1 n n P(j x i, θ s ) i=1 n (x i µ j ) 2 P(j x i, θ s ) i=1 n P(j x i, θ s ) i=1 n P(j x i, θ s ) i=1

59 Samy Bengio Tutorial on Statistical Machine Learning 59 Initialization Preliminaries Gaussian Mixture Models Expectation-Maximization EM for GMMs Auxiliary Function Update Rules EM is an iterative procedure that is very sensitive to initial conditions! Start from trash end up with trash. Hence, we need a good and fast initialization procedure. Often used: K-Means. Other options: hierarchical K-Means, Gaussian splitting.

60 Samy Bengio Tutorial on Statistical Machine Learning 60 Markovian Models Hidden Markov Models Speech Recognition Part IV Hidden Markov Models

61 Samy Bengio Tutorial on Statistical Machine Learning 61 Markovian Models Hidden Markov Models Speech Recognition Definition Graphical View 13 Markovian Models 14 Hidden Markov Models 15 Speech Recognition

62 Samy Bengio Tutorial on Statistical Machine Learning 62 Markovian Models Hidden Markov Models Speech Recognition Definition Graphical View Markov Models Stochastic process of a temporal sequence: the probability distribution of the variable q at time t depends on the variable q at times t 1 to 1. P(q 1, q 2,..., q T ) = P(q T 1 ) = P(q 1 ) First Order Markov Process: T t=2 P(q t q t 1 1 ) = P(q t q t 1 ) P(q t q t 1 1 ) Markov Model: model of a Markovian process with discrete states. Hidden Markov Model: Markov Model whose state is not observed, but of which one can observe a manifestation (a variable x t which depends only on q t ).

63 Samy Bengio Tutorial on Statistical Machine Learning 63 Markovian Models Hidden Markov Models Speech Recognition Definition Graphical View Markov Models (Graphical View) A Markov model: A Markov model unfolded in time: 1 q(t-2) q(t-1) q(t) q(t+1) 2 3

64 Samy Bengio Tutorial on Statistical Machine Learning 64 Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm 13 Markovian Models 14 Hidden Markov Models 15 Speech Recognition

65 Samy Bengio Tutorial on Statistical Machine Learning 65 Markovian Models Hidden Markov Models Speech Recognition Hidden Markov Models Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm A hidden Markov model: A hidden Markov model unfolded in time: y(t-2) y(t-1) y(t) y(t+1) q(t-2) q(t-1) q(t) q(t+1)

66 Samy Bengio Tutorial on Statistical Machine Learning 66 Elements of an HMM Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm A finite number of states N. Transition probabilities between states, which depend only on previous state: P(q t =i q t 1 =j, θ). Emission probabilities, which depend only on the current state: p(x t q t =i, θ) (where x t is observed). Initial state probabilities: P(q 0 = i θ). Each of these 3 sets of probabilities have parameters θ to estimate.

67 Samy Bengio Tutorial on Statistical Machine Learning 67 Markovian Models Hidden Markov Models Speech Recognition The 3 Problems of HMMs Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm The HMM model gives rise to 3 different problems: Given an HMM parameterized by θ, can we compute the likelihood of a sequence X = x T 1 = {x 1, x 2,..., x T }: p(x T 1 θ) Given an HMM parameterized by θ and a set of sequences D n, can we select the parameters θ such that: θ = arg max θ n p(x (p) θ) p=1 Given an HMM parameterized by θ, can we compute the optimal path Q through the state space given a sequence X : Q = arg max p(x, Q θ) Q

68 Samy Bengio Tutorial on Statistical Machine Learning 68 Markovian Models Hidden Markov Models Speech Recognition HMMs as Generative Processes Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm HMMs can be use to generate sequences: Let us define a set of starting states with initial probabilities P(q 0 = i). Let us also define a set of final states. Then for each sequence to generate: 1 Select an initial state j according to P(q 0 ). 2 Select the next state i according to P(q t = i q t 1 = j). 3 Emit an output according to the emission distribution P(x t q t = i). 4 If i is a final state, then stop, otherwise loop to step 2.

69 Samy Bengio Tutorial on Statistical Machine Learning 69 Markovian Models Hidden Markov Models Speech Recognition Markovian Assumptions Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm Emissions: the probability to emit x t at time t in state q t = i does not depend on anything else: p(x t q t = i, q t 1 1, x t 1 1 ) = p(x t q t = i) Transitions: the probability to go from state j to state i at time t does not depend on anything else: P(q t = i q t 1 = j, q t 2 1, x t 1 1 ) = P(q t = i q t 1 = j) Moreover, this probability does not depend on time t: P(q t = i q t 1 = j) is the same for all t we say that such Markov models are homogeneous.

70 Samy Bengio Tutorial on Statistical Machine Learning 70 Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm Derivation of the Forward Variable α the probability of having generated the sequence x1 t and being in state i at time t: α(i, t) def = p(x t 1, q t = i) = p(x t x t 1 1, q t = i)p(x t 1 1, q t = i) = p(x t q t = i) j p(x t 1 1, q t = i, q t 1 = j) = p(x t q t = i) j = p(x t q t = i) j = p(x t q t = i) j P(q t = i x t 1 1, q t 1 = j)p(x t 1 1, q t 1 = j) P(q t = i q t 1 = j)p(x t 1 1, q t 1 = j) P(q t = i q t 1 = j)α(j, t 1)

71 Samy Bengio Tutorial on Statistical Machine Learning 71 Markovian Models Hidden Markov Models Speech Recognition From α to the Likelihood Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm Reminder: α(i, t) def = p(x t 1, q t = i) Initial condition: α(i, 0) = P(q 0 = i) prior probabilities of each state i Then let us compute α(i, t) for each state i and each time t of a given sequence x T 1 Afterward, we can compute the likelihood as follows: p(x T 1 ) = i = i p(x T 1, q T = i) α(i, T ) Hence, to compute the likelihood p(x T 1 ), we need O(N2 T ) operations, where N is the number of states

72 Samy Bengio Tutorial on Statistical Machine Learning 72 Markovian Models Hidden Markov Models Speech Recognition EM Training for HMM Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm For HMM, the hidden variable Q will describe in which state the HMM was for each observation x t of a sequence X. The joint likelihood of all sequences X (l) and the hidden variable Q is then: p(x, Q θ) = n p(x (l), Q θ) l=1 Let us introduce the following indicator variable: { 1 if qt = i q i,t = 0 otherwise

73 Samy Bengio Tutorial on Statistical Machine Learning 73 Joint Likelihood Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm p(x, Q θ) = = n p(x (l), Q θ) l=1 ( n N ) P(q 0 = i) q i,0 l=1 T l i=1 t=1 i=1 N p(x t (l) q t = i) q i,t N P(q t = i q t 1 = j) q i,t q j,t 1 j=1

74 Samy Bengio Tutorial on Statistical Machine Learning 74 Joint Log Likelihood Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm log p(x, Q θ) = n N q i,0 log P(q 0 = i) + l=1 i=1 n T l N q i,t log p(x t (l) q t = i) + l=1 t=1 i=1 n T l N N q i,t q j,t 1 log P(q t = i q t 1 = j) l=1 t=1 i=1 j=1

75 Samy Bengio Tutorial on Statistical Machine Learning 75 Auxiliary Function Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm Let us now write the corresponding auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] = n N E Q [q i,0 X, θ s ] log P(q 0 = i) + l=1 i=1 n T l N E Q [q i,t X, θ s ] log p(x t (l) q t = i) + l=1 t=1 i=1 n T l N N E Q [q i,t q j,t 1 X, θ s ] log P(q t = i q t 1 = j) l=1 t=1 i=1 j=1 From now on, let us forget about index l for simplification.

76 Samy Bengio Tutorial on Statistical Machine Learning 76 Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm Derivation of the Backward Variable β the probability to generate the rest of the sequence xt+1 T we are in state i at time t β(i, t) def = p(x T t+1 q t =i) = j p(x T t+1, q t+1 =j q t =i) given that = j = j = j = j p(x t+1 xt+2, T q t+1 =j, q t =i)p(xt+2, T q t+1 =j q t =i) p(x t+1 q t+1 =j)p(xt+2 q T t+1 =j, q t =i)p(q t+1 =j q t =i) p(x t+1 q t+1 =j)p(xt+2 q T t+1 =j)p(q t+1 =j q t =i) p(x t+1 q t+1 =j)β(j, t + 1)P(q t+1 =j q t =i)

77 Samy Bengio Tutorial on Statistical Machine Learning 77 Final Details About β Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm Reminder: β(i, t) = p(x T t+1 q t=i) Final condition: { 1 if i is a final state β(i, T ) = 0 otherwise Hence, to compute all the β variables, we need O(N 2 T ) operations, where N is the number of states

78 Samy Bengio Tutorial on Statistical Machine Learning 78 E-Step for HMMs Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm Posterior on emission distributions: E Q [q i,t X, θ s ] = P(q t = i x T 1, θ s ) = P(q t = i x T 1 ) = p(x T 1, q t = i) p(x T 1 ) = p(x T t+1 q t = i, x t 1 )p(x t 1, q t = i) p(x T 1 ) = p(x t+1 T q t = i)p(x1 t, q t = i) p(x1 T ) β(i, t) α(i, t) = α(j, T ) j

79 Samy Bengio Tutorial on Statistical Machine Learning 79 E-Step for HMMs Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm Posterior on transition distributions: E Q [q i,t q j,t 1 X, θ s ] = P(q t = i, q t 1 = j x T 1, θ s ) = p(x T 1, q t = i, q t 1 = j) p(x T 1 ) = p(x t+1 T q t=i)p(q t =i q t 1 =j)p(x t q t =i)p(x1 t 1, q t 1 =j) p(x1 T ) = β(i, t)p(q t = i q t 1 = j)p(x t q t = i)α(j, t 1) α(j, T ) j

80 Samy Bengio Tutorial on Statistical Machine Learning 80 E-Step for HMMs Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm Posterior on initial state distribution: E Q [q i,0 X, θ p ] = P(q 0 = i x T 1, θ s ) = P(q 0 = i x T 1 ) = p(x T 1, q 0 = i) p(x T 1 ) = p(x 1 T q 0 = i)p(q 0 = i) p(x1 T ) = β(i, 0) P(q 0 = i) α(j, T ) j

81 Samy Bengio Tutorial on Statistical Machine Learning 81 M-Step for HMMs Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm Find the parameters θ that maximizes A, hence search for A θ = 0 When transition distributions are represented as tables, using a Lagrange multiplier, we obtain: T P(q t = i, q t 1 = j x1 T, θ s ) P(q t = i q t 1 = j) = t=1 T P(q t = i x1 T, θ s ) t=1 When emission distributions are implemented as GMMs, use already given equations, weighted by the posterior on emissions P(q t = i x T 1, θs ).

82 Samy Bengio Tutorial on Statistical Machine Learning 82 Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm The Most Likely Path (Graphical View) The Viterbi algorithm finds the best state sequence. Compute the patial paths Backtrack in time q3 States q2 q1 q3 States q2 q1 Time Time

83 Samy Bengio Tutorial on Statistical Machine Learning 83 Markovian Models Hidden Markov Models Speech Recognition The Viterbi Algorithm for HMMs Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm The Viterbi algorithm finds the best state sequence. V (i, t) def = max q t 1 1 p(x t 1, q t 1 1, q t =i) = max p(x t x q t 1 1 t 1, q1 t 1, q t =i)p(x1 t 1, q1 t 1, q t =i) 1 = p(x t q t =i) max max p(x q t 2 1 t 1, q1 t 2, q t =i, q t 1 =j) j 1 = p(x t q t =i) max max p(q t =i q t 1 =j)p(x q t 2 1 t 1, q1 t 2, q t 1 =j) j 1 = p(x t q t =i) max j = p(x t q t =i) max j p(q t =i q t 1 =j) max p(x q t 2 1 t 1, q1 t 2, q t 1 =j) 1 p(q t =i q t 1 =j)v (j, t 1)

84 Samy Bengio Tutorial on Statistical Machine Learning 84 Markovian Models Hidden Markov Models Speech Recognition Definition Likelihood of an HMM EM for HMMs The Viterbi Algorithm From Viterbi to the State Sequence Reminder: V (i, t) = max p(x1, t q q t 1 1 t 1, q t =i) 1 Let us compute V (i, t) for each state i and each time t of a given sequence x1 T Moreover, let us also keep for each V (i, t) the associated argmax previous state j Then, starting from the state i = arg max V (j, T ) backtrack j to decode the most probable state sequence. Hence, to compute all the V (i, t) variables, we need O(N 2 T ) operations, where N is the number of states

85 Samy Bengio Tutorial on Statistical Machine Learning 85 Markovian Models Hidden Markov Models Speech Recognition Embeded Training Decoding Measuring Error 13 Markovian Models 14 Hidden Markov Models 15 Speech Recognition

86 Samy Bengio Tutorial on Statistical Machine Learning 86 Markovian Models Hidden Markov Models Speech Recognition Embeded Training Decoding Measuring Error HMMs for Speech Recognition Application: continuous speech recognition: Find a sequence of phonemes (or words) given an acoustic sequence Idea: use a phoneme model

87 Samy Bengio Tutorial on Statistical Machine Learning 87 Markovian Models Hidden Markov Models Speech Recognition Embeded Training Decoding Measuring Error Embeded Training of HMMs For each acoustic sequence in the training set, create a new HMM as the concatenation of the HMMs representing the underlying sequence of phonemes. Maximize the likelihood of the training sentences. C A T

88 Samy Bengio Tutorial on Statistical Machine Learning 88 Markovian Models Hidden Markov Models Speech Recognition Embeded Training Decoding Measuring Error HMMs: Decoding a Sentence Decide what is the accepted vocabulary. Optionally add a language model: P(word sequence) Efficient algorithm to find the optimal path in the decoding HMM: D O G C A T

89 Samy Bengio Tutorial on Statistical Machine Learning 89 Markovian Models Hidden Markov Models Speech Recognition Embeded Training Decoding Measuring Error Measuring Error How do we measure the quality of a speech recognizer? Problem: the target solution is a sentence, the obtained solution is also a sentence, but they might have different size! Proposed solution: the Edit Distance: assume you have access to the operators insert, delete, and substitute, what is the smallest number of such operators we need to go from the obtained to the desired sentence? An efficient algorithm exists to compute this. At the end, we measure the error as follows: WER = #ins + #del + #subst #words Note that the word error rate (WER) can be greater than 1...

90 Samy Bengio Tutorial on Statistical Machine Learning 90 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Part V Advanced Models for Multimodal Processing

91 Samy Bengio Tutorial on Statistical Machine Learning 91 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings 16 Introduction to Graphical Models 17 A Zoo of Graphical Models 18 Applications to Meetings

92 Samy Bengio Tutorial on Statistical Machine Learning 92 Graphical Models Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings A tool to efficiently model complex joint distributions: p(x 1, X 2,, X n ) = n p(x i parents(x i )) i=1 Introduces and makes use of known conditional independences. p(x 1, X 2,, X n ) = p(x 1 X 2, X 5 ) p(x 2 ) p(x 3 ) p(x 4 X 2, X 3 ) p(x 5 X 4 ) X 1 X 2 X 4 X 3 X5

93 Samy Bengio Tutorial on Statistical Machine Learning 93 Graphical Models Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Can handle an arbitrary number of random variables Junction Tree Algorithm (JTA): used to estimate joint probability (inference) Expectation-Maximization (EM): used to train Depending on the graph, EM and JTA are tractable or not Dynamic Bayes Nets: temporal extension of graphical models

94 Samy Bengio Tutorial on Statistical Machine Learning 94 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs 16 Introduction to Graphical Models 17 A Zoo of Graphical Models 18 Applications to Meetings

95 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs A Zoo of Graphical Models for Multi-Channel Processing Samy Bengio Tutorial on Statistical Machine Learning 95

96 Samy Bengio Tutorial on Statistical Machine Learning 96 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Notation for Multimodal Data Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Notation for One Stream Let us have a training set of L observations sequences. ) The observation sequence l: O l = (o l 1, ol 2,..., ol with o l Tl t the vector of multimodal features at time t for sequence l, of length T l. Notation for Several Streams Let us consider N streams for each observation sequence. The stream ( n of observation ) sequence l: O l:n = o l:n 1, ol:n 2,..., ol:n T l with o l:n t the vector of features of stream n at time t for sequence l, of length T l.

97 Samy Bengio Tutorial on Statistical Machine Learning 97 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Goals for Multimodal Data Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Inference Likelihood: L l=1 ( { p O l:n} ) N ; θ n=1 Training θ = arg max θ L l=1 ( { p O l:n} ) N ; θ n=1

98 Samy Bengio Tutorial on Statistical Machine Learning 98 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Sample all streams at the same frame rate Emission distributions ( { } ) N p o l:n t q t = i n=1 Transition distributions p(q t = i q t 1 = j)

99 Samy Bengio Tutorial on Statistical Machine Learning 99 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Multi-Stream HMMs Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Train a separate model for each stream n θ n = arg max θ L l=1 Inference, for each state q t = i p ( { } ) N o l:n t q t = i = n=1 ) p (O l:n ; θ n N n=1 ( ) p o l:n t q t = i; θ n

100 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Multi-Stream HMMs Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Training and Inference Complexity Per iteration, per observation sequence l: O(N S 2 T l ) with S the number of states of each HMM stream, N the number of streams, T l the length of the observation sequence Additional Notes All stream HMMs need to be of the same topology. One can add stream weights ω n as follows: p ( { } ) N ot l:n q t = i = n=1 N n=1 ( p o l:n t q t = i; θ n ) ωn Samy Bengio Tutorial on Statistical Machine Learning 100

101 Samy Bengio Tutorial on Statistical Machine Learning 101 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Multi-Stream HMMs Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Pros Efficient training (linear in the number of streams) Robust to stream-dependent noise Easy to implement Cons Does not maximize the joint probability of the data Each stream needs to use the same HMM topology Assumes complete stream independence during training, and stream independence given the state during decoding.

102 Samy Bengio Tutorial on Statistical Machine Learning 102 Coupled HMMs Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Let q n t = i be the state at time t for stream n Stream emission distributions: ( ) p o l:n t qt n = i; θ n Coupled transition distributions: ( p qt n { ) qt 1} m N m=1 ; θ n

103 Samy Bengio Tutorial on Statistical Machine Learning 103 Coupled HMMs Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Complexity Per iteration, per observation sequence l: O(S 2 N T l ) with S the number of states of each HMM stream, N the number of streams, T l the length of the observation sequence N-Heads approximation: O(N 2 S 2 T l )

104 Samy Bengio Tutorial on Statistical Machine Learning 104 Coupled HMMs Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Pros Considers correlations between streams during training and decoding. Cons The algorithm is intractable (unless using the approximation). Assumes synchrony between the streams.

105 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Challenges in Multi Channel Integration: Asynchrony Stream 1 A B C Stream 2 A B C Naive Integration A A A B B B B C C C Samy Bengio Tutorial on Statistical Machine Learning 105

106 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Some Evidence of Stream Asynchrony Audio-visual speech recognition with lip movements: lips do not move at the same time as we hear the corresponding sound. Speaking and pointing: pointing to a map and saying I want to go there. Gesticulating, looking at, and talking to someone during a conversation. In a news video, the delay between the moment when the newscaster says Bush and the moment when Bush s picture appears.... Samy Bengio Tutorial on Statistical Machine Learning 106

107 Samy Bengio Tutorial on Statistical Machine Learning 107 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Asynchrony Revisited Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Stream 1 A B C 3 d1 dim states Stream 2 A B C 3 d2 dim states Naive Integration A A A B B B B C C C 5 (d1+d2) dim states Joint / Asynchronous A A B B C C 3 (d1+d2) dim states

108 Samy Bengio Tutorial on Statistical Machine Learning 108 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Asynchronous HMMs Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Enables re-synchronization of streams. One HMM: maximizes the likelihood of all streams jointly. O 1 1:T1 O 2 1:T2 p(o 1 r q p(o 1 r, o 2 t ) s q t ) p(o 2 s q t ) P(q t q t 1 ) p(emit o 1 p(emit o 2 r q t ) s q t ) q t 1 q t q t+1

109 Samy Bengio Tutorial on Statistical Machine Learning 109 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Training and Complexity Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Training EM algorithm maximizes the joint probability p(x T 1 1, yt 2 1, ). Complexity grows with number of streams, can be controlled Exact complexity: O(S 2 N i=1 T i) Introduction of temporal constraints: O(S 2 d T ) with d the maximum delay allowed between streams. Applications Easily adaptable to complex tasks such as speech recognition. Significant performance improvements in audio-visual speech and speaker recognition meeting analysis.

110 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Alignments with Asynchronous HMMs Word Error Rate (WER) db audio HMM audio+video HMM audio+video AHMM video HMM 5db 10db Video zero un deux trois cinq sept neuf quatre six huit 50 AHMM Alignment HMM Alignment Alignment Bounds Segmentation Signal to Noise Ratio (SNR) Audio Samy Bengio Tutorial on Statistical Machine Learning 110

111 Samy Bengio Tutorial on Statistical Machine Learning 111 Layered HMMs Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Philosophy Divide and Conquer for complex tasks Idea: transform the data from a raw representation into a higher level abstraction Then, transform again the result into yet another and higher level of abstraction, and continue as needed For temporal sequences: go from a fine time granularity to a coarser one.

112 Samy Bengio Tutorial on Statistical Machine Learning 112 General Algorithm Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Layered Algorithm For i = 0 to M layers, do: 1 Let O i 1:T be a sequence of observations of HMM model m i. 2 Train m i using EM in order to maximise p(o i 1:T m i). 3 Compute, for each state s i j for m i the data posterior: p(s i j Oi 1:T, m i) 4 Define O i+1 1:T as the sequence of vectors of p(si j Oi 1:T, m i)

113 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Complexity and Additional Notes Early-Integration HMMs Multi-Stream HMMs Coupled HMMs Asynchronous HMMs Layered HMMs Complexity Training complexity: O(N S 2 M) for M layers. A Speech Example A phoneme layer (with phoneme constraints) A word layer (with word constraints) A sentence layer (with language constraints) Samy Bengio Tutorial on Statistical Machine Learning 113

114 Samy Bengio Tutorial on Statistical Machine Learning 114 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Complexity of Meetings A Layered Approach Experiments 16 Introduction to Graphical Models 17 A Zoo of Graphical Models 18 Applications to Meetings

115 Samy Bengio Tutorial on Statistical Machine Learning 115 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Complexity of Meetings A Layered Approach Experiments Complexity of the Meeting Scenario Modeling multimodal group-level human interactions in meetings. Multimodal nature: data collected from multiple sensors (cameras, microphones, projectors, white-board, etc). Group nature: involves multiple interacting persons at the same time.

116 Samy Bengio Tutorial on Statistical Machine Learning 116 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Complexity of Meetings A Layered Approach Experiments Meeting Analysis Structure a meeting as a sequence of group actions taken from an exhaustive set V of N possible actions: V = {v 1, v 2, v 3,, v N } t V 1 V 2 V 3 V 2 V 4 Recorded and annotated 30 training a 30 test meetings. Extract high level audio and visual features. Try to recover the target action sequence of unseen meetings. V 1 I. McCowan, D. Gatica-Perez, S. Bengio, G. Lathoud, M. Barnard, and D. Zhang. Automatic Analysis of Multimodal Group Actions in Meetings. IEEE Transactions on PAMI, 27(3), 2005.

117 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Complexity of Meetings A Layered Approach Experiments A One-Layer Approach Person 1 AV Features Cameras Microphones Person 2 AV Features Person N AV Features G-HMM Group AV Features Classical Approach A large vector of audio-visual features from each participant and group-level features are concatenated to define the observation space. A general HMM is trained using a set of labeled meetings. Samy Bengio Tutorial on Statistical Machine Learning 117

118 Introduction to Graphical Models A Zoo of Graphical Models Applications to Meetings Complexity of Meetings A Layered Approach Experiments A Two-Layer Approach Person 1 AV Features I-HMM 1 Cameras Person 2 AV Features I-HMM 2 G-HMM Microphones Person N AV Features I-HMM N Group AV Features Advantages Smaller observation space. I-HMMs share parameters, person independent, trained on simple task (write, talk, passive). Last layer less sensitive to variations of low-level data. Samy Bengio Tutorial on Statistical Machine Learning 118

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data Statistical Machine Learning from Data Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL),

More information

An Introduction to Statistical Machine Learning - Theoretical Aspects -

An Introduction to Statistical Machine Learning - Theoretical Aspects - An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,

More information

An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition

An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition Samy Bengio Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4, 1920 Martigny, Switzerland

More information

HMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems

HMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems HMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems Silvia Chiappa and Samy Bengio {chiappa,bengio}@idiap.ch IDIAP, P.O. Box 592, CH-1920 Martigny, Switzerland Abstract. We compare the use

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Hidden Markov Models Hamid R. Rabiee

Hidden Markov Models Hamid R. Rabiee Hidden Markov Models Hamid R. Rabiee 1 Hidden Markov Models (HMMs) In the previous slides, we have seen that in many cases the underlying behavior of nature could be modeled as a Markov process. However

More information

Hidden Markov Model and Speech Recognition

Hidden Markov Model and Speech Recognition 1 Dec,2006 Outline Introduction 1 Introduction 2 3 4 5 Introduction What is Speech Recognition? Understanding what is being said Mapping speech data to textual information Speech Recognition is indeed

More information

Hidden Markov Models and other Finite State Automata for Sequence Processing

Hidden Markov Models and other Finite State Automata for Sequence Processing To appear in The Handbook of Brain Theory and Neural Networks, Second edition, (M.A. Arbib, Ed.), Cambridge, MA: The MIT Press, 2002. http://mitpress.mit.edu The MIT Press Hidden Markov Models and other

More information

Dynamic Approaches: The Hidden Markov Model

Dynamic Approaches: The Hidden Markov Model Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message

More information

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()

More information

Hierarchical Multi-Stream Posterior Based Speech Recognition System

Hierarchical Multi-Stream Posterior Based Speech Recognition System Hierarchical Multi-Stream Posterior Based Speech Recognition System Hamed Ketabdar 1,2, Hervé Bourlard 1,2 and Samy Bengio 1 1 IDIAP Research Institute, Martigny, Switzerland 2 Ecole Polytechnique Fédérale

More information

Hidden Markov Models in Language Processing

Hidden Markov Models in Language Processing Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms Recognition of Visual Speech Elements Using Adaptively Boosted Hidden Markov Models. Say Wei Foo, Yong Lian, Liang Dong. IEEE Transactions on Circuits and Systems for Video Technology, May 2004. Shankar

More information

Temporal Modeling and Basic Speech Recognition

Temporal Modeling and Basic Speech Recognition UNIVERSITY ILLINOIS @ URBANA-CHAMPAIGN OF CS 498PS Audio Computing Lab Temporal Modeling and Basic Speech Recognition Paris Smaragdis paris@illinois.edu paris.cs.illinois.edu Today s lecture Recognizing

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 25&29 January 2018 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Other Artificial Neural Networks Samy Bengio IDIAP Research Institute, Martigny, Switzerland,

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Lecture 11: Hidden Markov Models

Lecture 11: Hidden Markov Models Lecture 11: Hidden Markov Models Cognitive Systems - Machine Learning Cognitive Systems, Applied Computer Science, Bamberg University slides by Dr. Philip Jackson Centre for Vision, Speech & Signal Processing

More information

Active learning in sequence labeling

Active learning in sequence labeling Active learning in sequence labeling Tomáš Šabata 11. 5. 2017 Czech Technical University in Prague Faculty of Information technology Department of Theoretical Computer Science Table of contents 1. Introduction

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named We Live in Exciting Times ACM (an international computing research society) has named CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Apr. 2, 2019 Yoshua Bengio,

More information

Hidden Markov Modelling

Hidden Markov Modelling Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models Part 2: Algorithms Hidden Markov Models Part 2: Algorithms CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Hidden Markov Model An HMM consists of:

More information

Hidden Markov models

Hidden Markov models Hidden Markov models Charles Elkan November 26, 2012 Important: These lecture notes are based on notes written by Lawrence Saul. Also, these typeset notes lack illustrations. See the classroom lectures

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg Temporal Reasoning Kai Arras, University of Freiburg 1 Temporal Reasoning Contents Introduction Temporal Reasoning Hidden Markov Models Linear Dynamical Systems (LDS) Kalman Filter 2 Temporal Reasoning

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Lecture Notes Speech Communication 2, SS 2004 Erhard Rank/Franz Pernkopf Signal Processing and Speech Communication Laboratory Graz University of Technology Inffeldgasse 16c, A-8010

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models CI/CI(CS) UE, SS 2015 Christian Knoll Signal Processing and Speech Communication Laboratory Graz University of Technology June 23, 2015 CI/CI(CS) SS 2015 June 23, 2015 Slide 1/26 Content

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Dr Philip Jackson Centre for Vision, Speech & Signal Processing University of Surrey, UK 1 3 2 http://www.ee.surrey.ac.uk/personal/p.jackson/isspr/ Outline 1. Recognizing patterns

More information

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm + September13, 2016 Professor Meteer CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm Thanks to Dan Jurafsky for these slides + ASR components n Feature

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Bayesian Hidden Markov Models and Extensions

Bayesian Hidden Markov Models and Extensions Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454 Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, What

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

Hidden Markov Models Part 1: Introduction

Hidden Markov Models Part 1: Introduction Hidden Markov Models Part 1: Introduction CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Modeling Sequential Data Suppose that

More information

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping Plan for today! Part 1: (Hidden) Markov models! Part 2: String matching and read mapping! 2.1 Exact algorithms! 2.2 Heuristic methods for approximate search (Hidden) Markov models Why consider probabilistics

More information

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius Doctoral Course in Speech Recognition May 2007 Kjell Elenius CHAPTER 12 BASIC SEARCH ALGORITHMS State-based search paradigm Triplet S, O, G S, set of initial states O, set of operators applied on a state

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Statistical Sequence Recognition and Training: An Introduction to HMMs

Statistical Sequence Recognition and Training: An Introduction to HMMs Statistical Sequence Recognition and Training: An Introduction to HMMs EECS 225D Nikki Mirghafori nikki@icsi.berkeley.edu March 7, 2005 Credit: many of the HMM slides have been borrowed and adapted, with

More information

CS Machine Learning Qualifying Exam

CS Machine Learning Qualifying Exam CS Machine Learning Qualifying Exam Georgia Institute of Technology March 30, 2017 The exam is divided into four areas: Core, Statistical Methods and Models, Learning Theory, and Decision Processes. There

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010 Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data

More information

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement Simon Leglaive 1 Laurent Girin 1,2 Radu Horaud 1 1: Inria Grenoble Rhône-Alpes 2: Univ. Grenoble Alpes, Grenoble INP,

More information

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013 More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

More information

Master 2 Informatique Probabilistic Learning and Data Analysis

Master 2 Informatique Probabilistic Learning and Data Analysis Master 2 Informatique Probabilistic Learning and Data Analysis Faicel Chamroukhi Maître de Conférences USTV, LSIS UMR CNRS 7296 email: chamroukhi@univ-tln.fr web: chamroukhi.univ-tln.fr 2013/2014 Faicel

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391 Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391 Parameters of an HMM States: A set of states S=s 1, s n Transition probabilities: A= a 1,1, a 1,2,, a n,n

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

A brief introduction to Conditional Random Fields

A brief introduction to Conditional Random Fields A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood

More information

HMM part 1. Dr Philip Jackson

HMM part 1. Dr Philip Jackson Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. HMM part 1 Dr Philip Jackson Probability fundamentals Markov models State topology diagrams Hidden Markov models -

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides mostly from Mitch Marcus and Eric Fosler (with lots of modifications). Have you seen HMMs? Have you seen Kalman filters? Have you seen dynamic programming? HMMs are dynamic

More information

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1 Hidden Markov Models AIMA Chapter 15, Sections 1 5 AIMA Chapter 15, Sections 1 5 1 Consider a target tracking problem Time and uncertainty X t = set of unobservable state variables at time t e.g., Position

More information

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 Hidden Markov Model Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/19 Outline Example: Hidden Coin Tossing Hidden

More information

Speech Recognition HMM

Speech Recognition HMM Speech Recognition HMM Jan Černocký, Valentina Hubeika {cernocky ihubeika}@fit.vutbr.cz FIT BUT Brno Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 1/38 Agenda Recap variability

More information

Parametric Models Part III: Hidden Markov Models

Parametric Models Part III: Hidden Markov Models Parametric Models Part III: Hidden Markov Models Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2014 CS 551, Spring 2014 c 2014, Selim Aksoy (Bilkent

More information

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION HIDDEN MARKOV MODELS IN SPEECH RECOGNITION Wayne Ward Carnegie Mellon University Pittsburgh, PA 1 Acknowledgements Much of this talk is derived from the paper "An Introduction to Hidden Markov Models",

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu November 3, 2015 Methods to Learn Matrix Data Text Data Set Data Sequence Data Time Series Graph

More information

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General

More information

Estimating Covariance Using Factorial Hidden Markov Models

Estimating Covariance Using Factorial Hidden Markov Models Estimating Covariance Using Factorial Hidden Markov Models João Sedoc 1,2 with: Jordan Rodu 3, Lyle Ungar 1, Dean Foster 1 and Jean Gallier 1 1 University of Pennsylvania Philadelphia, PA joao@cis.upenn.edu

More information

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech

More information

Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Expectation propagation for signal detection in flat-fading channels

Expectation propagation for signal detection in flat-fading channels Expectation propagation for signal detection in flat-fading channels Yuan Qi MIT Media Lab Cambridge, MA, 02139 USA yuanqi@media.mit.edu Thomas Minka CMU Statistics Department Pittsburgh, PA 15213 USA

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework

More information

Expectation-Maximization (EM) algorithm

Expectation-Maximization (EM) algorithm I529: Machine Learning in Bioinformatics (Spring 2017) Expectation-Maximization (EM) algorithm Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2017 Contents Introduce

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Lecture 3: Machine learning, classification, and generative models

Lecture 3: Machine learning, classification, and generative models EE E6820: Speech & Audio Processing & Recognition Lecture 3: Machine learning, classification, and generative models 1 Classification 2 Generative models 3 Gaussian models Michael Mandel

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabás Póczos & Aarti Singh Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

Hidden Markov Models

Hidden Markov Models Andrea Passerini passerini@disi.unitn.it Statistical relational learning The aim Modeling temporal sequences Model signals which vary over time (e.g. speech) Two alternatives: deterministic models directly

More information

Template-Based Representations. Sargur Srihari

Template-Based Representations. Sargur Srihari Template-Based Representations Sargur srihari@cedar.buffalo.edu 1 Topics Variable-based vs Template-based Temporal Models Basic Assumptions Dynamic Bayesian Networks Hidden Markov Models Linear Dynamical

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information