Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data Statistical Machine Learning from Data Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland bengio@idiap.ch http://www.idiap.ch/~bengio December 2, 2005

Samy Bengio Statistical Machine Learning from Data 2 2 3 4

Samy Bengio Statistical Machine Learning from Data 3 Graphical View Training 2 3 4

Markov Models Graphical View Training Stochastic process of a temporal sequence: the probability distribution of the variable q at time t depends on the variable q at times t to. P(q, q 2,..., q T ) = P(q T ) = P(q ) First Order Markov Process: T t=2 P(q t q t ) = P(q t q t ) P(q t q t ) Markov Model: model of a Markovian process with discrete states. Samy Bengio Statistical Machine Learning from Data 4

Markov Models (Graphical View) Graphical View Training A Markov model: 2 3 A Markov model unfolded in time: q(t-2) q(t-) q(t) q(t+) 2 3 Samy Bengio Statistical Machine Learning from Data 5

Samy Bengio Statistical Machine Learning from Data 6 Training Markov Models Graphical View Training A Markov model is represented by all its transition probabilities: P(q t = i q t = j) i, j Given a training set of sequences X, training means re-estimating these probabilities. Simply count them to obtain the maximum likelihood solution: P(q t = i q t = j) = #(q t = i and q t = j X ) #(q t = j X ) Example: observe the weather today assuming it depends on the previous day.

Samy Bengio Statistical Machine Learning from Data 7 2 3 4

A hidden Markov model: A hidden Markov model unfolded in time: y(t-2) y(t-) y(t) y(t+) 2 3 2 3 q(t-2) q(t-) q(t) q(t+) Samy Bengio Statistical Machine Learning from Data 8

Elements of an HMM Hidden Markov Model: Markov Model whose state is not observed, but of which one can observe a manifestation (a variable x t which depends only on q t ). A finite number of states N. Transition probabilities between states, which depend only on previous state: P(q t =i q t =j, θ). Emission probabilities, which depend only on the current state: p(x t q t =i, θ) (where x t is observed). Initial state probabilities: P(q 0 = i θ). Each of these 3 sets of probabilities have parameters θ to estimate. Samy Bengio Statistical Machine Learning from Data 9

The 3 Problems of HMMs The HMM model gives rise to 3 different problems: Given an HMM parameterized by θ, can we compute the likelihood of a sequence X = x T = {x, x 2,..., x T }: p(x T θ) Given an HMM parameterized by θ and a set of sequences D n, can we select the parameters θ such that: θ = arg max θ n p(x (p) θ) p= Given an HMM parameterized by θ, can we compute the optimal path Q through the state space given a sequence X : Q = arg max p(x, Q θ) Q Samy Bengio Statistical Machine Learning from Data 0

HMMs as Generative Processes HMMs can be use to generate sequences: Let us define a set of starting states with initial probabilities P(q 0 = i). Let us also define a set of final states. Then for each sequence to generate: Select an initial state j according to P(q 0 ). 2 Select the next state i according to P(q t = i q t = j). 3 Emit an output according to the emission distribution P(x t q t = i). 4 If i is a final state, then stop, otherwise loop to step 2. Samy Bengio Statistical Machine Learning from Data

Markovian Assumptions Emissions: the probability to emit x t at time t in state q t = i does not depend on anything else: p(x t q t = i, q t, x t ) = p(x t q t = i) Transitions: the probability to go from state j to state i at time t does not depend on anything else: P(q t = i q t = j, q t 2, x t ) = P(q t = i q t = j) Moreover, this probability does not depend on time t: P(q t = i q t = j) is the same for all t we say that such Markov models are homogeneous. Samy Bengio Statistical Machine Learning from Data 2

Samy Bengio Statistical Machine Learning from Data 3 Derivation of the Forward Variable α the probability of having generated the sequence x t state i at time t: and being in α(i, t) def = p(x t, q t = i) = p(x t x t, q t = i)p(x t, q t = i) = p(x t q t = i) j p(x t, q t = i, q t = j) = p(x t q t = i) j = p(x t q t = i) j = p(x t q t = i) j P(q t = i x t, q t = j)p(x t, q t = j) P(q t = i q t = j)p(x t, q t = j) P(q t = i q t = j)α(j, t )

From α to the Likelihood Reminder: α(i, t) def = p(x t, q t = i) Initial condition: α(i, 0) = P(q 0 = i) prior probabilities of each state i Then let us compute α(i, t) for each state i and each time t of a given sequence x T Afterward, we can compute the likelihood as follows: p(x T ) = i p(x T, q T = i) = i α(i, T ) Hence, to compute the likelihood p(x T ), we need O(N2 T ) operations, where N is the number of states Samy Bengio Statistical Machine Learning from Data 4

Samy Bengio Statistical Machine Learning from Data 5 EM Training for HMM For HMM, the hidden variable Q will describe in which state the HMM was for each observation x t of a sequence X. The joint likelihood of all sequences X (l) and the hidden variable Q is then: p(x, Q θ) = n p(x (l), Q θ) l= Let us introduce the following indicator variable: { if qt = i q i,t = 0 otherwise

Samy Bengio Statistical Machine Learning from Data 6 Joint Likelihood Let us now use our indicator variables q to instanciate Q: p(x, Q θ) = = n p(x (l), Q θ) l= ( n N ) P(q 0 = i) q i,0 l= T l i= t= i= N p(x t (l) q t = i) q i,t N P(q t = i q t = j) q i,t q j,t j=

Samy Bengio Statistical Machine Learning from Data 7 Joint Log Likelihood log p(x, Q θ) = n N q i,0 log P(q 0 = i) + l= i= n T l N q i,t log p(x t (l) q t = i) + l= t= i= n T l N N q i,t q j,t log P(q t = i q t = j) l= t= i= j=

Samy Bengio Statistical Machine Learning from Data 8 Auxiliary Function Let us now write the corresponding auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] = n N E Q [q i,0 X, θ s ] log P(q 0 = i) + l= i= n T l N E Q [q i,t X, θ s ] log p(x t (l) q t = i) + l= t= i= n T l N N E Q [q i,t q j,t X, θ s ] log P(q t = i q t = j) l= t= i= j= From now on, let us forget about index l for simplification.

Samy Bengio Statistical Machine Learning from Data 9 Derivation of the Backward Variable β the probability to generate the rest of the sequence xt+ T we are in state i at time t β(i, t) def = p(x T t+ q t =i) = j p(x T t+, q t+ =j q t =i) given that = j = j = j = j p(x t+ xt+2, T q t+ =j, q t =i)p(xt+2, T q t+ =j q t =i) p(x t+ q t+ =j)p(xt+2 q T t+ =j, q t =i)p(q t+ =j q t =i) p(x t+ q t+ =j)p(xt+2 q T t+ =j)p(q t+ =j q t =i) p(x t+ q t+ =j)β(j, t + )P(q t+ =j q t =i)

Samy Bengio Statistical Machine Learning from Data 20 Final Details About β Reminder: β(i, t) = p(x T t+ q t=i) Final condition: { if i is a final state β(i, T ) = 0 otherwise Hence, to compute all the β variables, we need O(N 2 T ) operations, where N is the number of states

Samy Bengio Statistical Machine Learning from Data 2 E-Step for HMMs Posterior on emission distributions: E Q [q i,t X, θ s ] = P(q t = i x T, θ s ) = P(q t = i x T ) = p(x T, q t = i) p(x T ) = p(x T t+ q t = i, x t )p(x t, q t = i) p(x T ) = p(x t+ T q t = i)p(x t, q t = i) p(x T ) β(i, t) α(i, t) = α(j, T ) j

Samy Bengio Statistical Machine Learning from Data 22 E-Step for HMMs Posterior on transition distributions: E Q [q i,t q j,t X, θ s ] = P(q t = i, q t = j x T, θ s ) = p(x T, q t = i, q t = j) p(x T ) = p(x t+ T q t=i)p(q t =i q t =j)p(x t q t =i)p(x t, q t =j) p(x T ) = β(i, t)p(q t = i q t = j)p(x t q t = i)α(j, t ) α(j, T ) j

Samy Bengio Statistical Machine Learning from Data 23 E-Step for HMMs Posterior on initial state distribution: E Q [q i,0 X, θ p ] = P(q 0 = i x T, θ s ) = P(q 0 = i x T ) = p(x T, q 0 = i) p(x T ) = p(x T q 0 = i)p(q 0 = i) p(x T ) = β(i, 0) P(q 0 = i) α(j, T ) j

M-Step for HMMs Find the parameters θ that maximizes A, hence search for A θ = 0 When transition distributions are represented as tables, using a Lagrange multiplier, we obtain: P(q t = i q t = j) = T P(q t = i, q t = j x T, θ s ) t= T P(q t = j x T, θ s ) t= When emission distributions are implemented as GMMs, use already given equations, weighted by the posterior on emissions P(q t = i x T, θs ). Samy Bengio Statistical Machine Learning from Data 24

The Most Likely Path (Graphical View) The Viterbi algorithm finds the best state sequence. Compute the patial paths Backtrack in time q3 States q2 q q3 States q2 q Time Time Samy Bengio Statistical Machine Learning from Data 25

Samy Bengio Statistical Machine Learning from Data 26 for HMMs The Viterbi algorithm finds the best state sequence. V (i, t) def = max q t p(x t, q t, q t =i) = max p(x t x q t t, q t, q t =i)p(x t, q t, q t =i) = p(x t q t =i) max max p(x q t 2 t, q t 2, q t =i, q t =j) j = p(x t q t =i) max max p(q t =i q t =j)p(x q t 2 t, q t 2, q t =j) j = p(x t q t =i) max j = p(x t q t =i) max j p(q t =i q t =j) max p(x q t 2 t, q t 2, q t =j) p(q t =i q t =j)v (j, t )

From Viterbi to the State Sequence Reminder: V (i, t) = max p(x, t q q t t, q t =i) Let us compute V (i, t) for each state i and each time t of a given sequence x T Moreover, let us also keep for each V (i, t) the associated argmax previous state j Then, starting from the state i = arg max V (j, T ) backtrack j to decode the most probable state sequence. Hence, to compute all the V (i, t) variables, we need O(N 2 T ) operations, where N is the number of states Samy Bengio Statistical Machine Learning from Data 27

of HMMs Classifying sequences such as... DNA sequences (which family) gesture sequences video sequences phoneme sequences etc. Decoding sequences such as... continuous speech recognition handwriting recognition sequence of events (meeting, surveilance, games, etc) Samy Bengio Statistical Machine Learning from Data 28

Samy Bengio Statistical Machine Learning from Data 29 Embbeded Training Word Error Rates Discriminant Approach 2 3 4

Embbeded Training Word Error Rates Discriminant Approach Application: continuous speech recognition: Find a sequence of phonemes (or words) given an acoustic sequence Idea: use a phoneme model Samy Bengio Statistical Machine Learning from Data 30

Embbeded Training of HMMs Embbeded Training Word Error Rates Discriminant Approach For each acoustic sequence in the training set, create a new HMM as the concatenation of the HMMs representing the underlying sequence of phonemes. Maximize the likelihood of the training sentences. C A T Samy Bengio Statistical Machine Learning from Data 3

HMMs: Decoding a Sentence Embbeded Training Word Error Rates Discriminant Approach Decide what is the accepted vocabulary. Optionally add a language model: P(word sequence) Efficient algorithm to find the optimal path in the decoding HMM: D O G C A T Samy Bengio Statistical Machine Learning from Data 32

Measuring Error Embbeded Training Word Error Rates Discriminant Approach How do we measure the quality of a speech recognizer? Problem: the target solution is a sentence, the obtained solution is also a sentence, but they might have different size! Proposed solution: the Edit Distance: assume you have access to the operators insert, delete, and substitute, what is the smallest number of such operators we need to go from the obtained to the desired sentence? An efficient algorithm exists to compute this. At the end, we measure the error as follows: WER = #ins + #del + #subst #words Note that the word error rate (WER) can be greater than... Samy Bengio Statistical Machine Learning from Data 33

Maximum Mutual Information Embbeded Training Word Error Rates Discriminant Approach Using the Maximum Likelihood criterion for a classification task might sometimes be worse than using a discriminative approach What about changing the criterion to be more discriminative? Maximum Mutual Information (MMI) between word (W ) and accoustic (A) sequences: I (A, W ) = log P(A, W ) P(A)P(W ) = log P(A W )P(W ) log P(A) log P(W ) = log P(A W ) log P(A) = log P(A W ) w log P(A w)p(w) Apply gradient ascent: I (A,W ) θ. Samy Bengio Statistical Machine Learning from Data 34

Samy Bengio Statistical Machine Learning from Data 35 Various Imbalance 2 3 4

Various Imbalance Capacity tuned by the following hyper-parameters: Number of states (or values the hidden variable can take) Non-zero transitions (full-connect, left-to-right, etc) Capacity of underlying emission models Number of training iterations Initialization: If the training set is aligned, use this information Otherwise, uniform for transitions, K-Means for GMM-based emissions Computational contraint: Work in the logarithmic domain! Samy Bengio Statistical Machine Learning from Data 36

Various Imbalance Imbalance between Transitions and Emissions A problem often seen in speech recognition... Decoding with Viterbi: V (i, t) = p(x t q t = i) max P(q t = i q t = j)v (j, t ) j Emissions represented by GMMs: densities depend on the number of dimensions of x t. Practical estimates on Numbers 95 database (39 dimensions): Variance log P(q t q t ) 9.8 log p(x t q t ) 486.0 Comparison of variances of log distributions during decoding Samy Bengio Statistical Machine Learning from Data 37