Statistical Machine Learning from Data

Similar documents
Tutorial on Statistical Machine Learning

An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition

Hierarchical Multi-Stream Posterior Based Speech Recognition System

HMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems

p(d θ ) l(θ ) 1.2 x x x

Hidden Markov Models and other Finite State Automata for Sequence Processing

Dynamic Approaches: The Hidden Markov Model

Statistical Machine Learning from Data

An Introduction to Bioinformatics Algorithms Hidden Markov Models

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models

Hidden Markov Models

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Hidden Markov Modelling

Statistical Machine Learning from Data

Temporal Modeling and Basic Speech Recognition

Hidden Markov Models

STA 414/2104: Machine Learning

Statistical Machine Learning from Data

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Hidden Markov Models. Terminology, Representation and Basic Problems

Statistical Methods for NLP

Hidden Markov Models in Language Processing

Machine Learning for natural language processing

Hidden Markov Models

Conditional Random Field

Speech Recognition HMM

Statistical Machine Learning from Data

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Hidden Markov Model and Speech Recognition

order is number of previous outputs

Lecture 11: Hidden Markov Models

Parametric Models Part III: Hidden Markov Models

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Data-Intensive Computing with MapReduce

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models. Terminology and Basic Algorithms

Hidden Markov Models. Terminology and Basic Algorithms

Hidden Markov Models Hamid R. Rabiee

COMP90051 Statistical Machine Learning

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models

Hidden Markov Models

Computational Genomics and Molecular Biology, Fall

ASR using Hidden Markov Model : A tutorial

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

1. Markov models. 1.1 Markov-chain

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Sparse Models for Speech Recognition

STA 4273H: Statistical Machine Learning

Hidden Markov Models

Sequence modelling. Marco Saerens (UCL) Slides references

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Lecture 3: ASR: HMMs, Forward, Viterbi

] Automatic Speech Recognition (CS753)

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Models

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Expectation Maximization (EM)

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Weighted Finite-State Transducers in Computational Biology

Advanced Data Science

Hidden Markov Models (HMMs)

CS838-1 Advanced NLP: Hidden Markov Models

Hidden Markov Models

Samy Bengioy, Yoshua Bengioz. y INRS-Telecommunications, 16, Place du Commerce, Ile-des-Soeurs, Qc, H3E 1H6, CANADA

HMM part 1. Dr Philip Jackson

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Note Set 5: Hidden Markov Models

Basic math for biology

Statistical Sequence Recognition and Training: An Introduction to HMMs

Hidden Markov models

Hidden Markov Models

Lecture 12: Algorithms for HMMs

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Sequence Labeling: HMMs & Structured Perceptron

Hidden Markov Models Part 1: Introduction

Computational Genomics and Molecular Biology, Fall

CS 7180: Behavioral Modeling and Decision- making in AI

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Brief Introduction of Machine Learning Techniques for Content Analysis

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Graphical Models Seminar

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Discriminative Learning in Speech Recognition

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

A gentle introduction to Hidden Markov Models

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

O 3 O 4 O 5. q 3. q 4. Transition

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Automatic Speech Recognition (CS753)

Transcription:

Samy Bengio Statistical Machine Learning from Data Statistical Machine Learning from Data Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland bengio@idiap.ch http://www.idiap.ch/~bengio December 2, 2005

Samy Bengio Statistical Machine Learning from Data 2 2 3 4

Samy Bengio Statistical Machine Learning from Data 3 Graphical View Training 2 3 4

Markov Models Graphical View Training Stochastic process of a temporal sequence: the probability distribution of the variable q at time t depends on the variable q at times t to. P(q, q 2,..., q T ) = P(q T ) = P(q ) First Order Markov Process: T t=2 P(q t q t ) = P(q t q t ) P(q t q t ) Markov Model: model of a Markovian process with discrete states. Samy Bengio Statistical Machine Learning from Data 4

Markov Models (Graphical View) Graphical View Training A Markov model: 2 3 A Markov model unfolded in time: q(t-2) q(t-) q(t) q(t+) 2 3 Samy Bengio Statistical Machine Learning from Data 5

Samy Bengio Statistical Machine Learning from Data 6 Training Markov Models Graphical View Training A Markov model is represented by all its transition probabilities: P(q t = i q t = j) i, j Given a training set of sequences X, training means re-estimating these probabilities. Simply count them to obtain the maximum likelihood solution: P(q t = i q t = j) = #(q t = i and q t = j X ) #(q t = j X ) Example: observe the weather today assuming it depends on the previous day.

Samy Bengio Statistical Machine Learning from Data 7 2 3 4

A hidden Markov model: A hidden Markov model unfolded in time: y(t-2) y(t-) y(t) y(t+) 2 3 2 3 q(t-2) q(t-) q(t) q(t+) Samy Bengio Statistical Machine Learning from Data 8

Elements of an HMM Hidden Markov Model: Markov Model whose state is not observed, but of which one can observe a manifestation (a variable x t which depends only on q t ). A finite number of states N. Transition probabilities between states, which depend only on previous state: P(q t =i q t =j, θ). Emission probabilities, which depend only on the current state: p(x t q t =i, θ) (where x t is observed). Initial state probabilities: P(q 0 = i θ). Each of these 3 sets of probabilities have parameters θ to estimate. Samy Bengio Statistical Machine Learning from Data 9

The 3 Problems of HMMs The HMM model gives rise to 3 different problems: Given an HMM parameterized by θ, can we compute the likelihood of a sequence X = x T = {x, x 2,..., x T }: p(x T θ) Given an HMM parameterized by θ and a set of sequences D n, can we select the parameters θ such that: θ = arg max θ n p(x (p) θ) p= Given an HMM parameterized by θ, can we compute the optimal path Q through the state space given a sequence X : Q = arg max p(x, Q θ) Q Samy Bengio Statistical Machine Learning from Data 0

HMMs as Generative Processes HMMs can be use to generate sequences: Let us define a set of starting states with initial probabilities P(q 0 = i). Let us also define a set of final states. Then for each sequence to generate: Select an initial state j according to P(q 0 ). 2 Select the next state i according to P(q t = i q t = j). 3 Emit an output according to the emission distribution P(x t q t = i). 4 If i is a final state, then stop, otherwise loop to step 2. Samy Bengio Statistical Machine Learning from Data

Markovian Assumptions Emissions: the probability to emit x t at time t in state q t = i does not depend on anything else: p(x t q t = i, q t, x t ) = p(x t q t = i) Transitions: the probability to go from state j to state i at time t does not depend on anything else: P(q t = i q t = j, q t 2, x t ) = P(q t = i q t = j) Moreover, this probability does not depend on time t: P(q t = i q t = j) is the same for all t we say that such Markov models are homogeneous. Samy Bengio Statistical Machine Learning from Data 2

Samy Bengio Statistical Machine Learning from Data 3 Derivation of the Forward Variable α the probability of having generated the sequence x t state i at time t: and being in α(i, t) def = p(x t, q t = i) = p(x t x t, q t = i)p(x t, q t = i) = p(x t q t = i) j p(x t, q t = i, q t = j) = p(x t q t = i) j = p(x t q t = i) j = p(x t q t = i) j P(q t = i x t, q t = j)p(x t, q t = j) P(q t = i q t = j)p(x t, q t = j) P(q t = i q t = j)α(j, t )

From α to the Likelihood Reminder: α(i, t) def = p(x t, q t = i) Initial condition: α(i, 0) = P(q 0 = i) prior probabilities of each state i Then let us compute α(i, t) for each state i and each time t of a given sequence x T Afterward, we can compute the likelihood as follows: p(x T ) = i p(x T, q T = i) = i α(i, T ) Hence, to compute the likelihood p(x T ), we need O(N2 T ) operations, where N is the number of states Samy Bengio Statistical Machine Learning from Data 4

Samy Bengio Statistical Machine Learning from Data 5 EM Training for HMM For HMM, the hidden variable Q will describe in which state the HMM was for each observation x t of a sequence X. The joint likelihood of all sequences X (l) and the hidden variable Q is then: p(x, Q θ) = n p(x (l), Q θ) l= Let us introduce the following indicator variable: { if qt = i q i,t = 0 otherwise

Samy Bengio Statistical Machine Learning from Data 6 Joint Likelihood Let us now use our indicator variables q to instanciate Q: p(x, Q θ) = = n p(x (l), Q θ) l= ( n N ) P(q 0 = i) q i,0 l= T l i= t= i= N p(x t (l) q t = i) q i,t N P(q t = i q t = j) q i,t q j,t j=

Samy Bengio Statistical Machine Learning from Data 7 Joint Log Likelihood log p(x, Q θ) = n N q i,0 log P(q 0 = i) + l= i= n T l N q i,t log p(x t (l) q t = i) + l= t= i= n T l N N q i,t q j,t log P(q t = i q t = j) l= t= i= j=

Samy Bengio Statistical Machine Learning from Data 8 Auxiliary Function Let us now write the corresponding auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] = n N E Q [q i,0 X, θ s ] log P(q 0 = i) + l= i= n T l N E Q [q i,t X, θ s ] log p(x t (l) q t = i) + l= t= i= n T l N N E Q [q i,t q j,t X, θ s ] log P(q t = i q t = j) l= t= i= j= From now on, let us forget about index l for simplification.

Samy Bengio Statistical Machine Learning from Data 9 Derivation of the Backward Variable β the probability to generate the rest of the sequence xt+ T we are in state i at time t β(i, t) def = p(x T t+ q t =i) = j p(x T t+, q t+ =j q t =i) given that = j = j = j = j p(x t+ xt+2, T q t+ =j, q t =i)p(xt+2, T q t+ =j q t =i) p(x t+ q t+ =j)p(xt+2 q T t+ =j, q t =i)p(q t+ =j q t =i) p(x t+ q t+ =j)p(xt+2 q T t+ =j)p(q t+ =j q t =i) p(x t+ q t+ =j)β(j, t + )P(q t+ =j q t =i)

Samy Bengio Statistical Machine Learning from Data 20 Final Details About β Reminder: β(i, t) = p(x T t+ q t=i) Final condition: { if i is a final state β(i, T ) = 0 otherwise Hence, to compute all the β variables, we need O(N 2 T ) operations, where N is the number of states

Samy Bengio Statistical Machine Learning from Data 2 E-Step for HMMs Posterior on emission distributions: E Q [q i,t X, θ s ] = P(q t = i x T, θ s ) = P(q t = i x T ) = p(x T, q t = i) p(x T ) = p(x T t+ q t = i, x t )p(x t, q t = i) p(x T ) = p(x t+ T q t = i)p(x t, q t = i) p(x T ) β(i, t) α(i, t) = α(j, T ) j

Samy Bengio Statistical Machine Learning from Data 22 E-Step for HMMs Posterior on transition distributions: E Q [q i,t q j,t X, θ s ] = P(q t = i, q t = j x T, θ s ) = p(x T, q t = i, q t = j) p(x T ) = p(x t+ T q t=i)p(q t =i q t =j)p(x t q t =i)p(x t, q t =j) p(x T ) = β(i, t)p(q t = i q t = j)p(x t q t = i)α(j, t ) α(j, T ) j

Samy Bengio Statistical Machine Learning from Data 23 E-Step for HMMs Posterior on initial state distribution: E Q [q i,0 X, θ p ] = P(q 0 = i x T, θ s ) = P(q 0 = i x T ) = p(x T, q 0 = i) p(x T ) = p(x T q 0 = i)p(q 0 = i) p(x T ) = β(i, 0) P(q 0 = i) α(j, T ) j

M-Step for HMMs Find the parameters θ that maximizes A, hence search for A θ = 0 When transition distributions are represented as tables, using a Lagrange multiplier, we obtain: P(q t = i q t = j) = T P(q t = i, q t = j x T, θ s ) t= T P(q t = j x T, θ s ) t= When emission distributions are implemented as GMMs, use already given equations, weighted by the posterior on emissions P(q t = i x T, θs ). Samy Bengio Statistical Machine Learning from Data 24

The Most Likely Path (Graphical View) The Viterbi algorithm finds the best state sequence. Compute the patial paths Backtrack in time q3 States q2 q q3 States q2 q Time Time Samy Bengio Statistical Machine Learning from Data 25

Samy Bengio Statistical Machine Learning from Data 26 for HMMs The Viterbi algorithm finds the best state sequence. V (i, t) def = max q t p(x t, q t, q t =i) = max p(x t x q t t, q t, q t =i)p(x t, q t, q t =i) = p(x t q t =i) max max p(x q t 2 t, q t 2, q t =i, q t =j) j = p(x t q t =i) max max p(q t =i q t =j)p(x q t 2 t, q t 2, q t =j) j = p(x t q t =i) max j = p(x t q t =i) max j p(q t =i q t =j) max p(x q t 2 t, q t 2, q t =j) p(q t =i q t =j)v (j, t )

From Viterbi to the State Sequence Reminder: V (i, t) = max p(x, t q q t t, q t =i) Let us compute V (i, t) for each state i and each time t of a given sequence x T Moreover, let us also keep for each V (i, t) the associated argmax previous state j Then, starting from the state i = arg max V (j, T ) backtrack j to decode the most probable state sequence. Hence, to compute all the V (i, t) variables, we need O(N 2 T ) operations, where N is the number of states Samy Bengio Statistical Machine Learning from Data 27

of HMMs Classifying sequences such as... DNA sequences (which family) gesture sequences video sequences phoneme sequences etc. Decoding sequences such as... continuous speech recognition handwriting recognition sequence of events (meeting, surveilance, games, etc) Samy Bengio Statistical Machine Learning from Data 28

Samy Bengio Statistical Machine Learning from Data 29 Embbeded Training Word Error Rates Discriminant Approach 2 3 4

Embbeded Training Word Error Rates Discriminant Approach Application: continuous speech recognition: Find a sequence of phonemes (or words) given an acoustic sequence Idea: use a phoneme model Samy Bengio Statistical Machine Learning from Data 30

Embbeded Training of HMMs Embbeded Training Word Error Rates Discriminant Approach For each acoustic sequence in the training set, create a new HMM as the concatenation of the HMMs representing the underlying sequence of phonemes. Maximize the likelihood of the training sentences. C A T Samy Bengio Statistical Machine Learning from Data 3

HMMs: Decoding a Sentence Embbeded Training Word Error Rates Discriminant Approach Decide what is the accepted vocabulary. Optionally add a language model: P(word sequence) Efficient algorithm to find the optimal path in the decoding HMM: D O G C A T Samy Bengio Statistical Machine Learning from Data 32

Measuring Error Embbeded Training Word Error Rates Discriminant Approach How do we measure the quality of a speech recognizer? Problem: the target solution is a sentence, the obtained solution is also a sentence, but they might have different size! Proposed solution: the Edit Distance: assume you have access to the operators insert, delete, and substitute, what is the smallest number of such operators we need to go from the obtained to the desired sentence? An efficient algorithm exists to compute this. At the end, we measure the error as follows: WER = #ins + #del + #subst #words Note that the word error rate (WER) can be greater than... Samy Bengio Statistical Machine Learning from Data 33

Maximum Mutual Information Embbeded Training Word Error Rates Discriminant Approach Using the Maximum Likelihood criterion for a classification task might sometimes be worse than using a discriminative approach What about changing the criterion to be more discriminative? Maximum Mutual Information (MMI) between word (W ) and accoustic (A) sequences: I (A, W ) = log P(A, W ) P(A)P(W ) = log P(A W )P(W ) log P(A) log P(W ) = log P(A W ) log P(A) = log P(A W ) w log P(A w)p(w) Apply gradient ascent: I (A,W ) θ. Samy Bengio Statistical Machine Learning from Data 34

Samy Bengio Statistical Machine Learning from Data 35 Various Imbalance 2 3 4

Various Imbalance Capacity tuned by the following hyper-parameters: Number of states (or values the hidden variable can take) Non-zero transitions (full-connect, left-to-right, etc) Capacity of underlying emission models Number of training iterations Initialization: If the training set is aligned, use this information Otherwise, uniform for transitions, K-Means for GMM-based emissions Computational contraint: Work in the logarithmic domain! Samy Bengio Statistical Machine Learning from Data 36

Various Imbalance Imbalance between Transitions and Emissions A problem often seen in speech recognition... Decoding with Viterbi: V (i, t) = p(x t q t = i) max P(q t = i q t = j)v (j, t ) j Emissions represented by GMMs: densities depend on the number of dimensions of x t. Practical estimates on Numbers 95 database (39 dimensions): Variance log P(q t q t ) 9.8 log p(x t q t ) 486.0 Comparison of variances of log distributions during decoding Samy Bengio Statistical Machine Learning from Data 37