Speech Recognition HMM

Similar documents
University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Hidden Markov Models and Gaussian Mixture Models

Temporal Modeling and Basic Speech Recognition

Advanced Data Science

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models Part 2: Algorithms

STA 414/2104: Machine Learning

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

Hidden Markov Models

Statistical Machine Learning from Data

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Parametric Models Part III: Hidden Markov Models

STA 4273H: Statistical Machine Learning

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Design and Implementation of Speech Recognition Systems

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

order is number of previous outputs

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Hidden Markov Models NIKOLAY YAKOVETS

Lecture 12: Algorithms for HMMs

1. Markov models. 1.1 Markov-chain

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

ASR using Hidden Markov Model : A tutorial

Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics

Lecture 12: Algorithms for HMMs

Hidden Markov Modelling

Statistical Sequence Recognition and Training: An Introduction to HMMs

Hidden Markov Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Hidden Markov Models. Dr. Naomi Harte

Introduction to Machine Learning CMU-10701

Hidden Markov Models

Hidden Markov Models. x 1 x 2 x 3 x K

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Design and Implementation of Speech Recognition Systems

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Hidden Markov Model and Speech Recognition

Lecture 11: Hidden Markov Models

Data-Intensive Computing with MapReduce

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Hidden Markov Models

L23: hidden Markov models

Note Set 5: Hidden Markov Models

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Inf2b Learning and Data

On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Dept. of Linguistics, Indiana University Fall 2009

Statistical NLP: Hidden Markov Models. Updated 12/15

HMM: Parameter Estimation

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Automatic Speech Recognition (CS753)

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Outline of Today s Lecture

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Statistical Methods for NLP

COM336: Neural Computing

Hidden Markov models

Computational Genomics and Molecular Biology, Fall

Brief Introduction of Machine Learning Techniques for Content Analysis

Lecture 9. Intro to Hidden Markov Models (finish up)

p(d θ ) l(θ ) 1.2 x x x

Pair Hidden Markov Models

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

EECS E6870: Lecture 4: Hidden Markov Models

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Hidden Markov Models Part 1: Introduction

Hidden Markov Models in Language Processing

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Expectation Maximization (EM)

Machine Learning for natural language processing

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models Hamid R. Rabiee

Lecture 3. Gaussian Mixture Models and Introduction to HMM s. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Math 350: An exploration of HMMs through doodles.

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Hidden Markov Models

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

CS838-1 Advanced NLP: Hidden Markov Models

Linear Dynamical Systems (Kalman filter)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Hidden Markov Models. x 1 x 2 x 3 x K

Statistical Pattern Recognition

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Multiscale Systems Engineering Research Group

The Expectation-Maximization Algorithm

O 3 O 4 O 5. q 3. q 4. Transition

A New OCR System Similar to ASR System

Cheng Soon Ong & Christian Walder. Canberra February June 2018

MAP adaptation with SphinxTrain

Transcription:

Speech Recognition HMM Jan Černocký, Valentina Hubeika {cernocky ihubeika}@fit.vutbr.cz FIT BUT Brno Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 1/38

Agenda Recap variability in speech. Statistical approach. Model and its parameters. Probability of emmiting matrix O by model M. Parameter Estimation. Recognition. Viterbi a Pivo-passing. Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 2/38

Recap what with variability? parameter space - a humane never say one thing in the same way parameter vectors always differ. 1. Distance measuring between two vectors. 2. Statical modeling. Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 3/38

o2 reference vectors 00 01 1 00 1 11 00 11 00 11 00 11 00 01 01 000 111 10 11000 111 01 000 1110 1 000 111 000 111 01 test vector selected reference vector o1 Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 4/38

0.4 0.3 0.2 0.1 0 4 winning distribution 3.5 3 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 5/38

timing people never say one thing with the same timing. Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 6/38

timing n.1 - Dynamic time warping - path 40 30 20 10 5 10 15 20 25 30 35 40 45 50 55 40 30 20 10 5 10 15 20 25 30 35 40 45 50 55 dtw path, D=0.24094 40 test 30 20 10 5 10 15 20 25 30 35 40 45 50 55 reference Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 7/38

timing n.2 - Hidden Markov Models - state sequence Markov model M a 22 a 33 a 44 a 55 Observation sequence a 12 a 23 a 34 a 45 a 56 1 2 3 4 5 6 b 2(o ) a 24 a 35 1 b 2(o 2 ) b 3(o 3 ) b 4(o 4 )b 4(o 5 ) b 5(o 6 ) o o o o o o 1 2 3 4 5 6 Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 8/38

Hidden Markov Models - HMM If words were represented by a scalar we would model them with a Gaussian distribution... 0.8 0.7 want to determine output probabilities for x=0 0.6 0.5 0.4 winning distribution 0.3 0.2 0.1 0 5 4 3 2 1 0 1 2 3 4 5 Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 9/38

Value p(x) of the probability density function is not a real probability. Only b a p(x)dx is the correct probability. The value p(x) is sometimes called emission probability. If the random process would generate values x with probability p(x). Out process does not generate anything literally we just call it emission probability to distinguish from transition probabilities. Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 10/38

however, words are not represented with simple scalars but vectors multivariate Gaussian distribution. 0.35 winning distribution 0.3 0.25 0.2 0.15 0.1 0.05 0 4 3 4 2 1 0 0 1 2 3 We use n-dimensional Gaussians (e.g. 39 dimensional). In the examples we re going to work with 1D, 2D or 3D. Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 11/38

again, words are not represented with just one vector but with a sequence of vectors Idea 1: 1 Gaussian per word BAD IDEA ( paří = řípa???) Idea 2: each word is represented with a sequence of Gaussians. 1 Gaussian per frame? NO, each time a different number of vectors!!! model, where Gaussians can be repeated! Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 12/38

Models, where Gaussians can repeat are called HMM introduction on isolated words Each words we want to recognize is represented with one model model1 Viterbi probabilities Φ1 model2 Φ2 unknown word model3 Φ3 selection of the winner word2 v modeln Φ N v Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 13/38

Now some math. Everything is explained on one model. Keep in mind, to be able to recognize, we need several models, one per word. Input Vector Sequence O = [o(1),o(2),...,o(t)], (1) 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 14/38

Configuration of HMM Markov model M a 22 a 33 a 44 a 55 Observation sequence a 12 a 23 a 34 a 45 a 56 1 2 3 4 5 6 b 2(o ) b (o ) a 24 a 35 1 2 2 b 3(o 3 ) b 4(o 4 )b 4(o 5 ) b 5(o 6 ) o o o o o o 1 2 3 4 5 6 Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 15/38

Transmission probabilities a ij Here: only three types: a i,i probability of remaining in the same state. a i,i+1 probability of passing to the following state. a i,i+2 probability of skipping over the approaching state, (not used, for short silence models only). 0 a 12 0 0 0 0 0 a 22 a 23 a 24 0 0 0 0 a 33 a 34 a 35 0 A = (2) 0 0 0 a 44 a 45 0 0 0 0 0 a 55 a 56 0 0 0 0 0 0 the matrix has nonzero item only on the diagonal and on top of it ( we stay in the state or go to the following state, in DTW the path could not go back either). Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 16/38

Transmission Probability Density Function (PDF) transmission probability of the vector o(t) by the i-th state of the model is: b i [o(t)]. discrete: vectors are pre-quantized by VQ to symbols, states then contain tables with the emission probability of the symbols we will not talk about this one. continuous continuous probability density function (continuous density CDHMM) The function is defined as Gaussian probability distribution function or a sum of several distributions. Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 17/38

If vector o contains only one item: b j [o(t)] = N(o(t); µ j, σ j ) = 1 e [o(t) µ j ] 2 2σ 2 (3) σ j 2π 0.4 µ=1, σ=1 0.35 0.3 0.25 p(x) 0.2 0.15 0.1 0.05 0 5 4 3 2 1 0 1 2 3 4 5 x Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 18/38

in reality vector o(t) is P-dimensional (usually P = 39): b j [o(t)] = N(o(t); µ j,σ j ) = 1 e 1 2 (o(t) µ j )T Σ 1 j (o(t) µ ) j, (4) (2π) P Σ j µ = [1;0.5]; Σ = [1 0.5; 0.5 0.3] 0.8 0.7 0.6 0.5 p(x) 0.4 0.3 0.2 2 0.1 1 0 4 3 2 1 0 1 2 1 0 x 2 x 1 Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 19/38

definition of the mixture of Gaussians (with no formula): µ 1 =[1;0.5]; Σ 1 =[1 0.5; 0.5 0.3]; w 1 =0.5; µ 2 =[2;0]; Σ 2 =[0.5 0; 0 0.5]; w 2 =0.5; 0.45 0.4 0.35 0.3 p(x) 0.25 0.2 0.15 2 0.1 1 0.05 0 0 4 3 2 1 0 1 2 2 1 x 2 x 1 Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 20/38

Simplification of distribution probability: if the parameters are not correlated (or we hope they are not), the covariance matrix is diagonal instead of P P covariance coefficients we estimate P variances simpler models, enough data fro good estimation, total value of the probability density is given by a simple sum over the dimensions (no determinant, no inversion). b j [o(t)] = P N(o(t); µ ji, σ ji ) = i=1 P i=1 1 e [o(t) µ ji ] 2 2σ ji 2 (5) σ ji 2π parameters or states can be shared among models less parameters, better estimation, less memory. Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 21/38

Probability with which model M generates sequence O State sequence: states are assigned to vectors, e.g: X = [1 2 2 3 4 4 5 6]. state 1 2 3 4 5 6 1 2 3 4 5 6 time 7 Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 22/38

generation probability O over the path X: T P(O, X M) = a x(o)x(1) b x(t) (o t )a x(t)x(t+1), (6) How to define specific probability of generating a sequence of vectors? t=1 a) Baum-Welch: P(O M) = {X} P(O, X M), (7) we take the sum over all state sequence of the length T + 2. b) Viterbi: is probability of the optimal path. P (O M) = max {X} P(O, X M), (8) Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 23/38

Notes 1. In DTW we were looking minimal distance. Here we are searching maximal probability, sometimes also called likelihood L. 2. Fast algorithms for Baum-Welch and Viterbi: we don t evaluate all possible state sequences X. Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 24/38

Parameter Estimation (aren t in tables :-)... ) word1 word2 word3 word1 word2 word3 word1 word2 word3 word1 word2 word3 word1 word2 word3 word1 word3 word3 wordn v wordn v wordn v wordn v wordn v wordn v Training database TRAINING model1 model2 model3 v modeln Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 25/38

1. parameters of a model are roughly estimated : ˆµ = 1 T T t=1 o(t) ˆΣ = 1 T T (o(t) µ)(o(t) µ) T (9) t=1 2. vectors are aligned to states: hard or soft using function L j (t) (state occupation function). 1.2 1 state 2 state 3 state 4 0.8 L j (t) 0.6 0.4 0.2 0 0 5 10 15 20 25 30 35 40 45 50 t Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 26/38

3. new estimation of parameters: T L j (t)o(t) T L j (t)(o(t) µ j )(o(t) µ j ) T ˆµ j = t=1 T L j (t) ˆΣ j = t=1 T L j (t). (10) t=1 t=1...similar equations for transition probabilities a ij. Steps 2) and 3) repeat: stop criterion is either a fixed number of iterations or too little improvement in probabilities. the algorithm is denoted as EM (Expectation Maximization): dá we compute p(each vector old parameters) log p(each vector new parameters), which is called expectation (average on data) of the overall likelihood. This is the objective function that is maximized leads to maximizing likelihood. maximize likelihood that data generated by the model LM (likelihood maximization) - a little bit of a problem as the models learn to represent words the best rather than discriminate between them. Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 27/38

State Occupation Functions Probability L j (t) of being in the state j at time t can be calculated as a sum of all paths that in time t are in state j : P(O, x(t) = j M) TIME: t 1 t t+1 2 2 3 3 4... N 1 j... 4... N 1 Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 28/38

We need to ensure real probabilities, thus: N L j (t) = 1, (11) j=1 contribution of a vector to all states has to sum up to 100%. Need to normalize by the sum of likelihoods of all paths in time t L j (t) = P(O, x(t) = j M) P(O, x(t) = j M). (12) j this are all paths through the model so we can use Baum-Welch probability: L j (t) = P(O, x(t) = j M). (13) P(O M) Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 29/38

How do we do it P(O, x(t) = j M) 1. partial forward probabilities (starting at the first state and being at time t in state j): α j (t) = P (o(1)...o(t), x(t) = j M) (14) 2. partial backward probabilities ( starting at the last state, being at time t in state j): β j (t) = P (o(t + 1)...o(T) x(t) = j, M) (15) 3. state occupation function is then given as: L j (t) = α j(t)β j (t) P(O M) (16) more in HTK Book. Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 30/38

Model training Algorithm 1. Allocate accumulator for each estimated vector/matrix. 2. Calculate forward and backward probabilities. α j (t) and β j (t) for all times t and all states j. Compute state occupation functions L j (t). 3. To each accumulator add contribution from vector o(t) weighted by corresponding L j (t). 4. Use the resulting accumulator to estimate vector/matrix using equation 10 dont forget to normalize by T t=1 L j(t). 5. If the value P(O M) doesnt improve comparing to the previous iteration, stop, otherwise Goto 1. Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 31/38

Recognition Viterbi algorithm we have to recognize unknown sequence O. in the vocabulary we dispose of Ň words w 1...wŇ. for each word we have a trained model M 1...MŇ. The question is: Which model generates sequence O with the highest likelihood? i = arg max i {P(O M i )} (17) we will use Viterbi probability to estimate the most probable state sequence: P (O M) = max {X} P(O, X M). (18) thus: i = arg max i {P (O M i )}. (19) Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 32/38

Viterbi probability computation: similar tobaum-welch. partial Viterbi probability: Φ j (t) = P (o(1)...o(t), x(t) = j M). (20) For the j-th state and time t. Desired Viterbi probability is: P (O M) = Φ N (T) (21) Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 33/38

Isolated Words Recognition using HMM Viterbi probabilities model1 Φ1 model2 Φ2 unknown word model3 Φ3 selection of the winner word2 v modeln Φ N v Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 34/38

Viterbi: Token passing each model state has a glass with beer. We work with log-probabilities addition. We will pour in probabilities. Initialization: insert and empty glass to each input state of the model (usually state 1). Iteration: for times t = 1...T: in each state i, that containes the glass, clone the glass and send the copy to all connected states j. On the way add log a ij + log b j [o(t)]. if in one state we have more than one glass, the the one that is contains more beer. Finish: from all states i connected with the finish state N, that contain glass, send them to N and add log a in. In the last state N, choose the fullest glass and leave out the other ones. The level of beer in state N correspond to the log Viterbi likelihood: log P (O M). Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 35/38

Pivo passing for Connected Words construct a mega-model from all words, glue the first and the last states of two models. ONE A TWO B ZERO Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 36/38

Recognition similar to isolated words, we need to remember the optimal words the glass was passed through. Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 37/38

Further reading HTK Book - http://htk.eng.cam.ac.uk/ Gold and Morgan: in the FIT library. Computer labs on HMM-HTK. Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 38/38