Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

Similar documents
CS 136 Lecture 5 Acoustic modeling Phoneme modeling

Lecture 5: GMM Acoustic Modeling and Feature Extraction

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Lecture 3: ASR: HMMs, Forward, Viterbi

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Hidden Markov Modelling

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Speech Recognition. CS 294-5: Statistical Natural Language Processing. State-of-the-Art: Recognition. ASR for Dialog Systems.

STA 414/2104: Machine Learning

Hidden Markov Model and Speech Recognition

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

STA 4273H: Statistical Machine Learning

Automatic Speech Recognition (CS753)

Hidden Markov Models

Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Design and Implementation of Speech Recognition Systems

Augmented Statistical Models for Speech Recognition

Sparse Models for Speech Recognition

Lecture 3: Pattern Classification

Jorge Silva and Shrikanth Narayanan, Senior Member, IEEE. 1 is the probability measure induced by the probability density function

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Brief Introduction of Machine Learning Techniques for Content Analysis

1. Markov models. 1.1 Markov-chain

Expectation Maximization (EM)

Design and Implementation of Speech Recognition Systems

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Expectation Maximization (EM)

ISOLATED WORD RECOGNITION FOR ENGLISH LANGUAGE USING LPC,VQ AND HMM

L11: Pattern recognition principles

Lecture 3. Gaussian Mixture Models and Introduction to HMM s. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Hidden Markov Models

A New OCR System Similar to ASR System

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Lecture 3: Pattern Classification. Pattern classification

Hidden Markov Models

Machine Learning for Data Science (CS4786) Lecture 12

Hidden Markov Models. x 1 x 2 x 3 x K

Statistical Pattern Recognition

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Hidden Markov Models Part 2: Algorithms

Machine Learning for natural language processing

Hidden Markov Models in Language Processing

Chapter 9 Automatic Speech Recognition DRAFT

Speech Recognition HMM

Dept. of Linguistics, Indiana University Fall 2009

Temporal Modeling and Basic Speech Recognition

On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Automatic Speech Recognition (CS753)

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Hidden Markov Models Hamid R. Rabiee

Lecture 10. Discriminative Training, ROVER, and Consensus. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Hidden Markov Models

Statistical Sequence Recognition and Training: An Introduction to HMMs

Lecture 11: Hidden Markov Models

Upper Bound Kullback-Leibler Divergence for Hidden Markov Models with Application as Discrimination Measure for Speech Recognition

Hidden Markov Models

EECS E6870: Lecture 4: Hidden Markov Models

Automatic Speech Recognition (CS753)

O 3 O 4 O 5. q 3. q 4. Transition

Introduction to Machine Learning CMU-10701

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

] Automatic Speech Recognition (CS753)

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Using Sub-Phonemic Units for HMM Based Phone Recognition

order is number of previous outputs

Statistical Methods for NLP

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme

Hidden Markov models

Statistical NLP: Hidden Markov Models. Updated 12/15

SYMBOL RECOGNITION IN HANDWRITTEN MATHEMATI- CAL FORMULAS

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

CS534 Machine Learning - Spring Final Exam

Pair Hidden Markov Models

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Statistical Machine Learning from Data

Hidden Markov Models. x 1 x 2 x 3 x N

Introduction to Markov systems

Scalar and Vector Quantization. National Chiao Tung University Chun-Jen Tsai 11/06/2014

Conditional Random Field

Math 350: An exploration of HMMs through doodles.

Transcription:

Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (II)

Outline for ASR ASR Architecture The Noisy Channel Model Five easy pieces of an ASR system 1) Language Model 2) Lexicon/Pronunciation Model (HMM) 3) Feature Extraction 4) Acoustic Model 5) Decoder Training Evaluation 7/30/08 Speech and Language Processing Jurafsky and Martin 2

Acoustic Modeling (= Phone detection) Given a 39-dimensional vector corresponding to the observation of one frame o i And given a phone q we want to detect Compute p(o i q) Most popular method: GMM (Gaussian mixture models) Other methods Neural nets, CRFs, SVM, etc 7/30/08 Speech and Language Processing Jurafsky and Martin 3

Problem: how to apply HMM model to continuous observations? We have assumed that the output alphabet V has a finite number of symbols But spectral feature vectors are realvalued! How to deal with real-valued features? Decoding: Given o t, how to compute P(o t q) Learning: How to modify EM to deal with realvalued features 7/30/08 Speech and Language Processing Jurafsky and Martin 4

Vector Quantization Create a training set of feature vectors Cluster them into a small number of classes Represent each class by a discrete symbol For each class v k, we can compute the probability that it is generated by a given HMM state using Baum-Welch as above 7/30/08 Speech and Language Processing Jurafsky and Martin 5

VQ We ll define a Codebook, which lists for each symbol A prototype vector, or codeword If we had 256 classes ( 8-bit VQ ), A codebook with 256 prototype vectors Given an incoming feature vector, we compare it to each of the 256 prototype vectors We pick whichever one is closest (by some distance metric ) And replace the input vector by the index of this prototype vector 7/30/08 Speech and Language Processing Jurafsky and Martin 6

VQ 7/30/08 Speech and Language Processing Jurafsky and Martin 7

VQ requirements A distance metric or distortion metric Specifies how similar two vectors are Used: to build clusters To find prototype vector for cluster And to compare incoming vector to prototypes A clustering algorithm K-means, etc. 7/30/08 Speech and Language Processing Jurafsky and Martin 8

Distance metrics Simplest: (square of) Euclidean distance D i=1 d 2 (x, y) = (x i y i ) 2 Also called sumsquared error 7/30/08 Speech and Language Processing Jurafsky and Martin 9

Distance metrics More sophisticated: (square of) Mahalanobis distance Assume that each dimension of feature vector has variance σ 2 d 2 (x, y) = D i=1 (x i y i ) 2 σ i 2 Equation above assumes diagonal covariance matrix; more on this later 7/30/08 Speech and Language Processing Jurafsky and Martin 10

Training a VQ system (generating codebook): K-means clustering 1. Initialization choose M vectors from L training vectors (typically M=2 B ) as initial code words random or max. distance. 2. Search: for each training vector, find the closest code word, assign this training vector to that cell 3. Centroid Update: for each cell, compute centroid of that cell. The new code word is the centroid. 4. Repeat (2)-(3) until average distance falls below threshold (or no change) 7/30/08 Slide from John-Paul Hosum, OHSU/OGI Speech and Language Processing Jurafsky and Martin 11

Vector Quantization Example Slide thanks to John-Paul Hosum, OHSU/OGI Given data points, split into 4 codebook vectors with initial values at (2,2), (4,6), (6,5), and (8,8) 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 7/30/08 Speech and Language Processing Jurafsky and Martin 12

Vector Quantization Example Slide from John-Paul Hosum, OHSU/OGI compute centroids of each codebook, re-compute nearest neighbor, re-compute centroids... 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 7/30/08 Speech and Language Processing Jurafsky and Martin 13

Vector Quantization Example Slide from John-Paul Hosum, OHSU/OGI Once there s no more change, the feature space will be partitioned into 4 regions. Any input feature can be classified as belonging to one of the 4 regions. The entire codebook can be specified by the 4 centroid points. 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 7/30/08 Speech and Language Processing Jurafsky and Martin 14

Summary: VQ To compute p(o t q j ) Compute distance between feature vector o t and each codeword (prototype vector) in a preclustered codebook where distance is either Euclidean Mahalanobis Choose the vector that is the closest to o t and take its codeword v k And then look up the likelihood of v k given HMM state j in the B matrix B j (o t )=b j (v k ) s.t. v k is codeword of closest vector to o t Using Baum-Welch as above 7/30/08 Speech and Language Processing Jurafsky and Martin 15

Computing b j (v k ) feature value 2 for state j feature value 1 for state j b j (v k ) = number of vectors with codebook index k in state j number of vectors in state j 14 1 = = 56 4 7/30/08 Slide from John-Paul Hosum, OHSU/OGI Speech and Language Processing Jurafsky and Martin 16

Summary: VQ Training: Do VQ and then use Baum-Welch to assign probabilities to each symbol Decoding: Do VQ and then use the symbol probabilities in decoding 7/30/08 Speech and Language Processing Jurafsky and Martin 17

Directly Modeling Continuous Observations Gaussians Univariate Gaussians Baum-Welch for univariate Gaussians Multivariate Gaussians Baum-Welch for multivariate Gausians Gaussian Mixture Models (GMMs) Baum-Welch for GMMs 7/30/08 Speech and Language Processing Jurafsky and Martin 18

Better than VQ VQ is insufficient for real ASR Instead: Assume the possible values of the observation feature vector o t are normally distributed. Represent the observation likelihood function b j (o t ) as a Gaussian with mean µ j and variance σ j 2 f (x µ,σ) = 1 σ 2π exp( (x µ)2 2σ 2 ) 7/30/08 Speech and Language Processing Jurafsky and Martin 19

Gaussians are parameters by mean and variance 7/30/08 Speech and Language Processing Jurafsky and Martin 20

Reminder: means and variances For a discrete random variable X Mean is the expected value of X Weighted sum over the values of X Variance is the squared average deviation from mean 7/30/08 Speech and Language Processing Jurafsky and Martin 21

Gaussian as Probability Density Function 7/30/08 Speech and Language Processing Jurafsky and Martin 22

Gaussian PDFs A Gaussian is a probability density function; probability is area under curve. To make it a probability, we constrain area under curve = 1. BUT We will be using point estimates ; value of Gaussian at point. Technically these are not probabilities, since a pdf gives a probability over a internvl, needs to be multiplied by dx As we will see later, this is ok since same value is omitted from all Gaussians, so argmax is still correct. 7/30/08 Speech and Language Processing Jurafsky and Martin 23

Gaussians for Acoustic Modeling A Gaussian is parameterized by a mean and a variance: Different means P(o q): P(o q) is highest here at mean P(o q) P(o q is low here, very far from mean) o 7/30/08 Speech and Language Processing Jurafsky and Martin 24

Using a (univariate Gaussian) as an acoustic likelihood estimator Let s suppose our observation was a single real-valued feature (instead of 39D vector) Then if we had learned a Gaussian over the distribution of values of this feature We could compute the likelihood of any given observation o t as follows: 7/30/08 Speech and Language Processing Jurafsky and Martin 25

Training a Univariate Gaussian A (single) Gaussian is characterized by a mean and a variance Imagine that we had some training data in which each state was labeled We could just compute the mean and variance from the data: µ i = 1 T T o t s.t. o t is state i t=1 σ i 2 = 1 T T (o t µ i ) 2 s.t. o t is state i t=1 7/30/08 Speech and Language Processing Jurafsky and Martin 26

Training Univariate Gaussians But we don t know which observation was produced by which state! What we want: to assign each observation vector o t to every possible state i, prorated by the probability the the HMM was in state i at time t. The probability of being in state i at time t is ξ t (i)!! µ i = T t=1 T t=1 ξ t (i)o t ξ t (i) σ 2 i = T t=1 ξ t (i)(o t µ i ) 2 T t=1 ξ t (i) 7/30/08 Speech and Language Processing Jurafsky and Martin 27

Multivariate Gaussians Instead of a single mean µ and variance σ: 1 (x µ)2 f (x µ,σ) = exp( ) σ 2π 2σ 2 Vector of means µ and covariance matrix Σ 1 f (x µ,σ) = (2π) n / 2 Σ exp 1 1/ 2 2 (x µ)t Σ 1 (x µ) 7/30/08 Speech and Language Processing Jurafsky and Martin 28

Multivariate Gaussians Defining µ and Σ µ = E(x) [ ] Σ = E (x µ)(x µ) T So the i-jth element of Σ is: σ 2 = E [(x µ )(x µ )] ij i i j j 7/30/08 Speech and Language Processing Jurafsky and Martin 29

Gaussian Intuitions: Size of Σ µ = [0 0] µ = [0 0] µ = [0 0] Σ = I Σ = 0.6I Σ = 2I As Σ becomes larger, Gaussian becomes more spread out; as Σ becomes smaller, Gaussian more compressed 7/30/08 Text and figures from Andrew Ng s lecture notes for Speech CS229 and Language Processing Jurafsky and Martin 30

7/30/08 From Chen, Speech Picheny and Language et Processing al lecture Jurafsky slides and Martin 31

[1 0] [.6 0] [0 1] [ 0 2] Different variances in different dimensions 7/30/08 Speech and Language Processing Jurafsky and Martin 32

Gaussian Intuitions: Offdiagonal As we increase the off-diagonal entries, more correlation between value of x and value of y 7/30/08 Text and figures from Andrew Ng s lecture notes for CS229 Speech and Language Processing Jurafsky and Martin 33

Gaussian Intuitions: offdiagonal As we increase the off-diagonal entries, more correlation between value of x and value of y 7/30/08 Text and figures from Andrew Ng s lecture notes for CS229 Speech and Language Processing Jurafsky and Martin 34

Gaussian Intuitions: offdiagonal and diagonal Decreasing non-diagonal entries (#1-2) Increasing variance of one dimension in diagonal (#3) 7/30/08 Text and figures from Andrew Ng s lecture notes for CS229 Speech and Language Processing Jurafsky and Martin 35

In two dimensions 7/30/08 From Chen, Picheny et al lecture slides Speech and Language Processing Jurafsky and Martin 36

But: assume diagonal covariance I.e., assume that the features in the feature vector are uncorrelated This isn t true for FFT features, but is true for MFCC features, as we will see. Computation and storage much cheaper if diagonal covariance. I.e. only diagonal entries are non-zero Diagonal contains the variance of each dimension σ ii 2 So this means we consider the variance of each acoustic feature (dimension) separately 7/30/08 Speech and Language Processing Jurafsky and Martin 37

Diagonal covariance Diagonal contains the variance of each dimension σ 2 ii So this means we consider the variance of each acoustic feature (dimension) separately f (x µ,σ) = f (x µ,σ) = D d =1 1 σ j 2π exp 1 2 1 D d =1 2π D 2 σ d 2 exp( 1 2 x d µ d σ d D 2 (x d µ d ) 2 ) d =1 σ d 2 7/30/08 Speech and Language Processing Jurafsky and Martin 38

Baum-Welch reestimation equations for multivariate Gaussians Natural extension of univariate case, where now µ i is mean vector for state i: µ i = Σ i = T T t=1 T t=1 t=1 ξ t (i)o t ξ t (i) ξ t (i)(o t µ i )(o t µ i ) T T t=1 ξ t (i) 7/30/08 Speech and Language Processing Jurafsky and Martin 39

But we re not there yet Single Gaussian may do a bad job of modeling distribution in any dimension: Solution: Mixtures of Gaussians 7/30/08 Figure from Chen, Picheney et al slides Speech and Language Processing Jurafsky and Martin 40

Mixtures of Gaussians M mixtures of Gaussians: f (x µ jk,σ jk ) = c jk N(x,µ jk,σ jk ) M k=1 M k=1 b j (o t ) = c jk N(o t,µ jk,σ jk ) For diagonal covariance: b j (o t ) = M k=1 c jk D d =1 2π D 2 σ jkd 2 exp( 1 2 D (x jkd µ jkd ) 2 ) d =1 σ jkd 2 7/30/08 Speech and Language Processing Jurafsky and Martin 41

GMMs Summary: each state has a likelihood function parameterized by: M Mixture weights M Mean Vectors of dimensionality D Either M Covariance Matrices of DxD Or more likely M Diagonal Covariance Matrices of DxD which is equivalent to M Variance Vectors of dimensionality D 7/30/08 Speech and Language Processing Jurafsky and Martin 42

Where we are Given: A wave file Goal: output a string of words What we know: the acoustic model How to turn the wavefile into a sequence of acoustic feature vectors, one every 10 ms If we had a complete phonetic labeling of the training set, we know how to train a gaussian phone detector for each phone. We also know how to represent each word as a sequence of phones What we knew from Chapter 4: the language model Next: Seeing all this back in the context of HMMs Search: how to combine the language model and the acoustic model to produce a sequence of words 7/30/08 Speech and Language Processing Jurafsky and Martin 43

Decoding In principle: In practice: 7/30/08 Speech and Language Processing Jurafsky and Martin 44

Why is ASR decoding hard? 7/30/08 Speech and Language Processing Jurafsky and Martin 45

HMMs for speech 7/30/08 Speech and Language Processing Jurafsky and Martin 46

HMM for digit recognition task 7/30/08 Speech and Language Processing Jurafsky and Martin 47

The Evaluation (forward) problem for speech The observation sequence O is a series of MFCC vectors The hidden states W are the phones and words For a given phone/word string W, our job is to evaluate P(O W) Intuition: how likely is the input to have been generated by just that word string W 7/30/08 Speech and Language Processing Jurafsky and Martin 48

Evaluation for speech: Summing over all different paths! f ay ay ay ay v v v v f f ay ay ay ay v v v f f f f ay ay ay ay v f f ay ay ay ay ay ay v f f ay ay ay ay ay ay ay ay v f f ay v v v v v v v 7/30/08 Speech and Language Processing Jurafsky and Martin 49

The forward lattice for five 7/30/08 Speech and Language Processing Jurafsky and Martin 50

The forward trellis for five 7/30/08 Speech and Language Processing Jurafsky and Martin 51

Viterbi trellis for five 7/30/08 Speech and Language Processing Jurafsky and Martin 52

Viterbi trellis for five 7/30/08 Speech and Language Processing Jurafsky and Martin 53

Search space with bigrams 7/30/08 Speech and Language Processing Jurafsky and Martin 54

Viterbi trellis 7/30/08 Speech and Language Processing Jurafsky and Martin 55

Viterbi backtrace 7/30/08 Speech and Language Processing Jurafsky and Martin 56

Evaluation How to evaluate the word string output by a speech recognizer? 7/30/08 Speech and Language Processing Jurafsky and Martin 57

Word Error Rate Word Error Rate = 100 (Insertions+Substitutions + Deletions) ------------------------------ Total Word in Correct Transcript Aligment example: REF: portable **** PHONE UPSTAIRS last night so HYP: portable FORM OF STORES last night so Eval I S S WER = 100 (1+2+0)/6 = 50% 7/30/08 Speech and Language Processing Jurafsky and Martin 58

NIST sctk-1.3 scoring softare: Computing WER with sclite http://www.nist.gov/speech/tools/ Sclite aligns a hypothesized text (HYP) (from the recognizer) with a correct or reference text (REF) (human transcribed) id: (2347-b-013) Scores: (#C #S #D #I) 9 3 1 2 REF: was an engineer SO I i was always with **** **** MEN UM and they HYP: was an engineer ** AND i was always with THEM THEY ALL THAT and they Eval: D S I I S S 7/30/08 Speech and Language Processing Jurafsky and Martin 59

Sclite output for error analysis CONFUSION PAIRS Total (972) With >= 1 occurances (972) 1: 6 -> (%hesitation) ==> on 2: 6 -> the ==> that 3: 5 -> but ==> that 4: 4 -> a ==> the 5: 4 -> four ==> for 6: 4 -> in ==> and 7: 4 -> there ==> that 8: 3 -> (%hesitation) ==> and 9: 3 -> (%hesitation) ==> the 10: 3 -> (a-) ==> i 11: 3 -> and ==> i 12: 3 -> and ==> in 13: 3 -> are ==> there 14: 3 -> as ==> is 15: 3 -> have ==> that 16: 3 -> is ==> this 7/30/08 Speech and Language Processing Jurafsky and Martin 60

Sclite output for error analysis 17: 3 -> it ==> that 18: 3 -> mouse ==> most 19: 3 -> was ==> is 20: 3 -> was ==> this 21: 3 -> you ==> we 22: 2 -> (%hesitation) ==> it 23: 2 -> (%hesitation) ==> that 24: 2 -> (%hesitation) ==> to 25: 2 -> (%hesitation) ==> yeah 26: 2 -> a ==> all 27: 2 -> a ==> know 28: 2 -> a ==> you 29: 2 -> along ==> well 30: 2 -> and ==> it 31: 2 -> and ==> we 32: 2 -> and ==> you 33: 2 -> are ==> i 34: 2 -> are ==> were 7/30/08 Speech and Language Processing Jurafsky and Martin 61

Better metrics than WER? WER has been useful But should we be more concerned with meaning ( semantic error rate )? Good idea, but hard to agree on Has been applied in dialogue systems, where desired semantic output is more clear 7/30/08 Speech and Language Processing Jurafsky and Martin 62

Training 7/30/08 Speech and Language Processing Jurafsky and Martin 63

Reminder: Forward-Backward Algorithm 1) Intialize Φ=(A,B) 2) Compute α, β, ξ 3) Estimate new Φ =(A,B) 4) Replace Φ with Φ 5) If not converged go to 2 7/30/08 Speech and Language Processing Jurafsky and Martin 64

The Learning Problem: It s not just Baum-Welch Network structure of HMM is always created by hand no algorithm for double-induction of optimal structure and probabilities has been able to beat simple hand-built structures. Always Bakis network = links go forward in time Subcase of Bakis net: beads-on-string net: Baum-Welch only guaranteed to return 7/30/08 local max, rather than global optimum Speech and Language Processing Jurafsky and Martin 65

Complete Embedded Training Setting all the parameters in an ASR system Given: training set: wavefiles & word transcripts for each sentence Hand-built HMM lexicon Uses: Baum-Welch algorithm 7/30/08 Speech and Language Processing Jurafsky and Martin 66

Baum-Welch for Mixture Models By analogy with ξ earlier, let s define the probability of being in state j at time t with the k th mixture component accounting for o t : Now, ξ tm ( j) = N i=1 α t 1 ( j) a ij c jm b jm (o t )β j (t) α F (T) µ jm = T t=1 T M t=1 ξ tm ( j)o t k=1 ξ tk ( j) c jm = T t=1 T M t=1 ξ tm ( j) k=1 ξ tk ( j) Σ jm = T t=1 ξ tm ( j)(o t µ j )(o t µ j ) T T M t=1 k=1 ξ tm ( j) 7/30/08 Speech and Language Processing Jurafsky and Martin 67

How to train mixtures? Choose M (often 16; or can tune M dependent on amount of training observations) Then can do various splitting or clustering algorithms One simple method for splitting : 1) Compute global mean µ and global variance 2) Split into two Gaussians, with means µ±ε (sometimes ε is 0.2σ) 3) Run Forward-Backward to retrain 4) Go to 2 until we have 16 mixtures 7/30/08 Speech and Language Processing Jurafsky and Martin 68

Embedded Training Components of a speech recognizer: Feature extraction: not statistical Language model: word transition probabilities, trained on some other corpus Acoustic model: Pronunciation lexicon: the HMM structure for each word, built by hand Observation likelihoods bj(ot) Transition probabilities aij 7/30/08 Speech and Language Processing Jurafsky and Martin 69

Embedded training of acoustic model If we had hand-segmented and handlabeled training data With word and phone boundaries We could just compute the B: means and variances of all our triphone gaussians A: transition probabilities And we d be done! But we don t have word and phone boundaries, nor phone labeling 7/30/08 Speech and Language Processing Jurafsky and Martin 70

Embedded training Instead: We ll train each phone HMM embedded in an entire sentence We ll do word/phone segmentation and alignment automatically as part of training process 7/30/08 Speech and Language Processing Jurafsky and Martin 71

Embedded Training 7/30/08 Speech and Language Processing Jurafsky and Martin 72

Initialization: Flat start Transition probabilities: set to zero any that you want to be structurally zero The γ probability computation includes previous value of a ij, so if it s zero it will never change Set the rest to identical values Likelihoods: initialize µ and σ of each state to global mean and variance of all training data 7/30/08 Speech and Language Processing Jurafsky and Martin 73

Embedded Training Now we have estimates for A and B So we just run the EM algorithm During each iteration, we compute forward and backward probabilities Use them to re-estimate A and B Run EM til converge 7/30/08 Speech and Language Processing Jurafsky and Martin 74

Viterbi training Baum-Welch training says: We need to know what state we were in, to accumulate counts of a given output symbol o t We ll compute ξ I (t), the probability of being in state i at time t, by using forward-backward to sum over all possible paths that might have been in state i and output o t. Viterbi training says: Instead of summing over all possible paths, just take the single most likely path Use the Viterbi algorithm to compute this Viterbi path Via forced alignment 7/30/08 Speech and Language Processing Jurafsky and Martin 75

Forced Alignment Computing the Viterbi path over the training data is called forced alignment Because we know which word string to assign to each observation sequence. We just don t know the state sequence. So we use a ij to constrain the path to go through the correct words And otherwise do normal Viterbi Result: state sequence! 7/30/08 Speech and Language Processing Jurafsky and Martin 76

Viterbi training equations Viterbi Baum-Welch a ˆ ij = n ij n i For all pairs of emitting states, 1 <= i, j <= N b ˆ j (v k ) = n (s.t.o j t = v k ) n j Where n ij is number of frames with transition from i to j in best path And n j is number of frames where state j is occupied 7/30/08 Speech and Language Processing Jurafsky and Martin 77

Viterbi Training Much faster than Baum-Welch But doesn t work quite as well But the tradeoff is often worth it. 7/30/08 Speech and Language Processing Jurafsky and Martin 78

Viterbi training (II) Equations for non-mixture Gaussians σ i 2 = 1 N i T t=1 µ i = 1 N i o t s.t. q t = i T (o µ ) 2 s.t. q = i t i t t =1 Viterbi training for mixture Gaussians is more complex, generally just assign each observation to 1 mixture 7/30/08 Speech and Language Processing Jurafsky and Martin 79

Log domain In practice, do all computation in log domain Avoids underflow Instead of multiplying lots of very small probabilities, we add numbers that are not so small. Single multivariate Gaussian (diagonal Σ) compute: In log space: 7/30/08 Speech and Language Processing Jurafsky and Martin 80

Log domain Repeating: With some rearrangement of terms Where: Note that this looks like a weighted Mahalanobis distance!!! Also may justify why we these aren t really probabilities (point estimates); these are really just distances. 7/30/08 Speech and Language Processing Jurafsky and Martin 81

Summary: Acoustic Modeling for LVCSR. Increasingly sophisticated models For each state: Gaussians Multivariate Gaussians Mixtures of Multivariate Gaussians Where a state is progressively: CI Phone CI Subphone (3ish per phone) CD phone (=triphones) State-tying of CD phone Forward-Backward Training Viterbi training 7/30/08 Speech and Language Processing Jurafsky and Martin 82

Summary: ASR Architecture Five easy pieces: ASR Noisy Channel architecture 1) Feature Extraction: 39 MFCC features 2) Acoustic Model: Gaussians for computing p(o q) 3) Lexicon/Pronunciation Model HMM: what phones can follow each other 4) Language Model N-grams for computing p(w i w i-1 ) 5) Decoder Viterbi algorithm: dynamic programming for combining all these to get word sequence from speech! 7/30/08 Speech and Language Processing Jurafsky and Martin 83

Summary ASR Architecture The Noisy Channel Model Five easy pieces of an ASR system 1) Language Model 2) Lexicon/Pronunciation Model (HMM) 3) Feature Extraction 4) Acoustic Model 5) Decoder Training Evaluation 7/30/08 Speech and Language Processing Jurafsky and Martin 84