Lecture 3: ASR: HMMs, Forward, Viterbi

Similar documents
CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Lecture 5: GMM Acoustic Modeling and Feature Extraction

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

Machine Learning for natural language processing

Hidden Markov Modelling

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Dynamic Programming: Hidden Markov Models

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

STA 414/2104: Machine Learning

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Model and Speech Recognition

STA 4273H: Statistical Machine Learning

Lecture 12: Algorithms for HMMs

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Lecture 12: Algorithms for HMMs

Introduction to Machine Learning CMU-10701

Lecture 9: Hidden Markov Model

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753)

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Hidden Markov Models Hamid R. Rabiee

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models

Text Mining. March 3, March 3, / 49

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Statistical Sequence Recognition and Training: An Introduction to HMMs

Lecture 11: Hidden Markov Models

COMP90051 Statistical Machine Learning

1. Markov models. 1.1 Markov-chain

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

CS 136 Lecture 5 Acoustic modeling Phoneme modeling

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Statistical NLP: Hidden Markov Models. Updated 12/15

Dept. of Linguistics, Indiana University Fall 2009

Statistical Methods for NLP

The Noisy Channel Model and Markov Models

Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Statistical Methods for NLP

Hidden Markov Models

order is number of previous outputs

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Midterm sample questions

CS 188: Artificial Intelligence Fall 2011

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Brief Introduction of Machine Learning Techniques for Content Analysis

CHAPTER. Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright c All rights reserved. Draft of August 7, 2017.

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

CS 343: Artificial Intelligence

] Automatic Speech Recognition (CS753)

Data-Intensive Computing with MapReduce

Hidden Markov Models

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

CS 5522: Artificial Intelligence II

CS 5522: Artificial Intelligence II

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Linear Dynamical Systems (Kalman filter)

Graphical Models Seminar

Hidden Markov Models NIKOLAY YAKOVETS

Data Mining in Bioinformatics HMM

Sequence Labeling: HMMs & Structured Perceptron

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

CS 343: Artificial Intelligence

Parametric Models Part III: Hidden Markov Models

Chapter 9 Automatic Speech Recognition DRAFT

Hidden Markov Models

Statistical Machine Learning from Data

Lecture 11: Viterbi and Forward Algorithms

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Note Set 5: Hidden Markov Models

Hidden Markov models

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Hidden Markov Models in Language Processing

Graphical models for part of speech tagging

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Multiscale Systems Engineering Research Group

Hidden Markov Models. Vibhav Gogate The University of Texas at Dallas

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Transcription:

Original slides by Dan Jurafsky CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 3: ASR: HMMs, Forward, Viterbi

Fun informative read on phonetics The Art of Language Invention. David J. Peterson. 2015. http://www.artoflanguageinvention.com/books/

Outline for Today ASR Architecture Decoding with HMMs Forward Viterbi Decoding How this fits into the ASR component of course On your own: N-grams and Language Modeling Apr 12: Training, Advanced Decoding Apr 17: Feature Extraction, GMM Acoustic Modeling Apr 24: Neural Network Acoustic Models May 1: End to end neural network speech recognition

The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform.!"#$%&'()!!*+!"#$%&'!&()&(%& -.'/#!0%'1&' )2&'.""3'".'4"5&666!"#$%&$'!('!)' *#&!!'+)'!"#$%&, -.'/#!0%'1&' )2&'.""3'".'4"5&666!"#$!"% 75&$8'2+998'.+/048 -('+'2"4&'0(')2&'*$"#(3 &&& -.'/#!0%'1&')2&'.""3'".'4"5&!"#$%&*!"#$%&+!"#$%&,

The Noisy Channel Model (II) What is the most likely sentence out of all sentences in the language L given some acoustic input O? Treat acoustic input O as sequence of individual observations O = o 1,o 2,o 3,,o t Define a sentence as a sequence of words: W = w 1,w 2,w 3,,w n

Noisy Channel Model (III) Probabilistic implication: Pick the highest prob S: W ˆ = argmaxp(w O) W L We can use Bayes rule to rewrite this: ˆ W = argmax W L P(O W )P(W ) P(O) Since denominator is the same for each candidate sentence W, we can ignore it for the argmax: W ˆ = argmaxp(o W )P(W ) W L

Speech Recognition Architecture!"#$%&'() *"'%+&") ",%&'!%-./! 2455)*"'%+&"$ #$!&"% 0'+$$-'/) 1!.+$%-!)2.3"( #6./") (-7"(-6..3$ 822)(",-!./!9:&';) ('/:+':") ;.3"( #$"% <-%"&=-)>"!.3"& "!"#$%&!'#()#*+)#",,-#,"#.,/)000

Noisy channel model likelihood prior W ˆ = argmax W L P(O W )P(W )

The noisy channel model Ignoring the denominator leaves us with two factors: P(Source) and P(Signal Source)!"#$%&'()!!*+!"#$%&'!&()&(%& -.'/#!0%'1&' )2&'.""3'".'4"5&666!"#$%&$'!('!)' *#&!!'+)'!"#$%&, -.'/#!0%'1&' )2&'.""3'".'4"5&666!"#$!"% 75&$8'2+998'.+/048 -('+'2"4&'0(')2&'*$"#(3 &&& -.'/#!0%'1&')2&'.""3'".'4"5&!"#$%&*!"#$%&+!"#$%&,

Speech Architecture meets Noisy Channel

Decoding Architecture: five easy pieces Feature Extraction: 39 MFCC features Acoustic Model: Gaussians for computing p(o q) Lexicon/Pronunciation Model HMM: what phones can follow each other Language Model N-grams for computing p(w i w i-1 ) Decoder Viterbi algorithm: dynamic programming for combining all these to get word sequence from speech 11

Lexicon A list of words Each one with a pronunciation in terms of phones We get these from on-line pronunciation dictionary CMU dictionary: 127K words http://www.speech.cs.cmu.edu/cgi-bin/cmudict We ll represent the lexicon as an HMM 12

HMMs for speech

Phones are not homogeneous! 5000 0 0.48152 ay k 0.937203 Time (s) ay k

Each phone has 3 subphones % && %,, % $$ % 0& % &, %,$ % $2 '() & *"+,!"# $ -.%/. 0 1#+ 2

Resulting HMM word model for six - $ - % - #!" $!" %!" #. $. %. # - $ - % - # &'()' *+,

HMM for the digit recognition task

Markov chain for weather # 66 # %6 3045 6 # 6)!"#$" % &'( ) # 22 # 26 #.6 # # 6. # 62.. #.) # %2 # 2. # %. /01 2 #.2 *+,-. # 2)

Markov chain for words # 22 # %2 /'1* 2 # 2)!"#$" # % 2. &'( ) # 00 # 02 # 20 #.2 #.. #.) # %0 # 0. # %.,/ 0 #.0 *+,"-. # 0)

Markov chain = First-order observable Markov Model a set of states Q = q 1, q 2 q N; the state at time t is q t Transition probabilities: a set of probabilities A = a 01 a 02 a n1 a nn. Each a ij represents the probability of transitioning from state i to state j The set of these is the transition probability matrix A a ij = P(q t = j q t 1 = i) 1 i, j N N j=1 a ij =1; 1 i N Distinguished start and end states

Markov chain = First-order observable Markov Model Current state only depends on previous state Markov Assumption : P(q i q 1!q i 1 ) = P(q i q i 1 )

Another representation for start state Instead of start state Special initial probability vector p An initial distribution over probability of start states π i = P(q 1 = i) 1 i N Constraints: N j=1 π j =1

The weather figure using pi

The weather figure: specific example

Markov chain for weather What is the probability of 4 consecutive warm days? Sequence is warm-warm-warm-warm I.e., state sequence is 3-3-3-3 P(3,3,3,3) = p 3 a 33 a 33 a 33 a 33 = 0.2 x (0.6) 3 = 0.0432

How about? Hot hot hot hot Cold hot cold hot What does the difference in these probabilities tell you about the real world weather info encoded in the figure?

HMM for Ice Cream You are a climatologist in the year 2799 Studying global warming You can t find any records of the weather in Baltimore, MD for summer of 2008 But you find Jason Eisner s diary Which lists how many ice-creams Jason ate every date that summer Our job: figure out how hot it was

Hidden Markov Model For Markov chains, output symbols = state symbols See hot weather: we re in state hot But not in speech recognition Output symbols: vectors of acoustics (cepstral features) Hidden states: phones So we need an extension! A Hidden Markov Model is an extension of a Markov chain in which the input symbols are not the same as the states. This means we don t know which state we are in.

Hidden Markov Models

Assumptions Markov assumption: P(q i q 1!q i 1 ) = P(q i q i 1 ) Output-independence assumption P(o t O 1 t 1, q 1 t ) = P(o t q t )

Eisner task Given Observed Ice Cream Sequence: 1,2,3,2,2,2,3 Produce: Hidden Weather Sequence: H,C,H,H,H,C

HMM for ice cream! #!"#$" % 38 3: 3* +', - 37 36 39 &'() *! "./-010+',200000000003*./*010+',200005000036./7010+',2000000000036./-010&'()2000000000034./*010&'()200005000036./7010&'()200000000003-

Different types of HMM structure Bakis = left-to-right Ergodic = fully-connected

The Three Basic Problems for HMMs Jack Ferguson at IDA in the 1960s Problem 1 (Evaluation): Given the observation sequence O=(o 1 o 2 o T ), and an HMM model F = (A,B), how do we efficiently compute P(O F), the probability of the observation sequence, given the model? Problem 2 (Decoding): Given the observation sequence O=(o 1 o 2 o T ), and an HMM model F = (A,B), how do we choose a corresponding state sequence Q=(q 1 q 2 q T ) that is optimal in some sense (i.e., best explains the observations)? Problem 3 (Learning): How do we adjust the model parameters F = (A,B) to maximize P(O F )?

Problem 1: computing the observation likelihood Given the following HMM:! #!"#$" % 38 3: 3* +', - 37 36 39 &'() *! "./-010+',200000000003*./*010+',200005000036./7010+',2000000000036./-010&'()2000000000034./*010&'()200005000036./7010&'()200000000003- How likely is the sequence 3 1 3?

How to compute likelihood For a Markov chain, we just follow the states 3 1 3 and multiply the probabilities But for an HMM, we don t know what the states are! So let s start with a simpler situation. Computing the observation likelihood for a given hidden state sequence Suppose we knew the weather and wanted to predict how much ice cream Jason would eat. i.e., P( 3 1 3 H H C)

Computing likelihood of 3 1 3 given hidden state sequence

Computing joint probability of observation and state sequence

Computing total likelihood of 3 1 3 We would need to sum over Hot hot cold Hot hot hot Hot cold hot. How many possible hidden state sequences are there for this sequence? How about in general for an HMM with N hidden states and a sequence of T observations? N T So we can t just do separate computation for each hidden state sequence.

Instead: the Forward algorithm A dynamic programming algorithm Just like Minimum Edit Distance or CKY Parsing Uses a table to store intermediate values Idea: Compute the likelihood of the observation sequence By summing over all possible hidden state sequences But doing this efficiently By folding all the sequences into a single trellis

The forward algorithm The goal of the forward algorithm is to compute P(o 1,o 2,...,o T,q T = q F λ) We ll do this by recursion

The forward algorithm Each cell of the forward algorithm trellis alpha t (j) Represents the probability of being in state j After seeing the first t observations Given the automaton Each cell thus expresses the following probability

The Forward Recursion

" The Forward Trellis < = '() '() '() '().32*.14+.02*.08=.0464!! "#$9102! # "#$9.102/1:37.;.1:2/1:8.9.1::5:8 < 2 % % *+%,%-./.*+3,%- 14./.12 % % < 3 < : &!"#$" *+%,&-./.*+3,%- 17./.12 *+%,!"#$"-/*+0,%- 18./.17 *+&,!"#$"-./.*+0,&- 12./.13!! "!$.9.1:2 & *+&,%-./.*+3,&- 10./.16 *+&,&-./.*+3,&- 15./.16! # "!$.9.102/136.;.1:2/10:.9.1:67!"#$"!"#$"!"#$" & & 0 3 0 > 3 > 2 > 0

We update each cell!!(, "*$!!() "*$ / ( / ( / %! (&! "#$%&" '&!!() "'$&% -&. * & +! ",& ( / &!!(, "+$!!() "+$ % )& / ) / ) / ) %! '&!(, ",$!!() ",$ * % & +! ", / ' / ' $& / ' / '!!(, ")$!!() ")$ / $ / $ / $ / $! "#'! "#$! "! "0$

The Forward Algorithm

Decoding Given an observation sequence 3 1 3 And an HMM The task of the decoder To find the best hidden state sequence Given the observation sequence O=(o 1 o 2 o T ), and an HMM model F = (A,B), how do we choose a corresponding state sequence Q=(q 1 q 2 q T ) that is optimal in some sense (i.e., best explains the observations)

Decoding One possibility: For each hidden state sequence Q HHH, HHC, HCH, Compute P(O Q) Pick the highest one Why not? N T Instead: The Viterbi algorithm Is again a dynamic programming algorithm Uses a similar trellis to the Forward algorithm

Viterbi intuition We want to compute the joint probability of the observation sequence together with the best state sequence max q 0,q1,...,qT P(q 0,q 1,...,q T,o 1,o 2,...,o T,q T = q F λ)

Viterbi Recursion

" >? The Viterbi trellis '() '() '() '()! " #$%9102 /! $ #$%9.;#<+102/1:37=.1:2/1:8-.9.1:778 > 2 % % *+%,%-./.*+3,%- 14./.12 % % > 3 > : &!"#$" *+%,&-./.*+3,%- 17./.12 *+%,!"#$"-/*+0,%- 18./.17 *+&,!"#$"-./.*+0,&- 12./.13! " #"%.9.1:2 & *+&,%-./.*+3,&- 10./.16 *+&,&-./.*+3,&- 15./.16! $ #"%.9.;#<+102/136=.1:2/10:-.9.1:78!"#$"!"#$"!"#$" & & 0 3 0 @ 3 @ 2 @ 0

Viterbi intuition Process observation sequence left to right Filling out the trellis Each cell:

Viterbi Algorithm

" >? Viterbi backtrace '() '() '() '()! " #$%9102 /! $ #$%9.;#<+102/1:37=.1:2/1:8-.9.1:778 > 2 % % *+&,%-./.*+3,&- 10./.16 *+%,%-./.*+3,%- 14./.12 % % > 3 > : &!"#$" *+%,&-./.*+3,%- 17./.12 *+%,!"#$"-/*+0,%- 18./.17 *+&,!"#$"-./.*+0,&- 12./.13! " #"%.9.1:2 & *+&,&-./.*+3,&- 15./.16! $ #"%.9.;#<+102/136=.1:2/10:-.9.1:78!"#$"!"#$"!"#$" & & 0 3 0 @ 3 @ 2 @ 0

HMMs for Speech We haven t yet shown how to learn the A and B matrices for HMMs; we ll do that on Thursday The Baum-Welch (Forward-Backward alg) But let s return to think about speech

Reminder: a word looks like this: - $ - % - #!" $!" %!" #. $. %. # - $ - % - # &'()' *+,

HMM for digit recognition task

The Evaluation (forward) problem for speech The observation sequence O is a series of MFCC vectors The hidden states W are the phones and words For a given phone/word string W, our job is to evaluate P(O W) Intuition: how likely is the input to have been generated by just that word string W?

Evaluation for speech: Summing over all different paths! f ay ay ay ay v v v v f f ay ay ay ay v v v f f f f ay ay ay ay v f f ay ay ay ay ay ay v f f ay ay ay ay ay ay ay ay v f f ay v v v v v v v

The forward lattice for five

The forward trellis for five

Viterbi trellis for five

Viterbi trellis for five

Search space with bigrams -./%)0/1/%)0/2 -./%)0/1/+&%/2 & & & '( '( '( ) ) ) -./+&%/1/%)0/2 -./%)0/1/#0$%/2 + + + *& *& *& -./+&%/1/#0$%/2 -./+&%/1/+&%/2 -./#0$%/1/%)0/2,,, -./#0$%/1/+&%/2 # # #!"!"!" $ $ $ %& %& %& -./#0$%/1/#0$%/2

Viterbi trellis 65

Viterbi backtrace 66

Summary: ASR Architecture Five easy pieces: ASR Noisy Channel architecture Feature Extraction: 39 MFCC features Acoustic Model: Gaussians for computing p(o q) Lexicon/Pronunciation Model HMM: what phones can follow each other Language Model N-grams for computing p(w i w i-1 ) Decoder Viterbi algorithm: dynamic programming for combining all these to get word sequence from speech 67