Temporal Modeling and Basic Speech Recognition

Similar documents
CS 188: Artificial Intelligence Fall 2011

Design and Implementation of Speech Recognition Systems

STA 414/2104: Machine Learning

STA 4273H: Statistical Machine Learning

Speech Recognition HMM

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

Statistical Machine Learning from Data

1. Markov models. 1.1 Markov-chain

order is number of previous outputs

Statistical Sequence Recognition and Training: An Introduction to HMMs

CS 343: Artificial Intelligence

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Computational Genomics and Molecular Biology, Fall

COMP90051 Statistical Machine Learning

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models Part 2: Algorithms

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

CS 5522: Artificial Intelligence II

Hidden Markov Modelling

CS 5522: Artificial Intelligence II

Hidden Markov Models in Language Processing

Hidden Markov Model and Speech Recognition

CS 343: Artificial Intelligence

Lecture 3. Gaussian Mixture Models and Introduction to HMM s. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Improving the Multi-Stack Decoding Algorithm in a Segment-based Speech Recognizer

EECS E6870: Lecture 4: Hidden Markov Models

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Lecture 9: Speech Recognition. Recognizing Speech

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Lecture 9: Speech Recognition

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Introduction to Neural Networks

Design and Implementation of Speech Recognition Systems

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Classification: The rest of the story

Sound Recognition in Mixtures

Hidden Markov Models

Hidden Markov Models Hamid R. Rabiee

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Hidden Markov Models Part 1: Introduction

Hidden Markov Models

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

ASR using Hidden Markov Model : A tutorial

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

Why DNN Works for Acoustic Modeling in Speech Recognition?

HMM part 1. Dr Philip Jackson

Deep Learning for Speech Recognition. Hung-yi Lee

Data-Intensive Computing with MapReduce

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Sta$s$cal sequence recogni$on

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Artificial Intelligence Markov Chains

Parametric Models Part III: Hidden Markov Models

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Intelligent Systems (AI-2)

Forward algorithm vs. particle filtering

Lecture 3: ASR: HMMs, Forward, Viterbi

Deep Learning Recurrent Networks 2/28/2018

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Hidden Markov Models. x 1 x 2 x 3 x N

CSC321 Lecture 7 Neural language models

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

CSC321 Lecture 20: Reversible and Autoregressive Models

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Bayesian Networks BY: MOHAMAD ALSABBAGH

Brief Introduction of Machine Learning Techniques for Content Analysis

Hidden Markov Models

From perceptrons to word embeddings. Simon Šuster University of Groningen

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Hidden Markov Models

Detection-Based Speech Recognition with Sparse Point Process Models

Introduction to Artificial Intelligence (AI)

Dynamic Approaches: The Hidden Markov Model

Hidden Markov Models

Intelligent Systems (AI-2)

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme

CS 4100 // artificial intelligence. Recap/midterm review!

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Lecture 4. Hidden Markov Models. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Recurrent Neural Network

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

A New OCR System Similar to ASR System

Approximate Inference

Independent Component Analysis and Unsupervised Learning

Lecture 11: Hidden Markov Models

Neural Networks and Deep Learning

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Latent Variable Models

Transcription:

UNIVERSITY ILLINOIS @ URBANA-CHAMPAIGN OF CS 498PS Audio Computing Lab Temporal Modeling and Basic Speech Recognition Paris Smaragdis paris@illinois.edu paris.cs.illinois.edu

Today s lecture Recognizing time series Basic speech recognition Dynamic Time Warping Hidden Markov Models 2

A new problem? Are these the same utterances? 3

How do we recognize the spoken digit? How do we take time order in account? We could use a Gaussian model as before, but it can t tell the difference between ten, net, or ent How do we deal with timing differences? three and thhhrreeee should be seen as being the same 4

A motivating problem to solve How to recognize words E.g., yes/no, or spoken digits Build reliable features Must be invariant to irrelevant differences Build a classifier that can deal with time Invariant to temporal difference in inputs 5

Example data set Spoken digits Which is which? Can we automatically identify them? 6

Finding a good representation Small differences are not important E.g., the MSE of these two isn t useful Even though they are the same digit 7

Going to the frequency domain The magnitude Fourier transform is better The irrelevant waveform differences are hidden But so is the temporal structure Energy 20 15 10 5 20 40 60 80 100 120 Frequency 20 15 Energy 10 5 8 20 40 60 80 100 120 Frequency

Time/Frequency features A more robust representation Recordings look more similar" 0.04 0.02 0 0.02 0.04 0.15 0.1 0.05 0 0.05 0.1 0.15 0.2 0.4 0.6 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Frequency (Hz) Frequency (Hz) 11000 6000 3500 2000 1000 500 100 11000 6000 3500 2000 1000 500 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Time (sec) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time (sec) 9

A new problem 10 What about relative time differences? There is no 1-to-1 correspondence in the spectral frames 0.04 0.02 0 0.02 0.04 0.15 0.1 0.05 0 0.05 0.1 0.15 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Frequency (Hz) Frequency (Hz) 11000 6000 3500 2000 1000 500 100 11000 6000 3500 2000 1000 500 100 0.2 0.4 0.6 0.8 1 1.2 1.4 Time (sec) 0.2 0.4 0.6 0.8 1 1.2 1.4 Time (sec)

Time warping To compare them we need to time-warp and align How do we find that warp? How do we get overall similarity? 0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 1.2 1.4 11

Matching warped sequences Represent the warping with a path r(i),i =1,2,,6 t( j), j =1,2,,5 j 5 4 3 2 1 12 1 2 3 4 5 6 i

An example with text Comparing Mama with Mammaa a m a M Good match Bad match M a m m a a 13

Another example Comparing Mama with Männaa There is still a path, although not as good as before a Good match Bad match m Meh match a M M ä n n a a 14

Finding the overall distance Visiting a node will have a cost: e.g., d(i, j)= r(i) t( j) Overall cost of a path is: D = d(i k, j k ) The optimal path k Results in smallest final distance Can be considered as the distance j 5 4 3 2 between the two input sequences 1 1 2 3 4 5 6 i 15

Finding the optimal path Huge space of possible paths How do the find the one that gives us the minimal distance? We need constraints 5 Limit the search options We need some theory! Maybe we can skip a full search j 4 3 2 1 16 1 2 3 4 5 6 i

Constraining the search space Global constraints Nodes we should not visit d nodes Local constraints Allowable transitions Black lines Optimal path j 5 4 3 2 1 1 2 3 4 5 step k Blue line 0 0 1 2 3 4 5 i 17

Some theory help Bellman s optimality principle For an optimal path passing through (i, j): Then: (i, j):(i 0, j 0 ) opt (i T, j T ) (i 0, j 0 ) opt (i T, j T )= opt opt (i, j ) (i, j),(i, j) (i, j ) 0 0 T T j 5 4 3 2 1 ( i, j ) 1 2 3 4 5 6 i 18

Huh? 19

Finding an optimal path Optimal path to (i k, j k ): D min (i k, j k )= min i k 1, j k 1 D min (i k 1, j k 1 )+ d(i k, j k i k 1, j k 1 ) Smaller search min() only searches local transitions Vastly faster than otherwise j 5 4 3 2 1 20 1 2 3 4 5 6 i

Making this work for speech Define a distance function Define local constraints Define global constraints 21

Distance function We can simply use MSE on spectral frames: Energy Energy d(i, j)= f 1 (i) f 2 ( j) 20 15 10 5 20 40 60 80 100 120 Frequency 20 15 10 5 22 20 40 60 80 100 120 Frequency

Global constraints Define time ratios that make sense e.g., one sequence cannot be 2x faster than the other j 7 6 5 4 3 2 1 23 1 2 3 4 5 6 7 i

Local constraints Monotonicity: i k 1 i k, j k 1 j k Repeat, but do not go back in time Enforcing time order Won t match cat to act j 5 4 3 2 1 Non-allowable path examples 1 2 3 4 5 6 i 24

Variations of local constraints Acceptable local subpaths Consider only these local movements Depends on application and scope 25

Example toy run Local constraints x1 5 4 3 2 1 3 2 Input 1 1 2 3 Time Distance matrix 7.28 7.04 0 0 4.93 0 7.04 7.04 x1 5 4 3 2 1 3 2 Input 2 1 2 3 4 Time Cost matrix 12.2 7.04 0 0 4.93 0 7.04 14.1 1 0 4.93 7.28 7.28 1 0 4.93 12.2 19.5 1 2 3 4 x2 1 2 3 4 x2 26

Matching identical utterances Frequency (Hz) Frequency (Hz) 11000 6000 3500 2000 1000 500 100 11000 6000 3500 2000 1000 500 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time (sec) x1 x1 60 50 40 30 20 10 60 50 40 30 20 10 Distance matrix 10 20 30 40 50 60 x2 Cost matrix, d = 0 0.1 0.08 0.06 0.04 0.02 0 2.5 2 1.5 1 0.5 27 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time (sec) 10 20 30 40 50 60 x2

Matching similar utterances Frequency (Hz) Frequency (Hz) 11000 6000 3500 2000 1000 500 100 11000 6000 3500 2000 1000 500 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time (sec) x1 x1 60 50 40 30 20 10 60 50 40 30 20 10 Distance matrix 10 20 30 40 50 60 x2 Cost matrix, d = 1.5736 0.1 0.08 0.06 0.04 0.02 0 3 2.5 2 1.5 1 0.5 28 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time (sec) 10 20 30 40 50 60 x2

Matching different utterances Frequency (Hz) Frequency (Hz) 11000 6000 3500 2000 1000 500 100 11000 6000 3500 2000 1000 500 0.1 0.2 0.3 0.4 Time (sec) x1 x1 40 30 20 10 40 30 20 10 Distance matrix 10 20 30 40 50 x2 Cost matrix, d = 3.0415 0.08 0.06 0.04 0.02 3 2.5 2 1.5 1 0.5 29 100 0.1 0.2 0.3 0.4 0.5 0.6 Time (sec) 10 20 30 40 50 x2 0

A basic speech recognizer Collect templates T i (t) for each word to detect e.g. yes/no, or digits from 0 to 9 Compute DTW distance between input and each T i (t) Template with smallest distance is most similar word to input x(t) DTW DTW T i (t) d=4.1 d=0.3 x(t) is most similar to template 2 DTW d=5.5 30

On our small digit dataset DTW distances from digit data Distance Distance Distance Distance 4 2 0 4 2 0 4 2 0 4 2 0 4 Comparison with template 1 1 2 3 4 5 Comparison with template 2 1 2 3 4 5 Comparison with template 3 1 2 3 4 5 Comparison with template 4 1 2 3 4 5 Comparison with template 5 31 Distance 2 0 1 2 3 4 5 Input class

UNIVERSITY OF ILLINOIS @ URBANA-CHAMPAIGN Another application of DTW 32 Simplifying ADR Costs time and money Automatically align audio tracks Good take, lousy audio Good audio, lousy sync Good take, good audio!

Taking this to the next level DTW is very convenient for small problems E.g., a small contacts list on your phone, vocabulary-limited command set, digit recognition, It will not scale to full-blown speech recognition E.g. with thousands of templates For that we need more powerful tools 33

Starting from the GMM Gaussian Mixture Models explain data by using multiple Gaussians as opposed to just one 34

What if there is time order? We want to learn the dynamics of the sequence e.g., each cluster can be a phoneme 35

Representing time Hidden Markov Model Gaussian variant for now Each state is represented by a Gaussian distribution Transition probabilities between all states Only the previous state matters P(A A) State A P(A C) P(B A) State B P(A B) P(B C) P(C A) P(C B) State C P(C C) P(B B) Initial Probabilities: P(A), P(B), P(C) 36

What can we do with an HMM? Learning Given lots of inputs learn the model parameters e.g., learn an HMM for each word you want to recognize Evaluation For an input evaluate it s likelihood given the model e.g., given word models find which explains the input best Decoding For an input and a model find the optimal state sequence e.g., given the model find what state you are in at any time 37

Evaluation Propagate likelihoods given the forward algorithm: Probability of initial state Likelihood of data given state α i (1)= P 1 (i)p(x 1 i) α i (t +1)= P(x t+1 i) α j (t)p(i j) State Time j... State 2 State 1 State 3 t=1 t=2 t=3 38

Evaluation α i (t) denotes the probability of being in state i at time t, given the past Terminal value of α provides the overall likelihood of the model given input sequence P(x)= i α i (T) 39

Decoding The forward pass provides us with state likelihoods With the backwards pass we can find most likely state at any time Like the backwards pass in DTW Most-likely state depends on context! State Time The Viterbi algorithm 40

The Viterbi algorithm Similar to forward algorithm, but with hard decisions: v i (1)= P 1 (i)p(x 1 i) Probability of most likely path up to time t+1 that ends in state i For each step remember most likely transition (as with DTW) Backwards pass ( ) v i (t +1)= P(x t+1 i)max P( j i)v j (t 1) Start with most likely terminal state State Time Work backwards with most likely transitions t=1 t=2 t=3 41

Learning the parameters Simple algorithm: Viterbi learning Init: set parameters to random values Loop: while there is no change Find most likely state sequence Estimate new model parameters using it i.e., estimate state i s Gaussian mean/covariance using all the time frames assigned to state i based on the decoding Transition matrix is estimated by counting state transitions Heavy-duty version Expectation-Maximization 42

Things to note We can extent HMM learning to work with multiple training sequences e.g., learn an HMM for one using 100 recordings of it Use log probabilities! What is a state likelihood like after a million time points? Can you compute that with linear probabilities? You can use an arbitrary state model e.g., a Gaussian, a neural net, a GMM, 43

Learning an HMM 44

Back to speech Learning an HMM for a word Each state should correspond to a static sub-part e.g., a syllable, a phoneme, etc. silence /w/ /ə/ /n/ silence 45

Fully-connected model 46

Left-to-right model 47

Digit recognition results Model log likelihoods Distance Distance Distance Distance 7000 7100 7200 5600 5700 5800 5600 5700 5800 6100 6200 6300 Likelihood 1 1 2 3 Likelihood 2 4 5 1 2 3 Likelihood 3 4 5 1 2 3 Likelihood 4 4 5 1 2 3 Likelihood 5 4 5 Distance 6800 6900 7000 48 1 2 3 4 5 Input class

UNIVERSITY OF ILLINOIS @ URBANA-CHAMPAIGN Speech recognition in a nutshell Small scale recognition Connect word HMMs based on language model dog the was here Large systems are another story And lately they are solved differently using deep learning 49

Recap Learning time series Dynamic Time Warping Align and compare sequences Hidden Markov Models Lean sequence models and get likelihoods Basic speech recognition setup 50

Reading material DTW for keyword recognition http://www.irit.fr/~julien.pinquier/docs/tp_mabs/res/dtwsakoe-chiba78.pdf A tutorial on HMMs http://www.cs.ubc.ca/~murphyk/bayes/rabiner.pdf 51

Next lab Designing a simple speech recognizer! 52