UNIVERSITY ILLINOIS @ URBANA-CHAMPAIGN OF CS 498PS Audio Computing Lab Temporal Modeling and Basic Speech Recognition Paris Smaragdis paris@illinois.edu paris.cs.illinois.edu
Today s lecture Recognizing time series Basic speech recognition Dynamic Time Warping Hidden Markov Models 2
A new problem? Are these the same utterances? 3
How do we recognize the spoken digit? How do we take time order in account? We could use a Gaussian model as before, but it can t tell the difference between ten, net, or ent How do we deal with timing differences? three and thhhrreeee should be seen as being the same 4
A motivating problem to solve How to recognize words E.g., yes/no, or spoken digits Build reliable features Must be invariant to irrelevant differences Build a classifier that can deal with time Invariant to temporal difference in inputs 5
Example data set Spoken digits Which is which? Can we automatically identify them? 6
Finding a good representation Small differences are not important E.g., the MSE of these two isn t useful Even though they are the same digit 7
Going to the frequency domain The magnitude Fourier transform is better The irrelevant waveform differences are hidden But so is the temporal structure Energy 20 15 10 5 20 40 60 80 100 120 Frequency 20 15 Energy 10 5 8 20 40 60 80 100 120 Frequency
Time/Frequency features A more robust representation Recordings look more similar" 0.04 0.02 0 0.02 0.04 0.15 0.1 0.05 0 0.05 0.1 0.15 0.2 0.4 0.6 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Frequency (Hz) Frequency (Hz) 11000 6000 3500 2000 1000 500 100 11000 6000 3500 2000 1000 500 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Time (sec) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time (sec) 9
A new problem 10 What about relative time differences? There is no 1-to-1 correspondence in the spectral frames 0.04 0.02 0 0.02 0.04 0.15 0.1 0.05 0 0.05 0.1 0.15 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Frequency (Hz) Frequency (Hz) 11000 6000 3500 2000 1000 500 100 11000 6000 3500 2000 1000 500 100 0.2 0.4 0.6 0.8 1 1.2 1.4 Time (sec) 0.2 0.4 0.6 0.8 1 1.2 1.4 Time (sec)
Time warping To compare them we need to time-warp and align How do we find that warp? How do we get overall similarity? 0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 1.2 1.4 11
Matching warped sequences Represent the warping with a path r(i),i =1,2,,6 t( j), j =1,2,,5 j 5 4 3 2 1 12 1 2 3 4 5 6 i
An example with text Comparing Mama with Mammaa a m a M Good match Bad match M a m m a a 13
Another example Comparing Mama with Männaa There is still a path, although not as good as before a Good match Bad match m Meh match a M M ä n n a a 14
Finding the overall distance Visiting a node will have a cost: e.g., d(i, j)= r(i) t( j) Overall cost of a path is: D = d(i k, j k ) The optimal path k Results in smallest final distance Can be considered as the distance j 5 4 3 2 between the two input sequences 1 1 2 3 4 5 6 i 15
Finding the optimal path Huge space of possible paths How do the find the one that gives us the minimal distance? We need constraints 5 Limit the search options We need some theory! Maybe we can skip a full search j 4 3 2 1 16 1 2 3 4 5 6 i
Constraining the search space Global constraints Nodes we should not visit d nodes Local constraints Allowable transitions Black lines Optimal path j 5 4 3 2 1 1 2 3 4 5 step k Blue line 0 0 1 2 3 4 5 i 17
Some theory help Bellman s optimality principle For an optimal path passing through (i, j): Then: (i, j):(i 0, j 0 ) opt (i T, j T ) (i 0, j 0 ) opt (i T, j T )= opt opt (i, j ) (i, j),(i, j) (i, j ) 0 0 T T j 5 4 3 2 1 ( i, j ) 1 2 3 4 5 6 i 18
Huh? 19
Finding an optimal path Optimal path to (i k, j k ): D min (i k, j k )= min i k 1, j k 1 D min (i k 1, j k 1 )+ d(i k, j k i k 1, j k 1 ) Smaller search min() only searches local transitions Vastly faster than otherwise j 5 4 3 2 1 20 1 2 3 4 5 6 i
Making this work for speech Define a distance function Define local constraints Define global constraints 21
Distance function We can simply use MSE on spectral frames: Energy Energy d(i, j)= f 1 (i) f 2 ( j) 20 15 10 5 20 40 60 80 100 120 Frequency 20 15 10 5 22 20 40 60 80 100 120 Frequency
Global constraints Define time ratios that make sense e.g., one sequence cannot be 2x faster than the other j 7 6 5 4 3 2 1 23 1 2 3 4 5 6 7 i
Local constraints Monotonicity: i k 1 i k, j k 1 j k Repeat, but do not go back in time Enforcing time order Won t match cat to act j 5 4 3 2 1 Non-allowable path examples 1 2 3 4 5 6 i 24
Variations of local constraints Acceptable local subpaths Consider only these local movements Depends on application and scope 25
Example toy run Local constraints x1 5 4 3 2 1 3 2 Input 1 1 2 3 Time Distance matrix 7.28 7.04 0 0 4.93 0 7.04 7.04 x1 5 4 3 2 1 3 2 Input 2 1 2 3 4 Time Cost matrix 12.2 7.04 0 0 4.93 0 7.04 14.1 1 0 4.93 7.28 7.28 1 0 4.93 12.2 19.5 1 2 3 4 x2 1 2 3 4 x2 26
Matching identical utterances Frequency (Hz) Frequency (Hz) 11000 6000 3500 2000 1000 500 100 11000 6000 3500 2000 1000 500 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time (sec) x1 x1 60 50 40 30 20 10 60 50 40 30 20 10 Distance matrix 10 20 30 40 50 60 x2 Cost matrix, d = 0 0.1 0.08 0.06 0.04 0.02 0 2.5 2 1.5 1 0.5 27 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time (sec) 10 20 30 40 50 60 x2
Matching similar utterances Frequency (Hz) Frequency (Hz) 11000 6000 3500 2000 1000 500 100 11000 6000 3500 2000 1000 500 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time (sec) x1 x1 60 50 40 30 20 10 60 50 40 30 20 10 Distance matrix 10 20 30 40 50 60 x2 Cost matrix, d = 1.5736 0.1 0.08 0.06 0.04 0.02 0 3 2.5 2 1.5 1 0.5 28 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time (sec) 10 20 30 40 50 60 x2
Matching different utterances Frequency (Hz) Frequency (Hz) 11000 6000 3500 2000 1000 500 100 11000 6000 3500 2000 1000 500 0.1 0.2 0.3 0.4 Time (sec) x1 x1 40 30 20 10 40 30 20 10 Distance matrix 10 20 30 40 50 x2 Cost matrix, d = 3.0415 0.08 0.06 0.04 0.02 3 2.5 2 1.5 1 0.5 29 100 0.1 0.2 0.3 0.4 0.5 0.6 Time (sec) 10 20 30 40 50 x2 0
A basic speech recognizer Collect templates T i (t) for each word to detect e.g. yes/no, or digits from 0 to 9 Compute DTW distance between input and each T i (t) Template with smallest distance is most similar word to input x(t) DTW DTW T i (t) d=4.1 d=0.3 x(t) is most similar to template 2 DTW d=5.5 30
On our small digit dataset DTW distances from digit data Distance Distance Distance Distance 4 2 0 4 2 0 4 2 0 4 2 0 4 Comparison with template 1 1 2 3 4 5 Comparison with template 2 1 2 3 4 5 Comparison with template 3 1 2 3 4 5 Comparison with template 4 1 2 3 4 5 Comparison with template 5 31 Distance 2 0 1 2 3 4 5 Input class
UNIVERSITY OF ILLINOIS @ URBANA-CHAMPAIGN Another application of DTW 32 Simplifying ADR Costs time and money Automatically align audio tracks Good take, lousy audio Good audio, lousy sync Good take, good audio!
Taking this to the next level DTW is very convenient for small problems E.g., a small contacts list on your phone, vocabulary-limited command set, digit recognition, It will not scale to full-blown speech recognition E.g. with thousands of templates For that we need more powerful tools 33
Starting from the GMM Gaussian Mixture Models explain data by using multiple Gaussians as opposed to just one 34
What if there is time order? We want to learn the dynamics of the sequence e.g., each cluster can be a phoneme 35
Representing time Hidden Markov Model Gaussian variant for now Each state is represented by a Gaussian distribution Transition probabilities between all states Only the previous state matters P(A A) State A P(A C) P(B A) State B P(A B) P(B C) P(C A) P(C B) State C P(C C) P(B B) Initial Probabilities: P(A), P(B), P(C) 36
What can we do with an HMM? Learning Given lots of inputs learn the model parameters e.g., learn an HMM for each word you want to recognize Evaluation For an input evaluate it s likelihood given the model e.g., given word models find which explains the input best Decoding For an input and a model find the optimal state sequence e.g., given the model find what state you are in at any time 37
Evaluation Propagate likelihoods given the forward algorithm: Probability of initial state Likelihood of data given state α i (1)= P 1 (i)p(x 1 i) α i (t +1)= P(x t+1 i) α j (t)p(i j) State Time j... State 2 State 1 State 3 t=1 t=2 t=3 38
Evaluation α i (t) denotes the probability of being in state i at time t, given the past Terminal value of α provides the overall likelihood of the model given input sequence P(x)= i α i (T) 39
Decoding The forward pass provides us with state likelihoods With the backwards pass we can find most likely state at any time Like the backwards pass in DTW Most-likely state depends on context! State Time The Viterbi algorithm 40
The Viterbi algorithm Similar to forward algorithm, but with hard decisions: v i (1)= P 1 (i)p(x 1 i) Probability of most likely path up to time t+1 that ends in state i For each step remember most likely transition (as with DTW) Backwards pass ( ) v i (t +1)= P(x t+1 i)max P( j i)v j (t 1) Start with most likely terminal state State Time Work backwards with most likely transitions t=1 t=2 t=3 41
Learning the parameters Simple algorithm: Viterbi learning Init: set parameters to random values Loop: while there is no change Find most likely state sequence Estimate new model parameters using it i.e., estimate state i s Gaussian mean/covariance using all the time frames assigned to state i based on the decoding Transition matrix is estimated by counting state transitions Heavy-duty version Expectation-Maximization 42
Things to note We can extent HMM learning to work with multiple training sequences e.g., learn an HMM for one using 100 recordings of it Use log probabilities! What is a state likelihood like after a million time points? Can you compute that with linear probabilities? You can use an arbitrary state model e.g., a Gaussian, a neural net, a GMM, 43
Learning an HMM 44
Back to speech Learning an HMM for a word Each state should correspond to a static sub-part e.g., a syllable, a phoneme, etc. silence /w/ /ə/ /n/ silence 45
Fully-connected model 46
Left-to-right model 47
Digit recognition results Model log likelihoods Distance Distance Distance Distance 7000 7100 7200 5600 5700 5800 5600 5700 5800 6100 6200 6300 Likelihood 1 1 2 3 Likelihood 2 4 5 1 2 3 Likelihood 3 4 5 1 2 3 Likelihood 4 4 5 1 2 3 Likelihood 5 4 5 Distance 6800 6900 7000 48 1 2 3 4 5 Input class
UNIVERSITY OF ILLINOIS @ URBANA-CHAMPAIGN Speech recognition in a nutshell Small scale recognition Connect word HMMs based on language model dog the was here Large systems are another story And lately they are solved differently using deep learning 49
Recap Learning time series Dynamic Time Warping Align and compare sequences Hidden Markov Models Lean sequence models and get likelihoods Basic speech recognition setup 50
Reading material DTW for keyword recognition http://www.irit.fr/~julien.pinquier/docs/tp_mabs/res/dtwsakoe-chiba78.pdf A tutorial on HMMs http://www.cs.ubc.ca/~murphyk/bayes/rabiner.pdf 51
Next lab Designing a simple speech recognizer! 52