Lecture 9: Speech Recognition. Recognizing Speech

Size: px

Start display at page:

Download "Lecture 9: Speech Recognition. Recognizing Speech"

Adelia Patrick
5 years ago
Views:

1 EE E68: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 3 4 Recognizing Speech Feature Calculation Sequence Recognition Hidden Markov Models Dan Ellis <dpwe@ee.columbia.edu> Columbia University Dept. of Electrical Engineering Spring 6 E68 SAPR - Dan Ellis L9 - Speech Recognition Recognizing Speech So, I thought about that and I think it s still possible 4 Frequency... 3 Time What kind of information might we want from the speech signal? - words - phrasing, speech acts (prosody - mood / emotion - speaker identity What kind of processing do we need to get at that information? - time scale of feature extraction - signal aspects to capture in features - signal aspects to exclude from features E68 SAPR - Dan Ellis L9 - Speech Recognition

2 Speech recognition as Transcription Transcription = speech to text - find a word string to match the utterance Best suited to small vocabulary tasks - voice dialing, command & control etc. Gives neat objective measure: word error rate (WER % - can be a sensitive measure of performance Three kinds of errors: Reference: THE Recognized: - WER = (S + D + I / N CAT SAT ON THE MAT CAT SAT AN THE A MAT Deletion Substitution Insertion E68 SAPR - Dan Ellis L9 - Speech Recognition Problems: Within-speaker variability 4 Timing variation: - word duration varies enormously Frequency s ow ay aa ax b aw axay ih k t s t ih l p aa s b ax l th dx th n th n ih SO I ABOUT I IT'S STILL POSSIBLE THOUGHT THAT THINK AND - fast speech reduces vowels Speaking style variation: - careful/casual articulation - soft/loud speech Contextual effects: - speech sounds vary with context, role: How do you do? E68 SAPR - Dan Ellis L9 - Speech Recognition

3 Between-speaker variability Accent variation - regional / mother tongue Voice quality variation - gender, age, huskiness, nasality Individual characteristics - mannerisms, speed, prosody mbma fjdm freq / Hz time / s E68 SAPR - Dan Ellis L9 - Speech Recognition Environment variability Background noise - fans, cars, doors, papers Reverberation - boxiness in recordings Microphone/channel - huge effect on relative spectral gain Close mic freq / Hz 4 Tabletop mic time / s E68 SAPR - Dan Ellis L9 - Speech Recognition

4 freq / Hz 4 3 How to recognize speech? Cross correlate templates? - waveform? - spectrogram? - time-warp problems Match short-segments & handle time-warp later - model with slices of ~ ms - pseudo-stationary model of words: sil g w eh n sil other sources of variation... time / s E68 SAPR - Dan Ellis L9 - Speech Recognition Probabilistic formulation Probability that segment label is correct - gives standard form of speech recognizers: Feature calculation transforms signal into easilyclassified domain Acoustic classifier calculates probabilities of each mutually-exclusive state q i sn [ ] X m m = --- n H pq ( i X Finite state acceptor (i.e. HMM Qˆ = pq (, q, q L X, X X L { q argmax, q, q L } MAP match of allowable sequence to probabilities: X q = ay q time E68 SAPR - Dan Ellis L9 - Speech Recognition

Standard speech recognizer structure Feature calculation sound s[n] Network weights Word models Language model Acoustic classifier feature vectors X m phone probabilities p(q i X m HMM decoder phone

5 Standard speech recognizer structure Feature calculation sound s[n] Network weights Word models Language model Acoustic classifier feature vectors X m phone probabilities p(q i X m HMM decoder phone & word labeling Questions: - what are the best features? - how do we do the acoustic classification? - how do we find/match the state sequence? E68 SAPR - Dan Ellis L9 - Speech Recognition Outline 3 4 Recognizing Speech Feature Calculation - Spectrogram, MFCCs & PLP - Improving robustness Sequence Recognition Hidden Markov Models E68 SAPR - Dan Ellis L9 - Speech Recognition

6 Feature Calculation Goal: Find a representational space most suitable for classification - waveform: voluminous, redundant, variable - spectrogram: better, still quite variable -...? Pattern Recognition: Representation is upper bound on performance - maybe we should use the waveform... - or, maybe the representation can do all the work Feature calculation is intimately bound to classifier - pragmatic strengths and weaknesses Features develop by slow evolution - current choices more historical than principled E68 SAPR - Dan Ellis L9 - Speech Recognition Features (: Spectrogram Plain STFT as features e.g. [, ] sn+ mh n X m [ k] = SmHk = [ ] wn [ ] e ( jπkn Ú N freq / Hz Consider examples: Feature vector slice... time / s Similarities between corresponding segments - but still large differences E68 SAPR - Dan Ellis L9 - Speech Recognition

7 Male spectrum cepstrum Female spectrum Features (: Cepstrum Idea: Decorrelate, summarize spectral slices: X m [] l = IDFT{ log S[ mh, k] } - good for Gaussian models - greatly reduce feature dimension cepstrum time / s E68 SAPR - Dan Ellis L9 - Speech Recognition Features (3: Frequency axis warp Linear frequency axis gives equal space to - khz and 3-4 khz - but perceptual importance very different Male spectrum Warp frequency axis closer to perceptual axis: - mel, Bark, constant-q u c X[ c] = Sk [ ] k = l c audspec Female audspec... time / s E68 SAPR - Dan Ellis L9 - Speech Recognition

8 level / db Male audspec plp smoothed Features (4: Spectral smoothing Generalizing across different speakers is helped by smoothing (i.e. blurring spectrum Truncated cepstrum is one way: - MSE approx to log Sk [ ] LPC modeling is a little different: - MSE approx to prefers detail at peaks 4 Sk [ ] freq / chan audspec... time / s E68 SAPR - Dan Ellis L9 - Speech Recognition plp Features (: Normalization along time Idea: feature variations, not absolute level Hence: calculate average level & subtract it: Y[ n, k] = X[ n, k] mean n { X[ n, k] } Male plp mean norm Factors out fixed channel frequency response: log Snk [, ] sn [ ] = h c en [ ] = log * H c [ k] + log Enk [, ] Female mean norm... time / s E68 SAPR - Dan Ellis L9 - Speech Recognition

9 Male ddeltas deltas plp (µ,σ norm Delta features Want each segment to have static feature vals - but some segments intrinsically dynamic! calculate their derivatives - maybe steadier? Append dx/dt (+ d X/dt to feature vectors... time / s Relates to onset sensitivity in humans? E68 SAPR - Dan Ellis L9 - Speech Recognition Overall feature calculation MFCCs and/or RASTA-PLP Sound FFT X[k] FFT X[k] cepstra Mel scale freq. warp log X[k] IFFT Truncate Subtract mean spectra audspec Bark scale freq. warp log X[k] Rasta band-pass LPC smooth Cepstral recursion smoothed onsets LPC spectra Key attributes: - spectral, auditory scale - decorrelation - smoothed (spectral detail - normalization of levels CMN MFCC features Rasta-PLP cepstral features E68 SAPR - Dan Ellis L9 - Speech Recognition

10 spectrum 8 4 Features summary Male Female audspec rasta deltas... time / s Normalize same phones Contrast different phones E68 SAPR - Dan Ellis L9 - Speech Recognition Outline 3 4 Recognizing Speech Feature Calculation Sequence Recogntion - Dynamic Time Warp - Probabilistic Formulation Hidden Markov Models E68 SAPR - Dan Ellis L9 - Speech Recognition

11 3 Sequence recognition: Dynamic Time Warp (DTW Framewise comparison with stored templates: Reference ONE TWO THREE FOUR FIVE time /frames Test - distance metric? - comparison across templates? E68 SAPR - Dan Ellis L9 - Speech Recognition Dynamic Time Warp ( Find lowest-cost constrained path: - matrix d(i,j of distances between input frame f i and reference frame r j - allowable predecessors & transition costs T xy Reference frames r j D(i-,j D(i-,j T Lowest cost to (i,j D(i-,j + T D(i,j = d(i,j + min{ } D(i,j- + T D(i-,j- + T Local match cost Best predecessor D(i-,j (including transition cost T T Input frames f i Best path via traceback from final state - store predecessors for each (i,j E68 SAPR - Dan Ellis L9 - Speech Recognition

12 DTW-based recognition Reference templates for each possible word For isolated words: - mark endpoints of input word - calculate scores through each template (+prune Reference ONE TWO THREE FOUR Input frames - continuous speech: link together word ends Successfully handles timing variation - recognize speech at reasonable cost E68 SAPR - Dan Ellis L9 - Speech Recognition Statistical sequence recognition DTW limited because it s hard to optimize - interpretation of distance, transition costs? Need a theoretical foundation: Probability Formulate recognition as MAP choice among models: M * = pm ( j X, Θ argmax M j - X = observed features - M j = word-sequence models - Θ = all current parameters E68 SAPR - Dan Ellis L9 - Speech Recognition

13 Statistical formulation ( Can rearrange via Bayes rule (& drop p(x : M * = pm ( j X, Θ = argmax M j argmax M j pxm ( j, Θ A pm ( j Θ L - p(x M j = likelihood of observations under model - p(m j = prior probability of model - Θ A = acoustics-related model parameters - Θ L = language-related model parameters Questions: - what form of model to use for? - how to find Θ A (training? - how to solve for M j (decoding? pxm ( j, Θ A E68 SAPR - Dan Ellis L9 - Speech Recognition Model M j State-based modeling Assume discrete-state model for the speech: - observations are divided up into time frames - model states observations: Q k : q q q 3 q 4 q q 6... N X : x x 3 x 4 x x 6... x time states observed feature vectors Probability of observations given model is: N pxm ( j = px ( Qk, M j pq ( k M j all Q k - sum over all possible state sequences Q k How do observations depend on states? How do state sequences depend on model? E68 SAPR - Dan Ellis L9 - Speech Recognition

14 Outline 3 4 Recognizing Speech Feature Calculation Sequence Recognition Hidden Markov Models (HMM - generative Markov models - hidden Markov models - model fit likelihood - HMM examples E68 SAPR - Dan Ellis L9 - Speech Recognition Markov models S A (first order Markov model is a finite-state system whose behavior depends only on the current state E.g. generative Markov model:.8.8 q n+. p(q n+ q n S A B C E A. B.. C.7... E q n S A B C E S A A A A A A A A B B B B B B B B B C C C C B B B B B B C E E68 SAPR - Dan Ellis L9 - Speech Recognition

15 p(x q p(x q Hidden Markov models = Markov model where state sequence Q = {q n } is not directly observable (= hidden But, observations X do depend on Q: - x n is rv that depends on current state: Emission distributions q = A q = B q = C q = A q = B q = C 3 4 observation x x n x n - can still tell something about state seq... E68 SAPR - Dan Ellis L9 - Speech Recognition State sequence Observation sequence pxq ( AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC 3 time step n (Generative Markov models ( HMM is specified by: - states q i k a t - transition probabilities a ij pq n j i q n ( a ij k a t k a t k a t emission distributions b i (x pxq ( i b i ( x p(x q k a t x pq i ( π i + (initial state probabilities E68 SAPR - Dan Ellis L9 - Speech Recognition

16 Markov models for sequence recognition Independence of observations: - observation x n depends only current state q n pxq ( = px (, x, x N q, q, q N = px ( q px ( q px ( N q N N n = = px ( n q n = b qn ( x n Markov transitions: n = - transition to next state q i+ depends only on q i pqm ( = pq (, q, q N M = = pq ( N q q N pq ( N q q N pq ( q pq ( pq ( q pq ( q pq ( q pq ( N N N N = N pq ( pq ( n q n π N = a n = q n = qn q n E68 SAPR - Dan Ellis L9 - Speech Recognition N Model-fit calculation From state-based modeling : N pxm ( j = px ( Qk, M j pq ( k M j For HMMs: all Q k pxq ( = b qn ( x n Hence, solve for M * : - calculate for each available model, scale by prior n = p ( QM = π a q n = qn Sum over all Q k??? N pxm ( j pm j N q n ( pm j X ( E68 SAPR - Dan Ellis L9 - Speech Recognition

17 S S A B E.9. Model M.7. A B E..8 S A B E q q q q 3 q 4 S A A A E S A A B E S A B B E S B B B E. Summing over all paths x n p(x B p(x A States E B A Observations x, x, x 3 Observation likelihoods p(x q x x x 3 A... q{ B...3 S 3 4 time n All possible 3-emission paths Q k from S to E p(q M = Πn p(q n q n- p(x Q,M = Πn p(x n q n p(x,q M.9 x.7 x.7 x. =.44.9 x.7 x. x. =..9 x. x.8 x. =.88. x.8 x.8 x. =.8. x. x. =.. x. x.3 =.. x. x.3 =.6. x. x.3 = Σ =.9 Σ = p(x M =.4 E68 SAPR - Dan Ellis L9 - Speech Recognition Paths n (length 3 paths only The forward recursion Dynamic-programming-like technique to calculate sum over all Q k α n ( i Define as the probability of getting to state q i at time step n (by any path: α n ( i px (, x, x n, q n =q i n i = px (, qn Model states q i Then α n + ( j can be calculated recursively: α n (i+ a i+j a ij b j (x n+ S α n+ (j = [Σ α n (i a ij ] b j (x n+ i= α n (i Time steps n E68 SAPR - Dan Ellis L9 - Speech Recognition

18 Initialize Forward recursion ( α ( i = π i b i ( x Then total probability N p ( X M = αn ( i Practical way to solve for p(x M j and hence perform recognition S i = Observations X p(x M p(m p(x M p(m Choose best E68 SAPR - Dan Ellis L9 - Speech Recognition Optimal path May be interested in actual q n assignments - which state was active at each time frame - e.g. phone labelling (for training? Total probability is over all paths but can also solve for single best path = Viterbi state sequence j q n Probability along best path to state + : * * α n + ( j = max α n (aij i b j x n + i - backtrack from final state to get best path - final probability is product only (no sum log-domain calculation is just summation Total probability often dominated by best path: pxq (, * M pxm ( ( E68 SAPR - Dan Ellis L9 - Speech Recognition

19 Interpreting the Viterbi path Viterbi path assigns each x n to a state q i - performing classification based on b i (x -... at the same time as applying transition constraints a ij Inferred classification 3 x n Viterbi labels: 3 AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC Can be used for segmentation - train an HMM with garbage and target states - decode on new data to find targets, boundaries Can use for (heuristic training - e.g. train classifiers based on labels... E68 SAPR - Dan Ellis L9 - Speech Recognition Recognition with HMMs Isolated word - choose best pmx ( pxm ( pm ( Model M w ah n p(x M p(m =... Input Model M t uw p(x M p(m =... Model M 3 th r iy p(x M 3 p(m 3 =... Continuous speech - Viterbi decoding of one large HMM gives words Input p(m sil p(m 3 p(m w ah n t uw th r iy E68 SAPR - Dan Ellis L9 - Speech Recognition

20 HMM examples: Different state sequences Model M S K. A.. T E Model M Emission distributions S p(x q K. O.. T E p(x O p(x K p(x T p(x A. 3 4 x E68 SAPR - Dan Ellis L9 - Speech Recognition Observation sequence x n Model M log p(x M = -3. state alignment log p(x,q* M = -33. log trans.prob log p(q* M = -7. log obs.l'hood log p(x Q*,M = -6. Model matching: Emission probabilities 4 A T K O time n / steps Model M log p(x M = -47. state alignment log p(x,q* M = -47. log trans.prob log p(q* M = -8.3 log obs.l'hood log p(x Q*,M = E68 SAPR - Dan Ellis L9 - Speech Recognition

21 Model matching: Transition probabilities Model M' S.8.8 K..9 A.9 O...8 T. E Model M' S.8. K..9 A.9 O...8 T. E state alignment 4 Model M' Model M' log obs.l'hood log p(x Q*,M = -6. log p(x M = -3. log p(x,q* M = log trans.prob log p(q* M = -7.6 log p(x M = -33. log p(x,q* M = log trans.prob log p(q* M = time n / steps E68 SAPR - Dan Ellis L9 - Speech Recognition Summary Speech signal is highly variable - need models that absorb variability - hide what we can with robust features Speech is modeled as a sequence of features - need temporal aspect to recognition - best time-alignment of templates = DTW Hidden Markov models are rigorous solution - self-loops allow temporal dilation - exact, efficient likelihood calculations Parting thought: How to set the HMM parameters? (training E68 SAPR - Dan Ellis L9 - Speech Recognition

Lecture 9: Speech Recognition

EE E682: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 1 2 3 4 Recognizing Speech Feature Calculation Sequence Recognition Hidden Markov Models Dan Ellis