T Automatic Speech Recognition: From Theory to Practice

Similar documents
The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System

Speech Recognition. CS 294-5: Statistical Natural Language Processing. State-of-the-Art: Recognition. ASR for Dialog Systems.

Algorithms for NLP. Speech Signals. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Sound 2: frequency analysis

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Object Tracking and Asynchrony in Audio- Visual Speech Recognition

COMP 546, Winter 2018 lecture 19 - sound 2

Computer Science March, Homework Assignment #3 Due: Thursday, 1 April, 2010 at 12 PM

Speech Spectra and Spectrograms

CS460/626 : Natural Language

Discriminative Pronunciation Modeling: A Large-Margin Feature-Rich Approach

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Lecture 3: ASR: HMMs, Forward, Viterbi

Hidden Markov Modelling

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates

Witsuwit en phonetics and phonology. LING 200 Spring 2006

A Speech Enhancement System Based on Statistical and Acoustic-Phonetic Knowledge

Tuesday, August 26, 14. Articulatory Phonology

Hidden Markov Models and Gaussian Mixture Models

SPEECH COMMUNICATION 6.541J J-HST710J Spring 2004

Hidden Markov Models and Gaussian Mixture Models

Deep Learning for Speech Recognition. Hung-yi Lee

Automatic Phoneme Recognition. Segmental Hidden Markov Models

Hidden Markov Models Hamid R. Rabiee

The effect of speaking rate and vowel context on the perception of consonants. in babble noise

Lecture 3: Pattern Classification

ECE 598: The Speech Chain. Lecture 9: Consonants

Supporting Information

Hidden Markov Model and Speech Recognition

CS 136 Lecture 5 Acoustic modeling Phoneme modeling

Lecture 9: Speech Recognition. Recognizing Speech

Lecture 9: Speech Recognition

CS 188: Artificial Intelligence Fall 2011

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 1

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

CS425 Audio and Speech Processing. Matthieu Hodgkinson Department of Computer Science National University of Ireland, Maynooth

L2: Review of probability and statistics

Introduction to Stochastic Processes

Acoustic Modeling for Speech Recognition

Determining the Shape of a Human Vocal Tract From Pressure Measurements at the Lips

Artificial Intelligence Markov Chains

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

Estimation of Cepstral Coefficients for Robust Speech Recognition

STA 4273H: Statistical Machine Learning

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

ON THE USE OF MLP-DISTANCE TO ESTIMATE POSTERIOR PROBABILITIES BY KNN FOR SPEECH RECOGNITION

ME 597: AUTONOMOUS MOBILE ROBOTICS SECTION 2 PROBABILITY. Prof. Steven Waslander

Improving the Multi-Stack Decoding Algorithm in a Segment-based Speech Recognizer

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Sparse Models for Speech Recognition

Automatic Speech Recognition (CS753)

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Recitation 2: Probability

Signal representations: Cepstrum

Chapter 12 Sound in Medicine

SPEECH RECOGNITION USING TIME DOMAIN FEATURES FROM PHASE SPACE RECONSTRUCTIONS

Design and Implementation of Speech Recognition Systems

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Temporal Modeling and Basic Speech Recognition

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Statistical Sequence Recognition and Training: An Introduction to HMMs

Hidden Markov Models

Lecture 3: Pattern Classification. Pattern classification

Augmented Statistical Models for Speech Recognition

02 Background Minimum background on probability. Random process

Chapter 9 Automatic Speech Recognition DRAFT

Hidden Markov Models

A Linguistic Feature Representation of the Speech Waveform

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

p(d θ ) l(θ ) 1.2 x x x

STA 414/2104: Machine Learning

c 2014 Jacob Daniel Bryan

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Sound Correspondences in the World's Languages: Online Supplementary Materials

Machine Recognition of Sounds in Mixtures

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

Parametric Specification of Constriction Gestures

L8: Source estimation

Deep Learning for Automatic Speech Recognition Part I

4.2 Acoustics of Speech Production

Hidden Markov Models

Bivariate distributions

Bioinformatics: Biology X

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Advanced Introduction to Machine Learning CMU-10715

Intro to Probability. Andrei Barbu

Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics

L7: Linear prediction of speech

Bayesian Linear Regression [DRAFT - In Progress]

Weighted Finite State Transducers in Automatic Speech Recognition

Session 1: Pattern Recognition

Why DNN Works for Acoustic Modeling in Speech Recognition?

Transcription:

Automatic Speech Recognition: From Theory to Practice http://www.cis.hut.fi/opinnot// September 20, 2004 Prof. Bryan Pellom Department of Computer Science Center for Spoken Language Research University of Colorado pellom@cslr.colorado.edu Automatic Speech Recognition: From Theory to Practice 1

Today Introduction to Speech Production and Phonetics Short Video on Spectrogram Reading Quick Review of Probability and Statistics Formulation of the Speech Recognition Problem Talk about Homework #1 Automatic Speech Recognition: From Theory to Practice 2

Speech Production and Phonetics Peter Ladefoged, A Course In Phonetics, Harcourt Brace Jovanovich, ISBN 0-15-500173-6 Excellent introductory reference to this material Automatic Speech Recognition: From Theory to Practice 3

Speech Production Anatomy Figure from Spoken Language Processing (Huang, Acero, Hon) Automatic Speech Recognition: From Theory to Practice 4

Speech Production Anatomy Vocal Tract Consists of the pharyngeal and oral cavities Articulators Components of the vocal tract which move to produce various speech sounds Include: vocal folds, velum, lips, tongue, teeth Automatic Speech Recognition: From Theory to Practice 5

Source-Filter Representation of Speech Production Production viewed as an acoustic filtering operation Larynx and lungs provide input or source excitation Vocal and nasal tracts act as filter. Shape the spectrum of the resulting signal Automatic Speech Recognition: From Theory to Practice 6

Describing Sounds The study of speech sounds and their production, classification and transcription is known as phonetics A phoneme is an abstract unit that can be used for writing a language down in a systematic or unambiguous way Sub-classifications of phonemes Vowels air passes freely through resonators Consonants air passes partially or totally obstructed in one or more places as it passes through the resonators Automatic Speech Recognition: From Theory to Practice 7

Time-Domain Waveform Example two plus seven is less than ten Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003 Automatic Speech Recognition: From Theory to Practice 8

Wide-Band Spectrogram frequency two plus seven is less than ten Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003 time Automatic Speech Recognition: From Theory to Practice 9

Phonetic Alphabets Allow us to describe the primitive sounds that make up a language Each language will have a unique set of phonemes Useful for speech recognition since words can be represented by sequences of phonemes as described by a phonetic alphabet. Automatic Speech Recognition: From Theory to Practice 10

International Phonetic Alphabet (IPA) Phonetic representation standard which describes sounds in most/all world languages IPA last published in 1993 and updated in 1996 Issue: character set difficult to manipulate on a computer http://www2.arts.gla.ac.uk/ipa/ipa.html Automatic Speech Recognition: From Theory to Practice 11

IPA Consonants Automatic Speech Recognition: From Theory to Practice 12

IPA Vowels Automatic Speech Recognition: From Theory to Practice 13

American English Phonemes (IPA) Table from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003 Automatic Speech Recognition: From Theory to Practice 14

Alternative Phonetic Alphabets ARPAbet English only ASCII representation Phoneme units represented by 1-2 letters Similar representation used by CMU Sphinx-II recognizer, http://www.speech.cs.cmu.edu/cgi-bin/cmudict SAMPA Speech Assessment Methods Phonetic Alphabet Computer Readable representation Maps symbols of the IPA into ASCII codes Automatic Speech Recognition: From Theory to Practice 15

Automatic Speech Recognition: From Theory to Practice 16 CMU Sphinx-II Phonetic Symbols (silence) SIL toy OY hurt ER seizure ZH oat OW ed EH zee Z ping NG matter DX yield Y note N thee DH we W me M dud DD vee V lee L dee D two UW lick KD cheese CH hood UH key K Dub BD bits TS gee JH be B theta TH eat IY hide AY lit TD acid IX user AXR tea T it IH abide AX she SH he HH cow AW sea S bag GD ought AO read R green G hut AH lip PD fee F at AE pee P ate EY odd AA Example Phone Example Phone Example Phone

Example Words and Corresponding CMU Dictionary Transcriptions basement Bryan perfect speech recognize B EY S M AX N TD B R AY AX N P AXR F EH KD TD S P IY CH R EH K AX G N AY Z Automatic Speech Recognition: From Theory to Practice 17

Classifications of Speech Sounds Voiced vs. voiceless Voiced if vocal chords vibrate Nasal vs. Oral Nasal if air travels through nasal cavity and oral cavity closed Consonant vs. Vowel Consonants: obstruction in air stream above the glottis. The Glottis is defined as the space between the vocal cords. Lateral vs. Non-lateral Non-lateral If the air stream passes through the middle of the oral cavity (compared to along side the oral cavity) Automatic Speech Recognition: From Theory to Practice 18

Consonants and Vowels Consonants are characterized by: Place of articulation Manner of articulation Voicing Vowels are characterized by: lip position tongue height tongue advancement Automatic Speech Recognition: From Theory to Practice 19

Places of Articulation Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003 Automatic Speech Recognition: From Theory to Practice 20

Places of Articulation Biliabial Made with the two lips {P,B,M} Labio-dental Lower lip & upper front teeth {F,V} Dental Tongue tip or blade & upper front teeth {TH,DH} Alveolar Tongue tip or blade & alveolar ridge {T,D,N, } Retroflex Tongue tip & back of the alveolar ridge {R} Palato-Alveolar Tongue blade & back of the alveolar ridge {SH} Palatal Front of the tongue & hard palate {Y,ZH} Velar Back of the tongue & soft palate {K,G,NG} Automatic Speech Recognition: From Theory to Practice 21

Manners of Articulation Stop complete obstruction with sudden (explosive) release Nasal Airflow stopped in the oral cavity, soft palate down, airflow is through the nasal tract Fricative Articulators close together, turbulent airflow produced Automatic Speech Recognition: From Theory to Practice 22

Manners of Articulation Retroflex (Liquid) Tip of the tongue is curled back slightly (/r/) Lateral (Liquid) Obstruction of the air stream at a point along the center of the oral tract, with incomplete closure between one or both sides of the tongue and the roof of the mouth (/l/) Glide Vowel-like, but initial position within a syllable (/y/, /w/) Automatic Speech Recognition: From Theory to Practice 23

American English Consonants by Place and Manner of Articulation Place Manner Automatic Speech Recognition: From Theory to Practice 24

American English Unvoiced Fricatives Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003 Automatic Speech Recognition: From Theory to Practice 25

Voiced vs. Unvoiced Fricatives Voiced fricatives tend to be shorter than unvoiced fricatives Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003 Automatic Speech Recognition: From Theory to Practice 26

American English Unvoiced Stops Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003 Automatic Speech Recognition: From Theory to Practice 27

Voiced vs. Unvoiced Stops Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003 Automatic Speech Recognition: From Theory to Practice 28

Stop Consonant Formant Transitions Figure from http://hitchcock.dlt.asu.edu/media5/a_spanias/speech-recognition/real-lectures/pdf/lect04-5.pdf Automatic Speech Recognition: From Theory to Practice 29

Nasal vs. Oral Articulation velum /M/,/N/,/NG/ Automatic Speech Recognition: From Theory to Practice 30

Spectrogram of Nasals Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003 Automatic Speech Recognition: From Theory to Practice 31

American English Semivowels Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003 Automatic Speech Recognition: From Theory to Practice 32

Semivowels Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003 Automatic Speech Recognition: From Theory to Practice 33

Affricates Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003 Automatic Speech Recognition: From Theory to Practice 34

Describing Vowels Velum Position nasal vs. non-nasal Lip Shape Rounded vs. unrounded Tongue height High, mid, low Correlated to first formant position Tongue advancement Front, central, back Correlated to second formant position Automatic Speech Recognition: From Theory to Practice 35

American English Vowels heed hid head had IY IY IH IH EH EH AE AE hod hawed hood who'd AA AA AO AO UH UH UW UW Automatic Speech Recognition: From Theory to Practice 36

Vowel Chart Tongue Advancement front central back high IY UW Tongue Height mid low IH EY EH AE AY AX OY UH AW OW AA AO Diphthongs shown in green Automatic Speech Recognition: From Theory to Practice 37

Second Formant Frequency, F 2 (Hz) The Vowel Space 1 2 3 8 10 9 7 First Formant Frequency, F 1 (Hz) 4 5 6 1. IY 2. IH 3. EH 4. AE 5. UH 6. AA 7. AO 8. UW 9. (non English u ) 10. AX Automatic Speech Recognition: From Theory to Practice 38

Coarticulation F AY AY TH TH AY AY S AY AY SH SH AY AY Notice position of 2 nd formant onset for these words: "fie, thigh, sigh, shy" http://hctv.humnet.ucla.edu/departments/linguistics/vowelsandconsonants Automatic Speech Recognition: From Theory to Practice 39

Sound Classification Summary Vowels Semivowels Consonants Nasals Stops Affricates Fricatives Automatic Speech Recognition: From Theory to Practice 40

Spectrogram Reading Video Speech as Eyes See It (12 minute video) 1977-1978 video by Ron Cole and Victor Zue After 2000-3000 hours of training: phonemes and words can be transcribed from a spectrogram alone 80-90% agreement on segments Provided insight into the speech recognition problem during the 1970 s Automatic Speech Recognition: From Theory to Practice 41

Review: Probability & Statistics for Speech Recognition Automatic Speech Recognition: From Theory to Practice 42

Relative-Frequency and Probability Relative Frequency of A : Experiment is performed N times With 4 possible outcomes: A, B, C, D N A is number of times event A occurs: P(A) N A + N B + NC + N D = Defined as the probability of event A Relative Frequency that event A would occur if an experiment was repeated many times N r( A) = N N A Automatic Speech Recognition: From Theory to Practice 43

Automatic Speech Recognition: From Theory to Practice 44 Probability Example (con t) N N N N N D C B A = + + + 1 ) ( ) ( ) ( ) ( = + + + D r C r B r A r ) ( lim ) ( A r A P N = 1 ) ( ) ( ) ( ) ( = + + + D P C P B P A P

Mutually Exclusive Events For mutually exclusive events A, B, C,, M: 0 P( A) 1 P( A) + P( B) + P( C) + L+ P( M ) = 1 P( A) = P( A) = 0 1 represents impossible event represents certain event Automatic Speech Recognition: From Theory to Practice 45

Joint and Conditional Probability P(AB) is the joint probability of event A and B both occurring P( AB) = lim N N P(A B) is the conditional probability of event A given that event B has occurred: P( A B) = N P( AB) P( B) Automatic Speech Recognition: From Theory to Practice 46 = AB lim N N AB N B / / N N

Marginal Probability Probability of an event occurring across all conditions A 2 A 1 B A 3 A 5 A 4 P( B) = n k = 1 P( A k B) = n k = 1 P( B A k ) P( A k ) Automatic Speech Recognition: From Theory to Practice 47

Automatic Speech Recognition: From Theory to Practice 48 Bayes Theorem ) ( ) ( ) ( ) ( ) ( ) ( B P A P A B P B P A B P B A P i i i i = = ) ( ) ( ) ( ) ( 1 1 k n k k n k k A P A B P B A P B P = = = = ) ( ) ( ) ( A P A B P AB P =

Bayes Rule Bayes Rule allows us to update prior beliefs with new evidence, P( O W ) P( W ) i i = P( Wi O) P( O W j ) P( W j ) j Prior Probability Likelihood Evidence P(O) Posterior Probability Automatic Speech Recognition: From Theory to Practice 49

Statistically Independent Events Occurrence of event A does not influence occurrence of event B: P ( AB) = P( A) P( B) P ( A B) = P( A) P ( B A) = P( B) Automatic Speech Recognition: From Theory to Practice 50

Random Variables Used to describe events in which the number of possible outcomes is infinite Values of the outcomes can not be predicted with certainty Distribution of outcome values is known Automatic Speech Recognition: From Theory to Practice 51

Probability Distribution Functions The probability of the event that the random variable X is less than or equal to the allowed value x: F X ( x) = P( X x) (x) F X 1 0 x Automatic Speech Recognition: From Theory to Practice 52

Probability Distribution Functions Properties: 0 F X ( x) 1 < x < F X ( ) = 0 and F ( ) = 1 X F X ( x) is nondecreasing as x increases P( x1 < X x2) = FX ( x2) FX ( x1 ) Automatic Speech Recognition: From Theory to Practice 53

Probability Density Functions (PDF) Derivative of probability distribution function, f X ( x) = dfx ( x) dx Interpretation: f X ( x) = P( x < X x + dx) Automatic Speech Recognition: From Theory to Practice 54

Probability Density Functions (PDF) Properties f X ( x) 0 < x < F X f X ( x) ( x) dx = 1 x = f ( u) du X x x 1 2 f X ( x) dx = P( x1 < X x2) Automatic Speech Recognition: From Theory to Practice 55

Mean and Variance Expectation or Mean of a random variable X E( X ) = xf X ( x) dx Variance of a random variable X µ = E(X ) Var( X Var( X ) ) 2 = σ = σ 2 = = E [ ] 2 ( X µ ) E( X [ E( X )] 2 Automatic Speech Recognition: From Theory to Practice 56 2 )

Mean and Variance Variance properties (if X and Y are independent) Var ( X + Y ) = Var( X ) + Var( Y ) Var ( ax ) = Var( a 1 X 1 a 2 1 a 2 + L+ Var( X ) a Var( X n 1 X ) n + b) = + L+ a 2 n Var( X n ) Automatic Speech Recognition: From Theory to Practice 57

Covariance and Correlation Covariance of X and Y: Cov( X, Y ) Cov( X, Y ) = Cov( Y, X ) Correlation of X and Y: = E[( X µ )( Y µ )] x y ρ XY Cov( X, Y ) = 1 ρ XY 1 σ σ X Y Automatic Speech Recognition: From Theory to Practice 58

Automatic Speech Recognition: From Theory to Practice 59 Gaussian (Normal) Distribution ( ) = = 2 2 2 2 exp 2 1 ), ( σ µ πσ σ µ x x X f = = = N n x n N x E 1 1 ] [ µ [ ] 2 1 1 2 2 2 2 1 1 ) ( ) ( = = = = N n n N n n x N x N X E X E σ

Gaussian (Normal) Distribution f ( X = x µ, σ 2 ) µ = 0 σ 2 = 1 x -3-2 -1 0 1 2 3 Automatic Speech Recognition: From Theory to Practice 60

Multivariate Distributions Characterize more than one random variable at a time Example: Features for speech recognition In speech recognition we typically compute 39-dimensional feature vectors (more about that later) 100 feature vectors per second of audio Often want to compute the likelihood of the observed features given known (estimated) distribution which is being used to model some part of a phoneme (more about that later!). Automatic Speech Recognition: From Theory to Practice 61

Multivariate Gaussian Distribution f 1 1 ( X = x µ, Σ) = exp µ n / 2 1/ 2 Σ 2 ( 2π ) ( ) T 1( ) x Σ x u Determinant of covariance matrix Distribution covariance matrix Distribution mean vector Observed vector of random variables (features) Automatic Speech Recognition: From Theory to Practice 62

Diagonal Covariance Assumption Most speech recognition systems assume diagonal covariance matrices Data sparseness issue: Σ = σ 0 0 0 2 11 σ 0 2 22 0 0 σ 0 0 2 33 0 σ 0 0 0 2 44 d 2 Σ = σ n= 1 nn Automatic Speech Recognition: From Theory to Practice 63

Automatic Speech Recognition: From Theory to Practice 64 Diagonal Covariance Assumption Inverting a diagonal matrix involves simply inverting the elements along the diagonal: = Σ 2 44 2 33 2 22 2 11 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 σ σ σ σ

Automatic Speech Recognition: From Theory to Practice 65 Multivariate Gaussians with Diagonal Covariance Matrices Assuming a diagonal covariance matrix, = Σ = = n d d d d x C x X f 1 2 2 ] [ ]) [ ] [ ( 2 1 *exp ), ( σ µ µ ( ) 2 1/ 1 2 2 / ] [ 2 1 = = n d n d C σ π 2 2 ] [ dd d σ σ =

Multivariate Mixture Gaussians Distribution is governed by several Gaussian density functions, Sum of Gaussians (w m = mixture weight) f M ( x) = w m Ν x µ m= 1 m ( ;, Σ ) m m = M m= 1 ( 2π ) w 1 exp m n / 2 1/ 2 Σ 2 m ( ) T 1 x µ Σ ( x u ) m m m Automatic Speech Recognition: From Theory to Practice 66

f (x) Multiple Mixtures (1-D case) -7.5-5.5-3.5-1.5 0.5 2.5 4.5 6.5 Example: 3 mixtures used to model underlying random process of 3 Gaussians Automatic Speech Recognition: From Theory to Practice 67 x

Speech Recognition Problem Formulation Automatic Speech Recognition: From Theory to Practice 68

Problem Description Given a sequence of observations (evidence) from an audio signal, O = o 1 o 2 L o T Determine the underlying word sequence, W = w 1 w 2 L w m Number of words (m) unknown, observation sequence is variable length (T) Automatic Speech Recognition: From Theory to Practice 69

Problem Formulation Goal: Minimize the classification error rate Solution: Maximize the Posterior Probability Ŵ = arg max W P(W O) Solution requires optimization over all possible word strings! Automatic Speech Recognition: From Theory to Practice 70

Problem Formulation Using Bayes Rule, P (W O) = P(O W) P(W) P(O) Since P(O) does not impact optimization, Ŵ = arg max P(W O) W = arg max P(O W) P(W) W Automatic Speech Recognition: From Theory to Practice 71

Problem Formulation Let s assume words can be represented by a sequence of states, S, Ŵ = = arg arg max W max W P(O S W ) P(W ) P(O S) P( S W ) P(W ) Words Phonemes States States represent smaller pieces of phonemes Automatic Speech Recognition: From Theory to Practice 72

Problem Formulation Optimize: Ŵ = arg max W S P (O S) P( S W) P(W) Practical Realization, Observation (feature) sequence O P(O S) P(S W) P(W) Acoustic Model Lexicon / Pronunciation Model Language Model Automatic Speech Recognition: From Theory to Practice 73

Problem Formulation Optimization desires most likely word sequence given observations (evidence) Can not evaluate all possible word / state sequences (too many possibilities!) We need: To define a representation for modeling states (HMMs ) A means for approximately searching for the best word / state sequence given the evidence (Viterbi Algorithm) And a few other tricks up our sleeves to make it FASTER! Automatic Speech Recognition: From Theory to Practice 74

Hidden Markov Models (HMMs) Observation vectors are assumed to be generated by a Markov Model HMM: A finite-state machine that at each time t that a state j is entered, an observation is emitted with probability density b j (o t ) a 00 S 0 a 01 a 11 a12 S 1 a 22 S 2 Transition from state i to state j modeled with probability a ij o1 o2 o3 o4 ot ot+ 1 ot+ 2 ot Automatic Speech Recognition: From Theory to Practice 75

Beads-on-a-String HMM Representation SPEECH S P IY CH o1 o2 o3 o4 o5 o6 o7 Automatic Speech Recognition: From Theory to Practice 76

Modeled Probability Distribution a 00 a 11 a 22 S 0 a 01 a12 S 1 S 2 b ( o ) ( ) 0 t b o ( ) 1 t b 2 o t Automatic Speech Recognition: From Theory to Practice 77

HMMs with Mixtures of Gaussians Multivariate Mixture-Gaussian distribution, b j ( o ) t = M m = 1 w m ( 2π ) n Σ j e T ( o µ ) Σ ( o µ ) 1 t j j 1 2 t j Model parameters are (1) means, (2) variances, and (3) mixture weights Sum of Gaussians can model complex distributions Automatic Speech Recognition: From Theory to Practice 78

Hidden Markov Model Based ASR Observation sequence assumed to be known Probability of a particular state sequence can be computed a 00 S 0 a 01 a 11 a12 S 1 a 22 S 2 Underlying state sequence is unknown, hidden o1 o2 o3 o4 ot o6 o7 ot Automatic Speech Recognition: From Theory to Practice 79

Components of a Speech Recognizer Lexicon Language Model Speech Text + Feature Extraction Optimization // Search Timing + Confidence Acoustic Model Automatic Speech Recognition: From Theory to Practice 80

Topics for Next Time An introduction to hearing Speech Detection Frame-based Speech Analysis Feature Extraction Methods for Speech Recognition Automatic Speech Recognition: From Theory to Practice 81