Cochlear modeling and its role in human speech recognition

Allen/IPAM February 1, 2005 p. 1/3 Cochlear modeling and its role in human speech recognition Miller Nicely confusions and the articulation index Jont Allen Univ. of IL, Beckman Inst., Urbana IL

Allen/IPAM February 1, 2005 p. 2/3 odel of human speech recognition (HSR The research goal is to identify elemental HSR events An event is defined as a perceptual feature Recognition level Cochlea Event Phones Syllables Words Layer Layer s(t) Filters Layer Layer k e k s S W Analog objects "Front end"??? Discrete objects "Back end"

Allen/IPAM February 1, 2005 p. 3/3 Modeling MaxEnt HSR Definition of MaxEnt (a.k.a. nonsense ) syllables: A fixed set of { C,V } sounds are drawn from the language of interest A uniform distribution for each C and V is required, to minimize syllable context ( MaxEnt) MaxEnt CV or CVC syllable score: S cv = s 2, S cvc = s 3 MaxEnt syllables first described in Fletcher, 1929 A set of meaningful words is not MaxEnt Modeling non-maxent syllables require the context models of Boothroyd, 1968; Bronkhorst et al., 1993 Fletcher s 1921 band independence model: s() 1 e = 1 e 1 e 2 e 3... e 20 = 1 e min (1) accurately models MaxEnt HSR (Allen, 1994)

Allen/IPAM February 1, 2005 p. 4/3 Probabilistic measures of recognition Recognition measures (MaxEnt Maximum Entropy): k th band articulation index: k log(snr k ) Band recognition error: e k = e k/k min with K = 20 MaxEnt phone score: s = 1 e 1 e 2... e K = 1 e k min MaxEnt syllable score: S cv = s 2, S cvc = s 3 Recognition level Cochlea Event Phones Syllables Words Layer Layer s(t) Filters Layer Layer k e k s S W Analog objects "Front end"??? Discrete objects "Back end"

Allen/IPAM February 1, 2005 p. 5/3 What is an elemental event? Miller-Nicely s 1955 articulation matrix A measured at [-18, -12, -6 shown, 0, 6, 12] db SNR STIMULUS UNVOICED RESPONSE Confusion groups formed in A(snr) VOICED NASAL

Allen/IPAM February 1, 2005 p. 6/3 Case of /pa/, /ta/, /ka/ with /ta/ spoken Plot of A i,j (snr) for row i =2 and column j =2 10 0 AM 2,h (SNR): /ta/ spoken 10 0 AM s,2 (SNR): /ta/ heard P h s=2 (SNR) P s h=2 (SNR) 20 10 0 10 SNR [db] Symmetric: /ta/ 10 0 20 10 0 10 SNR [db] Skew symmetric: /ta/ 10 0 S 2,j (SNR) A 2,j (SNR) 20 10 0 10 SNR [db] 20 10 0 10 SNR [db] Solid red curve is total error e 2 1 A 2,2 = j 2 A 2,j

Allen/IPAM February 1, 2005 p. 7/3 The case of /ma/ vs. /na/ Plots of S(snr) 1 2 (A + At ), /ma/, /na/ spoken Solid red curve is total error e i 1 S i,i = j i S i,j 10 0 Symmetric: /ma/ S 15 (SNR) 10 0 Symmetric: /na/ S 16 (SNR) S 15,j (SNR) 1 S 15 (SNR) S 16,j (SNR) 1 S 16 (SNR) 20 10 0 10 SNR [db] 20 10 0 10 SNR [db] This 2-group of sounds is closed since S /ma/,/ma/ (snr) + S /ma/,/na/ (snr) 1

Allen/IPAM February 1, 2005 p. 8/3 Definition of the Let k be the k th band cochlear channel capacity k 10 30 log 10(1 + c 2 snr 2 k ), with c = 2, k 1, then 1 K k (Allen, 1994; Allen, 2004) Recognition level Cochlea Event Phones Syllables Words Layer Layer s(t) Filters Layer Layer k e k s S W Analog objects "Front end"??? Discrete objects "Back end"

Allen/IPAM February 1, 2005 p. 9/3 Long-term spectrum of female speech Speech spectrum for female speech Dunn and White Dashed red line shows the approximation Slope 29 db/decade

Allen/IPAM February 1, 2005 p. 10/3 Conversion from SNR to Spectra for SNRs of [-18, -12, -6, 0, 6, 12] db Speech: 18 to +12, Noise: 0 [db] Spectral level [RMS] 10 1 10 0 10 0 10 1 Frequency (khz) snr k is determined for K = 20 cochlear critical bands

Allen/IPAM February 1, 2005 p. 11/3 (SNR) for Miller Nicely s data (SNR) computed from the Dunn and White spectrum Spectral level [RMS] 10 1 10 0 Speech: 18 to +12, Noise: 0 [db] 10 0 10 1 Frequency (khz) (SNR) Female speech and white noise 0.5 0.4 0.3 0.2 0.1 0 20 10 0 10 Wideband SNR [db] Conversion from SNR k to [Allen 2005, JASA]: k = 1 60 20 k=1 log 10 (1 + c 2 snr 2 k ), c = 2, k 1

Allen/IPAM February 1, 2005 p. 12/3 MN16 and the model P (i) c () for the i th consonant and P c () 1 16 P c () = 1 (1 1 16 )e min i P c (i) () is the chance corrected model 1 0.8 (i) P correct for i=1...16 & mean (Tables I VI) 1 0.8 P c (SNR) 0.6 0.4 mean i (P (i) c ) P (i) 0.2 c (SNR) (SNR+20)/30 0 20 10 0 10 SNR [db] P c () 0.6 0.4 0.2 mean i (P c (i) ) P c (i) () 1 (1 1/16) e min 0 0 0.5 1 Band independence seems a perfect fit to MN16!!

Allen/IPAM February 1, 2005 p. 13/3 Log-error probability Band-independence, corrected for chance, P e (, H = 4) 1 P c () = (1 2 H )e min This suggests linear log-error plots vs. : MN16 CVs log(p (i) e ) = log(1 1 16 ) + log(e min) This relation is of the form Y = a + bx Plots of log(p e )() vs. provide a nice test of Fletcher s band independence model Individual sounds P e (i) () from MN16 may be tested for band independence.

Allen/IPAM February 1, 2005 p. 14/3 Testing the band-independence model CV log-error model: ( ) log () P (i) e = log ( 1 2 4) + log ( ) e (i) min 10 0 Sounds 1,3,5,9,10,12 10 0 Sounds: 2,6,7,13,14 (i) 1 s (), 1 mean (P ), emin i i c (i) 1 s i (), 1 mean i (P c ), emin 0 0.2 0.4 0.6 0 0.2 0.4 0.6 10 0 Sounds: 4,8,11 10 0 Sounds: 15,16 (i) 1 s (), 1 mean (P ), emin i i c 0 0.2 0.4 0.6 (i) 1 s i (), 1 mean i (P c ), emin 0 0.2 0.4 0.6 Model; MN16 P c (); - - -, - - -: CV sounds

Allen/IPAM February 1, 2005 p. 15/3 Testing the band-independence model Sounds: [1, 2, 3, 5, 8, 9, 10, 12, 13, 14] pass Sounds: [4, 8, 11, 15, 16] fail (nonlinear wrt ) 10 0 Sounds 1,3,5,9,10,12 10 0 Sounds: 2,6,7,13,14 (i) 1 s i (), 1 mean i (P c ), emin (i) 1 s (), 1 mean (P ), emin i i c 0 0.2 0.4 0.6 0 0.2 0.4 0.6 10 0 Sounds: 4,8,11 10 0 Sounds: 15,16 (i) 1 s i (), 1 mean i (P c ), emin (i) 1 s (), 1 mean (P ), emin i i c 0 0.2 0.4 0.6 0 0.2 0.4 0.6

Allen/IPAM February 1, 2005 p. 16/3 /ma/ and /na/ vs. The multichannel model is valid for the nasals S i,j () δ i,j + ( 1) δ i,j e min (i) 1 S i,i =1 H g = 1 [bit] (i.e., a 2-group) (1 2 H g ) (e (i) min ) for > g 10 0 Symmetric: /ma/ S 15 () 10 0 Symmetric: /na/ S 16 () S 15,j () ^g 1 S 15 () S 16,j () ^g 1 S 16 () 0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4

Allen/IPAM February 1, 2005 p. 17/3 Case of /pa/, /ta/, /ka/ vs. Fletcher s multichannel model is valid for i = 1, 2, 3: ( ) S i,j () δ i,j + ( 1) δ i,j (1 2 H g ) e (i,j) min for > g 10 0 S 1,j () /pa/ 10 0 S 2,j () 10 0 S 3,j () /ka/ /ta/ /ta/ 0 0.5 0 0.5 0 0.5 Band independence holds for the [/pa/, /ta/, /ka/] group: H g = log 2 (3), g = 0.15 and e min (i,j) depends on i, j.

Allen/IPAM February 1, 2005 p. 18/3 Conclusions I The predicts above chance performance near -20 db HSR performance saturates near 0 db SNR No overlap in ASR vs. HSR ASR chance performance near 0 db ASR performance saturates near +20 db SNR

Allen/IPAM February 1, 2005 p. 19/3 Conclusions II Fletcher s theory is based on band independence e(snr) = e 1 K 1 min e 1 K 2 min e 1 K 3 min... e 1 K K min = e min (2) Recognition of MaxEnt phones satisfy 2 MaxEnt close set tests scores may be predicted by P c (, H) = 1 (1 2 H )e min The first sign of > chance performance is grouping The sound groups depend on the noise spectrum

Allen/IPAM February 1, 2005 p. 20/3 Conclusions III The average over the 16 Miller Nicely consonant confusions P c (i) (snr) is accurately predicted by the Articulation Index: P (i) c = 10 30 P c (snr) 1 16 16 i=1 P c (i) (snr) = 1 ( 1 1 ) e min 16 is the probability of correct identification of i th CV k log 10(1 + 4snr 2 k ) where k is the band index Chance error is given by (1 1/16) e min 0.005 1 P c() (1 1/16) =1

Allen/IPAM February 1, 2005 p. 21/3 Conclusions IV The model, corrected for 1/16 chance guessing, predicts the average Miller Nicely P c quite well There are no free parameters in this model of P c (snr) For the MN experiment, the 0.5 The (snr) is not linear in snr, due to the shape of the snr(f) The individual curves remain to be analyzed and modeled (if possible) How can we explain the variance across sounds?

Allen/IPAM February 1, 2005 p. 22/3 Conclusions V The MN data is very close to symmetric There are a few exceptions, but even for these, the asymmetric part is very small

Allen/IPAM February 1, 2005 p. 23/3 Conclusions VI The off-diagonal PI() functions linearly decompose error for i th sound P c (i, ), since j P ij = 1 P e (i, ) 1 P i,i () = j i P (j i, ) In the previous figure the red curve is identically the sum of all the blue curves The green curve is the model ( P c (, H = 4) = 1 1 ) 16 e min

Allen/IPAM February 1, 2005 p. 24/3 MN16 vs. MN64 Results of the MN16 and the MN64 CV experiments, plotted as a function of both the SNR and the. Error (1 P c (SNR)) [%] 10 0 MN16 55 MN64 04 20 10 0 10 Wideband SNR [db] Error (1 P c (SNR)) [%] 10 0 MN16 55 MN64 04 0 0.5 1

Allen/IPAM February 1, 2005 p. 25/3 Human vs. Γ-tone filters h(t, f 0 ) = t 3 e 2.22 π ERB(f 0) t cos(2πf 0 t) ERB(f 0 ) = 24.7(4.37f 0 /1000 + 1) (Γ-tone filter) cilia displmt [solid] Γ tone [dashed] 10 0 Model neural tuning curves vs. Gamma tone filters 10 4 10 2 10 3 10 4 Frequency [khz] High frequency slope problem Low-frequency tails are wrong Wrong bandwidth at higher frequencies Missing middle ear high-pass response

Allen/IPAM February 1, 2005 p. 26/3 Narayan et al. data Narayan et al. 2000 (Gerbil)

Allen/IPAM February 1, 2005 p. 27/3 Kiang and Moxon 1979 cochlear USM Nonlinear upward spread of masking

Allen/IPAM February 1, 2005 p. 28/3 Neely model for Cat 1986 Active model of the cochlea and cilia, based on the resonant TM model. Cat data from Liberman and Delgutte.

Allen/IPAM February 1, 2005 p. 29/3 Allen model 1980 Solid: Model; dashed Cat FTC (Liberman and Delgutte) 1 2 MAGNITUDE 3 4 5 10 2 10 3 10 4 FREQUENCY (Hz)

Allen/IPAM February 1, 2005 p. 30/3 Symmetric form S = (A + A t )/2 10 0 /pa/ 10 0 /ta/ 10 0 /ka/ 10 0 /fa/ S r,c () 0 0.5 10 0 /Θ/ 0 0.5 10 0 /saw/ 0 0.5 10 0 / / 0 0.5 10 0 /ba/ 0 0.5 10 0 /da/ 0 0.5 10 0 /ga/ 0 0.5 10 0 /va/ 0 0.5 10 0 / / 0 0.5 10 0 /za/ 0 0.5 10 0 /3/ 0 0.5 10 0 /ma/ 0 0.5 10 0 /na/ 0 0.5 0 0.5 0 0.5 0 0.5

Allen/IPAM February 1, 2005 p. 31/3 References Allen, J. B. (1994). How do humans process and recognize speech? IEEE Transactions on speech and audio, 2(4):567 577. Allen, J. B. (2004). The articulation index is a Shannon channel capacity. In Pressnitzer, D., de Cheveigné, A., McAdams, S., and Collet, L., editors, Auditory signal processing: physiology, psychoacoustics, and models, chapter Speech, pages 314 320. Springer Verlag, New York, NY. Boothroyd, A. (1968). Statistical theory of the speech discrimination score. J. Acoust. Soc. Am., 43(2):362 367. Bronkhorst, A. W., Bosman, A. J., and Smoorenburg, G. F. (1993). A model for context effects in speech recognition. J. Acoust. Soc. Am., 93(1):499 509. Fletcher, H. (1929). Speech and Hearing. D. Van Nostrand Company, Inc., New York. Miller, G. A. and Nicely, P. E. (1955). An analysis of perceptual confsions among some english consonants. J. Acoust. Soc. Am., 27(2):338 352.

Allen/IPAM February 1, 2005 p. 32/3 title Latest from Sandeep 100 Fraction of utterances above the Error Threshold 90 80 Error Threshold [%] 70 60 50 40 30 20 10 0 10 0 10 1 10 2 Fraction of Utterances [%]