Cochlear modeling and its role in human speech recognition

Size: px

Start display at page:

Download "Cochlear modeling and its role in human speech recognition"

Dayna Wells
5 years ago
Views:

1 Allen/IPAM February 1, 2005 p. 1/3 Cochlear modeling and its role in human speech recognition Miller Nicely confusions and the articulation index Jont Allen Univ. of IL, Beckman Inst., Urbana IL

2 Allen/IPAM February 1, 2005 p. 2/3 odel of human speech recognition (HSR The research goal is to identify elemental HSR events An event is defined as a perceptual feature Recognition level Cochlea Event Phones Syllables Words Layer Layer s(t) Filters Layer Layer k e k s S W Analog objects "Front end"??? Discrete objects "Back end"

3 Allen/IPAM February 1, 2005 p. 3/3 Modeling MaxEnt HSR Definition of MaxEnt (a.k.a. nonsense ) syllables: A fixed set of { C,V } sounds are drawn from the language of interest A uniform distribution for each C and V is required, to minimize syllable context ( MaxEnt) MaxEnt CV or CVC syllable score: S cv = s 2, S cvc = s 3 MaxEnt syllables first described in Fletcher, 1929 A set of meaningful words is not MaxEnt Modeling non-maxent syllables require the context models of Boothroyd, 1968; Bronkhorst et al., 1993 Fletcher s 1921 band independence model: s() 1 e = 1 e 1 e 2 e 3... e 20 = 1 e min (1) accurately models MaxEnt HSR (Allen, 1994)

4 Allen/IPAM February 1, 2005 p. 4/3 Probabilistic measures of recognition Recognition measures (MaxEnt Maximum Entropy): k th band articulation index: k log(snr k ) Band recognition error: e k = e k/k min with K = 20 MaxEnt phone score: s = 1 e 1 e 2... e K = 1 e k min MaxEnt syllable score: S cv = s 2, S cvc = s 3 Recognition level Cochlea Event Phones Syllables Words Layer Layer s(t) Filters Layer Layer k e k s S W Analog objects "Front end"??? Discrete objects "Back end"

5 Allen/IPAM February 1, 2005 p. 5/3 What is an elemental event? Miller-Nicely s 1955 articulation matrix A measured at [-18, -12, -6 shown, 0, 6, 12] db SNR STIMULUS UNVOICED RESPONSE Confusion groups formed in A(snr) VOICED NASAL

6 Allen/IPAM February 1, 2005 p. 6/3 Case of /pa/, /ta/, /ka/ with /ta/ spoken Plot of A i,j (snr) for row i =2 and column j = AM 2,h (SNR): /ta/ spoken 10 0 AM s,2 (SNR): /ta/ heard P h s=2 (SNR) P s h=2 (SNR) SNR [db] Symmetric: /ta/ SNR [db] Skew symmetric: /ta/ 10 0 S 2,j (SNR) A 2,j (SNR) SNR [db] SNR [db] Solid red curve is total error e 2 1 A 2,2 = j 2 A 2,j

7 Allen/IPAM February 1, 2005 p. 7/3 The case of /ma/ vs. /na/ Plots of S(snr) 1 2 (A + At ), /ma/, /na/ spoken Solid red curve is total error e i 1 S i,i = j i S i,j 10 0 Symmetric: /ma/ S 15 (SNR) 10 0 Symmetric: /na/ S 16 (SNR) S 15,j (SNR) 1 S 15 (SNR) S 16,j (SNR) 1 S 16 (SNR) SNR [db] SNR [db] This 2-group of sounds is closed since S /ma/,/ma/ (snr) + S /ma/,/na/ (snr) 1

8 Allen/IPAM February 1, 2005 p. 8/3 Definition of the Let k be the k th band cochlear channel capacity k log 10(1 + c 2 snr 2 k ), with c = 2, k 1, then 1 K k (Allen, 1994; Allen, 2004) Recognition level Cochlea Event Phones Syllables Words Layer Layer s(t) Filters Layer Layer k e k s S W Analog objects "Front end"??? Discrete objects "Back end"

9 Allen/IPAM February 1, 2005 p. 9/3 Long-term spectrum of female speech Speech spectrum for female speech Dunn and White Dashed red line shows the approximation Slope 29 db/decade

10 Allen/IPAM February 1, 2005 p. 10/3 Conversion from SNR to Spectra for SNRs of [-18, -12, -6, 0, 6, 12] db Speech: 18 to +12, Noise: 0 [db] Spectral level [RMS] Frequency (khz) snr k is determined for K = 20 cochlear critical bands

11 Allen/IPAM February 1, 2005 p. 11/3 (SNR) for Miller Nicely s data (SNR) computed from the Dunn and White spectrum Spectral level [RMS] Speech: 18 to +12, Noise: 0 [db] Frequency (khz) (SNR) Female speech and white noise Wideband SNR [db] Conversion from SNR k to [Allen 2005, JASA]: k = k=1 log 10 (1 + c 2 snr 2 k ), c = 2, k 1

12 Allen/IPAM February 1, 2005 p. 12/3 MN16 and the model P (i) c () for the i th consonant and P c () 1 16 P c () = 1 ( )e min i P c (i) () is the chance corrected model (i) P correct for i= & mean (Tables I VI) P c (SNR) mean i (P (i) c ) P (i) 0.2 c (SNR) (SNR+20)/ SNR [db] P c () mean i (P c (i) ) P c (i) () 1 (1 1/16) e min Band independence seems a perfect fit to MN16!!

13 Allen/IPAM February 1, 2005 p. 13/3 Log-error probability Band-independence, corrected for chance, P e (, H = 4) 1 P c () = (1 2 H )e min This suggests linear log-error plots vs. : MN16 CVs log(p (i) e ) = log( ) + log(e min) This relation is of the form Y = a + bx Plots of log(p e )() vs. provide a nice test of Fletcher s band independence model Individual sounds P e (i) () from MN16 may be tested for band independence.

14 Allen/IPAM February 1, 2005 p. 14/3 Testing the band-independence model CV log-error model: ( ) log () P (i) e = log ( 1 2 4) + log ( ) e (i) min 10 0 Sounds 1,3,5,9,10, Sounds: 2,6,7,13,14 (i) 1 s (), 1 mean (P ), emin i i c (i) 1 s i (), 1 mean i (P c ), emin Sounds: 4,8, Sounds: 15,16 (i) 1 s (), 1 mean (P ), emin i i c (i) 1 s i (), 1 mean i (P c ), emin Model; MN16 P c (); - - -, - - -: CV sounds

15 Allen/IPAM February 1, 2005 p. 15/3 Testing the band-independence model Sounds: [1, 2, 3, 5, 8, 9, 10, 12, 13, 14] pass Sounds: [4, 8, 11, 15, 16] fail (nonlinear wrt ) 10 0 Sounds 1,3,5,9,10, Sounds: 2,6,7,13,14 (i) 1 s i (), 1 mean i (P c ), emin (i) 1 s (), 1 mean (P ), emin i i c Sounds: 4,8, Sounds: 15,16 (i) 1 s i (), 1 mean i (P c ), emin (i) 1 s (), 1 mean (P ), emin i i c

16 Allen/IPAM February 1, 2005 p. 16/3 /ma/ and /na/ vs. The multichannel model is valid for the nasals S i,j () δ i,j + ( 1) δ i,j e min (i) 1 S i,i =1 H g = 1 [bit] (i.e., a 2-group) (1 2 H g ) (e (i) min ) for > g 10 0 Symmetric: /ma/ S 15 () 10 0 Symmetric: /na/ S 16 () S 15,j () ^g 1 S 15 () S 16,j () ^g 1 S 16 ()

17 Allen/IPAM February 1, 2005 p. 17/3 Case of /pa/, /ta/, /ka/ vs. Fletcher s multichannel model is valid for i = 1, 2, 3: ( ) S i,j () δ i,j + ( 1) δ i,j (1 2 H g ) e (i,j) min for > g 10 0 S 1,j () /pa/ 10 0 S 2,j () 10 0 S 3,j () /ka/ /ta/ /ta/ Band independence holds for the [/pa/, /ta/, /ka/] group: H g = log 2 (3), g = 0.15 and e min (i,j) depends on i, j.

18 Allen/IPAM February 1, 2005 p. 18/3 Conclusions I The predicts above chance performance near -20 db HSR performance saturates near 0 db SNR No overlap in ASR vs. HSR ASR chance performance near 0 db ASR performance saturates near +20 db SNR

19 Allen/IPAM February 1, 2005 p. 19/3 Conclusions II Fletcher s theory is based on band independence e(snr) = e 1 K 1 min e 1 K 2 min e 1 K 3 min... e 1 K K min = e min (2) Recognition of MaxEnt phones satisfy 2 MaxEnt close set tests scores may be predicted by P c (, H) = 1 (1 2 H )e min The first sign of > chance performance is grouping The sound groups depend on the noise spectrum

20 Allen/IPAM February 1, 2005 p. 20/3 Conclusions III The average over the 16 Miller Nicely consonant confusions P c (i) (snr) is accurately predicted by the Articulation Index: P (i) c = P c (snr) i=1 P c (i) (snr) = 1 ( 1 1 ) e min 16 is the probability of correct identification of i th CV k log 10(1 + 4snr 2 k ) where k is the band index Chance error is given by (1 1/16) e min P c() (1 1/16) =1

21 Allen/IPAM February 1, 2005 p. 21/3 Conclusions IV The model, corrected for 1/16 chance guessing, predicts the average Miller Nicely P c quite well There are no free parameters in this model of P c (snr) For the MN experiment, the 0.5 The (snr) is not linear in snr, due to the shape of the snr(f) The individual curves remain to be analyzed and modeled (if possible) How can we explain the variance across sounds?

22 Allen/IPAM February 1, 2005 p. 22/3 Conclusions V The MN data is very close to symmetric There are a few exceptions, but even for these, the asymmetric part is very small

23 Allen/IPAM February 1, 2005 p. 23/3 Conclusions VI The off-diagonal PI() functions linearly decompose error for i th sound P c (i, ), since j P ij = 1 P e (i, ) 1 P i,i () = j i P (j i, ) In the previous figure the red curve is identically the sum of all the blue curves The green curve is the model ( P c (, H = 4) = 1 1 ) 16 e min

24 Allen/IPAM February 1, 2005 p. 24/3 MN16 vs. MN64 Results of the MN16 and the MN64 CV experiments, plotted as a function of both the SNR and the. Error (1 P c (SNR)) [%] 10 0 MN16 55 MN Wideband SNR [db] Error (1 P c (SNR)) [%] 10 0 MN16 55 MN

25 Allen/IPAM February 1, 2005 p. 25/3 Human vs. Γ-tone filters h(t, f 0 ) = t 3 e 2.22 π ERB(f 0) t cos(2πf 0 t) ERB(f 0 ) = 24.7(4.37f 0 / ) (Γ-tone filter) cilia displmt [solid] Γ tone [dashed] 10 0 Model neural tuning curves vs. Gamma tone filters Frequency [khz] High frequency slope problem Low-frequency tails are wrong Wrong bandwidth at higher frequencies Missing middle ear high-pass response

26 Allen/IPAM February 1, 2005 p. 26/3 Narayan et al. data Narayan et al (Gerbil)

27 Allen/IPAM February 1, 2005 p. 27/3 Kiang and Moxon 1979 cochlear USM Nonlinear upward spread of masking

28 Allen/IPAM February 1, 2005 p. 28/3 Neely model for Cat 1986 Active model of the cochlea and cilia, based on the resonant TM model. Cat data from Liberman and Delgutte.

29 Allen/IPAM February 1, 2005 p. 29/3 Allen model 1980 Solid: Model; dashed Cat FTC (Liberman and Delgutte) 1 2 MAGNITUDE FREQUENCY (Hz)

30 Allen/IPAM February 1, 2005 p. 30/3 Symmetric form S = (A + A t )/ /pa/ 10 0 /ta/ 10 0 /ka/ 10 0 /fa/ S r,c () /Θ/ /saw/ / / /ba/ /da/ /ga/ /va/ / / /za/ /3/ /ma/ /na/

31 Allen/IPAM February 1, 2005 p. 31/3 References Allen, J. B. (1994). How do humans process and recognize speech? IEEE Transactions on speech and audio, 2(4): Allen, J. B. (2004). The articulation index is a Shannon channel capacity. In Pressnitzer, D., de Cheveigné, A., McAdams, S., and Collet, L., editors, Auditory signal processing: physiology, psychoacoustics, and models, chapter Speech, pages Springer Verlag, New York, NY. Boothroyd, A. (1968). Statistical theory of the speech discrimination score. J. Acoust. Soc. Am., 43(2): Bronkhorst, A. W., Bosman, A. J., and Smoorenburg, G. F. (1993). A model for context effects in speech recognition. J. Acoust. Soc. Am., 93(1): Fletcher, H. (1929). Speech and Hearing. D. Van Nostrand Company, Inc., New York. Miller, G. A. and Nicely, P. E. (1955). An analysis of perceptual confsions among some english consonants. J. Acoust. Soc. Am., 27(2):

32 Allen/IPAM February 1, 2005 p. 32/3 title Latest from Sandeep 100 Fraction of utterances above the Error Threshold Error Threshold [%] Fraction of Utterances [%]

The effect of speaking rate and vowel context on the perception of consonants. in babble noise

The effect of speaking rate and vowel context on the perception of consonants in babble noise Anirudh Raju Department of Electrical Engineering, University of California, Los Angeles, California, USA anirudh90@ucla.edu