Cochlear modeling and its role in human speech recognition

Similar documents
The effect of speaking rate and vowel context on the perception of consonants. in babble noise

AN INVERTIBLE DISCRETE AUDITORY TRANSFORM

arxiv:math/ v1 [math.na] 12 Feb 2005

Speech Signal Representations

Acoustic Quantities. LMS Test.Lab. Rev 12A

Loudness and the JND

Single Channel Signal Separation Using MAP-based Subspace Decomposition

A Nonlinear Psychoacoustic Model Applied to the ISO MPEG Layer 3 Coder

Feature extraction 1

Time-Series Analysis for Ear-Related and Psychoacoustic Metrics

Psychophysical models of masking for coding applications

Machine Recognition of Sounds in Mixtures

Lecture 7: Pitch and Chord (2) HMM, pitch detection functions. Li Su 2016/03/31

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017

Automatic Speech Recognition (CS753)

Frog Sound Identification System for Frog Species Recognition

Lecture 5: Jan 19, 2005

Response-Field Dynamics in the Auditory Pathway

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

Wavelet Transform in Speech Segmentation

Proc. of NCC 2010, Chennai, India

Non-Negative Matrix Factorization And Its Application to Audio. Tuomas Virtanen Tampere University of Technology

Stress detection through emotional speech analysis

'L. E. Dickson, Introduction to the Theory of Numbers, Chap. V (1929).

THE OUTER HAIR CELL MOTILITY MUST BE BASED ON A TONIC CHANGE IN THE MEMBRANE VOLTAGE: IMPLICATIONS FOR THE COCHLEAR AMPLIFIER

Sound 2: frequency analysis

MANY digital speech communication applications, e.g.,

SPEECH COMMUNICATION 6.541J J-HST710J Spring 2004

Sound Recognition in Mixtures

Introduction to Acoustics Exercises

Last time: small acoustics

Linear Prediction 1 / 41

arxiv:math/ v1 [math.na] 7 Mar 2006

Robust Speaker Identification

Lecture 18: Gaussian Channel

1 Introduction. 2 Data Set and Linear Spectral Analysis

Linear and Non-Linear Responses to Dynamic Broad-Band Spectra in Primary Auditory Cortex

"Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction"

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT

COMP 546. Lecture 21. Cochlea to brain, Source Localization. Tues. April 3, 2018

Lecture 9: Speech Recognition. Recognizing Speech

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Lecture 9: Speech Recognition

A Low-Cost Robust Front-end for Embedded ASR System

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

SPEECH ENHANCEMENT USING PCA AND VARIANCE OF THE RECONSTRUCTION ERROR IN DISTRIBUTED SPEECH RECOGNITION

Tone Analysis in Harmonic-Frequency Domain and Feature Reduction using KLT+LVQ for Thai Isolated Word Recognition

New Statistical Model for the Enhancement of Noisy Speech

ALL-POLE MODELS OF AUDITORY FILTERING. R.F. LYON Apple Computer, Inc., One Infinite Loop Cupertino, CA USA

EIGENFILTERS FOR SIGNAL CANCELLATION. Sunil Bharitkar and Chris Kyriakakis

Enhancement of Noisy Speech. State-of-the-Art and Perspectives

Quantifying the Information in Auditory-Nerve Responses for Level Discrimination

Exploring Non-linear Transformations for an Entropybased Voice Activity Detector

Data Analysis for an Absolute Identification Experiment. Randomization with Replacement. Randomization without Replacement

arxiv: v1 [cs.sd] 25 Oct 2014

Signal Modeling Techniques In Speech Recognition

A comparative study of time-delay estimation techniques for convolutive speech mixtures

Signals, Instruments, and Systems W5. Introduction to Signal Processing Sampling, Reconstruction, and Filters

CSC321 Lecture 7 Neural language models

Autoregressive Neural Models for Statistical Parametric Speech Synthesis

COMP Signals and Systems. Dr Chris Bleakley. UCD School of Computer Science and Informatics.

Topic 6. Timbre Representations

Bayesian Estimation of Time-Frequency Coefficients for Audio Signal Enhancement

Dominant Feature Vectors Based Audio Similarity Measure

David Weenink. First semester 2007

A Probability Model for Interaural Phase Difference

Discovering Convolutive Speech Phones using Sparseness and Non-Negativity Constraints

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

LIKELIHOOD RECEIVER FOR FH-MFSK MOBILE RADIO*

An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments

Robust Sound Event Detection in Continuous Audio Environments

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment

COMP 546, Winter 2018 lecture 19 - sound 2

A perception- and PDE-based nonlinear transformation for processing spoken words

Probabilistic Inference of Speech Signals from Phaseless Spectrograms

GAUSSIANIZATION METHOD FOR IDENTIFICATION OF MEMORYLESS NONLINEAR AUDIO SYSTEMS

PDF estimation by use of characteristic functions and the FFT

Signal representations: Cepstrum

Convolutional Associative Memory: FIR Filter Model of Synapse

Audio Coding. Fundamentals Quantization Waveform Coding Subband Coding P NCTU/CSIE DSPLAB C.M..LIU

Time Varying Loudness as a Means of Quantifying Loudness in Nonlinearly Propagated Acoustical Signals. S. Hales Swift. Kent L. Gee, faculty advisor

Gaussian Mixture Model Based Coding of Speech and Audio

Text Independent Speaker Identification Using Imfcc Integrated With Ica

New Insights Into the Stereophonic Acoustic Echo Cancellation Problem and an Adaptive Nonlinearity Solution

Speech Spectra and Spectrograms

Time-domain representations

MDS codec evaluation based on perceptual sound attributes

200Pa 10million. Overview. Acoustics of Speech and Hearing. Loudness. Terms to describe sound. Matching Pressure to Loudness. Loudness vs.

Characterization of phonemes by means of correlation dimension

Estimating Correlation Coefficient Between Two Complex Signals Without Phase Observation

Upper Bound Kullback-Leibler Divergence for Hidden Markov Models with Application as Discrimination Measure for Speech Recognition

Acoustics 08 Paris 6013


Temporal Modeling and Basic Speech Recognition

Hidden Markov Model and Speech Recognition

Short-Time ICA for Blind Separation of Noisy Speech

Chirp Transform for FFT

A Generalized Subspace Approach for Enhancing Speech Corrupted by Colored Noise

Transcription:

Allen/IPAM February 1, 2005 p. 1/3 Cochlear modeling and its role in human speech recognition Miller Nicely confusions and the articulation index Jont Allen Univ. of IL, Beckman Inst., Urbana IL

Allen/IPAM February 1, 2005 p. 2/3 odel of human speech recognition (HSR The research goal is to identify elemental HSR events An event is defined as a perceptual feature Recognition level Cochlea Event Phones Syllables Words Layer Layer s(t) Filters Layer Layer k e k s S W Analog objects "Front end"??? Discrete objects "Back end"

Allen/IPAM February 1, 2005 p. 3/3 Modeling MaxEnt HSR Definition of MaxEnt (a.k.a. nonsense ) syllables: A fixed set of { C,V } sounds are drawn from the language of interest A uniform distribution for each C and V is required, to minimize syllable context ( MaxEnt) MaxEnt CV or CVC syllable score: S cv = s 2, S cvc = s 3 MaxEnt syllables first described in Fletcher, 1929 A set of meaningful words is not MaxEnt Modeling non-maxent syllables require the context models of Boothroyd, 1968; Bronkhorst et al., 1993 Fletcher s 1921 band independence model: s() 1 e = 1 e 1 e 2 e 3... e 20 = 1 e min (1) accurately models MaxEnt HSR (Allen, 1994)

Allen/IPAM February 1, 2005 p. 4/3 Probabilistic measures of recognition Recognition measures (MaxEnt Maximum Entropy): k th band articulation index: k log(snr k ) Band recognition error: e k = e k/k min with K = 20 MaxEnt phone score: s = 1 e 1 e 2... e K = 1 e k min MaxEnt syllable score: S cv = s 2, S cvc = s 3 Recognition level Cochlea Event Phones Syllables Words Layer Layer s(t) Filters Layer Layer k e k s S W Analog objects "Front end"??? Discrete objects "Back end"

Allen/IPAM February 1, 2005 p. 5/3 What is an elemental event? Miller-Nicely s 1955 articulation matrix A measured at [-18, -12, -6 shown, 0, 6, 12] db SNR STIMULUS UNVOICED RESPONSE Confusion groups formed in A(snr) VOICED NASAL

Allen/IPAM February 1, 2005 p. 6/3 Case of /pa/, /ta/, /ka/ with /ta/ spoken Plot of A i,j (snr) for row i =2 and column j =2 10 0 AM 2,h (SNR): /ta/ spoken 10 0 AM s,2 (SNR): /ta/ heard P h s=2 (SNR) P s h=2 (SNR) 20 10 0 10 SNR [db] Symmetric: /ta/ 10 0 20 10 0 10 SNR [db] Skew symmetric: /ta/ 10 0 S 2,j (SNR) A 2,j (SNR) 20 10 0 10 SNR [db] 20 10 0 10 SNR [db] Solid red curve is total error e 2 1 A 2,2 = j 2 A 2,j

Allen/IPAM February 1, 2005 p. 7/3 The case of /ma/ vs. /na/ Plots of S(snr) 1 2 (A + At ), /ma/, /na/ spoken Solid red curve is total error e i 1 S i,i = j i S i,j 10 0 Symmetric: /ma/ S 15 (SNR) 10 0 Symmetric: /na/ S 16 (SNR) S 15,j (SNR) 1 S 15 (SNR) S 16,j (SNR) 1 S 16 (SNR) 20 10 0 10 SNR [db] 20 10 0 10 SNR [db] This 2-group of sounds is closed since S /ma/,/ma/ (snr) + S /ma/,/na/ (snr) 1

Allen/IPAM February 1, 2005 p. 8/3 Definition of the Let k be the k th band cochlear channel capacity k 10 30 log 10(1 + c 2 snr 2 k ), with c = 2, k 1, then 1 K k (Allen, 1994; Allen, 2004) Recognition level Cochlea Event Phones Syllables Words Layer Layer s(t) Filters Layer Layer k e k s S W Analog objects "Front end"??? Discrete objects "Back end"

Allen/IPAM February 1, 2005 p. 9/3 Long-term spectrum of female speech Speech spectrum for female speech Dunn and White Dashed red line shows the approximation Slope 29 db/decade

Allen/IPAM February 1, 2005 p. 10/3 Conversion from SNR to Spectra for SNRs of [-18, -12, -6, 0, 6, 12] db Speech: 18 to +12, Noise: 0 [db] Spectral level [RMS] 10 1 10 0 10 0 10 1 Frequency (khz) snr k is determined for K = 20 cochlear critical bands

Allen/IPAM February 1, 2005 p. 11/3 (SNR) for Miller Nicely s data (SNR) computed from the Dunn and White spectrum Spectral level [RMS] 10 1 10 0 Speech: 18 to +12, Noise: 0 [db] 10 0 10 1 Frequency (khz) (SNR) Female speech and white noise 0.5 0.4 0.3 0.2 0.1 0 20 10 0 10 Wideband SNR [db] Conversion from SNR k to [Allen 2005, JASA]: k = 1 60 20 k=1 log 10 (1 + c 2 snr 2 k ), c = 2, k 1

Allen/IPAM February 1, 2005 p. 12/3 MN16 and the model P (i) c () for the i th consonant and P c () 1 16 P c () = 1 (1 1 16 )e min i P c (i) () is the chance corrected model 1 0.8 (i) P correct for i=1...16 & mean (Tables I VI) 1 0.8 P c (SNR) 0.6 0.4 mean i (P (i) c ) P (i) 0.2 c (SNR) (SNR+20)/30 0 20 10 0 10 SNR [db] P c () 0.6 0.4 0.2 mean i (P c (i) ) P c (i) () 1 (1 1/16) e min 0 0 0.5 1 Band independence seems a perfect fit to MN16!!

Allen/IPAM February 1, 2005 p. 13/3 Log-error probability Band-independence, corrected for chance, P e (, H = 4) 1 P c () = (1 2 H )e min This suggests linear log-error plots vs. : MN16 CVs log(p (i) e ) = log(1 1 16 ) + log(e min) This relation is of the form Y = a + bx Plots of log(p e )() vs. provide a nice test of Fletcher s band independence model Individual sounds P e (i) () from MN16 may be tested for band independence.

Allen/IPAM February 1, 2005 p. 14/3 Testing the band-independence model CV log-error model: ( ) log () P (i) e = log ( 1 2 4) + log ( ) e (i) min 10 0 Sounds 1,3,5,9,10,12 10 0 Sounds: 2,6,7,13,14 (i) 1 s (), 1 mean (P ), emin i i c (i) 1 s i (), 1 mean i (P c ), emin 0 0.2 0.4 0.6 0 0.2 0.4 0.6 10 0 Sounds: 4,8,11 10 0 Sounds: 15,16 (i) 1 s (), 1 mean (P ), emin i i c 0 0.2 0.4 0.6 (i) 1 s i (), 1 mean i (P c ), emin 0 0.2 0.4 0.6 Model; MN16 P c (); - - -, - - -: CV sounds

Allen/IPAM February 1, 2005 p. 15/3 Testing the band-independence model Sounds: [1, 2, 3, 5, 8, 9, 10, 12, 13, 14] pass Sounds: [4, 8, 11, 15, 16] fail (nonlinear wrt ) 10 0 Sounds 1,3,5,9,10,12 10 0 Sounds: 2,6,7,13,14 (i) 1 s i (), 1 mean i (P c ), emin (i) 1 s (), 1 mean (P ), emin i i c 0 0.2 0.4 0.6 0 0.2 0.4 0.6 10 0 Sounds: 4,8,11 10 0 Sounds: 15,16 (i) 1 s i (), 1 mean i (P c ), emin (i) 1 s (), 1 mean (P ), emin i i c 0 0.2 0.4 0.6 0 0.2 0.4 0.6

Allen/IPAM February 1, 2005 p. 16/3 /ma/ and /na/ vs. The multichannel model is valid for the nasals S i,j () δ i,j + ( 1) δ i,j e min (i) 1 S i,i =1 H g = 1 [bit] (i.e., a 2-group) (1 2 H g ) (e (i) min ) for > g 10 0 Symmetric: /ma/ S 15 () 10 0 Symmetric: /na/ S 16 () S 15,j () ^g 1 S 15 () S 16,j () ^g 1 S 16 () 0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4

Allen/IPAM February 1, 2005 p. 17/3 Case of /pa/, /ta/, /ka/ vs. Fletcher s multichannel model is valid for i = 1, 2, 3: ( ) S i,j () δ i,j + ( 1) δ i,j (1 2 H g ) e (i,j) min for > g 10 0 S 1,j () /pa/ 10 0 S 2,j () 10 0 S 3,j () /ka/ /ta/ /ta/ 0 0.5 0 0.5 0 0.5 Band independence holds for the [/pa/, /ta/, /ka/] group: H g = log 2 (3), g = 0.15 and e min (i,j) depends on i, j.

Allen/IPAM February 1, 2005 p. 18/3 Conclusions I The predicts above chance performance near -20 db HSR performance saturates near 0 db SNR No overlap in ASR vs. HSR ASR chance performance near 0 db ASR performance saturates near +20 db SNR

Allen/IPAM February 1, 2005 p. 19/3 Conclusions II Fletcher s theory is based on band independence e(snr) = e 1 K 1 min e 1 K 2 min e 1 K 3 min... e 1 K K min = e min (2) Recognition of MaxEnt phones satisfy 2 MaxEnt close set tests scores may be predicted by P c (, H) = 1 (1 2 H )e min The first sign of > chance performance is grouping The sound groups depend on the noise spectrum

Allen/IPAM February 1, 2005 p. 20/3 Conclusions III The average over the 16 Miller Nicely consonant confusions P c (i) (snr) is accurately predicted by the Articulation Index: P (i) c = 10 30 P c (snr) 1 16 16 i=1 P c (i) (snr) = 1 ( 1 1 ) e min 16 is the probability of correct identification of i th CV k log 10(1 + 4snr 2 k ) where k is the band index Chance error is given by (1 1/16) e min 0.005 1 P c() (1 1/16) =1

Allen/IPAM February 1, 2005 p. 21/3 Conclusions IV The model, corrected for 1/16 chance guessing, predicts the average Miller Nicely P c quite well There are no free parameters in this model of P c (snr) For the MN experiment, the 0.5 The (snr) is not linear in snr, due to the shape of the snr(f) The individual curves remain to be analyzed and modeled (if possible) How can we explain the variance across sounds?

Allen/IPAM February 1, 2005 p. 22/3 Conclusions V The MN data is very close to symmetric There are a few exceptions, but even for these, the asymmetric part is very small

Allen/IPAM February 1, 2005 p. 23/3 Conclusions VI The off-diagonal PI() functions linearly decompose error for i th sound P c (i, ), since j P ij = 1 P e (i, ) 1 P i,i () = j i P (j i, ) In the previous figure the red curve is identically the sum of all the blue curves The green curve is the model ( P c (, H = 4) = 1 1 ) 16 e min

Allen/IPAM February 1, 2005 p. 24/3 MN16 vs. MN64 Results of the MN16 and the MN64 CV experiments, plotted as a function of both the SNR and the. Error (1 P c (SNR)) [%] 10 0 MN16 55 MN64 04 20 10 0 10 Wideband SNR [db] Error (1 P c (SNR)) [%] 10 0 MN16 55 MN64 04 0 0.5 1

Allen/IPAM February 1, 2005 p. 25/3 Human vs. Γ-tone filters h(t, f 0 ) = t 3 e 2.22 π ERB(f 0) t cos(2πf 0 t) ERB(f 0 ) = 24.7(4.37f 0 /1000 + 1) (Γ-tone filter) cilia displmt [solid] Γ tone [dashed] 10 0 Model neural tuning curves vs. Gamma tone filters 10 4 10 2 10 3 10 4 Frequency [khz] High frequency slope problem Low-frequency tails are wrong Wrong bandwidth at higher frequencies Missing middle ear high-pass response

Allen/IPAM February 1, 2005 p. 26/3 Narayan et al. data Narayan et al. 2000 (Gerbil)

Allen/IPAM February 1, 2005 p. 27/3 Kiang and Moxon 1979 cochlear USM Nonlinear upward spread of masking

Allen/IPAM February 1, 2005 p. 28/3 Neely model for Cat 1986 Active model of the cochlea and cilia, based on the resonant TM model. Cat data from Liberman and Delgutte.

Allen/IPAM February 1, 2005 p. 29/3 Allen model 1980 Solid: Model; dashed Cat FTC (Liberman and Delgutte) 1 2 MAGNITUDE 3 4 5 10 2 10 3 10 4 FREQUENCY (Hz)

Allen/IPAM February 1, 2005 p. 30/3 Symmetric form S = (A + A t )/2 10 0 /pa/ 10 0 /ta/ 10 0 /ka/ 10 0 /fa/ S r,c () 0 0.5 10 0 /Θ/ 0 0.5 10 0 /saw/ 0 0.5 10 0 / / 0 0.5 10 0 /ba/ 0 0.5 10 0 /da/ 0 0.5 10 0 /ga/ 0 0.5 10 0 /va/ 0 0.5 10 0 / / 0 0.5 10 0 /za/ 0 0.5 10 0 /3/ 0 0.5 10 0 /ma/ 0 0.5 10 0 /na/ 0 0.5 0 0.5 0 0.5 0 0.5

Allen/IPAM February 1, 2005 p. 31/3 References Allen, J. B. (1994). How do humans process and recognize speech? IEEE Transactions on speech and audio, 2(4):567 577. Allen, J. B. (2004). The articulation index is a Shannon channel capacity. In Pressnitzer, D., de Cheveigné, A., McAdams, S., and Collet, L., editors, Auditory signal processing: physiology, psychoacoustics, and models, chapter Speech, pages 314 320. Springer Verlag, New York, NY. Boothroyd, A. (1968). Statistical theory of the speech discrimination score. J. Acoust. Soc. Am., 43(2):362 367. Bronkhorst, A. W., Bosman, A. J., and Smoorenburg, G. F. (1993). A model for context effects in speech recognition. J. Acoust. Soc. Am., 93(1):499 509. Fletcher, H. (1929). Speech and Hearing. D. Van Nostrand Company, Inc., New York. Miller, G. A. and Nicely, P. E. (1955). An analysis of perceptual confsions among some english consonants. J. Acoust. Soc. Am., 27(2):338 352.

Allen/IPAM February 1, 2005 p. 32/3 title Latest from Sandeep 100 Fraction of utterances above the Error Threshold 90 80 Error Threshold [%] 70 60 50 40 30 20 10 0 10 0 10 1 10 2 Fraction of Utterances [%]