Lecture 5: Jan 19, 2005

Size: px

Start display at page:

Download "Lecture 5: Jan 19, 2005"

Ann Powers
5 years ago
Views:

1 EE516 Computer Speech Processing Winter 2005 Lecture 5: Jan 19, 2005 Lecturer: Prof: J. Bilmes University of Washington Dept. of Electrical Engineering Scribe: Kevin Duh 5.1 Overview In the last lecture, we studied the anatomy of the hearing organ (outer, middle, and inner ear) and two views of frequency encoding in the hair cells (Place Theory and Timing Theory). In this lecture, we look at two broad topics: 1. How sound and speech are perceived 2. How speech is produced In the first topic, we shall discover that sound/speech perception is not as simple as bunch of dancing hair cells. Instead, pyscho-acoustical experiments have shown complex phenomena in sound perception, such as hearing thresholds, non-linear relationship between intensity and loudness, and temporal/simultaneous masking. We also briefly discuss the difficulty of speech perception research and overview some notable experiments regarding intelligibility tests on spectrally-filtered speech and gaussian-scaled speech. In the second topic, we will develop a simple mathematical model for speech production based on the source-filter model. Specifically, we will model the vocal tract as an uniform lossless tube and derive the dynamics of acoustic pressure waves within it. As we shall see, this simple model elegantly explains the formant structure of the vowel schwa, and with some extensions, we can also model the vocal tract for other vowels. Finally, we also derive the 1-D wave equations that serve as the basis for the above lossless tube derivations. 5.2 Perception of Sound and Speech Sound Perception Thresholds of Hearing and Feeling The threshold of hearing is the minimum intensity at which one can perceive a sound; the threshold of feeling is the point where any increase in sound intensity begins to cause physical pain. As shown in Figure 1, these thresholds vary across frequencies. The intensity of sound is measured in terms of sound pressure level (SPL) in units of decibels (db). Note that the threshold of hearing decreases sharply as we progress from 20Hz to 2000Hz, but increases rapidly thereafter. This bandpass phenomenon is due to the filtering of the outer and middle ear, and the sensitivity of hair cells to different frequencies. From this curve, we can note the interesting fact that it is easier to hear female speakers at low volumes, since females have a pitch range corresponding to a lower threshold of hearing. (Recall that the F0 s of males and females are Hz and Hz, respectively.) 5-1

2 Lecture 5: Jan 19, The plot also shows the frequencies of speech, which are relatively constant with respect to the threshold of hearing. The threshold of hearing is higher for 7000Hz, but only fricatives contain significant amounts of these frequencies, which has little effect on speech intelligibility and naturalness. Some believe that evolution has adapted speech production to fit this region of auditory perception; others believe that the auditory system, which basically functions only as a hearing organ, is under less selective pressure than the vocal system, which simultaneously works as speech, eating, and smelling organs. Figure 5.1: Hearing thresholds across different frequencies and frequency range of speech. [O87] Loudness Intensity The perception of loudness is not the same as the raw intensity of sound pressure waves. Fletcher and Munson developed a plot (Figure 2) which shows that our perception of loudness varies across frequencies. Sound pressure waves at different intensities may be perceived similarly in terms of loudness, depending on the waves frequencies. This forms the equal loudness curve. Since intensity of sound pressure waves does not correspond to the perception of loudness, a phon is a defined as an unit of loudness. As shown in Figure 3, the equal loudness curves define the different levels of phons. Speech typically lies between 20 to 80 phons. Figure 5.2: Fletcher and Munson curves [O87]

3 Lecture 5: Jan 19, Figure 5.3: Phons and Equal Loudness Curves [O87] Masking Masking occurs when the perception of one sound is obscured by the presence of another sound. This occurs when one sound raises the threshold of hearing for the other sound. Two types of masking are: Simultaneous Masking two sounds occurring at once, and one masks the other. The lower frequency sound masks the higher frequency sound, and this typically occurs when both sounds fall under the same neural critical band. Temporal Masking two sounds occur in sequence in time, and one masks the other. If the earlier sound masks the later sound, this is known as forward temporal masking. The opposite case is called backward temporal masking. Forward masking can be explained by neural fatigue, where the neuron needs some time to recharge after firing at the first sound. Backward masking very likely involves a blocking phenomenon where the processing of the earlier sound at a higher auditory level is interrupted by a later, louder sound. Masking is widely exploited in auditory coding. Sounds that are masked do not need to be encoded because the listener will not be able to perceive them anyways. (However, a select few people with golden ears may be able to hear the additive noise created by such a lossy encoder.) Neural Tuning Curves A neural tuning curve shows the threshold of hearing for a single neuron. These curves are helpful in determining the critical band, the frequencies where a neuron is active. They can be determined by probing a single neuron and measuring the response in terms of spiking rate for a given stimulus. The neural tuning curve for probing an anesthetized cat is shown in Figure 4. These curves show the threshold where neurons begin to respond above spontaneous firing. From the figure, we observe that (1) neural tuning curves exhibit roughly constant Q, and (2) different neurons have different tuning curves and critical frequencies. This is one of the evidence that supports von Bekesey s Place Theory of Frequency Encoding. The pyschophysical tuning curve for humans can be obtained by using masking (Figure 5). One such procedure uses a fixed-level tone in narrowband noise (masker). The tone is fixed at a low level ( 10dB SPL) to ensure that only one auditory filter responds. Then, the masker frequency and intensity is adjusted to the points where the listener loses perception of the tone (due to the masking effect). These points form the upside-down critical frequency curve of that neuron.

Lecture 5: Jan 19, 2005 5-4 Figure 5.4: Neural tuning curves taken from anesthetized cats. Figure 5.5: Psychophysical tuning curves of humans. 5.2.2 Speech Perception 5.2.2.1 Challenges in Speech Perception Research Speech perception occurs at a much higher level in the auditory cortex or brain.

4 Lecture 5: Jan 19, Figure 5.4: Neural tuning curves taken from anesthetized cats. Figure 5.5: Psychophysical tuning curves of humans Speech Perception Challenges in Speech Perception Research Speech perception occurs at a much higher level in the auditory cortex or brain. This is an active research area as not much is yet understood. The goal is to find invariant acoustic cues for different speech sounds. For example, what discriminates between phonemes, phones, syllables, words, phrases, or sentences? These are difficult to figure out since the acoustic cues for a sound changes depending on its context. For example, Figure 6 shows the formants of [di] and [du]. Although the phone [d] is the same in both cases, the contextual vowel following it changes the F2 drastically. Therefore, humans must be using more information than formants when perceiving speech.

5 Lecture 5: Jan 19, Figure 5.6: Formant transition is different depending on context. Example of [di], [du]. [LB88] Spectral Regions of Speech Perception The frequencies from 200Hz to 5500Hz are the most important for speech perception. Experiments in speech perception are usually performed by filtering out various spectral regions in a speech signal, then measuring intelligibility from test subjects. For example, after filtering out frequencies less than 1000Hz, the discriminability of voicing and manner of articulation decreases. (/b/ vs. /p/ vs. /v/). After filtering out frequencies above 1200Hz, place of articulation discriminability drops (/p/ vs. /t/). Note that in telephones, the bandwidth is Hz, which is good enough for speech intelligibility. Although the E-set (/p/, /d/, /e/, /g/, /c/) and fricatives like /f/ and /s/ may be difficult to discriminate, intelligibility is usually not a problem when words are spoken in context. Experiments in Gaussian scaled speech tests intelligibility by zeroing out windows of speech at different locations in time. The surprising finding is that if the Gaussian windows are short enough in duration, no particular zeroed location in the time/frequency axis contains the critical information that makes the speech incomprehensible. Apparently we can infer what is missing from the other sound segments. The bottom line is that humans are remarkably good at perceiving speech. Even in the presence of noise or missing sounds, we automatically use a variety of cues and context to determine the spoken speech. 5.3 Speech Production: A Mathematical Model of the Vocal Tract A mathematical model of the speech production process is important because it sheds light on the characteristics of speech and serves as the basis for speech analysis and synthesis. With a good model, we should be able to build good speech synthesizers. Further, if we can produce an inverse model that goes from the acoustic signal to model parameters, we should be able to do better speech recognition, as well. To begin our derivation of a mathematical model, we start with a joke: JOKE: A group of wealthy investors wanted to predict the outcome of horse races so they can become even richer. Therefore, they hired a group of eminent physicists around the world to research the issue. After a year, the physicists returned with the promise of a solution. They said they have developed a solution that would accurately predict the outcome of any race without fail! The investers were very eager in listening to this great discovery. So the head physicist reported, Well, first we need to simplify the problem. Assume the horse is a perfect sphere... What we will do in the following derivation is to assume that the vocal tract is an uniform lossless tube excited by

6 Lecture 5: Jan 19, a source. Comparing Figure 7 to Figure 8, we see that we have vastly simplified the situation. In the next lecture, we will extend the uniform tube derived here into a concatenation of tubes of varying sizes. It turns out that despite the simplifying assumptions, the models do shed light on the speech production process and in some cases perform adequatly in practice. Figure 5.7: Schematic of Human Vocal Tract [O87] Figure 5.8: Uniform Lossless Tube Vocal Tract Model

7 Lecture 5: Jan 19, Simplified Production Model: Overview Figure 5.9: Speech production as Source-Filter model. The source e(t) is the glottal pulse; The filter impulse response v(t) defines the resonances. The simplified production model we will adopt is a time-varying system (vocal tract) that is excited by a period source (glottis), as shown in Figure 9. The glottal source generates a sawtooth wave, which is superior to square waves and triangle waves in that it contains all the harmonics. The vocal tract is a time-varying filter that shapes the excitation signal with different formants. In other words, we have: S(jΩ) = V (jω)i(jω)g(jω) where V (jω) is the vocal tract response, G(jΩ) is the glottal pulse spectral shape, I(jΩ) is a periodic pulse train, and S(jΩ) is the output speech spectrum. Each of these models can be approximated separately. However, doing so makes an independence assumption between glottal excitement and vocal tract coloring, which is mostly but not entirely true. For example, in Lombard speech, the presence of noise affects speaker effort in an non-linear fashion. Also, in the extremely fast speech by Steve Woodmore (637 words/minute), there is little pitch variation, which suggests a coupling of the source and filter Glottal Excitation Model The glottal excitation e(t) is the convolution of a glottal pulse g(t), which has a specific spectral shape, with a periodic impulse train i(t). As seen in Figure 10, the glottal pulse looks like a half-wave rectified sine-wave, where closing phase occurs more rapid than the opening phase. This is because the Bernoulli force closes the glottis rapidly while the air pressure from the lungs opens the glottis gradually. As a result, we often model the glottal pulse as a sawtooth wave. Figure 5.10: Glottal pulse and mouth sound pressure [O87] The spectrum of a glottal pulse G(jΩ) is a low-pass signal with a cut-off frequency around 500Hz. The fall-off is about 12dB/Octave.

8 Lecture 5: Jan 19, Uniform Lossless Tube Vocal Tract Model After modeling the glottal source, we will now model the vocal tract filter. In the uniform lossless tube model (Figure 11) of the vocal tract, we assume the following: The vocal tract is a single tube (and therefore time-variation of the vocal tract shape is not captured.) The tube is not curved (which actually does not have a large effect.) Losses due to heat conduction, viscous friction at vocal tract walls, softness of vocal tract walls, etc. are ignored. Radiation of sound occurs only at the lips. There is no nasal coupling. Using the laws of conservation of mass, energy, and momentum, it can be shown that the acoustic waves in an uniform lossless tube satisfies the following 1-D wave equations: 2 p 2 = 1 ( 2 ) p c 2 t 2 (5.1) 2 u 2 = 1 c 2 ( 2 ) u t 2 (5.2) p = p(x, t): pressure as a function of position and time u = u(x, t): volume velocity of particles in the tube t: time x: particle position along the tube of length l c: speed of sound 340m/s Note: The wave equations are one-dimensional because we can assume planar wave propagation along the length of the tube when the wavelengths are sufficiently long compared to the diameter of the tube. (This is true in typical human vocal tracts for frequencies less than 4000Hz.) The general solution to these partial differential equations has the form: p(t, x) = p + (t x/c) + p (t + x/c) (5.3) u(t, x) = u + (t x/c) + u (t + x/c) (5.4) where p + and u+ are the forward-propagating waves, p and u are the backward-propagating waves, and both are arbitrary functions with constant first and second derivatives. One can imagine waves as propagating along the length of the tube through time without changing shape. Now we will find the vocal tract transfer function V (jω) to get a frequency domain understanding of the system. We begin by assuming the boundary condition at x = 0 is an excitation by a complex exponential: u(0, t) = u G (t) = U G (Ω)e jωt

9 Lecture 5: Jan 19, where U G (Ω) is the complex amplitude. As a result, the solutions to the forward and backward waves have the form: u + (t x/c) = K + e jω(t x/c) (5.5) u (t + x/c) = K e jω(t+x/c) (5.6) Substituting these equations in Eqs. 6.3 and 6.4 and applying the boundary condition where the pressue at the lips is zero (lips is open). p(l, t) = 0, we solve for the unknown constants K +, K, and arrive at the steady-state solutions: where Z 0 = ρc/a is the characteristic impedance of tube. sin(ω(l x)/c) p(x, t) = jz 0 U G (Ω) e jωt (5.7) cos(ωl/c) cos(ω(l x)/c) u(x, t) = U a (Ω) e jωt (5.8) cos(ωl/c) The volume velocity at the lips, which occurs at x = l, is therefore: u(l, t) = U(l, Ω)e jωt = U G(Ω)e jωt cos(ωl/c) (5.9) Finally, we can get the vocal tract transfer function: V (Ω) = U(l, Ω) U G (Ω) = 1 cos(ωl/c) (5.10) Figure 5.11: Resonant frequencies of an uniform tube Figure 11 shows a plot of V (Ω). This is an all-pole model, and the peaks occur when Ωl/c = (2n + 1)π/2. Thus, for an uniform tube model, the resonance frequencies are at evenly spaced frequencies f n = (2n+1)πc 2l, n = 0, ±1, ±2,... The spacing decreases as vocal tract length l increases. The resonant frequencies are analagous to the formats we see in speech, and the single tube model actually models the vowel schwa relatively well. As we shall see in the next lecture, differently-spaced resonant frequencies corresponding to different speech sounds can be modeled by concatenating uniform tubes of various sizes to approximate a varying vocal tract.

10 Lecture 5: Jan 19, Derivation of 1-D Wave Equations The 1-D wave equations (Eqs. 6.1, 6.2) we used for the single lossless tube model is not difficult to derive. The outline of our derivation is as follows: 1. Approximate particles in the tube as infinitesimally small boxes and use Newton s 2nd Law to model its motions. The result is p = ρ 0 v t. 2. Use the Adiabatic Ideal Gas Law to model the change of volume of the box. The result is 2 p = 1 2 p 2 c 2 t Do the same to get 2 u 2 = 1 c 2 2 u t 2. In the vocal tract, particles move when pressure is applied in some direction. To model this, we will consider a infinitesimally small cube of volume V = x y z as shown in Figure 12. The glottis produces air pressure from the left end of the tube, so there is a force that pushes the box to the right. This force can be described by Newton s Second Law of Motion, F = ma. This model works well for sounds in the audible range (e.g. Not more than 110dB SPL). Figure 5.12: Box in tube model. Pressure comes from the x=0 end. Since pressure is defined as the force over the area (p = f/a), we have the force causing acceleration in the negative x direction on the cube as f l = p A = ( p x) ( y z). The force causing a positive x acceleration is therefore the negative: Using Newton s Second Law, the force per unit volume is: f = ( p x) ( y z). (5.11) f V = Ma V = M V u t = ρ 0 u t (5.12) where M is the mass of the box, u is the volume velocity of the gas, and ρ 0 = M/V is the density of the box. From Eq, 6.11, we have p the derivation: = f x y z = f V. Combining this with Eq. 6.12, we get our desired result for step 1 of

11 Lecture 5: Jan 19, p v = ρ t (5.13) Next we proceed to step 2, where we apply the Ideal Gas Law to derive the second derivative equation for pressure. Recall that the Charles-Boyle Ideal Gas Law states that P V = nrt, where P : pressure V : volume n: number of moles in the volume R: universal gas constant T : temperature in Kelvins In general, this law describes the relationship between pressure and volume, depending on how temperature changes. There are two types of thermal conditions: Isothermal: The gas remains at constant temperature, which occurs if the change in pressure/volume is slow (quasi-static). The result is P V = constant. Adiabatic: No heat flows in or out of the system. This occurs if 1) the system is well insulated, and 2) the pressure/volume change happens so fast that heat has no time to flow in or out. This effect is dependent on the wavelenth of a typical sound. The wavelength of an example speech sound at 1000Hz is λ = c f = 340m/s 1000s 1 = 0.34m, so the length of a half-cycle is 0.17m. Heat diffusion occurs approximately at the rate of 0.5m/s. A half-cycle at 1000Hz lasts 0.5ms, so heat travels only 2.5x10 4 m. This is much smaller than the half-cycle of the speech sound, so the process in the vocal tract is essentially adiabatic. The Adiabatic Gas law is defined as where the heat capacity γ = P V γ = constant (5.14) specific heat capacity at constant pressure specific heat capacity at constant volume, which is 1.4 in air. We want the differential form of the adiabatic gas law, so we will first take the logarithm to get: ln(p V γ ) = constant ln P + γ ln V = constant Then we take take the derivative and rearrange to get the differential form: dp P = γ dv V which can be approximated as

12 Lecture 5: Jan 19, p P 0 = γ V V 0 (5.15) Here, P 0 is the undisturbed pressure and p is the incremental pressure from the wave; V 0 is the undisturbed volume and V is the incremental volume from the wave. Changing V to τ for notational simplicity and taking the derivative of time, we arrive at: 1 p P 0 t = γ τ V 0 t (5.16) Figure 5.13: Left: before deformation; Right: after deformation. Note the incremental volume. Now we look at the incremental volume of the box in the tube. We assume that the mass in the box remains constant (conservation of mass) while the side of the box extends due to pressure change. As shown in Figure 13, the incremental volume is related to ξ x, the displacement/extension of the box at one side. Subtracting the volumes of boxes in Figure 13 ( ( x + ξx x) y z and x y z ), the incremental volume is simply Taking the derivative with respect to time, we have τ = ξ x x y z = V 0 ξ x τ t = V 0 ξ x ξ x = V 0 t ξ x = V 0 t u = V 0 (5.17) Note the last step in equation (Eq. 6.17) follows because u, the instantaneous particle velocity, must be equivalent to the instantaneous velocity of the box s edge ( ξx t ) due to conservation of mass. Finally, we combine the equations we derived previously ( Eqs. 6.13, 6.16, 6.17 ) for the final result. Substituting Eq in Eq and taking derivative, we have:

13 Lecture 5: Jan 19, p t = γp u 0 2 p t 2 = γp 2 u 0 t (5.18) Taking the derivative (with respect of x) of Eq. 6.13, then substituting in Eq. 6.18, we get: 2 p 2 u 2 = ρ 0 t = ρ 0 2 p γp 0 t 2 Finally, observing that γp0 ρ 0 = c 2, where c is the speed of sound in the gas, we have our 1-D wave equation for pressure: 2 p 2 = 1 c 2 ( 2 ) p t 2 To find the wave equation for volume velocity, we take derivative of Eq with respect to x and proceed similarly. In the end, we have two 1-D wave equations describing the dynamics of sound pressure and volume velocity inside an uniform lossless tube: 2 p 2 = 1 c 2 2 u 2 = 1 c 2 ( 2 ) p t 2 ) ( 2 u t 2 References [DHP99] J. R. DELLER and J. H. L. Hansen and J. G. Proakis, Discrete-Time Processing of Speech Signals, Wiley-IEEE Press, [HAH01] X. HUANG, A. ACERO and H. HON, Spoken Language Processing, Prentice Hall PTR, 2001 [LB88] P. LIEBERMAN abd S. BLUMSTEIN Speech Physiology, Speech Perception, and Acoustic Phonetics Cambridge University Press, 1988 [O87] D. O SHAUGHNESSY, Speech Communications: Human and Machine, Addison-Wesley, 1987 [RS78] L. R. RABINER and R. W. SCHAFER, Digital Processing of Speech Signals, Prentice Hall, New Jersey, 1978.

SPEECH COMMUNICATION 6.541J J-HST710J Spring 2004

SPEECH COMMUNICATION 6.541J J-HST710J Spring 2004 6.541J PS3 02/19/04 1 SPEECH COMMUNICATION 6.541J-24.968J-HST710J Spring 2004 Problem Set 3 Assigned: 02/19/04 Due: 02/26/04 Read Chapter 6. Problem 1 In this problem we examine the acoustic and perceptual