General Synthesis Model. Time Domain Methods in Speech Processing. General Analysis Model. Overview. Basics. Fundamental Assumptions

Geeral ythesis Model Digital peech Processig Lectures 7-8 T T voiced soud amplitude Log Areas, Reflectio Coefficiets, Formats, Vocal Tract Polyomial, Articulatory Parameters, Time Domai Methods i peech Processig uvoiced soud amplitude Rz ( ) = α z Pitch Detectio, Voiced/Uvoiced/ilece Detectio, Gai Estimatio, Vocal Tract Parameter Estimatio, Glottal Pulse hape, Radiatio Model s[] Geeral Aalysis Model peech Aalysis Model Pitch Period, T[] Glottal Pulse hape, g[] Voiced Amplitude, A V [] V/U/[] witch Uvoiced Amplitude, A U [] Vocal Tract IR, v[] Radiatio Characteristic, r[] All aalysis parameters are time-varyig at rates commesurate with iformatio i the parameters; speech, x[] Overview igal Processig represetatio of speech speech or music A(x,t) formats reflectio coefficiets voiced-uvoiced-silece pitch souds of laguage speaker idetificatio emotios time domai processig => direct operatios o the speech waveform frequecy domai processig => direct operatios o a spectral represetatio of the sigal We eed algorithms for estimatig the aalysis simple processig parameters ad their variatios over time 3 eables various types of feature estimatio 4 x[] system zero crossig rate level crossig rate eergy autocorrelatio P P V U/ Basics 8 khz sampled speech (badwidth < 4 khz) properties of speech chage with time excitatio goes from voiced to uvoiced peak amplitude varies with the soud beig produced pitch varies withi ad across voiced souds periods of silece where backgroud sigals are see the key issue is whether we ca create simple time-domai processig methods that eable us to measure/estimate speech represetatios reliably ad accurately5 Fudametal Assumptios properties of the speech sigal chage relatively slowly with time (5- souds per secod) over very short (5- msec) itervals => ucertaity due to small amout of data, varyig pitch, varyig amplitude over medium legth (- msec) itervals => ucertaity due to chages i soud quality, trasitios betwee souds, rapid trasiets i speech over log (-5 msec) itervals => ucertaity due to large amout of soud chages there is always ucertaity i short time measuremets ad estimates from speech 6 sigals

Compromise olutio short-time processig methods => short segmets of the speech sigal are isolated ad processed as if they were short segmets from a sustaied soud with fixed (o-time-varyig) properties this short-time processig is periodically repeated for the duratio of the waveform these short aalysis segmets, or aalysis frames almost always overlap oe aother the results of short-time processig ca be a sigle umber (e.g., a estimate of the pitch period withi the frame), or a set of umbers (a estimate of the format frequecies for the aalysis frame) the ed result of the processig is a ew, time-varyig sequece that serves as a ew represetatio of the speech sigal 7 Frame-by-Frame Processig i uccessive Widows Frame Frame Frame 3 Frame 4 Frame 5 75% frame overlap => frame legth=l, frame shift=r=l/4 Frame={x[],x[],,x[L-]} Frame={x[R],x[R+],,x[R+L-]} Frame3={x[R],x[R+],,x[R+L-]} 8 Frame : samples,,..., L Frame-by-Frame Processig i uccessive Widows Frame : samples RR, +,..., R+ L Frame 3: samples R, R+,..., R+ L Frame Frame Frame 3 Frame 4 5% frame overlap Frame 4: samples 3 R,3R+,...,3R+ L peech is processed frame-by-frame i overlappig itervals util etire regio of speech is covered by at least oe such frame Results of aalysis of idividual frames used to derive model parameters i some maer Represetatio goes from time sample x[ ], = L,,,,L to parameter vector f[ m], m =,,,L where is the time idex ad m is the frame idex. Frames ad Widows hort-time Processig speech waveform, x[] short-time processig speech represetatio, f[m] F = 6, samples/secod L = 64 samples (equivalet to 4 msec frame (widow) legth) R = 4 samples (equivalet to 5 msec frame (widow) shift) Frame rate of 66.7 frames/secod x [ ] = samples at 8/sec rate; (e.g. secods of 4 khz badlimited speech, x [ ], 6) r f[ m] = f [ m], f [ m],..., f [ m] = vectors at /sec rate, m, { L } L is the size of the aalysis vector (e.g., for pitch period estimate, for autocorrelatio estimates, etc)

Geeric hort-time Processig hort-time Eergy x[] Q = T([ x m]) w% [ m] T(x[]) ~ T( ) w[] liear or o-liear trasformatio m= = widow sequece (usually fiite legth) Q Q is a sequece of local weighted average values of the sequece T(x[]) at time = 3 E = m= x m [ ] -- this is the log term defiitio of sigal eergy -- there is little or o utility of this defiitio for time-varyig sigals E = x m m= L+ [ ] = x [ L+ ] +... + x [ ] -- short-time eergy i viciity of time T( x) = x w % [ ] = L = otherwise 4 Computatio of hort-time Eergy L+ w %[ m] widow jumps/slides across sequece of squared values, selectig iterval for processig what happes to E as sequece jumps by,4,8,...,l samples ( E is a lowpass fuctio so it ca be decimated without lost of iformatio; why is E lowpass?) effects of decimatio deped o L; if L is small, the E is a lot more variable tha if L is large (widow badwidth chages with L!) 5 Effects of Widow Q = ( [ ]) % T x w[ ] = x [ ] w% [ ] = = w %[ ] serves as a lowpass filter o T([]) x which ofte has a lot of high frequecies (most o-liearities itroduce sigificat high frequecy eergy thik of what ( x[ ] x[ ] ) does i frequecy) ofte we exted the defiitio of Q to iclude a pre-filterig term so that x[ ] itself is filtered to a regio of iterest x[ ] x[ ] T([]) x Q = Q Liear = T ( ) w %[ ] Filter 6 hort-time Eergy serves to differetiate voiced ad uvoiced souds i speech from silece (backgroud sigal) atural defiitio of eergy of weighted sigal is: E = x[ m] w% [ m] (sum or squares of portio of sigal) m= -- cocetrates measuremet at sample, usig weightig w% [ -m ] x[] F E = x [ m] w% [ m] = x [ m] h[ m] m= m= h [ ] = w% [ ] short time eergy ( ) x [] E = E = h[] F F / R 7 hort-time Eergy Properties depeds o choice of h[], or equivaletly, ~ widow w[] if w[] duratio very log ad costat amplitude ~ (w[]=, =,,...,L-), E would ot chage much over time, ad would ot reflect the short-time amplitudes of the souds of the speech very log duratio widows correspod to arrowbad lowpass filters wat E to chage at a rate comparable to the chagig souds of the speech => this is the essetial coflict i all speech processig, amely we eed short duratio widow to be resposive to rapid soud chages, but short widows will ot provide sufficiet averagig to give smooth ad reliable eergy fuctio 8 3

Widows ~ cosider two widows, w[] rectagular widow: h[]=, L- ad otherwise Hammig widow (raised cosie widow): h[]=.54-.46 cos(π/(l-)), L- ad otherwise rectagular widow gives equal weight to all L samples i the widow (,...,-L+) Hammig widow gives most weight to middle samples ad tapers off strogly at the begiig ad the ed of the widow Rectagular ad Hammig Widows L = samples 9 Widow Frequecy Resposes rectagular widow si( ΩLT / ) He ( ) = e si( ΩT / ) jωt jωt( L )/ first zero occurs at f=f s /L=/(LT) (or Ω=(π)/(LT)) => omial cutoff frequecy of the equivalet lowpass filter Hammig widow w% [ ] =.54 % [ ].46*cos( π / ( )) % H wr L wr[ ] ca decompose Hammig Widow FR ito combiatio of three terms RW ad HW Frequecy Resposes log magitude respose of RW ad HW badwidth of HW is approximately twice the badwidth of RW atteuatio of more tha 4 db for HW outside passbad, versus 4 db for RW stopbad atteuatio is essetially idepedet of L, the widow duratio => icreasig L simply decreases widow badwidth L eeds to be larger tha a pitch period (or severe fluctuatios will occur i E ), but smaller tha a soud duratio (or E will ot adequately reflect the chages i the speech sigal) There is o perfect value of L, sice a pitch period ca be as short as samples (5 Hz at a khz samplig rate) for a high pitch child or female, ad up to 5 samples (4 Hz pitch at a khz samplig rate) for a low pitch male; a compromise value of L o the order of - samples for a khz samplig rate is ofte used i practice Widow Frequecy Resposes hort-time Eergy hort-time eergy computatio: Rectagular Widows, L=,4,6,8, Hammig Widows, L=,4,6,8, 3 E = ([ x m][ w m]) m= = xm w % m m= m= L+ ([ ]) [ ] For L-poit rectagular widow, wm %[ ] =, m=,,..., L givig E = ([ x m]) 4 4

hort-time Eergy usig RW/HW E L=5 L= L= L=4 L=5 L= L= L=4 as L icreases, the plots ted to coverge (however you are smoothig soud eergies) short-time eergy provides the basis for distiguishig voiced from uvoiced speech regios, ad for medium-to-high NR recordigs, ca eve be used to fid regios of 5 silece/backgroud sigal E hort-time Eergy for AGC Ca use a IIR filter to defie short-time eergy, e.g., time-depedet eergy defiitio = m= m= σ [ ] x [ m] h[ m]/ h[ m] cosider impulse respose of filter of form h [ ] = α u [ ] = α = < m m= σ [ ] = ( α) x [ m] α u[ m ] 6 Recursive hort-time Eergy u [ m ] implies the coditio m or m givig m σ [ ] = ( α) x [ m] α = ( α)( x [ ] + αx [ ] +...) m= for the idex we have m σ [ ] = ( α) x [ m] α = ( α)( x [ ] + αx [ 3] +...) m= thus givig the relatioship [ ] [ ] [ ]( ) σ = α σ + x α ad defies a Automatic Gai Cotrol (AGC) of the form G G [ ] = σ [ ] 7 Recursive hort-time Eergy σ [ ] = x [ ] h[ ] h [ ] = ( αα ) u [ ] σ ( z) = X ( z) H( z) ( ) = [ ] = ( αα ) [ ] Hz hz u z = m = = = = ( αα ) z ( + ) m m m m Hz ( ) = ( αα ) z = ( α) z α z m= m= = = = ( α) z m α z m= ( α) z αz σ ( z)/ X ( z) m σ [ ] = ασ [ ] + ( α) x ( ) 8 Recursive hort-time Eergy Recursive hort-time Eergy x [] σ ( ) x [ ] [ ] z + ( α) σ [ ] z α σ [ ] = α σ [ ] + x [ ]( α) 9 3 5

Use of hort-time Eergy for AGC Use of hort-time Eergy for AGC α =.9 α =.99 3 3 hort-time Magitude short-time eergy is very sesitive to large sigal levels due to x [] terms cosider a ew defiitio of pseudo-eergy based o average sigal magitude (rather tha eergy) M = [ ] %[ xm w m] m= weighted sum of magitudes, rather tha weighted sum of squares x[] x[] M = M = F w %[ ] F F / R computatio avoids multiplicatios of sigal with itself (the squared term) 33 hort-time Magitudes M M L=5 L=5 L= L= L= L= L=4 L=4 differeces betwee E ad M oticeable i uvoiced regios dyamic rage of M ~ square root (dyamic rage of E ) => level differeces betwee voiced ad uvoiced segmets are smaller E ad M ca be sampled at a rate of /sec for widow duratios of msec or so => efficiet 34 represetatio of sigal eergy/magitude hort Time Eergy ad Magitude Rectagular Widow E M L=5 L=5 hort Time Eergy ad Magitude Hammig Widow E M L=5 L=5 L= L= L= L= L= L= L= L= L=4 L=4 L=4 L=4 35 36 6

Other Lowpass Widows ca replace RW or HW with ay lowpass filer widow should be positive sice this guaratees E ad M will be positive FIR widows are efficiet computatioally sice they ca slide by R samples for efficiecy with o loss of iformatio (what should R be?) ca eve use a ifiite duratio widow if its z-trasform is a ratioal fuctio, i.e., h [ ] = a,, < a< = < Hz ( ) = z > a az 37 Other Lowpass Widows this simple lowpass filter ca be used to implemet E ad M recursively as: E = ae + ( a) x [ ] short-time eergy M = am + ( a) x[ ] short-time magitude eed to compute E or M every sample ad the dow-sample to /sec rate recursive computatio has a o-liear phase, so delay caot be compesated exactly 38 hort-time Average ZC Rate zero crossigs zero crossig => successive samples have differet algebraic sigs iusoid Zero Crossig Rates Assume the samplig rate is F =, Hz. F = Hz siusoid has F / F =, / = samples/cycle; or z = / crossigs/sample, or z = / * = crossigs/ msec iterval zero crossig rate is a simple measure of the frequecy cotet of a sigal especially true for arrowbad sigals (e.g., siusoids) siusoid at frequecy F with samplig rate F has F /F samples per cycle with two zero crossigs per cycle, givig a average zero crossig rate of z =() crossigs/cycle x (F / F ) cycles/sample z =F / F crossigs/sample (i.e., z proportioal to F ). F = Hz siusoid has F / F =, / = samples/cycle; or z = / crossigs/sample, or z = / * = crossigs/ msec iterval 3. F = 5 Hz siusoid has F / F =, / 5 = samples/cycle; or z = / crossigs/sample, or z = / * = crossigs/ msec iterval z 39 M =M (F /F ) crossigs/(m samples) 4 Zero Crossig for iusoids Zero Crossigs for Noise offset:.75, Hz siewave, ZC:9, offset siewave, ZC:8 Hz siewave 3 offseet:.75, radom oise, ZC:5, offset oise, ZC: radom gaussia oise.5 ZC=9 ZC=5 -.5 - - - 5 5 5 5.5 Offset=.75 Hz siewave with dc offset 6 Offset=.75 radom gaussia oise with dc offset ZC=8 4 ZC=.5 5 5 4-5 5 5 4 7

ZC Rate Defiitios Z = L sg( x[ m]) sg( x[ m ]) w% [ m] eff m= L+ sg( x [ ]) = x [ ] = x [ ] < simple rectagular widow: w % [ ] = L = otherwise L eff = L ame form for Z as for E or M 43 The formal defiitio of Z is: ZC Normalizatio Z = z = sg( xm [ ]) sg( xm [ ]) L m= L+ is iterpreted as the umber of zero crossigs per sample. For most practical applicatios, we eed the rate of zero crossigs per fixed iterval of M samples, which is z = z M = rate of zero crossigs per M sample iterval M Thus, for a iterval of τ sec., correspodig to M samples we get z = z M; M = τ F = τ / T M F =, Hz; T = μsec; τ= msec; M = samples F = 8, Hz; T = 5 μsec; τ= msec; M = 8 samples F = 6, Hz; T = 6. 5 μsec; τ= msec; M = 6 samples 44 Zero crossigs/ msec iterval as a fuctio of samplig rate ZC Normalizatio For a Hz siewave as iput, usig a 4 msec widow legth ( L), with various values of samplig rate ( F ), we get the followig: F L z M z 8 3 / 4 8 4 / 5 6 64 / 8 6 M ZC ad Eergy Computatio Hammig widow with duratio L= samples (.5 msec at Fs=6 khz) Thus we see that the ormalized (per iterval) zero crossig rate, z M, is idepedet of the samplig rate ad ca be used as a measure of the domiat eergy i a bad. Hammig widow with duratio L=4 samples (5 msec at Fs=6 khz) 45 46 ZC Rate Distributios ZC Rates for peech Uvoiced peech: the domiat eergy compoet is at about.5 khz KHz KHz 3KHz 4KHz Voiced peech: the domiat eergy compoet is at about 7 Hz for voiced speech, eergy is maily below.5 khz for uvoiced speech, eergy is maily above.5 khz mea ZC rate for uvoiced speech is 49 per msec iterval mea ZC rate for voiced speech is 4 per msec iterval 47 5 msec widows /sec samplig rate o ZC computatio 48 8

hort-time Eergy, Magitude, ZC Issues i ZC Rate Computatio for zero crossig rate to be accurate, eed zero DC i sigal => eed to remove offsets, hum, oise => use badpass filter to elimiate DC ad hum ca quatize the sigal to -bit for computatio of ZC rate ca apply the cocept of ZC rate to badpass filtered speech to give a crude spectral estimate i arrow bads of speech (kid of gives a estimate of the strogest frequecy i each arrow bad of speech) 49 5 ummary of imple Time Domai Measures s( ) Liear Filter x( ) Q = T( x[ m]) w% [ m] m=. Eergy: E = x [ m] w% [ m] m= L+ E T[ ] T( x[ ]) w %[ ] ca dowsample at rate commesurate with widow badwidth. Magitude: M = x[ m] w% [ m] m= L+ 3. Zero Crossig Rate: Z = z = L = L sg( xm [ ]) sg( xm [ ]) w % [ m] m + where sg( xm [ ]) = xm [ ] = xm [ ] < Q 5 hort-time Autocorrelatio -for a determiistic sigal, the autocorrelatio fuctio is defied as: Φ[ k] = x[ m] x[ m+ k] m= -for a radom or periodic sigal, the autocorrelatio fuctio is: L Φ[ k] = lim x[ m] x[ m+ k] L ( L + ) m= L - if x[ ] = x[ + P], the Φ[ k] = Φ[ k + P], => the autocorrelatio fuctio preserves periodicity -properties of Φ[ k] :. Φ[ k] is eve, Φ[ k] = Φ[ k]. Φ[ k] is maximum at k =, Φ[ k] Φ[ ], k 3. Φ[ ] is the sigal eergy or power (for radom sigals) 5 Periodic igals for a periodic sigal we have (at least i theory) Φ[P]=Φ[] so the period of a periodic sigal ca be estimated as the first o-zero maximum of Φ[k] this meas that the autocorrelatio fuctio is a good cadidate for speech pitch detectio algorithms it also meas that we eed a good way of measurig the short-time autocorrelatio fuctio for speech sigals hort-time Autocorrelatio - a reasoable defiitio for the short-time autocorrelatio is: R [ ] = [ ] %[ ] [ + ] % k x m w m x m k w[ k m] m=. select a segmet of speech by widowig. compute determiistic autocorrelatio of the widowed speech R[ k] = R[ k] - symmetry = xmxm [ ] [ k] w %[ mw ] %[ + k m] m= - defie filter of the form w% [ ] = %[ ] %[ k w w+ k] - this eables us to write the short-time autocorrelatio i the form: R [ ] = [ ] [ ] % k xmxm kwk [ m] m= th - the value of w% [ k] at time for the k lag is obtaied by filterig the sequece x [ ] x [ k] with a filter with impulse respose w% k [ ] 53 54 9

hort-time Autocorrelatio hort-time Autocorrelatio R [ k] = xmw [ ] [ m] xm [ + kw ] [ + k m] % % m= ~ ~ -L+ +k-l+ L poits used to compute R [ ]; L k poits used to compute R [ k]; 55 56 Examples of Autocorrelatios Voiced (female) L=4 (magitude) T T = NT t = T autocorrelatio peaks occur at k=7, 44,... => 4 Hz pitch Φ(P)<Φ() sice widowed speech is ot perfectly periodic over a 4 sample widow (4 msec of sigal), pitch period chages occur, so P is ot perfectly defied much less clear estimates of periodicity sice HW tapers sigal so strogly, makig it look like a o-periodic sigal 57 o strog peak for uvoiced speech F = T F / F s 58 Voiced (female) L=4 (log mag) Voiced (male) L=4 T t = T T T T F = T 3 3F = T F / F s 59 6

Uvoiced L=4 Uvoiced L=4 6 6 Effects of Widow ize L=4 L=5 L=5 choice of L, widow duratio small L so pitch period almost costat i widow large L so clear periodicity see i widow as k icreases, the umber of widow poits decrease, reducig the accuracy ad size of R (k) for large k => have a taper of the type R(k)=-k/L, k <L shapig of autocorrelatio (this is the autocorrelatio of size L rectagular widow) allow L to vary with detected pitch periods (so that at least full periods are icluded) 63 Modified Autocorrelatio wat to solve problem of differig umber of samples for each differet k term i R[ k], so modify defiitio as follows: R [ ] = [ ] % [ ] [ + ] % k x m w m x m k w[ m k] m= - where w% is stadard L-poit widow, ad w% is exteded widow of duratio L+ K samples, where K is the largest lag of iterest - we ca rewrite modified autocorrelatio as: R [ k] = x[ + m] w[ m] x[ + m+ k] w[ m+ k] m= - where w [ m] = w% [ m] ad w[ m] = w% [ m] - for rectagular widows we choose the followig: w [ m] =, m L w [ m] =, m L + K -givig R [ ] = L k x[ + m] x[ + m+ k], k K m= - always use L samples i computatio of R [ k] k 64 Examples of Modified Autocorrelatio Examples of Modified Autocorrelatio L- L+K- - R [ k] is a cross-correlatio, ot a auto-correlatio - R [ ] k R[ k] - R [ k] will have a strog peak at k = P for periodic sigals ad will ot fall off for large k 65 66

Examples of Modified AC Waterfall Examples L=4 L=4 L=4 L=5 L=4 L=5 Modified Autocorrelatios fixed value of L=4 Modified Autocorrelatios values of L=4,5,5 67 68 Waterfall Examples Waterfall Examples 69 7 hort-time AMDF AMDF for peech egmets - belief that for periodic sigals of period P, the differece fuctio d [ ] = x [ ] x [ k] - will be approximately zero for k =, ± P, ± P,... For realistic speech sigals, d [ ] will be small at k= P--but ot zero. Based o this reasoig. the short-time Average Magitude Differece Fuctio (AMDF) is defied as: γ [ ] = [ + ] % [ ] [ + ] % k x m w m x m k w[ m k] m= - with w% [ m] ad w% [ m] beig rectagular widows. If both are the same legth, the γ [ k] is similar to the short-time autocorrelatio, whereas if w% [ m] is loger tha w% [ m], the γ [ k] is similar to the modified short-time autocorrelatio (or covariace) fuctio. I fact it ca be show that γ[ k] β[ k] [ ] [ ] R R k / - where β[ k] varies betwee.6 ad. for differet segmets of speech. 7 7

ummary hort-time parameters i the time domai: short-time eergy short-time average magitude short-time zero crossig rate short-time autocorrelatio modified short-time autocorrelatio hort-time average magitude differece fuctio 73 3