convenient means to determine response to a sum of clear evidence of signal properties that are obscured in the original signal

Digital Speech Processing Lecture 9 Short-Time Fourier Analysis Methods- Introduction 1

General Discrete-Time Model of Speech Production Voiced Speech: A V P(z)G(z)V(z)R(z) Unvoiced Speech: A N N(z)V(z)R(z) 2

Short-Time Fourier Analysis represent signal by sum of sinusoids or complex exponentials as it leads to convenient solutions to problems (formant estimation, pitch period estimation, analysis-by-synthesis methods), and insight into the signal itself such Fourier representations provide convenient means to determine response to a sum of sinusoids for linear systems clear evidence of signal properties that are obscured in the original signal 3

Why STFT for Speech Signals steady state sounds, like vowels, are produced by periodic excitation of a linear system => speech spectrum is the product of the excitation spectrum and the vocal tract frequency response speech is a time-varying signal => need more sophisticated analysis to reflect time varying properties changes occur at syllabic rates (~10 times/sec) over fixed time intervals of 10-30 msec, properties of most speech signals are relatively constant (when is this not the case) 4

Overview of Lecture define time-varying Fourier transform (STFT) analysis method define synthesis method from time-varying FT (filter-bank summation, overlap addition) show how time-varying FT can be viewed in terms of a bank of filters model computation methods based on using FFT application to vocoders, spectrum displays, format estimation, pitch period estimation 5

Frequency Domain Processing Coding: transform, subband, homomorphic, channel vocoders Restoration/Enhancement/Modification: noise and reverberation removal, helium restoration, time-scale modifications (speed-up and slow-down of speech) 6

Frequency and the DTFT sinusoids 0 xn n e e jω0n jω0n ( ) = cos( ω ) = ( + )/ 0 where ω is the frequency (in radians) of the sinusoid the Discrete-Time Fourier Transform ( DTFT ) jω ω j n Xe ( ) = xne ( ) = x (n) n n= π 2 DTFT { } 1 j ω j ω n -1 ω ω DTFT { j } xn ( ) = Xe ( ) e d = Xe ( ) 2π π jω where ω is the frequency variable of Xe ( ) 7

DTFT and DFT of Speech i The DTFT and the DFT for the infinite duration signal could be calculated (the DTFT) and approximated (the DFT) by the following: jω jωm X ( e ) = x ( m ) e ( DTFT ) m= N 1 Xk ( ) = xmwme ( ) ( ) = m=00 i using a value of ω= (2 πk / N) j(2 π / N) km jω Xe ( ) ( DFT), k = 0,1,..., N 1 N=25000 we get the following plot 8

25000-Point DFT of Speech 9 Log Magnitude (db) Magnitude

Short-Time Fourier Transform (STFT) 10

Short-Time Fourier Transform speech is not a stationary signal, i.e., it has properties that change with time thus a single representation based on all the samples of a speech utterance, for the most part, has no meaning instead, we define a time-dependent Fourier transform (TDFT or STFT) of speech that changes periodically as the speech properties change over time 11

Definition of STFT j ˆ ω nˆ = ˆ ˆ ω ˆ ˆ m= X ( e ) x ( m ) w ( n m ) e both n and ω are variables wn ( ˆ m) is a real window which determines the portion of xn ( ˆ) j ˆ ω that is used in the computation of Xn ˆ ( e ) 12

Short Time Fourier Transform STFT is a function of two variables, the time index, n, which is discrete, and the frequency variable, ω, which is continuous ˆ ˆ ( ω n ) = ( ) ( ˆ ) ˆ ω m= X e x m w n m e = DTFT ( xmwn ( ) ( ˆ m)) nˆ fixed, ˆ ω variable 13

Short-Time Fourier Transform alternative form of STFT (based on change of variables) is nˆ j ˆ ω j ˆ( ω n ˆ m ) X ( e ) = w( m) x( nˆ m) e if we define m= j ˆ ωnˆ j ˆ ωm = e x( nˆ m) w( m) e m= ˆ ˆ ( ω n ) = ( ˆ ) ( ) j ˆ ω m= j Xn ˆ e ˆ ω j ˆ ω j ˆ ωnˆ nˆ n j ˆ ω j ˆ ωnˆ X e x n m w m e then ( ) can be expressed as (using m = m) X ( e ) = e X ( e ) = e DTFT x( nˆ + m) w( m) 14

STFT-Different Time Origins the STFT can be viewed as having two different time origins 1. time origin tied to signal xn ( ) ˆ ω nˆ( ) = ( ) ( ˆ ) ˆ ω m= X e x m w n m e = DTFT xmwn ( ) ( ˆ m), nˆ fixed, ˆ ω variable 2. timeoriginti ed to window signal w ( m ) nˆ j ˆ ω j ˆ ω n ˆ j ˆ ω m X ( e ) = e x( nˆ + m) w( m) e m= j ˆ ωnˆ ˆ ω = j e X( e ) j ˆ ωnˆ = e DTFT w( m) x( nˆ + m), nˆ fixed, ˆ ω variable 15

Time Origin for STFT m= nˆ x[0] Time origin tied to window w[ m] x[ nˆ + m] 16

Interpretations of STFT j ˆ ω there are 2 distinct interpretations of Xn ˆ ( e ) j ˆ ω 1. assume nˆ is fixed, then X ( e ) is simply the normal Fourier j ˆ ω fixed n, Xnˆ ( e ) has the nˆ transform of the sequence wn ( ˆ mxm ) ( ), < m< => for transform nˆ X e same properties as a normal Fourier X e nˆ j ˆ ω 2. consider nˆ ( ) as a function of the time index with fixed. Then X ( e ) is in the form of a convolution of the signal x( nˆ ) e nˆ with the window j ˆ ω j ˆ ωnˆ wn ( ˆ). This leads to an interpretation in the form of ˆ ˆ ˆ linear filtering of the frequency modulated signal ( ) j ωn xne by wn ( ˆ ). ˆ ω - we will now consider each of these interpretations of the STFT in a lot more detail 17

Fourier Transform Interpretation j ˆ ω consider Xn ˆ ( e ) as the normal Fourier transform of the sequence wn ( ˆ mxm ) ( ), < m< for fixed nˆ. the window wn ( ˆ m) slides along the sequence xm ( ) and defines a new STFT for every value of nˆ what are the conditions for the existence of the STFT the sequence wn ( ˆ mxm ) ( ) must be absolutely summable for all values of nˆ - since xn ( ˆ) L(32767 for 16-bit sampling) - since wn ( ˆ) 1 (normalized window levels) - since window duration is usually finite wn ( ˆ mxm ) ( ) is absolutely summable for all nˆ 18

Frequencies for STFT the STFT is periodic in ω with period 2 π, i.e., j ˆ ω j ( ˆ ω+ 2πk X ) nˆ( e ) = Xn ˆ( e ), k can use any of several frequency variables to express STFT, including ˆ ω =ΩT ˆ (where T is the sampling period for x( m)) to represent nˆ ˆ ω = 2πfˆ or ˆ ω = 2πFT ˆ to represent normalized frequency (0 fˆ 1) or analog frequency (0 Fˆ F = / T), giving X ( e ) or X ( e ) jωˆ T analog radian frequency, giving X ( e ) j 2π fˆ j 2π Fˆ T s 1 nˆ nˆ 19

Signal Recovery from STFT j ˆ ω since for a given value of nx ˆ, nˆ ( e ) has the same properties as a normal Fourier transform, we can recover the input sequence exactly j ˆ ω since Xn ˆ ( e ) is the normal Fourier transform of the windowed sequence wn ( ˆ mxm ) ( ), then 1 j ˆ ω j ˆ ωm wn ( ˆ mxm ) ( ) = ˆ ( ) ˆ ω 2 Xn e e d π π π assuming the window satisfies the property that w( 0) 0 ( a trivial requirement), then by evaluating the inverse Fourier transform when m = nˆ, we obtain π 1 j ˆ ω jωnˆ xn ( ˆ) = ˆ ( ) ˆ ω 2π 0 Xn e e d w( ) π 20

Signal Recovery from STFT π 1 j ˆ ω jωnˆ xn ( ˆ) = ˆ ( ) ˆ ω 2π ( 0) Xn e e d w π with the requirement that w( 0) 0, the sequence x ( nˆ ) j ˆ ω j ˆ ω can be recovered exactly from Xn ˆ( e ), if Xn ˆ( e ) is known for all values of ˆ ω over one complete period - sample-by-sample recovery process j ˆ ω - Xnˆ ( e ) must be known for every value of n and for all i can also recover sequence wn ( ˆ mxm ) ( ) but can't guarantee that xm ( ) can be recovered since w( nˆ m) can equal 0 ˆ ˆ ω 21

Properties of STFT X ( e ) [ w( n m) x( m)] n fixed, ω variable relation to short-time power density function S ( e ) = X ( e ) = X ( e ) X ( e ) = DTFT R ( k) nˆ fixed j ˆ ω ˆ ˆ ˆ = ˆ n DTFT j ˆ ω j ˆ ω 2 j ˆ ω j ˆ ω nˆ nˆ nˆ nˆ nˆ nˆ [ ] R ( k) = w( nˆ m) x( m) w( nˆ m k) x( m+ k) S m= j ˆ ω Relation to regular Xe ( ) (assuming it exists) ( j ˆ ω ) = DTFT [ ( )] = ( ) j ˆ ωm m= π j ˆ ω 1 jθ j ( ˆ ω θ ) jθnˆ nˆ ( ) = ( ) ( ) 2π π Xe xm xme X e W e X e e dθ ˆ wn m xm We e Xe jθ jθnˆ jθ ( ) i ( ) ( ) ( ) nˆ j ˆ ω ( e ) 22

Properties of STFT j ˆ ω assume Xe ( ) exists s ( j ˆ ω ) = DTFT [ ( )] = ( ) j ˆ ωm m= Xe xm xme π ˆ ω 1 θ ( ˆ ω θ) θ ˆ X ˆ ( j ) = ( ) ( ) θ 2π j j j n n e W e X e e d π limiting case j ˆ ω wn ( ˆ) = 1 < nˆ < We ( ) = 2πδ ( ˆ ω) π ˆ ω 1 ( ˆ ω θ ) θ ˆ ˆ ω X ˆ ( j ) = 2πδ( θ) ( ) θ ( ) 2π j j n = j n e X e e d X e π i.e., we get the same thing no matter where the window is shifted 23

Alternative Forms of STFT jωˆ Alternative forms of Xnˆ ( e ) 1. real and imaginary i parts j ˆ ω j ˆ ω j ˆ ω X ˆ( ) Re ˆ( ) Im ˆ( ) n e = Xn e + j Xn e = a ( ˆ ω ) jb ( ˆ ω ) n ˆ n ˆ j ˆ ω aˆ( ˆ ω) Re ˆ( ) n = Xn e j ˆ ω bˆ ˆ ω = Im ˆ n( ) X n( e ) when xm ( ) and wn ( ˆ m) are both real (usually the case) can show that a ( ˆ ω) is symmetric in ˆ ω, and b ( ˆ ω) is anti-symmetric in ˆ ω 2. magnitude and phase nˆ X e X e e j ˆ ω nˆ ( ) nˆ ( ) j ˆ ω jθnˆ = ( ˆ ω) j ˆ ω can relate X ˆ( ) and θˆ( ˆ n e n ω) to aˆ( ˆ ω) and bˆ( ˆ ω) n nˆ n 24

Role of Window in STFT i The window wn ( ˆ m) does the following: 1. chooses portion of xm ( ) to be analyzed i jωˆ 2. window shape determines the nature of Xn ˆ( e ) X e nˆ f wn ( ˆ mxm ) ( ), j ˆ ω Since nˆ ( ) (for fixed ) is the normal FT o then if we consider the normal FT's of both xn ( ) and wn ( ) individually, we get ˆ ˆ ( j ω ω ) xme ( ) j m Xe We = m= ˆ ˆ ( j ω ω ) wme ( ) j m = m= 25

Role of Window in STFT then for fixed n, ˆ, the normal Fourier transform of the product wn ( ˆ mxm ) ( ) is the convolution of the transforms of wn ( ˆ m ) and xm ( ) nˆ w nˆ m W e e j ˆ ω j ˆ ωnˆ for fixed, the FT of ( ) is ( ) --thus π j ˆ ω 1 j θ j θ nˆ j ( ˆ ω θ ) X ˆ( ) = n e W ( e ) e X ( e ) d θ 2π π and replacing θ by θ gives π j ˆ ω 1 ˆ ( ˆ ) ˆ ( ) = θ θ ω+ θ ( ) ( ) θ 2π j j n j Xn e W e e X e d π 26

Interpretation of Role of Window j ˆ ω j ˆ ω Xn ˆ ( e ) is the convolution of X ( e ) with the FT of the shifted j ˆ ω j ˆ ωnˆ window sequence We ( ) e j ˆ ω X( e ) really doesn't have meaning since x( nˆ ) varies with time; consider xn ( ˆ) defined for window duration and extended for all time Xe that t reflect the sound within the window (can also consider xn ( ˆ ) = 0 j ˆ ω to have the same properties then ( ) does exist with properties outside the window another case) jωˆ and define Xe ( ) appropriately--but this is jωˆ Bottom Line: Xn ˆ( e ) is a smoothed version of the FT of the part of xn ( ˆ) that is within the window w. 27

Windows in STFT j ˆ ω for X ˆ nˆ ( e ) to represent the short-time spectral properties of x( n) jθ inside the window => We ( ) should be much narrower in frequency j ˆ ω than significant spectral regions of Xe ( )--i.e., almost a in frequency n impulse consider rectangular and Hamming windows, where width of the main spectral lobe is inversely proportional to window length, and side lobe levels are essentially independent of window length Rectangular Window: flat window of length L samples; first zero in frequency response occurs at F S /L, with sidelobe levels of -14 db or lower Hamming Window: raised cosine window of length L samples; first zero in frequency response occurs at 2F S /L, with sidelobe levels of -40 db or lower 28

Windows L=2M+1-point Hamming window and its corresponding DTFT 29

Frequency Responses of Windows 30

Effect of Window Length-HW 31

Effect of Window Length-HW 32

Effect of Window Length-RW 33

Effect of Window Length-HW 34

Relation to Short-Time Autocorrelation i X e w nˆ m x m j ˆ ω nˆ ( ) is the discrete-timetime Fourier transform of [ ] [ ] for each value of nˆ, then it is seen that S ( e ) = X ( e ) = X ( e ) X ( e ) j ˆ ω j ˆ ω 2 j ˆ ω * j ˆ ω nˆ nˆ nˆ nˆ is the Fourier transform of Rˆ () = [ ˆ n l w n m] x[ mwn ] [ ˆ l mxm ] [ + l] m= which is the short-time autocorrelation function of the previous chapter. Thus the above equations relate the short-time spectrum to the short-time autocorrelation, 35

Short-Time Autocorrelation and STFT 36

Summary of FT view of STFT jω interpret X ( e ) as the normal Fourier transform of the sequence nˆ wn ( ˆ mxm ) ( ), < m< properties of this Fourier transform depend on the window j ω frequency resolution of Xn ˆ( e ) varies inversely with the length of the window => want long windows for high resolution want x ( n ) to be relatively stationary y( (non-time-varying) during duration of window for most stable spectrum => want short windows as usual in speech processing, there needs to be a compromise between good temporal resolution (short windows) and good frequency resolution (long windows) 37

Linear Filtering Interpretation of STFT 38

Linear Filtering Interpretation 1. modulation-lowpass filter form ( n rather than nˆ ) j ˆ ω j ˆ ωm X ( e ) = x( m) e w( n m) n m= j ˆ ωn ( ) = wn ( ) xne ( ), nvariable, ˆ ω fixed = 1 2π π jθ j ( θ+ ˆ ω ( ) ( ) jθn We Xe ) e d π 2. bandpass filter-demodulation X n j ˆ ω j ˆ( ω n m ) ( e ) = w( m) x( n m) e m= j ˆ ωn j ˆ ωm = θ e ( w( m) e ) x( n m) m= e [( w( n) e ) x( n)], n variable, ˆ ω fixed j ˆ ωn j ˆ ωn = 39

Linear Filtering Interpretation 1. modulation-lowpass filter form: j ˆ ω n( ) = ( ) j ˆ ωm ( ) m= X ( e ) x( m) e w( n m), n variable, ˆ ω fixed jωn ( xne ) = ( ) wn ( ) ( ˆ ω ) j ( ˆ ω ) = xn ( )cos( n) wn ( ) j xn ( )sin( n) wn ( ) = a ( ˆ ω) jb ( ˆ ω) n n 40

Linear Filtering Interpretation 2. bandpass filter-demodulation form ( ) X e = e w n e x n n j ˆ ω j ˆ ωn j ˆ ωn n( ) ( ) ( ), variable, ω fixed ˆ complex bandpass filter output modulated by signal e jωn jθ if We ( ) is lowpass, then filter is bandpass around θ = ω all real computation for lower half structure 41

Linear Filtering Interpretation Lowpass filter frequency response Bandpass filter frequency response 42

Linear Filtering Interpretation assume normal FT of x( n) exists jθ xn ( ) Xe ( ) (recall that ˆ ω is a particular frequency) j ˆ ωn j ( θ+ ˆ ω xne ( ) Xe ( ) ) => spectrum of xn ( ) at frequency ˆ ω is shifted to zero frequency; since the STFT is a convolution, the FT of the STFT is the product of the individual FT's, i.e., Xe We j ( θ+ ˆ ω ( ) jθ ) ( ) jθ jθ if We ( ) resembles a narrow band lowpass filter, i.e., We ( ) = 1 for small θ and is 0 otherwise, then Xe We Xe ( ˆ) ˆ ( j θ+ ω ) ( j θ ω ) ( j ) 43

STFT Magnitude Only for many applications you only need the magnitude of the STFT(not the phase) in such cases, the bandpass filter implementation is less complex, since j ˆ ω 2 2 Xn( e ) an( ω) bn( ω) 12 / = ˆ + ˆ = X e a ˆ b j ˆ ω 2 2 n( ) = n( ω) + n( ω) ˆ 12 / 44

Sampling Rates of STFT 45

Sampling Rates of STFT need to sample STFT in both time and frequency to produce an unaliased representation from which x(n) can be exactly recovered sampling rates lower than the theoretical minimum rate can be used, in either time or frequency, and x(n) can still be exactly recovered from the aliased (under- sampled) short-time time transform this is useful for spectral estimation, pitch estimation, formant estimation, speech spectrograms, vocoders for applications where the signal is modified, e.g., speech enhancement, cannot undersample STFT and still recover modified signal exactly 46

Sampling Rate in Time to determine the sampling rate in time, we take a linear filtering view j ˆ ω 1. Xn( e ) is the output of a filter with impulse response w( n) jω 2. We ( ) is a lowpass response with effective bandwidth of B Hertz => j ˆ ω j ˆ ω thus the effective bandwidth of Xn( e ) is B Hertz Xn( e ) has to be sampled at a rate of 2B samples/second to avoid aliasing Example: Hamming Window wn ( ) = 054. 046. cos( 2π n/( L 1)) 0 n L 1 = 0 otherwise 2Fs B (Hz); for L = 100, Fs = 10, 000 Hz => B = 200 Hz => need L rate of 400/sec (every 25 samples) for sampling rate in time 47

Sampling Rate in Frequency j ˆ ω since Xn ˆ ( e ) is periodic in ω with period 2π, it is only necessary to sample over an interval of length 2 π need to determine an appropriate finite set of frequencies, ωk = 2π k / N, k = 01,,..., N 1 j ˆ ω at which Xn ˆ( e ) must be specified to exactly recover x( n) j ˆ ω use the Fourier transform interpretation of X ( e ) nˆ j ˆ ω 1. if the window wn ( ) is time-limited, then the inverse transform of Xn ˆ( e ) jωˆ Xnˆ e is time-limited 2. the sampling theorem requres that we sample ( ) in the frequency dimension at a rate of at least twice its ('symmetric') "time width" j ˆ ω 3. since the inverse Fourier transform of X ˆ nˆ ( e ) is the signal x( m) w( n m) and this signal is of duration L samples (the duration of w( n)), then according to the sampling theorem j ˆ ω Xn ˆ ( e ) must be sampled (in frequency) at the set of frequencies 2π k ω k =, k = 01,,..., L 1 (where L / 2 is the effective width of the window) L jωk in order to exactly recover xn ( ) from X( e ) nˆ thus for a Hamming window of duration L=100 samples, we require that the STFT be evaluated at at least 100 uniformly spaced frequencies around the unit circle 48

Total Sampling Rate of STFT the total sampling rate for the STFT is the product of the sampling rates in time and frequency, i.e., SR = SR(time) x SR(frequency) = 2B x L samples/sec B = frequency bandwidth of window (Hz) L = time width of window (samples) for most windows of interest, B is a multiple of F S /L, i.e., B = C F S /L (Hz), C=1 for Rectangular Window C=2 2 for Hamming Window SR = 2C F S samples/second can define an oversampling rate of SR/ F S = 2C = oversampling rate of STFT as compared to conventional sampling representation of x(n) for RW, 2C=2; for HW 2C=4 => range of oversampling is 2-4 this oversampling gives a very flexible representation of the speech signal 49

Mathematical Basis s for Sampling the STFT assume sample in time at n = n = rr, < r < 2π and in frequency at ˆ ω = ωk = k, k = 01,,..., N 1 N sample values j 2π k 2π N j N km X ( e ) = w[ rr m] x[ m] e rr 2π j k N m= 2π 2π j krr j k N N rr = e X ( e ) r j km N X ( e ) = x [ rr + mw ] ( m ) e ( set m = rr+ m ; m = m ) rr m= define DFT-type notation 2π 2π = j N k = j N krr r r R r X ( k) X ( e ) e X ( k) 2π 50

Sampling the STFT 51

Sampling the STFT DFT Notation 2π 2π j k j krr N N = = r r R r X ( k) X ( e ) e X ( k) let w[ m] 0 for 0 m L 1(finite duration window with no zero-valued samples) L 1 X [ k] = x[ rr+ m] w[ m] e r m= 0 2π j km N ( r fixed, 0 k N 1) if L N then (DFT defined with no aliasing can recover sequence exactly using inverse DFT) 1 x [ rr+ mw ] [ m ] = X [ k ] e N N 1 r k = 0 2π j km N ( r fixed, 0 m N 1) if R L (IDFT defined with no aliasing), then all samples ca n be recovered from X [ k] ( R > L gaps in sequence) r 52

What We Have Learned So Far j j ˆ ω m ˆ ω 1. X ( e ) = x( m) w( nˆ m) e nˆ m= i function of nˆ = n for sampled ˆ ω (looks like a time sequence) i function of ˆ ω = ω for sampled nˆ (looks like a transform) X e d for n ˆ = 123,,,...; 0 ˆ ω π jωˆ nˆ( ) (no sampling rate reduction) define j ˆ ω 2. X ˆ ˆ ˆ( ) DTFT ( ) ( ) fixed, ˆ n e = x m w n m n ω variable with time origin tied to x( nˆ ) j ˆ ω j ˆ ωnˆ X ˆ ˆ ˆ( ) DTFT ( ) ( ) fixed, ˆ n e = e x n+ m w m n ω variable with time origin tied to w( m) j ˆ ω 3. Interpretations of X ( e ) nˆ j ˆ ω 1. nˆ fixed, ˆ ω = ω variable; X ˆ nˆ ( e ) = DTFT x( m) w( n m) DFT View j ˆ ω j ˆ ωn 2. nˆ = n variable, ˆ ω fixed; X ( e ) = x( n) e w( n) Linear Filtering view filter bank implementation nˆ 53

What We Have Learned So Far 4. Signal Recovery from STFT π 1 j ˆ ω j ˆ ωm ˆ 2π n π π j ˆ ω j ˆ ωnˆ nˆ ω π xmwn ( ) ( ˆ m ) = X ( e ) e d ω 1 xn ( ˆ) = X( e ) e d 2πw ( 0 ) 5. Linear Filtering Interpretation j ˆ ω j ˆ ωn 1. modulation-lowpass filter Xn( e ) = w( n) x( n) e, j ˆ ω 1 jθ j ( θ+ ˆ ω ˆ variable, ˆ fixed ( ) ( ) ( ) jθn n = n ω Xn e = ) θ 2π W e X e e d ( ) j ˆ ω j ˆ ωn j ˆ ωn 2. bandpass filer-demodulation X n( e ) e w( n) e x( n), nˆ = n variable, ˆ ω fixed π π = 54

What We Have Learned So Far 6. Sampling Rates in Time and Frequency jω 1. time: We ( ) has bandwidth of BHertz 2B samples/sec rate 2FS Hamming Window: B = (Hz) L jω 2. frequency: wn ( ) is time limited to Lsamples inverse of X ( e ) is also time limited need to sample in frequency at twice the (effective) time width of the time-limited sequence L frequency samples 3. total Sampling Rate: 2BL samples/sec - B = frequency bandwidth of the window (Hz) - L = effective time width of the window (samples) B = C FS / L(Hz) Sampling Rate = 2B L = 2CF S samples/second - for Rectangular Window, C = 1 - for Hamming Window, C = 2 n 55

Spectrographic Displays 56

Spectrographic Displays Sound Spectrograph-one of the earliest embodiments of the time- dependent spectrum analysis techniques 2-second utterance repeatedly modulates a variable frequency oscillator, then bandpass filtered, and the average energy at a given time and frequency is measured and used as a crude measure of the STFT thus energy is recorded by an ingenious electro-mechanical system on special electrostatic paper called teledeltos paper result is a two-dimensional representation of the time-dependent spectrum-with vertical intensity being spectrum level at a given frequency, and horizontal intensity being spectral level at a given timewith spectrum magnitude being represented by the darkness of the marking wide bandpass filters (300 Hz bandwidth) provide good temporal resolution and poor frequency resolution (resolve pitch pulses in time but not in frequency) called wideband spectrogram narrow bandpass filters (45 Hz bandwidth) provide good frequency resolution and poor time resolution (resolve pitch pulses in frequency, but not in time) called narrowband spectrogram 57

Conventional Spectrogram (Every salt breeze comes from the sea) 58

Digital Speech Spectrograms wideband spectrogram follows broad spectral peaks (formants) over time resolves most individual pitch periods as vertical striations since the IR of the analyzing filter is comparable in duration to a pitch period what happens for low pitch males high pitch females for unvoiced speech there are no vertical pitch striations narrowband spectrogram individual harmonics are resolved in voiced regions formant frequencies are still in evidence usually can see fundamental frequency unvoiced regions show no strong structure 59

Digital Speech Spectrograms Speech Parameters ( This is a test ): sampling rate: 16 khz speech duration: 1.406 seconds speaker: male Wideband d Spectrogram Parameters: analysis window: Hamming window analysis window duration: 6 msec (96 samples) analysis window shift: 0.625 msec (10 samples) FFT size: 512 dynamic range of spectral log magnitudes: 40 db Narrowband Spectrogram Parameters: analysis window: Hamming window analysis window duration: 60 msec (960 samples) analysis window shift: 6 msec (96 samples) FFT size: 1024 dynamic range of spectral log magnitudes: 40 db 60

Digital Speech Spectrograms Top Panel: 3 msec (48 samples) window Second Panel: 6 msec (96 samples) window Third Panel: 9 msec (144 sample) window Fourth Panel: 30 msec (480 sample) window 61

Spectrogram Comparisons 62

Spectrogram - Male nfft = 1024, L = 80, Overlap = 75 63

Spectrogram - Female nfft = 1024, L = 80, Overlap = 75 64

Overlap Addition (OLA) Method 65

Overlap Addition (OLA) Method based on normal FT interpretation of short-time spectrum nˆ DFT / IDFT jωk X ( e ) y ( m) = x( m) w( nˆ m) k can reconstruct x(m) by computing IDFT of X ( e )and nˆ dividing out the window (assumed non-zero for all samples) this process gives L signal values of x(m) for each window => window can be moved by L samples and the process repeated since X ( e ) is "undersampled" in time, it is highly susceptible nˆ jω k to aliasing errors => need more robust synthesis procedure nˆ jω 66

Overlap Addition (OLA) Method jω ω k j kn yn ( ) = Xm( e ) e m k summation is for overlapping analysis sections jωk for each value of m where Xm( e ) is measured, do an inverse FT to give y ( n ) = Lx ( n ) w ( m n ) (where L is the size of the FT) m y ( n) = y ( n) = Lx( n) w( m n) a basic property of the window is m m N 1 j 0 jωk ( ) = ( ) ω = 0 = ( ) k n= 0 We We wn m since any set of samples of the window are equivalent (by sampling arguments), then if wn ( ) is sampled often enough we get (independent of n) m wm n = We j 0 ( ) ( ) j 0 yn ( ) = Lx ( nw ) ( e ) using overlap-added d sections 67

Overlap Addition of Bartlett and Hann Windows 68

Overlap Addition of Hamming Window L=128 69

Window Spectra DTFTof Bartlett (triangular), Hann and Hamming windows 70

Hamming Window Spectra DTFTs of even-length, odd-length and modified odd-to-even length Hamming windows; zeros spaced at 2π/R give perfect reconstruction using OLA (even-length window) 71

Overlap Addition (OLA) Method 72

Overlap Addition (OLA) Method w(n) is an L-point Hamming window with R=L/4 assume x(n)=0 for n<0 time overlap of 4:1 for HW first analysis section begins at n=l/4 73

Overlap Addition (OLA) Method 4-overlapping sections contribute to each interval N-point FFT s done using L speech samples, with N-L zeros padded at end to allow modifications without significant aliasing effects for a given value of n y(n)=x(n)w(r-n)+x(n)w(2r- ( ) ( ) ( ) ( n)+x(n)w(3r-n)+x(n)w(4rn)=x(n)[w(r-n)+w(2r-n)+w(3rn)+w(4r-n)]=x(n) W(e j0 )/R 74

Filter Bank Summation (FBS) 75

Filter Bank Summation the filter bank interpretation of the STFT shows that for jω k any frequency ωk, Xn( e ) is a lowpass representation of the signal in a band centered at ω ( n = nˆ for FBS) jω jω n k k X ( e ) = e x( n m) w ( m) e n m= k k jω m where wk( m) is the lowpass window used at frequency ωk (we have generalized the structure to allow a different lowpass window at each frequency ω k ). k 76

Filter Bank Summation define a bandpass filter and substitute it in the equation to give j k h ( n ) = w ( n ) e ω k k jωk jωkn Xn( e ) = e x( n m) hk( m) n m= 77

Filter Bank Interpretation of STFT (case: w k [n]=w[n]) wn [ ] We ( jω ) -ω 0 ω 0 ω jωkn jω jω jωkn hn [ ] = wne [ ] He ( ) = W( e ) FT( e ) jω n jω n jωn j( ω ω ) n k k k FT( e ) = e e = e = δ ( ω ω ) n= n= He We We jω jω j ( ω ωk ) ( ) = ( ) δω ( ωk ) = ( ) [ ] [ ] ( j ωk ) j ωk v n = X e = e n x[ n] h[ n] k n ( j ω ) ( j ω ) ( j ω V ) j ωkn k e = H e X e FT( e ) jω jω = He ( ) Xe ( ) δω ( + ωk ) = Xe + j ω j ( ω ω k ) ( ) We ( ) δ ( ω ωk ) j ( ω+ ωk ) jω = Xe ( ) We ( ) lowpass k single-sided, bandpass ω k -ω 0 ω k ω k + ω 0 ω 78

Filter Bank Summation jω k thus X ( e ) is obtained by bandpass filtering x( n) n followed by modulation with the complex exponential j kn e ω. We can express this in the form k n jω jω n k m= k k y ( n) = X ( e ) e = x( n m) h ( m) thus y k ( n ) is the output of a bandpass filter with impulse response h ( n) k 79

Filter Bank Summation 80

Filter Bank Summation a practical method for reconstructing xn ( ) from the STFT is as follows jω k 1. assume we know X ( e ) for a set of N frequencies { ω }, k = 01,,..., N 1 n 2. assume we have a set of N bandpass filters with impulse responses jω n k h ( n) = w ( n) e, k = 01,,..., N 1 k k 3. assume w ( n ) is an ideal lowpass filter with cutoff frequency ω k k pk - the frequency response of the bandpass filter is j j( k ) H ( e ω ) = W ( e ω ω ) k k 81

Filter Bank Summation consider a set of N bandpass filters, uniformly spaced, so that the entire frequency band is covered 2π k ω k =, k = 01,,..., N 1 N also assume window the same for all channels, i.e., wk ( n) = w( n), k = 01,,..., N 1 if we add together all the bandpass outputs, the composite response is jω N 1 N 1 jω k k= 0 k= 0 jω j( ω ωk ) He ( ) = H ( e ) = We ( ) k if We ( ) is properly sampled in frequency ( N L), where Lis the window duration, then it can be shown that 1 N 1 N k = 0 j ( ω ω ) k W ( e ) = w ( 0 ) ω FBS Formula 82

Filter Bank Summation 1/N 83

Filter Bank Summation derivation of FBS formula FT / IFT jω wn ( ) We ( ) jω if We ( ) is sampled in frequency at N uniformly spaced points, the inverse discrete Fourier transform of the sampled version of We ( ) is (recall that sampling multiplication convolution aliasing) 1 N N 1 jω jω n k= 0 r= an aliased version of jω k k k We ( ) e = wn ( + rn) wn ( ) is obtained. 84

Filter Bank Summation If wn ( ) is of duration Lsamples, then wn ( ) = 0, n< 0, n L and no aliasing occurs due to sampling in frequency j ω of W ( e ). In this case if we evaluate the aliased formula for n = 0, we get 1 N N 1 1 k = 0 jω k We ( ) = w( 0) thefbsformulaisseentobeequivalenttotheformula seen to equivalent to above, since (according to the sampling theorem) any set of N uniformly spaced samples of W ( e ) is adequate jω 85

Filter Bank Summation the impulse response of the composite filter bank system is N 1 N 1 jωkn hn ( ) h( n) wne ( ) Nw( 0) δ ( n) = = = k k= 0 k= 0 thus the composite output is yn ( ) = xn ( ) hn ( ) = Nw( 0) xn ( ) thus for FBS method, the reconstructed signal is N 1 N 1 k n jωk jωkn k= 0 k= 0 yn ( ) = y ( n) = X ( e ) e = Nw( 0) xn ( ) jω k if Xn( e ) is sampled properly in frequency, and is independent of the shape of wn ( ) 86

Filter Bank Summation 1/N 87

Filter Bank Summation N 1 N 1 2π 2π j k j kn N N yn ( ) = y ( n ) = X ( e ) e k k= 0 k= 0 N 1 = xmwn ( ) ( me ) e k= 0 m n N 1 = xmwn ( ) ( m) e m k= 0 2π 2π j km j kn N N 2π j k( n m) N = xmwn ( ) ( m ) N δ ( n m rn ) m r= y( n) = N w( rn) x( n rn) r = wn ( ) 0 for 0 n L 1 if N Lthen need only r = 0 term y(n) = Nw( 0)x(n) if N < L then in order for y( n) = x( n) you need the condition w( rn) = 0, r = ± 1, ± 2,... 'undersampled' d' representation ti can still work--at least in theory 88

Summary of FBS Method j perfect reconstruction of x(n) from X ( e ω ) is possible using FBS under the following conditions: 1. w(n) is a finite duration filter/window j 2. X ( e ω ) is sampled properly in both time and frequency n perfect reconstruction of x(n) from Xn( e k ) is also possible using FBS under the following condition: jω We ( ) is perfectly bandlimited j e ω n jω k To avoid time aliasing, Xn( ) must be evaluated at at least L uniformly spaced frequencies, where L is the window duration -since window of length L samples has frequency bandwidth of from 2π / L (for RW) to 4π / L (for HW), the bandpass filters in FBS overlap in frequency since the analysis frequencies are 2π k/ L, k = 01,,..., L 1 j k e ω there is a way (at least theoretically) for Xn( ) to be evaluated in non-overlapping bands and for which x(n) can still be exactly recovered 89

FBS Reconstruction in Non- Overlapping Bands assume window length for all bands is L samples assume the same window is used for N equally spaced frequency bands with analysis frequencies 2 π k ω k =, k = 01,,..., N 1 N where N can be less than L assume w(n) is an ideal lowpass filter with cutoff frequency π ωp = N example with N=6 equally spaced ideal filters 90

FBS Reconstruction in Non- Overlapping Bands the composite impulse response for the FBS system is N 1 N 1 jω n jω n k k hn ( ) = wne ( ) = wn ( ) e k= 0 k= 0 defining a composite of the terms being summed as N 1 N 1 jω n j2π kn/ N k pn ( ) = e = e k= 0 k= 0 we get for h ( n) h ( n) = wn ( ) pn ( ) it is easy to show (Prob. 6.7) that p(n) is a periodic train of impulses of the form p ( n ) = N δ ( n rn ) r = giving for hn ( ) the expression hn ( ) = N wrn ( ) δ ( n rn ) r = thus the composite impulse response is the window sequence sampled at intervals of N samples 91

FBS Reconstruction in Non-Overlapping Bands impulse response of ideal lowpass filter with cutoff frequency π/n for ideal LPF we have πsin( n/ N) sin( r) 1 δwn ( ) = π, wrn ( ) = = ( r) n rn Nπ h ( n) = ( n) πgiving other cases where perfect reconstruction is obtained 1. wn ( ) is of finite length L< Nδand causal (no images) 2. wn ( ) has length > N and has the property wn ( ) = 1/ N, for n= rn 0 = 0 for n = rn ( r r0, r = 0, ± 1, ± 2,...) giving hn ( ) = pnwn ( ) ( ) = ( n rn 0 ) j rnδ0 ( ) = j r N He e yn ( ) = xn ( rnω) ω0 92

Summary of FBS Reconstruction for perfect reconstruction using FBS methods 1. w(n) does not need to be either time-limited or frequency-limited to exactly reconstruct jω k x(n) from Xn( e ) 2. w(n) just needs equally spaced zeros, spaced N samples apart for theoretically ti perfect reconstruction ti exact reconstruction of the input is possible with a number of frequency channels less than that required by the sampling theorem key issue is how to design digital filters that match these criteria 93

Practical Implementation of FBS 94

FBS and OLA Comparisons 95

FBS and OLA Comparisons duals filter bank summation method overlap addition method -- one depends on sampling relation in frequency -- one depends on sampling relation in time FBS requires sampling in frequency be such that the window jω transform We ( ) obeys the relation 1 N N 1 k = 0 j( ω ω ) k We ( ) = w( 0) any ω OLA requires that t sampling in time be such hthat tthe window obeys the relation r = j 0 wrr ( n ) = We ( )/ R any n the key to Short-Time Fourier Analysis is the ability to modify the short-time spectrum (via quantization, noise enhancement, signal enhancement, speed-up/slow-down,etc) and recover an "unaliased" modified signal 96

Overlap Addition (OLA) Method jω k assume X ( e ) sampled with period R samples in time n jω jω k k Y ( e ) = X ( e ), r integer, 0 k N 1 r rr the Overlap Add Method is based on the summation N 1 1 jωk jωkn yn ( ) = Yr ( e ) e OLA Method r= N k= 0 k for each value of r, compute the inverse transform of Yr ( e ) giving the sequences yr ( m) = x( m) w( rr m), < m < the signal at time n is obtained by summing the values at time n of all the sequences, y ( m) that overlap at time n, giving yn ( ) = y( n) = xn ( ) wrr ( n) r r= r = r if w(n) has a bandlimited FT and if Xn( e ) is properly sampled in time (i.e., R small enough to avoid aliasing) then jω k r = j 0 wrr ( n) We ( )/ R -- independent of n, and j 0 yn ( ) = xnwe ( ) ( )/ R-- exact reconstruction of x(n) jω 97