Processing of Non-Stationary Audio Signals

Processing of Non-Stationary Audio Signals A dissertation submitted to the University of Cambridge for the degree of Master of Philosophy Michael Hazas, Hughes Hall 3 August 999 Signal Processing and Communications Laboratory Department of Engineering University of Cambridge

For my grandfathers, both of whom live in worlds very different from everyone else s.

Acknowledgements My thanks go to Tanya Reeves and Peter de Rivaz for their patient and enlightening discussions regarding general wavelet theory and denoising. Thanks to Richard Cook for help in wrestling with L A TEX. Credit is also given to the creators of WaveLab, an extensive wavelet package for Matlab. However, I owe the most to my supervisor, Peter Rayner, for his guidance, support, and attention. This material is based upon work supported under a United States National Science Foundation Graduate Fellowship. i

Contents Introduction 2 State of the Art of Audio Denoising 3 2. Current Methods for Removing Broad-Band Continuous Noise.. 3 2.. Overview of Spectral Domain Procedures.......... 4 2..2 The Musical Noise Phenomenon............... 6 2.2 Current Methods for Removing Clicks................ 7 2.2. Click Detection........................ 7 2.2.2 Click Correction........................ 8 3 Wavelet and Local Trigonometric Theory 3. The Wavelet Transform........................ 3.. Continuous Wavelet Theory................. 3..2 Multiresolution Filter Banks................. 5 3.2 Local Trigonometric Transforms................... 22 3.2. Cosine Transforms...................... 23 3.2.2 Smooth Local Cosine Bases................. 28 3.2.3 Fast Implementations..................... 29 3.3 Wavelet and Local Cosine Packet Decomposition.......... 3 3.3. Wavelet Packets........................ 3 3.3.2 Local Cosine Packets..................... 33 3.3.3 Comparison between Wavelet and Local Cosine Packets.. 33 3.4 Entropy Algorithms and Selecting a Best Basis........... 35 3.4. Properties of an Orthonormal Tree............. 37 3.4.2 Calculation of Entropy.................... 38 3.4.3 Pruning the Tree....................... 40 4 The Denoising Algorithm 44 4. Windowing the Time-Domain Signal................ 44 4.2 Best Basis Selection from a Library................. 48 4.3 Selecting Coherent Coefficients.................... 48 4.3. Entropy Selection....................... 49 4.3.2 Adaptive Selection...................... 50 4.4 Performing Thresholding....................... 52 4.4. Soft Thresholding....................... 52 4.4.2 Hysteresis........................... 52 4.5 Reconstruction in the Time Domain................. 53 4.6 Other Considerations and Strategies................. 54 4.6. Number of Levels of Decomposition............. 54 4.6.2 Iterative Processing of Each Window............ 55 4.6.3 Time and Frequency Shifting................ 56 ii

5 Results of the Denoising Algorithm 57 5. Evaluation of Algorithm Strategies Using Tom s Diner..... 57 5.. Decimated and Non-Decimated Wavelet Packets...... 60 5..2 Wavelet and Local Cosine Packets.............. 6 5..3 Packet Bases of Frequency Representations......... 62 5..4 Number of Decomposition Levels.............. 62 5..5 Window Length........................ 64 5..6 Coherent Coefficient Selection................ 65 5..7 Hysteresis Thresholding................... 65 5..8 Processing Residues...................... 66 5..9 Time and Frequency Shifting................ 66 5..0 Summary........................... 67 5.2 Aria Performed by Enrico Caruso.................. 68 5.3 Comparison with State of the Art Denoising Methods....... 69 5.3. Beethoven (Piano)...................... 69 5.3.2 Mussorsky (Orchestral).................... 70 5.3.3 King Oliver (Jazz)...................... 70 6 Conclusions 72 6. Future Directions........................... 72 A Notational Conventions 74 B Track List for Accompanying CD 75 iii

Introduction In the past, methods of denoising audio signals have assumed that the signal is stationary over a specified interval of time. The entire time-domain signal to be processed is divided up into a series of these time intervals (often referred to as windows ), and the DSP algorithm processes each interval separately. However, it is a fact that audio signals (both speech and music) are generally not stationary; they cannot always be said to be stationary over each of these windows (which are equal, arbitrarily-set intervals of time). Thus, it is the objective of this thesis to provide an alternative to the more traditional spectral domain or model-based approaches to denoising by investigating whether wavelet packet and local trigonometric packet bases can be used to successfully decompose the signal for processing. The rationalisation behind this is that adaptive wavelet and trigonometric transforms inherently tend to pick out the significant portions of a signal. (Examples of significant portions might be chords in music or phonemes in speech.) Given a portion of a degraded time-domain audio signal, the wavelet/trigonometric transform will ideally break it up into intervals of frequency or time where the signal is likely to be relatively stationary. The transform coefficients from this decomposition are then evaluated, and the ones likely to correspond to parts of the underlying signal are identified. The remaining coefficients correspond to noise, and can be adjusted with the aim of reducing the noise content upon transformation back to the time domain. The next section will give the current state of research in audio denoising. Different classes of audio degradation will be outlined, and the most popular denoising strategies will be presented for two types of degradation in particular: broad-band noise and clicks. Section 3 reviews the mathematical foundation for wavelet and trigonometric transforms. It gives fast algorithms to accomplish these transforms, and then extends them to packet decompositions, discussing algorithms to find a best basis representation of a signal. These algorithms are then applied in section 4. A procedure for denoising audio signals using wavelet and trigonometric packet bases is presented and discussed in detail. Alternate strategies and variations on the algorithm are also given. The material in section 4 owes largely to the work of Berger, Coifman, and Goldberg [2] and Ramarapu and Maher [2]. Section 5 then presents the results of processing several different signals (involving continuous broad-band noise and/or click degradation) with the wavelet and local trigonometric denoising algorithm. The results are compared to signals denoised with spectral domain and model-based processing. Issues such as preservation of audio fidelity and naturalness of the denoised audio signal are discussed. An audio compact disc which accompanies this thesis contains the original test signals as well as results of the majority of the tests detailed in sec-

tion 5, so that the reader may make their own judgements about the perceptual audio quality of the denoised signals. The test results also appear on a data session of the CD as wav files. Section 6 concludes the thesis and provides commentary about the application of wavelet and trigonometric bases to audio denoising. Directions for future research in the field are identified. 2

2 State of the Art of Audio Denoising The current methodology used in denoising audio signals is to process the corrupted signal using several different algorithms, each algorithm treating a different class of degradation. According to Godsill and Rayner [0, pages 6 7], there are a total of five classes of signal degradation, which can be divided into two categories. One category is global degradation, which affects all samples of the waveform. Global degradation is manifested in the following forms:. broad-band continuous noise commonly known as hiss 2. wow and flutter pitch variation resulting in modulation of all frequency components 3. distortion any other non-linear defect which has an effect across the entire waveform; clipping is one example The remaining two classes of degradation belong to the category of localised degradation. These are discontinuities in the waveform which affect relatively few samples. 4. clicks described by Godsill and Rayner as short bursts of interference random in time and amplitude [0, page 6] 5. low-frequency noise transients large defects in the media which come across as audible thumps during playback Of the above five classes of degradation, two in particular are of interest for the purposes of this thesis: broad-band continuous noise (a global degradation) and clicks (a local degradation). The following sections explain the current methods of denoising audio signals exhibiting these two classes of degradation. 2. Current Methods for Removing Broad-Band Continuous Noise Based on the assumption that music and speech signals are comprised mainly of fundamental tones and their harmonics, it has been popular practice to remove broad-band continuous noise by processing short windows of the corrupted signal in the spectral domain [0, page 36]. This section of the report explains this particular strategy, and then discusses the pheonomenon of musical noise, an adverse side-effect of the spectral domain processing procedure. 3

Split into Overlapping Windows Multiply by Pre-Windowing Function Transform to Spectral Domain Degraded Time-Domain Signal Each Window y Y Process Spectral Components Inverse Transform Multiply by Post-Windowing Function X = f(y) X x Re-assemble Windows Multiply by Gain Compensation Function Corrected Time-Domain Signal Figure : Spectral Domain Procedure for Removing Broad-Band Noise 2.. Overview of Spectral Domain Procedures Figure shows the basic procedure for spectral domain removal of broad-band continuous noise. The time-domain signal is first broken up into a series of overlapping windows. Each window is then processed individually in the spectral domain. Before performing the spectral transform, the window is multiplied by a smooth pre-windowing function (such as a Hamming or Hanning window) in order to reduce spectral artifacts caused by the discontinuities at the edges of the window. Once in the spectral domain, the discrete spectral coefficients (also called frequency bins ) are adjusted using some function. (Examples of these functions are given below.) The modified spectral components are then transformed back into the time domain, and multiplied by a post-windowing function. This post-windowing function is to once again ensure that no discontinuities are introduced at the edges of the window. All of the overlapping windows are then added back together, and multiplied by a gain compensation function which corrects for the variations in signal amplitude introduced by the pre- and post-windowing functions. There are several functions which are commonly used to adjust the spectral components. One is based upon the frequency-domain Weiner filter, which minimises the mean-square error of the reconstruction in the time domain [0, page 39]: f(y (m)) = S X (m) S X (m) + S N (m) Y (m) (2.) where f is the spectral processing function, Y (m) is the spectral transform of 4

the noisy signal, S X (m) is the power spectrum of the signal, and S N (m) is the power spectrum of the noise. (Note that m denotes the frequency in the spectral domain.) In order to implement equation 2., the power spectrum of the noise and the power spectrum of the signal must be known. The power spectrum of the noise can be estimated from portions of the signal where there is no musical content. However, estimating the power spectrum of the signal is a bit trickier. Where S Y (m) is the power spectrum of the noisy signal, By making the following assumption, S Y (m) = S X (m) + S N (m). S Y (m) Y (m) 2 we arrive at the approximation for the power spectrum of the signal: { Y (m) 2 S N (m) if Y (m) 2 > S N (m), S X (m) = 0 otherwise. Thus, equation 2. becomes f(y (m)) = { Y (m) 2 S N (m) Y (m) 2 Y (m) if Y (m) 2 > S N (m), 0 otherwise. (2.2) Other similar functions for processing the spectral components include spectral subtraction [3] f(y (m)) = { Y (m) SN (m) Y (m) Y (m) if Y (m) 2 > S N (m), 0 otherwise. and power subtraction [7] { Y (m) 2 S N (m) Y (m) if Y (m) 2 > S f(y (m)) = Y (m) 2 N (m), 0 otherwise. (2.3) (2.4) Note that the above spectral processing functions (equations 2.2, 2.3, and 2.4) can be thought of as thresholding algorithms. If the spectral content of a given frequency bin is equal to or below the expected broad-band noise content for that bin, then it is assumed that the original signal had no significant spectral content in that bin and it is set to zero. However, if the spectral content is greater than the expected noise content for a given bin, then in each of the above spectral processing functions, the bin is scaled according to some function of the estimated noise power and measured signal spectrum. 5

2..2 The Musical Noise Phenomenon Without any special measures being taken, the above three spectral processing functions all have the undesirable side-effect of producing musical noise. The spectral processing algorithms scale any frequency bins whose magnitude squared is greater than the estimated noise power for that bin. Inevitably, in each frame there will be some spectral componenents due to noise whose power is greater than the estimated noise power. Ideally, these spectral components should be set to zero, but they are instead merely scaled down, leaving a spectral component in the denoised signal which was not part of the original music or speech. This is often referred to as an artifact or residual of the spectral domain processing. Spectral Magnitude 6 4 2 0 8 6 4 2 0 0 20 30 40 50 60 Frequency Bin (a) Noise Spectrum Spectral Magnitude 6 4 2 0 8 6 4 2 0 0 20 30 40 50 60 Frequency Bin (b) Signal Spectrum after Spectral Subtraction Figure 2: Residual Effect of Spectral Domain Processing The best way to demonstrate the residual effect is to show what results when a window containing broad-band noise only (no musical content) is denoised. Figure 2(a) shows the spectrum of an interval of broad-band noise. Figure 2(b) shows the corrected signal spectrum after spectral subtraction processing. The residual peaks are clearly visible. (The plots in Figure 2 are based upon ones that appear in [0, page 46].) Because of the random nature of noise, these residual spectral components will 6

appear in different frequency bins for each time-domain window of the signal. Thus, as the processed signal is played back, different tones will be heard for each window, changing in very quick succession. The durations of the residuals depend upon the window size used and the sampling rate. For a piece of music or speech sampled at 44. khz and denoised using window sizes of 2048 samples, the residuals change approximately every 45 ms. This leads to the musical noise effect. Audio in the presence of musical noise is often more unpleasant and more unnatural-sounding to the ear than the original noisy signal. 2.2 Current Methods for Removing Clicks Many of the current successful methods for the detection and removal of click degradation are based upon auto-regressive (AR) modeling techniques. Section 2.2. discusses the detection of clicks, and section 2.2.2 treats the correction of the clicks once their location has been ascertained. 2.2. Click Detection Simple click detection systems utilise high-pass filtering techniques, under the assumption that most signal contain relatively low-amplitude high-frequency components, whereas the clicks contained in the audio will have a fairly strong, shortduration component across all frequencies [0, page 28]. Similarly, wavelet and multiresolution methods have been used to detect clicks by decomposing the signal at a fine scale. In both of these methods, the filtered signal is compared to a threshold. Any samples exceeding that threshold are said to be clicks and a correction algorithm is then applied. Godsill and Rayner give a review of an AR model-based approach. Assuming that a short-duration window x n of a signal is said to be a stationary process, then it can be modeled with an AR parameterisation [0, page 30]: x n = P a i x n i + e n, (2.5) i= where a i are the AR coefficients, and e n is the prediction error for sample n. Provided the AR coefficients can be estimated, the inverse AR filter can be applied to a signal y n which has been corrupted by impulsive noise (clicks), a detection indicator could be expressed as: ɛ n = y n P a i y n i. (2.6) i= If the detection indicator ɛ n exceeds the neighbourhood of of the expected prediction error e n, then a click corruption may exist at that sample n. Godsill 7

and Rayner suggest that one possible critereon for click detection would be if ɛ n > kσ e then a click could be said to exist (where σ e is standard deviation of the prediction error e n ). One might set k = 3 if using the common 3σ assumption for Gaussian processes. In general, choice of threshold will always depend upon the AR model being used, the variance of the prediction error, and the amplitude of the impulsive noise [0, page 3]. The application of the basic AR model as a click detector has its difficulties however, especially where several clicks are present in close vicinity. In addition, a single click affects more than just a few samples; it is often between five and one hundred samples in length. While the AR detector can distinguish very well the beginning of a click sequence, it has much more difficulty in finding the end of it. Alterations to the general AR model have been used in the past to overcome these problems. Some of these alterations have been manifested as matched filters, the auto-regressive moving average (ARMA model), as well as a sinusoid plus AR residual model [0, pages 32 33]. 2.2.2 Click Correction Click correction procedures of all sorts usually involve some sort of interpolation strategy. Clicks degradations are assumed to have replaced a part of the signal for a relatively small number of samples (i.e. clicks are not an additive type of noise). It is the goal of the interpolation technique to use information gathered from neighbouring samples in order to make an educated guess at what the original signal content was over the click interval. There have been quite a few methods introduced in the past for interpolation, such as median filters and splicing techniques, but the most successful and robust schemes are model-based. In particular, the AR model-based interpolation scheme is presented here; what appears below is Godsill and Rayner s derivation of the AR model using a least squares (LS) approach [0, pages 06 07]. If one considers a signal vector x drawn from an AR process with coefficients a, then the excitation (or error) vector e is defined as: e = Ax. (2.7) Suppose then, that the signal can be rearranged into two vectors: a vector of the known samples x (i) and a vector of the unknown samples x (i) which have been presumably corrupted by a click noise process. In matrix notation, this can be expressed as: x = Ux (i) + Kx (i), where U and K are the matrices which rearrange the known and unknown signal elements into the signal x. Equation 2.7 can then be rewritten as e = A ( Ux (i) + Kx (i) ). 8

Defining A (i) = AU and A (i) = AK, e = A (i) x (i) + A (i) x (i) The sum of the squares of the prediction errors is defined as E = e T e. The objective is to interpolate the samples such that the LS error of the entire sequence is minimised: The sequence x LS (i) x (i) : x LS (i) = arg min x (i) {E}. can be found by computing the minimum of E and solving for E = 2e T x (i) e x (i) = 2 ( A (i) x (i) + A (i) x (i) ) T A (i) = 0 x LS (i) = ( A (i) T A (i) ) A(i) T A -(i) x -(i) (2.8) A maximum a posteriori (MAP) interpolator can also be formed using the AR model [0, pages 07 08]. The LS interpolator (equation 2.8) is actually a special case of the MAP interpolator, when the first P samples of the sequence are uncorrupted (and thus part of x (i) ). This special case can be applied to most click-degraded audio segments, as long as they are chosen such that they contain no clicks near the beginning of the interval. There are several variants on the AR-based interpolator. One involves taking into account the pitch period of the signal. Where the pitch period is denoted by T, x n = P Q x n i a i + x n T j b j + e n. (2.9) i= j= Q This pitch-based extension to the AR model takes advantage of signals which have a relatively long-term correlation. A few examples would be the voiced portions of speech and singing, and music of a periodic nature [0, page ]. Another similar variant on the AR interpolator adds a basis function to an AR residual. The form of this model is Q P x n = c i ϕ i [n] + a i r n i + e n, (2.0) i= 9 i=

where ϕ i [n] is a vector of elements from the basis function ϕ, and r n denotes the residual of the signal (that is, the part not approximated by the basis expansion). The basis function ϕ could be any number of things: a DC offset, a polynomial, a sinusoid, or even a set of wavelet bases [0, page ]. As another alternative, there is the ARMA model, which incorporates the weighted sum of an excitation: x n = P Q a i x n i + b j e n j, where b 0 =. (2.) i= j=0 Godsill and Rayner develop the ARMA interpolator and present results of the algorithm on music signals [0, pages 22-26]. 0

3 Wavelet and Local Trigonometric Theory Wavelet bases and local trigonometric bases are two classes of representations which may give us better insight into time-frequency behaviour than the short-time Fourier Transform (STFT). This section explains the underlying mathematics of wavelet and local trigonometric transforms, and details how these are exploited to give best basis representations. Sections 3. and 3.2 address wavelet and local cosine transforms, respectively. For each kind of transform, the transform equations are developed, and then fast implementations are discussed. Section 3.3 shows how redundant wavelet and local trigonometric decompositions of signals can be structured into a binary tree of bases, known as packet representations, for both types of transforms. Then methods of choosing the best complete and non-redundant representations of signals from these binary trees are presented in Section 3.4. 3. The Wavelet Transform Wavelet transforms and multiresolution are topics that have been studied by mathematicians, scientists, and engineers. However, it was not until the early 980 s that connections were drawn across the fields. A full mathematical exploration of the continuous wavelet transform and its inverse are beyond the scope of this thesis. However, section 3.. outlines some of the main ideas behind the continuous wavelet transform which will hopefully aid understanding of the discrete wavelet transform and multiresolution filter banks, which are discussed in section 3..2. 3.. Continuous Wavelet Theory Most of the following information concerning continuous wavelet theory appears in [5, pages 76 85]. A wavelet [6] is a function ψ(t) L 2 ( ) that satisfies the following three conditions.. It has an average of zero: + 2. It is normalised such that ψ(t) =. ψ(t)dt = 0. 3. It is centred in the neighbourhood of t = 0. The wavelet function ψ(t) can be scaled by s and translated by u such that: ψ u,s (t) = ( ) t u ψ. (3.) s s

Thus, changes in u cause the wavelet to slide to different points along the time axis, and changes in s stretch or squash the wavelet in the time domain. This stretching and squashing is often called dilating or scaling because it allows varying resolution at different frequency and time scales. (Note that the factor of s on the right-hand side of equation 3. ensures that the dilated wavelet satisfies the normalisation condition ψ u,s (t) =.) Essentially, s and u determine the time and frequency support of the wavelet function. Figure 3 depicts the timefrequency properties of a wavelet due to translation and dilation. η is the centre frequency of the wavelet when s =, and is defined as η = 2π + 0 ω ˆψ(ω) 2 dω. (3.2) When the wavelet is squashed (s is made smaller), then the following things happen: Its centre frequency η is shifted higher. Its time support decreases, increasing resolution in the time domain. Its frequency support increases, decreasing resolution in the frequency domain. Conversely, when the wavelet is stretched (s is made larger), then the opposite occurs: Its centre frequency η is shifted lower. Its time support increases, decreasing resolution in the time domain. Its frequency support decreases, increasing resolution in the frequency domain. The wavelet transform Wf of a function f is the inner product of ψ u,s and the function f: + Wf(u, s) = f, ψ u,s = f(t) ( ) t u ψ dt. (3.3) s s This allows some frequency components of a portion of f to be examined, depending upon the time-frequency spread of the dilated and translated wavelet ψ u,s. This transform is the equivalent of a convolution operation, or linear filtering. (As described below in section 3..2, linear filtering is exactly what takes place when wavelet transforms are implemented in the discrete domain on an interval.) Convolution is defined as + g(x)h(y x)dx = + 2 h(x)g(y x)dx = h g(y). (3.4)

Figure 3: Time-Frequency Effects of Dilating and Translating Wavelets [5, page 83] Now, if ψ s (v) = ψ s( v) = ( ) v ψ, s s then equation 3.3 may be written as Wf(u, s) = = + + = f ψ s (u). f(t) ( ) t u ψ dt s s f(t) ψ s (u t) dt (3.5) However, a function s wavelet transform alone is an incomplete representation of the original signal. What is needed is a complementary transform, which takes some sort of average, much as the wavelet transform examines the differences, or details. The function which accomplishes this transform is the scaling function, φ(t). The average, or low-frequency, approximation of a function f is as follows: Vf(u, s) = f, φ u,s = + f(t) s φ ( t u s ) dt. (3.6) Following the logic of equation 3.5, this averaging transform can also be shown to be a convolution. If φ s (v) = φ s ( v) = ( ) v φ, s s 3

then Vf(u, s) = = + + = f φ s (u). f(t) ( ) t u φ dt s s f(t) φ s (u t) dt (3.7) Mallat proves [5, pages 77 80] that a function can be fully reconstructed using the wavelet transforms up to and including a certain level s 0, and the low-frequency approximation from the same level s 0. where f(t) = C ψ s0 0 Wf(, s) ψ s (t) ds s 2 + C ψ s 0 Vf(, s 0 ) φ s0 (t), (3.8) C ψ = + 0 ˆψ(ω) 2 ω dω, (3.9) and ˆψ(ω) is the Fourier transform of ψ(t). Equation 3.8 implies that a function s space is simply the sum of the spaces occupied by its wavelet transform and low-frequency transform: { } { } { } f(t) = Vf(, s) Wf(, s). (3.0) Figure 4 shows an example of a wavelet. The Mexican hat wavelet is the second-order derivative of a Gaussian, and is often used in computer vision ([5, page 77] and [25, page 79]). It is also important to mention that some wavelets have the useful property of orthogonality. If a wavelet has this property, then the wavelet at one scale will be orthogonal to itself at all other scales, provided the scales used cover different frequency intervals. That is, the wavelet at each scale occupies a space that does not intersect with the spaces of the wavelet at any other scale. In set notation, this is expressed as { } } ψ u,s0 (t) {ψ u,s s0. (3.) Spaces occupied by orthogonal wavelet transforms of a function at different levels can also be shown to be orthogonal. This is expressed as { } { } Wf(u, s 0 ) Wf(u, s s 0 ). (3.2) 4

0.4 0.2 0 0.2 value 0.4 0.6 0.8 5 4 3 2 0 2 3 4 5 time Figure 4: Example Wavelet: The Mexican Hat 3..2 Multiresolution Filter Banks This section describes implementing the wavelet transform in the discrete domain in the form of a filter bank. Wherever possible, connections will be drawn to the continuous domain theory given in the previous section. Presented in this section, the method of explaining wavelet transforms in terms of signal subspaces is based on Strang s discussion of multiresolution [25, chapter 6]. Throughout this section, the variable j is used to refer to the level of the wavelet transform. The highest scale is denoted as J, and refers to the original signal with all of its details. The dilation parameter s associated with level j is simply s j. It is important to note however, that the scale level does not directly indicate the value of the dilation parameter. When the scale level j decreases, the actual value of dilation parameter s j increases. Conversely, when the scale increases, the value of the dilation parameter decreases. This is merely due to the way the dilation parameter is defined within the wavelet equation 3.3. Suppose there is a signal f(t) that exists in the space V J. The wavelet transform at the highest scale J is Wf(u, s J ), and will exist in W J, which is a subspace of V J. The highest scale low-frequency transform Vf(u, s J ) will then exist in V J, which is also a subspace of V J. According to equation 3.0, the signal space is the sum of the two subspaces formed by the transforms: V J = V J W J. (3.3) Another way of looking at it is that the space V J contains the low-frequency portion of f(t), while W J contains the high-frequency portion. 5

Suppose it is desired to compute the wavelet transform at the next scale down, J 2. The obvious way to do this would be to simply take the wavelet transform Wf(u, s J 2 ). However, if the wavelet being used to accomplish the transforms happens to be orthogonal, then it is known that the wavelet transforms of the function f(t) at different scales occupy completely different spaces (equation 3.2). Also, the space V J always contains the remainder of the signal, whatever is not represented by W J. Thus, the J 2 wavelet transform of the signal f(t) is equivalent to the J 2 wavelet transform of the J low-frequency portion of f(t): where Wf(u, s J 2 ) = Wf J (u, s J 2 ), (3.4) f J (t) = Vf(u, s J ). The same can be said for the wavelet transform at scale s J 3, s J 4, and so on. This makes the decomposition of a signal f(t) into wavelet and low-frequency spaces at different levels recursive in nature. Figure 5 depicts this recursive decomposition as a binary tree of low-frequency spaces V j and wavelet spaces W j. The binary tree representation follows equation 3.8; a signal space is fully represented by the sum of the wavelet spaces at all levels and the low-frequency space of the lowest level. Thus, the sum of all these spaces are said to be a complete basis for the signal: V J = V 0 W J W J 2 W J 3 W 0 J = V 0 W j. j=0 (3.5) In the digital domain, this recursive feature proves to be rather convenient. As implied by equations 3.5 and 3.7, it is possible to perform the wavelet and low-frequency transforms via convolution, or linear filtering in the digital domain. Let c[k] and d[k] be the digital filters associated with φ(t) and ψ(t), respectively; c[k] is used to perform the low-frequency transform, and d[k] is used to perform the wavelet (high-frequency) transform. Strang gives the relation between the digital filters and the continuous-time functions [25, page 82 and 84]: φ(t) = k ψ(t) = k c[k]φ(2t k) d[k]φ(2t k). (3.6) According to these equations, the filter coefficients depend on the wavelet and scaling functions at a given time resolution t and the wavelet and scaling 6

V J V J- W J- V J-2 W J-2 V J-3 W J-3 V 0 W 0 Figure 5: Binary Tree of Wavelet Decomposition Spaces functions at the next highest time resolution 2t. This implies that the filter coefficients would have to be re-calculated for each level of the wavelet decomposition. Fortunately, in the digital domain, there are ways of avoiding this. One way is to downsample the signal at each level. Downsampling, also known as decimation, involves removing every other sample, so that the resulting vector is one half the length of the original. Downsampling the signal at each level decreases the scale of the signal, which is the equivalent of decreasing the scale of the filters. This allows the same filters c[h] and d[h] to be used at each level. The implementation of these recursive filtering and downsampling operations is known as a filter bank. The filter bank shown in figure 6 corresponds to the binary tree of spaces shown earlier in figure 5. The downsampling operation is shown as 2. The vectors ˇv j and ˇw j indicate the result of the downsampled low-frequency and wavelet transforms at level j, and are known simply as the wavelet coefficients. This kind of filter bank is used in digital signal processing systems as a fast implementation of the wavelet transform, and is known as the fast biorthogonal wavelet transform. It is a recursive process of low- and high-pass filtering downsampled signals, and the entire process requires O(N) operations, where N is the length of the signal [5, page 267]. The fast biorthogonal wavelet transform generates N wavelet and low-frequency coefficients, so that the wavelet representation of a discrete signal occupies no more space in computer memory than the 7

x[n] = v [n] J d 2 c 2 w J- [n] d 2 c 2 w J-2 [n] d 2 c 2 w J-3 [n] d 2 w 0 [n] c 2 v 0 [n] Figure 6: Analysis Filter Bank original signal. Figure 7 shows the result of a fast biorthogonal wavelet transform on a realworld signal. It is the sung word sitting, an excerpt from Suzanne Vega s Tom s Diner. One can see that the upper level wavelet coefficients indicate high-frequency activity, such as the s sound at the beginning of the word sitting. Conversely, the lower level wavelet coefficients respond more to the voiced parts in the two syllables of the word. Figure 8 depicts the tiling of the time-frequency plane generally created by the fast biorthogonal wavelet transform. The highest level wavelet coefficients ˇw J represent the entire upper half of the frequency band, but have a very compact time support. As the wavelet coefficient level decreases, the frequency representation becomes more precise, but at the expense of increased time support. Figure 9 shows this tiling of the time frequency plane resulting from computing the wavelet coefficients of the sung word sitting. Dark values correspond to high-amplitude wavelet coefficients, and light values correspond to low-amplitude coefficients. Provided the filters c and d satisfy conditions for perfect reconstruction (detailed below), then the orignal signal can be recovered from the wavelet coefficients. This is accomplished by first upsampling and then low- and high-pass filtering the coefficients at each level. Upsampling a vector is done by inserting a zero after every element, making the vector twice as long. Figure 0 shows a synthesis filter bank which would reconstruct the signal decomposed by the analysis filter bank of figure 6. g and h are low- and high-pass filters, respectively, and 2 denotes upsampling. In order to recover the signal perfectly, however, the filters c, d, g, and h must be specially designed. Strang [25, page 05] outlines the requirements as G(z)C(z) + H(z)D(z) = 2z G(z)C( z) + G(z)D( z) = 0, (3.7) where C(z), D(z), G(z), and H(z) are the representations of the filters c, d, g, 8

Amplitude 0 0 00 200 300 400 500 600 Time (ms) (a) Time-Domain Signal 0 Decomposition Level (Offset from Level J) 2 3 4 5 6 7 0 00 200 300 400 500 600 Time (ms) (b) Six Highest Levels of Wavelet Coefficients ˇw j=j 6...J Amplitude 0 0 00 200 300 400 500 600 Time (ms) (c) Corresponding Lowest Level of Low-Frequency Coefficients ˇv J 6 Figure 7: The Sung Word Sitting (Excerpt from Tom s Diner ) 9

decreasing scale j Figure 8: Dyadic Wavelet Time-Frequency Tiling [5, page 0] and h in the Z-domain. A convenient method of determining the filters easily is by first calculating the low-frequency filter c from the scaling function φ(t), and then using c to determine d, g, and h. This is done according to the method shown in figure [25, page 0]. Filters constructed in this way are known as conjugate mirror filters, and they satisfy the conditions for perfect reconstruction (equation 3.7). Unfortunately, the fast biorthogonal wavelet transform of figure 6 is not translation invariant. In other words, the representation in the wavelet domain may change for different translations of the signal in the time domain. It is due to the fact that downsampling occurs in the filter bank. Translation invariance is a desirable property in applications such as pattern recognition and image or audio denoising. One way to maintain translation invariance is to use the algorithme à trous, or hole algorithm [5, pages 56 59]. It involves upsampling the filters (adding holes ) at each level instead of downsampling the signal. Upsampling the filters is accomplished by inserting 2 J j zeros (where J is the highest level of decomposition, and j is the current level of decomposition) between each filter coefficient. Calculating a wavelet representation using the algorithme à trous requires O(N log 2 N) operations, and results in up to N log 2 N coefficients. So, although it has the advantage of being translation invariant, it requires more computations than the fast biorthogonal wavelet transform and occupies more space in computer memory than the original signal. Additionally, Kingsbury has developed a complex discrete wavelet transform 20

Figure 9: Time-Frequency Wavelet Tiling for Sitting w J- [n] 2 h w 2 [n] 2 h v J [n] = x[n] w [n] 2 h 2 g w 0 [n] 2 h 2 g 2 g v 0 [n] 2 g Figure 0: Synthesis Filter Bank 2

c c[0], c[], c[2], c[3] flip c[3], c[2], c[], c[0] g flip and alternate signs alternate signs d c[3], -c[2], c[], -c[0] -c[0], c[], -c[2], c[3] h Figure : Conjugate Mirror Filter of Length Four (CDWT), which he has shown is translation-invariant []. The CDWT uses a filter bank structure with decimation. The filters, however, are complex, and produce complex coefficients which contain both magnitude and phase information. This section was intended to give an overview of how filter banks are used to implement wavelet transforms of discrete signals, and to help the reader obtain an intuitive feel for how wavelet coefficients represent time-domain signals. However, the description of filter banks given in this section is far from complete, and much of the mathematics which prove that discrete wavelet representations are valid have not been discussed. Daubechies [6] gives a thorough comparison of the STFT and the wavelet transform. Vetterli and Herley [27] also give a good comparison, and explore different methods of filter design for perfect reconstruction. A Strang and Nguyen textbook [25] gives a good introduction to the topic of wavelets, using primarily a filter banks/signal processing approach. In particular, they provide a complete explanation of the requirements for perfect reconstruction [25, chapter 4], and describe methods for designing filters for perfect reconstruction [25, chapter 5]. A Mallat text provides some elaboration on conjugate mirror filter theory [5, pages 228 236]. Lastly, both textbooks discuss tactics used to deal with boundaries of discrete signals [5, pages 280 29] and [25, pages 263 276]. 3.2 Local Trigonometric Transforms Like the wavelet transform, local trigonometric transforms provide a way of decomposing a time-domain signal to give a time-frequency representation, and are often used as an alternative to the STFT. They address issues of discontinuities at window edges and high-frequency coefficient decay. There are many different kinds of trigonometric transforms, such as sine I, cosine I, cosine IV, and cosinesine I. This section focuses on two trigonometric transforms in particular: the cosine I (which exhibits fast coefficient decay at high frequencies) and the cosine IV (which leads to the local cosine basis). Section 3.2. explains the theory be- 22

hind the local cosine transform (LCT), and section 3.2.2 discusses improvements to the LCT which make its high-frequency information more reliable. Finally, section 3.2.3 describes fast, discrete algorithms to implement these transforms. 3.2. Cosine Transforms An inherent problem of breaking a time-domain signal up into intervals is that it makes the signal discontinuous at the edges of the interval. When standard Fourier analysis is performed on the windowed signal, the high-frequency coefficients are large, even if the windowed signal is smooth [5, page 342]. One way to solve this problem is by symmetrically extending the signal. Suppose there is a windowed function f(t) that exists on the interval t [0, ]. Figure 2 shows a signal f(t) that is a version of f(t) which has been symmetrically extended about 0. In the discrete domain, the 2N discrete Fourier transform of Figure 2: Folding Method for Cosine I Transform [5, page 343] f(t) over the interval [0, 2] is defined as: ˆ f[k] = 2N f[n]e j 2πkn 2N n=0 Introducing a j πk 2N phase shift (because f is symmetric about n = 2 ), ˆ f[k] = 2N n=0 f[n]e j 2πkn 2N + j πk 2N 2N = n=0 f[n]e j 2πk(n+ 2 ) 2N and using the relation e j z = cos(z) + j sin(z), we arrive at ˆ f[k] = 2N n=0 [ 2πk(n + f[n] cos 2N 2 ) ] + 2N n=0 [ 2πk(n + f[n] j sin 2N 2 ) ], 23

which can also be written as ˆ f[k] = N n=0 N + n=0 N ˆ f[k] = n=0 N + n=0 N + n=0 N + n=0 [ 2πk(n + f[n] cos 2N 2 ) ] + [ 2πk(n + 2 f[n] j sin ) ] 2N [ 2πk(n + 2 f[n] cos ) ] 2N + 2N n=n 2N n=n [ 2πk(n + 2 f[n] cos ) ] 2N [ 2πk(n + f[n] j sin 2N [ 2πk(2N (n + ) + 2 f[2n (n + )] cos ) 2N [ 2πk(n + 2 f[n] j sin ) ] 2N [ 2πk(2N (n + ) + f[2n (n + )] j sin 2N Because f is symmetric about n = 2 and has a period of 2N, f[n] = f[2n (n + )]. This allows us to write N ( [ 2πk(n + ˆ f[k] = f[n] cos ) ] 2 2N n=0 N ( [ 2πk(n + + f[n] j sin ) 2 2N n=0 N ( [ 2πk(n + ˆ f[k] = f[n] cos ) ] 2 2N n=0 N ( [ 2πk(n + 2 + f[n] j sin ) ] 2N n=0 ] 2 ) ]. 2 ) [ 2πk(2N (n + ) + 2 + cos ) ]) 2N ] [ 2πk(2N (n + ) + 2 + sin ) ]) 2N [ + cos 2πk + 2πk( n ) ]) 2 2N [ + sin 2πk + 2πk(n + ) 2 2N The following are basic trigonometric identies (k is any integer): sin(2kπ + z) = sin(z) cos(2kπ + z) = cos(z) sin( z) = sin(z) cos( z) = cos(z). ]). ] 24

Using those identities and simplifying, N ( [ 2πk(n + ˆ f[k] = f[n] cos ) ] 2 2N n=0 N ( [ 2πk(n + + f[n] j sin ) ] 2 2N n=0 N ( [ 2πk(n + 2 = f[n] 2 cos ) ]) 2N n=0 N + f[n] j (0). n=0 [ 2πk(n + + cos 2N 2 ) ]) [ 2πk(n + + sin 2N Finally, noting that f[n] = f[n] for n = 0... N, we arrive at: ˆ f[k] = N n=0 [ ( πk 2f[n] cos n + )]. N 2 2 ) ]) This process of mirroring a discrete, N-periodic signal about n = and 2 representing it in the frequency domain is known as the discrete cosine I transform (DCT-I), and is generally expressed as follows [5, page 346]: where N ˆf I [k] = λ k n=0 λ k = [ ( kπ f[n] cos n + )], (3.8) N 2 { 2 if k = 0, otherwise. The upper frequency coefficients from this cosine I decomposition have a faster decay than those of a block-windowed Fourier basis [5, page 343]. The discrete inverse cosine I transform is f[n] = 2 N N k=0 [ ( kπ ˆf I [k]λ k cos n + )]. (3.9) N 2 Figure 3 depicts another kind of cosine decomposition, which involves symmetrically extending the signal about t = 0 and antisymmetrically about t = 25

and t = : f(t) if t [0, ] f( t) if t (, 0) f(t) = f(2 t) if t [, 2) f(2 + t) if t [, 2). (3.20) Figure 3: Folding Method for Cosine IV Transform [5, page 344] This is the cosine IV method of extension, and a derivation similar to that of the cosine I transform (equation 3.8) leads to the cosine IV transform. Expanding the Fourier transform of the 4N-periodic f (shown in figure 3), one discovers that the sine terms cancel, as do the cosine terms for even frequencies. The discrete cosine IV transform (DCT-IV) is then defined as [5, page 346] with its inverse ˆf IV [k] = f[n] = 2 N N n=0 N k=0 [ ( π f[n] cos k + ) ( n + )], (3.2) N 2 2 [ ( π ˆf IV [k] cos k + ) ( n + )]. (3.22) N 2 2 Although the coefficients ˆf IV [k] do not have a fast decay in the upper frequencies, the cosine IV transform is instrumental in constructing smooth local cosine bases, discussed in section 3.2.2, and in a fast implementation of the cosine I transform, which appears in section 3.2.3 [5, page 344]. Figure 4 shows a comparison of the upper-frequency coefficients of a 50 Hz sinusoid sampled at 48 khz. For each transform, the coefficients have been normalised so that they can be compared. Note that the DCT-I coefficients have the fastest decay, followed by the DCT-IV coefficients and then finally the FFT coefficients. 26

Amplitude 0 0 5 0 5 20 25 30 35 40 Time (ms) (a) Time-Domain Signal (50 Hz Sinusoid Sampled at 48kHz) 3.5 x 0 3 3 Magnitude (Normalised) 2.5 2.5 0.5 FFT coefficients DCT-IV coefficients DCT-I coefficients 4 6 8 0 2 4 6 8 20 22 24 Frequency (khz) (b) Upper Frequency Coefficients Figure 4: FFT, DCT-IV, and DCT-I Coefficients of a Sinusoid 27

3.2.2 Smooth Local Cosine Bases Another way to alleviate the effect of slow decay in high-frequency coefficients is to use overlapping, smooth windowing functions instead of characteristic (square), adjacent windowing functions. Mallat proves that a series of overlapping windowing functions can be orthogonal representations of a space [5, pages 354 357]. Figure 5 depicts the arrangement of these overlapping windowing functions. Each window g p which lies on an interval I p has a central interval C p and shares I p O p O p+ C p g p- (t) g p(t) g p+ (t) 0 a p a p+ 2ηp 2ηp+ Figure 5: Overlapping Windows t the intervals O p and O p+ with the neighbouring windows g p and g p+, respectively. The windowing function g p is defined as: 0 ( ) if t I p, β t ap η g p (t) = p if t O p, (3.23) ( ) if t C p, ap+ t β η p+ if t O p+, 28

where a p, a p+, η p and η p+ are as shown in figure 5, and β(t) is a rising function centred about t = 0 which satisfies the constraints: β 2 (t) + β 2 ( t) = for t [, ], and { 0 if t <, β(t) = if t >. Mallat proves that the window g p can be combined with an orthogonal basis to form a windowed, orthogonal basis g p,k [5, pages 356 357]. A function f can then be decomposed by computing the inner products f, g p,k. One popular windowed, orthonormal basis is the local cosine basis, which is mathematically similar to the cosine IV basis. The local cosine basis is defined as follows: { g p,k (t) = g p (t) 2 l p cos [ π l p ( k + ) ( )] } t a p 2 k,p (3.24) where the window length is l p = a p+ a p. In the discrete domain, the inner product is then computed as f, g p,k = a p+ +η p+ 2 n=a p η p+ 2 [ 2 π f[n]g p [n] cos l p l p ( k + ) )] (n a p. (3.25) 2 Just as with the wavelet transform (figure 6), this local cosine transform (equation 3.25) as well as the cosine I and cosine IV transforms (equations 3.8 and 3.2, respectively) are not translation invariant. This is because their coefficients contain no phase information. Thus, the local cosine coefficients computed from a signal may vary widely depending upon its translation in time [5, page 360]. 3.2.3 Fast Implementations This section describes some efficient algorithms for computing the cosine transforms of discrete signals. A fast DCT-IV will be presented, and then fast DCT-I and local cosine transforms will be shown which incorporate the fast DCT-IV algorithm. Duhamel, Mahieux, and Petit [9] proved that the DCT-IV coefficients can be efficiently calculated using the DFT coefficients of a vector z[n]: } ˆf IV [2k] = Re {e j πk N ẑ[k] } (3.26) ˆf IV [N 2k ] = Im {e j πk N ẑ[k], 29

where z[n] = 2 N (f[2n] + j f[n 2n]) e j π N (n+ 4) for n = 0... N 2. If an FFT is used to compute ẑ[n], then calculation of the DCT-IV coefficients ˆf IV [n] requires O(N log 2 N) operations [5, page 348]. The DCT-IV is its own inverse. Wang [28] developed a fast algorithm to calculate DCT-I coefficients using a recursive scheme involving the fast DCT-IV transform. Equation 3.8 can be rewritten as N 2 ˆf I [2k] = λ k (f[n] + f[n n]) cos n=0 N 2 [ ( π k + 2 ˆf I [2k + ] = (f[n] f[n n]) cos n=0 Defining the following signals for n = 0... N 2 : the above equations become f + [n] = f[n] + f[n n] f [n] = f[n] f[n n], [ ( πk n + ) ] N 2 2 ) N 2 ( n + 2) ]. ˆf I [2k] = ˆf + I N ˆf I [2k + ] = 4 ˆf (3.27) IV. In other words, the even DCT-I coefficients are simply the DCT-I coefficients of the length N signal f + [n], and the odd DCT-I coefficients are simply the scaled 2 DCT-IV coefficients of the length N signal f [n]. Implementing the recursive 2 DCT-I algorithm described by equation 3.27 requires O(N log 2 N) operations [5, page 349]. The inner products of the local cosine basis (equation 3.25) can be calculated efficiently using a folding method and the fast DCT-IV algorithm (equation 3.26). Where h p [n] is a signal f[n] multiplied by a smooth window and folded at its edges [5, page 363], g p [n]f[n] + g p [2a p n]f[2a p n] if n O p, h p [n] = f[n] if n C p, (3.28) g p [n]f[n] g p [2a p+ n]f[2a p+ n] if n O p+, 30

equation 3.25 becomes which is simply f, g p,k = a p+ 2 n=a p+ 2 [ 2 π h p [n] cos l p l p ( k + ) )] (n a p 2 f, g p,k = ĥpiv[n]. (3.29) ĥ piv [n] is the DCT-IV transform (defined by equation 3.2) of h p [n] and implemented efficiently by equation 3.26. Transforming from the local cosine basis back into the time domain is a similar procedure. The fast inverse DCT-IV transform algorithm (again equation 3.26) is run on the coefficients ĥpiv[n], and f[n] can then be reconstructed. Where Op = [a p η p, a p ] and O p + = [a p, a p + η p ], g p [n]h p [n] g p [2a p n]h p [2a p n] if n O p +, f[n] = h p [n] if n C p, (3.30) g p [n]h p [n] + g p [2a p+ n]h p+ [2a p+ n] if n Op+. Both the forward and inverse transforms involve folding the signal at its edges, and using the fast DCT-IV algorithm. This requires O(l p log 2 l p ) operations for each interval [a p, a p+ ] of length l p. 3.3 Wavelet and Local Cosine Packet Decomposition The wavelet and local cosine transforms described in the previous sections can be expanded to create redundant representations of a signal. These redundant representations can be diagrammed in binary tree structures, and each node of the binary tree is known as a packet. Section 3.3. and 3.3.2 discuss packet representations of wavelet and local cosine bases, respectively. Some comparisons of the two kinds of trees are given in section 3.3.3. 3.3. Wavelet Packets The wavelet transform described in section 3. effectively decomposes the signal into a set of wavelet (or high-frequency) bases and one low-pass basis, as shown by figure 5. Each of these bases are orthogonal to one another, and together they form a complete, non-redundant representation of the signal. In the wavelet transform algorithm, a space V J is decomposed into a wavelet space W J and a low-frequency space V J. The low-frequency space V J is in turn decomposed into another wavelet space W J 2 and low-frequency space V J 2, and so on. Essentially, in the basis tree shown in 5, the low-frequency branches V j are decomposed, while the wavelet branches W j are not. 3