1 Introduction 1 INTRODUCTION 1

Size: px

Start display at page:

Download "1 Introduction 1 INTRODUCTION 1"

Terence Summers
6 years ago
Views:

1 1 INTRODUCTION 1 Audio Denoising by Time-Frequency Block Thresholding Guoshen Yu, Stéphane Mallat and Emmanuel Bacry CMAP, Ecole Polytechnique, Palaiseau, France March 27 Abstract For audio denoising, diagonal thresholding estimators of spectrogram coefficients produce a musical noise that degrades audio perception. We introduce a block thresholding which produces hardly any musical noise and improves the SNR compared to diagonal thresholdings or Ephraim and Malah estimators. Spectrogram coefficients are grouped into blocks to compute attenuation factors. This block grouping regularizes the estimation which removes musical noises. The block size is adapted to the signal properties by minimizing a Stein unbiased estimator of the block thresholding risk. Index Terms Audio denoising, Block thresholding, Diagonal thresholding, Ephraim and Malah, SURE. 1 Introduction Audio signals are often contaminated by background environment noise and buzzing or humming noise from audio equipments. Audio denoising aims at attenuating the noise while retaining the underlying signals. Applications such as music and speech restoration are numerous. Thresholding estimators [11] remove noise by thresholding to zero small coefficients in an appropriate sparse signal representation. Image denoising by thresholding wavelet coefficients is particularly effective to suppress noise from images, and these estimators are used in many applications. For audio signals, despite interesting work on such thresholding estimators [8, 21, 24], the results are less convincing. Indeed, thresholding the spectrogram or the wavelet coefficients of a noisy audio signal produces a musical noise [6, 26]. This noise is a sum of localized time-frequency structures corresponding to isolated spectrogram or wavelet coefficients above the threshold. This superposition of musical noise contaminates the denoised sound and degrades the audio perception. Currently, the audio denoising method most often used is the Ephraim and Malah noise suppression rule [12, 13] and their variants [25] applied to spectrograms. This technique introduces little musical noise and maintains a small amplitude residual noise that masks this musical noise. This paper introduces a block thresholding estimator that produces hardly any musical noise with no residual noise, by grouping spectrogram coefficients in time-frequency blocks [26]. A block thresholding restores spectrograms that are more regular without isolated coefficients responsible for musical noise. Taking advantage of the time-frequency regularity of audio sounds, it also improves the resulting SNR. Comparisons are made with Ephraim and Malah estimators. Block thresholding estimators were first introduced by Cai and Silverman [3, 4, 5] to improve noise removal in orthonormal wavelet bases. Mathematical studies [15, 16, 17] proved the minimax optimality of wavelet block thresholding for certain classes of signals. For audio denoising, the grouping of spectrogram coefficients in blocks can be automatically adjusted to the signal content, by minimizing the resulting risk calculated with the Stein estimator [23]. We begin by reviewing conventional diagonal thresholding estimators and explain why they

2 2 DIAGONAL THRESHOLDING 2 produce musical noise for audio signals. Section 3 introduces the block thresholding estimators of Cai and Silverman [4] in the general context of orthogonal bases and frames. Block thresholding of spectrogram coefficients are studied for audio denoising, and comparisons are made with Ephraim and Malah methods. To adjust the size of blocks that group spectrogram coefficients, Section 4 explains how to compute the Stein unbiased risk estimate [23] of a block thresholding algorithm, and adjust the block size to minimize the risk estimation. A post-processing with an empirical Wiener shrinkage [14] is presented in Section 5 to further improve the estimation. 2 Diagonal Thresholding Next section describes the properties of diagonal thresholding estimators both in orthogonal bases and in frames, and Section 2.2 explains why they produce musical noises when applied to audio spectrograms. 2.1 Properties of Diagonal Thresholding Estimators Let y be a noisy signal that is the sum of a clean signal f and a noise ǫ of zero mean: y[n] = f[n] + ǫ[n], n =, 1,...,N 1. (1) Thresholding estimators decompose noisy signals in a basis or in a frame and set to zero small amplitude coefficients. Let F = {g m } 1 m N be a family of vectors that define an orthonormal basis of R N. Decomposing y in F yields with y F [m] = f F [m] + ǫ F [m], 1 m N (2) y F [m] = y, g m, f F [m] = f, g m and ǫ F [m] = ǫ, g m. A diagonal estimator in this basis modifies the amplitude of each coefficient y F [m] with a factor a[m] and reconstructs ˆf = N D m (y F [m])g m = m=1 N a[m] y F [m] g m. (3) To reduce the quadratic risk E{ f ˆf 2 } one can verify that the attenuation factor should satisfy a[m] 1. The estimator is said to be diagonal if a[m] depends only upon y F [m]. For diagonal estimators, one can verify [11] that a lower bound of the quadratic risk E{ f ˆf 2 } is obtained by choosing f F [m] 2 a[m] = f F [m] 2 + σ 2 (4) [m] where σ 2 [m] = E{ ǫ F [m] 2 } is the variance of each noisy coefficient. The resulting lower bound risk is N f F [m] 2 σ 2 [m] R o = f F [m] 2 + σ 2 [m]. (5) m=1 m=1

3 2 DIAGONAL THRESHOLDING 3 This lower bound cannot be reached because the oracle attenuation factor (4) depends upon f F [m] which is unknown. A simple diagonal estimator is the empirical Wiener estimator [2] defined by ( ) D m (x) = x 1 σ2 [m] x 2 + where we write (z) + = max(z, ). Donoho and Johnstone [11] have introduced better thresholding estimators that can produce a risk close to the oracle lower bound. A hard thresholding keeps coefficients above a threshold T m = λσ[m]: D m (x) = x1 { x >λ σ[m]} (7) in which case the attenuation factor a[m] is or 1. A soft thresholding reduces the amplitude of all coefficients ( D m (x) = x 1 λσ[m] ). (8) x + To minimize the risk, Donoho and Jonhstone proved that the threshold T m should be proportional to the noise standard deviation and depends upon the signal size. Asymptotically, an optimal choice is: T m = 2 log e N σ[m]. (9) When the noise ǫ is Gaussian and white, and hence σ[m] = σ for all 1 m N, Donoho and Johnstone [11] proved that for N 4 the hard and soft thresholding risk is close to the minimum oracle risk: R o E{ f ˆf 2 } (2 log e N + 2.4) ( σ 2 + R o ). (1) A frame is a family of M N vectors F = {g m } m Γ that defines a redundant signal representation f F [m] = f, g m. A tight frame satisfies an energy conservation like an orthogonal basis f 2 = 1 f F [m] 2 A and as a result one can prove that [19] f = 1 A m Γ f F [m] g m, m Γ where A is the frame bound. A thresholding estimator in a tight frame behaves similarly to an averaging of thresholding estimators in several orthonormal bases, which often improves the resulting SNR [9]. The thresholding risk in a frame can also be related to an oracle risk with an upper bound similar to (1). In numerical applications, thresholding estimators in tight frames are thus prefered to thresholding estimators in a single orthogonal basis. (6)

4 2 DIAGONAL THRESHOLDING Audio Denoising by Diagonal Thresholding Audio signal denoising can be implemented with a thresholding in a windowed Fourier frame. It amounts to a simple thresholding of the resulting spectrogram, but it produces a musical noise corresponding to isolated coefficients above threshold. Let w[n] be a window of size R normalized to w 2 = 1. A windowed Fourier frame is defined by ( )} i2πrn F = {g l,r [n]} = {w[n lu] exp R, 1 l N/u,1 r R where u is the window shifting step, and l, r are respectively the time and frequency indices. The resulting windowed Fourier coefficients are computed with an FFT for each translated window f F [l, r] = f, g l,r = N ( ) i2πrn f[n]w[n lu] exp R n=1 and { f F [l, r] 2 } 1 l N/u,1 r R is the spectrogram. Thresholding windowed Fourier coefficients thus amounts to threshold a spectrogram. If the window w[n] is chosen so that l w[n lu] 2 = A, n, (11) R then one can prove [1] that the windowed Fourier frame is a tight frame with frame bound A. In the following, we use half-overlapping windows with u = R/2 and with a window w that is the square root of a Hanning window to satisfy (11). If the noise is stationary then the noise variance σ 2 l,r = E{ǫ F[l, r] 2 } depends only upon the frequency index r and if it is white then it has a constant value σ 2. For an empirical Wiener diagonal estimator (6), the attenuation factor is ( ) a[l, r] = 1 σ2 [l, r] y F [l, r] 2, + which coincides with the square of the suppression rule for the method of power subtraction [1, 2, 18], and is known to produce musical noises. To illustrate the musical noise produced by a spectrogram thresholding, Fig. 1 shows the denoising of a short recording of the Mozart oboe concerto with a white Gaussian noise. Fig. 1(a) and 1(b) show respectively the log spectrograms log f F [l, r] and log y F [l, r] of the original signal f and its noisy version y. Thresholding y F [l, r] amounts to multiplying it by attenuation factors a[l, r] equal to or 1. Fig. 1(c) shows this attenuation map, with black points corresponding to a[l, r] = 1. As it can be observed in the zoom in Fig. 1(c ) this attenuation map includes many isolated black points. In the reconstruction process, these isolated coefficients restore isolated windowed Fourier vectors g l,r [n] that are perceived as a musical noise. A soft thresholding produces a similar phenomenon because each coefficient is also thresholded independently from its neighbors. To remove this musical noise, next section uses a block thresholding estimator that takes into account the fact that large spectrogram coefficients of most audio sounds are aggregated together in the time-frequency plane.

5 3 TIME-FREQUENCY BLOCK THRESHOLDING 5 (a) (b) (c) Log-spectrogram of original Mozart. (d) Log-spectrogram of noisy Mozart (a ) Hard-thresholding Adaptive block thresholding. (b ) (c ) Zoom of (a). (d ) Zoom of (b). Zoom of (c). Zoom of (d). Figure 1: Log-spectrogram of original and noisy Mozart and attenuation coefficients of hard thresholding and block thresholding. (a )(b )(c )(d ) are respectively zooms of the marked regions in (a)(b)(c)(d). Values of attenuation coefficients from 1 (black) to (white). 3 Time-Frequency Block Thresholding The block thresholding algorithm of Cai and Silverman [3, 4] regularizes diagonal thresholding estimations by grouping coefficients in blocks and computing a single attenuation factor for all coefficients in each block. We present this estimator in a general context of orthogonal bases and frames before applying it to spectrograms for audio denoising. By regularizing the thresholding estimation over blocks of coefficients, the musical noise is almost completely removed and the SNR is improved.

6 3 TIME-FREQUENCY BLOCK THRESHOLDING Block Thresholding in Bases and Frames Let F = {g m } m Γ be an orthonormal basis or a frame of R N. The set Γ of all indices m is segmented in K blocks B k in which indices are grouped together. If F is a windowed Fourier frame then the time-frequency indices m = (l, r) are grouped in time-frequency blocks B k whose shape may a priori be chosen arbitrarily. A block thresholding estimator multiplies all coefficients within B k with a same attenuation factor a k ˆf = K k=1 m B k a k y F [m] g m (12) This estimator is not diagonal because the value of each a k may depend upon all coefficients y F [m] within B k. A lower bound of the risk E{ ˆf f 2 } is obtained with an oracle attenuation. Let B # k be the number of coefficients within a block B k. The average signal and noise energy in this block are: f 2 F,k = 1 B # k m B k f F [m] 2 and σ 2 k = 1 B # k m B k σ 2 [m]. Similarly to the oracle attenuation factor (4), one can verify that a minimum risk is obtained by choosing a k = f2 F,k f 2 F,k + σ2 k σ 2 k = 1 ff,k 2 + σ2 k, (13) and the resulting oracle block risk is R bo = K k=1 f 2 F,k σ2 k f 2 F,k + σ2 k. (14) Clearly the oracle block attenuation factor a k in (13) cannot be calculated since it depends upon the values of f F [m]. The goal is to find a block estimator whose risk E{ ˆf f 2 } is as close as possible to the lower bound R bo. Observe that the oracle risk with blocks R bo in (14) is always larger than the oracle risk R o in (5) without blocks, because it is obtained through the same minimization but with less parameters as attenuation factors remain constant over each block. Reducing the number of attenuation parameters with a block technique increases the oracle risk lower bound but it regularizes the estimation when attenuation factors are computed from empirical coefficients. A direct calculation shows that K R bo R o = k=1 m B k ξ F,k ξ F [m](σ 2 k σ2 [m]) + (f 2 F,k f F[m] 2 ) (ξ F,k + 1)(ξ F [m] + 1), (15) with ξ F,k = f2 F,k is the average SNR in block B k and ξ F [m] = ff[m] 2 σk 2 σ is the SNR of the coefficient 2 corresponding to the index m. Equation (15) indicates that R bo is close to R o if both the noise

7 3 TIME-FREQUENCY BLOCK THRESHOLDING 7 and the signal coefficients have little variation in each block. Consequently the risk of the block thresholding estimator is reduced by choosing the blocks so that in each block B k either (i) f F [m] and σ 2 [m] vary little; or (ii) ξ F,k 1, ξ F [m] 1 and σ 2 [m] varies little; or (iii) ξ F,k 1, ξ F [m] 1 and f F [m] varies little. Cai and Silverman block thresholding operators [3, 4] use the James Stein shrinkage rule [22]. We cannot compute the original signal energy in the block but we can calculate the noisy signal energy yf,k 2 = 1 B # y F [m] 2 k m B k and observe that E{y 2 F,k } = f2 F,k + σ2 k. (16) The James Stein shrinkage rule [22] is similar to the oracle formula (13) where ff,k 2 + σ2 k is replaced by y 2 F,k : a k = ( 1 λσ2 k y 2 F,k ) +, (17) with a thresholding parameter λ 1. For blocks of size 1, if λ = 1 then this shrinkage rule corresponds to the empirical diagonal Wiener estimator defined in (6). If the noise ǫ is a Gaussian white noise, then, like in the case of diagonal thresholding estimators, the resulting risk E{ ˆf f 2 } can be shown to be close to the oracle risk (14). The average noise energy over a block B k ǫ 2 F,k = 1 B # ǫ F [m] 2 (18) k m B k has a χ 2 distribution with B # B # k degrees of freedom because each noise coefficient ǫ F[m] is a k Gaussian random variable of variance σ 2. If all blocks B k have the same size B #, then Cai [3] proved that R bo E{ ˆf f 2 } 2λR bo + 4Nσ 2 Prob{ǫ 2 F > λσ2 }, (19) where Prob{} is the probability measure and ǫ 2 F is the average noise energy over a block of size B #. The second term 4Nσ 2 Prob{ǫ 2 F > λσ2 } in the risk upper bound (19) is a variance term corresponding to a probability of keeping pure noise coefficients, i.e., f is zero (y = ǫ) and a k (c.f. (17)). Prob{ǫ 2 F > λσ2 } is the probability to keep a residual noise. The oracle risk and the variance terms in (19) are competing. When λ increases the first term increases and the variance term decreases. Similarly, when the block size B # k increases the oracle risk R bo increases whereas the variance decreases. Adjusting λ and the block sizes B # k can be interpreted as an optimization between the bias and the variance of our block thresholding estimator. The parameters λ and B # k are set by adjusting the residual noise probability where δ is the residual noise probability that one tolerates. Prob{ǫ 2 F > λσ2 } = δ (2)

8 3 TIME-FREQUENCY BLOCK THRESHOLDING 8 Cai [3] shows that choosing B # = log e N and λ = 4.55 yields the following block oracle inequality (19): R ba 2λ R ob + 2σ 2. (21) A tight frame is similar to a union of several orthonormal bases and the risk of a block thresholding estimator in a tight frame behaves similarly as the sum of the risks in several orthonormal bases. However, even if the noise is Gaussian white, because of the redundancy between frame vectors, the average noise energy ǫ 2 F over a block of size B# no longer follows a χ 2 B # distribution. 3.2 Block Thresholding in Short-Time Fourier Frames The time-frequency block thresholding can be applied directly with short-time Fourier frames. Some specifications about choice of parameters are discussed below. Choice of Block We group time-frequency contiguous short-time Fourier coefficients in disjoint rectangular blocks. The block size is B # k = L k W k, where L k and W k are respectively the block length in time and the block width in frequency. For simplicity, dyadic lengths L k = 8, 4, 2 and widths W k = 16, 8, 4, 2, 1 will be used (the unit being the time-frequency index in spectrogram). In this section, fixed block length and width are assigned to all the blocks, i.e., L k = L, W k = W and B # k = B# = L W, k. Choice of Thresholding Level λ Given a choice of block size and the residual noise probability level δ that one tolerates, the thresholding level λ is defined by (2). For each block width and length, λ is estimated using Monte Carlo simulation of ǫ 2 F. Table 1 shows the resulting λ with δ =.1%. Let us remark that for a block width W > 1, blocks that contain same number of coefficients B # = L W have close λ values. W = 16 W = 8 W = 4 W = 2 W = 1 L = L = L = Table 1: Thresholding level λ calculated with different block size B # = L W and with δ =.1%. 3.3 Block Thresholding and Ephraim and Malah In the Ephraim and Malah methods [12, 13, 6] and their variants [7, 25], two factors contribute essentially to the elimination of musical noise: the recursive decision-directed a priori SNR estimator that induces a temporal regularization in the estimator, and the suppression rules that retain a uniform noise which masks efficiently the musical noise in denoised signals. We discuss a connection between the block thresholding estimation and the decision-directed a priori SNR

9 3 TIME-FREQUENCY BLOCK THRESHOLDING 9 estimator. The masking noise technique is incorporated in block thresholding estimator. Ephraim and Malah Methods Estimating the a priori SNR ξ[l, r] = f F [l, r] 2 /σ 2 [l, r] is an important step of most noise suppression rules. In their milestone paper [12], Ephraim and Malah proposed a decision-directed estimator of the a priori SNR with a recursive procedure ˆξ[l, r] = α ˆf F [l 1, r] 2 σ 2 [l 1, r] ( yf [l, r] 2 ) + (1 α) σ 2 1, (22) [l, r] + where α [, 1] is a weighting parameter. In the first term, ˆf F [l 1, r] is the previously computed estimate of f F [l 1, r]. The second term is a maximum likelihood estimate of the SNR of the current coefficient. The decision-directed SNR estimator is recursive and induces a temporal regularization on ˆξ[l, r] with a causal smooth window exponentially decreasing. Based on an independent Gaussian distribution assumption of signal coefficients f F [l, r], Ephraim and Malah proposed a noise suppression rule as ˆf F [l, r] = a[l, r]y F [l, r] (23) with a[l, r] = ( ) [ ( ) ( )] π v[l, r] v[l, r] v[l, r] v[l, r] exp (1 + v[l, r])i + v[l, r]i 1 2 γ[l, r] (24) where γ[l, r] = y F [l, r] 2 /σ 2 [l, r] is called the a posteriori SNR of f F [l, r], v[l, r] is defined by v[l, r] = ξ[l,r] ξ[l,r]+1 γ[l, r] and I ( ) and I 1 ( ) denote respectively the modified Bessel function of zero and first order. Fig. 2-b shows the value of a[l, r] as a function of ξ[l, r] in db with different values of γ[l, r]. Note that the curve corresponding to γ 1 = ξ is close to the average case, since E{γ} 1 = ξ. The Ephraim and Malah suppression rule, compared with block thresholding in Fig. 2-a, performs less severe attenuation when the a priori SNR ξ[l, r] is very small; moreover, the attenuation decreases when the a posteriori SNR γ[l, r] increases. As a result, the Ephraim and Malah suppression rule is able to retain some residual masking noise. Block Thresholding A block thresholding estimation (17) also depends upon an estimated a priori SNR calculated on each block: ( ) ( a k = 1 λσ2 k = 1 λ ), (25) yf,k 2 ˆξ k where + ˆξ k = y2 F,k σ 2 k 1 (26) is an unbiased estimate of the a priori SNR ξ[l, r] computed by averaging the coefficient energy in a block.

10 3 TIME-FREQUENCY BLOCK THRESHOLDING 1 To retain a low-amplitude masking noise, a non-zero attenuation floor value is kept by modifying (25): ( ) ( ( a k = max 1 λσ2 k, a = max 1 λ ) ), a (27) yf,k 2 ˆξ k where < a 1 is a masking noise attenuation factor. The experiments show that with a around.5, the small residual noise masks completely the remaining very weak musical noise. Fig. 2(a) plots the attenuation factor (27) of the block thresholding in function of ˆξ k with different λ and a. Note that the curve with λ = 1 corresponds to the attenuation with oracle. The block thresholding makes stronger attenuation than the Ephraim and Malah suppression rule when the a priori SNR is weak. This explains why the block thresholding is better at eliminating the noise (if a is small) than the Ephaim and Malah suppression rule. (a) Gain (db) λ = 1., a =.6 λ = 1.5, a =.9 1 λ = 2., a =.7 λ = 2.5, a = A priori SNR (db) (b) Gain (db) γ 1 = 2 db γ 1 = db γ 1 = 2 db γ 1 = ξ A priori SNR (db) Figure 2: Attenuation factor versus a priori SNR ξ. (a) Block Thresholding (27) for different thresholding parameters λ and masking noise attenuation factor a. (b) Ephraim and Malah suppression rule (24) for different a posteriori SNR γ. 3.4 Experiments and Results The experiments presented below have been performed on various types of signals: Piano is a simple example that contains a single clear clavier stroke; Mozart and Centuria are musical excerpts that contain respectively quick notes played by a solo oboe and by some drums; Tête is a speech signal (in French). Centuria is sampled at 44 khz and all the other signals are sampled at 11 khz. They were corrupted by white Gaussian noise of different amplitude. Short-time Fourier transform with half-overlapping windows were used in the experiments. These windows are square root of Hanning windows of size 5 ms for Piano and Mozart, 3 ms for Centuria and 2 ms for Tête. 1 1 The audio denoising examples are available online at?????.

11 3 TIME-FREQUENCY BLOCK THRESHOLDING Performance Comparison Table 2 compares the performance in terms of SNR for block thresholding (block lengths and widths are discussed in the next section), Ephraim and Malah suppression rule equipped with the decision-directed SNR estimator [12] and hard thresholding. Two levels of noise removal have been used for the block thresholding and the Ephraim and Malah method. For the partial noise removal level (P), both methods were calibrated to retain a residual noise of similar energy : we chose a.5 in (27) for block thresholding and α.98 in (22) for the Ephraim and Malah method. To achieve the maximum noise removal level (M), we chose a = and α.999. For hard thresholding, the threshold was set equal to 3σ, where σ 2 is the noise variance. SNR Hard Block Thresholding Ephraim-Malah ( Mozart ) Thresholding Method P M P M db db db db Signal Hard Block Thresholding Ephraim-Malah (1 db SNR) Thresholding Method P M P M Piano Centuria Tête Table 2: Performance comparison. Top: Mozart with different SNR. Bottom: Piano, Centuria and Tête with 1 db SNR. From left to right: hard thresholding, block thresholding (with partial (P) and maximum (M) noise removal), Ephraim and Malah suppression rule equipped with the decision-directed SNR estimator (with partial (P) and maximum (M) noise removal levels). With partial noise removal level (P), in both methods, the residual noise masks the musical noise, however, block thresholding introduces less signal distortion as reflected by the systematic 2dB SNR improvement. With the maximum noise removal level (M), the musical noise cannot be masked by the residual noise since there is nearly no residual noise left. Whereas block thresholding hardly produces any musical noise, the Ephraim and Malah method results in noticeable musical noise, especially when the SNR of the noisy signal is small ( Mozart at db and 5 db). Note that the Ephraim and Malah method sometimes produces a resonance artifact, as if the sound was coming from far away. Such artifacts are especially strong for speech signals when α in the decision directed SNR estimator (22) is close to 1, which leads to a temporal window decreasing very slowly. Block thresholding does not create such artifact. Table 2 shows that a hard thresholding produces a smaller SNR than block thresholding (for both level (P) and (M)). Actually, it also produces a very strong musical noise. Fig. 3 displays

3 TIME-FREQUENCY BLOCK THRESHOLDING 12 the different attenuation coefficient maps for the Tête signal. It shows that block thresholding coefficients (Fig.

Moreover the block thresholding coefficients map is much more regular than the hard thresholding one.

Note that the block thresholding scheme can also be implemented with half-overlapping blocks to further regularize the estimator.

12 3 TIME-FREQUENCY BLOCK THRESHOLDING 12 the different attenuation coefficient maps for the Tête signal. It shows that block thresholding coefficients (Fig. 3(c)) are closer to the oracle coefficients (Fig. 3(f)) than the hard thresholding coefficients (Fig. 3(b)). Moreover the block thresholding coefficients map is much more regular than the hard thresholding one. This gives a visual confimation that block thresholding produces less signal distortion than hard thresholding. Note that the block thresholding scheme can also be implemented with half-overlapping blocks to further regularize the estimator. It is equivalent to compute 4 block thresholding estimators with blocks shifted by L/2 in time and/or by W/2 in frequency and then averaging the 4 signal estimations. It leads to a.2 db SNR improvement over the standard block thresholding with non-overlapping blocks, which is not much given the significant increase in the computational complexity. (a) (b) (c) Log spectrogram of noisy Tête (d) Hard-thresholding (e) Block thresholding (f) Adaptive block thresholding Adaptive block thresholding with empirical Wiener shrinkage post-processing Attenuation with oracle Figure 3: (a) log-spectrogram of Tête. Attenuation coefficients of hard-thresholding in (b), block thresholding in (c), adaptive block thresholding in (d), adaptive block thresholding with the empirical Wiener shrinkage as a post-processing in (e) and attenuation with oracle in (f). Values of attenuation coefficients from 1 (black) to (white).

4 ADAPTIVE BLOCK THRESHOLDING 13 3.4.2 Block Sizes in Block Thresholding The block thresholding results presented in Table 2 are obtained with optimal block sizes that maximize the SNR among block

Optimal block sizes are respectively (L, W) = (4, 1) for Piano, (L, W) = (8, 1) for Mozart, (L, W) = (8, 16) for Centuria and (L, W) = (4, 8) for Tête.

13 4 ADAPTIVE BLOCK THRESHOLDING Block Sizes in Block Thresholding The block thresholding results presented in Table 2 are obtained with optimal block sizes that maximize the SNR among block lengths L = 8, 4, 2 in time and block widths W = 16, 8, 4, 2, 1 in frequency. Optimal block sizes are respectively (L, W) = (4, 1) for Piano, (L, W) = (8, 1) for Mozart, (L, W) = (8, 16) for Centuria and (L, W) = (4, 8) for Tête. Since the noise is white and thus uniform in time and frequency, (15) shows that the optimal block size and shape depends upon the time-frequency spread of the signal components. Within the block size family previously mentioned, there is a difference of more than 2 db SNR between the best and worse block sizes. Block sizes could also be adapted to different signal parts. Fig.4 zooms on the onset of Mozart signal whose log-spectrogram is illustrated in Fig 1(b). As shown in Figs 4(a) and (b), at the beginning of the harmonics, blocks of large attenuation factors spread beyond the onset of the signal. Fig4 (b ) illustrates the horizontal blocks at the onsets marked in Figs 4(a) and (b). This produces a pre-echo artifact 2 in the denoised signal. In the time interval where the blocks exceed the signal onset, little attenuation is performed, the noise is not eliminated, consequently a sound is heard before the very beginning of the original signal. A smaller block size would reduce this time interval and thus reduce this pre-echo artifact. (a) (b) (b ) (c) (c ) Figure 4: Zoom on the onset of Mozart. (a) log-spectrogram. Attenuation coefficients of block thresholding in (b) and adaptive block thresholding in (c). Values of attenuation coefficients from 1 (black) to (white). (b ) and (c ) illustrate respectively the block partition with block thresholding and adaptive thresholding at the onset marked in (b) and (c). 4 Adaptive Block Thresholding An adaptive block thresholding adapts block sizes to the time-frequency signal property by minimizing an estimation of the risk. Appropriate block sizes reduce pre-echo artifacts (as described in Section 3.4.2) and improve the SNR. 2 We call this artifact pre-echo though, originally, pre-echo corresponds to a psychoacoustic phenomenon where an unusually noticeable artifact is heard in a sound recording from the energy of time domain transients smeared backwards in time after processing in the frequency domain due to the Gibbs phenomenon.

14 4 ADAPTIVE BLOCK THRESHOLDING SURE of Block Thresholding Estimator The best choice of block sizes minimizes the estimation risk E{ ˆf f 2 }. This risk cannot be calculated since f is unknown, but it can be estimated with a Stein Unbiased Risk Estimate (SURE) [23]. Best block sizes are computed by minimizing this estimated risk. SURE is an estimate of the risk of an arbitrary estimator Ŷ of the mean value vector Y of a multivariate normal random vector X and having an identity covariance matrix. Since it is unbiased, E{SURE} = E Ŷ Y 2. Theorem (Stein Unbiased Risk Estimate SURE). Let X = (x 1,..., x p ) be a multivariate normal random vector of dimension p with mean Y and having an identity covariance matrix. Let X+h(X) be an estimate of Y, where h = (h 1,..., h p ) : R p R p almost differentiable (h i : R p R 1, i). Define h = p i=1 x i h i. If E So { p i=1 x i h i (X) } <, then E X + h(x) Y 2 = p + E { h(x) h(x) }. (28) SURE := p + h(x) h(x) (29) is an unbiased estimate of the risk of X +h(x), called Stein Unbiased Risk Estimate (SURE) [23]. The proof of (28) is essentially based on the fact that φ (y) = yφ(y), where φ(y) is the standard normal density [23]. Following the approach of Cai [3, 5], one can apply the SURE estimator to compute the risk of a block thresholding estimator. The Gaussian noise coefficients are uncorrelated and hence independent. Let us normalize the observed data z F [m] = y F [m]/σ[m], m Γ so that the normalized noise has an identity covariance matrix. Applying the SURE to the block thresholding estimator (17) on a block B k of size p = B # k, one has ( ) h m (X) = λ z F [m]1 zf,k 2 z 2 F,k >λ z F[m]1 z 2 F,k λ, m B k, (3) where zf,k 2 = 1 B # m B k z F [m] 2. Applying (29), one gets SURE Bk for a block thresholding k estimator SURE Bk = B # k + λ2 B # k 2λ(B# k 2) 1 zf,k 2 z 2 F,k >λ + B# k (z2 F,k 2)1 zf,k 2 λ. (31) Since SURE is unbiased, E{SURE Bk } = E{ m B k f F [m] ˆf F [m] 2 }. When the noise is Gaussian white, orthogonal coefficients are independent. For a tight frame this hypothesis is not valid, but (31) still applies approximately because a tight frame behaves similarly to a union of orthogonal bases. 1 One can verify that the variance of SURE B # Bk is approximately proportional to 1. When k B # k the blocks are small it is necessary to reduce this variance by making an average over several blocks B k inside a macroblock M: SURE M = k M SURE B k. Let M # be the number of coefficients 1 in all the blocks included in M, SURE M # M has a variance proportional to 1. M #

15 5 POST-PROCESSING: EMPIRICAL WIENER SHRINKAGE 15 The adaptive block thresholding groups coefficients in blocks whose sizes are adjusted to minimize SURE and it attenuates coefficients in those blocks. The blocks B k are sets of coefficients that are not necessarily connected or rectangular. In the following by block size we mean a choice of block shape and size among a collection of possibilities. In this adaptive grouping procedure, neighboring coefficients y F [m] are grouped in disjoint macroblocks M j, j = 1, 2..., J. A macroblock M j can be segmented in blocks B k of same size B # (j). Several such segmentations are possible and we want to choose the one that leads to the smallest risk estimated with SURE. The optimal block size B # (j) for the blocks B k in M j is calculated by minimizing the SURE in M j, i.e., B # (j) = arg min B # SURE Mj = argmin B # k M j SURE Bk, j = 1, 2..., J (32) To reduce its variance, SURE is calculated over blocks of identical size imposed in each macroblock. Macroblock size should not be too large in order to maintain enough adaptivity in the size evolution of blocks. Once the block sizes are computed, coefficients in each B k are attenuated with (17), where λ is calculated with (2). 4.2 Adaptive Block Thresholding in Short-Time Fourier Frames The time-frequency adaptive block thresholding is applied directly to short-time Fourier frames. In numerical experiments each macroblock is segmented with 15 possible block sizes B # = L W with a combination of block length L = 8, 4, 2 and block width W = 16, 8, 4, 2, 1. The thresholding parameter λ is calculated with (2). The size of macroblocks is set to be equal to the maximum block size B max # = Fig. 5 illustrates different segmentations of these macroblocks into time-frequency blocks of same size. Experiments have been performed on the same audio signals as in Subsection 3.4, with 1 db SNR, with the same short-time Fourier frames and with the maximum noise removal level (M), i.e., with a = in (27). The first two columns of Table 3 compare the performance in terms of SNR between the adaptive block thresholding and the block thresholding with an optimal fixed block size obtained with an oracle. For three out of the four signals, the adaptive block thresholding improves the SNR relatively to the optimal fixed-size block thresholding. With Piano the SNR improvement is as high as.5 db. With Mozart, the result is the second best among the 15 block size candidates and.25 db below the result obtained with the optimal block size. As shown in Figs 4(c)(c ), compared with Figs 4(b)(b ), in the first part of Mozart, the adaptive block method chooses blocks of shorter length L that hardly exceed the onset of the signal. This reduces considerably the pre-echo artifact discussed in Section After the onset, the adaptive block method chooses narrow horizontal blocks, of the same width as the non adaptive method, that are able to capture the harmonic structure of the signal. 5 Post-processing: Empirical Wiener Shrinkage As a post-processing, an empirical Wiener shrinkage [14] is cascaded after the adaptive block thresholding. It allows more flexible and accurate attenuation decision while it inherits the time-

16 5 POST-PROCESSING: EMPIRICAL WIENER SHRINKAGE 16 Figure 5: Partition of macroblocks into blocks of different sizes. Block Thresholding with Adaptive Block Thresholding Optimal Fixed Size Adaptive Block Thresholding with Empirical Wiener Shrinkage as Post-processing Piano Mozart Centuria Tête Table 3: Performance comparison between the block thresholding with the optimal fixed block size, the adaptive block thresholding and the adaptive block thresholding with the empirical Wiener shrinkage as a post-processing. frequency regularization of the estimate from the adaptive block thresholding. The basic idea is to use the denoised signal as if it was the clean signal. Let us denote f the denoised signal obtained by the adaptive block thresholding algorithm and f F [m] = f, g m. An empirical Wiener shrinkage is a diagonal thresholding with attenuation coefficients defined as in (4): a[m] = f F [m] 2 f F [m] 2 + σ 2. (33) Table 3 shows that an improvement of.25 db SNR on average is brought by the empirical Wiener shrinkage as a post-processing and.5 db on Mozart. Audio improvement due to the post-processing includes less distortion of the underlying signals and further removal of the musical noise.

17 6 CONCLUSION 17 Fig. 3(e) displays the attenuation coefficients map of the empirical Wiener shrinkage. It maintains the same time-frequency regularity of the adaptive block thresholding (Fig. 3(d)), and its coefficients are closer to the oracle coefficients (Fig. 3(f)). 6 Conclusion A diagonal thresholding of spectrogram coefficients is unsuitable for audio signal denoising because it produces too much musical noise. This paper describes a time-frequency block thresholding which produces hardly any musical noise and improves the SNR relatively to start-of-the-art methods such as Ephraim and Malah estimations. A block thresholding groups time-frequency signal coefficients in blocks and then attenuates coefficients in each block. This block grouping regularizes estimations and contributes to the elimination of the musical noise. The block size can also be adapted to the signal properties by minimizing a SURE estimator of the block thresholding risk. For audio signals it reduces distortions such as pre-echo artifacts. References [1] M. Berouti, R. Schwartz, J. Makhoul, Enhancement of speech corrupted by acoustic noise, Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP Vol. 4, pp , [2] S. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. Acoust., Speech, Signal Process. ASSP-27, pp , [3] T. Cai, Adaptive wavelet estimation: a block thresholding and oracle inequality approach, Ann. Statist, 27, , [4] T. Cai and B.W. Silverman, Incorporation information on neighboring coefficients into wavelet estimation, Sankhya, 63, , 21. [5] T. Cai and H. Zhou, A data-driven block thresholding approach to wavelet estimation, Technical Report, Statistics Department, University of Pennsylvania, 25. [6] O. Cappe, Elimination of the musical noise phenomenon with the Ephraim and Malah Noise Suppressor, IEEE Trans. Speech and Audio Processing, vol. 2, p.p , Apr [7] I. Cohen, Speech enhancement using a noncausal a priori SNR estimator, Signal Processing Letters, IEEE, vol. 11, Issue 9, pp , Sept. 24. [8] I. Cohen, Enhancement of Speech Using Bark-Scaled Wavelet Packet Decomposition, Eurospeech, 21, Scandinavia. [9] R.R. Coifman, D.L. Donoho, Translation-Invariant De-Noising, [1] I. Daubechies, A. Grossmann, Y Meyer, Painless nonorthogonal expansions, J. Math. Phys., Vol. 27, No. 5, pp , 1986.

18 REFERENCES 18 [11] D. Donoho and I. Johnstone, Idea Spatial Adaptation via Wavelet Shrinkage, Biometrika, vol. 81, pp , [12] Y. Ephraim, D. Malah, Speech enhancement using a minimum mean square error short-time spectral amplitude estimator, IEEE. Trans. Acoust. Speech Signal Process, 32 (6), , Dec [13] Y. Ephraim and D. Malah, Speech enhancement using a minimum mean square error logspectral amplitude estimator, IEEE Trans. on Acoust., Speech, Signal Processing, vol. ASSP- 33, pp , Apr [14] S. Ghael, A. Sayeed and R. Baraniuk, Improved wavelet denoising via empirical wiener filtering, Proceedings for SPIE, Mathematical Imaging, San Diego, July [15] P. Hall, G. Kerkyacharian and D. Picard, A note on the wavelet oracle, Statistics and Probability Letters, 43, , [16] P. Hall, G. Kerkyacharian and D. Picard, Block threshold rules for curve estimation using kernel and wavelet methods, Ann. Statist, 26, , [17] P. Hall, G. Kerkyacharian and D. Picard, On the minimax optimality of block thresholded wavelet estimators, Statistica Sinica, 9, 33-5, [18] J.S. Lim and A.V. Oppenheim, Enhancement and bandwidth compression of noisy speech, Proc. of the IEEE, vol.67, Dec [19] S. Mallat, A Wavelet Tour of Signal Processing, 2nd edition, New York Academic, [2] R.J. McAulay, and M.L. Malpass, Speech enhancement using soft decision noise suppression filter, IEEE Trans. Acoust., Speech, Signal Process, ASSP-28, pp , 198. [21] H. Sheikhzadeh and H. R. Abutalebi, An improved wavelet-based speech enhancement system, EUROSPEECH, 21, [22] C. Stein and W. James, Estimation with quadratic loss, Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability 1 (Berkeley, University of California Press), , [23] C. Stein, Estimation of the mean of a multivariate normal distribution, Ann. Statist , 198. [24] J. S. Walker, Denoising Gabor Transforms, submitted. [25] P. J. Wolfe and S. J. Godsill, Simple alternatives to the Ephraim and Malah suppression rule for speech enhancement, IEEE Workshop on Statistical Signal Processing, pp , Aug. 21. [26] G. Yu, E. Bacry and S. Mallat, Audio Signal Denoising with Complex Wavelets and Adaptive block attenuation, to be appeared in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hawaii, 27.

Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator

1 Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator Israel Cohen Lamar Signal Processing Ltd. P.O.Box 573, Yokneam Ilit 20692, Israel E-mail: icohen@lamar.co.il