Overview of Single Channel Noise Suppression Algorithms

Size: px

Start display at page:

Download "Overview of Single Channel Noise Suppression Algorithms"

Evangeline Bryan
5 years ago
Views:

1 Overview of Single Channel Noise Suppression Algorithms Matías Zañartu Salas Post Doctoral Research Associate, Purdue University October 4, 2010 General notes Only single-channel speech enhancement schemes are reviewed in the present outline. A complete discussion of single and multichannel algorithms will be included in the first trimester report. Signal model and notation: y[n] = x[n] + v[n], which in vector form y k = x k + v k allows computing STFT Y (ω) = X(ω)+V (ω), where ω is the DFT frequency bin and k is the frame time index, (sometimes the notation ω k is used). For each frequency, the amplitude of the signal spectra are denoted Y k, X k, and V k. Noise and speech are assumed to be zero mean and uncorrelated. Frequency bins are generally considered independent, yet assigning joint distributions to them is possible. No effects of reverberation are considered in this initial review. Phase is generally not recovered and it is assumed to follow that of the corrupted speech. Perceptually, this is OK for SNR>5 db [10, 1]. Effects of phase distortion in ASR remain unknown. Most techniques use non-overlapping frames of short duration: win = 4-40 ms (most use ms). Thus, for fs = Hz a window length in samples is m = samples. The following classification of single channel noise suppression algorithms is an extension of that proposed in [1]. Notes: algorithm proposed for acoustic evaluation, no matlab code available from [10]. A Filtering schemes All of the following algorithms can be represented by a linear (but possibly non-causal) transfer function H(ω), normally referred to as the gain function, such that ˆX(ω) = H(ω)Y (ω). These methods are highly dependent on the ability to estimate the time-varying noise component and speech presence estimation. 1

2 1. Spectral subtraction: Based on heuristic principles, such as ˆX(ω) = Y (ω) ˆV (ω) if this difference is positive and ˆX(ω) = 0 otherwise. Thus, H(ω) = (1 ˆV (ω) / Y (ω) ). Overall complexity: O(mlogm). The substraction is normally made frequency by frequency. Imposing X(ω)=0 along with and sharp frame-to-frame differences in the frequency estimates create musical noise. Some techniques to reduce the musical noise are: (a) Alternative forms: ˆX(ω) p = Y (ω) p ˆV (ω) p. Cross terms are ignored, which introduces low-frequency distortions. Schemes to retrieve cross-terms normally require further statistical assumptions and yield other types of distortions. (b) Undersubtraction: provides a more gradual attenuation by using ˆX(ω) = 1/2 Y (ω) 1/2( Y (ω) 2 ˆV (ω) 2 ) 1/2. Less noise removal overall but minor distortion. (c) Smoothing: Averages estimates of ˆX from neighboring frames to reduce the spectral changes between them. In spectral subtraction, this method can introduce delays. Thus, it is preferred in the context of Wiener filter, where such delays are better handled. (d) Oversubtraction: Increases the amount of noise to be subtracted while defining a noise floor larger than 0. That is: ˆX(ω) 2 = Y (ω) 2 α ˆV (ω) 2 if Y (ω) 2 > (α + β) ˆV (ω) 2 and ˆX(ω) 2 = β ˆV (ω) 2 otherwise, and where α 1 and 0 < β 1. Less noise removal during silent portions but higher suppression during voiced ones. (e) Nonlinear: Similar to oversubtraction but using frequency dependent terms for α(ω) and smoothed estimates of noise and speech, thus improving the substraction of colored noise. (f) Multiband: Makes use of bands instead of single frequencies to reduce sharp spectral variations between frames. The filter-bank can be set linearly or on a MEL frequency scale. 2. Wiener filtering: Based on an optimal solution that minimizes the MSE and yields maximum reduction in terms of the noise reduction factor. This could be considered the theoretical gold standard of noise suppression, if the noise and SNR estimates are perfect. In time domain (assuming full rank) h 0 = Ry 1r yy Ry 1r vv = [ ] Ry 1r vv = h 1 Ry 1r vv. When applied, this filter yields the unaffected observation minus the ideal noise. In frequency domain, H 0 (ω) = P xy (ω)/p yy (ω) = P xx (ω)/(p xx (ω)+p vv (ω)), where both P xx (ω) and P vv (ω) need to be estimated. Alternatively, it can be noted that H 0 (ω) = 1/(1 + 1/ξ k (ω)), where ξ k (ω) = P xx (ω)/p vv (ω) is the a priori SNR. Overall complexity: O(m 2 ) in time domain and O(mlogm) in frequency domain. Musical noise is also present as in spectral subtraction and the following methods are used to reduce it: (a) 1-frame smoothing: The goal is to smooth sharp frame transitions that affect P xx (ω). Thus, the estimate for the k-th frame ˆP xx (ω,k) = (1 α) P xx (ω,k)+α ˆP xx (ω,k 1), with P xx can be obtained from spectral substraction or from noisy observation and α ǫ [0,1]. The process can be repeated iteratively if needed. Tradeoff: musical noise vs. onset distortion. (b) Gain-adaptive smoothing: As in the 1-frame smoothing but using a time-varying α that follows spectral transients. Underlying assumption: noise in regions of rapid spectral changes is easily masked. Thus, the goal is to apply less smoothing for transients and significant smoothing for stationary regions. Then, α k = f(1 2(Ŷk Ȳk)), where 2

3 f(x) = 1 if x 1, f(x) = 0 if x 0, and f(x) = x if 0 < x < 1. Ŷ k is a mean spectral distortion measure, i.e., Ŷk = [1/π π 0 Y k(ω) 2 Y k 1 (ω) 2 dω] 1/2. Ȳ k is the mean Ŷk in a noisy segment. Best results were obtained using a short frame size (4 ms) to better capture the desired transients [11]. (c) Suboptimal design: Simple tradeoff between suppression and distortion with a single coefficient α. Better understood in time domain h 0 = h 1 αry 1r vv, with α ǫ[0,1]. That is, the unaffected observation minus a non-ideal noise. (d) Mean adaptive: Based on an adaptive scheme used in image processing where the signal is modeled as a Gaussian random process represented by the sum of its mean and variance. Both quantities are computed and updated online. The technique is widely used in image processing but seldom used in speech. Thus, little information is available regarding its performance. (e) Multiband: see description for spectral subtraction. 3. ESTI scheme : Two-stage Wiener filter scheme combining schemes described above. Filter is estimated in frequency domain, where a 1-frame smoothing is used along with MELscaled based multiband approach. The filter magnitude is decomposed with the log scale filter-banks and its linear coefficients retrieved using a MEL-warped inverse discrete cosine transformation. These coefficients are used to filter the signal. After filtering the process is repeated once (second stage), but using an additional adaptive gain that is increased if the input contains noise only (over-suppression). The ESTI standard contains additional features for ASR (e.g., cepstral coefficient computation and enhancement) which, as a total, provide a notable increase in WER improvement (10 % lower than the best possible solution [4]). However, its noise suppression module has not been evaluated separately. The scheme handles 4 frames of 5 ms at the time for smoothing processing. Overall complexity: O(mlogm). Complete details for implementation can be found in [7]. 4. Subspace methods : Similar to the Wiener filter but with a constrained MMSE optimization, i.e., since ˆx k = Hy k, then e k = ˆx k x k = (H I)x k + Hv k = e x + e v. Thus, each noise is treated independently. Wiener: minimizing e k, Subspace: minimizing e x while limiting the noise residual e v. The constrained optimization yields a solution that uses SVD, i.e., H 0 = B T Λ(Λ + µi)b T, where B is an invertible matrix such that R v = BB T, R y = B T (Λ + I)B T, and Λ=diag(λ 1,λ 2,...,λ L ). Overall complexity: O(m 2 ) but can be reduced with iterative implementations. This approach is gaining more attention in recent years (particularly in cochlear implant applications). Its performance is variable, but can be considered comparable with that of the MMSE-LSA estimator. B Statistical spectral estimation schemes These algorithms generally estimate spectral amplitude and assume that the phase follows that of the noisy observation (which was shown to be an optimal estimate [5, 1]). The estimation is performed in the frequency domain for each frequency bin. All schemes assume probability density functions (pdfs) for the speech and noise and search for estimates that minimize certain distortion measure via some optimization algorithm. The selection of the distortion measures, pdfs, and optimization algorithm constitute the main differences between these algorithms. Unfortunately, pdf 3

4 assumptions can vary from case to case and simplifying assumptions are needed to have trackable mathematical expressions. Under simplifying assumptions, all schemes in this section yield (generally nonlinear) filter gains. The overall asymptotic complexity of these schemes is O(mlogm), yet their implementation may not always be trivial due to the presence of nonlinear terms. 1. MMSE estimator : this approach that estimates real and imaginary components of the signal spectrum (which would allow estimating both phase and amplitude). Thus, X(ω) = X R (ω)+x I (ω), where ˆX MMSE = E[X(ω) Y (ω)] = E[X R (ω) Y (ω)]+e[x I (ω) Y (ω)]. Assuming Gaussian pdfs for both noise and speech the MMSE estimate becomes the Wiener filter. Gaussian assumption is generally not OK for short term speech signals (<40 ms), for which either Laplacian or Gamma distributions are used. Expression for H MMSE (ω) can be found in all cases, being nonlinear for the latter pdfs. This method is not widely used since estimating only the amplitude is more efficient and yields the same results. 2. MMSE-SA estimator: [5] Assuming that Y (ω k ) = Y k e iθ Y k and X(ω k ) = X k e iθ X k, then ˆX k,mmse = 0 X k p[x k Y (ω k )]dx k. In other terms, ˆXk,MMSE minimizes the distortion measure d = E[ X k ˆX k 2 ] given the noisy observation y[n]. Assuming Gaussian pdfs, it is shown that ˆθ Xk = θ Yk and that the amplitude estimator is a function of the a priori SNR (ξ k = σx/σ 2 v 2 = P xx (ω k )/P vv (ω k )) and a posteriori SNR (γ k = Yk 2/σ2 v). This feature allows the filter to increase its noise suppression in terms of the instantaneous SNR, which suppresses more residual noise. Even under the Gaussian pdf assumption, the expression H MMSE SA (ξ k,γ k ) is highly nonlinear. However, under high SNR conditions, the MMSE-SA estimator converges to the Wiener filter. 3. MMSE-LSA estimator : (a.k.a. Log-MMSE) [6]. This estimator is almost identical in nature to the MMSE-SA, but uses a different distortion measure, such that ˆX k,mmse LSA minimizes the distortion measure d = E[ log(x k ) log( ˆX k ) 2 ] given the noisy observation y[n]. This is shown to be ˆX k,mmse LSA = exp(e[ln(x k ) Y k ]). This estimator can only be solved assuming Gaussian pdfs, and yields a H MMSE LSA that also depends on a priori and a posteriori SNRs. However, the H MMSE LSA provides further reduction, particularly when the instantaneous SNR is low. This yields a much lower noise residual than the MMSE-SA estimator with minor speech distortion. 4. Optimally-modified-MMSE-LSA estimator : It follows the same principles as in MMSE-LSA but using smoothing techniques for noise estimation and includes speech presence probability in subbands. 5. ML-A estimator : Maximum likelihood estimator is attractive due to asymptotical optimal propriety. ˆXk,MLA = arg max Xk {ln(p[y (ω k ) X k ])}. Assuming Gaussian pdfs, this estimate has a simple gain function H MLA (ω k ) = (1 + {(Yk 2 σ2 v )/Y k 2}1/2 )/2. Simplicity of the filter is the advantage of this approach. Little testing has been though with it, though. 6. MAP-A estimator : Maximum a posteriori estimator is similar to ML-A. ˆXk,MAP A = arg max Xk {ln(p[x k Y (ω k )])}. Assuming Gaussian pdfs, this estimate yields a simple gain function that is a function of the a priori (ξ k ) and a posteriori (γ k ) SNRs. Thus, H MAP A (ω k ) = (ξ k +{ξk 2+(1+ξ k)ξ k /γ k } 1/2 /{2(1+ξ k )}. The simple, closed-form of the filter, and its dependency on the a priori and a posteriori SNRs is the advantage of this approach. 4

5 The performance of this scheme has been shown to be almost the same as that of MMSE-SA [10]. 7. Perceptually-motivated Bayesian estimators: These schemes modify the distortion measure to introduce perceptually-based ideas. The distortion measures that emphasize the spectral valleys more than the spectral peaks were the ones that outperformed the MMSE- SA and MMSE-LSA, in terms of better noise residual and less speech distortion [10]. These algorithms assume Gaussian pdfs for simplicity. The selected distortion measures are: (a) Weighted Euclidian: The proposed distortion measure is d WE = X p k (X k + ˆX k ) 2. The Gain function is a function of the a priori and a posteriori and its highly nonlinear. Best results were observed when p=-1 [10]. (b) Weighted Cosh : The proposed distortion measure is d WCOSH = [X k / ˆX k + ˆX k /X k 1]X p k. The Gain function is a function of the a priori and a posteriori and its highly nonlinear. Best results were observed when p=-0.5, outperforming those from MMSE- SA, MMSE-LSA, and the weighted Euclidian distortion measure [10]. C Model-based schemes 1. Harmonic : Retrieves harmonic structure of voiced speech by using a comb filter such as h COMB [n] = N i=0 h iδ(n Ti). Challenge: estimate f 0, spectral slope (h i ), and number of harmonics. Even when performed correctly, it generally introduces distortion in unvoiced portions that is considered worse than musical noise [2]. 2. Linear prediction: Aim to retrieve the AR coefficients. Two main approaches are used, both based on a ML estimate obtained via iterative EM algorithms. Both potentially converge to an optimal estimation in the MMSE sense. (a) Wiener-EM: Assumes Gaussian distributions. E-step: uses a Wiener filter constructed using AR coefficients to estimate the signal. M-step: uses a MAP algorithm based on previous coefficient and clean speech estimates. Overall complexity: O(pm), where p is the AR order. (b) Kalman-EM. E-step uses a Kalman filter from noisy observations. M-step: Solves the Yule-Walker equations but using previous AR coefficients instead of correlation coefficients. ASR was evaluated for this algorithm in terms of WER, it was found to outperform those from Log-MMSE, Wiener-EM, and HMM but not those from Optimallymodified-Log-MMSE. Initial discussion is presented in [9] with further details in [8]. This scheme estimates not only the AR coefficient but also the complete enhanced speech signal and (possibly colored) background noise. This algorithm is the natural extension for a single-channel implementation of our modified Kalman filter we studied last year. Overall complexity: O(pm), where p is the highest AR order between the speech and noise models. 3. HMM : Statistical model that makes use of finite number of states and state transitions to estimate desired signals. It uses same approach as in statistical spectral estimation schemes but estimates specific pdfs from training data. Noise reduction: two HMMs are required 5

6 one for noise and one for speech, both needing training. It uses an EM algorithm during the training and an iterative MAP algorithm via AR-Wiener filter during the estimation. A different HMM enhancement combined ideas of HMM and harmonic model in [3]. Noise reduction was achieved by applying an HMM-based MMSE estimator to find the harmonic sinusoidal model parameters of clean speech from speech corrupted by additive noise. The model is considered to outperform the traditional HMM-based enhancement. HMM overall complexity: O(mK), where K is largest between the total number of Gaussian distributions and the codebook size, count that is generally larger than O(m 3 ). Thus, the scheme is computationally expensive and requires training, which does not appear compatible with a pre-processing, front-end noise suppression scheme for low-power applications. Furthermore, it has been outperformed in SNR and WER by simpler methods (i.e., Kalman-EM) [8]. Key building blocks 1. Voice activity detection (VAD): important component of noise suppression algorithms, as schemes can vary without the presence of the signal. It is also important for beamforming algorithms. It can also be used in silence compression schemes in speech coding (e.g., in a two way conference each participant utters speech about 35% of the time [10]). Overall complexity: between O(m) when no FFT is computed and O(mlogm) otherwise. (a) Heuristic approach : Main trend is based on thresholding of log-energy combined with zero-crossing count. The underlying assumption is that voiced segments have more energy and periodicity, thus the scheme has some problem with low-energy unvoiced consonants. The scheme can be made iterative and adaptive to improve this. Similar heuristic methods follow the same principles but make use of cepstral coefficients and other spectral distance measures. (b) Bayesian VUS (voiced-unvoiced-silence) : statistical scheme using multivariate Gaussian distribution using a vector feature containing five key features (short-time log energy, zero-cross count, normalized autocorrelation coefficient at unit sample delay, first predictor of a pth-order LPC, and normalized energy of a pth-order LPC). The approach requires training, but could be skipped as parameter sets can be taken from a study for English speakers [12]. (c) Model-based VAD : a thresholding scheme based on the likelihood ratio. The decision rule of speech presence (H1 k) is given above the threshold δ = 1/N N 1 k=1 logλ k (otherwise is speech absence: H0 k ). The likelihood ratio is computed assuming Gaussian distribution and simplified to λ k = 1/(1 + ξ k )exp{γ k ξ k /(1 + ξ k )}. Performance is higher than most methods at low SNR conditions (d) Speech presence probability estimation: designed to work with statistical spectral estimation schemes, where an additional speech presence probability P(H k 1 Y k) is used to multiply the gain functions. Expressions depend on each spectral estimation scheme and they are generally a function of the a priori SNR (ξ k ) and a posteriori SNR (γ k ) for each frequency bin. Although less common, they can also be used in a thresholding scheme to define VAD. However, since they are generally based on SNRs (i.e., energy based) they can have problems with low-energy unvoiced speech portions. 6

7 2. Noise estimator: Estimate of σ 2 v = P vv(ω) is fundamental for noise suppression algorithms. Basic approach makes use of VAD to identify silent portions (with only noise) and may use an averaged or recursive to update the estimate. However, this approach does not work for nonstationary noises, where continuous estimation is needed. Assumptions behind continuous noise speech estimator schemes require using long segments that include speech pauses and low energy portions, but being short enough so that noise is still more stationary than speech. These competing assumptions yield tradeoff between stationarity and temporal resolution as a function of the segment size. Most techniques use short overlapping windows of ms along with longer non-overlapping segments of s. Interestingly, preliminary tests by [10] do not show much improvement in objective measures (only tested with some of these estimates) with respect to a simple VAD-based scheme. Overall complexity: O(mlogm). (a) Spectral minimum tracking: Power of noisy speech decays to the power of noise. Thus, tracking minimum levels for each frequency band yields a (biased) estimate of noise. A simple single-frame smoothing is applied to the noisy observation periodogram to enhance the estimates, such as P yy (ω,k) = αp(ω,k 1) + (1 α) Y k (ω) 2. A bias correction (increase the noise floor for each band) can be considered by assuming a Gaussian distribution and observing the variance of the noisy speech. A modified version makes use of a single-sample recursive method (instead of a long segment) and a nonlinear smoothing scheme. Using the same initial smoothing with α, but considering P min (ω,k) = γp min (ω,k 1) + (1 γ)/(1 β){p yy (ω,k) βp yy (ω,k 1)} when P yy (ω,k) > P min (ω,k 1), and P min (ω,k) = P yy (ω,k) otherwise. Typical values for the smoothing parameters are α = 0.7, β = 0.96, and γ = This latter algorithm yields a good performance in a MSE sense when estimating the true background noise [10]. (b) Histogram-based: Most frequent level for each band (within a frame) corresponds to the noise level in that band. It is obtained from the histogram of a smoothed noisy observation P yy (ω) and smoothing its noise estimate using ˆσ 2 v(ω,k) = α mˆσ 2 v(ω,k 1) + (1 α m )h max (ω,k), where h max is the peak of the histogram distribution for the ω frequency bin during the k-th frame, and α m is a smoothing constant. This algorithm yields a consistently good performance in a MSE sense when estimating the true background noise [10]. (c) Time-recursive - SNR dependent: Noise spectrum can be estimated for each frequency with good precision when the a posteriori SNR (γ k ) is low. That means that we can update each frequency band as a function of SNR. This lead to a recursive structure given by ˆσ v (ω,k) = α(ω,k)ˆσ v (ω,k 1) + (1 α(ω,k)) Y k (ω) 2 (**). All the subsequent algorithms in this section (d-f) have the same type of recursion, but propose different methods to compute possible α(ω,k). In the SNR dependent scheme, α(ω,k) = 1 min(1/γ p k,1), among other options. (d) Time-recursive - weighted spectral averaging : based on the same principles as the SNR dependent case, but uses a hard threshold on β = γ k to define whether α(ω,k) needs to be updated (as in the SNR case). Otherwise, ˆσ v (ω,k) = ˆσ v (ω,k 1). Typical values are β = 2.5 and α = 0.9. Updates on this technique represent β as a function of the variance of the noisy observation. This algorithm yields one of the best performances in a MSE sense when estimating the true background noise [10]. 7

8 (e) Signal presence uncertainty - likelihood ratio: Uses the same principles and recursion than the time-recursive scheme. However, the estimation problem can be regarded as updating individual frequency bands of the noise estimate whenever the probability of speech being present is low. Thus, it can be shown that α = P(H 1 Y (ω,k)). Assuming a Gaussian distribution and using the likelihood ratio method, this probability is given by α = rλ/(1 + rλ), where λ = P(Y (ω,k) H 1 )/P(Y (ω,k) H 0 ) is the likelihood ratio, which can be computed, for instance, as log(λ G ) = 1/L L 1 k (γ k log(γ k ) 1). This value of α can then be used in (**) as described in (c). (f) Minima-controlled recursive averaging (MCRA): As in the previous case, this method is based on time-recursive averaging of the signal presence probability. However, it combines it with minimum tracking in the following fashion: The minima of a smoothed version of P yy (ω) is used to obtain a normalized periodogram of the noisy observation, P norm (ω) = P yy (ω)/p min (ω). A threshold is applied to this normalized periodogram to obtain an estimate of ˆp = P(H 1 Y k (ω)), probability that is then smoothed to obtain the final parameter α(ω,k) to be used in (**), as described in (c). Modifications have been proposed to the way the minima is tracked and the ˆp probability computed. However, multiple smoothing parameters are still needed. Note that this method is available via the Intel IPP (integrated performance primitive) function. 3. A priori SNR estimation: The a priori SNR (ξ k = var(x k )/σv) 2 is more difficult to estimate than its counterpart, the a posteriori SNR (γ k = Yk 2/σ2 v). For the latter, a noise estimation scheme can be used along with the periodogram of the noisy observation for each frequency. Overall complexity: O(mlogm). (a) Spectral substraction - Optimal ML: from power spectral substraction ˆX k 2 = Y k 2 σ 2 v and dividing by σ 2 v, then ˆξ k = γ k 1. This is also the optimal ML estimate assuming Gaussian distributions. In practice use ˆξ k = max( γ k 1,0), where γ k is a smoothed version of γ k obtained using a one-frame running average. (b) Decision-directed approach : Originally proposed from the MMSE-SA estimator [5]. It makes use of the estimate ˆX k provided by any desired algorithm by combining ξ k = E[X 2 k ]/σ2 v = E[γ k ] 1, a smoothed version can be constructed such as ˆξ k = a ˆX 2 k 1 /σ2 v,k 1 +(1 a)max(γ k 1,0). A common value for the smoothing constant References is a = Improvements of this scheme considered limiting ˆξ k to ˆξ min = 15 db, and making the smoothing constant a = a(ω,k) time and frequency dependent. [1] J. Chen, J. Benesty, Y. Huang, and E. J. Diethorn. Handbook of Speech Processing, chapter Fundamentals of Noise Reduction, pages Springer-Verlag, Berlin Heidelberg, 1st edition, [2] I. Cohen and S. Gannot. Handbook of Speech Processing, chapter Spectral Enhancement Methods, pages Springer-Verlag, Berlin Heidelberg, 1st edition, [3] Michael E. Deisher and Andreas S. Spanias. Speech enhancement using state-based estimation and sinusoidal modeling. J. Acoust. Soc. Am., 102(2): ,

9 [4] J. Droppo and A. Acero. Handbook of Speech Processing, chapter Environmental Robustness, pages Springer-Verlag, Berlin Heidelberg, 1st edition, [5] Y. Ephraim and D. Malah. Speech enhancement using a- minimum mean-square error shorttime spectral amplitude estimator. IEEE Trans. Acoust. Speech Sig. Process., 32(6): , [6] Y. Ephraim and D. Malah. Speech enhancement using a minimum mean-square error logspectral amplitude estimator. IEEE Trans. Acoust. Speech Sig. Process., 33(2): , [7] European Telecommunications Standards Institute. Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Advanced front-end feature extraction algorithm; Compression algorithms. ETSI ES V1.1.5 ( ). European Telecommunications Standards Institute, Sophia Antipolis, France, [8] S. Gannot. Speech Enhancement, chapter Speech Enhancement Application of the Kalman Filter in the Estimate-Maximize Framework, pages Springer-Verlag, Berlin Heidelberg, 1st edition, [9] S. Gannot and A. Yeredor. Handbook of Speech Processing, chapter The Kalman Filter, pages Springer-Verlag, Berlin Heidelberg, 1st edition, [10] Philipos C. Loizou. Speech enhancement: theory and practice. CRC Press, Boca Raton, FL, 1st edition, [11] T. F. Quatieri. Discrete-time speech signal processing: Principles and practice. Prentice-Hall signal processing series. Prentice Hall, Upper Saddle River, NJ, [12] L. R. Rabiner and R. W. Schafer. Theory and applications of digital speech processing. Prentice Hall, Upper Saddle River, NJ,

NOISE ROBUST RELATIVE TRANSFER FUNCTION ESTIMATION. M. Schwab, P. Noll, and T. Sikora. Technical University Berlin, Germany Communication System Group

NOISE ROBUST RELATIVE TRANSFER FUNCTION ESTIMATION M. Schwab, P. Noll, and T. Sikora Technical University Berlin, Germany Communication System Group Einsteinufer 17, 1557 Berlin (Germany) {schwab noll