RECTIFIED LINEAR UNIT CAN ASSIST GRIFFIN LIM PHASE RECOVERY. Kohei Yatabe, Yoshiki Masuyama and Yasuhiro Oikawa

Size: px

Start display at page:

Download "RECTIFIED LINEAR UNIT CAN ASSIST GRIFFIN LIM PHASE RECOVERY. Kohei Yatabe, Yoshiki Masuyama and Yasuhiro Oikawa"

Adrian Evans
5 years ago
Views:

1 RECTIFIED LINEAR UNIT CAN ASSIST GRIFFIN LIM PHASE RECOVERY Kohei Yatabe, Yoshiki Masuyama and Yasuhiro Oikawa Department of Intermedia Art and Science, Waseda University, Tokyo, Japan ABSTRACT Phase recovery is an essential process for reconstructing a timedomain signal from the corresponding spectrogram when its phase is contaminated or unavailable. Recently, a phase recovery method using deep neural network (DNN) was proposed, which interested us because the inverse short-time Fourier transform (inverse STFT) was utilized within the network. This inverse STFT converts a spectrogram into its time-domain counterpart, and then the activation function, leaky rectified linear unit (ReLU), is applied. Such nonlinear operation in time domain resembles the speech enhancement method called the harmonic regeneration noise reduction (HRNR). In HRNR, a time-domain nonlinearity, typically ReLU, is applied for assistance in enhancing the higher-order harmonics. From this point of view, one question arose in our mind: Can time-domain ReLU solely assist phase recovery? Inspired by this curious connection between the recent DNN-based phase recovery method and HRNR in speech enhancement, the ReLU assisted Griffin Lim algorithm is proposed in this paper to investigate the above question. Through an experiment of speech denoising with the oracle Wiener filter, some positive effect of the time-domain nonlinearity is confirmed in terms of the scores of the short-time objective intelligibility (STOI). Index Terms Spectrogram, redundancy, consistency, timedomain nonlinearity, harmonic regeneration. 1. INTRODUCTION Recent important trend in signal processing and speech enhancement includes phase recovery of an audio signal. Many of the popular acoustical processing methods are formulated in the time-frequency domain, obtained through the short-time Fourier transform (STFT), where the processing is usually implemented as a procedure of modifying the amplitude at each time-frequency bin. Although the spectrograms are parametrized by both amplitude and phase as they are expressed as a collection of complex numbers, phase had been ignored for several decades until the pioneering works demonstrated its importance. Recently, the so-called phase-aware signal processing gains considerable attention in the community, and a number of methodologies have been proposed [1 3]. This paper focuses on its branch called phase recovery which aims to obtain a better phase spectrogram under the given amplitude (together with noisy phase in some applications such as speech denoising). As usual in signal processing, phase recovery methods can be categorized by the amount of imposed prior knowledge. One of the most general algorithms is the Griffin Lim algorithm [4 6] which retrieves the phase only based on the redundancy of the timefrequency representation. In the algorithm, the phase is modified only by the linear transformation between time and time-frequency domains (STFT and its inverse), and no assumption is made upon the structure of the data. Therefore, even though the Griffin Lim algorithm might not achieve a good performance due to the insufficiency of the prior knowledge, it is utilized in a wide variety of applications. On the other hand, there are several phase recovery methods based on the structure of the data. For example, the harmonic structure of speech signals has been considered in the model-based phase recovery [7 10] which can obtain a better result with a price of narrowing the range of applications. Very recently, a phase recovery method based on a deep neural network (DNN) was proposed [11] along the extraordinary success stories of DNN in the last decade. Although it might not seems to have assumptions on the data, DNN heavily relies on the extremely rich prior knowledge, which is automatically learned from the training dataset, when it is applied as a signal processor. One thing from the DNN-based phase recovery in [11] which interested the authors is the use of the inverse STFT to obtain the time-domain signal within the network. As DNN is a composition of affine and nonlinear functions [12], this time-domain signal obtained by the inverse-stft layer was fed into the nonlinear functions [11]. Such nonlinearity in the time domain reminds us a speech enhancement method called harmonic regeneration noise reduction (HRNR) [13 15] which utilizes a time-domain nonlinear function, together with the inverse STFT, to recover the harmonic structure (especially in the high-frequency range) of speech signals. A typical choice of the nonlinear function in HRNR is the half-wave rectifier [13 15] which is equivalent to the quite popular activation called the rectified linear unit (ReLU) in the literature of DNN [16,17]. Indeed, the DNN-based phase recovery method in [11] utilized a variant of ReLU in the time domain, namely Leaky ReLU. This curious connection between the DNN-based phase recovery and HRNR suggested one possibility: Time-domain nonlinearity can solely contribute to phase recovery without a network. For investigating this conjecture, a combination of time-domain nonlinearity and a phase recovery algorithm is proposed, and its performance for speech enhancement is experimentally investigated in this paper. The Griffin Lim algorithm is chosen for the baseline method because it is the standard phase recovery algorithm without any assumption on the structure of data. ReLU is incorporated within its procedure, after the inverse STFT as the DNN-based method did, in order to artificially generate harmonic components as in HRNR. This modified Griffin Lim algorithm with the time-domain ReLU is compared to that without ReLU for seeing the effect of the timedomain nonlinearity. An experiment of speech denoising using the oracle Wiener filter is conducted with 200 speech signals obtained from the TIMIT database, and its result indicates the above conjecture positively. 2. PHASE RECOVERY OF SPECTROGRAM In this section, the standard time-frequency domain representation (spectrogram) of speech signals is briefly reviewed. The ordinary Griffin Lim algorithm is also introduced here so that the proposed modification in the subsequent section becomes apparent /18/$31.00 c 2018 IEEE

2 2.1. Time-frequency representation of audio signal Let the STFT of a signal x with a window w be defined as L 1 (F wx)[m, n] = x[l + an] w[l] e 2πibml/L, (1) l=0 where z is the complex conjugate of z, i = 1 is the imaginary unit, L is the window length, n and m are the time and frequency indices, and a and b are the time and frequency shifting steps, respectively. By denoting the inverse STFT F w, the reconstruction formula of STFT can be represented as x = F w F wx, where w is a suitable synthesis window associated with w, or the dual window [18 21] of w. For the sake of simplicity, only the Paseval tight case is considered in this paper, i.e., the window is self-dual w = w (the same window can be used in both analysis and synthesis to reconstruct the signal x = F w F wx). A spectrogram corresponding to x will be denoted by X[m, n] (= (F wx)[m, n]) for convenience Speech enhancement based on amplitude restoration One of the most popular strategies for enhancing audio signals is filtering in the time-frequency domain. By multiplying some scalar, so-called the time-frequency mask M[m, n], to each bin of the spectrogram X[m, n] and taking inverse STFT, F w (M X), a nonstationary filter can be approximately realized, where represents the element-wise multiplication. Ordinarily, in the acoustical applications, this bin-wise scalar M[m, n] (which may also be called spectral gain or Gabor multiplier) is treated as a nonnegative real number, that is, only amplitude of the spectrogram is modified. This practice stems from multiple reasons including the optimality in the sense of minimum mean square error estimates [1]. However, every spectrogram consists of not only amplitude but also phase which is essential for recovering the time-domain signals. Amplitude-only restoration of spectrogram results in contaminated signal, even when the recovered amplitude is perfect, owing to the reconstruction of the time-domain signal by inverse STFT using noisy phase Phase recovery by Griffin Lim algorithm Recently, the importance of restoring phase spectrogram gains considerable attentions through the pioneering studies [1 3], which emerges the field of phase-aware signal processing and modeling of complex spectrograms [22 24]. One of the most famous algorithms for obtaining a better phase spectrogram from the corresponding amplitude is the Griffin Lim algorithm [4]. This algorithm imposes two expectations upon the target spectrogram: The resulting spectrogram should (1) maintain the given amplitude and (2) have minimum norm among the all possible spectrograms corresponding to their time-domain counterpart. The latter condition is often called consistency, and therefore a method based on it is categorized into the consistency-based phase recovery [4 6]. The Griffin Lim algorithm implements the above expectations by the alternating projections with a hope for acquiring better phase 1 : X [k+1] = P A(P C(X [k] )), (2) 1 Note that, in general, those two expectations cannot be met simultaneously, and therefore the Griffin Lim algorithm does not ensure optimality in those senses. Indeed, Eq. (2) can be interpreted as a projected gradient algorithm with the relaxed consistency criterion, and thus consistency is not supposed to be satisfied. In this paper, such detail of the algorithm is omitted because the objective of the paper is to demonstrate the possibility of considering the time-domain nonlinearity in phase recovery and not to propose a new algorithm. For some examples of algorithmic investigation, see [10, 25]. where P S is the metric projection onto a set S [26], P S(X) = arg min X Y, (3) Y S is the Euclidean norm, k is an iteration index, A is a set of spectrograms X whose amplitude are equal to a given nonnegative value a[m, n] 0, i.e., X[m, n] = a[m, n], and C is a set of consistent spectrograms X = F wf w X (the set of fixed points of F wf w ). These projections onto the sets C and A are given by P C(X) = F wf w X, (4) P A(X) = a X X, (5) where, and represent element-wise absolute value, multiplication and division, respectively, and the result of division is replaced by zero when X[m, n] = 0. While the Griffin Lim algorithm has been successfully applied to a number of applications, its poor adaptability to a specific situation might have been restricted the practical performance. As the algorithm pays attention to the consistency only, and no applicationspecific structure is considered in the projections, it is presumed that incorporating some data-specific structure can contribute to improve the quality of estimated phase. In the next section, the time-domain ReLU is introduced for considering the harmonic structure of audio signals within the Griffin Lim algorithm. 3. GRIFFIN LIM ALGORITHM ASSISTED BY RECTIFIED LINEAR UNIT Some audio signals including speech have a specific structure of harmonics. In this section, a combination of the Griffin Lim algorithm and the time-domain ReLU is proposed with the hope of capturing a structure similar to that of speech signals Nonlinear harmonic regeneration Spectrograms of speech and audio signals are often comprised of harmonic components whose frequencies are integer multiple of the fundamental frequency. This well-known structure, the harmonic structure, has been utilized in many signal processing methods especially in speech enhancement. One notable use of such structure is the harmonic regeneration technique [13 15] which artificially generates harmonics from enhanced speech signals for obtaining a better estimate of a priori signal-to-noise ratio (SNR). For generating the harmonics artificially, a nonlinear function is applied in the time domain. The half-wave rectifier, namely ReLU, is a typical choice for such time-domain nonlinearity [13 15], ReLU(x) = max{x, 0 }, (6) where the maximum operator is evaluated element-wise. By clipping the negative components, harmonics are generated as illustrated in Fig. 1, where the horizontal green and light blue bands in the middle row represents the magnitude of generated harmonics. The important observation is that the phase spectrogram of the rectified sinusoid (in the bottom row of Fig. 1) has a certain structured pattern, corresponding to the generated harmonics, which does not exist in the original signal. That is, phase of the harmonics can be aligned based on the fundamental-frequency component through the time-domain nonlinear operation. Although the relationship between the phase of natural audio signals and this artificially generated pattern is not so clear, it might be possible to improve phase 556 2

3 recovery because the fundamental-frequency component is often the largest component which should contain better information for the recovery. Then, the following question is raised naturally: Can an element-wise nonlinear operation in the time domain help a phase recovery algorithm to improve the performance? For experimentally investigating this question, a combination of the Griffin Lim algorithm and ReLU is proposed Proposed ReLU assisted Griffin Lim algorithm Here, time-domain ReLU is incorporated into the procedure of the Griffin Lim algorithm. To do so, the following projection onto the set related to time-domain rectified signals is introduced: PN (X) = Fw ReLU(Fw X), (7) where N is a set of the consistent spectrograms whose time domain counterpart is nonnegative. By replacing the projection corresponding to consistency PC in Eq. (2) with this ReLU combined variation PN, the ReLU assisted Griffin Lim algorithm (ReLU-GLA) is proposed as the following procedure: X [k+1] = PA (PN (X [k] )), Fig. 1. Illustration of amplitude/phase spectrogram corresponding to a sinusoid and its rectified counterpart (from top to bottom: timedomain signal, amplitude spectrogram, and phase spectrogram). (8) where the only difference to the original algorithm in Eq. (2) is the additional ReLU term in Eq. (7) which does not exist in Eq. (4). This slight modification, which does not increase the computational complexity thanks to the extremely cheap evaluation of ReLU, can contribute to the quality of recovered phase to some extent as shown in the next section. Note that the additional nonlinear distortion imposed by this operation does not remain in each intermediate result X [k+1] because the projection onto the given amplitude spectrogram PA completely removes such distortion by replacing the amplitude to the predetermined values. That is, the generated harmonics only contributes to the phase, and therefore it is safe to choose any nonlinear operation in the time domain. Here, ReLU was chosen for just a representative example, and any other nonlinearity can be incorporated in the totally same manner STOI EXPERIMENT In order to investigate the question raised in Section 3.1, a numerical experiment was performed. Test signals consisting of 100 male and 100 female speech signals [1], obtained from the TIMIT database [27], were corrupted by the additive Gaussian noise whose amplitude was adjusted so that the SNRs of the simulated signals became 5, 10, 15 or 20 db. The Gaussian noise was generated 10 times for each speech signal, and thus, in total, 2000 noisy signals were utilized for each SNR. These noisy signals were enhanced by the Wiener filters constructed in the oracle condition (both signal and noise power at each time-frequency bin were known), where the STFT is calculated by the canonical tight variant of the 32 ms Hann window shifted by 16 ms. The iteration of the Griffin Lim and the proposed algorithms were started from the observed noisy phase, and the projection PA enforces the amplitude spectrogram to be the Wiener filtered one. For the evaluation, scores of the shorttime objective intelligibility (STOI) [28] was calculated as a perceptual measure of enhanced speech signals Number of iteration Fig. 2. Average scores of STOI for each iteration. The blue lines indicate the scores of ordinary Griffin Lim algorithm, while the red lines are those of the proposed ReLU-GLA. SNRs of input signals corresponding to each line are written within the figure. The experimental results summarized by the average of STOI scores of the 2000 noisy speech signals for each SNR are illustrated in Fig. 2, where the blue lines indicate the conventional GLA, and the red ones correspond to the proposed ReLU-GLA. For all four SNRs, the proposed ReLU-GLA attained the better scores at the first iteration, and then the difference of the scores of both methods seems to decrease as the iteration number increases. Although this result indicates that some positive effects of incorporating the time-domain ReLU into the phase recovery algorithm exist, the effect for each speech signal cannot be confirmed from this figure because each line represents the average of the 2000 trials. Therefore, the results are further illustrated by histograms for contrasting individual effects. Since the difference of the scores diminished for the large iteration numbers, the results of the first and 10th iteration are utilized to construct the histograms in the next page. 2 STOI was chosen in this paper because the performance of phase recovery cannot be measured by quantity sensitive to the difference of constant phase such as SNR. Other popular measures including PESQ were not calculated because, unfortunately, the first author only had extremely limited time before the deadlines of the initial and revisional submissions

4 Frequency Frequency Difference of STOI improvement Difference of STOI improvement Fig. 3. Histograms of difference of STOI improvement. STOI of the proposed ReLU assisted algorithm was subtracted by that of the ordinary Griffin Lim algorithm. Both algorithms were iterated once from the initial values, i.e., these results were obtained by the singleshot projection PA (PC (XWiener )) and PA (PN (XWiener )). The vertical red lines represent the position of 0, and therefore the bars at the right side of these red lines indicate the results where the proposed method was better than the conventional one. Fig. 4. Histograms of difference of STOI improvement. The algorithms were iterated 10 times from the initial values. is correct, then one can consider a more sophisticated nonlinearity which shapes the waveform closer to the target signals, maybe by learning from a dataset, to obtain a better phase recovery method. The reason for the diminishing phenomenon of the effect of nonlinearity should be because the Griffin Lim algorithm does not consider the observed phase within its procedure. As in Eq. (2), and also in Eq. (8), the phase spectrogram is modified without considering the observed phase. That is, the phase is close to the observed one only in the first few iterations where the effect of the initial value remain, and the resulting phase after a number of iterations is not directly related to the observation. A phase recovery method considering data fidelity to phase, unlike the Griffin Lim algorithm, might be possible to receive more benefit from the time-domain ReLU, or any other nonlinearity, and therefore seeking such algorithm together with an effective time-domain nonlinear function for harmonic regeneration should be the next direction of the research. The histograms of difference of STOI scores are shown in Figs. 3 and 4. For each signal, the score of the proposed algorithm was subtracted by that of the conventional GLA for clarifying the difference between them. Therefore, the center of the horizontal axis (represented by the vertical red line) means that the improvements of STOI achieved by both algorithms were the same. The positive value in the horizontal axis (right side of the red line) indicates that the proposed ReLU-GLA was better than the conventional one, and the negative value indicates the opposite situation. From Fig. 3, it can be confirmed that the proposed algorithm improved most of the test samples at the first iteration than the conventional GLA. That is, the single-shot projection PA (PN (XWiener )) improved STOI more than the conventional projection PA (PC (XWiener )), where XWiener represents the noisy spectrogram whose amplitude was enhanced by the Wiener filter. This result is important because projecting the Wienerfiltered data once may improve the intelligibility without the pain of iteration. Indeed, STOI scores of all 8000 samples (2000 per SNR) were improved from those of the initial values XWiener with observed phase. Although the effect of the time-domain nonlinearity diminished after some iterations, its positive effect can also be seen in the 10th iteration as shown in Fig. 4. These results indicated that the time-domain ReLU can assist the Griffin Lim algorithm in terms of STOI at the beginning of the iteration, and its effect remains in some iterations. The reason for this positive effect of ReLU might be the pulsetrain-like waveform of rectified signals as in Fig. 1. As considered in the source-filter model, speech signals consist of a sequence of pulses. Then, an appropriate phase for a speech signal should recover such sequential pulses, while an inappropriate one may not correspond to pulses. The time-domain ReLU might have a power to align the phase of the harmonics so that the waveform in time domain becomes more pulse-like as in Fig. 1. If the above discussion 5. CONCLUSIONS In this paper, inspired by the DNN-based phase recovery and the harmonic regeneration technique for speech enhancement, a variant of the well-known Griffin Lim algorithm combined with the timedomain ReLU was proposed. The effectiveness of the time-domain nonlinearity for speech denoising in terms of STOI was experimentally confirmed. The experimental results shed light on the possibility of utilizing such time-domain nonlinear function within a signal reconstruction process (or utilizing inverse-stft layer within the network in the words of DNN). Both ReLU and the Griffin-Lim algorithm are just one example of the possibilities, and searching a better combination as well as a DNN model containing time-domain representation within the network is remained as the future works. 6. ACKNOWLEDGMENT The first author would like to thank Dr. Ryoichi Miyazaki for his support on prior works and helpful comments on the time-domain nonlinear operation in speech enhancement

5 7. REFERENCES [1] P. Mowlaee, J. Kulmer, J. Stahl, and F. Mayer, Single Channel Phase-Aware Signal Processing in Speech Communication: Theory and Practice, Wiley, [2] T. Gerkmann, M. Krawczyk-Becker, and J. Le Roux, Phase processing for single-channel speech enhancement: History and recent advances, IEEE Signal Process. Mag., vol. 32, no. 2, pp , Mar [3] P. Mowlaee, R. Saeidi, and Y. Stylianou, Advances in phaseaware signal processing in speech communication, Speech Commun., vol. 81, pp. 1 29, [4] D. Griffin and J. Lim, Signal estimation from modified shorttime Fourier transform, IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 2, pp , Apr [5] J. Le Roux, H. Kameoka, N. Ono, and S. Sagayama, Fast signal reconstruction from magnitude STFT spectrogram based on spectrogram consistency, in Int. Conf. Digital Audio Effects (DAFx-10), Sep [6] N. Perraudin, P. Balazs, and P. L. Søndergaard, A fast griffinlim algorithm, in IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA), Oct 2013, pp [7] M. Krawczyk and T. Gerkmann, STFT phase reconstruction in voiced speech for an improved single-channel speech enhancement, IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 12, pp , Dec [8] Y. Wakabayashi, T. Fukumori, M. Nakayama, T. Nishiura, and Y. Yamashita, Single-channel speech enhancement with phase reconstruction based on phase distortion averaging, IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 26, no. 9, pp , Sep [9] P. Magron, R. Badeau, and B. David, Model-based stft phase recovery for audio source separation, IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 26, no. 6, pp , June [10] Y. Masuyama, K. Yatabe, and Y. Oikawa, Model-based phase recovery of spectrograms via optimization on Riemannian manifolds, in Int. Workshop Acoust. Signal Enhance. (IWAENC), Sep [11] K. Oyamada, H. Kameoka, T. Kaneko, K. Tanaka, N. Hojo, and H. Ando, Generative adversarial network-based approach to signal reconstruction from magnitude spectrograms, arxiv: , Sep [12] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, [13] C. Plapous, C. Marro, and P. Scalart, Improved signal-tonoise ratio estimation for speech enhancement, IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 6, pp , Nov [14] M. Une and R. Miyazaki, Evaluation of sound quality and speech recognition performance using harmonic regeneration for various noise reduction techniques, in RISP Int. Workshop Nonlinear Circuits, Commun Signal Process. (NCSP), Mar. 2017, pp [15] M. Une and R. Miyazaki, Musical-noise-free speech enhancement with low speech distortion by biased harmonic regeneration technique, in Int. Workshop Acoust. Signal Enhance. (IWAENC), Sep [16] X. Glorot, A. Bordes, and Y. Bengio, Deep sparse rectifier neural networks, in Proc. Fourteenth Int. Conf. Artif. Intell. Stat., Apr. 2011, vol. 15, pp [17] S. Sonoda and N. Murata, Neural network with unbounded activation functions is universal approximator, Appl. Comput. Harmon. Anal., vol. 43, no. 2, pp , [18] H. G. Feichtinger and T. Strohmer, Eds., Gabor Analysis and Algorithms: Theory and Applications, Birkhäuser Boston, Boston, MA, [19] K. Gröchenig, Foundations of Time-Frequency Analysis, Birkhäuser Boston, Boston, MA, [20] P. L. Søndergaard, Gabor frames by sampling and periodization, Adv. Comput. Math., vol. 27, no. 4, pp , [21] O. Christensen, Frames and Bases: An Introductory Course, Birkhauser, [22] K. Yatabe and Y. Oikawa, Phase corrected total variation for audio signals, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2018, pp [23] K. Yatabe and D. Kitamura, Determined blind source separation via proximal splitting algorithm, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2018, pp [24] Y. Masuyama, K. Yatabe, and Y. Oikawa, Low-rankness of complex-valued spectrogram and its application to phaseaware audio processing, (submitted). [25] Y. Masuyama, K. Yatabe, and Y. Oikawa, Griffin Lim like phase recovery via alternating direction method of multipliers, (submitted). [26] A. Cegielski, Iterative Methods for Fixed Point Problems in Hilbert Spaces, Springer, [27] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus CDROM, NIST, [28] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, An algorithm for intelligibility prediction of time-frequency weighted noisy speech, IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp , Sep

ON ADVERSARIAL TRAINING AND LOSS FUNCTIONS FOR SPEECH ENHANCEMENT. Ashutosh Pandey 1 and Deliang Wang 1,2. {pandey.99, wang.5664,

ON ADVERSARIAL TRAINING AND LOSS FUNCTIONS FOR SPEECH ENHANCEMENT Ashutosh Pandey and Deliang Wang,2 Department of Computer Science and Engineering, The Ohio State University, USA 2 Center for Cognitive