A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Size: px

Start display at page:

Download "A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement"

Rudolph Dixon
5 years ago
Views:

1 A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement Simon Leglaive 1 Laurent Girin 1,2 Radu Horaud 1 1: Inria Grenoble Rhône-Alpes 2: Univ. Grenoble Alpes, Grenoble INP, GIPSA-lab IEEE International Workshop on Machine Learning for Signal Processing (MLSP) September 18, 2018 Aalborg, Denmark

2 Introduction

3 Speech enhancement... noisy mixture signal speech enhancement clean speech signal Preprocessing step for various speech information retrieval tasks (e.g. automatic speech recognition, voice activity detection, etc.). 2

4 Speech enhancement as a source separation problem In the short-term Fourier transform (STFT) domain, for all (f, n) B = {0,..., F 1} {0,..., N 1}, we observe: x fn = s fn + b fn, (1) s fn C is the clean speech signal. b fn C is the noise signal. f is the frequency index and n the time-frame index. Objective: Separate the speech and noise signals from the observed mixture signal (under-determined problem). 3

5 Variance modeling with non-negative matrix factorization (NMF) From [1], independently for all (f, n) B: s fn N c ( 0, (W s H s ) f,n ), (2) W s R F Ks + is a dictionary matrix of spectral templates. H s R Ks N + is the activation matrix. K s is the rank of the factorization (usually K s (F + N) FN). [1] C. Févotte, N. Bertin and J. L. Durrieu, Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis, Neural computation,

6 Supervised NMF Supervised setting: W s is learned on a dataset of clean speech signals. H s is estimated from the noisy mixture signal. Pros and cons: Easy to interpret. Linear variance model E[ s fn 2 ] = (W s H s ) f,n = ws,f h s,n. Limited number of trainable parameters. 5

7 In this work we explore the use of neural networks as an alternative to this supervised NMF-based variance model. 6

8 Model

9 Speech variance modeling with neural networks From [2, 3], independently for all (f, n) B: z n N (0, I L ); (3) s fn z n N c (0, σ 2 f (z n)), (4) z n R L is a latent random vector with L F. σ 2 f : R L R + is a non-linear function parametrized by θ s. [2] D. P. Kingma and M. Welling, Auto-encoding variational Bayes, Proc. of ICLR, [3] Y. Bando et al., Statistical speech enhancement based on probabilistic integration of variational autoencoder and non-negative matrix factorization, Proc. of IEEE ICASSP,

10 Noise and mixture models Unsupervised noise model: Independently for all (f, n) B, ( ) b fn N c 0, (W b H b ) f,n, (5) where W b R F K b + and H b R K b N +. 8

11 Noise and mixture models Unsupervised noise model: Independently for all (f, n) B, ( ) b fn N c 0, (W b H b ) f,n, (5) where W b R F K b + and H b R K b N +. Mixture model: For all (f, n) B, where g n R + is a gain parameter. x fn = g n s fn + b fn, (6) 8

12 Noise and mixture models Unsupervised noise model: Independently for all (f, n) B, ( ) b fn N c 0, (W b H b ) f,n, (5) where W b R F K b + and H b R K b N +. Mixture model: For all (f, n) B, where g n R + is a gain parameter. x fn = g n s fn + b fn, (6) Conditional mixture distribution: x fn z n N c ( 0, g n σ 2 f (z n) + (W b H b ) f,n ). (7) 8

13 Inference

14 Parameters estimation For now, we assume that the speech parameters θ s have been learned during a training phase. Unsupervised model parameters: { } θ u = W b R F K b +, H b R K b N +, g = [g 0,..., g N 1 ] R N + Observed data: x = {x fn C} (f,n) B Direct maximum likelihood estimation is intractable 9

15 Parameters estimation For now, we assume that the speech parameters θ s have been learned during a training phase. Unsupervised model parameters: { } θ u = W b R F K b +, H b R K b N +, g = [g 0,..., g N 1 ] R N + Observed data: x = {x fn C} (f,n) B Direct maximum likelihood estimation is intractable Latent data: z = {z n R L } N 1 n=0 Expectation-maximization (EM) algorithm. 9

16 Monte Carlo EM algorithm E-Step. From the current value of the parameters θ u, compute: Q(θ u ; θ u) = E p(z x;θs,θ u ) [ln p(x, z; θ s, θ u )] 10

17 Monte Carlo EM algorithm E-Step. From the current value of the parameters θ u, compute: Q(θ u ; θu) = E p(z x;θs,θ u ) [ln p(x, z; θ s, θ u )] 1 R ) ln p (x, z (r) ; θ s, θ u, (8) R r=1 where the samples { z (r)} are asymptotically drawn from r=1,...,r p(z x; θ s, θu) using a Markov chain Monte Carlo method. 10

18 Monte Carlo EM algorithm E-Step. From the current value of the parameters θ u, compute: Q(θ u ; θu) = E p(z x;θs,θ u ) [ln p(x, z; θ s, θ u )] 1 R ) ln p (x, z (r) ; θ s, θ u, (8) R r=1 where the samples { z (r)} are asymptotically drawn from r=1,...,r p(z x; θ s, θu) using a Markov chain Monte Carlo method. M-Step. θ u arg max θ u Q(θ u ; θ u), (9) with θ u = {H b R K b N +, W b R F K b +, g R N +}. 10

19 Speech estimation Let s fn = g n s fn be the scaled speech STFT coefficients. Posterior mean estimation For all (f, n) B, ˆ s fn = E p( sfn x fn ;θ s,θ u)[ s fn ] [ g n σf 2 = E (z ] n) p(zn xn;θ s,θ u) g n σf 2(z x fn. (10) n) + (W b H b ) f,n Intractable expectation Markov chain Monte Carlo. 11

20 Training the generative model with variational autoencoders

21 Problem setting Training dataset of STFT speech time frames: s = {s n C F Ntr 1 } n=0. Generative model (reminder): Independently for all (f, n) B: z n N (0, I L ); s fn z n N c (0, σ 2 f (z n)), where z n R L Ntr 1 and in the following z = {z n } n=0. Problem: Learn the parameters θ s of this generative model (weights and biases of the neural network). Maximum likelihood is intractable variational autoencoders [2]. [2] D. P. Kingma and M. Welling, Auto-encoding variational Bayes, Proc. of ICLR,

22 Variational inference Find q(z s; φ) which approximates p(z s; θ s ). 13

23 Variational inference Find q(z s; φ) which approximates p(z s; θ s ). Kullback-Leibler divergence as a measure of fit: ) D KL (q(z s; φ) p(z s; θ s ) = ln p(s; θ s ) L(φ, θ s ), (11) where L(φ, θ s ) = E q(z s;φ) [ln p (s z; θ s )] }{{} Reconstruction accuracy D KL (q(z s; φ) p(z)). (12) }{{} Regularization We would like to maximize L(φ, θ s ) with respect to both φ and θ s. We need to define q(z s; φ). 13

24 Variational distribution Independently for all n {0,..., N tr 1} and l {0,..., L 1}: ( ( (z n ) l s n N µ l sn 2), σ l 2 denotes element-wise exponentiation; ( sn 2) ), (13) µ l : R F + R and σ l 2 parametrized by φ. : R F + R + are non-linear functions 14

25 Variational free energy L (θ s, φ) = c F f =0 L l=1 N tr 1 n=0 N tr 1 n=0 E q(zn s n;φ) [ ln σ 2 l ( )] [d IS s fn 2 ; σf 2 (z n) ( s n 2) µ l ( s n 2) 2 σ 2 l ( s n 2)], (14) where d IS (x; y) = x/y ln(x/y) 1 is the Itakura-Saito (IS) divergence. Intractable expectation approximated by a sample average ( reparametrization trick ). Differentiable with respect to both θ s and φ (backpropagation). Optimized using gradient-ascent-based algorithm. 15

26 Experiments

27 Dataset Clean speech signals: TIMIT database. Noise signals: DEMAND database (domestic environment, nature, office, indoor public spaces, street and transportation). Training: training set of TIMIT database; 4 hours of speech; 462 speakers. Testing: 168 noisy mixtures at 0 db signal-to-noise ratio; 1 sentence/speaker in the test set of TIMIT. 16

28 Semi-supervised NMF baseline Independently for all (f, n) B: s fn N c (0, (W s H s ) f,n ) and b fn N c (0, (W b H b ) f,n ). Training: From the observed clean speech signals min W s R F Ks Ks N +, H s R+ (f,n) B d IS ( s fn 2 ; (W s H s ) f,n ). Inference: From the observed mixture signal x fn = s fn + b fn, min Ks N H s R+, W b R F K b +, H b R K b N + (f,n) B d IS ( x fn 2 ; (W s H s + W b H b ) f,n ). (W s H s ) f,n Speech reconstruction: ŝ fn = x fn (W s H s + W b H b ) f,n 17

29 Fully-supervised deep-learning reference method Fully-supervised deep-learning approach proposed in [4]. A deep neural network is trained to map noisy speech log-power spectrograms to clean speech log-power spectrograms. Trained with more that 100 different noise types effective in handling unseen noise types. [4] Y. Xu et al., A regression approach to speech enhancement based on deep neural networks, IEEE Transactions on Audio, Speech and Language Processing,

30 Experiments The enhanced speech quality is evaluated in terms of: Signal-to-distortion ratio (SDR) in decibels (db). Perceptual evaluation of speech quality (PESQ) measure in between 0.5 and 4.5. The higher, the better. Different values for the latent dimension L and speech NMF rank K s : 8, 16, 32, 64 or

31 Experimental results (SDR) Median value indicated above each boxplot. 20

32 Experimental results (PESQ) Median value indicated above each boxplot. 21

frequency (khz) 8 5 5 4 4 3 3 2 2 1 0 1 0 1 2 3 4 time

frequency (khz) 7 6 frequency (khz) 7 frequency (khz) 8

33 Musical audio example. All models have been trained on speech (not singing voice). mixture original voice frequency (khz) 8 frequency (khz) time (s) fully-supervised DNN time (s) frequency (khz) 7 6 frequency (khz) 7 frequency (khz) time (s) proposed semi-supervised NMF time (s) time (s)

34 Conclusion

35 Conclusion Variational autoencoders are an interesting alternative to supervised NMF models. Some perspectives: Monte Carlo EM is slow variational inference; Temporal model on the latent variables; Multi-microphone extension; Uncertainty propagation for speech information retrieval. 23

36 Thank you Audio examples and code available online: 23

Variational inference

Variational inference Simon Leglaive Télécom ParisTech, CNRS LTCI, Université Paris Saclay November 18, 2016, Télécom ParisTech, Paris, France. Outline Introduction Probabilistic model Problem Log-likelihood decomposition EM