A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Similar documents
Variational inference

Variational Autoencoder

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Auto-Encoding Variational Bayes

Gaussian Processes for Audio Feature Extraction

Variational Autoencoders (VAEs)

Variational Autoencoders

Latent Variable Models

Expectation-maximization algorithms for Itakura-Saito nonnegative matrix factorization

Discovering Convolutive Speech Phones using Sparseness and Non-Negativity Constraints

Matrix Factorization for Speech Enhancement

ESTIMATING TRAFFIC NOISE LEVELS USING ACOUSTIC MONITORING: A PRELIMINARY STUDY

Non-Negative Matrix Factorization And Its Application to Audio. Tuomas Virtanen Tampere University of Technology

Nonnegative Matrix Factorization with Markov-Chained Bases for Modeling Time-Varying Patterns in Music Spectrograms

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

An Efficient Posterior Regularized Latent Variable Model for Interactive Sound Source Separation

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Deep NMF for Speech Separation

Probabilistic Reasoning in Deep Learning

Single Channel Music Sound Separation Based on Spectrogram Decomposition and Note Classification

Independent Component Analysis and Unsupervised Learning

Bayesian Deep Learning

AN EM ALGORITHM FOR AUDIO SOURCE SEPARATION BASED ON THE CONVOLUTIVE TRANSFER FUNCTION

Recent Advances in Bayesian Inference Techniques

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Environmental Sound Classification in Realistic Situations

Variational Inference via Stochastic Backpropagation

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017

Density Estimation. Seungjin Choi

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

EUSIPCO

Generative models for missing value completion

NONNEGATIVE MATRIX FACTORIZATION WITH TRANSFORM LEARNING. Dylan Fagot, Herwig Wendt and Cédric Févotte

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Posterior Regularization

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

ON ADVERSARIAL TRAINING AND LOSS FUNCTIONS FOR SPEECH ENHANCEMENT. Ashutosh Pandey 1 and Deliang Wang 1,2. {pandey.99, wang.5664,

502 IEEE SIGNAL PROCESSING LETTERS, VOL. 23, NO. 4, APRIL 2016

Introduction to Machine Learning

Scalable audio separation with light Kernel Additive Modelling

Non-Stationary Noise Power Spectral Density Estimation Based on Regional Statistics

EXPLOITING LONG-TERM TEMPORAL DEPENDENCIES IN NMF USING RECURRENT NEURAL NETWORKS WITH APPLICATION TO SOURCE SEPARATION

Unsupervised Learning

ORTHOGONALITY-REGULARIZED MASKED NMF FOR LEARNING ON WEAKLY LABELED AUDIO DATA. Iwona Sobieraj, Lucas Rencker, Mark D. Plumbley

Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Lecture 16 Deep Neural Generative Models

A State-Space Approach to Dynamic Nonnegative Matrix Factorization

Machine Learning Techniques for Computer Vision

Nonparametric Bayesian Methods (Gaussian Processes)

ESTIMATION OF RELATIVE TRANSFER FUNCTION IN THE PRESENCE OF STATIONARY NOISE BASED ON SEGMENTAL POWER SPECTRAL DENSITY MATRIX SUBTRACTION

Black-box α-divergence Minimization

Sound Recognition in Mixtures

Deep Generative Models

Bayesian Semi-supervised Learning with Deep Generative Models

The connection of dropout and Bayesian statistics

Natural Gradients via the Variational Predictive Distribution

Probabilistic & Bayesian deep learning. Andreas Damianou

STA 4273H: Statistical Machine Learning

FACTORS IN FACTORIZATION: DOES BETTER AUDIO SOURCE SEPARATION IMPLY BETTER POLYPHONIC MUSIC TRANSCRIPTION?

Low-Rank Time-Frequency Synthesis

PATTERN RECOGNITION AND MACHINE LEARNING

Probabilistic Graphical Models

Introduction to Machine Learning

Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors

NMF WITH SPECTRAL AND TEMPORAL CONTINUITY CRITERIA FOR MONAURAL SOUND SOURCE SEPARATION. Julian M. Becker, Christian Sohn and Christian Rohlfing

Dept. Electronics and Electrical Engineering, Keio University, Japan. NTT Communication Science Laboratories, NTT Corporation, Japan.

JOINT ACOUSTIC AND SPECTRAL MODELING FOR SPEECH DEREVERBERATION USING NON-NEGATIVE REPRESENTATIONS

MULTIPITCH ESTIMATION AND INSTRUMENT RECOGNITION BY EXEMPLAR-BASED SPARSE REPRESENTATION. Ikuo Degawa, Kei Sato, Masaaki Ikehara

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Probabilistic Inference of Speech Signals from Phaseless Spectrograms

REVIEW OF SINGLE CHANNEL SOURCE SEPARATION TECHNIQUES

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Probabilistic Graphical Models

STA 414/2104: Machine Learning

Variational Principal Components

SUPERVISED NON-EUCLIDEAN SPARSE NMF VIA BILEVEL OPTIMIZATION WITH APPLICATIONS TO SPEECH ENHANCEMENT

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

GWAS V: Gaussian processes

How to do backpropagation in a brain

Acoustic Vector Sensor based Speech Source Separation with Mixed Gaussian-Laplacian Distributions

Deep unsupervised learning

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

THE task of identifying the environment in which a sound

Audio Source Separation Based on Convolutive Transfer Function and Frequency-Domain Lasso Optimization

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 3, MARCH

LONG-TERM REVERBERATION MODELING FOR UNDER-DETERMINED AUDIO SOURCE SEPARATION WITH APPLICATION TO VOCAL MELODY EXTRACTION.

Probabilistic Machine Learning

Source Separation Tutorial Mini-Series III: Extensions and Interpretations to Non-Negative Matrix Factorization

Expectation Propagation for Approximate Bayesian Inference

STA 4273H: Statistical Machine Learning

MULTI-RESOLUTION SIGNAL DECOMPOSITION WITH TIME-DOMAIN SPECTROGRAM FACTORIZATION. Hirokazu Kameoka

Probabilistic Graphical Models for Image Analysis - Lecture 1

Lecture : Probabilistic Machine Learning

TUTORIAL PART 1 Unsupervised Learning

OPTIMAL WEIGHT LEARNING FOR COUPLED TENSOR FACTORIZATION WITH MIXED DIVERGENCES

arxiv: v5 [cs.lg] 2 Nov 2015

BACKPROPAGATION THROUGH THE VOID

Expectation Maximization

Transcription:

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement Simon Leglaive 1 Laurent Girin 1,2 Radu Horaud 1 1: Inria Grenoble Rhône-Alpes 2: Univ. Grenoble Alpes, Grenoble INP, GIPSA-lab IEEE International Workshop on Machine Learning for Signal Processing (MLSP) September 18, 2018 Aalborg, Denmark

Introduction

Speech enhancement... noisy mixture signal speech enhancement clean speech signal Preprocessing step for various speech information retrieval tasks (e.g. automatic speech recognition, voice activity detection, etc.). 2

Speech enhancement as a source separation problem In the short-term Fourier transform (STFT) domain, for all (f, n) B = {0,..., F 1} {0,..., N 1}, we observe: x fn = s fn + b fn, (1) s fn C is the clean speech signal. b fn C is the noise signal. f is the frequency index and n the time-frame index. Objective: Separate the speech and noise signals from the observed mixture signal (under-determined problem). 3

Variance modeling with non-negative matrix factorization (NMF) From [1], independently for all (f, n) B: s fn N c ( 0, (W s H s ) f,n ), (2) W s R F Ks + is a dictionary matrix of spectral templates. H s R Ks N + is the activation matrix. K s is the rank of the factorization (usually K s (F + N) FN). [1] C. Févotte, N. Bertin and J. L. Durrieu, Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis, Neural computation, 2009. 4

Supervised NMF Supervised setting: W s is learned on a dataset of clean speech signals. H s is estimated from the noisy mixture signal. Pros and cons: Easy to interpret. Linear variance model E[ s fn 2 ] = (W s H s ) f,n = ws,f h s,n. Limited number of trainable parameters. 5

In this work...... we explore the use of neural networks as an alternative to this supervised NMF-based variance model. 6

Model

Speech variance modeling with neural networks From [2, 3], independently for all (f, n) B: z n N (0, I L ); (3) s fn z n N c (0, σ 2 f (z n)), (4) z n R L is a latent random vector with L F. σ 2 f : R L R + is a non-linear function parametrized by θ s. [2] D. P. Kingma and M. Welling, Auto-encoding variational Bayes, Proc. of ICLR, 2014. [3] Y. Bando et al., Statistical speech enhancement based on probabilistic integration of variational autoencoder and non-negative matrix factorization, Proc. of IEEE ICASSP, 2018. 7

Noise and mixture models Unsupervised noise model: Independently for all (f, n) B, ( ) b fn N c 0, (W b H b ) f,n, (5) where W b R F K b + and H b R K b N +. 8

Noise and mixture models Unsupervised noise model: Independently for all (f, n) B, ( ) b fn N c 0, (W b H b ) f,n, (5) where W b R F K b + and H b R K b N +. Mixture model: For all (f, n) B, where g n R + is a gain parameter. x fn = g n s fn + b fn, (6) 8

Noise and mixture models Unsupervised noise model: Independently for all (f, n) B, ( ) b fn N c 0, (W b H b ) f,n, (5) where W b R F K b + and H b R K b N +. Mixture model: For all (f, n) B, where g n R + is a gain parameter. x fn = g n s fn + b fn, (6) Conditional mixture distribution: x fn z n N c ( 0, g n σ 2 f (z n) + (W b H b ) f,n ). (7) 8

Inference

Parameters estimation For now, we assume that the speech parameters θ s have been learned during a training phase. Unsupervised model parameters: { } θ u = W b R F K b +, H b R K b N +, g = [g 0,..., g N 1 ] R N + Observed data: x = {x fn C} (f,n) B Direct maximum likelihood estimation is intractable 9

Parameters estimation For now, we assume that the speech parameters θ s have been learned during a training phase. Unsupervised model parameters: { } θ u = W b R F K b +, H b R K b N +, g = [g 0,..., g N 1 ] R N + Observed data: x = {x fn C} (f,n) B Direct maximum likelihood estimation is intractable Latent data: z = {z n R L } N 1 n=0 Expectation-maximization (EM) algorithm. 9

Monte Carlo EM algorithm E-Step. From the current value of the parameters θ u, compute: Q(θ u ; θ u) = E p(z x;θs,θ u ) [ln p(x, z; θ s, θ u )] 10

Monte Carlo EM algorithm E-Step. From the current value of the parameters θ u, compute: Q(θ u ; θu) = E p(z x;θs,θ u ) [ln p(x, z; θ s, θ u )] 1 R ) ln p (x, z (r) ; θ s, θ u, (8) R r=1 where the samples { z (r)} are asymptotically drawn from r=1,...,r p(z x; θ s, θu) using a Markov chain Monte Carlo method. 10

Monte Carlo EM algorithm E-Step. From the current value of the parameters θ u, compute: Q(θ u ; θu) = E p(z x;θs,θ u ) [ln p(x, z; θ s, θ u )] 1 R ) ln p (x, z (r) ; θ s, θ u, (8) R r=1 where the samples { z (r)} are asymptotically drawn from r=1,...,r p(z x; θ s, θu) using a Markov chain Monte Carlo method. M-Step. θ u arg max θ u Q(θ u ; θ u), (9) with θ u = {H b R K b N +, W b R F K b +, g R N +}. 10

Speech estimation Let s fn = g n s fn be the scaled speech STFT coefficients. Posterior mean estimation For all (f, n) B, ˆ s fn = E p( sfn x fn ;θ s,θ u)[ s fn ] [ g n σf 2 = E (z ] n) p(zn xn;θ s,θ u) g n σf 2(z x fn. (10) n) + (W b H b ) f,n Intractable expectation Markov chain Monte Carlo. 11

Training the generative model with variational autoencoders

Problem setting Training dataset of STFT speech time frames: s = {s n C F Ntr 1 } n=0. Generative model (reminder): Independently for all (f, n) B: z n N (0, I L ); s fn z n N c (0, σ 2 f (z n)), where z n R L Ntr 1 and in the following z = {z n } n=0. Problem: Learn the parameters θ s of this generative model (weights and biases of the neural network). Maximum likelihood is intractable variational autoencoders [2]. [2] D. P. Kingma and M. Welling, Auto-encoding variational Bayes, Proc. of ICLR, 2014. 12

Variational inference Find q(z s; φ) which approximates p(z s; θ s ). 13

Variational inference Find q(z s; φ) which approximates p(z s; θ s ). Kullback-Leibler divergence as a measure of fit: ) D KL (q(z s; φ) p(z s; θ s ) = ln p(s; θ s ) L(φ, θ s ), (11) where L(φ, θ s ) = E q(z s;φ) [ln p (s z; θ s )] }{{} Reconstruction accuracy D KL (q(z s; φ) p(z)). (12) }{{} Regularization We would like to maximize L(φ, θ s ) with respect to both φ and θ s. We need to define q(z s; φ). 13

Variational distribution Independently for all n {0,..., N tr 1} and l {0,..., L 1}: ( ( (z n ) l s n N µ l sn 2), σ l 2 denotes element-wise exponentiation; ( sn 2) ), (13) µ l : R F + R and σ l 2 parametrized by φ. : R F + R + are non-linear functions 14

Variational free energy L (θ s, φ) = c F 1 + 1 2 f =0 L l=1 N tr 1 n=0 N tr 1 n=0 E q(zn s n;φ) [ ln σ 2 l ( )] [d IS s fn 2 ; σf 2 (z n) ( s n 2) µ l ( s n 2) 2 σ 2 l ( s n 2)], (14) where d IS (x; y) = x/y ln(x/y) 1 is the Itakura-Saito (IS) divergence. Intractable expectation approximated by a sample average ( reparametrization trick ). Differentiable with respect to both θ s and φ (backpropagation). Optimized using gradient-ascent-based algorithm. 15

Experiments

Dataset Clean speech signals: TIMIT database. Noise signals: DEMAND database (domestic environment, nature, office, indoor public spaces, street and transportation). Training: training set of TIMIT database; 4 hours of speech; 462 speakers. Testing: 168 noisy mixtures at 0 db signal-to-noise ratio; 1 sentence/speaker in the test set of TIMIT. 16

Semi-supervised NMF baseline Independently for all (f, n) B: s fn N c (0, (W s H s ) f,n ) and b fn N c (0, (W b H b ) f,n ). Training: From the observed clean speech signals min W s R F Ks Ks N +, H s R+ (f,n) B d IS ( s fn 2 ; (W s H s ) f,n ). Inference: From the observed mixture signal x fn = s fn + b fn, min Ks N H s R+, W b R F K b +, H b R K b N + (f,n) B d IS ( x fn 2 ; (W s H s + W b H b ) f,n ). (W s H s ) f,n Speech reconstruction: ŝ fn = x fn (W s H s + W b H b ) f,n 17

Fully-supervised deep-learning reference method Fully-supervised deep-learning approach proposed in [4]. A deep neural network is trained to map noisy speech log-power spectrograms to clean speech log-power spectrograms. Trained with more that 100 different noise types effective in handling unseen noise types. [4] Y. Xu et al., A regression approach to speech enhancement based on deep neural networks, IEEE Transactions on Audio, Speech and Language Processing, 2015. 18

Experiments The enhanced speech quality is evaluated in terms of: Signal-to-distortion ratio (SDR) in decibels (db). Perceptual evaluation of speech quality (PESQ) measure in between 0.5 and 4.5. The higher, the better. Different values for the latent dimension L and speech NMF rank K s : 8, 16, 32, 64 or 128. 19

Experimental results (SDR) Median value indicated above each boxplot. 20

Experimental results (PESQ) Median value indicated above each boxplot. 21

Musical audio example. All models have been trained on speech (not singing voice). mixture original voice 7 7 6 6 frequency (khz) 8 frequency (khz) 8 5 5 4 4 3 3 2 2 1 0 1 0 1 2 3 4 time (s) fully-supervised DNN 5 0 6 0 1 2 3 time (s) 7 6 6 frequency (khz) 7 6 frequency (khz) 7 frequency (khz) 8 5 5 4 5 4 3 4 3 2 3 2 1 2 1 1 2 3 time (s) 4 5 6 0 6 proposed 8 0 5 semi-supervised NMF 8 0 4 1 0 1 2 3 time (s) 4 5 6 0 0 1 2 3 time (s) 4 5 6 22

Conclusion

Conclusion Variational autoencoders are an interesting alternative to supervised NMF models. Some perspectives: Monte Carlo EM is slow variational inference; Temporal model on the latent variables; Multi-microphone extension; Uncertainty propagation for speech information retrieval. 23

Thank you Audio examples and code available online: https://sleglaive.github.io 23