Singing Voice Separation using Generative Adversarial Networks

Similar documents
GANs, GANs everywhere

Deep Generative Models. (Unsupervised Learning)

ON ADVERSARIAL TRAINING AND LOSS FUNCTIONS FOR SPEECH ENHANCEMENT. Ashutosh Pandey 1 and Deliang Wang 1,2. {pandey.99, wang.5664,

arxiv: v1 [cs.lg] 20 Apr 2017

GENERATIVE ADVERSARIAL LEARNING

Negative Momentum for Improved Game Dynamics

Generative Adversarial Networks

Lecture 14: Deep Generative Learning

arxiv: v3 [stat.ml] 20 Feb 2018

Notes on Adversarial Examples

Training Generative Adversarial Networks Via Turing Test

Importance Reweighting Using Adversarial-Collaborative Training

Generative Adversarial Networks, and Applications

Generative adversarial networks

Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks

Text2Action: Generative Adversarial Synthesis from Language to Action

Some theoretical properties of GANs. Gérard Biau Toulouse, September 2018

Enforcing constraints for interpolation and extrapolation in Generative Adversarial Networks

Generative Adversarial Networks. Presented by Yi Zhang

arxiv: v1 [eess.iv] 28 May 2018

A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS

arxiv: v3 [cs.lg] 11 Jun 2018

Open Set Learning with Counterfactual Images

Improving Visual Semantic Embedding By Adversarial Contrastive Estimation

Which Training Methods for GANs do actually Converge?

Wasserstein GAN. Juho Lee. Jan 23, 2017

The Success of Deep Generative Models

arxiv: v1 [cs.sd] 30 Oct 2017

Multiplicative Noise Channel in Generative Adversarial Networks

Nishant Gurnani. GAN Reading Group. April 14th, / 107

ON THE DIFFERENCE BETWEEN BUILDING AND EX-

Generative Adversarial Networks

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Generative Adversarial Networks for Real-time Stability of Inverter-based Systems

Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net

arxiv: v4 [cs.cv] 5 Sep 2018

arxiv: v4 [stat.ml] 16 Sep 2018

arxiv: v1 [cs.lg] 18 Dec 2017

Deep Feedforward Networks

Matching Adversarial Networks

arxiv: v1 [cs.lg] 7 Nov 2017

arxiv: v3 [cs.lg] 25 Dec 2017

Do you like to be successful? Able to see the big picture

Compressing deep neural networks

Supplementary Materials for: f-gan: Training Generative Neural Samplers using Variational Divergence Minimization

EE-559 Deep learning 10. Generative Adversarial Networks

An overview of deep learning methods for genomics

arxiv: v3 [cs.lg] 10 Sep 2018

Predicting Deeper into the Future of Semantic Segmentation Supplementary Material

WaveNet: A Generative Model for Raw Audio

arxiv: v2 [cs.lg] 21 Aug 2018

arxiv: v2 [cs.sd] 4 Feb 2019

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

First Order Generative Adversarial Networks

Wafer Pattern Recognition Using Tucker Decomposition

arxiv: v1 [cs.lg] 6 Nov 2016

CAUSAL GAN: LEARNING CAUSAL IMPLICIT GENERATIVE MODELS WITH ADVERSARIAL TRAINING

OPTIMIZATION METHODS IN DEEP LEARNING

Provable Non-Convex Min-Max Optimization

arxiv: v1 [cs.lg] 7 Sep 2017

Experiments on the Consciousness Prior

BiHMP-GAN: Bidirectional 3D Human Motion Prediction GAN

Understanding GANs: Back to the basics

Segmentation of Cell Membrane and Nucleus using Branches with Different Roles in Deep Neural Network

Top Tagging with Lorentz Boost Networks and Simulation of Electromagnetic Showers with a Wasserstein GAN

arxiv: v3 [stat.ml] 14 Mar 2018

CONTINUOUS-TIME FLOWS FOR EFFICIENT INFER-

Variational Autoencoders for Classical Spin Models

Neural networks and optimization

Variational Autoencoders (VAEs)

arxiv: v1 [cs.lg] 6 Dec 2018

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

Understanding GANs: the LQG Setting

Improved Training of Wasserstein GANs

Introduction to Deep Neural Networks

arxiv: v1 [cs.lg] 28 Dec 2017

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Source Separation Tutorial Mini-Series III: Extensions and Interpretations to Non-Negative Matrix Factorization

Deep Generative Models for Graph Generation. Jian Tang HEC Montreal CIFAR AI Chair, Mila

Dynamic Prediction Length for Time Series with Sequence to Sequence Networks

Composite Functional Gradient Learning of Generative Adversarial Models. Appendix

Convolutional Neural Networks. Srikumar Ramalingam

Convolutional Neural Network. Hung-yi Lee

Bounded Information Rate Variational Autoencoders

arxiv: v1 [cs.lg] 8 Dec 2016

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

LEARNING TO SAMPLE WITH ADVERSARIALLY LEARNED LIKELIHOOD-RATIO

arxiv: v3 [stat.ml] 5 Apr 2018

TUTORIAL PART 1 Unsupervised Learning

From perceptrons to word embeddings. Simon Šuster University of Groningen

When Harmonic Analysis Meets Machine Learning: Lipschitz Analysis of Deep Convolution Networks

Stochastic Video Prediction with Deep Conditional Generative Models

Theories of Deep Learning

arxiv: v2 [stat.ml] 23 Mar 2018 Abstract

CSC321 Lecture 20: Reversible and Autoregressive Models

Generating Text via Adversarial Training

Domain adaptation for deep learning

Adversarial Examples Generation and Defense Based on Generative Adversarial Network

Statistical Machine Learning

Transcription:

Singing Voice Separation using Generative Adversarial Networks Hyeong-seok Choi, Kyogu Lee Music and Audio Research Group Graduate School of Convergence Science and Technology Seoul National University {kekepa15, kglee}@snu.ac.kr Ju-heon Lee College of Liberal Studies Seoul National University juheon@snu.ac.kr Abstract In this paper, we propose a novel approach extending Wasserstein generative adversarial networks (GANs) [3] to separate singing voice from the mixture signal. We used the mixture signal as a condition to generate singing voices and applied the U-net style network for the stable training of the model. Experiments with the DSD100 dataset show the promising results with the potential of using the GANs for music source separation. 1 Introduction Music source separation is the process of separating a specific source from a music signal. Separating the source from the mixture signal can be interpreted as maximizing the likelihood of the source from a given mixture. Our task is to perform this task using GANs [1] which are classified as a method to maximize the likelihood by using implicit density. GANs are usually used to produce samples from noise, but in recent years, research [7,8] has been under way to better tailor the desired sample with a certain constraint. In this paper, our research aims to generate singing voice signals using mixture signals as a condition. Background GANs are generative model that learns a function generator G θ to map noise samples z p(z) into the real data space. The main idea of training GANs is often described as a mini-max game between two players which are discriminator (D) and generator (G) [1]. The input of D is either real sample x P r or fake sample x x x P g and the mission of D is to classify x x x as fake and to classify x as real. Many improved GANs model was attempted [,3,5] and one of the notable GANs studies that provides both theoretic background and practical result is the Wasserstein GANs. It is a model that tries to reduce the Wasserstein distance between the data distribution (P r ) and the generated sample distribution (P g ). Using the Wasserstein distance, the GANs training can be done as follows. Note that, x = G(z), z p(z) and D is a set of function that holds 1-Lipschitz condition. min max E x P r [D(x)] E x Pg [D( x)] (1) G D D In order to enforce D to be a function that holds 1-Lipschitz condition, [] suggests to regularize objective function by adding a gradient penalty term. Note that Pˆx is a sampling distribution that samples from the straight line between x P r and x P g, that is, ˆx = ɛ x + (1 ɛ) x, when 0 ɛ 1 and λ g is a gradient penalty coefficient. L = E x Pg [D( x)] E x Pr [D(x)] + λ g Eˆx Pˆx [( ˆx D(ˆx) 1) ] () 31st Conference on Neural Information Processing Systems (NIPS 017), Long Beach, CA, USA.

3 Model setup 3.1 Objective function We define x m, x s, and x s as mixture, real source paired with mixture, and fake (generated) source paired with mixture respectively. In our setting, the goal of G is to transform x m into x s as similar as possible to x s and the goal of D is to distinguish real source x s from the fake source x s conditioned on x m. To formulate this, we changed the aforementioned objective () into conditional GANs fashion [7, 8]. Thus, the input of D becomes the concatenation of either (x m, x s ) or (x m, x s ). For the gradient penalty term, we uniformly sampled ˆx ms Pˆxms from the straight line between the concatenation of (x m, x s ) and (x m, x s ) []. L = E xm P data, x s P g [D(x m, x s )] E (xm,x s) P data [D(x m, x s )] + λ g E (xm,x s) P data, x s P g, ˆx ms Pˆxms [( ˆxms D(x m, ˆx ms ) 1) ] (3) As a final objective for generator, we added l1 loss term to check the effect of more conventional loss and experimented with three cases including the objective containing only l1 loss, only generative adversarial loss and finally a case that adds both terms together. Therefore, our final objective for each generator (L G ) and discriminator (L D ) is as follows. The coefficients for adversarial loss, gradient penalty loss and l1 loss are denoted as λ D, λ g and λ l1. L G = λ D E xm P data, x s P g [D(x m, x s )] + λ l1 E xs P r x s P g [ x s x s 1 ] () L D = λ D (E xm P data, x s P g [D(x m, x s )] E (xm,x s) P data [D(x m, x s )]) + λ g E (xm,x s) P data, x s P g, ˆx ms Pˆxms [( ˆxms D(x m, ˆx ms ) 1) ] (5) 3. Network structure for generator Our generator model is constructed as follows. First, as a deep neural network we adapt U-net structure [6]. U-net consists of encoding and decoding stages and the layers in each stage are composed of convolutional layers. In the encoding stage, inputs are encoded with convolutional layers followed by batch normalization. The inputs are encoded until it becomes a vector with a length of 08. Then, by using a fully connected layer(fc layer) we encode it to a vector with a length of 51. In the following decoding stage, the input of each layer is concatenated in channel axis by using skip connection from each corresponding layer of encoding stage. Then, the concatenated layers are decoded by deconvolutional layers followed by batch normalization. For the non-linearity functions of each convolutional and deconvolutional layer, we used leaky Relu except the last layer using Relu. The more details are described in Figure 1. 3.3 Network structure for discriminator Our discriminator model is constructed as follows. First, input either (x m, x s ) or (x m, x s ) is concatenated in channel axis. We use 5 layers of convolutional layers without batch normalization since it is invalid in gradient penalty setting []. After each batch normalization layer, we used leaky Relu as a non-linearity function except the last layer that we didn t use any non-linearity. The more details are described in Figure. One noticeable aspect of our discriminator model is the fact that we intended to make the output to have the size of [7]. This allows the each pixel value of the output to equally contribute to the Wasserstein distance we compute by simply taking the mean of the output pixel values. In this way, each pixel of the output corresponds to the each different receptive region with the same receptive size of 115 31. Intuitively, we assume that this is a better idea than having a full receptive size of input size(51 18), since the receptive size is roughly a quarter of the input size, and hence, each pixel is able to make a decision over a different time-frequency region of input. Also, in practice, we found out that it is not only time consuming to train but also the train fails when the receptive size becomes bigger as the layer of discriminator becomes deeper.

Encode (Convolution layers) Height : 51 F 7 7 C : Width : 18 Channel: 1 C : 18 56 C : 56 3 18 8 3 Skip connections 8 reshape 1 1 F 7 7 C : 1 C : C : 18 C : 56 Decode (Deconvolution layers) Figure 1: Network structure for generator. It consists of two stages, encoding and decoding with skip connections from encoding layers. F denotes filter size, S_h denotes strides over height, S_w denotes strides over width and C denotes the output channel for the next layer. C : C : 18 C : 56 C : 1 Height : 51 56 3 18 Width : 18 Channel : Figure : Network structure for discriminator. Preliminary experiments.1 Dataset The DSD100 data set was used for model training. The DSD100 consists of 50 songs as a development set and 50 songs as a test set, each consisting of mixture and four sources (vocal, bass, drums and others). All recordings are digitized with a sampling frequency of,100hz.. Mini-batch composition To train our conditional GANs model, we composed our mini-batch having two parts, one as a condition part and the other one as a target source part. It might seem natural to include only the mixtures into the condition part, but we composed the condition part to include some proportion of singing voice sources as well as the mixtures. This technique was tried due to the nature of common popular music that includes intro, interlude and outro composed only with accompaniment, and hence the term "mixture" itself does not say much about the music signal composed of both singing voice and accompaniment. Because of this reason, lots of time the target source turns out to be a zero matrix which is not good for the training of the model. Moreover, in the music of the real world, there is also a chance of singing voice appearing only by itself (e.g., Acappella). Therefore we thought 3

there is also a need to prepare for this situation. In most experiments, the ratio between the mixtures and the singing voices in condition part was adjusted to 7:1. This is illustrated in Figure 3. Condition Target source x " x$ " Condition Target source x " x$ " x " xx$ " " x " x " Mixture Mixture Vocal Figure 3: Composition of mini-batch used in training. Mixture, true vocal, and fake vocal from generator is denoted as x m, x s and x s, respectively..3 Pre- & post-processing As a preprocessing, the songs in the dataset are split into audio segments to have a time length of seconds with a overlap of 1 second between each audio segment. Then, we converted each stereo audio segment to a mono by taking the mean of two channels. Next, we down-sampled the audio segment to,00hz and then performed short time Fourier transform on this waveform with a window size of 10 frames and hop length of 56 frames. This setting in turn makes a segment of audio into a matrix with a size of 51 18. As a post-processing, to change the final extracted vocal spectrogram as a waveform, we simply applied inverse spectrogram Fourier transform using a phase of input mixture spectrogram.. Results In Figure, we show four log magnitude spectrograms to compare the effect of generative adversarial loss and l1 loss. We found out that by using generative adversarial loss, the network tries to remove the accompaniment part more aggressively compared to the case when we only use l1 loss. Thus, we assume that one of the keys to train this model is to adjust the coefficients of l1 loss (λ l1 ) and generative adversarial loss term (λ D ). Still, we have not evaluated our algorithm with the common metrics in the music source separation task which are SDR (Source to Distortion Ratio), SIR (Source to Interference Ratio), and SAR (Source to Artifact Ratio). However, for a fair quantitative evaluation, we are planning to compare our model with the algorithm evaluation results in Signal Separation Evaluation Campaign(SiSEC) 0. The generated vocal samples of our model are available on the demo website 1. (a) (b) (c) (d) (e) Figure : Log magnitude spectrograms of (a) mixture, (b) true vocal, (c) estimated vocal using generative adversarial loss only, (d) estimated vocal using l1 loss only, and (e) estimated vocal using both generative adversarial loss and l1 loss. 1 Demo audio samples for our model are available on the website : https://kekepa15.github.io/

References [1] Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. & Bengio, Y. (01). Generative adversarial nets. Advances in neural information processing systems,67-680. [] Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., & Smolley, S. P. (0). Least squares generative adversarial networks. arxiv preprint ArXiv:11.0076. [3] Arjovsky, M., Chintala, S., & Bottou, L. (017). Wasserstein generative adversarial networks. In International Conference on Machine Learning, 1-3. [] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (017). Improved Training of Wasserstein GANs. arxiv preprint arxiv:170.0008 [5] Kodali, N., Abernethy, J., Hays, J., & Kira, Z. (017). How to Train Your DRAGAN. arxiv preprint arxiv:1705.0715. [6] Ronneberger, O. (017). Invited Talk: U-Net Convolutional Networks for Biomedical Image Segmentation. Informatik aktuell Bildverarbeitung für die Medizin 017,3-3. [7] Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (0). Image-to-image translation with conditional adversarial networks.arxiv preprint arxiv:11.0700. [8] Mirza, M., & Osindero, S. (01). Conditional generative adversarial nets. arxiv preprint arxiv:111.178. 5