Singing Voice Separation using Generative Adversarial Networks

Singing Voice Separation using Generative Adversarial Networks Hyeong-seok Choi, Kyogu Lee Music and Audio Research Group Graduate School of Convergence Science and Technology Seoul National University {kekepa15, kglee}@snu.ac.kr Ju-heon Lee College of Liberal Studies Seoul National University juheon@snu.ac.kr Abstract In this paper, we propose a novel approach extending Wasserstein generative adversarial networks (GANs) [3] to separate singing voice from the mixture signal. We used the mixture signal as a condition to generate singing voices and applied the U-net style network for the stable training of the model. Experiments with the DSD100 dataset show the promising results with the potential of using the GANs for music source separation. 1 Introduction Music source separation is the process of separating a specific source from a music signal. Separating the source from the mixture signal can be interpreted as maximizing the likelihood of the source from a given mixture. Our task is to perform this task using GANs [1] which are classified as a method to maximize the likelihood by using implicit density. GANs are usually used to produce samples from noise, but in recent years, research [7,8] has been under way to better tailor the desired sample with a certain constraint. In this paper, our research aims to generate singing voice signals using mixture signals as a condition. Background GANs are generative model that learns a function generator G θ to map noise samples z p(z) into the real data space. The main idea of training GANs is often described as a mini-max game between two players which are discriminator (D) and generator (G) [1]. The input of D is either real sample x P r or fake sample x x x P g and the mission of D is to classify x x x as fake and to classify x as real. Many improved GANs model was attempted [,3,5] and one of the notable GANs studies that provides both theoretic background and practical result is the Wasserstein GANs. It is a model that tries to reduce the Wasserstein distance between the data distribution (P r ) and the generated sample distribution (P g ). Using the Wasserstein distance, the GANs training can be done as follows. Note that, x = G(z), z p(z) and D is a set of function that holds 1-Lipschitz condition. min max E x P r [D(x)] E x Pg [D( x)] (1) G D D In order to enforce D to be a function that holds 1-Lipschitz condition, [] suggests to regularize objective function by adding a gradient penalty term. Note that Pˆx is a sampling distribution that samples from the straight line between x P r and x P g, that is, ˆx = ɛ x + (1 ɛ) x, when 0 ɛ 1 and λ g is a gradient penalty coefficient. L = E x Pg [D( x)] E x Pr [D(x)] + λ g Eˆx Pˆx [( ˆx D(ˆx) 1) ] () 31st Conference on Neural Information Processing Systems (NIPS 017), Long Beach, CA, USA.

3 Model setup 3.1 Objective function We define x m, x s, and x s as mixture, real source paired with mixture, and fake (generated) source paired with mixture respectively. In our setting, the goal of G is to transform x m into x s as similar as possible to x s and the goal of D is to distinguish real source x s from the fake source x s conditioned on x m. To formulate this, we changed the aforementioned objective () into conditional GANs fashion [7, 8]. Thus, the input of D becomes the concatenation of either (x m, x s ) or (x m, x s ). For the gradient penalty term, we uniformly sampled ˆx ms Pˆxms from the straight line between the concatenation of (x m, x s ) and (x m, x s ) []. L = E xm P data, x s P g [D(x m, x s )] E (xm,x s) P data [D(x m, x s )] + λ g E (xm,x s) P data, x s P g, ˆx ms Pˆxms [( ˆxms D(x m, ˆx ms ) 1) ] (3) As a final objective for generator, we added l1 loss term to check the effect of more conventional loss and experimented with three cases including the objective containing only l1 loss, only generative adversarial loss and finally a case that adds both terms together. Therefore, our final objective for each generator (L G ) and discriminator (L D ) is as follows. The coefficients for adversarial loss, gradient penalty loss and l1 loss are denoted as λ D, λ g and λ l1. L G = λ D E xm P data, x s P g [D(x m, x s )] + λ l1 E xs P r x s P g [ x s x s 1 ] () L D = λ D (E xm P data, x s P g [D(x m, x s )] E (xm,x s) P data [D(x m, x s )]) + λ g E (xm,x s) P data, x s P g, ˆx ms Pˆxms [( ˆxms D(x m, ˆx ms ) 1) ] (5) 3. Network structure for generator Our generator model is constructed as follows. First, as a deep neural network we adapt U-net structure [6]. U-net consists of encoding and decoding stages and the layers in each stage are composed of convolutional layers. In the encoding stage, inputs are encoded with convolutional layers followed by batch normalization. The inputs are encoded until it becomes a vector with a length of 08. Then, by using a fully connected layer(fc layer) we encode it to a vector with a length of 51. In the following decoding stage, the input of each layer is concatenated in channel axis by using skip connection from each corresponding layer of encoding stage. Then, the concatenated layers are decoded by deconvolutional layers followed by batch normalization. For the non-linearity functions of each convolutional and deconvolutional layer, we used leaky Relu except the last layer using Relu. The more details are described in Figure 1. 3.3 Network structure for discriminator Our discriminator model is constructed as follows. First, input either (x m, x s ) or (x m, x s ) is concatenated in channel axis. We use 5 layers of convolutional layers without batch normalization since it is invalid in gradient penalty setting []. After each batch normalization layer, we used leaky Relu as a non-linearity function except the last layer that we didn t use any non-linearity. The more details are described in Figure. One noticeable aspect of our discriminator model is the fact that we intended to make the output to have the size of [7]. This allows the each pixel value of the output to equally contribute to the Wasserstein distance we compute by simply taking the mean of the output pixel values. In this way, each pixel of the output corresponds to the each different receptive region with the same receptive size of 115 31. Intuitively, we assume that this is a better idea than having a full receptive size of input size(51 18), since the receptive size is roughly a quarter of the input size, and hence, each pixel is able to make a decision over a different time-frequency region of input. Also, in practice, we found out that it is not only time consuming to train but also the train fails when the receptive size becomes bigger as the layer of discriminator becomes deeper.

Encode (Convolution layers) Height : 51 F 7 7 C : Width : 18 Channel: 1 C : 18 56 C : 56 3 18 8 3 Skip connections 8 reshape 1 1 F 7 7 C : 1 C : C : 18 C : 56 Decode (Deconvolution layers) Figure 1: Network structure for generator. It consists of two stages, encoding and decoding with skip connections from encoding layers. F denotes filter size, S_h denotes strides over height, S_w denotes strides over width and C denotes the output channel for the next layer. C : C : 18 C : 56 C : 1 Height : 51 56 3 18 Width : 18 Channel : Figure : Network structure for discriminator. Preliminary experiments.1 Dataset The DSD100 data set was used for model training. The DSD100 consists of 50 songs as a development set and 50 songs as a test set, each consisting of mixture and four sources (vocal, bass, drums and others). All recordings are digitized with a sampling frequency of,100hz.. Mini-batch composition To train our conditional GANs model, we composed our mini-batch having two parts, one as a condition part and the other one as a target source part. It might seem natural to include only the mixtures into the condition part, but we composed the condition part to include some proportion of singing voice sources as well as the mixtures. This technique was tried due to the nature of common popular music that includes intro, interlude and outro composed only with accompaniment, and hence the term "mixture" itself does not say much about the music signal composed of both singing voice and accompaniment. Because of this reason, lots of time the target source turns out to be a zero matrix which is not good for the training of the model. Moreover, in the music of the real world, there is also a chance of singing voice appearing only by itself (e.g., Acappella). Therefore we thought 3

there is also a need to prepare for this situation. In most experiments, the ratio between the mixtures and the singing voices in condition part was adjusted to 7:1. This is illustrated in Figure 3. Condition Target source x " x$ " Condition Target source x " x$ " x " xx$ " " x " x " Mixture Mixture Vocal Figure 3: Composition of mini-batch used in training. Mixture, true vocal, and fake vocal from generator is denoted as x m, x s and x s, respectively..3 Pre- & post-processing As a preprocessing, the songs in the dataset are split into audio segments to have a time length of seconds with a overlap of 1 second between each audio segment. Then, we converted each stereo audio segment to a mono by taking the mean of two channels. Next, we down-sampled the audio segment to,00hz and then performed short time Fourier transform on this waveform with a window size of 10 frames and hop length of 56 frames. This setting in turn makes a segment of audio into a matrix with a size of 51 18. As a post-processing, to change the final extracted vocal spectrogram as a waveform, we simply applied inverse spectrogram Fourier transform using a phase of input mixture spectrogram.. Results In Figure, we show four log magnitude spectrograms to compare the effect of generative adversarial loss and l1 loss. We found out that by using generative adversarial loss, the network tries to remove the accompaniment part more aggressively compared to the case when we only use l1 loss. Thus, we assume that one of the keys to train this model is to adjust the coefficients of l1 loss (λ l1 ) and generative adversarial loss term (λ D ). Still, we have not evaluated our algorithm with the common metrics in the music source separation task which are SDR (Source to Distortion Ratio), SIR (Source to Interference Ratio), and SAR (Source to Artifact Ratio). However, for a fair quantitative evaluation, we are planning to compare our model with the algorithm evaluation results in Signal Separation Evaluation Campaign(SiSEC) 0. The generated vocal samples of our model are available on the demo website 1. (a) (b) (c) (d) (e) Figure : Log magnitude spectrograms of (a) mixture, (b) true vocal, (c) estimated vocal using generative adversarial loss only, (d) estimated vocal using l1 loss only, and (e) estimated vocal using both generative adversarial loss and l1 loss. 1 Demo audio samples for our model are available on the website : https://kekepa15.github.io/

References [1] Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. & Bengio, Y. (01). Generative adversarial nets. Advances in neural information processing systems,67-680. [] Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., & Smolley, S. P. (0). Least squares generative adversarial networks. arxiv preprint ArXiv:11.0076. [3] Arjovsky, M., Chintala, S., & Bottou, L. (017). Wasserstein generative adversarial networks. In International Conference on Machine Learning, 1-3. [] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (017). Improved Training of Wasserstein GANs. arxiv preprint arxiv:170.0008 [5] Kodali, N., Abernethy, J., Hays, J., & Kira, Z. (017). How to Train Your DRAGAN. arxiv preprint arxiv:1705.0715. [6] Ronneberger, O. (017). Invited Talk: U-Net Convolutional Networks for Biomedical Image Segmentation. Informatik aktuell Bildverarbeitung für die Medizin 017,3-3. [7] Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (0). Image-to-image translation with conditional adversarial networks.arxiv preprint arxiv:11.0700. [8] Mirza, M., & Osindero, S. (01). Conditional generative adversarial nets. arxiv preprint arxiv:111.178. 5