WaveNet: A Generative Model for Raw Audio

Size: px

Start display at page:

Download "WaveNet: A Generative Model for Raw Audio"

Georgia Adams
5 years ago
Views:

1 WaveNet: A Generative Model for Raw Audio Ido Guy & Daniel Brodeski Deep Learning Seminar 2017 TAU

2 Outline Introduction WaveNet Experiments

3 Introduction WaveNet is a deep generative model of raw audio waveforms. WaveNet are able to generate speech which mimics any human voice and which sounds more natural than the best existing Text-to- Speech (TTS) systems, reducing the gap with human performance by over 50%. WaveNet network can also be used to synthesize other audio signals such as music.

4 Introduction - DeepMind WaveNet was introduced by Google DeepMind Human-level control through Deep Reinforcement Learning Single program (general-purpose artificial agent) that taught itself how to play and win at 49 completely different Atari titles, with just raw pixels and score as inputs. AlphaGo AlphaGo is the first computer program to defeat a professional human Go player, the first program to defeat a Go world champion, and arguably the strongest Go player in history. Go is one of the most complex and intuitive games ever devised, with more positions than there are atoms in the universe.

5 Introduction - TTS A text-to-speech (TTS) system converts normal language text into speech. There are 2 main approaches in TTS today Concatenative TTS - very large database of short speech fragments are recorded from a single speaker and then recombined to form complete utterances. Parametric TTS - all the information required to generate the data is stored in the parameters of the model, and the contents and characteristics of the speech can be controlled via the inputs to the model. Existing parametric models typically generate audio signals by passing their outputs through signal processing algorithms known as vocoders.

Introduction - TTS Each of the current TTS approaches has it s pros and cons. Concatenative TTS Sounds more natural then parametric TTS. Large footprint.

6 Introduction - TTS Each of the current TTS approaches has it s pros and cons. Concatenative TTS Sounds more natural then parametric TTS. Large footprint. Difficult to modify the voice characteristics (for example switching to a different speaker, or altering the emphasis or emotion of their speech) without recording a whole new database. Parametric TTS Small footprint. Flexibility to change voice characteristics. Sounds less natural than concatenative TTS.

7 WaveNet WaveNet is an autoregressive model, in which the prediction for every one of the samples is influenced by all the previous ones. The Joint probability of waveform X = x 1,, x T is factorized as a product of conditional probabilities: p x = T t=1 p(x t x 1,, x t 1 ) Meaning each audio sample x t is therefore conditioned on all the samples at all previous timestamps. The conditional probability is modelled by a stack of convolutional layers. There are no pooling layers in the network and the output of the network has the same time dimensionality as the input.

8 WaveNet - PixelCNN Pixel Recurrent Neural networks

9 WaveNet - PixelCNN Conditional Image Generation with PixelCNN Decoders

10 WaveNet Causal Convolutions The main ingredient of WaveNet are 1D causal convolutions. By using causal convolutions, we make sure the model cannot violate the ordering in which we model the data: the prediction p(x t+1 x 1,, x t ) emitted by emitted by the model at timestep t cannot depend on any of the future timesteps x t+1, x t+2,, x T One of the problems of causal filters is that they require many layers, or large filters to increase the receptive field.

11 WaveNet Dilated Convolutions A dilated convolution (also called a`trous, or convolution with holes) is a convolution where the filter is applied over an area larger than its length by skipping input values with a certain step. It is equivalent to a convolution with a larger filter derived from the original filter by dilating it with zeros, but is significantly more efficient

12 WaveNet Dilated Convolutions Stacked dilated convolutions enable networks to have very large receptive fields with just a few layers, while preserving the input resolution throughout the network as well as computational efficiency. In this paper, the dilation is doubled for every layer up to a limit and then repeated: e.g. 1, 2, 4,..., 512, 1, 2, 4,..., 512, 1, 2, 4,..., 512. Exponentially increasing the dilation factor results in exponential receptive field growth with depth. Each 1, 2, 4,..., 512 block has receptive field of size 1024, and can be seen as a more efficient and discriminative (non-linear) counterpart of a convolution.

13 WaveNet SoftMax Distributions One approach to modeling the conditional distributions p(x t x 1,, x t 1 ) over the individual audio samples would be to use a mixture model: p t x = m i=1 a i i (t x) In PixelRNN and PixelCNN we have seen that softmax distribution (Multinomial) tend to work better, even when the data is implicitly continuous (as is the case for image pixel intensities or audio sample values). One of the reasons is that a categorical distribution is more flexible and can more easily model arbitrary distributions because it makes no assumptions (no prior) about their shape.

14 WaveNet SoftMax Distributions Raw audio is typically stored as a sequence of 16-bit integer values (one per timestep), meaning the softmax layer would need to output 65,536 probabilities per timestep to model all possible values. To make this more tractable, we first apply a μ-law companding transformation (ITU-T, 1988) to the data, and then quantize it to 256 possible values: f x t = sign x t ln 1 + μ x t ln 1 + μ Where 1 < x t < 1 and μ = 255

15 WaveNet Gated Activation Units One of the potential advantages that PixelRNN has on PixelCNN is that it contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions. To amend this, DeepMind replaced the rectified linear units between the masked convolutions in the original pixelcnn with the following gated activation unit: y = tanh (W k,f x) σ(w k,g x) where denotes a convolution operator, denotes an elementwise multiplication operator, σ( ) is a sigmoid function, k is the layer index, f and g denote filter and gate, respectively, and W is a learnable convolution filter.

16 WaveNet Residual And Skip Connections Both residual and parameterized skip connections are used throughout the network, to speed up convergence and enable training of much deeper models.

17 WaveNet The first step is an audio preprocessing step, after the input waveform is quantized to a fixed integer range. The integer amplitudes are then one-hot encoded to produce a tensor of shape (num_samples, num_channels). A convolutional layer that only accesses the current and previous inputs then reduces the channel dimension. The core of the network is constructed as a stack of causal dilated layers, each of which is a dilated convolution (convolution with holes), which only accesses the current and past audio samples. The outputs of all layers are combined and extended back to the original number of channels by a series of dense postprocessing layers, followed by a softmax function to transform the outputs into a categorical distribution. The loss function is the cross-entropy between the output for each timestep and the input at the next timestep.

18 Conditional WaveNet Given an additional input h, WaveNets can model the conditional distribution p x h of the audio given this input. p x h = T t=1 p(x t x 1,, x t 1, h) By conditioning the model on other input variables, we can guide WaveNet s generation to produce audio with the required characteristics. For example, in a multi-speaker setting we can choose the speaker by feeding the speaker identity to the model as an extra input. Similarly, for TTS we need to feed information about the text as an extra input.

19 Conditional WaveNet We condition the model on other inputs in two different ways: Global conditioning Local conditioning Global conditioning is characterized by a single latent representation h that influences the output distribution across all timesteps, e.g. a speaker embedding in a TTS model. The activation function now becomes: z = tanh(w k,f x + V T f,k h) σ(w k,g x + V T g,k h) Where V,k is a learnable linear projection, and the vector V f,k T h is broadcast over the time dimension.

20 Conditional WaveNet For local conditioning we have a second timeseries h t, possibly with a lower sampling frequency than the audio signal, e.g. linguistic features in a TTS model. We first transform this time series using a transposed convolutional network (learned upsampling) that maps it to a new time series y = f(h) with the same resolution as the audio signal, which is then used in the activation unit as follows: z = tanh(w k,f x + V f,k y) σ(w k,g x + V g,k y) where V f,k y is now a 1 1 convolution. As an alternative to the transposed convolutional network, it is also possible to use V f,k h and repeat these values across time. This worked slightly worse in the experiments.

21 Experiments - TTS To evaluate the performance of WaveNets for the TTS task, subjective paired comparison tests and mean opinion score (MOS) tests were conducted. In the paired comparison tests, after listening to each pair of samples, the subjects were asked to choose which they preferred, though they could choose neutral if they did not have any preference. In the MOS tests, after listening to each stimulus, the subjects were asked to rate the naturalness of the stimulus in a five-point Likert scale score (1: Bad, 2: Poor, 3: Fair, 4: Good, 5: Excellent).

22 Experiments - TTS Where L+F stands for trained WaveNets conditioned on the logarithmic fundamental frequency log(f 0 ) values in addition to the linguistic features

23 Experiments - TTS

24 Experiments - TTS

25 Experiments - TTS ES English Mandarin Chinese

26 Experiments Audio Generation Knowing what to say Speaker identity

27 Experiments Audio Generation Making Music

28 Thank You!

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017 I am v Master Student v 6 months @ Music Technology Group, Universitat Pompeu Fabra v Deep learning for acoustic source separation v