arxiv: v1 [cs.lg] 28 Dec 2017

Similar documents
Conditional Image Generation with PixelCNN Decoders

Masked Autoregressive Flow for Density Estimation

GENERATIVE ADVERSARIAL LEARNING

CSC321 Lecture 20: Reversible and Autoregressive Models

Improving Variational Inference with Inverse Autoregressive Flow

arxiv: v1 [cs.lg] 15 Jun 2016

arxiv: v1 [cs.lg] 2 Nov 2017

Self-Attention with Relative Position Representations

arxiv: v1 [cs.sd] 31 Oct 2018

arxiv: v2 [cs.lg] 4 Mar 2017

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

arxiv: v1 [cs.lg] 20 Apr 2017

FAST ADAPTATION IN GENERATIVE MODELS WITH GENERATIVE MATCHING NETWORKS

Generative Adversarial Networks

Learnable Explicit Density for Continuous Latent Space and Variational Inference

arxiv: v3 [cs.lg] 22 Oct 2018

Sylvester Normalizing Flows for Variational Inference

The Success of Deep Generative Models

Today s Lecture. Dropout

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Searching for the Principles of Reasoning and Intelligence

Deep Learning Year in Review 2016: Computer Vision Perspective

arxiv: v3 [cs.lg] 4 Jun 2016 Abstract

arxiv: v2 [cs.lg] 11 Jun 2018

WaveNet: A Generative Model for Raw Audio

Generating Text via Adversarial Training

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

arxiv: v1 [cs.lg] 8 Dec 2016

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

DO DEEP GENERATIVE MODELS KNOW WHAT THEY DON T KNOW?

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Nonparametric Inference for Auto-Encoding Variational Bayes

Deep Generative Models. (Unsupervised Learning)

UvA-DARE (Digital Academic Repository) Improving Variational Auto-Encoders using Householder Flow Tomczak, J.M.; Welling, M. Link to publication

A Recurrent Variational Autoencoder for Human Motion Synthesis

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction

Lecture 9 Recurrent Neural Networks

Very Deep Residual Networks with Maxout for Plant Identification in the Wild Milan Šulc, Dmytro Mishkin, Jiří Matas

Gradient Estimation Using Stochastic Computation Graphs

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Importance Reweighting Using Adversarial-Collaborative Training

VAE with a VampPrior. Abstract. 1 Introduction

Based on the original slides of Hung-yi Lee

arxiv: v1 [stat.ml] 3 Nov 2016

arxiv: v3 [cs.lg] 25 Feb 2016 ABSTRACT

FAST ADAPTATION IN GENERATIVE MODELS WITH GENERATIVE MATCHING NETWORKS

arxiv: v2 [cs.cl] 1 Jan 2019

Notes on Adversarial Examples

arxiv: v3 [cs.lg] 27 Feb 2017

arxiv: v2 [stat.ml] 24 Apr 2018

Some theoretical properties of GANs. Gérard Biau Toulouse, September 2018

WEIGHTED TRANSFORMER NETWORK FOR MACHINE TRANSLATION

Experiments on the Consciousness Prior

arxiv: v6 [cs.lg] 10 Apr 2015

arxiv: v5 [cs.lg] 2 Nov 2015

Stochastic Video Prediction with Deep Conditional Generative Models

arxiv: v1 [cs.lg] 22 Mar 2016

arxiv: v1 [cs.lg] 7 Jun 2017

Introduction to Deep Neural Networks

PARALLELIZING LINEAR RECURRENT NEURAL NETS OVER SEQUENCE LENGTH

arxiv: v1 [cs.lg] 6 Dec 2018

Autoregressive Attention for Parallel Sequence Modeling

Density estimation using Real NVP

On the Quantitative Analysis of Decoder-Based Generative Models

Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis

Character-Level Language Modeling with Recurrent Highway Hypernetworks

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,

Adversarial Sequential Monte Carlo

arxiv: v1 [cs.lg] 30 Oct 2014

Variational AutoEncoder: An Introduction and Recent Perspectives

attention mechanisms and generative models

arxiv: v2 [cs.cl] 16 Jun 2017 Abstract

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Denoising Criterion for Variational Auto-Encoding Framework

Full Resolution Image Compression with Recurrent Neural Networks

TRANSFORMATION AUTOREGRESSIVE NETWORKS

Inference Suboptimality in Variational Autoencoders

Implicit Variational Inference with Kernel Density Ratio Fitting

Adversarial Examples Generation and Defense Based on Generative Adversarial Network

Deep Learning Autoencoder Models

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

arxiv: v1 [stat.ml] 6 Dec 2018

arxiv: v1 [cs.lg] 3 Apr 2018

Negative Momentum for Improved Game Dynamics

THE EFFECTIVENESS OF LAYER-BY-LAYER TRAINING

arxiv: v3 [cs.cl] 24 Feb 2018

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

INFERENCE SUBOPTIMALITY

arxiv: v4 [cs.lg] 11 Jun 2018

LEARNING TO SAMPLE WITH ADVERSARIALLY LEARNED LIKELIHOOD-RATIO

Singing Voice Separation using Generative Adversarial Networks

Deep Gate Recurrent Neural Network

arxiv: v2 [cs.lg] 21 Aug 2018

Faster Training of Very Deep Networks Via p-norm Gates

Improving Visual Semantic Embedding By Adversarial Contrastive Estimation

Improved Bayesian Compression

Transcription:

PixelSNAIL: An Improved Autoregressive Generative Model arxiv:1712.09763v1 [cs.lg] 28 Dec 2017 Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, Pieter Abbeel Embodied Intelligence UC Berkeley, Department of Electrical Engineering and Computer Sciences Abstract Autoregressive generative models consistently achieve the best results in density estimation tasks involving high dimensional data, such as images or audio. They pose density estimation as a sequence modeling task, where a recurrent neural network (RNN) models the conditional distribution over the next element conditioned on all previous elements. In this paradigm, the bottleneck is the extent to which the RNN can model long-range dependencies, and the most successful approaches rely on causal convolutions, which offer better access to earlier parts of the sequence than conventional RNNs. Taking inspiration from recent work in meta reinforcement learning, where dealing with long-range dependencies is also essential, we introduce a new generative model architecture that combines causal convolutions with self attention. In this note, we describe the resulting model and present state-of-the-art log-likelihood results on CIFAR-10 (2.85 bits per dim) and 32 32 ImageNet (3.80 bits per dim). Our implementation is available at https://github.com/neocxi/pixelsnail-public. 1 Introduction Autoregressive generative models over high-dimensional data x = (x 1,..., x n ) factor the joint distribution as a product of conditionals: p(x) = p(x 1,..., x n ) = n p(x i x 1,..., x i 1 ) A recurrent neural network (RNN) is then trained to model p(x i x 1:i 1 ). Optionally, the model can be conditioned on additional global information h (such as a class label, when applied to images), in which case it in models p(x i x 1:i 1, h). Such methods are highly expressive and allow modeling complex dependencies. Compared to GANs [3], neural autoregressive models offer tractable likelihood computation and ease of training, and have been shown to outperform latent variable models [13, 12, 11]. The main design consideration is the neural network architecture used to implement the RNN, as it must be able to easily refer to earlier parts of the sequence. A number of possibilities exist: i=1 Traditional RNNs, such as GRUs or LSTMs: these propagate information by keeping it in their hidden state from one timestep to the next. This temporally-linear dependency significantly inhibits the extent to which they can model long-range relationships in the data. Causal convolutions [12, 11]: these apply convolutions over the sequence (masked or shifted so that the current prediction is only influenced by previous element). They offer highbandwidth access to the earlier parts of the sequence. However, their receptive field has a finite size, and still experience noticeable attenuation with regards to elements far away in the sequence.

Self-attention [14]: these models turn the sequence into an unordered key-value store that can be queried based on content. They feature an unbounded receptive field and allow undeteriorated access to information far away in the sequence. However, they only offer pinpoint access to small amounts of information, and require additional mechanism to incorporate positional information. Causal convolutions and self-attention demonstrate complementary strengths and weaknesses: the former allow high bandwidth access over a finite context size, and the latter allow access over an infinitely large context. Interleaving the two thus offers the best of both worlds, where the model can have high-bandwidth access without constraints on the amount of information it can effectively use. The convolutions can be seen as aggregating information to build the context over which to perform an attentive lookup. Using this approach (dubbed SNAIL), Mishra et al. [6] demonstrated significant performance improvements on a number of tasks in meta-learning setup, where the challenge of long-term temporal dependencies is also prevalent, as an agent should be able to adapt its behavior based on past experience. In this note, we simply apply the same idea to the task of autoregressive generative modeling, as the fundamental bottleneck of access to past information is the same. Building off the current state-of-the-art in generative models, a class of convolution-based architectures known as PixelCNNs (van den Oord et al. [12] and Salimans et al. [11]), we present a new architecture, PixelSNAIL, that incorporates ideas from [6] to obtain state-of-the-art results on the CIFAR-10 and Imagenet 32 32 datasets. outputs, shape [B, H, W, D] (a) Residual Block (D lters, R repeats) (b) Attention Block (key size K, value size V) outputs, shape [B, H, W, V] repeat R times add elementwise mul matmul sigmoid activation matmul, masked softmax D lters D lters 1x1 conv, K lters (query) 1x1 conv, K lters (keys) 1x1 conv, V lters (values) activation D lters activation inputs, shape [B, H, W, D] inputs, shape [B, H, W, D] Figure 1: The modular components that make up PixelSNAIL: (a) a residual block, and (b) an attention block. For both datasets, we used residual blocks with 256 filters and 4 repeats, and attention blocks with key size 16 and value size 128. 2 Model Architecture In this section, we describe the PixelSNAIL architecture. It is primarily composed of two building blocks, which are illustrated in Figure 1 and described below: A residual block applies several 2D-convolutions to its input, each with residual connections. To make them causal, the convolutions are masked so that the current pixel can only access 2

pixels to the left and above from it. We use a gated activation function similar to [12, 7]. Throughout the model, we used 4 convolutions per block and 256 filters in each convolution. An attention block performs a single key-value lookup. It projects the input to a lower dimensionality to produce the keys and values and then uses softmax-attention like in [14, 6] (masked so that the current pixel can only attend over previously generated pixels). We used keys of size 16 and values of size 128. PixelSNAIL: M blocks, K mixture components outputs, shape [B, H, W, 10*K] repeat M times add 1x1 conv, 256 lters 1x1 conv, 10*K lters 1x1 conv, 256 lters 1x1 conv, 256 lters 256 lters Residual Block 256 lters 4 repeats concat (channelwise) Attention Block key size 16 value size 128 inputs, shape [B, H, W, 3] Figure 2: The entire PixelSNAIL model architecture, using the building blocks from Figure 1. We used 12 blocks for both datasets, with 10 mixture components for CIFAR-10 and 32 for ImageNet. Figure 2 illustrates the full PixelSNAIL architecture, which interleaves the residual blocks and attention blocks depicted in Figure 1. In the CIFAR-10 model only, we applied dropout of 0.5 after the first convolution in every residual block, to prevent overfitting. We did not use any dropout for ImageNet, as the dataset is much larger. On both datasets, we use Polyak averaging [10] (following [11]) over the training parameters. We used an exponential moving average weight of 0.9995 for CIFAR-10 and 0.9997 for ImageNet. As the output distribution, we use the discretized mixture of logistics introduced by Salimans et al. [11], with 10 mixture components for CIFAR-10 and 32 for ImageNet. To predict the subpixel (red,green,blue) values, we used the same linear-autoregressive parametrization as Salimans et al. [11]. Our code is publicly available, and can be found at: https://github.com/neocxi/ pixelsnail-public. 3 Experiments In Table 3, we provide negative log-likelihood results (in bits per dim) for PixelSNAIL on both CIFAR-10 and Imagenet 32 32. We compare PixelSNAIL s performance to a number of autoregressive models. These include: (i) PixelRNN [8], which uses LSTMs, (ii) PixelCNN [12] and PixelCNN++ [11], which only use causal convolutions, and (iii) Image Transformer [1], an attention-only architecture inspired by Vaswani et al. [14]. PixelSNAIL outperforms all of these approaches, which suggests that both causal convolutions and attention are essential components of the architecture. 3

Method CIFAR-10 ImageNet 32 32 Conv DRAW [4] 3.5 4.40 Real NVP [2] 3.49 4.28 VAE with IAF [5] 3.11 PixelRNN [8] 3.00 3.86 PixelCNN [12] 3.03 3.83 Image Transformer [1] 2.98 3.81 PixelCNN++ [11] 2.92 Block Sparse PixelCNN++ [9] 2.90 PixelSNAIL (ours) 2.85 3.80 Table 1: Average negative log-likelihoods on CIFAR-10 and ImageNet 32 32, in bits per dim. PixelSNAIL outperforms other autoregressive models which only rely on causal convolutions xor self-attention. Figure 3: Samples from our CIFAR-10 model. 4 Conclusion We introduced PixelSNAIL, a class of autoregressive generative models that combine causal convolutions with self-attention. We demonstrate state-of-the-art density estimation performance on CIFAR-10 and ImageNet 32 32, with a publicly-available implementation at https://github. com/neocxi/pixelsnail-public. Despite their tractable likelihood and strong empirical performance, one notable drawback of autoregressive generative models is that sampling is slow, because each pixel must be sampled sequentially. PixelSNAIL s sampling speed is comparable to that of existing autoregressive models; the design of models that allow faster sampling without losing performance remains an open problem. 4

Figure 4: Samples from our ImageNet model. References [1] Anonymous. Image transformer. Under review at the International Conference on Learning Representations (ICLR), 2018. URL https://openreview.net/forum?id=r16vyf-0-. [2] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arxiv preprint arxiv:1605.08803, 2016. [3] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672 2680, 2014. [4] Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards conceptual compression. arxiv preprint arxiv:1604.08772, 2016. [5] Diederik P. Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improving variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, 2016. [6] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. In NIPS 2017 Workshop on Meta-Learning, 2017. [7] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arxiv preprint arxiv:1609.03499, 2016. [8] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. International Conference on Machine Learning (ICML), 2016. [9] OpenAI. Block-sparse gpu kernels, Dec 2017. URL https://blog.openai.com/ block-sparse-gpu-kernels/. [10] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838 855, 1992. 5

[11] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arxiv preprint arxiv:1701.05517, 2017. [12] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems (NIPS), 2016. [13] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In International Conference on Machine Learning (ICML), 2016. [14] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arxiv preprint arxiv:1706.03762, 2017. 6