CSC321 Lecture 20: Reversible and Autoregressive Models

Similar documents
CSC321 Lecture 16: ResNets and Attention

WaveNet: A Generative Model for Raw Audio

Lecture 15: Exploding and Vanishing Gradients

The Success of Deep Generative Models

arxiv: v1 [cs.lg] 28 Dec 2017

CSC321 Lecture 15: Exploding and Vanishing Gradients

Neural Architectures for Image, Language, and Speech Processing

CSC321 Lecture 15: Recurrent Neural Networks

CSC321 Lecture 7 Neural language models

CSC321 Lecture 22: Q-Learning

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Conditional Image Generation with PixelCNN Decoders

attention mechanisms and generative models

CSC321 Lecture 10 Training RNNs

CSC321 Lecture 20: Autoencoders

CSC321 Lecture 8: Optimization

CSC321 Lecture 5: Multilayer Perceptrons

Name: Student number:

CSC321 Lecture 9 Recurrent neural nets

CSC 411 Lecture 10: Neural Networks

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 6: Backpropagation

Convolutional Neural Network Architecture

CSC321 Lecture 4 The Perceptron Algorithm

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

CSC321 Lecture 7: Optimization

Deep Learning Year in Review 2016: Computer Vision Perspective

CSC321 Lecture 9: Generalization

Neural Networks in Structured Prediction. November 17, 2015

Honors Advanced Mathematics Determinants page 1

CSC321 Lecture 4: Learning a Classifier

Latent Variable Models

Deep Learning & Neural Networks Lecture 4

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Deep Generative Models for Graph Generation. Jian Tang HEC Montreal CIFAR AI Chair, Mila

Structured deep models: Deep learning on graphs and beyond


Understanding How ConvNets See

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Lecture 4: Training a Classifier

CSC321 Lecture 4: Learning a Classifier

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note 6

1 GSW Sets of Systems

CSC321 Lecture 5 Learning in a Single Neuron

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Masked Autoregressive Flow for Density Estimation

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 6

Lecture 17: Neural Networks and Deep Learning

Lecture 15: Recurrent Neural Nets

arxiv: v1 [cs.lg] 15 Jun 2016

Deep Generative Models. (Unsupervised Learning)

Temporal Modeling and Basic Speech Recognition

Lecture 2: Linear regression

Recurrent Neural Networks. Jian Tang

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Long-Short Term Memory and Other Gated RNNs

CSC321 Lecture 2: Linear Regression

Unsupervised Learning

Ways to make neural networks generalize better

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Computing Neural Network Gradients

STA 4273H: Statistical Machine Learning

Matrix Assembly in FEA

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017

Binary addition example worked out

Principal Component Analysis (PCA)

Generative Adversarial Networks

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Convolution and Pooling as an Infinitely Strong Prior

CSC321 Lecture 9: Generalization

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Lecture 4: Training a Classifier

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

15. LECTURE 15. I can calculate the dot product of two vectors and interpret its meaning. I can find the projection of one vector onto another one.

CSC 411 Lecture 12: Principal Component Analysis

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Lecture 5: Logistic Regression. Neural Networks

Homework 6: Image Completion using Mixture of Bernoullis

Recurrent neural networks

Lecture 9: Generalization

Neural Networks. Intro to AI Bert Huang Virginia Tech

Mathematics for Intelligent Systems Lecture 5 Homework Solutions

Deep Learning Lab Course 2017 (Deep Learning Practical)

2 Systems of Linear Equations

The K-FAC method for neural network optimization

Conditional time series forecasting with convolutional neural networks

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

c 1 v 1 + c 2 v 2 = 0 c 1 λ 1 v 1 + c 2 λ 1 v 2 = 0

STA 414/2104: Machine Learning

Eigenvalues by row operations

Convolutional Neural Networks

Math101, Sections 2 and 3, Spring 2008 Review Sheet for Exam #2:

Recurrent Neural Networks

Lecture 7 Convolutional Neural Networks

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

MAT 1302B Mathematical Methods II

Lecture 6: Backpropagation

Transcription:

CSC321 Lecture 20: Reversible and Autoregressive Models Roger Grosse Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 1 / 23

Overview Four modern approaches to generative modeling: Generative adversarial networks (last lecture) Reversible architectures (today) Autoregressive models (today) Variational autoencoders (CSC412) All four approaches have different pros and cons. Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 2 / 23

Overview Remember that the GAN generator network represents a distribution by sampling from a simple distribution p Z (z) over code vectors z. I ll use p Z here to emphasize that it s a distribution on z. A GAN was an implicit generative model, since we could only generate samples, not evaluate the log-likelihood. Can t tell if it s missing modes, memorizing the training data, etc. Reversible architectures are an elegant kind of generator network with tractable log-likelihood. Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 3 / 23

Change of Variables Formula Let f denote a differentiable, bijective mapping from space Z to space X. (I.e., it must be 1-to-1 and cover all of X.) Since f defines a one-to-one correspondence between values z Z and x X, we can think of it as a change-of-variables transformation. Change-of-Variables Formula from probability theory: if x = f (z), then ( ) p X (x) = p Z (z) x 1 det z Intuition for the Jacobian term: Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 4 / 23

Change of Variables Formula Suppose we have a generator network which computes the function f. It s tempting to apply the change-of-variables formula in order to compute the density p X (x). I.e., compute z = f 1 (x) ( ) p X (x) = p Z (z) x 1 det z Problems? The mapping f needs to be invertible, with an easy-to-compute inverse. It needs to be differentiable, so that the Jaobian x/ z is defined. We need to be able to compute the (log) determinant. The GAN generator may be differentiable, but it doesn t satisfy the other two properties. Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 5 / 23

Reversible Blocks Now let s define a reversible block which is invertible and has a tractable determinant. Such blocks can be composed. Inversion: f 1 = f 1 1 f 1 k Determinants: x k z = x k x k 1 x 2 x 1 x 1 z Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 6 / 23

Reversible Blocks Recall the residual blocks: y = x + F(x) Reversible blocks are a variant of residual blocks. Divide the units into two groups, x 1 and x 2. y 1 = x 1 + F(x 2 ) y 2 = x 2 Inverting a reversible block: x 2 = y 2 x 1 = y 1 F(x 2 ) Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 7 / 23

Reversible Blocks Composition of two reversible blocks, but with x 1 and x 2 swapped: Forward: y 1 = x 1 + F(x 2 ) y 2 = x 2 + G(y 1 ) Backward: x 2 = y 2 G(y 1 ) x 1 = y 1 F(x 2 ) Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 8 / 23

Volume Preservation It remains to compute the log determinant of the Jacobian. The Jacobian of the reversible block: y 1 = x 1 + F(x 2 ) y 2 = x 2 y x = ( ) I F x 2 0 I This is an upper triangular matrix. The determinant of an upper triangular matrix is the product of the diagonal entries, or in this case, 1. Since the determinant is 1, the mapping is said to be volume preserving. Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 9 / 23

Nonlinear Independent Components Estimation We ve just defined the reversible block. Easy to invert by subtracting rather than adding the residual function. The determinant of the Jacobian is 1. Nonlinear Independent Components Estimation (NICE) trains a generator network which is a composition of lots of reversible blocks. We can compute the likelihood function using the change-of-variables formula: ( ) p X (x) = p Z (z) x 1 det = p Z (z) z We can train this model using maximum likelihood. I.e., given a dataset {x (1),..., x (N) }, we maximize the likelihood N p X (x (i) ) = i=1 N p Z (f 1 (x (i) )) i=1 Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 10 / 23

Nonlinear Independent Components Estimation Likelihood: p X (x) = p Z (z) = p Z (f 1 (x)) Remember, p Z is a simple, fixed distribution (e.g. independent Gaussians) Intuition: train the network such that f 1 maps each data point to a high-density region of the code vector space Z. Without constraints on f, it could map everything to 0, and this likelihood objective would make no sense. But it can t do this because it s volume preserving. Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 11 / 23

Nonlinear Independent Components Estimation Dinh et al., 2016. Density estimation using RealNVP. Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 12 / 23

Nonlinear Independent Components Estimation Samples produced by RealNVP, a model based on NICE. Dinh et al., 2016. Density estimation using RealNVP. Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 13 / 23

RevNets (optional) A side benefit of reversible blocks: you don t need to store the activations in memory to do backprop, since you can reverse the computation. I.e., compute the activations as you need them, moving backwards through the computation graph. Notice that reversible blocks look a lot like residual blocks. We recently designed a reversible residual network (RevNet) architecture which is like a ResNet, but with reversible blocks instead of residual blocks. Matches state-of-the-art performance on ImageNet, but without the memory cost of activations! Gomez et al., NIPS 2017. The revesible residual network: backrpop without storing activations. Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 14 / 23

Overview Four modern approaches to generative modeling: Generative adversarial networks (last lecture) Reversible architectures (today) Autoregressive models (today) Variational autoencoders (CSC412) All four approaches have different pros and cons. Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 15 / 23

Autoregressive Models We ve already looked at autoregressive models in this course: Neural language models RNN language models (and decoders) We can push this further, and generate very long sequences. Problem: training an RNN to generate these sequences requires a for loop over > 10,000 time steps. Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 16 / 23

Causal Convolution Idea 1: causal convolution For RNN language models, we used the training sequence as both the inputs and the outputs to the RNN. We made sure the model was causal: each prediction depended only on inputs earlier in the sequence. We can do the same thing using a convolutional architecture. No for loops! Processing each input sequence just requires a series of convolution operations. Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 17 / 23

Causal Convolution Causal convolution for images: Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 18 / 23

CNN vs. RNN We can turn a causal CNN into an RNN by adding recurrent connections. Is this a good idea? The RNN has a memory, so it can use information from all past time steps. The CNN has a limited context. But training the RNN is very expensive since it requires a for loop over time steps. The CNN only requires a series of convolutions. Generating from both models is very expensive, since it requires a for loop. (Whereas generating from a GAN or a reversible model is very fast.) Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 19 / 23

PixelCNN and PixelRNN Van den Oord et al., ICML 2016, Pixel recurrent neural networks This paper introduced two autoregressive models of images: the PixelRNN and the PixelCNN. Both generated amazingly good high-resolution images. The output is a softmax over 256 possible pixel intensities. Completing an image using an PixelCNN: Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 20 / 23

PixelCNN and PixelRNN Samples from a PixelRNN trained on ImageNet: Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 21 / 23

Dilated Convolution Idea 2: dilated convolution The advantage of RNNs over CNNs is that their memory lets them learn arbitrary long-distance dependencies. But we can dramatically increase a CNN s receptive field using dilated convolution. You did this in Programming Assignment 2. Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 22 / 23

WaveNet WaveNet is an autoregressive model for raw audio based on causal dilated convolutions. van den Oord et al., 2016. WaveNet: a generative model for raw audio. Audio needs to be sampled at at least 16k frames per second for good quality. So the sequences are very long. WaveNet uses dilations of 1, 2,..., 512, so each unit at the end of this block as a receptive field of length 1024, or 64 milliseconds. It stacks several of these blocks, so the total context length is about 300 milliseconds. https://deepmind.com/blog/wavenet-generative-model-raw-audio/ Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 23 / 23