Variational Autoencoders. Presented by Alex Beatson Materials from Yann LeCun, Jaan Altosaar, Shakir Mohamed

Similar documents
Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Variational Autoencoders

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

CSC321 Lecture 20: Autoencoders

UNSUPERVISED LEARNING

Auto-Encoding Variational Bayes

Lecture 16 Deep Neural Generative Models

The connection of dropout and Bayesian statistics

TUTORIAL PART 1 Unsupervised Learning

Natural Gradients via the Variational Predictive Distribution

How to do backpropagation in a brain

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Latent Variable Models

ECE 5984: Introduction to Machine Learning

Unsupervised Learning

Variational Autoencoders (VAEs)

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Probabilistic Reasoning in Deep Learning

Expectation Maximization

STA 4273H: Sta-s-cal Machine Learning

Deep unsupervised learning

STA 414/2104: Lecture 8

Generative models for missing value completion

Learning Deep Architectures

Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Variational Autoencoder

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Machine Learning Techniques for Computer Vision

Unsupervised Learning

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Neural networks: Unsupervised learning

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Introduction to Machine Learning

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

The Success of Deep Generative Models

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

STA 4273H: Statistical Machine Learning

Deep Learning Autoencoder Models

STA 414/2104: Lecture 8

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Expectation Maximization

Deep Generative Models. (Unsupervised Learning)

Neural Map. Structured Memory for Deep RL. Emilio Parisotto

EM (cont.) November 26 th, Carlos Guestrin 1

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Variational Auto Encoders

An efficient way to learn deep generative models

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Computing with Distributed Distributional Codes Convergent Inference in Brains and Machines?

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

How to do backpropagation in a brain. Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Deep Learning & Neural Networks Lecture 2

Nonparametric Bayesian Methods (Gaussian Processes)

Making Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

CSC321 Lecture 18: Learning Probabilistic Models

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Dimensionality Reduction and Principle Components Analysis

Lecture 7: Con3nuous Latent Variable Models

Mathematical Formulation of Our Example

REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS

Generative Clustering, Topic Modeling, & Bayesian Inference

Variational Inference via Stochastic Backpropagation

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

Lecture 14: Deep Generative Learning

arxiv: v2 [cs.ne] 22 Feb 2013

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,

Robotics. Lecture 4: Probabilistic Robotics. See course website for up to date information.

Experiments on the Consciousness Prior

Lecture 13 : Variational Inference: Mean Field Approximation

STA 4273H: Statistical Machine Learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Variational Inference (11/04/13)

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CS 570: Machine Learning Seminar. Fall 2016

Discrete Latent Variable Models

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Deep Generative Models

Deep Belief Networks are compact universal approximators

Ways to make neural networks generalize better

Statistical Learning. Philipp Koehn. 10 November 2015

Based on slides by Richard Zemel

Transcription:

Variational Autoencoders Presented by Alex Beatson Materials from Yann LeCun, Jaan Altosaar, Shakir Mohamed

Contents 1. Why unsupervised learning, and why generative models? (Selected slides from Yann LeCun s keynote at NIPS 2016) 2. What is a variational autoencoder? (Jaan Altosaar s blog post) 3. A simple derivation of the VAE objective from importance sampling (Shakir Mohamed s slides) Sections 2 and 3 were done as a chalk talk in the presentation

1. Why unsupervised learning, and why generative models? Selected slides from Yann LeCun s keynote at NIPS 2016

Supervised Learning We can train a machine on lots of examples of tables, chairs, dog, cars, and people But will it recognize table, chairs, dogs, cars, and people it has never seen before? PLANE CAR CAR

Obstacles to Progress in AI Machines need to learn/understand how the world works Physical world, digital world, people,. They need to acquire some level of common sense They need to learn a very large amount of background knowledge Through observation and action Machines need to perceive the state of the world So as to make accurate predictions and planning Machines need to update and remember estimates of the state of the world Paying attention to important events. Remember relevant events Machines neet to reason and plan Predict which sequence of actions will lead to a desired state of the world Intelligence & Common Sense = Perception + Predictive Model + Memory + Reasoning & Planning

What is Common Sense? The trophy doesn t fit in the suitcase because it s too large/small (winograd schema) Tom picked up his bag and left the room We have common sense because we know how the world works How do we get machines to learn that?

Common Sense is the ability to fill in the blanks Infer the state of the world from partial information Infer the future from the past and present Infer past events from the present state Filling in the visual field at the retinal blind spot Filling in occluded images Fillling in missing segments in text, missing words in speech. Predicting the consequences of our actions Predicting the sequence of actions leading to a result Predicting any part of the past, present or future percepts from whatever information is available. That s what predictive learning is But really, that s what many people mean by unsupervised learning

The Necessity of Unsupervised Learning / Predictive Learning The number of samples required to train a large learning machine (for any task) depends on the amount of information that we ask it to predict. The more you ask of the machine, the larger it can be. The brain has about 10^14 synapses and we only live for about 10^9 seconds. So we have a lot more parameters than data. This motivates the idea that we must do a lot of unsupervised learning since the perceptual input (including proprioception) is the only place we can get 10^5 dimensions of constraint per second. Geoffrey Hinton (in his 2014 AMA on Reddit) (but he has been saying that since the late 1970s) Predicting human-provided labels is not enough Predicting a value function is not enough

How Much Information Does the Machine Need to Predict? Pure Reinforcement Learning (cherry) The machine predicts a scalar reward given once in a while. A few bits for some samples Supervised Learning (icing) The machine predicts a category or a few numbers for each input Predicting human-supplied data 10 10,000 bits per sample Unsupervised/Predictive Learning (cake) The machine predicts any part of its input for any observed part. Predicts future frames in videos Millions of bits per sample (Yes, I know, this picture is slightly offensive to RL folks. But I ll make it up)

Classical model-based optimal control Simulate the world (the plant) with an initial control sequence Adjust the control sequence to optimize the objective through gradient descent Backprop through time was invented by control theorists in the late 1950s it s sometimes called the adjoint state method [Athans & Falb 1966, Bryson & Ho 1969] Plant Plant Plant Plant Simulator Simulator Simulator Simulator Command Command Command Command Objective Objective Objective Objective

The Architecture Of an Intelligent System

AI System: Learning Agent + Immutable Objective The agent gets percepts from the world The agent acts on the world World The agents tries to minimize the long-term expected cost. Percepts / Observations Actions/ Outputs Agent State Objective Cost

AI System: Predicting + Planning = Reasoning The essence of intelligence is the ability to predict To plan ahead, we simulate the world The action taken minimizes the predicted cost World World Simulator Agent Percepts Actions/ Outputs Predicted Percepts Inferred World State Action Proposals Agent Actor Critic Actor State Predicted Cost Objective Agent State Cost

What we need is Model-Based Reinforcement Learning The essence of intelligence is the ability to predict To plan ahead, we must simulate the world, so as to minimizes the predicted value of some objective function. World World World World Agent Simulator Simulator Simulator Simulator Perception Actor Actor Actor Actor Critic Critic Critic Critic

Unsupervised Learning

Energy-Based Unsupervised Learning Learning an energy function (or contrast function) that takes Low values on the data manifold Higher values everywhere else Y2 Y1

Capturing Dependencies Between Variables with an Energy Function The energy surface is a contrast function that takes low values on the data manifold, and higher values everywhere else Special case: energy = negative log density Example: the samples live in the manifold Y 2 =(Y 1 ) 2 Y2 Y1

Energy-Based Unsupervised Learning Energy Function: Takes low value on data manifold, higher values everywhere else Push down on the energy of desired outputs. Push up on everything else. But how do we choose where to push up? Implausible futures (high energy) Plausible futures (low energy)

Learning the Energy Function parameterized energy function E(Y,W) Make the energy low on the samples Make the energy higher everywhere else Making the energy low on the samples is easy But how do we make it higher everywhere else?

Seven Strategies to Shape the Energy Function 1. build the machine so that the volume of low energy stuff is constant PCA, K-means, GMM, square ICA 2. push down of the energy of data points, push up everywhere else Max likelihood (needs tractable partition function) 3. push down of the energy of data points, push up on chosen locations contrastive divergence, Ratio Matching, Noise Contrastive Estimation, Minimum Probability Flow 4. minimize the gradient and maximize the curvature around data points score matching 5. train a dynamical system so that the dynamics goes to the manifold denoising auto-encoder 6. use a regularizer that limits the volume of space that has low energy Sparse coding, sparse auto-encoder, PSD 7. if E(Y) = Y - G(Y) ^2, make G(Y) as "constant" as possible. Contracting auto-encoder, saturating auto-encoder

#1: constant volume of low energy Energy surface for PCA and K-means 1. build the machine so that the volume of low energy stuff is constant PCA, K-means, GMM, square ICA... PCA E (Y )= W T WY Y 2 K-Means, Z constrained to 1-of-K code E (Y )=min z i Y W i Z i 2

#6. use a regularizer that limits the volume of space that has low energy Sparse coding, sparse auto-encoder, Predictive Sparse Decomposition

Why generative models take-aways: Any energy-based unsupervised learning method can be seen as a probabilistic model by estimating the partition function I claim that any unsupervised learning method can be seen as energybased, and can thus be transformed into a generative or probabilistic model Explicit probabilistic models are useful, because once we have one, we can use it out of the box for any of a variety of common sense tasks. No extra training or special procedures required. anomaly detection, denoising, filling in the blanks/superresolution, compression / representation (inferring latent variables), scoring realism of samples, generating samples,.

2. What is a variational autoencoder? Tutorial by Jaan Altosaar: https://jaan.io/what-is-variationalautoencoder-vae-tutorial/

What is a VAE take-aways: DL interpretation: A VAE can be seen as a denoising compressive autoencoder Denoising = we inject noise to one of the layers. Compressive = the middle layers have lower capacity than the outer layers. Probabilistic interpretation: The decoder of the VAE can be seen as a deep (high representational power) probabilistic model that can give us explicit likelihoods The encoder of the VAE can be seen as a variational distribution used to help train the decoder

2. From importance sampling to VAEs Selected slides from Shakir Mohamed s talk at the Deep Learning Summer School 2016

Importance Sampling p(z) q(z) f(z) Notation Always think of q(z x) but often will write q(z) for simplicity. Conditions q(z x)>0, when f(z)p(z) 0. Easy to sample from q(z). z Integral problem p(x) = Proposal p(x) = Importance Weight p(x) = Monte Carlo Z Z Z w (s) = p(z) q(z) p(x) = 1 S p(x z)p(z)dz p(x z)p(z) q(z) q(z) dz p(x z) p(z) q(z) q(z)dz X s z (s) q(z) w (s) p(x z (s) ) Machines that Imagine and Reason 40

Importance Sampling to Variational Inference Integral problem p(x) = Proposal p(x) = Importance Weight p(x) = Jensen s inequality Z log p(x)g(x)dx Z p(x) log g(x)dx log p(x) = Z Z Z Z Z p(x z)p(z)dz p(x z)p(z) q(z) q(z) dz p(x z) p(z) q(z) q(z)dz q(z) log p(x z) p(z) dz q(z) Z q(z) log p(x z) q(z) log q(z) p(z) Variational lower bound E q(z) [log p(x z)] KL[q(z)kp(z)] Machines that Imagine and Reason 41

Variational Free Energy F(x,q)=E q(z) [log p(x z)] KL[q(z)kp(z)] Approx. Posterior Reconstruction Penalty Interpreting the bound: Approximate posterior distribution q(z x): Best match to true posterior p(z x), one of the unknown inferential quantities of interest to us. Reconstruction cost: The expected log-likelihood measures how well samples from q(z x) are able to explain the data x. Penalty: Ensures that the explanation of the data q(z x) doesn t deviate too far from your beliefs p(z). A mechanism for realising Ockham s razor. Machines that Imagine and Reason 42

Other Families of Variational Bounds Variational Free Energy F(x,q)=E q(z) [log p(x z)] KL[q(z)kp(z)] Multi-sample Variational Objective F(x,q)=E q(z) " F(x,q)= 1 log 1 S X Renyi Variational Objective 1 E q(z) 2 4 log 1 S s X s # p(z) q(z) p(x z)! 3 1 p(z) q(z) p(x z) 5 Other generalised families exist. Optimal solution is the same for all objectives. Machines that Imagine and Reason 43

From importance sampling to VAE takeaways: The VAE objective function can be derived in a way that I think is pretty unobjectionable to Bayesians and frequentists alike. Treat the decoder as a likelihood model we wish to train with maximum likelihood. We want to use importance sampling as p(x z) is low for most z. The encoder is a trainable importance sampling distribution, and the VAE objective is a lower bound to the likelihood by Jensen s inequality.