Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Similar documents
Inductive Principles for Restricted Boltzmann Machine Learning

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Lecture 16 Deep Neural Generative Models

Deep unsupervised learning

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

An Empirical Investigation of Minimum Probability Flow Learning Under Different Connectivity Patterns

UNSUPERVISED LEARNING

The Origin of Deep Learning. Lili Mou Jan, 2015

Estimating Unnormalized models. Without Numerical Integration

Learning Energy-Based Models of High-Dimensional Data

Learning Deep Architectures

Modeling Natural Images with Higher-Order Boltzmann Machines

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

CSC321 Lecture 18: Learning Probabilistic Models

Knowledge Extraction from DBNs for Images

Variational Autoencoders

Learning Deep Architectures for AI. Part II - Vijay Chakilam

How to do backpropagation in a brain

Estimating Unnormalised Models by Score Matching

Chapter 16. Structured Probabilistic Models for Deep Learning

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV

Unsupervised Learning

Deep Belief Networks are compact universal approximators

Deep Generative Models. (Unsupervised Learning)

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

EE-559 Deep learning 9. Autoencoders and generative models

The XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic

STA 4273H: Sta-s-cal Machine Learning

Chapter 20. Deep Generative Models

Introduction to Restricted Boltzmann Machines

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Variational Autoencoder

Auto-Encoding Variational Bayes

GWAS V: Gaussian processes

Restricted Boltzmann Machines

Deep Learning & Neural Networks Lecture 2

Modeling Documents with a Deep Boltzmann Machine

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives

Contrastive Divergence

A Connection Between Score Matching and Denoising Autoencoders

TUTORIAL PART 1 Unsupervised Learning

STA 4273H: Statistical Machine Learning

Andriy Mnih and Ruslan Salakhutdinov

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Inductive Principles for Learning Restricted Boltzmann Machines

Conservativeness of untied auto-encoders

Learning Deep Architectures

Introduction to Machine Learning

Greedy Layer-Wise Training of Deep Networks

CSC321 Lecture 20: Autoencoders

Cardinality Restricted Boltzmann Machines

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Learning to Disentangle Factors of Variation with Manifold Learning

Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines

arxiv: v2 [cs.ne] 22 Feb 2013

Basic Principles of Unsupervised and Unsupervised

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

arxiv: v1 [cs.ne] 6 May 2014

Deep Boltzmann Machines

Gaussian Cardinality Restricted Boltzmann Machines

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

From independent component analysis to score matching

Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines

STA414/2104 Statistical Methods for Machine Learning II

Conservativeness of Untied Auto-Encoders

Probabilistic Graphical Models

Annealing Between Distributions by Averaging Moments

Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih

Deep Learning Autoencoder Models

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Approximate inference in Energy-Based Models

CS281 Section 4: Factor Analysis and PCA

arxiv: v1 [cs.lg] 25 Jun 2015

A Connection Between Score Matching and Denoising Autoencoders

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Learning Parameters of Undirected Models. Sargur Srihari

Density functionals from deep learning

Latent Variable Models and EM algorithm

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Intractable Likelihood Functions

Training an RBM: Contrastive Divergence. Sargur N. Srihari

An Efficient Learning Procedure for Deep Boltzmann Machines

The connection of dropout and Bayesian statistics

A Few Notes on Fisher Information (WIP)

Bayesian Inference by Density Ratio Estimation

Learning Parameters of Undirected Models. Sargur Srihari

Density estimation. Computing, and avoiding, partition functions. Iain Murray

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Usually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However,

Variational Auto Encoders

Probabilistic Reasoning in Deep Learning

p(d θ ) l(θ ) 1.2 x x x

Estimation theory and information geometry based on denoising

Part 2. Representation Learning Algorithms

RegML 2018 Class 8 Deep learning

Transcription:

On for Energy Based Models Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas Toronto Machine Learning Group Meeting, 2011

Motivation Models Learning Goal: Unsupervised Feature Learning Automatically learn features from data Useful for modeling images, speech, text, etc. without requiring lots of prior knowledge Many models and algorithms, we will focus on two commonly used ones

Motivation Motivation Models Learning Autoencoders and energy based models represent the state-of-the-art for feature learning in several domains We provide some unifying theory for these model families This will allow us to derive new autoencoder architectures from energy based models It will also give us an interesting way to regularize these autoencoders

Latent Energy Based Models Motivation Models Learning A probability distribution over data vectors v, with latent variables h and parameters θ Defined using an energy function E θ (v,h) as: P θ (v) = exp( E θ (v,h)) Z(θ) Normalized by the partition function Z(θ) = v h exp( E θ (v,h))dh dv Z(θ) is usually not analytical Features given by h... h 1 h 2 h nh... v 1 v 2 v nv Restricted Boltzmann Machine... h 1 h 2 h nh v 1 v 2... v nv v 1 v 2... v nv Factored EBM (e.g. mpot)

Motivation Models Learning Example: Gaussian-Bernoulli RBM An Gaussian-Bernoulli restricted Boltzmann machine is an energy based model with visible (observed) variables v R D, hidden (latent) variables h {0,1} K, and parameters θ = {W }. The energy is given by: E θ (v,h) = 1 2 v T v v T Wh The free-energy is obtained by marginalizing over the hidden variables: F θ (v) = 1 2 v T v K j=1 Where W j represents the j th column of W. ( )) log 1 + exp (v T W j

Autoencoders Motivation Models Learning A feed-forward neural network designed to reconstruct its input Encoding function: g θ (v) Reconstruction function: f θ (g θ (v)) Minimizes reconstruction error: ˆθ =argmin f θ (g θ (v)) v 2 θ Features given by g θ (v) f θ (g θ (v)) g θ (v) v

Motivation Models Learning Example: One-Layer Autoencoder A commonly used autoencoder architecture maps inputs v R D to outputs ˆv R D through a deterministic hidden layer ĥ [0,1]K using parameters θ = {W } by the following system: ˆv =f θ (g θ (v)) = W ĥ ĥ =g θ (v) = σ(w T v) Where σ(x) = 1 1+exp( x). We train this by minimizing the reconstruction error: l(θ) = D i=1 1 2 (v i ˆv i ) 2

Comparison Motivation Models Learning Energy based models Undirected Probabilistic Slow inference Intractable objective function Autoencoders Directed Deterministic Fast inference Tractable objective function Unifying these model families lets us leverage the advantages of each

Maximum Likelihood for EBMs Motivation Models Learning Minimize KL divergence between empirical data distribution P(v) and model distribution P θ (v): ] ˆθ =argmine p(v) [log( P(v)) log(p θ (v)) θ ~ P(v) P θ (v) log(p(v)) v Requires that we know Z(θ) Idea: Approximate Z(θ) with samples from P θ (v) Uses MCMC Contrastive divergence (CD), persistent contrastive divergence (PCD), fast PCD (FPCD)

Motivation Models Learning Score Matching (Hyvarinen, 2006) Intuition: If two functions have the same derivatives everywhere, then they are ) the same functions up to a constant We match v i log ( P(v) and v i log(p θ (v)) The constraint v P(v)dv = 1 means that the distributions must be equal if their derivatives are perfectly matched

Score Matching (2) Motivation Models Learning Minimize distance between the score functions v log( P(v)) and v log(p θ (v)): [ ˆθ = argminj(θ) =E p(v) v log( P(v)) v log(p θ (v)) 2] θ Equivalent to (Hyvarinen, 2006): J(θ) =E p(v) [ D i=1 ( ) ] 1 log(pθ (v)) 2 + 2 log(p θ (v)) + const 2 v i v 2 i Does not depend on Z(θ) since v Z(θ) = 0

Score Matching (3) Motivation Models Learning Score matching is not as statistically efficient as maximum likelihood It is in fact an approximation to pseudolikelihood (Hyvarinen, 2006) Statistical efficiency is only beneficial if you can actually find good parameters We trade efficiency for the ability to find better parameters with more powerful optimization

Motivation Models Learning Denoising Score Matching (Vincent, 2011) Rough idea: 1 2 e ) get a smoothed distribution Apply a kernel qβ (v v 0 ) to P(v R 0 0 0 e )dv qβ (v ) = qβ (v v )P(v Apply score matching to qβ (v ) Likelihood Likelihood Faster than ordinary score matching (no 2nd derivatives) Includes denoising autoencoders (for continuous data) as a special case Data eθ (v ) P Data qβ (v )

Score Matching Recipe Examples Generalization and Interpretation Connections with Autoencoders 1 Write down energy function E θ (v,h) 2 Marginalize over latent variables to form F θ (v) 3 Derive score matching objective J(θ) In practice, use Theano to do this step! E θ (v,h) F θ (v) J(θ)

Examples Generalization and Interpretation Connections with Autoencoders Example: RBM with All Gaussian Units If both v and h are Gaussian-distributed, we have the following energy and free-energy: E θ (v,h) = 1 2 v T v + 1 2 ht h v T Wh F θ (v) = 1 ( ) 2 v T I WW T v ( D J(θ) = 1 K ( v i 2 W ij i=1 j=1 Wj T ) ) 2 v + K Wij 2 j=1 This is a linear autoencoder with weight-decay regularization

Examples Generalization and Interpretation Connections with Autoencoders Example: RBM with All Gaussian Units (2) With denoising score matching: J(θ) = D i=1 ( 1 v i 2 x N(v,σ 2 I) K j=1 W ij ( Wj T ) ) 2 x Adding noise to the input is roughly equivalent to weight-decay (Bishop, 1994)

Example: Gaussian-Bernoulli RBMs Examples Generalization and Interpretation Connections with Autoencoders We apply score matching to the GBRBM model: J(θ) =const + D i=1 [ 1 2 (v i ˆv i ) 2 + K j=1 ( ) ] Wij 2 ĥj 1 ĥj This is also a regularized autoencoder The regularizer is a sum of weighted variances of the Bernoulli random variables h conditioned on v

Examples Generalization and Interpretation Connections with Autoencoders Example: mpot (Warning: lots of equations!) The GBRBM is bad at modelling heavy-tailed distributions, and capturing covariance structure in v We can use more sophisticated energy-based models to overcome these limitations The mpot is an energy based model with Gaussian-distributed visible units, Gamma-distributed hidden units h c R +K, and Bernoulli-distributed hidden units h m {0,1} M with parameters θ = {C,W }. E θ (v,h m,h c ) = F θ (v) = K k=1 K k=1 φ c k =C T k v φ m j =Wj T v [ hk(1 c + 1 ] 2 (C k T v) 2 ) + (1 γ)log(hk) c + 1 2 v T v v T Wh m γ log(1 + 1 2 (φ c k ) 2 ) M j=1 log(1 + e φ m j ) + 1 2 v T v

Examples Generalization and Interpretation Connections with Autoencoders Example: mpot (2) (Warning: lots of equations!) J(θ) = ψ i (p θ (v)) = K ik = D ik = D i=1 K k=1 1 2 ψ i(p θ (v)) 2 + ĥ c k D ik + K Cik 2 k=1 K k=1 M j=1 ( ( C T k v ) C ik ) ĥk c = γ 1 + 1 2 (φ k c)2 ˆ h m j ρ(x) =x 2 =sigm(φ m j ) K k=1 hˆ j m W ij v i ( ) ρ(ĥc k )D2 ik + ĥc k K ik + M j=1 ( hm ˆ j (1 h ˆ ) j m )W 2 1 ij

Examples Generalization and Interpretation Connections with Autoencoders Score Matching for General Latent EBMs Score matching can produce complicated and difficult to interpret objective functions Our solution is to apply score matching to latent EBMs without directly specifying E θ (v,h)

Examples Generalization and Interpretation Connections with Autoencoders Score Matching for General Latent EBMs (2) Theorem The score matching objective for a latent EBM can be expressed succinctly in terms of expectations of the energy with respect to the conditional distribution P θ (h v). ( E pθ (h v) [ nv 1 J(θ) = E p(v) i=1 2 [ Eθ (v,h) + var pθ (h v) v i [ ]) Eθ (v,h) 2 ] E pθ (h v) v i [ 2 E θ (v,h) v 2 i ]]. The score matching objective for a latent EBM always consists of 3 terms: 1 Squared error 2 Variance of error 3 Curvature Variance and curvature can be seen as regularizers

Examples Generalization and Interpretation Connections with Autoencoders Likelihood Visualization Data Model probability before learning

Examples Generalization and Interpretation Connections with Autoencoders Likelihood Likelihood Visualization Data Model probability before learning Data Squared error centers mass on data

Examples Generalization and Interpretation Connections with Autoencoders Likelihood Likelihood Visualization Data Data Squared error centers mass on data Likelihood Model probability before learning Data Variance and Curvature shape the distribution

Gradient Steps on the Free Energy Examples Generalization and Interpretation Connections with Autoencoders Consider taking a gradient step along the free-energy with step size ε: v (t+1) i =v (t) i ε F θ (v) v i When F θ (v) = F θ (v) + 1 2 v T v term, we get: v (t+1) i =v (t) i ε ( ) v (t) i + F θ(v) v i Setting ε to 1: v (t+1) i = F θ(v) v i An autoencoder can be thought of as performing a reconstruction by taking a gradient step along its free energy with a step-size of 1

Gaussian Energy Functions Examples Generalization and Interpretation Connections with Autoencoders If the energy function forms a Gaussian distribution in v: E θ (v,h) = 1 2 (v µ h) T Ω h (v µ h ) + const Then the squared error will take the form: ( 1 2 Epθ (h v) [Ω h (v µ h )] ) 2 This corresponds to a more general autoencoder reconstruction error For a GBRBM: Ω h = I If Ω h I then the autoencoder will also model covariance in v

Objectives Can we learn good features with richer autoencoder models? How do score matching estimators compare to maximum likelihood estimators? What are the effects of regularization in a score matching-based autoencoder?

Learned Features Covariance filters

Learned Features (2) Mean filters

Denoising Reconstruct an image from a blurry version Finds a nearby image that has a high likelihood MSE 0.7 0.65 0.6 0.55 0.5 FPCD PCD CD SM SMD 0.45 0.4 0.35 0 20 40 60 80 100 Epoch

Classification on CIFAR-10 CD PCD FPCD SM SMD AE 64.6% 64.7% 65.5% 65.0% 64.7% 57.6% Most methods are comparable mpot autoencoder without regularization does significantly worse

Effect of Regularizers on mpot Autoencoder Value Curve 0 0.5 200 100 Total Recon Var Curve 1 1.5 0 20 40 60 80 Epoch Value 0 100 200 Score matching mpot autoencoder ( 10 4 ) 300 0 20 40 60 80 100 Epoch Autoencoder with no regularization Total (blue curve) is the sum of the other curves Other curves represent the 3 score matching terms

Quick Mention: Memisevic, 2011 Roland Memisevic, a recent U of T grad published a paper in this year s ICCV He derived the same models through the idea of relating two different inputs using an autoencoder

Score matching provides a unifying theory for autoencoders and latent energy based models It reveals novel regularizers for some models We can derive novel autoencoders that model richer structure We provide an interpretation of the score matching objective for all latent EBMs Properly shaping the implied energy of an autoencoder is one key to learning good features This is done implicitly when noise is added to the input Perhaps also done implicity with sparsity penalties, etc.

Future Work We can model other kinds of data with higher-order features by simply changing the loss function Extend higher-order models to deeper layers Apply Hessian-free optimization to EBMs and related autoencoders