Lecture 16 Deep Neural Generative Models

Similar documents
A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Deep unsupervised learning

Unsupervised Learning

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Knowledge Extraction from DBNs for Images

UNSUPERVISED LEARNING

Greedy Layer-Wise Training of Deep Networks

The Origin of Deep Learning. Lili Mou Jan, 2015

CSC321 Lecture 20: Autoencoders

Learning Deep Architectures

Deep Belief Networks are compact universal approximators

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Deep Generative Models. (Unsupervised Learning)

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Learning Tetris. 1 Tetris. February 3, 2009

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Variational Autoencoder

Chapter 16. Structured Probabilistic Models for Deep Learning

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

An efficient way to learn deep generative models

Learning Energy-Based Models of High-Dimensional Data

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Probabilistic Graphical Models

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Deep Boltzmann Machines

Density estimation. Computing, and avoiding, partition functions. Iain Murray

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning

Variational Autoencoders

STA 4273H: Statistical Machine Learning

Gaussian Cardinality Restricted Boltzmann Machines

Auto-Encoding Variational Bayes

STA 414/2104: Lecture 8

arxiv: v1 [stat.ml] 30 Mar 2016

Does the Wake-sleep Algorithm Produce Good Density Estimators?

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Deep Learning Autoencoder Models

Deep Learning & Neural Networks Lecture 2

Chapter 20. Deep Generative Models

Sum-Product Networks: A New Deep Architecture

TUTORIAL PART 1 Unsupervised Learning

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Machine Learning Techniques for Computer Vision

Usually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However,

STA 414/2104: Machine Learning

Part 2. Representation Learning Algorithms

Deep Neural Networks

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Based on slides by Richard Zemel

Restricted Boltzmann Machines for Collaborative Filtering

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Probabilistic Graphical Models

Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines

An Efficient Learning Procedure for Deep Boltzmann Machines

Graphical Models for Collaborative Filtering

Modeling Documents with a Deep Boltzmann Machine

STA 4273H: Statistical Machine Learning

Pattern Recognition and Machine Learning

Using Graphs to Describe Model Structure. Sargur N. Srihari

arxiv: v1 [cs.ne] 6 May 2014

Robust Classification using Boltzmann machines by Vasileios Vasilakakis

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

arxiv: v2 [cs.ne] 22 Feb 2013

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

STA 414/2104: Lecture 8

CPSC 540: Machine Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Contrastive Divergence

Latent Variable Models

An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey Hinton

Need for Sampling in Machine Learning. Sargur Srihari

arxiv: v4 [cs.lg] 16 Apr 2015

Bayesian Learning in Undirected Graphical Models

Variational Auto Encoders

The Recurrent Temporal Restricted Boltzmann Machine

Recent Advances in Bayesian Inference Techniques

Variational Inference via Stochastic Backpropagation

How to do backpropagation in a brain

p L yi z n m x N n xi

Expectation Propagation Algorithm

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o

Variational Principal Components

Introduction to Restricted Boltzmann Machines

9 Forward-backward algorithm, sum-product on factor graphs

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Inductive Principles for Restricted Boltzmann Machine Learning

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

STA 4273H: Sta-s-cal Machine Learning

CSCI-567: Machine Learning (Spring 2019)

Lecture 6: April 19, 2002

Learning Deep Architectures

Basic Principles of Unsupervised and Unsupervised

Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

Ruslan Salakhutdinov Joint work with Geoff Hinton. University of Toronto, Machine Learning Group

Restricted Boltzmann Machines

Transcription:

Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017

Approach so far: We have considered simple models and then constructed their deep, non-linear variants

Approach so far: We have considered simple models and then constructed their deep, non-linear variants Example: PCA (and Linear Autoencoder) to Nonlinear-PCA (Non-linear (deep?) autoencoders)

Approach so far: We have considered simple models and then constructed their deep, non-linear variants Example: PCA (and Linear Autoencoder) to Nonlinear-PCA (Non-linear (deep?) autoencoders) Example: Sparse Coding (Sparse Autoencoder with linear decoding) to Deep Sparse Autoencoders

Approach so far: We have considered simple models and then constructed their deep, non-linear variants Example: PCA (and Linear Autoencoder) to Nonlinear-PCA (Non-linear (deep?) autoencoders) Example: Sparse Coding (Sparse Autoencoder with linear decoding) to Deep Sparse Autoencoders All the models we have considered so far are completely deterministic

Approach so far: We have considered simple models and then constructed their deep, non-linear variants Example: PCA (and Linear Autoencoder) to Nonlinear-PCA (Non-linear (deep?) autoencoders) Example: Sparse Coding (Sparse Autoencoder with linear decoding) to Deep Sparse Autoencoders All the models we have considered so far are completely deterministic The encoder and decoders have no stochasticity

Approach so far: We have considered simple models and then constructed their deep, non-linear variants Example: PCA (and Linear Autoencoder) to Nonlinear-PCA (Non-linear (deep?) autoencoders) Example: Sparse Coding (Sparse Autoencoder with linear decoding) to Deep Sparse Autoencoders All the models we have considered so far are completely deterministic The encoder and decoders have no stochasticity We don t construct a probabilistic model of the data

Approach so far: We have considered simple models and then constructed their deep, non-linear variants Example: PCA (and Linear Autoencoder) to Nonlinear-PCA (Non-linear (deep?) autoencoders) Example: Sparse Coding (Sparse Autoencoder with linear decoding) to Deep Sparse Autoencoders All the models we have considered so far are completely deterministic The encoder and decoders have no stochasticity We don t construct a probabilistic model of the data Can t sample from the model

Representations Figure: Ruslan Salakhutdinov

To motivate Deep Neural Generative models, like before, let s seek inspiration from simple linear models first

Linear Factor Model We want to build a probabilistic model of the input P (x)

Linear Factor Model We want to build a probabilistic model of the input P (x) Like before, we are interested in latent factors h that explain x

Linear Factor Model We want to build a probabilistic model of the input P (x) Like before, we are interested in latent factors h that explain x We then care about the marginal: P (x) = E h P (x h)

Linear Factor Model We want to build a probabilistic model of the input P (x) Like before, we are interested in latent factors h that explain x We then care about the marginal: P (x) = E h P (x h) h is a representation of the data

Linear Factor Model The latent factors h are an encoding of the data

Linear Factor Model The latent factors h are an encoding of the data Simplest decoding model: Get x after a linear transformation of x with some noise

Linear Factor Model The latent factors h are an encoding of the data Simplest decoding model: Get x after a linear transformation of x with some noise Formally: Suppose we sample the latent factors from a distribution h P (h)

Linear Factor Model The latent factors h are an encoding of the data Simplest decoding model: Get x after a linear transformation of x with some noise Formally: Suppose we sample the latent factors from a distribution h P (h) Then: x = W h + b + ɛ

Linear Factor Model P (h) is a factorial distribution h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5 x = W h + b + ɛ How do learn in such a model? Let s look at a simple example

Probabilistic PCA Suppose underlying latent factor has a Gaussian distribution h N (h; 0, I)

Probabilistic PCA Suppose underlying latent factor has a Gaussian distribution h N (h; 0, I) For the noise model: Assume ɛ N (0, σ 2 I)

Probabilistic PCA Suppose underlying latent factor has a Gaussian distribution h N (h; 0, I) For the noise model: Assume ɛ N (0, σ 2 I) Then: P (x h) = N (x W h + b, σ 2 I)

Probabilistic PCA Suppose underlying latent factor has a Gaussian distribution h N (h; 0, I) For the noise model: Assume ɛ N (0, σ 2 I) Then: P (x h) = N (x W h + b, σ 2 I) We care about the marginal P (x) (predictive distribution): P (x) = N (x b, W W T + σ 2 I)

Probabilistic PCA P (x) = N (x b, W W T + σ 2 I) How do we learn the parameters? (EM, ML Estimation)

Probabilistic PCA P (x) = N (x b, W W T + σ 2 I) How do we learn the parameters? (EM, ML Estimation) Let s look at the ML Estimation:

Probabilistic PCA P (x) = N (x b, W W T + σ 2 I) How do we learn the parameters? (EM, ML Estimation) Let s look at the ML Estimation: Let C = W W T + σ 2 I

Probabilistic PCA P (x) = N (x b, W W T + σ 2 I) How do we learn the parameters? (EM, ML Estimation) Let s look at the ML Estimation: Let C = W W T + σ 2 I We want to maximize l(θ; X) = i log P (x i θ)

Probabilistic PCA: ML Estimation l(θ; X) = i log P (x i θ) = N 2 log C 1 (x i b)c 1 (x i b) T 2 i = N 2 log C 1 2 T r[(c 1 i x i b)(x i b) T ] = N 2 log C 1 2 T r[(c 1 S]

Probabilistic PCA: ML Estimation l(θ; X) = i log P (x i θ) = N 2 log C 1 (x i b)c 1 (x i b) T 2 i = N 2 log C 1 2 T r[(c 1 i x i b)(x i b) T ] = N 2 log C 1 2 T r[(c 1 S] Now fit the parameters θ = W, b, σ to maximize log-likelihood

Probabilistic PCA: ML Estimation l(θ; X) = i log P (x i θ) = N 2 log C 1 (x i b)c 1 (x i b) T 2 i = N 2 log C 1 2 T r[(c 1 i x i b)(x i b) T ] = N 2 log C 1 2 T r[(c 1 S] Now fit the parameters θ = W, b, σ to maximize log-likelihood Can also use EM

Factor Analysis Fix the latent factor prior to be the unit Gaussian as before: h N (h; 0, I)

Factor Analysis Fix the latent factor prior to be the unit Gaussian as before: h N (h; 0, I) Noise is sampled from a Gaussian with a diagonal covariance:

Factor Analysis Fix the latent factor prior to be the unit Gaussian as before: h N (h; 0, I) Noise is sampled from a Gaussian with a diagonal covariance: Ψ = diag([σ 2 1, σ 2 2,..., σ 2 d ])

Factor Analysis Fix the latent factor prior to be the unit Gaussian as before: h N (h; 0, I) Noise is sampled from a Gaussian with a diagonal covariance: Ψ = diag([σ 2 1, σ 2 2,..., σ 2 d ]) Still consider linear relationship between inputs and observed variables: Marginal P (x) N (x; b, W W T + Ψ)

Factor Analysis On deriving the posterior P (h x) = N (h µ, Λ), we get:

Factor Analysis On deriving the posterior P (h x) = N (h µ, Λ), we get: µ = W T (W W T + Ψ) 1 (x b)

Factor Analysis On deriving the posterior P (h x) = N (h µ, Λ), we get: µ = W T (W W T + Ψ) 1 (x b) Λ = I W T (W W T + Ψ) 1 W

Factor Analysis On deriving the posterior P (h x) = N (h µ, Λ), we get: µ = W T (W W T + Ψ) 1 (x b) Λ = I W T (W W T + Ψ) 1 W Parameters are coupled, makes ML estimation difficult

Factor Analysis On deriving the posterior P (h x) = N (h µ, Λ), we get: µ = W T (W W T + Ψ) 1 (x b) Λ = I W T (W W T + Ψ) 1 W Parameters are coupled, makes ML estimation difficult Need to employ EM (or non-linear optimization)

More General Models Suppose P (h) can not be assumed to have a nice Gaussian form

More General Models Suppose P (h) can not be assumed to have a nice Gaussian form The decoding of the input from the latent states can be a complicated non-linear function

More General Models Suppose P (h) can not be assumed to have a nice Gaussian form The decoding of the input from the latent states can be a complicated non-linear function Estimation and inference can get complicated!

Earlier we had: h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5

Quick Review Generative models can be modeled as directed graphical models

Quick Review Generative models can be modeled as directed graphical models The nodes represent random variables and arcs indicate dependency

Quick Review Generative models can be modeled as directed graphical models The nodes represent random variables and arcs indicate dependency Some of the random variables are observed, others are hidden

Sigmoid Belief Networks............ h 3 h 2 h 1 Just like a feedfoward network, but with arrows reversed. x

Sigmoid Belief Networks Let x = h 0. Consider binary activations, then:

Sigmoid Belief Networks Let x = h 0. Consider binary activations, then: P (h k i = 1 h k+1 ) = sigm(b k i + j W k+1 i,j h k+1 j )

Sigmoid Belief Networks Let x = h 0. Consider binary activations, then: P (h k i = 1 h k+1 ) = sigm(b k i + j W k+1 i,j h k+1 j ) The joint probability factorizes as:

Sigmoid Belief Networks Let x = h 0. Consider binary activations, then: P (h k i = 1 h k+1 ) = sigm(b k i + j W k+1 i,j h k+1 j ) The joint probability factorizes as: ( l 1 P (x, h 1,..., h l ) = P (h l ) k=1 ) P (h k h k+1 ) P (x h 1 )

Sigmoid Belief Networks Let x = h 0. Consider binary activations, then: P (h k i = 1 h k+1 ) = sigm(b k i + j W k+1 i,j h k+1 j ) The joint probability factorizes as: ( l 1 P (x, h 1,..., h l ) = P (h l ) k=1 ) P (h k h k+1 ) P (x h 1 ) Marginalization yields P (x), intractable in practice except for very small models

Sigmoid Belief Networks ( l 1 P (x, h 1,..., h l ) = P (h l ) k=1 ) P (h k h k+1 ) P (x h 1 )

Sigmoid Belief Networks ( l 1 P (x, h 1,..., h l ) = P (h l ) k=1 ) P (h k h k+1 ) P (x h 1 ) The top level prior is chosen as factorizable: P (h l ) = i P (hl i )

Sigmoid Belief Networks ( l 1 P (x, h 1,..., h l ) = P (h l ) k=1 ) P (h k h k+1 ) P (x h 1 ) The top level prior is chosen as factorizable: P (h l ) = i P (hl i ) A single (Bernoulli) parameter is needed for each h i in case of binary units

Sigmoid Belief Networks ( l 1 P (x, h 1,..., h l ) = P (h l ) k=1 ) P (h k h k+1 ) P (x h 1 ) The top level prior is chosen as factorizable: P (h l ) = i P (hl i ) A single (Bernoulli) parameter is needed for each h i in case of binary units Deep Belief Networks are like Sigmoid Belief Networks except for the top two layers

Sigmoid Belief Networks General case models are called Helmholtz Machines

Sigmoid Belief Networks General case models are called Helmholtz Machines Two key references: G. E. Hinton, P. Dayan, B. J. Frey, R. M. Neal: The Wake-Sleep Algorithm for Unsupervised Neural Networks, In Science, 1995

Sigmoid Belief Networks General case models are called Helmholtz Machines Two key references: G. E. Hinton, P. Dayan, B. J. Frey, R. M. Neal: The Wake-Sleep Algorithm for Unsupervised Neural Networks, In Science, 1995 R. M. Neal: Connectionist Learning of Belief Networks, In Artificial Intelligence, 1992

Deep Belief Networks............ h 3 h 2 h 1 The top two layers now have undirected edges x

Deep Belief Networks The joint probability changes as: ( l 2 P (x, h 1,..., h l ) = P (h l, h l 1 ) k=1 ) P (h k h k+1 ) P (x h 1 )

Deep Belief Networks The joint probability changes as: ( l 2 P (x, h 1,..., h l ) = P (h l, h l 1 ) k=1 ) P (h k h k+1 ) P (x h 1 )

Deep Belief Networks...... h l+1 h l The top two layers are a Restricted Boltzmann Machine

Deep Belief Networks... h l+1... The top two layers are a Restricted Boltzmann Machine A RBM has the joint distribution: h l

Deep Belief Networks... h l+1... The top two layers are a Restricted Boltzmann Machine A RBM has the joint distribution: P (h l+1, h l ) exp(b h l 1 + c h l + h l W h l 1 ) h l

Deep Belief Networks... h l+1... The top two layers are a Restricted Boltzmann Machine A RBM has the joint distribution: P (h l+1, h l ) exp(b h l 1 + c h l + h l W h l 1 ) h l We will return to RBMs and training procedures in a while, but first we look at the mathematical machinery that will make our task easier

Energy Based Models Energy-Based Models assign a scalar energy with every configuration of variables under consideration

Energy Based Models Energy-Based Models assign a scalar energy with every configuration of variables under consideration Learning: Change the energy function so that its final shape has some desirable properties

Energy Based Models Energy-Based Models assign a scalar energy with every configuration of variables under consideration Learning: Change the energy function so that its final shape has some desirable properties We can define a probability distribution through an energy: P (x) = exp (Energy(x)) Z

Energy Based Models Energy-Based Models assign a scalar energy with every configuration of variables under consideration Learning: Change the energy function so that its final shape has some desirable properties We can define a probability distribution through an energy: P (x) = exp (Energy(x)) Z Energies are in the log-probability domain: Energy(x) = log 1 (ZP (x))

Energy Based Models P (x) = exp (Energy(x)) Z

Energy Based Models P (x) = exp (Energy(x)) Z Z is a normalizing factor called the Partition Function Z = x exp( Energy(x))

Energy Based Models P (x) = exp (Energy(x)) Z Z is a normalizing factor called the Partition Function Z = x exp( Energy(x)) How do we specify the energy function?

Product of Experts Formulation In this formulation, the energy function is: Energy(x) = i f i (x)

Product of Experts Formulation In this formulation, the energy function is: Energy(x) = i f i (x) Therefore: P (x) = exp ( i f i(x)) Z

Product of Experts Formulation In this formulation, the energy function is: Energy(x) = i f i (x) Therefore: P (x) = exp ( i f i(x)) Z We have the product of experts: P (x) i P i (x) i exp ( f i(x))

Product of Experts Formulation P (x) i P i (x) i exp ( f i(x)) Every expert f i can be seen as enforcing a constraint on x

Product of Experts Formulation P (x) i P i (x) i exp ( f i(x)) Every expert f i can be seen as enforcing a constraint on x If f i is large = P i (x) is small i.e. the expert thinks x is implausible (constraint violated)

Product of Experts Formulation P (x) i P i (x) i exp ( f i(x)) Every expert f i can be seen as enforcing a constraint on x If f i is large = P i (x) is small i.e. the expert thinks x is implausible (constraint violated) If f i is small = P i (x) is large i.e. the expert thinks x is plausible (constraint satisfied)

Product of Experts Formulation P (x) i P i (x) i exp ( f i(x)) Every expert f i can be seen as enforcing a constraint on x If f i is large = P i (x) is small i.e. the expert thinks x is implausible (constraint violated) If f i is small = P i (x) is large i.e. the expert thinks x is plausible (constraint satisfied) Contrast this with mixture models

Latent Variables x is observed, let s say h are hidden factors that explain x

Latent Variables x is observed, let s say h are hidden factors that explain x The probability then becomes: P (x, h) = exp (Energy(x,h)) Z

Latent Variables x is observed, let s say h are hidden factors that explain x The probability then becomes: P (x, h) = exp (Energy(x,h)) Z We only care about the marginal: P (x) = h exp (Energy(x,h)) Z

Latent Variables x is observed, let s say h are hidden factors that explain x The probability then becomes: P (x, h) = exp (Energy(x,h)) Z We only care about the marginal: P (x) = h exp (Energy(x,h)) Z

Latent Variables P (x) = h exp (Energy(x,h)) Z

Latent Variables P (x) = h exp (Energy(x,h)) Z We introduce another term in analogy from statistical physics: free energy: P (x) = exp (FreeEnergy(x)) Z

Latent Variables P (x) = h exp (Energy(x,h)) Z We introduce another term in analogy from statistical physics: free energy: P (x) = exp (FreeEnergy(x)) Z Free Energy is just a marginalization of energies in the log-domain: FreeEnergy(x) = log h exp (Energy(x,h))

Latent Variables P (x) = exp (FreeEnergy(x)) Z

Latent Variables P (x) = exp (FreeEnergy(x)) Z Likewise, the partition function: Z = x exp FreeEnergy(x)

Latent Variables P (x) = exp (FreeEnergy(x)) Z Likewise, the partition function: Z = x exp FreeEnergy(x) We have an expression for P (x) (and hence for the data log-likelihood). Let us see how the gradient looks like

Data Log-Likelihood Gradient P (x) = exp (FreeEnergy(x)) Z

Data Log-Likelihood Gradient P (x) = exp (FreeEnergy(x)) Z The gradient is simply working from the above: log P (x) θ = FreeEnergy(x) θ + 1 (FreeEnergy( x)) FreeEnergy( x) exp Z θ x

Data Log-Likelihood Gradient P (x) = exp (FreeEnergy(x)) Z The gradient is simply working from the above: log P (x) θ = FreeEnergy(x) θ + 1 (FreeEnergy( x)) FreeEnergy( x) exp Z θ x Note that P ( x) = exp (FreeEnergy( x))

Data Log-Likelihood Gradient E P [ The expected log-likelihood gradient over the training set has the following form: ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ

Data Log-Likelihood Gradient E P [ ] [ log P (x) = E θ P [ FreeEnergy(x) ]+E P θ ] FreeEnergy(x) θ P is the empirical training distribution

Data Log-Likelihood Gradient E P [ ] [ log P (x) = E θ P [ FreeEnergy(x) ]+E P θ ] FreeEnergy(x) θ P is the empirical training distribution Easy to compute!

Data Log-Likelihood Gradient E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ P is the model distribution (exponentially many configurations!)

Data Log-Likelihood Gradient E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ P is the model distribution (exponentially many configurations!) Usually very hard to compute!

Data Log-Likelihood Gradient E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ P is the model distribution (exponentially many configurations!) Usually very hard to compute! Resort to Markov Chain Monte Carlo to get a stochastic estimator of the gradient

A Special Case Suppose the energy has the following form: Energy(x, h) = β(x) + i γ i (x, h i )

A Special Case Suppose the energy has the following form: Energy(x, h) = β(x) + i γ i (x, h i ) The free energy, and numerator of log likelihood can be computed tractably!

A Special Case Suppose the energy has the following form: Energy(x, h) = β(x) + i γ i (x, h i ) The free energy, and numerator of log likelihood can be computed tractably! What is P (x)?

A Special Case Suppose the energy has the following form: Energy(x, h) = β(x) + i γ i (x, h i ) The free energy, and numerator of log likelihood can be computed tractably! What is P (x)? What is the FreeEnergy(x)?

A Special Case The likelihood term: P (x) = expβ(x) Z i h i exp γi(x,hi)

A Special Case The likelihood term: P (x) = expβ(x) Z The Free Energy term: i h i exp γi(x,hi) FreeEnergy(x) = log P (x) log Z = β i log h i exp γ i(x,h i )

Restricted Boltzmann Machines... Recall the form of energy:... h l+1 h l

Restricted Boltzmann Machines...... h l+1 h l Recall the form of energy: Energy(x, h) = b T x c T h h T W x

Restricted Boltzmann Machines... h l+1... Recall the form of energy: Energy(x, h) = b T x c T h h T W x h l Takes the earlier nice form with β(x) = b T x and γ i (x, h i ) = h i (c i + W i x) Originally proposed by Smolensky (1987) who called them Harmoniums as a special case of Boltzmann Machines

Restricted Boltzmann Machines... h l+1... Recall the form of energy: Energy(x, h) = b T x c T h h T W x h l Takes the earlier nice form with β(x) = b T x and γ i (x, h i ) = h i (c i + W i x) Originally proposed by Smolensky (1987) who called them Harmoniums as a special case of Boltzmann Machines

Restricted Boltzmann Machines...... h l+1 h l As seen before, the Free Energy can be computed efficiently: FreeEnergy(x) = b T x i log h i exp h i(c i +W i x)

Restricted Boltzmann Machines...... h l+1 h l As seen before, the Free Energy can be computed efficiently: FreeEnergy(x) = b T x i log h i exp h i(c i +W i x) The conditional probability: P (h x) = exp (bt x + c T h + h T W x) h exp (bt x + c T h + h T W x) = i P (h i x)

Restricted Boltzmann Machines... x and h play symmetric roles:... h l+1 h l

Restricted Boltzmann Machines...... h l+1 h l x and h play symmetric roles: P (x h) = i P (x i h)

Restricted Boltzmann Machines...... h l+1 h l x and h play symmetric roles: P (x h) = i P (x i h) The common transfer (for the binary case):

Restricted Boltzmann Machines...... h l+1 h l x and h play symmetric roles: P (x h) = i P (x i h) The common transfer (for the binary case): P (h i = 1 x) = σ(c i + W i x)

Restricted Boltzmann Machines...... h l+1 h l x and h play symmetric roles: P (x h) = i P (x i h) The common transfer (for the binary case): P (h i = 1 x) = σ(c i + W i x) P (x j = 1 h) = σ(b j + W T :,jh)

Restricted Boltzmann Machines...... h l+1 h l x and h play symmetric roles: P (x h) = i P (x i h) The common transfer (for the binary case): P (h i = 1 x) = σ(c i + W i x) P (x j = 1 h) = σ(b j + W T :,jh)

Approximate Learning and Gibbs Sampling E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ We saw the expression for Free Energy for a RBM. But the second term was intractable. How do learn in this case?

Approximate Learning and Gibbs Sampling E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ We saw the expression for Free Energy for a RBM. But the second term was intractable. How do learn in this case? Replace the average over all possible input configurations by samples

Approximate Learning and Gibbs Sampling E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ We saw the expression for Free Energy for a RBM. But the second term was intractable. How do learn in this case? Replace the average over all possible input configurations by samples Run Markov Chain Monte Carlo (Gibbs Sampling):

Approximate Learning and Gibbs Sampling E P [ ] [ log P (x) = E θ P ] [ FreeEnergy(x) + E P θ ] FreeEnergy(x) θ We saw the expression for Free Energy for a RBM. But the second term was intractable. How do learn in this case? Replace the average over all possible input configurations by samples Run Markov Chain Monte Carlo (Gibbs Sampling): First sample x 1 P (x), then h 1 P (h x 1 ), then x 2 P (x h 1 ), then h 2 P (h x 2 ) till x k+1

Approximate Learning, Alternating Gibbs Sampling We have already seen: P (x h) = i P (x i h) and P (h x) = i P (h i x)

Approximate Learning, Alternating Gibbs Sampling We have already seen: P (x h) = i P (x i h) and P (h x) = i P (h i x) With: P (h i = 1 x) = σ(c i + W i x) and P (x j = 1 h) = σ(b j + W T :,j h)

Training a RBM: The Contrastive Divergence Algorithm Start with a training example on the visible units

Training a RBM: The Contrastive Divergence Algorithm Start with a training example on the visible units Update all the hidden units in parallel

Training a RBM: The Contrastive Divergence Algorithm Start with a training example on the visible units Update all the hidden units in parallel Update all the visible units in parallel to obtain a reconstruction

Training a RBM: The Contrastive Divergence Algorithm Start with a training example on the visible units Update all the hidden units in parallel Update all the visible units in parallel to obtain a reconstruction Update all the hidden units again

Training a RBM: The Contrastive Divergence Algorithm Start with a training example on the visible units Update all the hidden units in parallel Update all the visible units in parallel to obtain a reconstruction Update all the hidden units again Update model parameters

Training a RBM: The Contrastive Divergence Algorithm Start with a training example on the visible units Update all the hidden units in parallel Update all the visible units in parallel to obtain a reconstruction Update all the hidden units again Update model parameters Aside: Easy to extend RBM (and contrastive divergence) to the continuous case

Boltzmann Machines A model in which the energy has the form:

Boltzmann Machines A model in which the energy has the form: Energy(x, h) = b T x c T h h T W x x T Ux h T V h

Boltzmann Machines A model in which the energy has the form: Energy(x, h) = b T x c T h h T W x x T Ux h T V h Originally proposed by Hinton and Sejnowski (1983) Important historically. But very difficult to train (why?)

Back to Deep Belief Networks... h 3... h 2... h 1... x ( l 2 P (x, h 1,..., h l ) = P (h l, h l 1 ) k=1 ) P (h k h k+1 ) P (x h 1 )

Back to Deep Belief Networks... h 3... h 2... h 1... x ( l 2 P (x, h 1,..., h l ) = P (h l, h l 1 ) k=1 ) P (h k h k+1 ) P (x h 1 )

Greedy Layer-wise Training of DBNs Reference: G. E. Hinton, S. Osindero and Y-W Teh: A Fast Learning Algorithm for Deep Belief Networks, In Neural Computation, 2006.

Greedy Layer-wise Training of DBNs Reference: G. E. Hinton, S. Osindero and Y-W Teh: A Fast Learning Algorithm for Deep Belief Networks, In Neural Computation, 2006. First Step: Construct a RBM with input x and a hidden layer h, train the RBM

Greedy Layer-wise Training of DBNs Reference: G. E. Hinton, S. Osindero and Y-W Teh: A Fast Learning Algorithm for Deep Belief Networks, In Neural Computation, 2006. First Step: Construct a RBM with input x and a hidden layer h, train the RBM Stack another layer on top of the RBM to form a new RBM. Fix W 1, sample from P (h 1 x), train W 2 as RBM

Greedy Layer-wise Training of DBNs Reference: G. E. Hinton, S. Osindero and Y-W Teh: A Fast Learning Algorithm for Deep Belief Networks, In Neural Computation, 2006. First Step: Construct a RBM with input x and a hidden layer h, train the RBM Stack another layer on top of the RBM to form a new RBM. Fix W 1, sample from P (h 1 x), train W 2 as RBM Continue till k layers

Greedy Layer-wise Training of DBNs Reference: G. E. Hinton, S. Osindero and Y-W Teh: A Fast Learning Algorithm for Deep Belief Networks, In Neural Computation, 2006. First Step: Construct a RBM with input x and a hidden layer h, train the RBM Stack another layer on top of the RBM to form a new RBM. Fix W 1, sample from P (h 1 x), train W 2 as RBM Continue till k layers Implicitly defines P (x) and P (h) (variational bound justifies layerwise training)

Greedy Layer-wise Training of DBNs Reference: G. E. Hinton, S. Osindero and Y-W Teh: A Fast Learning Algorithm for Deep Belief Networks, In Neural Computation, 2006. First Step: Construct a RBM with input x and a hidden layer h, train the RBM Stack another layer on top of the RBM to form a new RBM. Fix W 1, sample from P (h 1 x), train W 2 as RBM Continue till k layers Implicitly defines P (x) and P (h) (variational bound justifies layerwise training) Can then be discriminatively fine-tuned using backpropagation

From last time: Was hard to train deep networks from scratch in 2006! Deep Autoencoders (2006) G. E. Hinton, R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science, 2006

Semantic Hashing G. Hinton and R. Salakhutdinov, Semantic Hashing, 2006

Why does Unsupervised Pre-training work? Regularization. Feature representations that are good for P (x) are good for P (y x)

Why does Unsupervised Pre-training work? Regularization. Feature representations that are good for P (x) are good for P (y x) Optimization: Unsupervised pre-training leads to better regions of the space i.e. better than random initialization

Effect of Unsupervised Pre-training

Effect of Unsupervised Pre-training

Important topics we didn t talk about in detail/at all: Joint unsupervised training of all layers (Wake-Sleep algorithm) Deep Boltzmann Machines Variational bounds justifying greedy layerwise training Conditional RBMs, Multimodal RBMs, Temporal RBMs etc

Next Time Some Applications of methods we just considered

Next Time Some Applications of methods we just considered Generative Adversarial Networks