Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Similar documents
Greedy Layer-Wise Training of Deep Networks

Deep Belief Networks are Compact Universal Approximators

Deep Belief Networks are compact universal approximators

Deep unsupervised learning

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Mixture Models and Representational Power of RBM s, DBN s and DBM s

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

The Origin of Deep Learning. Lili Mou Jan, 2015

Restricted Boltzmann Machines for Collaborative Filtering

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Knowledge Extraction from DBNs for Images

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures

Deep Boltzmann Machines

Introduction to Restricted Boltzmann Machines

Lecture 16 Deep Neural Generative Models

Robust Classification using Boltzmann machines by Vasileios Vasilakakis

Unsupervised Learning

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep Generative Models. (Unsupervised Learning)

Restricted Boltzmann Machines

ECE521 Lectures 9 Fully Connected Neural Networks

Jakub Hajic Artificial Intelligence Seminar I

arxiv: v2 [cs.ne] 22 Feb 2013

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

STA 4273H: Statistical Machine Learning

Wasserstein Training of Boltzmann Machines

Learning to Disentangle Factors of Variation with Manifold Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Introduction to Machine Learning Spring 2018 Note Neural Networks

Restricted Boltzmann Machines

UNSUPERVISED LEARNING

Density functionals from deep learning

From perceptrons to word embeddings. Simon Šuster University of Groningen

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines

Chapter 20. Deep Generative Models

Latent Variable Models

Ways to make neural networks generalize better

Expressive Power and Approximation Errors of Restricted Boltzmann Machines

Au-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks

Logistic Regression Trained with Different Loss Functions. Discussion

Feedforward Neural Networks. Michael Collins, Columbia University

arxiv: v1 [cs.ne] 6 May 2014

Deep Neural Networks (1) Hidden layers; Back-propagation

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Deep Neural Networks

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Denoising Autoencoders

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

CONVOLUTIONAL DEEP BELIEF NETWORKS

Nonparametric Bayesian Methods (Gaussian Processes)

Modeling Documents with a Deep Boltzmann Machine

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

The connection of dropout and Bayesian statistics

Parallel Tempering is Efficient for Learning Restricted Boltzmann Machines

Learning Tetris. 1 Tetris. February 3, 2009

Sum-Product Networks: A New Deep Architecture

Learning Deep Architectures

Unsupervised Kernel Dimension Reduction Supplemental Material

ARestricted Boltzmann machine (RBM) [1] is a probabilistic

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Contrastive Divergence

Chapter 16. Structured Probabilistic Models for Deep Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning Architecture for Univariate Time Series Forecasting

Approximation properties of DBNs with binary hidden units and real-valued visible units

Implementation of a Restricted Boltzmann Machine in a Spiking Neural Network

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Lecture 4 Towards Deep Learning

STA 4273H: Statistical Machine Learning

CSC321 Lecture 5: Multilayer Perceptrons

Machine Learning. Neural Networks

Continuous Neural Networks

Deep Neural Networks (1) Hidden layers; Back-propagation

Expressive Power and Approximation Errors of Restricted Boltzmann Machines

Inductive Principles for Restricted Boltzmann Machine Learning

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

An efficient way to learn deep generative models

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Learning in Markov Random Fields An Empirical Study

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Usually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However,

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey Hinton

CSC 411 Lecture 10: Neural Networks

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets

Lecture 13: Introduction to Neural Networks

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Fundamentals of Computational Neuroscience 2e

STA 4273H: Sta-s-cal Machine Learning

Transcription:

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Introduction Representational abilities of functions with some sort of compositional structure is a well-studied problem Neural networks, kernel machines, digital circuits 2-level architectures of some of these have been shown to be able to represent any function Efficiency of representation has been shown to be improved as depth increases What about for Restricted Boltzmann Machines and Deep Belief Networks?

Questions Addressed By Paper 1. 2. What sorts of distributions can be represented by Restricted Boltzmann Machines? What benefits do adding additional layers to RBMs (thus creating Deep Belief Networks) give us?

Summary of Main Results Restricted Boltzmann Machines: Increasing the number of hidden units improves representational ability With an unbounded number of units, any distribution over {0,1}n can be approximated arbitrarily well Deep Belief Networks: Adding additional layers using greedy contrastive divergence training does not provide additional benefit There remain open questions about the benefits additional layers add

Recap: Restricted Boltzmann Machines h1 v1 h2 v2 h3 v3 h4 v4 v5 Bipartite graphs consisting of visible units (v) and hidden units (h) The joint distribution has the following form (where E(v, h) is called the energy of the state (v, h):

Recap: Restricted Boltzmann Machines (2) The structure of the model simplifies the computation of certain values: In this paper, all units have values in {0, 1}

Recap: Deep Belief Networks Essentially, RBMs with additional layers of hidden units The joint distribution is written in the following way: p(h(l-1), h(l)) (i.e. the marginal distribution over the top two layers) is an RBM Note on notation: we can define h(0) to equal v

Recap: Deep Belief Networks (2) As for RBMs, the structure of the model makes certain computations simpler:

Summary of Main Results Restricted Boltzmann Machines: Increasing the number of hidden units improves representational ability With an unbounded number of units, any distribution over {0,1}n can be approximated arbitrarily well Deep Belief Networks: Adding additional layers using greedy contrastive divergence training does not provide additional benefit There remain open questions about the benefits additional layers add

What do we mean by Representational Ability? We have some empirical distribution p0(v) that is defined by the observed data A RBM represents a marginal distribution p(v) over the visible units Quality of representation is measured by the KL divergence between p0 and p (which we want to be small) Decreasing KL divergence is equivalent to increasing log-likelihood of data:

Some Notation : the set of RBMs having marginal distribution p(v) : the set of RBMs obtained by adding a hidden unit with weight w and bias c to an RBM from : The marginal distribution over visible units for any RBM in

Equivalence Classes of RBMs Lemma 2.1. Let Rp be the equivalence class containing the RBMs whose associated marginal distribution over the visible units is p. The operation of adding a hidden unit to an RBM of Rp preserves the equivalence class. Thus, the set of RBMs composed of an RBM of Rp and an additional hidden unit is also an equivalence class (meaning that all the RBMs of this set have the same marginal distribution over visible units). Takeaway - the results we are about to prove are independent of the exact form of the RBM

Effect of adding unit with infinite negative bias Lemma 2.2. Let p be the distribution over binary vectors v in {0, 1}d, obtained with an RBM Rp and let pw,c be the distribution obtained when adding a hidden unit with weights w and bias c to Rp. Then p, w Rd, p = pw,

Effect of adding hidden units Theorem 2.3. Let p0 be an arbitrary distribution over {0,1}n and let Rp be an RBM with marginal distribution p over the visible units such that KL(p0 p) > 0. Then there exists an RBM Rpw,c composed of Rp and an additional hidden unit with parameters (w,c) whose marginal distribution pw,c over the visible units achieves KL(p0 pw,c) < KL(p0 p)

Proof Sketch (Theorem 2.3) 1. 2. Write down definition of KL(p0 pw,c) Rearrange to get expression of form KL(p0 pw,c) - KL(p0 p) = Z 3. Show that Z is negative

Proof of Theorem 2.3 Step 1: Write KL(p0 pw,c) in terms of KL(p0 p) We start with the definition of KL divergence: Let s expand this term

Proof of Theorem 2.3 (2) We can push in the sum corresponding to the new hidden unit:

Proof of Theorem 2.3 (3) Substituting this into our earlier equation:

Proof of Theorem 2.3 (4) We want to simplify log(1 + exp(wtv + c)). Consider the Taylor series expansion of the natural logarithm: If we assume is wtv + c is a large negative value for all v (which we can, since we set the parameter values), then we can use the approximation log(1 + x) = x + ox 0(x)

Proof of Theorem 2.3 (5) Second term:

Proof of Theorem 2.3 (6) Last term:

Proof of Theorem 2.3 (7) More simplification:

Proof of Theorem 2.3 (8) Finally, we can substitute in everything we just derived: Which gives us what we wanted:

Proof of Theorem 2.3 (9) Now, we want to show that there exists a w such that is negative. Since p p0, there exists an input such that Using this fact, we will now prove that a positive scalar a exists such that defining (with e = [1,, 1]T) gives us the condition above.

Proof of Theorem 2.3 (10) We can decompose the target sum in the following way: This is negative What about this?

Proof of Theorem 2.3 (11) For, we have:

Proof of Theorem 2.3 (12) Let s look at this term: No matter which of the possible four assignments you give to terms of the sum are less than or equal to zero Thus, as a approaches infinity, this term approaches zero and, the

Proof of Theorem 2.3 (13) Thus, going back to the expression we want to prove is negative: We know the following: Hence, an a exists which makes the difference in divergences negative.

Summary of Main Results Restricted Boltzmann Machines: Increasing the number of hidden units improves representational ability With an unbounded number of units, any distribution over {0,1}n can be approximated arbitrarily well Deep Belief Networks: Adding additional layers using greedy contrastive divergence training does not provide additional benefit There remain open questions about the benefits additional layers add

RBMs are Universal Approximators Theorem 2.4. Any distribution over {0, 1}n can be approximated arbitrarily well (in the sense of the KL divergence) with an RBM with k + 1 hidden units where k is the number of input vectors whose probability is not 0.

Theorem 2.4: Proof Sketch We construct a RBM in the following way: Each hidden unit is assigned one of the possible input vectors vi such that, when vi is the input: All other hidden units have probability of zero of being on The corresponding hidden unit has probability sigmoid(λi) of being on Values for weights and λ parameters are chosen such that: λi is tied with p(vi) When all hidden units except for i are off, p(vi h) = 1 When all of the hidden units are off (which happens w/probability 1-sigmoid(λi)),

Proof of Theorem 2.4 In the previous proof, we had: Let ṽ be an arbitrary input vector. Define a weight vector in the same way we did during the last proof; namely,

Proof of Theorem 2.4 (2) Define another parameter which we will be using next:, with λ ℝ. Note the following fact,

Proof of Theorem 2.4 (3) Using this fact, we get that, for v ṽ:

Proof of Theorem 2.4 (4) Using the same derivation, we get that What did we learn? We have figured out a way of adding a hidden unit to a RBM that increases the probability of a single input vector and decreases uniformly the probabilities of every single other input vector. Additionally, if p(ṽ) = 0, then for all λ

Proof of Theorem 2.4 (5) Let s build a RBM. First, let s index the input vectors in the following way: We ll use pi to represent the distribution of an RBM with i [added] hidden units We start with a RBM with weights and biases all set to zero; this induces a uniform marginal distribution over the visible units:

Proof of Theorem 2.4 (6) Next, we add a hidden unit with the following parameters: As mentioned previously, this gives us:

Proof of Theorem 2.4 (7) Now, we can add another hidden node, this time based on the second input vector, thus giving us p2 We set λ2 such that the following ratio is true: We can do this, since we can increase p(v2) arbitrarily and This ratio will continue to be true as additional hidden nodes are added, since the probabilities of all vectors (besides the vector under consideration) are multiplied by the same factor at each step

Proof of Theorem 2.4 (8) After adding k hidden nodes, the following equations are true: These imply that Additionally,, where which implies

Proof of Theorem 2.4 (9) A tiny bit of derivation gives us the following results: Using the logarithmic series identity around 0 again, we can then show that this RBM has the behavior we wanted:

Summary of Main Results Restricted Boltzmann Machines: Increasing the number of hidden units improves representational ability With an unbounded number of units, any distribution over {0,1}n can be approximated arbitrarily well Deep Belief Networks: Adding additional layers using greedy contrastive divergence training does not provide additional benefit There remain open questions about the benefits additional layers add

Recap: Greedy Training Method for DBNs Proposed in (Hinton et al., 2006) Training is completed by adding one layer at a time When training a new layer, the weights of all previous layers are fixed & the top layer is trained as an RBM Problems with this method: The training procedure does not take into account the fact that additional layers will be added in the future This unnecessarily restricts the forms of the distributions that intermediate layers may learn

DBN Greedy Training Objective In the greedy training procedure, a lower bound to the likelihood (called the variational bound) is maximized. For example, for the second layer of a two-layer DBN, we have: Since we fix the weights of the first layer, the only term in this expression that we can optimize is p(h(1))

DBN Greedy Training Objective (2) It turns out that there s an analytic formulation for the optimal solution to p(h(1)): From theorem 2.4, we know this can be approximated arbitrarily well by some RBM If we use an RBM that approximates this value, what distribution is being modeled by our DBN?

Marginal Distribution achieved by this DBN Proposition 3.1: In a 2-layer DBN, using a second layer RBM achieving p (h(1)), the model distribution p is equal to p1, where p1 is the distribution obtained by starting with p0 clamped to the visible units v, sampling h(1) given v, and then sampling v given h(1).

Proof of Proposition 3.1 We start with the analytic formulation of the marginal distribution for the top RBM: Now substitute this into the expression for the marginal distribution over the bottom layer:

Takeaways from Proposition 3.1 Using this training procedure, the best KL divergence we can achieve is KL(p0 p1) Achieving KL(p0 p1) = 0 requires that p0 = p1 In this case, we wouldn t need an extra layer! Does this mean there s no benefit to depth for RBMs? Not necessarily - we may be able to do better by optimizing some other bound besides the variational bound Also - the two layer network achieves its given approximation of p0 using only one up-down step, while the one layer network achieves this after an infinite number of steps

Proposed Alternative Training Criterion Rather than use the previous training objective for intermediate layers, could it be better to try to optimize over KL(p0 p1)? Experiment: Generated a toy dataset consisting of 60 bit vectors of length 10, with either one, two, or three consecutive bits turned on Trained two two-layer DBNs with the same number of nodes per layer The first used the contrastive divergence objective for the intermediate layer, while the second minimized KL(p0 p1) using gradient descent The second layer for both was trained using contrastive divergence

Experimental Results Contrastive Divergence Results KL Divergence Results

Summary of Main Results Restricted Boltzmann Machines: Increasing the number of hidden units improves representational ability With an unbounded number of units, any distribution over {0,1}n can be approximated arbitrarily well Deep Belief Networks: Adding additional layers using greedy contrastive divergence training does not provide additional benefit There remain open questions about the benefits additional layers add

Open Questions Using an argument from an earlier paper, we know that every distribution that can be represented by an l-layer DBN with n units per layer can also be represented by an (l+1)-layer DBN with n units per layer. This begs the following questions: Are there distributions that can be represented by the latter but not the former? What distributions can be represented using an unbounded number of layers?

Summary of Main Results Restricted Boltzmann Machines: Increasing the number of hidden units improves representational ability With an unbounded number of units, any distribution over {0,1}n can be approximated arbitrarily well Deep Belief Networks: Adding additional layers using greedy contrastive divergence training does not provide additional benefit There remain open questions about the benefits additional layers add