Deep Belief Networks are Compact Universal Approximators

Similar documents
Deep Belief Networks are compact universal approximators

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Mixture Models and Representational Power of RBM s, DBN s and DBM s

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Knowledge Extraction from DBNs for Images

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Density functionals from deep learning

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

The Origin of Deep Learning. Lili Mou Jan, 2015

Neural Networks and the Back-propagation Algorithm

Deep unsupervised learning

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Jakub Hajic Artificial Intelligence Seminar I

UNSUPERVISED LEARNING

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

CSC321 Lecture 5: Multilayer Perceptrons

Greedy Layer-Wise Training of Deep Networks

Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Lecture 16 Deep Neural Generative Models

arxiv: v1 [cs.ne] 6 May 2014

Restricted Boltzmann Machines

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

arxiv: v1 [cs.lg] 30 Sep 2018

Deep Generative Models. (Unsupervised Learning)

Machine Learning

Modeling Documents with a Deep Boltzmann Machine

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Expressive Power and Approximation Errors of Restricted Boltzmann Machines

Deep Learning & Neural Networks Lecture 2

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Machine Learning

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

An efficient way to learn deep generative models

Introduction to Deep Learning

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

ARestricted Boltzmann machine (RBM) [1] is a probabilistic

Understanding Neural Networks : Part I

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

Au-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto

Introduction to Restricted Boltzmann Machines

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets

Sum-Product Networks: A New Deep Architecture

arxiv: v3 [cs.lg] 18 Mar 2013

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

Neural Networks DWML, /25

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Lecture 17: Neural Networks and Deep Learning

Neural Networks. Hopfield Nets and Auto Associators Fall 2017

Learning Tetris. 1 Tetris. February 3, 2009

Deep Learning Architecture for Univariate Time Series Forecasting

Lecture 5: Logistic Regression. Neural Networks

Artificial Neural Networks

WHY DEEP NEURAL NETWORKS FOR FUNCTION APPROXIMATION SHIYU LIANG THESIS

Learning Deep Architectures

Deep Neural Networks

Undirected Graphical Models: Markov Random Fields

Implementation of a Restricted Boltzmann Machine in a Spiking Neural Network

Chapter 16. Structured Probabilistic Models for Deep Learning

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Deep Boltzmann Machines

Simple neuron model Components of simple neuron

Multilayer feedforward networks are universal approximators

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Restricted Boltzmann Machines

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Learning and Memory in Neural Networks

From perceptrons to word embeddings. Simon Šuster University of Groningen

A summary of Deep Learning without Poor Local Minima

Neural Networks: A Very Brief Tutorial

The Recurrent Temporal Restricted Boltzmann Machine

Neural Network Training

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Restricted Boltzmann Machines for Collaborative Filtering

Chapter 20. Deep Generative Models

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Lecture 13: Introduction to Neural Networks

10. Artificial Neural Networks

Artificial Neural Networks. Historical description

Multiclass Logistic Regression

CSE446: Neural Networks Winter 2015

Introduction to Neural Networks: Structure and Training

Artificial Neural Networks

Rapid Introduction to Machine Learning/ Deep Learning

Bidirectional Representation and Backpropagation Learning

Learning Deep Architectures

Intelligent Systems Discriminative Learning, Neural Networks

Simple Neural Nets For Pattern Classification

Bits of Machine Learning Part 1: Supervised Learning

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Transcription:

Deep Belief Networks are Compact Universal Approximators Franck Olivier Ndjakou Njeunje Applied Mathematics and Scientific Computation May 16, 2016 1 / 29

Outline 1 Introduction 2 Preliminaries Universal approximation theorem Definition Deep Belief Network 3 Sutekever and Hinton s Method Transfer of Probability 4 Le Roux and Bengio s Method Gray Codes Simultaneous Transfer of Probability 5 Conclusion 2 / 29

Introduction Machine learning (by Arthur Samuel - 1959) Neural Networks: Set of transformations 3 / 29

Introduction Machine learning (by Arthur Samuel - 1959) Neural Networks: Set of transformations 3 / 29

Introduction Machine learning (by Arthur Samuel - 1959) Neural Networks: Set of transformations σ(x) = 1 1 + exp( x) (1) 3 / 29

Universal approximation theorem [Cybe89] Let ϕ( ) be a non constant, bounded, and monotonically-increasing continuous function. Let I m denote the m-dimensional unit hypercube [0, 1] m. The space of continuous functions on I m is denoted by C(I m ). Then, given any function f C(I m ) and ε > 0, there exists an integer N, real constants v i, b i R and real vectors w i R m, where i = 1,, N such that we may define: F (x) = N i=1 ( ) v i ϕ wi T x + b i (2) as an approximate realization of the function f where f is independent of ϕ; that is, F (x) f (x) < ε (3) for all x I m. In other words, functions of the form F (x) are dense in C(I m ). 4 / 29

Definition Generative probabilistic model (GPM) Application: recognition, classification, and generation 5 / 29

Definition Generative probabilistic model (GPM) Application: recognition, classification, and generation Restricted Boltzmann machine (RBM) [LeBe08] 2-layer GPM 5 / 29

Definition Generative probabilistic model (GPM) Application: recognition, classification, and generation Restricted Boltzmann machine (RBM) [LeBe08] 2-layer GPM Deep belief network (DBN) [HiOT06] Multilayer GPM First two layer form an RBM 6 / 29

Deep Belief Network Let h i represent the vector of hidden variable at layer i. The model is parametrized as follows: P(h 0, h 1, h 2,..., h l ) = P(h 0 h 1 )P(h 1 h 2 )... P(h l 2 h l 1 )P(h l, h l 1 ). (4) The hidden layer h i is a binary random vector with elements h i j and n i P(h i h i+1 ) = P(h i j h i+1 ). (5) The element h i j is a stochatic neuron or whose binary activation is 1: j=1 ( n i+1 P(h i j = 1 h i+1 ) = σ bj i + k=1 W i jk hi+1 k ). (6) 7 / 29

Universal approximator (DBN) Let p be an arbitrary distribution over binary vectors of n bits. A deep belief network that has p as its marginal distribution over h 0 is said to be a universal approximator. This means that for any binary vector x of n bits, there exist weights and biases such that given ε > 0: P(h 0 = x) p (x) < ε (7) 8 / 29

Sutekever and Hinton s Method 2008 [SuHi08] Define an arbitrary sequence (a i ) 1 i 2 n of binary vectors in {0, 1} n. The goal is to find appropriate weights and biases such that the marginal distribution over the set of outputs to our DBN is the same as the probability distribution over the vectors (a i ) 1 i 2 n. In the next few slides I will consider the following example for n = 4: a 1 = 1011, p (a 1 ) = 0.1 (8) a 2 = 1000, p (a 2 ) = 0.05 (9) a 3 = 1001, p (a 3 ) = 0.01 (10) a 4 = 1111, p (a 4 ) = 0.02 (11). (12) 9 / 29

Sutekever and Hinton s Method 2008 [SuHi08] Consider two consecutive layers h and v of size n, with W ij the weight linking unit v i to unit h j, b i the bias of unit v i and w a positive scalar. For every positive scalar ε (0 < ε < 1), there is a weight vector W i,: and a real b i such that P(v i = h i h) = 1 ε. Indeed, setting: W ii = 2w W ij = 0 for i j b i = w yields a total input to unit v i of: I (v i, h) = 2wh i w (13) Therefore, if w = σ 1 (1 ε), we have P(v i = h i h) = 1 ε. 10 / 29

Sutekever and Hinton s Method 2008 [SuHi08] W ii = 2w W ij = 0 for i j b i = w With I (v i, h) = 2wh i w and w = σ 1 (1 ε): P(v i = 1 h i = 1) = σ(w) = 1 ε. (14) P(v i = 0 h i = 0) = 1 P(v i = 1 h i = 0) (15) = 1 σ( w) (16) = σ(w) = 1 ε. (17) 11 / 29

Transfer of Probability This is what Sutskever and Hinton call transfer of probability. 12 / 29

Transfer of Probability 13 / 29

Transfer of Probability Number of parameters: We need 3(n + 1) 2 2 n parameters or 3 2 n layers. 13 / 29

Le Roux and Bengio s Method 2010 [LeBe10]: Gray Codes Gray codes [Gray53] are sequences (a i ) 1 i 2 n k {a k } = {0, 1} n such that: k s.t. 2 k 2 n, a k a k 1 H = 1 where H is the Hamming distance Example for n = 4: a 1 = 0000 a 5 = 0110 1100 1010 a 2 = 0001 a 6 = 0111 1101 1011 a 3 = 0011 a 7 = 0101 1111 1001 a 4 = 0010. 0100 1110 1000 14 / 29

Theorem 1 Let a t be an arbitrary binary vector in {0, 1} n with its last bit equal to 0 and p a scalar. For every positive scalar ε (0 < ε < 1), there is a weight vector W n,: and a real b n such that: if the binary vector h is not equal to a t, the last bit remains unchanged with probability greater than or equal to 1 ε, that is P(v n = h n h a t ) > (1 ε). if the binary vector h is equal to a t, its last bit is switched from 0 to 1 with probability σ(p). 15 / 29

Parameters for Theorem 1 With the following the weights and biases the result in Theorem 1 is achievable: W nj = w, 1 j k W nj = w, k + 1 j n 1 W nn = nw b n = kw + p Number of parameters: We need n 2 2 n parameters or 2 n layers. 16 / 29

Simultaneous Transfer of Probability 1 vector at a time. N vectors at a time. We need n2 n parameters or 2n n layers. 17 / 29

Arrangement of Gray Codes Let n = 2 t. There exist n sequences of vectors of n bits S i, 0 i n 1 composed of vectors S i,k, 1 k 2n satisfying the following conditions: n 1 {S 0,..., S n 1 } is a partition of the set of all vectors of n bits. Example for n = 4: S 0 S 1 S 2 S 3 0000 0100 1000 1100 0001 0110 1001 1110 0011 0111 1011 1111 0010 0101 1010 1101 18 / 29

Arrangement of Gray Codes Let n = 2 t. There exist n sequences of vectors of n bits S i, 0 i n 1 composed of vectors S i,k, 1 k 2n satisfying the following conditions: n 1 {S 0,..., S n 1 } is a partition of the set of all vectors of n bits. 2 Every sub-sequence S i satisfies the second property of Gray codes: The Hamming distance between S i,k and S i,k+1 is 1. Example for n = 4: S 0 S 1 S 2 S 3 0000 0100 1000 1100 0001 0110 1001 1110 0011 0111 1011 1111 0010 0101 1010 1101 19 / 29

Arrangement of Gray Codes Let n = 2 t. There exist n sequences of vectors of n bits S i, 0 i n 1 composed of vectors S i,k, 1 k 2n satisfying the following conditions: n 1 {S 0,..., S n 1 } is a partition of the set of all vectors of n bits. 2 Every sub-sequence S i satisfies the second property of Gray codes: The Hamming distance between S i,k and S i,k+1 is 1. 3 For any two sub-sequences S i and S j the bit switched between consecutive vectors (S i,k and S i,k+1 or S j,k and S j,k+1 ) is different unless the Hamming distance between S i,k and S j,k is 1. Example for n = 4: S 0 S 1 S 2 S 3 0000 0100 1000 1100 0001 0110 1001 1110 0011 0111 1011 1111 0010 0101 1010 1101 20 / 29

End Game Can we retain the universal approximation property of DBN by transferring probability to n vectors at a time? For any binary vector x of length n, can we still find weights and biases such that P(h 0 = x) = p (x)? N vectors at a time. 21 / 29

Lemma Let p be an arbitrary distribution over vectors of n bits, where n is again a power of two. A DBN with 2n n + 1 layers such that: 1 for each i, 0 i n 1, the top RBM between layers h 2n n and h 2n n 1 assigns probability k p (S i,k ) to S i,1. 22 / 29

Lemma Let p be an arbitrary distribution over vectors of n bits, where n is again a power of two. A DBN with 2n n + 1 layers such that: 1 for each i, 0 i n 1, the top RBM between layers h 2n n and h 2n n 1 assigns probability k p (S i,k ) to S i,1. 2 for each i, 0 i n 1 and each k, 1 k 2n n 1, we have P(h 2n n (k+1) = S i,k+1 h 2n n (k) = S i,k ) = P(h 2n n (k+1) = S i,k h 2n n (k) = S i,k ) = 2n n t=k+1 p (S i,t ) (18) 2n n t=k p (S i,t ) p (S i,k ) (19) 2n n t=k p (S i,t ) 22 / 29

Lemma Let p be an arbitrary distribution over vectors of n bits, where n is again a power of two. A DBN with 2n n + 1 layers such that: 1 for each i, 0 i n 1, the top RBM between layers h 2n n and h 2n n 1 assigns probability k p (S i,k ) to S i,1. 2 for each i, 0 i n 1 and each k, 1 k 2n n 1, we have P(h 2n n (k+1) = S i,k+1 h 2n n (k) = S i,k ) = P(h 2n n (k+1) = S i,k h 2n n (k) = S i,k ) = 2n n t=k+1 p (S i,t ) (18) 2n n t=k p (S i,t ) p (S i,k ) (19) 2n n t=k p (S i,t ) 3 for each k, 1 k 2n n 1, we have P(h 2n n (k+1) = a h 2n n (k) = a) = 1 if a / i S i,k (20) has p as its marginal distribution over h 0. 22 / 29

Proof of the Lemma Let x be an arbitrary binary vector of n bits; there is a pair (i, k) such that x = S i,k. We need to show that: P(h 0 = S i,k ) = p (S i,k ). (21) 23 / 29

Proof of the Lemma Let x be an arbitrary binary vector of n bits; there is a pair (i, k) such that x = S i,k. We need to show that: P(h 0 = S i,k ) = p (S i,k ). (21) Example for n = 4: if x = S 22 P(h 0 = S 22 ) = p (S 22 ). (22) 23 / 29

Proof of the Lemma The marginal probability of h 0 = S i,k is therefore equal to: P(h 0 = S i,k ) = P(h 2n n 1 = S i,1 ) (23) k 1 ) P (h 2n n (t+1) = S i,t+1 h 2n n t = S i,t (24) t=1 ) P (h 2n n (k+1) = S i,k h 2n n k = S i,k (25) 2 n n 1 t=k+1 ) P (h 2n n (t+1) = S i,k h 2n n t = S i,k (26) 24 / 29

Proof of the Lemma By replacing each of those probabilities by the ones given in the Lemma we get: P(h 0 = S i,k ) = 2 n n u=1 k 1 t=1 p (S i,u ) (27) 2n n u=t+1 p (S i,u ) 2n n u=t p (S i,u ) p (S i,k ) 2n n u=k p (S i,u ) (28) (29) 1 2n n 1 k (30) = p (S i,k ) (31) The last result comes from the cancellation of consecutive terms in the product. This concludes the proof. 25 / 29

Theorem 4 If n = 2 t, a DBN composed of 2n n + 1 layers of size n is a universal approximator of distributions over vectors of size n. 26 / 29

Theorem 4 If n = 2 t, a DBN composed of 2n n + 1 layers of size n is a universal approximator of distributions over vectors of size n. Proof of Theorem 4: Using Lemma, we now show that it is possible to construct such a DBN. First, Le Roux and Bengio (2008) showed that an RBM with n hidden units can model any distribution which assigns a non-zero probability to at most n vectors. Property 1 of the Lemma can therefore be achieved. 26 / 29

Proof of Theorem 4 All the subsequent layers are as follows. At each layer, the first t bits of h k+1 are copied to the first t bits of h k with probability arbitrarily close to 1. This is possible as proven earlier. 27 / 29

Proof of Theorem 4 All the subsequent layers are as follows. At each layer, the first t bits of h k+1 are copied to the first t bits of h k with probability arbitrarily close to 1. This is possible as proven earlier. At each layer, n/2 of the remaining n t bits are potentially changed to move from one vector in a Gray code sequence to the next with the correct probability (as defined in the Lemma). 27 / 29

Proof of Theorem 4 All the subsequent layers are as follows. At each layer, the first t bits of h k+1 are copied to the first t bits of h k with probability arbitrarily close to 1. This is possible as proven earlier. At each layer, n/2 of the remaining n t bits are potentially changed to move from one vector in a Gray code sequence to the next with the correct probability (as defined in the Lemma). The remaining n/2 t bits are copied from h k+1 to h k with probability arbitrarily close to 1. Such layers are arbitrarily close to fulfilling the requirements of the second property of the Lemma. This concludes the proof. 27 / 29

Conclusion Deep belief networks are compact universal approximators: Sutskever and Hinton method (2008) Transfer of probability We need 3(n + 1) 2 2 n parameters or 3 2 n layers. LeRoux and Bengio improvements (2009) Gray codes Simultaneous transfer of probability We need n2 n parameters or 2n n layers (given n is a power of 2). 28 / 29

References: Le Roux, N., Bengio, Y. (2010). Deep belief networks are compact universal approximators. Neural computation, 22(8), 2192-2207. Le Roux, N. and Bengio, Y. (2008). Representational power of restricted boltzmann machines and deep belief networks. Neural Computation, 20(6), 16311649. Sutskever, I. and Hinton, G. E. (2008). Deep, narrow sigmoid belief networks are universal approximators. Neural Computation, 20(11), 26292636. Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 15271554. Gray, F. (1953). Pulse code communication. U.S. Patent 2,632,058. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314. 29 / 29