Greedy Layer-Wise Training of Deep Networks

Similar documents
Deep unsupervised learning

Learning Deep Architectures for AI. Part II - Vijay Chakilam

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Unsupervised Learning

Lecture 16 Deep Neural Generative Models

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Denoising Autoencoders

Learning Deep Architectures

UNSUPERVISED LEARNING

Deep Belief Networks are compact universal approximators

The Origin of Deep Learning. Lili Mou Jan, 2015

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Gaussian Cardinality Restricted Boltzmann Machines

Chapter 16. Structured Probabilistic Models for Deep Learning

Au-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto

Learning Deep Architectures

Restricted Boltzmann Machines

Introduction to Restricted Boltzmann Machines

Deep Boltzmann Machines

Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines

Knowledge Extraction from DBNs for Images

Modeling Documents with a Deep Boltzmann Machine

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Basic Principles of Unsupervised and Unsupervised

Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

Chapter 20. Deep Generative Models

TUTORIAL PART 1 Unsupervised Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Deep Neural Networks

Part 2. Representation Learning Algorithms

How to do backpropagation in a brain

Deep Generative Models. (Unsupervised Learning)

Learning to Disentangle Factors of Variation with Manifold Learning

Robust Classification using Boltzmann machines by Vasileios Vasilakakis

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Restricted Boltzmann Machines

Jakub Hajic Artificial Intelligence Seminar I

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

An efficient way to learn deep generative models

An Efficient Learning Procedure for Deep Boltzmann Machines

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Contrastive Divergence

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Probabilistic Graphical Models

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

arxiv: v3 [cs.lg] 18 Mar 2013

An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey Hinton

Restricted Boltzmann Machines for Collaborative Filtering

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural Networks and Deep Learning

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Learning Tetris. 1 Tetris. February 3, 2009

Unsupervised Neural Nets

Sum-Product Networks: A New Deep Architecture

Reading Group on Deep Learning Session 1

Deep Learning & Neural Networks Lecture 2

Deep Learning Architectures and Algorithms

To go deep or wide in learning?

Lecture 14: Deep Generative Learning

Notes on Boltzmann Machines

STA 4273H: Statistical Machine Learning

Artificial Neural Networks. MGS Lecture 2

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

From perceptrons to word embeddings. Simon Šuster University of Groningen

arxiv: v2 [cs.ne] 22 Feb 2013

Lecture 5: Logistic Regression. Neural Networks

Training an RBM: Contrastive Divergence. Sargur N. Srihari

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Expressive Power and Approximation Errors of Restricted Boltzmann Machines

CSC321 Lecture 20: Autoencoders

Deep Learning Architecture for Univariate Time Series Forecasting

The Recurrent Temporal Restricted Boltzmann Machine

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Density estimation. Computing, and avoiding, partition functions. Iain Murray

Multivariate Bayesian Linear Regression MLAI Lecture 11

Deep Belief Networks are Compact Universal Approximators

arxiv: v4 [cs.lg] 16 Apr 2015

Statistical Machine Learning from Data

Introduction to Machine Learning

Efficient Learning of Sparse, Distributed, Convolutional Feature Representations for Object Recognition

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Black-box α-divergence Minimization

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Learning Gaussian Process Models from Uncertain Data

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Variational Inference via Stochastic Backpropagation

Deep Learning Autoencoder Models

Auto-Encoders & Variants

Statistical Machine Learning from Data

Transcription:

Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Story so far Deep neural nets are more expressive: Can learn wider classes of functions with less hidden units (parameters) and training examples. Unfortunately they are not easy to train with randomly initialized gradient-based methods.

Story so far RBM 2 RBM 1 RBM 0 Hinton et. al. (2006) proposed greedy unsupervised layer-wise training: Greedy layer-wise: Train layers sequentially starting from bottom (input) layer. Unsupervised: Each layer learns a higher-level representation of the layer below. The training criterion does not depend on the labels. Each layer is trained as a Restricted Boltzman Machine. (RBM is the building block of Deep Belief Networks). The trained model can be fine tuned using a supervised method.

This paper RBM 2 RBM 1 RBM 0 Extends the concept to: Continuous variables Uncooperative input distributions Simultaneous Layer Training Explores variations to better understand the training method: What if we use greedy supervised layer-wise training? What if we replace RBMs with auto-encoders?

Outline Review Restricted Boltzman Machines Deep Belief Networks Greedy layer-wise Training Supervised Fine-tuning Extensions Continuous Inputs Uncooperative Input Distributions Simultaneous Training Analysis Experiments

Outline Review Restricted Boltzman Machines Deep Belief Networks Greedy layer-wise Training Supervised Fine-tuning Extensions Continuous Inputs Uncooperative Input Distributions Simultaneous Training Analysis Experiments

Restricted Boltzman Machine Undirected bipartite graphical model with connections between visible nodes and hidden nodes. Corresponds to joint probability distribution h v P v, h = 1 exp( energy(v, h)) Z = 1 Z exp(v Wh + b v + c h)

Restricted Boltzman Machine Undirected bipartite graphical model with connections between visible nodes and hidden nodes. Corresponds to joint probability distribution h v P v, h = 1 Z exp(h Wv + b v + c h) Factorized Conditionals Q h v = P(h j v) Q h j = 1 v = sigm(c j + W jk v k ) j k P v h = P(v k h) P v k = 1 h = sigm(b k + W jk h j ) k j

Restricted Boltzman Machine (Training) Given input vectors V 0, adjust θ = (W, b, c) to increase log P V 0 log P v 0 = log P(v 0, h) h = log exp energy v 0, h log exp energy v, h h v,h log P v 0 = Q h v θ 0 h log P v 0 = Q h v θ 0 k h energy v 0, h θ energy v 0, h energy v, h + P(v, h) θ v,h energy v, h + P v Q(h θ k v) k v h k θ k

Restricted Boltzman Machine (Training) Given input vectors V 0, adjust θ = (W, b, c) to increase log P V 0 log P v 0 = log P(v 0, h) h = log exp energy v 0, h log exp energy v, h h v,h log P v 0 = Q h v θ 0 h log P v 0 = Q h v θ 0 k h energy v 0, h θ energy v 0, h energy v, h + P(v, h) θ v,h energy v, h + P v Q(h θ k v) k v h k θ k

Restricted Boltzman Machine (Training) Given input vectors V 0, adjust θ = (W, b, c) to increase log P V 0 log P v 0 = log P(v 0, h) h = log exp energy v 0, h log exp energy v, h h v,h log P v 0 = Q h v θ 0 h log P v 0 = Q h v θ 0 k h energy v 0, h θ energy v 0, h energy v, h + P(v, h) θ v,h energy v, h + P v Q(h θ k v) k v h k θ k Sample h 0 given v 0 Sample v 1 and h 1 using Gibbs sampling

Restricted Boltzman Machine (Training) Now we can perform stochastic gradient descent on data loglikelihood Stop based on some criterion (e.g. reconstruction error log P(v 1 = x v 0 = x)

Deep Belief Network A DBN is a model of the form P x, g 1, g 2,, g l = P(x g 1 ) P g 1 g 2 P g l 2 g l 1 P(g l 1, g l ) x = g 0 denotes input variables g denotes hidden layers of causal variables

Deep Belief Network A DBN is a model of the form P x, g 1, g 2,, g l = P(x g 1 ) P g 1 g 2 P g L 2 g L 1 P(g L 1, g L ) x = g 0 denotes input variables g denotes hidden layers of causal variables

Deep Belief Network A DBN is a model of the form P x, g 1, g 2,, g l = P(x g 1 ) P g 1 g 2 P g l 2 g l 1 P(g l 1, g l ) x = g 0 denotes input variables g denotes hidden layers of causal variables P(g l 1, g l ) is an RBM P g i g i+1 = P(g j i g i+1 ) j P g i j g i+1 = sigm(b i j + k W i kj g i+1 k ) n i+1 RBM = Infinitely Deep network with tied weights

Greedy layer-wise training P(g 1 g 0 ) is intractable Approximate with Q(g 1 g 0 ) Treat bottom two layers as an RBM Fit parameters using contrastive divergence

Greedy layer-wise training P(g 1 g 0 ) is intractable Approximate with Q(g 1 g 0 ) Treat bottom two layers as an RBM Fit parameters using contrastive divergence That gives an approximate P g 1 We need to match it with P(g 1 )

Greedy layer-wise training Approximate P g l g l 1 Q(g l g l 1 ) Treat layers l 1, l as an RBM Fit parameters using contrastive divergence Sample g 0 l 1 recursively using Q g i g i 1 starting from g 0

Outline Review Restricted Boltzman Machines Deep Belief Networks Greedy layer-wise Training Supervised Fine-tuning Extensions Continuous Inputs Uncooperative Input Distributions Simultaneous Training Analysis Experiments

Supervised Fine Tuning (In this paper) Use greedy layer-wise training to initialize weights of all layers except output layer. For fine-tuning, use stochastic gradient descent of a cost function on the outputs where the conditional expected values of hidden nodes are approximated using mean-field. E g i g i 1 = μ i 1 = μ i = sigm(b i + W i μ i 1 )

Supervised Fine Tuning (In this paper) Use greedy layer-wise training to initialize weights of all layers except output layer. Use backpropagation

Outline Review Restricted Boltzman Machines Deep Belief Networks Greedy layer-wise Training Supervised Fine-tuning Extensions Continuous Inputs Uncooperative Input Distributions Simultaneous Training Analysis Experiments

Continuous Inputs Recall RBMs: Q h j v Q h j, v exp h j w v + b j h j exp (w v + b j ) h j = exp(a v h j ) If we restrict h j I = {0,1} then normalization gives us binomial with p given by sigmoid. Instead, if I = [0, ] we get exponential density If I is closed interval then we get truncated exponential

Continuous Inputs (Case for truncated exponential [0,1]) Sampling For truncated exponential, inverse CDF can be used log(1 U (1 exp a v ) h j = F 1 U = a(v) where U is sampled uniformly from [0,1] Conditional Expectation E h j v = 1 1 exp ( a v ) 1 a(v)

Continuous Inputs To handle Gaussian inputs, we need to augment the energy function with a term quadratic in h. For a diagonal covariance matrix P h j v = a v h j + d j h j 2 Giving E h j z = a(x)/2d 2

Continuous Hidden Nodes?

Continuous Hidden Nodes? Truncated Exponential E h j v = 1 1 exp ( a v ) 1 a(v) Gaussian E h j v = a(v)/2d 2

Uncooperative Input Distributions Setting x~p x y = f x + noise No particular relation between p and f, (e.g. Gaussian and sinus)

Uncooperative Input Distributions Setting x~p x y = f x + noise No particular relation between p and f, (e.g. Gaussian and sinus) Problem: Unsupvervised pre-training may not help prediction

Outline Review Restricted Boltzman Machines Deep Belief Networks Greedy layer-wise Training Supervised Fine-tuning Extensions Analysis Experiments

Uncooperative Input Distributions Proposal: Mix unsupervised and supervised training for each layer Temp. Ouptut Layer Stochastic Gradient of input log likelihood by Contrastive Divergence Stochastic Gradient of prediction error Combined Update

Simultaneous Layer Training Greedy Layer-wise Training For each layer Repeat Until Criterion Met Sample layer input (by recursively applying trained layers to data) Update parameters using contrastive divergence

Simultaneous Layer Training Simultaneous Training Repeat Until Criterion Met Sample input to all layers Update parameters of all layers using contrastive divergence Simpler: One criterion for the entire network Takes more time

Outline Review Restricted Boltzman Machines Deep Belief Networks Greedy layer-wise Training Supervised Fine-tuning Extensions Continuous Inputs Uncooperative Input Distributions Simultaneous Training Analysis Experiments

Experiments Does greedy unsupervised pre-training help? What if we replace RBM with auto-encoders? What if we do greedy supervised pre-training? Does continuous variable modeling help? Does partially supervised pre-training help?

Experiment 1 Does greedy unsupervised pre-training help? What if we replace RBM with auto-encoders? What if we do greedy supervised pre-training? Does continuous variable modeling help? Does partially supervised pre-training help?

Experiment 1

Experiment 1

Experiment 1(MSE and Training Errors) Partially Supervised < Unsupervised Pre-training < No Pre-training Gaussian < Binomial

Experiment 2 Does greedy unsupervised pre-training help? What if we replace RBM with auto-encoders? What if we do greedy supervised pre-training? Does continuous variable modeling help? Does partially supervised pre-training help?

Experiment 2 Auto Encoders Learn a compact representation to reconstruct X p x = sigm c + Wsigm b + W x Trained to minimize reconstruction cross-entropy X X R = x i log p x i + (1 x i ) log p 1 x i i

Experiment 2 (500~1000) layer width 20 nodes in last two layers

Experiment 2 Auto-encoder pre-training outperforms supervised pre-training but is still outperformed by RBM. Without pre-training, deep nets do not generalize well, but they can still fit the data if the output layers are wide enough.

Conclusions Unsupervised pre-training is important for deep networks. Partial supervision further enhances results, especially when input distribution and the function to be estimated are not closely related. Explicitly modeling conditional inputs is better than using binomial models.

Thanks