WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Similar documents
Denoising Autoencoders

Deep Belief Networks are compact universal approximators

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Greedy Layer-Wise Training of Deep Networks

UNSUPERVISED LEARNING

Learning Deep Architectures

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Deep Generative Models. (Unsupervised Learning)

Feature Design. Feature Design. Feature Design. & Deep Learning

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Deep unsupervised learning

TUTORIAL PART 1 Unsupervised Learning

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 16 Deep Neural Generative Models

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Deep Learning: a gentle introduction

From perceptrons to word embeddings. Simon Šuster University of Groningen

arxiv: v3 [cs.lg] 18 Mar 2013

FreezeOut: Accelerate Training by Progressively Freezing Layers

GENERATIVE ADVERSARIAL LEARNING

Undirected Graphical Models

How to do backpropagation in a brain

Restricted Boltzmann Machines

Learning Deep Architectures

Algorithms Reading Group Notes: Provable Bounds for Learning Deep Representations

arxiv: v1 [stat.ml] 2 Sep 2014

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks: Introduction

arxiv: v1 [cs.lg] 25 Jun 2015

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Lecture 5: Logistic Regression. Neural Networks

Introduction to Deep Neural Networks

Deep Generative Stochastic Networks Trainable by Backprop

Transportation analysis of denoising autoencoders: a novel method for analyzing deep neural networks

Deep Learning & Neural Networks Lecture 2

Neural Networks and Deep Learning.

Classification of Higgs Boson Tau-Tau decays using GPU accelerated Neural Networks

arxiv: v2 [stat.ml] 18 Jun 2017

Statistical Learning Reading Assignments

Part 2. Representation Learning Algorithms

Introduction to Machine Learning Midterm Exam

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Deep Learning of Invariant Spatiotemporal Features from Video. Bo Chen, Jo-Anne Ting, Ben Marlin, Nando de Freitas University of British Columbia

CSC321 Lecture 20: Autoencoders

Jakub Hajic Artificial Intelligence Seminar I

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

RAGAV VENKATESAN VIJETHA GATUPALLI BAOXIN LI NEURAL DATASET GENERALITY

The XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Experiments on the Consciousness Prior

Reading Group on Deep Learning Session 1

Neural Networks: Backpropagation

Deep Learning Autoencoder Models

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

Machine Learning and Related Disciplines

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Course 395: Machine Learning - Lectures

Unsupervised Learning

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Feedforward Neural Nets and Backpropagation

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

STA 414/2104: Lecture 8

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

Deep Learning Architectures and Algorithms

Training Deep Generative Models: Variations on a Theme

Domain-Adversarial Neural Networks

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

An Efficient Learning Procedure for Deep Boltzmann Machines

CS534 Machine Learning - Spring Final Exam

The connection of dropout and Bayesian statistics

RegML 2018 Class 8 Deep learning

Deep Learning for NLP

A summary of Deep Learning without Poor Local Minima

Neural Networks: A Very Brief Tutorial

Denoising Criterion for Variational Auto-Encoding Framework

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Introduction to (Convolutional) Neural Networks

Advanced statistical methods for data analysis Lecture 2

Logistic Regression & Neural Networks

Improved Local Coordinate Coding using Local Tangents

The Origin of Deep Learning. Lili Mou Jan, 2015

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

CLOSE-TO-CLEAN REGULARIZATION RELATES

Nonlinear system modeling with deep neural networks and autoencoders algorithm

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Transcription:

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu ABSTRACT Generative models for deep learning are promising both to improve understanding of the model, and yield training methods requiring fewer labeled samples. Recent works use generative model approaches to produce the deep net s input given the value of a hidden layer several levels above. However, there is no accompanying proof of correctness for the generative model, showing that the feedforward deep net is the correct inference method for recovering the hidden layer given the input. Furthermore, these models are complicated. The current paper takes a more theoretical tack. It presents a very simple generative model for ReLU deep nets, with the following characteristics: (i) The generative model is just the reverse of the feedforward net: if the forward transformation at a layer is A then the reverse transformation is A T. (This can be seen as an explanation of the old weight tying idea for denoising autoencoders.) (ii) Its correctness can be proven under a clean theoretical assumption: the edge weights in real-life deep nets behave like random numbers. Under this assumption which is experimentally tested on real-life nets like AlexNet it is formally proved that feed forward net is a correct inference method for recovering the hidden layer. The generative model suggests a simple modification for training: use the generative model to produce synthetic data with labels and include it in the training set. Experiments are shown to support this theory of random-like deep nets; and that it helps the training. This extended abstract provides a succinct description of our results while the full paper is available on arxiv. 1 INTRODUCTION Discriminative/generative pairs of models for classification tasks are an old theme in machine learning (Ng & Jordan, 2001). Generative model analogs for deep learning may not only cast new light on the discriminative backpropagation algorithm, but also allow learning with fewer labeled samples. A seeming obstacle in this quest is that deep nets are successful in a variety of domains, and it is unlikely that problem inputs in these domains share common families of generative models. Some generic (i.e., not tied to specific domain) approaches to defining such models include Restricted Boltzmann Machines (Freund & Haussler, 1994; Hinton & Salakhutdinov, 2006) and Denoising Autoencoders (Bengio et al., 2006; Vincent et al., 2008). Surprisingly, these suggest that deep nets are reversible: the generative model is a essentially the feedforward net run in reverse. Further refinements include Stacked Denoising Autoencoders (Vincent et al., 2010), Generalized Denoising Auto-Encoders (Bengio et al., 2013b) and Deep Generative Stochastic Networks (Bengio et al., 2013a). 1

In case of image recognition it is possible to work harder using a custom deep net to invert the feedforward net -and reproduce the input very well from the values of hidden layers much higher up, and in fact to generate images very different from any that were used to train the net (e.g., (Mahendran & Vedaldi, 2015)). To explain the contribution of this paper and contrast with past work, we need to formally define the problem. Let x denote the data/input to the deep net and z denote the hidden representation (or the output labels). The generative model has to satisfy the following: Property (a): Specify a joint distribution of x, z, or at least p(x z). Property (b) A proof that the deep net itself is a method of computing the (most likely) z given x. Past work usually fails to satisfy one of (a) and (b). The current paper introduces a simple mathematical explanation for why such a model should exist for deep nets with fully connected layers. We propose the random-like nets hypothesis, which says that real-life deep nets even those obtained from standard supervised learning are random-like, meaning their edge weights behave like random numbers. Notice, this is distinct from saying that the edge weights actually are randomly generated or uncorrelated. Instead we mean that the weighted graph has bulk properties similar to those of random weighted graphs. To give an example, matrices in a host of settings are known to display properties specifically, eigenvalue distribution similar to matrices with Gaussian entries; this so-called Universality phenomenon is a matrix analog of the Law of Large Numbers. The random-like properties of deep nets needed in this paper involve a generalized eigenvalue-like property, which we empirically verified on the real world neural nets. If a deep net is random-like, we can show mathematically that it has an associated simple generative model p(x z) (Property (a)) that we call the shadow distribution, and for which Property (b) also automatically holds in an approximate sense. Our generative model makes essential use of dropout noise and ReLUs and can be seen as providing (yet another) theoretical explanation for the efficacy of these two in modern deep nets. Note that Properties (a) and (b) hold even for random (and hence untrained/useless) deep nets. Empirically, supervised training seems to improve the shadow distribution, and at the end the synthetic images are somewhat reasonable, albeit cruder compared to say (Mahendran & Vedaldi, 2015). 2 GENERATIVE MODEL AND PROVABLE GUARANTEES Let x denote the input and h denote the hidden variable computed by the neural network. When h has fewer nonzero coordinates than x, this has to be a many-to one function, and prior work on generative models has tried to define a probabilistic inverse of this function. Sometimes e.g., in context of denoising autoencoders such inverses are called reconstruction if one thinks of all inverses as being similar. Here we abandon the idea of reconstruction and focus on defining many inverses x of h. We define a shadow distribution p(x h), such that a random sample x from this distribution satisfies Property (b), i.e., the feedforward network computes a hidden variable that is close to h. For example, one layer network with weight W and bias b outputs ReLU(W T x + b), which we show is close to h. To understand the considerations in defining such an inverse, one must keep in mind that the ultimate goal is to extend the notion to multi-level nets. Thus a generated inverse x has to look like the output of a 1-level net in the layer below. As mentioned, this is where previous attempts such as DBN or denoising autoencoders run into theoretical difficulties. One layer model. For simplicity we start with a single layer neural net. Given h R m, the model p(x h) generates x R n as follows: first compute r(αw h) where a scaling scalar α and and r is the ReLU; then randomly zero-out each coordinate with probability 1 ρ. Formally, let denote entry-wise product of two vectors. Then p(x h) is defined as x = r(αw h) n drop, (1) where α = 2/(ρn) is a scaling factor, and n drop {0, 1} n is a binary random vector satifying Pr[n drop ] = ρ ndrop 0 (1 ρ) n ndrop 0, (2) where n drop 0 denotes the number of non-zeros of n drop. We refer to this noise model as dropout noise, and note that ρ can be reduced to make x as sparse as needed; typically ρ will be small. Let s t ( ) be the 2

h (l) h (l) = r(w T l 1 h (l 1) + b l ) W l 1 W T l 1 h (l 1) = s kl 1 (r(α l 1 W l 1 h l )) h (l 1) h (1) h (1) = r(w T 0 x + b 1 ) W 0 W T 0 x = s k0 (r(α 0 W 0 h (1) )) (observed layer) (a) Generative model x (observed layer) (b) Feedforward NN Figure 1: Generative-discriminative pair: a) defines the conditional distribution Pr[x h (l) ]; b) defines the feed-forward function h (l) = NN(x). random function that drops coordinates with probability 1 ρ, that is, s t (z) = z n drop. Then (1) is simplified to x = s ρn (r(αw h)). (3) Multiple layer model. We describe the multilayer generative model for the shadow distribution associated with an l-layer deep net with ReLUs. The feedforward net is shown in Figure 2 (b). The j-th layer has n j nodes, while the observable layer has n 0 nodes. The corresponding generative model is in Figure 2 (a). The number of variables at each layer, and the edge weights match exactly in the two models, but the generative model runs from top to down. The generative model starts with the top layer h (l), which is from some arbitrary distribution D l over the set of k l -sparse vectors in R n l. Then it generates the hidden variable h (l 1) below using the same stochastic process as described for one layer: compute r(α l 1 W l 1 h (l) ), and then apply a random sampling function s kl 1 ( ) on the vector, where k l 1 is the target sparsity of h (l 1), and α l 1 = 2/k l 1 is a scaling constant. Formally, the generative model is x = s k0 (r(α 0 W 0 s k1 (r(α 1 W 1 ))). (4) Probable Guarantees We now present our formal guarantees for 2 and 3 layers that the feedforward net inverts the generative model, i.e., Property (b). We assume the random-like matrices W j s to have standard gaussian prior: W j has i.i.d entries from N (0, 1) (5) We also assume that the distribution D l produces k l -sparse vectors with not too large entries almost surely: ( ) h (l) R n l 0, h(l) 0 k l, and h (l) O log N/(kl ) h a.s. (6) where N j n j is the total number of nodes in the architecture. Under this mathematical setup, we prove the following reversibility and dropout robustness results for 2-layer networks. Theorem 2.1 (2-Layer Reversibility and Dropout Robustness). For l = 2, and k 2 < k 1 < k 0 < k2, 2 for 0.9 measure of the weights (W 0, W 1 ), the following holds: There exists constant offset vector b 0, b 1 such that when h (2) D 2 and Pr[x h (2) ] is specified as model (4), then network has reversibility and dropout robustness in the sense that the feedforward calculation (defined in Figure 2(b)) gives h (2) satisfying [ i [n 2 ], E h (2) i h (2) i 2] ɛτ 2 (7) 3

where τ = 1 k 2 i h(2) i is the average of the non-zero entries of h (2) and ɛ = Õ(k 2/k 1 ). To parse the theorem, we note that when k 2 k 1 and k 1 k 0, in expectation, the entry-wise difference between h (2) and h (2) is dominated by the average single strength of h (2). We note that the magnitude of the error in the theorem is on the order of the ratio of the sparsities between two layers. 3 SUMMARY OF OTHER RESULTS Theorem 2.1 is extended to three layers with stronger assumptions on the sparsity of the top layer We assume additionally that k 3 k 2 < k 0, which says that the top two layer is significantly sparser than the bottom layer k 0. We note that this assumption is still reasonable since in most of practical situations the top layer consists of the labels and therefore is indeed much sparser. Theorem 3.1 (3-layers Reversibility and Dropout Robustness, informally stated). For l = 3, when k 3 < k 2 < k 1 < k 0 < k 2 2 and k 3 k 2 < k 0, the 3-layer generative model has the same type of reversibility and dropout robustness properties as in Theorem 2.1. We also present experimental results that support our theory. First, the random-like nets hypothesis was verified on the fully connected layers in a few trained networks, such as those in AlexNet and also those in multilayer nets that we trained on different data sets. Edge weights fit a Gaussian distribution, and bias in the ReLU gates are essentially constant (in accord with the theorems) and the distribution of the singular values of the weight matrix is close to the quarter circular law of random Gaussian matrices. Second, the knowledge that the deep net being sought is random-like can be used to improve training. Namely, take a labeled data point x, and use the current feedforward net to compute its label z. Now use the shadow distribution p(x z) to compute a synthetic data point x, label it with z, and add it to the training set for the next iteration. Experiments show that adding this to training yields measurable improvements over backpropagation + dropout for training fully connected layers. Furthermore, throughout training, the prediction error on synthetic data closely tracks that on the real data, as predicted by the theory. REFERENCES Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In NIPS, pp. 153 160, 2006. Yoshua Bengio, Eric Thibodeau-Laufer, Guillaume Alain, and Jason Yosinski. Deep generative stochastic networks trainable by backprop. arxiv preprint arxiv:1306.1091, 2013a. Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized denoising auto-encoders as generative models. In Advances in Neural Information Processing Systems, pp. 899 907, 2013b. Yoav Freund and David Haussler. Unsupervised learning of distributions on binary vectors using two layer networks. Technical report, 1994. Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 507, 2006. A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. In IEEE Conference on Computer Vision and Pattern Recognition, 2015. Andrew Y. Ng and Michael I. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada], pp. 841 848, 2001. 4

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, pp. 1096 1103, 2008. Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. The Journal of Machine Learning Research, 11:3371 3408, 2010. 5