Deep Learning and Information Theory

Similar documents
Optimal Deep Learning and the Information Bottleneck method

the Information Bottleneck

arxiv:physics/ v1 [physics.data-an] 24 Apr 2000 Naftali Tishby, 1,2 Fernando C. Pereira, 3 and William Bialek 1

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Introduction to Machine Learning Spring 2018 Note Neural Networks

Optimal predictive inference

Feature Design. Feature Design. Feature Design. & Deep Learning

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

From perceptrons to word embeddings. Simon Šuster University of Groningen

Chapter 16. Structured Probabilistic Models for Deep Learning

Reinforcement Learning

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Deep Learning for NLP

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Learning Tetris. 1 Tetris. February 3, 2009

Deep Feedforward Networks

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

The Origin of Deep Learning. Lili Mou Jan, 2015

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Learning Deep Architectures for AI. Part I - Vijay Chakilam

EE04 804(B) Soft Computing Ver. 1.2 Class 2. Neural Networks - I Feb 23, Sasidharan Sreedharan

Deep Learning. Hung-yi Lee 李宏毅

Neural Networks in Structured Prediction. November 17, 2015

How to do backpropagation in a brain

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Deep Feedforward Networks

UNSUPERVISED LEARNING

2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

Artificial Neural Networks 2

Lecture 3 Feedforward Networks and Backpropagation

CS 570: Machine Learning Seminar. Fall 2016

Neural Networks Introduction

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Classification of Ordinal Data Using Neural Networks

Lecture 7 Artificial neural networks: Supervised learning

Linear Least-Squares Based Methods for Neural Networks Learning

Neural Networks and Deep Learning

Neural Networks and the Back-propagation Algorithm

CS885 Reinforcement Learning Lecture 7a: May 23, 2018

Why is Deep Learning so effective?

Multi-Layer Boosting for Pattern Recognition

Machine Learning (CSE 446): Backpropagation

Deep Feedforward Networks. Sargur N. Srihari

Knowledge Extraction from DBNs for Images

Artificial Neural Networks

Logistic Regression & Neural Networks

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Artificial Neural Network

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Course 395: Machine Learning - Lectures

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Kyle Reing University of Southern California April 18, 2018

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Deep unsupervised learning

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Multiplex network inference

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Simon Trebst. Quantitative Modeling of Complex Systems

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Sequential Supervised Learning

Lecture 16 Deep Neural Generative Models

STA 4273H: Statistical Machine Learning

PV021: Neural networks. Tomáš Brázdil

Introduction to Machine Learning

Machine Learning. Neural Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS 4700: Foundations of Artificial Intelligence

AN APPROACH TO FIND THE TRANSITION PROBABILITIES IN MARKOV CHAIN FOR EARLY PREDICTION OF SOFTWARE RELIABILITY

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Neural networks. Chapter 20. Chapter 20 1

Rapid Introduction to Machine Learning/ Deep Learning

Information Bottleneck for Gaussian Variables

Neural networks. Chapter 19, Sections 1 5 1

Expectation Propagation Algorithm

Linear Models for Classification

Information Dynamics Foundations and Applications

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

A summary of Deep Learning without Poor Local Minima

Advanced statistical methods for data analysis Lecture 2

Neural networks. Chapter 20, Section 5 1

Non-linear Measure Based Process Monitoring and Fault Diagnosis

10 NEURAL NETWORKS Bio-inspired Multi-Layer Networks. Learning Objectives:

CSC321 Lecture 20: Autoencoders

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

MLPR: Logistic Regression and Neural Networks

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Introduction to Neural Networks

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Lecture 3 Feedforward Networks and Backpropagation

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Artificial Neural Networks. MGS Lecture 2

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Neural Network Identification of Non Linear Systems Using State Space Techniques.

Transcription:

Deep Learning and Information Theory Bhumesh Kumar (13D070060) Alankar Kotwal (12D070010) November 21, 2016 Abstract T he machine learning revolution has recently led to the development of a new flurry of techniques called deep learning. These have been extremely successful in a variety of applications, often surpassing its competition and winning top machine learning challenges. These algorithms, however, are oft not as well-founded as classical machine learning techniques: most research involves finding the right network architecture for a particular application. We attempt to find a framework to make concrete this design of deep neural networks; the successive refinement of features provided by the Information Bottleneck principle seems to fit this criterion well. We try to connect the information flow in a neural network to sufficient statistics as introduced by the information bottleneck and justify why this could potentially be used to design optimal architectures for deep learning. 1 Introduction Deep learning has, in recent years, received tremendous attention due to the abilities of its models to consistently beat the state-of-the-art on various problems of interest in computer vision, linguistics and artificial intelligence. Recent successes include DeepMind s AlphaGo beating Lee Sedol on a full-sized 19 19 board, and automatic machine translation and content-based image retrieval. Deep learning has been used in applications as varied as automatic driving, handwriting generation, bioinformatics and genome sequencing due to its ability to produce highly expressive models that efficiently capture high-level information with excellent generalization properties. However, the theoretical underpinnings of deep neural networks are not very well-understood. Deep learning has, indeed, been described as the triumph of empiricism in [2]. Though the Universal Approximation Theorem for multilayer neural nets [1] provides a bound on point-wise error between the target function and the network approximation for the optimal network, it doesn t specify anything about its architecture. If there exist design principles for deep networks, we do not understand them. Most research on deep nets, therefore, involves tweaking the network architecture for performance on a particular application. 1

The information bottleneck [4], introduced in 1999, cast the classical statistical concept of sufficient statistics in an information-theoretic framework. It tries to find the best, smallest representation ˆX of the relevant information in a random variable X, about a random variable Y. Such a representation is found by formulating a variational principle for efficient representation. This leads to a set of Blahut-Arimoto-like equations which when iterated over yield the probabilistic mapping between X and ˆX. We attempt to connect the information bottleneck and deep learning because of their fundamental similarity: both refine representations of their input random variables to learn high-level non-redundant representations of the (variable for the bottleneck and truth for a neural net) Y. 2 Preliminaries We now introduce the two concepts separately, and devote the next section to the integration of the two. 2.1 Deep Neural Networks A neural network is an interconnection of multi-input single-output artificial neurons, which return a function of the signals input to it. A typical neuron linearly combines the input features and squashes the linear combination using an activation function. This enables a typical neural network to capture rich, nonlinear features within the data and mimic nonlinear functions. A neural network is a feed-forward neural network if it can be organized into layers as shown in Fig. 1. We deal with feed-forward networks only because of their ease of training with backpropogation. A multi-layered neural network is usually called a deep network. The input to a neural network, therefore, is a set of features and the output is the approximate target function we want to estimate. The data usually comes from a truth Y p(y), which generates X p(x y). In a supervised learning setting, the random variable X is revealed to us, and we need to model the joint p(x, y). The model extracted from the data, ˆp(x, y) is then used to predict the truth Ŷ. 2.2 The Information Bottleneck Consider the problem of finding the best, smallest representation ˆX of the information about Y in a random variable X, with the Markov relation Y X ˆX. For the moment, we assume that the mutual information is a good measure of the relevance between two random variables. We then have a trade-off between two quantities: I( ˆX; X), which represents how much of the information in X is retained in ˆX, and I( ˆX; Y ), which represents the information about Y retained by ˆX. For the optimal representation, we need to minimize the former: we want as less of the information from X to make the most concise summary of Y, 2

Figure 1: A neural network with two hidden layers. Taken from Neural Networks and Deep Learning while maximizing the latter to keep most of the relevant information about Y in ˆX. The optimization we then propose is the following: min p(ˆx x) I( ˆX; X) βi( ˆX; Y ) (1) As [4] shows, the optimal p(ˆx x) for this Lagrangian has the form [ p(ˆx x) = p(ˆx) Z(x, β) exp β ] p(y x) log p(y x) p(y ˆx) y (2) where Z(x, β) is a partition function. [4] goes further to derive a set of self-consistent equations to find this optimal conditional distribution. We now state two important properties [3] that relates to why the proposed optimization function is ideal. Theorem 2.1. If T is a random mapping of X, T is a sufficient statistic for Y iff where (F )(X) is the set of random mappings of X. I(Y ; T ) = max T (F )(X)I(Y ; T ) (3) Theorem 2.2. If T is a sufficient statistic for Y, T is a minimal sufficient statistic iff where (S)(Y ) is the set of sufficient statistics for Y. I(X; T ) = max T (S)(Y )I(X; T ) (4) 3

Y X h 1 h 2 h m Ŷ data Figure 2: Information flow in a neural network. Figure from [5]. ˆX A sufficient statistic T for Y derived from X retains all the information needed to estimate Y, and a minimal sufficient statistic refines X at least as well as any other statistic. Our goal is to build a minimal sufficient statistic: so the implications of these theorems are clear. We need to keep T a sufficient statistic and therefore to maximize I(Y ; T ), while maintaining the minimality and therefore minimizing I(X; T ). 3 The Information Bottleneck in Deep Neural Nets We now connect the two concepts mentioned in the Preliminaries section. To start off, we notice the fact that the layered structure of the network generates a Markov chain of intermediate refinements. For instance, in Fig. 2, the input feature vector maps (possibly deterministically) to the random variable h 1 on the second layer. The second layer, again, maps to the third layer and so on, till the m th layer maps to the refined random variable Ŷ. For i j, h i is a sufficient statistic for h j, and hence as information flows through the network, it gets successively refined to the minimal sufficient statistic if the network parameters and configuration are just right. [5], indeed, postulates that training a neural network should accomplish this goal. The problem of designing a network architecture, therefore, boils down to finding a path to the optimal mapping between Y and Ŷ through the intermediate variables h i. 4

4 Conclusion We showed a possible path to make formal the design neural networks with the information bottleneck. The concept of refinement of a variable to represent information about a target variable from the information bottleneck framework was used for this purpose. We showed the relation between sufficient statistics and the information bottleneck and argued why the optimization function is a good function to optimize on. References [1] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359 366, 1989. [2] Zachary Chase Lipton. Deep learning and the triumph of empiricism. Link, 2015. [Online; accessed 18-November-2015]. [3] Ohad Shamir, Sivan Sabato, and Naftali Tishby. Learning and Generalization with the Information Bottleneck, pages 92 107. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008. [4] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. ArXiv Physics e-prints, April 2000. [5] N. Tishby and N. Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1 5, April 2015. 5