Reading Group on Deep Learning Session 1

Similar documents
Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Neural Network Training

Artificial Neural Networks. MGS Lecture 2

NEURAL NETWORKS

y(x n, w) t n 2. (1)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning. 7. Logistic and Linear Regression

Machine Learning Lecture 5

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Feed-forward Network Functions

Linear Models for Classification

Introduction to Neural Networks

Multiclass Logistic Regression

Ch 4. Linear Models for Classification

Reading Group on Deep Learning Session 2

Artificial Neural Networks

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Artificial Neural Networks

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Neural Networks and Deep Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Pattern Recognition and Machine Learning

Regularization in Neural Networks

Neural Networks (Part 1) Goals for the lecture

Neural Networks and the Back-propagation Algorithm

Machine Learning Lecture 7

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Linear & nonlinear classifiers

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

An Introduction to Statistical and Probabilistic Linear Models

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Lecture 5: Logistic Regression. Neural Networks

Overfitting, Bias / Variance Analysis

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Logistic Regression & Neural Networks

Machine Learning Basics III

Introduction to Neural Networks

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Multilayer Perceptron

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Machine Learning Lecture 10

Course 395: Machine Learning - Lectures

Iterative Reweighted Least Squares

Multi-layer Neural Networks

Logistic Regression. COMP 527 Danushka Bollegala

4. Multilayer Perceptrons

Statistical Machine Learning from Data

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Machine Learning

CSC 578 Neural Networks and Deep Learning

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Error Backpropagation

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

From perceptrons to word embeddings. Simon Šuster University of Groningen

STA 4273H: Statistical Machine Learning

Neural Networks (and Gradient Ascent Again)

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Introduction to Machine Learning

Introduction to Convolutional Neural Networks (CNNs)

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Gaussian and Linear Discriminant Analysis; Multiclass Classification

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Machine Learning Lecture 12

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Linear Models for Regression

Artificial Intelligence

Linear Models for Regression

Stochastic gradient descent; Classification

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Introduction to Neural Networks

Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Feedforward Networks

CSCI567 Machine Learning (Fall 2018)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Machine Learning

1 Machine Learning Concepts (16 points)

Nonparametric Bayesian Methods (Gaussian Processes)

Bayesian Linear Regression. Sargur Srihari

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Models for Regression. Sargur Srihari

The Perceptron Algorithm

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Deep Feedforward Networks

Artificial Neural Networks

Linear Models for Classification

Transcription:

Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31

Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular and successful techniques known as Deep Learning. Sessions 1 and 2: discussion and analysis of the Chapter 5 about Neural Networks from the book Pattern Recognition and Machine Learning by Christopher M. Bishop (Springer, 2006) Session 1 (2nd of June): 5.1, 5.2, 5.3. and 5.4. Session 2 (10th of June): 5.5, 5.6 and 5.7. The importance of neural networks is that they offer a very powerful and very general framework for representing non-linear mappings from several input variables to several output variables, where the form of the mapping is governed by a number of adjustable parameters. Christopher Bishop 2/31

Chapter Structure Introduction. 5.1. Feed-forward Network Functions. 5.2. Network Training. 5.3. Error Backpropagation. 5.4. The Hessian Matrix. 5.5. Regularization in Neural Networks. 5.6. Mixture Density Networks. 5.7. Bayesian Neural Networks. 3/31

Chapter Structure Introduction. 5.1. Feed-forward Network Functions. 5.2. Network Training. 5.3. Error Backpropagation. 5.4. The Hessian Matrix. 5.5. Regularization in Neural Networks. 5.6. Mixture Density Networks. 5.7. Bayesian Neural Networks. 3/31

Introduction Given a training data set comprising N observations {x n }, where n = 1,..., N, together with corresponding target values {t n }, the goal is to predict the value of t for a new value of x: t = y(x, w) + ɛ (3.7) y(x, w) is the model (w are the parameters). ɛ is the residual error. More generally, from a probabilistic perspective, we aim to model the predictive distribution p(t x) because this expresses our uncertainty about the value of t for each value of x. 4/31

Introduction: the perceptron (Rosenblatt, 1957) Linear discriminant model (Section 4.1.7.): y(x) = f(w T φ(x)) (4.52) φ(x) is a feature vector obtained using a fixed nonlinear transformation of input vector x. w = (w 0,..., w M ) are the model coefficients, or weights. f( ) is a nonlinear activation function (sign function) { +1, a 0 f(a) = 1, a < 0 5/31

Introduction: the perceptron (Rosenblatt, 1957) We are seeking w s.t. x n in class C 1 will have w T φ(x n ) > 0, whereas x n in class C 2 have w T φ(x n ) < 0. Using t { 1, +1}, the perceptron criterion is E P (w) = n M w T φ(x n )t n (4.54) where M denotes the set of all misclassified patterns. We now apply the stochastic gradient descent algorithm to minimize this error function. The change in the weight vector w is given by w (τ+1) = w (τ) η E P (w) = w (τ) + ηφ(x n )t n (4.55) 6/31

Introduction: the perceptron (Rosenblatt, 1957) Perceptron convergence theorem: if the training data set is linearly separable, then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Problems: Learning algorithm: Linearly separable: there may be many solutions, and which one is found will depend on the parameters initialization and on the order of presentation of the data points. Not linearly separable: the algorithm will never converge. Does not provide probabilistic outputs. Does not generalize readily to K > 2 classes. It is based on linear combinations of fixed basis functions. A closely related system called the adaline ( adaptive linear element ) was also presented by Widrow and Hoff (1960). 7/31

Chapter Structure Introduction. 5.1. Feed-forward Network Functions. 5.2. Network Training. 5.3. Error Backpropagation. 5.4. The Hessian Matrix. 5.5. Regularization in Neural Networks. 5.6. Mixture Density Networks. 5.7. Bayesian Neural Networks. 8/31

Chapter Structure Introduction. 5.1. Feed-forward Network Functions. 5.2. Network Training. 5.3. Error Backpropagation. 5.4. The Hessian Matrix. 5.5. Regularization in Neural Networks. 5.6. Mixture Density Networks. 5.7. Bayesian Neural Networks. 8/31

5.1. Activations The basic neural network model can be described as a series of functional transformations. Construct M linear combinations of the inputs x 1,..., x D : a (1) j = D i=1 w (1) ji x i + w (1) j0 (5.2) a (1) j are the layer one activations, j = 1,..., M. w (1) ji are the layer one weights, i = 1... D. w (1) j0 the data. are the layer one biases, that allow for any fixed offset in Each linear combination a (1) j is transformed by a (nonlinear, differentiable) activation function: z j = h(a (1) j ) (5.3) 9/31

5.1. Activations The nonlinear functions h( ) are generally chosen to be sigmoidal functions such as the logistic sigmoid or the tanh. In modern neural networks, the default recommendation is to use the rectified linear unit or ReLU [as activation function] (Chapter 6, Deep Learning by Goodfellow et al, 2016) Deep convolutional neural networks with ReLUs train several times faster than their equivalents with tanh units. ( ImageNet Classification with Deep Convolutional Neural Networks by Krizhevsky et al, NIPS 2012) 10/31

5.1. Output Activations The hidden outputs z j = h(a j ) are linearly combined in layer two: a (2) k = M j=1 w (2) kj z j + w (2) k0 (5.4) a (2) k are the layer two output activations, k = 1,..., K. w (2) kj are the layer two weights, j = 1... D. w (2) k0 are the layer two biases. The output activations a (2) k are transformed by the output activation function: y k = σ(a (2) k ) (5.5) y k are the final outputs. σ( ) can be, like h( ), a logistic sigmoid function. 11/31

5.1. The Complete Two-Layer Model The model y k = σ(a k ) is, after substituting the definitions of a j and a k : Regression problems, the activation function σ( ) is the identity so that y k = a k. Classification problems, each output unit activation maps to a posterior probability (e.g. using a logistic sigmoid function). Evaluation of (5.9) is called forward propagation. 12/31

5.1. Network Diagram hidden units x D w (1) MD z M w (2) KM y K inputs outputs y 1 x 1 x 0 z 1 w (2) 10 z 0 Figure: 5.1 Nodes are input, hidden and output units. Links are corresponding weights. Fig. 5.1 displays a two-layer network: w (1) is the first layer and w (2) is the second one. Information propagates forwards from the explanatory variable x to the estimated response y k (x, w). 13/31

5.1. Properties & Generalizations If all hidden units have linear h( ) then we can always find an equivalent network without hidden units. Individual units do not need to be fully connected to the next layer and their links may skip over one or more subsequent layers. Networks with two or more layers are universal approximators: Any continuous function can be uniformly approximated to arbitrary accuracy, given enough hidden units. This is true for many definitions of h( ), but excluding polynomials. The key problem is how to find suitable parameter values given a set of training data. There may be symmetries in the weight space, meaning that different choices of w may define the same mapping from input to output. 14/31

Chapter Structure Introduction. 5.1. Feed-forward Network Functions. 5.2. Network Training. 5.3. Error Backpropagation. 5.4. The Hessian Matrix. 5.5. Regularization in Neural Networks. 5.6. Mixture Density Networks. 5.7. Bayesian Neural Networks. 15/31

Chapter Structure Introduction. 5.1. Feed-forward Network Functions. 5.2. Network Training. 5.3. Error Backpropagation. 5.4. The Hessian Matrix. 5.5. Regularization in Neural Networks. 5.6. Mixture Density Networks. 5.7. Bayesian Neural Networks. 15/31

5.2. Network Training Given a training set comprising a set of input vectors {x n }, where n = 1,..., N, together with a corresponding set of target vectors {t n }, we minimize the error function E(w) = 1 2 N 2 y(x n, w) t n n=1 (5.11) The aim is to minimize the residual error between y(x n, w) and t n. We can provide a much more general view of network training by first giving a probabilistic interpretation to the network outputs. 16/31

5.2. Network Training A single target variable t has a Gaussian distribution with an x-dependent mean, given by the output of the neural network: p(t x, w) = N ( t y(x, w), β 1) (5.12) where β is the precision (inverse variance) of the Gaussian noise. Given N independent and identically distributed observations, we can construct the corresponding likelihood function ( ) N p {t 1,..., t N } {x 1,..., x N }, w, β = p(t n x n, w, β) n=1 Taking the negative logarithm, we obtain the error function β 2 N { } 2 N y(x n, w) t n 2 log β + N log 2π (5.13) 2 n=1 which can be used to learn the parameters w and β. 17/31

5.2. Maximum Likelihood w A widely used frequentist estimator is maximum likelihood, in which w is set to the value that maximizes the likelihood function p(d w): p(w D) = p(d w)p(w) p(d) posterior likelihood prior (1.43) Maximizing the likelihood function is equivalent to minimizing the sum-of-squares error function given by E(w) = 1 2 N { } 2 y(x n, w) t n (5.14) n=1 where we have discarded additive and multiplicative constants. The maximum-likelihood estimate of w can be obtained by minimizing E(w): w ML = min E(w) w 18/31

5.2. Maximum Likelihood β Having obtained the ML parameter estimate w ML, the precision, β can also be estimated. E.g. if the N observations are IID, then their joint probability is ( ) N p {t 1,..., t N } {x 1,..., x N }, w, β = p(t n x n, w, β) The negative log-likelihood, in this case, is n=1 log p = βe(w ML ) N 2 log β + N 2 log 2π (5.13) The derivative d/dβ is E(w ML ) N 2β and so 1 β ML = 1 N 2E(w ML) = 1 N N { } 2 y(x n, w ML ) t n (5.15) n=1 And 1/β ML = 1 NK 2E(w ML) for K target variables. 19/31

5.2. Pairing of the error function and output units h( ) There is a natural choice of both the output unit activation function and matching error function, according to the type of problem being solved. Regression p(t x, w) = N ( t ) y(x, w), β 1 Output activation function: identity Error function: sum-of-squares error (see Eq. 5.14) Binary Classification p(t x, w) = y(x, w) t {1 y(x, w)} 1 t Output activation function: logistic sigmoid, with 0 y(x, w) 1. Error function: cross-entropy error N E(w) = {t n ln y n + (1 t n ) ln(1 y n )} (5.21) n=1 where y n denotes y(x n, w) 20/31

5.2. Pairing of the error function and output units h( ) Multiclass Classification K p(t x, w) = k=1 y k(x, w) t k {1 y k (x, w)} 1 t k Output activation function: softmax (or normalized exponential) y k (x, w) = exp(a(2) k (x, w)) j exp(a(2) j (x, w)) (5.25) which satisfies 0 y k 1 and k y k = 1. Error function: multiclass cross-entropy error N K E(w) = t nk ln y k (x n, w) (5.24) n=1 k=1 21/31

5.2. Error Surface The residual error E(w) can be visualized as a surface in the weight-space: E(w) w A w B wc w 1 w 2 E Figure: 5.5 The error will, in practice, be highly multimodal. There will be inequivalent minima (local minima), determined by the particular data and model, as well as equivalent minima, corresponding to weight-space symmetries. 22/31

5.2. Parameter Optimization Our goal: So we want to solve: w = argmin(e(w)). E(w) = 0 (5.26) Iterative search for a local minimum of the error: w (τ+1) = w (τ) + w (τ) (5.27) τ is the time-step. w (τ) is the weight-vector update. vector E(w) points in the direction of greatest rate of increase of the error function. 23/31

5.2. Gradient Descent The simplest approach is to update w by a displacement in the negative gradient direction. w (τ+1) = w (τ) η E ( w (τ)) (5.41) where η > 0 is the learning rate. This is a batch method, as evaluation of E involves the entire training data set. Stochastic methods can be used. Only a part of the training set is used at each iteration. Conjugate gradients or quasi-newton methods may be preferred because they have the property that the error function always decreases at each iteration. Many initializations can be tested. 24/31

5.2. Optimization Scheme Each iteration of the descent algorithm has two stages: I. Evaluate derivatives of error with respect to weights. An efficient method for the evaluation of E(w) is needed. It involves backpropagation of error though the network. II. Use derivatives to compute adjustments of the weights (e.g. steepest descent). Backpropagation is a general principle, which can be applied to many types of network and error function. 25/31

5.3. Simple Backpropagation The error function is, typically, a sum over the data points E(w) = N n=1 E n(w). For example, consider a linear model (1 layer). y k = w ki x i (5.45) i The error function, for an individual input x n, is E n = 1 (y nk t nk ) 2, where y nk = y k (x n, w). (5.46) 2 k The gradient with respect to a weight w ji is E n w ji = (y nj t nj ) x ni (5.47) 26/31

5.3. General Backpropagation Recall that, in general, each unit computes a weighted sum: a j = i w ji z i with activation z j = h(a j ). (5.48,5.49) For each error-term: E n w ji = E n a j }{{} δ j a j w ji }{{} z i = δ j z i (5.50,5.53) In the network: δ j E n a j = k E n a k a k a j where j {k} (5.55) Algorithm: δ j = k δ k w kj h (a j ) as a k a j = a k z j z j a j (5.56) 27/31

5.3. Backpropagation Algorithm Forward pass: Apply input x, and forward propagate to find all the a i and z i Back propagate the δ s to obtain a δ j for each hidden unit. To do so, we first evaluate δ k directly for the output units and then propagate them with δ j = h (a j ) k w kjδ k. z i w ji δ j wkj δ k Evaluate the derivatives En w ji = δ j z i. Update w by a displacement in the negative gradient direction: w (τ+1) = w (τ) η E ( w (τ)) (5.41) z j δ 1 28/31

5.3. Computational Efficiency The back-propagation algorithm is computationally more efficient than standard numerical minimization of E n. Suppose that W is the total number of weights and biases in the network. Backpropagation: The evaluation is O(W ) for large W, as there are many more weights than units. Standard approach: Perturb each weight, and compute E n ( w ) E ( w + (0,..., 0, wij, 0,..., 0, ) ). This requires W O(W ) computations, so the total complexity is O(W 2 ). 29/31

5.3. Jacobian Matrix The properties of the network can be investigated via the Jacobian J ki = y k x i (5.70) For example, (small) errors can be propagated through the trained network: y k y k x i x i (5.72) This is useful, but costly, as J ki itself depends on x. However, note that y k = y k a j = y k w ji (5.74) x i a j j x i a j j }{{} δ kj As before, we can show δ kj = h (a j ) l w lj δkl 30/31

5.4. Hessian Matrix ( ) 2 E H = w ji w lk (5.78) Backpropagation principle can be used to get: Diagonal approximation of H in O(W ) Approximation of the Hessian in O(W 2 ) Exact computation of the Hessian in O(W 2 ) Compute v T H in O(W ) 31/31