Neural Networks (and Gradient Ascent Again)

Similar documents
Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

y(x n, w) t n 2. (1)

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Reading Group on Deep Learning Session 1

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks

Feed-forward Network Functions

Neural Network Training

CSC 411 Lecture 10: Neural Networks

CSC321 Lecture 6: Backpropagation

Machine Learning Basics III

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Machine Learning Lecture 5

NEURAL NETWORKS

Lecture 4 Backpropagation

4. Multilayer Perceptrons

Error Backpropagation

Neural Networks and the Back-propagation Algorithm

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Models for Classification

Deep Feedforward Networks

1 What a Neural Network Computes

AI Programming CS F-20 Neural Networks

Regularization in Neural Networks

Neural Networks and Deep Learning

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Lecture 5: Logistic Regression. Neural Networks

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

Introduction to Machine Learning

A summary of Deep Learning without Poor Local Minima

Optimization and Gradient Descent

Artificial Neural Networks

Neural Networks. Henrik I. Christensen. Computer Science and Engineering University of California, San Diego

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Artificial Neuron (Perceptron)

Artificial Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 3 Feedforward Networks and Backpropagation

Logistic Regression & Neural Networks

Deep Feedforward Networks. Sargur N. Srihari

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

ECE521 Lectures 9 Fully Connected Neural Networks

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Linear Models for Regression

The XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic

Lecture 26: Neural Nets

Introduction to Neural Networks

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Introduction to Machine Learning Spring 2018 Note Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Artificial Intelligence

CS489/698: Intro to ML

CSC321 Lecture 5 Learning in a Single Neuron

Lecture 6: Backpropagation

Unit III. A Survey of Neural Network Model

Lecture 3 Feedforward Networks and Backpropagation

Artifical Neural Networks

Neural Nets Supervised learning

Multi-layer Neural Networks

Introduction to Neural Networks

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Linear Models for Regression

Automatic Differentiation and Neural Networks

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Statistical Machine Learning from Data

Decision Support Systems MEIC - Alameda 2010/2011. Homework #8. Due date: 5.Dec.2011

NN V: The generalized delta learning rule

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

Deep Neural Networks (1) Hidden layers; Back-propagation

Machine Learning Lecture 10

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Neural Networks. Intro to AI Bert Huang Virginia Tech

Artificial Neural Networks 2

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Ch.6 Deep Feedforward Networks (2/3)

Machine Learning (CSE 446): Backpropagation

Neuro-Fuzzy Comp. Ch. 4 March 24, R p

Feedforward Neural Nets and Backpropagation

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Neural networks. Chapter 20. Chapter 20 1

Deep Neural Networks (1) Hidden layers; Back-propagation

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Data Mining Part 5. Prediction

10 NEURAL NETWORKS Bio-inspired Multi-Layer Networks. Learning Objectives:

Neural Networks: Backpropagation

Introduction to Neural Networks

Transcription:

Neural Networks (and Gradient Ascent Again) Frank Wood April 27, 2010

Generalized Regression Until now we have focused on linear regression techniques. We generalized linear regression to include nonlinear functions of the inputs we called these features. The remaining regression model remained linear in the parameters. i.e. M y(x, w) = f w j φ j (x) where f () is the identity or is invertible such that a transform of the target vector t = {t 1,..., t n } can be employed. (y is the unknown func., t are the observed targets, φ j () is a feature.) Our goal has been to learn w. We ve done this using least squares or penalized least squares in the case of MAP estimation. j1=

Fancy f () s What if f () is not invertible? Then what? Can t use transformations of t. Today (to start): tanh(x) = ex e x e x + e x

tanh regression (like logistic regression) For pedagogical purpose assume that tanh() can t be inverted. Or that we observe targets that are t n { 1, +1} (note not continuous valued!) Let s consider a regression(/classification) function y(x n, w) = tanh(x n w) where w is a parameter vector and x is a vector of inputs (potentially features). For each input x we have an observed output t n which is either minus one or one. We are interested in the general case of how to learn parameters for such models.

tanh regression (like logistic regression) Further, we will use the error that you are familiar with, namely, the squared error. So, given a matrix of inputs X = [x 1 x n ] and a collection of output labels t = [t 1 t n ] we consider the following squared error function E(X, t, w) = 1 (t n y(x n, w)) 2 2 n We are interested in minimizing the error of our regressor/classifier. How do we do this?

Figure taken from PRML, Bishop 2006 Error minimization If we want to minimize E(X, t, w) = 1 (t n y(x n, w)) 2 2 w.r.t. w we should start by deriving gradients and trying to find places where the they disappear. E(w) n w A w B wc w 1 w 2 E

Error gradient w.r.t. w The gradient of w E(X, t, w) = 1 w (t n y(x n, w)) 2 2 n = (t n y(x n, w)) w y(x n, w) n A useful fact to know about tanh() is that d tanh(a) db = (1 tanh(a) 2 ) da db which makes it easy to complete the last line of the gradient computation straightforwardly for the choice of y(x n, w) = tanh(x n w), namely w E(X, t, w) = n (t n y(x n, w))(1 tanh(x n w) 2 )x n

Solving It is clear that algebraically solving w E(X, t, w) = n (t n y(x n, w))(1 tanh(x n w) 2 )x n = 0 for all the entries of w will be troublesome if not impossible. This is OK, however, because we don t always have to get an analytic solution that directly gives us the value of w. We can arrive at it s value numerically.

Calculus 101 Even simpler consider numerically minimizing the function How do you do this? Hint, start at some value x 0, say x 0 = 3 and use the gradient to walk towards the minimum.

Calculus 101 The gradient of y = (x 3) 2 + 2 (or derivative w.r.t. x) is x y = 2(x 3). Consider the sequence x n = x n 1 λ xn 1 y It is clear that if λ is small enough that this sequence will converge to lim n x n 3. There are several important caveats worth mentioning here If λ (called the learning rate) is set too high this sequence might oscillate Worse yet, the sequence might diverge. If the function has multiple minima (and/or saddles) this procedure is not guaranteed to converge to the minimum value.

Arbitrary error gradients This is true for any function that one would like to minimize. For instance we are interested in minimizing prediction error E(X, t, w) in our logistic regression/classification example where the gradient we computed is w E(X, t, w) = n (t n y(x n, w))(1 tanh(x n w) 2 )x n So starting at some value of the weights w 0 we can construct and follow a sequence of guesses until convergence w n = w n 1 λ wn 1 E(X, t, w)

Arbitrary error gradients Convergence of a procedure like can be assessed in multiple ways: w n = w n 1 λ wn 1 E(X, t, w) The norm of the gradient grows sufficiently small The function value change is sufficiently small from one step to the next. etc.

Gradient Min(Max)imization There are several other important points worth mentioning here and avenues for further study If the objective function is convex, such learning strategies are guaranteed to converge to the global optimum. Special techniques for convex optimization exist (e.g. Boyd and Vandenberghe, http://www.stanford.edu/ boyd/cvxbook/). If the objective function is not convex, multiple restarts of the learning procedure should be performed to ensure reasonable coverage of the parameter space. Even if the objective is not convex it might be worth the computational cost of restarting multiple times to achieve a good set of parameters. The sum over observations nature of the gradient calculation makes online learning feasible. More (much more) sophisticated gradient search algorithms exist, particularly ones that make use of the curvature of the underlying function.

Example - Data for tanh regression/classification Figure: Data in {+1, 1} Generative model = n = 100; x = [rand(n,1) rand(n,1)]*20; y = x*[-2;4] > 2; y = y+ (y==0)*-1;

Example - Result from Learning Figure: Learned regression surface. Run logistic regression/tanh regression.m

Two more hints 1. Even analytic gradients are not required! 2. (Good) software exists to allow you to minimize whatever function you want to minimize (matlab: fminunc) For both, note the following. The definition of a derivative (gradient) is given by df (x) dx = lim δ 0 f (x + δ) f (x) δ but can be approximated quite well by a fixed size choise of δ, i.e. df (x) f (x +.00000001) f (x) dx.00000001 This means that learning algorithms can be implemented on a computer using given nothing but the objective function to minimize!

Neural Networks It is from this perspective that we will approach neural networks. A general two layer feedforward neural network is given by : ( M D ) y k (x, w) = σ w (2) kj h w (1) ji x i j=0 i=0 Given what we have just covered, if given as set of targets t = [t 1 t n ] and a set of inputs X = [x 1 x n ] one should straightforwardly be able to learn w (the set of all weights w kj and w ji for all combinations kj and ji) for any choice of σ() and h().

Neural Networks It is from this perspective that we will approach neural networks. A general two layer feedforward neural network is given by : ( M D ) y k (x, w) = σ w (2) kj h w (1) ji x i j=0 i=0 Given what we have just covered, if given as set of targets t = [t 1 t n ] and a set of inputs X = [x 1 x n ] one should straightforwardly be able to learn w (the set of all weights w kj and w ji for all combinations kj and ji) for any choice of σ() and h().

Neural Networks Neural networks arose from trying to create mathematical simplifications or representations of the kind of processing units used in our brains. We will not consider their biological feasibility, instead we will focus on a particular class of neural network the multi-layer perceptron, which has proven to be of great practical value in both regression and classification settings.

Neural Networks To start there should be list of important features and caveats 1. Neural networks are universal approximators, meaning that a two-layer network with linear outputs can uniformly approximate any continuous function on a compact input domain to arbitrary accuracy provided that the network has a sufficiently large number of hidden units [Bishop, PRML, 2006] 2. but... How many hidden units? 3. Generally the error surface as a function of the weights is non-convex leading to a difficult and tricky optimization problem. 4. The internal mechanism by which the network represents the regression relationship is not usually examinable or testable in the way that linear regression models are. i.e. What s the meaning of a statement like, the 95% confidence interval for the i th hidden unit weight is [.2,.4]?

Neural network architecture hidden units x D w (1) MD z M w (2) KM y K inputs outputs y 1 x 1 x 0 z 1 w (2) 10 z 0 Figure taken from PRML, Bishop 2006

Neural Networks The specific neural network we will consider is a univariate regression network where there is one output node and the output nonlinearity is set to the identity σ(x) = x leaving only the hidden layer nonlinearity h(a) which will will choose to be h(a) = tanh(a). So simplifies to M y k (x, w) = σ w (2) kj h y(x, w) = j=0 M j=0 w (2) kj h ( D i=0 ( D i=0 ) w (1) ji x i w (1) ji x i Note that the bias nodes x 0 = 1 and z 0 = 1 are included in this notation. )

Figure taken from PRML, Bishop 2006 Representational Power Four regression functions learned using linear/tanh neural network with three hidden units. Hidden unit activation shown in the background colors.

Neural Network Training Given a set of input vectors {x n }, n = 1,..., N and a set of target vectors {t n } (taken here to be univariate {t n }) we wish to minimize the error function E(w) = 1 2 N y(x n, w) t n 2 n=1 In this example we will assume that t R and that t n is Gaussian distributed with mean a function of x P(t n x, w, β) = N (t n y(x n, w), β 1 ) which means that the targets are jointly distributed according to P(t X, w, β) = N P(t n x n, w, β) n=1

Neural Network Training and Prediction If we take the negative logarithm of the error function we arrive at β 2 P(t X, w, β) = N P(t n x n, w, β) n=1 N {y(x n, w) t n } 2 N 2 ln β + N 2 ln(2π) n=1 which we can minimize by first minimizing w.r.t. to w and then β. Given a trained value of β ML and w ML prediction is straightforward.

Neural Network Training, Gradient Ascent We therefore are interested in minimizing (in the case of continuous valued, univariate, neural network regression) where E(w) = y(x, w) = N {y(x n, w) t n } 2 n=1 M j=0 w (2) j h ( D i=0 w (1) ji x i which we can perform numerically if we have gradient information w (τ+1) = w (τ) η E(w (τ) ) where η is a learning rate and E(w (τ) ) is the gradient of the error function. )

Back Propagation While numeric gradient computation can be used to estimate the gradient and thereby adjust the weights of the neural net, doing so is not very efficient. A more efficient, if not slightly more confusing method of computing the gradient, is to use backpropagation. Back propagation is a fancy term for a dynamic programming-like way of computing the gradient by running computations backwards on the network.

Back Propagation To perform back propagation we need to identify several intermediate variables, the first of which is a j = i w ji z i where a j is a weighted sum of the inputs to a particular unit (hidden or otherwise). The activiation z j of a unit is given by z j = h(a j ) where in our example h(a) = tanh(a) Here j could be an output.

Back Propagation What we are interested in computing efficiently is de dw ji E(w) = N {y(x n, w) t n } 2 n=1 where we will focus on the individual contribution from a single training input/output pair den dw ji realizing that the final gradient is the sum of all of the individual gradients. Note: Stochatistic gradient ascent approximates the gradient using a single (or small group) of points at a time.

Back Propagation : Reuse of computation Our goal is to reuse computation as much as possible. We will do this by constructing a back-ward chaining set of partial derivatives that use computation closer to the output nodes in the calculation of the gradients of the error w.r.t. to weights closer to the input nodes. To start, note that E n depends on the weights w ji only through the input a j to unit j (here the unit is again an arbitrary unit). For this reason we can apply the chain rule to give de n dw ji = de n da j da j dw ji We will denote den da j = δ j. The δ j s are going to be the quantities we use for dynamic programming.

Back Propagation : Reuse of computation If we remember that a j = i w ji z i and our new notation den da j = δ j. We can re-write as This is almost it! de n dw ji = de n da j de n dw ji = δ j z i da j dw ji

Back Propagation : Reuse of computation For the output layer we have de n da k = d 1 da k 2 (a k t) 2 = y k t k From this we can compute the gradient with respect to all of the weights leading to the output layer simply using de n dw ki = δ k z i where i ranges over the hidden layer closest to the output layer and z i are the activations of that layer. What remains is to figure out how to use the precomputed δ s to compute the gradients at all remaining hidden layers back to the input nodes. For this we need the chain rule from calculus again.

Back Propagation : Reuse of computation We want a way of computing δ j, the error term for an arbitrary hidden unit as a function of the weights and the error terms already computed closer to the output node(s). The definition of δ j is δ j = den da j When node j is connected to nodes k, k = 1,..., K the following is true: the error is a function of the activations at all k nodes and that the activations at each of these nodes is a function of the activation at node j. That means the following is true (by the chain rule for differentiation) de n da j = k de n da k da k da j but we have computed δ k = den da k output already! already for nodes closer to the

Back Propagation : Reuse of computation To summarize, we know de n da j = k δ k da k da j We also know a k = j w kjz j and z j = h(a j ). This means that δ j = de n da j = k δ k h (a j )w kj = h (a j ) k δ k w kj This means that we can compute the δ s backwards, using only information local to each unit. Further we know that den w ji = δ j z i which is the error at the output side times the activation on the input side. Since the activations are computed on the forward pass and the errors are computed on the backwards pass we are done!

Back Propagation : Full procedure Error Backpropagation Repeat for all input/output pairs: 1. Propagate activations forward through the network for an input x n 2. Compute the δ s for all the units starting at the output layer and proceeding backwards through the network. 3. Compute the contribution to the gradient for that single input (and sum into global gradient computation).

Conclusion Neural networks are a powerful tool for regression analysis. Neural networks are not without (significant) downsides. They lack interpretability, they can be difficult to learn, and the model selection issues that arise in any regression problem don t go away. Further treatment of neural networks include different activation functions, multivalued outputs, classification, and Bayesian neural networks. Simple trailing question: what would MAP estimation of a neural network look like (with standard weight decay regularization)?