Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Similar documents
Reading Group on Deep Learning Session 1

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Network Training

NEURAL NETWORKS

y(x n, w) t n 2. (1)

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Artificial Neural Networks. MGS Lecture 2

Introduction to Neural Networks

Error Backpropagation

Multiclass Logistic Regression

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Regularization in Neural Networks

Neural Networks (and Gradient Ascent Again)

Artificial Neural Networks

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Linear and logistic regression

Machine Learning. 7. Logistic and Linear Regression

Introduction to gradient descent

Machine Learning Lecture 5

Multi-layer Neural Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neural Networks and Deep Learning

Neural Networks and the Back-propagation Algorithm

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Performance Surfaces and Optimum Points

4. Multilayer Perceptrons

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Numerical Optimization

Lecture 5: Logistic Regression. Neural Networks

Regression with Numerical Optimization. Logistic

Artificial Neural Networks

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Iterative Reweighted Least Squares

CSC 578 Neural Networks and Deep Learning

Computational statistics

Linear & nonlinear classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

ECS171: Machine Learning

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Neural Networks: Backpropagation

Ch 4. Linear Models for Classification

Linear Models for Regression. Sargur Srihari

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Machine Learning Lecture 7

Multilayer Perceptron

Linear Models for Classification

CSC 411: Lecture 04: Logistic Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Feed-forward Network Functions

Gradient Descent. Sargur Srihari

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Machine Learning Basics III

Introduction to Machine Learning

Intelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Networks Lecture 4: Radial Bases Function Networks

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Numerical Optimization Techniques

Mathematical optimization

Learning Neural Networks

Multilayer Perceptrons and Backpropagation

Neural networks III: The delta learning rule with semilinear activation function

Neural networks. Chapter 20. Chapter 20 1

Logistic Regression & Neural Networks

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Multilayer Perceptrons (MLPs)

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Statistical Machine Learning from Data

Gradient Descent. Dr. Xiaowei Huang

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Unit III. A Survey of Neural Network Model

Artificial Neural Networks

Pattern Recognition and Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Eigen Vector Descent and Line Search for Multilayer Perceptron

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Convex Optimization Algorithms for Machine Learning in 10 Slides

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

STA 4273H: Statistical Machine Learning

Artificial Neural Networks 2

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Linear Models for Regression

AI Programming CS F-20 Neural Networks

From perceptrons to word embeddings. Simon Šuster University of Groningen

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Lecture 4: Perceptrons and Multilayer Perceptrons

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

The Error Surface of the XOR Network: The finite stationary Points

Learning from Data Logistic Regression

Functions of Several Variables

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Transcription:

C.M. Bishop s PRML: Chapter 5; Neural Networks

Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations, n = 1,..., N. ɛ(x) is the residual error.

Linear Models For example, recall the (Generalized) Linear Model: M y(x, w) = f w j φ j (x) (5.1) j=0 φ = (φ 0,..., φ M ) is the fixed model basis. w = (w 0,..., w M ) are the model coefficients. For regression: f( ) is the identity. For classification: f( ) maps to a posterior probability.

Feed-Forward Networks Feed-forward Neural Networks generalize the linear model M y(x, w) = f w j φ j (x) (5.1 again) j=0 The basis itself, as well as the coefficients w j, will be adapted. Roughly: the principle of (5.1) will be used twice; once to define the basis, and once to obtain the output.

Activations Construct M linear combinations of the inputs x 1,..., x D : a j = D i=0 w (1) ji x i (5.2) a j are the activations, j = 1,..., M. w (1) ji are the layer one weights, i = 1... D. w (1) j0 are the layer one biases. Each linear combination a j is transformed by a (nonlinear, differentiable) activation function: z j = h(a j ) (5.3)

Output Activations The hidden outputs z j = h(a j ) are linearly combined in layer two: a k = M j=0 a k are the output activations, k = 1,..., K. w (2) kj are the layer two weights, j = 1... D. w (2) k0 are the layer two biases. w (2) kj z j (5.4) The output activations a k are transformed by the output activation function: y k = σ(a k ) (5.5) y k are the final outputs. σ( ) is, like h( ), a sigmoidal function.

The Complete Two-Layer Model The model y k = σ(a k ) is, after substituting the definitions of a j and a k : M y k (x, w) = σ w (2) kj h D w (1) ji x i (5.9) j=0 i=0 a j a k h( ) and σ( ) are a sigmoidal functions, e.g. the logistic function. s(a) = 1 1 + exp( a) s(a) [0, 1] If σ( ) is the identity, then a regression model is obtained. Evaluation of (5.9) is called forward propagation.

Network Diagram The approximation process can be represented by a network: hidden units xd w (1) MD zm w (2) KM yk inputs outputs y1 x1 x0 z1 w (2) 10 z0 Figure: 5.1 Nodes are input, hidden and output units. Links are corresponding weights. Information propagates forwards from the explanatory variable x to the estimated response y k (x, w).

Properties & Generalizations Typically K D M, which means that the network is redundant if all h( ) are linear. There may be more than one layer of hidden units. Individual units need not be fully connected to the next layer. Individual links may skip over one or more subsequent layers. Networks with two (cf. 5.9) or more layers are universal approximators. Any continuous function can be uniformly approximated to arbitrary accuracy, given enough hidden units. This is true for many definitions of h( ), but excluding polynomials. There may be symmetries in the weight space, meaning that different choices of w may define the same mapping from input to output.

Maximum Likelihood Parameters The aim is to minimize the residual error between y(x n, w) and t n. Suppose that the target is a scalar-valued function, which is Normally distributed around the estimate: p(t x, w) = N ( t y(x, w), β 1 ) (5.12) Then it will be appropriate to consider the sum of squared-errors E(w) = 1 2 N ( ) 2 y(x n, w) t n (5.14) n=1 The maximum-likelihood estimate of w can be obtained by (numerical) minimization: w ML = min w E(w)

Maximum Likelihood Precision Having obtained the ML parameter estimate w ML, the precision, β can also be estimated. E.g. if the N observations are IID, then their joint probability is ( ) N p {t 1,..., t N } {x 1,..., x N }, w, β = p(t n x n, w, β) The negative log-likelihood, in this case, is n=1 log p = βe(w ML ) N 2 log β + N 2 log 2π (5.13) The derivative d/dβ is E(w ML ) N 2β and so 1 β ML = 1 N 2E(w ML) (5.15) And 1/β ML = 1 NK 2E(w ML) for K target variables.

Error Surface The residual error E(w) can be visualized as a surface in the weight-space: E(w) w A w B wc w 1 w 2 E Figure: 5.5 The error will, in practice, be highly nonlinear, with many minima, maxima and saddle-points. There will be inequivalent minima, determined by the particular data and model, as well as equivalent minima, corresponding to weight-space symmetries.

Parameter Optimization Iterative search for a local minimum of the error: w (τ+1) = w (τ) + w (τ) (5.27) E will be zero at a minimum of the error. τ is the time-step. w (τ) is the weight-vector update. The definition of the update depends on the choice of algorithm.

Local Quadratic Approximation The truncated Taylor expansion of E(w) around a weight-point ŵ is E(w) E(ŵ) + (w ŵ) b + 1 2 (w ŵ) H(w ŵ) (5.28) b = E w=ŵ is the gradient at ŵ. (H) ij = E w i w j w=ŵ is the Hessian E at ŵ. The gradient of E can be approximated by the gradient of the quadratic model (5.28); if w ŵ then where 1 2 E b + H(w ŵ) (5.31) ( (H + H )w Hŵ H ŵ ) = H(w ŵ), as H = H.

Approximation at a Minimum Suppose that w is at a minimum of E, so E w=w is zero, and H = E w=w E(w) = E(w ) + 1 2 (w w ) H(w w ) (5.32) is the Hessian. The eigenvectors Hu i = λu i are orthonormal. (w w ) can be represented in H-coordinates as i α iu i. Hence the second term of (5.32) can be written So that 1 2 (w w ) H(w w ) = 1 ( ) ( ) λ i α i u i α j u j 2 i E(w) = E(w ) + 1 λ i αi 2 (5.36) 2 i j

Characterization of a Minimum The eigenvalues λ i of H characterize the stationary point w. If all λ i > 0, then H is positive definite (v Hv > 0). This is analogous to the scalar condition 2 E w 2 w > 0. Zero gradient and positive principle curvatures mean that E(w ) is a minimum. w 2 u 2 u 1 λ 1/2 2 w λ 1/2 1 Figure: 5.6 w 1

Gradient Descent The simplest approach is to update w by a displacement in the negative gradient direction. w (τ+1) = w (τ) η E ( w (τ)) (5.41) This is a steepest descent algorithm. η is the learning rate. This is a batch method, as evaluation of E involves the entire data set. Conjugate gradient or quasi-newton methods may, in practice, be preferred. A range of starting points { w (0)} may be needed, in order to find a satisfactory minimum.

Optimization Scheme An efficient method for the evaluation of E(w) is needed. Each iteration of the descent algorithm has two stages: I. Evaluate derivatives of error with respect to weights (involving backpropagation of error though the network). II. Use derivatives to compute adjustments of the weights (e.g. steepest descent). Backpropagation is a general principle, which can be applied to many types of network and error function.

Simple Backpropagation The error function is, typically, a sum over the data points E(w) = N n=1 E n(w). For example, consider a linear model y k = i w ki x i (5.45) The error function, for an individual input x n, is E n = 1 (y nk t nk ) 2, where y nk = y k (x n, w). (5.46) 2 k The gradient with respect to a weight w ji is w ji is a particular link (x i to y j ). E n w ji = (y nj t nj ) x ni (5.47) x ni is the input to the link (i-th component of x n ). (y nj t nj ) is the error output by the link.

General Backpropagation Recall that, in general, each unit computes a weighted sum: a j = i w ji z i with activation z j = h(a j ). (5.48,5.49) For each error-term: E n = E n a j (5.50) w ji a j w ji }{{} δ j So, from 5.48: E n w ji = δ j z i (5.53) In the network: δ j E n a j = k E n a k a k a j where j {k} (5.55) Algorithm: δ j = h (a j ) k w kj δ k as a k a j = a k z j z j a j (5.56)

Backpropagation Algorithm The formula for the update of a given unit depends only on the later (i.e. closer to the output) layers: z i w ji δ j wkj δ k Figure: 5.7 Hence the backpropagation algorithm is: z j δ 1 Apply input x, and forward propagate to find the hidden and output activations. Evaluate δ k directly for the output units. Back propagate the δ s to obtain a δ j for each hidden unit. Evaluate the derivatives En w ji = δ j z i.

Computational Efficiency The back-propagation algorithm is computationally more efficient than standard numerical minimization of E n. Suppose that W is the total number of weights and biases in the network. Backpropagation: The evaluation is O(W ) for large W, as there are many more weights than units. Standard approach: Perturb each weight, and forward propagate to compute the change in E n. This requires W O(W ) computations, so the total complexity is O(W 2 ).

Jacobian Matrix The properties of the network can be investigated via the Jacobian J ki = y k x i (5.70) For example, (small) errors can be propagated through the trained network: y k y k x i x i (5.72) This is useful, but costly, as J ki itself depends on x. However, note that y k = y k a j = y k w ji (5.74) x i a j j x i a j j The required derivatives y k / a j can be efficiently computed by backpropagation.