Neuro-Fuzzy Comp. Ch. 4 June 1, 2005

Similar documents
Neuro-Fuzzy Comp. Ch. 4 March 24, R p

1. A discrete-time recurrent network is described by the following equation: y(n + 1) = A y(n) + B x(n)

4. Multilayer Perceptrons

y(x n, w) t n 2. (1)

Artifical Neural Networks

Learning in Multidimensional Spaces Neural Networks. Matrix Formulation

Other examples of neurons can be found in:

Training Multi-Layer Neural Networks. - the Back-Propagation Method. (c) Marcin Sydow

In this section we first consider selected applications of the multi-layer perceptrons.

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Neural networks III: The delta learning rule with semilinear activation function

Backpropagation Neural Net

NN V: The generalized delta learning rule

Artificial Neural Networks. Edward Gatt

7 Advanced Learning Algorithms for Multilayer Perceptrons

Reading Group on Deep Learning Session 1

Introduction to Neural Networks

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural Network Training

Learning strategies for neuronal nets - the backpropagation algorithm

Artificial Neural Network

Neural Networks Lecture 3:Multi-Layer Perceptron

The Multi-Layer Perceptron

Artificial Neural Networks. MGS Lecture 2

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Unit III. A Survey of Neural Network Model

Statistical Machine Learning from Data

Chapter 2 Single Layer Feedforward Networks

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

Multi-layer Neural Networks

Backpropagation: The Good, the Bad and the Ugly

Neural networks. Chapter 20. Chapter 20 1

Neural Networks: Backpropagation

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

C1.2 Multilayer perceptrons

Multilayer Perceptrons and Backpropagation

Consider the way we are able to retrieve a pattern from a partial key as in Figure 10 1.

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Neural networks. Chapter 19, Sections 1 5 1

Lecture 5: Logistic Regression. Neural Networks

Neuro-Fuzzy Comp. Ch. 1 March 24, 2005

Intro to Neural Networks and Deep Learning

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

Lab 5: 16 th April Exercises on Neural Networks

AI Programming CS F-20 Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Intelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera

EPL442: Computational

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Feedforward Neural Nets and Backpropagation

Artificial Neural Networks Examination, June 2005

Gradient Descent Training Rule: The Details

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Revision: Neural Network

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Course 395: Machine Learning - Lectures

Chapter 3 Supervised learning:

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Neural Networks (and Gradient Ascent Again)

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Computational statistics

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 4.1 Gradient-Based Learning: Back-Propagation and Multi-Module Systems Yann LeCun

Multilayer Neural Networks

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Simple Neural Nets For Pattern Classification

Deep Feedforward Networks

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

CSC 578 Neural Networks and Deep Learning

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Supervised Learning in Neural Networks

In the Name of God. Lecture 11: Single Layer Perceptrons

Lecture 2: Learning with neural networks

Artificial Neural Networks Examination, March 2004

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Neural Networks. Intro to AI Bert Huang Virginia Tech

ADAPTIVE NEURO-FUZZY INFERENCE SYSTEMS

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Artificial Neural Networks

Artificial Neural Network : Training

Neural networks. Chapter 20, Section 5 1

Neural Networks and Deep Learning

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Radial-Basis Function Networks

Chapter 6: Backpropagation Nets

Introduction to Machine Learning

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Supervised (BPL) verses Hybrid (RBF) Learning. By: Shahed Shahir

Artificial Neural Networks

EEE 241: Linear Systems

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Artificial Neural Networks 2

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Transcription:

Neuro-Fuzz Comp Ch 4 June, 25 4 Feedforward Multilaer Neural Networks part I Feedforward multilaer neural networks (introduced in sec 7) with supervised error correcting learning are used to approximate (snthesise) a non-linear input-output mapping from a set of training patterns Consider the following mapping f(x) from a p-dimensional domain X into an m-dimensional output space D: R p X f(x) F(W, X) R m D Y Figure 4 : Mapping from a p-dimensional domain into an m-dimensional output space A function: f : X D, or d = f(x) ; x X R p, d D R m is assumed to be unknown, but it is specified b a set of training examples, {X; D} This function is approximated b a fixed, parameterised function (a neural network) F : R p R M R m, or = F (W, x); x R p, d R m, W R M Approximation is performed in such a wa that some performance index, J, tpicall a function of errors between D and Y, J = J(W, D Y ) is minimised AP Papliński 4 Neuro-Fuzz Comp Ch 4 June, 25 Basic tpes of NETWORKS for APPROXIMATION: Linear Networks Adaline F (W, x) = W x Classical approximation schemes Consider a set of suitable basis functions {φ} n i= then F (W, x) = n w i φ i (x) Popular examples: power series, trigonometric series, splines, Radial Basis Functions The Radial Basis Functions guarantee, under certain conditions, an optimal solution of the approximation problem A special case: Gaussian Radial Basis Functions: φ i (x) = exp 2 (x t i) T Σ i (x t i ) Multilaer Perceptrons Feedforward neural networks i= where t i and Σ i are the centre and covariance matrix of the i-th RBF representing adjustible parameters (weights) of the network Each laer of the network is characterised b its matrix of parameters, and the network performs composition of nonlinear operations as follows: F (W, x) = (W (W l x) ) A feedforward neural network with two laers (one hidden and one output) is ver commonl used to approximate unknown mappings If the output laer is linear, such a network ma have a structure similar to an RBF network AP Papliński 4 2

Neuro-Fuzz Comp Ch 4 June, 25 4 Multilaer perceptrons (MLPs) Multilaer perceptrons are commonl used to approximate complex nonlinear mappings In general, it is possible to show that two laers are sufficient to approximate an nonlinear function Therefore, we restrict our considerations to such two-laer networks The structure of the decoding part of the two-laer back-propagation network is presented in Figure (4 2) x u h p W h L L } {{ } Hidden laer ψ W v m m } {{ } Output laer Figure 4 2: A block-diagram of a single-hidden-laer feedforward neural network The structure of each laer has been discussed in sec 6 Nonlinear functions used in the hidden laer and in the output laer can be different The output function can be linear There are two weight matrices: an L p matrix W h in the output laer in the hidden laer, and an m L matrix W AP Papliński 4 3 Neuro-Fuzz Comp Ch 4 June, 25 The working of the network can be described in the following wa: or simpl as u(n) = W h x(n) ; h(n) = ψ(u(n)) hidden signals ; v(n) = W h(n) ; (n) = (v(n)) output signals (n) = ( W ψ(w h x(n)) ) (4) Tpicall, sigmoidal functions (hperbolic tangents) are used, but other choices are also possible The important condition from the point of view of the learning law is for the function to be differentiable Tpical non-linear functions and their derivatives used in multi-laer perceptrons: Sigmoidal unipolar: = (v) = + e = (tanh(βv/2) ) βv 2 The derivative of the unipolar sigmoidal function: v e βv = d dv = β = β ( ) ( + e βv ) 2 AP Papliński 4 4

Neuro-Fuzz Comp Ch 4 June, 25 Sigmoidal bipolar: (v) = tanh(βv) The derivative of the bipolar sigmoidal function: - v = d dv = 4βe2βv (e 2βv + ) = β( 2 2 ) Note that Derivatives of the sigmoidal functions are alwas non-negative Derivatives can be calculated directl from output signals using simple arithmetic operations In saturation, for big values of the activation potential, v, derivatives are close to zero Derivatives of used in the error-correction learning law AP Papliński 4 5 Neuro-Fuzz Comp Ch 4 June, 25 Comments on multi-laer linear networks Multi-laer feedforward linear neural networks can be alwas replaced b an equivalent single-laer network Consider a linear network consisting of two laers: x h p W W h L m The hidden and output signals in the network can be calculated as follows: h = W h x, = W h After substitution we have: = W W h x = W x where W = W W h which is equivalent to a single-laer network described b the weight matrix, W : x p W m AP Papliński 4 6

Neuro-Fuzz Comp Ch 4 June, 25 42 Detailed structure of a Two-Laer Perceptron the most commonl used feedforward neural network Signal-flow diagram: w h u x h w 3 v x i x = w ji h u j h j w kj v k k = x p w h Lp 3 7 u W h L h L w v p m ml W Input Laer Hidden Laer Output Laer u j = W h j: x ; h j = (u j ) ; v k = W k: h ; k = (v k ) Dendritic diagram: We can have: x p = and possibl h L = x x i x p h bus u h W: h W : u j h j Wj: h wji h W k: u L h L WL: h W h is L p Wm: h h j h L v v k k w kj v p ṁ W is m L Figure 4 3: Various representations of a Two-Laer Perceptron AP Papliński 4 7 Neuro-Fuzz Comp Ch 4 June, 25 43 Example of function approximation with a two-laer perceptron Consider a single-variable function approximated b the following two-laer perceptron: x + w w 2 w 2 w 22 h h 2 w w 2, d 8 6 4 2 2 Function approximation with a two laer perceptron w h w 2 h 2 w 3 h 3 w 3 w 32 W h h 3 w w 3 4 6 2 3 4 5 6 x The neural net generates the following function: = w h = w tanh W h x = w h + w 2 h 2 + w 3 h 3 = w tanh(w x + w 2 ) + w 2 tanh(w 2 x + w 22 ) + w 3 tanh(w 3 x + w 32 ) = 5 tanh(2x ) + tanh(3x 4) 3 tanh(75x 2) AP Papliński 4 8

Neuro-Fuzz Comp Ch 4 June, 25 44 Structure of a Gaussian Radial Basis Functions (RBF) Neural Network An RBF neural network is similar to a two-laer perceptron with the linear output laer: x p exp( 2 (x t ) T Σ (x t ) p p p p t Σ exp( 2 (x t 2) T Σ 2 (x t 2 ) p p t 2 Σ 2 p p p exp( 2 (x t L) T Σ L (x t L ) w w 2 w L Let us consider a single output case, that is, approximation of a single function of p variables: where = F (x; t,, t L, Σ,, Σ L, w) = L l= w l exp( 2 (x t l) T Σ l (x t l ) t l, p-element vectors, are the centers of the Gaussian functions, and Σ l, p p covariance matrices, specified a shape of the Gaussian functions All parameters: (t,, t L, Σ,, Σ L, w) are adjusted during learning procedure t L Σ L When the covariance matrix is diagonal, the axes of the Gaussian shape are aligned with the coordinate sstem axes If, in addition, all diagonal elements are identical, the Gaussian function is smmetrical in all directions AP Papliński 4 9 Neuro-Fuzz Comp Ch 4 June, 25 45 Example of function approximation with a Gaussian RBF network Consider a single-variable function approximated b the following Gaussian RBF network: 25 2 5 Function approximation with a Gaussian RBF x ( ) x t 2 2 exp h w, d 5 w h w 3 h 3 ( ) x t2 2 2 2 ( ) x t3 2 2 3 exp exp h 2 h 3 w 2 w 3 5 5 w 2 h 2 2 2 2 4 6 x The neural net generates the following function: = w h = w h + w 2 h 2 + w 3 h 3 = w exp( 2 x t 2 ) + w 2 exp( 2 x t 2 2 2 ) + w 3 exp( 2 x t 3 3 2 ) AP Papliński 4

Neuro-Fuzz Comp Ch 4 June, 25 46 Error-Correcting Learning Algorithms for Feedforward Neural Networks Error-correcting learning algorithms are supervised training algorithms that modif the parameters of the network in such a wa to minimise that error between the desired and actual outputs The vector w represents all the adjustable parameters of the network x = F( x; w ) w = w + w ε d Training data consists of N p-dimensional vectors x(n), and N m-dimensional desired output vectors, d(n), that are organized in two matrices: X = [ x() x(n) x(n) ] is p N matrix, D = [ d() d(n) d(n) ] is m N matrix For each input vector, x(n), the network calculates the actual output vector, (n), as (n) = F (x(n); w(n)) The output vector is compared with the desired output d(n) and the error is calculated: ε(n) = [ε (n) ε m (n)] T = d(n) (n) is an m vector, (42) In a pattern training algorithm, at each step the weight vector is updated so that the total error is minimised w(n + ) = w(n) + w(n) AP Papliński 4 Neuro-Fuzz Comp Ch 4 June, 25 More specificall, we tr to minimise a performance index, tpicall the mean-squared error (mse), J(w), specified as an averaged sum of instantaneous squared errors at the network output: J(w) = m N N n= E(w(n)) (43) where the total instantaneous squared error (tise) at the step n, E(w(n)), is defined as E(w(n)) = 2 m k= ε 2 k(n) = 2 εt (n) ε(n) (44) and ε is an m vector of instantaneous errors as in eqn (42) To consider possible minimization algorithm we can expand J(w) into the Talor power series: J(w(n + )) = J(w + w) = J(w) + w J(w) + 2 w 2 J(w) w T + = J(w) + w g(w) + 2 w H(w) wt + (45) (n) has been omitted for brevit where: J(w) = g is the gradient vector of the performance index, 2 J(w) = H is the Hessian matrix (matrix of second derivatives) AP Papliński 4 2

Neuro-Fuzz Comp Ch 4 June, 25 As for the Adaline, we infer that in order for the performance index to be reduced, that is the following condition must be satisfied: J(w + w) < J(w) w J(w) < where the higher order terms in the expansion (45) have been ignored This condition describes the steepest descent method in which the weight vector is modified in the direction opposite to the gradient vector: w = η J(w) (46) However, the gradient of the total error is equal to the sum of its components: J(w) = m N N n= E(w(n)) (47) Therefore, in the pattern training, at each step the weight vector can be modified in the direction opposite to the gradient of the total instantaneous squared error: w(n) = η E(w(n)) (48) AP Papliński 4 3 Neuro-Fuzz Comp Ch 4 June, 25 The gradient of the total instantaneous squared error (tise) is a K-component vector, where K is the total number of weights, of all partial derivatives of E(w) with respect to w: (n) has been omitted for brevit Using eqns (44) and (42) we have E(w) = w = w w K where w = j w k m K E(w) = w = εt ε w = εt w is a m K matrix of all derivatives j w k (Jacobian matrix) (49) Details of calculations of the gradient of the total instantaneous squared error (gradient of tise), in particular the Jacobian matrix will be different for a specific tpe of neural nets, that is, for a specific mapping function = F (x; w) AP Papliński 4 4

Neuro-Fuzz Comp Ch 4 June, 25 47 Steepest Descent Backpropagation Learning Algorithm for a Multi-Laer Perceptron The steepest descent backpropagation learning algorithm is the simplest error-correcting algorithms for a multi-laer perceptron For simplicit we will derive it for a two-laer perceptron as in Figure 4 3 where the vector of output signals is calculated as: = (v) ; v = W h ; h = ψ(w h x) (4) We start with re-shaping two weight matrices into one weight vector of the following structure: w = scan(w h, W ) = [w h wp h wlp h w w L w ml] = [ w h w ] }{{}}{{} hidden weights output weights The size of w h is L p and w is m L, total number of weights being equal to K = L(p + m) Now, the gradient of tise has components associated with the hidden laer weights, W h, and the output laer weights, W, which are arranged in the following wa: E(w) = w h w = w h wji h w h Lp w Using eqns (49) and (4) the gradient of tise can be further expressed as: w kj w ml E(w) = w = εt w = εt v v w (the chain rule) (4) AP Papliński 4 5 Neuro-Fuzz Comp Ch 4 June, 25 47 Output laer The vector of the output signals is calculated as: h L W v m m ε m m = (v) ; v = W h + d The first Jacobian matrix of eqn (4) can be evaluated as follows: v = diag( ) = diag( v v k = w k h which is a diagonal matrix of derivatives of the activation function k = k v k m ) (42) v m The products of errors ε k and derivatives of the activation function, k are known as the delta errors: is a vector of delta errors δ T = ε T diag( ) = [ ε ε m m Substituting eqns (43) and (42) into eqn (4) the gradient of tise can be expressed as: ] (43) E(w) = w = δt v w (44) AP Papliński 4 6

Neuro-Fuzz Comp Ch 4 June, 25 Now we will separatel calculate the output and hidden sections of the gradient of tise, namel, and wh w From eqn (44) the output section of the gradient can be written as: In order to evaluate the Jacobian v note first that w w = v δt (45) w since v = W h than Therefore, after scanning W row-wise into w we have v W = ht h v T w = h T = I h T (46) where denotes the Kronecker product and I is an m m identit matrix Combining eqns (45) and (46), we finall have: w = δt (I h T ) or W = δ ht (47) Hence, if we reshape the weight vector w back into a weight matrix W, the gradient of the total instantaneous squared error can be expressed as an outer product of δ and h vectors AP Papliński 4 7 Neuro-Fuzz Comp Ch 4 June, 25 Using eqn (48), the steepest descent pattern training algorithm for minimisation of the performance index with respect to the output weight matrix W (n) = η (n) W (n) = η δ(n) h T (n) ; W (n + ) = W (n) + W (n) (48) In the batch training mode the gradient of the total performance index, J(W h, W ), related to the output weight matrix, W, can be obtained b summing the gradients of the total instantaneous squared errors: J W = m N N n= (n) W (n) = m N If we take into account that the sum of outer products can be replaced b a product of matrices collecting the contributing vectors, then the gradient can be written as: N n= δ(n) h T (n) (49) J W = m N S HT (42) where S is the m N matrix of output delta errors: S = [δ() δ(n)] (42) and H is the L N matrix of the hidden signals: H = ψ(w h X) Therefore, in the batch training LMS algorithm, the weight update after k-th epoch can be written as: W (k) = η J W = η S(k) H T (k) ; W (k + ) = W (k) + W (k) (422) AP Papliński 4 8

Neuro-Fuzz Comp Ch 4 June, 25 472 Hidden laer As for the output laer we start with eqn (44) which for the hidden weight can be re-written in a form analogous to eqn (45) and then, using a chain rule of differentiation, we can write: w h = δt Since v = W h we have v h = W v w = h δt v h h w h (423) and we can define the backpropagated error as: ε h = (W ) T δ (424) The gradient of tise with respect to hidden weights can be now written as: x u = W h x ψ h = ψ(u) p W h L L εh L w h = (εh ) T h w = h (εh ) T h u u w h (425) which is structurall identical to eqn (49), w, ε and being replaced b w h, ε h and h, respectivel Following eqn (43) we can define the backpropagated delta errors: (δ h ) T = (ε h ) T h u = (εh ) T diag(ψ ) = [ ε h ψ ε h L ψ L ] ; ψ j = h j u j (426) AP Papliński 4 9 Neuro-Fuzz Comp Ch 4 June, 25 Now, the gradient of tise can be expressed in a form analogous to eqn (45): Finall we have: w h = (δh ) T (I x T ) w = h (δh ) T v (427) w h or W h = δh x T (428) Hence, if we reshape the weight vector w h back into a weight matrix W h, the gradient of the total instantaneous squared error can be expressed as an outer product of δ h and x vectors Therefore, for the pattern training algorithm, the update of the hidden weight matrix for the n the training pattern now becomes: W h (n) = η h (n) W h (n) = η h δ h (n) x T (n) ; W h (n + ) = W h (n) + W h (n) (429) It is interesting to note that the weight update rule is identical in its form for both the output and hidden laers AP Papliński 4 2

Neuro-Fuzz Comp Ch 4 June, 25 For the batch training, the gradient of the total performance index, J(W h, W ), related to the hidden weight matrix, W h, can be obtained b summing the instantaneous gradients: J W = h m N N n= (n) W h (n) = m N N n= δ h (n) x T (n) (43) If we take into account that the sum of outer products can be replaced b a product of matrices collecting the contributing vectors, then we finall have J W = m N Sh X T (43) where S h is the L N matrix of hidden delta errors: S h = [ δ h () δ h (N) ] (432) and X is the p N matrix of the input signals In the batch training steepest descent algorithm, the weight update after k-th epoch can be written as: W h J (k) = η h W = η h h S h (k) X T ; W h (k + ) = W h (k) + W h (k) (433) AP Papliński 4 2 Neuro-Fuzz Comp Ch 4 June, 25 473 Alternative derivation The basic signal flow referred to in the above calculations is as follows: x h v h h j h L ε h v Good to remember x x i x p u j h j w h ji h L v p d ε d v k w k kj ε k d k m ε m d m E = 2 m k= ε 2 k, ε k = d k k k h j = (w k h) = ( L w kj h j ) j= = (wj h x) = ( p wji x i ) i= AP Papliński 4 22

Neuro-Fuzz Comp Ch 4 June, 25 The gradient component related to a snaptic weight w kj for the n th training vector can be calculated as follows: (n) w kj = ε k k w kj = ε k k v k v k w kj E = 2 ( + ε2 k + ), k = (v k ) v k = W k: h = + w kjh j + where the delta error, δ k, is the output error, ε k, modified with the derivative of the activation function, k = ε k k h j k = k v k = δ k h j δ k = ε k k Alternativel, the gradient components related to the complete weight vector, W k: of the kth output neuron can be calculated as: In the above expression, each component of the gradient is a function of the delta error for the k th output, δ k, and respective output signal from the hidden laer, h T = [h h L ] (n) W k: = ε k k W k: = ε k k v k v k W k: = ε k k h T = δ k h T k = (v k ) v k = W k: h AP Papliński 4 23 Neuro-Fuzz Comp Ch 4 June, 25 Finall, the gradient components related to the complete weight matrix of the output laer, W, can be collected in an m L matrix, as follows: and denotes an element-b-element multiplication, and the hidden signals are calculated as follows: where δ = δ δ m (n) W = δ(n) h T (n) (434) = ε ε m m h(n) = ψ(w h (n) x(n)) = ε, (435) Gradient components related to the weight matrix of the hidden laer: (n) w h ji = m k= ε k k w h ji k = (v k ) = m k= = ( m k= ε k k v k v k w h ji δ k w kj) h j w h ji = W T :j δ h j w h ji v k = + w kjh j + δ k = ε k k W T :j δ = δ T W :j = m k= δ k w kj = ε h j AP Papliński 4 24

Neuro-Fuzz Comp Ch 4 June, 25 Note, that the j th column of the output weight matrix, W :j, is used to modif the delta errors to create the equivalent hidden laer errors, known as back-propagated errors which are specified as follows: ε h j = W T :j δ, for j =,, L Using the back-propagated error, we can now repeat the steps performed for the output laer, with ε h j and h j replacing ε k and k, respectivel (n) w h ji = ε h j h j w h ji = ε h j ψ j x i ψ j = ψ j u j = δ h j x i δ h j = ε h j ψ j h j = ψ(u j ), u j = W h j: x where the back-propagated error has been used to generate the delta-error for the hidden laer, δ h j All gradient components related to the hidden weight matrix, W h, can now be calculated in a wa similar to that for the output laer as in eqn (434): (n) W h = δh x T, where δ h = ε h ψ ε h L ψ L = ε h ψ and ε h = W T δ (436) AP Papliński 4 25 Neuro-Fuzz Comp Ch 4 June, 25 474 The structure of the two-laer back-propagation network with learning The structure of the decoding and encoding parts of the two-laer back-propagation network: x p ɛ u L ψ W h Lp W h η h δ h x T δ h L h = ψ(u) dψ du L ε h L W ml v m m d dv = (v) ψ W δ ε h ψ η ε δ h T m Σ ε d Note the decoding and encoding parts, and the blocks which calculate derivatives, delta signals and the weight update signals The process of computing the signals (pattern mode) during each time step consists of the: forward pass in which the signals of the decoding part are determined starting from x, through u, h, ψ, v to and backward pass in which the signals of the learning part are determined starting from d, through ε, δ, W, ε h, δ h and W h From Figure 474 and the relevant equations note that, in general, the weight update is proportional to the snaptic input signals (x, or h) and the delta signals (δ h, or δ) The delta signals, in turn, are proportional to the derivatives the activation functions, ψ, or AP Papliński 4 26

Neuro-Fuzz Comp Ch 4 June, 25 Comments on Learning Algorithms for Multi-Laer Perceptrons The process of training a neural network is monitored b observing the value of the performance index, J(W (n)), tpicall the mean-squared error as defined in eqns (43) and (44) In order to reduce the value of this error function, it is tpicall necessar to go through the set of training patterns (epochs) a number of times as discussed in page 3 2 There are two basic modes of updating weights: the pattern mode in which weights are updated after the presentation of a single training pattern, the batch mode in which weights are updated after each epoch For the basic steepest descent backpropagation algorithm the relevant equations are: pattern mode where n is the pattern index batch mode W (n + ) = W (n) + η δ(n) h T (n) W h (n + ) = W h (n) + η h δ h (n) x T (n) W (k + ) = W (k) + η S(k) H T (k) W h (k + ) = W h (k) + η h S h (k) X T where k is the epoch counter Definitions of the other variable have been alread given AP Papliński 4 27 Neuro-Fuzz Comp Ch 4 June, 25 Weight Initialisation The weight are initialised in one of the following was: using prior information if available The Nguen-Widrow algorithm presented in sec 54 is a good example of such initialisation to small uniforml distributed random numbers Incorrectl initialised weights cause that the activation potentials ma become large which saturates the neurons In saturation, derivatives = and no learning takes place A good initialisation can significantl speed up the learning process Randomisation For the pattern training it might be a good practice to randomise the order of presentation of training examples between epochs Validation In order to validate the process of learning the available data is randoml partitioned into a training set which is used for training, and a test set which is used for validation of the obtained data model AP Papliński 4 28

Neuro-Fuzz Comp Ch 4 June, 25 48 Example of function approximation (fap2dm) In this MATLAB example we approximate two functions of two variables, using a two-laer perceptron, = f(x), or = f (x, x 2 ), 2 = f 2 (x, x 2 ) = (W (W h x)) The weights of the perceptron, W h, W, are trained using the basic back-propagation algorithm in a batch mode as discussed in the previous section Specification of the the neural network (fap2dim): p = 3 ; % Number of inputs plus the bias input L = 2; % Number of hidden signals (with bias) m = 2 ; % Number of outputs x W u h h W v 3 2 2 2 2 x x 2 x 3 = u h u 2 h 2 W h u h h bus h h 2 h 2 = v W v 2 2 AP Papliński 4 29 Neuro-Fuzz Comp Ch 4 June, 25 Two functions to be approximated b the two-laer perceptron are as follows: = x e ρ2, 2 = sin 2ρ2 4ρ 2, where ρ 2 = x 2 + x 2 2 The domain of the function is a square x, x 2 [ 2, 2) In order to form the training set the functions are sampled on a regular 6 6 grid The relevant MATLAB code to form the matrices X and D follows: na = 6; N = naˆ2; nn = :na-; % Number of training cases Specification of the domain of functions: X = nn*4/na-2; % na points from -2 step (4/na)=25 to (2-4/na)=75 [X X2] = meshgrid(x); % coordinates of the grid vertices X and X2 are na b na R=(Xˆ2+X2ˆ2+e-5); % R (rhoˆ2) is a matrix of squares of distances of the grid vertices from the origin D = X*exp(-R); D = (D(:)) ; % D is na b na, D is b N D2 = 25*sin(2*R)/R ; D = [D ; (D2(:)) ]; % D2 is na b na, D is a 2 b N matrix of 2-D target vectors The domain sampling points are as follows: X=-2-75 5 75 X2=-2-2 -2-2 -2-75 5 75-75 -75-75 -75-2 -75 5 75 5 5 5 5-2 -75 5 75 75 75 75 75 AP Papliński 4 3

Neuro-Fuzz Comp Ch 4 June, 25 Scanning X and X2 column-wise and appending the bias inputs, we obtain the input matrix X which is p N: X = [X(:) ; X2(:) ;ones(,n)]; The training exemplars are as follows: X = -2-2 75 75-2 -75 5 75 D = -7-7 86 38-9 354-439 -27 Two 2 D target functions 5 The functions to be approximated are plotted side-b-side, which distorts the domain which in realit is the same for both functions, namel, x, x 2 [ 2, 2) 2 surfc([x-2 X+2], [X2 X2], [D D2]) 5 2 2 4 2 2 4 AP Papliński 4 3 Neuro-Fuzz Comp Ch 4 June, 25 Random initialization of the weight matrices: Wh = randn(l,p)/p; % the hidden-laer weight matrix W h is L p W = randn(m,l)/l; % the output-laer weight matrix W is m L C = 2; % maximum number of training epochs J = zeros(m,c); % Memor allocation for the error function eta = [3 ]; % Training gains for c = :C % The main loop (fap2dm) The forward pass: H = ones(l-,n)/(+exp(-wh*x)); % Hidden signals (L- b N) Hp = H*(-H); % Derivatives of hidden signals H = [H ; ones(,n)]; % bias signal appended Y = tanh(w*h); % Output signals (m b N) Yp = - Yˆ2; % Derivatives of output signals The backward pass: E = D - Y; % The output errors (m b K) JJ = (sum((e*e) )) ; % The total error after one epoch % the performance function m b dely = E*Yp; % Output delta signal (m b K) dw = dely*h ; % Update of the output matrix dw is L b m Eh = W(:,:L-) *dely % The back-propagated hidden error Eh is L- b N delh = Eh*Hp ; % Hidden delta signals (L- b N) dwh = delh*x ; % Update of the hidden matrix dwh is L- b p W = W+eta*dW; Wh = Wh+etah*dWh; % The batch update of the weights AP Papliński 4 32

Neuro-Fuzz Comp Ch 4 June, 25 Two 2-D approximated functions are plotted after each epoch D(:)=Y(,:) ; D2(:)=Y(2,:) ; surfc([x-2 X+2], [X2 X2], [D D2]) J(:,c) = JJ ; end % of the epoch loop Approximation after epochs: Approximation: epoch:, error: 3 295 eta: 4 4 The sum of squared errors at the end of each training epoch is collected in a 2 C matrix The approximation error for each function at the end of each epoch: 6 The approximation error 6 4 4 2 2 2 8 4 2 2 4 2 2 4 6 4 2 2 4 6 8 number of training epochs AP Papliński 4 33 Neuro-Fuzz Comp Ch 4 June, 25 From this example ou will note that the backpropagation algorithm is painfull slow sensitive to the weight initialization sensitive to the training gains We will address these problems in the subsequent sections It is good to know that the best training algorithm can be two orders of magnitude faster that a basic backpropagation algorithm AP Papliński 4 34