# Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Save this PDF as:

Size: px
Start display at page:

Download "Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer."

## Transcription

1 University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x d First layer Second Output layer layer y (x) K Mark Gales October 200

2 4 & 5. Multi-Layer Perceptron: Introduction and Training Multi-Layer Perceptron From the previous lecture we need a multi-layer perceptron to handle the XOR problem. More generally multi-layer perceptrons allow a neural network to perform arbitrary mappings. x y (x) Inputs x 2 y (x) 2 Outputs x d First layer Second Output layer layer y (x) K A 2-hidden layer neural network is shown above. The aim is to map an input vector x into an output y(x). The layers may be described as: Input layer: accepts the data vector or pattern; Hidden layers: one or more layers. They accept the output from the previous layer, weight them, and pass through a, normally, non-linear activation function. Output layer: takes the output from the final hidden layer weights them, and possibly pass through an output nonlinearity, to produce the target values.

3 2 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Possible Decision Boundaries The nature of the decision boundaries that may be produced varies with the network topology. Here only threshold (see the single layer perceptron) activation functions are used. () (2) There are three situations to consider. Single layer: this is able to position a hyperplane in the input space. 2. Two layers (one hidden layer): this is able to describe a decision boundary which surrounds a single convex region of the input space. 3. Three layers (two hidden layers): this is able to to generate arbitrary decision boundaries Note: any decision boundary can be approximated arbitrarily closely by a two layer network having sigmoidal activation functions. (3)

4 4 & 5. Multi-Layer Perceptron: Introduction and Training 3 Number of Hidden Units From the previous slide we can see that the number of hidden layers determines the decision boundaries that can be generated. In choosing the number of layers the following considerations are made. Multi-layer networks are harder to train than single layer networks. A two layer network (one hidden) with sigmoidal activation functions can model any decision boundary. Two layer networks are most commonly used in pattern recognition (the hidden layer having sigmoidal activation functions). How many units to have in each layer? The number of output units is determined by the number of output classes. The number of inputs is determined by the number of input dimensions The number of hidden units is a design issue. The problems are: too few, the network will not model complex decision boundaries; too many, the network will have poor generalisation.

5 4 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Hidden Layer Perceptron The form of the hidden, and the output, layer perceptron is a generalisation of the single layer perceptron from the previous lecture. Now the weighted input is passed to a general activation function, rather than a threshold function. Consider a single perceptron. Assume that there are n units at the previous level. x x 2 w i w i2 w i0 Σ z i Activation function y i x n w in The output from the perceptron, y i may be written as y i = φ(z i ) = φ(w i0 + d where φ() is the activation function. j= w ij x j ) We have already seen one example of an activation function the threshold function. Other forms are also used in multilayer perceptrons. Note: the activation function is not necessarily non-linear. However, if linear activation functions are used much of the power of the network is lost.

6 4 & 5. Multi-Layer Perceptron: Introduction and Training 5 Activation Functions There are a variety of non-linear activation functions that may be used. Consider the general form y j = φ(z j ) and there are n units, perceptrons, for the current level. Heaviside (or step) function: y j = 0, z j < 0, z j 0 These are sometimes used in threshold units, the output is binary. Sigmoid (or logistic regression) function: y j = + exp( z j ) The output is continuous, 0 y j. Softmax (or normalised exponential or generalised logistic) function: y j = exp(z j) n i= exp(z i ) The output is positive and the sum of all the outputs at the current level is, 0 y j. Hyperbolic tan (or tanh) function: y j = exp(z j) exp( z j ) exp(z j ) + exp( z j ) The output is continuous, y j.

7 6 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Notation Used Consider a multi-layer perceptron with: d-dimensional input data; L hidden layers (L + layer including the output layer); N units in the k th level; K-dimensional output. x x 2 w i w i2 w i0 Σ z i Activation function y i x N (k ) w in (k ) The following notation will be used x is the input to the k th layer x is the extended input to the k th layer x = x W is the weight matrix of the k th layer. By definition this is a N N (k ) matrix.

8 4 & 5. Multi-Layer Perceptron: Introduction and Training 7 Notation (cont) W is the weight matrix including the bias weight of the k th layer. By definition this is a N (N (k ) + ) matrix. W = [ w 0 W ] z is the N -dimensional vector defined as z j = w j x y is the output from the k th layer, so y j = φ(z j ) All the hidden layer activation functions are assumed to be the same φ(). Initially we shall also assume that the output activation function is also φ(). The following matrix notation feed forward equations may then used for a multi-layer perceptron with input x and output y(x). where k L +. x () x z = = x = y (k ) W x y = φ(z ) y(x) = y (L+) The target values for the training of the networks will be denoted as t(x) for training example x.

9 8 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Training Criteria A variety of training criteria may be used. have supervised training examples {{x, t(x )}..., {x n, t(x n )}} Some standard examples are: Assuming we Least squares error: one of the most common training criteria. E = 2 = 2 n p= n p= i= y(x p ) t(x p ) 2 K (y i (x p ) t i (x p )) 2 This may be derived from considering the targets as being corrupted by zero-mean Gaussian distributed noise. Cross-Entropy for two classes: consider the case when t(x) is binary (and softmax output). The expression is E = n p= (t(x p ) log(y(x p )) + ( t(x p )) log( y(x p ))) This expression goes to zero with the perfect mapping. Cross-Entropy for multiple classes: the above expression becomes (again softmax output) E = n K p= i= t i (x p ) log(y i (x p )) The minimum value is now non-zero, it represents the entropy of the target values.

10 4 & 5. Multi-Layer Perceptron: Introduction and Training 9 Network Interpretation We would like to be able to interpret the output of the network. Consider the case where a least squares error criterion is used. The training criterion is E = 2 n K p= i= (y i (x p ) t i (x p )) 2 In the case of an infinite amount of training data, n, E = 2 K i= K (yi (x) t i (x)) 2 p(t i (x), x)dt i (x)dx [ (yi (x) t i (x)) 2 p(t i (x) x)dt i (x) ] p(x)dx = 2 i= Examining the term inside the square braces (yi (x) t i (x)) 2 p(t i (x) x)dt i (x) = (y i (x) E{t i (x) x} + E{t i (x) x} t i (x)) 2 p(t i (x) x)dt i (x) = (y i (x) E{t i (x) x}) 2 + (E{t i (x) x} t i (x)) 2 p(t i (x) x)dt i (x) where + 2(y i (x) E{t i (x) x})(e{t i (x) x} t i (x))p(t i (x) x)dt i (x) E{t i (x) x} = t i (x)p(t i (x) x)dt i (x) We can write the cost function as E = 2 K i= K (yi (x) E{t i (x) x}) 2 p(x)dx ( E{ti (x) 2 x} (E{t i (x) x}) 2) p(x)dx + 2 i= The second term is not dependent on the weights, so is not affected by the optimisation scheme.

11 0 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Network Interpretation (cont) The first term in the previous expression is minimised when it equates to zero. This occurs when y i (x) = E{t i (x) x} The output of the network is the conditional average of the target data. This is the regression of t i (x) conditioned on x. So t(x) y(x) y(x ) i p(t(x )) i x i x the network models the posterior independent of the topology, but in practice require: an infinite amount of training data, or knowledge of correct distribution for x (i.e. p(x) is known or derivable from the training data); the topology of the network is complex enough that final error is small; the training algorithm used to optimise the network is good - it finds the global maximum.

12 4 & 5. Multi-Layer Perceptron: Introduction and Training Posterior Probabilities Consider the multi-class classification training problem with d-dimensional feature vectors: x; K-dimensional output from network: y(x); K-dimensional target: t(x). We would like the output of the network, y(x), to approximate the posterior distribution of the set of K classes. So y i (x) P (ω i x) Consider training a network with: means squared error estimation; -out-ofk coding, i.e. t i (x) = if x ω i 0 if x ω i The network will act as a d-dimensional to K-dimensional mapping. Can we interpret the output of the network?

13 2 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Posterior Probabilities (cont) From the previous regression network interpretation we know that y i (x) = E{t i (x) x} = t i (x)p(t i (x) x)dt i (x) As we are using the -out-of-k coding where p(t i (x) x) = K δ ij = j= δ(t i (x) δ ij )P (ω j x), (i = j) 0, otherwise This results in as required. y i (x) = P (ω i x) The same limitations are placed on this proof as the interpretation of the network for regression.

14 4 & 5. Multi-Layer Perceptron: Introduction and Training 3 Compensating for Different Priors The standard approach to described at the start of the course was to use Bayes law to obtain the posterior probability P (ω j x) = p(x ω j)p (ω j ) p(x) where the priors class priors, P (ω j ), and class conditional densities, p(x ω j ), are estimated separately. For some tasks the two use different training data (for example for speech recognition, the language model and the acoustic model). How can this difference in priors from the training and the test conditions be built into the neural network framework where the posterior probability is directly calculated? Again using Bayes law p(x ω j ) P (ω j x) P (ω j ) Thus if posterior is divided by the training data prior a value proportional to the class-conditional probability can be obtained. The standard form of Bayes rule may now be used with the appropriate, different, prior.

15 4 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Error Back Propagation Interest in multi-layer perceptrons (MLPs) resurfaced with the development of the error back propagation algorithm. This allows multi-layer perceptons to be simply trained. x y (x) Inputs x 2 y (x) 2 Outputs x d Hidden layer y K(x) Output layer A single hidden layer network is shown above. As previously mentioned with sigmoidal activation functions arbitrary decision boundaries may be obtained with this network topology. The error back propagation algorithm is based on gradient descent. Hence the activation function must be differentiable. Thus threshold and step units will not be considered. We need to be able to compute the derivative of the error function with respect to the weights of all layers. All gradients in the next few slides are evaluated at the current model parameters.

16 4 & 5. Multi-Layer Perceptron: Introduction and Training 5 Single Layer Perceptron Rather than examine the multi-layer case instantly, consider the following single layer perceptron. x x 2 w w 2 w 0 Σ z Activation function y(x) x d w d We would like to minimise (for example) the square error between the target of the output, t(x p ), and the current output value y(x p ). Assume that the activation function is known to be a sigmoid function. The cost function may be written as E = 2 n p= (y(x p ) t(x p ) (y(x p ) t(x p )) = n E (p) p= To simplify notation, we will only consider a single observation x with associated target values t(x) and current output from the network y(x). The error with this single observation is denoted E. The first question is how does the error change as we alter y(x). = y(x) t(x) y(x) But we are not interested in y(x) - how do we find the effect of varying the weights?

17 6 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing SLP Training (cont) We can calculate the effect that a change in z has on the error using the chain rule z = y(x) y(x) z However what we really want is the change of the error rate with the weights (the parameters that we want to learn). w i = z z w i The error function therefore depends on the weight as w i = y(x) y(x) z z w i All these expressions are known so we can write w i = (y(x) t(x))y(x)( y(x))x i This has been computed for a single observation. We are interested in terms of the complete training set. We know that the total errors is the sum of the individual errors, so E = n (y(x p ) t(x p ))y(x p )( y(x p )) x p p= So for a single layer we can use gradient descent schemes to find the best weight values. However we want to train multi-layer perceptrons!

18 4 & 5. Multi-Layer Perceptron: Introduction and Training 7 Error Back Propagation Algorithm Now consider a particular node, i, of hidden layer k. Using the previously defined notation, the input to the node is x and the output y i. x x 2 w i w i2 w i0 Σ z i Activation function y i x N (k ) w in (k ) From the previous section we can simply derive the rate of change of the error function with the weights of the output layer. We need to now examine the rate of change with the k th hidden layer weights. A general error criterion, E, will be used. Furthermore we will not assume that y j only depends on the z j. For example a softmax function may be used. In terms of the derivations given the output layer will be considered as the L + th layer. The training observations are assumed independent and so E = n E (p) p= where E (p) is the error cost for the p observation and the observations are x,..., x n.

19 8 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Error Back Propagation Algorithm (cont) We are required to calculate for all layers, k, and all rows w ij and columns of W. Applying the chain rule w ij = z i z i w ij In matrix notation we can write = δ x W We need to find a recursion for δ. δ = = = = δ i x j z z (k+) z z (k+) y z (k+) z y δ (k+) But we know from the forward recursions z (k+) y This yields the recursion = z(k+) = W(k+) x (k+) δ = Λ W (k+) δ (k+)

20 4 & 5. Multi-Layer Perceptron: Introduction and Training 9 Backward Error Recursion where we define the activation derivative matrix for layer k as Λ = y z = y z y z 2. y z N y 2 z y 2 z y 2 z N... y N z y N z 2. y N z N This has given a matrix form of the backward recursion for the error back propagation algorithm. We need to have an initialisation of the backward recursion. This will be from the output layer (layer L + ) δ (L+) = = z (L+) y (L+) z (L+) y (L+) = Λ (L+) y(x) Λ (L+) is the activation derivative matrix for the output layer.

21 20 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Error Back Propagation To calculate E (p) (θ[τ] is the set of current (training θ[τ] epoch τ) values of the weights) we use the following algorithm.. Apply the input vector x p to the network and use the feed forward matrix equations to propagate the input forward through the network. For all layers this yields y and z. 2. Compute the set of derivative matrices Λ for all layers. 3. Compute y(x) θ[τ] (the gradient at the output layer). 4. Using the back-propagation formulae back propagate the derivatives through the network. Having obtained the derivatives of the error function with respect to the weights of the network, we need a scheme to optimise the value of the weights. The obvious choice is gradient descent

22 4 & 5. Multi-Layer Perceptron: Introduction and Training 2 Gradient Descent Having found an expression for the gradient, gradient descent may be used to find the values of the weights. Initially consider a batch update rule. Here w i [τ + ] = w i [τ] η w i θ[τ] where θ[τ] = { W () [τ],..., W (L+) [τ]}, w i [τ] is the i th row of W at training epoch τ and w i θ[τ] = n p= (p) w i θ[τ] If the total number of weights in the system is N then all N derivatives may be calculated in O(N) operations with memory requirements O(N). However in common with other gradient descent schemes there are problems as: we need a value of η that achieves a stable, fast descent; the error surface may have local minima, maxima and saddle points. This has lead to refinements of gradient descent.

23 22 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Training Schemes On the previous slide the weights were updated after all n training examples have been seen. This is not the only scheme that may be used. Batch update: the weights are updated after all the training examples have been seen. Thus w i [τ + ] = w [τ] η i n (p) p= w i θ[τ] Sequential update: the weights are updated after every sample. Now w i [τ + ] = w i [τ] η (p) w i θ[τ] and we cycle around the training vectors. There are some advantages of this form of update. It is not necessary to store the whole training database. Samples may be used only once if desired. They may be used for online learning In dynamic systems the values of the weights can be updated to track the system. In practice forms of batch training are often used.

24 4 & 5. Multi-Layer Perceptron: Introduction and Training 23 Refining Gradient Descent There are some simple techniques to refine standard gradient descent. First consider the learning rate η. We can make this vary with each iteration. One of the simplest rules is to use η[τ + ] =.η[τ]; if E(θ[τ]) < E(θ[τ ]) 0.5η[τ]; if E(θ[τ]) > E(θ[τ ]) In words: if the previous value of η[τ] decreased the value of the cost function, then increase η[τ]. If the previous value of η[τ] increased the cost function (η[τ] too large) then decrease η[τ]. It is also possible to add a momentum term to the optimisation (common in MLP estimation). The update formula is where w i [τ + ] = w i w i [τ] = η[τ + ] w i [τ] + w [τ] θ[τ] The use of the momentum term, α[τ]: smooths successive updates; helps avoid small local maxima. i + α[τ] w i [τ ] Unfortunately it introduces an additional tunable parameter to set. Also if we are lucky and hit the minimum solution we will overshoot.

25 24 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Quadratic Approximation Gradient descent makes use if first-order terms of the error function. What about higher order techniques? Consider the vector form of the Taylor series where E(θ) = E(θ[τ]) + (θ θ[τ]) g + 2 (θ θ[τ]) H(θ θ[τ]) + O(3) g = E(θ) θ[τ] and (H) ij = h ij = (θ) w i w j θ[τ] Ignoring higher order terms we find E(θ) = g + H(θ θ[τ]) Equating this to zero we find that the value of θ at this point θ[τ + ] is θ[τ + ] = θ[τ] H g This gives us a simple update rule. The direction H g is known as the Newton direction.

26 4 & 5. Multi-Layer Perceptron: Introduction and Training 25 Problems with the Hessian In practice the use of the Hessian is limited.. The evaluation of the Hessian may be computationally expensive as O(N 2 ) parameters must be accumulated for each of the n training samples. 2. The Hessian must be inverted to find the direction, O(N 3 ). This gets very expensive as N gets large. 3. The direction given need not head towards a minimum - it could head towards a maximum or saddle point. This occurs if the Hessian is not positive-definite i.e. for all v. v Hv > 0 4. If the surface is highly non-quadratic the step sizes may be too large and the optimisation becomes unstable. Approximations to the Hessian are commonly used. The simplest approximation is to assume that the Hessian is diagonal. This ensures that the Hessian is invertible and only requires N parameters. The Hessian may be made positive definite using H = H + λi If λ is large enough then H is positive definite.

27 26 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Improved Learning Rates Rather than having a single learning rate for all weights in the system, weight specific rates may be used without using the Hessian. All schemes will make use of g ij [τ] = w ij θ[τ] Delta-delta: we might want to increase the learning rate when two consecutive gradients have the same sign. This may be implemented as η ij [τ] = γg ij [τ]g ij [τ ] where γ > 0. Unfortunately this can take the learning rate negative (depending on the value of γ)! Delta-bar-delta: refines delta-delta so that where η ij [τ] = κ, if g ij [τ ]g γη ij [τ ], ij [τ] > 0 if g ij [τ ]g ij [τ] < 0 g ij [τ] = ( β)g ij [τ] + βg ij [τ ] One of the drawbacks with this scheme is that three parameters, γ,κ and β must be selected. Quickprop. Here w ij [τ + ] = g ij [τ] g ij [τ ] g ij [τ] w ij [τ]

28 4 & 5. Multi-Layer Perceptron: Introduction and Training 27 Conjugate Directions Assume that we have optimised in one direction, d[τ]. We know that What direction should we now optimise in? E(θ[τ] + λd[τ]) = 0 λ If we work out the gradient at this new point θ[τ +] we know that E(θ[τ + ]) d[τ] = 0 Is this the best direction? No. What we really want is that as we move off in the new direction, d[τ + ], we would like to maintain the gradient in the previous direction, d[τ], being zero. In other words E(θ[τ + ] + λd[τ + ]) d[τ] = 0 Using a first order Taylor series expansion (E(θ[τ + ]) + λd[τ + ] E(θ[τ + ])) d[τ] = 0 Hence the following constraint is satisfied for a conjugate gradient d[τ + ] Hd[τ] = 0

29 28 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Conjugate Gradient (cont) d[τ] d[τ+] θ[τ+] θ[τ] Fortunately the conjugate direction can be calculated without explicitly computing the Hessian. This leads to the conjugate gradient descent algorithm (see book by Bishop for details).

30 4 & 5. Multi-Layer Perceptron: Introduction and Training 29 Input Transformations If the input to the network are not normalised the training time may become very large. The data is therefore normalised. Here the data is transformed to where x pi = x pi µ i σ i and µ i = n n x pi p= σ 2 i = n n p= (x pi µ i ) 2 The transformed data has zero mean and variance. This transformation may be generalised to whitening. Here the covariance matrix of the original data is calculated. The data is then decorrelated and the mean subtracted. This results in data with zero mean and an identity matrix covariance matrix.

31 30 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Generalisation In the vast majority of pattern processing tasks, the aim is not to get good performance on the training data (in supervised training we know the classes!), but to get good performance on some previously unseen test data. Eroor rate Training set Future data Number of parameters Typically the performance actually goes as above. As the number of model parameters, N, increase the training data likelihood is guaranteed to increase (provided the parameter optimisation scheme is sensible ). The associated classification error rate usually decreases. However the test (future) set performance has a maximum in the likelihood and associated minimum in error rate. Then the likelihood decreases and the error rate increases. The objective in any pattern classification task is to have the minimum test set (future data) error rate.

32 4 & 5. Multi-Layer Perceptron: Introduction and Training 3 Regularisation One of the major issues with training neural networks is how to ensure generalisation. One commonly used technique is weight decay. A regulariser may be used. Here N Ω = wi 2 2 i= where N is the total number of weights in the network. A new error function is defined Ẽ = E + νω Using gradient descent on this gives Ẽ = E + νw The effect of this regularisation term Ω penalises very large weight terms. From empirical results this has resulted in improved performance. Rather than using an explicit regularisation term, the complexity of the network can be controlled by training with noise. For batch training we replicate each of the samples multiple times and add a different noise vector to each of the samples. If we use least squares training with a zero mean noise source (equal variance ν in all the dimensions) the error function may be shown to have the form Ẽ = E + νω This is a another form of regularisation.

### Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 2012 Engineering Part IIB: Module 4F10 Introduction In

### Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

### Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

### Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,

### Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

### Machine Learning Lecture 5

Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

### Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

### Neural Network Training

Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

### Multi-layer Neural Networks

Multi-layer Neural Networks Steve Renals Informatics 2B Learning and Data Lecture 13 8 March 2011 Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 1 Overview Multi-layer neural

### Linear & nonlinear classifiers

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

### Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

### Multilayer Perceptron

Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4

### Computational statistics

Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

### 4. Multilayer Perceptrons

4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

### Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Introduction to Natural Computation Lecture 9 Multilayer Perceptrons and Backpropagation Peter Lewis 1 / 25 Overview of the Lecture Why multilayer perceptrons? Some applications of multilayer perceptrons.

### Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

### Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

### Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

### Linear & nonlinear classifiers

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

### Introduction to Neural Networks

Introduction to Neural Networks Steve Renals Automatic Speech Recognition ASR Lecture 10 24 February 2014 ASR Lecture 10 Introduction to Neural Networks 1 Neural networks for speech recognition Introduction

### 1 What a Neural Network Computes

Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

### Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely

### Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

### NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

### Machine Learning Lecture 7

Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

### LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

### University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision

### Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

### MLPR: Logistic Regression and Neural Networks

MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition Amos Storkey Amos Storkey MLPR: Logistic Regression and Neural Networks 1/28 Outline 1 Logistic Regression 2 Multi-layer

### Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Outline MLPR: and Neural Networks Machine Learning and Pattern Recognition 2 Amos Storkey Amos Storkey MLPR: and Neural Networks /28 Recap Amos Storkey MLPR: and Neural Networks 2/28 Which is the correct

### Linear Models for Classification

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

### Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

### y(x n, w) t n 2. (1)

Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N,

### Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

### ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

### Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

### NEURAL NETWORKS

5 Neural Networks In Chapters 3 and 4 we considered models for regression and classification that comprised linear combinations of fixed basis functions. We saw that such models have useful analytical

### Neural Networks and Deep Learning

Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

### Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

### Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will

### Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of

### Advanced statistical methods for data analysis Lecture 2

Advanced statistical methods for data analysis Lecture 2 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline

### Artificial Neural Networks (ANN)

Artificial Neural Networks (ANN) Edmondo Trentin April 17, 2013 ANN: Definition The definition of ANN is given in 3.1 points. Indeed, an ANN is a machine that is completely specified once we define its:

### University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F: Statistical Pattern Processing Handout 2: Multivariate Gaussians.2.5..5 8 6 4 2 2 4 6 8 Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2 2 Engineering

### Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

### Artificial Neural Networks

Artificial Neural Networks Stephan Dreiseitl University of Applied Sciences Upper Austria at Hagenberg Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Knowledge

### Machine Learning. 7. Logistic and Linear Regression

Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,

### Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

### Statistical Machine Learning from Data

January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole

### Neural Networks Lecture 4: Radial Bases Function Networks

Neural Networks Lecture 4: Radial Bases Function Networks H.A Talebi Farzaneh Abdollahi Department of Electrical Engineering Amirkabir University of Technology Winter 2011. A. Talebi, Farzaneh Abdollahi

### Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

### Deep Neural Networks

Deep Neural Networks DT2118 Speech and Speaker Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 45 Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi

### Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

### Feed-forward Network Functions

Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification

### Computational Intelligence Winter Term 2017/18

Computational Intelligence Winter Term 207/8 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS ) Fakultät für Informatik TU Dortmund Plan for Today Single-Layer Perceptron Accelerated Learning

### Deep Feedforward Networks

Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

### Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

### Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

### Linear discriminant functions

Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

### Regression with Numerical Optimization. Logistic

CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

### Deep Feedforward Networks

Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

### Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

### Lecture 7 Artificial neural networks: Supervised learning

Lecture 7 Artificial neural networks: Supervised learning Introduction, or how the brain works The neuron as a simple computing element The perceptron Multilayer neural networks Accelerated learning in

### Lecture 5 Neural models for NLP

CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

### Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural Networks, Computation Graphs CMSC 470 Marine Carpuat Binary Classification with a Multi-layer Perceptron φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ

### Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

### Introduction to Machine Learning

Introduction to Machine Learning Neural Networks Varun Chandola x x 5 Input Outline Contents February 2, 207 Extending Perceptrons 2 Multi Layered Perceptrons 2 2. Generalizing to Multiple Labels.................

### A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

### Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design

### Introduction to Neural Networks

CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character

### Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

### Week 5: Logistic Regression & Neural Networks

Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and

### MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

### Multilayer Perceptrons and Backpropagation

Multilayer Perceptrons and Backpropagation Informatics 1 CG: Lecture 7 Chris Lucas School of Informatics University of Edinburgh January 31, 2017 (Slides adapted from Mirella Lapata s.) 1 / 33 Reading:

### Lecture 17: Neural Networks and Deep Learning

UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

### Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

### Computational Intelligence

Plan for Today Single-Layer Perceptron Computational Intelligence Winter Term 00/ Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS ) Fakultät für Informatik TU Dortmund Accelerated Learning

### CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

### Neural Networks: Backpropagation

Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25

### AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

AN INTRODUCTION TO NEURAL NETWORKS Scott Kuindersma November 12, 2009 SUPERVISED LEARNING We are given some training data: We must learn a function If y is discrete, we call it classification If it is

Stochastic gradient descent; Classification Steve Renals Machine Learning Practical MLP Lecture 2 28 September 2016 MLP Lecture 2 Stochastic gradient descent; Classification 1 Single Layer Networks MLP

### Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

### Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Thore Graepel and Nicol N. Schraudolph Institute of Computational Science ETH Zürich, Switzerland {graepel,schraudo}@inf.ethz.ch

### What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 Multi-layer networks Steve Renals Machine Learning Practical MLP Lecture 3 7 October 2015 MLP Lecture 3 Multi-layer networks 2 What Do Single

### CSC321 Lecture 5: Multilayer Perceptrons

CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3

### Linear Models for Regression

Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output

### Lecture 4: Perceptrons and Multilayer Perceptrons

Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons

### Neural Networks (Part 1) Goals for the lecture

Neural Networks (Part ) Mark Craven and David Page Computer Sciences 760 Spring 208 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed

### Learning Neural Networks

Learning Neural Networks Neural Networks can represent complex decision boundaries Variable size. Any boolean function can be represented. Hidden units can be interpreted as new features Deterministic

### Artificial Neural Networks

Artificial Neural Networks 鮑興國 Ph.D. National Taiwan University of Science and Technology Outline Perceptrons Gradient descent Multi-layer networks Backpropagation Hidden layer representations Examples

### Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

### Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

### Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

### Optimization for neural networks

0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make

### Pattern Recognition and Machine Learning

Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability