# Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Save this PDF as:

Size: px
Start display at page:

Download "Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer."

## Transcription

1 University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x d First layer Second Output layer layer y (x) K Mark Gales October 200

2 4 & 5. Multi-Layer Perceptron: Introduction and Training Multi-Layer Perceptron From the previous lecture we need a multi-layer perceptron to handle the XOR problem. More generally multi-layer perceptrons allow a neural network to perform arbitrary mappings. x y (x) Inputs x 2 y (x) 2 Outputs x d First layer Second Output layer layer y (x) K A 2-hidden layer neural network is shown above. The aim is to map an input vector x into an output y(x). The layers may be described as: Input layer: accepts the data vector or pattern; Hidden layers: one or more layers. They accept the output from the previous layer, weight them, and pass through a, normally, non-linear activation function. Output layer: takes the output from the final hidden layer weights them, and possibly pass through an output nonlinearity, to produce the target values.

3 2 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Possible Decision Boundaries The nature of the decision boundaries that may be produced varies with the network topology. Here only threshold (see the single layer perceptron) activation functions are used. () (2) There are three situations to consider. Single layer: this is able to position a hyperplane in the input space. 2. Two layers (one hidden layer): this is able to describe a decision boundary which surrounds a single convex region of the input space. 3. Three layers (two hidden layers): this is able to to generate arbitrary decision boundaries Note: any decision boundary can be approximated arbitrarily closely by a two layer network having sigmoidal activation functions. (3)

4 4 & 5. Multi-Layer Perceptron: Introduction and Training 3 Number of Hidden Units From the previous slide we can see that the number of hidden layers determines the decision boundaries that can be generated. In choosing the number of layers the following considerations are made. Multi-layer networks are harder to train than single layer networks. A two layer network (one hidden) with sigmoidal activation functions can model any decision boundary. Two layer networks are most commonly used in pattern recognition (the hidden layer having sigmoidal activation functions). How many units to have in each layer? The number of output units is determined by the number of output classes. The number of inputs is determined by the number of input dimensions The number of hidden units is a design issue. The problems are: too few, the network will not model complex decision boundaries; too many, the network will have poor generalisation.

5 4 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Hidden Layer Perceptron The form of the hidden, and the output, layer perceptron is a generalisation of the single layer perceptron from the previous lecture. Now the weighted input is passed to a general activation function, rather than a threshold function. Consider a single perceptron. Assume that there are n units at the previous level. x x 2 w i w i2 w i0 Σ z i Activation function y i x n w in The output from the perceptron, y i may be written as y i = φ(z i ) = φ(w i0 + d where φ() is the activation function. j= w ij x j ) We have already seen one example of an activation function the threshold function. Other forms are also used in multilayer perceptrons. Note: the activation function is not necessarily non-linear. However, if linear activation functions are used much of the power of the network is lost.

6 4 & 5. Multi-Layer Perceptron: Introduction and Training 5 Activation Functions There are a variety of non-linear activation functions that may be used. Consider the general form y j = φ(z j ) and there are n units, perceptrons, for the current level. Heaviside (or step) function: y j = 0, z j < 0, z j 0 These are sometimes used in threshold units, the output is binary. Sigmoid (or logistic regression) function: y j = + exp( z j ) The output is continuous, 0 y j. Softmax (or normalised exponential or generalised logistic) function: y j = exp(z j) n i= exp(z i ) The output is positive and the sum of all the outputs at the current level is, 0 y j. Hyperbolic tan (or tanh) function: y j = exp(z j) exp( z j ) exp(z j ) + exp( z j ) The output is continuous, y j.

7 6 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Notation Used Consider a multi-layer perceptron with: d-dimensional input data; L hidden layers (L + layer including the output layer); N units in the k th level; K-dimensional output. x x 2 w i w i2 w i0 Σ z i Activation function y i x N (k ) w in (k ) The following notation will be used x is the input to the k th layer x is the extended input to the k th layer x = x W is the weight matrix of the k th layer. By definition this is a N N (k ) matrix.

8 4 & 5. Multi-Layer Perceptron: Introduction and Training 7 Notation (cont) W is the weight matrix including the bias weight of the k th layer. By definition this is a N (N (k ) + ) matrix. W = [ w 0 W ] z is the N -dimensional vector defined as z j = w j x y is the output from the k th layer, so y j = φ(z j ) All the hidden layer activation functions are assumed to be the same φ(). Initially we shall also assume that the output activation function is also φ(). The following matrix notation feed forward equations may then used for a multi-layer perceptron with input x and output y(x). where k L +. x () x z = = x = y (k ) W x y = φ(z ) y(x) = y (L+) The target values for the training of the networks will be denoted as t(x) for training example x.

9 8 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Training Criteria A variety of training criteria may be used. have supervised training examples {{x, t(x )}..., {x n, t(x n )}} Some standard examples are: Assuming we Least squares error: one of the most common training criteria. E = 2 = 2 n p= n p= i= y(x p ) t(x p ) 2 K (y i (x p ) t i (x p )) 2 This may be derived from considering the targets as being corrupted by zero-mean Gaussian distributed noise. Cross-Entropy for two classes: consider the case when t(x) is binary (and softmax output). The expression is E = n p= (t(x p ) log(y(x p )) + ( t(x p )) log( y(x p ))) This expression goes to zero with the perfect mapping. Cross-Entropy for multiple classes: the above expression becomes (again softmax output) E = n K p= i= t i (x p ) log(y i (x p )) The minimum value is now non-zero, it represents the entropy of the target values.

10 4 & 5. Multi-Layer Perceptron: Introduction and Training 9 Network Interpretation We would like to be able to interpret the output of the network. Consider the case where a least squares error criterion is used. The training criterion is E = 2 n K p= i= (y i (x p ) t i (x p )) 2 In the case of an infinite amount of training data, n, E = 2 K i= K (yi (x) t i (x)) 2 p(t i (x), x)dt i (x)dx [ (yi (x) t i (x)) 2 p(t i (x) x)dt i (x) ] p(x)dx = 2 i= Examining the term inside the square braces (yi (x) t i (x)) 2 p(t i (x) x)dt i (x) = (y i (x) E{t i (x) x} + E{t i (x) x} t i (x)) 2 p(t i (x) x)dt i (x) = (y i (x) E{t i (x) x}) 2 + (E{t i (x) x} t i (x)) 2 p(t i (x) x)dt i (x) where + 2(y i (x) E{t i (x) x})(e{t i (x) x} t i (x))p(t i (x) x)dt i (x) E{t i (x) x} = t i (x)p(t i (x) x)dt i (x) We can write the cost function as E = 2 K i= K (yi (x) E{t i (x) x}) 2 p(x)dx ( E{ti (x) 2 x} (E{t i (x) x}) 2) p(x)dx + 2 i= The second term is not dependent on the weights, so is not affected by the optimisation scheme.

11 0 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Network Interpretation (cont) The first term in the previous expression is minimised when it equates to zero. This occurs when y i (x) = E{t i (x) x} The output of the network is the conditional average of the target data. This is the regression of t i (x) conditioned on x. So t(x) y(x) y(x ) i p(t(x )) i x i x the network models the posterior independent of the topology, but in practice require: an infinite amount of training data, or knowledge of correct distribution for x (i.e. p(x) is known or derivable from the training data); the topology of the network is complex enough that final error is small; the training algorithm used to optimise the network is good - it finds the global maximum.

12 4 & 5. Multi-Layer Perceptron: Introduction and Training Posterior Probabilities Consider the multi-class classification training problem with d-dimensional feature vectors: x; K-dimensional output from network: y(x); K-dimensional target: t(x). We would like the output of the network, y(x), to approximate the posterior distribution of the set of K classes. So y i (x) P (ω i x) Consider training a network with: means squared error estimation; -out-ofk coding, i.e. t i (x) = if x ω i 0 if x ω i The network will act as a d-dimensional to K-dimensional mapping. Can we interpret the output of the network?

13 2 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Posterior Probabilities (cont) From the previous regression network interpretation we know that y i (x) = E{t i (x) x} = t i (x)p(t i (x) x)dt i (x) As we are using the -out-of-k coding where p(t i (x) x) = K δ ij = j= δ(t i (x) δ ij )P (ω j x), (i = j) 0, otherwise This results in as required. y i (x) = P (ω i x) The same limitations are placed on this proof as the interpretation of the network for regression.

14 4 & 5. Multi-Layer Perceptron: Introduction and Training 3 Compensating for Different Priors The standard approach to described at the start of the course was to use Bayes law to obtain the posterior probability P (ω j x) = p(x ω j)p (ω j ) p(x) where the priors class priors, P (ω j ), and class conditional densities, p(x ω j ), are estimated separately. For some tasks the two use different training data (for example for speech recognition, the language model and the acoustic model). How can this difference in priors from the training and the test conditions be built into the neural network framework where the posterior probability is directly calculated? Again using Bayes law p(x ω j ) P (ω j x) P (ω j ) Thus if posterior is divided by the training data prior a value proportional to the class-conditional probability can be obtained. The standard form of Bayes rule may now be used with the appropriate, different, prior.

15 4 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Error Back Propagation Interest in multi-layer perceptrons (MLPs) resurfaced with the development of the error back propagation algorithm. This allows multi-layer perceptons to be simply trained. x y (x) Inputs x 2 y (x) 2 Outputs x d Hidden layer y K(x) Output layer A single hidden layer network is shown above. As previously mentioned with sigmoidal activation functions arbitrary decision boundaries may be obtained with this network topology. The error back propagation algorithm is based on gradient descent. Hence the activation function must be differentiable. Thus threshold and step units will not be considered. We need to be able to compute the derivative of the error function with respect to the weights of all layers. All gradients in the next few slides are evaluated at the current model parameters.

16 4 & 5. Multi-Layer Perceptron: Introduction and Training 5 Single Layer Perceptron Rather than examine the multi-layer case instantly, consider the following single layer perceptron. x x 2 w w 2 w 0 Σ z Activation function y(x) x d w d We would like to minimise (for example) the square error between the target of the output, t(x p ), and the current output value y(x p ). Assume that the activation function is known to be a sigmoid function. The cost function may be written as E = 2 n p= (y(x p ) t(x p ) (y(x p ) t(x p )) = n E (p) p= To simplify notation, we will only consider a single observation x with associated target values t(x) and current output from the network y(x). The error with this single observation is denoted E. The first question is how does the error change as we alter y(x). = y(x) t(x) y(x) But we are not interested in y(x) - how do we find the effect of varying the weights?

17 6 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing SLP Training (cont) We can calculate the effect that a change in z has on the error using the chain rule z = y(x) y(x) z However what we really want is the change of the error rate with the weights (the parameters that we want to learn). w i = z z w i The error function therefore depends on the weight as w i = y(x) y(x) z z w i All these expressions are known so we can write w i = (y(x) t(x))y(x)( y(x))x i This has been computed for a single observation. We are interested in terms of the complete training set. We know that the total errors is the sum of the individual errors, so E = n (y(x p ) t(x p ))y(x p )( y(x p )) x p p= So for a single layer we can use gradient descent schemes to find the best weight values. However we want to train multi-layer perceptrons!

18 4 & 5. Multi-Layer Perceptron: Introduction and Training 7 Error Back Propagation Algorithm Now consider a particular node, i, of hidden layer k. Using the previously defined notation, the input to the node is x and the output y i. x x 2 w i w i2 w i0 Σ z i Activation function y i x N (k ) w in (k ) From the previous section we can simply derive the rate of change of the error function with the weights of the output layer. We need to now examine the rate of change with the k th hidden layer weights. A general error criterion, E, will be used. Furthermore we will not assume that y j only depends on the z j. For example a softmax function may be used. In terms of the derivations given the output layer will be considered as the L + th layer. The training observations are assumed independent and so E = n E (p) p= where E (p) is the error cost for the p observation and the observations are x,..., x n.

19 8 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Error Back Propagation Algorithm (cont) We are required to calculate for all layers, k, and all rows w ij and columns of W. Applying the chain rule w ij = z i z i w ij In matrix notation we can write = δ x W We need to find a recursion for δ. δ = = = = δ i x j z z (k+) z z (k+) y z (k+) z y δ (k+) But we know from the forward recursions z (k+) y This yields the recursion = z(k+) = W(k+) x (k+) δ = Λ W (k+) δ (k+)

20 4 & 5. Multi-Layer Perceptron: Introduction and Training 9 Backward Error Recursion where we define the activation derivative matrix for layer k as Λ = y z = y z y z 2. y z N y 2 z y 2 z y 2 z N... y N z y N z 2. y N z N This has given a matrix form of the backward recursion for the error back propagation algorithm. We need to have an initialisation of the backward recursion. This will be from the output layer (layer L + ) δ (L+) = = z (L+) y (L+) z (L+) y (L+) = Λ (L+) y(x) Λ (L+) is the activation derivative matrix for the output layer.

21 20 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Error Back Propagation To calculate E (p) (θ[τ] is the set of current (training θ[τ] epoch τ) values of the weights) we use the following algorithm.. Apply the input vector x p to the network and use the feed forward matrix equations to propagate the input forward through the network. For all layers this yields y and z. 2. Compute the set of derivative matrices Λ for all layers. 3. Compute y(x) θ[τ] (the gradient at the output layer). 4. Using the back-propagation formulae back propagate the derivatives through the network. Having obtained the derivatives of the error function with respect to the weights of the network, we need a scheme to optimise the value of the weights. The obvious choice is gradient descent

22 4 & 5. Multi-Layer Perceptron: Introduction and Training 2 Gradient Descent Having found an expression for the gradient, gradient descent may be used to find the values of the weights. Initially consider a batch update rule. Here w i [τ + ] = w i [τ] η w i θ[τ] where θ[τ] = { W () [τ],..., W (L+) [τ]}, w i [τ] is the i th row of W at training epoch τ and w i θ[τ] = n p= (p) w i θ[τ] If the total number of weights in the system is N then all N derivatives may be calculated in O(N) operations with memory requirements O(N). However in common with other gradient descent schemes there are problems as: we need a value of η that achieves a stable, fast descent; the error surface may have local minima, maxima and saddle points. This has lead to refinements of gradient descent.

23 22 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Training Schemes On the previous slide the weights were updated after all n training examples have been seen. This is not the only scheme that may be used. Batch update: the weights are updated after all the training examples have been seen. Thus w i [τ + ] = w [τ] η i n (p) p= w i θ[τ] Sequential update: the weights are updated after every sample. Now w i [τ + ] = w i [τ] η (p) w i θ[τ] and we cycle around the training vectors. There are some advantages of this form of update. It is not necessary to store the whole training database. Samples may be used only once if desired. They may be used for online learning In dynamic systems the values of the weights can be updated to track the system. In practice forms of batch training are often used.

24 4 & 5. Multi-Layer Perceptron: Introduction and Training 23 Refining Gradient Descent There are some simple techniques to refine standard gradient descent. First consider the learning rate η. We can make this vary with each iteration. One of the simplest rules is to use η[τ + ] =.η[τ]; if E(θ[τ]) < E(θ[τ ]) 0.5η[τ]; if E(θ[τ]) > E(θ[τ ]) In words: if the previous value of η[τ] decreased the value of the cost function, then increase η[τ]. If the previous value of η[τ] increased the cost function (η[τ] too large) then decrease η[τ]. It is also possible to add a momentum term to the optimisation (common in MLP estimation). The update formula is where w i [τ + ] = w i w i [τ] = η[τ + ] w i [τ] + w [τ] θ[τ] The use of the momentum term, α[τ]: smooths successive updates; helps avoid small local maxima. i + α[τ] w i [τ ] Unfortunately it introduces an additional tunable parameter to set. Also if we are lucky and hit the minimum solution we will overshoot.

25 24 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Quadratic Approximation Gradient descent makes use if first-order terms of the error function. What about higher order techniques? Consider the vector form of the Taylor series where E(θ) = E(θ[τ]) + (θ θ[τ]) g + 2 (θ θ[τ]) H(θ θ[τ]) + O(3) g = E(θ) θ[τ] and (H) ij = h ij = (θ) w i w j θ[τ] Ignoring higher order terms we find E(θ) = g + H(θ θ[τ]) Equating this to zero we find that the value of θ at this point θ[τ + ] is θ[τ + ] = θ[τ] H g This gives us a simple update rule. The direction H g is known as the Newton direction.

26 4 & 5. Multi-Layer Perceptron: Introduction and Training 25 Problems with the Hessian In practice the use of the Hessian is limited.. The evaluation of the Hessian may be computationally expensive as O(N 2 ) parameters must be accumulated for each of the n training samples. 2. The Hessian must be inverted to find the direction, O(N 3 ). This gets very expensive as N gets large. 3. The direction given need not head towards a minimum - it could head towards a maximum or saddle point. This occurs if the Hessian is not positive-definite i.e. for all v. v Hv > 0 4. If the surface is highly non-quadratic the step sizes may be too large and the optimisation becomes unstable. Approximations to the Hessian are commonly used. The simplest approximation is to assume that the Hessian is diagonal. This ensures that the Hessian is invertible and only requires N parameters. The Hessian may be made positive definite using H = H + λi If λ is large enough then H is positive definite.

27 26 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Improved Learning Rates Rather than having a single learning rate for all weights in the system, weight specific rates may be used without using the Hessian. All schemes will make use of g ij [τ] = w ij θ[τ] Delta-delta: we might want to increase the learning rate when two consecutive gradients have the same sign. This may be implemented as η ij [τ] = γg ij [τ]g ij [τ ] where γ > 0. Unfortunately this can take the learning rate negative (depending on the value of γ)! Delta-bar-delta: refines delta-delta so that where η ij [τ] = κ, if g ij [τ ]g γη ij [τ ], ij [τ] > 0 if g ij [τ ]g ij [τ] < 0 g ij [τ] = ( β)g ij [τ] + βg ij [τ ] One of the drawbacks with this scheme is that three parameters, γ,κ and β must be selected. Quickprop. Here w ij [τ + ] = g ij [τ] g ij [τ ] g ij [τ] w ij [τ]

28 4 & 5. Multi-Layer Perceptron: Introduction and Training 27 Conjugate Directions Assume that we have optimised in one direction, d[τ]. We know that What direction should we now optimise in? E(θ[τ] + λd[τ]) = 0 λ If we work out the gradient at this new point θ[τ +] we know that E(θ[τ + ]) d[τ] = 0 Is this the best direction? No. What we really want is that as we move off in the new direction, d[τ + ], we would like to maintain the gradient in the previous direction, d[τ], being zero. In other words E(θ[τ + ] + λd[τ + ]) d[τ] = 0 Using a first order Taylor series expansion (E(θ[τ + ]) + λd[τ + ] E(θ[τ + ])) d[τ] = 0 Hence the following constraint is satisfied for a conjugate gradient d[τ + ] Hd[τ] = 0

29 28 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Conjugate Gradient (cont) d[τ] d[τ+] θ[τ+] θ[τ] Fortunately the conjugate direction can be calculated without explicitly computing the Hessian. This leads to the conjugate gradient descent algorithm (see book by Bishop for details).

30 4 & 5. Multi-Layer Perceptron: Introduction and Training 29 Input Transformations If the input to the network are not normalised the training time may become very large. The data is therefore normalised. Here the data is transformed to where x pi = x pi µ i σ i and µ i = n n x pi p= σ 2 i = n n p= (x pi µ i ) 2 The transformed data has zero mean and variance. This transformation may be generalised to whitening. Here the covariance matrix of the original data is calculated. The data is then decorrelated and the mean subtracted. This results in data with zero mean and an identity matrix covariance matrix.

31 30 Engineering Part IIB & EIST Part II: I0 Advanced Pattern Processing Generalisation In the vast majority of pattern processing tasks, the aim is not to get good performance on the training data (in supervised training we know the classes!), but to get good performance on some previously unseen test data. Eroor rate Training set Future data Number of parameters Typically the performance actually goes as above. As the number of model parameters, N, increase the training data likelihood is guaranteed to increase (provided the parameter optimisation scheme is sensible ). The associated classification error rate usually decreases. However the test (future) set performance has a maximum in the likelihood and associated minimum in error rate. Then the likelihood decreases and the error rate increases. The objective in any pattern classification task is to have the minimum test set (future data) error rate.

32 4 & 5. Multi-Layer Perceptron: Introduction and Training 3 Regularisation One of the major issues with training neural networks is how to ensure generalisation. One commonly used technique is weight decay. A regulariser may be used. Here N Ω = wi 2 2 i= where N is the total number of weights in the network. A new error function is defined Ẽ = E + νω Using gradient descent on this gives Ẽ = E + νw The effect of this regularisation term Ω penalises very large weight terms. From empirical results this has resulted in improved performance. Rather than using an explicit regularisation term, the complexity of the network can be controlled by training with noise. For batch training we replicate each of the samples multiple times and add a different noise vector to each of the samples. If we use least squares training with a zero mean noise source (equal variance ν in all the dimensions) the error function may be shown to have the form Ẽ = E + νω This is a another form of regularisation.

### Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Introduction to Natural Computation Lecture 9 Multilayer Perceptrons and Backpropagation Peter Lewis 1 / 25 Overview of the Lecture Why multilayer perceptrons? Some applications of multilayer perceptrons.

### Multilayer Perceptron

Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4

### NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

### Introduction to Neural Networks

Introduction to Neural Networks Steve Renals Automatic Speech Recognition ASR Lecture 10 24 February 2014 ASR Lecture 10 Introduction to Neural Networks 1 Neural networks for speech recognition Introduction

### Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

### Machine Learning Lecture 7

Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

### Linear Models for Classification

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

### Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

### Artificial Neural Networks (ANN)

Artificial Neural Networks (ANN) Edmondo Trentin April 17, 2013 ANN: Definition The definition of ANN is given in 3.1 points. Indeed, an ANN is a machine that is completely specified once we define its:

### Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of

### Deep Feedforward Networks

Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

### Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

### Statistical Machine Learning from Data

January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole

### Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

### Neural Networks Lecture 4: Radial Bases Function Networks

Neural Networks Lecture 4: Radial Bases Function Networks H.A Talebi Farzaneh Abdollahi Department of Electrical Engineering Amirkabir University of Technology Winter 2011. A. Talebi, Farzaneh Abdollahi

### Machine Learning. 7. Logistic and Linear Regression

Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,

### University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F: Statistical Pattern Processing Handout 2: Multivariate Gaussians.2.5..5 8 6 4 2 2 4 6 8 Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2 2 Engineering

### Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

### Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

### Lecture 17: Neural Networks and Deep Learning

UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

### What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 Multi-layer networks Steve Renals Machine Learning Practical MLP Lecture 3 7 October 2015 MLP Lecture 3 Multi-layer networks 2 What Do Single

### Learning Neural Networks

Learning Neural Networks Neural Networks can represent complex decision boundaries Variable size. Any boolean function can be represented. Hidden units can be interpreted as new features Deterministic

### Artificial Neural Networks 2

CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b

### 10-701/ Machine Learning, Fall

0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training

### Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

### Lecture 4: Perceptrons and Multilayer Perceptrons

Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons

### CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! November 18, 2015 THE EXAM IS CLOSED BOOK. Once the exam has started, SORRY, NO TALKING!!! No, you can t even say see ya

### Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

### ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided

### Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 4 October 2017 / 9 October 2017 MLP Lecture 3 Deep Neural Networs (1) 1 Recap: Softmax single

### Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

### L11: Pattern recognition principles

L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

### (Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

### Introduction to Artificial Neural Networks

Facultés Universitaires Notre-Dame de la Paix 27 March 2007 Outline 1 Introduction 2 Fundamentals Biological neuron Artificial neuron Artificial Neural Network Outline 3 Single-layer ANN Perceptron Adaline

### Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

### DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION Alexandre Iline, Harri Valpola and Erkki Oja Laboratory of Computer and Information Science Helsinki University of Technology P.O.Box

Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear

### COM336: Neural Computing

COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

### Links between Perceptrons, MLPs and SVMs

Links between Perceptrons, MLPs and SVMs Ronan Collobert Samy Bengio IDIAP, Rue du Simplon, 19 Martigny, Switzerland Abstract We propose to study links between three important classification algorithms:

### Machine Learning (CS 567) Lecture 5

Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

### The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

1 2 The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural networks. First we will look at the algorithm itself

### Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Last update: October 26, 207 Neural networks CMSC 42: Section 8.7 Dana Nau Outline Applications of neural networks Brains Neural network units Perceptrons Multilayer perceptrons 2 Example Applications

### Machine Learning Lecture 12

Machine Learning Lecture 12 Neural Networks 30.11.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

### LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input

### Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Deep Feedforward Networks Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04 Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units

### Learning strategies for neuronal nets - the backpropagation algorithm

Learning strategies for neuronal nets - the backpropagation algorithm In contrast to the NNs with thresholds we handled until now NNs are the NNs with non-linear activation functions f(x). The most common

### Lab 5: 16 th April Exercises on Neural Networks

Lab 5: 16 th April 01 Exercises on Neural Networks 1. What are the values of weights w 0, w 1, and w for the perceptron whose decision surface is illustrated in the figure? Assume the surface crosses the

### Automatic Differentiation and Neural Networks

Statistical Machine Learning Notes 7 Automatic Differentiation and Neural Networks Instructor: Justin Domke 1 Introduction The name neural network is sometimes used to refer to many things (e.g. Hopfield

### Machine Learning 2017

Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

### Relevance Vector Machines

LUT February 21, 2011 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise Support Vector Machines The relevance vector machine (RVM) is a bayesian

### CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders

Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters

### Radial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition

Radial Basis Function Networks Ravi Kaushik Project 1 CSC 84010 Neural Networks and Pattern Recognition History Radial Basis Function (RBF) emerged in late 1980 s as a variant of artificial neural network.

### ECE521 Lecture7. Logistic Regression

ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

### Machine Learning: Logistic Regression. Lecture 04

Machine Learning: Logistic Regression Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Supervised Learning Task = learn an (unkon function t : X T that maps input

### 9.2 Support Vector Machines 159

9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of

### Reservoir Computing and Echo State Networks

An Introduction to: Reservoir Computing and Echo State Networks Claudio Gallicchio gallicch@di.unipi.it Outline Focus: Supervised learning in domain of sequences Recurrent Neural networks for supervised

### Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

### σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

### COMP 551 Applied Machine Learning Lecture 14: Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted,

### Neural networks and support vector machines

Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith

### Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

### Statistical NLP for the Web

Statistical NLP for the Web Neural Networks, Deep Belief Networks Sameer Maskey Week 8, October 24, 2012 *some slides from Andrew Rosenberg Announcements Please ask HW2 related questions in courseworks

### Kalman filtering and friends: Inference in time series models. Herke van Hoof slides mostly by Michael Rubinstein

Kalman filtering and friends: Inference in time series models Herke van Hoof slides mostly by Michael Rubinstein Problem overview Goal Estimate most probable state at time k using measurement up to time

### Chapter ML:VI (continued)

Chapter ML:VI (continued) VI Neural Networks Perceptron Learning Gradient Descent Multilayer Perceptron Radial asis Functions ML:VI-64 Neural Networks STEIN 2005-2018 Definition 1 (Linear Separability)

### Conjugate gradient algorithm for training neural networks

. Introduction Recall that in the steepest-descent neural network training algorithm, consecutive line-search directions are orthogonal, such that, where, gwt [ ( + ) ] denotes E[ w( t + ) ], the gradient

### Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

### Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

### The XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic

The XOR problem Fall 2013 x 2 Lecture 9, February 25, 2015 x 1 The XOR problem The XOR problem x 1 x 2 x 2 x 1 (picture adapted from Bishop 2006) It s the features, stupid It s the features, stupid The

### Chapter ML:VI (continued)

Chapter ML:VI (continued) VI. Neural Networks Perceptron Learning Gradient Descent Multilayer Perceptron Radial asis Functions ML:VI-56 Neural Networks STEIN 2005-2013 Definition 1 (Linear Separability)

Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

### Error Backpropagation

Error Backpropagation Sargur Srihari 1 Topics in Error Backpropagation Terminology of backpropagation 1. Evaluation of Error function derivatives 2. Error Backpropagation algorithm 3. A simple example

### Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

+ Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Prof. Alexander Ihler Linear Classifiers (Perceptrons) Linear Classifiers a linear classifier is a mapping which partitions

### Linear Regression (continued)

Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

### Introduction to feedforward neural networks

. Problem statement and historical context A. Learning framework Figure below illustrates the basic framework that we will see in artificial neural network learning. We assume that we want to learn a classification

### Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

### Neural Networks and Deep Learning.

Neural Networks and Deep Learning www.cs.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts perceptrons the perceptron training rule linear separability hidden

### Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

### Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

### STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable

### The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is

### Artificial Neural Networks Examination, June 2005

Artificial Neural Networks Examination, June 2005 Instructions There are SIXTY questions. (The pass mark is 30 out of 60). For each question, please select a maximum of ONE of the given answers (either

### Machine Learning Lecture 14

Machine Learning Lecture 14 Tricks of the Trade 07.12.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

### Eigen Vector Descent and Line Search for Multilayer Perceptron

igen Vector Descent and Line Search for Multilayer Perceptron Seiya Satoh and Ryohei Nakano Abstract As learning methods of a multilayer perceptron (MLP), we have the BP algorithm, Newton s method, quasi-

### Multivariate Analysis Techniques in HEP

Multivariate Analysis Techniques in HEP Jan Therhaag IKTP Institutsseminar, Dresden, January 31 st 2013 Multivariate analysis in a nutshell Neural networks: Defeating the black box Boosted Decision Trees:

### L20: MLPs, RBFs and SPR Bayes discriminants and MLPs The role of MLP hidden units Bayes discriminants and RBFs Comparison between MLPs and RBFs

L0: MLPs, RBFs and SPR Bayes discriminants and MLPs The role of MLP hidden units Bayes discriminants and RBFs Comparison between MLPs and RBFs CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU

### Feedforward Neural Networks

Feedforward Neural Networks Michael Collins 1 Introduction In the previous notes, we introduced an important class of models, log-linear models. In this note, we describe feedforward neural networks, which

### Unsupervised Classifiers, Mutual Information and 'Phantom Targets'

Unsupervised Classifiers, Mutual Information and 'Phantom Targets' John s. Bridle David J.e. MacKay Anthony J.R. Heading California Institute of Technology 139-74 Defence Research Agency Pasadena CA 91125

### CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS Learning Neural Networks Classifier Short Presentation INPUT: classification data, i.e. it contains an classification (class) attribute.

### Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

### Deep Learning for NLP

Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning

### Statistical Methods for NLP

Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition

### Midterm, Fall 2003

5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

### 11 More Regression; Newton s Method; ROC Curves

More Regression; Newton s Method; ROC Curves 59 11 More Regression; Newton s Method; ROC Curves LEAST-SQUARES POLYNOMIAL REGRESSION Replace each X i with feature vector e.g. (X i ) = [X 2 i1 X i1 X i2

### Neural Networks Task Sheet 2. Due date: May

Neural Networks 2007 Task Sheet 2 1/6 University of Zurich Prof. Dr. Rolf Pfeifer, pfeifer@ifi.unizh.ch Department of Informatics, AI Lab Matej Hoffmann, hoffmann@ifi.unizh.ch Andreasstrasse 15 Marc Ziegler,

### CENG 793. On Machine Learning and Optimization. Sinan Kalkan

CENG 793 On Machine Learning and Optimization Sinan Kalkan 2 Now Introduction to ML Problem definition Classes of approaches K-NN Support Vector Machines Softmax classification / logistic regression Parzen

### 488 Solutions to the XOR Problem

488 Solutions to the XOR Problem Frans M. Coetzee * eoetzee@eee.emu.edu Department of Electrical Engineering Carnegie Mellon University Pittsburgh, PA 15213 Virginia L. Stonick ginny@eee.emu.edu Department

### COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-