Multilayer Neural Networks

Similar documents
Multilayer Neural Networks

Pattern Classification

Multilayer Perceptron = FeedForward Neural Network

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Neural networks. Chapter 19, Sections 1 5 1

Neural networks. Chapter 20. Chapter 20 1

Introduction to Neural Networks

Artificial Neural Networks

Artificial Neural Networks Examination, June 2005

Lab 5: 16 th April Exercises on Neural Networks

AI Programming CS F-20 Neural Networks

4. Multilayer Perceptrons

Artificial Neural Networks. Edward Gatt

Neural networks. Chapter 20, Section 5 1

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Lecture 4: Perceptrons and Multilayer Perceptrons

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Feedforward Neural Nets and Backpropagation

Simple Neural Nets For Pattern Classification

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Multi-layer Neural Networks

Neural Networks and Deep Learning

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

y(x n, w) t n 2. (1)

Neural Networks biological neuron artificial neuron 1

Artificial Neural Networks

CS:4420 Artificial Intelligence

Multilayer Perceptrons (MLPs)

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Artificial neural networks

Course 395: Machine Learning - Lectures

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Learning and Neural Networks

Introduction to Neural Networks

Neural Networks and the Back-propagation Algorithm

Neural Networks (Part 1) Goals for the lecture

Artificial Neural Networks

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Revision: Neural Network

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Artifical Neural Networks

Machine Learning

Machine Learning Lecture 5

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Classification with Perceptrons. Reading:

Artificial Neural Networks Examination, June 2004

Artificial Neural Networks Examination, March 2004

Chapter 2 Single Layer Feedforward Networks

Multilayer Perceptrons and Backpropagation

Neural Networks Lecture 3:Multi-Layer Perceptron

Multilayer Perceptron

Advanced statistical methods for data analysis Lecture 2

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Data Mining Part 5. Prediction

Unit III. A Survey of Neural Network Model

Logistic Regression & Neural Networks

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

CSC321 Lecture 5: Multilayer Perceptrons

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Artificial Neural Networks 2

Computational Intelligence Lecture 3: Simple Neural Networks for Pattern Classification

Neural Nets Supervised learning

Machine Learning (CSE 446): Neural Networks

Introduction to Machine Learning

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Neural Networks Lecture 4: Radial Bases Function Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks Lecture 2:Single Layer Classifiers

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

CSC242: Intro to AI. Lecture 21

Feedforward Networks

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Feedforward Networks

Part 8: Neural Networks

Linear Discrimination Functions

Neural Networks. Intro to AI Bert Huang Virginia Tech

Machine Learning

Deep Feedforward Networks

Artificial Neural Networks

Training Multi-Layer Neural Networks. - the Back-Propagation Method. (c) Marcin Sydow

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

Learning strategies for neuronal nets - the backpropagation algorithm

Radial-Basis Function Networks

CMSC 421: Neural Computation. Applications of Neural Networks

Multilayer Perceptron

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Feedforward Networks. Gradient Descent Learning and Backpropagation. Christian Jacob. CPSC 533 Winter 2004

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Introduction to Artificial Neural Networks

Transcription:

Multilayer Neural Networks

Introduction Goal: Classify objects by learning nonlinearity There are many problems for which linear discriminants are insufficient for minimum error In previous methods, the central difficulty was the choice of the appropriate nonlinear functions A brute approach might be to select a complete basis set such as all polynomials; such a classifier would require too many parameters to be determined from a limited number of training samples 1

There is no automatic method for determining the nonlinearities when no information is provided to the classifier In using the multilayer Neural Networks, the form of the nonlinearity is learned from the training data 2

Feedforward Operation and Classification A three-layer neural network consists of an input layer, a hidden layer and an output layer interconnected by modifiable weights represented by links between layers 3

4

5

A single bias unit is connected to each unit other than the input units d net j = Âw ji x i + w j 0 Net activation: i=1 = w ji x i r t r x d  i= 0 where the subscript i indexes units in the input layer, j in the hidden; w ji denotes the input-to-hidden layer weights at the hidden unit j. Each hidden unit emits an output that is a nonlinear function of its activation, that is: Defines an Oriented Ridge y j = f (net j ) = f ( w j r t r w j x + w 14 243 j 0 Function is constant perpendicular to this line. ) 6

7

Multi-layer perceptrons: Sums of ledge functions Non-linear function are 1-D and mapped along direction of the weight vector 8

Figure 6.1 shows a simple threshold function Ï 1 if net 0 f (net) = sgn(net) Ì Ó -1 if net < 0 The function f(.) is also called the activation function or nonlinearity of a unit. Each output unit similarly computes its net activation based on the hidden unit signals as: n H  n  H j= 0 net k = w kj y j + w k0 = w kj y j = w r t k. y r j=1 = r w k t.f (W 1 r x ) where the subscript k indexes units in the output layer and n H denotes the number of hidden units 9

More than one output are referred z k. An output unit computes the nonlinear function of its net, emitting z k = f(net k ) In the case of c outputs (classes), we can view the network as computing c discriminants functions z k = g k (x) and classify the input x according to the largest discriminant function g k (x) " k = 1,, c The three-layer network with the weights listed in fig. 6.1 solves the XOR problem 10

The hidden unit y 1 computes the boundary: 0 fi y 1 = +1 x 1 + x 2 + 0.5 = 0 < 0 fi y 1 = -1 The hidden unit y 2 computes the boundary: 0 fi y 2 = +1 x 1 + x 2-1.5 = 0 < 0 fi y 2 = -1 The final output unit emits z 1 = +1 y 1 = +1 and y 2 = +1 z k = y 1 and not y 2 = (x 1 or x 2 ) and not (x 1 and x 2 ) = x 1 XOR x 2 which provides the nonlinear decision of fig. 6.1 11

General Feedforward Operation case of c output units Ê n H Ê d ˆ ˆ g k (x) z k = f Á Âw kj f Á Âw ji x i + w j 0 + w k 0 (1) Ë j=1 Ë i=1 (k =1,...,c) Hidden units enable us to express more complicated nonlinear functions and thus extend the classification The activation function does not have to be a sign function, it is often required to be continuous and differentiable We can allow the activation in the output layer to be different from the activation function in the hidden layer or have different activation for each individual unit We assume for now that all activation functions to be identical 12

Expressive Power of multi-layer Networks Q: Can every decision be implemented by a three-layer network described by equation (1)? A: Yes (A. Kolmogorov) Any continuous function from input to output can be implemented in a three-layer net, given sufficient number of hidden units n H, proper nonlinearities, and weights. 2n +1 Ê ˆ g(x) =  X j Á y ij (x i ) "x Œ I n (I = [0,1];n 2) Ë j=1  i=1:n for properly chosen functions X j and y ij We can always scale the input region of interest to lie in a hypercube, and thus 13 this condition on the feature space is not limiting.

Expressive Power of multi-layer Networks 14

Each of the 2n+1 hidden units X j takes as input a sum of N nonlinear functions, one for each input feature x i Unfortunately: Kolmogorov s theorem tells us very little about how to find the nonlinear functions based on data; this is the central problem in network-based pattern recognition. Moreover, any specific network structure specializes the above formulation, and hence most networks must be viewed as approximating the target function 15

16

Backpropagation Algorithm Any function from input to output can be approximated as a three-layer neural network How do we learn the approximation? 17

18

Our goal now is to set the interconnexion weights based on the training patterns and the desired outputs In a three-layer network, it is a straightforward matter to understand how the output, and thus the error, depend on the hidden-to-output layer weights The power of backpropagation is that it enables us to compute an effective error for each hidden unit, and thus derive a learning rule for the input-to-hidden weights, this is known as: The credit assignment problem 19

Network have two modes of operation: Feedforward The feedforward operations consists of presenting a pattern to the input units and passing (or feeding) the signals through the network in order to get outputs units (no cycles!) Learning The supervised learning consists of presenting an input pattern and modifying the network parameters (weights) to reduce distances between the computed output and the desired output 20

21

Ê n H Ê d ˆ ˆ g k (x) z k = f Á Âw kj f Á Âw ji x i + w j 0 + w k 0 (1) Ë j=1 Ë i=1 (k =1,...,c) Network Learning Let t k be the k-th target (or desired) output and z k be the k- th computed output with k = 1,, c and w represents all the weights of the network The training error: J(w) = 1 2 c  k=1 (t k - z k ) 2 = 1 2 t - z 2 The backpropagation learning rule is based on gradient descent The weights are initialized with pseudo-random values and are changed in a direction that will reduce the error: Dw = -h J w 22

where h is the learning rate which indicates the relative size of the change in weights w(m +1) = w(m) + Dw(m) where m is the m-th pattern presented Error on the hidden to-output weights J w kj = J net k. net k w kj = -d k net k w kj where the sensitivity of unit k is defined as: d k = - J net k and describes how the overall error changes with the activation of the unit s net d k = - J = - J z. k = (t k - z k ) f '(net k ) net k z k net k 23

Since net k = w kt.y therefore: net w kj k = y j Conclusion: the weight update (or learning rule) for the hidden-to-output weights is: Dw kj = hd k y j = h(t k z k ) f (net k )y j Error on the input-to-hidden units J w ji = J y j. y j net j. net j w ji 24

However, J y j = - = y j È Í Î 1 2 c  k = 1 ( t k z - z net Similarly as in the preceding case, we define the sensitivity for a hidden unit: c d f' ( net ) w d which means that: The sensitivity at a hidden unit is simply the sum of the individual sensitivities at the output units weighted by the hidden-to-output weights w kj ; all multipled by f (net j ) Conclusion: The learning rule for the input-to-hidden weights is: = -  k = 1 c c k k  ( tk - zk ). = - k = 1 netk y j k = 1 j j k ) 2  k= 1 kj c k ( t k ( t - z k k - z z ) y k k j ) f' ( net k )w kj [ w ] kjd k f' ( net j ) xi D w ji = hxid j = h S 144 2444 3 d j 25

Starting with a pseudo-random weight configuration, the stochastic backpropagation algorithm can be written as: initialize n H ; w, criterion q, h, m 0 do m m + 1 x m randomly chosen pattern w ji w ji + hd j x i ; w kj w kj + hd k y j until J(w) < q return w 26

Stopping criterion The algorithm terminates when the change in the criterion function J(w) is smaller than some preset value q There are other stopping criteria that lead to better performance than this one So far, we have considered the error on a single pattern, but we want to consider an error defined over the entirety of patterns in the training set The total training error is the sum over the errors of n individual patterns n J = J p  (1) p=1 28

Stopping criterion (cont.) A weight update may reduce the error on the single pattern being presented but can increase the error on the full training set However, given a large number of such individual updates, the total error of equation (1) decreases 29

Learning Curves Before training starts, the error on the training set is high; through the learning process, the error becomes smaller The error per pattern depends on the amount of training data and the expressive power (such as the number of weights) in the network The average error on an independent test set is always higher than on the training set, and it can decrease as well as increase A validation set is used in order to decide when to stop training ; we do not want to overfit the network and decrease the power of the classifier generalization we stop training at a minimum of the error on the validation 30 set

31