Pattern Classification

Similar documents
Multilayer Neural Networks

Multilayer Perceptron = FeedForward Neural Network

Multilayer Neural Networks

4. Multilayer Perceptrons

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Artificial Neural Networks. Edward Gatt

Lab 5: 16 th April Exercises on Neural Networks

Neural networks. Chapter 20. Chapter 20 1

Neural networks. Chapter 19, Sections 1 5 1

Multilayer Perceptron

An artificial neural networks (ANNs) model is a functional abstraction of the

Classification with Perceptrons. Reading:

Revision: Neural Network

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Machine Learning

Neural Networks Lecture 3:Multi-Layer Perceptron

Introduction to Neural Networks

Multilayer Perceptron

Artificial Neural Networks

Artificial Neural Network : Training

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Lecture 4: Perceptrons and Multilayer Perceptrons

Neural Networks and the Back-propagation Algorithm

Course 395: Machine Learning - Lectures

Chapter 2 Single Layer Feedforward Networks

Linear Discrimination Functions

Introduction to Neural Networks

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Simple Neural Nets For Pattern Classification

y(x n, w) t n 2. (1)

Introduction to Neural Networks

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Artificial Neural Networks

Artificial Neural Networks Examination, June 2005

Multilayer Perceptrons and Backpropagation

Multilayer Perceptrons (MLPs)

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neural Networks biological neuron artificial neuron 1

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Part 8: Neural Networks

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

Multilayer Neural Networks

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning

Neural networks. Chapter 20, Section 5 1

AI Programming CS F-20 Neural Networks

Neural Networks (Part 1) Goals for the lecture

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Slide04 Haykin Chapter 4: Multi-Layer Perceptrons

Unit III. A Survey of Neural Network Model

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks and Deep Learning

Introduction to Artificial Neural Networks

Machine Learning. Neural Networks

Networks of McCulloch-Pitts Neurons

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Artificial Neural Networks

PATTERN CLASSIFICATION

Multi-layer Neural Networks

Computational Intelligence Winter Term 2017/18

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning Lecture 5

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CS:4420 Artificial Intelligence

Feedforward Neural Nets and Backpropagation

Computational Intelligence

Data Mining Part 5. Prediction

Single layer NN. Neuron Model

Neural Networks Lecture 4: Radial Bases Function Networks

Learning and Neural Networks

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Artificial neural networks

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

NN V: The generalized delta learning rule

CSC321 Lecture 5: Multilayer Perceptrons

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

CSC242: Intro to AI. Lecture 21

Computational statistics

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Multilayer Neural Networks

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Neural Networks Lecture 2:Single Layer Classifiers

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Artificial Neural Networks 2

Multilayer Feedforward Networks. Berlin Chen, 2002

Artifical Neural Networks

Artificial Neural Networks Examination, March 2004

Artificial Neural Networks Examination, June 2004

Computational Intelligence Lecture 3: Simple Neural Networks for Pattern Classification

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Learning strategies for neuronal nets - the backpropagation algorithm

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Neuro-Fuzzy Comp. Ch. 4 March 24, R p

Transcription:

Pattern Classification All materials in these slides were taen from Pattern Classification (2nd ed) by R. O. Duda,, P. E. Hart and D. G. Stor, John Wiley & Sons, 2000 with the permission of the authors and the publisher

Chapter 6: Multilayer Neural Networs (Sections 6.1-6.3) 6.3) Introduction Feedforward Operation and Classification Bacpropagation Algorithm

Introduction 2 We ve already seen NNs in previous chapters: Generic multicategory classifier from Chapt 2.

Introduction 3 Probabilistic Neural Networ and RCE networ in Chapter 4:

Introduction 4 Linear Classifier schema in Chapter 5

Introduction 5 Goal: Classify objects by learning nonlinearity There are many problems for which linear discriminants are insufficient for minimum error In previous methods, the central difficulty was the choice of the appropriate nonlinear functions A brute approach might be to select a complete basis set such as all polynomials; such a classifier would require too many parameters to be determined from a limited number of training samples

6 There is no automatic method for determining the nonlinearities when no information is provided to the classifier In using the multilayer Neural Networs, the form of the nonlinearity is learned from the training data

7 Feedforward Operation and Classification A three-layer neural networ consists of an input layer, a hidden layer and an output layer interconnected by modifiable weights represented by lins between layers

8

9

10 A single bias unit is connected to each unit other than the input units Net activation: net d d t j = xiw ji + w j0 = xiw ji w j.x, i= 1 i= 0 where the subscript i indexes units in the input layer, j in the hidden; w ji denotes the input-to-hidden layer weights at the hidden unit j. (In neurobiology, such weights or connections are called synapses ) Each hidden unit emits an output that is a nonlinear function of its activation, that is: y j = f(net j )

Figure 6.1 shows a simple threshold function f ( net ) = 1 if net 0 sgn( net ) 1 if net < 0 The function f(.) is also called the activation function or nonlinearity of a unit. There are more general activation functions with desirables properties 11 Each output unit similarly computes its net activation based on the hidden unit signals as: net n = H nh t y jwj + w0 = y jwj = w.y, j= 1 j= 0 where the subscript indexes units in the ouput layer and n H denotes the number of hidden units

12 More than one output are referred z. An output unit computes the nonlinear function of its net, emitting z = f(net ) In the case of c outputs (classes), we can view the networ as computing c discriminants functions z = g (x) and classify the input x according to the largest discriminant function g (x) = 1,, c

The 3-layer networ with the weights listed in fig. 6.1 solves the XOR problem 13

The hidden unit y 1 computes the boundary: 0 y 1 = +1 x 1 + x 2 + 0.5 = 0 < 0 y 1 = -1 14 The hidden unit y 2 computes the boundary: 0 y 2 = +1 x 1 + x 2-1.5 = 0 < 0 y 2 = -1 The final output unit emits z 1 = +1 y 1 = +1 and y 2 = +1 z = y 1 AND NOT y 2 = (x 1 OR x 2 ) AND NOT (x 1 AND x 2 ) = x 1 XOR x 2 which provides the nonlinear decision of fig. 6.1

General Feedforward Operation case of c output units 15 g ( n H d ( x ) z = f wj f w j= 1 i= 1 = 1,...,c) Hidden units enable us to express more complicated nonlinear functions and thus extend the classification ji x i + w j0 + w 0 (1) The activation function does not have to be a sign function, it is often required to be continuous and differentiable We can allow the activation in the output layer to be different from the activation function in the hidden layer or have different activation for each individual unit We assume for now that all activation functions to be identical

Expressive Power of multi-layer Networs 16 Question: Can every decision be implemented by a threelayer networ described by equation (1)? Answer: Yes (due to A. Kolmogorov) Any continuous function from input to output can be implemented in a three-layer net, given sufficient number of hidden units n H, proper nonlinearities, and weights. g( x) 2n + 1 = j= 1 Ξ ( ) n Σ ( x ) x I ( I = [0,1]; n 2) j ψ ij i for properly chosen functions Ξ j and ψ ij

g( x) 2n+ 1 d = Ξ j ψ ij j= 1 i= 1 ( x i ) x I Each of the 2n+1 hidden units j taes as input a sum of d nonlinear functions, one for each input feature x i n ( I = [0,1]; n 2) 17 Each hidden unit emits a nonlinear function Ξ j of its total input The output unit emits the sum of the contributions of the hidden units Unfortunately: Kolmogorov s theorem tells us very little about how to find the nonlinear functions based on data; this is the central problem in networ-based pattern recognition Ξ j?

18

19 Any function from input to output can be implemented as a three-layer neural networ These results are of greater theoretical interest than practical, since the construction of such a networ requires the nonlinear functions and the weight values which are unnown!

20

Bacpropagation Algorithm 21 Our goal now is to set the interconnection weights based on the training patterns and the desired outputs In a three-layer networ, it is a straightforward matter to understand how the output, and thus the error, depend on the hidden-to-output layer weights The power of bacpropagation is that it enables us to compute an effective error for each hidden unit, and thus derive a learning rule for the input-to-hidden weights, this is nown as: The credit assignment problem

22 Networs have two modes of operation: Feedforward The feedforward operations consists of presenting a pattern to the input units and passing (or feeding) the signals through the networ in order to get outputs units (no cycles!) Learning The supervised learning consists of presenting an input pattern and modifying the networ parameters (weights) to reduce distances between the computed output and the desired output

23

Networ Learning Let t be the -th target (or desired) output and z be the -th computed output with = 1,, c and w represents all the weights of the networ The training error: J( w ) = 1 2 c = 1 ( t z ) 2 = 1 2 t z 24 2 The bacpropagation learning rule is based on gradient descent The weights are initialized with pseudo-random values and are changed in a direction that will reduce the error: Δw J = η w

where η is the learning rate which indicates the relative size of the change in weights w(m +1) = w(m) + Δw(m) at iteration m (m also indexes the pattern) 25 Error on the hidden to-output weights J w j = J net where the sensitivity of unit is defined as: net. w and describes how the overall error changes with the activation of the unit s net J J z δ = =. = ( t z ) f' ( net ) net z net j = δ net w δ j J = net

26 Since net = w t.y therefore: net w j = y j Conclusion: the weight update (or learning rule) for the hidden-to-output weights is: Δw j = ηδ y j = η(t z ) f (net )y j Error on the input-to-hidden units J w ji = J y j y j. net j net. w ji j

However, J y j = = y j 1 2 c = 1 ( t z net Similarly as in the preceding case, we define the sensitivity for a hidden unit: c z = = 1 c c ( t z ). = = 1 net y j = 1 ) 2 δ j c ( t ( t z z f' ( net z ) y j j ) f' ( net ) = 1 w j )w δ j 27 which means that: The sensitivity at a hidden unit is simply the sum of the individual sensitivities at the output units weighted by the hidden-to-output weights w j ; all multipled by f (net j ) Conclusion: The learning rule for the input-to-hidden weights is: [ Σw ] jδ f ' ( net j ) x i Δ w ji = ηx iδ j = η 14 44 24443 δ j

28 Starting with a pseudo-random weight configuration, the stochastic bacpropagation algorithm can be written as: Begin initialize n H ; w, criterion θ, η, m 0 do m m + 1 x m randomly chosen pattern End w ji w ji + ηδ j x i ; w j w j + ηδ y j until J(w) < θ return w

29 Batch bacpropagation Begin initialize n H ; w, criterion θ, η, r 0 do r r + 1 (epoch counter) m 0 ;Δ w ji 0; Δw j 0; do m m + 1 x m select pattern Δ w ji Δ w ji + ηδ j x i ; Δw j Δw j + ηδ y j until m = n w ji w ji + Δ w ji ; w j w j + Δw j until J(w) < θ return w End

Stopping criterion 30 The algorithm terminates when the change in the criterion function J(w) is smaller than some preset value θ There are other stopping criteria that lead to better performance than this one So far, we have considered the error on a single pattern, but we want to consider an error defined over the entirety of patterns in the training set The total training error is the sum over the errors of n individual patterns n J = p= 1 J p (1)

31 Stopping criterion (cont.) A weight update may reduce the error on the single pattern being presented but can increase the error on the full training set However, given a large number of such individual updates, the total error of equation (1) decreases

32 Learning Curves Before training starts, the error on the training set is high; through the learning process, the error becomes smaller The error per pattern depends on the amount of training data and the expressive power (such as the number of weights) in the networ The average error on an independent test set is always higher than on the training set, and it can decrease as well as increase A validation set is used in order to decide when to stop training ; we do not want to overfit the networ and decrease the power of the classifier generalization we stop training at a minimum of the error on the validation set

33

34 EXERCISES Exercise #1. Explain why a MLP (multilayer perceptron) does not learn if the initial weights and biases are all zeros Exercise #2. (#2 p. 344)