Multilayer Neural Networks

Size: px

Start display at page:

Download "Multilayer Neural Networks"

Berenice Marsh
5 years ago
Views:

1 Pattern Recognition Multilaer Neural Networs Lecture 4 Prof. Daniel Yeung School of Computer Science and Engineering South China Universit of Technolog Outline Introduction (6.) Artificial Neural Networ (6.) Multiple Laer Perceptron NN (6.) Bac-Propagation Algorithm (6.3) Regularization (6.) Relating NN and Baes Theor (6.6) (6.8) Lec4: Multilaer Neural Networs Lec4: Multilaer Neural Networs Introduction Recall, Linear Discriminant Functions: Introduction Solution : Mapping Function φ( x w x φ w x x 3 x 4 w Σ Σ g g x x 3 x 4 φ φ φ w Σ Σ g g x 5 x 5 φ Input Laer Output Laer Limited generalization capabilit Cannot handle the non-linearl separable problem Input Laer Output Laer Pro: Simple structure (still using LDF) Cons: Selection of φ( and its parameters Discuss alread in Lecture 03 Lec4: Multilaer Neural Networs 3 Lec4: Multilaer Neural Networs 4

Introduction Solution : Multi-Laer Neural Networ x w w 3 w x x 3 x 4 x 5 Σ g Σ g Multi-LaerNeural Networ (multilaer Perceptrons) The number of hidden laer should be more than one The hidden laers

2 Introduction Solution : Multi-Laer Neural Networ x w w 3 w x x 3 x 4 x 5 Σ g Σ g Multi-LaerNeural Networ (multilaer Perceptrons) The number of hidden laer should be more than one The hidden laers serve as a mapping function Will be introduced in this lecture Input Laer Hidden Laers Output Laer No need to choose the nonlinear mapping φ(, and no need to have an prior nowledge relevant to the classification problem. Lec4: Multilaer Neural Networs 5 Lec4: Multilaer Neural Networs 6 Artificial Neural Networ (ANN) A ver simplified model of the brain Human Brain input output Artificial Neural Networ (ANN) Composed of neurons which cooperate together input output Artificial NN Basicall a function approximator Transforms inputs into outputs to the best of its abilit Lec4: Multilaer Neural Networs 7 I w I w I d f Ө neuron input Lec4: Multilaer Neural Networs 8 w d snapse output

Artificial Neural Networ (ANN) Artificial Neural Networ (ANN) How does

The output of a neuron is a functionof the weighted sumof the

function Lec4: Multilaer Neural Networs 9 Lec4: Multilaer Neural

function ( f ) is the activation function Examples: Artificial Neural

4 ) sgn Linear Function Output is the same as input Differentiable f (

3 Artificial Neural Networ (ANN) Artificial Neural Networ (ANN) How does a neuron wor? The output of a neuron is a functionof the weighted sumof the inputsplus a bias (optional) unit which alwas emits a value of or - as a threshold. bias I I w I d w w d f Ө f(w I + w I + + w d I d + bias) activation function Lec4: Multilaer Neural Networs 9 Lec4: Multilaer Neural Networs 0 Artificial Neural Networ (ANN) Activation Function The function ( f ) is the activation function Examples: Artificial Neural Networ (ANN) XOR Example x+ x+ bias 0.5 x x ( ) sgn Linear Function Output is the same as input Differentiable f ( x Sign Function Decision Maing Not differentiable f ( x > 0 - x < 0 Sigmoid Function Smooth, continuous, and monotonicall increasing (differentiable) f ( /( + e -x ) Lec4: Multilaer Neural Networs x x x+ x.5 x Lec4: Multilaer Neural Networs x

4 Artificial Neural Networ (ANN) XOR Example x+ x+ x+ x x x sgn(++0.5) sgn(.5) sgn(+-.5) sgn(0.5) sgn( ) sgn(-.3) - ( ) sgn x - x - sgn( ) sgn(-.5) - sgn(---.5 ) sgn(-3.5) - sgn( ) sgn(-.3) - Lec4: Multilaer Neural Networs 3 Artificial Neural Networ (ANN) XOR Example The first hidden unit implements an OR gate The second hidden unit an AND gate x+ x + The final output unit implements an AND NOT gate: AND NOT (x OR x ) AND NOT (x AND x ) x XOR x implements Lec4: Multilaer Neural Networs x+ x ( ) sgn.5 Artificial Neural Networ (ANN) Structure of ANN: Illustrative example: w w x w 3 Σ g x w w w 3 Σ g x Σ g x Σ g Input Laer Hidden Laers Output Laer Input Laer Hidden Laers Output Laer A simple three-laer neural networ Input Laer: input units Hidden Laer: 3 hidden units Output Laer: output units x length of salmon/seabass x lightness of salmon/seabass w i Weights for assigning importance of each input from neuron Top hidden neuron Length discriminant function Middle hidden neuron Combination of length and lightness discriminat function Bottom hidden neuron Lightness discriminant function g final output g final output Lec4: Multilaer Neural Networs 5 Lec4: Multilaer Neural Networs 6

Artificial Neural Networ (ANN) Artificial Neural Networ (ANN) x x w 3 w w net f w' net f f w f g f g A two-laer networ classifier can onl implement a linear decision boundar net d m w m x Input Laer

5 Artificial Neural Networ (ANN) Artificial Neural Networ (ANN) x x w 3 w w net f w' net f f w f g f g A two-laer networ classifier can onl implement a linear decision boundar net d m w m x Input Laer Hidden Laers Output Laer m g ( ) f net f n net' Lec4: Multilaer Neural Networs 7 f w' i d i m w n i im x w' m i i ( ' ) g f net A three-, four-and higher-laer networs can implement arbitrardecision boundaries The decision regions need not be convex, nor simpl connected Lec4: Multilaer Neural Networs 8 Multiple Laer Perceptron NN (MLPNN) The most commonnn More thanone laer Sigmoid is used as activation function A general function approximator Not limited to linear problems Multilaers Multiple Laer Perceptron NN (MLPNN) Example Sigmoid function Input Laer Output Laer Lec4: Multilaer Neural Networs 9 Lec4: Multilaer Neural Networs 0

6 Training: Weight Determination Weights can be determined b training Reduce the errorbetween the desired outputs and the NN outputs of training samples Bac-propagation algorithmis the most widel used method for determining the weight Natural extension of LMS algorithm Pros: Simpleand generalmethod Cons: Slowand trapped at local minima Bac-Propagation (BP) Algorithm Calculation of the derivative flows bacwards through the networ Hence, it is called bacpropagation These derivatives point in the direction of themaximum increase of the error function (find out where max error being made and go bac to tr to decrease this error) A small step(learning rate) in the opposite directionwill result in the maximum decrease of the (local) error function: E w' w+ α where αis the learning rate & E the error function Lec4: Multilaer Neural Networs Lec4: Multilaer Neural Networs Bac-Propagation (BP) Algorithm Most common measure of error is the mean square error: J (target output) / Update ruleof weight is w( + ) w( ) η where ηis the learning rate which controls the size of each step Bac-Propagation (BP) Algorithm Next slides show BP for a 3-laer NN Two tpes of weights Hidden-to-Output w ho Input-to-Hidden w ih x x w ih w ho g g Lec4: Multilaer Neural Networs 3 Lec4: Multilaer Neural Networs 4

7 Bac-Propagation (BP) Algorithm 3-Laer NN The learning rule for the hidden-to-output units output output output f (net ) output net -(target -output ) n i w i Lec4: Multilaer Neural Networs 5 i output f' ( net ) output Bac-Propagation (BP) Algorithm 3-Laer NN The learning rule for the input-to-hidden units: i ( ) f net c c c net d m i w m c ( target? output ) ( target? output ) ( target? output ) x output output ( target? output) f' ( net ) w m ( net ) Lec4: Multilaer Neural Networs 6?? f output output output c d wmxm xi wi w i m Bac-Propagation (BP) Algorithm 3-Laer NN Summar Hidden-to-Output Weight Input-to-Hidden Weight i x f' i c ( net ) ( target? output ) f' ( net ) w f' ( net) ( target? ) output Bac-Propagation (BP) Algorithm Training Algorithm For the sample training set, the weights of NN can be updated differentl b presenting the training samples in different sequences There are two popular methods: Stochastic training Batch training Lec4: Multilaer Neural Networs 7 Lec4: Multilaer Neural Networs 8

8 Bac-Propagation (BP) Algorithm Training Algorithm Stochastic training Patterns are chosen randomlfrom the training set Networ weights are updated randoml Bac-Propagation (BP) Algorithm Training Algorithm Batch training All patterns are presentedto the networ before learning( weight update ) taes place Lec4: Multilaer Neural Networs 9 Lec4: Multilaer Neural Networs 30 Bac-Propagation (BP) Algorithm Training Algorithm A classifier with smaller training error is better? Most of the cases, NO! We have discussed this issue in Lecture 0 For example: Regularization In Lec03, we mentioned that in most cases, the solution (discriminant function) is not unique(ill-posed problem) Which one is the best? Enough to minimize training error? Too complex classifier? Good generalization abilit? (Stop training when testing error reaches a minimum) (Test unseen samples) Overfitting problem -- Minimize Remp Empirical Ris (Training Error) Lec4: Multilaer Neural Networs 3 Lec4: Multilaer Neural Networs 3

9 Regularization Regularizationis one of the methods to handle this problem ψ Add Regularization Term in obective function Measure the smoothness of the decision plane Tradeoff parameter (λ)to control the importanceof training accurac and regularization term See a smooth classifier with good performance on a training set Ma sacrifice Training Error for the Simplicit of a classifier if necessar Minimize: emp Tradeoff R +λψ( f) Regularization Minimize: emp R +λψ( f) λ: regularization parameter ; ψ regularization function λ 0 > λ > 0 λ Similar to traditional training obective faction No effect on the regularization term If we can find suitable λ, we ma find f with a good generalization abilit Dominated b the regularization term The most smooth classifier is found Training Error Regularization Term Lec4: Multilaer Neural Networs 33 Lec4: Multilaer Neural Networs 34 Regularization Weight Deca It is a well nown regularization example Regularization Term measures the value of weight f ψ ( ) Smaller smoother w The obective function becomes Minimize: Remp w +λ NN and Baes Theor Recall, Baes formula: ( x ω) p( ω) P( x, p( x ω ) p( ω ) p( p p( ω c i i i ω ) Suppose a networ is trained using the following target output setting: target( 0 if x ω otherwise Lec4: Multilaer Neural Networs 35 Lec4: Multilaer Neural Networs 36

10 NN and Baes Theor When the number of training samples tends to infinit (see p.304 in text): lim J( lim [ g( x; target] (Using mean square error for J) n n n n x 0 P( ω ) g ( x; p( x ω) dx+ P( ω ) g ( x; p( x ω dx [ ] i g ( x; p( dx g( x; p( x, ω) dx+ [ g ( x; P( ω ] p( dx+ P( ω P i ) p( x, ω dx ) dx ( ωi p( Dependent on w Independent of w We tr to minimize J( wrt w, the following term will be minimized [ ] g ( x; P( ω p( dx The trained networ will approximate the posteriori probabilit g ( x; P( ω Lec4: Multilaer Neural Networs 37 NN and Baes Theor Thus when MLPNNs are trained via bac propagation on a sum-squared errorcriterion, the provide a least squares fitto the Baes discriminant function, i.e. g ( x; P( ω Lec4: Multilaer Neural Networs 38 How to design a MLPNN to handle a given classification problem? x x... x n,,,... n Training Set??? MLP Must consider following issues: Scaling input Target values Number of Hidden Laers Number of Hidden Units Initializing Weights Learning Rates Momentum Weight Deca Stochastic and Batch Training Stopped Training Lec4: Multilaer Neural Networs 39 Lec4: Multilaer Neural Networs 40

11 Scaling input Featureswith different natures will have different properties(e.g. range, mean ) For example: Fish Mass (grams) and Length (meters) Normall the value of the mass will be orders of magnitude larger than that for length During the training, the networ will adust weights from the mass input unit far more than for the length input The error will hardl depend upon the tin length values The situation will be reversed when Mass (ilogram) and length (millimeter) Scaling input How to reduce this influence? Normalization(Standardization) Standardize the training samples have Same range (e.g. 0 to or - to ) Same variance (e.g. ) Same average (e.g. 0) Lec4: Multilaer Neural Networs 4 Lec4: Multilaer Neural Networs 4 Target Value Usuall a one-of-c representationfor the target vector is used For four-class problem: Four outputs will be used ω (, -, -, -) or (, 0, 0, 0) ω (-,, -, -) or (0,, 0, 0) ω 3 (-, -,, -) or (0, 0,, 0) ω 4 (-, -, -,-) or (0, 0, 0, ) Lec4: Multilaer Neural Networs 43 Number of Hidden Laers BP algorithm wors well for NN with man hidden laers, as long as the units are differentiable How man hidden laers are enough? NN with more hidden laers Easier to learn translations Some functions can be implemented more efficientl However, more undesirable local minima and more complex Since an arbitrar function can be approximated b a MLP with one hidden laer. Usuall a 3-laer NN is recommended. Special problem conditions or requirements ma ustif the use of more than 3- laers. Lec4: Multilaer Neural Networs 44

12 Number of Hidden Units Governs the expressive powerof the NN (for facial recognition, neuron for mouth, nose, ear, ee, face shape, etc.) Well separated or linearl separable samples, few hidden units Complicated problem, more hidden units One stud shows minimum Error occurs for NN in range Of 4-5 hidden units & range of weights 7-. Number of weight (n w ) Number of hidden units (n H ) Lec4: Multilaer Neural Networs 45 Number of Hidden Units How to determine the number of hidden units (n H )? n H determines the total number of weights in the net, thus we should not have more weights than the total number of training points (n) Without further information, n H cannot be determined before training Experimentall, Choose n H such that the total number of weights in the net is roughl n/0 Adust the complexitof the networ in response to the training data, for example, Start with a large value of n H Prune or eliminate weights Lec4: Multilaer Neural Networs 46 Initializing Weights In setting weights in a given laer, we choose weights randoml from a single distribution to help insure uniform learning If set w 0 initiall, learning can never start. Want standardize data so choose both positive & negative weights Initializing Weights If wis initiall too small the net activation of a hidden unit will be small and the linear modelwill be implemented net d m ( net ) w f mxm Sigmoid Function If wis initiall too large the hidden unit ma saturate (sigmoid function is alwas 0 or ) even before learning begins f ( /( + e -x ) Lec4: Multilaer Neural Networs 47 Saturate Linear Saturate Lec4: Multilaer Neural Networs 48

13 Initializing Weights We set wsuch that the net activation at a hidden unit is in the range < net < +, since net ±are the limits to its linear range Input-to-Hidden (d inputs ) < w i <+ d d Hidden-to-Output (the fan-in is n H ) < w <+ n n H Sigmoid Function f ( /( + e -x ) Lec4: Multilaer Neural Networs 49 H Learning Rates In principle, small learning rate ensures convergence Its value determines onl the learning speed but not the final weight values themselves However, in practice, because networs are rarel full trained to a training error minimum, the learning rate can affect the qualit of the final networ Lec4: Multilaer Neural Networs 50 Learning Rates Optimal learning rateis the one which leads to the local error minimum in one learning step J w Learning Rates The optimal rate: η opt J Slower convergence Optimal Converge b one step Oscillate but slowl converge Diverge Lec4: Multilaer Neural Networs 5 Lec4: Multilaer Neural Networs 5

$include some fraction αof the previous weight update Different faction Momentum w( m+ ) w( m) + ( α) w( m) + α w( m ) Momentum without momentum with momentum Using momentum Reduces the variationin$ algorithm has strength and drawbac: Batch learningis tpicall slowerthan stochastic learning.

algorithm has strength and drawbac: Batch learningis tpicall slowerthan stochastic learning.

14 Momentum What is Momentum? In phsics, it means the moving obects tend to eep moving unless acted upon b outside forces Example: Two balls carr the same momentum In BP algorithm, the approach is to alter the learning rule to include some fraction αof the previous weight update Different faction Momentum w( m+ ) w( m) + ( α) w( m) + α w( m ) Momentum without momentum with momentum Using momentum Reduces the variationin overall gradient directions Increase speed of learning Current delta w Previous delta w Lec4: Multilaer Neural Networs 53 Lec4: Multilaer Neural Networs 54 Stochastic and Batch Training Each training algorithm has strength and drawbac: Batch learningis tpicall slowerthan stochastic learning. Stochastic trainingis preferred for large redundant training sets Lec4: Multilaer Neural Networs 55 Stopped Training Stoppingthe training beforegradient descent is complete can help avoid overfitting A far more effective method is to stoptraining when the error on a separate validation set reaches a minimum Algorithm. Separate the original training set into two set New Training Set Validation Set. Use New Training Set to train the classifier 3. Evaluate the classifier using Validation Set at the end of each epoch Validation Error Training Error Generalization Error Lec4: Multilaer Neural Networs 56

Multilayer Neural Networks

Multilayer Neural Networks Pattern Recognition Lecture 4 Multilayer Neural Netors Prof. Daniel Yeung School of Computer Science and Engineering South China University of Technology Lec4: Multilayer Neural Netors Outline Introduction