Multilayer Neural Networks

Size: px

Start display at page:

Download "Multilayer Neural Networks"

Hilary Goodman
5 years ago
Views:

1 Pattern Recognition Lecture 4 Multilayer Neural Netors Prof. Daniel Yeung School of Computer Science and Engineering South China University of Technology Lec4: Multilayer Neural Netors Outline Introduction 6. Artificial Neural Netor 6. Multiple Layer Perceptron NN 6. Bac-Propagation Algorithm 6.3 Regularization 6. Relating NN and Bayes Theory 6.6 Practical Techniques 6.8 Lec4: Multilayer Neural Netors

2 Introduction Recall, Linear Discriminant Functions: Σ g 3 4 Σ g 5 Input Layer Output Layer Limited generalization capability Cannot handle the non-linearly separable problem Lec4: Multilayer Neural Netors 3 Introduction Solution : Mapping Function φ φ 3 φ φ Σ g 4 φ Σ g 5 φ Input Layer Output Layer Pro: Simple structure still using LDF Cons: Selection of φ and its parameters Discuss already in Lecture 03 Lec4: Multilayer Neural Netors 4

3 Introduction Solution : Multi-Layer Neural Netor 3 Σ g 3 4 Σ g 5 Input Layer Hidden Layers Output Layer No need to choose the nonlinear mapping φ, and no need to have any prior noledge relevant to the classification problem. Lec4: Multilayer Neural Netors 5 Multi-LayerNeural Netor multilayer Perceptrons The number of hidden layer should be more than one The hidden layers serve as a mapping function Will be introduced in this lecture Lec4: Multilayer Neural Netors 6

Artificial Neural Netor ANN A very simplified model of the brain

function approimator Transforms inputs into outputs to the best

Neural Netor ANN Composed of neurons hich cooperate together I

4 Artificial Neural Netor ANN A very simplified model of the brain Human Brain input output input output Artificial NN Basically a function approimator Transforms inputs into outputs to the best of its ability Lec4: Multilayer Neural Netors 7 Artificial Neural Netor ANN Composed of neurons hich cooperate together I synapse I I d f Ө neuron input Lec4: Multilayer Neural Netors 8 d output

The output of a neuron is a functionof the eighted sumof the inputsplus a bias optional

5 Artificial Neural Netor ANN Lec4: Multilayer Neural Netors 9 Artificial Neural Netor ANN Ho does a neuron or? The output of a neuron is a functionof the eighted sumof the inputsplus a bias optional unit hich alays emits a value of or - as a threshold. bias I I f Ө = f I + I + + d I d + bias I d d activation function Lec4: Multilayer Neural Netors 0

Artificial Neural Netor ANN Activation Function The function f is the

Differentiable f = Sign Function Decision Maing Not differentiable > 0 f - <

differentiable f = / + e - Lec4: Multilayer Neural Netors Artificial Neural

6 Artificial Neural Netor ANN Activation Function The function f is the activation function Eamples: Linear Function Output is the same as input Differentiable f = Sign Function Decision Maing Not differentiable > 0 f - < 0 Sigmoid Function Smooth, continuous, and monotonically increasing differentiable f = / + e - Lec4: Multilayer Neural Netors Artificial Neural Netor ANN XOR Eample y = bias y 0.7 y 0.4 = sgn y y = +.5 Lec4: Multilayer Neural Netors

7 Artificial Neural Netor ANN XOR Eample y = = = y = sgn++0.5 = sgn.5 = y = sgn+-.5 = sgn0.5 = y = sgn = sgn-.3 = - y 0.7 y 0.4 = sgn y = - = - y = +.5 y = sgn = sgn-.5 = - y = sgn---.5 = sgn-3.5 = - y = sgn = sgn-.3 = - Lec4: Multilayer Neural Netors 3 Artificial Neural Netor ANN XOR Eample The first hidden unit OR gate y = implements an The second hidden unit an AND gate y = +.5 implements The final output unit implements an AND NOT gate: 0.7 y 0.4 y= sgn y y= y AND NOT y = OR AND NOT AND = XOR Lec4: Multilayer Neural Netors 4

8 Artificial Neural Netor ANN Structure of ANN: 3 Σ g Σ g Input Layer Hidden Layers Output Layer A simple three-layer neural netor Input Layer: input units Hidden Layer: 3 hidden units Output Layer: output units Lec4: Multilayer Neural Netors 5 Illustrative eample: 3 Σ g Σ g Input Layer Hidden Layers Output Layer = length of salmon/seabass = lightness of salmon/seabass i = Weights for assigning importance of each input from neuron Top hidden neuron = Length discriminant function Middle hidden neuron = Combination of length and lightness discriminat function Bottom hidden neuron = Lightness discriminant function g = final output g = final output Lec4: Multilayer Neural Netors 6

Artificial Neural Netor ANN net y f ' net 3 f f g f f g Input Layer Hidden Layers Output Layer net = d m= m m g y = f net f n net' Lec4: Multilayer Neural Netors 7 f = ' i d i= m= = n i= im ' m i y i

9 Artificial Neural Netor ANN net y f ' net 3 f f g f f g Input Layer Hidden Layers Output Layer net = d m= m m g y = f net f n net' Lec4: Multilayer Neural Netors 7 f = ' i d i= m= = n i= im ' m i y i ' g = f net Artificial Neural Netor ANN A to-layer netor classifier can only implement a linear decision boundary A three-, four-and higher-layer netors can implement arbitrarydecision boundaries The decision regions need not be conve, nor simply connected Lec4: Multilayer Neural Netors 8

10 Multiple Layer Perceptron NN MLPNN The most commonnn More thanone layer Sigmoid is used as activation function A general function approimator Not limited to linear problems Multilayers Sigmoid function Input Layer Output Layer Lec4: Multilayer Neural Netors 9 Multiple Layer Perceptron NN MLPNN Eample Lec4: Multilayer Neural Netors 0

11 Training: Weight Determination Weights can be determined by training Reduce the errorbeteen the desired outputs and the NN outputs of training samples Bac-propagation algorithmis the most idely used method for determining the eight Natural etension of LMS algorithm Pros: Simpleand generalmethod Cons: Sloand trapped at local minima Lec4: Multilayer Neural Netors Bac-Propagation BP Algorithm Calculation of the derivative flos bacards through the netor Hence, it is called bacpropagation These derivatives point in the direction of themaimum increase of the error function find out here ma error being made and go bac to try to decrease this error A small steplearning rate in the opposite directionill result in the maimum decrease of the local error function: E ' = + α here αis the learning rate & E the error function Lec4: Multilayer Neural Netors

12 Bac-Propagation BP Algorithm Most common measure of error is the mean square error: J= target output / Update ruleof eight is J + = η here ηis the learning rate hich controls the size of each step Lec4: Multilayer Neural Netors 3 Bac-Propagation BP Algorithm Net slides sho BP for a 3-layer NN To types of eights Hidden-to-Output ho Input-to-Hidden ih ih ho g g Lec4: Multilayer Neural Netors 4

13 Bac-Propagation BP Algorithm 3-Layer NN The learning rule for the hidden-to-output units J = J output output net net output output = f net net = n i= i y i J output = -target -output output net = f ' net net = y Lec4: Multilayer Neural Netors 5 Bac-Propagation BP Algorithm 3-Layer NN The learning rule for the input-to-hidden units: J y = i f J = y y = J y net c = = c = = = c = c = y net net net = d m= i m target? output target? output target? output output y output net target? output f' net m net y net Lec4: Multilayer Neural Netors 6?? y net = f net output output output c d = mm i i = i m=

14 Bac-Propagation BP Algorithm 3-Layer NN Summary Hidden-to-Output Weight Input-to-Hidden Weight J i J = f' i = y c net target? output f' net f' net = target? output Lec4: Multilayer Neural Netors 7 Bac-Propagation BP Algorithm Training Algorithm For the sample training set, the eights of NN can be updated differently by presenting the training samples in different sequences There are to popular methods: Stochastic training Batch training Lec4: Multilayer Neural Netors 8

15 Bac-Propagation BP Algorithm Training Algorithm Stochastic training Patterns are chosen randomlyfrom the training set Netor eights are updated randomly Lec4: Multilayer Neural Netors 9 Bac-Propagation BP Algorithm Training Algorithm Batch training All patterns are presentedto the netor before learning eight update taes place Lec4: Multilayer Neural Netors 30

16 Bac-Propagation BP Algorithm Training Algorithm A classifier ith smaller training error is better? Most of the cases, NO! We have discussed this issue in Lecture 0 For eample: Stop training hen testing error reaches a minimum Test unseen samples Lec4: Multilayer Neural Netors 3 Regularization In Lec03, e mentioned that in most cases, the solution discriminant function is not uniqueill-posed problem Which one is the best? Enough to minimize training error? Too comple classifier? Good generalization ability? Overfitting problem -- Minimize Remp Empirical Ris Training Error Lec4: Multilayer Neural Netors 3

17 Regularization Regularizationis one of the methods to handle this problem ψ Add Regularization Term in obective function Measure the smoothness of the decision plane Tradeoff parameter λto control the importanceof training accuracy and regularization term See a smooth classifier ith good performance on a training set May sacrifice Training Error for the Simplicity of a classifier if necessary Minimize: emp Tradeoff R +λψ f Training Error Regularization Term Lec4: Multilayer Neural Netors 33 Regularization Minimize: emp R +λψ f λ: regularization parameter ; ψ regularization function λ= 0 > λ > 0 λ Similar to traditional training obective faction No effect on the regularization term If e can find suitable λ, e may find f ith a good generalization ability Dominated by the regularization term The most smooth classifier is found Lec4: Multilayer Neural Netors 34

18 Regularization Weight Decay It is a ell non regularization eample Regularization Term measures the value of eight ψ f = Smaller smoother The obective function becomes Minimize: Remp +λ Lec4: Multilayer Neural Netors 35 NN and Bayes Theory Recall, Bayes formula: ω p ω P, p ω p ω p p p ω = = c = i i i ω Suppose a netor is trained using the folloing target output setting: target = 0 if ω otherise Lec4: Multilayer Neural Netors 36

19 Lec4: Multilayer Neural Netors 37 NN and Bayes Theory When the number of training samples tends to infinity see p.304 in tet: We try to minimize J rt, the folloing term ill be minimized The trained netor ill approimate the posteriori probability [ ] = n n g n J n target ; lim lim [ ] + = d p g P d p g P i i ; ; ω ω ω ω + = d p d p g d p g,, ; ; ω ω [ ] + = d p P P d p P g i ; ω ω ω Independent of Dependent on [ ] d p P g ; ω ; P g ω Using mean square error for J = =0 Lec4: Multilayer Neural Netors 38 NN and Bayes Theory Thus hen MLPNNs are trained via bac propagation on a sum-squared errorcriterion, they provide a least squares fitto the Bayes discriminant function, i.e. ; P g ω

20 Practical Techniques Ho to design a MLPNN to handle a given classification problem?... n,,, y y... y n Training Set??? MLP Lec4: Multilayer Neural Netors 39 Practical Techniques Must consider folloing issues: Scaling input Target values Number of Hidden Layers Number of Hidden Units Initializing Weights Learning Rates Momentum Weight Decay Stochastic and Batch Training Stopped Training Lec4: Multilayer Neural Netors 40

21 Practical Techniques Scaling input Featuresith different natures ill have different propertiese.g. range, mean For eample: Fish Mass grams and Length meters Normally the value of the mass ill be orders of magnitude larger than that for length During the training, the netor ill adust eights from the mass input unit far more than for the length input The error ill hardly depend upon the tiny length values The situation ill be reversed hen Mass ilogram and length millimeter Lec4: Multilayer Neural Netors 4 Practical Techniques Scaling input Ho to reduce this influence? NormalizationStandardization Standardize the training samples have Same range e.g. 0 to or - to Same variance e.g. Same average e.g. 0 Lec4: Multilayer Neural Netors 4

22 Practical Techniques Target Value Usually a one-of-c representationfor the target vector is used For four-class problem: Four outputs ill be used ω =, -, -, - or, 0, 0, 0 ω = -,, -, - or 0,, 0, 0 ω 3 = -, -,, - or 0, 0,, 0 ω 4 = -, -, -,- or 0, 0, 0, Lec4: Multilayer Neural Netors 43 Practical Techniques Number of Hidden Layers BP algorithm ors ell for NN ith many hidden layers, as long as the units are differentiable Ho many hidden layers are enough? NN ith more hidden layers Easier to learn translations Some functions can be implemented more efficiently Hoever, more undesirable local minima and more comple Since any arbitrary function can be approimated by a MLP ith one hidden layer. Usually a 3-layer NN is recommended. Special problem conditions or requirements may ustify the use of more than 3- layers. Lec4: Multilayer Neural Netors 44

23 Practical Techniques Number of Hidden Units Governs the epressive poerof the NN for facial recognition, neuron for mouth, nose, ear, eye, face shape, etc. Well separated or linearly separable samples, fe hidden units Complicated problem, more hidden units One study shos minimum Error occurs for NN in range Of 4-5 hidden units & range of eights 7-. Number of eight n Number of hidden units n H Lec4: Multilayer Neural Netors 45 Practical Techniques Number of Hidden Units Ho to determine the number of hidden units n H? n H determines the total number of eights in the net, thus e should not have more eights than the total number of training points n Without further information, n H cannot be determined before training Eperimentally, Choose n H such that the total number of eights in the net is roughly n/0 Adust the compleityof the netor in response to the training data, for eample, Start ith a large value of n H Prune or eliminate eights Lec4: Multilayer Neural Netors 46

24 Practical Techniques Initializing Weights In setting eights in a given layer, e choose eights randomly from a single distribution to help insure uniform learning If set = 0 initially, learning can never start. Want standardize data so choose both positive & negative eights Lec4: Multilayer Neural Netors 47 Practical Techniques Initializing Weights If is initially too small the net activation of a hidden unit ill be small and the linear modelill be implemented If is initially too large the hidden unit may saturate sigmoid function is alays 0 or even before learning begins net = d m= m m y = f net Sigmoid Function f = / + e - Saturate Linear Saturate Lec4: Multilayer Neural Netors 48

25 Practical Techniques Initializing Weights We set such that the net activation at a hidden unit is in the range < net < +, since net ±are the limits to its linear range Sigmoid Function f = / + e - Input-to-Hidden d inputs < i <+ d d Hidden-to-Output the fan-in is n H < <+ n n H Lec4: Multilayer Neural Netors 49 H Practical Techniques Learning Rates In principle, small learning rate ensures convergence Its value determines only the learning speed but not the final eight values themselves Hoever, in practice, because netors are rarely fully trained to a training error minimum, the learning rate can affect the quality of the final netor Lec4: Multilayer Neural Netors 50

26 Practical Techniques Learning Rates Optimal learning rateis the one hich leads to the local error minimum in one learning step J = The optimal rate: J η opt = J Lec4: Multilayer Neural Netors 5 Practical Techniques Learning Rates Sloer convergence Optimal Converge by one step Oscillate but sloly converge Diverge Lec4: Multilayer Neural Netors 5

Practical Techniques Momentum What is Momentum?

$rule to include some fraction αof the previous eight update Different faction Momentum m+ = m + α$ m + α m Current delta Previous delta Lec4: Multilayer Neural Netors 53 Practical Techniques

m + α m Current delta Previous delta Lec4: Multilayer Neural Netors 53 Practical Techniques

27 Practical Techniques Momentum What is Momentum? In physics, it means the moving obects tend to eep moving unless acted upon by outside forces Eample: To balls carry the same momentum In BP algorithm, the approach is to alter the learning rule to include some fraction αof the previous eight update Different faction Momentum m+ = m + α m + α m Current delta Previous delta Lec4: Multilayer Neural Netors 53 Practical Techniques Momentum ithout momentum ith momentum Using momentum Reduces the variationin overall gradient directions Increase speed of learning Lec4: Multilayer Neural Netors 54

28 Practical Techniques Stochastic and Batch Training Each training algorithm has strength and drabac: Batch learningis typically sloerthan stochastic learning. Stochastic trainingis preferred for large redundant training sets Lec4: Multilayer Neural Netors 55 Practical Techniques Stopped Training Stoppingthe training beforegradient descent is complete can help avoid overfitting A far more effective method is to stoptraining hen the error on a separate validation set reaches a minimum Algorithm. Separate the original training set into to set Ne Training Set Validation Set. Use Ne Training Set to train the classifier 3. Evaluate the classifier using Validation Set at the end of each epoch Validation Error Training Error Generalization Error Lec4: Multilayer Neural Netors 56

Multilayer Neural Networks

Multilayer Neural Networks Pattern Recognition Multilaer Neural Networs Lecture 4 Prof. Daniel Yeung School of Computer Science and Engineering South China Universit of Technolog Outline Introduction (6.) Artificial Neural Networ