ECE 47/57 - Lecture 7 Back Propagation Types of NN Recurrent (feedback during operation) n Hopfield n Kohonen n Associative memory Feedforward n No feedback during operation or testing (only during determination of weights or training) n Perceptron n Backpropagation History In the 980s, NN became fashionable again, after the dark age during the 970s One of the reasons is the paper by Rumelhart, Hinton, and McClelland, which made the BP algorithm famous The algorithm was first discovered in 960s, but didn t draw much attention BP is most famous for applications for layered feedforward networks, or multilayer perceptrons 3
Limitations of Perceptron The output only has two values ( or 0) Can only classify samples which are linearly separable (straight line or straight plane) ingle layer: can only train AND, OR, NOT Can t train a network functions like XOR 4 XOR (3-layer NN) x x w3 w4 w3 w4 3 4 x x x d w w w d -b w35 w45 5, are identity functions 3, 4, 5 are sigmoid w3.0, w4 -.0 w4.0, w3 -.0 w35 0., w45-0. The input takes on only and 5 y BP 3-Layer Network w i h w y x (y ) i i E ( T ( y )) Choose a set of initial ω st ω st k + ω st k c k E k ω st k w st is the weight connecting input s at neuron t The problem is essentially how to choose weight w to minimize the error between the expected output and the actual output The basic idea behind BP is gradient descent 6
Exercise w i h w y x (y ) i i y h i i x ω ( h ) ω ω and ( h ) i i y h ω x i and h y ω ω i x i 7 *The Derivative Chain Rule w i h w y x (y ) i i Δω E E ω y T y ω ( )( $ )( ( h )) Δω i E & E y ) ( + h ω i '( y * + h ω i & ( T ) $ ( )( ω ) ( ) + '( * + $ ( ) x i ( ) 8 Threshold Function Traditional threshold function as proposed by McCulloch-Pitts is binary function The importance of differentiable A threshold-like but differentiable form for (5 years) The sigmoid ( x) + exp( x) 9 3
*BP vs. MPP E(ω) [g k (x;w) T k ] [g k (x;w) ] + [g k (x;w) 0] x ') n n k ( *) n x ω k n k [g k (x;w) ] x ω k + n n k n x ω k + ) [g k (x;w)], n n k x ω k -) lim n n E(w) P(ω k) [g k (x;w) ] p(x ω k )dx + P(ω i k ) g k (x;w)p(x w i k )dx [g k (x;w) g k (x;w) +]p(x,ω k )dx + g k (x;w)p(x,w i k )dx g k (x;w) p(x)dx g k (x;w)p(x,ω k )dx + p(x,ω k )dx [g k (x;w) P(ω k x)] p(x)dx + C 0 Practical Improvements to Backpropagation Activation (Threshold) Function The signum function # if x 0 ( x) signum( x) "! if x < 0 The sigmoid function n Nonlinear n aturate ( x) sigmoid( x) + exp( x) n Continuity and smoothness n Monotonicity (so (x) > 0) Improved a ( x) sigmoid( x) n Centered at zero + exp( bx) a n Antisymmetric (odd) leads to faster learning n a.76, b /3, to keep (0) ->, the linear range is <x<, and the extrema of (x) occur roughly at x -> 4
3 Data tandardization Problem in the units of the inputs n Different units cause magnitude of difference n ame units cause magnitude of difference tandardization scaling input n hift the input pattern w The average over the training set of each feature is zero n cale the full data set w Have the same variance in each feature component (around.0) 4 Target Values (output) Instead of one-of-c (c is the number of classes), we use +/- n + indicates target category n - indicates non-target category For faster convergence 5 5
Number of Hidden Layers The number of hidden layers governs the expressive power of the network, and also the complexity of the decision boundary More hidden layers -> higher expressive power -> better tuned to the particular training set -> poor performance on the testing set Rule of thumb n Choose the number of weights to be roughly n/0, where n is the total number of samples in the training set n tart with a large number of hidden units, and decay, prune, or eliminate weights 6 Number of Hidden Layers 7 Initializing Weight Can t start with zero Fast and uniform learning n All weights reach their final euilibrium values at about the same time n Choose weights randomly from a uniform distribution to help ensure uniform learning n Eual negative and positive weights n et the weights such that the integration value at a hidden unit is in the range of and + n Input-to-hidden weights: (-/srt(d), /srt(d)) n Hidden-to-output weights: (-/srt(n H ), /srt(n H )), n H is the number of connected units 8 6
Learning Rate ' ME $ c opt % " The optimal learning rate & ω # n Calculate the nd derivative of the obective function with respect to each weight n et the optimal learning rate separately for each weight n A learning rate of 0. is often adeuate n The maximum learning rate is c < max c opt n When c < c <, the convergence is slow opt c opt 9 Plateaus or Flat urface in Plateaus E n Regions where the derivative ω is very small st n When the sigmoid function saturates Momentum n Allows the network to learn more uickly when plateaus in the error surface exist ω st k + ω st k c k E k ω st k k ω + st ω k st + ( c k k )Δω bp + c k (ω k st ω k st ) 0 7
Weight Decay hould almost always lead to improved performance ω new ω old ( ε) Batch Training vs. On-line Training Batch training n Add up the weight changes for all the training patterns and apply them in one go n GD On-line training n Update all the weights immediately after processing each training pattern n Not true GD but faster learning rate 3 Other Improvements Other error function (Minkowski error) 4 8
Further Discussions How to draw the decision boundary of BPNN? How to set the range of valid output n 0-0.5 and 0.5-? n 0-0. and 0.8-? n 0.-0. and 0.8-0.9? The importance of having symmetric initial input 5 9