BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design learning algorithms for multi-layer networks of perceptrons Solution: Use multi-layer networks for non-linearly separable tasks Use continuous differentiable non-linear activation functions Solve the credit assignment problem 1
Multi-layer Perceptrons o1 o2 w (2,1) 1,1 w (2,1) 2,1 w (2,1) 1,2 w (1,0) 1,3 w (1,0) 2,1 Output layer w (2,1) 2,2 w (1,0) 1,1 w (1,0) 2,3 Input layer Hidden layer Activation function: sigmoid Error-correction learning Least mean square and gradient descent learning: Forward pass Backward pass 2
Preliminaries Training set: {( x p, d p ), 1 p P } P = number of input patterns x p = (x p,0,..., x p,n ) d p = (d p,1,..., d p,k ) = desired output for x p N = input space dimension K = number of output neurons o p = (o p,1,..., o p,k ) = actual ouput Obectives: minimize cumulative error Error = P p=1 Err( d p, o p ) Err should be a metric (distance measure) Err( x, y) 0 Err( x, y) = 0 if x = y Err( x, y) = Err( y, x) Err( x, y) + Err( y, z) Err( x, z) 3
Preliminaries (continuous) Popular choices based on norms of d p o p : e p, = d p, o p, E p = Err( d p, o p ) = (e u p,1 + + eu p,k )1 u, u > 0 u = 1: Manhattan distance u = 2: Euclidean distance Let u = 2, then cumulative error to minimize Sum of Squared Error SSE = P K e 2 p, p=1 =1 Mean Squared Error MSE = 1 P 1 K P K e 2 p, p=1 =1 4
Back-Propagation Algorithm Scenario for One-Hidden-Layer Networks 1. x p,i : value in i-th input node 2. net (1) = n i=0 w (1,0),i x p,i : net input of -th node in hidden layer 3. x (1) p, = σ( n i=0 w (1,0),i x p,i ): output of -th in hidden layer : net input of k-th node in out- 4. net (2) k = w (2,1) put layer x (1) p, 5. o p,k = σ( w (2,1) ): output of k-th node in output layer x (1) p, 6. e 2 p,k = d p,k o p,k 2 : squared error at k-th output node 7. w (i+1,i) : weight from -th node in i-th layer to k-th node i + 1-th layer 8. x (i) p, : output of -th node of i-th layer for pattern x p 5
Back-Propagation Algorithm E p = k e 2 p,k : error for pattern x p for simplicity write E = K k=1 e 2 k (for now, we drop suffix p, for convenience) MSE is minimal when E is minimal for each pattern x p. Since o k depends on the network weight w, then E is also a function of w. According to gradient descent, we the weight update That is and w (2,1) w (1,0),i w = E w = η = η E w (2,1) E w (1,0),i How does E depends on the weights? 1. E o k net (2) k w (2,1) 2. E o k net (2) k x (1) net (1) w (1,0),i 6
Back-Propagation Algorithm By the chain rule, we have 1. E w (2,1) = o E k o k net (2) k net(2) k w (2,1) 2. E w (1,0),i = K E k=1 ok o k net (2) k net(2) k x (1) x(1) net (1) net(1) w (1,0),i After substitutions we obtain E = 2(d k o k ) σ (net (2) k ) x(1) w (2,1) E w (1,0),i = K k=1 2(d k o k ) σ (net (2) k ) w(2,1) σ (net (1) ) x i Applying gradient descent rule we have 1. w (2,1) = η δ k x (1) 2. w (1,0),i = η µ x i δ k = (d k o k ) σ (net (2) k ) µ = ( k δ k w (2,1) ) σ (net (1) ) 7
Back-Propagation Algorithm Changes in weights for the two layers are similar 1. δ k proportional to actual error (d k o k ) multiplied by derivative of output node with respect to net input that node 2. µ proportional to weighted sum of errors coming to the hidden node from node in upper layer We made no assumption about the activation function σ except it should be differentiable. For logistic sigmoid function we have σ(net) = 1 1+e αnet σ (net) = σ(net) (1 σ(net)) Therefore 1. δ k = (d k o k ) o k (1 o k ) 2. µ = k δ k w (2,1) x (1) (1 x (1) ) 8
Back-Propagation Algorithm Start with initial random w; Repeat For each input pattern x p do {Propagate x p (forward pass), that is:} From first hidden layer to output layer do Compute hidden node activations: net (1) p, ; Compute hidden node outputs: x (1) p, ; Compute output node activations: net (2) p,k ; Compute network outputs: o p,k ; Compute network s error e p,k = d p,k o p,k ; {Back-propagate e p (backward pass), that is:} From output layer to first hidden layer do For the output layer do Update output layer weights: δ p,k = (d p,k o p,k ) o p,k (1 o p,k ); w (2,1) = η δ p,k x (1) p, ; For a hidden layer do Update hidden layer weights: mu p, = k δ p,k w (2,1) x (1) p, w (1,0),i = η µ p, x p,i ; Until MSE ( w) is minimal; (1 x(1) p, ); 9
General Multiclass Classification Problem K classes: C 1,..., C K n k examples of class C k T k = {( x k p, d k p ) : 1 p n k, 1 k K} Training set: T = T 1... T K Output representation: 1. log 2 K output nodes: (bad) Training is difficult since many output nodes must be high simultaneously. Cross-talk phenomenon: different training example require conflicting changes to be made to the same weight. 2. K output nodes: (better) One output node assigned to one class. Each output node focus only on learning one class rather than performing multiple duties. 3. Error-correcting output code: (best) Minimize cross-talk 10
General Multiclass Classification Problem Desired output representation: 1. High weight magnitudes for output 0 or 1 2. σ (net) 0 when net + 3. Desired output d k (ε,..., ε, 1 ε, ε,..., ε), where ε > 0 (typical choices are 0.01 and 0.1) Perfect classification is possible even if the error d p, o p, = 0 (in absolute value) 1. d p, = (1 ε) and o p, d p, then e p, = 0 2. d p, = ε and o p, d p, then e p, = 0 3. Otherwise e p, = d p, o p, Class membership of x p after training 1. Assign x p to that class k for which d k o p d o p for k where 1 K 2. Assign x p to class k if o k,p > o,p for all k where 1 K 11
Heuristic Modifications of Back-Propagation Frequency of weight updates 1. Sequential learning Weights are updated after each example presentation Slower but uses less memory Easy to implement Local minimization less ability to escape local optimum 2. Batch learning: w = P p=1 w p Weights are updated after each epoch Faster but uses more memory Can be parallelized Global minimization better ability to escape local optimum It is good practice to randomize the order of presentation of training examples from epoch to the next. 12
Heuristic Modifications of Back-Propagation Maximizing information content: Select examples that have the largest possible information content for the learning problem 1. Use samples that result in largest training error 2. Use radically different samples in training 3. Use random presentation 4. Use emphasizing scheme Activation function 1. Should be Antisymmetric: σ( y) = σ(y) Ex: hyperbolic tangent σ(y) = a tanh(by) 2. Bipolar representation ( 1, 1) vs. binary (0, 1) (a) Weights are always updated (b) Noise representation 3. Faster learning and better generalizability 13
Heuristic Modifications of Back-Propagation Normalizing the inputs 1. Weights should be updated at approximately the same speed 2. Prevent network bias toward particular inputs Initialization of weights 1. Random values: 1 w i +1 2. Normalized 3. Averaged: w (1,0),i = ± 1 P P p=1 1 x i or w (2,1) = ± 1 P P p=1 1 σ(net (1) ) w,i (new) = w,i(old) w (old) where w (old) denotes the average weight, computed over all values of i. 14
Heuristic Modifications of Back-Propagation Choice of learning rate: 1. Constant rate η 2. A rate η i for each w i 3. Start with large η (or η i ) and decrease steadily 4. At each iteration where (a) Performance improves: increase η (or η i ) (b) Performance worsens: decrease η (or η i ) 5. Double η (or η i ) until performance worsens i := 0; w new := w E ( w)ɛ; While E( w new ) < E( w) do := + 1; w := w new ; w new := w E ( w)2 ɛ; End-While η (or η i ) := 2 1 ɛ; Searches for a large enough rate (2 ɛ) at which network s error no longer decreases. (We assume ɛ > 0 such that E( w E ( w)ɛ) < E( w) 6. Make use of the second derivative of MSE 15
Heuristic Modifications of Back-Propagation Momentum 1. w grows in magnitude when E w has same sign on consecutive iterations: therefore accelerate descent for faster learning 2. w shrinks in magnitude when E w has opposite sign on consecutive iterations: therefore stabilize to avoid oscillation 3. 1 and 2 can be achieved by adding a momentum term in the learning rule: w (t + 1) = ηδ k x + α w (t) where momentum α: 0 < α 1 4. Momentum term has an averaging effect: weights move in the general (or average ) direction of motion. 16
Heuristic Modifications of Back-Propagation Generalizability 1. Neural networks are useless if they don t generalize 2. Design of Neural Network systems: (a) Training and Testing (b) Choice of training and testing set are very important (c) NN should give good training and test results 3. When to stop training? When testing error start to worsen Otherwise: Overfitting occurs: NN only memorizes the input, it doesn t generalize Small training error is acceptable when sampling is good 4. Small networks achieve better for generalizability 17