Intelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera

Size: px

Start display at page:

Download "Intelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera"

Kimberly Perry
5 years ago
Views:

1 Intelligent Control Module I- Neural Networks Lecture 7 Adaptive Learning Rate Laxmidhar Behera Department of Electrical Engineering Indian Institute of Technology, Kanpur Recurrent Networks p.1/40

2 Subjects to be covered Motivation for adaptive learning rate Lyapunov Stability Theory Training Algorithm based on Lyapunov Stability Theory Simulations and discussion Conclusion Recurrent Networks p.2/40

3 Training of a Feed Forward Network W W x 1 y x 2 Figure 1: A feed-forward network Here, W R M is the weight vector. The training data consists of, say, N patterns, {x p, y p }, p = 1, 2,..., N. Weight update law: W (t+1) = W (t) η E W, η : learning rate Recurrent Networks p.3/40

4 Motivation for adaptive learning rate Actual Adaptive learning rate 50 x 0 = f(x) x Figure 2: Convergence to global minimum With adaptive learning rate, one can employ a higher learning rate when the error is far from global minimum and a smaller learning rate when it is near to it. Recurrent Networks p.4/40

5 Adaptive Learning Rate The objective is to achieve global convergence for a non-quadratic, non-convex nonlinear function without increasing the computational complexity. In GD, learning rate is fixed. If one can have a larger learning rate for a point far away from global minimum and a smaller learning rate for a point closer to global minimum, then it would be possible to avoid local minima and ensure global convergence. This necessitates need of adaptive learning rate. Recurrent Networks p.5/40

6 Lyapunov Stability Theory Used extensively in control system problems. If we choose a Lyapunov function candidate V (x(t), t) such that V (x(t), t) is positive definite V (x(t), t) is negative definite then the system is asymptotically stable. Local Invariant Set Theorem (La Salle) Consider an autonomous system of the form ẋ = f(x) with f continuous, and let V (x) be a scalar function with continuous partial derivatives. Assume that * for some l > 0, the region Ω l defined by V (x) < l is bounded. Recurrent Networks p.6/40

7 Lyapunov stability theory: contd... * V (x) < 0 for all x in Ωl. Let R be the set of all points within Ω l where V (x) = 0, and M be the largest invariant set in R. Then, every solution x(t) originating in Ω l tends to M as t. Problem lies in choosing a proper Lyapunov function candidate. Recurrent Networks p.7/40

8 Weight update law using Lyapunov based approach The network output is given by ŷ p = f(w, x p ) p = 1, 2,... N (1) The usual quadratic cost function is given as: E = 1 2 N (y p ŷ p ) 2 (2) p=1 Let s choose a Lyapunov function candidate for the system as below: V = 1 2 (ỹt ỹ) (3) where ỹ = [y 1 ŷ 1,..., y p ŷ p,..., y N ŷ N ] T. Recurrent Networks p.8/40

9 LF I Algorithm The time derivative of the Lyapunov function V is given by where V = ỹ ŷ W Ẇ = ỹt JẆ (4) J = ŷ W J RN M Theorem 1. If an arbitrary initial weight W (0) is updated by W (t ) = W (0) + t 0 Ẇ dt (5) where ỹ 2 Ẇ = J T ỹ 2 +ɛ J T ỹ (6) where ɛ is a small positive constant, then ỹ converges to zero under the condition that Ẇ exists along the convergence trajectory. Recurrent Networks p.9/40

10 Proof of LF - I Algorithm Proof. Substitution of Eq. (6) into Eq. (4) yields V 1 = ỹ 2 J T ỹ 2 J T ỹ 2 +ɛ 0 (7) where V 1 < 0 for all ỹ 0. If V 1 is uniformly continuous and bounded, then according to Barbalat s lemma as t, V 1 0 and ỹ 0. Recurrent Networks p.10/40

11 LF - I Algorithm: contd... The weight update law is a batch update law. The instantaneous LF I learning algorithm can be derived as: Ẇ = ỹ 2 J it ỹ 2 J i T ỹ (8) where ỹ = y p ŷ p R and J i = ŷp W R(1 M). The difference equation representation of the weight update equation is given by Ŵ (t + 1) = Ŵ (t) + µẇ (t) (9) Here µ is a constant. Recurrent Networks p.11/40

12 Comparison with BP Algorithm In gradient descent method we have, W = η E W = ηj i T ỹ Ŵ (t + 1) = Ŵ (t) + ηj i T ỹ (10) The update equation for LF-I algorithm: Ŵ (t + 1) = Ŵ (t) + ( µ ỹ 2 J it ỹ 2 ) JiT ỹ Comparing above two equations, we find that the fixed learning rate η in BP algorithm is replaced by its adaptive version η a : η a = ( µ ỹ 2 ) (11) J it ỹ 2 Recurrent Networks p.12/40

13 Adaptive Learning rate of LF-I 50 LF - I : XOR 40 Learning rate No of iterations (4xno. of epochs) Learning rate is not fixed unlike BP algorithm. Learning rate goes to zero as error goes to zero. Recurrent Networks p.13/40

14 Convergence of LF-I The theorem states that the global convergence of LF-I is guaranteed provided Ẇ exists along the convergence trajectory. This, in turn, necessitates V 1 = J T ỹ 0. W V 1 = 0, indicates a local minimum of the error function. W Thus, the theorem only says that the global minimum is reached only when local minima are avoided during training. Since instantaneous update rule introduces noise, it may be possible to reach global minimum in some cases, however, the global convergence is not guaranteed. Recurrent Networks p.14/40

15 LF II Algorithm We consider following Lyapunov function V 2 = 1 2 (ỹt ỹ + λẇt Ẇ) = V 1 + λ 2 ẆT Ẇ (12) where λ is a positive constant. The time derivative of above equation is given by where J = y W V 2 = ỹ T y WẆ + λẅt Ẇ = ỹ T (J D)Ẇ (13) : N m is the Jacobian matrix, and D = λ 1 ỹ 2 ỹẅt R N m Recurrent Networks p.15/40

16 LF II Algorithm: contd... Theorem 2. If the update law for weight vector W follows a dynamics given by following nonlinear differential equation Ẇ = α(w)j T ỹ α(w)ẅ (14) where α(w) = ỹ 2 is a scalar function of weight vector W J T ỹ 2 +ɛ and ɛ is a small positive constant, then ỹ converges to zero under the condition that (J D) T ỹ is non-zero along the convergence trajectory. Recurrent Networks p.16/40

17 Proof of LF II algorithm Proof. Ẇ = α(w)j T ỹ α(w)ẅ may be rewritten as Ẇ = ỹ 2 J T ỹ 2 + ɛ (J D)T ỹ (15) Substituting for Ẇ from above equation into V 2 = ỹ T (J D)Ẇ, we get V 2 = ỹ 2 J T ỹ 2 +ɛ (J D)T ỹ 2 0 (16) Since (J D) T ỹ is non-zero, V 2 < 0 for all ỹ 0 and V 2 = 0 iff ỹ = 0. If V 2 is uniformly continuous and bounded, then according to Barbalat s lemma as t, V 2 0 and ỹ 0. Recurrent Networks p.17/40

18 Proof of LF II algorithm: contd... The instantaneous weight update equation using LF II algorithm can be finally expressed in difference equation model as follows: W(t + 1) = W(t) + ( ỹ 2 ) µ (Jp D) T ỹ J pt ỹ 2 + ɛ = W(t) + µ µ 1 Ẅ(t) J pt ỹ 2 + ɛ ỹ 2 J pt ỹ 2 + ɛ J p T ỹ (17) where µ 1 = µλ and the acceleration Ẅ(t) is computed as: Ẅ(t) = 1 [W(t) 2W(t 1) + W(t 2)] ( t) 2 and t is taken to be one time unit for simulation. Recurrent Networks p.18/40

19 Comparison with BP Algorithm Applying gradient-descent to V 2 = V 1 + λ 2 ẆT Ẇ, ) T ( V2 W = η W ( ) T [ ] T V1 d = η η W dw (λ 2 ẆT Ẇ) ( ) T y = η ỹ ηλẅ W Thus, the weight update equation for gradient descent method may be written as W(t + 1) = W(t) + η J pt ỹ µ Ẅ }{{} acceleration term (18) Recurrent Networks p.19/40

20 Adaptive learning rate and adaptive acceleration Comparing the two updates law, the adaptive learning rate in this case is given by η a ỹ 2 = µ J pt ỹ 2 + ɛ (19) and the adaptive acceleration rate is given by µ a = λ J pt ỹ 2 + ɛ (20) Recurrent Networks p.20/40

21 Convergence of LF II The global minimum of V 2 in is given by ỹ = 0, Ẇ = 0 (ỹ R n, W R m ) Global minimum can be reached provided vanish along the convergence trajectory. Ẇ does not Analyzing local minima conditions: Ẇ vanishes under following conditions 1. First condition: J = D (J, D R n m ) In case of neural networks, it is very unlikely that each element of J would be equal to that of D, thus this possibility can easily be ruled out for a multi-layer perceptron network. Recurrent Networks p.21/40

22 Convergence of LF II: contd Second Condition: Ẇ vanishes whenever (J D) T ỹ = 0 Assuming J D, Rank ρ(j D) = n ensures global convergence. 3. Third Condition: J T ỹ = D T ỹ = λẅ Solutions of above equation represent local minima The solution to above equation exists for every vector Ẅ Rm whenever rank ρ(j) = m Recurrent Networks p.22/40

23 Convergence of LF II: contd... For NN, n m ρ(j) n. Hence there are at least m n vectors Ẅ Rm for which solutions do not exist and hence local minima do not occur. Thus, increasing no. of hidden layers or hidden neurons (i.e, increasing m), chances of encountering local minima can be reduced. Increasing the number of output neurons increases both m and n as well as n/m. Thus, for MIMO systems, there are more local minima (for fixed number of weights) as compared to single output systems. Recurrent Networks p.23/40

24 Avoiding local minima local minimum V 1 global minimum W(t 1) W(t) W(t + 1) t 2 t 1 t t + 1 C B A D W Recurrent Networks p.24/40

25 Avoiding local minima: contd... Rewrite the update law for LF-II as W(t + 1) = W(t) + W(t + 1) = W(t) η V 1 W (t) µ Ẅ(t) Consider point B (at time t 1): The weight update for the interval (t 1, t] computed at this instant W(t) = W 1 (t 1) + W 2 (t 1). W 1 (t 1) = η V 1 (t 1) > 0 W W 2 (t 1) = µẅ(t 1) = µ( W (t 1) W (t 2)) > 0 It is to be noted that W (t 1) < W (t 2) as the velocity is decreasing towards the point of local minimum. W (t) > 0, hence speed increases. Recurrent Networks p.25/40

26 Avoiding local minima: contd... Consider point A (at t): Weight increments W 1 (t) = η V 1 W (t) = 0 W 2 (t) = µẅ(t) = µ( W (t) W (t 1)) > 0 W (t) < W (t 1) W 2 (t) > 0 W(t + 1) = W 1 (t) + W 2 (t) > 0 This helps in avoiding local minimum Recurrent Networks p.26/40

27 Avoiding local minima: contd... Consider point D (at instant t + 1): Weight contributions W 1 (t + 1) = η V 1 (t + 1) < 0 W W 2 (t + 1) = µẅ(t + 1) = µ( W (t + 1) W (t)) > 0 contribution due to BP term becomes negative as the slope V 1 > 0 on the right hand side of local minimum. W W (t + 1) < W (t) W(t + 2) = W 1 (t + 1) + W 2 (t + 1) > 0 if W 2 (t + 1) > W 1 (t + 1) Thus it is possible to avoid local minima by properly choosing µ. Recurrent Networks p.27/40

28 Simulation results - LF-I vs LF-II: XOR XOR LF I (λ=0.0, µ=0.55) LF II (λ=0.015, µ=0.65) Training Epochs Runs Figure 3: performance comparison for XOR Observation: LF II provides tangible improvement over LF I both in terms of convergence time and training epochs. Recurrent Networks p.28/40

29 LF I vs LF II: 3-bit parity bit Parity LF I (λ=0.0, µ=0.47) LF II (λ=0.03, µ=0.47) Training epochs Runs Figure 4: performance comparison for 3-bit parity Observation: LF II performs better than LF I both in terms of computation time and training epochs Recurrent Networks p.29/40

30 LF I vs LF II: 8-3 Encoder Encoder LF I (λ=0.0, µ=0.46) LF II (λ=0.01, µ=0.465) Training epochs Runs Figure 5: comparison for 8-3 encoder Observation: LF II takes minimum epochs in most of the runs Recurrent Networks p.30/40

31 LF I vs LF II: 2D Gabor function D Gabor Function LF I (µ=0.8, λ=0.0) LF II (µ=0.8, λ=0.6) rms training error Iterations (training data points) Figure 6: performance comparison for 2D Gabor function Observation: With increasing iterations, the performance of LF II improves as compared to LF I Recurrent Networks p.31/40

32 Simulation Results - Comparison: contd... XOR Algorithm epochs time (sec) parameters BP η = 0.5 BP η = 0.95 EKF λ = 0.9 LF-I µ = 0.55 LF-II µ = 0.65, λ = 0.01 Recurrent Networks p.32/40

33 Comparison among BP, EKF and LF-II 0.4 Convergence time (seconds) BP EKF LF - II Run Observation: LF takes almost same time for any arbitrary initial condition. Recurrent Networks p.33/40

34 Comparison among BP, EKF and LF: contd... 3-bit Parity Algorithm epochs time (sec) parameters BP η = 0.5 BP η = 0.95 EKF λ = 0.9 LF-I µ = 0.47 LF-II µ = 0.47, λ = 0.03 Recurrent Networks p.34/40

35 Comparison among BP, EKF and LF: contd Encoder Algorithm epochs time (sec) parameters BP η = 0.7 BP η = 0.9 LF-I µ = 0.46 LF-II µ = 0.465, λ = 0.01 Recurrent Networks p.35/40

36 Comparison among BP, EKF and LF: contd... 2D Gabor function Algorithm No. of Centers rms error/run parameters BP η 1,2 = 0.2 BP η 1,2 = 0.2 LF-I µ = 0.8 LF-II µ = 0.8, λ = 0.3 Recurrent Networks p.36/40

37 Discussion Global convergence of Lyapunov based learning Algorithms Consider the following Lyapunov function candidate: V 2 = µv σ V 1 W 2 ; where V 1 = 1 2ỹT ỹ (21) The objective is to select an weight update law Ẇ such that the global minimum (V 1 = 0 and V 1 = 0), is reached. W The rate derivative of the Lyapunov function V 2 is given as: V 2 = V 1 W [µi + σ 2 V 1 ]Ẇ (22) T W W Recurrent Networks p.37/40

38 If the weight update law Ẇ = [µi +σ 2 V 1 W W with ζ > 0 and η > 0, then Ẇ is selected as T ] 1 ( V1 W )T V 1 W 2 (ζ V 1 W 2 +η V 1 2 ) (23) V 2 = ζ V 1 W 2 η V 1 2 (24) which is negative definite with respect to V 1 and V 1 W. Thus, V 2 will finally converge to its equilibrium point given by V 1 = 0 and T V 1 W = 0. Recurrent Networks p.38/40

39 But the implementation of the weight update algorithm becomes very difficult due to the presence of a Hessian term 2 V 1 W W T. Thus, the above algorithm is of theoretical interest. The above weight update algorithm is similar to BP learning algorithm with a fixed learning rate. Recurrent Networks p.39/40

40 Conclusion LF Algorithms perform better than both EKF and BP algorithms in terms of speed and accuracy. LF II avoids local minima to a greater extent as compared to LF I. It is seen that by choosing a proper network architecture, it is possible to reach global minimum. LF-I Algorithm has an interesting parallel with conventional BP algorithm where the the fixed learning rate of BP is replaced by an adaptive learning rate. Recurrent Networks p.40/40

Neural Network Training

Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification