4 damped (modified) Newton methods

Size: px

Start display at page:

Download "4 damped (modified) Newton methods"

Gregory Woods
5 years ago
Views:

1 4 damped (modified) Newton methods 4.1 damped Newton method Exercise 4.1 Determine with the damped Newton method the unique real zero x of the real valued function of one variable f(x) = x 3 +x 2 using the step size rule t k = 1+ x k 1 and the starting point x 0 = Show, that the sequence of iterations x k for some C 2 > C 1 > 0 and ε > 0 and all x k with x k 1 the estimation (Q-convergence of order 3/2, faster than linear, slower than quadratic) C 1 x k 1 3/2 x k+1 1 C 2 x k 1 3/2 satisfies. 2. How the step size is to choose in order to achieve a Q-convergence of order r +1/n? Exercise 4.2 At the strictly quadratic convex function of n variables f(x) = x T Qx+q T x a damped Newton method is applied with scaled Armijo rule step size and secant parameter α (0,1/2) (because of the uniform convexity of f the negative gradient direction need not to be used). The starting scaled test step size is given through t = max {1, f(xk ) T d k } (d k ) T d k where d k satisfies the Newton equation Qd k = f(x k ). 1. Show, that the above test step size the armijo condition in x k w.r.t. the direction d k satisfies, whenever 1 t < 2(1 α) is valid. 2. Show, that the above damped Newton method is only Q-linearly convergent in case of 1 < t < 2(1 α) with the convergence rate C = t Determine for the function no. 4 of two variables f(x) = 1 4 (x 1 5) 2 +(x 2 6) 2 all starting points for which the damped Newton method converges in one step, i.e. for which the above parameter t = 1. Exercise 4.3 Prove: Suppose, that the function f C 2 (D,R) with open set D and W := N f (f(x 0 )) D satisfies on this level set W the (m,m) condition m d 2 d T H f (x)d M d 2, then for each sequence of points x k W the assigned sequence of Newton directions d k := H f (x k ) f(x k ) is strictly gradient like (sgl). Experiment 4.1 Start the file modnewton01. It is a long time experiment for the damped Newton method (method no. 3) using the linesearches from 1.0 till 8.1. For 20 stochastic points it needs about 5 minutes. You find the result file at C:\temp\<user>\results\modNewton01.txt 20

2 1. What kind of linesearches is bad (see also exercise 4.2 for the reasons)? 2. What kind of linesearches should be preferred (see also the convergence theorem and the subsequent statement)? 3. When the damped Newton method is only linearly convergent? Experiment 4.2 Start modnewton0201.m { linsmode,1.0, tol,1e-8, x0,[7,11]}); It gives a answer to question 1 in experiment 4.1 for the strictly convex quadratic function no. 4. Which theoretical length t has the starting step size in this case? See item 1. in exercise 4.2. Is the numerical value for t k in the iterational table the same? We used α = 10 4 (default value). Repeat the experiment (second command) with a value of α = 0.2 by using a second coordinate for linsmode { linsmode,[1.0,0.2], tol,1e-8, x0,[7,11]}); The condition in item 1 exercise 4.2 is not satisfied. In spite of this the method needs more than one step, why? Experiment 4.3 Start modnewton0202.m linsmode=4.1 (Armijo backtracking with t=1 preference) uses as first test t = 1. However, if d k is too large, the linesearch algorithm reduce its length. Since by the theory d k 0 this can happen only finitely many. If firstly the length is not reduced, the minimum is achieved in this full Newton step. { linsmode,4.1, tol,1e-8, x0,[6,11]}); The same holds for linsmode=5.1 (Powell/Wolfe backtracking with t=1 preference) has the same reduction and therefore the same effect. { linsmode,5.1, tol,1e-8, x0,[6,11]}); Experiment 4.4 Start modnewton0203.m. { linsmode,6.0, tol,1e-8, x0,[6,11]}); Golden section (GS) can be used to approximate the perfect step size with a given relative error. We use as default relative error The first interval [0,λ] for the application of GS is found by the Armijo rule (gives λ) and some subsequent extension to a λ such that φ(0) > φ( λ) φ(λ) is valid (otherwise φ no local minimum on positive real axis) (see goldsection.m). In spite of the step length is (very few!!) different from 1 we need more 21

3 than one step. Second, since the step length does not converge to one we have only Q-linear convergence. We repeat the experiment with fminbnd (combination of GS and quadratic interpolation, linsmode=[6.0,1e-2,1], 3rd coordinate is set) instead of pure GS. First QI yields for quadratic function the perfect step length hence the damped Newton method needs only one step. { linsmode,[6.0,1e-2,1], tol,1e-8, x0,[6,11]}); If we choose the relative error smaller than the tolerance tol of the damped Newton method, then again the minimum is achieved as expected in one step. But it takes a lot of function value calculations. Attention do not use some 3rd coordinate of linsmode!! { linsmode,[6.0,1e-10], tol,1e-8, x0,[6,11]}); Experiment 4.5 Start modnewton0204.m. The backtracking linesearch (linsmode=4.1) can attain the lower backtracking safeguard 0.1 over a lot of steps, whenever t = 1 does not satisfy AR, the first qudratic interpolation is smaller than 0.1 but 0.1 is valid for AR. Hence the damped Newton method can be over a long time an algorithm with constant step length smaller than one. Here is an example with the 2-dimensional banana function. his=psolve(1,3,{ picinfo,12, holdon,1},{ linsmode,4.1, tol,1e-8}); The backtracking Powell/Wolfe linesearch (linsmode=5.1) has only seldom this effect because of the additional lower bound by the twisted tangent condition. Hence it is faster. his=psolve(1,3,{ picinfo,12, holdon,1},{ linsmode,5.1, tol,1e-8}); Experiment 4.6 Start modnewton0205.m. The best result (for this example) for the damped Newton method generates a linesearch with final interpolation and use of the step with smallest function value. This is done by linsmode = 3.0(scaled) or linsmode = 3.1(starting value 1). Similar but more complicated linesearches of this kind can be found in Spellucci Sometimes also LS 3.1 can cause because of interpolation longer steps at the starting phase of the damped Newton method, which yields acceleration. his=psolve(1,3,{ picinfo,12, holdon,1},{ linsmode,3.0, tol,1e-8}); his=psolve(1,3,{ picinfo,12, holdon,1},{ linsmode,3.1, tol,1e-8}); Experiment 4.7 Start modnewton0206.m. In case of non regular hessian at the final point the linesearch linsmode=3.0 has advantages and the damped Newton method can be faster than the (local) Newton method. Because of longer step length than 1 the damped Newton method avoids the eigenvector direction for the eigenvalue zero, which is given for the Murphy function no. 9. We compare both methods. his=psolve(9,[1,3],{ picinfo,12, holdon,1},{ linsmode,3.0, tol,1e-8}); 22

4 Experiment 4.8 The damped Newton method can run as steepest descent over a large number of steps, when ever a point with negative eigenvalue of the hessian is reached (please compute the eigenvalues of the hessian at this point). This may happen for all implemented linesearches, less often for linesearches with final interpolation. We study the banana function in higher dimensions. Start modnewton03.m. We use commands of the kind (only short info, no picture from 2.-9.) his=psolve(50,[1,3],{ picinfo,0, txtinfo,1},... { linsmode,linsmode, tol,1e-8, maxit,maxit},{dim}); with dim as below and maxit large enough. 1. Banana dim =20, linsmode = steepest descent steps from 250 Iterations 2. Banana dim =20, linsmode = steepest descent steps from 39 Iterations (by chance faster (?) than local Newton method, but study the costs and CPU-time!) 3. Banana dim =10, linsmode = steepest descent steps from 803 Iterations The (local) Newton method needs for the same starting points in case of dim = 20 and dim = 10 only 46 resp. 34 iterations The replacement of the negative gradient direction by a stochastic cone direction (nearly in the half space of descent directions by setting of the option tau0, 5 attempts) yields no significantly better results, here for linsmode = 1.1 and dim x = 10. We use the command his=psolve(50,[1,3],{ picinfo,0, txtinfo,1},... { linsmode,1.1, tol,1e-8, maxit,1000, tau0,[pi/2.1]},{10}); 4.2 damped modified Newton method Exercise 4.4 Prove: Let Q R n n be a symmetric indefinite matrix and D R n n be a positively definite diagonal matrix, then there is µ 0 > 0 such that the matrix Q + µ 0 D is positively semi definite (one eigenvalue is zero) and Q+µ 0 D SPD n for all µ > µ 0. Exercise 4.5 Prove: Suppose, that the function f C 2 (D,R) with open set D, then for each bounded sequence of points x k D with assigned sequence of regularized Hessians H k the arising sequence of modified Newton directions d k := H k f(x k ) is strictly gradient like (sgl). Experiment 4.9 start modnewton04, be aware that m=10 (cf. row 10). We shorten a longtime experiment (originally 250 stochastically chosen starting points) and compare iterations and function value computations for the damped modified Newton method applied to the banana function of dimensions 10,15,20. Its behavior is much more comfortable than the former considered damped version (compare with the number of iterations for the 2 (!!!) dimensional banana function in experiment 4.1). The calls are executed according to the command 23

5 hist=psolve(50,4,{ picinfo,0, txtinfo,0},... { linsmode,ls(i), x0,xstart(j,:), maxit,maxit, tol,tol},{dim}); where LS=[1.0,1.1,2.0,2.1,3.0,3.1,4.1,5.1,6.0,7.0,7.1,8.0,8.1], dim=[10,15,20], j=1:m and maxit = 1e+4 being far away of really used number of iterations. The result file can be studied in C:\temp\<user>\results\modNewton04a.txt Experiment 4.10 Start modnewton06.m (for Matlab under 32 bit processor) or modnewton06a.m (for Matlab under 64 bit processor) which bases on a mat-file generated by a former run of modnewton05 (determination of points in which the Newton method is divergent). We consider the 20th starting point with divergence of the Newton method and study the damped modified Newton method with the call load modnewton0501 x0div % load the x0div variable of a mat-file his=psolve(50,4,{ picinfo,2, txtinfo,2, holdon,1},... { linsmode,3.1, x0,x0div{3}(20,:), tol,tol, maxit,500},{20}); We consider it for tol = 1e-3,1e-5,1e-8. Till tol = 1e-6 the AR condition can be correctly interpreted. The Armijo condition can still be well interpreted and yields for tol = 1e-3 yet a smooth function also in floating point arithmetic. The method stops regular with gradient condition. For tol = 1e-8 the methods stops with to small step size, caused by Armijo rule in LS 3.1 since the Armijo condition cannot be correctly interpreted by the numerics. The following steps show, that step length 1 is possible and yields a final point with quite better gradient norm 1e-13 (see command window). Figure 2 illustrates the Armijo condition (left and right hand side) of a smooth function in floating point arithmetics under digit extinction. Conclusion (see lecture). Finally we run the local Newton method to show that this is not a substantial alternative. 4.3 nonmonotonic damped (modified) Newton method Experiment 4.11 Start modnewton07: We use the nonmonotonic Armijo linesearch for the damped (modified) Newton method (method no. 3 (4)). The main command for this LS 9 is LS = 9.0xxs, where m = xx {1 : 99} determines the maximal number of going back in the function value history and s 0 : 9 is the number of starting pure Armijo steps. We start for the 50-dim Banana function with m = 1 : 20 (i. e. 20 times) twice method 4 (modified damped Newton with nonmonoton AR and mixed linesearch) and ones method 3 (damped Newton with nonmonoton AR), with the call his=psolve(50,method,{ picinfo,0, txtinfo,0},... { linsmode,ls, tol,1e-8, maxit,500},{50}); For comparison we start method 3 and 4 with LS 3.1 his=psolve(50,method,{ picinfo,0, txtinfo,0},... { linsmode,3.1, tol,1e-8, maxit,500},{50}); and the local Newton method (1) his=psolve(50,1,{ picinfo,0, txtinfo,0},... { tol,1e-8, maxit,500},{50}); 24

6 1. damped modified Newton method (= 4), xx=01:20, LS = 9.0xx1 result in: c:\temp\<user>\results\modnewtond07_50_50_20_1nmd.txt 2. damped modified Newton method (=4), xx=01:20, LS = [9.0xx1,0,0,1] mixed with LS 3.1 during modifying result in: c:\temp\<user>\results\modnewtond07_50_50_20_1_lsmixednmd.txt 3. damped Newton method (=3), xx=01:20, LS = 9.0xx1 result in: c:\temp\<user>\results\modnewtond07_50_50_20_1nd.txt The best results (till 40% less iterations) are achieved for (1.) damped modified Newton method with LS The mixed LS in case of (2.) damped modified Newton (till 10% less iterations) is not so successful in this example, best for LS The damped Newton method with negative gradient as alternative (3.) is similar to (2.) but 30 steepest descent steps destroy the performance. Please repeat the experiment for the Wood function no 15., Banana 2 dim. no 1, no. 2, no. 14 For function no. 2 the nonmonotonic strategy is unexpectedly contraproductive. Experiment 4.12 Start modnewton09: We investigate with problem no. 50 in higher dimensions(dim=50)howmuchstepswithpurearmijosearchatthebeginningcanberecommended. The literature says 5. Is this true for our implementation? The main command is his=psolve(50,4,{ picinfo,0, txtinfo,0},... { linsmode,9+m/1000+j/10000, tol,1e-8, maxit,250},{dim}); We use the information directly from the output his to construct the table and set m =15 (maximal number updatings) and j =1:10 designs the number of pure Armijo steps at the beginning. Experiment 4.13 Start modnewton10: We demonstrate that the nonmonotonic strategy can yield to stagnation for damped regularized Newton method as already known from damped Newton method. In modnewton10a we investigate the reason for this behavior. The main call is his=psolve(50,method,{ picinfo,12, txtinfo,-12},... { linsmode,ls, tol,1e-8, maxit,250},{4}); with method = 3,4 and LS = , LS = 3.1. The IOparameter-option txtinfo,-12 causes that all iterations are printed only in the file C:/temp/<use>/results/NEWTONMD-LS P050-0.txt but not on the screen. Here in this example the mixed linesearch LS = [9.0101,0,0,1] (available only for method 4) overcome this deteriorated situation which is not ever the case. Try to find an example. The pure linesearch LS = 3.1 for method = 4 runs to another stationary point. We check its kind. Experiment 4.14 Start modnewton10a: The regularization takes place at an area where the gradient is very flat and the hessian is indefinite. Eigenvalue 1 is about -1.5 and the other three ones are about 40. The used regularization routine modelhess (Dennis/Schnabel) overestimates the regularization parameter to µ = Hence all eigenvalues of the modified hessian are larger than 40. Since the gradient coordinates with respect to the eigenvector basis are about of 10 2 the direction d calculated with the modified hessian is with very small (ever step length 25

7 λ = 1). Since we use the same regularization program the input data of this program changes very slow. Hence, the area with this bad properties cannot be leaved fast enough and we have such small steps over a lot of iterations. Remark: The local Newton methods (also step length λ = 1!!) makes in the direction of the first eigenvector 30 times longer steps, but with ascent! 4.4 Newton method, hessian with forward differences Experiment 4.15 Start modnewton08a: The modification of the hessian by use of 2nd forward differences is contra productive and destroys the superlinearity in examples with bad condition of the hessians. The stopping condition should not be too small in such a case as done in this example. Thereafter, study the main call with tol = 1e-8 and second vorward differences: his=psolve(50,1,{ picinfo,2, txtinfo,2, holdon,1, diffmode,3},... { tol,tol, maxit,50, x0,0.9*ones(1,20)},{20}); What happens? modnewton08b:what is the numerical reason, that the method cannot catch the stopping condition? We study the x k and the d k at iteration k=8,9,10,11,12 (in EdOptLab use his.x(k+1,:))? Attention, the directions d k have to be externally computed. We use the call his=psolve(50,1,{ picinfo,2, txtinfo,2, holdon,1, diffmode,diffmode},... { tol,1e-13, maxit,15, x0,0.9*ones(1,20)},{20}); for diffmode=3,2,1 and study the last 4 essential iterations. Then we compare the behavior with the other both diffmodes = 2, damped Gauss-Newton method Continuation of introduction: Because of (J(x k ) T r(x k )) T d k = (J(x k ) T r(x k )) T (J(x k ) T J(x k )) 1 J(x k ) T r(x k ) < 0 in case of rankj(x k ) = n and F(x k ) = J(x k ) T r(x k ) 0 outside of a stationary point, the Gauss-Newton direction d k := (J(x k ) T J(x k )) 1 J(x k ) T r(x k ) is ever a descent direction. The handicap of the Newton-method in case of nonconvex problem functions cannot occur. In the following experiments we repeat the former experiments with the damped Gauss-Newton method (method=2) and compare it for respective experiments with the damped (modified) Newton method (method = 3 ( 4) ). Use suitable linesearches according to the possibilities (e.g. linsmode =1.0, 1.1,2.0,2.1,3.0, 3.1, 4, 5, 7.0, 7.1, 8.0, 8.1). Experiments: Experiment 4.16 Start gsn1d.m: Consider problem , where the 3 points (t,y) = (1,2),(2,4),(3,y) should be approximated by the ansatz function h(x,t) = e xt. We have 26

8 y = 8, 1, 8 for problem 213, 214 resp Observe that the problem has only one dimension. Use the command (damped Gauss-Newton method) his = gaussnewton(213,2,{ picinfo,12, txtinfo,3},... { linsmode,3.1, tol,1e-8, maxit,200}). under EdOptLab and study the convergence properties for several line searches. Which line search is the best? For comparison calculate the solution also with a simple Newton method for the necessary optimality condition. Experiment 4.17 Start gsn1d.m: Repeat Experiment 4.16, except the elementary calculation, for problem where the points should be approximated by the ansatz function h(x, t) = x 1 e x 2 t.thisproblemistwodimensional. Problem216usesthepointset (t,y) {1,2,4,5,8} {3,4,6,11,20}. Problem 217 has the additional point (4.1,46) and problem 218 has w.r.t. 216 slightly disturbed y-values. Which line search is the best? Experiment 4.18 Start gsn3d.m: Compare the performance of the damped Gauss-Newton method his = gaussnewton(201,2,{ picinfo,2, txtinfo,3},... { linsmode,3.1, tol,1e-12}) for problem 201 and the damped (modified) Newton method his = PSOLVE(1,[3,4],{ picinfo,12, txtinfo,3},... { linsmode,3.1, tol,1e-12}) for problem 1 using the standard starting point. Discuss the results for several linesearches. Also here we have problems with the expensive time of AD. It seems, that for a lot of dimensions in the function vector the AD needs linearly more CPU-time. Therefore make comparisons with symbolic differentiation (diffmode = 4 for Newton, diffmode = 5 for Gauss-Newton). However here we have long initilization time for the constructing of the m-files. Experiment 4.19 Start gsn4d.m: Compare the performance of the damped Gauss-Newton method his = gaussnewton(250,2,{ picinfo,2, txtinfo,3},... { linsmode,3.1, tol,1e-12},{dim}) for problem 250 and the damped (modified) Newton method his = PSOLVE(50,[3,4],{ picinfo,12, txtinfo,3},... { linsmode,3.1, tol,1e-12},{dim}) for problem 50 using the standard starting point. Discuss the results for several dimension, say dim = 5, 10, 20, 50, 100 and for several linesearches. Here the time problem for AD is more serious with increasing dimension. 27

Unconstrained optimization

Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout