Truncated Newton Method

Size: px

Start display at page:

Download "Truncated Newton Method"

Cora Allison
5 years ago
Views:

1 Truncated Newton Method approximate Newton methods truncated Newton methods truncated Newton interior-point methods EE364b, Stanford University

2 minimize convex f : R n R Newton s method Newton step x nt found from (SPD) Newton system using Cholesky factorization 2 f(x) x nt = f(x) backtracking line search on function value f(x) or norm of gradient f(x) stopping criterion based on Newton decrement λ 2 /2 = f(x) T x nt or norm of gradient f(x) EE364b, Stanford University 1

3 Approximate or inexact Newton methods use as search direction an approximate solution x of Newton system idea: no need to compute x nt exactly; only need a good enough search direction number of iterations may increase, but if effort per iteration is smaller than for Newton, we win examples: solve Ĥ x = f(x), where Ĥ is diagonal or band of 2 f(x) factor 2 f(x) every k iterations and use most recent factorization EE364b, Stanford University 2

4 Truncated Newton methods approximately solve Newton system using CG or PCG, terminating (sometimes way) early also called Newton-iterative methods; related to limited memory Newton (or BFGS) total effort is measured by cumulative sum of CG steps done for good performance, need to tune CG stopping criterion, to use just enough steps to get a good enough search direction less reliable than Newton s method, but (with good tuning, good preconditioner, fast z 2 f(x)z method, and some luck) can handle very large problems EE364b, Stanford University 3

5 Truncated Newton method backtracking line search on f(x) typical CG termination rule: stop after N max steps or η = 2 f(x) x+ f(x) f(x) ǫ pcg with simple rules, N max, ǫ pcg are constant more sophisticated rules adapt N max or ǫ pcg as algorithm proceeds (based on, e.g., value of f(x), or progress in reducing f(x) ) η = min(0.1, f(x) 1/2 ) guarantees (with large N max ) superlinear convergence EE364b, Stanford University 4

6 CG initialization we use CG to approximately solve 2 f(x) x+ f(x) = 0 if we initialize CG with x = 0 after one CG step, x points in direction of negative gradient (so, N max = 1 results in gradient method) all CG iterates are descent directions for f another choice: initialize with x = x prev, the previous search step initial CG iterates need not be descent directions but can give advantage when N max is small EE364b, Stanford University 5

7 simple scheme: if x prev is a descent direction ( x T prev f(x) < 0) start CG from x = x T prev f(x) x T prev 2 f(x) x prev x prev otherwise start CG from x = 0 EE364b, Stanford University 6

8 l 2 -regularized logistic regression Example minimize f(w) = (1/m) m i=1 log( 1+exp( b i x T i w)) + n i=1 λ iw 2 i variable is w R n problem data are x i R n, b i { 1,1}, i = 1,...,m, and regularization parameter λ R n + n is number of features; m is number of samples/observations EE364b, Stanford University 7

9 Hessian and gradient 2 f(w) = A T DA+2Λ, f(w) = A T g +2Λw where A = [b 1 x 1 b m x m ] T, D = diag(h), Λ = diag(λ) g i = (1/m)/(1+exp(Aw) i ) h i = (1/m)exp(Aw) i /(1+exp(Aw) i ) 2 we never form 2 f(w); we carry out multiplication z 2 f(w)z as 2 f(w)z = ( A T DA+2Λ ) z = A T (D(Az))+2Λz EE364b, Stanford University 8

10 Problem instance n = features, m = samples (10000 each with b i = ±1) x i have random sparsity pattern, with around 10 nonzero entries nonzero entries in x i drawn from N(b i,1) λ i = 10 8 around nonzeros in 2 f, and 30M nonzeros in Cholesky factor EE364b, Stanford University 9

11 Methods Newton (using Cholesky factorization of 2 f(w)) truncated Newton with ǫ cg = 10 4, N max = 10 truncated Newton with ǫ cg = 10 4, N max = 50 truncated Newton with ǫ cg = 10 4, N max = 250 EE364b, Stanford University 10

12 Convergence versus iterations 10 1 cg 10 cg cg 250 Newton 10 3 f k EE364b, Stanford University 11

13 Convergence versus cumulative CG steps 10 1 cg 10 cg cg f cumulative CG iterations EE364b, Stanford University 12

14 convergence of exact Newton, and truncated Newton methods with N max = 50 and 250 essentially the same, in terms of iterations in terms elapsed time (and memory!), truncated Newton methods far better than Newton truncated Newton with N max = 10 seems to jam near f(w) 10 6 times (on AMD270 2GHz, 12GB, Linux) in sec: method f(w) 10 5 f(w) 10 8 Newton cg 10 4 cg cg EE364b, Stanford University 13

15 Truncated PCG Newton method approximate search direction found via diagonally preconditioned PCG 10 1 cg 10 cg cg 250 pcg 10 pcg pcg 250 f cumulative CG iterations EE364b, Stanford University 14

16 diagonal preconditioning allows N max = 10 to achieve high accuracy; speeds up other truncated Newton methods times: method f(w) 10 5 f(w) 10 8 Newton cg 10 4 cg cg pcg pcg pcg speedups of 1600:3, 2600:5 are not bad (and we really didn t do much tuning... ) EE364b, Stanford University 15

17 Extensions can extend to (infeasible start) Newton s method with equality constraints since we don t use exact Newton step, equality constraints not guaranteed to hold after finite number of steps (but r p 0) can use for barrier, primal-dual methods EE364b, Stanford University 16

18 Truncated Newton interior-point methods use truncated Newton method to compute search direction in interior-point method tuning PCG parameters for optimal performance on a given problem class is tricky, since linear systems in interior-point methods often become ill-conditioned as algorithm proceeds but can work well (with luck, good preconditioner) EE364b, Stanford University 17

19 Network rate control rate control problem with variable f f R n ++ is vector of flow rates U(f) = n j=1 logf j is flow utility minimize U(f) = n j=1 logf j subject to Rf c R R m n is route matrix (R ij {0,1}) c R m is vector of link capacities EE364b, Stanford University 18

20 Dual rate control problem dual problem with variable λ R m duality gap maximize g(λ) = n c T λ+ m i=1 log(rt i λ) subject to λ 0 η = U(f) g(λ) n = logf j n+c T λ j=1 m log(ri T λ) i=1 EE364b, Stanford University 19

21 Primal-dual search direction (BV 11.7) primal-dual search direction f, λ given by (D 1 +R T D 2 R) f = g 1 (1/t)R T g 2, λ = D 2 R f λ+(1/t)g 2 where s = c Rf, D 1 = diag(1/f1,...,1/f 2 n), 2 D 2 = diag(λ 1 /s 1,...,λ m /s m ) g 1 = (1/f 1,...,1/f n ), g 2 = (1/s 1,...,1/s m ) EE364b, Stanford University 20

22 primal-dual residual: Truncated Newton primal-dual algorithm r = (r dual,r cent ) = ( g 2 +R T λ, diag(λ)s (1/t)1 ) given f with Rf c; λ 0 while η/g(λ) > ǫ t := µm/η compute f using PCG as approximate solution of (D 1 +R T D 2 R) f = g 1 (1/t)R T g 2 λ := D 2 R f λ+(1/t)g 2 carry out line search on r 2, and update: f := f +γ f, λ := λ+γ λ EE364b, Stanford University 21

23 problem instance m = links, n = flows average of 12 links per flow, 6 flows per link capacities random, uniform on [0.1, 1] algorithm parameters truncated Newton with ǫ cg = min(0.1,η/g(λ)), N max = 200 (N max never reached) diagonal preconditioner warm start µ = 2 ǫ = (i.e., solve to guaranteed 0.1% suboptimality) EE364b, Stanford University 22

24 Primal and dual objective evolution x 105 U(f) g(λ) cumulative PCG iterations EE364b, Stanford University 23

25 Relative duality gap evolution 10 1 relative duality gap cumulative PCG iterations EE364b, Stanford University 24

26 Primal and dual objective evolution (n = 10 6 ) x 106 U(f) g(λ) cumulative PCG iterations EE364b, Stanford University 25

27 Relative duality gap evolution (n = 10 6 ) 10 1 relative duality gap cumulative PCG iterations EE364b, Stanford University 26

Conjugate Gradient Method

Conjugate Gradient Method direct and indirect methods positive definite linear systems Krylov sequence spectral analysis of Krylov sequence preconditioning Prof. S. Boyd, EE364b, Stanford University Three