Sequential Convex Programming sequential convex programming alternating convex optimization convex-concave procedure Prof. S. Boyd, EE364b, Stanford University
Methods for nonconvex optimization problems convex optimization methods are (roughly) always global, always fast for general nonconvex problems, we have to give up one local optimization methods are fast, but need not find global solution (and even when they do, cannot certify it) global optimization methods find global solution (and certify it), but are not always fast (indeed, are often slow) this lecture: local optimization methods that are based on solving a sequence of convex problems Prof. S. Boyd, EE364b, Stanford University 1
Sequential convex programming (SCP) a local optimization method for nonconvex problems that leverages convex optimization convex portions of a problem are handled exactly and efficiently SCP is a heuristic it can fail to find optimal (or even feasible) point results can (and often do) depend on starting point (can run algorithm from many initial points and take best result) SCP often works well, i.e., finds a feasible point with good, if not optimal, objective value Prof. S. Boyd, EE364b, Stanford University 2
Problem we consider nonconvex problem minimize f 0 (x) subject to f i (x) 0, h i (x) = 0, i = 1,...,m j = 1,...,p with variable x R n f 0 and f i (possibly) nonconvex h i (possibly) non-affine Prof. S. Boyd, EE364b, Stanford University 3
Basic idea of SCP maintain estimate of solution x (k), and convex trust region T (k) R n form convex approximation ˆf i of f i over trust region T (k) form affine approximation ĥi of h i over trust region T (k) x (k+1) is optimal point for approximate convex problem minimize ˆf0 (x) subject to ˆfi (x) 0, i = 1,...,m ĥ i (x) = 0, i = 1,...,p x T (k) Prof. S. Boyd, EE364b, Stanford University 4
Trust region typical trust region is box around current point: T (k) = {x x i x (k) i ρ i, i = 1,...,n} if x i appears only in convex inequalities and affine equalities, can take ρ i = Prof. S. Boyd, EE364b, Stanford University 5
Affine and convex approximations via Taylor expansions (affine) first order Taylor expansion: ˆf(x) = f(x (k) ) + f(x (k) ) T (x x (k) ) (convex part of) second order Taylor expansion: ˆf(x) = f(x (k) ) + f(x (k) ) T (x x (k) ) + (1/2)(x x (k) ) T P(x x (k) ) P = ( 2 f(x (k) ) ), PSD part of Hessian + give local approximations, which don t depend on trust region radii ρ i Prof. S. Boyd, EE364b, Stanford University 6
Particle method particle method: choose points z 1,...,z K T (k) (e.g., all vertices, some vertices, grid, random,... ) evaluate y i = f(z i ) fit data (z i,y i ) with convex (affine) function (using convex optimization) advantages: handles nondifferentiable functions, or functions for which evaluating derivatives is difficult gives regional models, which depend on current point and trust region radii ρ i Prof. S. Boyd, EE364b, Stanford University 7
Fitting affine or quadratic functions to data fit convex quadratic function to data (z i, y i ) K ( ) minimize i=1 (zi x (k) ) T P(z i x (k) ) + q T (z i x (k) 2 ) + r y i subject to P 0 with variables P S n, q R n, r R can use other objectives, add other convex constraints no need to solve exactly this problem is solved for each nonconvex constraint, each SCP step Prof. S. Boyd, EE364b, Stanford University 8
Quasi-linearization a cheap and simple method for affine approximation write h(x) as A(x)x + b(x) (many ways to do this) use ĥ(x) = A(x(k) )x + b(x (k) ) example: h(x) = (1/2)x T Px + q T x + r = ((1/2)Px + q) T x + r ĥql(x) = ((1/2)Px (k) + q) T x + r ĥtay(x) = (Px (k) + q) T (x x (k) ) + r Prof. S. Boyd, EE364b, Stanford University 9
Example nonconvex QP minimize f(x) = (1/2)x T Px + q T x subject to x 1 with P symmetric but not PSD use approximation f(x (k) ) + (Px (k) + q) T (x x (k) ) + (1/2)(x x (k) ) T P + (x x (k) ) Prof. S. Boyd, EE364b, Stanford University 10
example with x R 20 SCP with ρ = 0.2, started from 10 different points 10 20 30 f(x (k) ) 40 50 60 70 5 10 15 20 25 30 k runs typically converge to points between 60 and 50 dashed line shows lower bound on optimal value 66.5 Prof. S. Boyd, EE364b, Stanford University 11
Lower bound via Lagrange dual write constraints as x 2 i 1 and form Lagrangian n L(x, λ) = (1/2)x T Px + q T x + λ i (x 2 i 1) i=1 = (1/2)x T (P + diag(λ)) x + q T x g(λ) = (1/2)q T (P + diag(λ)) 1 q; need P + diag(λ) 0 solve dual problem to get best lower bound: maximize (1/2)q T (P + diag(λ)) 1 q subject to λ 0 Prof. S. Boyd, EE364b, Stanford University 12
Some (related) issues approximate convex problem can be infeasible how do we evaluate progress when x (k) isn t feasible? need to take into account objective f 0 (x (k) ) inequality constraint violations f i (x (k) ) + equality constraint violations h i (x (k) ) controlling the trust region size ρ too large: approximations are poor, leading to bad choice of x (k+1) ρ too small: approximations are good, but progress is slow Prof. S. Boyd, EE364b, Stanford University 13
Exact penalty formulation instead of original problem, we solve unconstrained problem where λ > 0 minimize φ(x) = f 0 (x) + λ( m i=1 f i(x) + + p i=1 h i(x) ) for λ large enough, minimizer of φ is solution of original problem for SCP, use convex approximation ( ˆφ(x) = ˆf m 0 (x) + λ i=1 ˆf i (x) + + ) p ĥi(x) i=1 approximate problem always feasible Prof. S. Boyd, EE364b, Stanford University 14
Trust region update judge algorithm progress by decrease in φ, using solution x of approximate problem decrease with approximate objective: ˆδ = φ(x (k)) ) ˆφ( x) (called predicted decrease) decrease with exact objective: δ = φ(x (k)) ) φ( x) if δ αˆδ, ρ (k+1) = β succ ρ (k), x (k+1) = x (α (0,1), β succ 1; typical values α = 0.1, β succ = 1.1) if δ < αˆδ, ρ (k+1) = β fail ρ (k), x (k+1) = x (k) (β fail (0,1); typical value β fail = 0.5) interpretation: if actual decrease is more (less) than fraction α of predicted decrease then increase (decrease) trust region size Prof. S. Boyd, EE364b, Stanford University 15
Nonlinear optimal control l 2, m 2 τ 2 θ 2 τ 1 l 1, m 1 θ 1 2-link system, controlled by torques τ 1 and τ 2 (no gravity) Prof. S. Boyd, EE364b, Stanford University 16
dynamics given by M(θ) θ + W(θ, θ) θ = τ, with M(θ) = W(θ, θ) = [ [ (m 1 + m 2 )l1 2 m 2 l 1 l 2 (s 1 s 2 + c 1 c 2 ) m 2 l 1 l 2 (s 1 s 2 + c 1 c 2 ) m 2 l2 2 0 m 2 l 1 l 2 (s 1 c 2 c 1 s 2 ) θ 2 m 2 l 1 l 2 (s 1 c 2 c 1 s 2 ) θ 1 0 ] ] s i = sinθ i, c i = cosθ i nonlinear optimal control problem: minimize J = T 0 τ(t) 2 2 dt subject to θ(0) = θ init, θ(0) = 0, θ(t) = θfinal, θ(t) = 0 τ(t) τ max, 0 t T Prof. S. Boyd, EE364b, Stanford University 17
Discretization discretize with time interval h = T/N J h N i=1 τ i 2 2, with τ i = τ(ih) approximate derivatives as θ(ih) θ i+1 θ i 1 θ i+1 2θ i + θ i 1, θ(ih) 2h h 2 approximate dynamics as set of nonlinear equality constraints: M(θ i ) θ i+1 2θ i + θ i 1 h 2 + W ( θ i, θ ) i+1 θ i 1 θi+1 θ i 1 2h 2h = τ i θ 0 = θ 1 = θ init ; θ N = θ N+1 = θ final Prof. S. Boyd, EE364b, Stanford University 18
discretized nonlinear optimal control problem: minimize h N i=1 τ i 2 2 subject to θ 0 = θ 1 = θ init, τ i τ max, M(θ i ) θ i+1 2θ i +θ i 1 h 2 θ N = θ N+1 = θ final i = 1,...,N ( + W θ i, θ i+1 θ i 1 2h ) θi+1 θ i 1 2h = τ i replace equality constraints with quasilinearized versions M(θ (k) i ) θ i+1 2θ i + θ i 1 h 2 + W ( θ (k) i, θ(k) i+1 ) θ(k) i 1 2h θ i+1 θ i 1 2h = τ i trust region: only on θ i initialize with θ i = ((i 1)/(N 1))(θ final θ init ), i = 1,...,N Prof. S. Boyd, EE364b, Stanford University 19
Numerical example m 1 = 1, m 2 = 5, l 1 = 1, l 2 = 1 N = 40, T = 10 θ init = (0, 2.9), θ final = (3, 2.9) τ max = 1.1 α = 0.1, β succ = 1.1, β fail = 0.5, ρ (1) = 90 λ = 2 Prof. S. Boyd, EE364b, Stanford University 20
SCP progress 70 60 50 φ(x (k) ) 40 30 20 10 5 10 15 20 25 30 35 40 k Prof. S. Boyd, EE364b, Stanford University 21
Convergence of J and torque residuals 14 10 2 J (k) 13.5 13 12.5 12 11.5 11 sum of torque residuals 10 1 10 0 10 1 10 2 10.5 5 10 15 20 25 30 35 40 k 10 3 5 10 15 20 25 30 35 40 k Prof. S. Boyd, EE364b, Stanford University 22
Predicted and actual decreases in φ 140 10 2 ˆδ (dotted), δ (solid) 120 100 80 60 40 20 0 ρ (k) ( ) 10 1 10 0 10 1 10 2 20 5 10 15 20 25 30 35 40 k 10 3 5 10 15 20 25 30 35 40 k Prof. S. Boyd, EE364b, Stanford University 23
Trajectory plan τ1 1.5 1 0.5 0 θ1 3.5 3 2.5 2 1.5 1 0.5 0.5 0 1 2 3 4 5 6 7 8 9 10 2 t 0 0 1 2 3 4 5 6 7 8 9 10 5 t τ2 1 0 θ2 0 1 0 2 4 6 8 10 t 5 0 2 4 6 8 10 t Prof. S. Boyd, EE364b, Stanford University 24
Difference of convex programming express problem as minimize f 0 (x) g 0 (x) subject to f i (x) g i (x) 0, i = 1,...,m where f i and g i are convex f i g i are called difference of convex functions problem is sometimes called difference of convex programming Prof. S. Boyd, EE364b, Stanford University 25
Convex-concave procedure obvious convexification at x (k) : replace f(x) g(x) with ˆf(x) = f(x) g(x (k) ) g(x (k) ) T (x x (k) ) since ˆf(x) f(x) for all x, no trust region is needed true objective at x is better than convexified objective true feasible set contains feasible set for convexified problem SCP sometimes called convex-concave procedure Prof. S. Boyd, EE364b, Stanford University 26
Example (BV 7.1) given samples y 1,...,y N R n from N(0, Σ true ) negative log-likelihood function is f(σ) = log detσ + Tr(Σ 1 Y ), Y = (1/N) N y i yi T i=1 (dropping a constant and positive scale factor) ML estimate of Σ, with prior knowledge Σ ij 0: minimize f(σ) = log detσ + Tr(Σ 1 Y ) subject to Σ ij 0, i,j = 1,...,n with variable Σ (constraint Σ 0 is implicit) Prof. S. Boyd, EE364b, Stanford University 27
first term in f is concave; second term is convex linearize first term in objective to get ˆf(Σ) = log detσ (k) + Tr ( ) (Σ (k) ) 1 (Σ Σ (k) ) + Tr(Σ 1 Y ) Prof. S. Boyd, EE364b, Stanford University 28
Numerical example convergence of problem instance with n = 10, N = 15 0 5 10 f(σ) 15 20 25 30 1 2 3 4 5 6 7 k Prof. S. Boyd, EE364b, Stanford University 29
Alternating convex optimization given nonconvex problem with variable (x 1,...,x n ) R n I 1,...,I k {1,...,n} are index subsets with j I j = {1,...,n} suppose problem is convex in subset of variables x i, i I j, when x i, i I j are fixed alternating convex optimization method: cycle through j, in each step optimizing over variables x i, i I j special case: bi-convex problem x = (u,v); problem is convex in u (v) with v (u) fixed alternate optimizing over u and v Prof. S. Boyd, EE364b, Stanford University 30
Nonnegative matrix factorization NMF problem: minimize A XY F subject to X ij, Y ij 0 variables X R m k, Y R k n, data A R m n difficult problem, except for a few special cases (e.g., k = 1) alternating convex optimation: solve QPs to optimize over X, then Y, then X... Prof. S. Boyd, EE364b, Stanford University 31
Example convergence for example with m = n = 50, k = 5 (five starting points) 30 25 A XY F 20 15 10 5 0 0 5 10 15 20 25 30 k Prof. S. Boyd, EE364b, Stanford University 32