Convex Optimization Lecture 16

Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent

Recall: Gradient Descent (Steepest Descent w.r.t Euclidean Norm) Gradient descent algorithm: Gradient Descent Δx = f x k Convergence: 1 Init x 0 dom(f) Iterate x k+1 x k t k f x k Lower'Bounds Reminder: We are violating here the distinction between the primal space and the dual gradient space we are implicitly linking them by matching representations w.r.t. a chosen basis Some%upper%bounds: #iter $ & ' ( #iter & ' ( #iter & * Note: Δx is not normalized (i.e. we don t require Δx 2 = 1). This just changes the meaning of t. How do we 2choose 3 GD + log 1/1 # the 5 # stepsize 3 # t k 5? # 1 1 # 61 & * $ & ' Oracle/ops 78 +!(") A>GD 1 κ = M/µ + log 1/1 2 3 # 1 x x 78 +!(")

Smoothness and Strong Convexity Def: f is μ-strongly convex Def: f is M-smooth f x + f x, Δx + μ 2 Δx f x + Δx f x + f x, Δx + M 2 Δx Can be viewed as a condition on the directional 2 nd derivatives μ f x = f x + tv = v f x v M (for v = 1) f x + f x, Δx + M 2 Δx f(x + Δx) f x + f x, Δx + μ 2 f x + f x, Δx Δx

What about constraints? where X is convex min f(x) x s.t. x X

Projected Gradient Descent Idea: make sure that points are feasible by projecting onto X Algorithm: y (k+1) = x (k) t (k) g (k) where g (k) f(x (k) ) x (k+1) = Π X (y (k+1) ) y t+1 y (k+1) gradient step (3.2) projection (3.3) x (k+1) x t+1 x (k) x t The projection operator Π X onto X : X Π X (x) = min z X x z Notice: subgradient instead of gradient (even for differentiable functions)

Projected gradient descent convergence rate: µ 2 M 2 M L L, µ 2 κ log 1 ɛ M x 2 +(f(x 1 ) f(x )) ɛ L 2 x 2 L 2 ɛ 2 µɛ Same as unconstrained case! But, requires projection... how expensive is that? Examples: Euclidean ball PSD constraints Linear constraints Ax b Sometimes as expensive as solving the original optimization problem!

Conditional Gradient Descent A projection-free algorithm! Introduced for QP by Marguerite Frank and Philip Wolfe (1956) Algorithm Initialize: x(0) X s(k) = argminh f (x(k)), si s X x(k+1) = x(k) + t(k)(s(k) x(k))

Notice f assumed M-smooth X assumed bounded First-order oracle Linear optimization (in place of projection) Sparse iterates (e.g., for polytope constraints) Convergence rate For M-smooth functions with step size t (k) = 2 k+1 : where R = sup x,y X x y # iterations required for ɛ-optimality: MR2 ɛ

Proof f(x (k+1) ) f(x (k) ) + f(x (k) ), x (k+1) x (k) + M 2 x(k+1) x (k) 2 =f(x (k) ) + t (k) f(x (k) ), s (k) x (k) + M 2 (t(k) ) 2 s (k) x (k) 2 [smoothness] [update] f(x (k) ) + t (k) f(x (k) ), x x (k) + M 2 (t(k) ) 2 R 2 f(x (k) ) + t (k) (f(x ) f(x (k) )) + M 2 (t(k) ) 2 R 2 [convexity] Define: δ (k) = f(x (k) ) f(x ), we have: δ (k+1) (1 t (k) )δ (k) + M(t(k) ) 2 R 2 A simple induction shows that for t (k) = 2 k+1 : δ (k) 2MR2 k + 1 Same rate as projected gradient descent, but without projection! Does need linear optimization 2

What about strong convexity? Not helpful! Does not give linear rate (κ log(1/ɛ)) Active research

Randomness in Convex Optimization Insight: first-order methods are robust inexact gradients are sufficient As long as gradients are correct on average, the error will vanish Long history (Robbins & Monro, 1951)

Stochastic Gradient Descent Motivation Many machine learning problems have the form of empirical risk minimization min x R n m f i (x) + λω(x) i=1 where f i are convex and λ is the regularization constant Classification: SVM, logistic regression Regression: least-squares, ridge regression, LASSO Cost of computing the gradient? m n What if m is VERY large? We want cheaper iterations

Idea: Use stochastic first-order oracle: for each point x dom(f) returns a stochastic gradient g(x) s.t. E[ g(x)] f(x) That is, g is an unbiased estimator of the subgradient Example 1 min x R n m m i=1 F i (x) {}}{ (f i (x) + λω(x)) For this objective, select j {1,..., m} u.a.r. and return F j (x) Then, E[ g(x)] = 1 F i (x) = f(x) m i

SGD iterates: Stochastic x (k+1) vs. deterministic x (k) t (k) g(x (k) methods ) How to choose step size t (k)? Minimizing g(θ) = 1 n Lipschitz case: t (k) 1 k ochastic vs. deterministic methods µ-strongly-convex case: t (k) 1 µk g(θ) = 1 n f i (θ) with f i (θ) =l ( y i, θ Φ(x i ) ) + µω(θ) n i=1 Note: decaying step size! ent descent: θ t = θ t 1 GD γ t g (θ t 1 )=θ t 1 γ n t Stochastic gradient descent: f n θ SGD i(θ t 1 ) t = θ t 1 γ t f i(t) (θ t 1) radient descent: θ t = θ t 1 γ t f i(t) (θ t 1) n f i (θ) with f i (θ) =l ( y i, θ Φ(x i ) ) + µω(θ) i=1 Batch gradient descent: θ t = θ t 1 γ t g (θ t 1 )=θ t 1 γ t n i=1 (Figures borrowed from Francis Bach s slides) n f i(θ t 1 ) i=1 Stochastic vs. deterministic methods Goal = best of both worlds: linearratewitho(1) iteratio log(excess cost) stochastic deterministic time

Convergence rates µ 2 M 2 M L L, µ 2 GD κ log 1 ɛ M x 2 ɛ SGD?? L 2 x 2 L 2 ɛ 2 µɛ B 2 x 2 B 2 ɛ 2 µɛ Additional assumption: E[ g(x) 2 ] B 2 for all x dom(f) Comment: holds in expectation, with averaged iterates [ ( )] 1 K E f x (k) f(x )... K k=1 Similar rates as with exact gradients!

µ 2 M 2 M L L, µ 2 GD AGD κ log 1 ɛ κ log 1 ɛ SGD? M x 2 ɛ L 2 x 2 L 2 ɛ 2 µɛ M x 2 ɛ x σ + M x 2 ɛ 2 ɛ B 2 x 2 B 2 ɛ 2 µɛ where E[ f(x) g(x) 2 ] σ 2 Smoothness? Not helpful! (same rate as non-smooth) Lower bounds (Nemirovski & Yudin, 1983) Active research Acceleration? Cannot be easily accelerated! Mini-batch acceleration Active research

Random Coordinate Descent Recall: cost of computing exact GD update: m n What if n VERY is large? We want cheaper iterations Random coordinate descent algorithm: Initialize: x (0) dom(f) Iterate: pick i(k) {1,..., n} randomly x (k+1) = x (k) t (k) i(k) f(x (k) )e i(k) where we denote: i f(x) = f x i (x) Assumption: f is convex and differentiable

f x2 What if f not differentiable? x2 4 2 0 2 4 x1 4 2 0 2 4 x1 A: No! Look at the above counterexample (Figures borrowed from Ryan Tibshirani s slides) Q: Same question again, but now f (x) = g(x) + P n i=1 h i(x i ),with

Iteration cost? i f(x) + O(1) Compare to f(x) + O(n) for GD Example: quadratic f(x) = 1 2 x Qx v x f(x) = Qx v i f(x) = q i x v i Can view CD as SGD with oracle: g(x) = n i f(x)e i Clearly, E[ g(x)] = 1 n n i i f(x)e i = f(x) Can replace individual coordinates with blocks of coordinates

10 10 10 5 10 5 10 6 5 10 15 20 25 30 35 40 2 4 6 8 10 12 14 16 18 20 22 Example: SVM Primal: Dual: SDCA SDCA Perm SGD 10 min 1 w 10 0 SDCA SDCA Perm SGD 10 2 10 3 10 4 10 5 λ 2 w 2 + i min 2 α Qα 1 α s.t. 0 α i 1/λ α 1 max(1 y i w 10 z i, 0) 1 i 10 6 2 4 6 8 10 12 14 16 18 20 10 0 SDCA SDCA Perm SGD 10 2 10 3 10 4 10 5 where Q ij = y i y j z i z j 10 20 30 40 50 60 70 80 90 100 110 10 6 5 10 15 20 25 30 35 10 6 5 10 15 20 25 SDCA SDCA Perm SGD 10 0 SDCA SDCA Perm SGD 10 1 10 0 SDCA SDCA Perm SGD 10 1 10 2 10 2 10 3 10 3 10 4 10 4 10 5 10 5 50 100 150 200 250 300 10 6 20 40 60 80 100 120 140 160 180 200 10 6 10 20 30 40 50 60 70 80 90 100 110 (Shalev-Schwartz & Zhang, 2013) : Comparing the primal sub-optimality of SDCA and SGD for the non-smooth hinge-loss

Convergence rate Directional smoothness for f: there exist M 1,..., M n s.t. for any i {1,..., n}, x R n, and u R i f(x + ue i ) i f(x) M i u Note: implies f is M-smooth with M i M i Consider the update: x (k+1) = x (k) 1 M i(k) i(k) f(x (k) ) e i(k) No need to know M i s, can be adjusted dynamically

Rates (Nesterov, 2012): µ 2 M 2 M L L, µ 2 GD CD κ log 1 ɛ nκ log 1 ɛ, κ = i M i µ M x 2 ɛ L 2 x 2 L 2 ɛ 2 µɛ n x 2 i M i ɛ Same total cost as GD, but with much cheaper iterations Comment: holds in expectation [ ] E f(x (k) ) f... Acceleration? Yes! Active research