Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization

Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f + O(n) A- GD κ log 1/ε Is this the best we can do? What if we allow O n # M x # ops? YES! When using only first- order oracle (gradients) History: Nemirovski & Yudin (1983) Nesterov (2004) ε x x f + O(n)

Lower Bounds Black- box procedure: Ignoring runtime Past information mapping from x (<), f (<),, x (?), f (?) with f (?) f x (?) We count the number of oracle accesses Next point to x (?@<) Not satisfied by, e.g., Newton s method (uses # f)

Lower Bounds Theorem: For t n there exists a μ- strongly- convex and L- Lipschitz function f such that for any algorithm with access only to first- order oracle, min f <G?GH x(?) min J: J G L f x L# 8μt #M Implies Ω 1/ε

Lower Bounds Proof intuition: Play a game you (method) pick a point x? to query the oracle, and I (adversary) provide the answer. I can make sure that there exists f consistent with all previous answers (gradients), but has a core segment containing ε- optimal solutions which you did not query yet. Examples: Find a number between 1 and 10 20 questions game Lower bound for Sorting Ω n log n Choose the answer on- the- fly, but consistently

Lower Bounds Proof: Consider the μ- strongly- convex function f x = γ max x U + μ <GUG? 2 x # The subdifferential is then f x = μx + γ conv e U, i x U = max <G^G? x^ First- order oracle: f x = μx + γe U where i is the first coordinate for which x U = max <G^G? x^ Assume w.l.g. For all t 0 we have x (?@<) Span f (<),, f (?) x (<) = 0

Lower Bounds Proof: Consider the μ- strongly- convex function f x = γ max <GUG? x U + μ 2 x # At x (<) = 0 returns γe <, thus x (#) must lie on the line e < By induction, x (?) lies in the linear span of e <,, e?c< Therefore, x U (?) = 0 for all i t It follows that f x (?) 0 But we next construct a vector y with f y = ef Thus, f x (?) f x ef #M? #M? And taking γ = L/2 completes the proof Consistent with previous choices

Lower Bounds Proof: Consider the μ- strongly- convex function f x = γ max <GUG? x U + μ 2 x # At x (<) = 0 returns γe <, thus x (#) must lie on the line e < By induction, x (?) lies in the linear span of e <,, e?c< Therefore, x U (?) = 0 for all i t It follows that f x (?) 0 But we next construct a vector y with f y = ef Consider y = e M?,, e M?, 0,, 0 1 t t+1 n Notice that 0 f(y) Thus, y is a (global) minimizer with value f y = γ e M? + M ef t # M f? = ef f #M? #M?

Lower Bounds Proof without the assumptions (on Span and x < ): Consider the μ- strongly- convex function f x = γ max x, v U + μ U 2 x x g # Adversary decides x g and v U on- the- fly, but in time to answer the oracle query x g is set to x (<) (the first point queried) v? is set to any vector orthogonal to x (<),, x (?c<) If the space is large (n t), then there always exists such a vector The vector y in this case will be y = γ μt h v U U

Lower Bounds We showed bounds for non- smooth μ- strongly- convex functions, similar bounds exist for the smooth and/or convex cases. Summary of lower bounds with first- order oracle in high dimension: GD μ 2 M 2 M L κ log 1/ε M x # ε L μ 2 L # x # L # ε # με A- GD κ log 1/ε M x # ε x x Lower bound Ω κ log 1/ε Ω M x # Matching the upper bounds! (up to constants) Subgradient descent for non- smooth Accelerated GD for smooth (not attained by GD) ε Ω L# x # ε # Ω L# με

Lower Bounds What about low dimension? Do the bounds still hold? NO! For example, center- of- mass O n log 1 ε Also uses just gradients A similar construction shows lower bound Ω n log 1 ε BUT cost per iteration much higher oracle complexity runtime complexity Understanding runtime is much harder (as in general complexity theory) Why lower bounds are important Tells us what we shouldn t try Know that we are doing the best possible Tells us what we can hope for (motivation for A- GD) Improves understanding of the problem

Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization

Constrained Optimization Problems (P) min J R m f g (x) s. t. f U x 0 i = 1 m f g, f <,, f q : R r R { } Where f g, f U are convex We can easily introduce equality constraints: s. t. h^ x = 0 j = 1 p { h^ x 0 j = 1 p h^ x 0 Since we require convexity h^ must be linear: Ax = b A R ~ r, b R ~

Example: Linear Programming Linear Programming: min s.t. c H x Gx h Ax = b The feasible set is a polyhedron Unbounded only if feasible set is unbounded (but not always)

Example: Linear Programming Max- flow Vertices: 1 n Capacities: C U^ on edge i j Source node 1, Sink node n max X <U U s.t. 0 X U^ C U^ ij X U = ^ X U^ i = 2 n 1

Example: Linear Programming Piecewise linear minimization min J max <GUGq a U H x + b U Equivalent LP min? t s. t. a U H x + b U t i Non- smooth => constrained smooth

Example: Linear Programming ll < - norm minimization q min a J U < H U x b U Equivalent LP (2m constraints) q min t U? U < s. t. t U a H U x b U t U i = 1 m

Example: Linear Programming Feasibility problem Find x s.t. f U x 0 i = 1 m h^ x = 0 j = 1 p Equivalent LP min 0 J s.t. f U x 0 i = 1 m h^ x = 0 j = 1 p Example: linear separation Find w s.t. w H x U 1 x U χ @ w H x U 1 x U χ c

Example: Quadratic Programming Quadratic program min s.t. J < P is PSD => convex # xh Px + q H x Gx h Ax = b

Example: Quadratic Programming Quadratic program min s.t. J < Example: least squares # xh Px + q H x Gx h Ax = b min J Ax b # # with linear constraints: l x u

Optimality Condition x is optimal for (P) iff it is feasible and f g x,y x 0 Two options: Either x is in the interior, so f g x = 0 feasible y Or x is on the boundary, then f g x 0 => y f g x, y x = 0 is a supporting hyperplaneof the feasible set Generalizes unconstrained optimality condition f g x = 0 BUT: we want a more local optimality condition, without considering all feasible y

Lagrange Multipliers (P) Claim: min J R m f g (x) s. t. f U x 0 i = 1 m P = inf sup J R m R R š g WHY? h^ x = 0 q j = 1 p f g x + h λ U f U x U < ~ + h ν^h^ x ^ <

Lagrange Multipliers p = inf f J R m g x + h λ U f U x + h ν^h^ x R R U < ^ < š g L x, λ, ν Lagrangian If x is feasible, then sup is at λ = 0, and we get = f g x If x is infeasible, then: if f U x > 0 λ U if h U x > 0 ν U if h U x < 0 ν U Therefore, sup g, L x, λ, ν = f g x x is feasible ow q Lagrange multipliers ~

Lagrange Duality Claim (w/o any assumptions): sup g, inf J L x, λ, ν inf J Proof: for all λ 0, ν we have: inf J sup g, L x, λ, ν inf J š J Gg š J g f g x + h λ U f U x U < inf g x = p J since this holds for all λ 0, ν, we get q L x, λ, ν 0 ~ = 0 + h ν^h^ x sup inf L x, λ, ν g, J p = inf sup L x, λ, ν J g, ^ <

The Dual Define the dual objective function: g λ, ν = inf L x, λ, ν J The (Lagrange) dual problem: (D) max R R g(λ, ν) s. t. λ U 0 i = 1 m We denote by d the optimal value, and λ, ν dual opt. (if exists) is

Weak Duality Theorem (weak duality): d p Holds even if f g, f U, h^ not convex! Strong duality: That is, d = sup inf g, J d = p L x, λ, ν = inf J Means we can swap sup and inf WHEN? sup L x, λ, ν g, = p