How hard is this function to optimize?

Size: px

Start display at page:

Download "How hard is this function to optimize?"

Maude Dean
5 years ago
Views:

1 How hard is this function to optimize? John Duchi Based on joint work with Sabyasachi Chatterjee, John Lafferty, Yuancheng Zhu Stanford University West Coast Optimization Rumble October 2016

2 Problem minimize f(x) := E[F(x;ξ)] subject to x X. (1) where x F(x;ξ) is convex, X is closed convex set

3 Problem minimize f(x) := E[F(x;ξ)] subject to x X. (1) where x F(x;ξ) is convex, X is closed convex set Two questions: 1. How hard is it to solve problem (1) for that f 2. Can an algorithm(s) do as well as possible for each f?

4 Outline Part 0 Complexity of problems Part I Complexity lower bounds General lower bounds Super-efficiency Part II Toward achievability? Part III Problem geometry and dimensionality

5 Problem Complexity Large literature on guarantees of optimality Wald 1939, Contributions to the theory of statistical estimation and testing hypotheses (minimax complexity) Nemirovski and Yudin 1983, Problem Complexity and Method Efficiency in Optimization (information-based complexity)

6 Problem Complexity Large literature on guarantees of optimality Wald 1939, Contributions to the theory of statistical estimation and testing hypotheses (minimax complexity) Nemirovski and Yudin 1983, Problem Complexity and Method Efficiency in Optimization (information-based complexity) Three main considerations: i. Information oracle (how we get information on problem) ii. Problem class (what problems must the algorithm solve) iii. How to measure error

7 Minimax error and oracles Information oracle How algorithm/procedure receives information about problem Optimization: function value f(x) (zero-order), gradient f(x) (first-order), Hessian 2 f(x) (second-order) Statistics: observations ξi from probability distribution P

8 Minimax error and oracles Information oracle How algorithm/procedure receives information about problem Optimization: function value f(x) (zero-order), gradient f(x) (first-order), Hessian 2 f(x) (second-order) Statistics: observations ξi from probability distribution P Minimax principle Develop algorithm that has best worst case performance (risk) for problem class F R N (F) := inf sup A A N f F error(a,f) where A N is algorithms with N queries, F is problem class

9 Example F consist of 1-Lipschitz convex functions on R d Oracle returns f(x)+ε, where E[ε] = 0 and ε 1 Domain X containts l -ball

10 Example F consist of 1-Lipschitz convex functions on R d Oracle returns f(x)+ε, where E[ε] = 0 and ε 1 Domain X containts l -ball Minimax rate: let x N be output of algorithm A after N steps d inf E[f( x N ) f(x )]. N sup A A N f F (Agarwal et al. 12, Nemirovski & Yudin 83)

11 Example F consist of 1-Lipschitz convex functions on R d Oracle returns f(x)+ε, where E[ε] = 0 and ε 1 Domain X containts l -ball Minimax rate: let x N be output of algorithm A after N steps d inf E[f( x N ) f(x )]. N sup A A N f F (Agarwal et al. 12, Nemirovski & Yudin 83) More generally optimality guarantee: L-Lipschitz convex functions on sets X with diameter D, minimax rate LD/ N

12 An optimal? algorithm Algorithm: At iteration t Choose random ξ, set g t = F(x t ;ξ i ) Update x t+1 = x t α t g t

13 An optimal? algorithm Algorithm: At iteration t Choose random ξ, set g t = F(x t ;ξ i ) Update x t+1 = x t α t g t

14 An optimal? algorithm Algorithm: At iteration t Choose random ξ, set g t = F(x t ;ξ i ) Update x t+1 = x t α t g t Theorem (Russians / Hungarians): Let x N = 1 N N t=1 x t and assume D x x 1 2, L 2 E[ g t 2 2 ]. Then E[f( x N ) f(x ) LD N

15 Stochastic gradient descent Let s use stochastic gradient descent to solve minimize x f(x) = 1 2 x2 subject to x [ 1,1]

16 Stochastic gradient descent Let s use stochastic gradient descent to solve minimize x f(x) = 1 2 x2 subject to x [ 1,1] / t -steps t

17 Stochastic gradient descent Let s use stochastic gradient descent to solve minimize x f(x) = 1 2 x2 subject to x [ 1,1] t -steps 1/t-steps 1/ 1/t t

18 A local notion of complexity Minimax complexity R N (F) := inf sup A A N f F error(a,f)

19 A local notion of complexity Minimax complexity R N (F) := inf sup A A N f F error(a,f) Local minimax complexity: Fix function f, and look for hardest local alternative R N (f;f) := sup g F inf max{error(a,f),error(a,g)} A A N (Related ideas in statistics: Donoho & Liu 1987, 1991, Cai & Low 2015)

20 Local minimax complexity for stochastic optimization Noisy subgradient oracle: ξ N(0,σ 2 I d d ), return f (x)+ξ Function class F: convex functions (can restrict) Algorithm A N : all algorithms with N noisy subgradient queries Error metric err : X F R

21 Local minimax complexity for stochastic optimization Noisy subgradient oracle: ξ N(0,σ 2 I d d ), return f (x)+ξ Function class F: convex functions (can restrict) Algorithm A N : all algorithms with N noisy subgradient queries Error metric err : X F R Local minimax complexity ( x A is output of algorithm) R N (f;f) := sup inf max{e f [err( x A,f)],E g [err( x A,g)]}. g F A A N

22 Distances on functions and moduli of continuity Distance for solutions Error must to satisfy exclusion inequality err(x,f) d(f,g) implies err(x,g) d(f,g). Example: how well one or other can be optimized { f d(f 0,f 1 ) := sup δ 0 : 1 (x) f1 +δ implies f 0 (x) f0 +δ f 0 (x) f0 +δ implies f 1 (x) f1 +δ error err(x,f) = f(x) f },

23 Separation f 0 f 1 δ = d(f 0,f 1 ) x : f 1 (x) f 1 +δ

24 Other error metrics Distance for solutions Error must to satisfy exclusion inequality err(x,f) d(f,g) implies err(x,g) d(f,g). Example: distance to optimality X f := argmin x X f(x), Xg := argming(x) x X d(f,g) = dist(x f,x g)

25 Distances on functions We study first-order methods, so define κ(f,g) := sup f (x) g (x) x X

26 Modulus of continuity Given solution metric d and function metric κ, modulus of continuity of d with respect to κ at f is ω f (ǫ) := sup{d(f,g) : κ(f,g) ǫ} g

27 Modulus of continuity Given solution metric d and function metric κ, modulus of continuity of d with respect to κ at f is Examples: with d(f,g) = x f x g ω f (ǫ) := sup{d(f,g) : κ(f,g) ǫ} g

28 Modulus of continuity Given solution metric d and function metric κ, modulus of continuity of d with respect to κ at f is Examples: with d(f,g) = x f x g ω f (ǫ) := sup{d(f,g) : κ(f,g) ǫ} g Quadratic f(x) = 1 2 x2, ω f (ǫ) = ǫ

29 Modulus of continuity Given solution metric d and function metric κ, modulus of continuity of d with respect to κ at f is Examples: with d(f,g) = x f x g ω f (ǫ) := sup{d(f,g) : κ(f,g) ǫ} g Quadratic f(x) = 1 2 x2, ω f (ǫ) = ǫ Power f(x) = 1 k x k, ω f (ǫ) = ǫ 1 k 1

30 Illustration of modulus of continuity If X R, { ω f (ǫ) = sup x x f : f (x) ǫ } x

31 Main Theorem I Theorem (Chatterjee, D., Lafferty, Zhu): Under Gaussian σ 2 -noise, for (almost) any f ( ) ( ) σ σ cω f R N (f;f) Cω f. N N

32 Main Theorem I Theorem (Chatterjee, D., Lafferty, Zhu): Under Gaussian σ 2 -noise, for (almost) any f ( ) ( ) σ σ cω f R N (f;f) Cω f. N N Modulus of continuity precisely determines rate of convergence

33 Proof intuition If P f = distribution when f is true, P g = distribution when g is true max{e f [err( x,f)],e g [err( x,g)]} 1 2 E f[err( x,f)]+ 1 2 E g[err( x,g)]

34 Proof intuition If P f = distribution when f is true, P g = distribution when g is true max{e f [err( x,f)],e g [err( x,g)]} d(f,g) 2 [1+P g (err( x,f) d(f,g)) P f (err( x,g) d(f,g))]

35 Is this real? cω f ( σ N ) R N (f;f) Cω f ( σ N ).

36 Is this real? ( ) ( ) σ σ cω f R N (f;f) Cω f. N N Is it possible to have a faster algorithm? (Were we too adversarial?) Is it too easy? (Were we not adversarial enough?)

37 You probably cannot be faster Theorem (Chatterjee, D., Lafferty, Zhu): Let an algorithm x N satisfy ( ) σ E f [err( x N,f)] δω f. N Then for ǫ N = σ log 1 δ N and g one of g 1 (x) = f(x) ǫ N x, g 1 (x) = f(x)+ǫ N x, E g [err( x N,g)] cω g σ log 1 δ N.

38 Is the local problem too easy? Not in 1-dimension! Sign-based binary search From interval [a 0,b 0 ], for k = 1,2,... Query T points at x k = 1 2 (a+b) for gradients G 1,...,G T If T t=1 G t > 0, set b = x k otherwise a = x k Proposition (Chatterjee, D., Lafferty, Zhu): After k epochs, define { } I k := x : f (x) σ log(k/δ) T Then w.p. 1 δ, dist(x k,i k ) 2 k b 0 a 0 Õ(1)ω f ( ) σ N

39 Intuition Recall flat set {x : f (x) ǫ}

40 Simulations risk risk 5e-06 5e-05 5e-04 5e-03 risk risk k = t k l = 1.5, k r = 2 risk risk risk risk k = t k l = 1.5, k r = 3 risk risk risk risk k = t k l = 2, k r

41 More achievability Problem: In higher dimensions, we have some issues

42 More achievability Problem: In higher dimensions, we have some issues Note: an epoch-based stochastic gradient descent procedure of Juditsky and Nesterov (2014) is nearly adaptive (result coming soon)

43 More achievability Problem: In higher dimensions, we have some issues Note: an epoch-based stochastic gradient descent procedure of Juditsky and Nesterov (2014) is nearly adaptive (result coming soon) Intuition: for k = 1,2,..., Perform stochastic gradient descent for T k 2 k iterations with constant stepsize α k Halve stepsize, double iteration length T k+1 = 2T k, run again

44 More achievability Problem: In higher dimensions, we have some issues Note: an epoch-based stochastic gradient descent procedure of Juditsky and Nesterov (2014) is nearly adaptive (result coming soon) Intuition: for k = 1,2,..., Perform stochastic gradient descent for T k 2 k iterations with constant stepsize α k Halve stepsize, double iteration length T k+1 = 2T k, run again One of these will be correct stepsize, and everything later will not allow any movement anyway

45 Part III: Problem Geometry Coming soon... just a handwritten teaser if time.

46 Summary Local notions of minimax risk Worst single alternative Shrinking neighborhoods (suitably or poorly) defined Some optimal algorithms, but work to be done! Some here: more soon...

Local Minimax Complexity of Stochastic Convex Optimization

Local Minimax Complexity of Stochastic Convex Optimization Yuancheng Zhu Sabyasachi Chatterjee John Duchi and John Lafferty arxiv:1605.07596v3 [stat.ml] 6 May 016 Department of Statistics Department of