Minimax rates for Batched Stochastic Optimization

Size: px

Start display at page:

Download "Minimax rates for Batched Stochastic Optimization"

Rafe Lloyd
5 years ago
Views:

1 Minimax rates for Batched Stochastic Optimization John Duchi based on joint work with Feng Ruan and Chulhee Yun Stanford University

2 Tradeoffs Major problem in theoretical statistics: how do we characterize statistical optimality for problems with constraints?

3 Tradeoffs Major problem in theoretical statistics: how do we characterize statistical optimality for problems with constraints? Computational [Berthet & Rigollet 13, Ma & Wu 15, Brennan et al. 18, Feldman et al. 18] Privacy [Dwork et al. 06, Hardt & Talwar 09, Duchi et al. 13] Robustness [Huber 81, Hardt & Moitra 13, Diakonikolas et al. 16] Memory / communication [Duchi et al. 14, Braverman et al. 15, Steinhardt & Duchi 16]

4 Tradeoffs Major problem in theoretical statistics: how do we characterize statistical optimality for problems with constraints? Statistical Performance Data set size n

5 Tradeoffs Major problem in theoretical statistics: how do we characterize statistical optimality for problems with constraints? Statistical Performance Data set size n Other resources

6 Tradeoffs Major problem in theoretical statistics: how do we characterize statistical optimality for problems with constraints? Statistical Performance Data set size n Computation

7 Tradeoffs Major problem in theoretical statistics: how do we characterize statistical optimality for problems with constraints? Statistical Performance Data set size n Energy use

8 Tradeoffs Major problem in theoretical statistics: how do we characterize statistical optimality for problems with constraints? Statistical Performance Data set size n Disclosure risk (privacy)

9 Problem Setting minimize f(x) where f convex given mean-zero noisy gradient information g = rf(x)+

10 Problem Setting minimize f(x) where f convex given mean-zero noisy gradient information g = rf(x)+ computational complexity for these problems?

11 Stochastic Gradient methods Iterate (for k = 1, 2, ) g k = rf(x k )+ k x k+1 = x k k g k

12 Stochastic Gradient methods Iterate (for k = 1, 2, ) g k = rf(x k )+ k x k+1 = x k k g k g 1 g 2 g 3 g 4 g 5 g 6 g 7 g 8 g 9 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9

13 Stochastic Gradient methods Iterate (for k = 1, 2, ) g k = rf(x k )+ k x k+1 = x k k g k g 1 g 2 g 3 g 4 g 5 g 6 g 7 g 8 g 9 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 Theorem (Nemirovski & Yudin 83; Nemirovski et al. 09; Agarwal et al. 11) After k iterations, we have (optimal) convergence E[f(x k )] f?. 1 p k

14 Parellelization and interactivity? g 1 g 2 g 3 g 4 g 5 g 6 g 7 g 8 g 9 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 Requires many iterations, lots of interaction, no parallelism

15 Parellelization and interactivity? g 1 g 2 g 3 g 4 g 5 g 6 g 7 g 8 g 9 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 Requires many iterations, lots of interaction, no parallelism g 1 g 2 g 3 g 4 g 5 g 6 g 7 g 8 g 9 x 1 x 2 x 3 Trade off breadth for depth?

16 Batched optimization? Medical trials [Perchet et al. 16, Hardwick & Stout 02, Stein 45]

17 Batched optimization? Medical trials [Perchet et al. 16, Hardwick & Stout 02, Stein 45] Ideal: get patient, give treatment, observe outcome 1 year 1 year 1 year 1 year 1 year 1 year 100s of years

18 Batched optimization? Medical trials [Perchet et al. 16, Hardwick & Stout 02, Stein 45] Ideal: get patient, give treatment, observe outcome 1 year 1 year 1 year 1 year 1 year 1 year 100s of years Reality:

19 Local Differential Privacy

20 Problem Statement Problem: given M rounds of adaptation and n computations, what is the optimal error in optimization?

21 Problem Statement Problem: given M rounds of adaptation and n computations, what is the optimal error in optimization? A M,n = Algorithms with M rounds of computation and n (noisy) gradient computations F = Function class of interest

22 Problem Statement Problem: given M rounds of adaptation and n computations, what is the optimal error in optimization? A M,n = Algorithms with M rounds of computation and n (noisy) gradient computations F = Function class of interest Study minimax optimization error E[f(bx)] f?

23 Problem Statement Problem: given M rounds of adaptation and n computations, what is the optimal error in optimization? A M,n = Algorithms with M rounds of computation and n (noisy) gradient computations F = Function class of interest Study minimax optimization error sup f2f {E[f(bx)] f? }

24 Problem Statement Problem: given M rounds of adaptation and n computations, what is the optimal error in optimization? A M,n = Algorithms with M rounds of computation and n (noisy) gradient computations F = Function class of interest Study minimax optimization error inf sup bx2a M,n f2f {E[f(bx)] f? }

25 Problem Statement Problem: given M rounds of adaptation and n computations, what is the optimal error in optimization? A M,n = Algorithms with M rounds of computation and n (noisy) gradient computations F = Function class of interest Study minimax optimization error M M,n (F) := inf sup bx2a M,n f2f {E[f(bx)] f? }

26 Background and hopes

27 Background and hopes [Perchet, RCS 18] Batched Bandits. For 2 armed bandit, optimal regret achievable with M = O(log log n)

28 Background and hopes [Perchet, RCS 18] Batched Bandits. For 2 armed bandit, optimal regret achievable with M = O(log log n) [Nemirovski et al. 09, Ghadimi & Lan 12] Stochastic strongly convex optimization: f(bx) f?. kx 0 x? k 2 exp( M/ p Cond(f)) + Var( ) or M & p Cond(f) log n rounds n

29 Background and hopes [Perchet, RCS 18] Batched Bandits. For 2 armed bandit, optimal regret achievable with M = O(log log n) [Nemirovski et al. 09, Ghadimi & Lan 12] Stochastic strongly convex optimization: f(bx) f?. kx 0 x? k 2 exp( M/ p Cond(f)) + Var( ) or M & p Cond(f) log n rounds [Smith, TU 17] To solve convex optimization, need n M & log 1 rounds

30 Main Results F H, := { strongly convex,h smooth f}

31 Main Results F H, := { strongly convex,h smooth f} Theorem (D., Ruan, Yun 18) M M,n (F H, ) C(d, n) n 1 where C(d, n) poly(n) 1 ( d d+2) M

32 Main Results F H, := { strongly convex,h smooth f} Theorem (D., Ruan, Yun 18) M M,n (F H, ) C(d, n) n 1 where C(d, n) poly(n) 1 ( d d+2) M Theorem (D., Ruan, Yun 18) If M apple (d/2) log log n P f(bx) f? Cn then there is an algorithm s.t. 1 ( d+2) d M log n! 0

33 Achievability For convex functions f, x? 2 {y hrf(x),y xi apple0}

34 Achievability For convex functions f, x? 2 {y hrf(x),y xi apple0}

35 Maintain feasible box At round t, take points Achievability B t = c t +[ r t,r t ] d x i 2 B t,i=1,...,m with center c t

36 Maintain feasible box At round t, take points get parallel (noisy) gradients Achievability B t = c t +[ r t,r t ] d x i 2 B t,i=1,...,m with center c t brf(x 1 )... brf(x m )

37 Maintain feasible box At round t, take points get parallel (noisy) gradients Achievability B t = c t +[ r t,r t ] d x i 2 B t,i=1,...,m with center c t brf(x 1 )... brf(x m ) w.h.p. x? 2 {y h b rf(x i ),y xi apple ky xk}

38 Maintain feasible box At round t, take points Achievability B t = c t +[ r t,r t ] d x i 2 B t,i=1,...,m with center c t get parallel (noisy) gradients x 4 brf(x 1 )... brf(x m ) x 1 x 3 w.h.p. x? 2 {y h b rf(x i ),y xi apple ky xk} x 2

39 Maintain feasible box At round t, take points get parallel (noisy) gradients Achievability B t = c t +[ r t,r t ] d x i 2 B t,i=1,...,m with center c t brf(x 1 )... brf(x m ) w.h.p. x? 2 {y h b rf(x i ),y xi apple ky xk}

40 Recursive shrinking Box radius decreases as r t apple r t 1

41 Recursive shrinking Box radius decreases as r t apple r t 1 r 1

42 Recursive shrinking Box radius decreases as r t apple r t 1 r 2 r 1

43 Recursive shrinking Box radius decreases as r t apple r t 1 r 3 r 2 r 1

44 Recursive shrinking Box radius decreases as r t apple r t 1 r 3 r 2 r 1

45 Recursive shrinking Box radius decreases as r t apple r t 1 or, recursively r t apple r t 1 apple 1+ r 2 apple P t 1 j=0 t 2 apple j r t 0 t 1 1 r 3 r 2 r 1

46 Recursive shrinking Box radius decreases as r t apple r t 1 or, recursively r t apple r t 1 apple 1+ r 2 apple P t 1 j=0 t 2 apple j r t 0 t 1 1 r 3 r 2 r 1 for us, dimension d = d d +2 = n 1 d+2

47 Recursive shrinking Box radius decreases as with = d d+2 or, recursively t 1 r t apple 1 r t apple r t 1 r 3 r 2 r 1

48 Recursive shrinking Box radius decreases as with and = d d+2 t or, recursively t 1 r t apple 1 n iff r t apple r t 1 t. 1 log n r 3 r 2 r 1

49 Recursive shrinking Box radius decreases as with and = d d+2 t or, recursively t 1 r t apple 1 n iff r t apple r t 1 t. 1 log n r 3 r 2 r 1 Solution: t & log log n log 1/ = log log n log(1 + d/2)

50 Main Results F H, := { strongly convex,h smooth f} Theorem (D., Ruan, Yun 18) M M,n (F H, ) C(d, n) n 1 where C(d, n) poly(n) 1 ( d d+2) M Theorem (D., Ruan, Yun 18) If M apple (d/2) log log n P f(bx) f? Cn then there is an algorithm s.t. 1 ( d+2) d M log n! 0

51 How do we prove lower bounds? Lower bound 1. Two functions, where optimizing one means not optimizing the other [Agarwal BRW 13, Duchi 17] 2. Make information available to algorithm to distinguish them small

52 How do we prove lower bounds? Lower bound 1. Two functions, where optimizing one means not optimizing the other [Agarwal BRW 13, Duchi 17] 2. Make information available to algorithm to distinguish them small f 0 f 1 dist(f 0,f 1 )

53 How do we prove lower bounds? Lower bound 1. Two functions, where optimizing one means not optimizing the other [Agarwal BRW 13, Duchi 17] 2. Make information available to algorithm to distinguish them small

54 How do we prove lower bounds? Lower bound 1. Two functions, where optimizing one means not optimizing the other [Agarwal BRW 13, Duchi 17] 2. Make information available to algorithm to distinguish them small inf bx max E [f v(bx) f v? ] v2{0,1} dist(f 0,f 1 ) 2 inf P (A distinguishes f 0,f 1 ) Alg A

55 How do we prove lower bounds? Lower bound 1. Two functions, where optimizing one means not optimizing the other [Agarwal BRW 13, Duchi 17] 2. Make information available to algorithm to distinguish them small inf bx max E [f v(bx) f v? ] v2{0,1} dist(f 0,f 1 ) 2 inf Alg A P (A distinguishes f 0,f 1 ) {z } =1 kp 0 P 1 k TV

56 Lower bound: recursive packing M

57 Lower bound: recursive packing M U (1) = 1 packing of initial set 1

58 Lower bound: recursive packing M U (1) = 1 packing of initial set U (t) u =2 t packing of ball centered at u 2 1

59 Lower bound: recursive packing M U (1) = 1 packing of initial set U (t) u =2 t packing of ball centered at u 2 1

60 Lower bound: recursive packing M U (1) = 1 packing of initial set U (t) u =2 t packing of ball centered at u Idea: 1. define functions recursively on balls 2. optimization means identifying ball 2 1

61 Function constructions M Index functions by path down u 1:M =(u 1,...,u M )

62 Function constructions M Index functions by path down u 1:M =(u 1,...,u M ) f (1) u 1:M (x) =f (0) u 1:M (x) x 62 u M + M B

63 Function constructions M Index functions by path down u 1:M =(u 1,...,u M ) f (1) u 1:M (x) =f (0) u 1:M (x) f (±1) u 1:M (x) =f (±1) u 1:t,ũ t+1:m (x) x 62 u M + M B x 62 u t + t B

64 Function constructions M Index functions by path down u 1:M =(u 1,...,u M ) f (1) u 1:M (x) =f (0) u 1:M (x) f (±1) u 1:M (x) =f (±1) u 1:t,ũ t+1:m (x) x 62 u M + M B x 62 u t + t B Optimizing well means identifying sequence defining function u 1 u 2 u 3

65 Function construction Recurse: f u1 (x) = 1 2 kx u 1k 2 f u1:t (x) = SmoothMax{f u1:t 1 (x), kx u t k 2 + b t } max{f 1 (x),f 2 (x)} f 1 (x) f(x) f 1 (x) f 2 (x) f 2 (x)

66 Information recursion At each round t (of M): D KL (rf u1:m (x)+ rf u1:t,ũ t+1:m (x)+ ). 2 t

67 Information recursion At each round t (of M): D KL (rf u1:m (x)+ rf u1:t,ũ t+1:m (x)+ ). 2 t Choose radius for constant information per round: 2 t n # in packing 1

68 Information recursion At each round t (of M): D KL (rf u1:m (x)+ rf u1:t,ũ t+1:m (x)+ ). 2 t Choose radius for constant information per round: 2 t n d t d t 1 1

69 Information recursion At each round t (of M): D KL (rf u1:m (x)+ rf u1:t,ũ t+1:m (x)+ ). 2 t Choose radius for constant information per round: 2 t n d t d t 1 1 Solution for lower bound: M = n 1 d+2 d d+2 M 1 = n ( d d+2) M

How hard is this function to optimize?

How hard is this function to optimize? John Duchi Based on joint work with Sabyasachi Chatterjee, John Lafferty, Yuancheng Zhu Stanford University West Coast Optimization Rumble October 2016 Problem minimize