ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

Size: px

Start display at page:

Download "ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method"

Neil Walton
6 years ago
Views:

1 ORIE 4741: Learning with Big Messy Data Proximal Gradient Method Professor Udell Operations Research and Information Engineering Cornell November 13, / 31

2 Announcements Be a TA for CS/ORIE 1380: Data Science for All! syllabus: roster/sp18/class/orie/1380 apply: 2 / 31

3 Regularized empirical risk minimization choose model by solving minimize n l(x i, y i ; w) + r(w) i=1 with variable w R d parameter vector w R d loss function l : X Y R d R regularizer r : R d R 3 / 31

4 Regularized empirical risk minimization choose model by solving minimize n l(x i, y i ; w) + r(w) i=1 with variable w R d parameter vector w R d loss function l : X Y R d R regularizer r : R d R why? want to minimize the risk E (x,y) P l(x, y; w) approximate it by the empirical risk n i=1 l(x, y; w) add regularizer to help model generalize 3 / 31

5 Solving regularized risk minimization how should we fit these models? with a different software package for each model? with a different algorithm for each model? with a general purpose optimization solver? desiderata fast flexible we ll use the proximal gradient method 4 / 31

6 Proximal operator define the proximal operator of the function r : R d R prox r (z) = argmin(r(w) + 1 w 2 w z 2 2) 5 / 31

7 Proximal operator define the proximal operator of the function r : R d R prox r : R d R d prox r (z) = argmin(r(w) + 1 w 2 w z 2 2) 5 / 31

8 Proximal operator define the proximal operator of the function r : R d R prox r : R d R d prox r (z) = argmin(r(w) + 1 w 2 w z 2 2) generalized projection: if 1 C is the indicator of set C, prox 1C (w) = Π C (w) 5 / 31

9 Proximal operator define the proximal operator of the function r : R d R prox r : R d R d prox r (z) = argmin(r(w) + 1 w 2 w z 2 2) generalized projection: if 1 C is the indicator of set C, prox 1C (w) = Π C (w) implicit gradient step: if w = prox r (z) and r is smooth, r(w) + w z = 0 w = z r(w) 5 / 31

10 Proximal operator define the proximal operator of the function r : R d R prox r : R d R d prox r (z) = argmin(r(w) + 1 w 2 w z 2 2) generalized projection: if 1 C is the indicator of set C, prox 1C (w) = Π C (w) implicit gradient step: if w = prox r (z) and r is smooth, r(w) + w z = 0 w = z r(w) simple to evaluate: closed form solutions for many functions more info: [Parikh Boyd 2013] 5 / 31

11 Maps from functions to functions no consistent notation for map from functions to functions. for a function f : R d R, prox maps f to a new function prox f : R d R d proxf (x) evaluates this function at the point x maps f to a new function f : R d R d f (x) evaluates this function at the point x f x maps f to a new function x : Rd R d f x (x) x= x evaluates this function at the point x this one has the most confusing notation of all... 6 / 31

12 Maps from functions to functions in a nice programming language, you can write something like prox ( f ) ( x ) or prox ( f, x ) 7 / 31

13 Let s evaluate some proximal operators! define the proximal operator of the function r : R d R prox r (z) = argmin(r(w) + 1 w 2 w z 2 2) r(w) = 0 (identity) r(w) = d i=1 r i(w i ) (separable) r(w) = w 2 2 (shrinkage) r(w) = w 1 (soft-thresholding) r(w) = 1(w 0) (projection) r(w) = d 1 i=1 (w i+1 w i ) 2 (smoothing) 8 / 31

14 Proximal gradient method want to solve minimize l(w) + r(w) l : R d R smooth r : R d R with a fast prox operator proximal gradient method. pick step size sequence {α t } t=1 and w 0 R d repeat w t+1 = prox αtr (w t α t l(w t )) 9 / 31

15 Proximal gradient method want to solve minimize l(w) + r(w) l : R d R smooth r : R d R with a fast prox operator proximal gradient method. pick step size sequence {α t } t=1 and w 0 R d repeat w t+1 = prox αtr (w t α t l(w t )) complexity: O(nd) to evaluate gradient of loss function O(d) to prox and to update 9 / 31

16 Example: NNLS want to solve recall minimize 1 2 y Xw 2 + 1(w 0) ( 1 2 y Xw 2) = X T (y Xw) prox 1( 0) (w) = max(0, w) proximal gradient method. pick step size sequence {α t } t=0 and w 0 R d for t = 0, 1,... w t+1 = max(0, w t + α t X T (y Xw t )) 10 / 31

17 Example: NNLS want to solve recall minimize 1 2 y Xw 2 + 1(w 0) ( 1 2 y Xw 2) = X T (y Xw) prox 1( 0) (w) = max(0, w) proximal gradient method. pick step size sequence {α t } t=0 and w 0 R d for t = 0, 1,... w t+1 = max(0, w t + α t X T (y Xw t )) note: this is not the same as max(0, (X T X ) 1 X T y) 10 / 31

18 Example: NNLS proximal gradient method for NNLS. pick step size sequence {α t } t=0 and w 0 R d for t = 0, 1,... compute g t = X T (y Xw t ) (O(nd) flops) w t+1 = max(0, w t + α t g t ) (O(d) flops) O(nd) flops per iteration 11 / 31

19 Example: NNLS option: do work up front to reduce per-iteration complexity proximal gradient method for NNLS. pick step size sequence {α t } t=0 and w 0 R d form b = X T y (O(dn) flops), G = X T X (O(nd 2 ) flops) for t = 0, 1,... w t+1 = max(0, w t + α t (b Gw t )) (O(d 2 ) flops) O(nd 2 ) flops to begin, O(d 2 ) flops per iteration 12 / 31

20 Example: NNLS option: do work up front to reduce per-iteration complexity proximal gradient method for NNLS. pick step size sequence {α t } t=0 and w 0 R d form b = X T y (O(dn) flops), G = X T X (O(nd 2 ) flops) for t = 0, 1,... w t+1 = max(0, w t + α t (b Gw t )) (O(d 2 ) flops) O(nd 2 ) flops to begin, O(d 2 ) flops per iteration note: can compute b and G in parallel / 31

21 want to solve recall minimize Example: Lasso 1 2 y Xw 2 + λ w 1 ( 1 2 y Xw 2) = X T (y Xw) prox µ 1 (w) = s µ (w) where (s µ (w)) i = proximal gradient method. w i µ w i µ 0 w i µ w i + µ w i µ pick step size sequence {α t } t=0 and w 0 R d form b = X T y (O(dn) flops), G = X T X (O(nd 2 ) flops) for t = 0, 1,... w t+1 = s α t λ(w t + α t (b Gw t )) (O(d 2 ) flops) 13 / 31

22 Example: Lasso want to solve recall minimize 1 2 y Xw 2 + λ w 1 ( 1 2 y Xw 2) = X T (y Xw) prox µ 1 (w) = s µ (w) where (s µ (w)) i = proximal gradient method. w i µ w i µ 0 w i µ w i + µ w i µ pick step size sequence {α t } t=0 and w 0 R d form b = X T y (O(dn) flops), G = X T X (O(nd 2 ) flops) for t = 0, 1,... w t+1 = s α t λ(w t + α t (b Gw t )) (O(d 2 ) flops) notice: the hard part (O(d 2 )) is computing the gradient...! 13 / 31

23 Convergence two questions to ask: will the iteration ever stop? what kind of point will it stop at? if the iteration stops, we say it has converged 14 / 31

24 Convergence: what kind of point will it stop at? let s suppose r is differentiable 1 if we find w so that then w = prox αtr (w α t l(w)) w = argmin(α t r(w ) + 1 w 2 w (w α t l(w)) 2 2) 0 = α t r(w) + w w + α t l(w) = (r(w) + l(w)) so the gradient of the objective is 0 if l and r are convex, that means w minimizes l + r 1 take Convex Optimization for the proof for non-differentiable r 15 / 31

25 Convergence: will it stop? definitions: p = inf w l(w) + r(w) assumptions: loss function is continuously differentiable with L-Lipshitz derivative: l(w ) l(w) + l(w), w w + L 2 w w 2 for simplicity, consider constant step size α t = α 16 / 31

26 Proximal point method converges prove it for l = 0 first (aka the proximal point method) for any t = 0, 1,..., w t+1 = argmin αr(w) + 1 w 2 w w t 2 so in particular, αr(w t+1 ) w t+1 w t 2 αr(w t ) w t w t 2 1 2α w t+1 w t 2 r(w t ) r(w t+1 ) now add up these inequalities for t = 0, 1,..., T : 1 2α T w t+1 w t 2 t=0 it converges! T (r(w t ) r(w t+1 )) t=0 r(w 0 ) p 17 / 31

27 Proximal gradient method converges (I) now prove it for l 0. for any t = 0, 1,..., w t+1 = argmin αr(w) + 1 w 2 w (w t α l(w t )) 2 so in particular, r(w t+1 ) + 1 2α w t+1 (w t α l(w t )) 2 r(w t ) + 1 2α w t (w t α l(w t )) 2 r(w t+1 ) + 1 2α w t+1 w t 2 + α 2 l(w t )) 2 + l(w t ), w t+1 w t r(w t ) + α 2 l(w t ) 2 r(w t+1 ) + 1 2α w t+1 w t 2 + l(w t ), w t+1 w t r(w t ) 18 / 31

28 Proximal gradient method converges (II) now use l(w ) l(w) + l(w), w w + L 2 w w 2 with w = w t+1, w = w t l(w t+1 ) + r(w t+1 ) + 1 2α w t+1 w t 2 + l(w t ), w t+1 w t l(w t ) + r(w t ) + l(w t ), w t+1 w t + L 2 w t+1 w t 2 l(w t+1 ) + r(w t+1 ) + 1 2α w t+1 w t 2 l(w t ) + r(w t ) + L 2 w t+1 w t 2 1 2α w t+1 w t 2 L 2 w t+1 w t 2 l(w t ) + r(w t ) (l(w t+1 ) + r(w t+1 )) 19 / 31

29 Proximal gradient method converges (III) now add up these inequalities for t = 0, 1,..., T : ( ) T α L w t+1 w t 2 t=0 T l(w t ) + r(w t ) (l(w t+1 ) + r(w t+1 ))) t=0 l(w 0 ) + r(w 0 ) p if 1 α L 0 = α 1 L it converges! 20 / 31

30 Lipshitz? loss function is differentiable and L-Lipshitz: l(w ) l(w) + l(w), w w + L 2 w w 2 what is L? 21 / 31

31 Lipshitz? loss function is differentiable and L-Lipshitz: l(w ) l(w) + l(w), w w + L 2 w w 2 what is L? l(w ) = Xw y 2 = X (w w) + Xw y 2 = Xw y 2 + 2(Xw y) T X (w w) + X (w w) 2 = l(w) + l(w), w w + X (w w) 2 l(w) + l(w), w w + X 2 (w w) 2 so L = 2 X 2, where X is the maximum singular value of X 21 / 31

32 Speeding it up recall: computing the gradient is slow idea: approximate the gradient! a stochastic gradient l(w) is a random variable with E l(w) = l(w) 22 / 31

33 stochastic gradient obeys Stochastic gradient: examples E l(w) = l(w) examples: for l(w) = n i=1 (y i w T x i ) 2, 23 / 31

34 stochastic gradient obeys Stochastic gradient: examples E l(w) = l(w) examples: for l(w) = n i=1 (y i w T x i ) 2, single stochastic gradient. pick a random example i. set l(w) = n (y i w T x i ) 2 = 2n(y i w T x i )x i 23 / 31

35 stochastic gradient obeys Stochastic gradient: examples E l(w) = l(w) examples: for l(w) = n i=1 (y i w T x i ) 2, single stochastic gradient. pick a random example i. set l(w) = n (y i w T x i ) 2 = 2n(y i w T x i )x i minibatch stochastic gradient. pick a random set of examples S. set ( ) l(w) = n S (y i w T x i ) 2 i S ( = n S 2 ) (y i w T x i )x i i S (here S is the number of elements in the set S.) 23 / 31

36 Stochastic proximal gradient stochastic proximal gradient method. pick step size sequence {α t } t=1 and w 0 R d repeat pick i {1,..., n} uniformly at random w t+1 = prox αtr (w t + α t 2n(y i w T x i )x i ) (O(d) flops) 24 / 31

37 Stochastic proximal gradient stochastic proximal gradient method. pick step size sequence {α t } t=1 and w 0 R d repeat pick i {1,..., n} uniformly at random w t+1 = prox αtr (w t + α t 2n(y i w T x i )x i ) (O(d) flops) per iteration complexity: O(d)! 24 / 31

38 Extensions next lecture, we ll see loss functions that are not differentiable can still use proximal gradient method! need to define the subgradient 25 / 31

39 Subgradient define the subgradient of a convex function f : R d R at x f (x) = {g : y, f (y) f (x) + g, y x } the subgradient f maps points to sets! follows the chain rule: if f = h g, h : R R, and g : R d R is differentiable, f (x) = h(g(x)) g(x) 26 / 31

40 Subgradient for f : R R and convex, here s a simpler equivalent condition: if f is differentiable at x, f (x) = { f (x)} if f is not differentiable at x, it will still be differentiable just to the left and the right of x 2, so let g + = lim ɛ 0 f (x + ɛ) let g = lim ɛ 0 f (x ɛ) f (x) is any convex combination (i.e., any weighted average) of those gradients: f (x) = {αg + + (1 α)g : α [0, 1]} 2 (because a convex function is differentiable almost everywhere) 27 / 31

41 Proximal subgradient method proximal subgradient method. pick a step size sequence {α t } t=1 and w 0 R d repeat pick any g f (w t ) w t+1 = prox αtr (w t α t l(w t )) 28 / 31

42 Convergence for stochastic proximal (sub)gradient pick your poison: stochastic (sub)gradient, fixed step size α t = α: iterates converge quickly, then wander within a small ball stochastic (sub)gradient, decreasing step size α t = 1/t: iterates converge slowly to solution minibatch stochastic (sub)gradient with increasing minibatch size, fixed step size α t = α: iterates converge quickly to solution later iterations take (much) longer proofs: [Bertsekas, 2010] conditions: l is convex, subdifferentiable, Lipshitz continuous, and r is convex and Lipshitz continuous where it is < or all iterates are bounded 29 / 31

43 Proximal subgradient method Q: Why can t we just use gradient descent to solve all our problems? 30 / 31

44 Proximal subgradient method Q: Why can t we just use gradient descent to solve all our problems? A: Because some regularizers and loss functions aren t differentiable! 30 / 31

45 Proximal subgradient method Q: Why can t we just use gradient descent to solve all our problems? A: Because some regularizers and loss functions aren t differentiable! Q: Why can t we just use subgradient descent to solve all our problems? 30 / 31

46 Proximal subgradient method Q: Why can t we just use gradient descent to solve all our problems? A: Because some regularizers and loss functions aren t differentiable! Q: Why can t we just use subgradient descent to solve all our problems? A: Because some of our regularizers don t even have subgradients defined everywhere. (e.g., 1 + ) 30 / 31

47 References Vandenberghe: lecture on proximal gradient method. lectures/proxgrad.pdf Yin: lecture on proximal method. slides/lec05_proximaloperatordual.pdf Parikh and Boyd: paper on proximal algorithms. https: //stanford.edu/~boyd/papers/pdf/prox_algs.pdf Bertsekas: convergence proofs for every proximal gradient style method you can dream of / 31

ORIE 4741: Learning with Big Messy Data. Regularization

ORIE 4741: Learning with Big Messy Data Regularization Professor Udell Operations Research and Information Engineering Cornell October 26, 2017 1 / 24 Regularized empirical risk minimization choose model