Randomized Smoothing for Stochastic Optimization

Size: px

Start display at page:

Download "Randomized Smoothing for Stochastic Optimization"

Primrose Dickerson
5 years ago
Views:

1 Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

2 Problem Statement Goal: solve the following problem minimize f(w) subject to w W where f(w) := 1 n n F(w;x) or f(w) := E[F(w;x)] i=1 Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

3 Problem Statement Goal: solve the following problem minimize f(w) subject to w W where Examples: f(w) := 1 n n F(w;x) or f(w) := E[F(w;x)] i=1 F(w;{x,y}) = log(1+exp( y w,x )) F(w;{x,y}) = [1 y w,x ] + [logistic regression] [SVM] Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

4 Review: Stochastic Gradient Descent Repeat: At iteration t Receive stochastic gradient g t : E[g t g 1,...,g t 1 ] = f(w t ) Update w t+1 = w t α t g t Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

5 Review: Stochastic Gradient Descent Repeat: At iteration t Receive stochastic gradient g t : E[g t g 1,...,g t 1 ] = f(w t ) Update Example: When w t+1 = w t α t g t f(w) = 1 n n F(w;x i ) i=1 choose i uniformly at random, g t = F(w t ;x i ). Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

6 Review: Stochastic Gradient Descent Repeat: At iteration t Receive stochastic gradient g t : E[g t g 1,...,g t 1 ] = f(w t ) Update Example: When w t+1 = w t α t g t f(w) = 1 n n F(w;x i ) i=1 choose i uniformly at random, g t = F(w t ;x i ). Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

7 What everyone knows we should do Obviously: get a lower-variance estimate of the gradient. Sample g j,t with E[g j,t ] = f(w t ) and use g t = 1 m m j=1 g j,t Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

8 What everyone knows we should do Obviously: get a lower-variance estimate of the gradient. Sample g j,t with E[g j,t ] = f(w t ) and use g t = 1 m m j=1 g j,t Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

9 What everyone knows we should do Obviously: get a lower-variance estimate of the gradient. Sample g j,t with E[g j,t ] = f(w t ) and use g t = 1 m m j=1 g j,t Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

10 What everyone knows we should do Obviously: get a lower-variance estimate of the gradient. Sample g j,t with E[g j,t ] = f(w t ) and use g t = 1 m m j=1 g j,t Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

11 Theoretical Justification Normal stochastic gradient rate: f(w T ) f(w ) = O ( ) 1. T Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

12 Theoretical Justification Normal stochastic gradient rate: f(w T ) f(w ) = O ( ) 1. T Now, if we use m gradient samples and function f is suitably smooth, ( 1 f(w T ) f(w ) = O T + 1 ) Tm (Juditsky et al. 2008, Lan 2010, Dekel et al. 2010). Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

13 Theoretical Justification Normal stochastic gradient rate: f(w T ) f(w ) = O ( ) 1. T Now, if we use m gradient samples and function f is suitably smooth, ( 1 f(w T ) f(w ) = O T + 1 ) Tm (Juditsky et al. 2008, Lan 2010, Dekel et al. 2010). Problem: suitably smooth functions. Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

14 Non-smooth problems we care about: SVM Robust regression Structured prediction F(w;{x,y}) = [1 y w,x ] + F(w;{x,y}) = y w,x F(w,{x,y}) = max ŷ Y [L(y,ŷ)+ w,φ(x,ŷ) Φ(x,y) ] Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

15 Difficulties of non-smooth Intuition: A subgradient is a poor indicator of global structure Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

16 Better global estimators Idea: Ask for subgradients from multiple points Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

17 Better global estimators Idea: Ask for subgradients from multiple points Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

18 The algorithm Normal approach: sample x at random, w 1 g j,t F(w t ;x). Our approach: add noise to w g j,t F(w t +µ t Z j ;x) Decrease magnitude µ t over time Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

19 The algorithm Normal approach: sample x at random, g j,t F(w t ;x). Our approach: add noise to w g j,t F(w t +µ t Z j ;x) Decrease magnitude µ t over time µ 1 w 1 Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

20 The algorithm Normal approach: sample x at random, g j,t F(w t ;x). Our approach: add noise to w g j,t F(w t +µ t Z j ;x) Decrease magnitude µ t over time µ t w t Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

21 Algorithm Generalization of accelerated gradient methods (Tseng 2008, Lan 2010). Have query point and exploration point u t = (1 θ t )w t +θ t v t Sample x j,t and Z j,t, compute gradient approximation g t = 1 m g j,t, g j,t F(u t +µ t Z j,t ;x j,t ) m j=1 Solve for exploration point Interpolate v t+1 = argmin w W { t 1 [ gτ,w ] θ τ=0 τ }{{} Approximate f w t+1 = (1 θ t )w t +θ t v t } w 2 2 2α } t {{} Regularize Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

22 Theoretical Results Objective: minimize w W f(w) where f(w) = E[F(w;x)] Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

23 Theoretical Results Objective: minimize w W f(w) where f(w) = E[F(w;x)] Non-strongly convex objectives: f(w T ) f(w ) = O ( 1 T + 1 ) Tm Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

24 Theoretical Results Objective: minimize w W f(w) where f(w) = E[F(w;x)] Non-strongly convex objectives: λ-strongly convex objectives: f(w T ) f(w ) = O f(w T ) f(w ) = O ( 1 T + 1 ) Tm ( C T + 1 ) λtm Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

25 A few remarks on distributing Convergence rate: f(w T ) f(w ) = O ( 1 T + 1 ) Tm If communication is expensive, use larger batch sizes m: (a) Communication cost is c (b) n computers with batch size m (c) S total update steps }{{} c }{{} n }{{} m Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

26 A few remarks on distributing Convergence rate: f(w T ) f(w ) = O ( 1 T + 1 ) Tm If communication is expensive, use larger batch sizes m: (a) Communication cost is c (b) n computers with batch size m (c) S total update steps Backsolve: after T = S(m+c) units of time, error is ( m+c O + 1 ) m+c T Tn m }{{} c }{{} m }{{} n Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

27 Experimental results Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

28 Iteration complexity simulations Define T(ǫ,m) = min{t N f(w t ) f(w ) ǫ}, solve robust regression problem f(w) = 1 n x i,w y i = 1 n n Xw y 1 i=1 Iterations to ǫ-optimality 10 3 Actual T (ǫ, m) Predicted T (ǫ, m) Iterations to ǫ-optimality 10 3 Actual T (ǫ, m) Predicted T (ǫ, m) Number m of gradient samples Number m of gradient samples Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

29 Robustness to stepsize and smoothing Two parameters: smoothing parameter µ, stepsize η f ( x) + ϕ( x) f (x ) ϕ(x ) /u Plot: optimality gap after 2000 iterations on synthetic SVM problem f(w)+ϕ(w) := 1 n [1 y i x i,w ] n + + λ 2 w 2 2 i=1 Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27 η

30 Metric learning no need to distribute Data x i R d, measures y ij 0 of similarity between x i,x j. Goal: learn W 0 such that (x i x j ) W(x i x j ) y ij. ( ) n 1 minimize f(w) = tr(w(x i x j )(x i x j ) ) y ij W 0,tr(W) C 2 i j f(xt) - f(x ) m = 1 m = 2 m = 4 m = 8 m = 16 m = 32 m = 64 m = Time (s) Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

31 Support vector machines Reuter s RCV1 dataset, time to ǫ-optimal solution for 1 n n [1 y i x i,w ] + + λ 2 w 2 2 i=1 Mean time (seconds) Time to optimality gap of Batch size 10 Batch size 20 Mean time (seconds) Time to optimality gap of Batch size 10 Batch size Number of worker threads Number of worker threads Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

32 Support vector machines Reuter s RCV1 dataset, optimization speed for minimizing 1 n n [1 y i x i,w ] + + λ 2 w 2 2 i=1 Optimality gap Acc (1) Acc (2) Acc (3) Acc (4) Acc (6) Acc (8) Pegasos Time (s) Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

33 Parsing Experiments Penn Treebank dataset, learning PCFG weights for a hypergraph parser (here x is a sentence, y Y is a parse tree) 1 n max n [L(y i,ŷ)+ w,φ(x i,ŷ) Φ(x i,y i ) ]+ λ ŷ Y 2 w 2 2. i=1 f (x) + ϕ(x) f (x ) ϕ(x ) f (x) + ϕ(x) f (x ) ϕ(x ) RDA Iterations Time (seconds) Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

34 Acknowledgments Collaborators Martin Wainwright and Peter Bartlett Slav Petrov and Sasha Rush for help with NLP experiments Yoram Singer, Mike Jordan Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

35 Thanks! Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

36 Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

37 Is smoothing necessary? Solve multiple-median problem f(w) = 1 n n w x i 1, i=1 x i { 1,1} d. Compare standard stochastic gradient: Iterations to ǫ-optimality Smoothed Unsmoothed Number m of gradient samples Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

38 Analysis ideas Two main parts: (1) Understand accelerated gradient methods (2) Show that perturbed function is smooth and uniformly close to original function Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

39 Accelerated Gradient Methods Know that for smooth f, the method where g τ = f(u τ ) yields u t = (1 θ t )w t +θ t v t { t 1 v t+1 = argmin g τ,w + L } w W θ τ 2 w 2 τ=0 w t+1 = (1 θ t )w t +θ t v t+1 (Nesterov, 1983; Tseng, 2008) f(w t ) f(w ) = O ( ) L t 2 Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

40 Accelerated Gradient Methods Know that for smooth f, the method u t = (1 θ t )w t +θ t v t { t 1 v t+1 = argmin g τ,w + L+η } w 2 w W θ τ 2 τ=0 w t+1 = (1 θ t )w t +θ t v t+1 where σ 2 E[ g τ f(u τ ) 2 2 ] yields ( L f(w t ) f(w ) = O t 2 + η ) t + σ2 η (Nesterov, 1983; Tseng, 2008; Lan, 2010) Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

41 Perturbation Idea: Perturbation convolution, which smooths We have to trade-off Smoothness of f µ := E[f(w +µz)] Uniform approximation sup w f µ (w) f(w) Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

42 Perturbation Idea: Perturbation convolution, which smooths We have to trade-off Smoothness of f µ := E[f(w +µz)] Uniform approximation sup w f µ (w) f(w) We show for Z normal or uniform on l p balls that ( ) 1 f µ (w) f µ (v) = O µ w v and sup f µ (w) f(w) = O(µ) w W Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

43 Concluding derivation (1) From accelerated gradient applied to L/µ-smooth f: (2) Add in uniform approximation (3) Take µ = 1/t and η = σ t ( 1 f µ (w t ) f µ (w ) = O µt 2 + η ) t + σ2 η f(w t ) f(w ) = f µ (w t ) f µ (w )+O(µ) ( 1 = O µt ) η + ησ2 +µ t f(w t ) f(w ) = O ( 1 t + σ ) t Duchi (UC Berkeley) Smoothing and Stochastic Optimization Biglearn December / 27

Randomized Smoothing Techniques in Optimization

Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter Bartlett, Michael Jordan, Martin Wainwright, Andre Wibisono Stanford University Information Systems Laboratory