Randomized Smoothing Techniques in Optimization

Size: px

Start display at page:

Download "Randomized Smoothing Techniques in Optimization"

Wesley Nichols
6 years ago
Views:

1 Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter Bartlett, Michael Jordan, Martin Wainwright, Andre Wibisono Stanford University Information Systems Laboratory Seminar October 2014

2 Problem Statement Goal: solve the following problem minimize f(x) subject to x X

3 Problem Statement Goal: solve the following problem minimize f(x) subject to x X Often, we will assume f(x) := 1 n n F(x;ξ i ) or f(x) := E[F(x;ξ)] i=1

4 Gradient Descent Goal: solve f(x) minimize f(x) Technique: go down the slope, x t+1 = x t α f(x t ) (y,f(y)) f(y)+ f(y),x y

5 When is optimization easy? Easy problem: function is convex, nice and smooth

6 When is optimization easy? Easy problem: function is convex, nice and smooth Not so easy problem: function is non-smooth

7 When is optimization easy? Easy problem: function is convex, nice and smooth Not so easy problem: function is non-smooth Even harder problems: We cannot compute gradients f(x) Function f is non-convex and non-smooth

8 Example 1: Robust regression Data in pairs ξ i = (a i,b i ) R d R Want to estimate b i a i x

9 Example 1: Robust regression Data in pairs ξ i = (a i,b i ) R d R Want to estimate b i a i x To avoid outliers, minimize f(x) = 1 n a i x b i = 1 n n Ax b 1 i=1 a i x b i

Example 2: Protein Structure Prediction RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2

10 Example 2: Protein Structure Prediction RSCCPCYWGGCPWGQNCYPEGCSGPKV (A) 45 N (Grace et al., PNAS 2004) C 5 6

11 Protein Structure Prediction Featurize edges e in graph: vector ξ e. Labels y are matching in a graph, set V is all matchings. x T ξ 1 3 x T ξ x T ξ 4 6

12 Protein Structure Prediction Featurize edges e in graph: vector ξ e. Labels y are matching in a graph, set V is all matchings. x T ξ x T ξ x T ξ 4 6 Goal: Learn weights x so that { } argmax ξe x = ν ν V e ν i.e. learn x so maximum matching in graph with edge weights x ξ e is correct

13 Protein Structure Prediction x T ξ x T ξ x T ξ 4 6 Goal: Learn weights x so that { } argmax ξe x = ν ν V e ν i.e. learn x so maximum matching in graph with edge weights x ξ e is correct

14 Protein Structure Prediction x T ξ x T ξ x T ξ 4 6 Goal: Learn weights x so that { } argmax ξe x = ν ν V e ν i.e. learn x so maximum matching in graph with edge weights x ξ e is correct Loss function: L(ν, ν) is number of disagreements in matchings ( F(x;{ξ,ν}) := max L(ν, ν)+x ξ e x ξ e ). ν V e ν e ν

15 When is optimization easy? Easy problem: function is convex, nice and smooth Not so easy problem: function is non-smooth Even harder problems: We cannot compute gradients f(x) Function f is non-convex and non-smooth

16 One technique to address many of these Instead of only using f(x) and f(x) to solve get more global information minimize f(x),

17 One technique to address many of these Instead of only using f(x) and f(x) to solve get more global information minimize f(x), Let Z be a random variable, and for small u, look at f near points f(x+uz), where u is small

18 One technique to address many of these Instead of only using f(x) and f(x) to solve get more global information minimize f(x), Let Z be a random variable, and for small u, look at f near points u f(x+uz), where u is small

19 One technique to address many of these Instead of only using f(x) and f(x) to solve get more global information minimize f(x), Let Z be a random variable, and for small u, look at f near points u f(x+uz), where u is small

20 Three instances I Solving previously unsolvable problems [Burke, Lewis, Overton 2005] Non-smooth, non-convex problems

21 Three instances I Solving previously unsolvable problems [Burke, Lewis, Overton 2005] Non-smooth, non-convex problems II Optimal convergence guarantees for problems with existing algorithms [D., Jordan, Wainwright, Wibisono 2014] Smooth and non-smooth zero order stochastic and non-stochastic optimization problems

22 Three instances I Solving previously unsolvable problems [Burke, Lewis, Overton 2005] Non-smooth, non-convex problems II Optimal convergence guarantees for problems with existing algorithms [D., Jordan, Wainwright, Wibisono 2014] Smooth and non-smooth zero order stochastic and non-stochastic optimization problems III Parallelism: really fast solutions for large scale problems [D., Bartlett, Wainwright 2013] Smooth and non-smooth stochastic optimization problems

23 Instance I: Gradient Sampling Algorithm Problem: Solve minimize f(x) x X where f is potentially non-smooth and non-convex (but assume it is continuous and a.e. differentiable) [Burke, Lewis, Overton 2005]

24 Instance I: Gradient Sampling Algorithm Problem: Solve minimize f(x) x X where f is potentially non-smooth and non-convex (but assume it is continuous and a.e. differentiable) [Burke, Lewis, Overton 2005] At each iteration t, Draw Z 1,...,Z m i.i.d. Z i 1 Set g i t = f(x t +uz i ) Set gradient g t as g t = argmin g { g 2 2 : g = i λ igt i λ 0, i λ i = 1 } Update x t+1 = x t αg t, where α > 0 chosen by line search u

25 Instance I: Gradient Sampling Algorithm Problem: Solve minimize f(x) x X where f is potentially non-smooth and non-convex (but assume it is continuous and a.e. differentiable) [Burke, Lewis, Overton 2005] Define the set G u (x) := Conv { f(x ) : x x 2 u, f(x ) exists } Proposition (Burke, Lewis, Overton): There exist cluster points x of the sequence x t, and for any such cluster point, 0 G u ( x)

26 Instance II: Zero Order Optimization Problem: We want to solve minimize x X f(x) = E[F(x;ξ)] but we are only allowed to observe function values f(x) (or F(x;ξ))

27 Instance II: Zero Order Optimization Problem: We want to solve minimize x X f(x) = E[F(x;ξ)] but we are only allowed to observe function values f(x) (or F(x;ξ)) Idea: Approximate gradient by function differences f (y) g u := f(y +u) f(y) u u 1 u 0 f (x) f (y) f (y) + g u0, x y f (y) + g u1, x y

28 Instance II: Zero Order Optimization Problem: We want to solve minimize x X f(x) = E[F(x;ξ)] but we are only allowed to observe function values f(x) (or F(x;ξ)) Idea: Approximate gradient by function differences f (y) g u := f(y +u) f(y) u Long history in optimization: Kiefer-Wolfowitz, Spall, Robbins-Monroe f (y) + g u0, x y u 1 u 0 f (y) f (x) f (y) + g u1, x y

29 Instance II: Zero Order Optimization Problem: We want to solve minimize x X f(x) = E[F(x;ξ)] but we are only allowed to observe function values f(x) (or F(x;ξ)) Idea: Approximate gradient by function differences f (y) g u := f(y +u) f(y) u Long history in optimization: Kiefer-Wolfowitz, Spall, Robbins-Monroe Can randomized perturbations give insights? f (x) u 1 u 0 f (y) f (y) + g u0, x y f (y) + g u1, x y

30 Stochastic Gradient Descent Algorithm: At iteration t Choose random ξ, set g t = F(x t ;ξ i ) Update x t+1 = x t αg t

31 Stochastic Gradient Descent Algorithm: At iteration t Choose random ξ, set g t = F(x t ;ξ i ) Update x t+1 = x t αg t

32 Stochastic Gradient Descent Algorithm: At iteration t Choose random ξ, set g t = F(x t ;ξ i ) Update x t+1 = x t αg t Theorem (Russians): Let x T = 1 T T t=1 x t and assume R x x 1 2, G 2 E[ g t 2 2 ]. Then E[f( x T ) f(x )] RG 1 T

33 Stochastic Gradient Descent Algorithm: At iteration t Choose random ξ, set g t = F(x t ;ξ i ) Update x t+1 = x t αg t Theorem (Russians): Let x T = 1 T T t=1 x t and assume R x x 1 2, G 2 E[ g t 2 2 ]. Then E[f( x T ) f(x )] RG 1 T Note: Dependence on G important

34 Derivative-free gradient descent E[f( x T ) f(x )] RG 1 T Question: How well can we estimate gradient f using only function differences? And how small is the norm of this estimate?

35 Derivative-free gradient descent E[f( x T ) f(x )] RG 1 T Question: How well can we estimate gradient f using only function differences? And how small is the norm of this estimate? First idea gradient estimator: Sample Z µ satisfying E µ [ZZ ] = I d d Gradient estimator at x: Perform gradient descent using these g g = f(x+uz) f(x) Z u

36 Two-point gradient estimates At any point x and any direction z, for small u > 0 f(x+uz) f(x) u If f(x) exists, f (x,z) = f(x),z f (x,z) := lim h 0 f(x+hz) f(x) h If E[ZZ ] = I, then E[f (x,z)z] = E[ZZ f(x)] = f(x)

37 Two-point gradient estimates At any point x and any direction z, for small u > 0 f(x+uz) f(x) u If f(x) exists, f (x,z) = f(x),z f (x,z) := lim h 0 f(x+hz) f(x) h If E[ZZ ] = I, then E[f (x,z)z] = E[ZZ f(x)] = f(x) Random estimates Average f

38 Two-point stochastic gradient: differentiable functions To solve d-dimensional problem minimize x X R d f(x) := E[F(x;ξ)] Algorithm: Iterate Draw ξ according to distribution, draw Z µ with Cov(Z) = I Set u t = u/t and Update x t+1 = x t αg t g t = F(x t +u t Z;ξ) F(x t ;ξ) Z u t

39 Two-point stochastic gradient: differentiable functions To solve d-dimensional problem minimize x X R d f(x) := E[F(x;ξ)] Algorithm: Iterate Draw ξ according to distribution, draw Z µ with Cov(Z) = I Set u t = u/t and Update x t+1 = x t αg t g t = F(x t +u t Z;ξ) F(x t ;ξ) Z u t Theorem (D., Jordan, Wainwright, Wibisono): With appropriate α, if R x x 1 2 and E[ F(x;ξ) 2 2 ] G2 for all x, then E[f( x T ) f(x )] RG ( d +O u 2logT ). T T

40 Comparisons to knowing gradient Convergence rate scaling RG 1 d versus RG T T

41 Comparisons to knowing gradient Convergence rate scaling RG 1 d versus RG T T To achieve ǫ-accuracy, 1/ǫ 2 versus d/ǫ 2

42 Comparisons to knowing gradient Convergence rate scaling RG 1 d versus RG T T To achieve ǫ-accuracy, 1/ǫ 2 versus d/ǫ 2 Optimality gap SGD Zero Order Iterations

43 Comparisons to knowing gradient Convergence rate scaling RG 1 d versus RG T T To achieve ǫ-accuracy, 1/ǫ 2 versus d/ǫ Optimality gap SGD Zero Order Optimality gap SGD Zero Order Iterations Rescaled Iterations

44 Two-point stochastic gradient: non-differentiable functions Problem: If f is non-differentiable, kinks make estimates too large

45 Two-point stochastic gradient: non-differentiable functions Problem: If f is non-differentiable, kinks make estimates too large

46 Two-point stochastic gradient: non-differentiable functions Problem: If f is non-differentiable, kinks make estimates too large Example: Let f(x) = x 2. Then if E[ZZ ] = I d d, at x = 0 E[ (f(uz) 0)Z/u 2 2 ] = E[ Z 2 2 Z 2 2 ] d2

47 Two-point stochastic gradient: non-differentiable functions Problem: If f is non-differentiable, kinks make estimates too large Example: Let f(x) = x 2. Then if E[ZZ ] = I d d, at x = 0 E[ (f(uz) 0)Z/u 2 2 ] = E[ Z 2 2 Z 2 2 ] d2 Idea: More randomization! u 2 u 1

48 Two-point stochastic gradient: non-differentiable functions Problem: If f is non-differentiable, kinks make estimates too large Proposition (D., Jordan, Wainwright, Wibisono): If Z 1,Z 2 are N(0,I d d ) or uniform on z 2 d, then g := f(x+u 1Z 1 +u 2 Z 2 ) f(x+u 1 Z 1 ) u 2 Z 2 satisfies ( ) E[g] = f u1 (x)+o(u 2 /u 1 ) and E[ g 2 2 ] d u2 d+log(2d). u 1

49 Two-point stochastic gradient: non-differentiable functions Problem: If f is non-differentiable, kinks make estimates too large Proposition (D., Jordan, Wainwright, Wibisono): If Z 1,Z 2 are N(0,I d d ) or uniform on z 2 d, then g := f(x+u 1Z 1 +u 2 Z 2 ) f(x+u 1 Z 1 ) u 2 Z 2 satisfies ( ) E[g] = f u1 (x)+o(u 2 /u 1 ) and E[ g 2 2 ] d u2 d+log(2d). u 1 Note: If u 2 /u 1 0, scaling linear in d

50 Two-point sub-gradient: non-differentiable functions To solve d-dimensional problem Algorithm: Iterate minimize f(x) subject to x X R d Draw Z 1 µ and Z 2 µ with Cov(Z) = I Set u t,1 = u/t, u t,2 = u/t 2, and g t = f(x t +u t,1 Z 1 +u t,2 Z 2 ) F(x t +u t,1 Z 1 ) u t Z 2 Update x t+1 = x t αg t

51 Two-point sub-gradient: non-differentiable functions To solve d-dimensional problem Algorithm: Iterate minimize f(x) subject to x X R d Draw Z 1 µ and Z 2 µ with Cov(Z) = I Set u t,1 = u/t, u t,2 = u/t 2, and g t = f(x t +u t,1 Z 1 +u t,2 Z 2 ) F(x t +u t,1 Z 1 ) u t Z 2 Update x t+1 = x t αg t Theorem (D., Jordan, Wainwright, Wibisono): With appropriate α, if R x x 1 2 and f(x) 2 2 G2 for all x, then E[f( x T ) f(x )] RG ( dlogd +O u logt ). T T

52 Two-point sub-gradient: non-differentiable functions To solve d-dimensional problem minimize x X R d f(x) := E[F(x;ξ)] Algorithm: Iterate Draw ξ according to distribution, draw Z 1 µ and Z 2 µ with Cov(Z) = I Set u t,1 = u/t, u t,2 = u/t 2, and g t = F(x t +u t,1 Z 1 +u t,2 Z 2 ;ξ) F(x t +u t,1 ;ξ) u t Z 2 Update x t+1 = x t αg t

53 Two-point sub-gradient: non-differentiable functions To solve d-dimensional problem minimize x X R d f(x) := E[F(x;ξ)] Algorithm: Iterate Draw ξ according to distribution, draw Z 1 µ and Z 2 µ with Cov(Z) = I Set u t,1 = u/t, u t,2 = u/t 2, and g t = F(x t +u t,1 Z 1 +u t,2 Z 2 ;ξ) F(x t +u t,1 ;ξ) u t Z 2 Update x t+1 = x t αg t Corollary (D., Jordan, Wainwright, Wibisono): With appropriate α, if R x x 1 2 and E[ F(x;ξ) 2 2 ] G2 for all x, then E[f( x T ) f(x )] RG ( dlogd +O u logt ). T T

54 Wrapping up zero-order gradient methods If gradients available, convergence rates of 1/T If only zero order information available, in smooth and non-smooth case, convergence rates of d/t Time to ǫ-accuracy: 1/ǫ 2 d/ǫ 2

55 Wrapping up zero-order gradient methods If gradients available, convergence rates of 1/T If only zero order information available, in smooth and non-smooth case, convergence rates of d/t Time to ǫ-accuracy: 1/ǫ 2 d/ǫ 2 Sharpness: In stochastic case, no algorithms exist that can do better than those we have provided. That is, lower bound for all zero-order algorithms of d RG. T

56 Wrapping up zero-order gradient methods If gradients available, convergence rates of 1/T If only zero order information available, in smooth and non-smooth case, convergence rates of d/t Time to ǫ-accuracy: 1/ǫ 2 d/ǫ 2 Sharpness: In stochastic case, no algorithms exist that can do better than those we have provided. That is, lower bound for all zero-order algorithms of d RG. T Open question: Non-stochastic lower bounds? (Sebastian Pokutta, next week.)

57 Instance III: Parallelization and fast algorithms Goal: solve the following problem minimize f(x) subject to x X where f(x) := 1 n n F(x;ξ i ) or f(x) := E[F(x;ξ)] i=1

58 Stochastic Gradient Descent Problem: Tough to compute f(x) = 1 n n F(x;ξ i ). i=1 Instead: At iteration t Choose random ξ i, set g t = F(x t ;ξ i ) Update x t+1 = x t αg t

59 Stochastic Gradient Descent Problem: Tough to compute f(x) = 1 n n F(x;ξ i ). i=1 Instead: At iteration t Choose random ξ i, set g t = F(x t ;ξ i ) Update x t+1 = x t αg t

60 What everyone knows we should do Obviously: get a lower-variance estimate of the gradient. Sample g j,t with E[g j,t ] = f(x t ) and use g t = 1 m m j=1 g j,t

61 What everyone knows we should do Obviously: get a lower-variance estimate of the gradient. Sample g j,t with E[g j,t ] = f(x t ) and use g t = 1 m m j=1 g j,t

62 What everyone knows we should do Obviously: get a lower-variance estimate of the gradient. Sample g j,t with E[g j,t ] = f(x t ) and use g t = 1 m m j=1 g j,t

63 What everyone knows we should do Obviously: get a lower-variance estimate of the gradient. Sample g j,t with E[g j,t ] = f(x t ) and use g t = 1 m m j=1 g j,t

64 Problem: only works for smooth functions.

65 Non-smooth problems we care about: Classification F(x;ξ) = F(x;(a,b)) = [ ] 1 bx a + Robust regression F(x;(a,b)) = b x a

66 Non-smooth problems we care about: Classification F(x;ξ) = F(x;(a,b)) = [ ] 1 bx a + Robust regression F(x;(a,b)) = b x a Structured prediction (ranking, parsing, learning matchings) [ ] F(x;{ξ,ν}) = max L(ν, ν)+x Φ(ξ, ν) x Φ(ξ,ν) ν V

67 Difficulties of non-smooth Intuition: Gradient is poor indicator of global structure

68 Better global estimators Idea: Ask for subgradients from multiple points

69 Better global estimators Idea: Ask for subgradients from multiple points

70 The algorithm Normal approach: sample ξ at random, g j,t F(x t ;ξ). x t Our approach: add noise to x g j,t F(x t +u t Z j ;ξ) Decrease magnitude u t over time

71 The algorithm Normal approach: sample ξ at random, g j,t F(x t ;ξ). Our approach: add noise to x g j,t F(x t +u t Z j ;ξ) Decrease magnitude u t over time x t u t

72 The algorithm Normal approach: sample ξ at random, g j,t F(x t ;ξ). x t u t Our approach: add noise to x g j,t F(x t +u t Z j ;ξ) Decrease magnitude u t over time

73 Algorithm Generalization of accelerated gradient methods (Nesterov 1983, Tseng 2008, Lan 2010). Have query point and exploration point

74 Algorithm Generalization of accelerated gradient methods (Nesterov 1983, Tseng 2008, Lan 2010). Have query point and exploration point I. Get query point and gradients: y t = (1 θ t )x t +θ t z t Sample ξ j,t and Z j,t, compute gradient approximation g t = 1 m g j,t, g j,t F(y t +u t Z j,t ;ξ j,t ) m j=1

75 Algorithm Generalization of accelerated gradient methods (Nesterov 1983, Tseng 2008, Lan 2010). Have query point and exploration point I. Get query point and gradients: y t = (1 θ t )x t +θ t z t Sample ξ j,t and Z j,t, compute gradient approximation g t = 1 m g j,t, g j,t F(y t +u t Z j,t ;ξ j,t ) m j=1 II. Solve for exploration point { t z t+1 = argmin x X 1 [ gτ,x ] θ τ=0 τ }{{} Approximate f + 1 } x 2 2 2α } t {{} Regularize

76 Algorithm Generalization of accelerated gradient methods (Nesterov 1983, Tseng 2008, Lan 2010). Have query point and exploration point I. Get query point and gradients: y t = (1 θ t )x t +θ t z t Sample ξ j,t and Z j,t, compute gradient approximation g t = 1 m g j,t, g j,t F(y t +u t Z j,t ;ξ j,t ) m j=1 II. Solve for exploration point { t z t+1 = argmin x X III. Interpolate 1 [ gτ,x ] θ τ=0 τ }{{} Approximate f x t+1 = (1 θ t )x t +θ t z t } x 2 2 2α } t {{} Regularize

77 Theoretical Results Objective: minimize x X f(x) where f(x) = E[F(x;ξ)] using m gradient samples for stochastic gradients.

78 Theoretical Results Objective: minimize x X f(x) where f(x) = E[F(x;ξ)] using m gradient samples for stochastic gradients. Non-strongly convex objectives: f(x T ) f(x ) = O ( C T + 1 ) Tm

79 Theoretical Results Objective: minimize x X f(x) where f(x) = E[F(x;ξ)] using m gradient samples for stochastic gradients. Non-strongly convex objectives: λ-strongly convex objectives: f(x T ) f(x ) = O f(x T ) f(x ) = O ( C T + 1 ) Tm ( C T ) λtm

80 A few remarks on distributing Convergence rate: f(x T ) f(x ) = O ( 1 T + 1 ) Tm If communication is expensive, use larger batch sizes m: (a) Communication cost is c (b) n computers with batch size m (c) S total update steps }{{} c }{{} n }{{} m

81 A few remarks on distributing Convergence rate: f(x T ) f(x ) = O ( 1 T + 1 ) Tm If communication is expensive, use larger batch sizes m: (a) Communication cost is c (b) n computers with batch size m (c) S total update steps Backsolve: after T = S(m+c) units of time, error is ( m+c O + 1 ) m+c T Tn m }{{} c }{{} m }{{} n

82 Experimental results

83 Iteration complexity simulations Define T(ǫ,m) = min{t N f(x t ) f(x ) ǫ}, solve robust regression problem f(x) = 1 n a 1 i x b i = n n Ax b 1 i=1 Iterations to ǫ-optimality 10 3 Actual T (ǫ, m) Predicted T (ǫ, m) Iterations to ǫ-optimality 10 3 Actual T (ǫ, m) Predicted T (ǫ, m) Number m of gradient samples Number m of gradient samples

84 Robustness to stepsize and smoothing Two parameters: smoothing parameter u, stepsize η f ( x) + ϕ( x) f (x ) ϕ(x ) /u η Plot: optimality gap after 2000 iterations on synthetic SVM problem f(x)+ϕ(x) := 1 n n i=1 [ 1 ξ i x ] + + λ 2 x 2 2

85 Text Classification Reuter s RCV1 dataset, time to ǫ-optimal solution for 1 n n i=1 [ ] 1 ξi x + λ + 2 x 2 2 Mean time (seconds) Time to optimality gap of Batch size 10 Batch size 20 Mean time (seconds) Time to optimality gap of Batch size 10 Batch size Number of worker threads Number of worker threads

86 Text Classification Reuter s RCV1 dataset, optimization speed for minimizing 1 n n i=1 [ ] 1 ξi x + λ + 2 x 2 2 Optimality gap Acc (1) Acc (2) Acc (3) Acc (4) Acc (6) Acc (8) Pegasos Time (s)

87 Parsing Penn Treebank dataset, learning weights for a hypergraph parser (here x is a sentence, y V is a parse tree) 1 n n i=1 [ ] max L(ν i, ν)+x (Φ(ξ i, ν) Φ(ξ i,ν i )) + λ ν V 2 x 2 2. f (x) + ϕ(x) f (x ) ϕ(x ) f (x) + ϕ(x) f (x ) ϕ(x ) RDA Iterations Time (seconds)

88 Is smoothing necessary? Solve multiple-median problem f(x) = 1 n n x ξ i 1, i=1 ξ i { 1,1} d. Compare standard stochastic gradient: Iterations to ǫ-optimality Smoothed Unsmoothed Number m of gradient samples

89 Discussion Randomized smoothing allows Stationary points of non-convex non-smooth problems Optimal solutions in zero-order problems (including non-smooth) Parallelization for non-smooth problems

90 Discussion Randomized smoothing allows Stationary points of non-convex non-smooth problems Optimal solutions in zero-order problems (including non-smooth) Parallelization for non-smooth problems Current experiments in consultation with Google for large-scale parsing/translation tasks Open questions: non-stochastic optimality guarantees? True zero-order optimization?

91 Discussion Randomized smoothing allows Stationary points of non-convex non-smooth problems Optimal solutions in zero-order problems (including non-smooth) Parallelization for non-smooth problems Current experiments in consultation with Google for large-scale parsing/translation tasks Open questions: non-stochastic optimality guarantees? True zero-order optimization? Thanks!

92 Discussion Randomized smoothing allows Stationary points of non-convex non-smooth problems Optimal solutions in zero-order problems (including non-smooth) Parallelization for non-smooth problems Current experiments in consultation with Google for large-scale parsing/translation tasks Open questions: non-stochastic optimality guarantees? True zero-order optimization? References: Randomized Smoothing for Stochastic Optimization (D., Bartlett, Wainwright). SIAM Journal on Optimization, 22(2), pages Optimal rates for zero-order convex optimization: the power of two function evaluations (D., Jordan, Wainwright, Wibisono). arxiv: [math.oc]. Available on my webpage (

Randomized Smoothing for Stochastic Optimization

Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and