Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Size: px

Start display at page:

Download "Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos"

Avice Howard
6 years ago
Views:

1 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos

2 Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and Alex Smola GD is slower than SGD Joint work with Simon Du, Chi Jin, Aarti Singh, Michael Jordan, Jason Lee

3 Finite Sum Minimization in ML 3 Arises naturally in ML Empirical Risk Minimization (ERM) M-estimators

4 Empirical Risk Minimization in ML 4

5 Empirical Risk Minimization in ML 5 Age Countr y 24 US 45 UK 32 India

6 Empirical Risk Minimization in ML 6 Clustering

7 Empirical Risk Minimization in ML Example: Problems in Deep Learning 7 ^bi = DNN(w; a i ) a i b i w denotes the weight parameters of the neural network.

8 Empirical Risk Minimization in ML Example: Max-likelihood estimation, Gaussian Mixture Models 8

9 Focus of our work 9 First-order methods for nonconvex finite sum minimization Given the dominance of stochastic gradient methods (SGD) in optimizing neural nets and other large nonconvex models, theoretical investigation of faster nonconvex 1 st order methods is much needed.

10 Smoothness Assumptions 10 We will use First-order methods for nonconvex finite sum minimization: Assumptions:

11 L-smooth functions 11

12 Problem Setup: Black Box Oracle Incremental First-order Oracle (IFO) (Agarwal & Bottou 2014) 12 Algorithm Query: (Iterate x, Index i) IFO

13 Central question 13 We provide an affirmative answer to this question by showing that a careful selection of parameters in SVRG leads to faster convergence than both SGD and GD. To our knowledge, ours is the first work to improve convergence rates of SGD and GD for IFO-based nonconvex optimization.

14 Strongly Convex Functions 14

15 Problem Setup: Measuring Efficiency 15 Measure the number of IFO calls to reach a solution.

16 Problem Setup: Measuring Efficiency 16

17 Gradient Descent 17 Stochastic methods to the rescue:

18 18 History of Stochastic Methods Stochastic methods to the rescue: [Robbins & Monro 1951] (SGD for Stochastic Programming) [Widrow & Hoff 1960] (Widrow-Hoff least mean square) [Litvakov 1966] (Nonsmooth extension of least square) [Luo 1991] (For feedback neural networks) [Nedic et al. 2001] (Distributed & asynchronous versions) [Blatt et al. 2008] (Linear convergence for quadratic problems) [Bertsekas 2010] (Beautiful Survey of incremental methods for Convex Optimization)

19 19 History of Stochastic Methods [Schmidt et al. 2012, Johnson & Zhang 2013, Defazio 2014, Gurbuzbalaban et al. 2015] (Linear convergence for general strongly convex problems) [Solodov 1997] (Nonconvex differentiable) [Sra 2012] (Proximal Nonconvex) [Ghadimi & Lan 2013] (Convergence rates for SGD for general nonconvex stochastic programming) [Ghadimi et al. 2013] (Proximal extension of SGD for nonconvex stochastic composite programming)

20 Stochastic Gradient Descent 20

21 ¾-Bounded Gradient 21

22 Stochastic Gradient Descent 22

23 Stochastic Gradient Descent 23

24 Stochastic Gradient Descent 24

25 SGD Summary 25

26 GD vs SGD 26 Gradient Descent SGD O(n/²) O(1/² 2 ) Strong dependence on n Weak dependence on ² No dependence on n Strong dependence on ²

27 Main Research Question 27 For nonconvex functions, can one obtain convergence rates faster than those of SGD and Gradient Descent using only an IFO?

28 How to improve SGD? 28 SGD: Variance in stochastic gradients slows down convergence is usually large

29 How to improve SGD? 29 f(x)

30 How to improve SGD? 30 f(x) rf 1 (x) rf 3 (x) rf 2 (x)

31 How to improve SGD? 31 f(x) rf 1 (x) rf 3 (x) rf 2 (x) Unbiased but variance hinders convergence

32 How to improve SGD? 32

33 Nonconvex SVRG 33 Sashank J. Reddi, Suvrit Sra, Barnabás Póczos, Alex Smola. Stochastic Variance Reduction for Nonconvex Optimization. ICML 2016.

34 Nonconvex SVRG 34

35 Nonconvex SVRG Algorithm 35 For s = 0 to S-1 [S epochs] For t = 0 to m-1 [Length of an epoch is m ] Uniformly randomly pick i t 2 [n] end end

36 36 Nonconvex SVRG Algorithm For s = 0 to S-1 For t = 0 to m-1 Uniformly randomly pick i t 2 [n] end end

37 37 Nonconvex SVRG Algorithm For s = 0 to S-1 For t = 0 to m-1 Uniformly randomly pick i t 2 [n] end end

38 Nonconvex SVRG Algorithm 38 For s = 0 to S-1 For t = 0 to m-1 Uniformly randomly pick i t 2 [n] end end

39 Nonconvex SVRG Algorithm 39 For s = 0 to S-1 For t = 0 to m-1 Uniformly randomly pick i t 2 [n] end end

40 40 Nonconvex SVRG Algorithm For s = 0 to S-1 For t = 0 to m-1 Uniformly randomly pick i t 2 [n] end end

41 41 Nonconvex SVRG Algorithm For s = 0 to S-1 For t = 0 to m-1 Uniformly randomly pick i t 2 [n] end end

42 Nonconvex SVRG Algorithm 42 For s = 0 to S-1 For t = 0 to m-1 Uniformly randomly pick i t 2 [n] end end

43 43 Nonconvex SVRG Algorithm For s = 0 to S-1 For t = 0 to m-1 Uniformly randomly pick i t 2 [n] end end

44 Nonconvex SVRG 44

45 SVRG Properties 45

46 SVRG Properties 46

47 47 Theoretical analysis of SVRG in the nonconvex setting

48 Main Theorem 48

49 Number of IFO Calls 49 Key result:

50 Main Theorem Summary 50 Theorem(Reddi et al., ICML 2016): Nonconvex SVRG requires O(n+n 2/3 /²) iterations to converge to an ²-accurate solution Interplay between epoch length (m), number of functions (n), smoothness (L), and step size ( ), is important and subtle.

51 Gradient Dominated Functions 51

52 Gradient Dominated Functions 52

53 Gradient Dominated Functions 53

54 Theoretical analysis 54

55 GD-SVRG Linear Rate 55

56 GD vs SVRG 56

57 Strongly Convex Functions 57 Similar (but not the same) gains can be seen for SVRG for strongly convex functions.

58 Results Overview 58

59 Minibatch Setting 59 Minibatch variant of SVRG for nonconvex setting O(n 2/3 /(b² 2 )) iterations for ²-accurate solution Faster by a factor of b

60 Experiments 60 Multiclass classification with 3 hidden layer neural network and softmax output nodes SGD (with best chosen step size decreasing and constant) vs SVRG (with constant step size) Comparison Criteria: Training loss, Stationarity gap, and Test error

61 Experiments 61 Results on CIFAR-10 - Training loss (left), Stationarity gap (center), Test Error (right)

62 Proof of SGD Rate 62

63 63

64 64

65 65

66 66

67 Main Contributions 67

68 GD vs PGD 68

69 GD vs PGD 69

70 GD 70

71 Stationary points 71 A stationary point can be a local minimizer, saddle point, or local maximizer. In recent years, there has been an increasing focus on conditions under which it is possible to escape saddle points (specifically, strict saddle points) and converge to a local minimizer. Stronger statements can be made when the following two key properties hold: 1) all local minima are global minima, and 2) all saddle points are strict. For these problems, any algorithm that is capable of escaping strict saddle points will converge to a global minimizer from an arbitrary initialization point.

72 72 It has been shown that when perturbations are incorporated into GD at each step the resulting algorithm can escape strict saddle points in polynomial time [Ge et al., 2015]. It has also been shown that episodic perturbations suffice; in particular, Jin et al. [2017] analyzed an algorithm that occasionally adds a perturbation to GD, and proved that the algorithm escapes strict saddle points in polynomial time

73 73 This leaves open the question as to whether such perturbations are in fact necessary. If not, we might prefer to avoid the perturbations if possible, as they involve additional hyperparameters. The major existing result is provided by Lee et al. [2016], who show that gradient descent, with any reasonable random initialization, will always escape strict saddle points eventually but without any guarantee on the number of steps required. This motivates the following question: Does randomly initialized gradient descent generally escape saddle points in polynomial time?

74 74 Does randomly initialized gradient descent generally escape saddle points in polynomial time? We give a strong negative answer to this question. We show that even under a fairly natural initialization scheme (e.g., uniform initialization over a unit cube, or Gaussian initialization), GD can take exponentially long time to escape strict saddle points and reach local minima, while perturbed GD only needs polynomial time. This result shows that GD is fundamentally slower in escaping saddle points than its perturbed variant, and justifies the necessity of adding perturbations for efficient non-convex optimization.

75 75

76 Notation 76

77 Notation 77

78 78

79 79

80 80

81 81

82 Perturbed Gradient Descent 82

83 83 GD Escapes Strict Saddle points The following theorem shows that if GD with random initialization converges, then it will converge to a second-order stationary point almost surely.

84 Perturbed GD Escapes Strict Saddle points in Polynomial time 84 The previous Theorem only provides limiting behavior without specifying the convergence rate. On the other hand, if we are willing to add perturbations, the following theorem provides a polynomial convergence rate:

85 This theorem states that with proper choice of hyperparameters, perturbed gradient descent can consistently escape strict saddle points and converge to second-order stationary point in a polynomial number of iterations. 85

86 Warmup: Examples with Un-natural" Initialization 86 In this section, we discuss two very simple and intuitive counterexamples for which gradient descent with random initialization requires an exponential number of steps to escape strict saddle points. We will also explain that, however, these examples are unnatural and pathological in certain ways, thus unlikely to arise in practice. A more sophisticated counter-example with natural initialization and non pathological behavior will be given later.

87 Initialize uniformly within an extremely thin band 87

88 88

89 Initialize far away 89

90 90 Consider a two-dimensional function with a strict saddle point at (0,0). Instead of initializing in a extremely thin band, we construct a very long slope so that a relatively large initialization region necessarily converges to this extremely thin band.

91 91

92 92

93 93 Main Result In the previous section we have shown that gradient descent takes exponential time to escape saddle points under un-natural" initialization schemes. Is it possible for the same statement to hold even under natural initialization schemes and non-pathological functions? The following theorem confirms this

94 Main Theorem 94

95 95

96 Extension of the Main Theorem 96 All of the above results can be generalized to other initializations as well, e.g. Gaussian.

97 Proof Sketch 97 We will show that GD needs an exponential number of steps to escape.

98 Proof Sketch 98

99 99

100 100

101 We call the union of the regions a tube 101

102 102

103 103 Tube defined in 2D Trajectory of gradient descent in the tube for d = 3. The blue points are saddle points and the red point is the minimum. The pink line is the trajectory of gradient descent.

104 104 Optimum Buffer Saddle Buffer Saddle

105 105

106 106

107 Thanks for your Attention! 107

Optimization for Machine Learning (Lecture 3-B - Nonconvex)

Optimization for Machine Learning (Lecture 3-B - Nonconvex) SUVRIT SRA Massachusetts Institute of Technology MPI-IS Tübingen Machine Learning Summer School, June 2017 ml.mit.edu Nonconvex problems are