Towards stability and optimality in stochastic gradient descent

Size: px

Start display at page:

Download "Towards stability and optimality in stochastic gradient descent"

Branden Tucker
5 years ago
Views:

1 Towards stability and optimality in stochastic gradient descent Panos Toulis, Dustin Tran and Edoardo M. Airoldi August 26, 2016 Discussion by Ikenna Odinaka Duke University

2 Outline Introduction 1 Introduction 2 3

3 Stochastic Optimization Problem Consider random variable ξ Ξ, convex and compact parameter space Θ, and differentiable (a.s.) loss function L : Θ Ξ R We want to solve the stochastic optimization problem where and the expectation is wrt ξ. θ = arg min l(θ) (1) θ Θ l(θ) = E (L (θ, ξ))

4 Classic SGD and Averaged SGD Classic stochastic gradient descent (SGD) solves problem 1 using θ n = θ n 1 γ n L (θ n 1, ξ n ), θ 0 Θ, (2) where {ξ n } are i.i.d. realizations of ξ and the learning rate {γ n } is a non-increasing sequence of positive real numbers. To achieve statistical efficiency, classic SGD is merged with iterate averaging in averaged SGD (ASGD) θ n = θ n 1 γ n L (θ n 1, ξ n ), θ 0 Θ, θ n = 1 n θ i (3) n i=1

5 Authors Contribution: Implicit SGD (improves stability) and Averaging (improves statistical efficiency) of iterates Both aspects have been shown in a previous 2015 arxiv publication titled Implicit stochastic gradient descen Theorem 2 is new to this AISTATS 2016 submission Main Idea: Solve problem 1 using θ n = θ n 1 γ n L (θ n, ξ n ), θ 0 Θ, (4) θ n = 1 n θ i (5) n i=1

6 Interpretations of Implicit SGD Interpretations: Implicit update 4 is equivalent to a sequence of improved classic SGD procedures θ (1) n = θ n 1 γ n L (θ n 1, ξ n, ( ) θ (2) n = θ n 1 γ n L θ (1) n, ξ n, ( ) θ (3) n = θ n 1 γ n L θ (2) n, ξ n,... ( ) θ ( ) n = θ n 1 γ n L θ ( ) n, ξ n Implicit update 4 is equivalent to proximal update { } 1 θ n = arg min θ θ n L (θ, ξ n ) θ Θ 2γ n (6)

7 (De)Merits of Implicit SGD Merit: Implicit SGD is stable. Assuming L is µ-strongly convex a.s., then θ n θ is contracting a.s. Demerit: Solving multidimensional fixed-point equation 4 for θ n is difficult for general models. Easy when L is a function of natural parameter x T θ

8 Notation Introduction Let denote the L 2 norm x y defines x as equal to known variable y x = def y denotes that the value of x is equal to the value of y, by definition a n 0 means a n > 0, a n is monotonically non-increasing, and a n 0 Given sequences a n > 0, b n > 0, b n = O(a n ) means c > 0 s.t. b n ca n n b n = o(a n ) means b n /a n 0 For a sequence of vectors or matrices X n, X n = O(a n ) and X n = o(a n ) denote the equality for the scalar norm sequence X n Given two matrices A and B, A B denotes that B A is positive semidefinite tr(a) denotes the trace of A

9 Main Assumptions Introduction Assumption 1 The loss function L(θ, ξ) is almost-surely differentiable. The random vector ξ can be decomposed as ξ = (x, y), x R p, y R d, such that L(θ, ξ) L(x T θ, y) (7) Assumption 2 The learning rate sequence {γ n } is defined as γ n = γ 1 n γ, where γ 1 > 0 and γ (1/2, 1]

10 Main Assumptions Contd Assumption 3 (Lipschitz conditions). For all θ 1, θ 2 Θ, a combination of the following conditions is satisfied almost-surely: (a) The loss function L is Lipschitz with parameter λ 0, i.e., L(θ 1, ξ) L(θ 2, ξ) λ 0 θ 1 θ 2, (b) The map L is Lipschitz with parameter λ 1, i.e., L(θ 1, ξ) L(θ 2, ξ) λ 1 θ 1 θ 2, (c) The map 2 L is Lipschitz with parameter λ 2, i.e., 2 L(θ 1, ξ) 2 L(θ 2, ξ) λ2 θ 1 θ 2,

11 Main Assumptions Contd Assumption 4 The observed Fisher information matrix, Î(θ) 2 L(θ, ξ), has non-vanishing trace, i.e., there exists φ > 0 such that tr(î(θ)) φ, almost-surely, for all θ Θ. The expected Fisher information matrix, I(θ) E(Î(θ)), has minimum eigenvalue 0 < λ f φ, for all θ Θ Assumption 5 The zero-mean random variable W θ L(θ, ξ) l(θ) is square-integrable, such that, for a fixed positive-definite Σ, E ( W θ W T θ ) Σ

12 Remarks on Assumptions Assumptions 2 and 5 are standard in the stochastic approximation literature Assumption 1 restricts application to models where L depends on θ through x T θ. This excludes models with regularization. Authors claim they can easily incorporate regularization as shown in the supp. material (no such extension) Authors also claim there is no need for regularization, since proximal operator (equation 6) regularizes θ n towards θ n 1 Assumptions 3(b) and (c) are used to simplify non-asymptotic analysis. These assumptions have been relaxed in classical stochastic approximation theory Assumption 3(a) is not standard in stochastic approximation literature. It only shows up in implicit SGD literature Open Problem: Establish Theorem 1 without assumption 3(a). Assumption 4 has two requirements: tr(î(θ)) φ > 0 is a relaxed form of strong convexity on L(θ, ξ) Strong convexity on l(θ)

13 Main Technical Challenge The main technical challenge in analyzing implicit SGD (equation 4) is that ξ n is not conditionally independent of theta n This implies E ( L (θ n, ξ n ) θ n ) l (θ n ) So, convexity properties of l cannot be used to analyze E ( θ n θ 2), as is common in the classic SGD literature In addition to Assumption 3(a), other authors make strict assumptions of bounded gradients L(θ, ξ) almost-surely for implicit procedure

14 Computational Efficiency Solving the fixed-point equation 4 at every iteration can be expensive. For a special class of problems, the multidimensional equation can be reduced to a single-variable equation Definition 1 Suppose that Assumption 1 holds. For observation ξ = (x, y), the first derivative with respect to the natural parameter x T θ is denoted by L (θ, ξ), and is defined as L (θ, ξ) L(θ, ξ) (x T θ) def = L(x T θ, y) (x T θ) (8) Similarly, L (θ, ξ) L (θ,ξ) (x T θ).

15 Computational Efficiency Contd Lemma 1 Suppose that Assumption 1 holds, and consider functions L, L from Definition 1. Then, almost-surely, L(θ n, ξ n ) = s n L(θ n 1, ξ n ); (9) the scalar s n satisfies the fixed-point equation s n κ n 1 = L (θ n 1 s n γ n κ n 1 x n, ξ n ), (10) where κ n 1 L (θ n 1, ξ n ). Moreover, if L (θ, ξ) 0 almost-surely for all θ Θ, then { [κ n 1, 0) if κ n 1 < 0, s n [0, κ n 1 ] otherwise.

16 Non-asymptotic Analysis: Implicit SGD Theorem 1 Suppose that Assumptions 1, 2, 3(a), and 4 hold. Define δ n E ( θ n θ 2), and constants Γ 2 = 4λ 2 0 γ 2 i <, ɛ = (1 + γ 1 (φ λ f )) 1, and λ = 1 + γ 1 λ f ɛ. Also let ρ γ (n) = n 1 γ if γ 1 and ρ γ (n) = log n if γ = 1. Then, there exists constant n 0 > 0 such that, for all n > 0, δ n (8λ 2 0γ 1 λ/λ f ɛ)n γ + e log λ ργ(n) [δ 0 + λ n 0 Γ 2 ]. Remarks on convergence rate and numerical stability: Convergence rate of implicit SGD iterates θ n is O(n γ ). Same as for classic SGD The initial conditions δ 0 is reduced at an exponential rate, irrespective of the learning rate To contrast, in classic SGD there is an exponential term exp(λ 2 1 γ2 1 n1 2γ ) multiplying the initial conditions In classic SGD, if γ 1 is misspecified, the iterates can diverge

17 Non-asymptotic Analysis: Averaged Implicit SGD Theorem 2 Consider the procedure 5, and suppose Assumptions 1, 2, 3(a), 3(c), 4, and 5 hold with γ < 1. Then, ( E ( θn θ 2)) 1/2 ( tr ( 2 l(θ ) 1 Σ 2 l(θ ) 1) /n ) 1/2 +O ( n 1+γ/2) + O (n γ ) +O ( exp ( log λ n 1 γ /2 )). Remarks: The iterates θ n attain the CRLB (statistical perspective) the rate O(1/n), which is optimal for first-order methods (optimization perspective) Same as in averaged classic SGD Optimal choice of γ = 2/3 inherits stability properties from implicit SGD

18 Linear Regression: Description To demonstrate the statistical efficiency and stability of Let N = 10 6 be the number of observations Let p = 20 be the number of features Let θ = (0, 0,..., 0) T be the ground truth Let H be a randomly generated symmetric matrix with eigenvalues 1/k, for k = 1,..., p Let ξ n = (x n, y n ), where x 1,..., x N N p (0, H) are i.i.d. normal random variables y n x n N (x T n θ, 1), for n = 1,..., N. Let L(θ, ξ n ) = (y n x T n θ) 2. Thus, l(θ) = E(L(θ, ξ)) = (θ θ ) T H(θ θ ) Constant learning rate γ n γ 1 1/R 2, where R 2 = tr(h) is the average radius of the data.

19 Linear Regression: Results

20 Classification: Description Compare to other stochastic optimization approaches using standard benchmarks Benchmarks include: COVTYPE (forest cover classification), DELTA (synthetic data in PASCAL large scale challenge), RCV1(document classification), and MNIST (handwritten digits classification) and ASGD use learning rate schedule γ n = η 0 (1 + η 0 n) ( 3/4) η 0 is obtained by preprocessing on a small subset of the data Other methods, grid search used for hyperparameter tuning

21 Classification: Results

22 Sensitivity Analysis Introduction Unfair sensitivity comparison: regularization vs algorithmic parameters Authors first talk about varying learning rates, but later show and talk about regularization parameters The trends in the text are reversed I m pretty sure there is a mistake somewhere in this section.

23 Sensitivity Analysis Contd

Asymptotic and finite-sample properties of estimators based on stochastic gradients

Asymptotic and finite-sample properties of estimators based on stochastic gradients Panagiotis (Panos) Toulis panos.toulis@chicagobooth.edu Econometrics and Statistics University of Chicago, Booth School