Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Size: px

Start display at page:

Download "Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure"

Geraldine McDaniel
5 years ago
Views:

Bietti Julien Mairal Inria Grenoble (Thoth) March 21,

1 Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21, / 20

2 Stochastic optimization in machine learning Stochastic approximation: min x E ζ D [f (x, ζ)] Infinite datasets (expected risk, D: data distribution), or single pass SGD, stochastic mirror descent, FOBOS, RDA O(1/ɛ) complexity 1 Incremental methods with variance reduction: min ni=1 x n f i (x) Finite datasets (empirical risk): f i (x) = l(y i, x ξ i ) + (µ/2) x 2 SAG, SDCA, SVRG, SAGA, MISO, etc. O(log 1/ɛ) complexity Alberto Bietti Stochastic MISO March 21, / 20

Data perturbations in machine learning Perturbations of data useful for regularization, stable feature selection, privacy aware learning 2 Data Augmentation via Le vy Processes We focus on data

) Dropout: set coordinates of feature vectors to 0 with probability δ.

Bryggen Hanseatic Wharf will give you a sense of the local culture take some time to snap photos of the Hanseatic commercial buildings, which look like scenery from a movie set.

3 Data perturbations in machine learning Perturbations of data useful for regularization, stable feature selection, privacy aware learning 2 Data Augmentation via Le vy Processes We focus on data augmentation of a finite training set, for regularization purposes (better performance on test data), e.g.: I I Image data augmentation: add random transformations of each image in the training set (crop, scale, rotate, brightness, contrast, etc.) Dropout: set coordinates of feature vectors to 0 with probability δ. Invariant SVM using Selective Sampling (a) Gaussian noise The colorful Norwegian city of Bergen is also a gateway to majestic fjords. Bryggen Hanseatic Wharf will give you a sense of the local culture take some time to snap photos of the Hanseatic commercial buildings, which look like scenery from a movie set. The colorful of gateway to fjords. Hanseatic Wharf will sense the culture take some to snap photos the commercial buildings, which look scenery a (b) Dropout noise Figure: Data augmentation on MNIST digit (left), Dropout on text (right). Figure 1.1: Two examples of transforming an original input X into a noisy, less e The new inputs clearly have the same label but contain less informative input X. information and thus are harder to classify. is most likely still about travel. In both cases, the expert knowledge amounts to a belief that a certain transform of the features should generally not a ect an March 21, 2017 example s label. MISO Figure 6: This figure shows 16 Bietti variations of a digit with all the transformations cited here. Alberto Stochastic 3 / 20

4 Optimization objective with perturbations min x R p f i (x)=e ρ Γ [ f i (x, ρ)] { F (x) = 1 } n E ρ Γ [ f i (x, ρ)] + h(x) n i=1 ρ: perturbation f i (, ρ) is convex with L-Lipschitz gradients F is µ-strongly convex h: convex, possibly non-smooth, penalty, e.g. l 1 norm Alberto Bietti Stochastic MISO March 21, / 20

5 Can we do better than SGD? min x R p { f (x) = 1 } n E ρ Γ [ f i (x, ρ)] n i=1 SGD is a natural choice Sample index it, perturbation ρ t Γ Update: x t = x t 1 η t f it (x t 1, ρ t ) O(σ 2 tot/µt) convergence, with σ 2 tot := E i,ρ [ f i (x, ρ) 2 ] Key observation: variance from perturbations only is small compared to variance across all examples Contribution: improve convergence of SGD by exploiting the finite-sum structure using variance reduction. Yields O(σ 2 /µt) convergence with E ρ [ f i (x, ρ) f i (x ) 2] σ 2 σ 2 tot Alberto Bietti Stochastic MISO March 21, / 20

6 Background: MISO algorithm (Mairal, 2015) Finite sum problem: min x f (x) = 1 n ni=1 f i (x) Maintains a quadratic lower bound model di t(x) = µ 2 x zt i 2 + ci t on each f i di t is updated using a strong convexity lower bound on f i : f i (x) f i (x t 1 ) + f i (x t 1 ), x x t 1 + µ 2 x x t 1 2 =: l t i (x) Two steps: { Select i t, update: di t(x) = (1 α)d t 1 i (x) + αli t(x), if i = i t d t 1 i (x), otherwise Minimize the model: xt = arg min x {D t (x) = 1 n n i=1 d i t(x)} Alberto Bietti Stochastic MISO March 21, / 20

7 MISO algorithm (Mairal, 2015) Final algorithm: at iteration t, choose index i t at random and update: z t i = x t = 1 n { (1 α)z t 1 i + α(x t 1 1 µ f i(x t 1 )), if i = i t z t 1 i, otherwise. n zi t i=1 Complexity O((n + L/µ) log 1/ɛ), typical of variance reduction Similar to SDCA without duality (Shalev-Shwartz, 2016) Alberto Bietti Stochastic MISO March 21, / 20

8 Stochastic MISO min x R p { f (x) = 1 } n E ρ Γ [ f i (x, ρ)] n i=1 With perturbations, we cannot compute exact strong convexity lower bounds on f i = E ρ [ f i (, ρ)] Instead, use approximate lower bounds using stochastic gradient estimates f it (x t 1, ρ t ) Allow decreasing step-sizes α t in order to guarantee convergence as in stochastic approximation Alberto Bietti Stochastic MISO March 21, / 20

9 Stochastic MISO: algorithm Input: step-size sequence (α t ) t 1 ; for t = 1,... do Sample i t uniformly at random, ρ t Γ, and update: { (1 zi t αt )zi t 1 + α t (x t 1 1 µ = f it (x t 1, ρ t )), if i = i t zi t 1, otherwise. x t = 1 n zi t = x t n n (zt i t zi t 1 t ). i=1 end for Note: reduces to MISO for σ 2 = 0, α t = α, and to SGD for n = 1. Alberto Bietti Stochastic MISO March 21, / 20

10 Stochastic MISO: convergence analysis Define the Lyapunov function (with z i := x 1 µ f i(x )) C t = 1 2 x t x 2 + α t n 2 n zi t zi 2. i=1 Theorem (Recursion on C t, smooth case) If (α t ) t 1 are positive, non-increasing step-sizes with { } 1 α 1 min 2, n, 2(2κ 1) with κ = L/µ, then C t obeys the recursion ( E[C t ] 1 α t n ) E[C t 1 ] + 2 ( ) 2 αt σ 2 n µ 2. Note: Similar recursion for SGD with σ 2 tot instead of σ 2. Alberto Bietti Stochastic MISO March 21, / 20

11 Stochastic MISO: convergence with decreasing step-sizes Similar to SGD (Bottou et al., 2016). Theorem (Convergence of Lyapunov function) Let the sequence of step-sizes (α t ) t 1 be defined by α t = 2n γ + t { } 1 for γ 0 s.t. α 1 min 2, n. 2(2κ 1) For t 0, where E[C t ] ν := max { 8σ 2 ν γ + t + 1, µ 2, (γ + 1)C 0 }. Q: How can we get rid of the dependence on C 0? Alberto Bietti Stochastic MISO March 21, / 20

12 Practical step-size strategy Following Bottou et al. (2016), we keep the step-size constant for a few epochs in order to quickly forget the initial condition C 0 Using a constant step-size ᾱ, we can converge linearly near a constant error C = 2ᾱσ2 nµ 2 (in practice: a few epochs) We then start decreasing step-sizes with γ large enough s.t. α 1 = 2n/(γ + 1) ᾱ, no more C 0 in the convergence rate! Overall, complexity for reaching E[ x t x 2 ] ɛ: ( O (n + L/µ) log C 0 ɛ ) ( ) σ 2 + O µ 2. ɛ For E[f (x t ) f (x )] ɛ, the second term becomes O(Lσ 2 /µ 2 ɛ) via smoothness. Iterate averaging brings this down to O(σ 2 /µɛ). Alberto Bietti Stochastic MISO March 21, / 20

13 Extensions Composite objectives (h 0, e.g., l 1 penalty) MISO extends to this case by adding h to lower bound model (Lin et al., 2015) Different Lyapunov function ( xt x 2 replaced by an upper bound) Similar to Regularized Dual Averaging when n = 1 Non-uniform sampling Smoothness constants Li of each f i can vary a lot in heterogeneous datasets Sampling difficult examples more often can improve dependence in L from L max to L average Same convergence results apply (same Lyapunov recursion, decreasing step-sizes, iterate averaging) Alberto Bietti Stochastic MISO March 21, / 20

14 Experiments: dropout Dropout rate δ controls the variance of the perturbations. f - f* gene dropout, δ = 0.30 S-MISO η = 0. 1 S-MISO η = 1. 0 SGD η = 0. 1 SGD η = 1. 0 N-SAGA η = 0. 1 N-SAGA η = epochs f - f* gene dropout, 10 0 δ = epochs epochs f - f* gene dropout, δ = 0.01 Alberto Bietti Stochastic MISO March 21, / 20

15 Experiments: image data augmentation Random image crops and scalings, encoding with an unsupervised deep convolutional network. Different conditioning, controlled by µ. f - f* STL-10 ckn, µ = 10 3 S-MISO η = 0. 1 S-MISO η = 1. 0 SGD η = 0. 1 SGD η = 1. 0 N-SAGA η = epochs f - f* STL-10 ckn, µ = epochs f - f* STL-10 ckn, µ = epochs Alberto Bietti Stochastic MISO March 21, / 20

16 Conclusion Exploit underlying finite-sum structures in stochastic optimization problems using variance reduction Bring SGD variance term down to the variance induced by perturbations only Useful for data augmentation (e.g. random image transformations, Dropout) Future work: application to stable feature selection? C++/Eigen library with Cython extension available: Alberto Bietti Stochastic MISO March 21, / 20

17 References L. Bottou, F. E. Curtis, and J. Nocedal. Optimization Methods for Large-Scale Machine Learning. arxiv: , S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. arxiv: , H. Lin, J. Mairal, and Z. Harchaoui. A Universal Catalyst for First-Order Optimization. In Advances in Neural Information Processing Systems (NIPS), J. Mairal. Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning. SIAM Journal on Optimization, 25(2): , S. Shalev-Shwartz. SDCA without Duality, Regularization, and Individual Convexity. In International Conference on Machine Learning (ICML), Alberto Bietti Stochastic MISO March 21, / 20

18 Acceleration by iterate averaging For function values, averaging helps bring the complexity term O(Lσ 2 /µ 2 ɛ) down to O(σ 2 /µɛ) Similar technique to Lacoste-Julien et al. (2012), but allows small initial step-sizes Theorem (Convergence under iterate averaging) Let the step-size sequence (α t ) t 1 be defined by We have α t = 2n γ + t { } 1 for γ 1 s.t. α 1 min 2, n. 4(2κ 1) E[f ( x T ) f (x )] 2µγ(γ 1)C 0 T (2γ + T 1) + 16σ 2 µ(2γ + T 1), 2 where x T := T 1 T (2γ+T 1) t=0 (γ + t)x t. Alberto Bietti Stochastic MISO March 21, / 20

19 Stochastic MISO (composite, non-uniform sampling) Input: step-sizes (α t ) t 1, sampling distribution q; for t = 1,... do Sample an index i t q, a perturbation ρ t Γ, and update: z t i = { (1 α t q i n )zt 1 i + αt q i n (x t 1 1 µ f it (x t 1, ρ t )), if i = i t zi t 1, otherwise z t = 1 n zi t n i=1 x t = prox h/µ ( z t ). = z t n (zt i t z t 1 i t ) end for Note: Similar to RDA for n = 1 when α t = 1/t. Alberto Bietti Stochastic MISO March 21, / 20

20 General S-MISO: analysis Lyapunov function C q t = F (x ) D t (x t ) + µα t n 2 n i=1 1 q i n zt i z i 2. Bound on the iterates Recursion with σ 2 q = 1 n µ 2 E[ x t x 2 ] E[F (x ) D t (x t )]. ( E[Ct q ] 1 α t n i σ 2 i q i n. ) ( ) 2 E[C q t 1 ] + 2 αt σq 2 n µ, Alberto Bietti Stochastic MISO March 21, / 20

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine