Large-scale machine learning Stochastic gradient descent

Stochastic gradient descent IRKM Lab April 22, 2010

Introduction Very large amounts of data being generated quicker than we know what do with it ( 08 stats): NYSE generates 1 terabyte/day of new trade data Facebook has about 1 billion photos 2.5 petabytes Large Hadron Collider 15 petabytes/year Internet Archive grows 20 terabytes/month

Introduction The dynamics of learning change with scale (Banko & Brill 01):

Dealing with large scale: challenges Data will most likely not fit in RAM Disk transfer speed slow compared to its size 3 hrs to read 1 terabyte from disk @ 100MB/s 1 Gbit/s ethernet card (125 MB/s) reads about 450 GB/hour

Dealing with large scale: approaches Subsample from data available

Dealing with large scale: approaches Subsample from data available Reduce data dimensionality (compress, project to smaller subspace)

Dealing with large scale: approaches Subsample from data available Reduce data dimensionality (compress, project to smaller subspace) Implicit kernel representation (kernel trick, up to infinite dimensional spaces) Use smarter (quicker) algorithm (batch vs conjugate gradient descent) Parallelize the algorithm (map-reduce, GPU, MPI, SAN-based HPC)

Review: batch gradient descent

Gradient descent Algorithm: Stochastic gradient descent Algorithm: Batch gradient descent

Stochastic gradient descent: what does it look like?

Stochastic gradient descent: why does it work?

Stochastic gradient descent: why does it work? Foundational work in stochastic approximation methods by Robins and Monro, in the 1950 s These algorithms have proven convergence in the limit, as the number of data points goes to infinity, provided a few things hold:

Stochastic gradient descent: why should we care? In practice, stochastic gradient descent will often get close to the minimum much faster than batch gradient descent

Stochastic gradient descent: why should we care? In practice, stochastic gradient descent will often get close to the minimum much faster than batch gradient descent In addition, with an unlimited supply of data, stochastic gradient descent is the obvious candidate Bottou and Le Cun (2003) show that the best generalization error is asymptotically achieved by the learning algorithm that uses the most examples within the allowed time

Small-scale vs large-scale (Bottou & Bousquet 07) Problem: Choose F, n, ρ to make error as small as possible, subject to constraints: Large scale: constraint is computation time Small scale: constraint is number of examples

Small-scale vs large-scale (Bottou & Bousquet 07)

Small-scale vs large-scale (Bottou & Bousquet 07) Small scale To reduce estimation error use as many examples as possible To reduce optimization error take ρ = 0 Adjust richness of F

Small-scale vs large-scale (Bottou & Bousquet 07) Small scale To reduce estimation error use as many examples as possible To reduce optimization error take ρ = 0 Adjust richness of F Large scale More complicated: computing time depends on all 3 parameters (F, n, ρ) Example: If we choose ρ small, we decrease the optimization error, but we may then also have to decrease F and/or n, with adverse effects on the estimation and approximation errors

Algorithmic complexity (Bottou 08)

Representative results (Bottou 08)

Stochastic gradient descent: usage Least squares regression

Stochastic gradient descent: usage Least squares regression Generalized linear models

Stochastic gradient descent: usage Least squares regression Generalized linear models Multilayer neural networks

Stochastic gradient descent: usage Least squares regression Generalized linear models Multilayer neural networks Singular value decomposition

Stochastic gradient descent: usage Least squares regression Generalized linear models Multilayer neural networks Singular value decomposition Principal component analysis

Stochastic gradient descent: usage Least squares regression Generalized linear models Multilayer neural networks Singular value decomposition Principal component analysis Support vector machines Pegasos: Primal Estimated sub-gradient SOlver for SVM (Shalev-Shwartz et al. 07) Conditional random fields Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods (Vishwanathan et al. 06)

Stochastic gradient descent: how to speed/scale it up? Second-order stochastic gradient

Stochastic gradient descent: how to speed/scale it up? Second-order stochastic gradient Mini-batches

Stochastic gradient descent: how to speed/scale it up? Second-order stochastic gradient Mini-batches Parallelize over examples

Stochastic gradient descent: how to speed/scale it up? Second-order stochastic gradient Mini-batches Parallelize over examples Parallelize over features

Stochastic gradient descent: how to speed/scale it up? Second-order stochastic gradient Mini-batches Parallelize over examples Parallelize over features Vowpal Wabbit @ http://hunch.net/ vw/

Conclusion Stochastic gradient descent: Simple Fast Scalable

Conclusion Stochastic gradient descent: Simple Fast Scalable Don t underestimate it!