Stochastic gradient descent IRKM Lab April 22, 2010
Introduction Very large amounts of data being generated quicker than we know what do with it ( 08 stats): NYSE generates 1 terabyte/day of new trade data Facebook has about 1 billion photos 2.5 petabytes Large Hadron Collider 15 petabytes/year Internet Archive grows 20 terabytes/month
Introduction The dynamics of learning change with scale (Banko & Brill 01):
Dealing with large scale: challenges Data will most likely not fit in RAM Disk transfer speed slow compared to its size 3 hrs to read 1 terabyte from disk @ 100MB/s 1 Gbit/s ethernet card (125 MB/s) reads about 450 GB/hour
Dealing with large scale: approaches Subsample from data available
Dealing with large scale: approaches Subsample from data available Reduce data dimensionality (compress, project to smaller subspace)
Dealing with large scale: approaches Subsample from data available Reduce data dimensionality (compress, project to smaller subspace) Implicit kernel representation (kernel trick, up to infinite dimensional spaces)
Dealing with large scale: approaches Subsample from data available Reduce data dimensionality (compress, project to smaller subspace) Implicit kernel representation (kernel trick, up to infinite dimensional spaces) Use smarter (quicker) algorithm (batch vs conjugate gradient descent)
Dealing with large scale: approaches Subsample from data available Reduce data dimensionality (compress, project to smaller subspace) Implicit kernel representation (kernel trick, up to infinite dimensional spaces) Use smarter (quicker) algorithm (batch vs conjugate gradient descent) Parallelize the algorithm (map-reduce, GPU, MPI, SAN-based HPC)
Dealing with large scale: approaches Subsample from data available Reduce data dimensionality (compress, project to smaller subspace) Implicit kernel representation (kernel trick, up to infinite dimensional spaces) Use smarter (quicker) algorithm (batch vs conjugate gradient descent) Parallelize the algorithm (map-reduce, GPU, MPI, SAN-based HPC) Use an incremental/stream-based algorithm (online learning, stochastic gradient descent)
Dealing with large scale: approaches Subsample from data available Reduce data dimensionality (compress, project to smaller subspace) Implicit kernel representation (kernel trick, up to infinite dimensional spaces) Use smarter (quicker) algorithm (batch vs conjugate gradient descent) Parallelize the algorithm (map-reduce, GPU, MPI, SAN-based HPC) Use an incremental/stream-based algorithm (online learning, stochastic gradient descent) Mix and match!
Review: batch gradient descent
Review: batch gradient descent
Gradient descent Algorithm: Stochastic gradient descent Algorithm: Batch gradient descent
Stochastic gradient descent: what does it look like?
Stochastic gradient descent: why does it work?
Stochastic gradient descent: why does it work? Foundational work in stochastic approximation methods by Robins and Monro, in the 1950 s These algorithms have proven convergence in the limit, as the number of data points goes to infinity, provided a few things hold:
Stochastic gradient descent: why should we care? In practice, stochastic gradient descent will often get close to the minimum much faster than batch gradient descent
Stochastic gradient descent: why should we care? In practice, stochastic gradient descent will often get close to the minimum much faster than batch gradient descent In addition, with an unlimited supply of data, stochastic gradient descent is the obvious candidate
Stochastic gradient descent: why should we care? In practice, stochastic gradient descent will often get close to the minimum much faster than batch gradient descent In addition, with an unlimited supply of data, stochastic gradient descent is the obvious candidate Bottou and Le Cun (2003) show that the best generalization error is asymptotically achieved by the learning algorithm that uses the most examples within the allowed time
Small-scale vs large-scale (Bottou & Bousquet 07) Problem: Choose F, n, ρ to make error as small as possible, subject to constraints: Large scale: constraint is computation time Small scale: constraint is number of examples
Small-scale vs large-scale (Bottou & Bousquet 07)
Small-scale vs large-scale (Bottou & Bousquet 07) Small scale To reduce estimation error use as many examples as possible To reduce optimization error take ρ = 0 Adjust richness of F
Small-scale vs large-scale (Bottou & Bousquet 07) Small scale To reduce estimation error use as many examples as possible To reduce optimization error take ρ = 0 Adjust richness of F Large scale More complicated: computing time depends on all 3 parameters (F, n, ρ) Example: If we choose ρ small, we decrease the optimization error, but we may then also have to decrease F and/or n, with adverse effects on the estimation and approximation errors
Algorithmic complexity (Bottou 08)
Representative results (Bottou 08)
Representative results (Bottou 08)
Stochastic gradient descent: usage Least squares regression
Stochastic gradient descent: usage Least squares regression Generalized linear models
Stochastic gradient descent: usage Least squares regression Generalized linear models Multilayer neural networks
Stochastic gradient descent: usage Least squares regression Generalized linear models Multilayer neural networks Singular value decomposition
Stochastic gradient descent: usage Least squares regression Generalized linear models Multilayer neural networks Singular value decomposition Principal component analysis
Stochastic gradient descent: usage Least squares regression Generalized linear models Multilayer neural networks Singular value decomposition Principal component analysis Support vector machines Pegasos: Primal Estimated sub-gradient SOlver for SVM (Shalev-Shwartz et al. 07)
Stochastic gradient descent: usage Least squares regression Generalized linear models Multilayer neural networks Singular value decomposition Principal component analysis Support vector machines Pegasos: Primal Estimated sub-gradient SOlver for SVM (Shalev-Shwartz et al. 07) Conditional random fields Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods (Vishwanathan et al. 06)
Stochastic gradient descent: usage Least squares regression Generalized linear models Multilayer neural networks Singular value decomposition Principal component analysis Support vector machines Pegasos: Primal Estimated sub-gradient SOlver for SVM (Shalev-Shwartz et al. 07) Conditional random fields Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods (Vishwanathan et al. 06) Code and results for the last 2 @ http://leon.bottou.org/projects/sgd
Stochastic gradient descent: how to speed/scale it up? Second-order stochastic gradient
Stochastic gradient descent: how to speed/scale it up? Second-order stochastic gradient Mini-batches
Stochastic gradient descent: how to speed/scale it up? Second-order stochastic gradient Mini-batches Parallelize over examples
Stochastic gradient descent: how to speed/scale it up? Second-order stochastic gradient Mini-batches Parallelize over examples Parallelize over features
Stochastic gradient descent: how to speed/scale it up? Second-order stochastic gradient Mini-batches Parallelize over examples Parallelize over features Vowpal Wabbit @ http://hunch.net/ vw/
Conclusion Stochastic gradient descent: Simple Fast Scalable
Conclusion Stochastic gradient descent: Simple Fast Scalable Don t underestimate it!