Large-scale machine learning Stochastic gradient descent

Similar documents
Large-scale Stochastic Optimization

Stochastic Gradient Descent. CS 584: Big Data Analytics

More Optimization. Optimization Methods. Methods

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

A Dual Coordinate Descent Method for Large-scale Linear SVM

Training Support Vector Machines: Status and Challenges

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Big Data Analytics: Optimization and Randomization

Machine Learning Lecture 6 Note

CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE

Neural Networks: Backpropagation

Advanced Topics in Machine Learning

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Large Scale Machine Learning with Stochastic Gradient Descent

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning

A Distributed Solver for Kernelized SVM

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Inverse Time Dependency in Convex Regularized Learning

Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines

Support Vector Machine

Stochastic optimization in Hilbert spaces

STA141C: Big Data & High Performance Statistical Computing

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Support Vector Machines

An interior-point stochastic approximation method and an L1-regularized delta rule

Comparison of Modern Stochastic Optimization Algorithms

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

ECE521 lecture 4: 19 January Optimization, MLE, regularization

A Dual Coordinate Descent Method for Large-scale Linear SVM

Presentation in Convex Optimization

Optimization Methods for Machine Learning

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

SVMs, Duality and the Kernel Trick

ECS289: Scalable Machine Learning

Kernelized Perceptron Support Vector Machines

Nonparametric Bayesian Methods (Gaussian Processes)

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

CSC2541 Lecture 5 Natural Gradient

Neural networks and optimization

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Coordinate Descent Method for Large-scale L2-loss Linear SVM. Kai-Wei Chang Cho-Jui Hsieh Chih-Jen Lin

CSC 411 Lecture 17: Support Vector Machine

Support vector machines Lecture 4

CS60021: Scalable Data Mining. Large Scale Machine Learning

Nonlinear Optimization Methods for Machine Learning

Least Mean Squares Regression. Machine Learning Fall 2018

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Least Mean Squares Regression

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

ECS171: Machine Learning

A simpler unified analysis of Budget Perceptrons

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

A Parallel SGD method with Strong Convergence

Deep Learning for Computer Vision

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Introduction to Logistic Regression and Support Vector Machine

Why should you care about the solution strategies?

Learning with Momentum, Conjugate Gradient Learning

CS260: Machine Learning Algorithms

Numerical Optimization Techniques

Accelerating Stochastic Optimization

Based on the original slides of Hung-yi Lee

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Computational and Statistical Learning Theory

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Introduction to Support Vector Machines

Stochastic Gradient Descent with Variance Reduction

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Jeff Howbert Introduction to Machine Learning Winter

An empirical study about online learning with generalized passive-aggressive approaches

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Parallel Coordinate Optimization

Support Vector Machines

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Neural Networks and Deep Learning

Efficient Learning on Large Data Sets

CSC321 Lecture 8: Optimization

Beyond stochastic gradient descent for large-scale machine learning

Support Vector Machines

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Multi-class classification via proximal mirror descent

Day 3 Lecture 3. Optimizing deep networks

Deep Learning II: Momentum & Adaptive Step Size

ONLINE LEARNING WITH KERNELS: OVERCOMING THE GROWING SUM PROBLEM. Abhishek Singh, Narendra Ahuja and Pierre Moulin

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Logistic Regression Logistic

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Optimistic Rates Nati Srebro

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Transcription:

Stochastic gradient descent IRKM Lab April 22, 2010

Introduction Very large amounts of data being generated quicker than we know what do with it ( 08 stats): NYSE generates 1 terabyte/day of new trade data Facebook has about 1 billion photos 2.5 petabytes Large Hadron Collider 15 petabytes/year Internet Archive grows 20 terabytes/month

Introduction The dynamics of learning change with scale (Banko & Brill 01):

Dealing with large scale: challenges Data will most likely not fit in RAM Disk transfer speed slow compared to its size 3 hrs to read 1 terabyte from disk @ 100MB/s 1 Gbit/s ethernet card (125 MB/s) reads about 450 GB/hour

Dealing with large scale: approaches Subsample from data available

Dealing with large scale: approaches Subsample from data available Reduce data dimensionality (compress, project to smaller subspace)

Dealing with large scale: approaches Subsample from data available Reduce data dimensionality (compress, project to smaller subspace) Implicit kernel representation (kernel trick, up to infinite dimensional spaces)

Dealing with large scale: approaches Subsample from data available Reduce data dimensionality (compress, project to smaller subspace) Implicit kernel representation (kernel trick, up to infinite dimensional spaces) Use smarter (quicker) algorithm (batch vs conjugate gradient descent)

Dealing with large scale: approaches Subsample from data available Reduce data dimensionality (compress, project to smaller subspace) Implicit kernel representation (kernel trick, up to infinite dimensional spaces) Use smarter (quicker) algorithm (batch vs conjugate gradient descent) Parallelize the algorithm (map-reduce, GPU, MPI, SAN-based HPC)

Dealing with large scale: approaches Subsample from data available Reduce data dimensionality (compress, project to smaller subspace) Implicit kernel representation (kernel trick, up to infinite dimensional spaces) Use smarter (quicker) algorithm (batch vs conjugate gradient descent) Parallelize the algorithm (map-reduce, GPU, MPI, SAN-based HPC) Use an incremental/stream-based algorithm (online learning, stochastic gradient descent)

Dealing with large scale: approaches Subsample from data available Reduce data dimensionality (compress, project to smaller subspace) Implicit kernel representation (kernel trick, up to infinite dimensional spaces) Use smarter (quicker) algorithm (batch vs conjugate gradient descent) Parallelize the algorithm (map-reduce, GPU, MPI, SAN-based HPC) Use an incremental/stream-based algorithm (online learning, stochastic gradient descent) Mix and match!

Review: batch gradient descent

Review: batch gradient descent

Gradient descent Algorithm: Stochastic gradient descent Algorithm: Batch gradient descent

Stochastic gradient descent: what does it look like?

Stochastic gradient descent: why does it work?

Stochastic gradient descent: why does it work? Foundational work in stochastic approximation methods by Robins and Monro, in the 1950 s These algorithms have proven convergence in the limit, as the number of data points goes to infinity, provided a few things hold:

Stochastic gradient descent: why should we care? In practice, stochastic gradient descent will often get close to the minimum much faster than batch gradient descent

Stochastic gradient descent: why should we care? In practice, stochastic gradient descent will often get close to the minimum much faster than batch gradient descent In addition, with an unlimited supply of data, stochastic gradient descent is the obvious candidate

Stochastic gradient descent: why should we care? In practice, stochastic gradient descent will often get close to the minimum much faster than batch gradient descent In addition, with an unlimited supply of data, stochastic gradient descent is the obvious candidate Bottou and Le Cun (2003) show that the best generalization error is asymptotically achieved by the learning algorithm that uses the most examples within the allowed time

Small-scale vs large-scale (Bottou & Bousquet 07) Problem: Choose F, n, ρ to make error as small as possible, subject to constraints: Large scale: constraint is computation time Small scale: constraint is number of examples

Small-scale vs large-scale (Bottou & Bousquet 07)

Small-scale vs large-scale (Bottou & Bousquet 07) Small scale To reduce estimation error use as many examples as possible To reduce optimization error take ρ = 0 Adjust richness of F

Small-scale vs large-scale (Bottou & Bousquet 07) Small scale To reduce estimation error use as many examples as possible To reduce optimization error take ρ = 0 Adjust richness of F Large scale More complicated: computing time depends on all 3 parameters (F, n, ρ) Example: If we choose ρ small, we decrease the optimization error, but we may then also have to decrease F and/or n, with adverse effects on the estimation and approximation errors

Algorithmic complexity (Bottou 08)

Representative results (Bottou 08)

Representative results (Bottou 08)

Stochastic gradient descent: usage Least squares regression

Stochastic gradient descent: usage Least squares regression Generalized linear models

Stochastic gradient descent: usage Least squares regression Generalized linear models Multilayer neural networks

Stochastic gradient descent: usage Least squares regression Generalized linear models Multilayer neural networks Singular value decomposition

Stochastic gradient descent: usage Least squares regression Generalized linear models Multilayer neural networks Singular value decomposition Principal component analysis

Stochastic gradient descent: usage Least squares regression Generalized linear models Multilayer neural networks Singular value decomposition Principal component analysis Support vector machines Pegasos: Primal Estimated sub-gradient SOlver for SVM (Shalev-Shwartz et al. 07)

Stochastic gradient descent: usage Least squares regression Generalized linear models Multilayer neural networks Singular value decomposition Principal component analysis Support vector machines Pegasos: Primal Estimated sub-gradient SOlver for SVM (Shalev-Shwartz et al. 07) Conditional random fields Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods (Vishwanathan et al. 06)

Stochastic gradient descent: usage Least squares regression Generalized linear models Multilayer neural networks Singular value decomposition Principal component analysis Support vector machines Pegasos: Primal Estimated sub-gradient SOlver for SVM (Shalev-Shwartz et al. 07) Conditional random fields Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods (Vishwanathan et al. 06) Code and results for the last 2 @ http://leon.bottou.org/projects/sgd

Stochastic gradient descent: how to speed/scale it up? Second-order stochastic gradient

Stochastic gradient descent: how to speed/scale it up? Second-order stochastic gradient Mini-batches

Stochastic gradient descent: how to speed/scale it up? Second-order stochastic gradient Mini-batches Parallelize over examples

Stochastic gradient descent: how to speed/scale it up? Second-order stochastic gradient Mini-batches Parallelize over examples Parallelize over features

Stochastic gradient descent: how to speed/scale it up? Second-order stochastic gradient Mini-batches Parallelize over examples Parallelize over features Vowpal Wabbit @ http://hunch.net/ vw/

Conclusion Stochastic gradient descent: Simple Fast Scalable

Conclusion Stochastic gradient descent: Simple Fast Scalable Don t underestimate it!