Simple Optimization, Bigger Models, and Faster Learning. Niao He

Similar documents
STA141C: Big Data & High Performance Statistical Computing

Stochastic optimization in Hilbert spaces

Big Data Analytics: Optimization and Randomization

Large-scale Stochastic Optimization

Stochastic Gradient Descent with Variance Reduction

ECS289: Scalable Machine Learning

Convex Optimization Lecture 16

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

SVRG++ with Non-uniform Sampling

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Accelerating Stochastic Optimization

Kernel Learning via Random Fourier Representations

Nesterov s Acceleration

Fast Stochastic Optimization Algorithms for ML

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Coordinate Descent and Ascent Methods

Beyond stochastic gradient descent for large-scale machine learning

Sub-Sampled Newton Methods

Optimization for Machine Learning

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Linearly-Convergent Stochastic-Gradient Methods

Stochastic Optimization Algorithms Beyond SG

Stochastic and online algorithms

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Stochastic gradient descent and robustness to ill-conditioning

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Comparison of Modern Stochastic Optimization Algorithms

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

CS260: Machine Learning Algorithms

ECS171: Machine Learning

ECS289: Scalable Machine Learning

Randomized Algorithms

Trade-Offs in Distributed Learning and Optimization

Support Vector Machines

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Beyond stochastic gradient descent for large-scale machine learning

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Linear Convergence under the Polyak-Łojasiewicz Inequality

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Bayesian Support Vector Machines for Feature Ranking and Selection

A Magiv CV Theory for Large-Margin Classifiers

Importance Sampling for Minibatches

Discriminative Models

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

Learning gradients: prescriptive models

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Lecture 1: Supervised Learning

Probabilistic Graphical Models & Applications

Linear Convergence under the Polyak-Łojasiewicz Inequality

Lecture 3: Minimizing Large Sums. Peter Richtárik

Neural Networks and Deep Learning

Day 3 Lecture 3. Optimizing deep networks

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Scaling up Vector Autoregressive Models with Operator Random Fourier Features

CIS 520: Machine Learning Oct 09, Kernel Methods

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Optimization Methods for Machine Learning

LEARNING OVER FUNCTIONS, DISTRIBUTIONS AND DYNAMICS VIA STOCHASTIC OPTIMIZATION. A Dissertation Presented to The Academic Faculty.

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Accelerating SVRG via second-order information

Introduction to Logistic Regression and Support Vector Machine

Computing regularization paths for learning multiple kernels

Modern Optimization Techniques

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Stochastic Gradient Descent. CS 584: Big Data Analytics

Logistic Regression. Stochastic Gradient Descent

Introduction to Machine Learning (67577)

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Nonlinear Optimization Methods for Machine Learning

Stochastic Analogues to Deterministic Optimizers

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Beyond stochastic gradient descent for large-scale machine learning

Discriminative Models

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Stochastic Quasi-Newton Methods

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Block stochastic gradient update method

Bits of Machine Learning Part 1: Supervised Learning

Support Vector Machine

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Ad Placement Strategies

Stochastic gradient methods for machine learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Support Vector Machines: Maximum Margin Classifiers

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Linear Models in Machine Learning

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Transcription:

Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016

Big Data, Big Picture Niao He (UIUC) 2/26

Big Data, Big Picture Niao He (UIUC) 3/26

Big Data, Big Picture Linear model: f (x) = w T x Nonlinear model: f (x) = w T φ(x) min L(f ) f F e.g., L(f ) = 1 n l(f (x i ), y i ) n i=1 (x, y), y f (x) Data Multi-layer network model: f (x) = W T 3 g 2(W T 2 g 1(W T 1 x)) Model Optimization Niao He (UIUC) 4/26

Big Data, Big Opportunity Most problems are convex! Local optimum is also global optimum; fundamentally tractable. Some problems are non-convex, but admits nice structure. Local optimum behaves like global optimum. We have a dedicated library of efficient first-order optimization algorithms. Gradient Descent, its acceleration and cousins Conditional Gradient (a.k.a. Frank Wolfe algorithm) Coordinate Descent, its randomization variations Primal-dual algorithms, ADMM Quasi-Newton Methods Niao He (UIUC) 5/26

Big Data, Big Challenges Too many data points (n large): simpler algorithms are needed Stochastic gradient descent (SGD, a.k.a. stochastic approximation) type of algorithms become the only method of choice Cheap iteration cost and (at least) sublinear convergence guarantee Too many features (d large): bigger models are needed However, Kernel methods are usually not scalable Neural network models break convexity Niao He (UIUC) 6/26

Revisit: Stochastic Optimization and SGD Niao He (UIUC) 7/26

SGD Overview (Stochastic) convex optimization problem min φ(θ) = E 1 n ξ[f (θ, ξ)] or F (θ, ξ i ) θ Θ n i=1 Stochastic Gradient Descent [Robbins-Monro, 1951] θ t+1 = Π Θ (θ t γ t F (θ t, ξ t )) where Π Θ (η) = Argmin θ Θ { 1 2 θ η 2 2}. Stochastic Mirror Descent [Nemirovski, 1979] θ t+1 = P θt (γ t F (θ t, ξ t )) where P θt (η) = Argmin θ Θ {D ω(θ, θ t) + η, θ }. Inexact Stochastic Mirror Descent θ t+1 P ɛt θ t (γ t F (θ t, ξ t )) where P ɛt θ t (η) = Argmin ɛt θ Θ {Dω(θ, θt) + η, θ }. Niao He (UIUC) 8/26

SGD Typical Results min φ(θ) = E ξ[f (θ, ξ)] θ Θ The inexact Stochastic Mirror Descent algorithm guarantees that [ ( t ) ] τ=1 E φ γ τ θ τ t τ=1 γ φ(θ ) M2 t τ=1 γ2 τ + D ω (θ, θ 1 ) + t t τ τ=1 γ τ τ=1 ɛ τ where M 2 = max θ Θ E[ F (θ, ξ) 2 ]. unbiased gradient + bounded variance + proper stepsize + well-controlled error + good average scheme O(1/t) convergence rate for strongly convex case O(1/ t) convergence rate for general convex case Niao He (UIUC) 9/26

SGD Practical Performance Full Gradient Descent: converges faster but with expensive iteration cost Stochastic Gradient Descent: converges slowly but with cheaper iteration cost Figure from [Bach,2013] Niao He (UIUC) 10/26

SGD Beyond Lots of recent algorithmic development for supervised learning: SGD for convex-concave saddle point problems SGD with adaptive learning rates / preconditioning (AdaGrad, etc.) SGD with importance / stratified sampling (Iprox-SMD, etc.) SGD with second order information (SQN, stochastic BFGS, etc.) Variance reduced algorithms (SAG, SAGA, SVRG, PRDG, etc.) Parallel and asynchronous SGD (Hogwild!, Downpour SGD, etc.) However, still for several fundamental machine learning tasks, SGD or any of the above adaptation is not enough. Niao He (UIUC) 11/26

Three Variants of Stochastic Gradient Descent Supervised Learning: Doubly Stochastic Gradient Descent (Doubly SGD) [with Dai, Xie, Liang, Balcan, Song, NIPS 14] Bayesian Inference: Particle Mirror Descent (PMD) [with Dai 2 and Song, AISTATS 15] Reinforcement Learning: Embedding Stochastic Gradient Descent (Embedding-SGD) [with Dai, Pan, and Song, 2016] Niao He (UIUC) 12/26

Doubly SGD: scaling up big nonlinear models Niao He (UIUC) 13/26

Learning in Hilbert Space min f H 1 n n l(f (x i ), y i ) + ν f 2 H i=1 or min L(f ) := E (x,y) P(x,y)[l(f (x), y)] + ν f H }{{} 2 f 2 H }{{} expected loss regularizer with domain H as the reproducing kernel Hilbert space: generators: k(x, ), x X reproducing property: f, k(x, ) H = f (x). Niao He (UIUC) 14/26

Previous Work Dual approach: e.g. for square loss min α R αt Kα + λα T α 2α T y n inpractical to store/compute kernel matrix K = (k(x i, x j )). Stochastic Gradient Descent/Dual Coordinate Ascent [ Kivinen et.al.,2004; Shalev-Shwartz & Zhang, 2013] at step t, f t (x) = t i=1 α ik(x i, ) require high memory to retrieve support vectors Low-rank approximation/random feature approximation [ Williams & Seeger, 2001; Rahimi & Rechet, 2008] low memory, but does not generalize well Niao He (UIUC) 15/26

Duality Between Kernels and Random Processes Theorem (Bochner) A continuous kernel k(x, x ) = k(x x ) on R d is PD if and only if k(x x ) is the Fourier transform of a non-negative measure P(ω). k(x, x ) = e iω (x x ) dp(ω) = E ω [φ ω (x)φ ω (x )]. R d Examples Kernel k(x, x ) p(ω) Gaussian exp( x x 2 2 2 ) 2π d ω 2 exp( 2 2 Laplacian exp( x x 1 ) Cauchy d i=1 d i=1 1 π(1+ω 2 i ) 2 1+(x i x i )2 exp( ω 1 ) 2 ) many other kernels (dot product, polynomial, Hellinger s, χ 2, Arc-cosine). Niao He (UIUC) 16/26

Doubly SGD: Basic Idea min f H E (x,y) P(x,y)[l(f (x), y)] + ν f 2 H First randomly sample (x, y) P(x, y) stochastic gradient g( ) = l (f (x), y)k(x, ) + νf ( ) Then randomly sample ω P(ω) doubly stochastic gradient ĝ( ) = l (f (x), y)φ ω (x)φ ω ( ) + νf ( ) double sources of randomness: unbiased: E x,y,ω [ĝ( )] = R(f ) Observation: f t ( ) = t i=1 β ik(x i, ) = f t ( ) = t i=1 α iφ ωi ( ) Memory significantly reduced from O(td) to O(t) Caveat: no longer in H Niao He (UIUC) 17/26

Doubly SGD: Theoretical Complexity Assumption: Loss function is smooth and kernel is bounded; Theorem When γ t = θ t with θ > 0 such that θν Z +, x X, f t+1 (x) f (x) 2 Õ ( ) 1, with high probability t High-level proof idea. Decompose the error into two terms f t+1 (x) f (x) 2 2 f t+1 (x) h t+1 (x) 2 }{{} error due to random features +2κ h t+1 f 2 H }{{} error due to random data Niao He (UIUC) 18/26

Doubly SGD: Key Features Doubly SGD: {α i } t i=1 Input: P(ω), φ ω(x), l(f (x), y), ν. for i = 1,..., t do Sample (x i, y i ) P(x, y). Sample ω i P(ω) with seed i. f (x i ) = Predict(x i, {α j } i 1 j=1 ). α i = γ i l (f (x i ), y i )φ ωi (x i ). α j = (1 γ i ν)α j, j = 1,..., i 1. end for simple algorithm flexible, nonparametric low memory cost O(t) for doubly SGD O(n 2 ) for kernel matrix O(td) for vanilla SGD cheap computation cost O(td) at each iteration theoretically grounded O(1/t) w.h.p. strong empirical results competes with neural nets Niao He (UIUC) 19/26

Toy Example Model: Kernelized Ridge Regression Dataset: 2D Synthetic dataset 10 1 1/t rate optimal function surface 10 2 ft f 2 2 ˆft f 2 2 1 0.5 0 0.5 L2 distance 10 3 10 4 1 5 5 10 5 0 5 5 0 10 6 10 3 10 4 10 5 10 6 number of iterations Niao He (UIUC) 20/26

Handwritten Digit Recognition Model: Support Vector Machines Dataset: 1.6 million images for digit 6 and digit 8 Input Dimension: each data point is of size 784 3 2.5 Test Error (%) 2 1.5 1 0.5 MNIST handwritten digits 0 10 0 10 2 10 4 Training Time (sec) doubly SGD/ SGD / SDCA / n-sdca... Niao He (UIUC) 21/26

ImageNet Classification Red layers are convolutions with max pooling layers. Blue layers are fully connected layers. Green layer is the output layer multiclass logistic regression model. Niao He (UIUC) 22/26

ImageNet Classification Model: Logistic regression Dataset: 1.3 million color images and 1000 classes Input Dimension: each data point is of size 9216 100 Test error (%) jointly trained neural net 90 80 fixed neural net doubly SGD 70 60 50 40 ImageNet 6 10 8 10 Number of training samples Platform: AMD 16 2.4GHz Opteron CPUs and 200G memory Niao He (UIUC) 23/26

Extending to Min-Max Saddle Point Problems Optimizing saddle point problems over RKHS: min max f H k g G k E x,y [f (x)g(x) l(g(x), y)] + ν 1 2 f 2 H ν 2 2 g 2 G The doubly SGD trick still applies. Under mild conditions (smooth loss and kernels), we have ( ) 1 E[ f (x t ) f (x) 2 + g(x t ) g (x) 2 ] Õ t Recently been applied to solve reinforcement learning problem. Niao He (UIUC) 24/26

Further Implication Solving two-level stochastic optimization problems min E ξ[f ξ (E η [G η (θ, ξ)])] θ Θ The algorithm and analysis can be easily extended to address general stochastic problems involving two levels of expectations. for i = 1,..., t do Sample (ξ i, η i ) P(ξ, η). ĝ = 1 i i j=1 Gη i (θ i, ξ j ) θ i+1 = P θi ( γi G ηi (θ i, ξ i ) T F ξi (ĝ) ) end for When ξ η and f Lipschitz smooth, g Lipschitz continuous, the overall function is strongly convex, then we obtain the optimal O(1/t) rate of convergence. = Niao He (UIUC) 25/26

Summary Optimization lies at the heart of Big Data analytics. Stochastic gradient descent is powerful, but has limitations. Simple optimization techniques allow us to learn bigger and faster. Niao He (UIUC) 26/26