Presentation in Convex Optimization

Similar documents
An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Sample size selection in optimization methods for machine learning

Big Data Analytics. Lucas Rego Drumond

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Optimization for Training I. First-Order Methods Training algorithm

ECS171: Machine Learning

Gradient Boosting (Continued)

Machine Learning

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Gradient Boosting, Continued

OPTIMIZATION METHODS IN DEEP LEARNING

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Midterm exam CS 189/289, Fall 2015

Normalization Techniques in Training of Deep Neural Networks

Introduction to Optimization

Distributed Newton Methods for Deep Neural Networks

Kernel Learning via Random Fourier Representations

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Logistic Regression. Stochastic Gradient Descent

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Nonlinear Optimization Methods for Machine Learning

Stochastic Gradient Descent. CS 584: Big Data Analytics

MATH 4211/6211 Optimization Basics of Optimization Problems

Symmetries in Experimental Design and Group Lasso Kentaro Tanaka and Masami Miyakawa

CS260: Machine Learning Algorithms

Least Mean Squares Regression. Machine Learning Fall 2018

Introduction to gradient descent

Least Mean Squares Regression

Linear Discrimination Functions

Stochastic Quasi-Newton Methods

ECE521 lecture 4: 19 January Optimization, MLE, regularization

CSC321 Lecture 8: Optimization

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Machine Learning and Data Mining. Linear regression. Kalev Kask

Optimization Methods for Machine Learning

Stochastic Optimization Algorithms Beyond SG

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Lecture 1: Supervised Learning

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Scaling up Vector Autoregressive Models with Operator Random Fourier Features

CS60021: Scalable Data Mining. Large Scale Machine Learning

Machine Learning

Topic 3: Neural Networks

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

CSC321 Lecture 9: Generalization

Deep Learning & Artificial Intelligence WS 2018/2019

Linear Regression. S. Sumitra

Convex Optimization Algorithms for Machine Learning in 10 Slides

RegML 2018 Class 2 Tikhonov regularization and kernels

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Stochastic Subgradient Method

Accelerating SVRG via second-order information

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Optimization II: Unconstrained Multivariable

More Optimization. Optimization Methods. Methods

CSC321 Lecture 7: Optimization

Large-scale Stochastic Optimization

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler

CSC321 Lecture 9: Generalization

Supplementary Information for cryosparc: Algorithms for rapid unsupervised cryo-em structure determination

Machine Learning 2017

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

An Algorithm for Unconstrained Quadratically Penalized Convex Optimization (post conference version)

Inverse Time Dependency in Convex Regularized Learning

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Optimization for neural networks

Numerical Optimization: Basic Concepts and Algorithms

Comparison of Modern Stochastic Optimization Algorithms

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Why should you care about the solution strategies?

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Stochastic Gradient Descent with Variance Reduction

Sparse Stochastic Inference for Latent Dirichlet Allocation

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Large-scale Collaborative Ranking in Near-Linear Time

Pattern Classification

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Statistical Machine Learning Hilary Term 2018

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Empirical Risk Minimization, Model Selection, and Model Assessment

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Algorithmic Stability and Generalization Christoph Lampert

ECS289: Scalable Machine Learning

Neural Networks Teaser

A Parallel SGD method with Strong Convergence

ABC random forest for parameter estimation. Jean-Michel Marin

Data Mining (Mineria de Dades)

Transcription:

Dec 22, 2014

Introduction Sample size selection in optimization methods for machine learning

Introduction Sample size selection in optimization methods for machine learning Main results: presents a methodology for using varying sample sizes in batch-type optimization methods for large-scale machine learning problems.

Introduction Sample size selection in optimization methods for machine learning Main results: presents a methodology for using varying sample sizes in batch-type optimization methods for large-scale machine learning problems. Dynamic sample selection in the evaluation of the function and gradient. A practical Newton method that uses smaller sample to compute Hessian vector-products.

The dynamic sample size gradient method ProblemµTo determine the values of the parameters ω R m of a prediction function f(ω; x), where we assume: f(ω, x) = ω T x (1)

The dynamic sample size gradient method ProblemµTo determine the values of the parameters ω R m of a prediction function f(ω; x), where we assume: f(ω, x) = ω T x (1) Common Approach:To minimize the empirical loss function: J(ω) = 1 N N l(f(ω; x i ), y i ) (2) i=1 where l(ŷ, y) is a convex loss function.

The dynamic sample size gradient method AssumptionµThe size of the data set N is extremely large, numbered somewhere in the millions or billions, so that the evaluation of J(ω) is very expensive.

The dynamic sample size gradient method AssumptionµThe size of the data set N is extremely large, numbered somewhere in the millions or billions, so that the evaluation of J(ω) is very expensive. A gradient-based mini-batch optimization algorithm: At every iteration, chooses a subset S {1, 2,, N} of the training set, and applies one step of an optimization algorithm to the objective function: J S (ω) = 1 l(f(ω; x i ), y i ) (3) S i S

The dynamic sample size gradient method Measure of the quality of the sample S: the variance in the gradient J S

The dynamic sample size gradient method Measure of the quality of the sample S: the variance in the gradient J S It s easy to verify that the vector d = J S (ω) is a descent direction for J at ω if: where θ [0, 1). Note that δ S (ω) J S (ω) J(ω) 2 θ J S (ω) 2 (4) E[δ S (ω) 2 ] = E[ J S (ω) J(ω) 2 2] = Var( J S ) 1 (5)

The dynamic sample size gradient method By simple calculations, we have: where N S N 1 1. Var( J S (ω)) = Var( l(ω; i)) N S S N 1 (6)

The dynamic sample size gradient method By simple calculations, we have: Var( J S (ω)) = Var( l(ω; i)) N S S N 1 (6) where N S N 1 1. Then we can rewrite the condition in the following format: S = Var i S( l(ω; i)) 1 θ 2 J S (ω) 2 2 (7) This is also the criterion that we use to determine the dynamic sample size.

A Newton-CG method with dynamic sampling At each iteration, the subsampled Newton-CG method chooses samples S k and H k such that H k S k, and defines the search direction d k as an approximate solution of the linear system 2 J Hk (ω k )d = J Sk (ω k ) (8)

A Newton-CG method with dynamic sampling At each iteration, the subsampled Newton-CG method chooses samples S k and H k such that H k S k, and defines the search direction d k as an approximate solution of the linear system 2 J Hk (ω k )d = J Sk (ω k ) (8) Now we turn to create automatic criterion for deciding the accuracy in the solution of (8)

A Newton-CG method with dynamic sampling r k 2 J Hk (ω k )d + J Sk (ω k ) (9)

A Newton-CG method with dynamic sampling r k 2 J Hk (ω k )d + J Sk (ω k ) (9) Then we write the residual of the standard Newton iteration as: 2 J Sk (ω k )d + J Sk (ω k ) = r k + [ 2 J Sk (ω k ) 2 J Hk (ω k )]d (10)

A Newton-CG method with dynamic sampling r k 2 J Hk (ω k )d + J Sk (ω k ) (9) Then we write the residual of the standard Newton iteration as: 2 J Sk (ω k )d + J Sk (ω k ) = r k + [ 2 J Sk (ω k ) 2 J Hk (ω k )]d (10) If we define: E[ Hk (ω k ; d) 2 ] [ 2 J Sk (ω k ) 2 J Hk (ω k )]d (11) Then we can make the approximation: E[ Hk (ω k ; d) 2 ] Var i H k ( 2 l(ω k ; i)d) 1 H k (12)

A Newton-CG method with dynamic sampling In order to avoid to recompute the variance at every CG iteration, we initialize the CG iteration at the zero vector. From (9), we have that the initial CG search direction is given by p 0 = r 0 = J Sk (ω k ).

A Newton-CG method with dynamic sampling In order to avoid to recompute the variance at every CG iteration, we initialize the CG iteration at the zero vector. From (9), we have that the initial CG search direction is given by p 0 = r 0 = J Sk (ω k ). We compute (12) at the beginning of the CG iteration, for d = p 0.

A Newton-CG method with dynamic sampling In order to avoid to recompute the variance at every CG iteration, we initialize the CG iteration at the zero vector. From (9), we have that the initial CG search direction is given by p 0 = r 0 = J Sk (ω k ). We compute (12) at the beginning of the CG iteration, for d = p 0. The stop test for the j + 1 CG iteration is then set as: r j+1 2 2 Ψ ( Var i H k ( 2 l(ω k ; i)p 0 ) 1 ) d j 2 2 H k p 0 2 2 (13) where d j is the jth trial candidate for the solution of (8) generated by the CG process and the last ratio accounts for the length of the CG solution.