Presentation in Convex Optimization

Dec 22, 2014

Introduction Sample size selection in optimization methods for machine learning

Introduction Sample size selection in optimization methods for machine learning Main results: presents a methodology for using varying sample sizes in batch-type optimization methods for large-scale machine learning problems. Dynamic sample selection in the evaluation of the function and gradient. A practical Newton method that uses smaller sample to compute Hessian vector-products.

The dynamic sample size gradient method ProblemµTo determine the values of the parameters ω R m of a prediction function f(ω; x), where we assume: f(ω, x) = ω T x (1)

The dynamic sample size gradient method ProblemµTo determine the values of the parameters ω R m of a prediction function f(ω; x), where we assume: f(ω, x) = ω T x (1) Common Approach:To minimize the empirical loss function: J(ω) = 1 N N l(f(ω; x i ), y i ) (2) i=1 where l(ŷ, y) is a convex loss function.

The dynamic sample size gradient method AssumptionµThe size of the data set N is extremely large, numbered somewhere in the millions or billions, so that the evaluation of J(ω) is very expensive.

The dynamic sample size gradient method AssumptionµThe size of the data set N is extremely large, numbered somewhere in the millions or billions, so that the evaluation of J(ω) is very expensive. A gradient-based mini-batch optimization algorithm: At every iteration, chooses a subset S {1, 2,, N} of the training set, and applies one step of an optimization algorithm to the objective function: J S (ω) = 1 l(f(ω; x i ), y i ) (3) S i S

The dynamic sample size gradient method Measure of the quality of the sample S: the variance in the gradient J S

The dynamic sample size gradient method Measure of the quality of the sample S: the variance in the gradient J S It s easy to verify that the vector d = J S (ω) is a descent direction for J at ω if: where θ [0, 1). Note that δ S (ω) J S (ω) J(ω) 2 θ J S (ω) 2 (4) E[δ S (ω) 2 ] = E[ J S (ω) J(ω) 2 2] = Var( J S ) 1 (5)

The dynamic sample size gradient method By simple calculations, we have: where N S N 1 1. Var( J S (ω)) = Var( l(ω; i)) N S S N 1 (6)

The dynamic sample size gradient method By simple calculations, we have: Var( J S (ω)) = Var( l(ω; i)) N S S N 1 (6) where N S N 1 1. Then we can rewrite the condition in the following format: S = Var i S( l(ω; i)) 1 θ 2 J S (ω) 2 2 (7) This is also the criterion that we use to determine the dynamic sample size.

A Newton-CG method with dynamic sampling At each iteration, the subsampled Newton-CG method chooses samples S k and H k such that H k S k, and defines the search direction d k as an approximate solution of the linear system 2 J Hk (ω k )d = J Sk (ω k ) (8)

A Newton-CG method with dynamic sampling r k 2 J Hk (ω k )d + J Sk (ω k ) (9)

A Newton-CG method with dynamic sampling r k 2 J Hk (ω k )d + J Sk (ω k ) (9) Then we write the residual of the standard Newton iteration as: 2 J Sk (ω k )d + J Sk (ω k ) = r k + [ 2 J Sk (ω k ) 2 J Hk (ω k )]d (10) If we define: E[ Hk (ω k ; d) 2 ] [ 2 J Sk (ω k ) 2 J Hk (ω k )]d (11) Then we can make the approximation: E[ Hk (ω k ; d) 2 ] Var i H k ( 2 l(ω k ; i)d) 1 H k (12)

A Newton-CG method with dynamic sampling In order to avoid to recompute the variance at every CG iteration, we initialize the CG iteration at the zero vector. From (9), we have that the initial CG search direction is given by p 0 = r 0 = J Sk (ω k ). We compute (12) at the beginning of the CG iteration, for d = p 0. The stop test for the j + 1 CG iteration is then set as: r j+1 2 2 Ψ ( Var i H k ( 2 l(ω k ; i)p 0 ) 1 ) d j 2 2 H k p 0 2 2 (13) where d j is the jth trial candidate for the solution of (8) generated by the CG process and the last ratio accounts for the length of the CG solution.