An Algorithm for Unconstrained Quadratically Penalized Convex Optimization (post conference version)

Similar documents
Improving the Convergence of Back-Propogation Learning with Second Order Methods

Optimization and Gradient Descent

5 Quasi-Newton Methods

2. Quasi-Newton methods

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Convex Optimization. Problem set 2. Due Monday April 26th

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Lecture 14: October 17

Quasi-Newton Methods

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Selected Topics in Optimization. Some slides borrowed from

Unconstrained minimization of smooth functions

Nonlinear Optimization for Optimal Control

Empirical Risk Minimization and Optimization

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

CPSC 540: Machine Learning

Optimization Tutorial 1. Basic Gradient Descent

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Optimization Methods for Machine Learning

Empirical Risk Minimization and Optimization

Gradient Boosting (Continued)

Stochastic Quasi-Newton Methods

Unconstrained optimization

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Convex Optimization CMU-10725

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Non-Linear Optimization

CPSC 540: Machine Learning

Nonlinear Programming

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Variations of Logistic Regression with Stochastic Gradient Descent

Lecture 1: Supervised Learning

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Regression with Numerical Optimization. Logistic

Optimization methods

ECS171: Machine Learning

Optimization II: Unconstrained Multivariable

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Optimization and Root Finding. Kurt Hornik

Presentation in Convex Optimization

Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

Statistical Learning Theory and the C-Loss cost function

Comparative study of Optimization methods for Unconstrained Multivariable Nonlinear Programming Problems

ABSTRACT 1. INTRODUCTION

Data Mining (Mineria de Dades)

Optimization for Machine Learning

Optimization II: Unconstrained Multivariable

Stochastic Optimization Algorithms Beyond SG

Stochastic Analogues to Deterministic Optimizers

Sub-Sampled Newton Methods

Higher-Order Methods

Comparison of Modern Stochastic Optimization Algorithms

CS60021: Scalable Data Mining. Large Scale Machine Learning

x k+1 = x k + α k p k (13.1)

Newton s Method. Javier Peña Convex Optimization /36-725

Optimization for neural networks

Optimization methods

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

Oslo Class 2 Tikhonov regularization and kernels

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

Homework 5. Convex Optimization /36-725

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

Large-scale Stochastic Optimization

Second-Order Methods for Stochastic Optimization

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Gradient Boosting, Continued

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

RegML 2018 Class 2 Tikhonov regularization and kernels

Programming, numerics and optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Day 3 Lecture 3. Optimizing deep networks

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

ECS550NFB Introduction to Numerical Methods using Matlab Day 2

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

1 Kernel methods & optimization

Midterm exam CS 189/289, Fall 2015

Lasso Regression: Regularization for feature selection

Subgradient Method. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. CS 584: Big Data Analytics

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification

Gradient Descent. Sargur Srihari

Lecture V. Numerical Optimization

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Numerical Optimization: Basic Concepts and Algorithms

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

MATH 4211/6211 Optimization Basics of Optimization Problems

STA141C: Big Data & High Performance Statistical Computing

Introduction to Optimization

Lecture 9: September 28

LBFGS. John Langford, Large Scale Machine Learning Class, February 5. (post presentation version)

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Transcription:

7/28/ 10 UseR! The R User Conference 2010 An Algorithm for Unconstrained Quadratically Penalized Convex Optimization (post conference version) Steven P. Ellis New York State Psychiatric Institute at Columbia University

PROBLEM Minimize functions of form F (h) = V (h) + Q(h), h R d, 1. (R = reals; d = positive integer.) 2. V is non-negative and convex. 3. V is computationally expensive. 4. Q is known, strictly convex, and quadratic. 5. (Unconstrained optimization problem) 6. Gradient, but not necessarily Hessian are available.

NONPARAMETRIC FUNCTION ESTIMATION Need to minimize: F (h) = V (h) + λh T Qh. λ > 0 is complexity parameter.

WAR STORY Work on a kernel-based survival analysis algorithm lead me to work on this optimization problem. At first I used BFGS, but it was very slow. (Broyden, 70; Fletcher, 70; Goldfarb, 70; Shanno, 70) Once I waited 19 hours for it to converge! Finding no software for unconstrained convex optimization (see below), I invented my own.

SOFTWARE FOR UNCONSTRAINED CONVEX OP- TIMIZATION Didn t find such software. CVX ( http://cvxr.com/cvx/ ) is a Matlab-based modeling system for convex optimization. But a developer, Michael Grant, says that CVX wasn t designed for problems such as my survival analysis problem.

QQMM Developed algorithm QQMM ( quasi-quadratic minimization with memory ; Q 2 M 2 ) to solve problems of this type. Implemented in R. Posted on STATLIB.

Iterative descent method An iteration: If h 1 R d has smallest F value found so far, compute one or more trial minimizers, h 2, until sufficient decrease is achieved. Assign h 2 h 1 to finish iteration. Repeat until evaluation limit is exceeded or stopping criteria are met.

h2 0.2 0.0 0.2 0.4 0.6 0.8 3 6 5 8 9 7 4 2 1 F or lower bound 0 0.25 1 2.25 4 6.25 9 1.0 0.5 0.0 0.5 h1 2 4 6 8 eval num

CONVEX GLOBAL UNDERESTIMATORS If h R d, define a quasi-quadratic function : q h (g) = max{v (h)+ V (h) (g h), 0}+Q(h), g R d

+ = or

q h is a convex global underestimator of F : q h F. Possible trial minimand of F is the point h 2 where q h is minimum, but that doesn t work very well.

L.U.B. S If h (1),... h (n) R d are points visited by algorithm so far, the least upper bound (l.u.b.) of q h(1), q h(2),..., q h(n 1), q h(n) is their pointwise maximum: F n (h) = max{q h(1) (h), q h(2) (h),..., q h(n 1) (h), q h(n) (h)}, F n is also a convex global underestimator of F no smaller than any q h(i). The point, h 2 where F n is minimum is probably a good trial minimizer. But minimizing F n may be at least as hard as minimizing F!

As a compromise, proceed as follows. Let h 1 = h (n) be best trial minimizer found so far and let h (1),... h (n) R be points visited by algorithm so far. For i = 1, 2,..., n 1 let q h(i),h 1 be l.u.b. of q h(i) and q h1. q double h Convex global underestimator of F. Easy to minimize in closed form.

Let i = j be index in {1, 2,..., n 1} such that minimum value of q h(i),h 1 is largest. I.e., no smaller than minimum value of any q h(i),h 1 (i = 1,..., n 1). So q h(j),h 1 has a maximin property. Let h 2 be vector at which q h(j),h 1 achieves its minimum. (Actual method is slightly more careful than this.) If h 2 isn t better than current position, h 1, backtrack.

Minimizing q h(i),h 1 requires matrix operations. Limits size of problems for which Q 2 M 2 can be used to no more than, say, 4 or 5 thousand variables.

STOPPING RULE Trial values h 2 are minima of nonnegative global underestimators of F. Values of these global underestimators at corresponding h 2 s are lower bounds on min F. Store cumulative maxima of these lower bounds. Let L denote current value of cumulative maximum. L is a lower bound on min F.

If h 1 is current best trial minimizer, relative difference between F (h 1 ) and L exceeds relative difference between F (h 1 ) and min F. F (h 1 ) L L F (h 1) min F min F. I.e., we can explicitly bound relative error in F (h 1 ) as an estimate of min F!

Choose small ɛ > 0. I often take ɛ = 0.01. When upper bound on relative error first falls below threshold ɛ, STOP. Call this convergence. Upon convergence you re guaranteed to be within ɛ of the bottom.

h2 0.2 0.0 0.2 0.4 0.6 0.8 3 6 5 8 9 7 4 2 1 F or lower bound 0 0.25 1 2.25 4 6.25 9 1.0 0.5 0.0 0.5 h1 2 4 6 8 eval num

Gives good control over stopping. That is important because...

STOPPING EARLY MAKES SENSE IN STATISTICAL ESTIMATION In statistical estimation, the function, F, depends, through V, on noisy data so: In statistical estimation there s no point in taking time to achieve great accuracy in optimization. F (h) = V (h) + Q(h), h R d,

Q 2 M 2 IS SOMETIMES SLOW Q 2 M 2 tends to close in on minimum rapidly. But sometimes is very slow to converge. E.g., when Q is nearly singular. E.g., when complexity parameter, λ, is small. Distribution of number of evaluations needed for convergence has long right hand tail as you vary over optimization problems.

SIMULATIONS: PHILOSOPHY If F is computationally expensive then simulations are unworkable. A short computation time for optimizations is desired. When F is computationally expensive then computation time is roughly proportional to number of function evaluations. Simulate computationally cheap F s, but track number of evaluations not computation time.

COMPARE Q 2 M 2 AND BFGS. Why BFGS? BFGS is widely used. Default method Like Q 2 M 2, BFGS uses gradient and employs vector-matrix operations. Optimization maven at NYU (Michael Overton) suggested it as comparator!

(Specifically, the BFGS option in the R function optim that was used. John C. Nash personal communication pointed out at the conference that other algorithms called BFGS are faster than the BFGS in optim.)

SIMULATION STRUCTURE I chose several relevant estimation problems. For each of variety of choices of complexity parameter λ use both Q 2 M 2 and BFGS to fit model to randomly generated training sample and test data. Either simulated data or real data randomly split into two subsamples. Repeat 1000 times for each choice of λ. Gather numbers of evaluations required and other statistics describing simulations.

Use Gibbons, Olkin, Sobel ( 99) Selecting and Ordering Populations: A New Statistical Methodology to select the range of λ values that, with 95% confidence, contains λ with lowest mean test error. Conservative to use largest λ in selected group.

SIMULATIONS: SUMMARY L 3/2 kernel-based regression. For largest selected λ values, BFGS required 3 times as many evaluations compared to Q 2 M 2. Penalized logistic regression: Wisconsin Breast Cancer data University of California, Irvine Machine Learning Repository For largest selected λ values, BFGS required 2.7 times as many evaluations compared to Q 2 M 2.

Penalized logistic regression: R s Titanic data. For largest selected λ values, Q 2 M 2 required nearly twice as many evaluations compared to BFGS. I.e., this time BFGS was better. Hardly any penalization was required: Selected λ s were small.

SURVIVAL ANALYSIS AGAIN On a real data set with good choice of λ, Q 2 M 2 optimized my survival analysis penalized risk function in 23 minutes. BFGS took: 3.4 times longer with oracle telling BFGS when to stop. 6.5 times longer without oracle. Oracle means using information from the 6.5 without oracle and Q 2 M 2 runs to select the number of interations at which BFGS achieves the same level of accuracy as does Q 2 M 2.

CONCLUSIONS QQMM (Q 2 M 2 ) is an algorithm for minimizing convex functions of form F (h) = V (h) + Q(h), h R d. V is convex and non-negative. Q is known, quadratic, strictly convex. Q 2 M 2 is especially appropriate when V is expensive to compute. Allows good control of stopping.

Needs (sub)gradient. Utilizes matrix algebra. This limits maximum size of problems to no more than 4 or 5 thousand variables. Q 2 M 2 is often quite fast, but can be slow if Q is nearly singular.