Selected Topics in Optimization. Some slides borrowed from

Similar documents
Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Lecture 5: September 15

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Lecture 5: September 12

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Lecture 8: February 9

Newton s Method. Javier Peña Convex Optimization /36-725

Descent methods. min x. f(x)

ECS171: Machine Learning

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Subgradient Method. Ryan Tibshirani Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Lecture 14: October 17

Convex Optimization Lecture 16

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Lecture 14: Newton s Method

Unconstrained minimization

Unconstrained minimization of smooth functions

Optimization methods

Big Data Analytics. Lucas Rego Drumond

Nonlinear Optimization for Optimal Control

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lecture 6: September 12

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

5. Subgradient method

Large-scale Stochastic Optimization

10. Unconstrained minimization

Introduction to Optimization

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

CS260: Machine Learning Algorithms

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Conditional Gradient (Frank-Wolfe) Method

Lecture 14 Ellipsoid method

Kernelized Perceptron Support Vector Machines

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Optimization methods

Why should you care about the solution strategies?

Modern Optimization Techniques

Barrier Method. Javier Peña Convex Optimization /36-725

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

Lasso: Algorithms and Extensions

Lecture 25: November 27

CPSC 540: Machine Learning

STA141C: Big Data & High Performance Statistical Computing

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Optimization Tutorial 1. Basic Gradient Descent

Coordinate Descent and Ascent Methods

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

Stochastic Gradient Descent

14. Nonlinear equations

ECS171: Machine Learning

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Math 273a: Optimization Subgradients of convex functions

Chapter 2. Optimization. Gradients, convexity, and ALS

Optimization for Machine Learning

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Line Search Methods for Unconstrained Optimisation

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Empirical Risk Minimization and Optimization

arxiv: v1 [math.oc] 1 Jul 2016

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Lecture 9: September 28

Classification: Logistic Regression from Data

Regression with Numerical Optimization. Logistic

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Homework 4. Convex Optimization /36-725

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Lecture 9 Sequential unconstrained minimization

Lecture 1: Supervised Learning

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

CS60021: Scalable Data Mining. Large Scale Machine Learning

6. Proximal gradient method

ECS289: Scalable Machine Learning

Lecture 1: January 12

Optimization for Machine Learning

Optimization for neural networks

Logistic Regression. Stochastic Gradient Descent

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Stochastic Subgradient Method

Stochastic Optimization Algorithms Beyond SG

Convex Optimization. Prof. Nati Srebro. Lecture 12: Infeasible-Start Newton s Method Interior Point Methods

Optimization and Gradient Descent

Day 3 Lecture 3. Optimizing deep networks

Empirical Risk Minimization and Optimization

Sub-Sampled Newton Methods

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

Convex Optimization. Problem set 2. Due Monday April 26th

Neural Network Training

6. Proximal gradient method

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Modern Optimization Techniques

Transcription:

Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/

Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model (?) Ou tp u t min x f x Idea/mod el Optimization problem: inference model x

Example In a regression model, we want the model to minimize deviation from the dependent variable. In a classification model, we want the model to minimize classification error. In a generative model, we want to maximize the likelihood to produce the observed data.

Gradient descent Consider unconstrained, smooth convex optimization min x f(x) i.e., f is convex and differentiable with dom(f) = R n. Denote the optimal criterion value by f = min x f(x), and a solution by x Gradient descent: choose initial point x (0) R n, repeat: x (k) = x (k 1) t k f(x (k 1) ), k = 1, 2, 3,... Stop at some point 3

Gradient descent interpretation At each iteration, consider the expansion f(y) f(x) + f(x) T (y x) + 1 2t y x 2 2 Quadratic approximation, replacing usual Hessian 2 f(x) by 1 t I f(x) + f(x) T (y x) 1 2t y x 2 2 linear approximation to f proximity term to x, with weight 1/(2t) Choose next point y = x + to minimize quadratic approximation: x + = x t f(x) 6

x + = argmin y Blue point is x, red point is f(x) + f(x) T (y x) + 1 2t y x 2 2 7

Fixed step size Simply take t k = t for all k = 1, 2, 3,..., can diverge if t is too big. Consider f(x) = (10x 2 1 + x2 2 )/2, gradient descent after 8 steps: 20 10 0 10 20 * 20 10 0 10 20 9

10 Can be slow if t is too small. Same example, gradient descent after 100 steps: 20 10 0 10 20 * 20 10 0 10 20

11 Converges nicely when t is just right. Same example, gradient descent after 40 steps: 20 10 0 10 20 * 20 10 0 10 20 Convergence analysis later will give us a precise idea of just right

12 Backtracking line search One way to adaptively choose the step size is to use backtracking line search: First fix parameters 0 < β < 1 and 0 < α 1/2 At each iteration, start with t = t init, and while f(x t f(x)) > f(x) αt f(x) 2 2 shrink t = βt. Else perform gradient descent update x + = x t f(x) Simple and tends to work well in practice (further simplification: just take α = 1/2)

13 Backtracking interpretation f(x + t x) f(x)+t f(x) T x t =0 t 0 f(x)+αt f(x) T x t For us x = f(x)

14 Backtracking picks up roughly the right step size (12 outer steps, 40 steps total): 20 10 0 10 20 * 20 10 0 10 20 Here α = β = 0.5

24 Practicalities Stopping rule: stop when f(x) 2 is small Recall f(x ) = 0 at solution x If f is strongly convex with parameter m, then f(x) 2 2mɛ = f(x) f ɛ Pros and cons of gradient descent: Pro: simple idea, and each iteration is cheap (usually) Pro: fast for well-conditioned, strongly convex problems Con: can often be slow, because many interesting problems aren t strongly convex or well-conditioned Con: can t handle nondifferentiable functions

29 Stochastic gradient descent Consider minimizing a sum of functions min x m f i (x) i=1 As m i=1 f i(x) = m i=1 f i(x), gradient descent would repeat: x (k) = x (k 1) t k m f i (x (k 1) ), k = 1, 2, 3,... i=1 In comparison, stochastic gradient descent or SGD (or incremental gradient descent) repeats: x (k) = x (k 1) t k f ik (x (k 1) ), k = 1, 2, 3,... where i k {1,... m} is some chosen index at iteration k

30 Two rules for choosing index i k at iteration k: Cyclic rule: choose i k = 1, 2,... m, 1, 2,... m,... Randomized rule: choose i k {1,... m} uniformly at random Randomized rule is more common in practice What s the difference between stochastic and usual (called batch) methods? Computationally, m stochastic steps one batch step. But what about progress? Cyclic rule, m steps: x (k+m) = x (k) t m i=1 f i(x (k+i 1) ) Batch method, one step: x (k+1) = x (k) t m i=1 f i(x (k) ) Difference in direction is m i=1 [ f i(x (k+i 1) ) f i (x (k) )] So SGD should converge if each f i (x) doesn t vary wildly with x Rule of thumb: SGD thrives far from optimum, struggles close to optimum... (we ll revisit in just a few lectures)

31 References and further reading D. Bertsekas (2010), Incremental gradient, subgradient, and proximal methods for convex optimization: a survey S. Boyd and L. Vandenberghe (2004), Convex optimization, Chapter 9 T. Hastie, R. Tibshirani and J. Friedman (2009), The elements of statistical learning, Chapters 10 and 16 Y. Nesterov (1998), Introductory lectures on convex optimization: a basic course, Chapter 2 L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring 2011-2012

16 Convex set: C R n such that Convex sets and functions x, y C = tx + (1 t)y C for all 0 t 1 Convex function: f : R n R such that dom(f) R n convex, and f(tx + (1 t)y) tf(x) + (1 t)f(y) for 0 t 1 and all x, y dom(f) (x, f(x)) (y, f(y))

17 Convex optimization problems Optimization problem: min x D subject to f(x) g i (x) 0, i = 1,... m h j (x) = 0, j = 1,... r Here D = dom(f) m i=1 dom(g i) p j=1 dom(h j), common domain of all the functions This is a convex optimization problem provided the functions f and g i, i = 1,... m are convex, and h j, j = 1,... p are affine: h j (x) = a T j x + b j, j = 1,... p

Local minima are global minima For convex optimization problems, local minima are global minima Formally, if x is feasible x D, and satisfies all constraints and minimizes f in a local neighborhood, then f(x) f(y) for all feasible y, x y 2 ρ, f(x) f(y) for all feasible y This is a very useful fact and will save us a lot of trouble! Convex Nonconvex 18

Nonconvex Problem Convex problem: convex objective function, convex constraints, convex domain Non- convex problem: not all above conditions are met. Usually find approximations or local optimum.

Summary GD/SGD: both simple implementation SGD: fewer iterations of the whole dataset, fast especially when data size is large; more able to get over local optimums for non- convex problems. GD: less tricky stepsize tuning. Second- order methods (e.g. Newton methods, L- BFGS): Simple stepsize tuning; closer to optimum for non- convex problems. More memory cost.