Why should you care about the solution strategies?

Similar documents
Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Stochastic Gradient Descent. CS 584: Big Data Analytics

Selected Topics in Optimization. Some slides borrowed from

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Neural Network Training

Linear Regression (continued)

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Gradient Descent. Sargur Srihari

Stochastic Analogues to Deterministic Optimizers

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

HOMEWORK #4: LOGISTIC REGRESSION

Overfitting, Bias / Variance Analysis

ECS171: Machine Learning

Day 3 Lecture 3. Optimizing deep networks

Logistic Regression. Stochastic Gradient Descent

CPSC 540: Machine Learning

Stochastic Optimization Algorithms Beyond SG

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Nonlinear Optimization for Optimal Control

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Regression with Numerical Optimization. Logistic

HOMEWORK #4: LOGISTIC REGRESSION

ECE521 week 3: 23/26 January 2017

Linear Models in Machine Learning

Linear classifiers: Overfitting and regularization

CS260: Machine Learning Algorithms

Optimization Methods for Machine Learning

STA141C: Big Data & High Performance Statistical Computing

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Ch 4. Linear Models for Classification

Lecture 1: Supervised Learning

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Large-scale Stochastic Optimization

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Lecture 5: Logistic Regression. Neural Networks

Nesterov s Acceleration

ECS289: Scalable Machine Learning

CSC321 Lecture 7: Optimization

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Adaptive Gradient Methods AdaGrad / Adam

CS60021: Scalable Data Mining. Large Scale Machine Learning

OPTIMIZATION METHODS IN DEEP LEARNING

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Sub-Sampled Newton Methods

Max Margin-Classifier

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine Learning Basics III

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Big Data Analytics. Lucas Rego Drumond

Optimization for neural networks

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Iterative Reweighted Least Squares

11 More Regression; Newton s Method; ROC Curves

Nonlinear Optimization Methods for Machine Learning

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Stochastic Gradient Descent

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

CSC321 Lecture 8: Optimization

CPSC 540: Machine Learning

Midterm exam CS 189/289, Fall 2015

Announcements Kevin Jamieson

Machine Learning Lecture 7

Data Mining Techniques. Lecture 2: Regression

Parameter estimation Conditional risk

Is the test error unbiased for these programs?

Deep Learning & Artificial Intelligence WS 2018/2019

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

Convex Optimization Lecture 16

CSC2541 Lecture 5 Natural Gradient

Classification Logistic Regression

Reading Group on Deep Learning Session 1

Linear discriminant functions

Deep Learning & Neural Networks Lecture 4

Optimization for Training I. First-Order Methods Training algorithm

Math 273a: Optimization Netwon s methods

Mini-Course 1: SGD Escapes Saddle Points

CSC321 Lecture 2: Linear Regression

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Lecture 5: September 12

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Stochastic gradient descent; Classification

Stochastic Gradient Descent

Logistic Regression. William Cohen

STA 4273H: Sta-s-cal Machine Learning

Logistic Regression. COMP 527 Danushka Bollegala

Transcription:

Optimization

Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the optimization approaches makes you formalize your problem more effectively otherwise you might formalize a very hard optimization problem; sometimes with minor modifications, can significantly simplify for the solvers, without impacting properties of solution significantly When you want to do something outside the given packages or solvers (which is often true) 2 also its fun!

Thought questions Many questions about existence and finding optimal solution e.g., What if the maximum likelihood estimation of a parameter does not exist? e.g., Do we always assume convex objectives? e.g., How can we find the global solution, and not get stuck in local minima or saddlepoints? e.g., Are local minima good enough? e.g., How do we pick starting points? 3

Optimality We will not only deal with convex functions We just have so far, and if we *can* make our optimization convex, then this is better i.e., if you have two options (convex and non-convex), and its not clear one is better than the other, may as well pick the convex one The field of optimization deals with finding optimal solutions for non-convex problems Sometimes possible, sometimes not possible One strategy: random restarts Another strategy: smart initialization approaches How do we pick a good starting point for gradient descent? 4 Is a local minimum good enough?

How do we pick model types? Such as distributions and priors? For most ML problems, we will pick generic distributions that match the type of the target Where do priors come from? General purpose (e.g., regularizers, sparsity) or specified by an expert e.g., imagine modelling the distribution over images of trees, with feature vector containing height, leave size, age, etc. An expert might know ranges and general relationships on these variables, to narrow the choice of distributions Suggested in TQs: Use some data to estimate a prior. Then, use that prior with new data. Would this be better than simply doing maximum likelihood to start? 5

TQ: Isn t the world deterministic? So aren t we just making it harder on ourselves, by assuming things are probabilistic Even if the world is deterministic (which might be a fun assumption, actually), it is definitely partially observable Partial observability makes the world look stochastic Example: weather tomorrow seems random, because we cannot measure all the relevant variables Beneficial to deal with partial observability by explicitly modelling this uncertainty (with probabilities) 6

Reducible and Irreducible Error Recall: reducible and irreducible error E[C] = = ˆ ˆ X X ˆ (f(x) y) 2 p(x,y)dydx Y ˆ (f(x) E[Y x]) 2 p(x)dx + Do we ever trade-off these two errors? Is it related to the bias-variance trade-off? X ˆ Y (E[Y x] y) 2 p(x,y)dydx, Reducible error Irreducible error e first term is the distance between the trained model (x) and the optim 7

Where does gradient descent come from? Goal is to find a stationary point, but cannot get closed form solution for gradient = 0 Taylor series expansion with First order for gradient descent Second order for Newton-Raphson method (also called second-order gradient descent) 8

Taylor series expansion Afunctionf(x) in the neighborhood of point x 0,canbeapproximatedusingthe Taylor series as X 1X f (n) (x 0 ) f(x) = (x x 0 ) n, n! n=0 where f (n) (x 0 ) is the n-th derivative of function f(x) evaluated at point x 0.Also, is considered to be infinitely differentiable. For practical reasons, we will e.g. f(x) f(x 0 )+(x x 0 )f 0 (x 0 )+ 1 2 (x x 0) 2 f 00 (x 0 ). 9

Taylor series expansion f(x) = 1X n=0 f (n) (x 0 ) n! (x x 0 ) n, degree 1, 3, 5, 7, 9, 11 and 13. 10 From wikipedia

Taylor series expansion f(x) = 1X n=0 f (n) (x 0 ) n! (x x 0 ) n, 11 From wikipedia

Whiteboard First-order and second-order gradient descent Big-O for these methods Understanding the Hessian and stepsize selection 12

An example of convergence rates Initial squared distance to true weights Let 2 (0, 1). { n } converges linearly to zero, but not superlinearly. { n2 } converges superlinearly to 0, but not quadratically. { 2n } converges quadratically to zero. n is iterations Superlinear convergence is much faster than linear convergences, but quadratic convergence is much, much faster than superlinear convergence. = 1 2 gives n =2 n, n 2 =2 n2, 2 n =2 2n 13 *see https://sites.math.washington.edu/~burke/crs/408/lectures/l10-rates-of-conv-newton.pdf

Second-order min f (x) :=x 2 +e x x k+1 = x k f 0 (x k ) f 00 (x k ) x f 0 (x) 1 4.7182818 0 1 1/3.0498646.3516893.00012.3517337.00000000064 In addition, one more iteration gives f 0 (x 5 ) apple 10 20. 14

First-order Many more iterations 15 k x k f (x k ) f 0 (x k ) 0 1.37182818 4.7182818 1 0 1 1 2.5.8565307 0.3934693 3.25.8413008 0.2788008 4.375.8279143.0627107 5.34075.8273473.0297367 6.356375.8272131.01254 7.3485625.8271976.0085768 8.3524688.8271848.001987 9.3514922.8271841.0006528 10.3517364.827184.0000072 Compared to x f 0 (x) 1 4.7182818 0 1 1/3.0498646.3516893.00012.3517337.00000000064 0

Gradient descent Algorithm 1: Batch Gradient Descent(Err, X, y) 1: // A non-optimized, basic implementation of batch gradient descent 2: w random vector in R d 3: err 1 4: tolerance 10e 4 5: 0.1 6: while Err(w) err > tolerance do 7: err Err(w) 8: // The step-size should be chosen by line-search 9: w w rerr(w) =w X > (Xw y) 10: return w Recall: for error function E(w) = goal is to solve re(w) =0 initial me initial ( 16

Want step-size such that = arg min E(w re(w)) Backtracking line search: Line search 1. Start with relatively large (say = 1) 2. Check if E(w re(w) <E(w) 3. If yes, use that ) 4. Otherwise, decrease (e.g., = /2), and check again 17

What is the second-order gradient descent update? Algorithm 1: Batch Gradient Descent(Err, X, y) 1: // A non-optimized, basic implementation of batch gradient descent 2: w random vector in R d 3: err 1 4: tolerance 10e 4 5: 0.1 6: while Err(w) err > tolerance do 7: err Err(w) 8: // The step-size should be chosen by line-search 9: w w rerr(w) =w X > (Xw y) 10: return w 18

Intuition for first and second order Locally approximate function at current point For first order, locally approximate as linear and step in the direction of the minimum of that linear function 2 For second order, locally approximate as quadratic and step in the direction of the minimum 6 of that quadratic function 7 6 7 a quadratic 4 approximation is more accurate 5 What happens if the true function is quadratic? 3 x (i+1) = x (i) H f(x (i) ) 1 rf(x (i) ), Newton in red 19

Quasi-second order methods Approximate inverse Hessian, can be much more efficient Imagine if you only kept the diagonal of the inverse Hessian How expensive would this be? Examples: LBFGS, low-rank approximations, Adagrad, Adadelta, Adam 20

Batch optimization What are some issues with batch gradient descent? When might it be slow? Recall: O(d n) per step 21

22 Stochastic gradient descent Algorithm 2: Stochastic Gradient Descent(E,X, y) 1: w random vector in R d 2: for t =1,...,n do 3: // For some settings, we need the step-size t to decrease with time 4: w w t re t (w) =w t (x > t w y t )x t 5: end for 6: return w For batch error: Ê(w) = P n t=1 E t(w) e.g., E t (w) =(x > t w y t ) 2 Ê(w) = P n t=1 E t(w) =kxw yk 2 2 rê(w) =P n R Rt=1 re t(w) E(w) = X Y f(x,y)(x> w y) 2 dydx Stochastic gradient descent (stochastic approximation) minimizes with an unbiased sample of the gradient E[rE t (w)] = re(w)

Batch gradient unbiased sample E apple 1 of true gradient n rê(w) = 1 n E = 1 n " n X nx i=1 i=1 re i (w) E[rE i (w)] = 1 nx e.g., E[rE(w)] E[(X > i w Y i )X i ] n = 1 n i=1 nx i=1 = re(w) re(w) # 23

Stochastic gradient descent Can also approximate gradient with more than one sample (e.g., mini-batch), as long as E[rE t (w)] = re(w) Proof of convergence and conditions on step-size: Robbins-Monro ( A Stochastic Approximation Method, Robbins and Monro, 1951) A big focus in recent years in the machine learning community; many new approaches for improving convergence rate, reducing variance, etc. 24

How do we pick the stepsize? Less clear than for batch gradient descent Basic algorithm, the step sizes must decrease with time, but be non-negligible in magnitude (e.g., 1/t) 1X i=1 2 t < 1 1X t = 1 i=1 Recent further insights into improving selection of stepsizes, and reducing variance (e.g., SAGA, SVG) Note: look up stochastic approximation as alternative name 25

What are the benefits of SGD? For batch gradient descent: to get w such that f(w) - f(w*) < epsilon, need O(ln(1/epsilon)) iterations with conditions on f (convex, gradient Lipschitz continuous) 1 iteration of GD for linear regression: ln(1/0.001) approx= 7 w = w t X > (Xw y) X n = w t (x > i w y i )x i For stochastic gradient descent: to get w such that f(w) - f(w*) < epsilon, need O(1/epsilon) iterations with conditions on f_i (strongly convex, gradient Lipschitz continuous) i=1 1 iteration of SGD for linear regression: w = w t (x > w y t )x t 1/0.001 = 1000 26

Alternative optimization strategies What about non-gradient based optimization methods? Example: cross-entropy method What are the pros and cons? 27

Whiteboard Exercise: derive an algorithm to compute the solution to l1- regularized linear regression (i.e., MAP estimation with a Gaussian likelihood p(y x, w) and Laplace prior) First write down the Laplacian Then write down the MAP optimization Then determine how to solve this optimization Next: Generalized linear models 28