Why should you care about the solution strategies?

Optimization

Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the optimization approaches makes you formalize your problem more effectively otherwise you might formalize a very hard optimization problem; sometimes with minor modifications, can significantly simplify for the solvers, without impacting properties of solution significantly When you want to do something outside the given packages or solvers (which is often true) 2 also its fun!

Thought questions Many questions about existence and finding optimal solution e.g., What if the maximum likelihood estimation of a parameter does not exist? e.g., Do we always assume convex objectives? e.g., How can we find the global solution, and not get stuck in local minima or saddlepoints? e.g., Are local minima good enough? e.g., How do we pick starting points? 3

Optimality We will not only deal with convex functions We just have so far, and if we *can* make our optimization convex, then this is better i.e., if you have two options (convex and non-convex), and its not clear one is better than the other, may as well pick the convex one The field of optimization deals with finding optimal solutions for non-convex problems Sometimes possible, sometimes not possible One strategy: random restarts Another strategy: smart initialization approaches How do we pick a good starting point for gradient descent? 4 Is a local minimum good enough?

How do we pick model types? Such as distributions and priors? For most ML problems, we will pick generic distributions that match the type of the target Where do priors come from? General purpose (e.g., regularizers, sparsity) or specified by an expert e.g., imagine modelling the distribution over images of trees, with feature vector containing height, leave size, age, etc. An expert might know ranges and general relationships on these variables, to narrow the choice of distributions Suggested in TQs: Use some data to estimate a prior. Then, use that prior with new data. Would this be better than simply doing maximum likelihood to start? 5

TQ: Isn t the world deterministic? So aren t we just making it harder on ourselves, by assuming things are probabilistic Even if the world is deterministic (which might be a fun assumption, actually), it is definitely partially observable Partial observability makes the world look stochastic Example: weather tomorrow seems random, because we cannot measure all the relevant variables Beneficial to deal with partial observability by explicitly modelling this uncertainty (with probabilities) 6

Reducible and Irreducible Error Recall: reducible and irreducible error E[C] = = ˆ ˆ X X ˆ (f(x) y) 2 p(x,y)dydx Y ˆ (f(x) E[Y x]) 2 p(x)dx + Do we ever trade-off these two errors? Is it related to the bias-variance trade-off? X ˆ Y (E[Y x] y) 2 p(x,y)dydx, Reducible error Irreducible error e first term is the distance between the trained model (x) and the optim 7

Where does gradient descent come from? Goal is to find a stationary point, but cannot get closed form solution for gradient = 0 Taylor series expansion with First order for gradient descent Second order for Newton-Raphson method (also called second-order gradient descent) 8

Taylor series expansion Afunctionf(x) in the neighborhood of point x 0,canbeapproximatedusingthe Taylor series as X 1X f (n) (x 0 ) f(x) = (x x 0 ) n, n! n=0 where f (n) (x 0 ) is the n-th derivative of function f(x) evaluated at point x 0.Also, is considered to be infinitely differentiable. For practical reasons, we will e.g. f(x) f(x 0 )+(x x 0 )f 0 (x 0 )+ 1 2 (x x 0) 2 f 00 (x 0 ). 9

Taylor series expansion f(x) = 1X n=0 f (n) (x 0 ) n! (x x 0 ) n, degree 1, 3, 5, 7, 9, 11 and 13. 10 From wikipedia

Taylor series expansion f(x) = 1X n=0 f (n) (x 0 ) n! (x x 0 ) n, 11 From wikipedia

Whiteboard First-order and second-order gradient descent Big-O for these methods Understanding the Hessian and stepsize selection 12

An example of convergence rates Initial squared distance to true weights Let 2 (0, 1). { n } converges linearly to zero, but not superlinearly. { n2 } converges superlinearly to 0, but not quadratically. { 2n } converges quadratically to zero. n is iterations Superlinear convergence is much faster than linear convergences, but quadratic convergence is much, much faster than superlinear convergence. = 1 2 gives n =2 n, n 2 =2 n2, 2 n =2 2n 13 *see https://sites.math.washington.edu/~burke/crs/408/lectures/l10-rates-of-conv-newton.pdf

Second-order min f (x) :=x 2 +e x x k+1 = x k f 0 (x k ) f 00 (x k ) x f 0 (x) 1 4.7182818 0 1 1/3.0498646.3516893.00012.3517337.00000000064 In addition, one more iteration gives f 0 (x 5 ) apple 10 20. 14

First-order Many more iterations 15 k x k f (x k ) f 0 (x k ) 0 1.37182818 4.7182818 1 0 1 1 2.5.8565307 0.3934693 3.25.8413008 0.2788008 4.375.8279143.0627107 5.34075.8273473.0297367 6.356375.8272131.01254 7.3485625.8271976.0085768 8.3524688.8271848.001987 9.3514922.8271841.0006528 10.3517364.827184.0000072 Compared to x f 0 (x) 1 4.7182818 0 1 1/3.0498646.3516893.00012.3517337.00000000064 0

Gradient descent Algorithm 1: Batch Gradient Descent(Err, X, y) 1: // A non-optimized, basic implementation of batch gradient descent 2: w random vector in R d 3: err 1 4: tolerance 10e 4 5: 0.1 6: while Err(w) err > tolerance do 7: err Err(w) 8: // The step-size should be chosen by line-search 9: w w rerr(w) =w X > (Xw y) 10: return w Recall: for error function E(w) = goal is to solve re(w) =0 initial me initial ( 16

Want step-size such that = arg min E(w re(w)) Backtracking line search: Line search 1. Start with relatively large (say = 1) 2. Check if E(w re(w) <E(w) 3. If yes, use that ) 4. Otherwise, decrease (e.g., = /2), and check again 17

What is the second-order gradient descent update? Algorithm 1: Batch Gradient Descent(Err, X, y) 1: // A non-optimized, basic implementation of batch gradient descent 2: w random vector in R d 3: err 1 4: tolerance 10e 4 5: 0.1 6: while Err(w) err > tolerance do 7: err Err(w) 8: // The step-size should be chosen by line-search 9: w w rerr(w) =w X > (Xw y) 10: return w 18

Intuition for first and second order Locally approximate function at current point For first order, locally approximate as linear and step in the direction of the minimum of that linear function 2 For second order, locally approximate as quadratic and step in the direction of the minimum 6 of that quadratic function 7 6 7 a quadratic 4 approximation is more accurate 5 What happens if the true function is quadratic? 3 x (i+1) = x (i) H f(x (i) ) 1 rf(x (i) ), Newton in red 19

Quasi-second order methods Approximate inverse Hessian, can be much more efficient Imagine if you only kept the diagonal of the inverse Hessian How expensive would this be? Examples: LBFGS, low-rank approximations, Adagrad, Adadelta, Adam 20

Batch optimization What are some issues with batch gradient descent? When might it be slow? Recall: O(d n) per step 21

22 Stochastic gradient descent Algorithm 2: Stochastic Gradient Descent(E,X, y) 1: w random vector in R d 2: for t =1,...,n do 3: // For some settings, we need the step-size t to decrease with time 4: w w t re t (w) =w t (x > t w y t )x t 5: end for 6: return w For batch error: Ê(w) = P n t=1 E t(w) e.g., E t (w) =(x > t w y t ) 2 Ê(w) = P n t=1 E t(w) =kxw yk 2 2 rê(w) =P n R Rt=1 re t(w) E(w) = X Y f(x,y)(x> w y) 2 dydx Stochastic gradient descent (stochastic approximation) minimizes with an unbiased sample of the gradient E[rE t (w)] = re(w)

Batch gradient unbiased sample E apple 1 of true gradient n rê(w) = 1 n E = 1 n " n X nx i=1 i=1 re i (w) E[rE i (w)] = 1 nx e.g., E[rE(w)] E[(X > i w Y i )X i ] n = 1 n i=1 nx i=1 = re(w) re(w) # 23

Stochastic gradient descent Can also approximate gradient with more than one sample (e.g., mini-batch), as long as E[rE t (w)] = re(w) Proof of convergence and conditions on step-size: Robbins-Monro ( A Stochastic Approximation Method, Robbins and Monro, 1951) A big focus in recent years in the machine learning community; many new approaches for improving convergence rate, reducing variance, etc. 24

How do we pick the stepsize? Less clear than for batch gradient descent Basic algorithm, the step sizes must decrease with time, but be non-negligible in magnitude (e.g., 1/t) 1X i=1 2 t < 1 1X t = 1 i=1 Recent further insights into improving selection of stepsizes, and reducing variance (e.g., SAGA, SVG) Note: look up stochastic approximation as alternative name 25

What are the benefits of SGD? For batch gradient descent: to get w such that f(w) - f(w*) < epsilon, need O(ln(1/epsilon)) iterations with conditions on f (convex, gradient Lipschitz continuous) 1 iteration of GD for linear regression: ln(1/0.001) approx= 7 w = w t X > (Xw y) X n = w t (x > i w y i )x i For stochastic gradient descent: to get w such that f(w) - f(w*) < epsilon, need O(1/epsilon) iterations with conditions on f_i (strongly convex, gradient Lipschitz continuous) i=1 1 iteration of SGD for linear regression: w = w t (x > w y t )x t 1/0.001 = 1000 26

Alternative optimization strategies What about non-gradient based optimization methods? Example: cross-entropy method What are the pros and cons? 27

Whiteboard Exercise: derive an algorithm to compute the solution to l1- regularized linear regression (i.e., MAP estimation with a Gaussian likelihood p(y x, w) and Laplace prior) First write down the Laplacian Then write down the MAP optimization Then determine how to solve this optimization Next: Generalized linear models 28