Statistical Computing (36-350)

Similar documents
Lecture 18: Constrained & Penalized Optimization

Lecture 17: Numerical Optimization October 2014

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Linear Models for Regression CS534

Selected Topics in Optimization. Some slides borrowed from

6. Regularized linear regression

1 Computing with constraints

Linear Models for Regression CS534

Expectation maximization tutorial

An interior-point stochastic approximation method and an L1-regularized delta rule

ECE521 lecture 4: 19 January Optimization, MLE, regularization

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Lecture V. Numerical Optimization

Bindel, Spring 2017 Numerical Analysis (CS 4220) Notes for So far, we have considered unconstrained optimization problems.

Lasso, Ridge, and Elastic Net

Stochastic Gradient Descent. CS 584: Big Data Analytics

IE 5531: Engineering Optimization I

ISM206 Lecture Optimization of Nonlinear Objective with Linear Constraints

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Modern Optimization Techniques

Optimization for Machine Learning

Estimators based on non-convex programs: Statistical and computational guarantees

Scientific Computing: Optimization

LINEAR REGRESSION, RIDGE, LASSO, SVR

Non-Linear Optimization

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Algorithms for Constrained Optimization

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Linear Models for Regression CS534

Numerical Optimization. Review: Unconstrained Optimization

Optimization for neural networks

Quasi-Newton Methods

Lasso Regression: Regularization for feature selection

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

IE 5531 Midterm #2 Solutions

FINANCIAL OPTIMIZATION

Linear Model Selection and Regularization

High-dimensional regression

Penalty and Barrier Methods. So we again build on our unconstrained algorithms, but in a different way.

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

Sufficient Conditions for Finite-variable Constrained Minimization

Introduction to Optimization

Expectation Maximization and Mixtures of Gaussians

CPSC 540: Machine Learning

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Machine Learning and Data Mining. Linear regression. Kalev Kask

Design and Optimization of Energy Systems Prof. C. Balaji Department of Mechanical Engineering Indian Institute of Technology, Madras

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Review for Exam 2 Ben Wang and Mark Styczynski

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Lecture 25: November 27

PROBLEM SOLVING AND SEARCH IN ARTIFICIAL INTELLIGENCE

How to Characterize Solutions to Constrained Optimization Problems

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

8 Numerical methods for unconstrained problems

Applications of Linear Programming

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Nonlinear Optimization for Optimal Control

LINEAR AND NONLINEAR PROGRAMMING

Lasso Regression: Regularization for feature selection

Max Margin-Classifier

STAT 200C: High-dimensional Statistics

Optimization. Totally not complete this is...don't use it yet...

Optimization and Gradient Descent

COMS 4771 Regression. Nakul Verma

Optimization Methods via Simulation

Non-convex optimization. Issam Laradji

But if z is conditioned on, we need to model it:

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

12. LOCAL SEARCH. gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria

Constrained Optimization

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Linear & nonlinear classifiers

Bias-Variance Tradeoff

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Linear Regression (continued)

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

Big Data Analytics. Lucas Rego Drumond

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

2.098/6.255/ Optimization Methods Practice True/False Questions

Stochastic Gradient Descent

Transcription:

Statistical Computing (36-350) Lecture 19: Optimization III: Constrained and Stochastic Optimization Cosma Shalizi 30 October 2013

Agenda Constraints and Penalties Constraints and penalties Stochastic optimization methods READING (big picture): Francis Spufford, Red Plenty OPTIONAL READING (big picture): Herbert Simon, The Sciences of the Artificial OPTIONAL READING (close up): Bottou and Bosquet, The Tradeoffs of Large Scale Learning

Maximizing a multinomial likelihood I roll dice n times; n 1,...n 6 count the outcomes Likelihood and log-likelihood: L(θ 1,θ 2,θ 3,θ 4,θ 5,θ 6 ) = n! n 1!n 2!n 3!n 4!n 5!n 6! n! 6 l(θ 1,θ 2,θ 3,θ 4,θ 5,θ 6 ) = log n 1!n 2!n 3!n 4!n 5!n 6! + n i logθ i 6 i=1 Optimize by taking the derivative and setting to zero: l θ 1 = n 1 θ 1 = 0 θ 1 = θ n i i i=1 or n 1 = 0

We forgot that 6 i=1 θ i = 1 We could use the constraint to eliminate one of the variables Then solve the equations θ 6 = 1 5 i=1 θ i l = n 1 n 6 θ i θ i 1 5 j=1 θ = 0 j BUT eliminating a variable with the constraint is usually messy

Constraints and Penalties g(θ) = c g(θ) c = 0 Lagrangian: (θ,λ) = f (θ) λ(g(θ) c) = f when the constraint is satisfied

Constraints and Penalties Lagrangian: g(θ) = c g(θ) c = 0 (θ,λ) = f (θ) λ(g(θ) c) = f when the constraint is satisfied Now do unconstrained minimization over θ and λ: θ θ,λ = f (θ ) λ g(θ ) = 0 = g(θ ) c = 0 λ θ,λ optimizing Lagrange multiplier λ enforces constraint More constraints, more multipliers

Try the dice again: θ i θi =θ i n i = log = n i θ i n! i n i! + λ = 0 λ = θ i 6 n i 6 λ = θ i = 1 i=1 λ = i=1 6 6 n i log(θ i ) λ θ i 1 i=1 6 n i θ i = n i 6 i=1 n i i=1 i=1

Thinking About the Constrained minimum value is generally higher than the unconstrained Changing the constraint level c changes θ, f (θ )

Thinking About the Constrained minimum value is generally higher than the unconstrained Changing the constraint level c changes θ, f (θ ) f (θ ) c = (θ,λ ) c = [ f (θ ) λ g(θ )] θ c [g(θ ) c] λ c + λ = λ

Thinking About the Constrained minimum value is generally higher than the unconstrained Changing the constraint level c changes θ, f (θ ) f (θ ) c = (θ,λ ) c = [ f (θ ) λ g(θ )] θ c [g(θ ) c] λ c + λ = λ λ = Rate of change in optimal value as the constraint is relaxed λ = Shadow price : How much would you pay for minute change in the level of the constraint

Inequality Constraints What about an inequality constraint? h(θ) d h(θ) d 0 The region where the constraint is satisfied is the feasible set

Inequality Constraints What about an inequality constraint? h(θ) d h(θ) d 0 The region where the constraint is satisfied is the feasible set Roughly two cases: 1 Unconstrained optimum is inside the feasible set constraint is inactive

Inequality Constraints What about an inequality constraint? h(θ) d h(θ) d 0 The region where the constraint is satisfied is the feasible set Roughly two cases: 1 Unconstrained optimum is inside the feasible set constraint is inactive 2 Optimum is outside feasible set; constraint is active, binds or bites; constrained optimum is usually on the boundary

Inequality Constraints What about an inequality constraint? h(θ) d h(θ) d 0 The region where the constraint is satisfied is the feasible set Roughly two cases: 1 Unconstrained optimum is inside the feasible set constraint is inactive 2 Optimum is outside feasible set; constraint is active, binds or bites; constrained optimum is usually on the boundary Add a Lagrange multiplier; λ 0 constraint binds

Mathematical Programming Older than computer programming... Optimize f (θ) subject to g(θ) = c and h(θ) d Give us the best deal on f, keeping in mind that we ve only got d to spend, and the books have to balance Linear programming (Kantorovich, 1938) f, h both linear in θ θ always at a corner of the feasible set

Back to the Factory Constraints and Penalties Constraints: Revenue: $13k/car, $27k/truck The feasible region: 40(cars) + 60(trucks) 1600 1(cars) + 3(trucks) 70 labor steel trucks 0 10 20 30 40 0 10 20 30 40 cars

Back to the Factory Constraints and Penalties Constraints: Revenue: $13k/car, $27k/truck The feasible region: 40(cars) + 60(trucks) 1600 1(cars) + 3(trucks) 70 labor steel trucks 0 10 20 30 40 0 10 20 30 40 cars

Barrier Methods Constraints and Penalties (a.k.a. interior point, central path, etc.)

Barrier Methods Constraints and Penalties (a.k.a. interior point, central path, etc.) Having constraints switch on and off abruptly is annoying especially with gradient methods

Barrier Methods Constraints and Penalties (a.k.a. interior point, central path, etc.) Having constraints switch on and off abruptly is annoying especially with gradient methods Fix µ > 0 and try minimizing f (θ) µlog(d h(θ)) pushes away from the barrier more and more weakly as µ 0

Barrier Methods Constraints and Penalties (a.k.a. interior point, central path, etc.) Having constraints switch on and off abruptly is annoying especially with gradient methods Fix µ > 0 and try minimizing f (θ) µlog(d h(θ)) pushes away from the barrier more and more weakly as µ 0 1 Initial θ in feasible set, initial µ 2 While ((not too tired) and (making adequate progress)) 1 Minimize f (θ) µlog(d h(θ)) 2 Reduce µ 3 Return final θ

Constraints vs. Penalties argmin θ:h(θ) d f (θ) argmin f (θ) λ(h(θ) d) θ,λ d doesn t matter for doing the second minimization over θ

Constraints vs. Penalties argmin θ:h(θ) d f (θ) argmin f (θ) λ(h(θ) d) θ,λ d doesn t matter for doing the second minimization over θ Constrained optimization Penalized optimization Constraint level d Penalty factor λ

Statistical Applications of Penalization Minimize MSE of linear function β x : ordinary least squares regression

Statistical Applications of Penalization Minimize MSE of linear function β x : ordinary least squares regression penalty on length of coefficient vector, β 2 : ridge regression j stabilizes estimate when data are collinear, p > n, or just noisy

Statistical Applications of Penalization Minimize MSE of linear function β x : ordinary least squares regression penalty on length of coefficient vector, β 2 : ridge regression j stabilizes estimate when data are collinear, p > n, or just noisy penalty on sum of coefficients, β j : lasso stability + drive small coefficients to 0 ( sparsity )

Statistical Applications of Penalization Minimize MSE of linear function β x : ordinary least squares regression penalty on length of coefficient vector, β 2 : ridge regression j stabilizes estimate when data are collinear, p > n, or just noisy penalty on sum of coefficients, β j : lasso stability + drive small coefficients to 0 ( sparsity ) Minimize MSE of function + penalty on curvature : spline fit smooth regressions w/o assuming specific form

Statistical Applications of Penalization Minimize MSE of linear function β x : ordinary least squares regression penalty on length of coefficient vector, β 2 : ridge regression j stabilizes estimate when data are collinear, p > n, or just noisy penalty on sum of coefficients, β j : lasso stability + drive small coefficients to 0 ( sparsity ) Minimize MSE of function + penalty on curvature : spline fit smooth regressions w/o assuming specific form Smoothing over time, space, other relations e.g., social or genetic ties

Statistical Applications of Penalization Minimize MSE of linear function β x : ordinary least squares regression penalty on length of coefficient vector, β 2 : ridge regression j stabilizes estimate when data are collinear, p > n, or just noisy penalty on sum of coefficients, β j : lasso stability + drive small coefficients to 0 ( sparsity ) Minimize MSE of function + penalty on curvature : spline fit smooth regressions w/o assuming specific form Smoothing over time, space, other relations e.g., social or genetic ties Usually decide on penalty factor/constraint level by trying to predict out of sample

R implementation Constraints and Penalties In penalty form, just chose λ and modify your objective function

R implementation Constraints and Penalties In penalty form, just chose λ and modify your objective function constroptim implements the barrier method Try this: factory <- matrix(c(40,1,60,3),nrow=2, dimnames=list(c("labor","steel"),c("car","truck"))) available <- c(1600,70); names(available) <- rownames(factory) prices <- c(car=13,truck=27) revenue <- function(output) { return(-output %*% prices) } plan <- constroptim(theta=c(5,5),f=revenue,grad=null, ui=-factory,ci=-available,method="nelder-mead") plan$par only works with constraints like uθ c, so minus signs

Problems with Big Data Stochastic Gradient Descent Simulated Annealing Typical statistical objective function, mean-squared error: f (θ) = 1 n n (y i m(x i,θ)) 2 i=1

Problems with Big Data Stochastic Gradient Descent Simulated Annealing Typical statistical objective function, mean-squared error: f (θ) = 1 n n (y i m(x i,θ)) 2 i=1 Getting a value of f is O(n), f is O(np), H is O(np 2 ) worse still if m slows down with n

Problems with Big Data Stochastic Gradient Descent Simulated Annealing Typical statistical objective function, mean-squared error: f (θ) = 1 n n (y i m(x i,θ)) 2 i=1 Getting a value of f is O(n), f is O(np), H is O(np 2 ) worse still if m slows down with n Not bad when n = 100 or even n = 10 4, but if n = 10 9 or n = 10 12 we don t even know which way to move

Stochastic Gradient Descent Simulated Annealing

Stochastic Gradient Descent Simulated Annealing Sampling, the Alternative to Sarcastic Gradient Descent Pick one data point I at random (uniform on 1 : n) Loss there, (y I m(x I,θ)) 2, is random, but (y I m(x I,θ)) 2 = f (θ)

Stochastic Gradient Descent Simulated Annealing Sampling, the Alternative to Sarcastic Gradient Descent Pick one data point I at random (uniform on 1 : n) Loss there, (y I m(x I,θ)) 2, is random, but (y I m(x I,θ)) 2 = f (θ) Generally, if f (θ) = n 1 n i=1 f i (θ) and f i are well-behaved, [f I (θ)] = f (θ) [ f I (θ)] = f (θ) 2 f I (θ) = H(θ)

Stochastic Gradient Descent Simulated Annealing Sampling, the Alternative to Sarcastic Gradient Descent Pick one data point I at random (uniform on 1 : n) Loss there, (y I m(x I,θ)) 2, is random, but (y I m(x I,θ)) 2 = f (θ) Generally, if f (θ) = n 1 n i=1 f i (θ) and f i are well-behaved, [f I (θ)] = f (θ) [ f I (θ)] = f (θ) 2 f I (θ) = H(θ) Don t optimize with all the data, optimize with random samples

Stochastic Gradient Descent Stochastic Gradient Descent Simulated Annealing Draw lots of one-point samples, let their noise cancel out: 1 Start with initial guess θ, learning rate η 2 While ((not too tired) and (making adequate progress)) 1 At t th iteration, pick random I uniformly 2 Set θ θ t 1 η f I (θ) 3 Return final θ

Stochastic Gradient Descent Stochastic Gradient Descent Simulated Annealing Draw lots of one-point samples, let their noise cancel out: 1 Start with initial guess θ, learning rate η 2 While ((not too tired) and (making adequate progress)) 1 At t th iteration, pick random I uniformly 2 Set θ θ t 1 η f I (θ) 3 Return final θ Shrinking step-size by 1/t ensures noise in each gradient dies down

Stochastic Gradient Descent Stochastic Gradient Descent Simulated Annealing Draw lots of one-point samples, let their noise cancel out: 1 Start with initial guess θ, learning rate η 2 While ((not too tired) and (making adequate progress)) 1 At t th iteration, pick random I uniformly 2 Set θ θ t 1 η f I (θ) 3 Return final θ Shrinking step-size by 1/t ensures noise in each gradient dies down (Variants: put points in some random order, only check progress after going over each point once, adjust 1/t rate, average a couple of random data points, etc.)

Stochastic Gradient Descent Simulated Annealing stoch.grad.descent <- function(f,theta,df,max.iter=1e6,rate=1e-6) { stopifnot(require(numderiv)) for (t in 1:max.iter) { g <- stoch.grad(f,theta,df) theta <- theta - (rate/t)*g } return(x) } stoch.grad <- function(f,theta,df) { i <- sample(1:nrow(df),size=1) noisy.f <- function(theta) { return(f(theta, data=df[i,])) } stoch.grad <- grad(noisy.f,theta) return(stoch.grad) }

Stochastic Newton s Method Stochastic Gradient Descent Simulated Annealing a.k.a. 2nd order stochastic gradient descent 1 Start with initial guess θ 2 While ((not too tired) and (making adequate progress)) 1 At t th iteration, pick uniformly-random I 2 θ θ t 1 H 1 (θ) f I I (θ) 3 Return final θ + all the Newton-ish tricks to avoid having to recompute the Hessian

Stochastic Gradient Methods Stochastic Gradient Descent Simulated Annealing Pros: Each iteration is fast (and constant in n) Never need to hold all data in memory Does converge eventually

Stochastic Gradient Methods Stochastic Gradient Descent Simulated Annealing Pros: Cons: Each iteration is fast (and constant in n) Never need to hold all data in memory Does converge eventually Noise does reduce precision more iterations to get within ε of optimum than non-stochastic GD or Newton

Stochastic Gradient Methods Stochastic Gradient Descent Simulated Annealing Pros: Cons: Each iteration is fast (and constant in n) Never need to hold all data in memory Does converge eventually Noise does reduce precision more iterations to get within ε of optimum than non-stochastic GD or Newton Often low computational cost to get within statistical error of the optimum

Simulated Annealing Constraints and Penalties Stochastic Gradient Descent Simulated Annealing f (θ)/t Use Metropolis to sample from a density e Samples will tend to be near small values of f Keep lowering T as we go along ( cooling, annealing )

Simulated Annealing Constraints and Penalties Stochastic Gradient Descent Simulated Annealing f (θ)/t Use Metropolis to sample from a density e Samples will tend to be near small values of f Keep lowering T as we go along ( cooling, annealing ) 1 Set initial θ, T > 0 2 While ((not too tired) and (making adequate progress)) 1 Proposal: Z r( θ) (e.g., Gaussian noise) 2 Draw U Unif(0, 1) 3 f (Z) f (θ) Acceptance: If U < e T then θ Z 4 Reduce T a little 3 Return final θ Always moves to lower values of f, sometimes moves to higher

Simulated Annealing Constraints and Penalties Stochastic Gradient Descent Simulated Annealing f (θ)/t Use Metropolis to sample from a density e Samples will tend to be near small values of f Keep lowering T as we go along ( cooling, annealing ) 1 Set initial θ, T > 0 2 While ((not too tired) and (making adequate progress)) 1 Proposal: Z r( θ) (e.g., Gaussian noise) 2 Draw U Unif(0, 1) 3 f (Z) f (θ) Acceptance: If U < e T then θ Z 4 Reduce T a little 3 Return final θ Always moves to lower values of f, sometimes moves to higher No derivatives, works for discrete problems, few guarantees

Summary Constraints and Penalties Stochastic Gradient Descent Simulated Annealing Constraints are usually part of optimization Constrained optimum generally at the boundary of feasible set Lagrange multipliers turn constrained problems into unconstrained ones Multipliers are prices: trade-off between tightening constraint and worsening optimal value Stochastic optimization methods use probability in the search Stochastic gradient descent samples the data; gives up precision for speed Simulated annealing randomly moves against the objective function; escapes local minima