Chapter 2. Optimization. Gradients, convexity, and ALS

Similar documents
Lecture 3. Optimization Problems and Iterative Algorithms

Constrained Optimization

Constrained Optimization and Lagrangian Duality

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

Nonlinear Optimization: What s important?

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Gradient Descent. Dr. Xiaowei Huang

CS-E4830 Kernel Methods in Machine Learning

Numerical Optimization

2.098/6.255/ Optimization Methods Practice True/False Questions

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Conditional Gradient (Frank-Wolfe) Method

Selected Topics in Optimization. Some slides borrowed from

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

Lecture 18: Optimization Programming

ICS-E4030 Kernel Methods in Machine Learning

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

5 Handling Constraints

Algorithms for constrained local optimization

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

CSCI : Optimization and Control of Networks. Review on Convex Optimization

A Brief Review on Convex Optimization

Lecture 25: November 27

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Algorithms for Constrained Optimization

Optimisation in Higher Dimensions

Optimality Conditions for Constrained Optimization

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Coordinate Descent and Ascent Methods

ISM206 Lecture Optimization of Nonlinear Objective with Linear Constraints

Numerical Optimization

Interior Point Methods. We ll discuss linear programming first, followed by three nonlinear problems. Algorithms for Linear Programming Problems

Newton s Method. Javier Peña Convex Optimization /36-725

Introduction to Optimization

Nonlinear Programming

Optimization Tutorial 1. Basic Gradient Descent

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Optimization and Root Finding. Kurt Hornik

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives

Constrained optimization

Lectures 9 and 10: Constrained optimization problems and their optimality conditions

CONSTRAINED NONLINEAR PROGRAMMING

Convex Optimization Lecture 6: KKT Conditions, and applications

4TE3/6TE3. Algorithms for. Continuous Optimization

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Part 4: Active-set methods for linearly constrained optimization. Nick Gould (RAL)

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Miscellaneous Nonlinear Programming Exercises

Part IB Optimisation

Lecture V. Numerical Optimization

Existence of minimizers

5 Quasi-Newton Methods

MATH 4211/6211 Optimization Basics of Optimization Problems

Lecture 6: Conic Optimization September 8

Linear and non-linear programming

STATIC LECTURE 4: CONSTRAINED OPTIMIZATION II - KUHN TUCKER THEORY

Primal/Dual Decomposition Methods

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

Subgradients. subgradients. strong and weak subgradient calculus. optimality conditions via subgradients. directional derivatives

NONLINEAR. (Hillier & Lieberman Introduction to Operations Research, 8 th edition)

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Optimization II: Unconstrained Multivariable

Appendix A Taylor Approximations and Definite Matrices

A Distributed Newton Method for Network Utility Maximization, II: Convergence

2. Quasi-Newton methods

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Optimization for Machine Learning

Computational Finance

Optimality conditions for unconstrained optimization. Outline

Duality. Lagrange dual problem weak and strong duality optimality conditions perturbation and sensitivity analysis generalized inequalities

A ten page introduction to conic optimization

Solution Methods. Richard Lusby. Department of Management Engineering Technical University of Denmark

Matrix Derivatives and Descent Optimization Methods

CE 191: Civil & Environmental Engineering Systems Analysis. LEC 17 : Final Review

Interior-Point Methods for Linear Optimization

Optimization Problems with Constraints - introduction to theory, numerical Methods and applications

Linear Programming: Simplex

Introduction to gradient descent

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Written Examination

CPSC 540: Machine Learning

ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

Lecture 5: September 12

Nonlinear Optimization for Optimal Control

Non-Negative Matrix Factorization

Nonlinear Programming (Hillier, Lieberman Chapter 13) CHEM-E7155 Production Planning and Control

8 Numerical methods for unconstrained problems

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Transcription:

Chapter 2 Optimization Gradients, convexity, and ALS

Contents Background Gradient descent Stochastic gradient descent Newton s method Alternating least squares KKT conditions 2

Motivation We can solve basic least-squares linear systems using SVD But what if we have missing values in the data extra constraints for feasible solutions more complex optimization problems (e.g. regularizers) etc 3

Gradients, Hessians, and convexity 4

Derivatives and local optima The derivative of a function f: R R, denoted f, explains its rate of change If it exists ƒ 0 ( )= lim h!0+ ƒ ( + h) ƒ ( ) h The second derivative f is the change of rate of change 5

Derivatives and local optima A stationary point of differentiable f is x s.t. f (x) = 0 f achieves its extremes in stationary points or in points where derivative doesn t exist, or at infinities (Fermat s theorem) Whether this is (local) maximum or minimum can be seen from the second derivative (if it exists) 6

Partial derivative If f is multivariate (e.g. f: R 3 R), we can consider it as a family of functions E.g. f(x, y) = x 2 + y has functions f x (y) = x 2 + y and f y (x) = x 2 + y Partial derivative w.r.t. one variable keeps other variables constant ƒ (,y)=ƒ 0 y ( )=2 7

Gradient Gradient is the derivative for multivariate functions f: R n R ƒ = ƒ, ƒ 1 2,..., ƒn Here (and later), we assume that the derivatives exist Gradient is a function f: R n R n f(x) points up in the function at point x 8

Gradient 9

Hessian Hessian is a square matrix of all second-order partial derivatives of a function f: R n R H(ƒ )= 0 B @ 2 ƒ 2 2 1 ƒ 2 1. 2 ƒ n 1 2 ƒ 1 2 2 ƒ 2 2 2 ƒ n 2.... 2 ƒ 1 n 2 ƒ 2 n 2 ƒ 2 n 1 C A As usual, we assume the derivatives exist 10

Jacobian matrix If f: R m R n, then its Jacobian (matrix) is an n m matrix of partial derivatives in form J = 0 B @ ƒ 1 1 ƒ 2 1. ƒ n 1 ƒ 1 2 ƒ 2 2 ƒ n 2.... ƒ 1 m ƒ 2 m ƒ n m 1 C A Jacobian is the best linear approximation of f H(f(x)) = J( f(x)) T 11

Examples Function ƒ (,y)= 2 + 2 y + y Partial derivatives ƒ (,y)=2 + 2y ƒ y (,y)=2 + 1 Function ƒ (,y)= Jacobian J(ƒ )= 2 y 5 + sin y 2 y 2 5 cosy Gradient Hessian H(ƒ )= ƒ =(2 + 2y, 2 + 1) Å ã 2 2 2 0 12

Gradient s properties Linearity: (αf + βg)(x) + α f(x) + β g(x) Product rule: (fg)(x) = f(x) g(x) + g(x) f(x) Chain rule: IMPORTANT! If f: R n R and g: R m R n, then (f g)(x) = J(g(x)) T ( f(y)) where y = g(x) If f is as above and h: R R, then (h f)(x) = h (f(x)) f(x) 13

Convexity A function is convex if any line segment between two points of the function lie above or on the graph For univariate f, if f (x) 0 for all x For multivariate f, if its Hessian is positive semidefinite I.e. z T Hz 0 for any z Convex function s local minimum is its global minimum 14

Preserving the convexity If f is convex and λ > 0, then λf is convex If f and g are convex, the f + g is convex If f is convex and g is affine (i.e. g(x) = Ax + b), then f g is convex (N.B. (f g)(x) = f(ax + b)) Let f(x) = (h g)(x) with g: R n R and h: R R; f is convex if g is convex and h is nondecreasing and convex g is concave and h is non-increasing and convex 15

Gradient descent 16

Idea If f is convex, we should find it s minimum by following its negative gradient But the gradient at x points to minimum only at x Hence, we need to descent slowly down the gradient 17

Example 1.0 0.5 0.0 0.5 1.0 5 5 5 4 4 4 4.5 4.5 4.5 4 4 4 3.5 3.5 3 3 2.5 2.5 0.5 0.5 1 1.5 1.5 2 4.5 4.5 1.0 0.5 0.0 0.5 1.0 5 5 4.5 4.5 4.5 5.5 5.5 5.5 5 5 6 6 6 5.5 5.5 6.5 6.5 6 6 6 7 6.5 7 6.5 6.5 7 7 7.5 7.5 q(t) - q * q(t) - q * 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 stepfun(px, py) t 0.0 0.2 0.4 0.6 0.8 1.0 t 18

Gradient descent Start from random point x 0 At step n, update x n x n 1 γ f(x n 1 ) γ is some small step size Often, γ depends on the iteration x n x n 1 γ n f(x n 1 ) With suitable f and step size, will converge to local minimum 19

Example: least squares Given A R n m and b R n, find x R m s.t. Ax b 2 /2 is minimized Can be solved using SVD Calculate the gradient of f A,b (x) = Ax b 2 /2 Employ the gradient descent approach In this case, the step size can be calculated analytically 20

Example: the gradient Let s write open: 1 2 ka bk2 = 1 2 = 1 2 = 1 2 = 1 2 nx (A ) b 2 =1 nx =1 nx Ä =1 nx mx j j b 2 j=1 X m 2 j j j=1 mx j j =1 j=1 2 2b nx b mx j=1 mx =1 j=1 j j + b 2ä nx j j + 1 2 =1 b 2 21

Example: the gradient The partial derivative w.r.t. x j : j Ä 1 2 ka bk2ä = Linearity = 1 2 = 1 2 Chain rule = = nx =1 nx j Ä 1 2 nx nx mx =1 =1 j nx =1 j mx k=1 mx k=1 mx j k k=1 Ä m X j =1 k=1 k k=1 k 2 k k 2 k k 2 k nx b =1 k k b ä nx b =1 nx b =1 nx b =1 j mx k k + 1 nx b 2ä 2 k=1 j mx =1 k k + j k=1 j 1 2 = 0 if k j nx =1 b 2 = 0 22

Example: the gradient Collecting terms: Hence we have: j Ä 1 2 ka bk2ä = Ä 1 = = nx =1 nx =1 j j Ä m X k=1 k k b Ä (A ) b Ä ä A T (A b) 2 ka bk2ä = A T (A b) j ä Matrix product ä Another matrix product 23

Example: the gradient The other way: Use the chain rule Ä 1 2 ka bk2ä = J(A b) T ( 1 2 kyk2 ) y = A b = A T (A b) 24

Gradient descent & matrices How about Given A, find small B and C s.t. A BC F is minimized? Not convex for B and C jointly Fix some B and solve for C C = argmin X A BX F Use the found C and solve for B, and repeat until convergence 25

How to solve for C? C = argmin X A BX F still needs some work Write the norm as sum of column-wise errors A BX F = a j Bx j 2 Now the problem is a series of standard least-squares problems Each can be solved independently 26

How to select the step size? Recall: x n x n 1 γ n f(x n 1 ) Selecting correct γ n for each n is crucial Methods for optimal step size are often slow (e.g. line search) Wrong step size can lead to nonconvergence 27

Stochastic gradient descent 28

Basic idea With gradient descent, we need to calculate the gradient for c a Bc many times for different a in each iteration Instead we can fix one element a ij and update the ith row of B and jth column of C accordingly When we choose a ij randomly, this is stochastic gradient descent (SGD) 29

Local gradient With fixed a ij, a ij (BC) ij = a ij b ik c kj Local gradient for b ik is 2c kj (a ij (BC) ij ) Similarly for c kj This allows us to update the factors by only computing one gradient Gradient needs to be sufficiently scaled 30

SGD process Initialize with random B and C repeat Pick a random element (i, j) Update a row of B and a column of C using the local gradients w.r.t. a ij 1.0 0.5 0.0 0.5 1.0 q(t) - q * 5 4 4.5 4 3.5 3 2.5 0.5 1 1.5 2 4.5 1.0 0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 5 stepfun(px, py) 4.5 5.5 5 6 5.5 6.5 6 7 6.5 7 7.5 31

SGD pros and cons Each iteration is faster to compute But can increase the error Does not need to know all elements of the input data Scalability Partially observed matrices (e.g. collaborative filtering) The step size still needs to be chosen carefully 32

Newton s method 33

Basic idea Iterative update rule: x n+1 x n [H(f(x n ))] 1 f(x n ) Assuming Hessian exists and is invertible Takes curvature information into account 34

Pros and cons Much faster convergence But Hessian is slow to compute and takes lots of memory Quasi-Newton methods (e.g. L-BFGS) compute the Hessian indirectly Often still needs some step size other than 1 35

Alternating least squares 36

Basic idea Given A and B, we can find C that minimizes A BC F In gradient descent, we move slightly towards C In alternating least squares (ALS), we replace C with the new one 37

Basic ALS algorithm Given A, sample a random B repeat until convergence C argmin X A BX F B argmin X A XC F 38

ALS pros and cons Can have faster convergence than gradient descent (or SGD) The update is slower to compute than in SGD About as fast as in gradient descent Requires fully-observed matrices 39

Adding constraints 40

The problem setting So far, we have done unconstrained optimization What if we have constrains on the optimal solution? E.g. all matrices must be nonnegative In general, the above approaches won t admit these constraints 41

General case Minimize f(x) Subject to g i (x) 0, i = 1,, m h j (x) = 0, j = 1,, k Assuming certain regularity conditions, there exists constraints μ i (i=1,,m) and λ j (j=1,,k) that satisfy Karush Kuhn Tucker (KKT) conditions 42

KKT conditions Let x* be the optimal solution Stationarity: f(x*) = i μ i g i (x*) + j λ j h j (x*) Primal feasibility: g i (x*) 0 for all i = 1,, m h j (x*) = 0 for all j = 1,, k Dual feasibility: μ i 0 for all i = 1,, m Complementary slackness: μ i g i (x*) = 0 for all i = 1,, m 43

When do KKT conditions hold KKT conditions hold under certain regularity conditions E.g. g i and h j are affine Or f is convex and exists x s.t. h(x) = 0 and g i (x) < 0 Nonnegativity is an example of linear (hence, affine) constraint 44

What to do with the KKT conditions? μ and λ are new unknown variables Must be optimized together with x The conditions appear in the optimization E.g. in the gradient The KKT conditions are rarely solved directly 45

Summary There are many methods for optimization We only scratched the surface Methods are often based on gradients Can lead into ugly equations Next week: applying these techniques for finding nonnegative factorizations Stay tuned! 46