Chapter 2 Optimization Gradients, convexity, and ALS
Contents Background Gradient descent Stochastic gradient descent Newton s method Alternating least squares KKT conditions 2
Motivation We can solve basic least-squares linear systems using SVD But what if we have missing values in the data extra constraints for feasible solutions more complex optimization problems (e.g. regularizers) etc 3
Gradients, Hessians, and convexity 4
Derivatives and local optima The derivative of a function f: R R, denoted f, explains its rate of change If it exists ƒ 0 ( )= lim h!0+ ƒ ( + h) ƒ ( ) h The second derivative f is the change of rate of change 5
Derivatives and local optima A stationary point of differentiable f is x s.t. f (x) = 0 f achieves its extremes in stationary points or in points where derivative doesn t exist, or at infinities (Fermat s theorem) Whether this is (local) maximum or minimum can be seen from the second derivative (if it exists) 6
Partial derivative If f is multivariate (e.g. f: R 3 R), we can consider it as a family of functions E.g. f(x, y) = x 2 + y has functions f x (y) = x 2 + y and f y (x) = x 2 + y Partial derivative w.r.t. one variable keeps other variables constant ƒ (,y)=ƒ 0 y ( )=2 7
Gradient Gradient is the derivative for multivariate functions f: R n R ƒ = ƒ, ƒ 1 2,..., ƒn Here (and later), we assume that the derivatives exist Gradient is a function f: R n R n f(x) points up in the function at point x 8
Gradient 9
Hessian Hessian is a square matrix of all second-order partial derivatives of a function f: R n R H(ƒ )= 0 B @ 2 ƒ 2 2 1 ƒ 2 1. 2 ƒ n 1 2 ƒ 1 2 2 ƒ 2 2 2 ƒ n 2.... 2 ƒ 1 n 2 ƒ 2 n 2 ƒ 2 n 1 C A As usual, we assume the derivatives exist 10
Jacobian matrix If f: R m R n, then its Jacobian (matrix) is an n m matrix of partial derivatives in form J = 0 B @ ƒ 1 1 ƒ 2 1. ƒ n 1 ƒ 1 2 ƒ 2 2 ƒ n 2.... ƒ 1 m ƒ 2 m ƒ n m 1 C A Jacobian is the best linear approximation of f H(f(x)) = J( f(x)) T 11
Examples Function ƒ (,y)= 2 + 2 y + y Partial derivatives ƒ (,y)=2 + 2y ƒ y (,y)=2 + 1 Function ƒ (,y)= Jacobian J(ƒ )= 2 y 5 + sin y 2 y 2 5 cosy Gradient Hessian H(ƒ )= ƒ =(2 + 2y, 2 + 1) Å ã 2 2 2 0 12
Gradient s properties Linearity: (αf + βg)(x) + α f(x) + β g(x) Product rule: (fg)(x) = f(x) g(x) + g(x) f(x) Chain rule: IMPORTANT! If f: R n R and g: R m R n, then (f g)(x) = J(g(x)) T ( f(y)) where y = g(x) If f is as above and h: R R, then (h f)(x) = h (f(x)) f(x) 13
Convexity A function is convex if any line segment between two points of the function lie above or on the graph For univariate f, if f (x) 0 for all x For multivariate f, if its Hessian is positive semidefinite I.e. z T Hz 0 for any z Convex function s local minimum is its global minimum 14
Preserving the convexity If f is convex and λ > 0, then λf is convex If f and g are convex, the f + g is convex If f is convex and g is affine (i.e. g(x) = Ax + b), then f g is convex (N.B. (f g)(x) = f(ax + b)) Let f(x) = (h g)(x) with g: R n R and h: R R; f is convex if g is convex and h is nondecreasing and convex g is concave and h is non-increasing and convex 15
Gradient descent 16
Idea If f is convex, we should find it s minimum by following its negative gradient But the gradient at x points to minimum only at x Hence, we need to descent slowly down the gradient 17
Example 1.0 0.5 0.0 0.5 1.0 5 5 5 4 4 4 4.5 4.5 4.5 4 4 4 3.5 3.5 3 3 2.5 2.5 0.5 0.5 1 1.5 1.5 2 4.5 4.5 1.0 0.5 0.0 0.5 1.0 5 5 4.5 4.5 4.5 5.5 5.5 5.5 5 5 6 6 6 5.5 5.5 6.5 6.5 6 6 6 7 6.5 7 6.5 6.5 7 7 7.5 7.5 q(t) - q * q(t) - q * 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 stepfun(px, py) t 0.0 0.2 0.4 0.6 0.8 1.0 t 18
Gradient descent Start from random point x 0 At step n, update x n x n 1 γ f(x n 1 ) γ is some small step size Often, γ depends on the iteration x n x n 1 γ n f(x n 1 ) With suitable f and step size, will converge to local minimum 19
Example: least squares Given A R n m and b R n, find x R m s.t. Ax b 2 /2 is minimized Can be solved using SVD Calculate the gradient of f A,b (x) = Ax b 2 /2 Employ the gradient descent approach In this case, the step size can be calculated analytically 20
Example: the gradient Let s write open: 1 2 ka bk2 = 1 2 = 1 2 = 1 2 = 1 2 nx (A ) b 2 =1 nx =1 nx Ä =1 nx mx j j b 2 j=1 X m 2 j j j=1 mx j j =1 j=1 2 2b nx b mx j=1 mx =1 j=1 j j + b 2ä nx j j + 1 2 =1 b 2 21
Example: the gradient The partial derivative w.r.t. x j : j Ä 1 2 ka bk2ä = Linearity = 1 2 = 1 2 Chain rule = = nx =1 nx j Ä 1 2 nx nx mx =1 =1 j nx =1 j mx k=1 mx k=1 mx j k k=1 Ä m X j =1 k=1 k k=1 k 2 k k 2 k k 2 k nx b =1 k k b ä nx b =1 nx b =1 nx b =1 j mx k k + 1 nx b 2ä 2 k=1 j mx =1 k k + j k=1 j 1 2 = 0 if k j nx =1 b 2 = 0 22
Example: the gradient Collecting terms: Hence we have: j Ä 1 2 ka bk2ä = Ä 1 = = nx =1 nx =1 j j Ä m X k=1 k k b Ä (A ) b Ä ä A T (A b) 2 ka bk2ä = A T (A b) j ä Matrix product ä Another matrix product 23
Example: the gradient The other way: Use the chain rule Ä 1 2 ka bk2ä = J(A b) T ( 1 2 kyk2 ) y = A b = A T (A b) 24
Gradient descent & matrices How about Given A, find small B and C s.t. A BC F is minimized? Not convex for B and C jointly Fix some B and solve for C C = argmin X A BX F Use the found C and solve for B, and repeat until convergence 25
How to solve for C? C = argmin X A BX F still needs some work Write the norm as sum of column-wise errors A BX F = a j Bx j 2 Now the problem is a series of standard least-squares problems Each can be solved independently 26
How to select the step size? Recall: x n x n 1 γ n f(x n 1 ) Selecting correct γ n for each n is crucial Methods for optimal step size are often slow (e.g. line search) Wrong step size can lead to nonconvergence 27
Stochastic gradient descent 28
Basic idea With gradient descent, we need to calculate the gradient for c a Bc many times for different a in each iteration Instead we can fix one element a ij and update the ith row of B and jth column of C accordingly When we choose a ij randomly, this is stochastic gradient descent (SGD) 29
Local gradient With fixed a ij, a ij (BC) ij = a ij b ik c kj Local gradient for b ik is 2c kj (a ij (BC) ij ) Similarly for c kj This allows us to update the factors by only computing one gradient Gradient needs to be sufficiently scaled 30
SGD process Initialize with random B and C repeat Pick a random element (i, j) Update a row of B and a column of C using the local gradients w.r.t. a ij 1.0 0.5 0.0 0.5 1.0 q(t) - q * 5 4 4.5 4 3.5 3 2.5 0.5 1 1.5 2 4.5 1.0 0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 5 stepfun(px, py) 4.5 5.5 5 6 5.5 6.5 6 7 6.5 7 7.5 31
SGD pros and cons Each iteration is faster to compute But can increase the error Does not need to know all elements of the input data Scalability Partially observed matrices (e.g. collaborative filtering) The step size still needs to be chosen carefully 32
Newton s method 33
Basic idea Iterative update rule: x n+1 x n [H(f(x n ))] 1 f(x n ) Assuming Hessian exists and is invertible Takes curvature information into account 34
Pros and cons Much faster convergence But Hessian is slow to compute and takes lots of memory Quasi-Newton methods (e.g. L-BFGS) compute the Hessian indirectly Often still needs some step size other than 1 35
Alternating least squares 36
Basic idea Given A and B, we can find C that minimizes A BC F In gradient descent, we move slightly towards C In alternating least squares (ALS), we replace C with the new one 37
Basic ALS algorithm Given A, sample a random B repeat until convergence C argmin X A BX F B argmin X A XC F 38
ALS pros and cons Can have faster convergence than gradient descent (or SGD) The update is slower to compute than in SGD About as fast as in gradient descent Requires fully-observed matrices 39
Adding constraints 40
The problem setting So far, we have done unconstrained optimization What if we have constrains on the optimal solution? E.g. all matrices must be nonnegative In general, the above approaches won t admit these constraints 41
General case Minimize f(x) Subject to g i (x) 0, i = 1,, m h j (x) = 0, j = 1,, k Assuming certain regularity conditions, there exists constraints μ i (i=1,,m) and λ j (j=1,,k) that satisfy Karush Kuhn Tucker (KKT) conditions 42
KKT conditions Let x* be the optimal solution Stationarity: f(x*) = i μ i g i (x*) + j λ j h j (x*) Primal feasibility: g i (x*) 0 for all i = 1,, m h j (x*) = 0 for all j = 1,, k Dual feasibility: μ i 0 for all i = 1,, m Complementary slackness: μ i g i (x*) = 0 for all i = 1,, m 43
When do KKT conditions hold KKT conditions hold under certain regularity conditions E.g. g i and h j are affine Or f is convex and exists x s.t. h(x) = 0 and g i (x) < 0 Nonnegativity is an example of linear (hence, affine) constraint 44
What to do with the KKT conditions? μ and λ are new unknown variables Must be optimized together with x The conditions appear in the optimization E.g. in the gradient The KKT conditions are rarely solved directly 45
Summary There are many methods for optimization We only scratched the surface Methods are often based on gradients Can lead into ugly equations Next week: applying these techniques for finding nonnegative factorizations Stay tuned! 46