Convex Optimization Lecture 15 - Gradient Descent in Machine Learning Instructor: Yuanzhang Xiao University of Hawaii at Manoa Fall 2017 1 / 21
Today s Lecture 1 Motivation 2 Subgradient Method 3 Stochastic Subgradient Method 2 / 21
Outline 1 Motivation 2 Subgradient Method 3 Stochastic Subgradient Method 3 / 21
Why Gradient Descent in Machine Learning? in theory: Newton methods much faster than gradient descent in practice, especially in machine learning involving big data: gradient descent (and its variations) are used why and how to use gradient descent in machine learning tasks? 4 / 21
Example Least Square least squares: minimize Ax b 2 2 assuming A T A is invertible, we have analytical solution: x = (A T A) 1 (A T b) building block of Newton methods (i.e., inverse of 2 f (x)) 5 / 21
Example Least Square with big data, we can have A R m n where m 10 6 and n 10 5 the size of (A T A) 1 is 10 5 10 5 74GB of memory sparsity of A does not help, because (A T A) 1 can be dense 6 / 21
Example Least Square in gradient descent, we do x (k+1) = x (k) t (k) f (x (k) ) = x (k) t (k) (A T Ax (k) A T b) need to store A R m n : 740MB memory if density of A is 10 3 A T b R n : 7MB memory Ax (k) R m : 0.7MB memory A T (Ax (k) ) R n : 0.7MB memory memory needed in gradient descent: 740MB as compared to 74GB in analytical solution or Newton method 7 / 21
Example Least Square least squares: minimize Ax b 2 2 with m = 500, 000, n = 400, 000, and density of A being 10 3 8 / 21
Outline 1 Motivation 2 Subgradient Method 3 Stochastic Subgradient Method 9 / 21
Motivation Objective is Not Differentiable least squares with regularization: minimize Ax b 2 2 + x 1 do not want to introduce additional variables n = 10 5 another 10 5 variables do not want to use Newton method can we use variations of gradient descent? 10 / 21
Subgradient g is a subgradient of f at x if: f (y) f (x) + g T (y x) for all y the line f (x) + g T (y x) is a global underestimator of f g 1 subgradient at x 1 ; g 2, g 3 subgradients at x 2 11 / 21
Descent Directions g may not be descent direction for nondifferentiable f example: f (x) = x 1 + 2 x 2 12 / 21
Subgradient Method subgradient method: x (k+1) = x (k) α k g (k) g (k) is any subgradient at x (k) not a descent method; need to keep track of the best point f (k) best = min f (x (i) ) i=1,...,k 13 / 21
Subgradient Method Step Size Rules not a descent method no backtracking line search step sizes fixed ahead of run time constant step size: α k = α square summable but not summable: αk 2 <, k=1 k=1 α k = example: α k = 1/k diminishing, not summable: lim α k = 0, k α k = k=1 example: α k = 1/ k 14 / 21
Example Piecewise Linear Minimization piecewise linear minimization with m = 100, n = 20 minimize max i=1,...,m (at i x + b i ) constant step size: 15 / 21
Example Piecewise Linear Minimization piecewise linear minimization with m = 100, n = 20 minimize max i=1,...,m (at i x + b i ) diminishing step size: 16 / 21
Outline 1 Motivation 2 Subgradient Method 3 Stochastic Subgradient Method 17 / 21
Stochastic Subgradient Method least squares reformulation: minimize m (ai T x b i ) 2 i=1 more generally, consider a problem: minimize m f i (x) i=1 stochastic subgradient method: x (k+1) = x (k) α k g (k) i g (k) i is any subgradient of randomly chosen f i at x (k) 18 / 21
Advantages of Stochastic Subgradient Method even less memory in the LS example, need to store a i R 100,000 ( 0.7MB) sometimes data arrive consecutively update x after each data a i arrives 19 / 21
Example Piecewise Linear Minimization piecewise linear minimization with m = 100, n = 20 minimize max i=1,...,m (at i x + b i ) convergence: 20 / 21
Example Piecewise Linear Minimization piecewise linear minimization with m = 100, n = 20 minimize empirical distribution of f (k) best f : max i=1,...,m (at i x + b i ) 21 / 21