Optimization for Learning and Big Data

Size: px

Start display at page:

Download "Optimization for Learning and Big Data"

Gerard Greer
5 years ago
Views:

1 Optimization for Learning and Big Data Donald Goldfarb Department of IEOR Columbia University Department of Mathematics Distinguished Lecture Series May 17-19, 2016.

2 Lecture 1. First-Order Methods for Convex Optimization Lecture 2. Stochastic Quasi-Newton Methods for Machine Learning Lecture 3. Optimization for Tensor Models

3 In whatever happens in the world, one can find the concept of maximum or minimum; hence there is no doubt that all phenomena in nature can be explained via the maximum and minimum method... Euler, Leonhard (1744)

4 Lec 1. Unsupervised Learning Background - Foreground Separation in Videos

5 Lec 2. Supervised Learning Classification

6 Lec 3 Unsupervised Learning Topic Model

7 First-Order Methods for Convex Optimization Convex Functions - Basic Definitions Proximal Algorithms Augmented Lagrangian Method (of Multipliers) Alternating Direction Method of Multipliers (ADMM) Conditional Gradient (Frank-Wolfe) Method

8 Convex Functions f : R n R {+ }, dom(f) = {x R n f (x) < + } f is proper: dom(f ) f is (strictly) convex if f (λx + (1 λ)y) λf (x) + (1 λ)f (y), λ [0, 1] (<) (λ (0,1)) f is µ-strongly convex if for every λ [0, 1] f (λx + (1 λ)y) λf (x) + (1 λ)f (y) µ λ(1 λ) x y 2 2 If f C 2, µi 2 f (x) LI, then f is µ-strongly convex and f (y) f (x) + f (x)(y x) + µ y x 2

9 Non-smooth Convex Functions For convex functions, subgradient take the place of gradients. v is a subgradient of f at x if f (y) f (x) + v (y x) Recall for f C 1, f (y) f (x) + f (x) (y x) Subdifferential: f (x) = {all subgradients of f at x} f(y) f(x) + v, y-x Slope v x y

10 Optimality for Non-smooth Convex Functions f is a set-valued functions Example: { x 2 if x < 0 f (x) = x if x 0 2x if x < 0 f (x) = [0, 1] if x = 0 1 if x > x minimize f (x) 0 f (x).

11 Moreau Proximal Envelopes History: Moreau and Yosida (1960 s) Moreau Envelope: f γ (x) = min y {f (y) + 1 2γ y x 2 } f γ (x) f (x); f γ (x) is a regularized version of f f γ (x) has the same set of minimizer as f (x) x Moreau envelope for x and γ =

12 Moreau Proximity Operator Proximity Operator: prox γf : R n R n of γf, where γ > 0 is a scale factor in prox γf (x) = argmin y {f (y) + 1 2γ y x 2 } (1) The function in {} in (1) is strongly convex and hence has a unique minimizer for every x. prox γf ( ) is closer to minimizers of f ( ) (and f γ ( )) than x. f (y) f (x) + f (x) (y x) prox γ f (x) = x γ f (x) linearizion of f ( ) at x gradient descent with step size γ

13 Proximity Operators: Examples f = I C (x), the indicator function for the convex set C R n I C (x) = { 0 if x C + if x C prox f (x) = argmin y C y x 2 (projection of x onto C) f = γ x prox γf (x) = soft(x, γ) = sgn(x) max( x γ, 0) Nuclear (trace) norm: X = of singular values of X. Let SVD of X be UΛV, then prox γ (X ) = U ΛV, Λ ii = soft(λ ii, γ)

14 Proximal Minimization x k+1 prox γf (x k ) (2) Minimizer x of γf is a fixed point of prox γf, i.e. x = prox γf (x ) prox γf = x γ f γ (x), is a steepest descent step, with step length γ for minimizing the Moreau envelope. w.r.t f, if f C 1, prox γf is equivalent to an implicit gradient (backward Euler) step. Iteration (2) converges to the set of minimizers of f.

15 Proximal Gradient Method Consider: minimize f (x) + g(x) where f : R n R, f C 1, g : R n R {+ } are both closed and proper convex functions. Proximal gradient method x k+1 prox αk g (x k α k f (x k )) Re-discovered in optimization, convex analysis, machine learning, signal processing, PDE, etc - Fixed-Point Continuation (FPC) - Iterative Shrinkage Thresholding (IST) - Forward-Backward Splitting (FBS) Let f (x) = f (x k ) + f (x k ) (x x k ) x k+1 prox αk g (prox αk f (x k ))

16 Unsupervised Learning: Proximal Gradient Method Recommendation Systems: Netflix problem 17,000 movies, 500,000 customers, 100,000,000 ratings objective function value: $1,000,000

17 Unsupervised Learning: Proximal Gradient Method Netflix Problem Matrix Completion min X {rank(x ) P Ω(X M) = 0} Convex Relaxation (Candes and Recht, 2009) (Candes and Tao, 2009) Prox gradient method: min µ X P Ω(X M) 2 F Y k X k τg(x k ) X k+1 S τµ (Y k ) where g(x ) := gradient of 1 2 P Ω(X M) 2 F (Ma, G, Chen, 2009) S ν (Y ) := matrix shrinkage operatior

18 Augmented Lagrangian Methods Consider the linearly constrained problem minimize subject to f (x) Ax = b where f is a proper, lower semi-continuous, convex function. Augmented Lagrangian with penalty parameter ρ > 0 L(x, λ; ρ) := f (x) + λ (Ax b) + ρ }{{} 2 Ax b 2 2 }{{} Lagrangian augmentation Augmented Lagrangian method (method of multipliers) (Hestenes, Powell ) x k = argmin x L(x, λ k 1 ; ρ), λ k = λ k 1 + ρ(ax k b).

19 A Non-standard Derivation min x f (x) s.t. Ax = b min x max λ {f (x) + λ (Ax b)} To smooth max λ {f (x) + λ (Ax b)}, add a proximal term given an estimate λ: ˆϕ(x) := max λ {f (x) + λ (Ax b) 1 2ρ λ λ 2 } Maximizing w.r.t. λ yields ˆλ = λ + ρ(ax b) and min{f (x) + λ (Ax b) + ρ x 2 Ax b 2 } = L(x, λ; ρ). Extends immediately to nonlinear constraints c(x) = 0 or c(x) 0, and explicit constraints min x Ω L(x, λ, ρ).

20 Another Non-standard Derivation Consider a penalty method approach min x f (x) + ρ 2 Ax b 2 2 Bregman distance for convex f ( ) between points u and v is D p f (u, v) := f (u) f (v) p, u v

21 Another Non-standard Derivation (Cont.) (ρ = 1) Bregman iteration: set x 0 0, p 0 0 x k+1 argmin D pk x f (x, x k ) Ax b 2 2 p k+1 p k + A (Ax k+1 b) Augmented Lagrangian method: set x 0 0, λ 0 0 x k+1 argmin x f (x) + λ k, Ax Ax b 2 2 λ k+1 λ k + Ax k+1 b Augmented Lagrangian Bregman {p k = A λ k }

22 Alternating Direction Method of Multipliers (ADMM) Long history: goes back to Gabay and Mercier, Glowinski and Marrocco, Lions and Mercier, and Passty etc. Variational problems in partial differential equations Maximal monotone operators Variational inequalities Nonlinear convex optimization Linear programming Nonsmooth l 1 -minimization, compressive sensing Split-Bregman (Goldstein & Osher, 2009) 2139 citations, (Gabay & Mercier, 1976) 970 citations

23 Alternating Direction Method of Multipliers (ADMM) Consider problems with a separable objective of the form min f (x) + h(z) s.t. Ax + Bz = c. (x,z) Standard augmented Lagrangian method minimizes L(x, z, λ; ρ) := f (x)+h(z)+λ (Ax+Bz c)+ ρ 2 Ax Bz c 2 2 w.r.t. (x, z) jointly. In ADMM, minimize over x and z separately and sequentially: x k = argmin x L(x, z k 1, λ k 1 ; ρ k ); z k = argmin z L(x k, z, λ k 1 ; ρ k ); λ k = λ k 1 + ρ k (Ax k + Bz k c).

24 ADMM: A Simpler Form Consider the simpler problem min x f (x) + h(ax) min (x,z) f (x) + h(z) s.t. Ax = z. In this case, the ADMM can be written as x k = argmin x f (x) + ρ 2 Ax z k 1 d k z k = argmin z h(z) + ρ 2 Ax k 1 z d k d k = d k 1 (Ax k z k ) sometimes called the scaled version of ADMM. Note z k = prox h/ρ (Ax k 1 d k 1 ) and is usually easy. Updating x k may be hard: if f is not quadratic, may be as hard as the original problem.

25 Examples min F (x) f (x) + g(x) Compressed sensing (Lasso): Matrix Rank Min: min ρ x Ax b 2 2 Robust PCA: min ρ X A(X ) b 2 2 min X,Y X + ρ Y 1 : X + Y = M Sparse Inverse Covariance Selection: Group Lasso: min log det(x ) + Σ, X + ρ X 1 min ρ x 1, Ax b 2 2

26 Variable Splitting min f (x) + g(x) min f (x) + g(y) s.t. x = y Augmented Lagrangian function: L(x, y; λ) := f (x) + g(y) λ, x y + 1 x y 2 2µ ADMM x k+1 := arg min x L(x, y k ; λ k ) y k+1 := arg min y L(x k+1, y; λ k ) λ k+1 := λ k (x k+1 y k+1 )/µ

27 Symmetric ADMM Alternating Linearization Method Symmetric version x k+1 := arg min L(x, y k ; λ k ) x λ k+ 1 2 := λ k (x k+1 y k )/µ y k+1 := arg min L(x k+1, y; λ k+ 1 2 ) y λ k+1 := λ k+ 1 2 (x k+1 y k+1 )/µ Optimality conditions lead to (assuming f and g are smooth) λ k+ 1 2 = f (x k+1 ), λ k+1 = g(y k+1 ) Alternating Linerization Method (ALM) { x k+1 = arg min x f (x) + g(y k ) + g(y k ), x y k + 1 2µ x y k 2 y k+1 = arg min x f (x k+1 ) + f (x k+1 ), y x k µ x k+1 y 2 + g(y) Gauss-Seidel like algorithm

28 Complexity Bound for ALM Theorem (G, Ma and Scheinberg, 2013) Assume f and g are Lipschitz continuous with constants L(f ) and L(g). For µ 1/ max{l(f ), L(g)}, ALM satisfies F (y k ) F (x ) x 0 x 2 4µk O(1/ɛ) iterations for an ɛ-optimal solution (f (x) f (x ) ɛ) Can we improve the complexity? Can we extend this result to ADMM?

29 Optimal Gradient Methods Lipschitz continuous f Classical gradient method Complexity O(1/ɛ) x k = x k 1 τ k f (x k 1 ) Nesterov s acceleration technique (1983) Complexity O(1/ ɛ) { x k := y k 1 τ k f (y k 1 ) y k := x k + k 1 k+2 (x k x k 1 ) Optimal first-order method; best one can get

30 ISTA and FISTA (Beck and Teboulle, 2009) Assume g is smooth min F (x) f (x) + g(x) ISTA (Proximal gradient method) or equivalently x k+1 := arg min x Q g (x, x k ) Complexity O(1/ɛ) x k+1 := arg min x τf (x) x (x k τ g(x k )) 2 Never minimize g Fast ISTA (FISTA) Complexity O(1/ ɛ) x k := ( arg min x τf (x) + ) 1 2 x (y k τ g(y k )) 2 t k+1 := tk 2 /2 y k+1 := x k + t k 1 t k+1 (x k x k 1 )

31 Fast Alternating Linearization Method (FALM) ALM (symmetric ADMM) { x k+1 := arg min x Q g (x, y k ) y k+1 := arg min y Q f (x k+1, y) Accelerate ALM in the same way as FISTA Fast Alternating Linearization Method (FALM) x k := arg min x Q g (x, z k ) y k := arg min y Q f (x k, y) w k := (x k + y k )/2 t k+1 := ( t 2 k ) /2 z k+1 := w k + 1 t k+1 (t k (y k w k 1 ) (w k w k 1 )) computational effort at each iteration is almost unchanged both f and g must be smooth; however, both are minimized

32 FALM (cont.) Theorem (G, Ma and Scheinberg, 2013) Assume f and g are Lipschitz continuous with constants L(f ) and L(g). For µ 1/ max{l(f ), L(g)}, FALM satisfies F (y k ) F (x ) x 0 x 2 µ(k + 1) 2 Complexity O(1/ ɛ) iterations for an ɛ-optimal solution Hence, optimal first-order method Applied to Total Variation denoising outperforms split Bregman (Qin, G, Ma, 2013)

33 ALM with skipping steps At k-th iteration of ALM-S: x k+1 := arg min x L µ (x, y k ; λ k ) If F (x k+1 ) > L µ (x k+1, y k ; λ k ), then x k+1 := y k y k+1 := arg min y Q f (y, x k+1 ) λ k+1 := f (x k+1 ) (x k+1 y k+1 )/µ Note that only f is required to be smooth. If µ 1/L(f ), complexity O(1/ɛ) ; if L(f ) not known, use backtracking line search (Scheinberg, G, Bai 2014) FALM version has complexity O(1/ ɛ). Applied to solve Sparse Inverse Covariance Selection (Scheinberg, Ma, G, 2010), Group Lasso (structured sparsity for breast cancer gene expression) (Qin, G, 2012)

34 Multiple Splitting Algorithm (MSA) Generalization from 2 to K convex functions is possible, but non-convergence of ADMM for K 3 has been shown. Consider min ALM (symmetric ADMM) F (x) f (x) + g(x) + h(x) Q gh (u, v, w) := f (u) + g(v) + g(v), u v + u v 2 /2µ +h(w) + h(w), u w + u w 2 /2µ. x k+1 := arg min Q gh (x, y k, z k ) y k+1 := arg min Q fh (x k+1, y, z k ) z k+1 := arg min Q fg (x k+1, y k+1, z) Gauss-Seidel like algorithm! Convergence?

35 Multiple Splitting Algorithm (MSA) (cont.) Jacobi type algorithm x k+1 := arg min Q gh (x, w k, w k ) y k+1 := arg min Q fh (w k, y, w k ) z k+1 := arg min Q fg (w k, w k, z) w k+1 := (x k+1 + y k+1 + z k+1 )/3 Convergent Complexity O(1/ɛ) (G and Ma, 2012)

36 O(1/ ɛ) complexity (G and Ma, 2012) Fast Multiple Splitting Algorithm (FaMSA) x k := arg min Q gh (x, wx k, wx k ) y k := arg min Q fh (wy k, y, wy k ) z k := arg min Q fg (wz k, wz k, z) w k := (x k + y k + z k )/3 t k+1 := ( 1 + w k+1 x := w k t 2 k ) /2 t k+1 [t k (x k w k ) (w k w k 1 )] wy k+1 := w k + 1 t k+1 [t k (y k w k ) (w k w k 1 )] wz k+1 := w k + 1 t k+1 [t k (z k w k ) (w k w k 1 )]

37 The Frank-Wolfe Algorithm Discovered in 1956, the Frank-Wolfe (also known as conditional gradient) algorithm is the earliest algorithm to solve: where f (x) is a convex function minimize f (x) subject to x D D R p is a compact and convex set.

38 The Frank-Wolfe Algorithm

39 Application:Signal Processing Recover a sparse signal x from noisy measurements b Convex Relaxation Exact Recovery with high probability (Candes, Romberg and Tao, 2006; Donoho, 2006) Consider min Ax b 2 x 1 1 Frank - Wolfe select same ======== vertex each step Matching Pursuit Fully corective Frank-Wolfe Orthogonal Matching Pursuit (Tropp & Gilbert, 2007)

40 Application: Robust and Stable Principal Component Pursuit (RPCP and SPCP) M = L 0 + S 0 + N 0 low-rank sparse small, dense noise Given M, approximately and efficiently recover L 0 and S 0. Convex approach SPCP: min L,S L + λ S 1 s.t. L + S M F δ RPCP: min L,S L + λ S 1 s.t. L + S = M

41 Algorithms for RPCP and SPCP Many first-order methods have been developed Most exploit the closed-form expression for the proximal operator of nuclear norm; i.e. matrix shrinkage 1 min L 2 L Z λ L Using a full or partial SVD, thus limiting their applicability to large-scale problems They also us the closed-form expression for the proximal operator of the l 1 -norm; i.e. vector shrinkage to compute S.

42 Frank-Wolfe for Norm-Constrained SPCP Solve min L,S 1 2 P Ω(L + S M) 2 F s.t. L β 1, S 1 β 2 Frank-Wolfe algorithm for SPCP:

43 Inefficiency of the FW algorithm Synthetic data: (Slow convergence) Inefficient in updateing S: S k+1 = k k + 2 S k 2β 2 k + 2 ek i (ek j ) = S k+1 0 S k 0 + 1

44 Frank-Wolfe/Prox Gradient (FW-P) Algorithm Key idea: Add a prox gradient step to update S after each F-W step

45 FW-P Algorithm for SPCP Solve min L,S 1 2 P Ω[L + S M] 2 F + λ 1 L + λ 2 S 1 Domain unbounded Epigraph formulation!

46 FW-P Algorithm for SPCP Synthetic data: (Red: F-W, Blue: UFA) Theorem (Mu, Wright, G. 14) For {(L k, S k )} produced by FW-P method, we have f (L k, S k ) f (L, S ) 16(β2 1 + β2 2 ) k + 2

47 FW-P Algorithm for SPCP Comparison with other algorithms

48 FW-P for Matrix SPCP Background and foreground extractions from greyscale surveillance videos M, 96 seconds using a laptop!

49 FW-P for Tensor SPCP Comvex program: min L,S 1 2 P Ω[X L S] F + λ 1 X + λ 2 S 1

50 FW-P for Tensor SPCP Background segmentation for color videos: background modelling (50% missing entries) Data size: = 18.4M, running time: 34 secs.

Alternating Direction Augmented Lagrangian Algorithms for Convex Optimization

Alternating Direction Augmented Lagrangian Algorithms for Convex Optimization Joint work with Bo Huang, Shiqian Ma, Tony Qin, Katya Scheinberg, Zaiwen Wen and Wotao Yin Department of IEOR, Columbia University