Coordinate Descent Methods on Huge-Scale Optimization Problems

Size: px

Start display at page:

Download "Coordinate Descent Methods on Huge-Scale Optimization Problems"

Elmer Wilkinson
6 years ago
Views:

1 Coordinate Descent Methods on Huge-Scale Optimization Problems Zhimin Peng Optimization Group Meeting

2 Warm up exercise?

3 Warm up exercise? Q: Why do mathematicians, after a dinner at a Chinese restaurant, always insist on taking the leftovers home?

4 Warm up exercise? Q: Why do mathematicians, after a dinner at a Chinese restaurant, always insist on taking the leftovers home? A: Because they know the Chinese remainder theorem!

5 Warm up exercise? Q: Why do mathematicians, after a dinner at a Chinese restaurant, always insist on taking the leftovers home? A: Because they know the Chinese remainder theorem! Q: What does the zero say to the eight?

6 Warm up exercise? Q: Why do mathematicians, after a dinner at a Chinese restaurant, always insist on taking the leftovers home? A: Because they know the Chinese remainder theorem! Q: What does the zero say to the eight? A: Nice belt!

7 Motivation consider optimization problem: min f(x) x R N

8 Motivation consider optimization problem: min f(x) x R N Why coordinate descent methods(cd)?

9 Motivation consider optimization problem: min f(x) x R N Why coordinate descent methods(cd)? CD based on maximal absolute value of gradient 1. Choose i k = arg max 1 i n if(x k ) 2. Update x k+1 = x k α ik f(x k )e ik

10 Motivation consider optimization problem: min f(x) x R N Why coordinate descent methods(cd)? CD based on maximal absolute value of gradient 1. Choose i k = arg max 1 i n if(x k ) 2. Update x k+1 = x k α ik f(x k )e ik What s the problem with it?

11 Huge scale problems?

12 Huge scale problems? Sources: Internet, telecommunication Finite element schemes, weather prediction

13 Huge scale problems? Sources: Internet, telecommunication Finite element schemes, weather prediction Features: Expensive function evaluation Huge data

14 Huge scale problems? Sources: Internet, telecommunication Finite element schemes, weather prediction Features: Expensive function evaluation Huge data Conclusion: We need CD methods!

15 Unconstrained Optimization min f(x) x R N

16 Unconstrained Optimization Notations: Decomposition of R N : min f(x) x R N R N = n i=1 R ni

17 Unconstrained Optimization Notations: Decomposition of R N : min f(x) x R N R N = Partition of the unit matrix U: n i=1 R ni I N = (U 1, U 2,..., U n ) R N N, U i R N ni

18 Unconstrained Optimization Notations: Decomposition of R N : min f(x) x R N R N = Partition of the unit matrix U: n i=1 R ni I N = (U 1, U 2,..., U n ) R N N, U i R N ni x = (x (1), x (2),..., x (n) ) T R N can be represented as: n x = U i x (i) i=1

19 More notations... Partial gradient of f(x) f i(x) = U T i f(x) R ni

20 More notations... Partial gradient of f(x) f i(x) = U T i f(x) R ni Assume that the gradient of function f is coordinatewise Lipschitz continuous: x = max x =1 s, x f i(x + U i h i ) f i(x) (i) L i h i (i)

21 More notations... Partial gradient of f(x) f i(x) = U T i f(x) R ni Assume that the gradient of function f is coordinatewise Lipschitz continuous: x = max x =1 s, x Optimal coordinate steps: f i(x + U i h i ) f i(x) (i) L i h i (i) T i (x) = x 1 L i U i f i(x) # s # arg max s, x 1 2 x 2

22 More notations... Partial gradient of f(x) f i(x) = U T i f(x) R ni Assume that the gradient of function f is coordinatewise Lipschitz continuous: x = max x =1 s, x Optimal coordinate steps: f i(x + U i h i ) f i(x) (i) L i h i (i) T i (x) = x 1 L i U i f i(x) # s # arg max s, x 1 2 x 2

23 More notations... A new norm: x [α] = [ n i=1 where (i) is some fixed norm. L α i x (i) (2) (i) ] 1 2 Random counter A α, α R, which generates an random number i {1,..., n} with probability p (i) α = Lα i j Lα j

24 Method RCDM(α, x 0 ) Algorithm: 1. Choose i k = A α 2. Update x k+1 = T ik (x k )

25 Method RCDM(α, x 0 ) Algorithm: Theorem For any k 0, we have 1. Choose i k = A α 2. Update x k+1 = T ik (x k ) E[f(x k )] f 2 k + 4 L α j R1 α(x 2 0 ) where R β (x 0 ) = max x {max x X x x [β] : f(x) f(x 0 )} Comments: R β (x 0 ) measures the distance between the initial point x 0 and the optimal set X. In fact, R β (x 0 ) is positively correlated to the distance between x 0 and X. j

26 Proof Key inequality 1: The above inequality is given by the Lipschitz gradient inequality.

27 Proof Key inequality 1: The above inequality is given by the Lipschitz gradient inequality. Key inequality 2:

28 Combine the previous key inequalities, we have

29 Convergence of strongly convex functions Strongly convex functions: f(y) f(x) + f(x), y x + 1 σ(f) y x 2 2 σ = σ(f) is the convexity parameter

30 Convergence of strongly convex functions Strongly convex functions: f(y) f(x) + f(x), y x + 1 σ(f) y x 2 2 σ = σ(f) is the convexity parameter Theorem Let function f(x) be strongly convex with respect to the norm [1 α] with convexity parameter σ 1 α = σ 1 α (f) > 0. Then, for the sequence {x k } generated by RCMD we have E[f(x k )] f (1 σ 1 α(f) S α (f) )k (f(x 0 ) f )

31 Convergence of strongly convex functions Strongly convex functions: f(y) f(x) + f(x), y x + 1 σ(f) y x 2 2 σ = σ(f) is the convexity parameter Theorem Let function f(x) be strongly convex with respect to the norm [1 α] with convexity parameter σ 1 α = σ 1 α (f) > 0. Then, for the sequence {x k } generated by RCMD we have Proof: E[f(x k )] f (1 σ 1 α(f) S α (f) )k (f(x 0 ) f )

32 Expected quality is good!

33 Expected quality is good! How about the result of a single run?

34 Expected quality is good! How about the result of a single run? Define function f µ (x) by: f µ (x) = f(x) + µ 2 x x 0 2 [1] f µ (x) is strongly convex with respect to [1] f µ (x) has convexity parameter µ

35 Expected quality is good! How about the result of a single run? Define function f µ (x) by: f µ (x) = f(x) + µ 2 x x 0 2 [1] f µ (x) is strongly convex with respect to [1] f µ (x) has convexity parameter µ Theorem Let us define µ = 4R 2 1 ε (x0) and choose k µ ln 1 2µ(1 β) If the random point x k is generated by RCDM(0, x 0 ) as applied to function f µ, then Prob(f(x k ) f ε) β Comments: The second inequality is derived by the property of strongly convex function.

36 Accelerated Coordinate Descent Consider the following scheme applied to strongly convex function with given convexity parameter σ:

37 Convergence Based on the previous accelerated algorithm, we have the following convergence theorem:

38 Constrained optimization Consider the constrained minimization problem min f(x) x Q Q = n i=1 Q i, where Q i R ni are closed and convex f(x) is convex and satisfies the smoothness assumption: f i(x + U i h i ) f i(x) (i) L i h i (i)

39 Constrained optimization Consider the constrained minimization problem min f(x) x Q Q = n i=1 Q i, where Q i R ni are closed and convex f(x) is convex and satisfies the smoothness assumption: f i(x + U i h i ) f i(x) (i) L i h i (i) Algorithm: (1) Choose randomly i by uniform distribution on {1,...,n} (2) u (i) = arg min f i(x k ), u (i) x (i) u (i) k + L i Q i 2 u(i) x (i) k (3) Update x k+1 = x k + U T i (u (i) x (i) k ) 2 (i)

40 Theorem For any k 0 we have φ k f n n + k (1 2 R2 1(x 0 ) + f(x 0 ) f ) If f is strongly convex in [1] with constant σ, then φ k f (1 2σ n(1 + σ) )k ( 1 2 R2 1(x 0 ) + f(x 0 ) f )

41 Implementation

42 Google problem Let E R n n be an incidence matrix of graph; E = E diag(e T e) 1 ; Google problem: min 1 2 Ex x 2 + γ [ e, x 1]2 2

43 Google problem Let E R n n be an incidence matrix of graph; E = E diag(e T e) 1 ; Google problem: min 1 2 Ex x 2 + γ [ e, x 1]2 2

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower