Sparse Linear Programming via Primal and Dual Augmented Coordinate Descent

Size: px

Start display at page:

Download "Sparse Linear Programming via Primal and Dual Augmented Coordinate Descent"

Melvyn Harper
5 years ago
Views:

1 Sparse Linear Programg via Primal and Dual Augmented Coordinate Descent Presenter: Joint work with Kai Zhong, Cho-Jui Hsieh, Pradeep Ravikumar and Inderjit Dhillon.

2 Sparse Linear Program Given vectors c R n, b R m and m n matrix [ ] AI A = = [A b A f ], A E the primal and dual forms of Linear Program (LP) are x R n s.t. c T x A I x b I y R m s.t. b T y A T b y c b A E x = b E x j 0, j [n b ] A T f y = c f y i 0, i [m I ]. We say a LP is sparse in the sense that (i) nnz(a) m n. (ii) nnz(y ) m (dual sparsity). (iii) nnz(x ) n (primal sparsity).

3 Sparse Linear Program: Examples L1-regularized Multiclass SVM w m,ξ i λ k m=1 w m 1 + l ξ i i=1 s.t. w T y i x i w T m x i e m i ξ i, (i,m) Sparse Inverse Covariance Estimation Ω R d d Ω 1 s.t. SΩ I d max λ MAP Inference on Factor Graph max p i,i V q j,j F s.t. j F θj T q j M i,j q j = p i, (i,j) E q j j.

4 Algorithms for Linear Program Simplex Method: Moving between corner points requires solutions of two m m linear system (O(m 3 ) for m iterations). #iterations is exponential in worst case, but between 2m and 3m typically. Interior Point Method (IPM): Reduce LP to series of (asymptotically ill-conditioned) unconstrained problems by Barrier functions. Newton Method requires solving an m m linear system (O(m 3 )). #iterations=o(log(1/ε)) for ε sub-optimality.

5 Algorithms for Linear Program Simplex Method: Moving between corner points requires solutions of two m m linear system (O(m 3 ) for m iterations). #iterations is exponential in worst case, but between 2m and 3m typically. Interior Point Method (IPM): Reduce LP to series of (asymptotically ill-conditioned) unconstrained problems by Barrier functions. Newton Method requires solving an m m linear system (O(m 3 )). #iterations=o(log(1/ε)) for ε sub-optimality. Subgradient-based Methods: Evaluate subgradient takes only O(nnz(A)), but requires O(1/ε 2 ) iterations. (Even finding a feasible solution is hard.) Augmented Lagrangian Method (ALM): Reduce LP to series of bound-constrained Quadratic Problem. #iterations=o(log(1/ε)). (cost for solving sub-problem?) This paper Randomized Coordinate Descent (RCD) with ALM gives O(nnz(A)log 2 (1/ε)) overall complexity. More efficient for primal, dual-sparse problems via an Active-Set strategy.

6 Augmented Lagrangian Method (ALM) Let g(y) denote the objective of dual LP (taking for infeasible points). Then a primal ALM is eqivalent to a dual Proximal-Point iterates y t+1 = arg y where we find y t+1 by solving the dual of (1) x, ξ c T x + η t 2 s.t. x b 0, ξ 0 g(y) + 1 2η t y y t 2, (1) A I x b I + ξ A E x b E + 1 η t [ y t I y t E ] 2 to obtain y t+1 = y(x,ξ ) = η t [ AI x b I + ξ A E x b E ] [ y t + I ye t ].

7 ALM with Coordinate Descent (AL-CD) The quadratic program (with ξ eliated) F (x) = c T x + η t x 2 [A I x b I + yi t/η t] + A E x b E + ye t /η t has where s.t. x b 0. F (x) = c + η t A T I [w(x)] + + η t A T E v(x) w(x) = A I x b I + y t I /η t v(x) = A E x b E + y t E /η t 2 Given w(x), v(x), gradient j F (x) of coordinate can be evaluated in O(nnz(a j )). Maintaining w(x), v(x) after each coordinate j update needs O(nnz(a j )).

8 ALM with Coordinate Descent (AL-CD) The quadratic program (with ξ eliated) x F (x) = c T x + η t 2 s.t. x b 0. [A I x b I + yi t/η t] + A E x b E + ye t /η t 2 RCD algorithm: For each randomly picked j [n] 1 Solve single-variable QP: 1 dj = [ x j j F (x)/ 2 j F (x) ] j,+ x j, O(nnz(a j )) 2 Line search to obtain step size β and x j + = βdj. 3 Maintain w(x) and v(x). O(nnz(a j )) 1 [. ] j,+ denotes a truncation to 0 if j [n b ].

9 ALM with Coordinate Descent (AL-CD) The quadratic program (with ξ eliated) x F (x) = c T x + η t 2 s.t. x b 0. [A I x b I + yi t/η t] + A E x b E + ye t /η t 2 RCD with Active Set: For each randomly picked j A k 1 Solve single-variable QP: 2 dj = [ x j j F (x)/ 2 j F (x) ] j,+ x j, 2 Line search to obtain step size β and x j + = βdj. 3 Maintain w(x) and v(x). 4 Remove coordinate j from A k if j F (x) > ε A and j [n b ]. 2 [. ] j,+ denotes a truncation to 0 if j [n b ].

10 Convergence Analysis Subprobleem is strongly convex when restricted to N(Ā) (CNSC) We show RCD converges to F (x) F (x ) ε in γn log(1/ε) number of iterations w.h.p., and the cost for sub-problem is O(nnz(A) log(1/ε)) The ALM does not amplify error when solving subproblem inexactly. If each subproblem is approximated to ε then after t iterations we have approximation error at most tε. Needs O(tn log(t/ε)) = O(n log 2 (1/ε)) CD iterations overall.

11 Implementation Details Two-Phase strategy: Phase-I: Spend constant cost on each sub-problem. ( A is large) Phase-II: Solve each sub-problem to precision ε t, with η t, ε t dynamically adjusted. ( A is small) For ill-conditioned problem, use Projected-Newton-CG when Active set becomes stable.

12 Experiments L1-regularized Multiclass SVM: nnz(a) mn, nnz(y ) m, and nnz(x ) n.

13 Experiments Sparse Inverse Covariance Estimation: Ω R d d Ω 1 s.t. SΩ I d max λ S = Z T Z, Z is n d d 2 Transform to: Ω R d d,y R n d Ω 1 s.t. nnz(a) mn and nnz(x ) n. Z T Y I d max λ Y = ZΩ

14 Experiments NMF: nnz(a) mn.

15 Future Works Utilize Active-set in a non-heuristic way. Can we not go through all variables each ALM iteration? Exploit primal, dual sparsity simultaneously.

Sparse Linear Programming via Primal and Dual Augmented Coordinate Descent

Sparse Linear Programming via Primal and Dual Augmented Coordinate Descent Ian E.H. Yen Kai Zhong Cho-Jui Hsieh Pradeep Ravikumar Inderjit S. Dhillon University of Texas at Austin University of California