Accelerated Block-Coordinate Relaxation for Regularized Optimization

Size: px

Start display at page:

Download "Accelerated Block-Coordinate Relaxation for Regularized Optimization"

Joleen Williamson
5 years ago
Views:

1 Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012

2 Problem descriptions Consider where f is smooth (Lipschitz continuously differentiable) and τ is a parameter. The regularizer P has the form where each P q is a closed, proper, extended-valued, convex function; [q] {1, 2,, n}; and {x [q] q Q} is a partition of x. 2

3 Examples Compressed sensing: f(x) = 1 2 Ax y 2 2 and P (x) = x 1; regularized least-squares: f(x) = 1 2 Ax y 2 2, P (x) is x 1 or group-l 2 or group-l regularizer; regularized logistic regression: f is a log-likelihood function, P is either x 1 or group-l 2 regularizer; low-rank matrix recovery: f = 1 2 A(LR ) b 2 2 and P = L 2 F + R 2 F 3

4 Block-coordinate prox-linear update (basic version) At the current iterate x, select Q Q and solve where µ µ min > 0 and [q] f denotes gradient subvector. Clearly, d [q] = 0 for q Q. If the solution of (1.3) produces a sufficient decrease in the objective φ τ, the step is accepted. Otherwise, µ is increased and (1.3) is re-solved, with the same relaxation set Q. Denote Q at iteration k as Q k. For some integer T 1 and all k T 1, it requires 4

5 Algorithm 5

6 Main idea for acceleration 1 Detect a manifold (some structure of the solution, e.g., support detection in sparse vector recovery); 2 Obtain a subspace from the manifold and transfer to a subspace optimization (e.g., restrict to detected support); 3 Apply a Newton-like step to update the solution 6

7 Manifold 7

8 Property of Manifold Form lower-dimensional problem ψ τ (y) = φ τ (G(y)), for y near ȳ 8

9 Transfer to subspace problem A point z is a strong local minimizer of a function h with modulus c > 0 if 9

10 Newton acceleration At iteration k, do the following procedure 1 Compute d k satisfying φ τ (x k ) φ τ (x k + d k ) d k 3 2 and find a manifold M k containing x k + d k (if no such M can be conveniently identified, set d k = d k and skip the acceleration); 2 Identify a mapping G k that parametrizes M k in the sense of Lemma 2.1, and y k such that G k (y k ) = x k + d k and G k (y) M for all y close to y k ; 3 Define ψ k τ (y) = φ τ (G k (y)) and compute a Newton step w k for ψ k τ from y k ; 4 Take d k = G(y k + w k ) x k if this step satisfies the acceptance conditions and x k + d k M k 10

11 Global convergence 11

12 Identification of manifold 12

13 Superlinear convergence 13

14 l 1 regularized logistic regression Given dataset {x i } m i=1 with x i R n and each x i attached a label b i { 1, +1}, predict the chance of a given feature vector x having label +1. Assume 1 p(x; z) = 1 + e. z x The log-likelihood function is Consider min z φ τ (z) = 1 m L(z) + τ z 1 14

15 Identify manifold and form subspace problem 1 For current iterate z, find a descent direction d; 2 Define Q 0 = {i : z i +d i = 0}, Q = {i : z i +d i < 0}, Q + = {i : z i +d i > 0} and let M = {z : z i = 0, i Q 0 }. Then F (z) = E z, G(y) = Y y, parametrize M, where E s columns are those of the n n identity corresponding to indices in Q 0, and Y is its complement. 3 Then the function ψ τ (y) is (note y R n Q 0 ) ψ τ (y) = 1 L(Y y) τ y i + τ m i Q i Q + y i. 15

16 Implementation Details 1 Selection of Q k (set of blocks) Unbiased: some fixed fraction of the indices in {1, 2,, n}; Biased: all nonzeros of z k plus some fixed fraction of zeros. 2 Choice of µ k : µ k = max(µ min, 0.8µ k 1 ), µ min = 10 3 and η = 2. 3 Reduced Newton-like step select a subset S of samples to form the Hessian; add a damping term λ k I to the reduced Hessian; do line search z k + α d k to have same signs as z k + d k ; γ = 10 3 (φ τ (z k + d k ) φ τ (z k ) γ d k 3 ). 4 Termination: min{ v 2 : v φ τ } Continuation in τ: τ 0 = τ max = f(0), τ final = 0.25τ 0, and use 10 continuation steps. 16

17 Result 1: n = 20, 000, m = 4, 000 S {1, 2,, m}; G {1, 2,, n}; none means no acceleration; 10 nonzero components from N (0, 1). 17

18 Result 2: n = 1, 000, m = 100, 000 S {1, 2,, m}; G {1, 2,, n}; none means no acceleration; 10 nonzero components from N (0, 1). 18

19 Result 3: n = 1, 000, m = 50, 000 S {1, 2,, m}; G {1, 2,, n}; none means no acceleration; 100 nonzero components in the form of 10 ξ with ξ from N (0, 1). 19

20 Observations 1 Reduced Newton acceleration makes great improvement; sometimes, first-order method without acceleration is competitive; 2 Some performance benefits from partial gradient evaluations; 3 Full gradient evaluation and no acceleration give poorest performance; 4 Not significant improvement by reduced Hessian; 5 No clear benefit from biased technique. 20

21 References Stephen J. Wright, Accelerated block-coordinate relaxation for regularized optimization, SIAM J. Optimization, 22(1), pp , 2012.

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big