Applications with l 1 -Norm Objective Terms

Size: px

Start display at page:

Download "Applications with l 1 -Norm Objective Terms"

Theodore Randall
6 years ago
Views:

1 Applications with l 1 -Norm Objective Terms Stephen Wright University of Wisconsin-Madison Huatulco, January 2007 Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

2 1 Formulation 2 Least-Squares with l 1 Applications Algorithms Results 3 Logistic Regression Application Algorithms Results Based on joint work with Weiliang Shi, Grace Wahba, Rob Nowak, Mario Figueiredo. Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

3 Formulation We describe two classes of applications for problems of the form: min x f (x) + λ x 1 where x IR n ; f is convex, smooth, possibly nonlinear; λ > 0 is a regularization parameter. A special case of particular interest: 1 min x 2 Ax y λ x 1 n may be very large (hence, storage and computational limitations); l 1 norm may apply to only a subvector of x; may wish to solve for a number of λ values. Use well-known optimization techniques, tailored to structure and characteristics of the applications. Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

4 Related Formulations Several related formulations are also of interest in applications: for parameter t 0. min x x 1 subject to Ax y 2 t, Several related formulations are also of interest in applications: min x x 1 subject to Ax = y, and for t > 0. min x Ax y 2 2 subject to x 1 t, Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

5 Least Squares with l 1 Regularization: Applications LASSO Wavelet-based Signal Reconstruction Compressed Sensing Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

6 LASSO The LASSO technique of Tibshirani (1996) works with formulation for some parameter t 0. min x Ax y 2 2 subject to x 1 t, When t = 0, solution is x = 0. When t x LS 1, where x LS is the (unconstrained) least-squares solution, we have x = x LS. The motive is variable selection: Seek sparse approximate solutions of Ax = b, for which x has relatively few nonzeros. (In general, smaller t implies fewer nonzeros.) Once these variables are identified, solve a reduced least squares problem in which only these variables are allowed to be nonzero. Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

7 Often want to find the path of solutions, for t [0, x LS 1 ], or at least the solutions for a sample of t s along this path. Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

8 Wavelet-based Signal Reconstruction Problem has the form Ax y, where A = RW : x is vector of coefficients for the unknown image or signal; W is a wavelet basis (multiplication by W performs a wavelet transform) R is the observation operator (e.g. convolution of the signal/image with a blur operator, or a tomographic projection) y is vector of observations, possibly containing errors/noise. Dimensions are large, and matrix representation of W is dense in general. Impractical to store or factor it, or multiply it by R. However, multiplications by R, R T, W, W T can be performed economically. Motivation: Want to reconstruct a signal from transmitted encoding y, given prior knowledge that x is sparse. Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

9 Specific Problems in this Class W represents an orthogonal wavelet basis of dimension n: W is n n; multiplication by W or W T costs O(n), using fast wavelet transform. W represents a redundant, translation-invariant wavelet system of dimension n: W is n n(log 2 n + 1), multiplication by W or W T costs O(n log n), again using FWT. R can be a k n random sampling matrix (consisting of zeros and a few ones, or a random mix of ±1). Compressed Sensing. Linear code: W = I and columns of R are codewords. Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

10 Compressed Sensing Recent theory shows that, if x is known to be sparse, then it can be reconstructed from y Ax, where A is k n random with k n, under certain conditions on A. A Representative Result. (Candès, Romberg, Tao, 2005) Given A, define δ S to be the smallest quantity for which (1 δ S ) c 2 2 A T c 2 2 (1 + δ S ) c 2 2, for all c, where A T is a column submatrix of A defined by T {1, 2,..., n} with T S. (Ensures that A T is close to orthonormal.) If δ 3S + 3δ 4S < 2, then for any signal x with at most S nonzeros and any vector y such that y A x 2 ɛ, the solution of min x x 1 subject to Ax y 2 ɛ satisfies x x 2 C S ɛ, where C S is a constant depending only on δ 4S. Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

11 Algorithms: Least Squares with l 1 1 min x 2 Ax y λ x 1. Formulate as bound-constrained least squares by splitting and writing min u 0,v 0 x = u v, (u, v) 0, 1 2 A(u v) y λ1 T u + λ1 T v. For signal processing applications, we ve had good success with gradient-projection algorithms. We ll describe these first, then discuss alternatives and indicate why they are less suitable. Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

12 Basic Gradient Projection Writing objective as F (u, v), the problem is min u 0,v 0 F (u, v). We have Main costs: u,v F (u, v) = [ A T A(u v) A T ] y + λ1 A T A(u v) + A T. y + λ1 Evaluation of F (u, v): one multiplication by A; Evaluation of u,v F (u, v): one additional multiplication by A T. Choose search direction (δ u, δ v ) = ( u F, v F ); Look along path (u + αδ u, v + αδ v ) +, α > 0, where ( ) + denotes projection onto the nonnegative orthant. Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

13 First try α = α 0, the unconstrained minimizer of F along this direction, which is given by (Cost: One multiplication by A.) α 0 = (δ u, δ v ) 2 2 A(δ u δ v ) 2. 2 Armijo backtracking: Choose the first α in the sequence α 0, βα 0, β 2 α 0, satisfying a sufficient decrease condition: F ((u + αδ u, v + αδ v ) + ) F (u, v).001(α/α 0 ) F (u, v) T [(u, v) (u + α 0 δ u, v + α 0 δ v ) + ]. Set (u, v) (u + αδ u, v + αδ v ) + Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

14 Termination It s often not practical to iterate until the optimal active set is identified. However, since our main interest is in approximately identifying the correct active set, we use a criterion based on this. Specifically, terminate when the relative change to I k def = {i u k i > 0 or v k i > 0} falls below tola. (We use tola=.02.) Tried GPCG (Moré and Toraldo, 1991), which alternates GP steps with CG exploration of a fixed working set, but it doesn t work well as the restriction of the objective to the current working set usually has a singular Hessian. Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

15 Debiasing After finding an approximate solution, we follow with a debiasing step, in which the zero elements of x = u v are discarded and we perform an unconstrained minimization of Ax y 2 2 over the nonzero elements, using conjugate gradient. Terminate this phase after decreasing the gradient A T (Ax y) by factor of Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

16 Barzilai-Borwein Variants Consider a modification of GP based on method of Barzilai&Borwein (1988) for unconstrained nonlinear minimiation; modified by Dai&Fletcher (2005) for box-constrained QP. The approach is non-monotone. Motivation in terms of min x f (x): Let s and y be change in x and f over the last step: s = x k x k 1, y = f (x k ) f (x k 1 ), choose α so that αs y in the least-squares sense (so that 2 f αi in some sense) and set Can compute α trivially and obtain x k+1 = x k 1 α f (x k). x k+1 = x k st s s T y f (x k). Obtain a variant by choosing α so that s αy and setting x k+1 = x k α f (x k ). Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

17 Modifications for QP See Dai&Fletcher, modified for least-squares objective and with an additional line search. Compute step (δ u, δ v ) = (u α u F, v α v F ) + (u, v), Perform exact line search to minimize F along the line segment from (u, v) to (u, v) + λ(δ u, δ v ); Set for the next iteration. α = (δ u, δ v ) 2 2 A(δ u δ v ) 2 2 Cheap! total of two multiplications by A or A T at each iteration. Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

18 In some applications, theory suggests an appropriate value of regularization parameter λ. But often need to search around this value and solve for a range of λ values. Since gradient projection approaches benefit from a good starting point, can simply use the approximate solution for one λ as the starting point for a nearby λ. Solve for preassigned sequence of λs, in increasing order. (The set of nonzero x components shrinks as λ increases.) Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

19 Alternative Approach: Active Set/Pivoting Solve the equivalent problem with objective Ax y 2 2 and constraint x 1 t by starting with t = 0 (solution x = 0) and proceeding to t = x LS 1 (solution x = x LS ). Determine breakpoints values of λ at which a component of x changes from zero to nonzero or vice versa. Use pivoting operations to update x at these values for the new active set. See Osborne et al (2000), Efron et al (2003). Need to be able to factor A and submatrices of A, hence unsuitable for problems where A is not known explicitly. Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

20 Alternative Approach: Interior-Point Could apply a primal-dual method to the bound-constrained formulation, solving the linear system at each IP iteration by CG or some variant. Each inner iteration requires multiplications by A and A T, but not explicit knowledge of A. Application of this approach to our problem is one of the approaches described in the basis pursuit paper of Chen, Donoho, Saunders (1998). See also Saunders (2002) and his PDCO / SolveBP code. Not very good at solving for multiple λ values due to the usual difficulty of warm-starting interior-point methods (though probably easier here because of simplicity of the constraints). Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

21 Alternative Approach: Bound Optimization Solve an approximate QP in which Hessian A T A is replaced by a diagonal approximation D, for which A T A D: 1 min x 2 (x x k) T D(x x k ) + (x x k ) T A T (Ax k y) + λ x 1, where x k is the previous iteration. Can solve in closed form to get new iterate x k+1. (If x k = x optimal, solution is x = x.) For applications of interest, the price paid by ignoring off-diagonal information is too high, and convergence is slow. (Due to Figueiredo, Nowak and others.) Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

22 Alternative Approach: Second-Order Cone Applies to formulation for parameter t 0. min x x 1 subject to Ax y 2 t, In l1magic code (Candès, Romberg), recast as a second-order cone program and solved by a primal log-barrier / Newton / CG approach using the usual barrier term for the constraint. µ log(t 2 Ax y 2 2) Again not good at solving for multiple t values. Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

23 Results: Compressed Sensing Small, simple, explicit problem to evaluate different algorithms. min 1 2 y Rx λ x 1, R is , dense, elements chosen independently from N(0, 1), then rows are normalized. Choose x i = 0 with prob.99, x i Uniform[ 1, 1] with prob.01. Choose y = Rx + e where e i N(0,.005). Solve for just a single value of τ. Compare several algorithms: Gradient Projection: Basic and Barzilai-Borwein l1-magic: SOCP formulation SparseLab: basis pursuit / interior-point A bound optimization algorithm Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

24 GP reconstructs the signal well (compared to least-squares solution). (Note some attenuation due to the x 1 term.) 1 Original 0! Reconstruction (details: n = 4096, k = 512, sigma = 0.005, tau = MSE = e!005! MSE = Pseudo!solution! Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

25 Barzilai-Borwein version is faster than basic GP: GPD!Basic GPD!BB Objective function CPU time Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

26 As problem size increases, BB beats the competition: 10 3 CPU time (seconds) GPD!BB SparseLab l1!magic BOA ! n Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

27 Debiasing removes the attenuation due to the x 1 term. & '()*)+,-!!&! "!! #!! $!! %!! &!!! &"!! &#!! &$!! &%!! "!!!./01+23(403) :5!;!!#<=> &!!&! "!! #!! $!! %!! &!!! &"!! &#!! &$!! &%!! "!!!?/@),2/A5(/01+23(403) :5#;=$/!!!$> &!!&! "!! #!! $!! %!! &!!! &"!! &#!! &$!! &%!! "!!! Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

28 Application: Logistic Regression Have n subjects with attribute vectors x(i), i = 1, 2,..., n and labels (y 1 (i), y 2 (i),..., y K (i)), where K =number of classes. y r (i) = 1 if subject i is in class r and y r (i) = 0 otherwise. Express the probability that some x = x(i) is in class k by means of functions p k (x) and f k (x), related by p k (x) = Express f k in terms of basis functions f k (x) = exp f k (x) K j=1 exp f j(x). N c kl B l (x), k = 1, 2,..., K, l=0 with coefficients c kl, k = 1, 2,..., K and l = 0, 1, 2,..., N to be determined from the optimization. Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

29 Typical dimensions: K [2, 10], n possibly , N possibly exponential in number of features (dimension of each x(i)). Seek solutions for which only a small fraction of the c kl are nonzero these identify the most significant basis functions B l. Log-likelihood function: Given x(1), x(2),..., x(n) and labels y(1), y(2),..., y(n), find the optimal functions f k, k = 1, 2,..., K by minimizing L(c) = 1 n K y k (i) log p k (x(i)) n i=1 k=1 (Recall: the p k depend on f k, which depend on the c kl.) Introduce LASSO regularization term: J(c) = K k=1 l=1 N c kl (NB: no penalty for c k0, which corresponds to basis function B 0 (x) 1.) Minimize the function T λ (c) = L(c) + λj(c). Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

30 Two-Category Variant When K = 2 the formulation simplifies. Can WLOG set f 1 (x) 0, i.e. c 1,l = 0, l = 0, 1,..., N (since if we apply the same shift to c 1,l and c 2,l, the functions p 1 (x) and p 2 (x) do not change). Define c l def = c 2,l, l = 0, 1, 2,..., N, and obtain L(c) = 1 n [ ( ( n N N ))] y 2 (i) c l B l (x(i)) + log 1 + exp c l B l (x(i)) i=1 l=0 l=0 and J(c) = N c l. l=1 Minimize T λ (c) = L(c) + λj(c) for a number of different λ values; choose the most suitable λ in an outer loop: Generalized Approximate Cross-Validation (GACV). Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

31 PatternSearch: Beaver Dam Data n = 876 =number of persons in study. Each x(i) is a zero-one vector of 7 features/risk factors: Risk factor 0 1 sex female male income > $30, 000 $30, 000 juvenile myopia myopic after age 21 myopic before age 21 cataract severity 1,2,3 4,5 smoking packs years 30 > 30 aspirin no yes vitamins no yes Find combinations of factors, as well as individual factors, that predict progression of myopia. Define N = 2 7 and basis functions B i1,i 2,...,i 7 (x) = x j. j:i j =1 This function is 1 if x j = 1 for all j with i j = 1, and zero otherwise. Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

32 PatternSearch: Rheumatoid Arthritis and SNPs Want to predict likelihood that an individual is susceptible to rheumatoid arthritis based on genetic variations, plus some environmental factors. SNP: variation in a single nucleotide in the genome sequence. e.g. AAGGC changes to ATGGC. This version has two alleles, A and T. The less frequent nucleotides are called minor alleles. It s observed that rheumatoid arthritis is associated with SNPs on chromosome 6. Include 9787 nucleotides, mostly on chromosome 6, in the feature vector x, where the relevant component of x is coded as 0,1,2 according to whether it contains the most common nucleotide or one of the minor alleles. x also contains coding of a variation of DR type at the HLA locus of chromosome 6. x also contains variables for gender (female=1), smoking (yes=1), and age (older than 55 = 1). Total of 9792 x components. Only two categories (K = 2). Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

33 Might like to examine all possible interactions but this would involve solving a problem with unknowns. Instead, do multiple rounds of prescreening. 795 of the 9792 individual variables survived the first round. (Actually, 880 variables survive, as for some variables more than one level was of interest.) After screening for interactions between pairs of these 795 variables, obtained 1679 interactions of possible interest. Then solve a max-likelihood problem with 2559 = variables. Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

34 Algorithm Use variable splitting again (c = c + c ), and write the problem as min T λ(c + c ) = L(c + c ) + λ1 T c + + λ1 T c. u 0,v 0 (Ignore the unconstrained component of c for simplicity.) Recall that L(c) has the form [ ] L(c) = 1 n N y 2 (i) c l B l (x(i)) + log (1 + F (x(i); c)), n where i=1 l=0 ( N ) F (x; c) = exp c l B l (x). It s relatively expensive to evaluate F (x(i); c) for i = 1, 2,..., n. However once these quantities are known, the gradient L(c) is cheap and the Hessian is uncomplicated (though dense). l=0 Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

35 Two-Metric Gradient Projection Use gradient projection again, with two-metric scaling in the free components (Bertsekas, 1982). At each iterate x k for min x 0 f (x): Calculate f (x k ) and use it to form an estimate of the free variable set I k. Exclude i from I k if xi k is close to zero and f / x i > 0. Calculate the partial Hessian corresponding to I k : [ 2 f (x k ] ) H Ik = x i x j Form search direction p k by i I k,j I k. pi k f = τ k, i / I k x i for some scale factor τ k, and p k I k = (H Ik + ɛ k I ) 1 Ik f (x k ). Do Armijo backtracking line search along (x k + αp k ) +, α = 1, 1 2, 1 4, 1 8,.... Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

36 Why 2MGP? For interesting λ values, very few nonzero components of c at the solution, so have to compute only a small submatrix of the Hessian (and it s cheap). Use of the reduced Hessian (with damping) accelerates the method greatly over plain gradient projection. Strategies that need more of the Hessian are not practical because it s dense and very large. Strategies that use CG on the reduced Hessian unnecessary because it s small and easily calculated. It s faster than Matlab s fmincon! Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

37 Results: Myopia Progression Choose λ by GACV. For the selected λ, 13 of the 128 possible combined rsk factors are selected. These are subjected to further analysis ( Step 2 ) and 5 factors survive. Coefficients for f 2 : pattern coefficient constant cataract 2.42 smoking, no vitamins 1.18 male, low income, juv. myopia, no aspirin 1.84 male, low income, cataract, no aspirin 1.08 Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

38 Thanks! Thanks to the Organizers! Thanks for listening! Thanks to Dramamine! Stephen Wright (UW-Madison) Applications with l 1 -Norm Objective Terms Huatulco, January / 38

Optimization Algorithms for Compressed Sensing

Optimization Algorithms for Compressed Sensing Stephen Wright University of Wisconsin-Madison SIAM Gator Student Conference, Gainesville, March 2009 Stephen Wright (UW-Madison) Optimization and Compressed