Bias-free Sparse Regression with Guaranteed Consistency

Size: px

Start display at page:

Download "Bias-free Sparse Regression with Guaranteed Consistency"

Sharleen Caldwell
5 years ago
Views:

1 Bias-free Sparse Regression with Guaranteed Consistency Wotao Yin (UCLA Math) joint with: Stanley Osher, Ming Yan (UCLA) Feng Ruan, Jiechao Xiong, Yuan Yao (Peking U) UC Riverside, STATS Department March 10, 2015

2 Background Goal: recover a sparse x R n from noisy linear observation b := Ax + ɛ where A R m n and b R m are given, ɛ is zero-mean unknown noise. Our focus: the under-determined case, m n LASSO is a common approach, but its solution is biased. Fan and Li (2001): to avoid bias, minimization must use nonconvex prior Our approach keeps the convex prior but replaces minimization

3 This talk Review LASSO and explains its solution bias A new regularization path: solution to an ordinary differential inclusion use convex prior, is free of bias, and has the oracle property has sign/l 2 consistency how to compute the exact path, as well as its fast approximation how to try it by making a 2-line change to your existing code

4 LASSO and its bias Minimization form: x lasso minimize x 1 + t x 2m Ax b 2 2 Variational form (optimality condition): 0 = p + t m AT (Ax lasso b) and p x lasso 1

5 LASSO and its bias Minimization form: x lasso minimize x 1 + t x 2m Ax b 2 2 Variational form (optimality condition): Suppose 0 = p + t m AT (Ax lasso b) and p x lasso 1 S := supp(x ), that is, x = [x S; 0] LASSO recovers exact support, S = supp(x lasso ) then x lasso S = x S + 1 m (AT S A S ) 1 A T S ɛ }{{} oracle estimate, E(...) = x S 1 t (AT S A S ) 1 sign(xs lasso ). }{{} bias

6 Toy example 1 Consider b > 0 and the all-scalar problem: b = x + ɛ Oracle estimate: x oracle = b

7 Toy example 1 Consider b > 0 and the all-scalar problem: b = x + ɛ Oracle estimate: x oracle = b LASSO: x lasso minimize x x + t x b 2 2 LASSO solution: 0, 0 t 1 x lasso = b, b 1 t, 1 b < t <. LASSO reduces the signal magnitude.

8 Toy example 2 Suppose sorted a 1 > a 2... a n > 0 and given measurement where x R n. b = a T x + ɛ R

9 Toy example 2 Suppose sorted a 1 > a 2... a n > 0 and given measurement where x R n. LASSO solution: b = a T x + ɛ R x lasso 2 = = x lasso x lasso 1 = n = 0, t 0. { 0, 0 t 1, a 1 b b a 1 1, ta a 1 b < t <. LASSO selects a 1 but reduces the signal magnitude.

10 A more realistic example Setup: n = 256, m = 25, gaussian noise ɛ true signal BPDN recovery True vs LASSO (t is hand tuned) LASSO solution: selects large signals but reduces their magnitudes misses several moderate sized signals (false negatives) includes small false signals (false positives)

11 LASSO post-debiasing Goal: to restore the reduced magnitudes Let S := supp(x lasso ). Common approach: solve minimize Ax b 2 x subject to supp(x) = S (the solution and x lasso may have different signs) Another approach: remove 1 t (AT S A S ) 1 sign(xs lasso ) from xs lasso

12 LASSO post-debiasing Goal: to restore the reduced magnitudes Let S := supp(x lasso ). Common approach: solve minimize Ax b 2 x subject to supp(x) = S (the solution and x lasso may have different signs) Another approach: remove 1 t (AT S A S ) 1 sign(xs lasso ) from xs lasso Issues: extra computation of matrix inversion cannot correct false positives or false negatives in x lasso cannot work with continuous support (e.g., low-rank matrix recovery)

13 Proposed: inverse scale space (ISS) dynamic Name comes from image processing Idea: instead of minimizing prior+fitting, evolve prior and fitting along their (sub)gradients Get the solution path {x(t), p(t)} t 0 by evolving from initial x(0) = p(0) = 0. ṗ(t) = 1 m AT (b Ax(t)), }{{} fitting p(t) x(t) 1. ISS path is well-defined under assumptions: p(t) is right-continuously differentiable, and x(t) is right-continuous.

14 Compare LASSO and ISS Apply LASSO and ISS to the same example shown before true signal BPDN recovery true signal Bregman recovery True vs LASSO (shown previously) True vs ISS Compared to LASSO, ISS does not reduce signal magnitudes ISS has fewer false positives ISS has fewer false negatives. ISS recovers the moderate sized signals.

15 Under the hood: removing LASSO bias at its origin Recall in LASSO, we have p x lasso 1 and p = t m AT (b Ax lasso ) Differentiating the equation w.r.t. t gives ṗ = 1 m AT (b A(tẋ lasso + x lasso )) In fact, tẋ lasso + x lasso is LASSO s post-debiasing solution!

16 Under the hood: removing LASSO bias at its origin Recall in LASSO, we have p x lasso 1 and p = t m AT (b Ax lasso ) Differentiating the equation w.r.t. t gives ṗ = 1 m AT (b A(tẋ lasso + x lasso )) In fact, tẋ lasso + x lasso is LASSO s post-debiasing solution! Replacing tẋ lasso + x lasso by x to remove bias, yielding the ISS dynamic ṗ = 1 m AT (b Ax) ISS works better than (LASSO + post-debiasing).

17 Numerical result: prostate tumor size the first example from Hastie-Tibshirani-Friedman problem: given 8 clinical features, select predictors for prostate tumor size data: 67 training cases + 30 testing cases; parameters picked by cross validation Predictor LS Subset Selection LASSO ISS Intercept lcavol lweight age lbph svi lcp gleason pgg #predictors Test error ISS uses fewest predictors and achieves the best test error!

18 Theory: consistency guarantees for ISS Question: t so that x(t) has the following properties? sign consistency: sign(x ) = sign(x(t)). no false positive: if true xi = 0, then x i(t) = 0 no false negative: if true xi 0, then x i(t) 0

19 Theory: consistency guarantees for ISS Question: t so that x(t) has the following properties? sign consistency: sign(x ) = sign(x(t)). no false positive: if true xi = 0, then x i(t) = 0 no false negative: if true xi 0, then x i(t) 0 Theorem Make the assumptions Gaussian noise: ω N (0, σ 2 I ), normalized column: 1 n maxj Aj 2 1, and assume the irrepresentable and strong-signal conditions. Then, with high probability, ISS point x(t) has sign consistency and gives an unbias estimate to x. (There is an explicit formula for t.) Proof is based on the next two lemmas.

20 No false positive Define true support S := supp(x ), and let T := S c. Lemma Under the assumptions, if A S has full column rank and max A T j A S (A T S A S ) η j T for some η (0, 1), then with high probability supp(x(s)) S, s t := O ( ) η m. σ log n Proof uses: (i) concentration inequality, and (ii) if supp(x(s)) S, s t, then p(s) T = A T TA S (A T S A S ) 1 p(s) S + ta TP A w, s t. S

21 No false negative / sign consistency Lemma Under the assumptions, if A SA S γi and { ( ) xmin σ log S max O, O γ m ( )} σ log S log n, ηγ m then there exist t (which can be given explicitly) so that with high probability and x(t) = x S (A SA S ) 1 A Sω obeys sign(x(t)) = sign(x ) x(t) x x min/2. first term in max ensures (A SA S ) 1 A Sω u min/2 second term ensures: inf{t : sign(x S (t)) = sign(x S )} t.

22 Compute the ISS path Theorem The solution path to ṗ(t) = 1 m AT (b Ax(t)) and p(t) x(t) 1 with initial t 0 = 0, p(0) = 0, x(0) = 0 is given piece-wise by iteration: for k = 1, 2,..., K compute

23 Compute the ISS path Theorem The solution path to ṗ(t) = 1 m AT (b Ax(t)) and p(t) x(t) 1 with initial t 0 = 0, p(0) = 0, x(0) = 0 is given piece-wise by iteration: for k = 1, 2,..., K compute p(t) is piece-wise linear, given by p(t) = p(t k 1 ) + t t k 1 m AT (b Ax(t k 1 )), t [t k 1, t k ], where t k := sup{t > t k 1 : p(t) x(t k 1 ) 1}.

24 Compute the ISS path Theorem The solution path to ṗ(t) = 1 m AT (b Ax(t)) and p(t) x(t) 1 with initial t 0 = 0, p(0) = 0, x(0) = 0 is given piece-wise by iteration: for k = 1, 2,..., K compute p(t) is piece-wise linear, given by p(t) = p(t k 1 ) + t t k 1 m AT (b Ax(t k 1 )), t [t k 1, t k ], where t k := sup{t > t k 1 : p(t) x(t k 1 ) 1}. x(t), t [t k 1, t k ), is constantly equal to x(t k 1 ); if t k, next u i 0, p i(t k ) = 1, x(t k ) = arg min Au b 2 2, subject to u i = 0, p i(t k ) ( 1, 1), u u i 0, p i(t k ) = 1.

25 ISS computation ISS is fast on moderately-sized problems evolve from t = 0 and through (finitely many) break points each break point: a constrained least-squares subproblem. (Since it is similar to the one at the previous break point, it can be solved by maintaining a QR decomposition) How to evolve ISS for huge problems with many break points? Answer: fast discrete approximations Bregman iteration: LASSO subproblem + add-back-the-residual Linearized Bregman iteration: closed-form iteration, parallelizable

26 Discrete ISS = Bregman iteration Apply forward Euler to ṗ = 1 m AT (b Ax) while keeping p x 1: p k+1 = p k + δ m AT (b Ax k ),

27 Discrete ISS = Bregman iteration Apply forward Euler to ṗ = 1 m AT (b Ax) while keeping p x 1: p k+1 = p k + δ m AT (b Ax k ), which is the first-order optimality condition to x k+1 arg min x 1 x k 1 p k, x x k δ + x }{{} 2m Ax b 2, Bregman distance of l 1

28 Discrete ISS = Bregman iteration Apply forward Euler to ṗ = 1 m AT (b Ax) while keeping p x 1: p k+1 = p k + δ m AT (b Ax k ), which is the first-order optimality condition to x k+1 arg min x 1 x k 1 p k, x x k δ + x }{{} 2m Ax b 2, Bregman distance of l 1 By a change of variable, obtain the equivalent iteration: x k+1 arg min x 1 + δ x 2m Ax bk 2, b k+1 b k + (b Ax k ). add back the residual Keep your LASSO solver, use a small δ, and just add back the residual Important: derivation still holds with 1 replace by any convex r( )

29 Faster alternative: linearized Bregman ISS Add the blue term to ISS. ṗ(t) + 1 κ ẋ(t) = 1 m AT (b Ax(t)), p(t) x(t) 1. Solution is piece-wise smooth, every piece has a closed form. Converges to the ISS solution exponentially fast in κ

30 Faster alternative: linearized Bregman ISS Add the blue term to ISS. ṗ(t) + 1 κ ẋ(t) = 1 m AT (b Ax(t)), p(t) x(t) 1. Solution is piece-wise smooth, every piece has a closed form. Converges to the ISS solution exponentially fast in κ By z(t) = p(t) + 1 x(t), it reduces to an ODE: κ ż(t) = 1 m AT (b κa shrink(z(t))).

31 Faster alternative: linearized Bregman ISS Add the blue term to ISS. ṗ(t) + 1 κ ẋ(t) = 1 m AT (b Ax(t)), p(t) x(t) 1. Solution is piece-wise smooth, every piece has a closed form. Converges to the ISS solution exponentially fast in κ By z(t) = p(t) + 1 x(t), it reduces to an ODE: κ Insight: Given z(t), uniquely recover ż(t) = 1 m AT (b κa shrink(z(t))). x(t) = κ shrink(z(t)), p(t) = z(t) 1 κ x(t).

32 Discrete linearized Bregman Iteration ODE from the last slide: ż = 1 m AT (b κa shrink(z(t))). Forward Euler: z k+1 = z k + α k m AT (b A (κ shrink(z k )) }{{} x k )

33 Discrete linearized Bregman Iteration ODE from the last slide: ż = 1 m AT (b κa shrink(z(t))). Forward Euler: z k+1 = z k + α k m AT (b A (κ shrink(z k )) }{{} x k ) Easy to parallelize for very large dataset. For example: A = [A 1 A 2 A L ], where A l is distributed Distributed implementation: for l = 1,..., L in parallel: { z k+1 l = zl k + α k m AT l (b w k ) w k+1 l = κa l shrink(z k+1 L all-reduce sum: w k+1 = w k+1 l. l=1 l )

34 Comparison to ISTA Compare ISTA iteration: x k+1 = shrink(x k α k m AT (Ax k b), 1 t ) Discrete linearized Bregman (LBreg) iteration: z k+1 = z k α k m AT (A(κ shrink(z k )) b)

35 Comparison to ISTA Compare ISTA iteration: x k+1 = shrink(x k α k m AT (Ax k b), 1 t ) Discrete linearized Bregman (LBreg) iteration: Comparison: z k+1 = z k α k m AT (A(κ shrink(z k )) b) ISTA: solves LASSO as k, intermediate x k is dense LBreg: intermediate x k is sparse (useful as a regularization path)

36 Comparison to ISTA Compare ISTA iteration: x k+1 = shrink(x k α k m AT (Ax k b), 1 t ) Discrete linearized Bregman (LBreg) iteration: Comparison: z k+1 = z k α k m AT (A(κ shrink(z k )) b) ISTA: solves LASSO as k, intermediate x k is dense LBreg: intermediate x k is sparse (useful as a regularization path) as k, solves: minimize x κ x 2 subject to Ax = b, with exact penalty property: sufficiently large κ gives x 1 minimizer

37 Comparison to orthogonal matching pursuit (OMP) 1 OMP: start with index set S = and vector x = 0; iterate: 1. compute residual A (b Ax), add its largest entry to S 2. set x arg min b Ax 2 2 subject to x i = 0 i S. 1 Mallat-Zhang 93, Tropp-Gilbert 07

38 Comparison to orthogonal matching pursuit (OMP) 1 OMP: start with index set S = and vector x = 0; iterate: 1. compute residual A (b Ax), add its largest entry to S 2. set x arg min b Ax 2 2 subject to x i = 0 i S. Differences: OMP: increase index set S. (OMP variants evolve S in other ways) ISS: evolves p x 1, encoding how likely a current zero becomes nonzero 1 Mallat-Zhang 93, Tropp-Gilbert 07

39 Generalization Consider any convex regression model, parameterized by t: minimize x r(x) + t f (x) (1) Fan and Li (2001): convex r causes bias. Solution: making r nonconvex.

40 Generalization Consider any convex regression model, parameterized by t: minimize x r(x) + t f (x) (1) Fan and Li (2001): convex r causes bias. Solution: making r nonconvex. Our solution: time differentiation ṗ(t) = f (x), p(t) r(x(t)).

41 Generalization Consider any convex regression model, parameterized by t: minimize x r(x) + t f (x) (1) Fan and Li (2001): convex r causes bias. Solution: making r nonconvex. Our solution: time differentiation ṗ(t) = f (x), p(t) r(x(t)). Applications: prior r: weighted l 1, l 1,2, nuclear norm; can incorporate nonnegative or box constraints as indicator functions fitting f : square loss, logistic loss, etc. You can keep existing solver for (1), try iteratively adding back the residual. In fact, there is even a simple way to make r nonconvex.

42 Related work in optimization / image processing Discrete: Bregman iteration for imaging and compressed sensing: Osher-Burger-Goldfarb-Xu-Y 06, Y-Osher-Goldfarb-Darbon 08 Linearized Bregman on l 1: Y-Osher-Goldfarb-Darbon 08, Y 10, Lai-Y 13 Matrix completion SVT on X : Cai-Candès-Shen 10 Extension and analysis: Zhang 13, Zhang 14 Continuous: Inverse scale space (ISS) on TV: Burger-Gilboa-Osher-Xu 06 Adaptive ISS on l 1: Burger-Möller-Benning-Osher 11 Greedy ISS on l 1: Möller-Zhang 13

43 Summary Instead of minimize r(x) + t f (x), try solving { ṗ(t) = f (x(t)) The solution will have the structure you seek for no or less bias p(t) r(x(t)). often, has simple and fast approximation algorithms Reference: UCLA CAM S.Osher, F.Ruan, J.Xiong, Y.Yao and W.Yin, Sparse Recovery via Differential Inclusions, July 2014.

Sparse Optimization Lecture: Dual Methods, Part I

Sparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know dual (sub)gradient iteration augmented l 1 iteration