19.1 Problem setup: Sparse linear regression

Size: px

Start display at page:

Download "19.1 Problem setup: Sparse linear regression"

Shona Howard
5 years ago
Views:

1 ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 19: Minimax rates for sparse linear regression Lecturer: Yihong Wu Scribe: Subhadeep Paul, April 13/14, 2016 In the last lecture we analyzed the -sparse Gaussian location model in high dimension (the denoising problem) and proved minimax rate for estimating the location parameter. In this lecture we extend the earlier ideas to sparse linear regression in high dimension. We prove a minimax lower bound and then obtain upper bounds on the ris of a few procedures Problem setup: Sparse linear regression The sparse linear regression model is Y n 1 = X n p θ p 1 + Z, Z N (0, I n ), (19.1) where X R n p is the design matrix and θ R p is an unnown -sparse parameter vector. In this lecture we are concerned with the case when n << p but n, i.e., we have more predictors in the design matrix than we have samples. Interpretation: Y is a noisy linear combination of the columns of the design matrix X. The goal here is to estimate θ, given Y and X. Note that the system has more unnowns than the number of equations and hence is indeterminate even without the noise. Estimation is made possible due to the -sparsity structure. Note: We consider here random design matrices only. More precisely we have so that the columns have roughly unit norm. X ij i.i.d N (0, 1/n), The next theorem proves a lower bound on the minimax ris for estimating θ in the -sparse regression model. Theorem The minimax ris for estimating θ in the model defined by Equation (19.1) is lower bounded by R = R (p,, n) = inf ˆθ sup E θ ˆθ θ 2 2 log ep θ 0, n. Proof. Note that (X i, Y i ) s are i.i.d sampled from a distribution P θ. We show that the KL-diameter for two different θ is exactly same as that in p-dimensional gaussian location model, i.e., We derive the result as follows, D(P θ0 P θ1 ) = 1 2 θ 0 θ

2 Hence, D(P Xi,Y i θ 0 P Xi,Y i θ 1 ) = E Xi [D(P Yi X i,θ 0 P Yi X i,θ 1 )] = E Xi [D(N ( X i, θ 0, 1) N ( X i, θ 1, 1))] = E Xi [ 1 2 X i, θ 0 θ 1 2 ] = 1 2 E X i [(θ 0 θ 1 ) X ix i (θ 0 θ 1 )] = 1 2 (θ 0 θ 1 ) E[X ix i ](θ 0 θ 1 ) = 1 2n θ 0 θ 1 2. D(P θ0 P θ1 ) = D(P n X i,y i θ 0 P n X i,y i θ 1 ) = 1 2 θ 0 θ 1 2. The result then follows from the analysis of the gaussian location model from previous lecture. From the previous discussion we have, R log ep for any n, which is identical to the minimax rate of the denoising problem for any n. This is reasonable, because even with full observation n p, which roughly corresponds to the denoising problem we cannot beat this rate. Surprisingly, as long as n log ep, the denoising rate is attainable and we have R log ep, achieved by, e.g., the maximum lielihood estimator. This is proved in Section Note that MLE is computationally expensive. A computable alternative procedure is the Dantzig selector [CT07]. As analyzed in Section 19.3 It is guaranteed to achieve the rate R log p as long as n log ep. This falls slightly sort of the optimal rate. However unlie MLE, the procedure is completely adaptive and can be cast as a linear programming problem. More recently a procedure called SLOPE [SC15] has been proposed which achieves the optimal rate. In particular its ris coincides with the minimax rate with sharp constant, namely, R (2 + o(1)) log ep as p and = o(p), provided that n log ep Analysis of MLE The MLE in this case is defined in terms of the solution (may or may not be unique) of an optimization problem, ˆθ MLE arg min θ 0 Y Xθ 2 2. (19.2) Unfortunately the optimization problem can only be solved through exhaustive search which is NP-hard in the worst case. Theorem Whenever n C log ep for some sufficiently large constant C, θ B 0(). ˆθ MLE θ 2 2 log ep, (19.3) X(ˆθ MLE θ) 2 2 log ep, (19.4) 2

3 hold with high probability. Proof. Since ˆθ is a minimizer, we have Y X ˆθ 2 2 Y Xθ 2 2 = Z 2 2. On the left hand side we have, Y X ˆθ 2 2 = Y Xθ + Xθ X ˆθ 2 2 = Z Xh 2 2, where h = ˆθ θ. Hence we have, which leads to the basic inequality Z Xh 2 2 Z 2 2, Xh Z, Xh = 2Z Xh 2 h 2 sup u S p 1 B 0 (2) Z Xu. Note that the left hand side is not the estimation error, instead it is the prediction error Xh 2 2 = X ˆθ Xθ 2 2. Hence to conclude both (19.2) and (19.4) from the basic inequality, it suffices to show (a) h 2 Xh 2, (Restricted isometry property) (b) sup u S p 1 B 0 (2) Z Xu log ep, with high probability. We first prove (b). sup Z xu = Z w(g) u S p 1 B 0 (2) log ep, where w(g) is the Gaussian width of the set S p 1 B 0 (2) and from last lecture we now, w(g) log ep. For (a) we will show that where c is a constant. First note that, Xh 2 inf c if n log ep h 0 h 2, Au 2 inf u 0 u 2 = σ min (A). For any feasible h, Xh = X J h J, where J = supp(h) is the support of h and X J is the n matrix whose columns are the columns of X that corresponds to the rows in the support J. Then we have Xh 2 inf h 0 h 2 For a fixed J, σ min (X J ) concentrates to 1 [ P min σ min(x J ) < t J n = min J σ min(x J ).. Hence an union bound gives, ] 3 ( ) p P[σ min (X [] ) < t].

4 Using the tail bound P [ σ min (X [] ) < 1 ] n t exp( t 2 /2), n and choosing t = 4 log ep and n = 100 log ep, we have 1 P[min J σ min (X J ) < 0.5] 0. This completes the proof. n t n 0.5 and consequently, 19.3 Dantzig selector The Dantzig selector can written as the following optimization problem, min θ 1, s.t X (Y Xθ) τ. (19.5) This optimization problem can be efficiently solved as a linear programming. Another computable procedure for -sparse regression is the Lasso, which can be written as the following optimization problem min Y Xθ λ θ 1, θ R n. (19.6) Note that X is added to the constraint in the Dantzig selector in (19.5) to mae the solution rotationinvariant. Precisely, if U O(n) be a n n orthogonal rotation matrix, then UY = UXθ + UZ. Note that in this case, ˆθ(X, Y ) = ˆθ(UX, UY ). Theorem Let the Dantzig selector ˆθ DS denote a minimizer of (19.5) we have ˆθ DS θ 2 2 log p, w.h.p as long as n C log ep for some sufficiently large constant C. Proof. Similar to the proof of Theorem?? in the denoising problem, the proof is divided in three steps. Step 1 : set τ to guarantee θ is feasible We choose, X Z τ = 2 log p, so that ground truth is feasible. Step 2: Structure of the error h = ˆθ θ Let J be the support of θ. Define the cone C J {h : h J c 1 h J 1 }. (19.7) Since ˆθ 1 θ 1, we have h C J. Claim X Xh 2τ 4

5 To see this note that, X Xh = X X(ˆθ θ) = X (Y Xθ) X (Y X ˆθ) X (Y Xθ) + X (Y X ˆθ) 2τ. Step 3: The ris We have, Xh 2 2 = Xh, Xh = h X Xh = X Xh, h X Xh h 1 (Holder) 2τ2 h J 1 4 τ h J 2 (Cauchy-Schwarz) 4 τ h 2. Now we need to show one last thing to complete the proof. Claim h 2 2 Xh 2 2 w.h.p h C J To prove this claim we use the special feature of the cone C J defined in (19.7), that for any h C J, half of the energy of h is on 2 co-ordinates, i.e. h is almost 2- sparse. Suppose h is ordered in the following fashion: The vector of length that corresponds to J = supp(θ), h J comes first. The rest of h is ordered in terms of decreasing magnitude. We divide the remaining h after first bloc into blocs of size and name the blocs K 1, K 2,... and the vectors h 1, h 2,..., such that h Ki = h i. Let a = h J K1 = h J + h 1. By construction, it has more than 1/2 of the energy, i.e., a h 2 2 and define, b h (J K1 ) c = i 2 h i. Then Xh 2 2 = Xa + Xb 2 2 Xa Xa, Xb (19.8) Since X satisfies the restricted isometry property, for n c log ep, there exists c 1(c) with c 1 1 if c, such that Xa 2 2 c 1 a 2 2 c 1 2 h 2 2. Now we need to just show that the cross term Xa, Xb is small in magnitude. For this we use the following restricted decorrelation lemma, 5

6 Lemma Let n c log ep. Then, with high probability, for all u, v B 0(2) we have, Xa, Xb u v, where c 2 = c 2 (c) and c 2 0 if c. Then we have, Xa, Xb = Xa, Xh j Xa, hx j a 2 h j 2 (Lemma 19.1) h 2 hj h 2 h j 1 1 h 2 ( h j 1 1 ) (By ordering) h 2 h J c 1 Property of cone Reverting bac to (19.8) we have, h 2 h J 1 h 2 2. This completes the proof. Xh 2 2 ( c 1 2 c 2) h 2 2 h 2 2. Remar 19.1 (Adaptivity issues). Note that the Dantzig selector procedure is adaptive to, but not to σ. To see this consider the following high dimensional -sparse regression problem, Y = Xθ + Z, Z N (0, σ 2 I n ). If σ is nown then we can set τ = σ 2 log p, but typically σ is not nown. A similar problem arises with Lasso as well. In (19.6), if σ is nown then the optimal λ = 2σ log p, but if σ is unnown then λ is a tuning parameter. As a remedy for this another procedure called square root Lasso was proposed which can be written as the following optimization problem, min Y Xθ 2 + λ θ 1, θ R n. The optimal λ = log p even when σ is unnown. However the downside is that this optimization problem is not easy to solve. 6

7 References [CT07] Emmanuel Candés and Terence Tao. The Dantzig selector: statistical estimation when p is much larger than n. pages , [SC15] Weijie Su and Emmanuel Candés. SLOPE is adaptive to unnown sparsity and asymptotically minimax. arxiv preprint arxiv: ,

19.1 Maximum Likelihood estimator and risk upper bound

19.1 Maximum Likelihood estimator and risk upper bound ECE598: Information-theoretic methods in high-dimensional statistics Spring 016 Lecture 19: Denoising sparse vectors - Ris upper bound Lecturer: Yihong Wu Scribe: Ravi Kiran Raman, Apr 1, 016 This lecture