Stats 3C: Theory of Statistics Spring 28 Lecture 24 May 3, 28 Prof. Emmanuel Candes Scribe: Martin J. Zhang, Jun Yan, Can Wang, and E. Candes Outline Agenda: High-dimensional Statistical Estimation. Lasso 2. l l equivalence 3. Oracle inequalities for Lasso? 2 Canonical Selection Procedure Consider the usual linear model and the canonical l selection procedure y = Xβ + z min y X ˆβ 2 + λ 2 σ 2 ˆβ We have seen several choices of the parameter λ AIC: λ 2 = 2 BIC: λ 2 = log n RIC: λ 2 2 log p The critical problem shared by these methods is that finding the minimum requires an exhaustive search over all possible models. This search is completely intractable for even moderate values of p. 3 Lasso Consider instead solving the l regularized problem min 2 y X ˆβ 2 2 + λσ ˆβ
This is a quadratic program which can be solved efficiently. l regularization has a long history in statistics and applied science. This idea appeared in the early work of reflection seismology. In seismology, people try to image the earth by sound wave. With reflected sound wave, one receives a time series of signals, which is noisy and low frequency. In order to detect where the jumps in earth are, people came up with the idea to use l norm. Some important references about the use of l norm are below. Logan (956) Claerbout & Muir (973) Taylor, Banks, McCoy (979) Santosa & Synes (983, 986) Rudin, Osher, Fatemi (992) Donoho et al. (99 s) Tibshirani (996) 3. Why l? To motivate the l approach, consider first the noiseless case y = Xβ, 2
which is an underdetermined system of equations in the case where we have more variables than unknowns, i.e. p > n. In this setting, there is no unique solution to the least squares problem. However, by assuming that β is sparse, we can reduce our search to sparse solutions, i.e., we wish to find the sparsest solution to y = Xβ by solving min ˆβ s.t. y = X ˆβ Under some conditions, the minimizer is unique, i.e., ˆβ = β. Even so, this doesn t address the issue of computational tractability. The optimization problem as stated remains intractable. This motivates the convex relaxation Under broad conditions, it can be shown that The minimizer is unique min ˆβ s.t. y = X ˆβ The l solution is equal to the l solution. Why l may not always work Why does l work? (a) ˆβ = β Good! (b) ˆβ β Bad! Figure : Geometry of the l optimization problem in the noiseless setting. To understand the picture suppose ˆβ is a feasible point (X ˆβ = Xβ = y). Consider descent vectors u such that ˆβ + u < ˆβ. It is easy to see that ˆβ is the solution to the l optimization problem if Xu for all descent vectors u, i.e., if all vectors with smaller l norm are non-feasible. (a) The solution here is the true β, because all other feasible vectors have larger l norm. (b) Here, ˆβ β. Starting at β, we can consider the descent direction u as shown. The point β + u remains feasible, and has l -norm strictly smaller than β. Rigorous results pertaining to the formal equivalence between the l and l problem can be found in Candes & Tao (24) Donoho (24) 3
Why an l l norm is closest convex objective? approximation to l quasi-norm (a) l l quasi-norm quasi-norm (b) l l norm norm (c) l 2 l norm 2 norm Figure 2: l norm is closest convex approximation to l quasi-norm. l ball, smallest convex ball containing ±e i = (,...,, ±,,..., ) 3.2 l as a convex relaxation of l Sparse Representations (Numerical Analysis Seminar, NYU, 2 April 27) 8 l is the tightest convex relaxation of l. One way of seeing this is to look at the model p µ = β i X i where β is sparse. The building blocks of sparse solutions are vectors of the form i= (,...,, ±,,..., ) and the l ball is the smallest convex body that contains these building blocks. It is the convex hull of these points. Another interpretation is this: the l norm is the best convex lower bound to the cardinality functional over the unit cube. That is to say, if we search for a convex function f such that f(β) β for all β [, ] p, then the tightest lower bound is the l norm, see Figure 2. 4 Lasso and Oracles? Recall from last time that Theorem. CSP (Canonical Selection Procedures) with λ p = A log p, then E X ˆβ Xβ 2 = O(log p)[σ 2 + R I (µ)] where R I (µ) is the ideal risk. As we will illustrate, we cannot hope to achieve comparable optimality results for the Lasso. Case study : Take X to be ε.... ε... 4
In this case, p = n + so the system is underdetermined. Suppose that µ is constructed with β = ε.. so that µ = In the noiseless setting: The l solution is exactly β The l solution is. ˆβ = β in the case where ε < 2/(n ). So l does not recover the right solution. Next we will consider the noisy setting. With no noise, the lasso does not find the sparse solution. So with noise, there is little doubt that it will not be able to find the correct variables either. Now this may not be a problem as long as we care only about the prediction error, i.e. ˆβ may be a very poor estimate of β but X ˆβ may still be a good estimate of Xβ. This is, however, not the case here. Suppose ε is small as above, and pick a noise level σ = /, say. Hence, we observe y i = (i n) +. z i, i =,..., n. It can be shown for all reasonable values of λ, the Lasso solution is given by soft-thresholding y, i.e. ˆβ i = (y i λσ) +, i =,..., n and that ˆβ i =, i = n, n + ˆµ i = (y i λσ) +, i =,..., n ˆµ i =, i = n Consequently, the lasso has a squared bias of roughly (n )λ 2 σ 2 and a variance of (n )σ 2. We d be better off by just plugging ˆµ = y and only pay the variance. The empirical results under a noisy setting are shown in Figure 3. 5
.9 Lasso estimated mean response true mean estimated mean.2 Ideal estimated mean response true mean estimated mean.8.7.8.6.6.5.4.4.3.2.2. 2 3 4 5 6 7 8 9.2 2 3 4 5 6 7 8 9 Figure 3: Graphical representation of the example above with n =,, p =, and σ =.. The lasso estimate fits X ˆβ = ˆµ = (y.λ) +, which results in a large variance and a large bias. The ideal estimate uses only two predictors. The ratio between the lasso MSE and the ideal MSE is over 6, 5 6.5n 6.5p. Summary: This is an instance where the l solution is irreparably worse than the l solution. Indeed, with l we achieve E X ˆβ Xβ 2 2σ 2 while for the Lasso (no matter λ), E X ˆβ Xβ 2 nσ 2 Take-away message: problems for the Lasso. Even in the noiseless case, correlation between the predictors cause severe Following we will show another case where the Lasso does not work well even when the correlation between columns of X is very low. Case study 2: X = [I n F 2:n ], where F 2:n is the discrete cosine transform matrix excluding the first column (the DC component). Specifically, ϕ () ϕ 2 () ϕ n () ϕ (2) ϕ 2 (2) ϕ n (2) F 2:n =..., ϕ (n) ϕ 2 (n) ϕ n (n) where ϕ 2k (t) = 2/n cos(2πkt/n), k =, 2,, n/2 ϕ 2k (t) = 2/n sin(2πkt/n), k =, 2,, n/2 ϕ n (t) = ( ) t / n. 6
In this case, p = 2n. The maximum correlation between columns of X, µ(x) = 2/n, which shows that the columns of X are extremely incoherent. Besides, x 2 2. Then we can construct f = Xβ, of which each coordinate is, with a sparse β (shown in Figure 4). Figure 4: Graphical representation of f = Xβ and β in case study 2 with n = 256, p = 5. In noisy setting, the observed noisy Xβ + z, the Lasso estimate ˆβ and X ˆβ are shown in Figure 5. Figure 5: Graphical representation of the results in case study 2 with n = 256, p = 5. The lasso estimate ˆβ has far more non-zero entries than β and results in large MSE. Empirically and provably, X ˆβ = y λσ, ˆβi = { y i λσ, i {,, n},, i {n +,, 2n }. 7
The Lasso MSE, ˆf f 2 nσ 2 ( + λ 2 ), is horrible. The l norm of ˆβ, supp( ˆβ) = n, is much higher than that of β, supp(β) = 3 2 n. 8