Lecture 24 May 30, 2018

Similar documents
Sparsity in Underdetermined Systems

Lecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973]

Near-ideal model selection by l 1 minimization

1 Regression with High Dimensional Data

Reconstruction from Anisotropic Random Measurements

ISyE 691 Data mining and analytics

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

STAT 200C: High-dimensional Statistics

(Part 1) High-dimensional statistics May / 41

The Dantzig Selector

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Machine Learning for OR & FE

Sparsity and the Lasso

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Linear regression COMS 4771

Linear Methods for Regression. Lijun Zhang

Compressed sensing. Or: the equation Ax = b, revisited. Terence Tao. Mahler Lecture Series. University of California, Los Angeles

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

OWL to the rescue of LASSO

Regularization: Ridge Regression and the LASSO

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees

THE DANTZIG SELECTOR: STATISTICAL ESTIMATION WHEN p IS MUCH LARGER THAN n 1

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Sparse Linear Models (10/7/13)

Lecture 2 Part 1 Optimization

Introduction How it works Theory behind Compressed Sensing. Compressed Sensing. Huichao Xue. CS3750 Fall 2011

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Cross-Validation with Confidence

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Grouping Pursuit in regression. Xiaotong Shen

The deterministic Lasso

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016

Signal Recovery from Permuted Observations

Compressed Sensing in Cancer Biology? (A Work in Progress)

Compressed Sensing and Neural Networks

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Risk and Noise Estimation in High Dimensional Statistics via State Evolution

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Biostatistics Advanced Methods in Biostatistics IV

Shifting Inequality and Recovery of Sparse Signals

ESL Chap3. Some extensions of lasso

Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Representations

Optimization for Compressed Sensing

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Convex optimization COMS 4771

High-dimensional regression with unknown variance

Going off the grid. Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison

Lecture 5: September 15

Linear regression methods

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing

Does l p -minimization outperform l 1 -minimization?

Tractable Upper Bounds on the Restricted Isometry Constant

Least squares under convex constraint

Nonconvex Methods for Phase Retrieval

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Uncertainty quantification for sparse solutions of random PDEs

Subgradient Method. Ryan Tibshirani Convex Optimization

Linear Model Selection and Regularization

Lasso, Ridge, and Elastic Net

Recent Developments in Compressed Sensing

Lecture 22: More On Compressed Sensing

Regularization and Variable Selection via the Elastic Net

Optimization methods

Nonconvex penalties: Signal-to-noise ratio and algorithms

Lecture 6 April

An Introduction to Sparse Approximation

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

DATA MINING AND MACHINE LEARNING

Supremum of simple stochastic processes

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Lecture 5 : Projections

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso)

Greedy Signal Recovery and Uniform Uncertainty Principles

Sparse Approximation and Variable Selection

Stable Signal Recovery from Incomplete and Inaccurate Measurements

Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

An iterative hard thresholding estimator for low rank matrix recovery

Data Mining Stat 588

Lecture 26: April 22nd

ABSTRACT. Topics on LASSO and Approximate Message Passing. Ali Mousavi

Optimisation Combinatoire et Convexe.

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Lecture 8: February 9

Compressed Sensing and Sparse Recovery

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

High-dimensional covariance estimation based on Gaussian graphical models

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

25 : Graphical induced structured input/output models

Data Sparse Matrix Computation - Lecture 20

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Noisy Signal Recovery via Iterative Reweighted L1-Minimization

STAT 462-Computational Data Analysis

19.1 Problem setup: Sparse linear regression

Robust Principal Component Analysis

Transcription:

Stats 3C: Theory of Statistics Spring 28 Lecture 24 May 3, 28 Prof. Emmanuel Candes Scribe: Martin J. Zhang, Jun Yan, Can Wang, and E. Candes Outline Agenda: High-dimensional Statistical Estimation. Lasso 2. l l equivalence 3. Oracle inequalities for Lasso? 2 Canonical Selection Procedure Consider the usual linear model and the canonical l selection procedure y = Xβ + z min y X ˆβ 2 + λ 2 σ 2 ˆβ We have seen several choices of the parameter λ AIC: λ 2 = 2 BIC: λ 2 = log n RIC: λ 2 2 log p The critical problem shared by these methods is that finding the minimum requires an exhaustive search over all possible models. This search is completely intractable for even moderate values of p. 3 Lasso Consider instead solving the l regularized problem min 2 y X ˆβ 2 2 + λσ ˆβ

This is a quadratic program which can be solved efficiently. l regularization has a long history in statistics and applied science. This idea appeared in the early work of reflection seismology. In seismology, people try to image the earth by sound wave. With reflected sound wave, one receives a time series of signals, which is noisy and low frequency. In order to detect where the jumps in earth are, people came up with the idea to use l norm. Some important references about the use of l norm are below. Logan (956) Claerbout & Muir (973) Taylor, Banks, McCoy (979) Santosa & Synes (983, 986) Rudin, Osher, Fatemi (992) Donoho et al. (99 s) Tibshirani (996) 3. Why l? To motivate the l approach, consider first the noiseless case y = Xβ, 2

which is an underdetermined system of equations in the case where we have more variables than unknowns, i.e. p > n. In this setting, there is no unique solution to the least squares problem. However, by assuming that β is sparse, we can reduce our search to sparse solutions, i.e., we wish to find the sparsest solution to y = Xβ by solving min ˆβ s.t. y = X ˆβ Under some conditions, the minimizer is unique, i.e., ˆβ = β. Even so, this doesn t address the issue of computational tractability. The optimization problem as stated remains intractable. This motivates the convex relaxation Under broad conditions, it can be shown that The minimizer is unique min ˆβ s.t. y = X ˆβ The l solution is equal to the l solution. Why l may not always work Why does l work? (a) ˆβ = β Good! (b) ˆβ β Bad! Figure : Geometry of the l optimization problem in the noiseless setting. To understand the picture suppose ˆβ is a feasible point (X ˆβ = Xβ = y). Consider descent vectors u such that ˆβ + u < ˆβ. It is easy to see that ˆβ is the solution to the l optimization problem if Xu for all descent vectors u, i.e., if all vectors with smaller l norm are non-feasible. (a) The solution here is the true β, because all other feasible vectors have larger l norm. (b) Here, ˆβ β. Starting at β, we can consider the descent direction u as shown. The point β + u remains feasible, and has l -norm strictly smaller than β. Rigorous results pertaining to the formal equivalence between the l and l problem can be found in Candes & Tao (24) Donoho (24) 3

Why an l l norm is closest convex objective? approximation to l quasi-norm (a) l l quasi-norm quasi-norm (b) l l norm norm (c) l 2 l norm 2 norm Figure 2: l norm is closest convex approximation to l quasi-norm. l ball, smallest convex ball containing ±e i = (,...,, ±,,..., ) 3.2 l as a convex relaxation of l Sparse Representations (Numerical Analysis Seminar, NYU, 2 April 27) 8 l is the tightest convex relaxation of l. One way of seeing this is to look at the model p µ = β i X i where β is sparse. The building blocks of sparse solutions are vectors of the form i= (,...,, ±,,..., ) and the l ball is the smallest convex body that contains these building blocks. It is the convex hull of these points. Another interpretation is this: the l norm is the best convex lower bound to the cardinality functional over the unit cube. That is to say, if we search for a convex function f such that f(β) β for all β [, ] p, then the tightest lower bound is the l norm, see Figure 2. 4 Lasso and Oracles? Recall from last time that Theorem. CSP (Canonical Selection Procedures) with λ p = A log p, then E X ˆβ Xβ 2 = O(log p)[σ 2 + R I (µ)] where R I (µ) is the ideal risk. As we will illustrate, we cannot hope to achieve comparable optimality results for the Lasso. Case study : Take X to be ε.... ε... 4

In this case, p = n + so the system is underdetermined. Suppose that µ is constructed with β = ε.. so that µ = In the noiseless setting: The l solution is exactly β The l solution is. ˆβ = β in the case where ε < 2/(n ). So l does not recover the right solution. Next we will consider the noisy setting. With no noise, the lasso does not find the sparse solution. So with noise, there is little doubt that it will not be able to find the correct variables either. Now this may not be a problem as long as we care only about the prediction error, i.e. ˆβ may be a very poor estimate of β but X ˆβ may still be a good estimate of Xβ. This is, however, not the case here. Suppose ε is small as above, and pick a noise level σ = /, say. Hence, we observe y i = (i n) +. z i, i =,..., n. It can be shown for all reasonable values of λ, the Lasso solution is given by soft-thresholding y, i.e. ˆβ i = (y i λσ) +, i =,..., n and that ˆβ i =, i = n, n + ˆµ i = (y i λσ) +, i =,..., n ˆµ i =, i = n Consequently, the lasso has a squared bias of roughly (n )λ 2 σ 2 and a variance of (n )σ 2. We d be better off by just plugging ˆµ = y and only pay the variance. The empirical results under a noisy setting are shown in Figure 3. 5

.9 Lasso estimated mean response true mean estimated mean.2 Ideal estimated mean response true mean estimated mean.8.7.8.6.6.5.4.4.3.2.2. 2 3 4 5 6 7 8 9.2 2 3 4 5 6 7 8 9 Figure 3: Graphical representation of the example above with n =,, p =, and σ =.. The lasso estimate fits X ˆβ = ˆµ = (y.λ) +, which results in a large variance and a large bias. The ideal estimate uses only two predictors. The ratio between the lasso MSE and the ideal MSE is over 6, 5 6.5n 6.5p. Summary: This is an instance where the l solution is irreparably worse than the l solution. Indeed, with l we achieve E X ˆβ Xβ 2 2σ 2 while for the Lasso (no matter λ), E X ˆβ Xβ 2 nσ 2 Take-away message: problems for the Lasso. Even in the noiseless case, correlation between the predictors cause severe Following we will show another case where the Lasso does not work well even when the correlation between columns of X is very low. Case study 2: X = [I n F 2:n ], where F 2:n is the discrete cosine transform matrix excluding the first column (the DC component). Specifically, ϕ () ϕ 2 () ϕ n () ϕ (2) ϕ 2 (2) ϕ n (2) F 2:n =..., ϕ (n) ϕ 2 (n) ϕ n (n) where ϕ 2k (t) = 2/n cos(2πkt/n), k =, 2,, n/2 ϕ 2k (t) = 2/n sin(2πkt/n), k =, 2,, n/2 ϕ n (t) = ( ) t / n. 6

In this case, p = 2n. The maximum correlation between columns of X, µ(x) = 2/n, which shows that the columns of X are extremely incoherent. Besides, x 2 2. Then we can construct f = Xβ, of which each coordinate is, with a sparse β (shown in Figure 4). Figure 4: Graphical representation of f = Xβ and β in case study 2 with n = 256, p = 5. In noisy setting, the observed noisy Xβ + z, the Lasso estimate ˆβ and X ˆβ are shown in Figure 5. Figure 5: Graphical representation of the results in case study 2 with n = 256, p = 5. The lasso estimate ˆβ has far more non-zero entries than β and results in large MSE. Empirically and provably, X ˆβ = y λσ, ˆβi = { y i λσ, i {,, n},, i {n +,, 2n }. 7

The Lasso MSE, ˆf f 2 nσ 2 ( + λ 2 ), is horrible. The l norm of ˆβ, supp( ˆβ) = n, is much higher than that of β, supp(β) = 3 2 n. 8