IEOR 265 Lecture 3 Sparse Linear Regression

Similar documents
1 Regression with High Dimensional Data

Constrained optimization

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

STAT 200C: High-dimensional Statistics

Lecture Notes 9: Constrained Optimization

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

STAT 200C: High-dimensional Statistics

Optimization methods

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

SCRIBERS: SOROOSH SHAFIEEZADEH-ABADEH, MICHAËL DEFFERRARD

Supremum of simple stochastic processes

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

4 Bias-Variance for Ridge Regression (24 points)

Sampling and high-dimensional convex geometry

Optimisation Combinatoire et Convexe.

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016

The Convex Geometry of Linear Inverse Problems

Optimization methods

Least squares under convex constraint

Estimating Unknown Sparsity in Compressed Sensing

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Lecture 8. Strong Duality Results. September 22, 2008

Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm

Reconstruction from Anisotropic Random Measurements

Compressed Sensing and Sparse Recovery

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform

Z Algorithmic Superpower Randomization October 15th, Lecture 12

Linear Analysis Lecture 5

The Pros and Cons of Compressive Sensing

Sparse Optimization Lecture: Dual Certificate in l 1 Minimization

Barrier Method. Javier Peña Convex Optimization /36-725

Nonlinear Optimization for Optimal Control

An Introduction to Sparse Approximation

Robust high-dimensional linear regression: A statistical perspective

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

CSC 576: Variants of Sparse Learning

Sparsity Regularization

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

Methods for sparse analysis of high-dimensional data, II

Linear Regression and Its Applications

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Faster Johnson-Lindenstrauss style reductions

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

E5295/5B5749 Convex optimization with engineering applications. Lecture 5. Convex programming and semidefinite programming

Cheng Soon Ong & Christian Walder. Canberra February June 2018

University of Luxembourg. Master in Mathematics. Student project. Compressed sensing. Supervisor: Prof. I. Nourdin. Author: Lucien May

Lagrange duality. The Lagrangian. We consider an optimization program of the form

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

DS-GA 1002 Lecture notes 10 November 23, Linear models

Sparse Proteomics Analysis (SPA)

10-725/ Optimization Midterm Exam

Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Constrained Optimization and Lagrangian Duality

Reconstruction of Block-Sparse Signals by Using an l 2/p -Regularized Least-Squares Algorithm

Near-Potential Games: Geometry and Dynamics

Methods for sparse analysis of high-dimensional data, II

The convex algebraic geometry of linear inverse problems

Assignment 1: From the Definition of Convexity to Helley Theorem

A SIMPLE TOOL FOR BOUNDING THE DEVIATION OF RANDOM MATRICES ON GEOMETRIC SETS

The deterministic Lasso

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Gauge optimization and duality

Signal Recovery from Permuted Observations

Normed & Inner Product Vector Spaces

Noisy Signal Recovery via Iterative Reweighted L1-Minimization

Corrupted Sensing: Novel Guarantees for Separating Structured Signals

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

CSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Problem Set 6: Solutions Math 201A: Fall a n x n,

IV. Matrix Approximation using Least-Squares

arxiv: v3 [math.oc] 19 Oct 2017

19.1 Problem setup: Sparse linear regression

Sparse and Low Rank Recovery via Null Space Properties

Compressed Sensing and Robust Recovery of Low Rank Matrices

12. Interior-point methods

Applications of Linear Programming

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang

Lecture 21 Theory of the Lasso II

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

(Part 1) High-dimensional statistics May / 41

Sparse Linear Models (10/7/13)

Inference for High Dimensional Robust Regression

Concentration Inequalities for Random Matrices

DS-GA 1002 Lecture notes 12 Fall Linear regression

8. Least squares. ˆ Review of linear equations. ˆ Least squares. ˆ Example: curve-fitting. ˆ Vector norms. ˆ Geometrical intuition

Sparsity in Underdetermined Systems

Recovery of Simultaneously Structured Models using Convex Optimization

An iterative hard thresholding estimator for low rank matrix recovery

arxiv: v1 [cs.it] 21 Feb 2013

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1

Machine Learning. Support Vector Machines. Manfred Huber

Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees

Non-Asymptotic Theory of Random Matrices Lecture 8: DUDLEY S INTEGRAL INEQUALITY

Generalization theory

Least Sparsity of p-norm based Optimization Problems with p > 1

Transcription:

IOR 65 Lecture 3 Sparse Linear Regression 1 M Bound Recall from last lecture that the reason we are interested in complexity measures of sets is because of the following result, which is known as the M Bound. Suppose i T R p is bounded and symmetric i.e., T = T, and ii is a random subspace of R p with fixed codimension n and drawn according to an appropriate distribution. More specifically, suppose = nullspacea = {x : Ax = 0}, where A R n p has iid N 0, 1 entries, meaning they are iid Gaussian entries with zero-mean and unit variance. Then we have diamt CwT. n 1.1 Three Useful Results Before we can prove the M Bound, we need three results. The first result is that i if U is a symmetric random variable e.g., its pdf is even f U u = f U u, and ii ɛ is a Rademacher random variable i.e., Pɛ = ±1 = 1 that is independent of U; then ɛu has the same distribution as U. To see why this is the case, note that if we condition ɛ U on the two possible values ɛ ± 1, then the conditional distribution is the same as the distribution of U because of the symmetry of U. The second result is that if ϕ is a Lipschitz continuous function with Lipschitz constant L with ϕ0 = 0, then we have sup t T 1 n ɛ i ψt i L sup 1 n t T. ɛ i t i If L = 1, then the function ϕ is also known as a contraction, and so this result is sometimes known as a contraction comparison theorem. The third result is a concentration of measure result, and these types of results can be thought of as finite sample versions of the central limit theorem. In particular, suppose x 1,..., x n are iid N 0, 1, and let ϕ : R n R be a Lipschitz continuous function with Lipschitz constant L. Then we have P ϕx 1,..., x n ϕx 1,..., x n > t exp t /L. 1

1. Proof of M Bound Let a i be the i-th row of A. We begin by noting that a it, where t R p is an arbitrary vector, is a Gaussian random variable with a it = 0 and vara it = t, since the entries of a i are iid N 0, 1. Thus, a it is a Folded Normal Distribution and has mean a it = π t. Next, we use a common trick known as symmetrization. In particular, observe that sup 1 a it a it sup 1 a it ã it sup 1 n sup 1 ɛ i a it ã it ɛ i a it, where i the first line follows by Jensen s inequality, ii the second line follows by the first useful result from above, and iii the third line follows from the triangle inequality. Since the absolute value function is Lipschitz continuous with L = 1 and has 0 = 0, we can apply the second useful result from above. This gives the bound sup 1 a it a it 4 sup 1 ɛ i a it = 4 sup 1 n t. ɛ i a i However, note that conditioned on ɛ i the random variable 1 n n ɛ i a i has distribution N 0, 1 since the entries of a i are iid N 0, 1. Since this holds conditioned on all possible values of ɛ i, we have that unconditionally the random variable 1 n n ɛ i a i has distribution N 0, 1. Thus, we can make the variable substitution g = 1 n n ɛ i a i, which gives sup 1 a it a 4 it n sup g t = 4 sup g t, where the last equality follows since T is symmetric since T is assumed to be symmetric, and since is symmetric because it is defined as a nullspace. The last step is to make substitutions. First, recall that wt = sup g t by definition. Second, recall we showed above that a it = t π. Third, note that a it = 0 whenever t by definition of as the nullspace of the matrix A. Making these substitutions into the last inequality above, we have sup t 4 wt π. n

Since i diamt = sup t, and ii wt wt because T T, this proves the M Bound. 1.3 High Probability Version The M Bound has a version that holds with high probability, which can be proved using the third useful result from above. In particular, assume that i T R p is bounded and symmetric i.e., T = T, and ii is a random subspace of R p with fixed codimension n and drawn according to an appropriate distribution. More specifically, suppose = nullspacea = {x : Ax = 0}, where A R n p has iid N 0, 1 entries. Then with probability at least 1 exp nt /diamt we have diamt CwT + Ct. n The benefit of this version is that allows us to state that the diameter of T is small with high probability, as opposed to merely stating that the expected diameter is small. High-Dimensional Linear Regression Recall the setting of a linear model with noiseless measurements: We have pairs of independent measurements x i, y i for i = 1,..., n, where x i R p and y i R, and the system is described by a linear model p y i = j x j i + ɛ i = x i. j=1 In the case where p is large relative to n, the OLS estimator has high estimation error. We now return to the original question posed in the first lecture: How can we estimate when p is large relative to n? In general, this problem is ill-posed. However, suppose we know that the majority of entries of are exactly equal to zero. We can imagine that our chances of estimating are improved in this specific setting. Intuitively, the reason is that if we knew which coefficients were zero beforehand, then we could have used a simple OLS estimator to estimate the nonzero coefficients of. However, the situation is more complicated in our present case. Specifically, we do not know a priori which coefficients of are nonzero. Fortunately, it turns out that this is not a major impediment..1 stimator for Sparse Linear Models Assume that the true value of has at most s nonzero components and has j µ. Now suppose we solve the feasibility problem { } ˆ = find : Y = X, {t R p : t 0 s, t j µ}, 3

where t 0 = p j=1 1t j 0 counts the number of nonzero coefficients of t. By the M bound we know that if the entries of x i are iid N 0, 1, then we have ˆ C wt n, where T = {t R p : t 0 s, t j µ}. The final question is: What is the value of wt? To bound wt, observe that that wt ws if T S. Define the l 1 -ball with radius s µ to be S = {t : t 1 s µ}, and note that by construction T S. Thus, we have that wt s µ 4 log p, where we have used the Gaussian average of an l 1 -ball from last lecture. This allows us to conclude that ˆ log p Cµ s n. Thus, solving the above feasibility problem is able to estimate the vector in high dimensions because the dependence on dimensionality is logarithmic. However, this feasibility problem is nonconvex, and so an alternative solution approach is needed.. Convex stimators Instead, consider the convex feasibility problem { } ˆ = find : Y = X, {t R p : t 1 s µ}. By the same reasoning as above, when x i has iid N 0, 1 entries we have that ˆ log p Cµ s n. Note that we can reformulate the convex feasibility problem as ˆ = arg min{ 1 X = Y }. It turns out that a solution of this optimization problem gives a solution to the convex feasibility problem. The reason is that the true value of and the estimated value ˆ are both feasible to X = Y when there is no noise, and so we have ˆ 1 1 s µ. Thus, ˆ is also a solution to the convex feasibility problem. Another reformulation of the convex feasibility problem is ˆ = arg min { Y X 1 s µ}. Note that the true value of is feasible to this convex optimization problem by assumption, and provides a minimum of Y X = 0. Hence, any minimizer to this second convex optimization problem provides a solution with Y X ˆ = 0, which implies that Y = X ˆ. As a result, any solution to this optimization problem is also a solution to the convex feasibility problem. 4

3 Lasso Regression We have so far developed an estimator for high-dimensional linear models when our measurements y i have no noise. However, suppose we have noisy measurements y i = x i + ɛ i, where ɛ i are iid zero-mean and bounded random variables. A common approach to estimating the parameters is known as lasso regression, and is given by This estimator is often written as ˆ = arg min{ Y X 1 λ}. ˆ = arg min Y X + µ 1, and the reason is that the minimizer to the two above optimization problems are identical for an appropriate choice of the value µ. To see why, consider the first optimization problem for λ > 0. Slater s condition holds, and so the Langrange dual problem has zero optimality gap. This dual problem is given by max min ν 0 Y X + ν 1 λ max{ Y X ˆ ν + ν ˆ ν 1 νλ : ν 0}. ν Let the optimizer be ν, and set µ = ν. Then the solution ˆ µ minimizes Y X + µ 1 and is identical to ˆ = arg min { Y X 1 λ}. This result is useful because it has a graphical interpretation that provides additional insight. Visualizing the constrained form of the estimator provides intuition into why using the l 1 -norm provide sparsity in the estimated coefficients ˆ. 4 More Reading The material in these sections follows that of R. Vershynin 014 stimation in high dimensions: a geometric perspective, in Sampling Theory, a Renaissance. Springer. To appear. R. Vershynin 009 Lectures in geometric functional analysis. Available online: http://www-personal.umich.edu/~romanv/papers/gfa-book/gfa-book.pdf. M. Ledoux and M. Talagrand 1991 Probability in Banach Spaces. Springer. More details about these sections can be found in the above references. 5