Lecture 21 Theory of the Lasso II

Similar documents
Lecture 14 Singular Value Decomposition

Lecture 17 Intro to Lasso Regression

Lecture Notes 6: Dynamic Equations Part C: Linear Difference Equation Systems

Lecture 16 Solving GLMs via IRWLS

Lecture 05 Geometry of Least Squares

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Reconstruction from Anisotropic Random Measurements

5 Compact linear operators

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

x 1 x 2. x 1, x 2,..., x n R. x n

EXAMPLES OF PROOFS BY INDUCTION

Name Solutions Linear Algebra; Test 3. Throughout the test simplify all answers except where stated otherwise.

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Delta Theorem in the Age of High Dimensions

U.C. Berkeley Better-than-Worst-Case Analysis Handout 3 Luca Trevisan May 24, 2018

STAT 501 Assignment 1 Name Spring Written Assignment: Due Monday, January 22, in class. Please write your answers on this assignment

(Part 1) High-dimensional statistics May / 41

Cambridge University Press The Mathematics of Signal Processing Steven B. Damelin and Willard Miller Excerpt More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Applied Linear Algebra in Geoscience Using MATLAB

NORMS ON SPACE OF MATRICES

THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich

(, ) : R n R n R. 1. It is bilinear, meaning it s linear in each argument: that is

Lecture 15, 16: Diagonalization

Announcements Monday, November 06

2. Review of Linear Algebra

Spectral Graph Theory Lecture 2. The Laplacian. Daniel A. Spielman September 4, x T M x. ψ i = arg min

Math 3191 Applied Linear Algebra

Generating Function Notes , Fall 2005, Prof. Peter Shor

Dot Products. K. Behrend. April 3, Abstract A short review of some basic facts on the dot product. Projections. The spectral theorem.

MATH 320, WEEK 11: Eigenvalues and Eigenvectors

In this section again we shall assume that the matrix A is m m, real and symmetric.

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

Lecture 3 January 23

Least squares under convex constraint

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Video 6.1 Vijay Kumar and Ani Hsieh

Functional Analysis Exercise Class

AM 205: lecture 6. Last time: finished the data fitting topic Today s lecture: numerical linear algebra, LU factorization

Linear regression COMS 4771

Matrix estimation by Universal Singular Value Thresholding

Sequences: Limit Theorems

IEOR 265 Lecture 3 Sparse Linear Regression

Elementary Linear Algebra Review for Exam 2 Exam is Monday, November 16th.

Lecture 7. Econ August 18

Lecture Notes 1: Vector spaces

1 Computing with constraints

Dominating Configurations of Kings

Bindel, Fall 2011 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Jan 9

CS 229r: Algorithms for Big Data Fall Lecture 19 Nov 5

Functional Analysis Review

The deterministic Lasso

Designing Information Devices and Systems I Spring 2016 Official Lecture Notes Note 21

Math 1600B Lecture 4, Section 2, 13 Jan 2014

O.K. But what if the chicken didn t have access to a teleporter.

CS 246 Review of Linear Algebra 01/17/19

Math 276, Spring 2007 Additional Notes on Vectors

The following definition is fundamental.

HILBERT SPACES AND THE RADON-NIKODYM THEOREM. where the bar in the first equation denotes complex conjugation. In either case, for any x V define

Robust high-dimensional linear regression: A statistical perspective

2. Limits at Infinity

SYDE 112, LECTURE 7: Integration by Parts

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

High-dimensional regression with unknown variance

1 Adjacency matrix and eigenvalues

Lecture 21: The decomposition theorem into generalized eigenspaces; multiplicity of eigenvalues and upper-triangular matrices (1)

2tdt 1 y = t2 + C y = which implies C = 1 and the solution is y = 1

Inner products and Norms. Inner product of 2 vectors. Inner product of 2 vectors x and y in R n : x 1 y 1 + x 2 y x n y n in R n

Iowa State University. Instructor: Alex Roitershtein Summer Exam #1. Solutions. x u = 2 x v

AM 205: lecture 6. Last time: finished the data fitting topic Today s lecture: numerical linear algebra, LU factorization

Intermediate Lab PHYS 3870

We have already seen that the main problem of linear algebra is solving systems of linear equations.

1 Invariant subspaces

RANDOM PROPERTIES BENOIT PAUSADER

Eigenvalues and Eigenfunctions of the Laplacian

Linear Algebra Massoud Malek

Today: eigenvalue sensitivity, eigenvalue algorithms Reminder: midterm starts today

Quantum Computing Lecture 2. Review of Linear Algebra

Econ Slides from Lecture 7

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

The Laplacian ( ) Matthias Vestner Dr. Emanuele Rodolà Room , Informatik IX

Optimization methods

Spectral Processing. Misha Kazhdan

Midterm 1 Review. Distance = (x 1 x 0 ) 2 + (y 1 y 0 ) 2.

Last Update: March 1 2, 201 0

2.3. Clustering or vector quantization 57

10-725/36-725: Convex Optimization Prerequisite Topics

Lecture 4: Linear Algebra 1

Statistical Data Analysis

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

means is a subset of. So we say A B for sets A and B if x A we have x B holds. BY CONTRAST, a S means that a is a member of S.

1 Sparsity and l 1 relaxation

Lecture 3: Review of Linear Algebra

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

Properties of a Taylor Polynomial

Midterm for Introduction to Numerical Analysis I, AMSC/CMSC 466, on 10/29/2015

Lecture 3: Review of Linear Algebra

Lasso, Ridge, and Elastic Net

Chapter 8: Taylor s theorem and L Hospital s rule

Convex relaxation for Combinatorial Penalties

Transcription:

Lecture 21 Theory of the Lasso II 02 December 2015 Taylor B. Arnold Yale Statistics STAT 312/612

Class Notes Midterm II - Available now, due next Monday Problem Set 7 - Available now, due December 11th (grace period through the 16th)

Last time

Last class, we started investigating the theory of the lasso estimator. For the case of X t X equal to the identity matrix, we were able to quickly establish bounds on the prediction error, estimation of β, and the reconstruction of the support of β. For an arbitrary X matrix we were able to calculate a bound on X( β β) 2 2. Today s goal is to establish a bound on β β 2 2

The basic starting point from last time was the following decomposition, which had no assumptions beyond linearity of the true model: X(β b) 2 2 2ϵ t X(b β) + λ ( β 1 b 1 ) Where can think of this decomposition as the loss to be minimized, the empirical part, and the penalty term.

I then defined the set A = { 2 ϵ t X λ } And showed that for any A > 1 we have PA 1 A 1 whenever λ A 8 log(2p)σ 2.

Today we will motivate a stronger assumption on the model and use these two results to establish bounds on the prediction of β. Also, it will be helpful to write the set A as being parameterized by the value of λ 0 : A(λ 0 ) = { 2 ϵ t X λ 0 }

Bounds on estimation error

We already know that on A(λ 0 ) and with λ > 2 λ 0, we have: X(b β) 2 2 + λ b 1 2ϵ t X(b β) + λ β 1 Now, multiplying by two gives: λ 0 b β 1 + λ β 1 2 X(b β) 2 2 + 2λ b 1 λ b β 1 + 2λ β 1

Recall that we defined the notation: S = {j : β j 0}, s is the size of the set S, and v S is the vector v which has components not in S set to zero.

Notice that: b 1 = b S 1 + b S c 1 β 1 b S β 1 + b S c 1 Using the (reverse) triangle inequality and the fact that β S c is zero by definition.

Similarly, we have: b β 1 = b S β S 1 + b S c 1 Where clearly β S is redundant, but useful to keep the notation straight.

Plugging these in, we now get: 2 X(b β) 2 2+2λ β S 1 2λ b S β 1 + 2λ b S c 1 Which cancels out as: λ b β 1 + λ b S β S 1 + 2λ b S c 1 2 X(b β) 2 2 + λ b S c 1 3 λ b S β S 1

This result now actually gives two sub-results, as all three terms are positive and therefore each component of the left hand side is individually bounded by the right hand side. In particular, we have: b S c 1 3 b S β S 1 Which implies that the amount of error in b can not be too highly concentrated on S c.

The other sub-result gives: 2 X(b β) 2 2 3λ b S β S 1 If σ min is the minimum singular value of X, then the left hand side can be bounded below by: 2σ 2 min b β 2 2 3λ b S β S 1 Using the Cauchy-Schwarz inequality, this becomes: 2σ 2 min b β 2 2 3λ s b S β S 2 b β 2 3λ s 2σ 2 min Which gives a bound on the error of estimating β, which is exactly what we wanted to establish.

Why is this not sufficient for us? Well, in the high dimensional case p > n, we will always have σ min equal to 0. We can get around this problem by defining a modified version of the minimum eigenvector (or squared singular value) by only considering b β such that: b S c 1 3 b S β S 1

The (minimum) restricted eigenvalue ϕ S on the set S is defined as: Where: Xb 2 ϕ S = arg min v V S b 2 V S = {v R p s.t. v S c 1 3 v S 1 }

Because we do not know S, it is impossible to calculate ϕ S in practice. In theoretical work, often one considers the restricted eigenvalue ϕ defined as the smallest ϕ S for all sets S with size bounded by some predefined s 0.

Now, we can bound the following using our prior result: 2 X(b β) 2 2 + λ b β 1 = 2 X(b β) 2 2 + λ b S β S 1 + λ b S c 1 = 4λ b S β S 1 Using Cauchy-Schartz again, we can change the l 1 -norm to an l 2 -norm at the cost of a factor of s: 2 X(b β) 2 2 + λ b β 1 4λ s b S β S 2 Finally, we now use the restricted eigenvalue ϕ to convert from β space to Xβ space: 2 X(b β) 2 2 + λ b β 1 4λ s X(b S β S ) 2 /ϕ

I am now going to use an inequality trick that is often useful in theoretical statistics derivations. For any u and v, notice that 4uv u 2 + 4v 2. For a proof, notice that it is trivially true at zero and negative values of u and v. Then look at the derivatives and notice that the right hand side grows faster than the left hand side in the directions of both u and v.

Setting u = X(b S β S ) 2, we then have: 2 X(b β) 2 2 + λ b β 1 X(b S β S ) 2 + 4λ 2 s /ϕ 2 X(b β) 2 2 + 4λ 2 s /ϕ 2 And when canceling one factor of X(b β) 2 : X(b β) 2 2 + λ b β 1 4λ 2 s /ϕ 2 Which holds on the entire set A(λ 0 ).

This establishes two simultaneous bounds: X(b β) 2 2 4λ 2 s /ϕ 2 b β 1 4λ s /ϕ 2 Though the first is slightly less satisfying than our result in last class as it relies on ϕ 2, though it no longer requires the norm of β.

Asymptotic analysis

As before, we can convert a more natural re-scaled problem by dividing all of the λ parameters by n Also, remember that for some A > 1, we have PA(λ 0 ) 1 A 1 for all λ > A 16n 1 log(2p)σ 2.

Therefore, we have: b β 1 4λ s /ϕ 2 8 Aσ 2 /ϕ 2 s2 n log(2p n ) n Which is the same result as from the Bickel, Ritov, Tsybakov paper.

To establish consistency of the estimator under constant noise and restricted eigenvalues ϕ 2, we need the following limit to go to zero: s 2 n log(2p n ) lim = 0 n n Which can happen with a number of different scalings, such as a constant number of non-zero terms but an exponential number of non-zero terms. Or, s n growing like n 1/3 and p n growing linearly with s n.

Some (personal) closing thoughts on the application of the lasso theory to data analysis: 1. The theory is useful for establishing a rough rule of thumb for how large p n and s n can be to have a reasonable chance of reconstructing β or Xβ 2. The theory also helps guide where to start looking for the optimal λ 3. We still generally need some form of cross validation however, as the theoretical values tend to overestimate λ in practice; we also do not know σ 2 and in theory need to use an over-estimate for the convergence results to hold 4. Bounds on X(β β) 2 2 are nice to have, however the theoretical bounds on β β 2 2 are difficult to use in practice due to the near-impossible to calculate restricted eigenvalue assumption 5. I have always been skeptical of the asymptotic results for the same reason; ϕ likely depends on n, p n and s n in complex ways that are not accounted for

For our next (and last) week we will: 1. use the lasso to encode more complex forms of linear sparsity (e.g., outlier detection and the fused lasso) 2. give an alternative approach to solving for the lasso solution at a particular value of λ