ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Similar documents
Compressed Sensing and Sparse Recovery

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees

Sparse Optimization Lecture: Dual Certificate in l 1 Minimization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016

An Introduction to Sparse Approximation

Introduction How it works Theory behind Compressed Sensing. Compressed Sensing. Huichao Xue. CS3750 Fall 2011

Strengthened Sobolev inequalities for a random subspace of functions

Lecture Notes 9: Constrained Optimization

Lecture 22: More On Compressed Sensing

Recovering overcomplete sparse representations from structured sensing

Introduction to Compressed Sensing

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

Robust Principal Component Analysis

Supremum of simple stochastic processes

Recent Developments in Compressed Sensing

ROBUST BLIND SPIKES DECONVOLUTION. Yuejie Chi. Department of ECE and Department of BMI The Ohio State University, Columbus, Ohio 43210

Compressive Sensing and Beyond

Constructing Explicit RIP Matrices and the Square-Root Bottleneck

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Exact Reconstruction Conditions and Error Bounds for Regularized Modified Basis Pursuit (Reg-Modified-BP)

STAT 200C: High-dimensional Statistics

Solving Corrupted Quadratic Equations, Provably

Reconstruction from Anisotropic Random Measurements

Constrained optimization

Low-Rank Matrix Recovery

Compressed sensing. Or: the equation Ax = b, revisited. Terence Tao. Mahler Lecture Series. University of California, Los Angeles

6 Compressed Sensing and Sparse Recovery

Sparse analysis Lecture V: From Sparse Approximation to Sparse Signal Recovery

Exponential decay of reconstruction error from binary measurements of sparse signals

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lecture 3. Random Fourier measurements

Necessary and Sufficient Conditions of Solution Uniqueness in 1-Norm Minimization

We will now focus on underdetermined systems of equations: data acquisition system

Minimizing the Difference of L 1 and L 2 Norms with Applications

Sparsity Regularization

Noisy Signal Recovery via Iterative Reweighted L1-Minimization

Sparse Optimization Lecture: Sparse Recovery Guarantees

Sensing systems limited by constraints: physical size, time, cost, energy

Three Generalizations of Compressed Sensing

Sparsity and Compressed Sensing

Z Algorithmic Superpower Randomization October 15th, Lecture 12

1 Regression with High Dimensional Data

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Data Sparse Matrix Computation - Lecture 20

Optimization methods

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Sparse Solutions of an Undetermined Linear System

Necessary and sufficient conditions of solution uniqueness in l 1 minimization

Sparse Legendre expansions via l 1 minimization

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing

University of Luxembourg. Master in Mathematics. Student project. Compressed sensing. Supervisor: Prof. I. Nourdin. Author: Lucien May

Combining Sparsity with Physically-Meaningful Constraints in Sparse Parameter Estimation

Compressed Sensing: Extending CLEAN and NNLS

Compressed sensing with structured sparsity and structured acquisition

5. Orthogonal matrices

Convex Optimization M2

arxiv: v3 [cs.it] 25 Jul 2014

Oslo Class 6 Sparsity based regularization

Near Optimal Signal Recovery from Random Projections

Sparse and Low Rank Recovery via Null Space Properties

CS 229r: Algorithms for Big Data Fall Lecture 19 Nov 5

Covariance Sketching via Quadratic Sampling

Chapter 3. Vector spaces

STAT 200C: High-dimensional Statistics

Least Sparsity of p-norm based Optimization Problems with p > 1

Compressed Sensing: Lecture I. Ronald DeVore

Mismatch and Resolution in CS

Sparse Parameter Estimation: Compressed Sensing meets Matrix Pencil

Fast Reconstruction Algorithms for Deterministic Sensing Matrices and Applications

The Pros and Cons of Compressive Sensing

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

8.1 Concentration inequality for Gaussian random matrix (cont d)

THEORY OF COMPRESSIVE SENSING VIA l 1 -MINIMIZATION: A NON-RIP ANALYSIS AND EXTENSIONS

Introduction to Sparsity. Xudong Cao, Jake Dreamtree & Jerry 04/05/2012

Primal Dual Pursuit A Homotopy based Algorithm for the Dantzig Selector

Greedy Signal Recovery and Uniform Uncertainty Principles

Enhanced Compressive Sensing and More

Compressed Sensing and Neural Networks

CoSaMP. Iterative signal recovery from incomplete and inaccurate samples. Joel A. Tropp

Optimisation Combinatoire et Convexe.

One condition for all: solution uniqueness and robustness of l 1 -synthesis and l 1 -analysis minimizations

INDUSTRIAL MATHEMATICS INSTITUTE. B.S. Kashin and V.N. Temlyakov. IMI Preprint Series. Department of Mathematics University of South Carolina

Compressive Sensing Theory and L1-Related Optimization Algorithms

Optimization for Compressed Sensing

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Lecture 5 Least-squares

Uncertainty principles and sparse approximation

ON THE LOCAL CORRECTNESS OF l 1 -MINIMIZATION FOR DICTIONARY LEARNING ALGORITHM QUAN GENG THESIS

Generalized Orthogonal Matching Pursuit- A Review and Some

Random Methods for Linear Algebra

SCRIBERS: SOROOSH SHAFIEEZADEH-ABADEH, MICHAËL DEFFERRARD

SPARSE signal representations have gained popularity in recent

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

Part IV Compressed Sensing

Elaine T. Hale, Wotao Yin, Yin Zhang

Design of Projection Matrix for Compressive Sensing by Nonsmooth Optimization

Model-Based Compressive Sensing for Signal Ensembles. Marco F. Duarte Volkan Cevher Richard G. Baraniuk

Transcription:

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 3: Sparse signal recovery: A RIPless analysis of l 1 minimization Yuejie Chi The Ohio State University Page 1

Outline A RIPless theory for CS using l 1 minimization recovery Reference: E. J. Candes and Y. Plan. A probabilistic and RIPless theory of compressed sensing. 2010. Page 2

Consider a convex function f(x). Subgradient of l 1 function Definition 1. [Subgradient] u f(x 0 ) is a subgradient of a convex f at x 0 if for all x: f(x) f(x 0 ) + u T (x x 0 ) Remark: f(x 0 ). if f is differentiable at x 0, the only subgradient is the gradient Example: For the scalar absolute function f(t) = t, t R, u f(t) iff { u = sgn(t), t 0 u [ 1, 1], t = 0 For f(x) = x 1, x R n, u f(x) iff { ui = sgn(x i ), x i 0 u i [ 1, 1], x i = 0 Page 3

Characterization of l 1 solution: an optimization viewpoint Proposition 1. [Necessary and Sufficient condition for l 1 recovery] the support of x as T. x is the solution to BP if for all h Null(A), Denote sign(x i )h i h i. i T c i T Furthermore, x is the unique solution if the equality holds iff h = 0. Remark: Recovery property only depends on the sign pattern of x, not the magnitudes! Proof of Proposition 1: We first show it is a sufficient condition. Denote the solution of BP as ˆx = x + h. We have Ah = A(ˆx x) = 0, i.e. h Null(A). Page 4

Since x is supported on T, we have x 1 ˆx 1 = x + h 1 = i T x i + h i + i T c h i i T x i + sign(x i )h i + i T c h i i T x i = x 1. Therefore h = 0 and ˆx = x. Next we show it is also a necessary condition. If there exists h Null(A) such that then we can verify sign(x i )h i > h i i T c i T x h 1 = i T x i h i + i T c h i < i T ( x i sign(x i )h i ) + i T c h i < i T x i = x 1. Page 5

Denote the support of x as T. Dual certificate Proposition 2. x is an optimal solution of BP iff there exists u = A T λ such that { ui = sgn(x i ), i T u i [ 1, 1], i T c In addition, if u i < 1 for i T c and A T has full columns rank, x is the unique solution. Remarks: We call u or λ the (exact) dual certificate. certificate, we can verify the optimality of BP. If we can find such a dual Note that u Null(A)), which is also a subgradient of x 1 at x. Page 6

Dual certificate: geometric interpretation Geometric interpretation of the dual certificate: there exists a subgradient u of the objective function x 1 at the ground truth x such that u Null(A) Page 7

Unicity If supp(x) T, Ax = A T x T. Note for any h Null(A), sgn(x i )h i = i T i T u i h i = u, h i T c u i h i = i T c u i h i (since u Null(A)) < i T c h i (since u i < 1for i T c ) unless h T c 0. If h T c = 0, since A T has full column rank, Ah = A T h T = 0 which indicates h T = 0 as well. In summary h = h T + h T c = 0, and x is the unique solution. Page 8

A Probabilistic Approach with Gaussian matrices Our goal is to develop a theory of compressed sensing that 1) does not require RIP; and 2) admits near-optimal performance guarantees. Let A be composed of i.i.d. N (0, 1) entries. Question: How well does BP perform for an arbitrary but fixed sparse signal? (BP:) ˆx = argmin x x 1 subject to y = Ax. Theorem 1. Let x R n be an arbitrary fixed vector that is k-sparse. Assume A is composed of i.i.d. N (0, 1) entries. As long as m C 1 k log n for some large enough constant C 1, x is the unique solution to BP with probability at least 1 n C 2 for some constant C 2. Remark: Compare this result with the earlier RIP-based result. Page 9

Proof by certifying the dual certificate Denote the support of x as T. We first verify that A T is full column rank with high probability. Since A T is a fixed m k matrix with i.i.d. N (0, 1) entries, random matrix theory tells us (we ll just take for granted) ( ) 1 k P σ max (A) > 1 + m m + t e mt2 /2 ( ) 1 k P σ min (A) < 1 m m t e mt2 /2. Then as long as m c 1 k for some large constant c 1, we have 1 m AT T A T I 1 2 with probability at least 1 e c 2m for some c 2. Call this event A. Page 10

Construction of the dual certificate We need to find a dual certificate u = A T λ such that { ui = sgn(x i ), i T u i < 1, i T c Consider the solution to the following l 2 minimization problem: min u 2 s.t. u = A T λ, u i = sgn(x i ), i T which can be written explicitly as u = A T A T (A T T A T ) 1 sgn(x T ). Note that under event A, A T T A T is invertible, and (A T T A T ) 1 2 m. We will show the above choice is a valid dual certificate. Page 11

Validation of the dual certificate The only condition that needs extra work is to establish This amounts to bound max u i = max i T c i T c where a i is the ith column of A. u i < 1, i T c. a i, A T (A T T A T ) 1 sgn(x T ) }{{} w Note that a i and w are independent for i T c. For a fixed index i T c, Conditioned on w, u i N (0, w 2 2), we have the Chernoff bound for the tail of a Gaussian rv: ( P( u i 1 w) 2 exp 1 ) 2 w 2 2 Page 12

Under the event A, we could also bound w 2 as w 2 A T (A T T A T ) 1 sgn(x T ) 2 (A T T A T ) 1 1/2 sgn(x T ) 2 2k m since (A T (A T T A T ) 1 ) T A T (A T T A T ) 1 = (A T T A T ) 1. We have P(max u i 1) T c P( u i > 1) union bound i T c n P( u i 1 w)dµ(w). w Page 13

Note that = w P( u i 1 w)dµ(w) P( u i 1 w)dµ(w) + w 2 2km P( u i 1 w)dµ(w) + P w 2 2km 2e 1 2 w 2 2dµ(w) + P (A c ) w 2 2km 2e m 4k + e c 2 m 3e m 4k, which gives P( u i 1 w)dµ(w) w 2 > 2km ( w 2 > P(max i T c u i 1) 3ne m 4k ) 2k m Set m = 4(γ + 1)k log n for some γ > 0, we have u T c < 1 with probability at least 1 n γ. Page 14

A General RIPless Theory for CS Approach: Consider a sampling model that allows dependent entries across the row entries. Let the signal be denoted as x R n. The ith measurement is given as y i = a i, x, i = 1,..., m, where each sampling/measurement vector is drawn from a distribution F : a i F, i.i.d. We will make a few assumptions on F such that it provides the incoherent sampling we want. Page 15

Incoherence sampling We define F to satisfy two key properties: Isometry property: for a F, Eaa T = I Incoherence property: we let µ to be the smallest number such at a = [a 1,..., a n ] T F, max 1 i n a i 2 µ. Remark: Both conditions may be relaxed a little, see [Candes and Plan, 2010]. In particular, we could allow the incoherence property holds with high probability, to accommodate the case a N (0, I). µ 1 since E a i 2 = 1. On the other hand, µ could be as large as n. To get good performance, we would like to have µ small. Page 16

Examples of incoherence sampling Denote the (scaled) DFT matrix Φ with entries φ l,i = e j2πli/n. Let a F be obtained by sampling a row of Φ uniformly at random, we have E[aa H ] = n l=1 1 n φh l φ l = I and max i a i 2 = 1 := µ. Page 17

Other Examples of incoherent sampling Binary sensing: P(a i = ±1) = 1 2, E[aa T ] = I, max i a i 2 = 1. Gaussian sensing: a i N (0, 1), we have E[aa T ] = I, max i a i 2 2 log n. Partial Fourier transform (useful in MRI): pick a frequency ω Unif[0, 1], and set a i = e j2πωi. We have E[aa T ] = I, max i a i 2 = 1. Page 18

Performance Guarantees for BP Theorem 2. [Noise-free, Basis Pursuit] Let x R n be an arbitrary fixed vector that is k-sparse. Then x is the unique solution to BP with high probability, as long as m Cµk log n for some constant C. The proof is based on slightly different methods, by constructing an inexact dual certificate using the golfing scheme. This technique is developed first by D. Gross for analyzing matrix completion. We will discuss this method later in the course in more details. The result is near-optimal for the general class of incoherence sampling models. It is clear that the oversampling ratio depends on the coherence parameter µ. When specializing to the Gaussian case, this result is sub-optimal by log n. Page 19

Performance Guarantees for the General Case using LASSO Consider noisy observation with Gaussian noise: y = Ax + w where w N (0, σ 2 I). Consider the LASSO algorithm: ˆx = argmin x 1 2 y Ax 2 2 + λ x 1 Theorem 3. [Candes and Plan, 2010] probability, we have Set λ = 10σ log n. Then with high ˆx x 2 x x k 1 k + σ k log n m provided m µk log n. Page 20