Supremum of simple stochastic processes

Similar documents
Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

DS-GA 1002 Lecture notes 10 November 23, Linear models

Reconstruction from Anisotropic Random Measurements

Constrained optimization

Sketching as a Tool for Numerical Linear Algebra All Lectures. David Woodruff IBM Almaden

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Lecture Notes 9: Constrained Optimization

Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees

STAT 200C: High-dimensional Statistics

Introduction to Compressed Sensing

Random Methods for Linear Algebra

An Introduction to Sparse Approximation

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Z Algorithmic Superpower Randomization October 15th, Lecture 12

Convex optimization COMS 4771

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sketching as a Tool for Numerical Linear Algebra All Lectures. David Woodruff IBM Almaden

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016

Lecture 22: More On Compressed Sensing

1 Regression with High Dimensional Data

Optimisation Combinatoire et Convexe.

Least Sparsity of p-norm based Optimization Problems with p > 1

IEOR 265 Lecture 3 Sparse Linear Regression

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

Lecture 6: September 19

1. Let m 1 and n 1 be two natural numbers such that m > n. Which of the following is/are true?

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

17 Random Projections and Orthogonal Matching Pursuit

Constructing Explicit RIP Matrices and the Square-Root Bottleneck

Methods for sparse analysis of high-dimensional data, II

Linear Methods for Regression. Lijun Zhang

Sketching as a Tool for Numerical Linear Algebra

Least singular value of random matrices. Lewis Memorial Lecture / DIMACS minicourse March 18, Terence Tao (UCLA)

Compressive Sensing with Random Matrices

Methods for sparse analysis of high-dimensional data, II

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR

Problem Set 6: Solutions Math 201A: Fall a n x n,

CS 229r: Algorithms for Big Data Fall Lecture 19 Nov 5

Conditions for Robust Principal Component Analysis

Fast Dimension Reduction

The Stability of Low-Rank Matrix Reconstruction: a Constrained Singular Value Perspective

STAT 200C: High-dimensional Statistics

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

THE SMALLEST SINGULAR VALUE OF A RANDOM RECTANGULAR MATRIX

University of Luxembourg. Master in Mathematics. Student project. Compressed sensing. Supervisor: Prof. I. Nourdin. Author: Lucien May

Review of Some Concepts from Linear Algebra: Part 2

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Sparse Interactions: Identifying High-Dimensional Multilinear Systems via Compressed Sensing

1 Lesson 1: Brunn Minkowski Inequality

Sketching as a Tool for Numerical Linear Algebra

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

Dimensionality Reduction Notes 3

Orthogonal Projection and Least Squares Prof. Philip Pennance 1 -Version: December 12, 2016

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

arxiv: v1 [math.pr] 22 May 2008

Generalization theory

Introduction How it works Theory behind Compressed Sensing. Compressed Sensing. Huichao Xue. CS3750 Fall 2011

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma

15 Singular Value Decomposition

Invertibility of symmetric random matrices

Sparse analysis Lecture II: Hardness results for sparse approximation problems

19.1 Problem setup: Sparse linear regression

STAT 200C: High-dimensional Statistics

Empirical Processes and random projections

Sparse and Low Rank Recovery via Null Space Properties

SINGULAR VALUES OF GAUSSIAN MATRICES AND PERMANENT ESTIMATORS

Nonlinear Programming Models

Lecture 18: March 15

Recovering overcomplete sparse representations from structured sensing

OPTIMAL SCALING FOR P -NORMS AND COMPONENTWISE DISTANCE TO SINGULARITY

A fast randomized algorithm for approximating an SVD of a matrix

QUASI-LINEAR COMPRESSED SENSING

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Embeddings of finite metric spaces in Euclidean space: a probabilistic view

Solution-recovery in l 1 -norm for non-square linear systems: deterministic conditions and open questions

Recent Developments in Compressed Sensing

An algebraic perspective on integer sparse recovery

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform

Oslo Class 6 Sparsity based regularization

Lecture 12: Randomized Least-squares Approximation in Practice, Cont. 12 Randomized Least-squares Approximation in Practice, Cont.

The uniform uncertainty principle and compressed sensing Harmonic analysis and related topics, Seville December 5, 2008

Dimensionality reduction: Johnson-Lindenstrauss lemma for structured random matrices

We showed that adding a vector to a basis produces a linearly dependent set of vectors; more is true.

Strengthened Sobolev inequalities for a random subspace of functions

Optimization methods

arxiv: v2 [math.pr] 15 Dec 2010

Fast Dimension Reduction

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

(Part 1) High-dimensional statistics May / 41

Compressed Sensing and Robust Recovery of Low Rank Matrices

Notes on Gaussian processes and majorizing measures

SPARSE signal representations have gained popularity in recent

Fast Random Projections

Linear Algebra- Final Exam Review

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

25 Minimum bandwidth: Approximation via volume respecting embeddings

8.1 Concentration inequality for Gaussian random matrix (cont d)

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Transcription:

Subspace embeddings Daniel Hsu COMS 4772 1 Supremum of simple stochastic processes 2

Recap: JL lemma JL lemma. For any ε (0, 1/2), point set S R d of cardinality 16 ln n S = n, and k N such that k, there exists a linear map ε 2 f : R d R k such that (1 ε) x y 2 2 f (x) f (y) 2 2 (1+ε) x y 2 2 for all x, y S. Main probabilistic lemma random linear map M : R d R k such that, for any u S d 1, ( Mu ) ( ) 2 P 2 1 > ε 2 exp Ω(kε 2 ). JL lemma is consequence of main probabilistic lemma as applied to collection T S d 1 of T = ( n 2) unit vectors (+ union bound): ( P ) max Mu 2 2 1 > ε u T ( ) T 2 exp Ω(kε 2 ). 3 Related question For T S d 1, expected maximum deviation E max Mu 2 2 1? u T General questions For arbitrary collection of zero-mean random variables {X t : t T }: E max t T X t? E max t T X t? 4

Finite collections Let {X t : t T } be a finite collection of v-subgaussian and mean-zero random variables. Then E max X t 2v ln T. t T Doesn t assume independence of {X t : t T }. (Independent case is the worst.) Get bound on E max t T X t as corollary. Apply result to collection {X t : t T } { X t : t T }. 5 Proof Starting point is identity from two invertible operations (λ > 0): E max X t = 1 ( ) t T λ ln exp E max λx t t T Apply Jensen s inequality: 1 ) (max λ ln E exp λx t t T = 1 ) (max λ ln E exp(λx t) t T Bound max with sum, and use linearity of expectation: 1 λ ln t T E exp(λx t ) Exploit v-subgaussian property: 1 λ ln t T ( ) exp vλ 2 /2 = ln T λ + vλ 2 Choose appropriate λ to conclude. 6

Alternative proof Integrate tail bound: for any non-negative random variable Y, E(Y ) = 0 P(Y y) dy. For Y := max t T X t, gives same result up to constants. 7 Infinite collections For infinite collection of zero-mean random variables {X t : t T }: E sup X t? t T In general, can go. To bound, must exploit correlations among the X t. { Mu 2 E.g., in 2 1 } : u T for T S d 1, the random variables for u and u + δ, for small δ, are highly correlated. 8

Convex hulls of linear functionals Let T R d be a finite set of vectors, and let X be a random vector in R d such that w, X is v-subgaussian for every w T. Then Proof: E max w, X 2v ln T. w conv(t ) Write w conv(t ) as w = w T p ww for some p w 0 that sum to one. Observe that w, x = w T p w w, x max w, x. w T So max over w conv(t ) is at most max over w T. Conclude by applying previous result for finite collections. 9 Euclidean norm Let X be a random vector such that u, X is v-subgaussian for every u S d 1. Then ( ) E X 2 = E max u S d 1 u, X 2 2v ln 5 d = O vd. Key step of proof: For any ε > 0, there is a finite subset N S d 1 of cardinality N (1 + 2/ε) d such that, for every u S d 1, there exists u 0 N with u u 0 2 ε. Such a set N is called an ε-net for S d 1. We need a 1/2-net, of cardinality at most 5 d. 10

Proof Write u S d 1 as u = u 0 + δq, where u 0 N, q S d 1, δ [ 0, 1/2 ], so u, X = u 0, X + δ q, X. Observe that max X u Sd 1 u, max u 0, X + u 0 N max δ [0,1/2] q S max δ q, X d 1 max u 0, X + 1 u 0 N 2 max X. d 1 q, q S So max over S d 1 is at most twice max over N. Conclude by applying previous result for finite collections. 11 ε-nets for unit sphere There is an ε-net for S d 1 of cardinality at most (1 + 2/ε) d. Proof: Repeatedly select points from S d 1 so that each selected point has distance more than ε from all previously selected points. Equivalent: repeatedly select points from S d 1 as long as balls of radius ε/2, centered at selected points, are disjoint. (Process must eventually stop.) When process stops, every u S d 1 is at distance at most ε from selected points. I.e., selected points form an ε-net for S d 1. If select N points, then the N balls of radius ε/2 are disjoint, and they are contained in a ball of radius 1 + ε/2. So N vol((ε/2)b d ) vol((1 + ε/2)b d ). This implies N (1 + 2/ε) d. 12

Remarks All previous results also hold with random variables are (v, c)-subexponential (possibly with c > 0), with a slightly different bound: e.g., E max t T X t { } max 2v ln T, 2c ln T. Also easy to get probability tail bounds (rather than expectation bounds). 13 Subspace embeddings 14

Subspace JL lemma Consider k d random matrix M whose entries are iid N(0, 1/k). For a W R d be a subspace of dimension r, ( E max Mu 2 r 2 1 O k + r ). k u S d 1 W ( ) Bound is at most ε when k O r. ε 2 Implies existence of mapping M : R d R k that approximately preserves all distances between points in W. 15 Proof of subspace JL lemma Let columns of Q be ONB for W. Then max Mu 2 u 2 1 = max Q ( M M I ) Qu u S r 1 u S d 1 W Lemma. For any u, v S r 1, = max u,v S r 1 u Q ( M M I ) Qv. X u,v := u Q ( M M I ) Qv is (O(1/k), O(1/k))-subexponential. 16

Proof of subspace JL lemma (continued) For u, v S r 1, X u,v := u Q ( M M I ) Qv. Let N be 1/4-net for S r 1. Write u, v S r 1 as u = u 0 + εp, v = v 0 + δq, where u 0, v 0 N, p, q S r 1 and ε, δ [ 0, 1/4 ], so Therefore X u,v = X u0,v 0 + εx p,v + δx u0,q. max X u,v S r 1 u,v max X u 0,v 0 + 1 max u 0,v 0 N 2 X p,q S r 1 p,q, which implies max X u,v S r 1 u,v 2 max X u 0,v 0. u 0,v 0 N Conclude by applying previous result for finite collections. 17 Application to least squares 18

Big data least squares Input: matrix A R n d, vector b R n (n d). Goal: find x R d so as to (approx.) minimize Ax b 2 2. Computation time: O(nd 2 ). Can we speed this up? 19 Simple approach Pick m n. Let M be random m n matrix (e.g., entries iid N(0, 1/m), Fast JL Transform). Let à := MA and b := Mb. Obtain solution ˆx to least squares problem on (Ã, b). 20

Simple (somewhat loose) analysis Let W be subspace spanned by columns of A and b. Dimension is at most d + 1. If m O(d/ε 2 ), then M is subspace embedding for W : (1 ε) x 2 2 Mx 2 2 (1 + ε) x 2 2 for all x W. Let x := arg min x R d Ax b 2 2. Aˆx b 2 2 1 1 ε M(Aˆx b) 2 2 1 1 ε M(Ax b) 2 2 1 + ε 1 ε Ax b 2 2. ( Running time (using FJLT): O (m + n)d log n + md 2). 21 Another perspective: random sampling Pick random sample of m n of rows of (A, b); obtain solution ˆx for least squares problem on the sample. Hope ˆx is also good for the original problem. In statistics, this is the random design setting for regression. Random sample of covariates à Rm d and responses b R m from full population (A, b). Least squares solution ˆx on ( Ã, b) is MLE for linear regression coefficients under linear model with Gaussian noise. Can also regard ˆx as empirical risk minimizer among all linear predictors under squared loss. 22

Simple random design analysis Let x := arg min x R d Ax b 2 2. With high probability over choice of random sample, ( ( ) ) κ Aˆx b 2 2 1 + O Ax b 2 2 m (up to lower-order terms), where κ := n max i [n] (A A) 1/2 A e i 2 2 and e i is i-th coordinate basis vector. Write thin SVD of A as A = USV, where U R n d. Then (A A) 1/2 A = (V S 2 V ) 1/2 V SU = V U. So κ = n max i [n] U e i 2 2. U e i 2 2 is statistical leverage score for i-th row of A: measures how much influence i-th row has on least squares solution. 23 Statistical leverage i-th statistical leverage score: l i := U e i 2 2, where U Rn d is matrix of left singular vectors of A. Two extreme cases: ] U = [ Id d 0 (n d) d U = 1 n [ H n e 1 H n e 2 H n e d ] where H n is n n Hadamard matrix. First case: first d rows are the only rows that matter. Second case: all n rows equally important. n max i [n] l i = n. n max i [n] l i = d, 24

Ensuring small statistical leverage To ensure situation is more like second case, apply random rotation (e.g., randomized Hadamard transform) to A and b. Randomly mixes up rows of (A, b) so no single row is (much) more important than another. Get n maxi [n] l i = O(d + log n) with high probability. To get 1 + ε approximation ratio, i.e., Aˆx b 2 2 (1 + ε) Ax b 2 2, suffices to have ( ) d + log n m O. ε 25 Application to compressed sensing 26

Under-determined least squares Input: matrix A R n d, vector b R n (n d). Goal: find sparsest x R d so as to minimize Ax b 2 2. NP-hard in general. Suppose b = A x for some x R d with nnz( x) k. I.e., x is k-sparse. Is x the (unique) sparsest solution? If so, how to find it? 27 Null space property Lemma. Null space of A does not contain any non-zero 2k-sparse vectors every k-sparse vector x R d is the unique solution to Ax = A x. Proof. ( ) Take any k-sparse vectors x and y with Ax = Ay. Want to show x = y. Then x y is 2k-sparse, and A(x y) = 0. By assumption, null space of A does not contain any non-zero 2k-sparse vectors. So x y = 0, i.e., x = y. ( ) Take any 2k-sparse vector z in the null space of A. Want to show z = 0. Write it as z = x y for some k-sparse vectors x and y with disjoint supports. Then A(x y) = 0, and hence x = y by assumption. But x and y have disjoint support, so it must be that x = y = 0, so z = 0. 28

Null space property from subspace embeddings If A is n d random matrix with iid N(0, 1) entries, then under what conditions is there no non-zero 2k-sparse vector in its null space? Want: for any 2k-sparse vector z, Az 0, i.e., Az 2 2 > 0. Consider a particular choice I [d] of I = 2k coordinates, and the corresponding subspace W I spanned by {e i : i I}. Every 2k-sparse z is in WI for some I. Sufficient for A to be 1/2-subspace embedding for W I for all I: 1 2 z 2 2 Az 2 2 3 2 z 2 2 for all 2k-sparse z. Null space property from subspace embeddings (continued) 29 Say A fails for I if it is not a 1/2-subspace embedding for W I. Subspace JL lemma: P(A fails for I) 2 O(k) exp ( Ω(n) ). Union bound over all choices of I with I = 2k: P(A fails for some I) ( ) d 2 O(k) exp ( Ω(n) ). 2k To ensure this is, say, at most 1/2, just need n O k + log ( ) d = O ( k + k log(d/k) ). 2k 30

Restricted isometry property (l, δ)-restricted isometry property (RIP): (1 δ) z 2 2 Az 2 2 (1 + δ) z 2 2 for all l-sparse z. Many algorithms can recover unique sparsest solution under RIP (with l = O(k) and δ = Ω(1)). E.g., Basis pursuit, Lasso, orthogonal matching pursuit. 31