Supremum of simple stochastic processes

Subspace embeddings Daniel Hsu COMS 4772 1 Supremum of simple stochastic processes 2

Recap: JL lemma JL lemma. For any ε (0, 1/2), point set S R d of cardinality 16 ln n S = n, and k N such that k, there exists a linear map ε 2 f : R d R k such that (1 ε) x y 2 2 f (x) f (y) 2 2 (1+ε) x y 2 2 for all x, y S. Main probabilistic lemma random linear map M : R d R k such that, for any u S d 1, ( Mu ) ( ) 2 P 2 1 > ε 2 exp Ω(kε 2 ). JL lemma is consequence of main probabilistic lemma as applied to collection T S d 1 of T = ( n 2) unit vectors (+ union bound): ( P ) max Mu 2 2 1 > ε u T ( ) T 2 exp Ω(kε 2 ). 3 Related question For T S d 1, expected maximum deviation E max Mu 2 2 1? u T General questions For arbitrary collection of zero-mean random variables {X t : t T }: E max t T X t? E max t T X t? 4

Finite collections Let {X t : t T } be a finite collection of v-subgaussian and mean-zero random variables. Then E max X t 2v ln T. t T Doesn t assume independence of {X t : t T }. (Independent case is the worst.) Get bound on E max t T X t as corollary. Apply result to collection {X t : t T } { X t : t T }. 5 Proof Starting point is identity from two invertible operations (λ > 0): E max X t = 1 ( ) t T λ ln exp E max λx t t T Apply Jensen s inequality: 1 ) (max λ ln E exp λx t t T = 1 ) (max λ ln E exp(λx t) t T Bound max with sum, and use linearity of expectation: 1 λ ln t T E exp(λx t ) Exploit v-subgaussian property: 1 λ ln t T ( ) exp vλ 2 /2 = ln T λ + vλ 2 Choose appropriate λ to conclude. 6

Alternative proof Integrate tail bound: for any non-negative random variable Y, E(Y ) = 0 P(Y y) dy. For Y := max t T X t, gives same result up to constants. 7 Infinite collections For infinite collection of zero-mean random variables {X t : t T }: E sup X t? t T In general, can go. To bound, must exploit correlations among the X t. { Mu 2 E.g., in 2 1 } : u T for T S d 1, the random variables for u and u + δ, for small δ, are highly correlated. 8

Convex hulls of linear functionals Let T R d be a finite set of vectors, and let X be a random vector in R d such that w, X is v-subgaussian for every w T. Then Proof: E max w, X 2v ln T. w conv(t ) Write w conv(t ) as w = w T p ww for some p w 0 that sum to one. Observe that w, x = w T p w w, x max w, x. w T So max over w conv(t ) is at most max over w T. Conclude by applying previous result for finite collections. 9 Euclidean norm Let X be a random vector such that u, X is v-subgaussian for every u S d 1. Then ( ) E X 2 = E max u S d 1 u, X 2 2v ln 5 d = O vd. Key step of proof: For any ε > 0, there is a finite subset N S d 1 of cardinality N (1 + 2/ε) d such that, for every u S d 1, there exists u 0 N with u u 0 2 ε. Such a set N is called an ε-net for S d 1. We need a 1/2-net, of cardinality at most 5 d. 10

Proof Write u S d 1 as u = u 0 + δq, where u 0 N, q S d 1, δ [ 0, 1/2 ], so u, X = u 0, X + δ q, X. Observe that max X u Sd 1 u, max u 0, X + u 0 N max δ [0,1/2] q S max δ q, X d 1 max u 0, X + 1 u 0 N 2 max X. d 1 q, q S So max over S d 1 is at most twice max over N. Conclude by applying previous result for finite collections. 11 ε-nets for unit sphere There is an ε-net for S d 1 of cardinality at most (1 + 2/ε) d. Proof: Repeatedly select points from S d 1 so that each selected point has distance more than ε from all previously selected points. Equivalent: repeatedly select points from S d 1 as long as balls of radius ε/2, centered at selected points, are disjoint. (Process must eventually stop.) When process stops, every u S d 1 is at distance at most ε from selected points. I.e., selected points form an ε-net for S d 1. If select N points, then the N balls of radius ε/2 are disjoint, and they are contained in a ball of radius 1 + ε/2. So N vol((ε/2)b d ) vol((1 + ε/2)b d ). This implies N (1 + 2/ε) d. 12

Remarks All previous results also hold with random variables are (v, c)-subexponential (possibly with c > 0), with a slightly different bound: e.g., E max t T X t { } max 2v ln T, 2c ln T. Also easy to get probability tail bounds (rather than expectation bounds). 13 Subspace embeddings 14

Subspace JL lemma Consider k d random matrix M whose entries are iid N(0, 1/k). For a W R d be a subspace of dimension r, ( E max Mu 2 r 2 1 O k + r ). k u S d 1 W ( ) Bound is at most ε when k O r. ε 2 Implies existence of mapping M : R d R k that approximately preserves all distances between points in W. 15 Proof of subspace JL lemma Let columns of Q be ONB for W. Then max Mu 2 u 2 1 = max Q ( M M I ) Qu u S r 1 u S d 1 W Lemma. For any u, v S r 1, = max u,v S r 1 u Q ( M M I ) Qv. X u,v := u Q ( M M I ) Qv is (O(1/k), O(1/k))-subexponential. 16

Proof of subspace JL lemma (continued) For u, v S r 1, X u,v := u Q ( M M I ) Qv. Let N be 1/4-net for S r 1. Write u, v S r 1 as u = u 0 + εp, v = v 0 + δq, where u 0, v 0 N, p, q S r 1 and ε, δ [ 0, 1/4 ], so Therefore X u,v = X u0,v 0 + εx p,v + δx u0,q. max X u,v S r 1 u,v max X u 0,v 0 + 1 max u 0,v 0 N 2 X p,q S r 1 p,q, which implies max X u,v S r 1 u,v 2 max X u 0,v 0. u 0,v 0 N Conclude by applying previous result for finite collections. 17 Application to least squares 18

Big data least squares Input: matrix A R n d, vector b R n (n d). Goal: find x R d so as to (approx.) minimize Ax b 2 2. Computation time: O(nd 2 ). Can we speed this up? 19 Simple approach Pick m n. Let M be random m n matrix (e.g., entries iid N(0, 1/m), Fast JL Transform). Let Ã := MA and b := Mb. Obtain solution ˆx to least squares problem on (Ã, b). 20

Simple (somewhat loose) analysis Let W be subspace spanned by columns of A and b. Dimension is at most d + 1. If m O(d/ε 2 ), then M is subspace embedding for W : (1 ε) x 2 2 Mx 2 2 (1 + ε) x 2 2 for all x W. Let x := arg min x R d Ax b 2 2. Aˆx b 2 2 1 1 ε M(Aˆx b) 2 2 1 1 ε M(Ax b) 2 2 1 + ε 1 ε Ax b 2 2. ( Running time (using FJLT): O (m + n)d log n + md 2). 21 Another perspective: random sampling Pick random sample of m n of rows of (A, b); obtain solution ˆx for least squares problem on the sample. Hope ˆx is also good for the original problem. In statistics, this is the random design setting for regression. Random sample of covariates Ã Rm d and responses b R m from full population (A, b). Least squares solution ˆx on ( Ã, b) is MLE for linear regression coefficients under linear model with Gaussian noise. Can also regard ˆx as empirical risk minimizer among all linear predictors under squared loss. 22

Simple random design analysis Let x := arg min x R d Ax b 2 2. With high probability over choice of random sample, ( ( ) ) κ Aˆx b 2 2 1 + O Ax b 2 2 m (up to lower-order terms), where κ := n max i [n] (A A) 1/2 A e i 2 2 and e i is i-th coordinate basis vector. Write thin SVD of A as A = USV, where U R n d. Then (A A) 1/2 A = (V S 2 V ) 1/2 V SU = V U. So κ = n max i [n] U e i 2 2. U e i 2 2 is statistical leverage score for i-th row of A: measures how much influence i-th row has on least squares solution. 23 Statistical leverage i-th statistical leverage score: l i := U e i 2 2, where U Rn d is matrix of left singular vectors of A. Two extreme cases: ] U = [ Id d 0 (n d) d U = 1 n [ H n e 1 H n e 2 H n e d ] where H n is n n Hadamard matrix. First case: first d rows are the only rows that matter. Second case: all n rows equally important. n max i [n] l i = n. n max i [n] l i = d, 24

Ensuring small statistical leverage To ensure situation is more like second case, apply random rotation (e.g., randomized Hadamard transform) to A and b. Randomly mixes up rows of (A, b) so no single row is (much) more important than another. Get n maxi [n] l i = O(d + log n) with high probability. To get 1 + ε approximation ratio, i.e., Aˆx b 2 2 (1 + ε) Ax b 2 2, suffices to have ( ) d + log n m O. ε 25 Application to compressed sensing 26

Under-determined least squares Input: matrix A R n d, vector b R n (n d). Goal: find sparsest x R d so as to minimize Ax b 2 2. NP-hard in general. Suppose b = A x for some x R d with nnz( x) k. I.e., x is k-sparse. Is x the (unique) sparsest solution? If so, how to find it? 27 Null space property Lemma. Null space of A does not contain any non-zero 2k-sparse vectors every k-sparse vector x R d is the unique solution to Ax = A x. Proof. ( ) Take any k-sparse vectors x and y with Ax = Ay. Want to show x = y. Then x y is 2k-sparse, and A(x y) = 0. By assumption, null space of A does not contain any non-zero 2k-sparse vectors. So x y = 0, i.e., x = y. ( ) Take any 2k-sparse vector z in the null space of A. Want to show z = 0. Write it as z = x y for some k-sparse vectors x and y with disjoint supports. Then A(x y) = 0, and hence x = y by assumption. But x and y have disjoint support, so it must be that x = y = 0, so z = 0. 28

Null space property from subspace embeddings If A is n d random matrix with iid N(0, 1) entries, then under what conditions is there no non-zero 2k-sparse vector in its null space? Want: for any 2k-sparse vector z, Az 0, i.e., Az 2 2 > 0. Consider a particular choice I [d] of I = 2k coordinates, and the corresponding subspace W I spanned by {e i : i I}. Every 2k-sparse z is in WI for some I. Sufficient for A to be 1/2-subspace embedding for W I for all I: 1 2 z 2 2 Az 2 2 3 2 z 2 2 for all 2k-sparse z. Null space property from subspace embeddings (continued) 29 Say A fails for I if it is not a 1/2-subspace embedding for W I. Subspace JL lemma: P(A fails for I) 2 O(k) exp ( Ω(n) ). Union bound over all choices of I with I = 2k: P(A fails for some I) ( ) d 2 O(k) exp ( Ω(n) ). 2k To ensure this is, say, at most 1/2, just need n O k + log ( ) d = O ( k + k log(d/k) ). 2k 30

Restricted isometry property (l, δ)-restricted isometry property (RIP): (1 δ) z 2 2 Az 2 2 (1 + δ) z 2 2 for all l-sparse z. Many algorithms can recover unique sparsest solution under RIP (with l = O(k) and δ = Ω(1)). E.g., Basis pursuit, Lasso, orthogonal matching pursuit. 31