Lecture 15: Random Projections

Size: px

Start display at page:

Download "Lecture 15: Random Projections"

Lindsey Leonard
5 years ago
Views:

1 Lecture 15: Random Projections Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 15 1 / 11

2 Review of PCA Unsupervised learning technique Performs dimensionality reduction from dim d to dim k < d Optimization problem: find rank-k orthogonal projection UU that minimizes mean square distortion on data m i=1 x i UU x i 2 2 Solution: define data matrix A = m i=1 x ixi, diagonalize it, choose U = [u 1,..., u k ] to be the top k eigenvectors of A Equivalently: find k orthogonal directions that capture most of the data variance Often, pre-processing step for supervised; has denoising effect Computational cost: O(md 2 + d 3 ) if m d O(m 2 d + m 3 ) if d m Kontorovich and Sabato (BGU) Lecture 15 2 / 11

3 What about the interpoint distances? after PCA projection, distinct points could be mapped to identical ones: UU x = UU x even though x x for some applications, we would like for relative distances to be approximately preserved We will discuss two applications: Nearest neighbors SVM formally, if W : R d R k is a linear map, would like Wx Wx x x 1 for all x, x in the data set PCA does not have this property Kontorovich and Sabato (BGU) Lecture 15 3 / 11

4 Pairwise distances norms Seeking a linear map W : R d R k that approximately preserves distances: Wx Wx x x 1 Equivalently, preserves norms: Wx x 1. ( ) Why is this equivalent? Theorem: can be obtained if all W ij are independently Gaussian. Reminder: normal/gaussian distribution N(µ, σ 2 ) with mean µ and variance σ 2 has density function [draw] key properties: f (x) = 1 σ 2π e (x µ)2 /(2σ2 ) X N(0, 1) = ax + b N(b, a 2 ) X, Y are independent and X, Y N(0, 1) = X + Y N(0, 2) X1,..., X n are independent and X i N(0, 1) = n i=1 a ix i N(0, a 2 2 ). Kontorovich and Sabato (BGU) Lecture 15 4 / 11

5 Preserving norm of a single vector Theorem Fix x R d and draw W R k d s.t. W ij N(0, 1/k) independently. Then, for all 0 < ε < 3 [ Wx 2 ] P x 2 1 > ε < 2e ε2k/6. Proof idea: Suffices to consider x such that x = 1. Let U R d d such that UU = I and x := U x = (1, 0,..., 0). Entries of W := WU are also distributed as N(0, 1/k). We have x = x and Wx = WUU x = W x. So suffices to look at x = (1, 0, 0, 0,..., 0) and Gaussian W. W x = (W 11,..., W k1 ), W x is a k-dimensional vector with entries distributed as N(0, 1/k). E[ W x 2 ] = k i=1 E[W 2 i1] = k 1/k = 1. Expectations match Theorem: norm is close to 1 with high probability for large enough k. Kontorovich and Sabato (BGU) Lecture 15 5 / 11

6 From one to many Theorem: for fixed x R d and W R k d s.t. W ij N(0, 1/k) independently, we have, for all 0 < ε < 3 [ Wx 2 ] P x 2 1 > ε < 2e ε2k/6. But we have several points x. Denote them Q R d. Theorem (Johnson-Lindenstrauss) Let Q R d be a finite set, δ (0, 1). Let k N and define 6 log(2 Q /δ) ε = k. If ɛ 3, we have, with probability 1 δ over W : max x Q Wx 2 x 2 1 < ε. Proof: The previous theorem + union bound. New dimension is k = 6 log(2 Q /δ)/ɛ 2. Does not depend on d! Kontorovich and Sabato (BGU) Lecture 15 6 / 11

7 Preserving distances The guarantee: for x Q, if k = 6 log(2 Q /δ)/ɛ 2, max Wx 2 x Q x 2 1 < ε. This preserves norms in Q. Consider a labeled sample S = { (x i, y i ) R d { 1, 1} : i m }. How to preserve distances in S? Set Q = {x i x j i, j m}. Q = ( m 2). Suffices to have k = 12 log(m/δ)/ɛ 2. Kontorovich and Sabato (BGU) Lecture 15 7 / 11

8 Applications: Approximate Nearest Neighbors Labeled sample S = { (x i, y i ) R d { 1, 1} : i m } Recall 1-NN classifier h 1 NN : R d { 1, 1} h 1 NN (x) = y j if x j S is closest to x Naively, to evaluate 1-NN classifier on x, requires time O(md). Suppose we will be required to label at most m points. Use Johnson-Lindenstrauss (J-L): Select random W : R d R k, project to dimension k = O(log(m + m )/ε 2 ). Get, with high probability, for all test and query points. 1 ε Wx Wx x x 1 + ε To classify, set x = Wx, and calculate 1-NN from x to the projected training set S = {( x i, y i )}, where x i = Wx i. Kontorovich and Sabato (BGU) Lecture 15 8 / 11

9 Applications: Approximate Nearest Neighbors How good is this scheme? put x i := Wx i ; at query point x, suppose 1-NN(x, S) = x i but 1-NN( x, S) = x j. Claim: x x j (1 + O(ε)) x x i. Proof: NN property: x x i x x j and x x j x x i x x j 2 x x i ε (1 + O(ε))-approximate NN x x j 2 x x i < 1 x x i ε x x i 1+ε x x i ε x x i = 1+ε 2 evaluation cost now O(mk) instead of O(md) 1 ε = 1 + O(ε) Kontorovich and Sabato (BGU) Lecture 15 9 / 11

10 Applications: Approximate SVM Claim: Random projections approximately preserve inner products Assume x 1 for all x S; put x = Wx Claim: for k = O(log(m)/ε 2 ), a random W R k d satisfies for all u, v S Proof: u, v ũ, ṽ = O(ε) For u, v R d we have u, v = 1 2 ( u 2 + v 2 u v 2 ) J-L: 1 ε ũ ṽ 2 u v 1 + ε. 2 (1 ε) 1 = 1 + O(ε); (1 + ε) 1 = 1 O(ε). hence u, v = 1 2 ( u 2 + v 2 u v 2 ) 1 2 ((1 ε) 1 ( ũ 2 + ṽ 2 ) (1 + ε) 1 ũ ṽ 2 ) = 1 2 ( ũ 2 + ṽ 2 ũ ṽ 2 ) + O(ε) = ũ, ṽ + O(ε) and ũ, ṽ = 1 2 ( ũ 2 + ṽ 2 ũ ṽ 2 ) 1 2 ((1 + ε)( u 2 + v 2 ) (1 ε) 1 u v 2 ) = u, v + O(ε). Kontorovich and Sabato (BGU) Lecture / 11

11 Approximate SVM Theorem Suppose that all (x, y) S R d { 1, 1} satisfy x 1 and S is linearly separable with margin γ. Assuming d 1/γ 2. Then there is k = O(1/γ 2 ) such that with high probability, randomly projected sample S is linearly separable with margin γ/2. If d 1/γ 2, can use a dimension 1/γ 2 instead of original dimension! Another proof that when margin is large, dimension doesn t matter. Can show guarantees also for non-separable problems. Kontorovich and Sabato (BGU) Lecture / 11

12 Johnson-Lindenstrauss summary Unsupervised learning technique key insight: for any m points {xi } in R d, ε (0, 3) there is a W R k d, k = O(log(m)/ε 2 ) s.t. 1 ε Wx i 2 x i 1 + ε 2 i m ( ) further, if we draw W ij N(0, 1/k) independently then (*) holds w/high prob for any m-point set! J-L is a data-oblivious technique (unlike PCA...) Applications: NN (1 + O(ε))-approximate nearest neighbors at query point x, if true NN has distance approx NN has distance (1 + O(ε)) speedup from O(md) to O(mk) Applications: SVM if data is separable with margin γ, random projection to dim O(1/γ 2 ) results w/high prob in data separable with margin γ/2 effective dimension is O(1/γ 2 ) independent of d Kontorovich and Sabato (BGU) Lecture / 11

Lecture 16: Compressed Sensing

Lecture 16: Compressed Sensing Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 16 1 / 12 Review of Johnson-Lindenstrauss Unsupervised learning technique key insight: