Lecture 6 Proof for JL Lemma and Linear Dimensionality Reduction

Size: px

Start display at page:

Download "Lecture 6 Proof for JL Lemma and Linear Dimensionality Reduction"

Alan Lane
5 years ago
Views:

1 COMS 4995: Unsupervised Learning (Summer 18) June 7, 018 Lecture 6 Proof for JL Lemma and Linear imensionality Reduction Instructor: Nakul Verma Scribes: Ziyuan Zhong, Kirsten Blancato This lecture gives a proof of JL-lemma and introduces linear dimensionality reduction techniques including: RP (Random Projection), PCA (Principal Component Analysis), LA (Linear iscriminant Analysis) and MS (Multidimensional scaling). 1 JL-Lemma 1.1 Recall Theorem 1 (Johnson-Lidenstrauss flattening lemma, 1984). Pick any 0 < ɛ < 1. Then for any integer n, let d > 4 ( ln n + ln 3), that is, d > Ω( ln n ). Then for any set V R, s.t. V = n, ɛ ɛ there exists a linear map f : R R d s.t. u, v V, (1 ɛ) u v f(u) f(v) (1 + ɛ) u v Remark. Since 1 ɛ 1 ɛ and 1 + ɛ 1 + ɛ, the embedding is thus a -embedding where the distortion = 1+ɛ 1 ɛ 1 + 5ɛ because 1 1 ɛ 1 + ɛ 0 < ɛ < 1. The above holds true for any random d-dim subspace (in -dim) with high probability (minor global scaling). Lemma 3 (Concentration of Measure). Pick any 0 < ɛ < 1, fix any unit vector w R (i.e. w = 1), let φ : R R d, d <, be a random subspace map. Then, [ Pr φ(w) < (1 ɛ) d φ or φ(w) > (1 + ɛ) d ] 3e dɛ /4. For JL, we ll want to project to a random subspace, then scale by /d. 1. Proof of JL-Lemma Proof. Because φ is linear, there is a corresponding matrix P R d s.t. φ(w) = P w. f := d φ, so f(w) = d P w. For any distinct u, v V, 1

2 ] Pr [ u, v V s.t. f(u) f(v) < (1 ɛ) u v or f(u) f(v) > (1 + ɛ) u v = (u,v) V V unordered pairs (u,v) V V unordered pairs ( ) n 3e dɛ /4 < 1 [ Pr f(u) f(v) < (1 ɛ) u v φ or f(u) f(v) > (1 + ɛ) u v ] [ ( ) u v Pr φ < (1 ɛ) d φ u v or ( ) u v φ > (1 + ɛ) d ] u v where the first inequality is a union bound, and the last inequality holds if we choose d such that: 4 d > ( ln n + ln 3). ɛ The inner inequality follows from the linearity of f = d φ, followed by an application of Lemma 3: ( ) f(u) f(v) < (1 ɛ) u v u v d φ < (1 ɛ) u v ( ) u v φ < (1 ɛ) d u v. Recap: JL is a linear dimensionality-reduction technique. The goal is to preserve l distances up to distortions of 1 ± ɛ. This is also a concentration result, φ(w) < (1 ɛ)d/, where φ(w) is the actual length of a particular draw and d/ is the expected length. The actual draw will be concentrated towards a specific value, typically the expected value. 1.3 Aside: A list of concentration inequalities Markov Chebychev Chernoff Hoeffding Bernstein

3 Effron-Stein Azuma Mcdiarmid Talagrand Linear imensionality Reduction The goal of JL is to presere l distances. However, depending on what property you care about, there will be different linear dimensionality reduction techniques appropriate for your task. Common linear dimensionality reduction techniques: RP (Random Projection) PCA (Principal Component Analysis) LA (Linear iscriminate Analysis) MS (Multi-dimensional Scaling) ICA/BSS (Independent Component Analysis/Blind Source Separation) CCA (Canonical Correlation Analysis) ML (istance Metric Learning) Factor (Factor Analysis) NMF/MF ((Non-negative) Matrix Factorization).1 RP (Random Projection) Method For random projection, P R d with P ij = N (0, 1). i.e. [ ] N (0, 1) N (0, 1) P =.... If a projection matrix is wanted, apply Gram-Schmidt to P..1.1 Practical application PCA has time complexity O(n 3 ), if we assume both the number of data points and the number of features are equal to n. In practice, if we are working with large amounts of data our first instinct to speed up PCA might be to subsample the data, i.e. if our dataset has 10k samples in R 10k, randomly subsample 1k points and perform PCA on this reduced dataset. However, the quality of the k th eigenvector of the subsampled data decays with respect to the k th eigenvector of the full dataset. While the first eigenvector of the subsampled and full dataset will be similar, all subsequent eigenvectors of the subsampled data will be of worsening quality. 3

4 The better approach to speeding up PCA is to first do a random projection, and then perform PCA. While the distances between points will be distorted within 1 ± ɛ, the quality of the eigenvectors will be better.. PCA (Principal Component Analysis)..1 Outline ata: x 1, x,..., x n R d Goal: Find the best linear transformation φ : R d R k that best maintains reconstruction accuracy. Equivalently, minimize aggregate residual error. efine: Π k : R d R d minimize 1 n n i=1 x i Π k (x i ) Figure 1: An illustration of PCA.. Method A k dimensional subspace can be represented by q 1,..., q k R d orthonormal vectors. The projection of any x R d in the span(q 1,..., q k ) is given by ( k q i qi T )x = i=1 } {{ } Π k k (q i x)q i i=1 4

5 To represent it in R k (using basis q 1,..., q k ) the coefficients simply are: (q 1, x),..., (q k, x)...3 The k = 1 case In k = 1 case, the objective is the following: 1 minimize n q =1 n i=1 x i (qq T )x i Equivalently, maximize q =1 q T ( 1 n XXT )q, where 1 n XXT is the covariance of data, if the data is mean-centered. The solution is the top eigenvector (1/n)XX T. Remark: For any q, the quadratic form q T ( 1 n XXT )q is the empirical variance of data in the direction q...4 General k case The general k case is similar and the solution is basically the top k eigenvectors of the matrix XX T..3 LA (Linear iscriminate Analysis).3.1 Motivation The goal of PCA is to minimize the reconstruction error. However, this is not necessarily going to help for classification purposes..3. Method efine µ 1 = 1 C 1 x C 1 x and µ = 1 One intuitive formulation would be: It is easy to see that w = µ 1 µ µ 1 µ. C x C x. efine µ 1 = w T µ 1 and µ = w T µ max L(w) = µ 1 µ = w T µ 1 w T µ w,s.t. w =1 Problem: Maintaining the distance between the means is not sufficient. What we want are for the: class means to be as far as possible after projection, and class variances to be as small as possible. 5

6 After projection, denote s 1 = x C 1 ( x µ 1 ), s = x C ( x µ ). The formulation according to our criteria would be the following: max w L(w) = ( µ 1 µ ) s 1 + s We will expand each term in this equation in the following: s 1 = x C 1 ( x µ 1 ) Similarly, s = wt S w = x C 1 (w T x w T µ 1 ) = w T ( x C 1 (x µ 1 )(x µ 1 ) T ) w = w T S 1 w where S 1 is the scatter matrix for 1 Note: rank of S B = 1. Thus, ( µ 1 µ ) = (w T (µ 1 µ )) = w T (µ 1 µ )(µ 1 µ ) T w = w T S B w where S B is the between class scatter matrix L(w) = wt S B w w T S w w It follows that it has derivative (if we ignore its denominator): When we divide it by w T S w w, we get: If we let it equal to 0, we get: d dw L(w) = wt S w ws B w w T S B ws w w S B w wt S B w w T S w w (S ww) = S B w L(w)(S w w) S B w = S w L(w)w S 1 w S B w = L(w)w Maximizing L(w) finding eigenvector corresponding to the largest eigenvalues of S 1 w S B. Remark: When you have c-classes, LA will find a c 1 subspace. 6

7 .4 MS (Multi-dimensional Scaling) MS is useful when you don t have a Euclidean representation of the data, and wish to come up with one based on distances between data points. There are three types of MS: classical MS metric MS non-metric MS Here, we are mostly going to discuss metric MS..4.1 Metric MS method Given a distance matrix P R n n, define the stress function, for x i R k i = 1,..., n, to be the following: S(x 1,..., x n ) = ( x i x j p ij ) i<j We want to minimize the stress function over the data points x 1,..., x n. The problem can be formulated as the following: min S(x 1,..., x n ) s.t. x i = 0 There is no easy way to write this as an eigenvalue problem, so to solve this we can simply do any gradient based optimization..4. Non-metric MS The goal on non-metric MS is to maintain the order of distances between data points. For example, if x i is closer to x j than x k, then maintain this ordering. Can make x i x j into any monotonic function. References [1] An Elementary Proof of a Theorem of Johnson and Lindenstrauss. Sanjoy asgupta, Anupam Gupta. [] Verma 4771 Machine Learning Lecture Slide: classes/ml/lec/lec8_unsupervised_dim_redux.pdf [3] Linear imensionality Reduction- Survey, Insights, and Generalizations. John P. Cunningham, Zoubin Ghahramani. 7

Principal Component Analysis

Principal Component Analysis CSci 5525: Machine Learning Dec 3, 2008 The Main Idea Given a dataset X = {x 1,..., x N } The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection The Main Idea Given