Methods for sparse analysis of high-dimensional data, II

Similar documents
Methods for sparse analysis of high-dimensional data, II

Principal Component Analysis

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Econ 204 Supplement to Section 3.6 Diagonalization and Quadratic Forms. 1 Diagonalization and Change of Basis

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Lecture Notes 1: Vector spaces

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Introduction to Machine Learning

14 Singular Value Decomposition

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Principal Component Analysis

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Lecture 1: Review of linear algebra

MATH 304 Linear Algebra Lecture 20: The Gram-Schmidt process (continued). Eigenvalues and eigenvectors.

Introduction to Compressed Sensing

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

15 Singular Value Decomposition

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma

The Hilbert Space of Random Variables

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

7 Principal Component Analysis

ORTHOGONALITY AND LEAST-SQUARES [CHAP. 6]

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Throughout these notes we assume V, W are finite dimensional inner product spaces over C.

Chapter 6: Orthogonality

Lecture: Face Recognition and Feature Reduction

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016

Chapter 4 Euclid Space

Matrix Vector Products

Compressed Sensing: Lecture I. Ronald DeVore

Recall the convention that, for us, all vectors are column vectors.

STAT 151A: Lab 1. 1 Logistics. 2 Reference. 3 Playing with R: graphics and lm() 4 Random vectors. Billy Fang. 2 September 2017

Applied Linear Algebra in Geoscience Using MATLAB

235 Final exam review questions

PCA, Kernel PCA, ICA

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP)

Lecture 3: Review of Linear Algebra

1 Last time: least-squares problems

Lecture 3: Review of Linear Algebra

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Lecture: Face Recognition and Feature Reduction

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

MLCC 2015 Dimensionality Reduction and PCA

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Definitions for Quizzes

Solutions to Review Problems for Chapter 6 ( ), 7.1

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Eigenfaces. Face Recognition Using Principal Components Analysis

Numerical Methods I Singular Value Decomposition

Strengthened Sobolev inequalities for a random subspace of functions

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Eigenimaging for Facial Recognition

EXAM. Exam 1. Math 5316, Fall December 2, 2012

EECS 275 Matrix Computation

Conditions for Robust Principal Component Analysis

MATH 423 Linear Algebra II Lecture 33: Diagonalization of normal operators.

Ir O D = D = ( ) Section 2.6 Example 1. (Bottom of page 119) dim(v ) = dim(l(v, W )) = dim(v ) dim(f ) = dim(v )

DS-GA 1002 Lecture notes 10 November 23, Linear models

PCA Review. CS 510 February 25 th, 2013

17 Random Projections and Orthogonal Matching Pursuit

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

6 Inner Product Spaces

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform

LINEAR ALGEBRA 1, 2012-I PARTIAL EXAM 3 SOLUTIONS TO PRACTICE PROBLEMS

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

PCA and admixture models

Reconstruction from Anisotropic Random Measurements

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Advanced Introduction to Machine Learning CMU-10715

Pattern Recognition 2

Intelligent Data Analysis. Principal Component Analysis. School of Computer Science University of Birmingham

Estimation of large dimensional sparse covariance matrices

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Stable Process. 2. Multivariate Stable Distributions. July, 2006

Non-Asymptotic Theory of Random Matrices Lecture 4: Dimension Reduction Date: January 16, 2007

DIMENSION REDUCTION. min. j=1

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Ma/CS 6b Class 23: Eigenvalues in Regular Graphs

Chapter XII: Data Pre and Post Processing

Lecture 15: Random Projections

An Introduction to Sparse Approximation

CS 143 Linear Algebra Review

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Randomized Algorithms

Announcements (repeat) Principal Components Analysis

Sparse PCA in High Dimensions

Mathematical foundations - linear algebra

Part IA. Vectors and Matrices. Year

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

LECTURE NOTE #11 PROF. ALAN YUILLE

UNIT 6: The singular value decomposition.

5.3 The Power Method Approximation of the Eigenvalue of Largest Module

1 Inner Product and Orthogonality

Economics 204 Summer/Fall 2010 Lecture 10 Friday August 6, 2010

Principal components analysis COMS 4771

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Singular Value Decomposition

Transcription:

Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011

High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47

High dimensional data with low-dimensional structure 3 / 47

High dimensional data with low-dimensional structure 4 / 47

We need to recall some... Euclidean geometry Statistics Linear algebra 5 / 47

Euclidean Geometry 6 / 47

An element of R n is written x = ( x 1, x 2,..., x n ) R n is a vector space: x + y = ( x 1 + y 1, x 2 + y 2,..., x n + y n ) ax = ( ax 1, ax 2,..., ax n ) x = (x 1, x 2,..., x n ) = n j=1 x je j where e 1 = (1, 0,..., 0), e 2 = (0, 1,..., 0),... e n = (0, 0,..., 1) are the standard basis vectors. 7 / 47

The inner product between x and y is: x, y = x 1 y 1 + x 2 y 2 +... + x n y n = n j=1 x jy j x := x, x 1/2 = (x 2 1 + x 2 2 +... + x 2 n) 1/2 is the Euclidean length of x. It is a norm: x = 0 if and only if x = 0. ax = a x triangle inequality: x + y x + y x,y x y = cos(θ) x and y are orthogonal (perpendicular) if and only if x, y = 0 8 / 47

Statistics 9 / 47

x = (x 1, x 2, x 3,..., x n ) R n Sample mean: x = 1 n n j=1 x j Standard deviation: s = n j=1 (x j x) 2 n 1 = 1 n 1 x x, x x 10 / 47

Variance: s 2 = 1 1 n 1 x x, x x = n 1 x x 2 } Suppose we have p data vectors {x 1, x 2,..., x p Covariance: Cov(x j, x k ) = 1 n 1 x j x j, x k x k Covariance matrix for 3 data vectors C = {x 1, x 2, x 3 } : cov(x 1, x 1 ) cov(x 1, x 2 ) cov(x 1, x 3 ) cov(x 2, x 1 ) cov(x 2, x 2 ) cov(x 2, x 3 ) cov(x 3, x 1 ) cov(x 3, x 2 ) cov(x 3, x 3 ) Covariance matrix for p data vectors has p columns and p rows 11 / 47

What does the covariance matrix look like? 12 / 47

Linear Algebra 13 / 47

Eigenvectors Suppose A is a p p matrix. If Av = λv, then we say v is an eigenvector of A with eigenvalue λ. Are these eigenvectors? A = A = ( 2 3 2 1 ( 2 3 2 1 ) ( 1, v = 3 ) ( 3, v = 2 ) ) If v is an eigenvector of A with eigenvector λ, then αv is also an eigenvector of A with eigenvector λ. We will always use the normalized eigenvector v = 1. 14 / 47

Any real-valued and symmetric matrix C has n eigenvectors { v1, v 2,..., v n } which form an orthonormal basis for R n (a.k.a. rotated coordinate view). Any x R n can be expressed in this basis via x = n j=1 x, v j v j. Cx = n j=1 λ j x, v j v j C = PDP 1 is diagonalizable: v 1 v 2 P =., D = v n λ 1 0... 0 0 λ 2... 0.. 0 0... λ n 15 / 47

Example 9 6 3 0 3 3 0 3 6 9 x = ( 7.5, 1.5, 6.6, 5.7, 9.3, 6.9, 6, 3, 4.5, 3.3 ), y = ( 7.2, 2.1, 8.7, 6.6, 9, 8.1, 4.8, 3.3, 4.8, 2.7 ) C = cov(x, y) = 1 x x, y ȳ, n 1 ( cov(x, x) cov(x, y) cov(x, y) cov(y, y) ) = ( 5.549 5.539 5.539 6.449 ) 16 / 47

2 1 0 1 2 Eigenvectors / values for C: (.6780 v 1 =.7352 (.7352 v 2 =.6780 ), λ 1 = 11.5562 ), λ 2 =.4418 2 1 0 1 2 Figure: x x vs. y ȳ v 1 the first principal component of the data (x, y), and v 2 the second principal component, and so-on... Prove: v 1 is in the direction of the least squares fit to the centered data ( xj x, y j ȳ ), j = 1, 2,..., n. 17 / 47

Principal component analysis 9 6 3 0 3 2 1 0 3 3 0 3 6 9 1 1 0 1 2 3 Figure: Original data and projection onto first principal component 3 2 y 1 0 1 1 0 1 2 3 x Figure: Residual 18 / 47

Principal component analysis Best fit ellipsoid to the data 19 / 47

Principal component analysis The covariance matrix is written as C = PDP 1, where v 1 v 2 P =. v n, D = λ 1 0... 0 0 λ 2... 0.. 0 0... λ n Suppose that C is n n but λ k+1 = = λ n = 0. Then the underlying data is low-rank Suppose that C is n n but λ k through λ n are very small. Then the underlying data is approximately low-rank. 20 / 47

Eigenfaces The first few principal components (a.k.a. eigenvectors of the covariance matrix) for a database of many faces. Different components accentuate different facial characteristics 21 / 47

Eigenfaces Top left face is projection of bottom right face onto its first principal component. Each new image from left to right corresponds to using 8 additional principal components for reconstruction 22 / 47

Eigenfaces The projections of non-face images onto first few principal components 23 / 47

Reducing dimensionality using random projections 24 / 47

9 6 3 0 3 3 2 1 0 1 3 0 3 6 9 1 0 1 2 3 Principal components: Directions of projection are data-dependent Random projections: Directions of projection are independent of the data Why not always use principal components? 1 May not have access to all the data at once, as in data streaming 2 Computing principal components (eigenvectors) is computationally expensive in high dimensions: O(kn 2 ) flops to compute k principal components 25 / 47

Data streaming Massive amounts of data arrives in small time increments Often past data cannot be accumulated and stored, or when they can, access is expensive. 26 / 47

Data streaming x = (x 1, x 2,..., x n ) at time (t 1, t 2,..., t n ), and x = ( x 1, x 2,..., x n ) at time ( t1 + (t), t 2 + (t),..., t n + (t) ) Summary statistics that can be computed in one pass: Mean value: x = 1 n n j=1 x j Euclidean length: x 2 = n j=1 x 2 j Variance: σ 2 (x) = 1 n n j=1 (x j x) 2 What about the correlation x x, y ȳ /σ(x)σ(y)? used to assess risk of stock x against market y 27 / 47

Approach: introduce randomness Consider x = (x 1, x 2,..., x n ) and vector g = (g 1, g 2,..., g n ) of independent and identically distributed (i.i.d.) unit normal Gaussian random variables: Consider g j N (0, 1), P ( g j x ) = x 1 2π e t2 /2 dt u = g, x g, x = ( g 1 x 1 + g 2 x 2 + + g n x n ) ( g1 x 1 + g 2 x 2 + + g n x n ) = g, x x Theorem E g, x x 2 = x y 2 28 / 47

For an m N matrix Φ with i.i.d. Gaussian entries ϕ i,j N (0, 1) E( 1 m Φ(x y) 2 ) = 1 ( m ) E g i, x x 2 m = x y 2 i=1 29 / 47

Approach: introduce randomness Concentration around expectation: For a fixed x R n, ( P 1 Φ(x) 2 (1 + ε) x 2) exp ( m m 4 ε2) For p vectors { } x 1, x 2,..., x p in R n ( P x j : 1 Φ(x j ) 2 (1 + ε) x j 2) exp ( log p m m 4 ε2) How small can m be such that this probability is still small? 30 / 47

Geometric intuition The linear map x 1 m Φx is similar to a random projection onto an m-dimensional subspace of R n most projections preserve geometry, but not all. 31 / 47

Measure-concentration for Gaussian matrices Theorem (Concentration of lengths / Johnson-Lindenstrauss) Fix an accuracy ε > 0 and probability of failure η > ε? > 0. Fix an integer m 10ε 2 log(p), and fix an m n Gaussian random matrix Φ. Then with probability greater than 1 η, for all j and k. 1 Φx j Φx k x j x k ε x j x k m 32 / 47

Corollary (Concentration for inner products) Fix an accuracy ε > 0 and probability of failure η > 0. Fix an integer m 10ε 2 log(p) and fix an m n Gaussian random matrix Φ. Then with probability greater than 1 η, for all j and k. 1 m Φx j, Φx k x j, x k ε 2 ( x j 2 + x k 2 ) 33 / 47

Nearest-neighbors 34 / 47

The nearest-neighbors problem Find the closest point } to a point q from among a set of points S = {x 1, x 2,..., x p. Originally called the post-office problem (1973) 35 / 47

Applications Similarity searching... 36 / 47

The nearest-neighbors problem Find the closest point } to a point q from among a set of points S = {x 1, x 2,..., x p x = arg min x j S q x j 2 = arg min x j S N ( q(k) xj (k) ) 2 k=1 Computational cost (number of flops ) per search: O(Np) Computational cost of m searches: O(Nmp). Curse of dimensionality: If N and p are large, this is a lot of flops! 37 / 47

The ε-approximate nearest-neighbors problem Given a tolerance ε > 0, and a point q R N, return a point x ε from the set S = {x 1, x 2,..., x p } which is an ε-approximate nearest neighbor to q: q x ε (1 + ε) q x This problem can be solved using random projections: Let Φ be an m N Gaussian random matrix, where m = 10ε 2 log p. Compute r = Φq. For all j = 1,..., p, compute x j u j = Φx j. Computational cost: O(Np log(p)). Compute x ε = arg min xj S r u j. Computational cost: of m searches: O(pm log(p + m)). Total computation cost: O((N + m)p log(p + m)) << O(Np 2 )! 38 / 47

Random projections and sparse recovery 39 / 47

Theorem (Subspace-preservation) Suppose that Φ is an m n random matrix with the distance-preservation property: ( For any fixed x : P Φx 2 x 2 ε x 2) 2e cεm Let k c ε m and let T k be a k-dimensional subspace of R n. Then ( P For all x T k : (1 ε) x 2 Φx 2 (1 + ε) x 2) 1 e c ε m Outline of proof: A ε-cover and the Vitali covering lemma Continuity argument 40 / 47

Sparse recovery and RIP Restricted Isometry Property of order k: Φ has the RIP of order k if for all k-sparse vectors x R n..8 x 2 Φx 2 1.2 x 2 Theorem If Φ has RIP of order k, then for all k-sparse vectors x such that Φx = b, { N x = arg min z(j) : Φz = b, z R n} j=1 41 / 47

Theorem (Distance-preservation implies RIP) Suppose that Φ is an m N random matrix with the subspace-preservation property: ( P x T k : (1 ε) x 2 Φx 2 (1 + ε) x 2) e c ε m Then with probability greater than.99, (1 ε) x 2 Φx 2 (1 + ε) x 2 for all x of sparsity level k c ε m/log(n). Outline of proof: Bound for a fixed subspace T k. Union bound over all ( N k ) N k subspaces of k-sparse vectors 42 / 47

Fast principal component analysis 43 / 47

Principal component analysis 9 6 3 0 3 2 1 0 3 3 0 3 6 9 1 1 0 1 2 3 Figure: Original data and projection onto first principal component 3 2 y 1 0 1 1 0 1 2 3 x Figure: Residual 44 / 47

Principal components in higher dimensions Computing principal components is expensive: Use fast randomized algorithms for approximate PCA 45 / 47

Randomized Principal component analysis First principal component is largest eigenvector v 1 = ( v 1 (1),..., v 1 (n) ) of covariance matrix C = PDP 1, where v 1 (1) v 1 (2)... v 1 (n) λ 1 0... 0 v 2 (1) v 2 (2)... v 2 (n) P =.., D = 0 λ 2... 0.. v n (1) v n (2)... v n (n) 0 0... λ n Power method for computing largest principal component based on observation: If x 0 is a random Gaussian vector and x n+1 = Cx n, then x n / x n v 1. 46 / 47

Randomized principal component analysis If C = PDP 1 is a rank-k (or approximately rank-k) matrix, then all principal components can be computed using 2k gaussian random vectors. For more accurate approximate PCA, do more iterations of power method. 47 / 47