Methods for sparse analysis of high-dimensional data, II

Similar documents
Methods for sparse analysis of high-dimensional data, II

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Principal Component Analysis

Introduction to Machine Learning

Econ 204 Supplement to Section 3.6 Diagonalization and Quadratic Forms. 1 Diagonalization and Change of Basis

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Lecture Notes 1: Vector spaces

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Principal Component Analysis

Lecture: Face Recognition and Feature Reduction

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

14 Singular Value Decomposition

Chapter 6: Orthogonality

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

LINEAR ALGEBRA 1, 2012-I PARTIAL EXAM 3 SOLUTIONS TO PRACTICE PROBLEMS

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP)

EXAM. Exam 1. Math 5316, Fall December 2, 2012

1 Last time: least-squares problems

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

15 Singular Value Decomposition

MATH 304 Linear Algebra Lecture 20: The Gram-Schmidt process (continued). Eigenvalues and eigenvectors.

Lecture 1: Review of linear algebra

PCA and admixture models

Throughout these notes we assume V, W are finite dimensional inner product spaces over C.

The Hilbert Space of Random Variables

Recall the convention that, for us, all vectors are column vectors.

235 Final exam review questions

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

DIAGONALIZATION. In order to see the implications of this definition, let us consider the following example Example 1. Consider the matrix

PCA, Kernel PCA, ICA

Lecture: Face Recognition and Feature Reduction

7 Principal Component Analysis

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Solutions to Review Problems for Chapter 6 ( ), 7.1

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Chapter 4 Euclid Space

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

STA141C: Big Data & High Performance Statistical Computing

MATH 423 Linear Algebra II Lecture 33: Diagonalization of normal operators.

Lecture 15, 16: Diagonalization

Definitions for Quizzes

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra

STAT 151A: Lab 1. 1 Logistics. 2 Reference. 3 Playing with R: graphics and lm() 4 Random vectors. Billy Fang. 2 September 2017

Advanced Introduction to Machine Learning CMU-10715

ORTHOGONALITY AND LEAST-SQUARES [CHAP. 6]

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Numerical Methods I Singular Value Decomposition

MATH 304 Linear Algebra Lecture 23: Diagonalization. Review for Test 2.

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Ir O D = D = ( ) Section 2.6 Example 1. (Bottom of page 119) dim(v ) = dim(l(v, W )) = dim(v ) dim(f ) = dim(v )

Applied Linear Algebra in Geoscience Using MATLAB

PCA Review. CS 510 February 25 th, 2013

FFTs in Graphics and Vision. The Laplace Operator

DS-GA 1002 Lecture notes 10 November 23, Linear models

6 Inner Product Spaces

Matrix Vector Products

1 Inner Product and Orthogonality

Economics 204 Summer/Fall 2010 Lecture 10 Friday August 6, 2010

5.3 The Power Method Approximation of the Eigenvalue of Largest Module

Linear Algebra & Geometry why is linear algebra useful in computer vision?

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

Eigenimaging for Facial Recognition

Spectral radius, symmetric and positive matrices

Exercises * on Principal Component Analysis

arxiv: v5 [math.na] 16 Nov 2017

Course Notes: Week 1

Conditions for Robust Principal Component Analysis

IMPORTANT DEFINITIONS AND THEOREMS REFERENCE SHEET

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma

UNIT 6: The singular value decomposition.

STA141C: Big Data & High Performance Statistical Computing

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Math 18, Linear Algebra, Lecture C00, Spring 2017 Review and Practice Problems for Final Exam

CS 143 Linear Algebra Review

STAT 100C: Linear models

Quadratic forms. Here. Thus symmetric matrices are diagonalizable, and the diagonalization can be performed by means of an orthogonal matrix.

Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic

Singular Value Decomposition

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

IMPORTANT DEFINITIONS AND THEOREMS REFERENCE SHEET

Pattern Recognition 2

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Lecture 8 : Eigenvalues and Eigenvectors

Announcements (repeat) Principal Components Analysis

ICS 6N Computational Linear Algebra Symmetric Matrices and Orthogonal Diagonalization

Linear Algebra in Actuarial Science: Slides to the lecture

2. Every linear system with the same number of equations as unknowns has a unique solution.

Mathematical foundations - linear algebra

Estimation of large dimensional sparse covariance matrices

Lecture Notes 2: Matrices

Math 24 Spring 2012 Sample Homework Solutions Week 8

Eigenfaces. Face Recognition Using Principal Components Analysis

EECS 275 Matrix Computation

Strengthened Sobolev inequalities for a random subspace of functions

Transcription:

Methods for sparse analysis of high-dimensional data, II Rachel Ward May 26, 2011

High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 55

High dimensional data with low-dimensional structure 3 / 55

High dimensional data with low-dimensional structure 4 / 55

We need to recall some... Euclidean geometry Statistics Linear algebra 5 / 55

Euclidean Geometry 6 / 55

An element of R n is written x = ( x 1, x 2,..., x n ) R n is a vector space: x + y = ( x 1 + y 1, x 2 + y 2,..., x n + y n ) ax = ( ax 1, ax 2,..., ax n ) x = (x 1, x 2,..., x n ) = n j=1 x je j where e 1 = (1, 0,..., 0), e 2 = (0, 1,..., 0),... e n = (0, 0,..., 1) are the standard basis vectors. 7 / 55

The inner product between x and y is: x, y = x 1 y 1 + x 2 y 2 +... + x n y n = n j=1 x jy j x := x, x 1/2 = (x 2 1 + x 2 2 +... + x 2 n) 1/2 is the Euclidean length of x. It is a norm: x = 0 if and only if x = 0. ax = a x triangle inequality: x + y x + y x,y x y = cos(θ) x and y are orthogonal (perpendicular) if and only if x, y = 0 8 / 55

Statistics 9 / 55

x = (x 1, x 2, x 3,..., x n ) R n Sample mean: x = 1 n n j=1 x j Standard deviation: s = n j=1 (x j x) 2 n 1 = 1 n 1 x x, x x 10 / 55

Variance: s 2 = 1 1 n 1 x x, x x = n 1 x x 2 } Suppose we have p data vectors {x 1, x 2,..., x p Covariance: Cov(x j, x k ) = 1 n 1 x j x j, x k x k Covariance matrix for 3 data vectors C = {x 1, x 2, x 3 } : cov(x 1, x 1 ) cov(x 1, x 2 ) cov(x 1, x 3 ) cov(x 2, x 1 ) cov(x 2, x 2 ) cov(x 2, x 3 ) cov(x 3, x 1 ) cov(x 3, x 2 ) cov(x 3, x 3 ) Covariance matrix for p data vectors has p columns and p rows 11 / 55

What does the covariance matrix look like? 12 / 55

Linear Algebra 13 / 55

Eigenvectors Suppose A is a p p matrix. If Av = λv, then we say v is an eigenvector of A with eigenvalue λ. Are these eigenvectors? A = A = ( 2 3 2 1 ( 2 3 2 1 ) ( 1, v = 3 ) ( 3, v = 2 ) ) If v is an eigenvector of A with eigenvector λ, then αv is also an eigenvector of A with eigenvector λ. We will always use the normalized eigenvector v = 1. 14 / 55

Any real-valued and symmetric matrix C has n eigenvectors { v1, v 2,..., v n } which form an orthonormal basis for R n (a.k.a. rotated coordinate view). Any x R n can be expressed in this basis via x = n j=1 x, v j v j. Cx = n j=1 λ j x, v j v j C = PDP 1 is diagonalizable: v 1 v 2 P =., D = v n λ 1 0... 0 0 λ 2... 0.. 0 0... λ n 15 / 55

Example 9 6 3 0 3 3 0 3 6 9 x = ( 7.5, 1.5, 6.6, 5.7, 9.3, 6.9, 6, 3, 4.5, 3.3 ), y = ( 7.2, 2.1, 8.7, 6.6, 9, 8.1, 4.8, 3.3, 4.8, 2.7 ) C = cov(x, y) = 1 x x, y ȳ, n 1 ( cov(x, x) cov(x, y) cov(x, y) cov(y, y) ) = ( 5.549 5.539 5.539 6.449 ) 16 / 55

2 1 0 1 2 Eigenvectors / values for C: (.6780 v 1 =.7352 (.7352 v 2 =.6780 ), λ 1 = 11.5562 ), λ 2 =.4418 2 1 0 1 2 Figure: x x vs. y ȳ v 1 the first principal component of the data (x, y), and v 2 the second principal component, and so-on... Prove: v 1 is in the direction of the least squares fit to the centered data ( xj x, y j ȳ ), j = 1, 2,..., n. 17 / 55

Principal component analysis 9 6 3 0 3 2 1 0 3 3 0 3 6 9 1 1 0 1 2 3 Figure: Original data and projection onto first principal component 3 2 y 1 0 1 1 0 1 2 3 x Figure: Residual 18 / 55

Principal component analysis Best fit ellipsoid to the data 19 / 55

Principal component analysis The covariance matrix is written as C = PDP 1, where v 1 v 2 P =. v n, D = λ 1 0... 0 0 λ 2... 0.. 0 0... λ n Suppose that C is n n but λ k+1 = = λ n = 0. Then the underlying data is low-rank Suppose that C is n n but λ k through λ n are very small. Then the underlying data is approximately low-rank. 20 / 55

Eigenfaces The first few principal components (a.k.a. eigenvectors of the covariance matrix) for a database of many faces. Different components accentuate different facial characteristics 21 / 55

Eigenfaces Top left face is projection of bottom right face onto its first principal component. Each new image from left to right corresponds to using 8 additional principal components for reconstruction 22 / 55

Eigenfaces The projections of non-face images onto first few principal components 23 / 55

Fast principal component analysis 24 / 55

Randomized principal component analysis 9 3 6 3 0 3 3 0 3 6 9 2 1 0 1 1 0 1 2 3 Figure: Original data and projection onto first principal component 3 2 y 1 0 1 1 0 1 2 3 x Figure: Residual 25 / 55

Inspiration: power iteration Suppose that A is an n n real-valued matrix. Then A has n non-negative eigenvalues and n orthonormal eigenvectors, so A is diagonalizable: A = P 1 DP, where D = λ 1 0... 0 0 λ 2... 0.. 0 0... λ n, P = v 1 v 2. v n 26 / 55

Inspiration: power iteration Suppose that A is an n n real-valued matrix. Then A has n non-negative eigenvalues and n orthonormal eigenvectors, so A is diagonalizable: A = P 1 DP, where D = λ 1 0... 0 0 λ 2... 0.. 0 0... λ n, P = v 1 v 2. v n Power iteration for computing v 1 /λ 1 : a Gaussian random x 0, and iterate x k+1 = Ax k / Ax k. If λ 1 > λ 2, then x k v 1. 27 / 55

Inspiration: power iteration Suppose that A is an n n real-valued matrix. Then A has n non-negative eigenvalues and n orthonormal eigenvectors, so A is diagonalizable: A = P 1 DP, where D = λ 1 0... 0 0 λ 2... 0.. 0 0... λ n, P = v 1 v 2. v n Power iteration for computing v 1 /λ 1 : a Gaussian random x 0, and iterate x k+1 = Ax k / Ax k. If λ 1 > λ 2, then x k v 1. Problem: unstable for computing subsequent eigenvalues 28 / 55

Stable randomized power iteration Consider an n n symmetric real matrix A, and suppose we want to compute first k eigenvectors and eigenvalues. 1 Set l = k + 10. Using a random number generator, form an n l matrix G whose entries are i.i.d. Gaussian random variables of zero mean and unit variance 29 / 55

Stable randomized power iteration Consider an n n symmetric real matrix A, and suppose we want to compute first k eigenvectors and eigenvalues. 1 Set l = k + 10. Using a random number generator, form an n l matrix G whose entries are i.i.d. Gaussian random variables of zero mean and unit variance 2 Form the n l matrix H (0) = AG and compute Q (0) whose columns form an orthonormal basis for the range of H (0). 30 / 55

Stable randomized power iteration Consider an n n symmetric real matrix A, and suppose we want to compute first k eigenvectors and eigenvalues. 1 Set l = k + 10. Using a random number generator, form an n l matrix G whose entries are i.i.d. Gaussian random variables of zero mean and unit variance 2 Form the n l matrix H (0) = AG and compute Q (0) whose columns form an orthonormal basis for the range of H (0). 3 Form H (j) = AQ (j 1) and compute Q (j) whose columns form an orthonormal basis for the range of H (j). 31 / 55

Stable randomized power iteration Consider an n n symmetric real matrix A, and suppose we want to compute first k eigenvectors and eigenvalues. 1 Set l = k + 10. Using a random number generator, form an n l matrix G whose entries are i.i.d. Gaussian random variables of zero mean and unit variance 2 Form the n l matrix H (0) = AG and compute Q (0) whose columns form an orthonormal basis for the range of H (0). 3 Form H (j) = AQ (j 1) and compute Q (j) whose columns form an orthonormal basis for the range of H (j). 4 Set Q = Q (q). THEOREM: With high probability, A QQ T A (nk) 1 2q λ k+1. 32 / 55

Stable randomized power iteration Consider an n n symmetric real matrix A, and suppose we want to compute first k eigenvectors and eigenvalues. 1 Set l = k + 10. Using a random number generator, form an n l matrix G whose entries are i.i.d. Gaussian random variables of zero mean and unit variance 2 Form the n l matrix H (0) = AG and compute Q (0) whose columns form an orthonormal basis for the range of H (0). 3 Form H (j) = AQ (j 1) and compute Q (j) whose columns form an orthonormal basis for the range of H (j). 4 Set Q = Q (q). THEOREM: With high probability, A QQ T A (nk) 1 2q λ k+1. Importance: Using a FFT -based random matrix in place of G, this algorithm takes O(n 2 log(k)) flops to compute k eigenvalues / eigenvectors (Compare to standard O(n 2 k) flops) 33 / 55

Randomized Algorithm for PCA THEOREM: With high probability, A QQ T A (nk) 1 2q λ k+1 Why does A QQ T A ε give us eigenvalues and eigenvectors of A? Form B = Q T AQ. Diagonalize B: B = V ΛV 1 Form U = QV. Then A UΛU T = A QV ΛV 1 Q T = A QQ T AQQ T A QQ T A + QQ T A QQ T AQQ T ε + QQ T ε = 2ε 34 / 55

Reducing dimensionality using random projections 35 / 55

9 6 3 0 3 3 2 1 0 1 3 0 3 6 9 1 0 1 2 3 Principal components: Directions of projection are data-dependent Random projections: Directions of projection are independent of the data Two situations where we use random projections:: 1 Data is so high-dimensional that it is too expensive to compute principal components directly 2 You do not access to all the data at once, as in data streaming 36 / 55

Data streaming Massive amounts of data arrives in small time increments Often past data cannot be accumulated and stored, or when they can, access is expensive. 37 / 55

Data streaming x = (x 1, x 2,..., x n ) at time (t 1, t 2,..., t n ) Summary statistics that can be computed in one pass: Mean value: x = 1 n n j=1 x j Euclidean length: x 2 = n j=1 x 2 j Variance: σ 2 (x) = 1 n n j=1 (x j x) 2 38 / 55

Data streaming x = (x 1, x 2,..., x n ) at time (t 1, t 2,..., t n ) Summary statistics that can be computed in one pass: Mean value: x = 1 n n j=1 x j Euclidean length: x 2 = n j=1 x 2 j Variance: σ 2 (x) = 1 n n j=1 (x j x) 2 Now we want to compare x to x = ( x 1, x 2,..., x n ). The correlation x x, x x /σ(x)σ( x) is used to assess risk of stock x against market x 39 / 55

Approach: introduce randomness Consider x = (x 1, x 2,..., x n ) and vector ϕ = (ϕ 1, ϕ 2,..., ϕ n ) of (i.i.d.) unit normal Gaussian random variables: Consider ϕ j N (0, 1), P ( ϕ j x ) = x 1 2π e t2 /2 dt b = ϕ, x ϕ, x = ( ϕ 1 x 1 + ϕ 2 x 2 + + ϕ n x n ) ( ϕ1 x 1 + ϕ 2 x 2 + + ϕ n x n ) = ϕ, x x Claim: Eb 2 = x x 2 40 / 55

More generally, for an m N matrix Φ with i.i.d. Gaussian entries ϕ i,j N (0, 1) ( E b 2) = E( 1 Φ(x) 2 ) = 1 ( m m m E ) ϕ i, x 2 = x 2 i=1 41 / 55

Geometric intuition The linear map x 1 m Φx is similar to a random projection onto an m-dimensional subspace of R n most projections preserve geometry, but not all. 42 / 55

Concentration around expectation: For a fixed x R n, ( P 1 Φ(x) 2 (1 + ε) x 2) exp ( m m 4 ε2) 43 / 55

Concentration around expectation: For a fixed x R n, ( P 1 Φ(x) 2 (1 + ε) x 2) exp ( m m 4 ε2) For p vectors { x 1, x 2,..., x p } in R n ( P x j : 1 Φ(x j ) 2 (1 + ε) x j 2) exp ( log p m m 4 ε2) How small can m be such that this probability is still small? 44 / 55

Distance preservation of random matrices Theorem (Distance preservation) Fix an accuracy ε > 0 and probability of failure η > 0. Fix an integer m c ε,η log(p), and fix an m n Gaussian random matrix Φ. Then with probability greater than 1 η, For any fixed x : ( ) Φx P 2 x 2 ε x 2 2e c ε m for all j and k. 45 / 55

Corollary (Angle preservation) Fix an accuracy ε > 0 and probability of failure η > 0. Fix an integer m c ε,η log(p) and fix an m n Gaussian random matrix Φ. Then with probability greater than 1 η, for all j and k. 1 m Φx j, Φx k x j, x k ε 2 ( x j 2 + x k 2 ) 46 / 55

The nearest-neighbors problem 47 / 55

The nearest-neighbors problem Find the closest point } to a point q from among a set of points S = {x 1, x 2,..., x p. Originally called the post-office problem (1973) 48 / 55

Applications Similarity searching... 49 / 55

The nearest-neighbors problem Find the closest point } to a point q from among a set of points S = {x 1, x 2,..., x p x = arg min x j S q x j 2 = arg min x j S N ( q(k) xj (k) ) 2 k=1 Computational cost (number of flops ) per search: O(Np) Computational cost of m searches: O(Nmp). Curse of dimensionality: If N and p are large, this is a lot of flops! 50 / 55

The ε-approximate nearest-neighbors problem Given a tolerance ε > 0, and a point q R N, return a point x ε from the set S = {x 1, x 2,..., x p } which is an ε-approximate nearest neighbor to q: q x ε (1 + ε) q x This problem can be solved using random projections: Let Φ be an m N Gaussian random matrix, where m = 10ε 2 log p. Compute r = Φq. For all j = 1,..., p, compute x j u j = Φx j. Computational cost: O(Np log(p)). Compute x ε = arg min xj S r u j. Computational cost: of m searches: O(pm log(p + m)). Total computation cost: O((N + m)p log(p + m)) << O(Np 2 )! 51 / 55

Random projections and sparse recovery 52 / 55

Theorem (Subspace-preservation) Suppose that Φ is an m n random matrix with the distance-preservation property: ( For any fixed x : P Φx 2 x 2 ε x 2) 2e cεm Let k c ε m and let T k be a k-dimensional subspace of R n. Then ( P For all x T k : (1 ε) x 2 Φx 2 (1 + ε) x 2) 1 e c ε m Outline of proof: A ε-cover and the Vitali covering lemma Continuity argument 53 / 55

Sparse recovery and RIP Restricted Isometry Property of order k: Φ has the RIP of order k if for all k-sparse vectors x R n..8 x 2 Φx 2 1.2 x 2 Theorem If Φ has RIP of order k, then for all k-sparse vectors x such that Φx = b, { N x = arg min z(j) : Φz = b, z R n} j=1 54 / 55

Theorem (Distance-preservation implies RIP) Suppose that Φ is an m N random matrix with the subspace-preservation property: ( P x T k : (1 ε) x 2 Φx 2 (1 + ε) x 2) e c ε m Then with probability greater than.99, (1 ε) x 2 Φx 2 (1 + ε) x 2 for all x of sparsity level k c ε m/log(n). Outline of proof: Bound for a fixed subspace T k. Union bound over all ( N k ) N k subspaces of k-sparse vectors 55 / 55