PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

Similar documents
arxiv: v5 [math.na] 16 Nov 2017

DS-GA 1002 Lecture notes 10 November 23, Linear models

NORMS ON SPACE OF MATRICES

Dissertation Defense

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Throughout these notes we assume V, W are finite dimensional inner product spaces over C.

Review of Linear Algebra Definitions, Change of Basis, Trace, Spectral Theorem

BALANCING GAUSSIAN VECTORS. 1. Introduction

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

Small Ball Probability, Arithmetic Structure and Random Matrices

Upper Bound for Intermediate Singular Values of Random Sub-Gaussian Matrices 1

Lecture 2: Linear Algebra Review

Least singular value of random matrices. Lewis Memorial Lecture / DIMACS minicourse March 18, Terence Tao (UCLA)

Random matrices: Distribution of the least singular value (via Property Testing)

Spectral Theorem for Self-adjoint Linear Operators

Linear Algebra Massoud Malek

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016

A Randomized Algorithm for the Approximation of Matrices

Optimal compression of approximate Euclidean distances

5 Compact linear operators

Lecture Notes 1: Vector spaces

Norms of Random Matrices & Low-Rank via Sampling

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

A geometric proof of the spectral theorem for real symmetric matrices

Statistical Data Analysis

October 25, 2013 INNER PRODUCT SPACES

HILBERT SPACES AND THE RADON-NIKODYM THEOREM. where the bar in the first equation denotes complex conjugation. In either case, for any x V define

Proofs for Large Sample Properties of Generalized Method of Moments Estimators

SPECTRAL PROPERTIES OF THE LAPLACIAN ON BOUNDED DOMAINS

Math 408 Advanced Linear Algebra

Random Fermionic Systems

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

1 Topology Definition of a topology Basis (Base) of a topology The subspace topology & the product topology on X Y 3

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Extreme eigenvalues of Erdős-Rényi random graphs

Functional Analysis Review

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms

Lecture 5. Ch. 5, Norms for vectors and matrices. Norms for vectors and matrices Why?

Principal Component Analysis

A randomized algorithm for approximating the SVD of a matrix

Exercise Solutions to Functional Analysis

U.C. Berkeley Better-than-Worst-Case Analysis Handout 3 Luca Trevisan May 24, 2018

CS 6820 Fall 2014 Lectures, October 3-20, 2014

Peter J. Dukes. 22 August, 2012

Kernel Method: Data Analysis with Positive Definite Kernels

CSE 206A: Lattice Algorithms and Applications Spring Minkowski s theorem. Instructor: Daniele Micciancio

Invertibility of random matrices

Random Bernstein-Markov factors

SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS

Submitted to the Brazilian Journal of Probability and Statistics

The 123 Theorem and its extensions

Recall that any inner product space V has an associated norm defined by

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II

Notes taken by Costis Georgiou revised by Hamed Hatami

Normed & Inner Product Vector Spaces

Ir O D = D = ( ) Section 2.6 Example 1. (Bottom of page 119) dim(v ) = dim(l(v, W )) = dim(v ) dim(f ) = dim(v )

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

On the concentration of eigenvalues of random symmetric matrices

Supremum of simple stochastic processes

4 Bias-Variance for Ridge Regression (24 points)

Bayesian Nonparametric Point Estimation Under a Conjugate Prior

The following definition is fundamental.

Lecture 8: The Goemans-Williamson MAXCUT algorithm

MATH 31BH Homework 1 Solutions

Random Matrix Theory Lecture 1 Introduction, Ensembles and Basic Laws. Symeon Chatzinotas February 11, 2013 Luxembourg

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

A Simple Algorithm for Clustering Mixtures of Discrete Distributions

Using Laplacian Eigenvalues and Eigenvectors in the Analysis of Frequency Assignment Problems

4 Derivations of the Discrete-Time Kalman Filter

Characterization of half-radial matrices

Stat 159/259: Linear Algebra Notes

Reconstruction from Anisotropic Random Measurements

Invertibility of symmetric random matrices

Grothendieck s Inequality

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Inner product spaces. Layers of structure:

1: Introduction to Lattices

Mathematics Department Stanford University Math 61CM/DM Inner products

Tools from Lebesgue integration

The circular law. Lewis Memorial Lecture / DIMACS minicourse March 19, Terence Tao (UCLA)

1 Math 241A-B Homework Problem List for F2015 and W2016

A strongly polynomial algorithm for linear systems having a binary solution

Applications of Robust Optimization in Signal Processing: Beamforming and Power Control Fall 2012

5 Measure theory II. (or. lim. Prove the proposition. 5. For fixed F A and φ M define the restriction of φ on F by writing.

Lecture 3. Random Fourier measurements

Chapter 7. Extremal Problems. 7.1 Extrema and Local Extrema

Analysis in weighted spaces : preliminary version

Today: eigenvalue sensitivity, eigenvalue algorithms Reminder: midterm starts today

1 Directional Derivatives and Differentiability

Topological properties

Class notes: Approximation

NATIONAL UNIVERSITY OF SINGAPORE Department of Mathematics MA4247 Complex Analysis II Lecture Notes Part II

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October

Transcription:

PCA with random noise Van Ha Vu Department of Mathematics Yale University

An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical analysis) is to compute the first few singular vectors of a large matrix. Among others, this problem lies at the heart of PCA (Principal Component Analysis), which has a very wide range of applications. Problem. For a matrix A of size n n with singular values σ 1 σ n 0, let v 1,..., v n be the corresponding (unit) singular vectors. Compute v 1,..., v k, for some k n.

Typically n is large and k is relatively small. As a matter of fact, in many applications k is a constant independent of n. For example, to obtain a visualization of a large set of data, one often sets k = 2 or 3. The assumption that A is a square matrix is for convenience and our analysis can be carried out with nominal modification for rectangular matrices. Asymptotic notation: Θ, Ω, O under the assumption that n. For a vector v, v denotes its L 2 norm. For a matrix A, A = σ 1 (A) denotes its spectral norm.

A model. The matrix A, which represents data, is often perturbed by noise. Thus, one works with A + E, where E represents the noise. A natural and important problem is to estimate the influence of noise on the vectors v 1,..., v k. We denote by v 1,..., v k the first k singular vectors of A + E. Question. When is v 1 a good approximation of v 1 or how much the noise change v 1? For singular values (Weyl s bound) σ 1 (A + E) σ 1 (A) σ 1 (E). If E 0, σ 1 (A + E) σ 1 (A). In other words, σ 1 is continuous.

On the other hand, the singular vectors are not continuous. Let A be the matrix ( ) 1 + ɛ 0. 0 1 ɛ Apparently, the singular values of A are 1 + ɛ and 1 ɛ, with corresponding singular vectors (1, 0) and (0, 1). Let E be ( ) ɛ ɛ, ɛ ɛ where ɛ is a small positive number. The perturbed matrix A + E has the form ( ) 1 ɛ. ɛ 1 Obviously, the singular values A + E are also 1 + ɛ and 1 ɛ. However, the corresponding singular vectors now are ( 1 1 2, 2 ) and ( 1 2, 1 2 ), no matter how small ɛ is.

A traditional way to measure the distance between two vectors v and v is to look at sin (v, v ), where (v, v ) is the angle between the vectors, taken in [0, π/2] Let us fix a small parameter ɛ > 0, which represents a desired accuracy. We want find a sufficient condition for the matrix A which guarantees that sin (v 1, v 1 ) ɛ. The key parameter to look at is the gap (or separation) δ := σ 1 σ 2, between the first and second singular values of A. Theorem (Wedin sin theorem) There is a positive constant C such that sin (v 1, v 1) C E δ.

Corollary For any given ɛ > 0, there is C = C(ɛ) > 0 such that if δ C E, then sin (v 1, v 1) ɛ. In the case when A and A + E are Hermitian, this statement is a special case of the Davis-Kahan sin θ theorem. Wedin extended Davis-Kahan theorem to non-hermitian matrices.

Random perturbation Noise (or perturbation) represents errors that come from various sources which are frequently of entirely different nature, such as errors occurring in measurements, errors occurring in recording and transmitting data, errors occurring by rounding etc. It is usually too complicated to model noise deterministically, so in practice, one often assumes that it is random. In particular, a popular model is that the entries of E are independent random variables with mean 0 and variance 1 (the value 1 is, of course, just matter of normalization).

For simplicity, we restrict ourselves to a representative case when all entries of E are iid Bernoulli random variables, taking values ±1 with probability half. We prefer the Bernoulli model over the Gaussian one for two reasons: In many real-life applications, noise must have discrete nature (after all, data are finite). So it seems reasonable to use random variables with discrete support to model noise, and Bernoulli is the simplest such variable. The analysis for the Bernoulli model easily extends to many other models of random matrices (including the Gaussian one). On the other hand, the analysis for gaussian matrices often relies on special properties of the Gaussian measure which are not available in other cases.

It is well known that a random matrix of size n has norm E 2 n, with high probability. Corollary For any given ɛ > 0, there is C = C(ɛ) > 0 such that if δ C n, then with probability 1 o(1) sin (v 1, v 1) ɛ.

0 10 20 30 40 50 60 70 80 90 1 Empirical CDF 1 Empirical CDF 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 F(x) 0.5 F(x) 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 x 0 0 2 4 6 8 10 12 14 16 18 20 x 400 400 matrix of rank 2, with gaps being 1 and 8, respectively; the efficient gap is much less than predicted by Wedin s bound.

0 10 20 30 40 50 60 70 80 90 1 Empirical CDF 1 Empirical CDF 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 F(x) 0.5 F(x) 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 x 0 0 2 4 6 8 10 12 x 1000 1000 matrix of rank 2, with gaps being 1 and 10, respectively.

0 2 4 6 8 10 12 14 16 18 20 1 Empirical CDF 0.9 0.8 0.7 0.6 F(x) 0.5 0.4 0.3 0.2 0.1 0 x 1 Empirical CDF 0.9 0.8 0.7 0.6 F(x) 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 12 x

Low dimensional data and improved bounds In a large variety of problems, the data is of small dimension, namely, r := rank A n. In this setting, we discovered that the results can be significantly improved. This improvement will reflect the real dimension r, rather than the size n of the matrix. Corollary For any positive constant ɛ there is a positive constant C = C(ɛ) such that the following holds. Assume that A has rank r n.99 and n r log n σ 1 and δ C r log n. Then with probability 1 o(1) sin (v 1, v 1) ɛ. (1)

Theorem (Probabilistic sin-theorem) For any positive constants α 1, α 2 there is a positive constant C such that the following holds. Assume that A has rank r n 1 α 1 and σ 1 := σ 1 (A) n α 2. Let E be a random Bernoulli matrix. Then with probabilty 1 o(1) ( sin 2 (v 1, v 1) r log n n ) C max,. (2) δ δσ 1

Let us now consider the general case when we try to approximate the first k singular vectors. Set ɛ k := sin (v k, v k ) and s k := (ɛ 2 1 + + ɛ2 k )1/2. We can bound ɛ k recursively as follows. Theorem For any positive constants α 1, α 2, k there is a positive constant C such that the following holds. Assume that A has rank r n 1 α 1 and σ 1 := σ 1 (A) n α 2. Let E be a random Bernoulli matrix. Then with probabilty 1 o(1) ( ɛ 2 k C max r log n n n,,, σ2 1 s2 k 1, (σ 1 + n)(σk + n)s ) k 1. δ k σ k δ k σ k σ k δ k σ k δ k (3)

Take A such that r = n o(1), σ 1 = 2n α, σ 2 = n α, δ 2 = n β, where α > 1/2 > ( β > 1 α are positive constants. Then δ 1 = n α and ɛ 2 1 max n α+o(1), n ), 1 2α+o(1) almost surely. Assume that we want to bound sin (v 2, v 2 ). The gap δ 2 = n β = o(n 1/2 ), so Wedin theorem does not apply. On the other hand, our theorem implies that almost surely ( ɛ 2 2 max n β+o(1), n 1/2 α+o(1), n α β+1). Thus, we have almost surely sin (v 2, v 2) = n Ω(1) = o(1).

Proof strategy. Bound the difference σ 1 σ 1 from both above and below. Show that if v 1 is far from v 1, then σ 1 is far from σ 1. The second step relies on the formula σ 1 := sup v (A + E)v. v =1 It suffices to consider v in an ɛ-net of the unit sphere. Critical step: It suffices to restrict to a subset of dimension roughly rank A!!.

Fix a system v 1,..., v n of unit singular vectors of A. It is well-known that v 1,..., v n form an orthonormal basis. (If A has rank r, the choice of v r+1,..., v n will turn out to be irrelevant.) For a vector v, if we decompose it as then v := α 1 v 1 + + α n v n, Av 2 = v A Av = n αi 2 σi 2. (4) i=1 Courant-Fisher minimax principle for singular values: σ k (M) = max min dim H=k v H, v =1 where σ k (M) is the kth largest singular value of M. Mv, (5)

Let ɛ be a positive number. A set X is an ɛ-net of a set Y if for any y Y, there is x X such that x y ɛ. Lemma [ɛ-approximation lemma] Let H be a subspace and S := {v v = 1, v H}. Let 0 < ɛ 1 be a number and M a linear map. Let N S be an ɛ-net of S. Then there is a vector w N such that Mw (1 ɛ) max v S Mv. Let v be the vector where the maximum is attained and let w be a vector in the net closest to v (tights are broken arbitrarily). Then by the triangle inequality Mw Mv M(v w). As v w ɛ, M(v w) ɛ max v S Mv.

Lemma [Net size] A unit sphere in d dimension admits an ɛ-net of size at most (3ɛ 1 ) d. Let S be the sphere in question, centered at O, and N S be a finite subset of S such that the distance between any two points is at least ɛ. If N is maximal with respect to this property then N is an ɛ-net. On the other hand, the balls of radius ɛ/2 centered at the points in N are disjoint subsets of the the ball of radius (1 + ɛ/2), centered at O. Since 1 + ɛ/2 ɛ/2 3ɛ 1 the claim follows by a volume argument.

Lemma [Spectral norm; Alon-Krivelevich-V.] There is a constant C 0 > 0 such that the following holds. Let E be a random Bernoulli matrix of size n. Then P( E 3 n) exp( C 0 n). Next, we present a lemma which roughly asserts that for any two vectors given u and v, u and Ev are, with high probability, almost orthogonal. Lemma [Orthogonality lemma] Let E be a random Bernoulli matrix of size n. For any fixed unit vectors u, v and positive number t P( u T Ev t) 2 exp( t 2 /16).

Lemma [Main lemma] For any constant 0 < β 1 there is a constant C such that the following holds. Assume that A is such that σ 1 n β 1 and let V := {v 1,..., v d } for some d = o(n/ log n).. Then the following holds almost surely. For any unit vector v V (A + E)v 2 n (v v i ) 2 σi 2 + C(n + σ 1 d log n). i=1 It is important that the statement holds for all unit v simultaneously.

It suffices to prove for v belonging to an ɛ-net N of the unit sphere S in V, with ɛ := 1 n+σ 1. With such small ɛ, the error coming from the term (1 ɛ) is swallowed into the error term O(n + σ 1 d log n). Thanks to the upper on the net size, it suffices to show that if C is large enough, then for any v N P( (A+E)v 2 n (v v i ) 2 +C(n+σ 1 d log n)) exp( 2C1 d log n) i=1 for any fixed v N. Fix v N. (A + E)v 2 = Av 2 + Ev 2 + 2(Av) (Ev) n = (v v i ) 2 σi 2 + Ev 2 + 2(Av) (Ev). i=1 Use the spectral norm lemma and the orthogonality lemma.

Let and u i (1 i n) be the singular vectors of the matrix A. First, we give a lower bound for σ 1 := A + E. By the minimax principle, we have σ 1 = A + E u T 1 (A + E)v 1 = σ 1 + u T 1 Ev 1. By orthogonality lemma with probability 1 o(1), u T 1 Ev 1 log log n. (The choice of log log n is not important. One can replace it by any function that tends slowly to infinity with n.) Thus, we have, with probability 1 o(1), that A + E σ 1 log log n. (6) Our main observation is that, with high probability, any v that is far from v 1 would yield (A + E)v < σ 1 log log n. Therefore, the first singular vector v 1 of A + E must be close to v 1.

Consider a unit vector v and write it as v = c 1 v 1 + c 2 v 2 + + c r v r + c 0 u (7) where u is a unit vector orthogonal to H := {v 1,..., v r } and c 2 1 + + c2 r + c 2 0 = 1. Recall that r is the rank of A, so Au = 0. Setting w := c 1 v 1 + + c r v r and using Cauchy-Schwartz, we have (A + E)v 2 = (A + E)w + c 0 Eu 2 (A + E)w 2 + 2c 0 (A + E)w Eu + c 2 0 Eu 2 (1 + c2 0 4 ) (A + E)w 2 + (4 + c 2 0 ) Eu 2.

By Spectral norm Lemma, we have, with probability 1 o(1), that Eu 3 n for every unit vector u. Furthermore, by Main Lemma, we have, with probability 1 o(1), (A + E)w 2 r (w v i ) 2 + O(σ 1 r log n + n) i=1 for every vector w H of length at most 1. Since r (w v i ) 2 σi 2 = i=1 r ci 2 σi 2 (1 c0 2 )σ1 2 (1 c0 2 c1 2 )(σ1 2 σ2), 2 i=1 we can conclude that with probability 1 o(1) the following holds. Any unit vector v written in the form above form satisfies 1 1 + c 2 0 /4 (A + E)v 2 (1 c 2 0 )σ 2 1 (1 c 2 0 c 2 1 )(σ 2 1 σ 2 2) + O(σ 1 r log n + n).

Set v to be the first singular vector of A + E. By the lower bound on (A + E)v 1 1 + c 2 0 /4 (A + E)v 2 (1 c2 0 4 )(σ 1 log log n) 2. Combining it with the previous inequality we get (1 c 2 1 )σ 1 δ c2 0 4 σ2 1 c 2 0 σ 2 2 + C(σ 1 r log n + n). From here we can get a upper bound on 1 c1 2 after some manipulation.

Further directions of research. Improve bounds. Other models of random matrices. Limiting distributions. Data in low dimension.