PCA with random noise Van Ha Vu Department of Mathematics Yale University
An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical analysis) is to compute the first few singular vectors of a large matrix. Among others, this problem lies at the heart of PCA (Principal Component Analysis), which has a very wide range of applications. Problem. For a matrix A of size n n with singular values σ 1 σ n 0, let v 1,..., v n be the corresponding (unit) singular vectors. Compute v 1,..., v k, for some k n.
Typically n is large and k is relatively small. As a matter of fact, in many applications k is a constant independent of n. For example, to obtain a visualization of a large set of data, one often sets k = 2 or 3. The assumption that A is a square matrix is for convenience and our analysis can be carried out with nominal modification for rectangular matrices. Asymptotic notation: Θ, Ω, O under the assumption that n. For a vector v, v denotes its L 2 norm. For a matrix A, A = σ 1 (A) denotes its spectral norm.
A model. The matrix A, which represents data, is often perturbed by noise. Thus, one works with A + E, where E represents the noise. A natural and important problem is to estimate the influence of noise on the vectors v 1,..., v k. We denote by v 1,..., v k the first k singular vectors of A + E. Question. When is v 1 a good approximation of v 1 or how much the noise change v 1? For singular values (Weyl s bound) σ 1 (A + E) σ 1 (A) σ 1 (E). If E 0, σ 1 (A + E) σ 1 (A). In other words, σ 1 is continuous.
On the other hand, the singular vectors are not continuous. Let A be the matrix ( ) 1 + ɛ 0. 0 1 ɛ Apparently, the singular values of A are 1 + ɛ and 1 ɛ, with corresponding singular vectors (1, 0) and (0, 1). Let E be ( ) ɛ ɛ, ɛ ɛ where ɛ is a small positive number. The perturbed matrix A + E has the form ( ) 1 ɛ. ɛ 1 Obviously, the singular values A + E are also 1 + ɛ and 1 ɛ. However, the corresponding singular vectors now are ( 1 1 2, 2 ) and ( 1 2, 1 2 ), no matter how small ɛ is.
A traditional way to measure the distance between two vectors v and v is to look at sin (v, v ), where (v, v ) is the angle between the vectors, taken in [0, π/2] Let us fix a small parameter ɛ > 0, which represents a desired accuracy. We want find a sufficient condition for the matrix A which guarantees that sin (v 1, v 1 ) ɛ. The key parameter to look at is the gap (or separation) δ := σ 1 σ 2, between the first and second singular values of A. Theorem (Wedin sin theorem) There is a positive constant C such that sin (v 1, v 1) C E δ.
Corollary For any given ɛ > 0, there is C = C(ɛ) > 0 such that if δ C E, then sin (v 1, v 1) ɛ. In the case when A and A + E are Hermitian, this statement is a special case of the Davis-Kahan sin θ theorem. Wedin extended Davis-Kahan theorem to non-hermitian matrices.
Random perturbation Noise (or perturbation) represents errors that come from various sources which are frequently of entirely different nature, such as errors occurring in measurements, errors occurring in recording and transmitting data, errors occurring by rounding etc. It is usually too complicated to model noise deterministically, so in practice, one often assumes that it is random. In particular, a popular model is that the entries of E are independent random variables with mean 0 and variance 1 (the value 1 is, of course, just matter of normalization).
For simplicity, we restrict ourselves to a representative case when all entries of E are iid Bernoulli random variables, taking values ±1 with probability half. We prefer the Bernoulli model over the Gaussian one for two reasons: In many real-life applications, noise must have discrete nature (after all, data are finite). So it seems reasonable to use random variables with discrete support to model noise, and Bernoulli is the simplest such variable. The analysis for the Bernoulli model easily extends to many other models of random matrices (including the Gaussian one). On the other hand, the analysis for gaussian matrices often relies on special properties of the Gaussian measure which are not available in other cases.
It is well known that a random matrix of size n has norm E 2 n, with high probability. Corollary For any given ɛ > 0, there is C = C(ɛ) > 0 such that if δ C n, then with probability 1 o(1) sin (v 1, v 1) ɛ.
0 10 20 30 40 50 60 70 80 90 1 Empirical CDF 1 Empirical CDF 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 F(x) 0.5 F(x) 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 x 0 0 2 4 6 8 10 12 14 16 18 20 x 400 400 matrix of rank 2, with gaps being 1 and 8, respectively; the efficient gap is much less than predicted by Wedin s bound.
0 10 20 30 40 50 60 70 80 90 1 Empirical CDF 1 Empirical CDF 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 F(x) 0.5 F(x) 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 x 0 0 2 4 6 8 10 12 x 1000 1000 matrix of rank 2, with gaps being 1 and 10, respectively.
0 2 4 6 8 10 12 14 16 18 20 1 Empirical CDF 0.9 0.8 0.7 0.6 F(x) 0.5 0.4 0.3 0.2 0.1 0 x 1 Empirical CDF 0.9 0.8 0.7 0.6 F(x) 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 12 x
Low dimensional data and improved bounds In a large variety of problems, the data is of small dimension, namely, r := rank A n. In this setting, we discovered that the results can be significantly improved. This improvement will reflect the real dimension r, rather than the size n of the matrix. Corollary For any positive constant ɛ there is a positive constant C = C(ɛ) such that the following holds. Assume that A has rank r n.99 and n r log n σ 1 and δ C r log n. Then with probability 1 o(1) sin (v 1, v 1) ɛ. (1)
Theorem (Probabilistic sin-theorem) For any positive constants α 1, α 2 there is a positive constant C such that the following holds. Assume that A has rank r n 1 α 1 and σ 1 := σ 1 (A) n α 2. Let E be a random Bernoulli matrix. Then with probabilty 1 o(1) ( sin 2 (v 1, v 1) r log n n ) C max,. (2) δ δσ 1
Let us now consider the general case when we try to approximate the first k singular vectors. Set ɛ k := sin (v k, v k ) and s k := (ɛ 2 1 + + ɛ2 k )1/2. We can bound ɛ k recursively as follows. Theorem For any positive constants α 1, α 2, k there is a positive constant C such that the following holds. Assume that A has rank r n 1 α 1 and σ 1 := σ 1 (A) n α 2. Let E be a random Bernoulli matrix. Then with probabilty 1 o(1) ( ɛ 2 k C max r log n n n,,, σ2 1 s2 k 1, (σ 1 + n)(σk + n)s ) k 1. δ k σ k δ k σ k σ k δ k σ k δ k (3)
Take A such that r = n o(1), σ 1 = 2n α, σ 2 = n α, δ 2 = n β, where α > 1/2 > ( β > 1 α are positive constants. Then δ 1 = n α and ɛ 2 1 max n α+o(1), n ), 1 2α+o(1) almost surely. Assume that we want to bound sin (v 2, v 2 ). The gap δ 2 = n β = o(n 1/2 ), so Wedin theorem does not apply. On the other hand, our theorem implies that almost surely ( ɛ 2 2 max n β+o(1), n 1/2 α+o(1), n α β+1). Thus, we have almost surely sin (v 2, v 2) = n Ω(1) = o(1).
Proof strategy. Bound the difference σ 1 σ 1 from both above and below. Show that if v 1 is far from v 1, then σ 1 is far from σ 1. The second step relies on the formula σ 1 := sup v (A + E)v. v =1 It suffices to consider v in an ɛ-net of the unit sphere. Critical step: It suffices to restrict to a subset of dimension roughly rank A!!.
Fix a system v 1,..., v n of unit singular vectors of A. It is well-known that v 1,..., v n form an orthonormal basis. (If A has rank r, the choice of v r+1,..., v n will turn out to be irrelevant.) For a vector v, if we decompose it as then v := α 1 v 1 + + α n v n, Av 2 = v A Av = n αi 2 σi 2. (4) i=1 Courant-Fisher minimax principle for singular values: σ k (M) = max min dim H=k v H, v =1 where σ k (M) is the kth largest singular value of M. Mv, (5)
Let ɛ be a positive number. A set X is an ɛ-net of a set Y if for any y Y, there is x X such that x y ɛ. Lemma [ɛ-approximation lemma] Let H be a subspace and S := {v v = 1, v H}. Let 0 < ɛ 1 be a number and M a linear map. Let N S be an ɛ-net of S. Then there is a vector w N such that Mw (1 ɛ) max v S Mv. Let v be the vector where the maximum is attained and let w be a vector in the net closest to v (tights are broken arbitrarily). Then by the triangle inequality Mw Mv M(v w). As v w ɛ, M(v w) ɛ max v S Mv.
Lemma [Net size] A unit sphere in d dimension admits an ɛ-net of size at most (3ɛ 1 ) d. Let S be the sphere in question, centered at O, and N S be a finite subset of S such that the distance between any two points is at least ɛ. If N is maximal with respect to this property then N is an ɛ-net. On the other hand, the balls of radius ɛ/2 centered at the points in N are disjoint subsets of the the ball of radius (1 + ɛ/2), centered at O. Since 1 + ɛ/2 ɛ/2 3ɛ 1 the claim follows by a volume argument.
Lemma [Spectral norm; Alon-Krivelevich-V.] There is a constant C 0 > 0 such that the following holds. Let E be a random Bernoulli matrix of size n. Then P( E 3 n) exp( C 0 n). Next, we present a lemma which roughly asserts that for any two vectors given u and v, u and Ev are, with high probability, almost orthogonal. Lemma [Orthogonality lemma] Let E be a random Bernoulli matrix of size n. For any fixed unit vectors u, v and positive number t P( u T Ev t) 2 exp( t 2 /16).
Lemma [Main lemma] For any constant 0 < β 1 there is a constant C such that the following holds. Assume that A is such that σ 1 n β 1 and let V := {v 1,..., v d } for some d = o(n/ log n).. Then the following holds almost surely. For any unit vector v V (A + E)v 2 n (v v i ) 2 σi 2 + C(n + σ 1 d log n). i=1 It is important that the statement holds for all unit v simultaneously.
It suffices to prove for v belonging to an ɛ-net N of the unit sphere S in V, with ɛ := 1 n+σ 1. With such small ɛ, the error coming from the term (1 ɛ) is swallowed into the error term O(n + σ 1 d log n). Thanks to the upper on the net size, it suffices to show that if C is large enough, then for any v N P( (A+E)v 2 n (v v i ) 2 +C(n+σ 1 d log n)) exp( 2C1 d log n) i=1 for any fixed v N. Fix v N. (A + E)v 2 = Av 2 + Ev 2 + 2(Av) (Ev) n = (v v i ) 2 σi 2 + Ev 2 + 2(Av) (Ev). i=1 Use the spectral norm lemma and the orthogonality lemma.
Let and u i (1 i n) be the singular vectors of the matrix A. First, we give a lower bound for σ 1 := A + E. By the minimax principle, we have σ 1 = A + E u T 1 (A + E)v 1 = σ 1 + u T 1 Ev 1. By orthogonality lemma with probability 1 o(1), u T 1 Ev 1 log log n. (The choice of log log n is not important. One can replace it by any function that tends slowly to infinity with n.) Thus, we have, with probability 1 o(1), that A + E σ 1 log log n. (6) Our main observation is that, with high probability, any v that is far from v 1 would yield (A + E)v < σ 1 log log n. Therefore, the first singular vector v 1 of A + E must be close to v 1.
Consider a unit vector v and write it as v = c 1 v 1 + c 2 v 2 + + c r v r + c 0 u (7) where u is a unit vector orthogonal to H := {v 1,..., v r } and c 2 1 + + c2 r + c 2 0 = 1. Recall that r is the rank of A, so Au = 0. Setting w := c 1 v 1 + + c r v r and using Cauchy-Schwartz, we have (A + E)v 2 = (A + E)w + c 0 Eu 2 (A + E)w 2 + 2c 0 (A + E)w Eu + c 2 0 Eu 2 (1 + c2 0 4 ) (A + E)w 2 + (4 + c 2 0 ) Eu 2.
By Spectral norm Lemma, we have, with probability 1 o(1), that Eu 3 n for every unit vector u. Furthermore, by Main Lemma, we have, with probability 1 o(1), (A + E)w 2 r (w v i ) 2 + O(σ 1 r log n + n) i=1 for every vector w H of length at most 1. Since r (w v i ) 2 σi 2 = i=1 r ci 2 σi 2 (1 c0 2 )σ1 2 (1 c0 2 c1 2 )(σ1 2 σ2), 2 i=1 we can conclude that with probability 1 o(1) the following holds. Any unit vector v written in the form above form satisfies 1 1 + c 2 0 /4 (A + E)v 2 (1 c 2 0 )σ 2 1 (1 c 2 0 c 2 1 )(σ 2 1 σ 2 2) + O(σ 1 r log n + n).
Set v to be the first singular vector of A + E. By the lower bound on (A + E)v 1 1 + c 2 0 /4 (A + E)v 2 (1 c2 0 4 )(σ 1 log log n) 2. Combining it with the previous inequality we get (1 c 2 1 )σ 1 δ c2 0 4 σ2 1 c 2 0 σ 2 2 + C(σ 1 r log n + n). From here we can get a upper bound on 1 c1 2 after some manipulation.
Further directions of research. Improve bounds. Other models of random matrices. Limiting distributions. Data in low dimension.