Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get

Size: px

Start display at page:

Download "Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get"

Meghan Dennis
5 years ago
Views:

1 Supplementary Material A. Auxillary Lemmas Lemma A. Lemma. Shalev-Shwartz & Ben-David,. Any update of the form P t+ = Π C P t ηg t, 3 for an arbitrary sequence of matrices g, g,..., g, projection Π C onto an arbitrary convex set C, and initialization P = satisfies P t P, g t for any P C. P η + η g t, Lemma A. Lemma Warmuth & Kuzmin,. or any PSD A and any symmetric B, C for which B C, it holds that r AB r AC. B. Proofs of Section 3 Proof of heorem.. Using Lemma A., noting P k for all P C, and g t = x t x t so that g t = x t, x t Px t x t P t x t k η + η. Choosing η = k completes the proof. Proof of heorem 3.. Let ĝ t = g t + E t be the noisy gradient. he analysis begins with bounding the distance between the t-th iterate and any candidate P C, D t = P t P. Dt+ = P t+ P = Π C P t ηĝ t P P t ηĝ t P = P t P + η ĝ t η P t P, g t + E t D t + η G η P t P, g t η P t P, E t Dt + η G η P t P, g t + η P t P E t D t + η G η P t P, g t + kη E t Where the first inequality is due to projection onto a convex set being non-expanding and the third one is by Holder s inequality. he last inequality follows from triangle inequality and constraints: P t P P t + P k. Noting g t = x t x t, rearranging and dividing both sides by η we get P P t, x t x t D t Dt+ + η η G + k E t 5 We then sum over iterates and use the telescopic property of the first term of the right hand side of 5, so that D t Dt+ = D D + D. he initial distance D is bounded as follows: D = P P = P + P P, P k + k + P P k Plugging back the above bound in 5 and noting E t E, we get x t Px t x P t x k η + ηg By choosing optimal learning rate η = k G desired results. + ke we get the Proof of heorem 3.. Let ĝ k = x k + y k x k + y k denote the corrupted gradient and let g k = x k x k denote the unbiased estimate of C based on the k-th sample. Let v denote the top eigenvector of C := E D [xx with associated eigenvalue λ and let u N, I be the initialization for Oja s algorithm. We are going to assume that y k + y k. We note that our assumption implies that ĝ k g k O and ĝ k n O n/. Let Φ M k and Ψ k be defined as in the proof of heorem 5.3. We are first going to upper bound r : r + ηr r r Φ uu = r ĝ g Φ uu + ηr x x Φ uu ĝ Φuu + η r + ηw g w + η ĝ g + η ĝ exp ηw g w + η ĝ g + η ĝ u exp ηw k g k w k + η ĝ k g k + η ĝ k, where we have used lemma in Warmuth & Kuzmin,.

2 Next we are going to lower bound E[r Ψ : E[r Ψ E[r vv I + ηĝ Φ I = E[r vv I + ηc Φ I + E[r vv I + ηy y Φ I + ηλe[r vv Φ I exp ηλ η λ E[r vv Φ I exp ηλ η λ, 7 where we used the fact that E[x k = and that + x exp x x for x [,. inally using Lieb- hirring s inequality we have the following upper bound on E[r Ψ : E[r Ψ E[r I + ηĝ Ψ = E[ I + ηx x + ηy y + η ĝ + η3 ĝ 3 +η ĝ r Ψ exp ηλ + α η + α η + α 3 η 3 3/ +α η, 8 where for the last inequality we have pushed the expectation inside the trace and used the fact that E[ ĝ n is bounded above by some linear combinations of y m, where m [n. Next we have used the fact that + x exp x and induction together with y k O and y k n O n/. Let α = α + α η + α 3 η + α η 3 3/ + ηλ. Using Chebyshev s inequality with equations 7 and 8 we have: P[r Ψ exp ηλ η λ δ exp / ηα δ. probability δ. hus with probability δ it holds that Od + log/δexp ηw k g k w k + η ĝ k g k +η ĝ k 3δ exp ηλ η λ = g k, P P k log Od + log/δ η + O + ηo log Substituting η = C. Proofs of Section η 3δ β finishes the proof. Lemma C.. he naive estimator x x is not unbiased, i.e. E R [ x x x = q xx + q q diag xx, where the expectation is taken with respect to Bernoulli random model. Proof of Lemma C.. or off-diagonal elements i j, E R [ x x x i,j = x i x j Pi, j {i,.., i r } = x i x j PZ i = PZ j = = q x i x j. Now consider the diagonal elements. E R [ x x x i,i = x i PZ i = = qx i. Putting together, E R [ x x x = q xx + q q diag xx. Proof of Lemma.. irst note that E S [zz x = r rq q E S [ x i s e is e i s x = q q diag x x. he proof simply follows from the following equalities.. β Let η = for some constant β independent of. We claim that for small enough β it holds that with probability at least δ, r Ψ 3 exp ηλ η λ. or the choice of η, this is equivalent to exp βα + δ 9, where α now becomes α = α + α β + λ β + α 3 β + α β 3. aking β we can see that exp αβ < + δ/9 and since exp αβ is a continuous function of β the desired value of β > exists. rom the derivation in Allen- Zhu & Li, 7 Appendix I, it follows that u Od + log/δ and r Ωδ r Ψ with Φ uu E S,R [ x x zz x = E R [E S [ q x x x x E R [E S [zz x x = q E R[ x x x q q E R [diag x x x = xx + q q q = xx. diag xx q q diag xx 9

3 Proof of heorem.. We first bound ĝ t as follows: ĝ t xt x = t z t z t = x t x t + z t z t x t x t + zt z t x t + z t q x + r t q x q q + r t q q x t x t, z t z t where the first inequality follows from the fact that inner product of PSD matrices is non-negative, the third inequality is by definition of x and z, and the last inequality holds since x x. aking the expectation of both sides, noting the for Binomial distribution E R [rt = var r t + E R [r t = dq q + d q, we get E R [ ĝ q + dq q 3 +d q q q. Now, using Lemma A. with {ĝ t }, noting P k for all P C E[ P P t, ĝ t Setting η = k η + η q + dq q 3 + d q q k η + η q + dq q 3 + d q q q. q q q +dq q 3 +d q q k and taking expectation with respect to the internal randomization S and the distribution of the missing data R finishes the proof. Proof of heorem Oja with missing entries.3. We begin by bounding E S,R [ ĝ k, E S,R [ ĝ k 3 and E S,R [ ĝ k. Let a := x k x k and b := z k z k. Using Jensen s inequality we have ĝ k a + b, ĝ k 3 a 3 + b 3, ĝ k 8a + 8b. Since x it holds that a n q n and b n rn k qn q. Let n µ = dq q + dq µ 3 = dq 3q + 3dq + q 3dq + d q µ = dq 7q + 7dq + q 8dq + d q q 3 + dq 3 d q 3 + d 3 q 3. hese are the second, third and fourth moment of the binomial random variable r k, respectively. Also let α k = q + r k q q.hus we have E S,R [ ĝ k n α n = n q n + n µ n q n q n. Define Φ M k, Ψ k, w k, λ,v and u as in the proof of theorem 5.3. We have: r = r + ηr ĝ Φ uu + η r ĝ Φuu r + ηr ĝ Φ uu + η α r r exp η α + ηw ĝ w u exp η α t + η w t ĝt w t. Next we need a lower bound on E[Ψ. his is done in the same way as in heorem 5.3: E[r Ψ exp ηλ η λ, inally, we need to bound E[Ψ. It holds that: E[r Ψ = E[r I + ηĝ Ψ I + ηĝ Ψ E[r I + ηĝ Ψ = E[r I + η 3 ĝ 3 + η ĝ + ηĝ + η ĝ Ψ r I + η α + η 3 α 3 + η α + ηλψ r I + η α + ηλe[ψ exp αη + ηλ E[r Ψ, where α = α + α 3 + α. Using Chebyshev s inequality it holds that P[r Ψ exp ηλ η λ δ / exp αη + η λ δ. As long as η log+δ/9 α+λ, this implies that with probability δ, r Ψ 3 exp ηλ η λ. rom the derivation in Allen-Zhu & Li, 7 Appendix I, it follows that u Od + log/δ and r Φ uu Ωδ r Ψ with probability δ. hus with probability δ it holds that Od + log/δexp η α t + η w t ĝt w t 3δ exp ηλ η λ.

4 aking logarithms on both sides implies that ηλ η r E S,R [w t w t E S,R [ĝ t logod + log/δ log 3δ +η E S,R [ α, t where we have used the independence between ĝ t and w t. his implies g t, P E S,R [P t logod + log/δ log 3δ η Setting η = log+δ/9 α+λ finishes the proof. D. Proofs of Section 5 Lemma D.. he naive estimator x i x i E R [ x x x = + ηα is not unbiased, i.e. rr rd r dd xx + dd diag xx, where expectation is taken with respect to the uniform sampling model. Proof of Lemma D.. or off-diagonal elements i j, E R [ x x x i,j = x i x j Pi, j {i,.., i r } so we need to compute the probability of the event that two fixed element i, j [ d are both in the subset {i,.., i r } sampled uniformly at random from all subsets of size r. he number of sets of r elements which contain i and j is exactly d r and the total number of r sets is d r r d = rr dd thus Pi, j {i,.., i r } = and so d r E R [ x x rr x i,j = x i x j dd. Now consider the diagonal elements. E R [ x x x i,i = x i Pi {i,.., i r } so we need to compute the probability of the event i {i,.., i r }. he number of sets with size r containing i is d r and thus E R [ x x x i,i = x i r d. Putting together, E R [ x x x = = rr dd xx + r d rr diag xx dd rr rd r dd xx + dd diag xx. Proof of Lemma 5.. irst note that E S [zz x = dr r r E S[ x i s e is e i s x = d r r diag x x. he proof simply follows from the following equalities. E S,R [ x x zz x = E R [E S [ x x x x E R [E S [zz x x = dd rr E R[ x x x d r r E R[diag x x = xx + d r r diag xx d r diag xx r = xx. 3 where diag A, A R d d is the d d matrix consisting of the diagonal of A. Proof of heorem 5.. We first bound ĝ t as follows: ĝ t xt x = t z t z t = x t x t + zt z t x t x t + z t z t x t + z t x t x t, z t z t d d r r + r d r r x d d + r d r r r. where the first inequality follows from x t x t, z t z t, since this is an inner product of PSD matrices, the third inequality follows the definition of x and z, and the forth inequality holds since x x.now, using Lemma A. with {ĝ t }, noting P k for all P C, we have that P P t, ĝ t k η + η d d + r d r r r = k η + η d d + r d r r r. Since Eĝt [=g t = x t x t, setting η = rr k and taking expectation with d d +r d r respect to the internal randomization S and the distribution of the missing data R finishes the proof. Proof of heorem 5.3. irst we bound ĝ k. Since ĝ k = x k x k z k z k is the difference of two PSD matrices it holds that ĝ k = max xk x k, z k z k α.

5 Let v be the top eigenvector of C with associated eigenvalue λ. Also let g k = x k x k and w k be the k-th iterate of Oja s algorithm, such that P k = w k w k. Define Φ M k = I + ηĝ k I + ηĝ MI + ηĝ I + ηĝ k Ψ k = I + ηĝ k I + ηĝ vv I + ηĝ I + ηĝ k. ollowing the calculations in Allen-Zhu & Li, 7 Appendix I, we have: r = r I + ηĝ Φ uu I + ηĝ = r + ηr ĝ Φ uu + η r ĝ Φ uu ĝ = r + ηr ĝ Φ uu + η r ĝ Φuu r + ηr ĝ Φ uu + η α r = r + η α + ηw ĝ w r exp η α + ηw ĝ w u exp η α + η w k ĝk w k, 5 where for the first inequality we have used the fact that Φ uu is PSD and ĝ k α and for the second inequality we have used + x exp x. We have also used the fact that Φ uu = r Φ uu w w. Next we have E[Ψ = E[r vv I + ηĝ Φ I I + ηĝ = E[r vv I + ηĝ Φ I + η v ĝ Φ I ĝ v E[r vv I + ηc Φ I = + ηλe[v Φ I v exp ηλ η λ E[v Φ I v exp ηλ η λ, where the expectation is both with respect to R and D and we have used + x exp x x, for x. inally we have: E[r Ψ = E[r I + ηĝ Ψ I + ηĝ Ψ E[r I + ηĝ Ψ = E[r I + η 3 ĝ 3 + η ĝ + ηĝ + η ĝ Ψ E[r 7 I + η α I + ηg Ψ exp η α + ηλ E[r Ψ exp η α + ηλ. Combining equations and 7 with Chebyshev s inequality we get: P[r Ψ exp ηλ α η exp ηλ α η exp η α +η λ δ δ. hus with probability δ it holds r Ψ exp ηλ α η δ / exp η α + η λ 3 exp ηλ α η log+δ/9 as long as η α +λ. ollowing the derivations Allen-Zhu & Li, 7, we have that with probability δ it holds that r Ωδ r Ψ and that Φ uu u Od + log/δ. Union bound, together with inequality 5 implies that with probability δ we have: Od + log/δ exp η α + η 3δ exp ηλ α η = logod + log/δ + η w k ĝk w k w k ĝk w k log/3δ + ηλ α η = λ w k ĝk w k logod + log/δ log/3δ / + α η = η g k, P E S,R [P k α +λ logod+log/δ log/3δ / log + δ/9 + log + δ/9, where in the last implication we have substituted η = log+δ/9 α +λ and used the fact that α < α + λ.

6 E. Proofs of Section Proof of heorem.. Denoting P t Ix t by y t, we can rewrite g t as g t = xty t +y t x t y t. hus we have g t = y t r x t y t + y t x t x t y t + y t x t x t y t y t = x t. By convexity of norms, x P t x x P x P t P, g t. Using Lemma A., noting P k for all P C, and g t, x t P t x t Choosing η = k x t P x t k η + η.. Proofs of Section 8 completes the proof. Proof of heorem 8.. Our proof follows Zinkevich, 3. We start the analysis by bounding the distance of our iterates at time t from any dynamic competitor P t at time t P t+ P t = Π C P t ηg t P t P t ηg t P t = P t P t + P t P t ηg t = P t P t + P t P t + P t P t, P t η g t, P t P t P t P t + P t + P t P t P t + η x t P t x t x t P t x t P t P t + k + k P t P t + η + η x t P t x t x t P t x t P t + η g t P t P t P t + η P t Rearranging and dividing both sides by η we get x t P t x t x t P t x t 3 k P t P t + η η P t P t P t+ P t + η Summing over t, observe the telescopic sum on the right hand side is bounded by P t P t P t+ P t = P P P + P k P Plugging the definition of the total shift S into the sum x t P t x t x t P t x t 3 ks η + η + k η = ks + k + η η Choosing η = ks+k completes the proof... Experimental Results We provide additional experiments in igures 3,, 5 and for PCA with partial observations and for PCA with missing entries. As stated in the main text, all our observations from igure and igure hold for the results presented here. urther, in igure 7, we present experimental results for the proposed Absolute Subspace Deviation Model algorithm of Section referred to as Robust GD in the plots. he algorithms against which we compare are Robust Online PCA proposed in Goes et al., referred to as Robust MEG in the plots which is an instance of online mirror descent with choice of potential function being entropy and simple batch PCA on the whole dataset. As ground truth we take the k-dimensional subspace returned by the proposed method in Lerman et al.,. In the following discussion the ground truth is represented by the projection matrix P R d d and the estimated projection matrix returned by an algorithm is denoted by P t. Since our experiments are ran on a synthetic dataset of fixed size, we denote this dataset by X R d n. he criteria against which we evaluate are average angle between subspaces given by r P t P /k, reconstruction error on the whole dataset given by U t X U X /n where a rank-k projection matrix P is decomposed as P = UU for orthogonal U R d k. We also evaluate on the total regret incurred. he data set is generated as in Goes et al.,. irst inliers are sampled from a k-dimensional Multivariate Normal distribution with mean and covariance matrix k I k. Next the inliers are embedded in a d- dimensional space via the linear transformation U R d k with entries U ii = and U i j =. inally outliers are

7 %88 missing %9 missing %9 missing - L Oja MGD MGD-po MGD-md Oja-po Oja-md GROUSE ime ime ime igure 3: Comparisons of Oja, MGD, MGD-PO, MGD-MD, Oja-PO, Oja-MD, GROUSE for PCA with missing data on XRMB dataset, in terms of the variance captured on a test set as a function of number of observed entries for k= top, number of iterations middle and runtime bottom. sampled from a d-dimensional Multivariate Normal distribution with diagonal covariance λ d meanσ Σ out out where Σ out ii = i, here λ is a user-specified parameter which governs the SNR. In our experiments λ = the number of outliers is %, %, 8% of the total number of points, k = and d =. In our comparison we also include a capped version of our proposed method where the rank of each intermediate iterate is hard-capped to 5 by keeping the 5 directions associated with the top 5 singular values of our current iterate. We now briefly discuss the more interesting points which should be observed from the provided igure 7. irst our algorithm clearly out-performs the Robust Online PCA and batch PCA in all the chosen criteria. Secondly, since Robust Online PCA is required to always keep full-rank iterates as it operates in the complementary space to the one we wish to recover it has to pay computational cost Od 3 per iteration while our method only pays Od k where k is the rank of the current iterate. As can be seen from the plots for the cases when number of outliers are % or % the intermediate rank of iterates does not grow too much so in practice our proposed method is efficient. inally we note that the capped version of the proposed method performs as well or even better in some cases.

8 %88 missing %9 missing %9 missing - L Oja MGD MGD-po MGD-md Oja-po Oja-md GROUSE ime ime ime igure : Comparisons of Oja, MGD, MGD-PO, MGD-MD, Oja-PO, Oja-MD, GROUSE for PCA with missing data on XRMB dataset, in terms of the variance captured on a test set as a function of number of observed entries for k= top, number of iterations middle and runtime bottom.

9 %88 missing %9 missing %9 missing - L Oja MGD MGD-po MGD-md Oja-po Oja-md GROUSE ime - ime - ime igure 5: Comparisons of Oja, MGD, MSG-PO, MSG-MD, Oja-PO, Oja-MD and GROUSE for PCA with missing data on MNIS dataset, in terms of the variance captured on a test set as a function of number of observed entries for k= top, number of iterations middle and runtime bottom.

10 %88 missing %9 missing %9 missing - L Oja MGD MGD-po MGD-md Oja-po Oja-md GROUSE ime ime ime igure : Comparisons of Oja, MGD, MSG-PO, MSG-MD, Oja-PO, Oja-MD and GROUSE for PCA with missing data on MNIS dataset, in terms of the variance captured on a test set as a function of number of observed entries for k= top, number of iterations middle and runtime bottom.

11 tracep * P t /k Regret P * X-P t X /n rankp t tracep * P t /k Regret P * X-P t X /n rankp t tracep * P t /k Regret P * X-P t X /n rankp t average angle between subspaces PCA optimization error reconstruction error rank of iterates Robust MEG Robust GD Batch PCA Capped Robust GD log log log log Robust MEG Robust GD Batch PCA Capped Robust GD log log log log Robust MEG Robust GD Batch PCA Capped Robust GD log log log log igure 7: k = Comparison of batch PCA, Robust Online PCA Robust MEG, and Absolute Subspace Deviation Model Robust GD with % outliers top row, % outliers middle row and 8% outliers bottom row on synthetic dataset. Experiments are in terms of the average subspace angle left most, variance captured on a test set left, reconstruction error right and the rank of iterates right most as a function of number of samples.

Streaming Principal Component Analysis in Noisy Settings

Streaming Principal Component Analysis in Noisy Settings Teodor V. Marinov * 1 Poorya Mianjy * 1 Raman Arora 1 Abstract We study streaming algorithms for principal component analysis (PCA) in noisy settings.