Subspace Estimation from Incomplete Observations: A High-Dimensional Analysis

Size: px

Start display at page:

Download "Subspace Estimation from Incomplete Observations: A High-Dimensional Analysis"

Osborne Powell
5 years ago
Views:

1 Subspace Estimation from Incomplete Observations: A High-Dimensional Analysis Chuang Wang, Yonina C. Elar, Fellow, IEEE an Yue M. Lu, Senior Member, IEEE Abstract We present a high-imensional analysis of three popular algorithms, namely, Oja s metho, GROUSE an PE- TRELS, for subspace estimation from streaming an highly incomplete observations. We show that, with proper time scaling, the time-varying principal angles between the true subspace an its estimates given by the algorithms converge wealy to eterministic processes when the ambient imension n tens to infinity. Moreover, the limiting processes can be exactly characterize as the unique solutions of certain orinary ifferential equations (ODEs). A finite sample boun is also given, showing that the rate of convergence towars such limits is O(/ n). In aition to proviing asymptotically exact preictions of the ynamic performance of the algorithms, our high-imensional analysis yiels several insights, incluing an asymptotic equivalence between Oja s metho an GROUSE, an a precise scaling relationship lining the amount of missing ata to the signalto-noise ratio. By analyzing the solutions of the limiting ODEs, we also establish phase transition phenomena associate with the steay-state performance of these techniques. Inex Terms Subspace tracing, streaming PCA, incomplete ata, high-imensional analysis, scaling limit I. INTRODUCTION Subspace estimation is a ey tas in many signal processing applications. Examples inclue source localization in array processing, system ientification, networ monitoring, an image sequence analysis, to name a few. The ubiquity of subspace estimation comes from the fact that a low-ran subspace moel can conveniently capture the intrinsic, lowimensional structures of many large atasets. In this paper, we consier the problem of estimating an tracing an unnown subspace from streaming measurements with many missing entries. The streaming setting appears in applications (e.g. vieo surveillances) where high-imensional ata arrive sequentially over time at high rates. It is especially relevant in ynamic scenarios where the unerlying subspace to be estimate can be time-varying. Missing ata is also a C. Wang is with the John A. Paulson School of Engineering an Applie Sciences, Harvar University, Cambrige, MA 238, USA ( chuangwang@g.harvar.eu). Y. C. Elar is with the Department of EE, Technion, Israel Institute of Technology, Haifa, 32, Israel ( yonina@ee.technion.ac.il). Y. M. Lu is with the John A. Paulson School of Engineering an Applie Sciences, Harvar University, Cambrige, MA 238, USA ( yuelu@seas.harvar.eu). The wor of C. Wang an Y. M. Lu was supporte in part by the US Army Research Office uner contract W9NF an in part by the US National Science Founation uner grants CCF-394 an CCF The wor of Y. Elar was supporte in part by the European Union s Horizon 22 Research an Innovation Program uner Grant ERCCOG-BNYQ. Preliminary results of this wor was presente at the Signal Processing with Aaptive Sparse Structure Representations (SPARS) worshop in 27. very common issue in practice. Incomplete observations may result from a variety of reasons, such as the limitations of the sensing mechanisms, constraints on power consumption or communication banwih, or a eliberate esign feature that protects the privacy of iniviuals by removing partial recors. GROUSE [] an PETRELS [2] as well as the classical Oja s metho [3] are three popular algorithms for solving the above estimation problem. They are all streaming algorithms in the sense that they provie instantaneous, on-the-fly upates to their subspace estimates upon the arrival of a ata point. The three iffer in their upate rules: Oja s metho an GROUSE perform first-orer incremental graient escent on the Eucliean space an the Grassmannian, respectively, whereas PETRELS can be interprete as a secon-orer stochastic graient escent scheme. These algorithms have been shown to be highly effective in practice, but their performance epens on the careful choice of algorithmic parameters such as the step size (for GROUSE an Oja s metho) an the iscount parameter (for PETRELS). Various convergence properties of these techniques have been stuie in [2], [4] [7], but a precise analysis of their performance is still an open problem. Moreover, the important question of how the signal-to-noise ratios (SNRs), the amount of missing ata, an various other algorithmic parameters affect the estimation performance is not fully unerstoo. As the main objective of this wor, we present a tractable an asymptotically exact analysis of the ynamic performance of Oja s metho, GROUSE an PETRELS in the highimensional regime. Our contribution is mainly threefol:. Precise analysis via scaling limits. We show in Theorem an Theorem 2 that the time-varying trajectories of the estimation errors, measure in terms of the principal angles between the true unerlying subspace an the estimates given by the algorithms, converge wealy to eterministic processes, as the ambient imension n. Moreover, such eterministic limits can be characterize as the unique solutions of certain orinary ifferential equations (ODEs). In aition, we provie a finite-size guarantee in Theorem 3, showing that the convergence rate towars such limits is O(/ n). Numerical simulations verify the accuracy of our asymptotic preictions. The main technical tool behin our analysis is the wea convergence theory of stochastic processes (see [8] [2] for its mathematical founations an [3] [5] for its recent applications in relate estimation problems). 2. Insights regaring the algorithms. In aition to proviing asymptotically exact preictions of the ynamic performance of the three subspace estimation algorithms, our high-

2 2 imensional analysis leas to several valuable insights. First, the result of Theorem implies that, espite their ifferent upate rules, Oja s methos an GROUSE are asymptotically equivalent, with both converging to the same eterministic process as the imension increases. Secon, the characterization given in Theorem 2 shows that PETRELS can be examine within a common framewor that incorporates all three algorithms, with the ifference being that PETRELS uses an aaptive scheme to ajust its effective step sizes. Thir, our limiting ODEs also reveal an (asymptotically) exact scaling relationship that lins the amount of missing ata to the SNR. See the iscussions in Section IV-A for etails. 3. Funamental limits an phase transitions. Analyzing the limiting ODEs also reveals phase transition phenomena associate with the steay-state performance of these algorithms. Specifically, we provie in Propositions an 2 critical threshols for setting ey algorithm parameters (as a function of the SNR an the subsampling ratio), beyon which the algorithms converge to noninformative estimates that are no better than mere ranom guesses. The rest of the paper is organize as follows. We start by presenting in Section II-A the exact problem formulation for subspace estimation with missing ata. This is followe by a brief review of the three algorithms to be analyze in this wor. The main results are presente in Section III, where we show that the ynamic performance of Oja s metho, GROUSE an PETRELS can be asymptotically characterize by the solutions of certain eterministic systems of ODEs. Numerical experiments are also provie to illustrate an verify our theoretical preictions. To place our asymptotic analysis in proper context, we iscuss relate wor in the literature in Section III-D. We consier various implications an insights rawn from our analysis in Section IV. Due to space limitation, we only present informal erivation of the limiting ODEs an proof setches in Section V. More technical etails an the proofs of all the results presente in this paper can be foun in the Supplementary Materials [6]. Notation: Throughout the paper, we use I to enote the ientity matrix. For any positive semiinite matrix M, its principal square root is written as (M) 2. Depening on the context, enotes either the l 2 norm of a vector or the spectral norm of a matrix. For any x R, the floor operation x gives the largest integer that is smaller than or equal to x. Let {X n } be a sequence of ranom variables in a general P probability space. X n X means that Xn converges in wealy probability to a ranom variable X, whereas X n X means that X n converges to X wealy (i.e. in law). Finally, A enotes the inicator function for an event A. II. PROBLEM FORMULATION AND OVERVIEW OF A. Observation Moel ALGORITHMS We consier the problem of estimating a low-ran subspace using partial observations from a ata stream. At any iscretetime, suppose that a sample vector s R n is generate accoring to s = Uc + a. () Here, U R n is an unnown eterministic matrix whose columns form an orthonormal basis of a -imensional subspace, an c R is a ranom vector representing the expansion coefficients in that subspace. We also assume that the covariance matrix of c is Λ = iag(λ, λ 2,..., λ ), (2) where λ λ 2 λ are some strictly positive numbers. The noise in the observations is moele by a ranom vector a R n with zero mean an a covariance matrix equal to I n. Furthermore, a is inepenent of c. Since {λ l } l in (2) inicate the strength of the subspace component relative to the noise, we shall refer to these parameters as the SNR in our subsequent iscussions. We consier the missing ata case, where only a subset of the entries of s is available. This observation process can be moele by a iagonal matrix Ω = iag(v,, v,2,..., v,n ), (3) where v,i = if the ith component of x is observe, an v,i = otherwise. Our actual observation, enote by y, may then be written as y = Ω s. (4) Given a sequence of incomplete observations {y, Ω } arriving in a stream, we aim to estimate the subspace spanne by the columns of U. B. Oja s Metho Oja s metho [3] is a classical algorithm for estimating low-ran subspaces from streaming samples. It was originally esigne for the case where the full sample vectors s in () are available. Given a collection of K such sample vectors, it is natural to use the following optimization formulation to estimate the unnown subspace: Û = arg min X X=I = arg max X X=I K = min w s Xw 2 (5) K X s s X, (6) = where the equivalence between (5) an (6) is establishe by solving the simple quaratic problem min w s Xw 2 an substituting the solution into (5). Oja s metho is a stochastic projecte-graient algorithm for solving (6). At each step, let X enote the current estimate of the subspace. Then, with the arrival of a new sample vector s, we first upate X accoring to X = X + τ n s w, (7) The assumption that the covariance matrix is iagonal can be mae without loss of generality, after a rotation of the coorinate system. To see that, suppose c has a general covariance matrix Σ, which is iagonalize as Σ = ΦΛΦ. Here, Φ is an orthonormal matrix an Λ is a iagonal matrix as in (2). The generating moel () can then be rewritten as s = (UΦ)(Φ c ) + a. Thus, our problem is equivalent to estimating a subspace spanne by UΦ, an Λ is the covariance matrix of the new expansion coefficient vector Φ c.

3 3 where w = X s an {τ } is a sequence of positive constants that control the step-size (or learning rate) of the algorithm. We note that, up to a scaling constant, s w T in (7) is exactly equal to the graient of the objective function X s s X in (6) ue to the new sample s. Next, to enforce the orthogonality constraint, we compute X + = X ( X X ) 2, (8) where ( ) 2 stans for the principal square root of a positive semiinite matrix. In practice, (8) is implemente using the QR-ecomposition of X. To hanle the case of partially-observe samples, we can moify Oja s metho in two ways. First, we estimate the expansion coefficients w in (7) by solving a least squares problem that taes into account the missing ata moel: ŵ = arg min y Ω X w 2, (9) w R where y is the incomplete sample vector ine in (4), Ω is the corresponing subsampling matrix, an X is the current estimate of the subspace. Next, we replace the missing elements in y by the corresponing entries in X ŵ. This imputation step leas to an estimate of the full vector: ŷ = y + (I n Ω )X ŵ. () Replacing the original vectors y an w in (7) by their estimate counterparts ŷ an ŵ, we reach the moifie Oja s metho, the pseuocoe of which is summarize in Algorithm. Note that, to ensure we have enough observe entries in y, we first chec, with the arrival of a new partially observe vector y, whether et(x Ω X ) > ɛ et(x X ), () where ɛ > is a small positive constant. If this is inee the case, we o the stanar upate as escribe above; otherwise, we simply ignore the new sample vector an o not change the estimate in this step. Note that, uner a suitable probabilistic moel for the subsampling process (see assumption (A.3) in Section III-C), one can show that () is satisfie with high probability as long as ɛ < α, where α enotes the subsampling ratio ine in assumption (A.3). C. GROUSE Similar to Oja s metho, Grassmannian Ran-One Upate Subspace Estimation (GROUSE) [] is a first-orer stochastic graient escent algorithm for solving (5). The main ifference is that GROUSE solves the optimization problem on the Grassmannian, the manifol of all subspaces with a fixe ran. One avantage of this approach is that it avois the explicit orthogonalization step in (8), allowing the algorithm to achieve even lower computational complexity. At each step, GROUSE first fins the coefficient ŵ accoring to (9). It then computes the reconstruction error vector r = y Ω p, (2) Algorithm Oja s metho [3] with imputation Require: An initial estimate X such that X X = I, a sequence of step-size parameters {τ } an a positive constant ɛ. : := 2: repeat 3: if et(x Ω X ) > ɛ et(x X ) then 4: ŵ := arg min w y Ω X w 5: ŷ := y + (I n Ω )X ŵ 6: X := X + τ n ŷ ŵ 7: X + := X ( X X ) 2 8: else 9: X + := X : en if : := + 2: until termination Algorithm 2 GROUSE [] Require: An initial estimate X such that X X = I, a sequence of step-size parameters {τ } an a positive constant ɛ. : := 2: repeat 3: if et(x Ω X ) > ɛ et(x X ) then 4: ŵ := arg min w y Ω X w 5: p := X ŵ 6: r := y Ω p 7: θ := τ n r p[ X + := X + (cos(θ ) ) p p 8: + sin(θ ) r ] ŵ r ŵ 9: else : X + := X : en if 2: := + 3: until termination where p = X ŵ. Next, it upates the current estimate X on the Grassmannian as X + = X + where [ (cos θ ) p r p + sin θ r ] w w, θ = τ n r p, (3) an {τ } is a sequence of step-size parameters. The algorithm is summarize in Algorithm 2. D. PETRELS When there is no missing ata, an alternative to Oja s metho is a classical algorithm calle Projection Approximation Subspace Tracing (PAST) [7]. This metho estimates

4 4 Algorithm 3 Simplifie PETRELS [2] Require: An initial estimate of the subspace X, R = δ n I for some δ >, an positive constants γ an ɛ. : := 2: repeat 3: if et(x Ω X ) > ɛ et(x X ) then 4: ŵ := arg min w y Ω X w 5: X + := X + Ω (y X ŵ )ŵ R 6: v := γ R ŵ 7: β := + α ŵ v 8: R + := γ R α v v /β 9: else : X + := X : R + := R 2: en if 3: := + 4: until termination the unerlying subspace U by solving an exponentiallyweighte least-squares problem X + = arg min X R n γ s Xw 2, (4) = where w = X T s an γ (, ] is a iscount parameter. The solution of (4) has a simple recursive upate rule X + = X + (s X w ) w R (5) R + = (γr + w w ). (6) Moreover, one can avoi the explicit calculation of the matrix inverse in (6) by using the Woobury ientity an the fact that (6) amounts to a ran- upate. Parallel Subspace Estimation an Tracing by Recursive Least Squares (PETRELS) [2] extens PAST to the case of partially-observe ata. The main change is that it estimates the coefficient w in (4) using (9). In aition, it provies a parallel sub-routine in its calculations so that upates to ifferent coorinates can be compute in a fully parallel fashion. In its most general form, PETRELS nees to maintain an upate a ifferent matrix R i for each of the n coorinates. To reuce computational complexity, a simplifie version of PETRELS has been provie in [2], using a common R for all the coorinates. In this paper, we focus on this simplifie version of PE- TRELS, which is summarize in Algorithm 3. Note that we introuce an aitional parameter α in lines 7 an 8 of the pseuocoe. The simplifie algorithm given in [2] correspons to setting α =. In our analysis, we set α to be equal to the subsampling ratio ine later in (25). Empirically, we fin that, with this moification, the performance of the simplifie algorithm matches that of the full PETRELS algorithm when the ambient imension n is large. III. MAIN RESULTS: SCALING LIMITS In this section, we present the main results of this wor a tractable an asymptotically exact analysis of the performance of the three algorithms reviewe in Section II. A. Performance Metric: Principal Angles We start by ining the performance metric we will be using in our analysis. Recall the generative moel ine in (). The groun truth subspace is represente by the matrix U, whose column vectors form an orthonormal basis of that subspace. For Algorithms, 2, an 3, the estimate subspace at the th step is spanne by an orthogonal matrix Û = X (X X ) /2, (7) where X is the th iteran generate by the algorithms. Note that, for Oja s metho an GROUSE, Û = X as the matrix X is alreay orthogonal, whereas for PETRELS, generally X X I an thus the step in (7) is necessary. In the special case of = (i.e. ran-one subspaces), both U an Û are unit-norm vectors. The egree to which these vectors are aligne can be measure by their cosine similarity, ine as U. Û This concept can be naturally extene to arbitrary. In general, the closeness of two -imensional subspaces may be quantifie by their principal angles [8], [9]. In particular, the cosines of the principal angles are uniquely specifie as the singular values of a matrix ine as Q (n) = U Û = U X (X X ) /2. (8) In what follows, we shall refer to Q (n) as the cosine similarity matrix. Since we will be stuying the high-imensional limit of Q (n) as the ambient imension n, we use the superscript (n) to mae the epenence of Q (n) on n explicit. B. The Scaling Limits of Stochastic Processes: Main Ieas To analyze the performance of Algorithms, 2, an 3, our goal boils own to tracing the evolution of the cosine similarity matrix Q (n) over time. Thans to the streaming nature of all three methos, it is easy to see that the ynamics of their estimates X can be moele by homogeneous Marov chains with state space in R n. Thus, being a function of X [see (8)], the ynamics of Q (n) forms a hien Marov chain. We then show that, as n an{ with proper } time scaling, the family of stochastic processes Q (n) inexe n by n converges wealy to a eterministic function of time that is characterize as the unique solution of some ODEs. Such convergence is nown in the literature as the scaling limits [], [2], [5], [2] of stochastic processes. To present out results, we first consier a simple one-imensional example that illustrates the unerlying ieas behin scaling limits. Our main convergence theorems are presente in Section III-C. Consier a -D stochastic process ine by a recursion q + = q + τ n f(q ) + n (/2)+δ v, (9) where f( ) is a Lipschitz function, τ an δ are two positive constants, v is a sequence of i.i.. ranom variables with zero mean an unit variance, an n > is a constant introuce to scale the step-size an the noise variance. (This particular scaling is chosen here because it mimics the actual scaling that appears in the high-imensional ynamics of Q (n) we shall stuy.)

5 Figure. Convergence of the -D stochastic process q (n) t escribe in Example to its eterministic scaling limit. Here, we use δ =.25. When n is large, the ifference between q an q + is small. In other wors, we will not be able to see macroscopic changes unless we observe the process over a large number of steps. To accelerate the time (by a factor of n), we embe {q } in continuous-time by ining a piecewise-constant process q (n) (t) = q nt, (2) where is the floor function. Here, t is the rescale (accelerate) time: within t [, ], the original iscrete-time process moves n steps. Due to the scaling of the noise term in (9) (with the noise variance equal to n 2δ for some δ > ), the rescale stochastic process q (n) (t) converges to a eterministic limit function as n. We illustrate this convergence behavior using the following example. Example : Let us consier the special case where f(q) = q. We plot in Figure simulations results of q (n) (t) for several ifferent values of n. We see that, as n increases, the rescale stochastic processes q (n) (t) inee converge to some limit function (the blac line in the figure), which will be enote by q(t). To prove this convergence, we first expan the recursion (9) (by using the fact that f(q) = q) an get q = ( τ n ) q +, (2) where is a zero-mean ranom variable ine as = ( τ n (/2)+δ n ) i v i. i= Since {v i } i are inepenent ranom variables with unit variance, E ( ) 2 ( ( τ n +2δ n )2 ) = O(n 2δ ). It then follows from (2) that, for any t >, q (n) (t) = q nt P lim n ( τ n ) nt q = q e τt, (22) where P stans for convergence in probability. For general nonlinear function f(q), we can no longer irectly simplify the recursion (9) as in (2). However, similar convergence behaviors of q (n) (t) still exist. Moreover, the limit function q(t) can be characterize via an ODE. To see the origin of the ODE, we note that, for any t > an = nt, Figure 2. Convergence of the cosine similarity Q (n) associate with nt GROUSE at a fixe rescale time t =.5, as we increase n from 2 to 5,. In this experiment, = an thus Q (n) reuces to a scalar, enote by Q (n). The error bars show the stanar eviation of Q(n) nt over inepenent trials. In each trial, we ranomly generate a subspace U, the expansion coefficients {c } an the noise vector {a } as in (). The re ashe lines is the limiting value preicte by our asymptotic characterization, to be given in Theorem. we may rewrite (9) as q (n) (t + /n) q (n) (t) /n = τf[q (n) (t)] + n (/2) δ v. (23) Taing the n limit on both sie of (23) an neglecting the noise term n (/2) δ v, we may then write at least in a nonrigorous way the following ODE q(t) = τf[q(t)], which always has a unique solution ue to the Lipschitz property of f( ). For instance, the ODE associate with the linear setting in Example is q(t) = τq(t), whose unique solution q(t) = q e τt is inee the limit establishe in (22). A rigorous justification of the above steps can be foun in the theory of wea convergence of stochastic processes (see, for example, [2], [2]). Returning from the above etour, we recall that the central objects of our analysis are the cosine similarity matrices Q (n) ine in (8). It turns out that, just lie the simple - D process q in (9), the matrix-value stochastic processes Q (n), after a proper time rescaling = nt, also converge to a eterministic limit as the ambient imension n. This phenomenon is emonstrate in Figure 2, where we plot the cosine similarity Q (n) nt of GROUSE at t =.5 for ifferent values of n. In this experiment, we set = an thus Q (n) nt reuces to a scalar. The stanar eviations of Q (n) over nt inepenent trials, shown as error bars in Figure 2, ecrease as n increases. This inicates that the performance of these stochastic algorithms can inee be characterize by certain eterministic limits when the imension is high. C. The Scaling Limits of Oja s, GROUSE an PETRELS Q (n) nt To stuy the scaling limits of the cosine similarity matrices, we embe the iscrete-time process Q(n) into a continuous time process Q (n) (t) via a simple piecewise-constant

6 6 interpolation: Q (n) (t) = Q (n) nt. (24) The main objective of this wor is to establish the highimensional limit of Q (n) (t) as n. Our asymptotic analysis is carrie out uner the following technical assumptions on the generative moel () an the observation moel (3). (A.) The elements of the noise vector a are i.i.. ranom variables with zero mean, unit variance, an finite higherorer moments; (A.2) c in () is a -D ranom vector with zero-mean an a covariance matrix Λ as given in (2). Moreover, all the higher-orer moments of c exist an are finite, an {c } is inepenent of {a }; (A.3) We assume that { } v,i in the observation moel (3) is a collection of inepenent an ientically istribute binary ranom variables such that P(v,i = ) = α, (25) for some constant α (, ). Throughout the paper, we refer to α as the subsampling ratio. We shall also assume that the algorithmic parameter ɛ use in Algorithms 3 satisfies the conition that ɛ < α. (A.4) The subspace matrix U an initial guess X are incoherent in the sense that n Ui,j 4 C n n an X,i,j 4 C n, (26) i= j= i= j= where U i,j an X,i,j enote the (i, j)th entries of U an X, respectively, an C is a generic constant that oes not epen on n. (A.5) The initial cosine similarity Q (n) converges entrywise an in probability to a eterministic matrix Q(). (A.6) For Oja s metho an GROUSE, the step-size parameters τ = τ(/n), where τ( ) is a eterministic function such that sup t τ(t) C for a generic constant C that oes not epen on n. For PETRELS, the iscount factor γ = µ n, (27) for some constant µ >. Assumption (A.4) requires some further explanations. The conition (26) essentially requires the basis matrix U an the initial guess X to be generic. To see this, consier a U that is rawn uniformly at ranom from the Grassmannian for ran- subspaces. Such a U can be generate as U = X(X X) /2, (28) where X is an n ranom matrix whose entries are i.i.. stanar normal ranom variables. For such a generic choice of U, one can show that its entries U i,j O(/ n) an that (26) hols with high probability when n is large. Theorem { (Oja s metho an GROUSE): Fix T >, an let Q (n) (t) } be the time-varying cosine similarity t [,T ] matrices associate with either Oja s metho or GROUSE over the finite interval t [, T ]. Uner assumptions (A.) (A.6), we have { Q (n) (t) } t [,T ] wealy Q(t), stans for wea convergence an Q(t) is a eterministic matrix-value process. Moreover, Q(t) is the unique solution of the ODE where wealy Q(t) = F (Q(t), τ(t)i ), (29) where F : R R R is a matrix-value function ine as [ F (Q, G) = αλ 2 Q 2 QG Q ( ] I + 2 G) Q αλ 2 Q G. (3) Here α is the subsampling ratio, an Λ is the iagonal covariance matrix ie in (2). In Section V, we present a (nonrigorous) erivation of the limiting ODE (29). Full technical etails an a complete proof can be foun in the Supplementary Materials [6]. An interesting conclusion of this theorem is that the cosine similarity matrices Q (n) (t) associate with Oja s metho an GROUSE converge to the same asymptotic trajectory. We will elaborate on this point in Section IV-A. To establish the scaling limits of PETRELS, we nee to introuce an auxiliary matrix G (n) = (X X ) 2 R (X X ) 2, (3) where the matrices R an X are those use in Algorithm 3. Similar to (24), we embe the iscrete-time process G (n) into a continuous-time process: G (n) (t) = G (n) nt. (32) The following theorem, whose proof can be foun in the Supplementary Materials [6], characterizes the asymptotic ynamics of PETRELS. { Theorem 2 (PETRELS): For any fixe T >, let Q (n) (t) } be the time-varying cosine similarity matrices associate with PETRELS on the interval t [, T ]. t [,T ] Let { G (n) (t) } be the process ine in (32). Uner t [,T ] assumptions (A.) (A.6) an as n, we have { Q (n) (t) } wealy Q(t) an { G (n) (t) } wealy G(t), t [,T ] t [,T ] where { Q(t), G(t) } is the unique solution of the following system of couple ODEs: Q(t) = F (Q(t), G(t)), (33) G(t) = H(Q(t), G(t)). (34) Here, F is the function ine in (3) an H is a function ine by [ ] H(Q, G) = G µ G(G + I )(Q αλ 2 Q + I ), (35) where µ > is the constant given in (27). Theorem an Theorem 2 establish the scaling limits of Oja s metho, GROUSE an PETELS, respectively, as n. In practice, the imension n is always finite, an thus the actual trajectories of the performance curves will fluctuate aroun their asymptotic limits. To boun such fluctuations via a finite-sample analysis, we first nee to slightly strengthen assumption (A.5) as follows:

7 (a) GROUSE (crosses) an Oja (circles) (b) PETRELS Figure 3. Numerical simulations vs. asymptotic characterizations. (a) Results for Oja s metho an GROUSE, where the soli lines are the theoretical preictions of the cosines of 4 principal angles by the solution of the ODE (29). The crosses (for Oja s metho) an circles (for GROUSE) show the simulation results average over inepenent trials. In each trial, we ranomly generate a subspace U as in (28), the expansion coefficients {c } an the noise vector {a }. The error bars inicate ±2 stanar eviations. (b) Similar comparisons of numerical simulations an theoretical preictions for PETRELS. (A.7) Let Q (n) be the initial cosine similarity matrices. There exists a fixe matrix Q() such that E Q (n) Q() Cn /2, 2 where 2 enotes the spectral norm of a matrix an C > is a constant that oes not epen on n. Theorem 3 (Finite Sample Analysis): Let Q (n) (t) be the time-varying cosine similarity matrices associate with Oja s metho, GROUSE, or PETELS, respectively. Let Q(t) enote the corresponing scaling limit given in (29), (33) or (34). Fix any T >. Uner assumptions (A.) (A.4), (A.6) (A.7), for any t [, T ], we have E Q (n) (t) Q(t) C(T ), (36) 2 n sup n where C(T ) is a constant that can epen on the terminal time T but not on n. The above theorem, whose proof can be foun in the Supplementary Materials [6], shows that the rate of convergence towars the scaling limits is O(/ n). Example 2: To emonstrate the accuracy of the asymptotic characterizations given in Theorem an Theorem 2, we compare the actual performance of the algorithms against their theoretical preictions in Figure 3. In our experiments, we generate a ranom orthogonal matrix U accoring to (28) with n = 2, an = 4. For Oja s metho an GROUSE, we use a constant step size τ =.5. For PETRELS, the iscount factor is γ = µ/n with µ = 5, an R = δ n I with δ =. The covariance matrix is set to Λ = iag {5, 4, 3, 2} an the subsampling ratio is α =.5. Figure 3(a) shows the evolutions of the cosines of the 4 principal angles between U an the estimates given by Oja s metho (shown as crosses) an GROUSE (shown as circles). We compute the theoretical preictions of the principal angles by performing a SVD of the limiting matrices Q(t) as specifie by the ODE (29). (In fact, this ODE has a simple analytical solution. See Section IV-B for etails.) Figure 3(b) shows similar comparisons between PETRELS an its corresponing theoretical preictions. In this case, we solve the limiting ODEs (33) an (34) numerically. D. Relate Wor The problem of estimating an tracing low-ran subspaces has receive a lot of attention recently in the signal processing an learning communities. Uner the setting of fully observe ata, an earlier wor [2] stuies a bloc-version of Oja s metho an provies a sample complexity estimate for the case of =. Similar analysis is available for general - imensional subspaces [22], [23]. The streaming version of Oja s metho an its sample complexities have also been extensively stuie. See, e.g., [24] [28]. For the case of incomplete observations, the sample complexity of a bloc version of Oja s metho with missing ata is analyze in [29] uner the same generative moel as in (). In [7], the authors provie the sample complexity for learning a low-ran subspace from subsample ata uner a nonparametric moel much more general than (): the complete ata vectors are assume to be i.i.. samples from a general probability istribution on R n. In the streaming setting, Oja s metho, GROUSE, PETRELS are three popular algorithms for tacling the challenge of subspace learning with partial information. Other interesting approaches inclue online matrix completion methos [3] [32]. See [33] for a recent review of relevant literature in this area. Local convergence of GROUSE is given in [4], [5]. Global convergence of GROUSE is establishe in [6] uner the noiseless setting. In general, establishing finite sample global performance guarantees for GROUSE an other algorithms such as Oja s an PETRELS in the missing ata case is still an open problem. Unlie most wor in the literature that sees to establish finite-sample performance guarantees for various subspace

8 8 estimation algorithms, our results in this paper provie an asymptotically exact characterization of three popular methos in the high-imensional limit. The main technical tool behin our analysis is the wea convergence of stochastic processes towars their scaling limits that are characterize by ODEs or stochastic ifferential equations (see, e.g., [8] [], [5]). Using ODEs to analyze stochastic recursive algorithms has a long history [34], [35]. An ODE analysis of an early subspace tracing algorithm was given in [36], an this result was aapte to analyze PETRELS for the nonsubsample case [2]. Our results in this paper iffer from previous analysis not only in that it can hanle the more challenging case of incomplete observations. In aition, previous ODE analysis in [2], [36] eeps the ambient imension n fixe an stuies the asymptotic limit as the step size tens to. The resulting ODEs involve O(n) variables. In contrast, our analysis stuies the limit as the imension n, an the resulting ODEs only involve at most 2 2 variables, where is the imension of the subspace which, in many practical situations, is a small constant. This low-imensional characterization maes our limiting results more practical to use, especially when the ambient imension n is large. It is important to point out a limitation of our asymptotic analysis: we require the initial estimate X to be asymptotically correlate with the true subspace U. To see why this is an issue, we note that if the initial cosine similarity matrix Q() = (i.e., a fully uncorrelate initial estimate), then the ODEs in Theorems an 2 only provie a trivial solution Q(t), yieling no useful information. In practice, a correlate initial estimate can be obtaine by performing a PCA on a small batch of samples; it may also available from aitional sie information about the true subspace U. Therefore, the requirement that Q() be invertible is not overly restrictive. Nevertheless, we observe in numerical simulations that, uner sufficiently high SNRs, Oja s metho, GROUSE an PETRELS can successfully estimate the subspace by starting from ranom initial guesses that are uncorrelate with U. Extening our analysis framewor to hanle the case of ranom initial estimates is an important line of future wor. IV. IMPLICATIONS OF HIGH-DIMENSIONAL ANALYSIS The scaling limits presente in Section III provie asymptotically exact characterizations of the ynamic performance of Oja s metho, GROUSE, an PETRELS. In this section, we iscuss implications of these results. Analyzing the limiting ODEs also reveals the funamental limits an phase transition phenomena associate with the steay-state performance of these algorithms. A. Algorithmic Insights By examining Theorem an Theorem 2, we raw the following conclusions regaring the three subspace estimation algorithms.. Connections an ifferences between the algorithms. Theorem implies that, as n, Oja s metho an GROUSE converge to the same eterministic limit process characterize as the solution of the ODE (29). This result is Figure 4. Monte Carlo simulations of the PETRELS algorithm v.s. asymptotic preictions obtaine by the limiting ODEs given in Theorem 2 for =. In this case, the two matrices Q(t) an G(t) reuce to two scalars Q(t) an G(t). The variable G(t) acts as an effective step-size, which aaptively ajusts its value accoring to the change in Q(t). The error bars shown in the figures represent one stanar eviation over 5 inepenent trials. The signal imension is n = 4. somewhat surprising, as the upate rules of the two methos (see Algorithm an Algorithm 2) appear to be sufficiently ifferent. Theorem 2 shows that PETRELS is also intricately connecte to the other two algorithms. Inee, the ODE (33) of the cosine similarity matrix Q(t) for PETRELS has exactly the same form as the one for GROUSE an Oja s metho shown in (29), except for the fact that the nonaaptive stepsize τ(t)i in (29) is now replace by a matrix G(t), itself governe by an ODE (33). Thus, G(t) in PETRELS can be viewe as an aaptive scheme for ajusting the step-size. To investigate how G(t) evolves, we run an experiment for =. In this case, the quantities Q(t), G(t) an Λ reuce to three scalars, enote by Q(t), G(t), an λ, respectively. Figure 4 shows the ynamics of PETRELS to recover this -D subspace. It shows that G(t) increases initially, which helps to boost the convergence spee. As Q(t) increases (meaning the estimates becoming more accurate), however, the effective step-size G(t) graually ecreases, in orer to help Q(t) reach a higher steay-state value. 2. Subsampling vs. the SNR. The ODEs in Theorems an 2 also reveal an interesting (asymptotic) equivalence between the subsampling ratio α an the SNR as specifie by the matrix Λ. To see this, we observe from the inition of the two functions F an H in (3) an (35) that α always appears together with Λ in the form of a prouct αλ 2. This implies that an observation moel with subsampling ratio α an SNR Λ will have the same asymptotic performance as a ifferent moel with subsampling ratio α an SNR α/ˆα Λ. In simpler terms, having missing ata is asymptotically equivalent to lowering the SNR in the fully-observable setting. B. Oja s Metho an GROUSE: Analytical Solutions an Phase Transitions Next, we investigate the ynamics of Oja s metho an GROUSE by stuying the solution of the ODE given in

9 9 Theorem. To that en, we consier a change of variables by ining P (t) = [Q(t)Q (t)]. (37) One may euce from (29) that the evolution of P (t) is also governe by a first-orer ODE: where P (t) = A(t) P (t)b(t) B(t)P (t), (38) A(t) = τ(t)[2 + τ(t)]αλ 2 (39) B(t) = τ(t) ( αλ 2 τ(t) 2 I ) (4) are two iagonal matrices. Thans to the linearity of (38), it amits an analytical solution P (t) = e t B(r) r P ()e t + t B(r) r A(s)e 2 t s B(r) r s. (4) Note that the first two terms on the right-han sie of (4) represent the influence of the initial estimate P () = [Q()Q ()] on the current state at t. In the special case of the algorithms using a constant step size, i.e., τ(t) τ >, the solution (4) may be further simplifie as P (t) = e tb P ()e tb + Z(t), (42) where Z(t) = iag { z (t),..., z (t) } with z l (t) = (2 + ( τ)αλ2 l 2αλ 2 l τ e τ(2αλ2 τ)t) l (43) for l. Note that if 2αλ 2 l τ = for some l, the above expression for z l is unerstoo via the convention that ( e τt )/ = τt. The formula (42) reveals a phase transition phenomenon for the steay-state performance of the two algorithms as we change the step-size parameter τ. To see that, we first recall that the eigenvalues of Q (n) (t)(q (n) (t)) are exactly equal to the square cosines of the principal angles { θl n(t)} between the true subspace U an the estimate given by the algorithms. We say an algorithm generates an asymptotically informative solution if lim lim t n cos2 (θl n (t)) > for all l, (44) i.e., the steay-state estimates of the algorithms achieve nontrivial correlations with all the irections of U. In contrast, a noninformative solution correspons to lim lim t n cos2 (θl n (t)) = for all l, (45) in which case the steay-state estimates carry no information about U. For >, one may also have the thir situation where only a subset of the irections of U can be recovere (with nontrivial correlations) by the algorithm. Proposition : Let θ (n) l (t) enotes the lth principal angle between the true subspace an the estimate obtaine by Oja s metho or GROUSE with a constant step size τ. Uner the same assumptions as in Theorem, we have lim lim t n cos2 (θ (n) l (t)) = max {, 2αλ 2 l τ αλ 2 l (2+τ) }, (46) where {λ l } are the SNR parameters ine in (2). It follows that the two algorithms provie asymptotically informative solutions if an only if τ < 2α min l λ2 l. (47) Proof: Suppose the iagonal matrix B in (4) has positive iagonal entries (with ), an 2 = negative or zero entries. Without loss of generatively, [ we may assume ] that B can be split into a bloc form B 2 such that B 2 B only contains the positive 2 iagonal entries, an B 2 only contains the nonpositive entires. Accoringly, [ we split ] the other two [ matrices in (42) ] P as P () =, P,2 Z (t) an Z(t) = 2. P 2, P 2,2 2 Z 2 (t) Applying the bloc matrix inverse formula to (42), we get [ ] P W (t) =, (t) W,2 (t), (48) W 2, (t) W 2,2 (t) where W, (t) = e tb P, e tb + Z (t) e tb P,2 e tb2 (e tb2 P 2,2 e tb2 + Z 2 ) e tb 2 P 2, e tb. It is easy to verify from the initions of B an Z that { 2αλ 2 τ αλ 2(2+τ),..., 2αλ 2 τ αλ 2 (2+τ) lim W,(t) = iag t Similarly, we may verify that lim W,2(t) = 2 t lim W 2,2(t) = 2 2. t }. (49) (5) Substituting (49) an (5) into (48) an recalling that the eigenvalues of P (t) are exactly equal to the square cosines of the principal angles, we reach (46). Applying the conitions given in (44) an (45) to (46) yiels (47). C. Steay-State Analysis of PETRELS The steay-state property of PETRELS can also be obtaine by stuying the limiting ODEs as given in Theorem 2. The coupling of Q(t) an G(t) in (33) an (34), however, maes the analysis much more challenging. Unlie the case of Oja s metho an GROUSE, we are not able to obtain closeform analytical solutions of the ODEs for PETRELS. In what follows, we restrict our iscussions to the special case of =. This simplifies the tas, as the matrix-value ODEs (33) an (34) reuce to scalar-value ones. It is not har to verify that, for any solution { Q(t), R(t) } with an initial conition { Q(), R() }, there is a symmetric solution { Q(t), G(t) } for the initial conition { Q(), G() }. To remove this reunancy, it is convenient to investigate the ynamics of Q 2 (t) an G(t), which satisfy the following ODEs [Q2 (t)] = GQ 2 [2αλ 2 G 2Q 2 ( + 2 G)αλ2 ] (5) G(t) = G[µ G(G + )(Q2 αλ 2 + )]. (52)

10 (a) informative solution (c) noninformative solution (b) informative solution () noninformative solution Figure 5. Phase portraits of the nonlinear ODEs in Theorem 2: The blac curves are trajectories of the solutions (Q 2 (t), G(t)) of the ODES starting from ifferent initial values. The green an re curves represent nontrivial solutions of the two stationary equations Q2 (t) = an G(t) =. Their intersection point, if it exists, is a stable fixe point of the ynamical system. The fixe-points of the top two figures correspon to Q 2 ( ) >, an thus the steay-state solutions in these two cases are informative. In contrast, the fixe-points of the bottom two figures are associate with noninformative steay-state solutions with Q 2 ( ) =. Figure 5 visualizes several ifferent solution trajectories of these ODEs as the blac curves in the Q G plane. These solutions start from ifferent initial conitions at the borers of the figures, an they converge to certain stationary points. The locations of these stationary points epen on the SNR {λ l }, the subsampling ratio α an the iscount parameter µ use by the algorithm. In Figures 5(a) an 5(b), the stationary points correspon to Q 2 >, an thus the algorithm generates asymptotically informative solutions accoring to the inition in (44). In contrast, Figure 5(c) an Figure 5() show the situations where the steay-state solutions are noninformative. Proposition 2: Let =. Uner the same assumptions as in Theorem 2, PETRELS generates an asymptotically informative solution if an only if µ < ( 2αλ 2 + 2) 2 4, (53) where µ is the parameter ine in (27), λ enotes the SNR in (2), an α is the subsampling ratio. Proof: It follows from Theorem 2 that verifying the conitions (44) an (45) boils own to stuying the fixe point of a ynamical system governe by the limiting ODEs (5) an (52). This tas is in turn equivalent to setting the left-han sies of the ODEs to zero an solving the resulting equations. Let {Q, G } be any solution to the equations G2 = an G =. From the forms of the right-han sies of (5) an (52), we see that {Q, G } must fall into one of the following three cases: Case I: G = an Q can tae arbitrary values; Case II: Q = an G is the unique positive solution to G (G + ) = µ; (54) Case III: Q an G. A local stability analysis, erre to the en of the proof, shows that the fixe points in Case I are always unstable, in the sense that any small perturbation will mae the ynamics move away from these fixe points. Thus, we just nee to focus on Case II an Case III, with the former corresponing to an uninformative solution an the latter to an informative one. We will show that, uner (53), a fixe point in Case III exists an it is the unique stable fixe point. That solution isappears when (53) ceases to hol, in which case the solution in Case II becomes the unique stable fixe point. To see why (53) provies the phase transition bounary, we note that a solution in Case III, if it exists, must satisfy (Q ) 2 = f(g ) an (Q ) 2 = h(g ), where f(g) = αλ2 + ( + G 2 )αλ2 αλ 2 (55) h(g) = ( µ G(G + ) ) αλ 2. (56) The above two equations are erive from Q2 (t) = an G(t) =. In Figure 5, the functions f(g) an h(g) are plotte as the green an re ashe lines, respectively. It is easy to verify from their initions that f(g) an h(g) are both monotonically ecreasing in the feasible region ( Q 2 an G > ). Moreover, = f () < h (), where f an h enote the inverse function of f an h, respectively. Thus, a solution in Case III exists if f () > h (), which then leas to (53) after some algebraic manipulations. Finally, we examine the local stability of the fixe points in Case I an Case II. Note that a fixe point (Q, G ) of the 2-imensional ODE (5) an (52) is stable if an only if [ ] [Q 2 ] Q2 (t) Q=Q,G=G < an [ ] G G(t) Q=Q,G=G <, where Q2 (t) an G(t) are the functions on the right-han sie of (5) an (52), respectively. It follows that [ all the ] Case I fixe points are always unstable, because G G(t) G= = µ >. Furthermore, the Case II fixe point is also unstable if (53) hols, because [Q 2 ] [ Q2 (t) ] Q=,G=G = 2αλ 2 G >, where G is the value specifie in (54). On the other han, when (53) oes not hol, the Case II fixe point becomes stable. Example 3: Proposition 2 preicts a critical choice of µ (as a function of the SNR λ an the subsampling ratio α) that separates informative solutions from noninformative ones. This preiction is confirme numerically in Figure 6. In our experiments, we set =, n =,. We then scan the

Multiplying both sies of (57) from the left by u, we get Q + Q = n g, (6) where [ (cos(θ g = n ) ) ] u p p + sin(θ ) u r A (6) r specifies the increment of the cosine similarity from Q to Q +.

11 Multiplying both sies of (57) from the left by u, we get Q + Q = n g, (6) where [ (cos(θ g = n ) ) ] u p p + sin(θ ) u r A (6) r specifies the increment of the cosine similarity from Q to Q +. To erive the limiting ODE, we first rewrite (6) as Figure 6. The grayscale in the figure visualizes the steay-state square cosine similarities of PETRELS corresponing to ifferent values of the SNR λ 2, the subsampling ratio α, an the step-size parameter µ. The re curve is the theoretical preiction given in Proposition 2 of a phase transition bounary, below which no informative solution can be achieve by the algorithm. The theoretical preiction matches numerical results. parameter space of µ an αλ 2. For each choice of these two parameters on our search gri, we perform inepenent trials, with each trial using a ifferent realizations of c an a in () an a ifferent U rawn uniformly at ranom from the n-d sphere. The grayscale in Figure 6 shows the average value of the square cosine similarity Q(t) at t = 3. V. DERIVATIONS OF THE ODES AND PROOF SKETCHES In this section, we present a nonrigorous erivation of the limiting ODEs an setch the main ingreients of our proofs of Theorems an 2. More technical etails an the complete proofs can be foun in the Supplementary Materials [6]. A. Derivations of the ODE In what follows, we show how one may erive the limiting ODE in Theorem. We focus on GROUSE, but the other two algorithms can be treate similarly. For simplicity, we consier the case in which the subspace imension is =. In this case, the true subspace U in () an its estimate X given by Algorithm 2 reuce to vectors u an x, respectively. The covariance matrix Λ in (2) also reuces to a scalar λ. Consequently, the weight vector ŵ obtaine in (9) becomes a scalar w = x Ω s / Ω x 2. Our first observation is that the ynamic of GROUSE can be moele by a Marov chain (x, u ) on R 2n, where u u for all. The upate rule of this Marov chain is x + x = [ (cos(θ ) ) p p + sin(θ ) r r ] A, (57) where A = { Ω x 2 > ɛ}. Here, the inicator function A encoes the test in line 3 of Algorithm 2. Since we are consiering the special case of =, the vectors r an p as originally ine in (2) can be rewritten as r = Ω (s p ) (58) p = x x Ω s Ω x 2. (59) Q + Q /n = E g + (g E g ), (62) where E enotes conitional expectation with respect to { all the ranom } elements encountere up to step, i.e., cj, a j, Ω j in the generative moel (). One can j show that E (g E g ) 2 = O() (63) an E g = F (Q, τ ) + O(/ n), (64) where F (, ) is the function ine in (3). Substituting (64) into (62) an omitting the zero-mean ifference term (g E g ), we get Q + Q /n = F (Q, τ ) + O(/ n). (65) Let Q(t) be a continuous-time process ine as in (24), with t = /n being the rescale time. In an intuitive but nonrigorous way, we have Q + Q /n Q(t) as n. This then gives us the ODE in (29). In what follows, we provie some aitional etails behin the estimate in (64). To simplify our presentation, we first introuce a few variables. Let z = Ω x 2, z = n Ω s 2 p = u Ω s, q = x Ω s Q = u Ω x. (66) Since u = x =, all these variables are O() quantities when n. (See Lemma 5 in Supplementary Materials.) Given its inition in (3), we rewrite θ use in (57) as θ 2 = τ 2 n q 2 z 2 [ ] z 2 q2 O(/n). nz Thus, it is natural to expan the two terms cos(θ ) an sin(θ ) that appear in (57) via a Taylor series expansion, which yiels cos(θ ) = τ 2 r 2 p 2 2n 2 + O(n 2 ) sin(θ ) = τ n r 2 p 2 + O(n 3/2 ). Substituting (67) into (6) gives us (67) g = τ z [ p q z q 2 ( Q + τ 2 z Q )] {z >ɛ} + O(n /2 ). (68) A rigorous justification of this step is presente as Lemma 8 in the Supplementary Materials.

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration