Fast and Memory Optimal Low-Rank Matrix Approximation

Size: px

Start display at page:

Download "Fast and Memory Optimal Low-Rank Matrix Approximation"

Blaise Cunningham
5 years ago
Views:

Fast and Meory Optial Low-Rank Matrix Approxiation Yun Se-Young, Marc Lelarge, Alexandre Proutière To cite this version: Yun Se-Young, Marc Lelarge,

<hal-01254913> HAL Id: hal-01254913 https://hal.archives-ouvertes.

docuents, whether they are published or not.

1 Fast and Meory Optial Low-Rank Matrix Approxiation Yun Se-Young, Marc Lelarge, Alexandre Proutière To cite this version: Yun Se-Young, Marc Lelarge, Alexandre Proutière. Fast and Meory Optial Low-Rank Matrix Approxiation. NIPS 2015, Dec 2015, Montreal, Canada. <hal > HAL Id: hal Subitted on 12 Jan 2016 HAL is a ulti-disciplinary open access archive for the deposit and disseination of scientific research docuents, whether they are published or not. The docuents ay coe fro teaching and research institutions in France or abroad, or fro public or private research centers. L archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de docuents scientifiques de niveau recherche, publiés ou non, éanant des établisseents d enseigneent et de recherche français ou étrangers, des laboratoires publics ou privés.

2 Fast and Meory Optial Low-Rank Matrix Approxiation Se-Young Yun MSR, Cabridge Marc Lelarge Inria & ENS Alexandre Proutiere KTH, EE School / ACL alepro@kth.se Abstract In this paper, we revisit the proble of constructing a near-optial rank k approxiation of a atrix M [0, 1] n under the streaing data odel where the coluns of M are revealed sequentially. We present SLA (Streaing Low-rank Approxiation), an algorith that is asyptotically accurate, when ks k+1 (M) = o( n) where s k+1 (M) is the (k + 1)-th largest singular value of M. This eans that its average ean-square error converges to 0 as and n grow large (i.e., ˆM (k) M (k) 2 F = o(n) with high probability, where ˆM (k) and M (k) denote the output of SLA and the optial rank k approxiation of M, respectively). Our algorith akes one pass on the data if the coluns of M are revealed in a rando order, and two passes if the coluns of M arrive in an arbitrary order. To reduce its eory footprint and coplexity, SLA uses rando sparsification, and saples each entry of M with a sall probability δ. In turn, SLA is eory optial as its required eory space scales as k(+n), the diension of its output. Furtherore, SLA is coputationally efficient as it runs in O(δkn) tie (a constant nuber of operations is ade for each observed entry of M), which can be as sall as O(k log() 4 n) for an appropriate choice of δ and if n. 1 Introduction We investigate the proble of constructing, in a eory and coputationally efficient anner, an accurate estiate of the optial rank k approxiation M (k) of a large ( n) atrix M [0, 1] n. This proble is fundaental in achine learning, and has naturally found nuerous applications in coputer science. The optial rank k approxiation M (k) iniizes, over all rank k atrices Z, the Frobenius nor M Z F (and any nor that is invariant under rotation) and can be coputed by Singular Value Decoposition (SVD) of M in O(n 2 ) tie (if we assue that n). For assive atrices M (i.e., when and n are very large), this becoes unacceptably slow. In addition, storing and anipulating M in eory ay becoe difficult. In this paper, we design a eory and coputationally efficient algorith, referred to as Streaing Low-rank Approxiation (SLA), that coputes a near-optial rank k approxiation ˆM (k). Under ild assuptions on M, the SLA algorith is asyptotically accurate in the sense that as and n grow large, its average ean-square error converges to 0, i.e., ˆM (k) M (k) 2 F = o(n) with high probability (we interpret M (k) as the signal that we ai to recover for a noisy observation M). To reduce its eory footprint and running tie, the proposed algorith cobines rando sparsification and the idea of the streaing data odel. More precisely, each entry of M is revealed to the algorith with probability δ, called the sapling rate. Moreover, SLA observes and treats the Work perfored as part of MSR-INRIA joint research centre. M.L. acknowledges the support of the French Agence Nationale de la Recherche (ANR) under reference ANR-11-JS (GAP project). A. Proutiere s research is supported by the ERC FSA grant, and the SSF ICT-Psi project. 1

3 coluns of M one after the other in a sequential anner. The sequence of observed coluns ay be chosen uniforly at rando in which case the algorith requires one pass on M only, or can be arbitrary in which case the algorith needs two passes. SLA first stores l = 1/(δ log()) randoly selected coluns, and extracts via spectral decoposition an estiator of parts of the k top right singular vectors of M. It then copletes the estiator of these vectors by receiving and treating the reain coluns sequentially. SLA finally builds, fro the estiated top k right singular vectors, the linear projection onto the subspace generated by these vectors, and deduces an estiator of M (k). The analysis of the perforance of SLA is presented in Theores 7, and 8. In suary: when n, log4 () δ 8/9, with probability 1 kδ, the output ˆM (k) of SLA satisfies: M (k) ˆM (k) 2 ( ( F s 2 = O k 2 k+1 (M) + log() )), (1) n n δ where s k+1 (M) is the (k + 1)-th singular value of M. SLA requires O(kn) eory space, and if δ log4 () and k log6 (), its tie is O(δkn). To ensure the asyptotic accuracy of SLA, the upper-bound in (1) needs to converge to 0 which is true as soon as ks k+1 (M) = o( n). In the case where M is seen as a noisy version of M (k), this condition quantifies the axiu aount of noise allowed for our algorith to be asyptotically accurate. SLA is eory optial, since any rank k approxiation algorith needs to at least store its output, i.e., k right and left singular vectors, and hence needs at least O(kn) eory space. Further observe that aong the class of algoriths sapling each entry of M at a given rate δ, SLA is coputational optial, since it runs in O(δkn) tie (it does a constant nuber of operations per observed entry if k = O(1)). In turn, to the best of our knowledge, SLA is both faster and ore eory efficient than existing algoriths. SLA is the first eory optial and asyptotically accurate low rank approxiation algorith. The approach used to design SLA can be readily extended to devise eory and coputationally efficient atrix copletion algoriths. We present this extension in the suppleentary aterial. Notations. Throughout the paper, we use the following notations. For any n atrix A, we denote by A its transpose, and by A 1 its pseudo-inverse. We denote by s 1 (A) s n (A) 0, the singular values of A. When atrices A and B have the sae nuber of rows, [A, B] to denote the atrix whose first coluns are those of A followed by those of B. A denotes an orthonoral basis of the subspace perpendicular to the linear span of the coluns of A. A j, A i, and A ij denote the j-th colun of A, the i-th row of A, and the entry of A on the i-th line and j-th colun, respectively. For h l, A h:l (resp. A h:l ) is the atrix obtained by extracting the coluns (resp. lines) h,..., l of A. For any ordered set B = {b 1,..., b p } {1,..., n}, A (B) refers to the atrix coposed by the ordered set B of coluns of A. A (B) is defined siilarly (but for lines). For real nubers a b, we define A b a the atrix with (i, j) entry equal to ( A b a) ij = in(b, ax(a, A ij )). Finally, for any vector v, v denotes its Euclidean nor, whereas for any atrix A, A F denotes its Frobenius nor, A 2 its operator nor, and A its l -nor, i.e., A = ax i,j A ij. 2 Related Work Low-rank approxiation algoriths have received a lot of attention over the last decade. There are two types of error estiate for these algoriths: either the error is additive or relative. To translate our bound (1) in an additive error is easy: ( ( ) ) M ˆM (k) F M M (k) s k+1 (M) F + O k + log1/2 n. (2) n (δ) 1/4 Sparsifying M to speed-up the coputation of a low-rank approxiation has been proposed in the literature and the best additive error bounds have been obtained in [AM07]. When the sapling rate δ satisfies δ log4, the authors show that with probability 1 exp( log4 ), ( M M k (k) F M M (k) 1/2 n 1/2 F + O + k1/4 n 1/4 ) M (k) 1/2 δ 1/2 δ 1/4 F. (3) 2

4 This perforance guarantee is derived fro Lea 1.1 and Theore 1.4 in [AM07]. To copare (2) and (3), note that our assuptions on the bounded entries of M ensures that: s 2 k+1 (M) n 1 k and M (k) F M F ( ) n. In particular, we see that the worst case k n bound for (3) is 1/2 δ + k1/4 (δ) which is always lower than the worst case bound for (2): ( ) 1/4 1/2 1 k k + log δ n. When k = O(1), our bound is only larger by a logarithic ter in copared to [AM07]. However, the algorith proposed in [AM07] requires to store O(δn) entries of M whereas SLA needs O(n) eory space. Recall that log 4 δ 1/9 so that our algorith akes a significant iproveent on the eory requireent at a low price in the error guarantee bounds. Although biased sapling algoriths can reduce the error, the algorith have to run leverage scores with ultiple passes over data [BJS15]. In a recent work, [CW13] proposes a tie efficient algorith to copute a low-rank approxiation of a sparse atrix. Cobined with [AM07], we obtain an algorith running in tie O(δn) + O(nk 2 + k 3 ) but with an increased additive error ter. We can also copare our result to papers providing an estiate M (k) of the optial low-rank approxiation of M with a relative error ε, i.e. such that M M (k) F (1 + ε) M M (k) F. To the best of our knowledge, [CW09] provides the best result in this setting. Theore 4.4 in [CW09] shows that provided the rank of M is at least 2(k + 1), their algorith outputs with probability 1 η a rank-k atrix M (k) with relative error ε using eory space O (k/ε log(1/η)(n + )) (note that in [CW09], the authors use as unit of eory a bit whereas we use as unit of eory an entry of the atrix so we reoved a log n factor in their expression to ake fair coparisons). To copare with our result, we can translate our bound (1) in a relative error, and we need to take: ε = O k s k+1(m) + log 1/2 (δ) n 1/4. M M (k) F First note that since M is assued to be of rank at least 2(k + 1), we have M M (k) F s k+1 (M) > 0 and ε is well-defined. Clearly, for our ε to tend to zero, we need M M (k) F to be not too sall. For the scenario we have in ind, M is a noisy version of the signal M (k) so that M M (k) is the noise atrix. When every entry of M M (k) is generated independently at rando with a constant variance, M M (k) F = Θ( + n) while s k+1 (M) = Θ( n). In such a case, we have ε = o(1) and we iprove the eory requireent of [CW09] by a factor ε 1 log(kδ) 1. [CW09] also considers a odel where the full coluns of M are revealed one after the other in an arbitrary order, and proposes a one-pass algorith to derive the rank-k approxiation of M with the sae eory requireent. In this general setting, our algorith is required to ake two passes on the data (and only one pass if the order of arrival of the colun is rando instead of arbitrary). The running tie of the algorith scales as O(knε 1 log(kδ) 1 ) to project M onto kε 1 log(kδ) 1 diensional rando space. Thus, SLA iproves the tie again by a factor of ε 1 log(kδ) 1. We could also think of using sketching and streaing PCA algoriths to estiate M (k). When the coluns arrive sequentially, these algoriths identify the left singular vectors using one-pass on the atrix and then need a second pass on the data to estiate the right singular vectors. For exaple, [Lib13] proposes a sketching algorith that updates the p ost frequent directions as coluns are observed. [GP14] shows that with O(k/ε) eory space (for p = k/ε), this sketching algorith finds k atrix Û such that M P Û M F (1 + ε) M M (k) F, where PÛ denotes the projection atrix to the linear span of the coluns of Û. The running tie of the algorith is roughly O(knε 1 ), which is uch greater than that of SLA. Note also that to identify such atrix Û in one pass on M, it is shown in [Woo14] that we have to use Ω(k/ε) eory space. This result does not contradict the perforance analysis of SLA, since the latter needs two passes on M if the coluns of M are observed in an arbitrary anner. Finally, note that the streaing PCA algorith proposed in [MCJ13] does not apply to our proble as this paper investigates a very specific proble: the spiked covariance odel where a colun is randoly generated in an i.i.d. anner. 3 Streaing Low-rank Approxiation Algorith 3

5 Algorith 1 Streaing Low-rank Approxiation (SLA) 1 Input: M, k, δ, and l = δ log() 1. A (B1), A (B2) independently saple entries of [M 1,..., M l ] at rate δ 2. PCA for the first l coluns: Q SPCA(A (B1), k) 3. Triing the rows and coluns of A (B2): A (B2) set the entries of rows of A (B2) having ore than two non-zero entries to 0 A (B2) set the entries of the coluns of A (B2) having ore than 10δ non-zero entries to 0 4. W A (B2)Q 5. ˆV (B1) (A (B1)) W 6. Î A (B1) (B 1) ˆV Reove A (B1), A (B2), and Q fro the eory space for t = l + 1 to n do 7. A t saple entries of M t at rate δ 8. ˆV t (A t ) W 9. Î Î + A ˆV t t Reove A t fro the eory space end for 10. ˆR find ˆR using the Gra-Schidt process such that ˆV ˆR is an orthonoral atrix 11. Û 1ˆδ Î ˆR ˆR Output: ˆM (k) = Û ˆV 1 0 Algorith 2 Spectral PCA (SPCA) Input: C [0, 1] l, k Ω l k Gaussian rando atrix Triing: C set the entries of the rows of C with ore than 10 non-zero entries to 0 Φ C C diag( C C) Power Iteration: QR QR decoposition of Φ 5 log(l) Ω Output: Q In this section, we present the Streaing Low-rank Approxiation (SLA) algorith and analyze its perforance. SLA akes one pass on the atrix M, and is provided with the coluns of M one after the other in a streaing anner. The SVD of M is M = UΣV where U and V are ( ) and (n n) unitary atrices and Σ is the ( n) atrix diag(s 1 (M),... s n (M)). We assue (or ipose by design of SLA) that the l (specified below) first observed coluns of M are chosen uniforly at rando aong all coluns. An extension of SLA to scenarios where coluns are observed in an arbitrary order is presented in 3.5, but this extension requires two passes on M. To be eory efficient, SLA uses sapling. Each observed entry of M is erased (i.e., set equal to 0) with probability 1 δ, where δ > 0 is referred to as the sapling rate. The algorith, whose pseudo-code is presented in Algorith 1, proceeds in three steps: 1 1. In the first step, we observe l = δ log() coluns of M chosen uniforly at rando. These coluns for the atrix M (B) = UΣ(V (B) ), where B denotes the ordered set of the indexes of the l first observed coluns. M (B) is sapled at rate δ. More precisely, we apply two independent sapling procedures, where in each of the, every entry of M (B) is sapled at rate δ. The two resulting independent rando atrices A (B1), and A (B2) are stored in eory. A (B1), referred to as A (B) to siplify the notations, is used in this first step, whereas A (B2) will be used in subsequent steps. Next through a spectral decoposition of A (B), we derive a (l k) orthonoral atrix Q such that the span of its colun vectors approxiates that of the colun vectors of V (B). The first step corresponds to Lines 1 and 2 in the pseudo-code of SLA. 2. In the second step, we coplete the construction of our estiator of the top k right singular vectors V of M. Denote by ˆV the k n atrix fored by these estiated vectors. We first copute the coponents of these vectors corresponding to the set of indexes B as ˆV (B) = A (B 1) W with W = A (B2) Q. Then for t = l + 1,..., n, after receiving the t-th colun M t of M, we set ˆV t = A t W, where A t is obtained by sapling entries of M t at rate δ. Hence after one pass on M, we get ˆV = Ã W, where Ã = [A (B 1), A l+1,..., A n ]. As it turns out, ultiplying W by Ã aplifies the useful signal contained in W, and yields an accurate approxiation of the span of the 4

6 top k right singular vectors V of M. The second step is presented in Lines 3, 4, 5, 7 and 8 in SLA pseudo-code. 3. In the last step, we deduce fro ˆV a set of colun vectors gathered in atrix Û such that Û ˆV provides an accurate approxiation of M (k). First, using the Gra-Schidt process, we find ˆR such that ˆV ˆR is an orthonoral atrix and copute Û = 1 δ A ˆV ˆR ˆR in a streaing anner as in Step 2. Then, Û ˆV = 1 δ A ˆV ˆR( ˆV ˆR) where ˆV ˆR( ˆV ˆR) approxiates the projection atrix onto the linear span of the top k right singular vectors of M. Thus, Û ˆV is close to M (k). This last step is described in Lines 6, 9, 10 and 11 in SLA pseudo-code. In the next subsections, we present in ore details the rationale behind the three steps of SLA, and provide a perforance analysis of the algorith. 3.1 Step 1. Estiating right-singular vectors of the first batch of coluns The objective of the first step is to estiate V (B), those coponents of the top k right singular vectors of M whose indexes are in the set B (reeber that B is the set of indexes of the l first observed coluns). This estiator, denoted by Q, is obtained by applying the power ethod to extract the top k right singular vector of M (B), as described in Algorith 2. In the design of this algorith and its perforance analysis, we face two challenges: (i) we only have access to a sapled version A (B) of M (B) ; and (ii) UΣ(V (B) ) is not the SVD of M (B) since the colun vectors of are not orthonoral in general (we keep the coponents of these vectors corresponding to the set of indexes B). Hence, the top k right singular vectors of M (B) that we extract in Algorith 2 do not necessarily correspond to V (B). V (B) To address (i), in Algorith 2, we do not directly extract the top k right singular vectors of A (B). We first reove the rows of A (B) with too any non-zero entries (i.e., too any observed entries fro M (B) ), since these rows would perturb the SVD of A (B). Let us denote by Ā the obtained tried atrix. We then for the covariance atrix Ā Ā, and reove its diagonal entries to obtain the atrix Φ = Ā Ā diag(ā Ā). Reoving the diagonal entries is needed because of the sapling procedure. Indeed, the diagonal entries of Ā Ā scale as δ, whereas its off-diagonal entries scale as δ 2. Hence, when δ is sall, the diagonal entries would clearly becoe doinant in the spectral decoposition. We finally apply the power ethod to Φ to obtain Q. In the analysis of the perforance of Algorith 2, the following lea will be instruental, and provides an upper bound of the gap between Φ and (M (B) ) M (B) using the atrix Bernstein inequality (Theore 6.1 [Tro12]). All proofs are detailed in Appendix. Lea 1 If δ 8 9, with probability 1 1 l 2, Φ δ 2 (M (B) ) M (B) 2 c 1 δ l log(l), for soe constant c 1 > 1. To address (ii), we first establish in Lea 2 that for an appropriate choice of l, the colun vectors of V (B) are approxiately orthonoral. This lea is of independent interest, and relates the SVD of a truncated atrix, here M (B), to that of the initial atrix M. More precisely: Lea 2 If δ 8/9, there exists a l k atrix V (B) such that its colun vectors are orthonoral, and with probability 1 exp( 1/7 ), for all i k satisfying that s 2 i (M) δl n l log(l), n l V (B) (B) 1:i V 1:i Note that as suggested by the above lea, it ight be ipossible to recover V (B) i when the corresponding singular value s i (M) is sall (ore precisely, when s 2 i (M) δl n l log(l)). However, the singular vectors corresponding to such sall singular values generate very little error for lowrank approxiation. Thus, we are only interested in singular vectors whose singular values are above the threshold ( δl n l log(l)) 1/2. Let k = ax{i : s 2 i (M) δl n l log(l), i k}. Now to analyze the perforance of Algorith 2 when applied to A (B), we decopose Φ as Φ = δ 2 l (B) n V (Σ )2 (B) ( V ) + Y, where Y = Φ δ2 l (B) n V (Σ )2 (B) ( V ) is a noise atrix. The 5

7 following lea quantifies how noise ay affect the perforance of the power ethod, i.e., it (B) provides an upper bound of the gap between Q and V as a function of the operator nor of the noise atrix Y : Lea 3 With probability 1 1 l, the output Q of SPCA when applied to A 2 (B) satisfies for all i k (B) : ( V 1:i ) Q 2 3 Y 2 δ 2 l n si(m)2. In the proof, we analyze the power iteration algorith fro results in [HMT11]. To coplete the perforance analysis of Algorith 2, it reains to upper bound Y 2. To this ai, we decopose Y into three ters: Y = ( Φ δ 2 (M (B) ) ) M (B) + δ 2 (M (B) ) ( I U U ) M(B) + ( δ 2 (M (B) ) U U M (B) l ) (B) V (Σ n )2 (B) ( V ). The first ter can be controlled using Lea 1, and the last ter is upper bounded using Lea 2. Finally, the second ter corresponds to the error ade by ignoring the singular vectors which are not within the top k. To estiate this ter, we use the atrix Chernoff bound (Theore 2.2 in [Tro11]), and prove that: Lea 4 With probability 1 exp( 1/4 ), (I U U )M (B) δ l log(l) + l n s2 k+1 (M). In suary, cobining the four above leas, we can establish that Q accurately estiates V (B) : Theore 5 If δ 8/9, with probability 1 3 l 2, the output Q of Algorith 2 when applied to (B) A (B) satisfies for all i k: ( V 1:i ) Q 2 3δ2 (s 2 k+1 (M)+2 c 1 is the constant fro Lea Step 2: Estiating the principal right singular vectors of M 2 3 n)+3(2+c1)δ n l δ 2 s 2 i (M) l log(l), where In this step, we ai at estiating the top k right singular vectors V, or at least at producing k vectors whose linear span approxiates that of V. Towards this objective, we start fro Q derived in the previous step, and define the ( k) atrix W = A (B2)Q. W is stored and kept in eory for the reaining of the algorith. It is tepting to directly read fro W the top k left singular vectors U. Indeed, we know that Q n l V (B), and E[A (B 2)] = δuσ(v (B) ), and hence E[W ] δ n l U Σ. However, the level of the noise in W is too iportant so as to accurately extract U. In turn, W can be written as δuσ(v (B) ) Q + Z, where Z = (A (B2) δuσ(v (B) ) )Q partly captures the noise in W. It is then easy to see that the level of the noise Z satisfies E[ Z 2 ] E[ Z F / k] = Ω( δ). Indeed, first observe that Z is of rank k. Then E[ Z 2 F ] = k i=1 j=1 E[Z2 ij ] kδ: this is due to the facts that (i) Q and A (B2) δuσ(v (B) ) are independent (since A (B1) and A (B2) are independent), (ii) Q j 2 2 = 1 for all j k, and (iii) the entries of A (B2) are independent with variance Θ(δ(1 δ)). However, for all j k, the j-th singular value of δuσ(v (B) ) Q scales as O(δ l) = O( δ log() ), since s j(m) l n and s j (M (B) ) n s j(m) when j k fro Lea 2. Instead, fro W, A (B1) and the subsequent sapled arriving coluns A t, t > l, we produce a (n k) atrix ˆV whose linear span approxiates that of V. More precisely, we first let ˆV (B) = A (B 1) W. Then for all t = l + 1,..., n, we define ˆV t = A t W, where A t is obtained fro the t-th observed colun of M after sapling each of its entries at rate δ. Multiplying W by Ã = [A (B1), A l+1,..., A n ] aplifies the useful signal in W, so that ˆV = Ã W constitutes a good approxiation of V. To understand why, we can rewrite ˆV as follows: ˆV = δ 2 M M (B) Q + δm (A (B2) δm (B) )Q + (Ã δm) W. 6

8 In the above equation, the first ter corresponds to the useful signal and the two reaining ters constitute noise atrices. Fro Theore 5, the linear span of coluns of Q approxiates that of the coluns of V (B) and thus, for j k, s j (δ 2 M M (B) Q) δ 2 s 2 j (M) l n δ n log(l). The spectral nors of the noise atrices are bounded using rando atrix arguents, and the fact that (A (B2) δm (B) ) and (Ã δm) are zero-ean rando atrices with independent entries. We can show (see Lea 14 given in the suppleentary aterial) using the independence of A (B1) and A (B2) that with high probability, δm (A (B2) δm (B) )Q 2 = O(δ n). We ay also establish that with high probability, (Ã δm) W 2 = O(δ ( + n)). This is a consequence of a result derived in [AM07] (quoted in Lea 13 in the suppleentary aterial) stating that with high probability, Ã δm = O( δ( + n)) and of the fact that due to the triing process presented in Line 3 in Algorith 1, W 2 = O( δ). In suary, as soon as n scales at least as, the noise level becoes negligible, and the span of ˆV provides an accurate approxiation of that of V. The above arguents are ade precise and rigorous in the suppleentary aterial. The following theore suarizes the accuracy of our estiator of V. Theore 6 With log4 () δ 8 9 for all i k, there exists a constant c 2 such that with probability 1 kδ, Vi ( ˆV s ) 2 c 2 k+1 (M)+n log() /δ+ n log()/δ 2 s 2 i (M). 3.3 Step 3: Estiating the principal left singular vectors of M In the last step, we estiate the principal left singular vectors of M to finally derive an estiator of M (k), the optial rank-k approxiation of M. The construction of this estiator is based on the observation that M (k) = U Σ V = MP V, where P V = V V is an (n n) atrix representing the projection onto the linear span of the top k right singular vectors V of M. Hence to estiate M (k), we try to approxiate the atrix P V. To this ai, we construct a (k k) atrix ˆR so that the colun vectors of ˆV ˆR for an orthonoral basis whose span corresponds to that of the colun vectors of ˆV. This construction is achieved using Gra-Schidt process. We then approxiate P V by P ˆV = ˆV ˆR ˆR ˆV, and finally our estiator ˆM (k) of M (k) is 1 δ ÃP ˆV. The construction of ˆM (k) can be ade in a eory efficient way accoodating for our streaing odel where the coluns of M arrive one after the other, as described in the pseudo-code of SLA. First, after constructing ˆV (B) in Step 2, we build the atrix Î = A ˆV (B (B) 1). Then, for t = l + 1,..., n, after constructing the t-th line ˆV t of ˆV, we update Î by adding to it the atrix A ˆV t t, so that after all coluns of M are observed, Î = Ã ˆV. Hence we can build an estiator Û of the principal left singular vectors of M as Û = 1 δ Î ˆR ˆR, and finally obtain ˆM (k) = Û ˆV 1 0. To quantify the estiation error of ˆM (k), we decopose M (k) ˆM (k) as: M (k) ˆM (k) = M (k) (I P ˆV ) + (M (k) M)P ˆV + (M 1 δ Ã)P ˆV. The first ter of the r.h.s. of the above equation can be bounded using Theore 6: for i k, we have s i (M) 2 Vi ˆV z = c 2 (s 2 k+1 (M) + n log() /δ + n log()/δ), and hence we can conclude that for all i k, s i (M)U i Vi (I P ˆV ) 2 z. The second ter can be easily bounded observing that the atrix F (M (k) M)P ˆV is of rank k: (M (k) M)P ˆV 2 F k (M (k) M)P ˆV 2 2 k M (k) M 2 2 = ks k+1 (M) 2. The last ter in the r.h.s. can be controlled as in the perforance analysis of Step 2, and ( ) observing that ( 1 δ Ã M)P 1 ˆV is of rank k: δ Ã M P ˆV 2 F k 1 δ Ã M 2 = O(kδ(+n)). 2 It is then easy to reark that for the range of the paraeter δ we are interested in, the upper bound z of the first ter doinates the upper bound of the two other ters. Finally, we obtain the following result (see the suppleentary aterial for a coplete proof): Theore 7 When log4 () δ 8 9, with probability ( 1 kδ, the output of the SLA algorith ) satisfies with constant c 3 : M (k) [Û ˆV ] F n = c 3 k 2 s 2 k+1 (M) n + log() log() δ + δn. 7

9 Note that if log4 () δ 8 9, then log() δ = o(1). Hence if n, the SLA algorith provides an asyptotically accurate estiate of M (k) as soon as s k+1(m) 2 = o(1). 3.4 Required Meory and Running Tie Required eory. Lines 1-6 in SLA pseudo-code. A (B1) and A (B2) have O(δl) non-zero entries and we need O(δl log ) bits to store the id of these entries. Siilarly, the eory required to store Φ is O(δ 2 l 2 (B log(l)). Storing Q further requires O(lk) eory. Finally, ˆV 1) and Î coputed in Line 6 require O(lk) and O(k) eory space, respectively. Thus, when l = 1 δ log, this first part of the algorith requires O(k( + n)) eory. Lines 7-9. Before we treat the reaining coluns, A (B1), A (B2), and Q are reoved fro the eory. Using this released eory, when the t-th colun arrives, we can store it, copute ˆV t and Î, and reove the colun to save eory. Therefore, we do not need additional eory to treat the reaining coluns. Lines 10 and 11. Fro Î and ˆV, we copute Û. To this ai, the eory required is O(k( + n)). Running tie. Fro line 1 to 6. The SPCA algorith requires O(lk(δ 2 l + k) log(l)) floating-point operations to copute Q. W, ˆV, and Î are inner products, and their coputations require O(δkl) operations. 1 With l = δ log(), the nuber of operations to treat the first l coluns is O(lk(δ2 l + k) log(l) + kδl) = O(k) + O( k2 δ ). Fro line 7 to 9. To copute ˆV t and Î when the t-th colun arrives, we need O(δk) operations. Since there are n l reaining coluns, the total nuber of operations is O(δkn). Lines 10 and 11 ˆR is coputed fro ˆV using the Gra-Schidt process which requires O(k 2 ) operations. We then copute Î ˆR ˆR using O(k 2 ) operations. Hence we conclude that: In suary, we have shown that: Theore 8 The eory required to run the SLA algorith is O(k( + n)). Its running tie is O(δkn + k2 δ + k2 ). Observe that when δ ax( (log())4, (log())2 n ) and k (log()) 6, we have δkn k 2 /δ k 2, and therefore, the running tie of SLA is O(δkn). 3.5 General Streaing Model SLA is a one-pass low-rank approxiation algorith, but the set of the l first observed coluns of M needs to be chosen uniforly at rando. We can readily extend SLA to deal with scenarios where the coluns of M can be observed in an arbitrary order. This extension requires two passes on M, but otherwise perfors exactly the sae operations as SLA. In the first pass, we extract a set of l coluns chosen uniforly at rando, and in the second pass, we deal with all other coluns. To extract l randoly selected coluns in the first pass, we proceed as follows. Assue that when the t-th colun of M arrives, we have already extracted l coluns. Then the t-th colun is extracted with probability those of SLA. 4 Conclusion l l n t+1 n. This two-pass version of SLA enjoys the sae perforance guarantees as This paper revisited the low rank approxiation proble. We proposed a streaing algorith that saples the data and produces a near optial solution with a vanishing ean square error. The algorith uses a eory space scaling linearly with the abient diension of the atrix, i.e. the eory required to store the output alone. Its running tie scales as the nuber of sapled entries of the input atrix. The algorith is relatively siple, and in particular, does exploit elaborated techniques (such as sparse ebedding techniques) recently developed to reduce the eory requireent and coplexity of algoriths addressing various probles in linear algebra. 8

10 References [AM07] [BJS15] [CW09] [CW13] [GP14] Diitris Achlioptas and Frank Mcsherry. Fast coputation of low-rank atrix approxiations. Journal of the ACM (JACM), 54(2):9, Srinadh Bhojanapalli, Prateek Jain, and Sujay Sanghavi. Tighter low-rank approxiation via sapling the leveraged eleent. In Proceedings of the Twenty-Sixth Annual ACM- SIAM Syposiu on Discrete Algoriths, pages SIAM, Kenneth L Clarkson and David P Woodruff. Nuerical linear algebra in the streaing odel. In Proceedings of the forty-first annual ACM syposiu on Theory of coputing, pages ACM, Kenneth L Clarkson and David P Woodruff. Low rank approxiation and regression in input sparsity tie. In Proceedings of the forty-fifth annual ACM syposiu on Theory of coputing, pages ACM, Mina Ghashai and Jeff M Phillips. Relative errors for deterinistic low-rank atrix approxiations. In SODA, pages SIAM, [HMT11] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randoness: Probabilistic algoriths for constructing approxiate atrix decopositions. SIAM review, 53(2): , [Lib13] [MCJ13] [Tro11] [Tro12] Edo Liberty. Siple and deterinistic atrix sketching. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data ining, pages ACM, Ioannis Mitliagkas, Constantine Caraanis, and Prateek Jain. Meory liited, streaing PCA. In Advances in Neural Inforation Processing Systes, Joel A Tropp. Iproved analysis of the subsapled randoized hadaard transfor. Advances in Adaptive Data Analysis, 3(01n02): , Joel A Tropp. User-friendly tail bounds for sus of rando atrices. Foundations of Coputational Matheatics, 12(4): , [Woo14] David Woodruff. Low rank approxiation lower bounds in row-update streas. In Advances in Neural Inforation Processing Systes, pages ,

Feature Extraction Techniques

Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that