Secrets of Matrix Factorization: Supplementary Document

Secrets of Matrix Factorization: Supplementary Document Je Hyeong Hong University of Cambridge jhh37@cam.ac.uk Andrew Fitzgibbon Microsoft, Cambridge, UK awf@microsoft.com Abstract This document consists of details to which the original submission article has made references. These include a list of symbols, some matrix identities and calculus rules, a list of VarPro derivatives, symmetry analysis of the un-regularized VarPro Gauss-Newton matrix and MTSS computation details. Symbol Meaning R p q S p The space of real matrices of size p q The space of symmetric real matrices of size p p M The measurement matrix R m n (may contain noise and missing data) W The weight matrix R m n U The first decomposition matrix R m r V The second decomposition matrix R n r u vec(u) R mr v vec(v ) R nr m The number of rows in M n The number of columns in M p The number of visible entries in M p j The number of visible entries in the j-th column of M f The objective R ε The vector objective R p J Jacobian H Hessian V arg min V f(u, V) R n r Π Projection matrix R p mn selecting nonzero entries of vec(w) W Π diag(vec(w)) R p mn m W vec(m) R p Ũ W(I n U) R p nr Ṽ W(V I m ) R p mr The permutation matrix satisfying vec(u ) = K mr vec(u) R mr mr The permutation matrix satisfying vec(v ) = K nr vec(v) R nr nr R W (UV M) R m n Z (W R) I r R mr nr K mr K nr Table : A list of symbols used in [].

. Symbols A list of symbols used in [] can be found in Table. 2. Matrix identities and calculus rules Some of the key matrix identities and calculus rules in [6, 5] are illustrated and extended below for convenience. 2.. Vec and Kronecker product identities Given that each small letter in bold is an arbitrary column vector of appropriate size, we have 2.2. Differentiation of µ-pseudo inverse vec(a B) = diag(vec A) vec(b), () vec(axb) = (B A) vec(x), (2) vec(ab) = (B I) vec(a) = (I A) vec(b), (3) (A B)(C D) = AC BD, (4) (A b)c = AC b, (5) vec(a ) = K A vec(a), (6) (A B) = K (B A)K 2, and (7) (A b) = K (b A). (8) Let us first define µ-pseudo inverse X µ := (X X + µi) X. The derivative of µ-pseudo inverse is [X µ ] = [(X X + µi) ]X + (X X + µi) [X] (9) = (X X + µi) [X X + µi](x X + µi) X + (X X + µi) [X] (0) = (X X + µi) ( [X ]X + X [X])(X X + µi) X + (X X + µi) [X] () = (X X + µi) [X ]XX µ (X X + µi) X [X]X µ + (X X + µi) [X] (2) = X µ [X]X µ + (X X + µi) [X] (I XX µ ). (3) The derivative of pseudo-inverse X := (X X) X is then simply the above for µ = 0. 3. Derivatives for variable projection (VarPro) A full list of derivatives for un-regularized VarPro can be found in Table 2. Derivations are illustrated in [2]. The regularized setting is also discussed in [2]. 3.. Differentiating v (u) From [], we have Substituting (4) to the original objective yields Taking [v ] and applying (3) gives us v (u) := arg min f(u, v) = Ũ µ m. (4) v ε (u, v (u)) = Ũv m = (I ŨŨ µ ) m. (5) [v ] = Ũ µ [Ũ]Ũ µ m + (Ũ Ũ + µi) [Ũ] (I ŨŨ µ ) m (6) = Ũ µ [Ũ]v (Ũ Ũ + µi) [Ũ] ε / noting v (u) = Ũ µ m and (5) (7) = Ũ µ Ṽ u (Ũ Ũ + µi) Z K mr u, / noting bilinearity and the results in [2] (8) and hence du = (Ũ Ũ + µi) (Ũ Ṽ + Z K mr ). (9)

Quantity Definition ALS (RW3) + [Approx. Gauss-Newton (RW2)] RW 2 + [Full Gauss-Newton (RW)] RW + [Full Newton] F N w/o Damping w/o Projection constraint P v (u) R nr du Rnr mr v := arg min f(u, v) v du := d vec(v ) d vec(u) Cost vector ε R p ε ε := Ũv m Jacobian J R p mr J := dε du Cost function f R Ordinary gradient g R mr Projected gradient g p R mr Reduced gradient g r R mr r2 Ordinary Hessian H GN, H S mr Projected Hessian H pgn, H p S mr Reduced Hessian H r S (mr r2 ) f := ε 2 2 g := df du g p := P p g (P p := I mr I r UU ) g r := P r g (P r := I r U ) 2 H GN 2 H := 2 := J J + λi dg du + λi 2 H pgn := P pj J P p + λi 2 H p := 2 P dg p p du P p + λi (P p := I r (I m UU )) 2 H r := 2 P rh pp r (P r := I r U ) v = Ũ m ( m R p consists of non-zero elements of m.) du = [Ũ Ṽ ] RW 2 [(Ũ Ũ) Z K mr ] RW (Z := (W R ) I r ) J = (I p [ŨŨ ] RW 2 )Ṽ [Ũ Z K mr ] RW f = W (UV M) F 2 g = Ṽ ε = vec((w R )V ) 2 g p = 2 g 2 g r = 2 P rg 2 H GN = Ṽ (I p [ŨŨ ] RW 2 )Ṽ + [K mrz (Ũ Ũ) Z K mr ] RW + λi mr 2 H = 2 H GN [2K mrz (Ũ Ũ) Z K mr ] F N [K mrz Ũ Ṽ + Ṽ Ũ Z K mr ] F N + λi mr 2 H pgn = 2 H GN 2 H p = 2 H pgn [2K mrz (Ũ Ũ) Z K mr ] F N [K mrz Ũ Ṽ P p + P p Ṽ Ũ Z K mr ] F N + αi r UU P + λi mr 2 H r = 2 P rh pp r = 2 P rh P r = 2 P dg r du P r + λi mr K mr vec(u) = vec(u ) R := W (UV M) Z := (W R ) I r Ordinary update Ordinary update for U U U unvec(h g) Projected update Grassmann manifold retraction U qf(u unvec(h p gp)) Reduced update Grassmann manifold retraction U qf(u P r unvec(h r gr)) Table 2: List of derivatives and extensions for un-regularized variable projection on low-rank matrix factorization with missing data. This includes notions to Ruhe and Wedin s algorithms.

4. Additional symmetry of the Gauss-Newton matrix for un-regularized VarPro To observe the additional symmetry, we first need to know how the un-regularized VarPro problem can be reformulated in terms of column-wise derivatives. Below is a summary of these derivatives found in [2]. 4.. Summary of column-wise derivatives for VarPro The unregularized objective f (u, v (u)) can be defined as the sum of squared norm of each column in R (u, v (u)). i.e. f (u, v (u)) := W j (Uvj m j ) 2 2, (20) j= where v j is the j-th row of V (U), m j is the j-th column of M and W j is the non-zero rows of diag vec(w j ), where w j R m is the j-th column of W. Hence, W j is a p j m weight-projection matrix, where p j is the number of non-missing entries in column j. Now define the followings: Hence, v j := arg min v j Ũ j := W j U (2) m j := W j m j, and (22) ε j = Ũ j v j m j. (23) Ũ j vj m j 2 2 (24) j= which shows that each row of V is independent of other rows. 4.. Some definitions We define = arg min Ũ j vj m j 2 2 = Ũ j m j, (25) v j Ṽ j := W j (v j I m ) R pj mr, and (26) Z j := (w j r j ) I r R mr r. (27) 4..2 Column-wise Jacobian J j The Jacobian matrix is also independent for each column. From [2], the Jacobian matrix for column j, J j, is J j = (I Ũ j Ũ j )Ṽ j Ũ j Z j K mr (28) = (I Ũ j Ũ j )Ṽ j Ũ ( j (Ũ j Ũj) (w j r j ) ). (29) 4..3 The column-wise form of Hessian H From [2], the column-wise form of H is 2 H = [ Ṽ j (I [Ũ j Ũ j ] RW 2)Ṽ j + [ ] F N [K mrz j (Ũ j Ũj) Z j K mr ] RW j= [Ṽ j Ũ j Z All Z j K mr can be replaced with I (w j r j ) using (8). j K mr + K mrz j Ũ jṽ j ] F N ] + λi. (30)

4.2. Symmetry analysis As derived in [2], the Gauss-Newton matrix for un-regularized VarPro is identical to its projection to the tangent space of U. The column-wise representation of this is 2 H GN = Simplifying the first term yields [ ] Ṽ j (I [Ũ j Ũ j ] RW 2)Ṽ j + [K mrz j (Ũ j Ũj) Z j K mr ] RW + λi. (3) j= Ṽ j Ṽ j = (vj I) W W j j (vj I) (32) = (v j W j )(v j W j ) = v j v j W j W j, / noting (5) (33) which is the Kronecker product of two symmetric matrices. Now defining Ũ jq := qf(ũ j ) gives Ũ j Ũ j = Ũ jq Ũ jq. Hence, simplifying the second term yields Ṽ j (Ũ j Ũ j )Ṽ j = (v j W j )(Ũ jq Ũ jq)(v j W j ) (34) Given that Ũ j = Ũ jq Ũ jr, we can transform (Ũ j Ũj) to Ũ = v j v j W j ŨjQŨ jq W j. / noting (5) (35) jr Ũ jr. The last term then becomes K mrz j (Ũ j Ũj) Z j K mr = (I (w j r j ))Ũ jr Ũ jr (I (w j r j ) ) / noting (8) (36) = Ũ jr Ũ jr (w j r j )(w j r j ). / noting (5) (37) Hence, all the terms inside the square brackets in (3) can be represented as the Kronecker product of two symmetric matrices, which has an additional internal symmetry that can be exploited. This means that computing the Gauss-Newton matrix for un-regularized VarPro can be sped up by factor of 4 at most. 5. Computing Mean time to second success (MTSS) As mentioned in [], computing MTSS is based on the assumption that there exists a known best optimum on a dataset to which thousands or millions of past runs have converged. Given that we have such dataset, we first draw a set of random starting points, usually between 20 to 00, from an isotropic Normal distribution (i.e. U = randn(m, r) with some predefined seed). For each sample of starting points, we run each algorithm and observe two things,. whether the algorithm has successfully converged to the best-known optimum, and 2. the elapsed time. Since the samples of starting points have been drawn in a pseudo-random manner with a monotonically-increasing seed number, we can now estimate how long it would have taken each algorithm to observe two successes starting from each sample number. The average of these yields MTSS. An example is included in the next section to demonstrate this procedures more clearly. 5.. An example illustration Table 3 shows some experimental data produced by an algorithm on a dataset of which the best-known optimum is.225. For each sample of starting points, the final cost and the elapsed time have been recorded. We first determine which samples have yielded success, which is defined as convergence to the best-known optimum (.225). Then, for each sample number, we can calculate up to which sample number we should have run to obtain two successes. e.g. when starting from sample number 5, we would have needed to run the algorithm up to sample number 7. This allows us to estimate the times required for two successes starting from each sample number (see Table 3). MTSS is simply the average of these times, 432.4/7 = 6.8.

Sample no. 2 3 4 5 6 7 8 9 0 Final cost.523.225.647.225.52.225.225.647.225.774 Elapsed time (s) 23.2 5. 24.7 9.5 25.4 6.3 5.5 2.2 7.8 2.0 Success (S) S S S S S Sample numbers of 2 & 4 2 & 4 4 & 6 4 & 6 6 & 7 6 & 7 7 & 9 N/A N/A N/A next 2 successes Time required for 2 successes (s) 82.5 59.3 85.9 6.2 57.2 3.8 54.5 N/A N/A N/A Table 3: Above example data is used to compute MTSS of an algorithm. We assume that each sample starts from randomlydrawn initial points and that the best-known optimum for this dataset is.225 after millions of trials. In this case, MTSS is the average of the bottom row values which is 6.8. 6. Results The supplementary package includes a set of csv files that were used to generate the results figures in []. They can be found in the <main>/results directory. 6.. Limitations Some results are incomplete for the following reasons:. RTRMC [3] fails to run on FAC and Scu datasets. The algorithm has also failed on several occasions when running on other datasets. The suspected cause of such failure is the algorithm s use of Cholesky decomposition, which is unable to decompose non-positive definite matrices. 2. The original Damped Wiberg code (TO DW) [7] fails to run on FAC and Scu datasets. The way in which the code implements QR-decomposition requires datasets to be trimmed in advance so that each column in the measurement matrix has at least the number of visible entries equal to the estimated rank of the factorization model. 3. CSF [4] runs extremely slowly on UGb dataset. 4. Our Ceres-implemented algorithms are not yet able to take in sparse measurement matrices, which is required to handle large datasets such as NET. References [] Authors. Secrets of matrix factorization: Approximations, numerics and manifold optimization. 205., 2, 5, 6 [2] Authors. Secrets of matrix factorization: Further derivations and comparisons. Technical report, further.pdf, 205. 2, 4, 5 [3] N. Boumal and P.-A. Absil. RTRMC: A Riemannian trust-region method for low-rank matrix completion. In Advances in Neural Information Processing Systems 24 (NIPS 20), pages 406 44. 20. 6 [4] P. F. Gotardo and A. M. Martinez. Computing smooth time trajectories for camera and deformable shape in structure from motion with occlusion. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33(0):205 2065, Oct 20. 6 [5] J. R. Magnus and H. Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. 3rd edition, 2007. 2 [6] T. P. Minka. Old and new matrix algebra useful for statistics. Technical report, Microsoft Research, 2000. 2 [7] T. Okatani, T. Yoshida, and K. Deguchi. Efficient algorithm for low-rank matrix factorization with missing components and performance comparison of latest algorithms. In Proceedings of the 20 IEEE International Conference on Computer Vision (ICCV), pages 842 849, 20. 6