Secrets of Matrix Factorization: Supplementary Document

Similar documents
A Derivation of the 3-entity-type model

Revisiting the Variable Projection Method for Separable Nonlinear Least Squares Problems

Matrix Derivatives and Descent Optimization Methods

Efficient Computation of Robust Low-Rank Matrix Approximations in the Presence of Missing Data using the L 1 Norm

Vectors. January 13, 2013

General and Nested Wiberg Minimization: L 2 and Maximum Likelihood

Learning Task Grouping and Overlap in Multi-Task Learning

Lecture 4. Tensor-Related Singular Value Decompositions. Charles F. Van Loan

Efficient Computation of Robust Low-Rank Matrix Approximations in the Presence of Missing Data using the L 1 Norm

Foundations of Computer Vision

Data-driven signal processing

Robust Principal Component Pursuit via Alternating Minimization Scheme on Matrix Manifolds

Pose estimation from point and line correspondences

Corrigendum to Inference on impulse response functions in structural VAR models [J. Econometrics 177 (2013), 1-13]

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Method 1: Geometric Error Optimization

Optimisation on Manifolds

ENGG5781 Matrix Analysis and Computations Lecture 9: Kronecker Product

Linear Regression and Its Applications

Institute for Computational Mathematics Hong Kong Baptist University

Matrix differential calculus Optimization Geoff Gordon Ryan Tibshirani

Corrigendum to Inference on impulse. response functions in structural VAR models. [J. Econometrics 177 (2013), 1-13]

nonrobust estimation The n measurement vectors taken together give the vector X R N. The unknown parameter vector is P R M.

18.06SC Final Exam Solutions

Optimization and Root Finding. Kurt Hornik

Mathematical Foundations of Applied Statistics: Matrix Algebra

Rank-constrained Optimization: A Riemannian Manifold Approach

On Expected Gaussian Random Determinants

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks

Lecture 4. CP and KSVD Representations. Charles F. Van Loan

Computational methods for mixed models

KRONECKER PRODUCT AND LINEAR MATRIX EQUATIONS

Subspace Projection Matrix Completion on Grassmann Manifold

Sparse Levenberg-Marquardt algorithm.

Review Questions REVIEW QUESTIONS 71

Visual SLAM Tutorial: Bundle Adjustment

Matrix completion: Fundamental limits and efficient algorithms. Sewoong Oh Stanford University

CVPR A New Tensor Algebra - Tutorial. July 26, 2017

Linear Algebra, part 3. Going back to least squares. Mathematical Models, Analysis and Simulation = 0. a T 1 e. a T n e. Anna-Karin Tornberg

Versal deformations in generalized flag manifolds

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization

Tensor Analysis in Euclidean Space

Basic Math for

Non-convex optimization. Issam Laradji

The Singular Value Decomposition

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 4

The Bock iteration for the ODE estimation problem

Chapter 3 Numerical Methods

Two-View Segmentation of Dynamic Scenes from the Multibody Fundamental Matrix

Today s class. Linear Algebraic Equations LU Decomposition. Numerical Methods, Fall 2011 Lecture 8. Prof. Jinbo Bi CSE, UConn

Low-rank matrix completion via preconditioned optimization on the Grassmann manifold

A Study of Kruppa s Equation for Camera Self-calibration

Comparative Statics. Autumn 2018

Grassmann Averages for Scalable Robust PCA Supplementary Material

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Sparse and low-rank decomposition for big data systems via smoothed Riemannian optimization

Optimization on the Manifold of Multiple Homographies

Lecture 9 Least Square Problems

A Factorization Method for 3D Multi-body Motion Estimation and Segmentation

II. DIFFERENTIABLE MANIFOLDS. Washington Mio CENTER FOR APPLIED VISION AND IMAGING SCIENCES

Recurrent Neural Network Approach to Computation of Gener. Inverses

Outline Introduction: Problem Description Diculties Algebraic Structure: Algebraic Varieties Rank Decient Toeplitz Matrices Constructing Lower Rank St

CS 231A Section 1: Linear Algebra & Probability Review

Solving Constrained Rayleigh Quotient Optimization Problem by Projected QEP Method

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Linear Algebra Short Course Lecture 2

DUAL REGULARIZED TOTAL LEAST SQUARES SOLUTION FROM TWO-PARAMETER TRUST-REGION ALGORITHM. Geunseop Lee

TRACKING and DETECTION in COMPUTER VISION

1 Differentiable manifolds and smooth maps. (Solutions)

Matrix Algebra, Class Notes (part 2) by Hrishikesh D. Vinod Copyright 1998 by Prof. H. D. Vinod, Fordham University, New York. All rights reserved.

Numerical Methods for CSE. Examination - Solutions

High Accuracy Fundamental Matrix Computation and Its Performance Evaluation

Quick Introduction to Nonnegative Matrix Factorization

ELA THE OPTIMAL PERTURBATION BOUNDS FOR THE WEIGHTED MOORE-PENROSE INVERSE. 1. Introduction. Let C m n be the set of complex m n matrices and C m n

Nonnegative Matrix Factorization Clustering on Multiple Manifolds

Using Hankel structured low-rank approximation for sparse signal recovery

Low Rank Approximation Lecture 7. Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Matrix-Tensor and Deep Learning in High Dimensional Data Analysis

Computational Methods. Least Squares Approximation/Optimization

Riemannian Optimization Method on the Flag Manifold for Independent Subspace Analysis

ENGG5781 Matrix Analysis and Computations Lecture 0: Overview

Reconstruction from projections using Grassmann tensors

Computational Multilinear Algebra

B553 Lecture 5: Matrix Algebra Review

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Derivation of the Kalman Filter

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Errata List Numerical Mathematics and Computing, 7th Edition Ward Cheney & David Kincaid Cengage Learning (c) March 2013

Natural Gradient Learning for Over- and Under-Complete Bases in ICA

The Robust Low Rank Matrix Factorization in the Presence of. Zhang Qiang

Designing Information Devices and Systems II

A LINEAR PROGRAMMING BASED ANALYSIS OF THE CP-RANK OF COMPLETELY POSITIVE MATRICES

f(x, y) = 1 2 x y2 xy 3

Non-Negative Matrix Factorization with Quasi-Newton Optimization

Big Data Analytics: Optimization and Randomization

Numerically Stable Cointegration Analysis

ECS289: Scalable Machine Learning

Compressed Sensing and Neural Networks

Transcription:

Secrets of Matrix Factorization: Supplementary Document Je Hyeong Hong University of Cambridge jhh37@cam.ac.uk Andrew Fitzgibbon Microsoft, Cambridge, UK awf@microsoft.com Abstract This document consists of details to which the original submission article has made references. These include a list of symbols, some matrix identities and calculus rules, a list of VarPro derivatives, symmetry analysis of the un-regularized VarPro Gauss-Newton matrix and MTSS computation details. Symbol Meaning R p q S p The space of real matrices of size p q The space of symmetric real matrices of size p p M The measurement matrix R m n (may contain noise and missing data) W The weight matrix R m n U The first decomposition matrix R m r V The second decomposition matrix R n r u vec(u) R mr v vec(v ) R nr m The number of rows in M n The number of columns in M p The number of visible entries in M p j The number of visible entries in the j-th column of M f The objective R ε The vector objective R p J Jacobian H Hessian V arg min V f(u, V) R n r Π Projection matrix R p mn selecting nonzero entries of vec(w) W Π diag(vec(w)) R p mn m W vec(m) R p Ũ W(I n U) R p nr Ṽ W(V I m ) R p mr The permutation matrix satisfying vec(u ) = K mr vec(u) R mr mr The permutation matrix satisfying vec(v ) = K nr vec(v) R nr nr R W (UV M) R m n Z (W R) I r R mr nr K mr K nr Table : A list of symbols used in [].

. Symbols A list of symbols used in [] can be found in Table. 2. Matrix identities and calculus rules Some of the key matrix identities and calculus rules in [6, 5] are illustrated and extended below for convenience. 2.. Vec and Kronecker product identities Given that each small letter in bold is an arbitrary column vector of appropriate size, we have 2.2. Differentiation of µ-pseudo inverse vec(a B) = diag(vec A) vec(b), () vec(axb) = (B A) vec(x), (2) vec(ab) = (B I) vec(a) = (I A) vec(b), (3) (A B)(C D) = AC BD, (4) (A b)c = AC b, (5) vec(a ) = K A vec(a), (6) (A B) = K (B A)K 2, and (7) (A b) = K (b A). (8) Let us first define µ-pseudo inverse X µ := (X X + µi) X. The derivative of µ-pseudo inverse is [X µ ] = [(X X + µi) ]X + (X X + µi) [X] (9) = (X X + µi) [X X + µi](x X + µi) X + (X X + µi) [X] (0) = (X X + µi) ( [X ]X + X [X])(X X + µi) X + (X X + µi) [X] () = (X X + µi) [X ]XX µ (X X + µi) X [X]X µ + (X X + µi) [X] (2) = X µ [X]X µ + (X X + µi) [X] (I XX µ ). (3) The derivative of pseudo-inverse X := (X X) X is then simply the above for µ = 0. 3. Derivatives for variable projection (VarPro) A full list of derivatives for un-regularized VarPro can be found in Table 2. Derivations are illustrated in [2]. The regularized setting is also discussed in [2]. 3.. Differentiating v (u) From [], we have Substituting (4) to the original objective yields Taking [v ] and applying (3) gives us v (u) := arg min f(u, v) = Ũ µ m. (4) v ε (u, v (u)) = Ũv m = (I ŨŨ µ ) m. (5) [v ] = Ũ µ [Ũ]Ũ µ m + (Ũ Ũ + µi) [Ũ] (I ŨŨ µ ) m (6) = Ũ µ [Ũ]v (Ũ Ũ + µi) [Ũ] ε / noting v (u) = Ũ µ m and (5) (7) = Ũ µ Ṽ u (Ũ Ũ + µi) Z K mr u, / noting bilinearity and the results in [2] (8) and hence du = (Ũ Ũ + µi) (Ũ Ṽ + Z K mr ). (9)

Quantity Definition ALS (RW3) + [Approx. Gauss-Newton (RW2)] RW 2 + [Full Gauss-Newton (RW)] RW + [Full Newton] F N w/o Damping w/o Projection constraint P v (u) R nr du Rnr mr v := arg min f(u, v) v du := d vec(v ) d vec(u) Cost vector ε R p ε ε := Ũv m Jacobian J R p mr J := dε du Cost function f R Ordinary gradient g R mr Projected gradient g p R mr Reduced gradient g r R mr r2 Ordinary Hessian H GN, H S mr Projected Hessian H pgn, H p S mr Reduced Hessian H r S (mr r2 ) f := ε 2 2 g := df du g p := P p g (P p := I mr I r UU ) g r := P r g (P r := I r U ) 2 H GN 2 H := 2 := J J + λi dg du + λi 2 H pgn := P pj J P p + λi 2 H p := 2 P dg p p du P p + λi (P p := I r (I m UU )) 2 H r := 2 P rh pp r (P r := I r U ) v = Ũ m ( m R p consists of non-zero elements of m.) du = [Ũ Ṽ ] RW 2 [(Ũ Ũ) Z K mr ] RW (Z := (W R ) I r ) J = (I p [ŨŨ ] RW 2 )Ṽ [Ũ Z K mr ] RW f = W (UV M) F 2 g = Ṽ ε = vec((w R )V ) 2 g p = 2 g 2 g r = 2 P rg 2 H GN = Ṽ (I p [ŨŨ ] RW 2 )Ṽ + [K mrz (Ũ Ũ) Z K mr ] RW + λi mr 2 H = 2 H GN [2K mrz (Ũ Ũ) Z K mr ] F N [K mrz Ũ Ṽ + Ṽ Ũ Z K mr ] F N + λi mr 2 H pgn = 2 H GN 2 H p = 2 H pgn [2K mrz (Ũ Ũ) Z K mr ] F N [K mrz Ũ Ṽ P p + P p Ṽ Ũ Z K mr ] F N + αi r UU P + λi mr 2 H r = 2 P rh pp r = 2 P rh P r = 2 P dg r du P r + λi mr K mr vec(u) = vec(u ) R := W (UV M) Z := (W R ) I r Ordinary update Ordinary update for U U U unvec(h g) Projected update Grassmann manifold retraction U qf(u unvec(h p gp)) Reduced update Grassmann manifold retraction U qf(u P r unvec(h r gr)) Table 2: List of derivatives and extensions for un-regularized variable projection on low-rank matrix factorization with missing data. This includes notions to Ruhe and Wedin s algorithms.

4. Additional symmetry of the Gauss-Newton matrix for un-regularized VarPro To observe the additional symmetry, we first need to know how the un-regularized VarPro problem can be reformulated in terms of column-wise derivatives. Below is a summary of these derivatives found in [2]. 4.. Summary of column-wise derivatives for VarPro The unregularized objective f (u, v (u)) can be defined as the sum of squared norm of each column in R (u, v (u)). i.e. f (u, v (u)) := W j (Uvj m j ) 2 2, (20) j= where v j is the j-th row of V (U), m j is the j-th column of M and W j is the non-zero rows of diag vec(w j ), where w j R m is the j-th column of W. Hence, W j is a p j m weight-projection matrix, where p j is the number of non-missing entries in column j. Now define the followings: Hence, v j := arg min v j Ũ j := W j U (2) m j := W j m j, and (22) ε j = Ũ j v j m j. (23) Ũ j vj m j 2 2 (24) j= which shows that each row of V is independent of other rows. 4.. Some definitions We define = arg min Ũ j vj m j 2 2 = Ũ j m j, (25) v j Ṽ j := W j (v j I m ) R pj mr, and (26) Z j := (w j r j ) I r R mr r. (27) 4..2 Column-wise Jacobian J j The Jacobian matrix is also independent for each column. From [2], the Jacobian matrix for column j, J j, is J j = (I Ũ j Ũ j )Ṽ j Ũ j Z j K mr (28) = (I Ũ j Ũ j )Ṽ j Ũ ( j (Ũ j Ũj) (w j r j ) ). (29) 4..3 The column-wise form of Hessian H From [2], the column-wise form of H is 2 H = [ Ṽ j (I [Ũ j Ũ j ] RW 2)Ṽ j + [ ] F N [K mrz j (Ũ j Ũj) Z j K mr ] RW j= [Ṽ j Ũ j Z All Z j K mr can be replaced with I (w j r j ) using (8). j K mr + K mrz j Ũ jṽ j ] F N ] + λi. (30)

4.2. Symmetry analysis As derived in [2], the Gauss-Newton matrix for un-regularized VarPro is identical to its projection to the tangent space of U. The column-wise representation of this is 2 H GN = Simplifying the first term yields [ ] Ṽ j (I [Ũ j Ũ j ] RW 2)Ṽ j + [K mrz j (Ũ j Ũj) Z j K mr ] RW + λi. (3) j= Ṽ j Ṽ j = (vj I) W W j j (vj I) (32) = (v j W j )(v j W j ) = v j v j W j W j, / noting (5) (33) which is the Kronecker product of two symmetric matrices. Now defining Ũ jq := qf(ũ j ) gives Ũ j Ũ j = Ũ jq Ũ jq. Hence, simplifying the second term yields Ṽ j (Ũ j Ũ j )Ṽ j = (v j W j )(Ũ jq Ũ jq)(v j W j ) (34) Given that Ũ j = Ũ jq Ũ jr, we can transform (Ũ j Ũj) to Ũ = v j v j W j ŨjQŨ jq W j. / noting (5) (35) jr Ũ jr. The last term then becomes K mrz j (Ũ j Ũj) Z j K mr = (I (w j r j ))Ũ jr Ũ jr (I (w j r j ) ) / noting (8) (36) = Ũ jr Ũ jr (w j r j )(w j r j ). / noting (5) (37) Hence, all the terms inside the square brackets in (3) can be represented as the Kronecker product of two symmetric matrices, which has an additional internal symmetry that can be exploited. This means that computing the Gauss-Newton matrix for un-regularized VarPro can be sped up by factor of 4 at most. 5. Computing Mean time to second success (MTSS) As mentioned in [], computing MTSS is based on the assumption that there exists a known best optimum on a dataset to which thousands or millions of past runs have converged. Given that we have such dataset, we first draw a set of random starting points, usually between 20 to 00, from an isotropic Normal distribution (i.e. U = randn(m, r) with some predefined seed). For each sample of starting points, we run each algorithm and observe two things,. whether the algorithm has successfully converged to the best-known optimum, and 2. the elapsed time. Since the samples of starting points have been drawn in a pseudo-random manner with a monotonically-increasing seed number, we can now estimate how long it would have taken each algorithm to observe two successes starting from each sample number. The average of these yields MTSS. An example is included in the next section to demonstrate this procedures more clearly. 5.. An example illustration Table 3 shows some experimental data produced by an algorithm on a dataset of which the best-known optimum is.225. For each sample of starting points, the final cost and the elapsed time have been recorded. We first determine which samples have yielded success, which is defined as convergence to the best-known optimum (.225). Then, for each sample number, we can calculate up to which sample number we should have run to obtain two successes. e.g. when starting from sample number 5, we would have needed to run the algorithm up to sample number 7. This allows us to estimate the times required for two successes starting from each sample number (see Table 3). MTSS is simply the average of these times, 432.4/7 = 6.8.

Sample no. 2 3 4 5 6 7 8 9 0 Final cost.523.225.647.225.52.225.225.647.225.774 Elapsed time (s) 23.2 5. 24.7 9.5 25.4 6.3 5.5 2.2 7.8 2.0 Success (S) S S S S S Sample numbers of 2 & 4 2 & 4 4 & 6 4 & 6 6 & 7 6 & 7 7 & 9 N/A N/A N/A next 2 successes Time required for 2 successes (s) 82.5 59.3 85.9 6.2 57.2 3.8 54.5 N/A N/A N/A Table 3: Above example data is used to compute MTSS of an algorithm. We assume that each sample starts from randomlydrawn initial points and that the best-known optimum for this dataset is.225 after millions of trials. In this case, MTSS is the average of the bottom row values which is 6.8. 6. Results The supplementary package includes a set of csv files that were used to generate the results figures in []. They can be found in the <main>/results directory. 6.. Limitations Some results are incomplete for the following reasons:. RTRMC [3] fails to run on FAC and Scu datasets. The algorithm has also failed on several occasions when running on other datasets. The suspected cause of such failure is the algorithm s use of Cholesky decomposition, which is unable to decompose non-positive definite matrices. 2. The original Damped Wiberg code (TO DW) [7] fails to run on FAC and Scu datasets. The way in which the code implements QR-decomposition requires datasets to be trimmed in advance so that each column in the measurement matrix has at least the number of visible entries equal to the estimated rank of the factorization model. 3. CSF [4] runs extremely slowly on UGb dataset. 4. Our Ceres-implemented algorithms are not yet able to take in sparse measurement matrices, which is required to handle large datasets such as NET. References [] Authors. Secrets of matrix factorization: Approximations, numerics and manifold optimization. 205., 2, 5, 6 [2] Authors. Secrets of matrix factorization: Further derivations and comparisons. Technical report, further.pdf, 205. 2, 4, 5 [3] N. Boumal and P.-A. Absil. RTRMC: A Riemannian trust-region method for low-rank matrix completion. In Advances in Neural Information Processing Systems 24 (NIPS 20), pages 406 44. 20. 6 [4] P. F. Gotardo and A. M. Martinez. Computing smooth time trajectories for camera and deformable shape in structure from motion with occlusion. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33(0):205 2065, Oct 20. 6 [5] J. R. Magnus and H. Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. 3rd edition, 2007. 2 [6] T. P. Minka. Old and new matrix algebra useful for statistics. Technical report, Microsoft Research, 2000. 2 [7] T. Okatani, T. Yoshida, and K. Deguchi. Efficient algorithm for low-rank matrix factorization with missing components and performance comparison of latest algorithms. In Proceedings of the 20 IEEE International Conference on Computer Vision (ICCV), pages 842 849, 20. 6