Riemannian Pursuit for Big Matrix Recovery

Size: px

Start display at page:

Download "Riemannian Pursuit for Big Matrix Recovery"

Evelyn Kennedy
5 years ago
Views:

1 Riemannian Pursuit for Big Matrix Recovery Mingkui Tan 1, Ivor W. Tsang 2, Li Wang 3, Bart Vandereycken 4, Sinno Jialin Pan 5 1 School of Computer Science, University of Adelaide, Australia 2 Center for Quantum Computation & Intelligent Systems, UTS, Australia 3 Department of Mathematics, University of California, USA 4 Department of Mathematics, Princeton University, USA 5 Institute for Infocomm Research, SG August 18, / 36

2 Outline Introduction to Low Rank Matrix Recovery Problem Formulation Existing MR Methods Riemannian Optimization on Fixed-rank Manifold Riemannian Pursuit for Matrix Recovery Motivations Our Contributions Proposed Algorithm Main Theoretical Results Stopping Conditions of RP Conjugate Gradient Descent on Manifold Experiments on Matrix Completion Conclusions 2 / 36

3 Notations We will use the following notations: Given a linear operator A, denote its adjoint operator by A. For instance, if A is a matrix A, its adjoint operator is A Let X = U(diag(σ))V be the SVD of X R m n. The nuclear norm of X is defined as X = σ 1 = i σ i. The condition number κ r (X) of X w.r.t. a given number r is defined as κ r (X) = σ 1 /σ r 3 / 36

4 Low Rank Assumption Recovering X from partial observations b is impossible in general. Low-rank assumption: rank( X) r, where r min(m, n). Faces from a same person lie on a low-dimensional manifold (Yan, et al., 2007). In collaborative filtering tasks, items from same group may have similar actions. 4 / 36

5 Problem Formulation Given a linear operator A : R m n R l, let b = A( X) + e be l linear measurements of an unknown rank- r matrix X R m n, where e denotes noise. Matrix recover tries to recover X by solving min f(x), s.t. rank(x) r, (1) X where l mn, r r, and f(x) = 1 2 b A(X) 2 2. The definition of the operator A depends on specific application: Matrix completion [Recht et al.(2010)]: Quantum state tomography [Candés & Plan(2010a)] Matrix learning & factorizations [Laue(2012)] Low rank structure learning or clustering [Deng et al.(2013)]... 5 / 36

6 Existing MR Methods MR by minimizing f(x) with rank constraint is known to be NP-hard. How to estimate r? rank(x) r Estimate rank(x) by cross-validation? A good idea, but, how to measure the performance of a specific parameter? Very difficult! Existing methods: 1. Nuclear-norm convex relaxations [Candés & Plan(2010a)]: Replace the rank constraint by X v for some v > Fixed-rank methods (relaxations): Assume rank(x) = r, where r is supposed to be known. 3. Other methods: Max-norm based methods, p-norm non-convex methods (0 < p < 1), and so on. 6 / 36

7 Nuclear-norm Convex Relaxations Two kinds of trace-norm relaxations are usually studied: Nuclear-norm minimization with equality constraint min X X, s.t. A(X) = b, (2) Typical methods: Singular Value Thresholding (SVT) [Cai et al.(2010)] Augmented Lagrangian method (ALM) [Lin et al.(2010)] Matrix lasso problem: min X,ξ λ X ξ 2 2 : ξ = b A(X), (3) Typical methods: Accelerated Proximal Gradient (APG) [Toh & Yun(2010)] Stochastic Gradient methods 7 / 36

8 Nuclear-norm Convex Relaxations Advantages: A global solution can be obtained by convex optimization methods (but may not be unique). Good RIP guarantees for exact matrix recovery, namely, γ r 1/3 (Cai et al., 2013). Disadvantages: Require many high-dimensional SVDs. Very expensive for large-scale problems. Has solution bias due to the nuclear-norm regularization [Vandereycken(2013)]. 8 / 36

9 Challenges The key is how to avoid high dimensional SVDs and estimate the correct rank 9 / 36

10 Fixed-rank Methods The fixed-rank methods solve the following problem: min f(x), s.t. rank(x) = r, (4) X where r is supposed to be known. Non-convex since the constraint rank(x) = r defines a nonlinear smooth matrix manifold [Meyer et al.(2011)]. Many efficient local-search methods have been proposed: Greedy methods: Singular Value Projection (SVP) [Meka et al.(2009b)] Atomic Decomposition for Minimum Rank Approximation (ADMiRA) [Lee & Bresler(2010)] Stochastic Gradient methods [Wen et al.(2012)]. Manifold optimization methods: Low-rank geometric conjugate gradient method (LRGeomCG) [Vandereycken(2013)] The quotient geometric matrix completion method (qgeommc) [Mishra et al.(2012)] 10 / 36

11 Differential Geometry of Fixed-rank Matrices Smooth manifold M r of fixed-rank matrices: M r = {X R m n : rank(x) = r} = {Udiag(σ)V : U St m r, V St n r, σ 0 = r}, Stiefel manifold: m r real and orthonormal matrices: St m r = {U R m r : U U = I}. The tangent space T X M r of M r at X = Udiag(σ)V R m n : T X M r = {UMV +U p V +UVp : M R r r, U p R m r, U p U = 0, V p R n r, Vp V = 0}. 11 / 36

12 Differential Geometry of Fixed-rank Matrices Riemannian gradient of f on M r : Given a smooth function f : M r R, the Riemannian gradient is the orthogonal projection of the gradient of f onto the tangent space. Define P U = UU and P U = I UU for any U St m r, and the orthogonal projection of any Z R m n onto the tangent space at X = Udiag(σ)V is defined as Retraction: E = P TX M r (Z) : Z P U ZP V + P U ZP V + P U ZP V. (5) The Retraction mapping lets a point X + E in the tangent space go back to the manifold: r R X (E) = P Mr (X + E) = σ i p i q i, (6) where r i=1 σ ip i q i denotes the best rank-r approximation to X + E by the SVD. Retraction can be calculated with O((m + n)r 2 ) cost. i=1 12 / 36

13 Fixed-rank Methods Advantages of Fixed-rank Methods: Fixed-rank Methods have superior scalability compared with the nuclear-norm based methods [Boumal & Absil(2012), Mishra et al.(2012), Vandereycken(2013)]. Greedy methods need truncated SVD of known low rank only. Some manifold optimization methods involve simple matrix-products, which is particularly important for parallel computing. Disadvantages How do we know the rank r? It is nontrivial. Greedy methods require restricted conditions to converge. Convergence Issues: If X is in ill-conditioning, existing fixed-rank methods may converge very slowly [Ngo & Saad(2012)]. 13 / 36

14 Motivations Fixed-rank methods have gained great success in solving big MR problems, but it requires the explicit knowledge of the rank r. Three Questions: 1. Can we avoid the rank estimation by iteratively increase the rank by a fixed integer ρ? 2. If using this procedure, how can we stop it? 3. Is this procedure helpful to avoid the convergence issue on ill-conditioned problems? 14 / 36

15 Main Contributions We propose a Riemannian Pursuit (RP) method, which indeed increases the rank by ρ and essentially solves a sequence of fixed-rank minimization problems. RP converges linearly under mild conditions. RP can effectively address the convergence issues that occur with ill-conditioned and large rank problems. RP can automatically estimates the rank under proper stopping conditions. 15 / 36

16 Active Subspace Search Let G = A (A(X) b). The gradient of f(x) on M r can be calculated by gradf(x) = P TX M r (G). (7) The gradient direction orthogonal to M r is Q = G P TX M r (G) Select the subspace of rank ρ from Q Add this subspace into the original manifold 16 / 36

17 Riemannian Pursuit Algorithm 1 Riemannian Pursuit for MR. 1: Inner iteration tolerance ɛ in, and outer iteration tolerance ɛ out. Initialize X 0 = 0, ξ 0 = b and G = A (ξ 0 ). Let t = 1 and r = ρ. 2: Perform an active-subspace search as follows. (2a): Compute Q = G P TX M r (G). (2b): Compute a best rank ρ approximation of Q: H t 1 2 = U ρ diag(σ ρ )V ρ 3: Let H t 1 1 = P TX M r (G) and H t 1 = H1 t 1 + H t 1 (3a): Choose a proper step size τ t and set 2. X intial = R X ( τ t H t 1 ). (Warm Start) (3b): Update X t by X t =NRCG(X intial, ɛ in ). 4: Update r = r + ρ, ξ t = b A(X t ) and G = A (ξ t ). 5: Quit if stopping conditions achieve, otherwise, let t = t+1, and go to Step / 36

18 Main Theoretical Results Lemma 1: Firstly, let {X t } be the sequence generated by RP, then f(x t ) decreases monotonically w.r.t. t. Theorem 1: Let {X t } be the sequence generated by RP, as long as f(x t ) > f( X) = C 2 e 2 (where C > 1), and there exists an integer ι > 0 such that γ r+2ιρ < 1/2, then RP decreases linearly in objective values when t < ι, namely f(x t+1 ) νf(x t ), where ν = ( 1 ρζ 2r ( C(1 2γ (r+2ιρ) ) 2 ( C+1) 2 (1 γ (r+2ιρ) ) ) (1 1 C ) 2 ). 18 / 36

19 Stopping Conditions of RP RP monotonically decreases the objective values. Without proper stopping conditions, it may increase the rank until tρ = min(m, n), leading to over-fitting issue. To avoid this, we propose to use the averaged function value difference between iterations as the stopping condition, namely 2(f(X t 1 ) f(x t ))/(ρ b 2 2) ɛ out, (8) where ɛ out is a predefined tolerance value. If the increasing of rank does not significantly reduce the objective value, we stop it to avoid over-fitting. When it is stopped, a rank is returned! 19 / 36

20 Riemannian Pursuit Algorithm 2 Riemannian Pursuit for MR. 1: Inner iteration tolerance ɛ in, and outer iteration tolerance ɛ out. Initialize X 0 = 0, ξ 0 = b and G = A (ξ 0 ). Let t = 1 and r = ρ. 2: Perform an active-subspace search as follows. (2a): Compute Q = G P TX M r (G). (2b): Compute a best rank ρ approximation of Q: H t 1 2 = U ρ diag(σ ρ )V ρ 3: Let H t 1 1 = P TX M r (G) and H t 1 = H1 t 1 + H t 1 (3a): Choose a proper step size τ t and set 2. X intial = R X ( τ t H t 1 ). (Warm Start) (3b): Update X t by X t =NRCG(X intial, ɛ in ). 4: Update r = r + ρ, ξ t = b A(X t ) and G = A (ξ t ). 5: Quit if stopping conditions achieve, otherwise, let t = t+1, and go to Step / 36

21 Fixed-rank subproblem Step (3b) (Update X t by X t =NRCG(X intial, ɛ in )) is to solving the following problem: min X f(x), s.t. rank(x) = tρ. (9) ρ is much smaller than r. When t is small, tρ < r. Smaller condition number: κ tρ (X) = σ 1 /σ tρ < κ r (X). Faster convergence speed with a small condition number. Selection of ρ is important. Setting ρ = 1 is the simplest way, but ρ > 1 is better. 21 / 36

22 Nonlinear Conjugate Gradient Descent Figure: Basic elements of Riemannian manifold on matrices. 22 / 36

23 Nonlinear Conjugate Gradient Descent The nonlinear conjugate gradient descent algorithm is shown as follows: Algorithm 3 NRCG for solving the fixed-rank minimization problem. 1: Initialize X 1 = X intial and ζ 0 = 0. Let k = 1. 2: Compute the Riemannian gradient E k = gradf(x k ). 3: Compute the conjugate direction ζ k according to ζ k = P k + β t T Xk 1 X k (ζ k 1 ). 4: Choose a step size θ k satisfying the strong Wolfe conditions, and set X k+1 = R Xk (θ k ζ k ). 5: Terminate and output X k+1 if the stopping conditions are achieved; otherwise, let k = k + 1 and go to step / 36

24 Experiments Toy Experiments for Convergence Comparison Real-world Experiments on Collaborative Filtering Tasks 24 / 36

25 Toy Experimental Settings Methods for comparison: LRGeomCG, qgeommc, ScGrassMC, SVP, ADMiRA, SpaRCS, GECO and APG. Except GECO and APG, others are fixed-rank methods. Toy experimental setting We generate ground-truth X by X = Ûdiag( σ) V + e, where σ is a r-sparse vector, Û Stm r, and V St n r. Two kinds of singular values are studied: 1) Gaussian sparse singular value vector s g sampled from the Gaussian distribution N(0, 1000); 2) χ 2 sparse singular value vector s 2 χ, where each entry is the square of s g. 25 / 36

26 Toy Experiments for Convergence Comparison Relative Objective Value GECO ScGrassMC LRGeomCG LMaFit qgeommc RP(η = 1.00) RP(η = 0.75) RP(η = 0.65) RP(η = 0.55) Time (in seconds) Time (in seconds) (a) Relative objective values on (b) Relative objective values on s g, where ρ w.r.t. different η s is s 2 χ, where ρ for different η s is 1, 1, 8, 14, 18, respectively. 4, 6, 12, respectively. Relative Objective Value ScGrassMC LRGeomCG LMaFit qgeommc RP(η = 1.00) RP(η = 0.75) RP(η = 0.65) RP(η = 0.55) Figure: Convergence of comparison methods on s g and s χ 2. RP with different ρ converges well on ill-conditioned problems. Other algorithms have convergence issues. GECO cannot converge within 1 hour on s g, and we omit its results on s χ 2. ScGrassMC gets numerical problems after 50 iterations on s χ / 36

27 Experiments on Real-world Collaborative Filtering Tasks Table: Experimental results on real-world datasets. Result of APG on Netflix is absent. The average ranks estimated by APG, Lmafit-A and RP on Movie-10M are 100, 77 and 10, respectively. The average ranks estimated by Lmafit-A and RP on Netflix are 81 and 12, respectively Dataset Movie-10M Netflix-100M RMSE Time (seconds) RMSE Time (seconds) APG ± LRGeomCG ± ±35 QgeomMC ± ± 74 Lmafit ± ±50 Lmafit-A ± ±165 RP ± ±27 RP performs best among all the methods in terms of RMSE and computational efficiency. We use the rank returned by RP as the rank for LRGeomCG, qgeommc and Lmafit, thus RP is much faster if the model selection cost is considered. 27 / 36

28 Comparison on MovieLens with 10M ratings Table: Performance comparison on Movie-10M dataset. Methods Time (in seconds) SpeedUp RMSE CPU(GHz) GECO 784,941 9,000x Laue 2,663 30x Jaggi 3,120 38x RP / 36

29 Conclusions We propose a Riemannian Pursuit (RP) for tackling Big Matrix Recovery problems. By exploiting the Riemannian geometry on the fixed-rank manifolds, high-dimensional SVDs in the master problem of RP are avoided, which exhibits good scalability for large-scale problems. RP converges linearly under mild conditions. RP can automatically estimate the rank and effectively address the convergence issues on ill-conditioned and large rank matrices. Extensive experimental results show that RP achieves superb scalability on Big Matrices using a single PC; while it maintains similar or better MR performance. 29 / 36

30 Thank you for your attention! Our Poster is at T4! 30 / 36

31 Conjugate Gradient Descent on Manifold Steepest gradient descent in general is very slow. Conjugate gradient descent is a good choice. Let ζ k 1 be a search direction at (k 1) th, we compute the conjugate search direction: ζ t = gradf(x k ) + β t ζ k 1, (10) where β t can be calculated by the Fletcher-Reeves (F-R) rule. But notice: gradf(x k ) and β t ζ k 1 are in different tangent spaces! Therefore, Eq. (10) is not valid! We still need an operator: Vector Transport. 31 / 36

32 Conjugate Gradient Descent on Manifold A Vector Transport T on a manifold M r is a smooth map which transports tangent vectors from one tangent space to another. A vector transport T on a manifold M r is a smooth map [Absil et al.(2008)]: T M r T M r T M r : (ζ X, ν X ) T ζx (ν X ) satisfying the following properties for all X M r : There exists a retraction R associated with T such that for all (ζ X, ν X ) T M r T M r, it holds that T ζx (ν X ) T RX M r. T0 (ν) = ν for all ν T X M r. Tζ (aν X + bω X ) = X atη X (ν X ) + btη X (ω X ) for ν X T M r and ω X T M r. Very tedious definition, but the computation takes only O(m + n)r 2 cost. 32 / 36

33 Conjugate Gradient Descent on Manifold With Vector Transport T, the conjugate search direction can be calculated by ζ k = P k + β t T Xk 1 X k (ζ k 1 ), (11) Figure: Conjugate search direction computation. 33 / 36

34 Other Issues in NRCG Line search for the step size θ: Armijo line search: The convergence cannot be guaranteed. Strong Wolfe line search: The convergence can be guaranteed. Initialization with Warm Start: Set X 1 = X intial, where X intial is from Algorithm / 36

35 Riemannian Pursuit Algorithm 4 Riemannian Pursuit for MR. 1: Inner iteration tolerance ɛ in, and outer iteration tolerance ɛ out. Initialize X 0 = 0, ξ 0 = b and G = A (ξ 0 ). Let t = 1 and r = ρ. 2: Perform an active-subspace search as follows. (2a): Compute Q = G P TX M r (G). (2b): Compute a best rank ρ approximation of Q: H t 1 2 = U ρ diag(σ ρ )V ρ 3: Let H t 1 1 = P TX M r (G) and H t 1 = H1 t 1 + H t 1 (3a): Choose a proper step size τ t and set 2. X intial = R X ( τ t H t 1 ). (Warm Start) (3b): Update X t by X t =NRCG(X intial, ɛ in ). 4: Update r = r + ρ, ξ t = b A(X t ) and G = A (ξ t ). 5: Quit if stopping conditions achieve, otherwise, let t = t+1, and go to Step / 36

36 Other Issues in NRCG Line search for the step size θ: Armijo line search: The convergence cannot be guaranteed. Strong Wolfe line search: The convergence can be guaranteed. Warm start for initialization: Set X 1 = X intial, where X intial is from Algorithm 1. Stopping condition of NRCG: f(x k 1 ) f(x k ) f(x k 1 ) ɛ in, (12) No need to solve the fixed-rank problem with high precision! We set ɛ in = 0.01 in practice. All these techniques are important to improve the efficiency. Please find more details in the paper. 36 / 36

37 Absil, P.-A. and Mahony, R. and Sepulchre, R. Optimization Algorithms on Matrix Manifolds. Princeton University Press, Boumal, N. and Absil, P.-A. Rtrmc: A riemannian trust-region method for low-rank matrix completion. In NIPS, Cai, J., Candés, J., E., and Shen, Z. A singular value thresholding algorithm for matrix completion. SIAM J. on Optim., 20(4): , Candés, E. J. and Plan, Y. Tight oracle bounds for low-rank matrix recovery from a minimal number of random measurements. IEEE Trans. on Inform. Theory, 57(4): , 2010a. Candés, E. J. and Plan, Y. Matrix completion with noise. Proceedings of the IEEE, 98(6): , 2010b. Candés, E. J. and Recht, B. 36 / 36

38 Exact matrix completion via convex optimization. Found. Comput. Math., 9: , Donoho, D. L., Tsaig, Y., Drori, I., and Starck, J. L. Sparse solution of underdetermined systems of linear equations by stagewise orthogonal matching pursuit. IEEE Trans. Info. Theory, 58(2): , Fazel, M. Matrix rank minimization with applications PhD thesis, Stanford University. Golub, G. H. and Van Loan, C. F. Matrix computations. JHU Press, 3rd edition, Hazan, E. Sparse approximate solutions to semidefinite programs. LATIN, pp , Herlocker, J. L., Konstan, J. A., Borchers, A., and Riedl, J. An algorithmic framework for performing collaborative filtering. 36 / 36

39 In SIGIR, Jaggi, M. and Sulovsky, M. A simple algorithm for nuclear norm regularized problems. In ICML, KDDCup. ACM SIGKDD and netflix. In Proceedings of KDD Cup and Workshop, Keshavan, R. H., Montanari, A., and Oh, S. Matrix completion from a few entries. IEEE Trans. on Info. Theory, 56: , 2010a. Keshavan, R. H, Montanari, A., and Oh, S. Matrix completion from noisy entries. JMLR, 99: , 2010b. Laue, S. A hybrid algorithm for convex semidefinite optimization. In ICML, Lee, K. and Bresler, Y. Admira: Atomic decomposition for minimum rank approximation. 36 / 36

40 IEEE Trans. on Inform. Theory, 56(9): , Y. Deng, Q. Dai, R. Liu, Z. Zhang, and S. Hu. Low-rank structure learning via nonconvex heuristic recovery. IEEE Trans. Neural Netw. Learning Syst., 24(3): , Lin, Z., Chen, M., and Ma, Y. The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. Technical report, UIUC, Lin, Z., Liu, R., and Su, Z. Linearized alternating direction method with adaptive penalty for low-rank representation. arxiv preprint arxiv: , Meka, R., Jain, P., and Dhillon, I. S. Guaranteed rank minimization via singular value projection. Technical report, 2009a. Meka, R., Jain, P., and I.S.Dhillon. Guaranteed rank minimization via singular value projection. In NIPS, 2009b. 36 / 36

41 Meyer, G., Bonnabel, S., and Sepulchre, R. Linear regression under fixed-rank constraints: A riemannian approach. In ICML, Mishra, B., Apuroop, K. A., and Sepulchre, R. A riemannian geometry for low-rank matrix completion. Technical report, Mishra, B., Meyer, G., Bach, F., and Sepulchre, R. Low-rank optimization with trace norm penalty. SIAM J. Optim., 23(4): , Mitra, K., Sheorey, S., and Chellappa, R. Large-scale matrix factorization with missing data under additional constraints. In NIPS, Negahban, S. and Wainwright, M. J. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. JMLR, 13: , / 36

42 Ngo, T. T. and Saad, Y. Scaled gradients on grassmann manifolds for matrix completion. In NIPS, Recht, B. A simpler approach to matrix completion. JMLR, pp , Recht, B., Fazel, M., and Parrilo, P. A. Guaranteed minimum rank solutions to linear matrix equations via nuclear norm minimization. SIAM Rev., 52(3), Ring, W. and Wirth, B. Optimization methods on riemannian manifolds and their application to shape space. SIAM J. Optim., 22(2): , Sato, H. and Iwai, T. A new, globally convergent riemannian conjugate gradient method. Optimization: A Journal of Mathematical Programming and Operations Research, (ahead-of-print):1 21, / 36

43 Selvan, S. E., Amato, U., Gallivan, K. A., Qi, Ch., Carfora, M. F., Larobina, M., and Alfano, B. Descent algorithms on oblique manifold for source-adaptive ica contrast. IEEE Trans. Neural Netw. Learning Syst., 23(12): , Shalit, U., Weinshall, D., and Chechik, G. Online learning in the embedded manifold of low-rank matrices. JMLR, 13: , Shwartz, S. S., Gonen, A., and Shamir, O. Large-scale convex minimization with a low-rank constraint. In ICML, Toh, K.-C. and Yun, S. An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pac. J. Optim., 6: , Vandereycken, B. Low-rank matrix completion by Riemannian optimization. Technical report, / 36

44 Siam J. Optim., 23(2): , Waters, A. E., Sankaranarayanan, A. C., and Baraniuk, Richard G. Sparcs: Recovering low-rank and sparse matrices from compressive measurements. In NIPS, Wen, Z., Yin, W., and Zhang, Y. Solving a low-rank factorization model for matrix completion by a non-linear successive over-relaxation algorithm. Math. Program. Comput., 4(4): , Yang, J. and Yuan, X. Linearized augmented lagrangian and alternating direction methods for nuclear norm minimization. Mathematics of Computation, 82(281): , / 36

Supplementary Materials for Riemannian Pursuit for Big Matrix Recovery

Supplementary Materials for Riemannian Pursuit for Big Matrix Recovery Mingkui Tan, School of omputer Science, The University of Adelaide, Australia Ivor W. Tsang, IS, University of Technology Sydney,