Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Size: px

Start display at page:

Download "Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms"

Jocelyn Reed
6 years ago
Views:

1 Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms François Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini, Marie Chavent (INRIA, U. of Bordeaux) F. Caron 1 / 44

2 Outline Introduction Complete case Matrix completion Experiments Conclusion and perspectives F. Caron 2 / 44

3 Matrix Completion Netflix prize 480k users and 18k movies providing 1-5 ratings 99% of the ratings are missing Objective: predict missing entries in order to make recommendations Movies Users F. Caron 3 / 44

4 Matrix Completion Objective Complete a matrix X of size m n from a subset of its entries Applications Recommender systems Image inpainting Imputation of missing data F. Caron 4 / 44

5 Matrix Completion Potentially large matrices (each dimension of order ) Very sparsely observed (1%-10%) F. Caron 5 / 44

6 Low rank Matrix Completion Assume that the complete matrix Z is of low rank with k min(m, n). }{{} Z m n A }{{} m k B T }{{} k n F. Caron 6 / 44

7 Low rank Matrix Completion Assume that the complete matrix Z is of low rank with k min(m, n). }{{} Z m n A }{{} m k B T }{{} k n Features Users Items F. Caron 6 / 44

8 Low rank Matrix Completion Let Ω {1,..., m} {1,..., n} be the subset of observed entries For (i, j) Ω where σ 2 > 0 X ij = Z ij + ε ij, ε ij iid N (0, σ 2 ) F. Caron 7 / 44

9 Low rank Matrix Completion Optimization problem minimize Z 1 (X 2σ 2 ij Z ij ) 2 + λ rank(z) (i,j) Ω }{{}}{{} loglikelihood penalty where λ > 0 is some regularization parameter. Non-convex Computationally hard for general subset Ω F. Caron 8 / 44

10 Low rank Matrix Completion Matrix completion with nuclear norm penalty minimize Z 1 (X 2σ 2 ij Z ij ) 2 + λ Z (i,j) Ω }{{}}{{} loglikelihood penalty where Z is the nuclear norm of Z, or the sum of the singular values of Z. Convex relaxation of the rank penalty optimization [Fazel, 2002, Candès and Recht, 2009, Candès and Tao, 2010] F. Caron 9 / 44

11 Low rank Matrix Completion Soft-Impute algorithm Start with an initial matrix Z (0) At each iteration t = 1, 2,... Replace the missing elements in X with those in Z (t 1) Perform a soft-thresholded SVD on the completed matrix, with shrinkage λ to obtain the low rank matrix Z (t) [Mazumder et al., 2010] F. Caron 10 / 44

12 Low rank Matrix Completion Soft-Impute algorithm Soft-thresholded SVD yields a low-rank representation Each iteration decreases the value of the nuclear norm objective function towards its minimum Various strategies proposed to scale the algorithm to problems where n, m of order 10 6 Same shrinkage applied to all singular values [Mazumder et al., 2010] F. Caron 11 / 44

13 Contributions Probabilistic interpretation of the nuclear norm objective function Maximum A Posteriori estimation assuming exponential priors on the singular values Soft-Impute = Expectation-Maximization algorithm Construction of alternative non-convex objective functions building on hierarchical priors Bridge the gap between the rank penalty and the nuclear norm penalty EM: Adaptative algorithm that iteratively adjusts the shrinkage coefficients for each singular value Similar to adaptive lasso in multivariate regression Numerical results show the interest of the approach on various datasets F. Caron 12 / 44

14 Outline Introduction Complete case Hierarchical adaptive spectral penalty EM algorithm for MAP estimation Matrix completion Experiments Conclusion and perspectives F. Caron 13 / 44

15 Nuclear Norm penalty Complete matrix X Nuclear norm objective function minimize Z 1 2σ 2 X Z 2 F + λ Z where 2 F is the Frobenius norm Global solution given by a soft-thresholded SVD Ẑ = S λσ 2(X) where S λ (X) = Ũ D λ Ṽ T with D λ = diag(( d 1 λ) +,..., ( d r λ) + ) and t + = max(t, 0). [Cai et al., 2010, Mazumder et al., 2010] F. Caron 14 / 44

16 Nuclear Norm penalty Maximum A Posteriori (MAP) estimate under the prior Ẑ = arg max [log p(x Z) + log p(z)] Z p(z) exp ( λ Z ) where Z = UDV T with D = diag(d 1, d 2,..., d r ), and U, V iid Haar uniform prior on unitary matrices d i iid Exp(λ) F. Caron 15 / 44

17 Hierarchical adaptive spectral penalty Each singular value has its own random shrinkage coefficient Hierarchical model, for each singular value i = 1,..., r d i γ i Exp(γ i ) γ i Gamma(a, b) Marginal distribution over d i : p(d i ) = 0 Exp(d i ; γ i ) Gamma(γ i ; a, b)dγ i = ab a (d i + b) a+1 Pareto distribution with heavier tails than exponential distribution [Todeschini et al., 2013] F. Caron 16 / 44

18 Hierarchical adaptive spectral penalty β = β = 2 β = p(di) d i HASP penalty Figure: Marginal distribution p(d i ) with a = b = β pen(z) = log p(z) = r (a + 1) log(b + d i ) Admits as special case the nuclear norm penalty λ Z when a = λb and b. F. Caron 17 / 44 i=1

1) (h) l 0 norm Figure: Top: Manifold of constant penalty, for a symmetric 2 2 matrix Z = [x, y; y, z] for (a) the nuclear norm, hierarchical

19 Hierarchical adaptive spectral penalty (a) Nuclear norm (b) HASP (β = 1) (c) HASP (β = 0.1) (d) Rank penalty (e) l 1 norm (f) HAL (β = 1) (g) HAL (β = 0.1) (h) l 0 norm Figure: Top: Manifold of constant penalty, for a symmetric 2 2 matrix Z = [x, y; y, z] for (a) the nuclear norm, hierarchical adaptive spectral penalty with a = b = β (b) β = 1 and (c) β = 0.1, and (d) the rank penalty. Bottom: contour of constant penalty for a diagonal matrix [x, 0; 0, z], where one recovers the classical (e) lasso, (f-g) hierarchical lasso and (h) l 0 F. penalties. Caron 18 / 44

20 EM algorithm for MAP estimation Expectation Maximization (EM) algorithm to obtain a MAP estimate i.e. to minimize Ẑ = arg max [log p(x Z) + log p(z)] Z L(Z) = 1 2σ 2 X Z 2 F + r (a + 1) log(b + d i ) i=1 F. Caron 19 / 44

21 EM algorithm for MAP estimation Latent variables: γ = (γ 1,..., γ r ) E step: Q(Z, Z ) = E [log(p(x, Z, γ)) Z, X] = C 1 r 2σ X 2 Z 2 F ω i d i where ω i = E[γ i d i ] = a+1. b+d i i=1 F. Caron 20 / 44

22 EM algorithm for MAP estimation M step: minimize Z 1 2σ 2 X Z 2 F + r ω i d i (1) i=1 (1) is an adaptive spectral penalty regularized optimization problem, with weights ω i = a+1. b+d i d 1 d 2... d r 0 ω 1 ω 2... ω r (2) Given condition (2), the solution is given by a weighted soft-thresholded SVD Ẑ = S σ 2 ω(x) (3) where S ω (X) = Ũ D ω Ṽ T with D ω = diag(( d 1 ω 1 ) +,..., ( d r ω r ) + ). [Gaïffas and Lecué, 2011] F. Caron 21 / 44

23 EM algorithm for MAP estimation Nuclear Norm HASP (β = 2) HASP (β = 0.1) d i d i Figure: Thresholding rules on the singular values d i of X The weights will penalize less heavily higher singular values, hence reducing bias. F. Caron 22 / 44

24 Low rank estimation of complete matrices Hierarchical Adaptive Soft Thresholded (HAST) algorithm for low rank estimation of complete matrices Initialize Z (0). At iteration t 1 For i = 1,..., r, compute the weights ω (t) i = a+1 b+d (t 1) i Set Z (t) = S σ 2 ω (t)(x) If L(Z(t 1) ) L(Z (t) ) < ε then return L(Z (t 1) ) Ẑ = Z(t) Admits soft-thresholded SVD operator as a special case when a = bλ and b = β. F. Caron 23 / 44

25 Settings Parametrization: We set b = β and a = λβ where λ and β are tuning parameters that can be chosen by cross-validation. Possible to estimate σ within the EM algorithm. Initialization: Initialization with the soft thresholded SVD with parameter σ 2 λ F. Caron 24 / 44

26 Outline Introduction Complete case Matrix completion Experiments Conclusion and perspectives F. Caron 25 / 44

27 Matrix completion Only a subset Ω {1,..., m} {1,..., n} of the entries of the matrix X is observed. Subset operators: P Ω (X)(i, j) = { Xij if (i, j) Ω 0 otherwise P Ω (X)(i, j) = { 0 X ij if (i, j) Ω otherwise Same prior over Z MAP estimate is obtained by minimizing L(Z) = 1 2σ 2 P Ω(X) P Ω (Z) 2 F + (a + 1) r log(b + d i ) i=1 F. Caron 26 / 44

28 EM algorithm for MAP estimation Latent variables: γ and P Ω (X) Hierarchical Adaptive Soft Impute (HASI) algorithm for matrix completion Initialize Z (0). At iteration t 1 For i = 1,..., r, compute the weights ω (t) i = a+1 b+d (t 1) i Set Z (t) ( = S σ 2 ω (t) PΩ (X) + PΩ (Z(t 1) ) ) If L(Z(t 1) ) L(Z (t) ) L(Z (t 1) ) < ε then return Ẑ = Z(t) F. Caron 27 / 44

29 EM algorithm for MAP estimation HASI algorithm admits the Soft-Impute algorithm as a special case when a = λb and b = β. In this case, ω (t) i = λ for all i. When β <, the algorithm adaptively updates the weights so that to penalize less heavily higher singular values. F. Caron 28 / 44

30 Initialization Non-convex objective function - different initializations may lead to different modes. We set a = λb and b = β and initialize the algorithm with the Soft-Impute algorithm with regularization parameter σ 2 λ. F. Caron 29 / 44

31 Scaling Similarly to the Soft-Impute algorithm, the computationally bottleneck is the computation of the weighted soft-truncated SVD ) S σ 2 ω (P (t) Ω (X) + PΩ (Z(t 1) ) For large matrices, one can resort to the PROPACK algorithm. Efficiently computes the truncated SVD of the sparse + low rank matrix P Ω (X) + PΩ (Z(t 1) ) = P Ω (X) P Ω (Z (t 1) ) }{{} sparse and can thus handle large matrices. + Z (t 1) }{{} low rank [Larsen, 2004] F. Caron 30 / 44

32 Outline Introduction Complete case Matrix completion Experiments Simulated data Collaborative filtering examples Conclusion and perspectives F. Caron 31 / 44

33 Simulated data Procedure We generate matrices Z = AB T of low rank q where A and B are Gaussian matrices of size m q and n q, m = n = 100 and add Gaussian noise with σ = 1. var(z) The signal to noise ratio is defined as SNR =. σ 2 For the HASP penalty, we set a = λβ and b = β. Grid of 50 values of the regularization parameter λ Metric err = Ẑ Z 2 F Z 2 F and err Ω = P Ω (Ẑ) P Ω (Z) 2 F P Ω (Z) 2 F F. Caron 32 / 44

34 Simulated data Complete case Relative error Rank ST HT HAST β = 100 HAST β = 10 HAST β = 1 (a) SNR=1; Complete; rank=10 Figure: Test error w.r.t. the rank obtained by varying the value of the regularization parameter λ. The HASP penalty provides a bridge/tradeoff between the nuclear norm and the rank penalty. For example, value of β = 10 show a minimum at the true rank q = 10 as HT, but with a lower error when the rank is overestimated. F. Caron 33 / 44

35 Simulated data Incomplete case Test error Rank MMMF SoftImp SoftImp+ HardImp HASI β = 100 HASI β = 10 HASI β = 1 (a) SNR=1; 50% missing; rank=5 Test error Rank MMMF SoftImp SoftImp+ HardImp HASI β = 100 HASI β = 10 HASI β = 1 (b) SNR=10; 80% missing; rank=5 Figure: Test error w.r.t. the rank obtained by varying the value of the regularization parameter λ, averaged over 50 replications. Similar behavior is observed, with the HASI algorithm attaining a minimum at the true rank q = 5. F. Caron 34 / 44

36 Simulated data Incomplete case We then remove 20% of the observed entries as a validation set to estimate the regularization parameters. We use the unobserved entries as a test set. MMMF SoftImp SoftImp+ HardImp HASI Test error (a) SNR=1; 50% miss Rank (b) SNR=1; 50% miss Test error (c) SNR=10; 80% miss Rank (d) SNR=10; 80% miss. Figure: Boxplots of the test error and ranks obtained over 50 replications. F. Caron 35 / 44

37 Collaborative filtering examples (Jester) Procedure We randomly select two ratings per user as a test set, and two other ratings per user as a validation set to select the parameters λ and β. The results are computed over four values β = 1000, 100, 10, 1. We compare the results of the different methods with the Normalized Mean Absolute Error (NMAE) NMAE = 1 card(ω test) (i,j) Ω test X ij Ẑij max(x) min(x) F. Caron 36 / 44

38 Collaborative filtering examples (Jester) Table: Results on the Jester datasets, averaged over 10 replications Jester 1 Jester 2 Jester % miss. 27.3% miss. 75.3% miss. Method NMAE Rank NMAE Rank NMAE Rank MMMF Soft Imp Soft Imp Hard Imp HASI F. Caron 37 / 44

39 Collaborative filtering examples (Jester) Test error MMMF SoftImp SoftImp+ HardImp HASI β = 1000 HASI β = 100 HASI β = 10 HASI β = 1 Test error MMMF SoftImp SoftImp+ HardImp HASI β = 1000 HASI β = 100 HASI β = 10 HASI β = Rank (a) Jester Rank (b) Jester 3 Figure: NMAE w.r.t. the rank obtained by varying the regularization parameter λ. F. Caron 38 / 44

40 Collaborative filtering examples (MovieLens) Procedure We randomly select 20% of the entries as a test set, and the remaining entries are split between a training set (80%) and a validation set (20%). For all the methods, we stop the regularization path as soon as the estimated rank exceeds r max = 100. For the larger MovieLens 1M dataset, the precision, maximum number of iterations and maximum rank are decreased to ɛ = 10 6, t max = 100 and r max = 30. F. Caron 39 / 44

41 Collaborative filtering examples (MovieLens) Table: Results on the MovieLens datasets, averaged over 5 replications MovieLens 100k MovieLens 1M % miss. 95.8% miss. Method NMAE Rank NMAE Rank MMMF Soft Imp Soft Imp Hard Imp HASI F. Caron 40 / 44

42 Outline Introduction Complete case Matrix completion Experiments Conclusion and perspectives F. Caron 41 / 44

43 Conclusion and perspectives Conclusion: Good results compared to several alternative low rank matrix completion methods. Bridge between nuclear norm and rank regularization algorithms. Can be extended to binary matrices Non-convex optimization, but experiments show that initializing the algorithm with the Soft-Impute algorithm provides very satisfactory results. Matlab code available online Perspectives: Fully Bayesian approach Larger datasets F. Caron 42 / 44

44 Bibliography I Cai, J., Candès, E., and Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4): Candès, E. and Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6): Candès, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. Information Theory, IEEE Transactions on, 56(5): Fazel, M. (2002). Matrix rank minimization with applications. PhD thesis, Stanford University. Gaïffas, S. and Lecué, G. (2011). Weighted algorithms for compressed sensing and matrix completion. arxiv preprint arxiv: Larsen, R. M. (2004). Propack-software for large and sparse svd calculations. Available online. URL stanford. edu/rmunk/propack. F. Caron 43 / 44

45 Bibliography II Mazumder, R., Hastie, T., and Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research, 11: Todeschini, A., Caron, F., and Chavent, M. (2013). Probabilistic low-rank matrix completion with adaptive spectral regularization algorithms. In Advances in Neural Information Processing Systems, pages F. Caron 44 / 44

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Adrien Todeschini Inria Bordeaux JdS 2014, Rennes Aug. 2014 Joint work with François Caron (Univ. Oxford), Marie