Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Size: px

Start display at page:

Download "Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms"

Marshall Poole
5 years ago
Views:

2014, Rennes Aug. 2014 Joint work with François Caron (Univ.

1 Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Adrien Todeschini Inria Bordeaux JdS 2014, Rennes Aug Joint work with François Caron (Univ. Oxford), Marie Chavent (Inria, Univ. Bordeaux) A. Todeschini 1 / 35

2 Disclaimer Not a fully Bayesian approach. Derivation of an EM algorithm for MAP estimation. But builds on a hierarchical prior construction. A. Todeschini 2 / 35

3 Outline Introduction Hierarchical adaptive spectral penalty EM algorithm for MAP estimation Experiments A. Todeschini 3 / 35

4 Matrix Completion Netflix prize 480k users and 18k movies providing 1-5 ratings 99% of the ratings are missing Objective: predict missing entries in order to make recommendations Movies Users A. Todeschini 4 / 35

5 Matrix Completion Objective Complete a matrix X of size m n from a subset of its entries Applications Recommender systems Image inpainting Imputation of missing data A. Todeschini 5 / 35

6 Matrix Completion Potentially large matrices (each dimension of order ) Very sparsely observed (1%-10%) A. Todeschini 6 / 35

7 Low rank Matrix Completion Assume that the complete matrix Z is of low rank with k min(m, n) = r. }{{} Z m n A }{{} m k B T }{{} k n A. Todeschini 7 / 35

8 Low rank Matrix Completion Assume that the complete matrix Z is of low rank with k min(m, n) = r. }{{} Z m n A }{{} m k B T }{{} k n Features Users Items A. Todeschini 7 / 35

9 Low rank Matrix Completion Let Ω {1,..., m} {1,..., n} be the subset of observed entries For (i, j) Ω where σ 2 > 0 X ij = Z ij + ε ij, ε ij iid N (0, σ 2 ) A. Todeschini 8 / 35

10 Low rank Matrix Completion Optimization problem minimize Z 1 (X 2σ 2 ij Z ij ) 2 + λ rank(z) (i,j) Ω }{{}}{{} loglikelihood penalty where λ > 0 is some regularization parameter. Non-convex Computationally hard for general subset Ω A. Todeschini 9 / 35

11 Low rank Matrix Completion Matrix completion with nuclear norm penalty minimize Z 1 (X 2σ 2 ij Z ij ) 2 + λ Z (i,j) Ω }{{}}{{} loglikelihood penalty where Z is the nuclear norm of Z, or the sum of the singular values of Z. Convex relaxation of the rank penalty optimization [Fazel, 2002, Candès and Recht, 2009, Candès and Tao, 2010] A. Todeschini 10 / 35

12 Low rank Matrix Completion Complete matrix X Nuclear norm objective function minimize Z 1 2σ 2 X Z 2 F + λ Z where 2 F is the Frobenius norm Global solution given by a soft-thresholded SVD Ẑ = S λσ 2(X) where S λ (X) = Ũ D λ Ṽ T with D λ = diag(( d 1 λ) +,..., ( d r λ) + ) and t + = max(t, 0). [Cai et al., 2010, Mazumder et al., 2010] A. Todeschini 11 / 35

13 Low rank Matrix Completion Soft-Impute algorithm Start with an initial matrix Z (0) At each iteration t = 1, 2,... Replace the missing elements in X with those in Z (t 1) Perform a soft-thresholded SVD on the completed matrix, with shrinkage λ to obtain the low rank matrix Z (t) [Mazumder et al., 2010] A. Todeschini 12 / 35

14 Low rank Matrix Completion Thresholding rule Nuclear Norm HASP (β = 2) HASP (β = 0.1) d i d i Figure : Thresholding rules on the singular values d i of X A. Todeschini 13 / 35

15 Outline Introduction Hierarchical adaptive spectral penalty EM algorithm for MAP estimation Experiments A. Todeschini 14 / 35

16 Nuclear Norm penalty Maximum A Posteriori (MAP) estimate under the prior Ẑ = arg max [log p(x Z) + log p(z)] Z p(z) exp ( λ Z ) where Z = UDV T with D = diag(d 1, d 2,..., d r ), and U, V iid Haar uniform prior on unitary matrices d i iid Exp(λ) λ d 1 d 2 d r 1 d r A. Todeschini 15 / 35

17 Hierarchical adaptive spectral penalty Each singular value has its own random shrinkage coefficient Hierarchical model, for each singular value i = 1,..., r d i γ i Exp(γ i ) γ i Gamma(a, b) λ, β γ 1 γ 2 γ r 1 γ r d 1 d 2 d r 1 d r We set a = λb and b = β [Todeschini et al., 2013] A. Todeschini 16 / 35

18 Hierarchical adaptive spectral penalty Marginal distribution over d i : p(d i ) = 0 Exp(d i ; γ i ) Gamma(γ i ; a, b)dγ i = ab a (d i + b) a+1 Pareto distribution with heavier tails than exponential distribution β = β = 2 β = 0.1 p(di) d i A. Todeschini 17 / 35

19 Hierarchical adaptive spectral penalty pen(z) = log p(z) = r X (a + 1) log(b + di ) i=1 (a) Nuclear norm (e) I `1 norm (b) (f) HASP (β = 1) HAL (β = 1) (c) (g) HASP (β = 0.1) HAL (β = 0.1) (d) Rank penalty (h) `0 norm Admits as special case the nuclear norm penalty λ Z when a = λb and b. A. Todeschini 18 / 35

20 Outline Introduction Hierarchical adaptive spectral penalty EM algorithm for MAP estimation Experiments A. Todeschini 19 / 35

21 EM algorithm for MAP estimation Expectation Maximization (EM) algorithm to obtain a MAP estimate i.e. to minimize Ẑ = arg max [log p(x Z) + log p(z)] Z L(Z) = 1 2σ 2 P Ω(X) P Ω (Z) 2 F + (a + 1) r log(b + d i ) i=1 where P Ω (X)(i, j) = { Xij if (i, j) Ω 0 otherwise P Ω (X)(i, j) = { 0 X ij if (i, j) Ω otherwise A. Todeschini 20 / 35

22 EM algorithm for MAP estimation Latent variables: γ = (γ 1,..., γ r ) and P Ω (X) E step: [ ] Q(Z, Z ) = E log(p(pω (X), Z, γ)) Z, P Ω (X) = C 1 2σ 2 X Z 2 F r ω i d i i=1 where X = P Ω (X) + P Ω (Z ) and ω i = E[γ i d i ] = a+1. b+d i A. Todeschini 21 / 35

23 EM algorithm for MAP estimation M step: minimize Z 1 2σ 2 X Z 2 F + r ω i d i (1) i=1 (1) is an adaptive spectral penalty regularized optimization problem, with weights ω i = a+1. b+d i d 1 d 2... d r 0 ω 1 ω 2... ω r (2) Given condition (2), the solution is given by a weighted soft-thresholded SVD Ẑ = S σ 2 ω(x ) (3) where S ω (X) = Ũ D ω Ṽ T with D ω = diag(( d 1 ω 1 ) +,..., ( d r ω r ) + ). [Gaïffas and Lecué, 2011] A. Todeschini 22 / 35

24 EM algorithm for MAP estimation Nuclear Norm HASP (β = 2) HASP (β = 0.1) d i d i Figure : Thresholding rules on the singular values d i of X The weights will penalize less heavily higher singular values, hence reducing bias. A. Todeschini 23 / 35

25 EM algorithm for MAP estimation Hierarchical Adaptive Soft Impute (HASI) algorithm for matrix completion Initialize Z (0) with Soft-Impute. At iteration t 1 For i = 1,..., r, compute the weights ω (t) i = a+1 b+d (t 1) i Set Z (t) ( = S σ 2 ω (t) PΩ (X) + PΩ (Z(t 1) ) ) A. Todeschini 24 / 35

26 EM algorithm for MAP estimation HASI algorithm admits the Soft-Impute algorithm as a special case when a = λb and b = β. In this case, ω (t) i = λ for all i. When β <, the algorithm adaptively updates the weights so that to penalize less heavily higher singular values. A. Todeschini 25 / 35

27 Outline Introduction Hierarchical adaptive spectral penalty EM algorithm for MAP estimation Experiments Simulated data Collaborative filtering examples A. Todeschini 26 / 35

28 Simulated data Test error Test error MMMF SoftImp SoftImp+ HardImp HASI β = 100 HASI β = 10 HASI β = Rank (a) SNR=1; 50% missing; rank= Rank (b) SNR=10; 80% missing; rank=5 A. Todeschini 27 / 35

29 Simulated data We then remove 20% of the observed entries as a validation set to estimate the regularization parameters. We use the unobserved entries as a test set. MMMF SoftImp SoftImp+ HardImp HASI Test error (c) SNR=1; 50% miss Rank (d) SNR=1; 50% miss Test error (e) SNR=10; 80% miss Rank (f) SNR=10; 80% miss. A. Todeschini 28 / 35

30 Collaborative filtering examples (Jester) Jester 1 Jester 2 Jester % miss. 27.3% miss. 75.3% miss. Method NMAE Rank NMAE Rank NMAE Rank MMMF Soft Imp Soft Imp Hard Imp HASI A. Todeschini 29 / 35

31 Collaborative filtering examples (Jester) Test error Test error MMMF SoftImp SoftImp+ HardImp HASI β = 1000 HASI β = 100 HASI β = 10 HASI β = Rank (g) Jester Rank (h) Jester 3 A. Todeschini 30 / 35

32 Collaborative filtering examples (MovieLens) MovieLens 100k MovieLens 1M % miss. 95.8% miss. Method NMAE Rank NMAE Rank MMMF Soft Imp Soft Imp Hard Imp HASI A. Todeschini 31 / 35

33 Conclusion and perspectives Conclusion: Good results compared to several alternative low rank matrix completion methods. Bridge between nuclear norm and rank regularization algorithms. Can be extended to binary matrices Non-convex optimization, but experiments show that initializing the algorithm with the Soft-Impute algorithm provides very satisfactory results. Matlab code available online Perspectives: Fully Bayesian approach Tensor factorization Online EM A. Todeschini 32 / 35

34 Bibliography I Cai, J., Candès, E., and Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4): Candès, E. and Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6): Candès, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. Information Theory, IEEE Transactions on, 56(5): Fazel, M. (2002). Matrix rank minimization with applications. PhD thesis, Stanford University. Gaïffas, S. and Lecué, G. (2011). Weighted algorithms for compressed sensing and matrix completion. arxiv preprint arxiv: Larsen, R. M. (2004). Propack-software for large and sparse svd calculations. Available online. URL stanford. edu/rmunk/propack. A. Todeschini 33 / 35

35 Bibliography II Mazumder, R., Hastie, T., and Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research, 11: Todeschini, A., Caron, F., and Chavent, M. (2013). Probabilistic low-rank matrix completion with adaptive spectral regularization algorithms. In Advances in Neural Information Processing Systems, pages A. Todeschini 34 / 35

36 Thank you A. Todeschini 35 / 35

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms François Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini,