Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Similar documents
Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Low-rank Matrix Completion with Noisy Observations: a Quantitative Comparison

Lecture 9: September 28

Binary matrix completion

Lecture Notes 10: Matrix Factorization

Adaptive one-bit matrix completion

A Scalable Spectral Relaxation Approach to Matrix Completion via Kronecker Products

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

Machine Learning for Disease Progression

Spectral Regularization Algorithms for Learning Large Incomplete Matrices

Using SVD to Recommend Movies

CSC 576: Variants of Sparse Learning

Spectral k-support Norm Regularization

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Bayesian Matrix Completion via Adaptive Relaxed Spectral Regularization

Matrix completion: Fundamental limits and efficient algorithms. Sewoong Oh Stanford University

Algorithms for Collaborative Filtering

Decoupled Collaborative Ranking

ECE521 lecture 4: 19 January Optimization, MLE, regularization

arxiv: v1 [stat.ml] 1 Mar 2015

1-Bit Matrix Completion

Matrix Completion from a Few Entries

Collaborative Filtering: A Machine Learning Perspective

Recommender systems, matrix factorization, variable selection and social graph data

1-Bit Matrix Completion

Andriy Mnih and Ruslan Salakhutdinov

Matrix Completion: Fundamental Limits and Efficient Algorithms

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Joint Capped Norms Minimization for Robust Matrix Recovery

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems

Lecture 1: September 25

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

High-dimensional Statistics

An iterative hard thresholding estimator for low rank matrix recovery

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

STA 4273H: Statistical Machine Learning

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

1-Bit Matrix Completion

Robust Matrix Completion via Joint Schatten p-norm and l p -Norm Minimization

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Lecture 2 Part 1 Optimization

STA414/2104 Statistical Methods for Machine Learning II

Generalized Linear Models in Collaborative Filtering

Tractable Upper Bounds on the Restricted Isometry Constant

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Latent Variable Models and EM algorithm

Latent Variable Models and EM Algorithm

High-dimensional Statistical Models

Lecture 14. Clustering, K-means, and EM

Low Rank Matrix Completion Formulation and Algorithm

Breaking the Limits of Subspace Inference

Restricted Boltzmann Machines for Collaborative Filtering

Online Dictionary Learning with Group Structure Inducing Norms

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

Graphical Models for Collaborative Filtering

A Simple Algorithm for Nuclear Norm Regularized Problems

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

K-Means and Gaussian Mixture Models

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM


Lecture 8: February 9

Probabilistic low-rank matrix completion on finite alphabets

Linear Methods for Regression. Lijun Zhang

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

CS281 Section 4: Factor Analysis and PCA

Least Squares Regression

MATRIX RECOVERY FROM QUANTIZED AND CORRUPTED MEASUREMENTS

Matrix Completion from Fewer Entries

Data Mining Techniques

A direct formulation for sparse PCA using semidefinite programming

Least Squares Regression

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Bias-free Sparse Regression with Guaranteed Consistency

Maximum Margin Matrix Factorization

ECE521 week 3: 23/26 January 2017

Structured matrix factorizations. Example: Eigenfaces

Statistical Data Mining and Machine Learning Hilary Term 2016

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Sparse Linear Models (10/7/13)

Sparse PCA with applications in finance

Learning Task Grouping and Overlap in Multi-Task Learning

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Dimension Reduction Methods

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

Machine Learning for OR & FE

Collaborative Filtering. Radek Pelánek

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Scalable Completion of Nonnegative Matrices with the Separable Structure

Compressed Sensing and Neural Networks

Sparse Covariance Selection using Semidefinite Programming

Optimisation Combinatoire et Convexe.

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Homework 1. Yuan Yao. September 18, 2011

Overview. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Transcription:

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms François Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini, Marie Chavent (INRIA, U. of Bordeaux) F. Caron 1 / 44

Outline Introduction Complete case Matrix completion Experiments Conclusion and perspectives F. Caron 2 / 44

Matrix Completion Netflix prize 480k users and 18k movies providing 1-5 ratings 99% of the ratings are missing Objective: predict missing entries in order to make recommendations Movies Users...... 1 4... 1... 2 5... 3 1 4..................... F. Caron 3 / 44

Matrix Completion Objective Complete a matrix X of size m n from a subset of its entries Applications Recommender systems Image inpainting Imputation of missing data.............................. F. Caron 4 / 44

Matrix Completion Potentially large matrices (each dimension of order 10 4 10 6 ) Very sparsely observed (1%-10%) F. Caron 5 / 44

Low rank Matrix Completion Assume that the complete matrix Z is of low rank with k min(m, n). }{{} Z m n A }{{} m k B T }{{} k n........................................ F. Caron 6 / 44

Low rank Matrix Completion Assume that the complete matrix Z is of low rank with k min(m, n). }{{} Z m n A }{{} m k B T }{{} k n Features Users...... Items.................................. F. Caron 6 / 44

Low rank Matrix Completion Let Ω {1,..., m} {1,..., n} be the subset of observed entries For (i, j) Ω where σ 2 > 0 X ij = Z ij + ε ij, ε ij iid N (0, σ 2 ) F. Caron 7 / 44

Low rank Matrix Completion Optimization problem minimize Z 1 (X 2σ 2 ij Z ij ) 2 + λ rank(z) (i,j) Ω }{{}}{{} loglikelihood penalty where λ > 0 is some regularization parameter. Non-convex Computationally hard for general subset Ω F. Caron 8 / 44

Low rank Matrix Completion Matrix completion with nuclear norm penalty minimize Z 1 (X 2σ 2 ij Z ij ) 2 + λ Z (i,j) Ω }{{}}{{} loglikelihood penalty where Z is the nuclear norm of Z, or the sum of the singular values of Z. Convex relaxation of the rank penalty optimization [Fazel, 2002, Candès and Recht, 2009, Candès and Tao, 2010] F. Caron 9 / 44

Low rank Matrix Completion Soft-Impute algorithm Start with an initial matrix Z (0) At each iteration t = 1, 2,... Replace the missing elements in X with those in Z (t 1) Perform a soft-thresholded SVD on the completed matrix, with shrinkage λ to obtain the low rank matrix Z (t) [Mazumder et al., 2010] F. Caron 10 / 44

Low rank Matrix Completion Soft-Impute algorithm Soft-thresholded SVD yields a low-rank representation Each iteration decreases the value of the nuclear norm objective function towards its minimum Various strategies proposed to scale the algorithm to problems where n, m of order 10 6 Same shrinkage applied to all singular values [Mazumder et al., 2010] F. Caron 11 / 44

Contributions Probabilistic interpretation of the nuclear norm objective function Maximum A Posteriori estimation assuming exponential priors on the singular values Soft-Impute = Expectation-Maximization algorithm Construction of alternative non-convex objective functions building on hierarchical priors Bridge the gap between the rank penalty and the nuclear norm penalty EM: Adaptative algorithm that iteratively adjusts the shrinkage coefficients for each singular value Similar to adaptive lasso in multivariate regression Numerical results show the interest of the approach on various datasets F. Caron 12 / 44

Outline Introduction Complete case Hierarchical adaptive spectral penalty EM algorithm for MAP estimation Matrix completion Experiments Conclusion and perspectives F. Caron 13 / 44

Nuclear Norm penalty Complete matrix X Nuclear norm objective function minimize Z 1 2σ 2 X Z 2 F + λ Z where 2 F is the Frobenius norm Global solution given by a soft-thresholded SVD Ẑ = S λσ 2(X) where S λ (X) = Ũ D λ Ṽ T with D λ = diag(( d 1 λ) +,..., ( d r λ) + ) and t + = max(t, 0). [Cai et al., 2010, Mazumder et al., 2010] F. Caron 14 / 44

Nuclear Norm penalty Maximum A Posteriori (MAP) estimate under the prior Ẑ = arg max [log p(x Z) + log p(z)] Z p(z) exp ( λ Z ) where Z = UDV T with D = diag(d 1, d 2,..., d r ), and U, V iid Haar uniform prior on unitary matrices d i iid Exp(λ) F. Caron 15 / 44

Hierarchical adaptive spectral penalty Each singular value has its own random shrinkage coefficient Hierarchical model, for each singular value i = 1,..., r d i γ i Exp(γ i ) γ i Gamma(a, b) Marginal distribution over d i : p(d i ) = 0 Exp(d i ; γ i ) Gamma(γ i ; a, b)dγ i = ab a (d i + b) a+1 Pareto distribution with heavier tails than exponential distribution [Todeschini et al., 2013] F. Caron 16 / 44

Hierarchical adaptive spectral penalty 1 0.9 0.8 β = β = 2 β = 0.1 0.7 0.6 p(di) 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 d i HASP penalty Figure: Marginal distribution p(d i ) with a = b = β pen(z) = log p(z) = r (a + 1) log(b + d i ) Admits as special case the nuclear norm penalty λ Z when a = λb and b. F. Caron 17 / 44 i=1

Hierarchical adaptive spectral penalty (a) Nuclear norm (b) HASP (β = 1) (c) HASP (β = 0.1) (d) Rank penalty (e) l 1 norm (f) HAL (β = 1) (g) HAL (β = 0.1) (h) l 0 norm Figure: Top: Manifold of constant penalty, for a symmetric 2 2 matrix Z = [x, y; y, z] for (a) the nuclear norm, hierarchical adaptive spectral penalty with a = b = β (b) β = 1 and (c) β = 0.1, and (d) the rank penalty. Bottom: contour of constant penalty for a diagonal matrix [x, 0; 0, z], where one recovers the classical (e) lasso, (f-g) hierarchical lasso and (h) l 0 F. penalties. Caron 18 / 44

EM algorithm for MAP estimation Expectation Maximization (EM) algorithm to obtain a MAP estimate i.e. to minimize Ẑ = arg max [log p(x Z) + log p(z)] Z L(Z) = 1 2σ 2 X Z 2 F + r (a + 1) log(b + d i ) i=1 F. Caron 19 / 44

EM algorithm for MAP estimation Latent variables: γ = (γ 1,..., γ r ) E step: Q(Z, Z ) = E [log(p(x, Z, γ)) Z, X] = C 1 r 2σ X 2 Z 2 F ω i d i where ω i = E[γ i d i ] = a+1. b+d i i=1 F. Caron 20 / 44

EM algorithm for MAP estimation M step: minimize Z 1 2σ 2 X Z 2 F + r ω i d i (1) i=1 (1) is an adaptive spectral penalty regularized optimization problem, with weights ω i = a+1. b+d i d 1 d 2... d r 0 ω 1 ω 2... ω r (2) Given condition (2), the solution is given by a weighted soft-thresholded SVD Ẑ = S σ 2 ω(x) (3) where S ω (X) = Ũ D ω Ṽ T with D ω = diag(( d 1 ω 1 ) +,..., ( d r ω r ) + ). [Gaïffas and Lecué, 2011] F. Caron 21 / 44

EM algorithm for MAP estimation 6 5 4 Nuclear Norm HASP (β = 2) HASP (β = 0.1) d i 3 2 1 0 0 1 2 3 4 5 6 d i Figure: Thresholding rules on the singular values d i of X The weights will penalize less heavily higher singular values, hence reducing bias. F. Caron 22 / 44

Low rank estimation of complete matrices Hierarchical Adaptive Soft Thresholded (HAST) algorithm for low rank estimation of complete matrices Initialize Z (0). At iteration t 1 For i = 1,..., r, compute the weights ω (t) i = a+1 b+d (t 1) i Set Z (t) = S σ 2 ω (t)(x) If L(Z(t 1) ) L(Z (t) ) < ε then return L(Z (t 1) ) Ẑ = Z(t) Admits soft-thresholded SVD operator as a special case when a = bλ and b = β. F. Caron 23 / 44

Settings Parametrization: We set b = β and a = λβ where λ and β are tuning parameters that can be chosen by cross-validation. Possible to estimate σ within the EM algorithm. Initialization: Initialization with the soft thresholded SVD with parameter σ 2 λ F. Caron 24 / 44

Outline Introduction Complete case Matrix completion Experiments Conclusion and perspectives F. Caron 25 / 44

Matrix completion Only a subset Ω {1,..., m} {1,..., n} of the entries of the matrix X is observed. Subset operators: P Ω (X)(i, j) = { Xij if (i, j) Ω 0 otherwise P Ω (X)(i, j) = { 0 X ij if (i, j) Ω otherwise Same prior over Z MAP estimate is obtained by minimizing L(Z) = 1 2σ 2 P Ω(X) P Ω (Z) 2 F + (a + 1) r log(b + d i ) i=1 F. Caron 26 / 44

EM algorithm for MAP estimation Latent variables: γ and P Ω (X) Hierarchical Adaptive Soft Impute (HASI) algorithm for matrix completion Initialize Z (0). At iteration t 1 For i = 1,..., r, compute the weights ω (t) i = a+1 b+d (t 1) i Set Z (t) ( = S σ 2 ω (t) PΩ (X) + PΩ (Z(t 1) ) ) If L(Z(t 1) ) L(Z (t) ) L(Z (t 1) ) < ε then return Ẑ = Z(t) F. Caron 27 / 44

EM algorithm for MAP estimation HASI algorithm admits the Soft-Impute algorithm as a special case when a = λb and b = β. In this case, ω (t) i = λ for all i. When β <, the algorithm adaptively updates the weights so that to penalize less heavily higher singular values. F. Caron 28 / 44

Initialization Non-convex objective function - different initializations may lead to different modes. We set a = λb and b = β and initialize the algorithm with the Soft-Impute algorithm with regularization parameter σ 2 λ. F. Caron 29 / 44

Scaling Similarly to the Soft-Impute algorithm, the computationally bottleneck is the computation of the weighted soft-truncated SVD ) S σ 2 ω (P (t) Ω (X) + PΩ (Z(t 1) ) For large matrices, one can resort to the PROPACK algorithm. Efficiently computes the truncated SVD of the sparse + low rank matrix P Ω (X) + PΩ (Z(t 1) ) = P Ω (X) P Ω (Z (t 1) ) }{{} sparse and can thus handle large matrices. + Z (t 1) }{{} low rank [Larsen, 2004] F. Caron 30 / 44

Outline Introduction Complete case Matrix completion Experiments Simulated data Collaborative filtering examples Conclusion and perspectives F. Caron 31 / 44

Simulated data Procedure We generate matrices Z = AB T of low rank q where A and B are Gaussian matrices of size m q and n q, m = n = 100 and add Gaussian noise with σ = 1. var(z) The signal to noise ratio is defined as SNR =. σ 2 For the HASP penalty, we set a = λβ and b = β. Grid of 50 values of the regularization parameter λ Metric err = Ẑ Z 2 F Z 2 F and err Ω = P Ω (Ẑ) P Ω (Z) 2 F P Ω (Z) 2 F F. Caron 32 / 44

Simulated data Complete case 1 0.9 Relative error 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 70 80 90 100 Rank ST HT HAST β = 100 HAST β = 10 HAST β = 1 (a) SNR=1; Complete; rank=10 Figure: Test error w.r.t. the rank obtained by varying the value of the regularization parameter λ. The HASP penalty provides a bridge/tradeoff between the nuclear norm and the rank penalty. For example, value of β = 10 show a minimum at the true rank q = 10 as HT, but with a lower error when the rank is overestimated. F. Caron 33 / 44

Simulated data Incomplete case Test error 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0 10 20 30 40 50 60 Rank MMMF SoftImp SoftImp+ HardImp HASI β = 100 HASI β = 10 HASI β = 1 (a) SNR=1; 50% missing; rank=5 Test error 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 Rank MMMF SoftImp SoftImp+ HardImp HASI β = 100 HASI β = 10 HASI β = 1 (b) SNR=10; 80% missing; rank=5 Figure: Test error w.r.t. the rank obtained by varying the value of the regularization parameter λ, averaged over 50 replications. Similar behavior is observed, with the HASI algorithm attaining a minimum at the true rank q = 5. F. Caron 34 / 44

Simulated data Incomplete case We then remove 20% of the observed entries as a validation set to estimate the regularization parameters. We use the unobserved entries as a test set. MMMF SoftImp SoftImp+ HardImp HASI 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Test error (a) SNR=1; 50% miss. 10 20 30 40 50 60 70 80 90 100 Rank (b) SNR=1; 50% miss. 0 0.05 0.1 0.15 0.2 0.25 Test error (c) SNR=10; 80% miss. 10 20 30 40 50 60 70 80 90 100 Rank (d) SNR=10; 80% miss. Figure: Boxplots of the test error and ranks obtained over 50 replications. F. Caron 35 / 44

Collaborative filtering examples (Jester) Procedure We randomly select two ratings per user as a test set, and two other ratings per user as a validation set to select the parameters λ and β. The results are computed over four values β = 1000, 100, 10, 1. We compare the results of the different methods with the Normalized Mean Absolute Error (NMAE) NMAE = 1 card(ω test) (i,j) Ω test X ij Ẑij max(x) min(x) F. Caron 36 / 44

Collaborative filtering examples (Jester) Table: Results on the Jester datasets, averaged over 10 replications Jester 1 Jester 2 Jester 3 24983 100 23500 100 24938 100 27.5% miss. 27.3% miss. 75.3% miss. Method NMAE Rank NMAE Rank NMAE Rank MMMF 0.161 95 0.162 96 0.183 58 Soft Imp 0.161 100 0.162 100 0.184 78 Soft Imp+ 0.169 14 0.171 11 0.184 33 Hard Imp 0.158 7 0.159 6 0.181 4 HASI 0.153 100 0.153 100 0.174 30 F. Caron 37 / 44

Collaborative filtering examples (Jester) Test error 0.32 0.3 0.28 0.26 0.24 0.22 MMMF SoftImp SoftImp+ HardImp HASI β = 1000 HASI β = 100 HASI β = 10 HASI β = 1 Test error 0.3 0.25 MMMF SoftImp SoftImp+ HardImp HASI β = 1000 HASI β = 100 HASI β = 10 HASI β = 1 0.2 0.2 0.18 0.16 0 10 20 30 40 50 60 70 80 90 100 Rank (a) Jester 1 0.15 0 10 20 30 40 50 60 70 80 90 Rank (b) Jester 3 Figure: NMAE w.r.t. the rank obtained by varying the regularization parameter λ. F. Caron 38 / 44

Collaborative filtering examples (MovieLens) Procedure We randomly select 20% of the entries as a test set, and the remaining entries are split between a training set (80%) and a validation set (20%). For all the methods, we stop the regularization path as soon as the estimated rank exceeds r max = 100. For the larger MovieLens 1M dataset, the precision, maximum number of iterations and maximum rank are decreased to ɛ = 10 6, t max = 100 and r max = 30. F. Caron 39 / 44

Collaborative filtering examples (MovieLens) Table: Results on the MovieLens datasets, averaged over 5 replications MovieLens 100k MovieLens 1M 943 1682 6040 3952 93.7% miss. 95.8% miss. Method NMAE Rank NMAE Rank MMMF 0.195 50 0.169 30 Soft Imp 0.197 156 0.176 30 Soft Imp+ 0.197 108 0.189 30 Hard Imp 0.190 7 0.175 8 HASI 0.187 35 0.172 27 F. Caron 40 / 44

Outline Introduction Complete case Matrix completion Experiments Conclusion and perspectives F. Caron 41 / 44

Conclusion and perspectives Conclusion: Good results compared to several alternative low rank matrix completion methods. Bridge between nuclear norm and rank regularization algorithms. Can be extended to binary matrices Non-convex optimization, but experiments show that initializing the algorithm with the Soft-Impute algorithm provides very satisfactory results. Matlab code available online Perspectives: Fully Bayesian approach Larger datasets F. Caron 42 / 44

Bibliography I Cai, J., Candès, E., and Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956 1982. Candès, E. and Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717 772. Candès, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. Information Theory, IEEE Transactions on, 56(5):2053 2080. Fazel, M. (2002). Matrix rank minimization with applications. PhD thesis, Stanford University. Gaïffas, S. and Lecué, G. (2011). Weighted algorithms for compressed sensing and matrix completion. arxiv preprint arxiv:1107.1638. Larsen, R. M. (2004). Propack-software for large and sparse svd calculations. Available online. URL http://sun. stanford. edu/rmunk/propack. F. Caron 43 / 44

Bibliography II Mazumder, R., Hastie, T., and Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research, 11:2287 2322. Todeschini, A., Caron, F., and Chavent, M. (2013). Probabilistic low-rank matrix completion with adaptive spectral regularization algorithms. In Advances in Neural Information Processing Systems, pages 845 853. F. Caron 44 / 44