Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Similar documents
Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Low-rank Matrix Completion with Noisy Observations: a Quantitative Comparison

Binary matrix completion

Lecture 9: September 28

A Scalable Spectral Relaxation Approach to Matrix Completion via Kronecker Products

Spectral k-support Norm Regularization

Lecture Notes 10: Matrix Factorization

Collaborative Filtering: A Machine Learning Perspective

Matrix Completion from a Few Entries

Adaptive one-bit matrix completion

Bayesian Matrix Completion via Adaptive Relaxed Spectral Regularization

Spectral Regularization Algorithms for Learning Large Incomplete Matrices

Matrix completion: Fundamental limits and efficient algorithms. Sewoong Oh Stanford University

CSC 576: Variants of Sparse Learning

Robust Matrix Completion via Joint Schatten p-norm and l p -Norm Minimization

Algorithms for Collaborative Filtering

Matrix Completion: Fundamental Limits and Efficient Algorithms

Joint Capped Norms Minimization for Robust Matrix Recovery

arxiv: v1 [stat.ml] 1 Mar 2015

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

Using SVD to Recommend Movies

A Simple Algorithm for Nuclear Norm Regularized Problems

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Generalized Linear Models in Collaborative Filtering

Machine Learning for Disease Progression

Lecture 2 Part 1 Optimization

1-Bit Matrix Completion

Lecture 1: September 25

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Restricted Boltzmann Machines for Collaborative Filtering

Decoupled Collaborative Ranking

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion

Low Rank Matrix Completion Formulation and Algorithm

MATRIX RECOVERY FROM QUANTIZED AND CORRUPTED MEASUREMENTS

Probabilistic low-rank matrix completion on finite alphabets

Recommender systems, matrix factorization, variable selection and social graph data

Stochastic dynamical modeling:

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

1-Bit Matrix Completion

Graphical Models for Collaborative Filtering

Collaborative Filtering Matrix Completion Alternating Least Squares

Tractable Upper Bounds on the Restricted Isometry Constant

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Breaking the Limits of Subspace Inference

Provable Alternating Minimization Methods for Non-convex Optimization

Structured matrix factorizations. Example: Eigenfaces

Scalable Completion of Nonnegative Matrices with the Separable Structure

Matrix Completion from Fewer Entries

An iterative hard thresholding estimator for low rank matrix recovery

High-dimensional Statistics

1-Bit Matrix Completion

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

A Characterization of Sampling Patterns for Union of Low-Rank Subspaces Retrieval Problem

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Exact Low-rank Matrix Recovery via Nonconvex M p -Minimization

ELE 538B: Mathematics of High-Dimensional Data. Spectral methods. Yuxin Chen Princeton University, Fall 2018

Sparse PCA with applications in finance

Nonnegative Matrix Factorization

Andriy Mnih and Ruslan Salakhutdinov

High-dimensional Statistical Models

Probabilistic Latent Semantic Analysis

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Collaborative Filtering

Recovering any low-rank matrix, provably

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007

Online Matrix Completion Through Nuclear Norm Regularisation

Improved Bounded Matrix Completion for Large-Scale Recommender Systems

Collaborative Filtering. Radek Pelánek

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

A direct formulation for sparse PCA using semidefinite programming

Maximum Margin Matrix Factorization

Learning to Learn and Collaborative Filtering

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

CS Lecture 18. Expectation Maximization

K-Means and Gaussian Mixture Models

Learning Gaussian Graphical Models with Unknown Group Sparsity

Optimisation Combinatoire et Convexe.

* Matrix Factorization and Recommendation Systems

Learning with matrix and tensor based models using low-rank penalties

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Linear dimensionality reduction for data analysis

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Lecture 8: February 9

A Novel Approach to Quantized Matrix Completion Using Huber Loss Measure

Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation

Data Mining Techniques

An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

A direct formulation for sparse PCA using semidefinite programming

Compressed Sensing and Neural Networks

STA 4273H: Statistical Machine Learning

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

Matrix Factorizations: A Tale of Two Norms

ACCELERATED LINEARIZED BREGMAN METHOD. June 21, Introduction. In this paper, we are interested in the following optimization problem.

Mixtures of Gaussians. Sargur Srihari

Sparse Covariance Selection using Semidefinite Programming

Approximate Message Passing Algorithms

Transcription:

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Adrien Todeschini Inria Bordeaux JdS 2014, Rennes Aug. 2014 Joint work with François Caron (Univ. Oxford), Marie Chavent (Inria, Univ. Bordeaux) A. Todeschini 1 / 35

Disclaimer Not a fully Bayesian approach. Derivation of an EM algorithm for MAP estimation. But builds on a hierarchical prior construction. A. Todeschini 2 / 35

Outline Introduction Hierarchical adaptive spectral penalty EM algorithm for MAP estimation Experiments A. Todeschini 3 / 35

Matrix Completion Netflix prize 480k users and 18k movies providing 1-5 ratings 99% of the ratings are missing Objective: predict missing entries in order to make recommendations Movies Users...... 1 4... 1... 2 5... 3 1 4..................... A. Todeschini 4 / 35

Matrix Completion Objective Complete a matrix X of size m n from a subset of its entries Applications Recommender systems Image inpainting Imputation of missing data.............................. A. Todeschini 5 / 35

Matrix Completion Potentially large matrices (each dimension of order 10 4 10 6 ) Very sparsely observed (1%-10%) A. Todeschini 6 / 35

Low rank Matrix Completion Assume that the complete matrix Z is of low rank with k min(m, n) = r. }{{} Z m n A }{{} m k B T }{{} k n........................................ A. Todeschini 7 / 35

Low rank Matrix Completion Assume that the complete matrix Z is of low rank with k min(m, n) = r. }{{} Z m n A }{{} m k B T }{{} k n Features Users...... Items.................................. A. Todeschini 7 / 35

Low rank Matrix Completion Let Ω {1,..., m} {1,..., n} be the subset of observed entries For (i, j) Ω where σ 2 > 0 X ij = Z ij + ε ij, ε ij iid N (0, σ 2 ) A. Todeschini 8 / 35

Low rank Matrix Completion Optimization problem minimize Z 1 (X 2σ 2 ij Z ij ) 2 + λ rank(z) (i,j) Ω }{{}}{{} loglikelihood penalty where λ > 0 is some regularization parameter. Non-convex Computationally hard for general subset Ω A. Todeschini 9 / 35

Low rank Matrix Completion Matrix completion with nuclear norm penalty minimize Z 1 (X 2σ 2 ij Z ij ) 2 + λ Z (i,j) Ω }{{}}{{} loglikelihood penalty where Z is the nuclear norm of Z, or the sum of the singular values of Z. Convex relaxation of the rank penalty optimization [Fazel, 2002, Candès and Recht, 2009, Candès and Tao, 2010] A. Todeschini 10 / 35

Low rank Matrix Completion Complete matrix X Nuclear norm objective function minimize Z 1 2σ 2 X Z 2 F + λ Z where 2 F is the Frobenius norm Global solution given by a soft-thresholded SVD Ẑ = S λσ 2(X) where S λ (X) = Ũ D λ Ṽ T with D λ = diag(( d 1 λ) +,..., ( d r λ) + ) and t + = max(t, 0). [Cai et al., 2010, Mazumder et al., 2010] A. Todeschini 11 / 35

Low rank Matrix Completion Soft-Impute algorithm Start with an initial matrix Z (0) At each iteration t = 1, 2,... Replace the missing elements in X with those in Z (t 1) Perform a soft-thresholded SVD on the completed matrix, with shrinkage λ to obtain the low rank matrix Z (t) [Mazumder et al., 2010] A. Todeschini 12 / 35

Low rank Matrix Completion Thresholding rule 6 5 4 Nuclear Norm HASP (β = 2) HASP (β = 0.1) d i 3 2 1 0 0 1 2 3 4 5 6 d i Figure : Thresholding rules on the singular values d i of X A. Todeschini 13 / 35

Outline Introduction Hierarchical adaptive spectral penalty EM algorithm for MAP estimation Experiments A. Todeschini 14 / 35

Nuclear Norm penalty Maximum A Posteriori (MAP) estimate under the prior Ẑ = arg max [log p(x Z) + log p(z)] Z p(z) exp ( λ Z ) where Z = UDV T with D = diag(d 1, d 2,..., d r ), and U, V iid Haar uniform prior on unitary matrices d i iid Exp(λ) λ d 1 d 2 d r 1 d r A. Todeschini 15 / 35

Hierarchical adaptive spectral penalty Each singular value has its own random shrinkage coefficient Hierarchical model, for each singular value i = 1,..., r d i γ i Exp(γ i ) γ i Gamma(a, b) λ, β γ 1 γ 2 γ r 1 γ r d 1 d 2 d r 1 d r We set a = λb and b = β [Todeschini et al., 2013] A. Todeschini 16 / 35

Hierarchical adaptive spectral penalty Marginal distribution over d i : p(d i ) = 0 Exp(d i ; γ i ) Gamma(γ i ; a, b)dγ i = ab a (d i + b) a+1 Pareto distribution with heavier tails than exponential distribution 1 0.9 0.8 0.7 β = β = 2 β = 0.1 p(di) 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 d i A. Todeschini 17 / 35

Hierarchical adaptive spectral penalty pen(z) = log p(z) = r X (a + 1) log(b + di ) i=1 (a) Nuclear norm (e) I `1 norm (b) (f) HASP (β = 1) HAL (β = 1) (c) (g) HASP (β = 0.1) HAL (β = 0.1) (d) Rank penalty (h) `0 norm Admits as special case the nuclear norm penalty λ Z when a = λb and b. A. Todeschini 18 / 35

Outline Introduction Hierarchical adaptive spectral penalty EM algorithm for MAP estimation Experiments A. Todeschini 19 / 35

EM algorithm for MAP estimation Expectation Maximization (EM) algorithm to obtain a MAP estimate i.e. to minimize Ẑ = arg max [log p(x Z) + log p(z)] Z L(Z) = 1 2σ 2 P Ω(X) P Ω (Z) 2 F + (a + 1) r log(b + d i ) i=1 where P Ω (X)(i, j) = { Xij if (i, j) Ω 0 otherwise P Ω (X)(i, j) = { 0 X ij if (i, j) Ω otherwise A. Todeschini 20 / 35

EM algorithm for MAP estimation Latent variables: γ = (γ 1,..., γ r ) and P Ω (X) E step: [ ] Q(Z, Z ) = E log(p(pω (X), Z, γ)) Z, P Ω (X) = C 1 2σ 2 X Z 2 F r ω i d i i=1 where X = P Ω (X) + P Ω (Z ) and ω i = E[γ i d i ] = a+1. b+d i A. Todeschini 21 / 35

EM algorithm for MAP estimation M step: minimize Z 1 2σ 2 X Z 2 F + r ω i d i (1) i=1 (1) is an adaptive spectral penalty regularized optimization problem, with weights ω i = a+1. b+d i d 1 d 2... d r 0 ω 1 ω 2... ω r (2) Given condition (2), the solution is given by a weighted soft-thresholded SVD Ẑ = S σ 2 ω(x ) (3) where S ω (X) = Ũ D ω Ṽ T with D ω = diag(( d 1 ω 1 ) +,..., ( d r ω r ) + ). [Gaïffas and Lecué, 2011] A. Todeschini 22 / 35

EM algorithm for MAP estimation 6 5 4 Nuclear Norm HASP (β = 2) HASP (β = 0.1) d i 3 2 1 0 0 1 2 3 4 5 6 d i Figure : Thresholding rules on the singular values d i of X The weights will penalize less heavily higher singular values, hence reducing bias. A. Todeschini 23 / 35

EM algorithm for MAP estimation Hierarchical Adaptive Soft Impute (HASI) algorithm for matrix completion Initialize Z (0) with Soft-Impute. At iteration t 1 For i = 1,..., r, compute the weights ω (t) i = a+1 b+d (t 1) i Set Z (t) ( = S σ 2 ω (t) PΩ (X) + PΩ (Z(t 1) ) ) A. Todeschini 24 / 35

EM algorithm for MAP estimation HASI algorithm admits the Soft-Impute algorithm as a special case when a = λb and b = β. In this case, ω (t) i = λ for all i. When β <, the algorithm adaptively updates the weights so that to penalize less heavily higher singular values. A. Todeschini 25 / 35

Outline Introduction Hierarchical adaptive spectral penalty EM algorithm for MAP estimation Experiments Simulated data Collaborative filtering examples A. Todeschini 26 / 35

Simulated data Test error 1 0.9 0.8 0.7 0.6 Test error 1 0.9 0.8 0.7 0.6 0.5 0.4 MMMF SoftImp SoftImp+ HardImp HASI β = 100 HASI β = 10 HASI β = 1 0.5 0.3 0.4 0.2 0.1 0.3 0 5 10 15 20 25 30 35 40 45 50 Rank (a) SNR=1; 50% missing; rank=5 0 0 5 10 15 20 25 30 Rank (b) SNR=10; 80% missing; rank=5 A. Todeschini 27 / 35

Simulated data We then remove 20% of the observed entries as a validation set to estimate the regularization parameters. We use the unobserved entries as a test set. MMMF SoftImp SoftImp+ HardImp HASI 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Test error (c) SNR=1; 50% miss. 10 20 30 40 50 60 70 80 90 100 Rank (d) SNR=1; 50% miss. 0 0.05 0.1 0.15 0.2 0.25 Test error (e) SNR=10; 80% miss. 10 20 30 40 50 60 70 80 90 100 Rank (f) SNR=10; 80% miss. A. Todeschini 28 / 35

Collaborative filtering examples (Jester) Jester 1 Jester 2 Jester 3 24983 100 23500 100 24938 100 27.5% miss. 27.3% miss. 75.3% miss. Method NMAE Rank NMAE Rank NMAE Rank MMMF 0.161 95 0.162 96 0.183 58 Soft Imp 0.161 100 0.162 100 0.184 78 Soft Imp+ 0.169 14 0.171 11 0.184 33 Hard Imp 0.158 7 0.159 6 0.181 4 HASI 0.153 100 0.153 100 0.174 30 A. Todeschini 29 / 35

Collaborative filtering examples (Jester) Test error 0.32 0.3 0.28 0.26 0.24 0.22 Test error 0.34 0.32 0.3 0.28 0.26 0.24 MMMF SoftImp SoftImp+ HardImp HASI β = 1000 HASI β = 100 HASI β = 10 HASI β = 1 0.2 0.22 0.18 0.2 0.16 0.18 0 10 20 30 40 50 60 70 80 90 100 Rank (g) Jester 1 0 10 20 30 40 50 60 70 80 90 Rank (h) Jester 3 A. Todeschini 30 / 35

Collaborative filtering examples (MovieLens) MovieLens 100k MovieLens 1M 943 1682 6040 3952 93.7% miss. 95.8% miss. Method NMAE Rank NMAE Rank MMMF 0.195 50 0.169 30 Soft Imp 0.197 156 0.176 30 Soft Imp+ 0.197 108 0.189 30 Hard Imp 0.190 7 0.175 8 HASI 0.187 35 0.172 27 A. Todeschini 31 / 35

Conclusion and perspectives Conclusion: Good results compared to several alternative low rank matrix completion methods. Bridge between nuclear norm and rank regularization algorithms. Can be extended to binary matrices Non-convex optimization, but experiments show that initializing the algorithm with the Soft-Impute algorithm provides very satisfactory results. Matlab code available online Perspectives: Fully Bayesian approach Tensor factorization Online EM A. Todeschini 32 / 35

Bibliography I Cai, J., Candès, E., and Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956 1982. Candès, E. and Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717 772. Candès, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. Information Theory, IEEE Transactions on, 56(5):2053 2080. Fazel, M. (2002). Matrix rank minimization with applications. PhD thesis, Stanford University. Gaïffas, S. and Lecué, G. (2011). Weighted algorithms for compressed sensing and matrix completion. arxiv preprint arxiv:1107.1638. Larsen, R. M. (2004). Propack-software for large and sparse svd calculations. Available online. URL http://sun. stanford. edu/rmunk/propack. A. Todeschini 33 / 35

Bibliography II Mazumder, R., Hastie, T., and Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research, 11:2287 2322. Todeschini, A., Caron, F., and Chavent, M. (2013). Probabilistic low-rank matrix completion with adaptive spectral regularization algorithms. In Advances in Neural Information Processing Systems, pages 845 853. A. Todeschini 34 / 35

Thank you A. Todeschini 35 / 35