Adaptive one-bit matrix completion

Size: px

Start display at page:

Download "Adaptive one-bit matrix completion"

Bonnie Craig
5 years ago
Views:

1 Adaptive one-bit matrix completion Joseph Salmon Télécom Paristech, Institut Mines-Télécom Joint work with Jean Lafond (Télécom Paristech) Olga Klopp (Crest / MODAL X, Université Paris Ouest) Éric Moulines (Télécom Paristech)

2 Motivation : recommender systems movies (Netflix, Itunes, Allociné)

3 Motivation : recommender systems songs (Pandora, Itunes)

4 Motivation : recommender systems restaurants (Yelp, la Fourchette)

5 Motivation : recommender systems trips, hotels (TripAdvisor, Voyages-SNCF)

6 Motivation : recommender systems...

7 Motivation : recommender systems In all those cases matrix completion is a crucial ingredient (not the only one) for improving recommender systems Koren et al. [2009]

8 Recommendation systems : movie example

9 Recommendation systems : movie example

10 Other aspect of matrix completion Quantum physics

11 Other aspect of matrix completion Quantum physics Image/signal processing with missing pixels

12 Other aspect of matrix completion Quantum physics Image/signal processing with missing pixels Communications

13 Other aspect of matrix completion Quantum physics Image/signal processing with missing pixels Communications Analysis of survey data

14 Other aspect of matrix completion Quantum physics Image/signal processing with missing pixels Communications Analysis of survey data...

15 Classical theoretical model : partial observation and Gaussian noise Observation model Matrix of true ratings : X R m 1 m 2 (to recover) Indexes observed : (ω i ) 1 i n i.i.d. Unif over [m 1 ] [m 2 ], Noisy observations : Y ωi = X ω i + σε ωi for 1 i n σ : noise level, ε : centered standard Gaussian random vector Rem: potentially n m 1 m 2 Rem: randomness sources : 1) index picking 2) degraded answer

16 Some dataset sizes Parameter Size m 1 m 2 n MovieLens NetFlix Yahoo

17 Low rank and matrix factorization Underlying simplifying assumption : r = rank(x ) is small Consequence : pass from m 1 m 2 to r (m 1 + m 2 ) degrees of freedom Interpretation : a combination of few items can represent all of them a combination of few users can represent all of them

18 Popular estimator Least square penalized by trace/nuclear norm ˆX = arg min X R m 1 m n i=1 (Y ωi X ωi ) 2 + λ X σ,1 X σ,1 : trace/nuclear norm (l 1 norm of the singular values) λ>0 : regularization parameter controlling data-fitting / low rank trade-off Rem: vector case : 1 regularization sparsity (LASSO) matrix case : σ,1 regularization low rank

19 Previous theoretical work on matrix completion with Rem: can be extended to non uniform sampling provided each coefficient is sampled sufficiently often noise-free scenario : Recht, Fazel and Parrilo [2010] Candès and Recht [2009] Candès and Tao [2010] additive noise scenario : Candès and Plan [2010] Koltchinskii, Tsybakov and Lounici [2011] Negahban and Wainwrigh [2012] Klopp [2014] Typical results : Klopp [2014] For λ = Cσ log(m 1 +m 2 ) min(m 1,m 2 )n, w.h.p. ˆX X 2 F c max(σ 2, X 2 m 1 m ) r max(m 1, m 2 ) log(m 1 + m 2 ) 2 n

20 Limits of the previous model Generally ratings are discrete (0-1, 1-5 stars, etc.) In surveys, answers are naturally discrete (yes/no, classes, etc.) Variance of the noise model (implicitly) assumed identical for all entries. Cases with picky distribution e.g., movies with agreement (only 5 s) / disagreement among the audience (lots of 1 s and lots of 5 s).

21 Binary model Observation model Davenport et al. [2012] Matrix of true ratings to recover : X R m 1 m 2 Indexes observed : (ω i ) 1 i n i.i.d. Unif over [m 1 ] [m 2 ], Indirect observations : P(Y ωi =1)=f (X ω i ) and P(Y ωi = 1) = 1 f (X ω i ), where f is a link function taking value in [0, 1]. Rem: Uniform sampling only for the sake of simplicity Rem: To obtain theoretical guarantees log(f ( )) and log(1 f ( )) need to be concave (e.g., logit, probit)

22 The estimator The log-likelihood of the observations X L(X) : L(X) = n i=1 [ ] 1 {Yωi =1} log(f (X ωi )) + 1 {Yωi = 1} log(1 f (X ωi )). Penalized log-likelihood estimator ˆX = arg min X R m 1 m 2 F(X), where F(X) = 1 n L(X)+λ X σ,1, with λ>0 a regularization parameter.

23 Results Proposed result For λ = C KL log(m 1 +m 2 ) min(m 1,m 2 )n w.h.p. ( f (X ), f ( ˆX) ) c r max(m 1, m 2 ) log(m 1 + m 2 ) n where we define the Kullback-Liebler divergence : KL (P, Q) := 1 m 1 m 2 1 i m 1 1 j m 2 [ P i,j log P i,j Q i,j +(1 P i,j ) log 1 P i,j 1 Q i,j ].

24 Multinomial Coordinate Lifted Gradient Desc. : Dudik et al. [2012] Data: Observations : Y ini. param. : θ 0 Θ + ; tolerance : ɛ ; maximum iterations : K Result: θ Θ + Initialization : θ θ 0, conv 0, k 0 while k K and conv = 0 do Compute top singular vectors pair of ( F(W θ )) : u, v Let g = λ + L, uv if g ɛ/2 then β = arg min b R F(θ +(bδ uv )) θ θ + βδ uv ; k k +1 end else if g ɛ then conv 1 end else θ arg min θ R +K,supp(θ ) supp(θ) F(θ ) ; k k +1 end end end

25 Main interests compared to other classical methods Does not require full SVD as proximal methods Convex formulation which offers strong theoretical guarantees Well adapted to sparse structure

26 Numerical experiments Simulate X for m 1 m 2 = , , and r =5

27 Numerical experiments Simulate X for m 1 m 2 = , , and r =5 For each X simulate with logit distribution n observations, from n = to

28 Numerical experiments Simulate X for m 1 m 2 = , , and r =5 For each X simulate with logit distribution n observations, from n = to For Gaussian and Binomial estimator, choose λ by cross validation

29 Numerical experiments Simulate X for m 1 m 2 = , , and r =5 For each X simulate with logit distribution n observations, from n = to For Gaussian and Binomial estimator, choose λ by cross validation For Gaussian and Binomial estimator, estimate X

30 Numerical experiments Simulate X for m 1 m 2 = , , and r =5 For each X simulate with logit distribution n observations, from n = to For Gaussian and Binomial estimator, choose λ by cross validation For Gaussian and Binomial estimator, estimate X For Gaussian and Binomial estimator, compute KL divergence

31 mean KL distance Illustration over simulation (with cross-validation choice for λ) 0.20 KL distance for logistic (plain), gaussian (dashed) size: 100x150 size: 300x450 size: 900x number of observations

32 Conclusion New results for binary / logit matrix completion No need to know a bound on the rank or to make a spikiness assumption Fast algorithm based on Lifted Coordinate Descent Dudik et al. [2012] Extension to multinomial under some separability

33 Références I E. J. Candès and Y. Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6) : , E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Found. Comput. Math., 9(6) : , E. J. Candès and T. Tao. The power of convex relaxation : Near-optimal matrix completion. IEEE Trans. Inf. Theory, 56(5) : , M. Dudík, Z. Harchaoui, and J. Malick. Lifted coordinate descent for learning with trace-norm regularization. In AISTATS, M. A. Davenport, Y. Plan, E. van den Berg, and M. Wootters. 1-bit matrix completion. CoRR, abs/ , 2012.

34 Références II Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8) :30 37, O. Klopp. Noisy low-rank matrix completion with general sampling distribution. Bernoulli, 2(1) : , V. Koltchinskii, A. B. Tsybakov, and K. Lounici. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Statist., 39(5) : , S. Negahban and M. J. Wainwright. Restricted strong convexity and weighted matrix completion : optimal bounds with noise. J. Mach. Learn. Res., 13 : , 2012.

35 Références III B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3) : , 2010.

Probabilistic low-rank matrix completion on finite alphabets

Probabilistic low-rank matrix completion on finite alphabets Jean Lafond Institut Mines-Télécom Télécom ParisTech CNRS LTCI jean.lafond@telecom-paristech.fr Olga Klopp CREST et MODAL X Université Paris