1-bit Matrix Completion. PAC-Bayes and Variational Approximation

Size: px

Start display at page:

Download "1-bit Matrix Completion. PAC-Bayes and Variational Approximation"

Alison Lindsey
5 years ago
Views:

1 : PAC-Bayes and Variational Approximation (with P. Alquier) PhD Supervisor: N. Chopin Bayes In Paris, 5 January 2017 (Happy New Year!)

2 Various Topics covered Matrix Completion PAC-Bayesian Estimation Variational Bayes Approximation PAC bounds and Algorithm Experiments

3 Introduction Matrix Completion Incomplete Matrix: black cells: known white cells: unknown

4 Introduction Matrix Completion Incomplete Matrix: black cells: known white cells: unknown Matrix Completion Main Questions When is it possible? Method? Is it efficient?

5 Introduction Low Rank Assumption M L R m 1 m 2 entries m 1 r + m 2 r entries

6 Introduction Recovery Data Given n observations: Y i R X i {1,..., m 1 } {1,..., m 2 } Heuristic: Best low-rank approximation of the observations Method Penalized Least Squares: M = arg min M n (Y i M Xi ) 2 + λ rank(m) i=1

7 Introduction Recovery Data Given n observations: Y i R X i {1,..., m 1 } {1,..., m 2 } Heuristic: Best low-rank approximation of the observations Method Penalized Least Squares: Convex Penalty: M = arg min M M = arg min M n (Y i M Xi ) 2 + λ rank(m) i=1 n (Y i M Xi ) 2 + λ M i=1

8 Bayesian Framework Prior distribution on matrices Factorization M = LR and: L i,k, R j,k γ k N (0, γ k ) Parameter: θ = (L, R, γ). γ k Γ(a, b) Rank adaptation large γ k : large variance for L i,k and R j,k small γ k : small variance hence almost null column

9 Bayesian Framework Prior distribution on matrices Factorization M = LR and: L i,k, R j,k γ k N (0, γ k ) Parameter: θ = (L, R, γ). γ k Γ(a, b) Rank adaptation large γ k : large variance for L i,k and R j,k small γ k : small variance hence almost null column Model Gaussian noise: Y i = M Xi + ε i, ε i N (0, σ 2 ).

10 Introduction Bibliography about Least Squares (Among many others) Exact reconstruction: Candès & Tao, 2010 Noisy reconstruction: Candès & Plan, 2010 Trace Regression: Koltchinskii & al., 2011 General Sampling: Klopp, 2014 Bayesian Way of life: Mai & Alquier, 2015 Variational Bayes Approximation: Lim & Teh, 2007

11 Introduction 1-bit matrix completion Observations are binary: Y i { 1, 1} +1: Like / vote Yes 1: dislike / vote No

12 Introduction 1-bit matrix completion Observations are binary: Y i { 1, 1} +1: Like / vote Yes 1: dislike / vote No Toy Example: Movie recommendation Movie User James Bond Toy Story Batman Heat Psycho... Michel Vincent Pierre 1... Keefe Gosia Emma

13 Introduction Previous Models Statistical model { 1 with prob. f ( M 0 ) x Y {X = x} = 1 with prob. 1 f ( Mx 0 ) f : R [0, 1] is the link function, M 0 the parameter.

14 Introduction Previous Models Statistical model { 1 with prob. f ( M 0 ) x Y {X = x} = 1 with prob. 1 f ( Mx 0 ) f : R [0, 1] is the link function, M 0 the parameter. Assumptions and Results Assumption: M 0 has low rank. Estimator: M = arg min M { Results: Recovery of M 0 } n log L i (M) + λ M i=1

15 Outline 1 Introduction 2 PAC-Bayes Estimation 3 Theoretical Results 4 Example

16 PAC-Bayes Estimation Different background: Classification (Machine Learning) M R m 1 m 2, (y, x) { 1, 1} ([m 1 ] [m 2 ]) Relevant loss: 0-1 l M (y, x) = I (y sign(m x ))

17 PAC-Bayes Estimation Different background: Classification (Machine Learning) M R m 1 m 2, (y, x) { 1, 1} ([m 1 ] [m 2 ]) Relevant loss: 0-1 l M (y, x) = I (y sign(m x )) Convex surrogate: Hinge Loss l h M (y, x) = max(0, 1 ym x) Integrated 0 1 risk of M: R(M) = E[l M (Y, X )] Empirical hinge risk of M: rn h (M) = 1 n l h M n (Y i, X i ) i=1

18 PAC-Bayesian Estimation Pseudo-posterior dist. Classic Bayesian Framework If L(θ) is the likelihood, the posterior distribution is: p(θ) L(θ)π(θ) No likelihood: pseudo-posterior distribution p(θ) exp[ λr h n (LR )]π(θ) Not a statistical model, just a loss function. Related to Aggregation and EWA. Big Issue: Hard to use

19 PAC-Bayes Estimation Variational Approximation Fast method Search an approximation in a family F. ρ = arg min KL(ρ, p) ρ F Practical Aspects: family F small for a fast computation; family F large to get a close approx. Usual Families Mean Field: product of independent factors Parametric: optimizes only the finite parameters

20 PAC-Bayes Estimation Variational Approximation Proposed Family Product of parametric densities: ρ F, ρ(l, R, γ) = i,k N (L i,k; L 0 i,k, v i,k 0 ) j,k N (R j,k; Rj,k 0, u0 j,k ) k Γ(γ k; α k, β k ) Parameters: L 0, R 0, v 0, u 0, α, β.

21 PAC-Bayes Estimation Variational Approximation Proposed Family Product of parametric densities: ρ F, ρ(l, R, γ) = i,k N (L i,k; L 0 i,k, v i,k 0 ) j,k N (R j,k; Rj,k 0, u0 j,k ) k Γ(γ k; α k, β k ) Parameters: L 0, R 0, v 0, u 0, α, β. Question How to find the minimizer of KL(ρ, p)?

22 PAC-Bayes Estimation Approx. of the Variational Approx. Difficulty KL(ρ, p) intractable Optimize a bound, which could be not so far. Upper Bound ρ F, KL(ρ, p) = λ r h n dρ + KL(ρ, π) + C r h n (L 0 R 0 ) + R(L 0, R 0, v 0, u 0, α, β) + C { } ρ = arg min L 0,R 0,v 0,u 0,α,β rn h (L 0 R 0 ) + R(L 0, R 0, v 0, u 0, α, β) Point Estimate : E ρ (LR ) = L 0 R 0

23 PAC-Bayes Estimation Algorithm Program: min L 0,R 0,v 0,u 0,α,β 1 Gradient Descent for L 0 2 Gradient Descent for R 0 3 explicit formula for u 0, v 0 4 Mean field update for α, β. { } rn h (L 0 R 0 ) + R(L 0, R 0, v 0, u 0, α, β)

24 Outline 1 Introduction 2 PAC-Bayes Estimation 3 Theoretical Results 4 Example

25 Theoretical Results Best Predictor Bayes predictor: x, M B x = sign (E[Y X = x])

26 Theoretical Results Best Predictor Bayes predictor: x, M B x = sign (E[Y X = x]) Risk control in a restrictive case Margin assumption Y is noiseless observed M B has rank r (hopefully small) Then Rd ρ C w.p. 1 ɛ. [ r(m1 + m 2 )(log n + l) + log 1 ɛ n ]

27 Theoretical Results Best Predictor Bayes predictor: x, M B x = sign (E[Y X = x]) Risk control in a restrictive case Margin assumption Y is observed with a switch noise, with prob. p. M B has rank r (hopefully small) Then Rd ρ C w.p. 1 ɛ. [ r(m1 + m 2 )(log n + l) + log 1 ɛ n ]

28 Theoretical Results Best Predictor Bayes predictor: x, M B x = sign (E[Y X = x]) Risk control in a restrictive case Margin assumption Y is observed with a switch noise, with prob. p. M B has rank r (hopefully small) Then Rd ρ C w.p. 1 ɛ. [ r(m1 + m 2 )(log n + l) + log 1 ɛ n ] +(2 + δ)p

29 Outline 1 Introduction 2 PAC-Bayes Estimation 3 Theoretical Results 4 Example

30 Example Misclassification Rate method Freq. Logit HL G HL IG Level of noise Figure: Different level of switch noise

31 Appication: Movie Recommender System MovieLens Dataset: 1682 users 943 movies 100,000 ratings between 1 (very bad) to 5 (excellent) split between: +1 = ratings > = ratings < 3.5 Table: Results Algorithm Hinge Bayes. Logit Bayes. Logit Freq. Correct. 72% 68% 73%

32 Conclusion Summary New approach on 1-bit MC PAC-Bayes Estimation Theoretical Bounds, not enough conclusive Works well in practice Future works Different loss on MC New estimation to get better bounds

33 Bibliography OLS Matrix Completion: Koltchinskii, Lounici, Tsybakov, Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion, bit Matrix Completion: Lafond, Klopp, Moulines, Salmon, Probabilistic low-rank matrix completion on finite alphabets, 2014 Variational Approximation of PAC-Bayes Estimation: Alquier, Ridgway, Chopin On the properties of variational approximations of Gibbs posteriors, 2015 Our paper: Cottet, Alquier, : PAC-Bayesian Analysis of a Variational Approximation, 2016

34 Thank you Questions?

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

1-bit Matrix Completion. PAC-Bayes and Variational Approximation : PAC-Bayes and Variational Approximation (with P. Alquier) PhD Supervisor: N. Chopin Junior Conference on Data Science 2016 Université Paris Saclay, 15-16 September 2016 Introduction: Matrix Completion