Maximum Margin Matrix Factorization

Size: px

Start display at page:

Download "Maximum Margin Matrix Factorization"

Carmel Ward
5 years ago
Views:

1 Maximum Margin Matrix Factorization Nati Srebro Toyota Technological Institute Chicago Joint work with Noga Alon Tel-Aviv Yonatan Amit Hebrew U Alex d Aspremont Princeton Michael Fink Hebrew U Tommi Jaakkola MIT Ben Marlin Toronto Jason Rennie MIT Adi Schraibman Hebrew U Shimon Ullman Weizmann

2 Collaborative Prediction Based on partially observed matrix: Predict unobserved entries users movies ? ? 2 5 3? ? ? ? ? 1 3? ? ? Will user i like movie j?

3 Linear Factor Model movies movies u1 +1 v 1 comic value preferences of a specific user + + u2 +2 v 2 u3-1 v 3 dramatic value violence characteristics of the user

4 Linear Factor Model movies u movies V

5 Linear Factor Model users movies Y u U movies V

6 Matrix Factorization Unconstrained: Low Rank Approximation Y U V = X rank k Additive Gaussian noise: minimize Y-UV Fro General additive noise General conditional models Multiplicative noise, Exponential-PCA [Collins+01], Multinomial (plsa [Hofmann01]), etc General loss functions Hinge loss, loss functions appropriate for ratings, etc [Gordon03] Unconstrained U,V, fully observed Y use SVD non-convex, no explicit solution

7 Matrix Factorization Y U V = X rank k Non-Negativity [LeeSeung99] Stochasticity (convexity) [LeeSeung97] [Hofmann01] Sparsity Clustering as an extreme (when rows of U sparse) Overall number of factors still constrained Non-convex optimization problems

8 Outline Maximum Margin Matrix Factorization low max-norm and low trace-norm matrices Unbounded number of factors Convex! Learning MMMF: Semidefinite Programming Generalization error bounds Learning with low-rank (finite factor) matrices MMMF (low max-norm / trace-norm) Representational power of low rank, max-norm and trace-norm matrices

9 Collaborative Prediction with U Matrix Factorization V ? ? ? ? ? +1 5? ? ? ? ? Fit factorizable (lowrank) matrix X=UV to observed entries. minimize Σloss(X ij ;Y ij ) prediction observation Use matrix X to predict unobserved entries. [Sarwar+00] [Azar+01] [Hoffman04] [Marlin+04]

10 Collaborative Prediction with U Matrix Factorization V v12 v11 v10 v9 v8 v7 v6 v5 v4 v3 v2 v When U is fixed, each row is a linear classification problem: rows of U are feature vectors columns of V are linear classifiers Fitting U and V: Learning features that work well across all classification problems.

11 Geometric Interpretation: Co-embedding Points and Separating Hyperplanes U V rows of U (users) columns of V (movies)

12 Geometric Interpretation: Co-embedding Points and Separating Hyperplanes U V rows of U (users) columns of V (movies)

13 Max-Margin Matrix Factorization: Bound norms of U,V instead of their dimensionality low norm U V bound norms uniformly: (max i U i 2 ) (max j V j 2 ) 1 rows of U (users) columns of V (movies) For observed Y ij ±1: Y ij X ij Margin U i,v j

14 Max-Margin Matrix Factorization: Bound norms of U,V instead of their dimensionality low norm U V bound norms uniformly: (max i U i 2 ) (max j V j 2 ) R 2 X max = min (max i U i )(max j V j ) X=UV max-norm bound norms on average: ( i U i 2 ) ( j V j 2 ) R 2 X tr = min U Fro V Fro X=UV U is fixed: each column of V is SVM For observed Y ij ±1: Y ij X ij Margin U i,v j

15 Max-Margin Matrix Factorization: Bound norms of U,V instead of their dimensionality αv 1 bound norms uniformly: (max i U i 2 ) (max j V j 2 ) R 2 1 αv 2 X max = min (max i U i )(max j V j ) X=UV max-norm α U 1 1 α U 2 αu 1 V 1 + (1-α)U 2 V 2 bound norms on average: ( i U i 2 ) ( j V j 2 ) R 2 X tr = min U Fro V Fro X=UV Unlike rank(x) k, these are convex constraints!

Max-Margin Matrix Factorization: Convex Combination of Factors unit norm columns S U unit norm rows V -1 +1 +1-1 -1 +1 +1-1 +1-1-1 +1 +1 +1 +1-1 +1-1 -1 +1 +1-1 +1-1 -1-1

16 Max-Margin Matrix Factorization: Convex Combination of Factors unit norm columns S U unit norm rows V bound norms on average: ( i U i 2 ) ( j V j 2 ) 1 X tr = (singular values of X) s i 1 s i u i v i outer product of norm-1 vectors: rank-1 norm-1 matrix

17 Finding Max-Margin Matrix Factorizations minimize ( i U i 2 ) ( j V j 2 ) Y ij X ij 1 X=UV X tr = (singular values of X) = min A,B ½( tr(a) + tr(b) ) U V U A X V X B p.s.d. [Fazel Hindi Boyd 2001] minimize ½(tr(A)+tr(B)) Y ij X ij 1 A X X B p.s.d.

18 Finding Max-Margin Matrix Factorizations X dense primal Y sparse observations (constraints) Q sparse dual minimize tr(a)+tr(b) Y ij X ij 1 A X X B p.s.d. Dual variable Q ij for each observed (i,j) maximize Q ij 0 Q ij Q Y 2 1 sparse elementwise product (zero for unobserved entries)

19 Finding Max-Margin Matrix Factorizations X dense primal Y sparse observations (constraints) Q sparse dual X tr err(x) minimize ½(tr(A)+tr(B)) + c ξ ij Y ij X ij 1- ξ ij A X X B p.s.d. Dual variable Q ij for each observed (i,j) maximize Q ij 0 Q ij c Q Y 2 1 sparse elementwise product (zero for unobserved entries)

20 Finding Max-Margin Matrix Factorizations X dense primal Y sparse observations (constraints) Q sparse dual X tr err(x) minimize ½(tr(A)+tr(B)) + c ξ ij Y ij X ij 1- ξ ij A X X B Dual variable Q ij for each observed (i,j) maximize Q ij 0 Q ij c p.s.d. Q Y 2 1 X * =UDV Q * =USV can be recovered by solving small LP

Fast Optimization of the Dual err(x) minimize X tr + c (1-Y ij X ij ) + maximize 1 - b err(x) X tr err(x) / #obs 1 0.9 0.8 0.7 0.6 0.

21 Fast Optimization of the Dual err(x) minimize X tr + c (1-Y ij X ij ) + maximize 1 - b err(x) X tr err(x) / #obs ~ X=X/v ~ maximize v - b (v-y ij X ij ) X ~ tr 1 Attainable ( err(x), X tr ) maximize Q ij Q Y Q ij c minimize Q Y ~ 2 Q ~ ij 1 0 Q ~ ij b min max u (Q Y)u ~ Use Nesterov s smoothing for saddle problems Front of extreme 0.2 err(x), X tr values 0.1 (aka regularization path ) X tr Subgradient (and grad of smoothed problem) given by SVD

22 Normalized Mean Absolute Error Finding Max-Margin Matrix Factorizations 1000x x x x x x x562 1% observed 2 sec 2 sec 19 sec 3:27 min 34:35 min 5:44:07 57:23:09 EachMovie 74,424 1, M observations (1 6) 10% observed 3 sec 18 sec 2:34 min 3:37 min 41:15 min 6:40:06 67:35:34 50% observed 10 sec 35 sec 2:41 min 19:11 min 1:35:28 hours 19:09:49 62:12:21 Prediction Performance MovieLens 6,040 3,952 1M observations (1 5) URP (Aspect/LDA-like) "Atitude (softmax-based) MMMF Over 1.5 million observations 5 million observations 3.02 GHz Xeon Top methods from [Marlin04] comprehensive comparison

23 MMMF Optimization: Other Options minimize X tr + loss(x) Gradient descent of smoothed primal: X tr = i λ i (X) smooth X / X = U S tr smooth / S V norm singular value Gradient descent on unconstrained: U Fro + V Fro + loss(uv) Non-convex If dimensionality of U,V large enough, still guaranteed global optimum [Burer Monteiro 2004]

24 Outline Maximum Margin Matrix Factorization low max-norm and low trace-norm matrices Unbounded number of factors Convex! Learning MMMF: Semidefinite Programming Generalization error bounds Learning with low-rank (finite factor) matrices MMMF (low max-norm / trace-norm) Representational power of low rank, max-norm and trace-norm matrices

25 Generalization Error Bounds D(X;Y) = { ij X ij Y ij 0 } / nm generalization error D S (X;Y) = { ij S X ij Y ij 0 } / S empirical error Assuming a low-rank structure (eigengap): Asymptotic behavior [Azar+01] Sample complexity, query strategy [Drineas+02] Y Pr S ( X D(X;Y)<D S (X;Y)+ε ) > 1-δ hypothesis X Y S source distribution training set random unknown, assumption-free

26 Generalization Error Bounds: Low Trace-Norm, Max-Norm or Rank D(X;Y) = { ij X ij Y ij 0 } / nm generalization error D S (X;Y) = { ij S X ij Y ij <1 } / S empirical error Y Pr S ( X D(X;Y)<D S (X;Y)+ε ) > 1-δ = { X R n m rank(x) k } dc(a) = min rank(x) A ij X ij >0 ε = 8en k( n + m)log + log 1 k 2 S δ = { X R n m X 2 max R2 } mc 2 (A) = min X 2 max A ij X ij 1 ε = 12 R 2 ( n + m) + log S 1 δ = { X R n m X 2 tr /nm R2 } ac 2 (A) = min X 2 tr /nm A ij X ij 1 ε = K R ( n + m)log n + log 1 S δ

27 Matrices Realizable by Low-Rank, Low-Trace-Norm and Low-Max- Norm Rank and max-norm are uniform measures, but trace-norm is on-average measure; Might be able to realize with low trace-norm, but require high (O(n 2/3 )) rank, max-norm Realizable with low max-norm Realizable with low rank (and extra log factor) dc(a) 10 mc 2 (A) log(3nm) Some of matrices realizable with low rank require arbitrarily high max-norm, trace-norm [Sherstov 07]

28 Maximum Margin Matrix Factorization as a Convex Combination of Classifiers { UV ( U i 2 )( V i 2 ) 1 } = convex-hull( { uv u R n, v R m u = v =1} ) rank-one, unit-norm, matrices conv( { uv u {±1} n, v {±1}m} ) { UV (max U i 2 )(max V j 2 ) 1 } 2 conv( { uv u {±1} n, v {±1}m} ) rank-one sign matrices

29 Transfer Learning in Multi-Task and Multi-Class Settings minimize ½ X tr + loss(x predicting Y) minimize ½ W tr + loss(wφ predicting Y) instances instances tasks W Φ features Y tasks [Argyriou et al 2007] [Abernethy et al] [Amit et al 2007]

30 Transfer Learning in Multi-Task and Multi-Class Settings minimize ½ X tr + loss(x predicting Y) minimize ½ W tr + loss(wφ predicting Y) instances instances tasks U V Φ features Y tasks Learned feature space VΦ [Argyriou et al 2007] [Abernethy et al] [Amit et al 2007]

31 Mammal Recognition Experiment 72 classes of mammals Training: 1000 images Testing: 1000 images Compare: Frobenius norm SVM Trace norm SVM

32 accuracy change using trace norm Mammal Recognition Experiment Trace Norm Frobenius norm number of training instances

33 Mammal Recognition Trace Experiment Frobenius

34 Max-Margin Matrix Factorization Nati Srebro, TTI-Chicago Infinite number of factors (norm replaces dimensionality) Convex optimization problem (SDP) Low-norm factorization is true goal, not surrogate for rank Infinite factor model, connection to SVM Generalization error bounds Theoretically, not a good approximation to rank Appropriate model in practice good empirical results Not only collaborative filtering! Multi-task learning Multi-class learning Semi-supervised learning following eg [Rie Zhang 05,07] Bottleneck-type methods [Globerson Tishby 04]

Matrix Factorizations: A Tale of Two Norms

Matrix Factorizations: A Tale of Two Norms Nati Srebro Toyota Technological Institute Chicago Maximum Margin Matrix Factorization S, Jason Rennie, Tommi Jaakkola (MIT), NIPS 2004 Rank, Trace-Norm and Max-Norm