Learning with Matrix Factorizations

Size: px

Start display at page:

Download "Learning with Matrix Factorizations"

Katrina Robertson
5 years ago
Views:

1 Learning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Adviser: Tommi Jaakkola (MIT) Committee: Alan Willsky (MIT), Tali Tishby (HUJI), Josh Tenenbaum (MIT) Joint work with Shie Mannor (McGill), Noga Alon (TAU) and Jason Rennie (MIT) Lots of help from John Barnett, Karen Livescu, Adi Shraibman, Michael Collins, Erik Demaine, Michel Goemans, David Karger and Dmitry Panchenko

2 Dimensionality Reduction: Low dimensional representation capturing important aspects of high dimensional data observations y image gene expression levels a user s preferences document small representation u varying aspects of scene description of biological process characterization of user topics Compression (mostly to reduce processing time) Reconstructing latent signal biological processes through gene expression Capturing structure in a corpus documents, images, etc Prediction: collaborative filtering

3 Linear Dimensionality Reduction y u1 v 1 + u2 v 2 + u3 v 3

4 Linear Dimensionality Reduction titles titles y u1 +1 v 1 comic value + u2 +2 v 2 dramatic value + u3-1 v 3 violence preferences of a specific user (real-valued preference level for each title) characteristics of the user

5 Linear Dimensionality Reduction titles y u titles V

6 Linear Dimensionality Reduction users titles y Y data u U titles V

7 Matrix Factorization Y data U V = X rank k Non-Negativity [LeeSeung99] Stochasticity (convexity) [LeeSeung97][Barnett04] Sparsity Clustering as an extreme (when rows of U sparse) Unconstrained: Low Rank Approximation

8 Matrix Factorization Y data U V = + Z i.i.d. Gaussian Additive Gaussian noise minimize Y-UV Fro ( ) 1 = = 2 '; Y log P( Y UV ' ) ( Y UV ' ) + log L UV 2 ij ij ij 2σ ij ij ij const

9 Matrix Factorization Y data U V = + Z i.i.d. p(z ij ) Additive Gaussian noise minimize Y-UV Fro General additive noise minimize -log p(y ij -UV ij )

10 Additive Gaussian noise minimize Y-UV Fro General additive noise minimize -log p(y ij -UV ij ) General conditional models minimize -log p(y ij UV ij ) Multiplicative noise Binary observations: Logistic LRA Exponential PCA [Collins+01] Multinomial (plsa [Hofman01]) Other loss functions [Gordon02] minimize loss(uv ;Y ) p(y ij UV ij ) Matrix Factorization Y data U V

11 topic 1..k T I The Aspect Model (plsa) J Y co-occurrence occurrence counts word X p(i,j) rank k = p(i,t) [Hoffman+99] p(j t) document Y X Multinomial(N,X) Y ij X ij Binomial(N,X ij ) N= Y ij

12 Low-Rank Models for Matrices of Counts, Occurrences or Frequencies Multinomial Independent Binomials Independent Bernoulli Mean parameterization 0 X ij 1 E[Y ij X ij ]=X ij Natural parameterization unconstrained X ij Aspect Model (plsa) [Hoffman+99] Sufficient Dimensionality Reduction [Globerson+02] Y ij X ij ~Bin(N,X ij ) NMF if X ij =1 NMF [Lee+01] Y ij X ij ~Bin(N,g(X ij )) P(Y ij =1) = X ij Exponential PCA: [Collins+01] p(y ij X ij ) exp(y ij X ij +F(Y ij )) Logistic Low Rank Approximation [Schein+03] hinge loss g(x)=1/(1+e x )

13 Outline Finding Low Rank Approximations Weight Low Rank Approx: minimize ij W ij (Y ij -X ij ) 2 Use WLRA Basis for other losses / conditional models Consistency of Low Rank Approximation When more data is available, do we converge to correct solution? Not always Matrix Factorization for Collaborative Filtering Maximum Margin Matrix Factorization Generalization Error Bounds (Low Rank & MMMF)

14 Finding Low Rank Approximation Find rank-k X minimizing loss(x ij ;Y ij ) (=log p(y X)) Non-convex: X is rank-k is a non-convex constraint loss((uv) ij ;Y ij ) not convex in U,V rank-k X minimizing (Y ij -X ij ) 2 : non-convex, but no (non-global) local minima solution: leading components of SVD For other loss functions, or with missing data: cannot use SVD, local minima, difficult problem Weighted Low Rank Approximation: minimize W ij (Y ij -X ij ) 2 Arbitrarily specified weights (part of input)

15 WLRA: Optimization J ( UV ') = W ( Y UV ') ij ij 2 ij For fixed V, find optimal U For fixed U, find optimal V n k U m V k V J * ( V ) = minj ( UV ') U J* ( V ') = 2U*'(( U* V ' Y ) W ) Conjugate gradient descent on J* Optimize km parameters instead of k(n+m) EM approach: elementwise product X LowRankApprox(W Y+(1-W) X)

16 Newton Optimization for Non-Quadratic Loss Functions minimize ij loss(x ij ;Y ij ) loss(x ij ;Y ij ) convex in X ij Iteratively optimize quadratic approximations of objective Each such quadratic optimization is a weighted low rank approximation

17 Maximum Likelihood Estimation with Gaussian Mixture Noise Y observed = X + rank k Z noise C (hidden) Mixture of Gaussians: {p c,µ c,σ c } E step: calculate posteriors of C M step: WLRA with Pr( Cij = c ) Wij = 2 σ c C A Pr( C ij ij = Yij + 2 c σc = c ) µ C W ij

18 Outline Finding Low Rank Approximations Weight Low Rank Approx: minimize ij W ij (Y ij -X ij ) 2 Use WLRA Basis for other losses / conditional models Consistency of Low Rank Approximation When more data is available, do we converge to correct solution? Not always Matrix Factorization for Collaborative Filtering Maximum Margin Matrix Factorization Generalization Error Bounds (Low Rank & MMMF)

19 Asymptotic Behavior of Low Rank Approximation parameters random Y X = + rank k Z independent Single observation, Number of parameters is linear in number of observables Can never approach correct estimation of parameters What can be estimated is row-space of X

20 Y U V = + random Z independent Number of parameters is linear in number of observables What can be estimated is row-space of X

21 random V random Y = + U Z independent What can be estimated is row-space of X

22 y = u + V z Multiple samples of random variable y What can be estimated is row-space of X

23 Probabilistic PCA y = u + ~ Gaussian V z ~ Gaussian x ~ Gaussian Σ X rank k estimated parameters σ 2 I Maximum likelihood PCA [Tipping Bishop 97]

24 Probabilistic PCA [Tipping Bishop 97] y = u + ~ Gaussian V estimated parameters z ~ Gaussian(σ 2 I) Latent Dirichlet Allocation [Blei Ng Jordan 03] y ~ V Multinomial( N, u ) ~ Dirichlet(α) estimated parameters

25 Generative and Non-Generative Low Rank Models Y=X+Gaussian Probabilistic PCA plsa, Y~Binom(N,X) Non-parametric generative models Latent Dirichlet Allocatoin Parametric generative models: Consistency of Maximum Likelihood estimation guaranteed if model assumptions hold (both on Y X and on U)

26 estimated parameters row-space of V, not entries of V y = u + P U x V z ~ Gaussian nonparametric, unconstrained nuisance Non-parametric model, estimation of a parametric part of the model Maximum Likelihood Single Observed Y

27 Consistency of ML Estimation estimated V true V

Consistency of ML Estimation V expected contribution of x to likelihood of V x [ max log p (( x + z ) )] Ψ( V ; x ) = E uv ML estimator is consistent for any P u z u Z for all x,

28 Consistency of ML Estimation V expected contribution of x to likelihood of V x [ max log p (( x + z ) )] Ψ( V ; x ) = E uv ML estimator is consistent for any P u z u Z for all x, V maximizes iff V spans x Ψ( V ; x ) When Z~Gaussian, Ψ ( V ; x ) = E[L 2 distance of x+z from V] For iid Gaussian noise, ML estimation (PCA) of the low-rank sub-space is consistent

29 Consistency of ML Estimation V expected contribution of x to likelihood of V x=0 [ max log p (( x + z ) )] Ψ( V ; x ) = E uv ML estimator is consistent for any P u z u Z for all x, V maximizes iff V spans x Ψ( V ; x ) ML estimator is consistent for any P u Ψ(V ;0) is constant for all V

30 Consistency of ML Estimation V expected contribution of x to likelihood of V x=0 θ [ max log p (( x + z ) )] Ψ( V ; x ) = E z u Z uv 1 z [ i ] Laplace: p ( z [ i ]) = e Z vertical axis Ψ(V ;0) = 1 horizontal axis π θ= 2 π

31 Consistency of ML Estimation expected contribution of x to likelihood of V X=UV General conditional model for Y X [ max log p ( Y uv ) x] Ψ( V; x) = E X u Y ML estimator is consistent for any P u Y X for all x, V maximizes iff V spans x Ψ( V ; x ) ML estimator is consistent for any P u Ψ(V ;0) is constant for all V

32 Consistency of ML Estimation V expected contribution of x to likelihood of V x=0 θ [ max log p ( Y uv ) x] Ψ( V; x) = E X u Y Logistic: py X Y i = x Y X i 1 = 1 e ( [ ] 1 [ ]) x[ i] Ψ(V ;0) = θ= π π 2

33 Consistent Estimation Additive i.i.d. noise Y=X+Z: Maximum Likelihood generally not consistent PCA is consistent (for any noise distribution) Span of k leading eigenvectors of Σˆ Y Σ X + σ 2 I rank k

34 Consistent Estimation Additive i.i.d. noise Y=X+Z: Maximum Likelihood generally not consistent PCA is consistent (for any noise distribution) s 1, s 2,, s k, 0, 0,, 0 Σˆ Y Σ Y = Σ X + Σ Z = Σ X + σ 2 I s 1 +σ 2, s 2 +σ 2,, s k +σ 2, σ 2, σ 2,, σ 2

35 Consistent Estimation Additive i.i.d. noise Y=X+Z: Maximum Likelihood generally not consistent PCA is consistent (for any noise distribution) distance to correct subspace ML PCA Noise: 0.99 N(0,1) N(0,100) Sample size (number of observed rows)

36 Consistent Estimation Additive i.i.d. noise Y=X+Z: Maximum Likelihood generally not consistent PCA is consistent (for any noise distribution) Unbiased noise E[Y X]=X: Maximum Likelihood generally not consistent PCA not consistent -- Can correct by ignoring diagonal of covariance Exponential PCA (X are natural parameters) Maximum Likelihood generally not consistent Covariance methods not consistent???

37 Outline Finding Low Rank Approximations Weight Low Rank Approx: minimize ij W ij (Y ij -X ij ) 2 Use WLRA Basis for other losses / conditional models Consistency of Low Rank Approximation When more data is available, do we converge to correct solution? Not always Matrix Factorization for Collaborative Filtering Maximum Margin Matrix Factorization Generalization Error Bounds (Low Rank & MMMF)

38 Collaborative Filtering Based on preferences so far, and preferences of others: Predict further preferences users movies Implicit or explicit preferences? Type of queries.

39 Matrix Completion Based on partially observed matrix: Predict unobserved entries users movies ? ? 5 3? ? ? 5? ? 5 2? ? ? Will user i like movie j?

40 Matrix Completion with Matrix Factorization U V ? ? ? ? ? +1 5? ? ? ? ? Fit factorizable (lowrank) matrix X=UV to observed entries. minimize Σloss(X ij ;Y ij ) prediction observation Use matrix X to predict unobserved entries. [Sarwar+00] [Azar+01] [Hoffman04] [Marlin+04]

41 Matrix Completion with Matrix Factorization U v12 v9 v10 v11 v8 v7 v6 v5 v4 v3 v2 v When U is fixed, each row is a linear classification problem: rows of U are feature vectors columns of V are linear classifiers Fitting U and V: Learning features that work well across all classification problems.

42 Max-Margin Matrix Factorization low norm U V Instead of bounding dimensionality of U,V, bound norms of U,V For observed Y ij ±1: Y ij X ij Margin U i,v j

43 Max-Margin Matrix Factorization low norm U V bound norms on average: ( i U i 2 ) ( j V j 2 ) 1 bound norms uniformly: (max i U i 2 ) (max j V j 2 ) 1 For observed Y ij ±1: Y ij X ij Margin U i,v j U is fixed: each column of V is SVM

44 Max-Margin Matrix Factorization low norm U V bound norms on average: ( i U i 2 ) ( j V j 2 ) 1 bound norms uniformly: (max i U i 2 ) (max j V j 2 ) 1 For observed Y ij ±1: Y ij X ij Margin U i,v j U is fixed: each column of V is SVM

45 Geometric Interpretation V columns of V (items) rows of U (users) U (max i U i 2 ) (max j V j 2 ) 1

46 Geometric Interpretation V columns of V (items) rows of U (users) U (max i U i 2 ) (max j V j 2 ) 1

47 Finding Max-Margin Matrix Factorizations maximize M Y ij X ij M X=UV ( i U i 2 ) ( j V j 2 ) 1 maximize M Y ij X ij M X=UV (max i U i 2 ) (max j V j 2 ) 1 Unlike rank(x) k, these are convex constraints!

48 Finding Max-Margin Matrix Factorizations maximize M Y ij X ij M X=UV ( i U i 2 ) ( j V j 2 ) 1 X Y Q X tr = (singular values of X) minimize tr(a)+tr(b) + c ξ ij A X Y ij X ij 1 X B p.s.d. Dual variable Q ij for each observed (i,j) maximize Q ij - ξ ij 0 Q ij c Q Y 2 1 sparse elementwise product (zero for unobserved entries)

49 Finding Max-Margin Matrix Factorizations Semi-definite program with sparse dual: Limited by number of observations, not size (for both average-norm and max-norm) Current implementation: use CSDP (off-the-shelf solver), up to 30k observations (e.g. 1000x1000, 3% observed) For large-scale problems: updates on dual alone? Dual variable Q ij for each observed (i,j) minimize tr(a)+tr(b) + c ξ ij maximize Q ij Y ij X ij 1 - ξ ij 0 Q ij c A X X B p.s.d. Q Y 2 1 sparse elementwise product (zero for unobserved entries)

50 Loss Functions for Rankings loss(x ij ;Y ij ) 0 Y ij =+1 X ij

51 Loss Functions for Rankings loss(x ij ;Y ij ) Y ij = θ 1 θ 2 θ 3 θ 4 θ 5 θ 6 X ij

52 Loss Functions for Rankings all-threshold loss loss(x ij ;Y ij ) Immediate-threshold loss [Shashua Levin 03] Y ij =3 θ 1 θ 2 θ 3 θ 4 θ 5 θ 6 X ij All-threshold loss is a bound on the absolute rank-difference For both loss functions: learn per-user θ s

53 Experimental Results on MovieLens Subset all threshold MMMF immediate threshold MMMF K-medians K=2 Rank-1 Rank-2 Rank Difference Zero One Error users 100 movies subset of MovieLens, 3515 training ratings, 3515 test ratings

54 Outline Finding Low Rank Approximations Weight Low Rank Approx: minimize ij W ij (Y ij -X ij ) 2 Use WLRA Basis for other losses / conditional models Consistency of Low Rank Approximation When more data is available, do we converge to correct solution? Not always Matrix Factorization for Collaborative Filtering Maximum Margin Matrix Factorization Generalization Error Bounds (Low Rank & MMMF)

55 Generalization Error Bounds D(X;Y) = ij loss(x ij ;Y ij )/nm generalization error Assuming a low-rank structure (eigengap): Asymptotic behavior [Azar+01] Sample complexity, query strategy [Drineas+02] X Y unknown, assumption-free

56 Generalization Error Bounds D(X;Y) = ij loss(x ij ;Y ij )/nm generalization error D S (X;Y) = ij S loss(x ij ;Y ij )/ S empirical error Y Pr S ( rank-k X D(X;Y)<D S (X;Y)+ε ) > 1-δ hypothesis X Y S source distribution training set random unknown, assumption-free

57 Generalization Error Bounds 0/1 loss: loss(x ij ;Y ij ) = 1 when sign(x ij ) Y ij D(X;Y) = ij loss(x ij ;Y ij )/nm generalization error D S (X,Y) = ij S loss(x ij ;Y ij )/ S empirical error For particular X,Y: random loss(x ij ;Y ij ) Bernoulli(D(X;Y)) Pr( D S (X;Y) < D(X;Y)-ε ) < e -2 S ε2 random Union bound over all possible Xs: Y Pr S ( X D(X,Y)<D S (X,Y)+ε ) > 1-δ ε = 8em klog(# ( n + of mpossible )log Xs) + log 1 2 S k δ

58 Number of Sign Configurations of Rank-k Matrices U V X nk variables mk variables X i,j = r U i,r V r,j nm polynomials of degree 2 polynomials Warren (1968): The number of connected components of { x i P i (x) 0 } is at most 4 e (degree) (#polys) (# variables) Based on [Alon95] (# variables) P 2 (x 1,x 2 )=0 P 1 (x 1,x 2 )= P 3 (x 1,x 2 )=0

59 Number of Sign Configurations of Rank-k Matrices U V X nk variables mk variables X i,j = r U i,r V r,j nm polynomials of degree 2 polynomials Warren (1968): The number of connected components of { x i P i (x) 0 } is at most Based on [Alon95] 4 e (degree) 2 (#polys) nm (# k(n+m) variables) k(n+m) (# variables) P 2 (x 1,x 2 )=0 P 1 (x 1,x 2 )= P 3 (x 1,x 2 )=0

60 Generalization Error Bounds: Low Rank Matrix Factorization D(X;Y) = ij loss(x ij ;Y ij )/nm generalization error D S (X,Y) = ij S loss(x ij ;Y ij )/ S empirical error Y Pr S ( X of rank-k D(X,Y)<D S (X,Y)+ε ) > 1-δ 0/1 loss: loss(x ij ;Y ij ) = sign(x ij Y ij ) ε = 8em k( n + m)log + log 1 2 S k δ loss(x ij ;Y ij ) 1: (by bounding the psudodimension) ε = 6 k( n + m)log 8em k log S S k ( n+ m) + log 1 δ

61 Generalization Error Bounds: Large Margin Matrix Factorization D(X;Y) = ij loss(x ij ;Y ij )/nm generalization error loss(x ij ;Y ij ) = sign(x ij Y ij ) D S (X;Y) = ij S loss 1 (X ij ;Y ij )/ S empirical error loss 1 (X ij ;Y ij ) = sign(x ij Y ij -1) Y Pr S ( X D(X,Y)<D S (X,Y)+ε ) > 1-δ universal constant from [Seginer00] bound on spectral norm of random matrix ( U i 2 /n)( V i 2 /m) R 2 : ε = K 4 ln m R 2 ( n + m)log n S + log 1 δ (max U i 2 )(max V j 2 ) R 2 : ε = 12 R 2 ( n + m) + S log 1 δ

62 Maximum Margin Matrix Factorization as a Convex Combination of Classifiers { UV ( U i 2 )( V i 2 ) 1 } = convex-hull( { uv u R n, v R m u = v =1} ) conv( { uv u ±1 n, v ±1m} ) { UV (max U i 2 )(max V j 2 ) 1 } 2 conv( { uv u ±1 n, v ±1m} ) Grothendiek's Inequality

63 Summary Finding Low Rank Approximations Weighted Low Rank Approximations Basis for other loss function: Newton, Gaussian mixtures Consistency of Low Rank Approximation ML for popular low-rank models is not consistent! PCA consistent for additive noise; diagonal ignoring for unbiased Efficient estimators? Consistent estimators for Exponential-PCA? Maximum Margin Matrix Factorization Correspondence with large margin linear classification Sparse SDPs for both avarage-norm and max-norm formulations Direct optimization of dual would enable large-scale applications Generalization Error Bounds for Collaborative Prediction First assumption free bounds for matrix completion Both for Low-Rank and for Max-Margin Observation process?

64 Average February Temperature (centigrade) Tel-Aviv Haifa (Technion) Boston Toronto????

Maximum Margin Matrix Factorization

Maximum Margin Matrix Factorization Nati Srebro Toyota Technological Institute Chicago Joint work with Noga Alon Tel-Aviv Yonatan Amit Hebrew U Alex d Aspremont Princeton Michael Fink Hebrew U Tommi Jaakkola