Matrix and Tensor Factorization from a Machine Learning Perspective

Size: px

Start display at page:

Download "Matrix and Tensor Factorization from a Machine Learning Perspective"

Laureen Leonard
5 years ago
Views:

Matrix and Tensor Factorization from a Machine Learning Perspective Christoph Freudenthaler Information Systems and Machine Learning Lab, University of

1 Matrix and Tensor Factorization from a Machine Learning Perspective Christoph Freudenthaler Information Systems and Machine Learning Lab, University of Hildesheim Research Seminar, Vienna University of Economics and Business, January 13, 2012 Christoph Freudenthaler 1 / 34 ISMLL, University of Hildesheim

2 Matrix Factorization - SVD SVD: Decompose p 1 p 2 matrix Y := V 1 D V2 T V1 are eigenvectors of YY T V2 are eigenvectors of Y T Y D := eig(diag(yy T )) p 1 d 1 0 p = 1 p 2 0 d Y V 1 V 2 p 2... Christoph Freudenthaler 2 / 34 ISMLL, University of Hildesheim

3 Matrix Factorization - SVD SVD: Decompose p 1 p 2 matrix Y := V 1 D V2 T V1 are eigenvectors of YY T V2 are eigenvectors of Y T Y D := eig(diag(yy T )) p 1 d 1 0 p = 1 p 2 0 d Y V 1 V 2 p 2... Properties: Method from linear algebra for arbitrary Y Tool for descriptive and exploratory data analysis Decomposition optimizes the Frobenius norm with orthogonality constraints Christoph Freudenthaler 2 / 34 ISMLL, University of Hildesheim

4 Tensor Factorization - Tucer Decomposition Tucer Decomposition: Decompose p 1 p 2 p 3 tensor Y := D 1 V 1 2 V 2 3 V 3 V1 are 1 eigenvectors of mode-1 unfolded Y V2 are 2 eigenvectors of mode-2 unfolded Y V3 are 3 eigenvectors of mode-3 unfolded Y D R non-diagonal core tensor p 3 3 p 2 Y = D p V 1 p 2 2 V 2 p 3 3 V 3 1 p Christoph Freudenthaler 3 / 34 ISMLL, University of Hildesheim

5 Tensor Factorization - Tucer Decomposition Tucer Decomposition: Decompose p 1 p 2 p 3 tensor Y := D 1 V 1 2 V 2 3 V 3 V1 are 1 eigenvectors of mode-1 unfolded Y V2 are 2 eigenvectors of mode-2 unfolded Y V3 are 3 eigenvectors of mode-3 unfolded Y D R non-diagonal core tensor p 3 3 p 2 Y = D p V 1 p 2 2 V 2 p 3 3 V 3 1 p Properties: Extension of SVD method for arbitrary tensor Y orthogonal representation Tool for descriptive and exploratory data analysis Decomposition optimizes the Frobenius norm Expensive inference: sequences of SVD + core tensor D Christoph Freudenthaler 3 / 34 ISMLL, University of Hildesheim

6 Tensor Factorization - Canonical Decomposition Canonical Decomposition (CD): Decompose p 1 p 2 p 3 tensor Y := D 1 V 1 2 V 2 3 V 3 with diagonal, identity tensor D as a sum of ran-one tensors Y = f =1 v 1,f v 2,f v 3,f p 3 p Y = p 1 1 V 1 p 2 2 V 2 p V p 1 Christoph Freudenthaler 4 / 34 ISMLL, University of Hildesheim

7 Tensor Factorization - Canonical Decomposition Canonical Decomposition (CD): Decompose p 1 p 2 p 3 tensor Y := D 1 V 1 2 V 2 3 V 3 with diagonal, identity tensor D as a sum of ran-one tensors Y = f =1 v 1,f v 2,f v 3,f p 3 p Y = p 1 1 V 1 p 2 2 V 2 p V p 1 Properties: Extension of SVD method for arbitrary tensor Y Tool for descriptive and exploratory data analysis Decomposition optimizes the Frobenius norm Fast inference due to less parameters Christoph Freudenthaler 4 / 34 ISMLL, University of Hildesheim

8 Machine Learning Perspective Machine Learning:...is the tas of learning from (noisy) experience E with respect to some class of tass T and performance measure P Experience E: data Tass T: predictions Performance P: root mean square error (RMSE), misclassification,... Christoph Freudenthaler 5 / 34 ISMLL, University of Hildesheim

9 Machine Learning Perspective Machine Learning:...is the tas of learning from (noisy) experience E with respect to some class of tass T and performance measure P Experience E: data Tass T: predictions Performance P: root mean square error (RMSE), misclassification,... to generalize from noisy data to accurately predict new cases Christoph Freudenthaler 5 / 34 ISMLL, University of Hildesheim

10 Machine Learning Perspective Machine Learning:...is the tas of learning from (noisy) experience E with respect to some class of tass T and performance measure P Experience E: data Tass T: predictions Performance P: root mean square error (RMSE), misclassification,... to generalize from noisy data to accurately predict new cases In other words: Prediction accuracy on new cases = Machine Learning No hypothesis generation/testing = Data Mining Christoph Freudenthaler 5 / 34 ISMLL, University of Hildesheim

Recommender Systems Graph Analysis Image/Video Analysis.

11 Applications of the Machine Learning Perspective Many different applications: Social Networ Analysis Recommender Systems Graph Analysis Image/Video Analysis... Christoph Freudenthaler 6 / 34 ISMLL, University of Hildesheim

12 Applications of the Machine Learning Perspective Overall Tas: Prediction of missing friendships, interesting items, corrupted pixels, missing edges, etc. Data representation: matrix/tensor with missing entries Y = u 1 u 2... u p1 i 1 i 2 i 3... i p2 1 i p Christoph Freudenthaler 7 / 34 ISMLL, University of Hildesheim

13 Applications of the Machine Learning Perspective Overall Tas: Prediction of missing friendships, interesting items, corrupted pixels, missing edges, etc. Data representation: matrix/tensor with missing entries Further Properties: Y = u 1 u 2... u p1 i 1 i 2 i 3... i p2 1 i p Unobserved Heterogeneity, e.g. different consumption preferences of different users and items Large-scale: millions of interactions Poor Data Quality: high noise due to indirect data collection 2 Factorization Models Christoph Freudenthaler 7 / 34 ISMLL, University of Hildesheim

14 Factorization Models from a Machine Learning Perspective Common usage of matrix (and tensor) factorization: Identification/Interpretation of unobserved heterogeneity, i.e. latent dependencies between instances of a mode (rows, columns,... ) Data compression, e.g., instead of p 1 p 2 values, store only (p 1 + p 2 ) values Data preprocessing: uncorrelate p predictor variables of a design matrix X Christoph Freudenthaler 8 / 34 ISMLL, University of Hildesheim

15 Factorization Models from a Machine Learning Perspective Machine Learning perspective on factorization models: Factorization models seen as predictive models No probabilistic embedding, e.g. for Bayesian Analysis Gaussian lielihood: Y = V 1 D V T 2 + E, e l E : e l N(0, σ 2 = 1) No missing value treatment; often treated as 0 Christoph Freudenthaler 8 / 34 ISMLL, University of Hildesheim

16 Factorization Models from a Machine Learning Perspective Machine Learning perspective on factorization models: Factorization models seen as predictive models No probabilistic embedding, e.g. for Bayesian Analysis Gaussian lielihood: Y = V 1 D V T 2 + E, e l E : e l N(0, σ 2 = 1) No missing value treatment; often treated as 0 Non-informative prior: p(v 1, D, V 2 ) 1 Does not distinguish between signal and noise Christoph Freudenthaler 8 / 34 ISMLL, University of Hildesheim

17 Factorization Models from a Machine Learning Perspective Machine Learning perspective on factorization models: Factorization models seen as predictive models No probabilistic embedding, e.g. for Bayesian Analysis Gaussian lielihood: Y = V 1 D V T 2 + E, e l E : e l N(0, σ 2 = 1) No missing value treatment; often treated as 0 Non-informative prior: p(v 1, D, V 2 ) 1 Does not distinguish between signal and noise No interest in latent representation, e.g. interpretation Elimination of orthonormality constraint Y = V 1, V2 T + E for general tensors: Y = v 1,f v 2,f... v m,f + E f =1 Christoph Freudenthaler 8 / 34 ISMLL, University of Hildesheim

18 Factorization Models from a Machine Learning Perspective Machine Learning perspective on factorization models: Factorization models seen as predictive models No probabilistic embedding, e.g. for Bayesian Analysis Gaussian lielihood: Y = V 1 D V T 2 + E, e l E : e l N(0, σ 2 = 1) No missing value treatment; often treated as 0 Non-informative prior: p(v 1, D, V 2 ) 1 Does not distinguish between signal and noise No interest in latent representation, e.g. interpretation Elimination of orthonormality constraint Y = V 1, V2 T + E for general tensors: Y = v 1,f v 2,f... v m,f + E f =1 Christoph Freudenthaler 8 / 34 ISMLL, University of Hildesheim

19 Related Wor Existing Extensions of Factorization Models: More general lielihood for multinomial, count data, etc. Exponential Family PCA 1 Inference only on observed tensor entries 1 Mohamed, S., Heller, K. A., Ghahramani, Z.: Exponential Family PCA, NIPS08. 2 Tipping, M. E., Bishop, C. M.: Probabilistic PCA, Journal of the Royal Statistical Society. 3 Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 9 / 34 ISMLL, University of Hildesheim

20 Related Wor Existing Extensions of Factorization Models: More general lielihood for multinomial, count data, etc. Exponential Family PCA 1 Inference only on observed tensor entries Introduction of prior distributions 2 1 Mohamed, S., Heller, K. A., Ghahramani, Z.: Exponential Family PCA, NIPS08. 2 Tipping, M. E., Bishop, C. M.: Probabilistic PCA, Journal of the Royal Statistical Society. 3 Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 9 / 34 ISMLL, University of Hildesheim

21 Related Wor Existing Extensions of Factorization Models: More general lielihood for multinomial, count data, etc. Exponential Family PCA 1 Inference only on observed tensor entries Introduction of prior distributions 2 Scalable Learning algorithms for large scale data: Gradient Descent Models 3 1 Mohamed, S., Heller, K. A., Ghahramani, Z.: Exponential Family PCA, NIPS08. 2 Tipping, M. E., Bishop, C. M.: Probabilistic PCA, Journal of the Royal Statistical Society. 3 Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 9 / 34 ISMLL, University of Hildesheim

22 Related Wor Existing Extensions of Factorization Models: More general lielihood for multinomial, count data, etc. Exponential Family PCA 1 Inference only on observed tensor entries Introduction of prior distributions 2 Scalable Learning algorithms for large scale data: Gradient Descent Models 3 Missing Extensions: Extensions of the predictive models, i.e. SVD, CD Comparison to standard predictive models 1 Mohamed, S., Heller, K. A., Ghahramani, Z.: Exponential Family PCA, NIPS08. 2 Tipping, M. E., Bishop, C. M.: Probabilistic PCA, Journal of the Royal Statistical Society. 3 Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 9 / 34 ISMLL, University of Hildesheim

23 Outline Generalized Factorization Model Relations to standard Models Empirical Evaluation Conclusion and Discussion Christoph Freudenthaler 10 / 34 ISMLL, University of Hildesheim

24 Outline Generalized Factorization Model Relations to standard Models Empirical Evaluation Conclusion and Discussion Christoph Freudenthaler 11 / 34 ISMLL, University of Hildesheim

25 Probabilistic SVD Frobenius norm optimal: argmin Y V 1 V2 T F = θ={v 1,V 2} p 1 p 2 (y i,j v1,iv T 2,j ) 2 i=1 j=1 j j i y i, j p 1 i Y V 1 v 1,i v 2,j V 2 T p 2 Christoph Freudenthaler 11 / 34 ISMLL, University of Hildesheim

26 Probabilistic SVD Frobenius norm optimal: argmin Y V 1 V2 T F = θ={v 1,V 2} p 1 p 2 (y i,j v1,iv T 2,j ) 2 i=1 j=1 j j i y i, j p i Y 1 = V 1 T v 1,i. v 2,j V 2 T p 2 Christoph Freudenthaler 11 / 34 ISMLL, University of Hildesheim

27 Probabilistic SVD Frobenius norm optimal: argmin Y V 1 V2 T F = θ={v 1,V 2} p 1 p 2 (y i,j v1,iv T 2,j ) 2 i=1 j=1 i j y i, j p i Y 1 = V 1 j V T 2 T. v 1,i v 2,j = v f =1 1,i,f v 2,j,f p 2 Christoph Freudenthaler 11 / 34 ISMLL, University of Hildesheim

28 Probabilistic SVD Frobenius norm optimal: argmin Y V 1 V2 T F = θ={v 1,V 2} p 1 p 2 (y i,j v1,iv T 2,j ) 2 i=1 j=1 i j y i, j p i Y 1 = V 1 j V T 2 T. v 1,i v 2,j = v f =1 1,i,f v 2,j,f p 2 Gaussian maximum lielihood: p 1 p 2 argmax θ={v 1,V 2} i=1 j=1 exp ( 12 (y i,j v T1,iv ) 2,j ) 2 Christoph Freudenthaler 11 / 34 ISMLL, University of Hildesheim

29 Probabilistic CD - Extend SVD to arbitrary m Frobenius norm optimal: argmin p 1 p 2... θ={v 1,V 2,...,V m} i 1=1 i 2=1 p m i m=1 (y i1,i 2,...,i m v 1,i1,f v 2,i2,f... v m,im,f f =1 } {{ } yi CD 1,i 2,...,im ) 2 i 2 V 2 T y i1,i 2,i 3 p 1 i 1 Y V 1 p 2 v 1,i1 v 2,i2 v 2,i3 V T 3 i 3 Christoph Freudenthaler 12 / 34 ISMLL, University of Hildesheim

30 Probabilistic CD - Extend SVD to arbitrary m Frobenius norm optimal: argmin p 1 p 2... θ={v 1,V 2,...,V m} i 1=1 i 2=1 p m i m=1 (y i1,i 2,...,i m v 1,i1,f v 2,i2,f... v m,im,f f =1 } {{ } yi CD 1,i 2,...,im ) 2 i 2 y i1,i 2,i 3 p 1 i 1 Y V 1 p 2 V T 2 v 1,i1 v 2,i2 v 2,i3 V T 3 i 3 Christoph Freudenthaler 12 / 34 ISMLL, University of Hildesheim

31 Probabilistic CD - Extend SVD to arbitrary m Frobenius norm optimal: argmin p 1 p 2... θ={v 1,V 2,...,V m} i 1=1 i 2=1 p m i m=1 (y i1,i 2,...,i m v 1,i1,f v 2,i2,f... v m,im,f f =1 } {{ } yi CD 1,i 2,...,im ) 2 i 2 V 2 T y i1,i 2,i 3 p 1 i = 1 Y V 1 p 2 v 1,i1 v 2,i2 v 2,i3 V T 3 i 3 Dot product = f =1 v 1,i1,f v 2,i2,f v 3,i3,f Christoph Freudenthaler 12 / 34 ISMLL, University of Hildesheim

32 Probabilistic CD - Extend SVD to arbitrary m Frobenius norm optimal: argmin p 1 p 2... θ={v 1,V 2,...,V m} i 1=1 i 2=1 p m i m=1 (y i1,i 2,...,i m v 1,i1,f v 2,i2,f... v m,im,f f =1 } {{ } yi CD 1,i 2,...,im ) 2 i 2 V 2 T y i1,i 2,i 3 p 1 i = 1 Y V 1 p 2 v 1,i1 v 2,i2 v 2,i3 V T 3 i 3 Dot product = f =1 v 1,i1,f v 2,i2,f v 3,i3,f Gaussian maximum lielihood: argmax p 1 p 2... p m ( exp 1 ) 2 (y i 1,i 2,...,i m yi CD 1,i 2,...,i m ) 2 θ={v 1,V 2,...} i 1=1 i 2=1 i m=1 Christoph Freudenthaler 12 / 34 ISMLL, University of Hildesheim

33 GFM I: Predictive Model Interpretation Adapt Notation: Introduce for each tensor element l = 1,..., n vector-valued indicator vectors x l,j, j = 1,..., m of length p j Christoph Freudenthaler 13 / 34 ISMLL, University of Hildesheim

34 GFM I: Predictive Model Interpretation Adapt Notation: Introduce for each tensor element l = 1,..., n vector-valued indicator vectors x l,j, j = 1,..., m of length p j Form for each tensor element y i1,...,i m a predictor vector x l R p=p1+p2+...+pm by concatenating x j : x CD l = 0,..., 1,..., 0, 0,..., 1,..., 0,... }{{}}{{} x l,1 x l,2 Christoph Freudenthaler 13 / 34 ISMLL, University of Hildesheim

35 GFM I: Predictive Model Interpretation Adapt Notation: Introduce for each tensor element l = 1,..., n vector-valued indicator vectors x l,j, j = 1,..., m of length p j Form for each tensor element y i1,...,i m a predictor vector x l R p=p1+p2+...+pm by concatenating x j : x CD l = 0,..., 1,..., 0, 0,..., 1,..., 0,... }{{}}{{} x l,1 x l,2 Denote with V = {V 1,..., V m } the set of all predictive model parameters Christoph Freudenthaler 13 / 34 ISMLL, University of Hildesheim

36 GFM I: Predictive Model Interpretation Adapt Notation: Introduce for each tensor element l = 1,..., n vector-valued indicator vectors x l,j, j = 1,..., m of length p j Form for each tensor element y i1,...,i m a predictor vector x l R p=p1+p2+...+pm by concatenating x j : x CD l = 0,..., 1,..., 0, 0,..., 1,..., 0,... }{{}}{{} x l,1 x l,2 Denote with V = {V 1,..., V m } the set of all predictive model parameters Rewrite yi CD 1,...,i m as yi CD 1,...,i m = yl CD = f (xl CD V ) with general f (x l V ) = m f =1 j=1 xt l,j v j,f = f =1 v 1,i 1,f v 2,i2,f... v m,im,f Christoph Freudenthaler 13 / 34 ISMLL, University of Hildesheim

37 GFM I: Predictive Model Interpretation More intuitive interpretation: 2-level hierarchical representation y l N(µ l, α = 1) =1 y l T j, f x l =x l, j vj,f Level 1 V l=1,...,n Level 2 f (x l V ) = µ l = f =1 j=1 m x T l,j v j,f }{{} µ j,f (x l ) Christoph Freudenthaler 14 / 34 ISMLL, University of Hildesheim

38 GFM I: Predictive Model Interpretation More intuitive interpretation: 2-level hierarchical representation y l N(µ l, α = 1) =1 y l T j, f x l =x l, j vj,f Level 1 V l=1,...,n Level 2 f (x l V ) = µ l = f =1 j=1 m x T l,j v j,f }{{} µ j,f (x l ) each pair of mode j and latent dimension f has a different linear model µ j,f (x l ) = x T l,jv j,f with predictor vector x l,j describing each mode j and p j different model parameters v j,f per mode j and latent dimension f Christoph Freudenthaler 14 / 34 ISMLL, University of Hildesheim

39 GFM I: Predictive Model Interpretation More intuitive interpretation: 2-level hierarchical representation y l N(µ l, α = 1) =1 y l T j, f x l =x l, j vj,f Level 1 V l=1,...,n Level 2 f (x l V ) = µ l = f =1 j=1 m x T l,j v j,f }{{} µ j,f (x l ) each pair of mode j and latent dimension f has a different linear model µ j,f (x l ) = x T l,jv j,f with predictor vector x l,j describing each mode j and p j different model parameters v j,f per mode j and latent dimension f factorization models are 2-level hierarchical (multi-)linear models with µ j,f (x l ) per latent dimension f and mode j in the upper level Christoph Freudenthaler 14 / 34 ISMLL, University of Hildesheim

40 GFM I: Predictive Model Interpretation More intuitive interpretation: 2-level hierarchical representation y l N(µ l, α = 1) =1 y l T j, f x l =x l, j vj,f Level 1 V l=1,...,n Level 2 f (x l V ) = µ l = f =1 j=1 m x T l,j v j,f }{{} µ j,f (x l ) each pair of mode j and latent dimension f has a different linear model µ j,f (x l ) = x T l,jv j,f with predictor vector x l,j describing each mode j and p j different model parameters v j,f per mode j and latent dimension f factorization models are 2-level hierarchical (multi-)linear models with µ j,f (x l ) per latent dimension f and mode j in the upper level modeled as linear functions of mode-dependent predictors x l,j Christoph Freudenthaler 14 / 34 ISMLL, University of Hildesheim

41 GFM I: Predictive Model Interpretation More intuitive interpretation: 2-level hierarchical representation y l N(µ l, α = 1) =1 y l T j, f x l =x l, j vj,f Level 1 V l=1,...,n Level 2 f (x l V ) = µ l = f =1 j=1 m x T l,j v j,f }{{} µ j,f (x l ) each pair of mode j and latent dimension f has a different linear model µ j,f (x l ) = x T l,jv j,f with predictor vector x l,j describing each mode j and p j different model parameters v j,f per mode j and latent dimension f factorization models are 2-level hierarchical (multi-)linear models with µ j,f (x l ) per latent dimension f and mode j in the upper level modeled as linear functions of mode-dependent predictors x l,j m merged by the dot product f =1 j=1 µ j,f to get f (x l V ) Christoph Freudenthaler 14 / 34 ISMLL, University of Hildesheim

42 GFM I: Predictive Model Interpretation More intuitive interpretation: 2-level hierarchical representation y l N(µ l, α = 1) =1 y l T j, f x l =x l, j vj,f Level 1 V l=1,...,n Level 2 f (x l V ) = µ l = f =1 j=1 m x T l,j v j,f }{{} µ j,f (x l ) each pair of mode j and latent dimension f has a different linear model µ j,f (x l ) = x T l,jv j,f with predictor vector x l,j describing each mode j and p j different model parameters v j,f per mode j and latent dimension f factorization models are 2-level hierarchical (multi-)linear models with µ j,f (x l ) per latent dimension f and mode j in the upper level modeled as linear functions of mode-dependent predictors x l,j m merged by the dot product f =1 j=1 µ j,f to get f (x l V ) and y CD l = f (xl CD V ) using x CD l Christoph Freudenthaler 14 / 34 ISMLL, University of Hildesheim

43 Important example: Matrix Factorization Level 1: =1 V 1 V 2 T 1, f x l =x l,1 v 1,f Level 1 T 2, f x l = x l,2 v2,f µ 1,f (x l ) = x T l,1 v 1,f µ 2,f (x l ) = x T l,2 v 2,f ( row means) ( column means) Level 2 y l l=1,...,n Christoph Freudenthaler 15 / 34 ISMLL, University of Hildesheim

44 Important example: Matrix Factorization =1 Level 2 V 1 V 2 T 1, f x l =x l,1 v 1,f Level 1 T 2, f x l = x l,2 v2,f y l l=1,...,n Level 1: µ 1,f (x l ) = x T l,1 v 1,f µ 2,f (x l ) = x T l,2 v 2,f Level 2: ( row means) ( column means) f (x CD l V ) = f =1 µ 1,f (x l )µ 2,f (x l ) = v1,f T v 2,f Christoph Freudenthaler 15 / 34 ISMLL, University of Hildesheim

45 GFM II: Predictive Model Extension From x CD l to arbitrary x l : x l = (x l,1,..., x l,m ) still consists of m subvectors for each mode Christoph Freudenthaler 16 / 34 ISMLL, University of Hildesheim

46 GFM II: Predictive Model Extension From x CD l to arbitrary x l : x l = (x l,1,..., x l,m ) still consists of m subvectors for each mode While x l,j x CD l values, e.g., has exactly one active position, thus p j 1 zero x l,1 = ( }{{} 1, 0,..., 0) x l,2 = (0, }{{} 1, 0,..., 0) 1 st row 2 nd column Christoph Freudenthaler 16 / 34 ISMLL, University of Hildesheim

47 GFM II: Predictive Model Extension From x CD l to arbitrary x l : x l = (x l,1,..., x l,m ) still consists of m subvectors for each mode While x l,j x CD l values, e.g., has exactly one active position, thus p j 1 zero x l,1 = ( }{{} 1, 0,..., 0) x l,2 = (0, }{{} 1, 0,..., 0) 1 st row 2 nd column Mode-vectors x l,j may be defined arbitrarily, e.g., to tae into account more complex conditional dependencies Christoph Freudenthaler 16 / 34 ISMLL, University of Hildesheim

48 Understanding induced dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Simplest case SVD: Correlation only on identical mode-indices, e.g. row or column index rows/columns are independent from other rows/columns i j y i, j p i Y 1 = V 1 j V 2 T T. v 1,i v 2,j = v f =1 1,i,f v 2,j,f p 2 Christoph Freudenthaler 17 / 34 ISMLL, University of Hildesheim

49 Understanding induced dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Simplest case SVD: Correlation only on identical mode-indices, e.g. row or column index rows/columns are independent from other rows/columns i j y i, j p i Y 1 = V 1 j V 2 T T. v 1,i v 2,j = v f =1 1,i,f v 2,j,f p 2 e.g. p 1 = 3 rows, p 2 = 8 columns: x l,1 = (1, 0, 0), x l,2 = (0, 1, 0, 0, 0, 0, 0, 0) y l = µ 1,f (x l )µ 2,f (x l )+ɛ l = (x T l,1 v 1,f )(x T l,2 v 2,f )+ɛ l = v 1,1,f v 2,2,f +ɛ l f =1 f =1 f =1 Christoph Freudenthaler 17 / 34 ISMLL, University of Hildesheim

50 Creating new dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Dependencies between different rows/columns for SVD: More than one active indicator per mode, with p 1 = 3 rows, p 2 = 8 columns: i j y i, j x l,1 = (1, 0, 0), x l,2 = (0, 1, 1, 0, 1, 0, 0, 0) j j ' j ' ' p i Y 1 = V 1 V 2 T p 2 4 Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 18 / 34 ISMLL, University of Hildesheim

51 Creating new dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Dependencies between different rows/columns for SVD: More than one active indicator per mode, with p 1 = 3 rows, p 2 = 8 columns: i j y i, j x l,1 = (1, 0, 0), x l,2 = (0, 1, 1, 0, 1, 0, 0, 0) j j ' j ' ' p i Y 1 = V 1 V 2 T p 2 f (x l V ) = (x T l,1v 1,f )(x T l,2v 2,f ) = v 1,1,f (v 2,2,f + v 2,3,f + v 2,5,f ) f =1 f =1 4 Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 18 / 34 ISMLL, University of Hildesheim

52 Creating new dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Dependencies between different rows/columns for SVD: More than one active indicator per mode, with p 1 = 3 rows, p 2 = 8 columns: i j y i, j x l,1 = (1, 0, 0), x l,2 = (0, 1, 1, 0, 1, 0, 0, 0) j j ' j ' ' p i Y 1 = V 1 V 2 T p 2 f (x l V ) = (x T l,1v 1,f )(x T l,2v 2,f ) = v 1,1,f (v 2,2,f + v 2,3,f + v 2,5,f ) f =1 f =1 Example from recommender systems: SVD Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 18 / 34 ISMLL, University of Hildesheim

53 Creating new dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Dependencies beyond rows and/or columns for SVD: p 1 = 3 rows and p 2 = 8 columns + additional rating information: x l,1 = (1, 0, 0), x l,2 = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 4, 0, 0, 0 ) }{{}}{{}}{{} users items ratings for co-rated items j j 3 4 i y i, j p i Y 1 = V 1 V 2 T f (x l V ) = p 2 (x T l,1v 1,f )(x T l,2v 2,f ) = f =1 f =1 v 1,1,f (3 v 2,3,f + 4 v 2,5,f ) 5 Rendle, S., Freudenthaler, C., Schmidt-Thieme, L.: Factorizing personalized Marov chains for next-baset Christoph recommendation. Freudenthaler WWW / 34 ISMLL, University of Hildesheim

54 Creating new dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Dependencies beyond rows and/or columns for SVD: p 1 = 3 rows and p 2 = 8 columns + additional rating information: x l,1 = (1, 0, 0), x l,2 = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 4, 0, 0, 0 ) }{{}}{{}}{{} users items ratings for co-rated items j j 3 4 i y i, j p i Y 1 = V 1 V 2 T f (x l V ) = p 2 (x T l,1v 1,f )(x T l,2v 2,f ) = f =1 v 1,1,f (3 v 2,3,f + 4 v 2,5,f ) Example from recommender systems: Factorized Transition Tensors 5 5 Rendle, S., Freudenthaler, C., Schmidt-Thieme, L.: Factorizing personalized Marov chains for next-baset Christoph recommendation. Freudenthaler WWW / 34 ISMLL, University of Hildesheim f =1

55 GFM II: Predictive Model Extension II f (x l V ) = µ l = m x T l,j v j,f = f =1 j=1 p m j x l,j,ij v j,ij,f f =1 j=1 i j =1 Only one predictive model per mode m Increase predictive accuracy: combine several reasonable predictive models per mode Christoph Freudenthaler 20 / 34 ISMLL, University of Hildesheim

56 GFM II: Predictive Model Extension II f (x l V ) = µ l = m x T l,j v j,f = f =1 j=1 p m j x l,j,ij 1 v j,ij,f f =1 j=1 i j =1 Only one predictive model per mode m Increase predictive accuracy: combine several reasonable predictive models per mode Introduce mode-defining selection vectors d j {0, 1} p on x l = (x 1,..., x m) R p, j = 1,..., m One set of selection vectors {d 1,..., d m} defines one model i V 1 m p f (x l V ) = µ l = x l,i d j,i v i,f f =1 j=1 i=1 j 3 4 V 2 T V 1 V 2 V Christoph Freudenthaler 20 / 34 ISMLL, University of Hildesheim

57 GFM II: Predictive Model Extension II f (x l V ) = µ l = m x T l,j v j,f = f =1 j=1 p m j x l,j,ij 1 v j,ij,f f =1 j=1 i j =1 Only one predictive model per mode m Increase predictive accuracy: combine several reasonable predictive models per mode Introduce mode-defining selection vectors d j {0, 1} p on x l = (x 1,..., x m) R p, j = 1,..., m One set of selection vectors {d 1,..., d m} defines one model i V 1 m p f (x l V ) = µ l = x l,i d j,i v i,f f =1 j=1 i=1 j 3 4 V 2 T V 1 V 2 V Collect several selection sets {d 1,..., d m} D: m p f GFM (x l V ) = µ l = x l,i d j,i v i,f f =1 {d 1,...,d m} D j=1 i=1 Christoph Freudenthaler 20 / 34 ISMLL, University of Hildesheim

58 GFM II: Predictive Model Learning? p p f GFM (x l V ) =... x l,i1... x l,im d 1,i1... d m,im v i1,f... v im,f i 1 =1 im=1 D f =1 }{{}}{{} Interaction Weight I Interaction Weight II Infer model selection D Bayesian model averaging? + Also negative interaction effects of even order - Redundant parameterization - Expensive model prediction O( D mp) Christoph Freudenthaler 21 / 34 ISMLL, University of Hildesheim

59 GFM II: Predictive Model Selection Specify D, i.e. select a specific model Christoph Freudenthaler 22 / 34 ISMLL, University of Hildesheim

60 GFM II: Predictive Model Selection Specify D, i.e. select a specific model Extended Matrix Factorization m = 2 different modes, p different mode-models D = mp = 2p { } {(d1,1,..., d D = 1,p ), : d (d 2,1,..., d 2,p )} 1,i1 = 1, d 2,i2 = 1 i 2 > i 1, 0 else, Christoph Freudenthaler 22 / 34 ISMLL, University of Hildesheim

61 GFM II: Predictive Model Selection Specify D, i.e. select a specific model Extended Matrix Factorization m = 2 different modes, p different mode-models D = mp = 2p { } {(d1,1,..., d D = 1,p ), : d (d 2,1,..., d 2,p )} 1,i1 = 1, d 2,i2 = 1 i 2 > i 1, 0 else, ex.1: d 1 = (1, 0, 0,..., 0), d 2 = (0, 1, 1, 1,..., 1) ex.2: d 1 = (0, 1, 0,..., 0), d 2 = (0, 0, 1, 1,..., 1) p p f (x l V ) = µ l = x l,i1 x l,i2 i 1=1 i 2>i 1 v i1,f v i2,f f =1 Christoph Freudenthaler 22 / 34 ISMLL, University of Hildesheim

62 Selecting a prior distribution: So far: y l x l N(µ l = f (x l V ), 1) More general x l unifies recent factorization model enhancements Christoph Freudenthaler 23 / 34 ISMLL, University of Hildesheim

63 Selecting a prior distribution: So far: y l x l N(µ l = f (x l V ), 1) More general x l unifies recent factorization model enhancements Next step: Selecting a prior distribution for model parameters V, using prior nowledge: If all tensor elements R linearly independent real valued latent dimensions Christoph Freudenthaler 23 / 34 ISMLL, University of Hildesheim

64 Selecting a prior distribution: So far: y l x l N(µ l = f (x l V ), 1) More general x l unifies recent factorization model enhancements Next step: Selecting a prior distribution for model parameters V, using prior nowledge: If all tensor elements R linearly independent real valued latent dimensions Each tensor element is a sum over linear functions of the same x l Different variances σ 2 f along latent dimensions Christoph Freudenthaler 23 / 34 ISMLL, University of Hildesheim

65 Selecting a prior distribution: So far: y l x l N(µ l = f (x l V ), 1) More general x l unifies recent factorization model enhancements Next step: Selecting a prior distribution for model parameters V, using prior nowledge: If all tensor elements R linearly independent real valued latent dimensions Each tensor element is a sum over linear functions of the same x l Different variances σ 2 f along latent dimensions Centered at some dimension dependent mean µf Christoph Freudenthaler 23 / 34 ISMLL, University of Hildesheim

66 Selecting a prior distribution: So far: y l x l N(µ l = f (x l V ), 1) More general x l unifies recent factorization model enhancements Next step: Selecting a prior distribution for model parameters V, using prior nowledge: If all tensor elements R linearly independent real valued latent dimensions Each tensor element is a sum over linear functions of the same x l Different variances σ 2 f along latent dimensions Centered at some dimension dependent mean µf Center µf and variance σ 2 f may differ for different modes Christoph Freudenthaler 23 / 34 ISMLL, University of Hildesheim

67 Selecting a prior distribution: Conjugate Gaussian prior distribution: Each latent -dim representation v i R, i = 1,..., p is an independent draw v i N((µ 1,..., µ ), diag(σ 2 1,..., σ 2 )) V i p V Christoph Freudenthaler 24 / 34 ISMLL, University of Hildesheim

68 Selecting a prior distribution: Conjugate Gaussian prior distribution: Each latent -dim representation v i R, i = 1,..., p is an independent draw v i N((µ 1,..., µ ), diag(σ 2 1,..., σ 2 )) V i p V Conjugate Gaussian hyperprior: Each prior mean µ f is considered an independent realization of, v f c, 0 =0 µ f N(µ 0, c λ f ) f=1,..., f Conjugate Gamma hyperprior: Each precision λ f = σ 2 f is considered an independent realization of v i i=1,...,p 1 f=1,..., λ f G(α 0, β 0 ) y l l=1,...,n Christoph Freudenthaler 24 / 34 ISMLL, University of Hildesheim

69 Learning Generalized Factorization Models Stochastic Gradient Descent: scalable, simple Christoph Freudenthaler 25 / 34 ISMLL, University of Hildesheim

70 Learning Generalized Factorization Models Stochastic Gradient Descent: scalable, simple Alternating Least Squares Christoph Freudenthaler 25 / 34 ISMLL, University of Hildesheim

71 Learning Generalized Factorization Models Stochastic Gradient Descent: scalable, simple Alternating Least Squares Bayesian Analysis, i.e. standard Gibbs sampling: Easy to derive for multi-linear models lie p p f (x l V ) = x l,i1 x l,i2 i 1=1 i 2>i 1 v i1,f v i2,f f =1 Christoph Freudenthaler 25 / 34 ISMLL, University of Hildesheim

72 Learning Generalized Factorization Models Stochastic Gradient Descent: scalable, simple Alternating Least Squares Bayesian Analysis, i.e. standard Gibbs sampling: Easy to derive for multi-linear models lie p p f (x l V ) = x l,i1 x l,i2 i 1=1 i 2>i 1 v i1,f v i2,f f =1 Learning on large datasets: bloc size = 1 O(Nz ) Christoph Freudenthaler 25 / 34 ISMLL, University of Hildesheim

73 Outline Generalized Factorization Model Relations to standard Models Empirical Evaluation Conclusion and Discussion Christoph Freudenthaler 26 / 34 ISMLL, University of Hildesheim

74 Polynomial Regression p p p p p f (x l V ) = x l,i1 β i1 + x l,i1 x l,i2 β i1,i x l,i1... x l,i1 β i1,...,i o i 1 =1 i 1 =1 i 2 i 1 i 1 =1 i o i o 1 Order-o polynomial regression models Christoph Freudenthaler 26 / 34 ISMLL, University of Hildesheim

75 Polynomial Regression p p p p p f (x l V ) = x l,i1 β i1 + x l,i1 x l,i2 β i1,i x l,i1... x l,i1 β i1,...,i o i 1 =1 i 1 =1 i 2 i 1 i 1 =1 i o i o 1 Order-o polynomial regression models f GFM (x l V ) = f =1 {d 1,...,d m} D m p x l,i d j,i v i,f j=1 i=1 Generalized Factorization Model of m-mode tensor Christoph Freudenthaler 26 / 34 ISMLL, University of Hildesheim

76 Polynomial Regression p p p p p f (x l V ) = x l,i1 β i1 + x l,i1 x l,i2 β i1,i x l,i1... x l,i1 β i1,...,i o i 1 =1 i 1 =1 i 2 i 1 i 1 =1 i o i o 1 Order-o polynomial regression models f GFM (x l V ) = f =1 {d 1,...,d m} D m p x l,i d j,i v i,f j=1 i=1 Generalized Factorization Model of m-mode tensor Define one of the p predictor variables as constant, e.g. xl,1 = 1 l = 1,..., n Fix corresponding parameter vector v1 V = (1,..., 1) }{{} Christoph Freudenthaler 26 / 34 ISMLL, University of Hildesheim

77 Polynomial Regression p p p p p f (x l V ) = x l,i1 β i1 + x l,i1 x l,i2 β i1,i x l,i1... x l,i1 β i1,...,i o i 1 =1 i 1 =1 i 2 i 1 i 1 =1 i o i o 1 Order-o polynomial regression models f GFM (x l V ) = f =1 {d 1,...,d m} D m p x l,i d j,i v i,f j=1 i=1 Generalized Factorization Model of m-mode tensor Define one of the p predictor variables as constant, e.g. xl,1 = 1 l = 1,..., n Fix corresponding parameter vector v1 V = (1,..., 1) }{{} Selecting D accordingly gives p p p p p f GFM (x l V ) = x l,i1 v i1,f + x l,i1 x l,i2 v i1,f v i2,f x l,i1... x l,im v i1,f... v im,f i 1 =1 f =1 i 1 =1 i 2 i 1 f =1 i 1 =1 im i }{{}}{{} m 1 f =1 }{{} β i1 β i1,i β 2 i1,...,im Christoph Freudenthaler 26 / 34 ISMLL, University of Hildesheim

78 Polynomial Regression vs. GFM Generalized Factorization Model include Polynomial Regression For factorized parameters, e.g. β i1,i 2 If number of modes m equals order o = f =1 v i 1,f v i2,f Christoph Freudenthaler 27 / 34 ISMLL, University of Hildesheim

79 Factorization Machines 6 vs. GFM Very similar to previous factorized polynomial regression model f GFM (x l V ) = p i 1=1 f FM (x l V ) = x l,i1 f =1 v i1,f }{{} β i1 p x l,i1 β i1 + i 1=1 + p p i 1=1 i 2 i 1 x l,i1 x l,i2 v i1,f v i2,f +... f =1 } {{ } β i1,i 2 p p i 1=1 i 2 i 1 x l,i1 x l,i2 v i1,f v i2,f f =1 } {{ } β i1,i Rendle, S.: Factorization Machines. ICDM10. Christoph Freudenthaler 28 / 34 ISMLL, University of Hildesheim

80 Factorization Machines 6 vs. GFM Very similar to previous factorized polynomial regression model f GFM (x l V ) = p i 1=1 f FM (x l V ) = x l,i1 f =1 v i1,f }{{} β i1 p x l,i1 β i1 + i 1=1 + p p i 1=1 i 2 i 1 x l,i1 x l,i2 v i1,f v i2,f +... f =1 } {{ } β i1,i 2 p p i 1=1 i 2 i 1 x l,i1 x l,i2 v i1,f v i2,f f =1 } {{ } β i1,i Factorization Machines using simple Gibbs are successfully applied in several challenges 6 Rendle, S.: Factorization Machines. ICDM10. Christoph Freudenthaler 28 / 34 ISMLL, University of Hildesheim

81 Neural Networs vs. GFM Output y l Hidden f x l,v 1 w 1 w 2 f x l,v 2 Input v v v 1,1 1,2 4,2 v3,1 v 4,1 x l,1 x l,2 x l,3 x l,4 Christoph Freudenthaler 29 / 34 ISMLL, University of Hildesheim

82 Neural Networs vs. GFM Output y l Hidden f x l,v 1 w 1 w 2 f x l,v 2 Input p=4 v v v 1,1 1,2 4,2 v3,1 v 4,1 x l,1 x l,2 x l,3 x l,4 Input: predictor vector x l R p, l = 1,..., n Christoph Freudenthaler 29 / 34 ISMLL, University of Hildesheim

83 Neural Networs vs. GFM Output y l Hidden =2 f x l,v 1 w 1 w 2 f x l,v 2 Input p=4 v v v 1,1 1,2 4,2 v3,1 v 4,1 x l,1 x l,2 x l,3 x l,4 Input: predictor vector x l R p, l = 1,..., n Hidden: Hidden nodes correspond to latent dimensions nodes Christoph Freudenthaler 29 / 34 ISMLL, University of Hildesheim

84 Neural Networs vs. GFM Output y l Hidden =2 f x l,v 1 w 1 w 2 f x l,v 2 Input p=4 v v v 1,1 1,2 4,2 v3,1 v 4,1 x l,1 x l,2 x l,3 x l,4 Input: predictor vector x l R p, l = 1,..., n Hidden: Hidden nodes correspond to latent dimensions nodes Activation Function: f (x l, v f ) := {d 1,...,d m} D m j=1 p i=1 x l,id j,i v i,f Christoph Freudenthaler 29 / 34 ISMLL, University of Hildesheim

85 Neural Networs vs. GFM Output y l Hidden =2 f x l,v 1 w 1 w 2 f x l,v 2 Input p=4 v v v 1,1 1,2 4,2 v3,1 v 4,1 x l,1 x l,2 x l,3 x l,4 Input: predictor vector x l R p, l = 1,..., n Hidden: Hidden nodes correspond to latent dimensions nodes Activation Function: f (x l, v f ) := {d 1,...,d m} D Output: Weights w f = 1 m j=1 p i=1 x l,id j,i v i,f Christoph Freudenthaler 29 / 34 ISMLL, University of Hildesheim

86 Outline Generalized Factorization Model Relations to standard Models Empirical Evaluation Conclusion and Discussion Christoph Freudenthaler 30 / 34 ISMLL, University of Hildesheim

87 Empirical Evaluation on Recommender System Data Tas: movie rating prediction Data 7 : 100M ratings of 480 users on 18 items Netflix RMSE SGD KNN256 SGD KNN256++ BFM KNN256 BFM KNN256++ SGD PMF BPMF BFM (u,i) BFM (u,i,t,f) BFM (u,i,ut,it,if) BFM Ensemble (#latent Dimensions) Christoph Freudenthaler 30 / 34 ISMLL, University of Hildesheim

Bayesian Factorization Machines - Kaggle Tas: student performance prediction on GMAT (Graduate Management Admission Test), SAT and ACT (college

88 Bayesian Factorization Machines - Kaggle Tas: student performance prediction on GMAT (Graduate Management Admission Test), SAT and ACT (college admission test) 118 contestants Data 8 : 8 Christoph Freudenthaler 31 / 34 ISMLL, University of Hildesheim

89 Bayesian Factorization Machines - KDDCup 11 Tas: music rating prediction 1000 contestants Data 9 : 250M ratings of 1M users on 625 items (songs, tracs, album or artists) 9 Christoph Freudenthaler 32 / 34 ISMLL, University of Hildesheim

90 Outline Generalized Factorization Model Relations to standard Models Empirical Evaluation Conclusion and Discussion Christoph Freudenthaler 33 / 34 ISMLL, University of Hildesheim

91 In a nutshell: GFM, thus matrix and tensor Factorization are types of regression models Christoph Freudenthaler 33 / 34 ISMLL, University of Hildesheim

92 In a nutshell: GFM, thus matrix and tensor Factorization are types of regression models Generalized Factorization Models relate to Polynomial regression for factorized parameters Christoph Freudenthaler 33 / 34 ISMLL, University of Hildesheim

93 In a nutshell: GFM, thus matrix and tensor Factorization are types of regression models Generalized Factorization Models relate to Polynomial regression for factorized parameters Feed-forward Neural Networs for given activation function Christoph Freudenthaler 33 / 34 ISMLL, University of Hildesheim

94 In a nutshell: GFM, thus matrix and tensor Factorization are types of regression models Generalized Factorization Models relate to Polynomial regression for factorized parameters Feed-forward Neural Networs for given activation function Factorization Machines with simple Gibbs show decent predictive performance Christoph Freudenthaler 33 / 34 ISMLL, University of Hildesheim

95 In a nutshell: GFM, thus matrix and tensor Factorization are types of regression models Generalized Factorization Models relate to Polynomial regression for factorized parameters Feed-forward Neural Networs for given activation function Factorization Machines with simple Gibbs show decent predictive performance Open questions: Bayesian learning for models with non-linear functions of V? Bayesian model averaging for GFM? Efficient Bayesian inference for non-gaussian lielihood? Christoph Freudenthaler 33 / 34 ISMLL, University of Hildesheim

96 Christoph Freudenthaler 34 / 34 ISMLL, University of Hildesheim

Scaling Neighbourhood Methods

Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)