Matrix and Tensor Factorization from a Machine Learning Perspective

Similar documents
Scaling Neighbourhood Methods

Latent Variable Models and EM algorithm

Large-scale Ordinal Collaborative Filtering

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Andriy Mnih and Ruslan Salakhutdinov

Matrix Factorization In Recommender Systems. Yong Zheng, PhDc Center for Web Intelligence, DePaul University, USA March 4, 2015

Using SVD to Recommend Movies

* Matrix Factorization and Recommendation Systems

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning

Graphical Models for Collaborative Filtering

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Pattern Recognition and Machine Learning

a Short Introduction

ECS289: Scalable Machine Learning

A Modified PMF Model Incorporating Implicit Item Associations

Collaborative Filtering. Radek Pelánek

Data Mining Techniques

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Collaborative Filtering Applied to Educational Data Mining

Reading Group on Deep Learning Session 1

Matrix Factorization Techniques for Recommender Systems

15 Singular Value Decomposition

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Matrix Factorization Techniques For Recommender Systems. Collaborative Filtering

Jeffrey D. Ullman Stanford University

Mixed Membership Matrix Factorization

Content-based Recommendation

Lecture 16: Introduction to Neural Networks

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Kernel methods, kernel SVM and ridge regression

Recent Advances in Bayesian Inference Techniques

SCMF: Sparse Covariance Matrix Factorization for Collaborative Filtering

Collaborative Filtering: A Machine Learning Perspective

Recommendation Systems

Restricted Boltzmann Machines for Collaborative Filtering

Decoupled Collaborative Ranking

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Dimensionality Reduction and Principle Components Analysis

NetBox: A Probabilistic Method for Analyzing Market Basket Data

Data Mining Techniques

STA 4273H: Statistical Machine Learning

Scalable Bayesian Matrix Factorization

Mixed Membership Matrix Factorization

Introduction to Data Mining

CS425: Algorithms for Web Scale Data

Introduction to Machine Learning

10-701/15-781, Machine Learning: Homework 4

Nonparametric Bayesian Methods (Gaussian Processes)

Algorithms for Collaborative Filtering

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

PCA, Kernel PCA, ICA

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CS281 Section 4: Factor Analysis and PCA

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians

Generative Models for Discrete Data

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Introduction to Probabilistic Graphical Models

Recommender Systems EE448, Big Data Mining, Lecture 10. Weinan Zhang Shanghai Jiao Tong University

STA 4273H: Statistical Machine Learning

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Latent Semantic Analysis. Hongning Wang

Dimension Reduction. David M. Blei. April 23, 2012

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Collaborative Filtering

Probabilistic & Unsupervised Learning

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007

Lecture Notes 1: Vector spaces

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Probabilistic Latent-Factor Database Models

Large Scale Data Analysis Using Deep Learning

CS60021: Scalable Data Mining. Dimensionality Reduction

Background Mathematics (2/2) 1. David Barber

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Matrix Factorization Techniques for Recommender Systems

Statistical Machine Learning

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Factor Analysis (10/2/13)

Personalized Ranking for Non-Uniformly Sampled Items

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Linear Algebra for Machine Learning. Sargur N. Srihari

Linear & nonlinear classifiers

Chris Bishop s PRML Ch. 8: Graphical Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Relational Stacked Denoising Autoencoder for Tag Recommendation. Hao Wang

Matrix decompositions

Linear Algebra Background

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

ECE521 week 3: 23/26 January 2017

Machine Learning Techniques

Graphical Models 359

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Wavelet Transform And Principal Component Analysis Based Feature Extraction

Transcription:

Matrix and Tensor Factorization from a Machine Learning Perspective Christoph Freudenthaler Information Systems and Machine Learning Lab, University of Hildesheim Research Seminar, Vienna University of Economics and Business, January 13, 2012 Christoph Freudenthaler 1 / 34 ISMLL, University of Hildesheim

Matrix Factorization - SVD SVD: Decompose p 1 p 2 matrix Y := V 1 D V2 T V1 are eigenvectors of YY T V2 are eigenvectors of Y T Y D := eig(diag(yy T )) p 1 d 1 0 p = 1 p 2 0 d Y V 1 V 2 p 2... Christoph Freudenthaler 2 / 34 ISMLL, University of Hildesheim

Matrix Factorization - SVD SVD: Decompose p 1 p 2 matrix Y := V 1 D V2 T V1 are eigenvectors of YY T V2 are eigenvectors of Y T Y D := eig(diag(yy T )) p 1 d 1 0 p = 1 p 2 0 d Y V 1 V 2 p 2... Properties: Method from linear algebra for arbitrary Y Tool for descriptive and exploratory data analysis Decomposition optimizes the Frobenius norm with orthogonality constraints Christoph Freudenthaler 2 / 34 ISMLL, University of Hildesheim

Tensor Factorization - Tucer Decomposition Tucer Decomposition: Decompose p 1 p 2 p 3 tensor Y := D 1 V 1 2 V 2 3 V 3 V1 are 1 eigenvectors of mode-1 unfolded Y V2 are 2 eigenvectors of mode-2 unfolded Y V3 are 3 eigenvectors of mode-3 unfolded Y D R 1 2 3 non-diagonal core tensor p 3 3 p 2 Y = D p 2 1 1 V 1 p 2 2 V 2 p 3 3 V 3 1 p 1 1 2 3 Christoph Freudenthaler 3 / 34 ISMLL, University of Hildesheim

Tensor Factorization - Tucer Decomposition Tucer Decomposition: Decompose p 1 p 2 p 3 tensor Y := D 1 V 1 2 V 2 3 V 3 V1 are 1 eigenvectors of mode-1 unfolded Y V2 are 2 eigenvectors of mode-2 unfolded Y V3 are 3 eigenvectors of mode-3 unfolded Y D R 1 2 3 non-diagonal core tensor p 3 3 p 2 Y = D p 2 1 1 V 1 p 2 2 V 2 p 3 3 V 3 1 p 1 1 2 3 Properties: Extension of SVD method for arbitrary tensor Y orthogonal representation Tool for descriptive and exploratory data analysis Decomposition optimizes the Frobenius norm Expensive inference: sequences of SVD + core tensor D Christoph Freudenthaler 3 / 34 ISMLL, University of Hildesheim

Tensor Factorization - Canonical Decomposition Canonical Decomposition (CD): Decompose p 1 p 2 p 3 tensor Y := D 1 V 1 2 V 2 3 V 3 with diagonal, identity tensor D as a sum of ran-one tensors Y = f =1 v 1,f v 2,f v 3,f p 3 p 2 0 1 0 Y = p 1 1 V 1 p 2 2 V 2 p 3 1 3 V 3... 0 0 p 1 Christoph Freudenthaler 4 / 34 ISMLL, University of Hildesheim

Tensor Factorization - Canonical Decomposition Canonical Decomposition (CD): Decompose p 1 p 2 p 3 tensor Y := D 1 V 1 2 V 2 3 V 3 with diagonal, identity tensor D as a sum of ran-one tensors Y = f =1 v 1,f v 2,f v 3,f p 3 p 2 0 1 0 Y = p 1 1 V 1 p 2 2 V 2 p 3 1 3 V 3... 0 0 p 1 Properties: Extension of SVD method for arbitrary tensor Y Tool for descriptive and exploratory data analysis Decomposition optimizes the Frobenius norm Fast inference due to less parameters Christoph Freudenthaler 4 / 34 ISMLL, University of Hildesheim

Machine Learning Perspective Machine Learning:...is the tas of learning from (noisy) experience E with respect to some class of tass T and performance measure P Experience E: data Tass T: predictions Performance P: root mean square error (RMSE), misclassification,... Christoph Freudenthaler 5 / 34 ISMLL, University of Hildesheim

Machine Learning Perspective Machine Learning:...is the tas of learning from (noisy) experience E with respect to some class of tass T and performance measure P Experience E: data Tass T: predictions Performance P: root mean square error (RMSE), misclassification,... to generalize from noisy data to accurately predict new cases Christoph Freudenthaler 5 / 34 ISMLL, University of Hildesheim

Machine Learning Perspective Machine Learning:...is the tas of learning from (noisy) experience E with respect to some class of tass T and performance measure P Experience E: data Tass T: predictions Performance P: root mean square error (RMSE), misclassification,... to generalize from noisy data to accurately predict new cases In other words: Prediction accuracy on new cases = Machine Learning No hypothesis generation/testing = Data Mining Christoph Freudenthaler 5 / 34 ISMLL, University of Hildesheim

Applications of the Machine Learning Perspective Many different applications: Social Networ Analysis Recommender Systems Graph Analysis Image/Video Analysis... Christoph Freudenthaler 6 / 34 ISMLL, University of Hildesheim

Applications of the Machine Learning Perspective Overall Tas: Prediction of missing friendships, interesting items, corrupted pixels, missing edges, etc. Data representation: matrix/tensor with missing entries Y = u 1 u 2... u p1 i 1 i 2 i 3... i p2 1 i p2 3 5 3 4 4 5 2 2 Christoph Freudenthaler 7 / 34 ISMLL, University of Hildesheim

Applications of the Machine Learning Perspective Overall Tas: Prediction of missing friendships, interesting items, corrupted pixels, missing edges, etc. Data representation: matrix/tensor with missing entries Further Properties: Y = u 1 u 2... u p1 i 1 i 2 i 3... i p2 1 i p2 3 5 3 4 4 5 2 Unobserved Heterogeneity, e.g. different consumption preferences of different users and items Large-scale: millions of interactions Poor Data Quality: high noise due to indirect data collection 2 Factorization Models Christoph Freudenthaler 7 / 34 ISMLL, University of Hildesheim

Factorization Models from a Machine Learning Perspective Common usage of matrix (and tensor) factorization: Identification/Interpretation of unobserved heterogeneity, i.e. latent dependencies between instances of a mode (rows, columns,... ) Data compression, e.g., instead of p 1 p 2 values, store only (p 1 + p 2 ) values Data preprocessing: uncorrelate p predictor variables of a design matrix X Christoph Freudenthaler 8 / 34 ISMLL, University of Hildesheim

Factorization Models from a Machine Learning Perspective Machine Learning perspective on factorization models: Factorization models seen as predictive models No probabilistic embedding, e.g. for Bayesian Analysis Gaussian lielihood: Y = V 1 D V T 2 + E, e l E : e l N(0, σ 2 = 1) No missing value treatment; often treated as 0 Christoph Freudenthaler 8 / 34 ISMLL, University of Hildesheim

Factorization Models from a Machine Learning Perspective Machine Learning perspective on factorization models: Factorization models seen as predictive models No probabilistic embedding, e.g. for Bayesian Analysis Gaussian lielihood: Y = V 1 D V T 2 + E, e l E : e l N(0, σ 2 = 1) No missing value treatment; often treated as 0 Non-informative prior: p(v 1, D, V 2 ) 1 Does not distinguish between signal and noise Christoph Freudenthaler 8 / 34 ISMLL, University of Hildesheim

Factorization Models from a Machine Learning Perspective Machine Learning perspective on factorization models: Factorization models seen as predictive models No probabilistic embedding, e.g. for Bayesian Analysis Gaussian lielihood: Y = V 1 D V T 2 + E, e l E : e l N(0, σ 2 = 1) No missing value treatment; often treated as 0 Non-informative prior: p(v 1, D, V 2 ) 1 Does not distinguish between signal and noise No interest in latent representation, e.g. interpretation Elimination of orthonormality constraint Y = V 1, V2 T + E for general tensors: Y = v 1,f v 2,f... v m,f + E f =1 Christoph Freudenthaler 8 / 34 ISMLL, University of Hildesheim

Factorization Models from a Machine Learning Perspective Machine Learning perspective on factorization models: Factorization models seen as predictive models No probabilistic embedding, e.g. for Bayesian Analysis Gaussian lielihood: Y = V 1 D V T 2 + E, e l E : e l N(0, σ 2 = 1) No missing value treatment; often treated as 0 Non-informative prior: p(v 1, D, V 2 ) 1 Does not distinguish between signal and noise No interest in latent representation, e.g. interpretation Elimination of orthonormality constraint Y = V 1, V2 T + E for general tensors: Y = v 1,f v 2,f... v m,f + E f =1 Christoph Freudenthaler 8 / 34 ISMLL, University of Hildesheim

Related Wor Existing Extensions of Factorization Models: More general lielihood for multinomial, count data, etc. Exponential Family PCA 1 Inference only on observed tensor entries 1 Mohamed, S., Heller, K. A., Ghahramani, Z.: Exponential Family PCA, NIPS08. 2 Tipping, M. E., Bishop, C. M.: Probabilistic PCA, Journal of the Royal Statistical Society. 3 Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 9 / 34 ISMLL, University of Hildesheim

Related Wor Existing Extensions of Factorization Models: More general lielihood for multinomial, count data, etc. Exponential Family PCA 1 Inference only on observed tensor entries Introduction of prior distributions 2 1 Mohamed, S., Heller, K. A., Ghahramani, Z.: Exponential Family PCA, NIPS08. 2 Tipping, M. E., Bishop, C. M.: Probabilistic PCA, Journal of the Royal Statistical Society. 3 Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 9 / 34 ISMLL, University of Hildesheim

Related Wor Existing Extensions of Factorization Models: More general lielihood for multinomial, count data, etc. Exponential Family PCA 1 Inference only on observed tensor entries Introduction of prior distributions 2 Scalable Learning algorithms for large scale data: Gradient Descent Models 3 1 Mohamed, S., Heller, K. A., Ghahramani, Z.: Exponential Family PCA, NIPS08. 2 Tipping, M. E., Bishop, C. M.: Probabilistic PCA, Journal of the Royal Statistical Society. 3 Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 9 / 34 ISMLL, University of Hildesheim

Related Wor Existing Extensions of Factorization Models: More general lielihood for multinomial, count data, etc. Exponential Family PCA 1 Inference only on observed tensor entries Introduction of prior distributions 2 Scalable Learning algorithms for large scale data: Gradient Descent Models 3 Missing Extensions: Extensions of the predictive models, i.e. SVD, CD Comparison to standard predictive models 1 Mohamed, S., Heller, K. A., Ghahramani, Z.: Exponential Family PCA, NIPS08. 2 Tipping, M. E., Bishop, C. M.: Probabilistic PCA, Journal of the Royal Statistical Society. 3 Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 9 / 34 ISMLL, University of Hildesheim

Outline Generalized Factorization Model Relations to standard Models Empirical Evaluation Conclusion and Discussion Christoph Freudenthaler 10 / 34 ISMLL, University of Hildesheim

Outline Generalized Factorization Model Relations to standard Models Empirical Evaluation Conclusion and Discussion Christoph Freudenthaler 11 / 34 ISMLL, University of Hildesheim

Probabilistic SVD Frobenius norm optimal: argmin Y V 1 V2 T F = θ={v 1,V 2} p 1 p 2 (y i,j v1,iv T 2,j ) 2 i=1 j=1 j j i y i, j p 1 i Y V 1 v 1,i v 2,j V 2 T p 2 Christoph Freudenthaler 11 / 34 ISMLL, University of Hildesheim

Probabilistic SVD Frobenius norm optimal: argmin Y V 1 V2 T F = θ={v 1,V 2} p 1 p 2 (y i,j v1,iv T 2,j ) 2 i=1 j=1 j j i y i, j p i Y 1 = V 1 T v 1,i. v 2,j V 2 T p 2 Christoph Freudenthaler 11 / 34 ISMLL, University of Hildesheim

Probabilistic SVD Frobenius norm optimal: argmin Y V 1 V2 T F = θ={v 1,V 2} p 1 p 2 (y i,j v1,iv T 2,j ) 2 i=1 j=1 i j y i, j p i Y 1 = V 1 j V T 2 T. v 1,i v 2,j = v f =1 1,i,f v 2,j,f p 2 Christoph Freudenthaler 11 / 34 ISMLL, University of Hildesheim

Probabilistic SVD Frobenius norm optimal: argmin Y V 1 V2 T F = θ={v 1,V 2} p 1 p 2 (y i,j v1,iv T 2,j ) 2 i=1 j=1 i j y i, j p i Y 1 = V 1 j V T 2 T. v 1,i v 2,j = v f =1 1,i,f v 2,j,f p 2 Gaussian maximum lielihood: p 1 p 2 argmax θ={v 1,V 2} i=1 j=1 exp ( 12 (y i,j v T1,iv ) 2,j ) 2 Christoph Freudenthaler 11 / 34 ISMLL, University of Hildesheim

Probabilistic CD - Extend SVD to arbitrary m Frobenius norm optimal: argmin p 1 p 2... θ={v 1,V 2,...,V m} i 1=1 i 2=1 p m i m=1 (y i1,i 2,...,i m v 1,i1,f v 2,i2,f... v m,im,f f =1 } {{ } yi CD 1,i 2,...,im ) 2 i 2 V 2 T y i1,i 2,i 3 p 1 i 1 Y V 1 p 2 v 1,i1 v 2,i2 v 2,i3 V T 3 i 3 Christoph Freudenthaler 12 / 34 ISMLL, University of Hildesheim

Probabilistic CD - Extend SVD to arbitrary m Frobenius norm optimal: argmin p 1 p 2... θ={v 1,V 2,...,V m} i 1=1 i 2=1 p m i m=1 (y i1,i 2,...,i m v 1,i1,f v 2,i2,f... v m,im,f f =1 } {{ } yi CD 1,i 2,...,im ) 2 i 2 y i1,i 2,i 3 p 1 i 1 Y V 1 p 2 V T 2 v 1,i1 v 2,i2 v 2,i3 V T 3 i 3 Christoph Freudenthaler 12 / 34 ISMLL, University of Hildesheim

Probabilistic CD - Extend SVD to arbitrary m Frobenius norm optimal: argmin p 1 p 2... θ={v 1,V 2,...,V m} i 1=1 i 2=1 p m i m=1 (y i1,i 2,...,i m v 1,i1,f v 2,i2,f... v m,im,f f =1 } {{ } yi CD 1,i 2,...,im ) 2 i 2 V 2 T y i1,i 2,i 3 p 1 i = 1 Y V 1 p 2 v 1,i1 v 2,i2 v 2,i3 V T 3 i 3 Dot product = f =1 v 1,i1,f v 2,i2,f v 3,i3,f Christoph Freudenthaler 12 / 34 ISMLL, University of Hildesheim

Probabilistic CD - Extend SVD to arbitrary m Frobenius norm optimal: argmin p 1 p 2... θ={v 1,V 2,...,V m} i 1=1 i 2=1 p m i m=1 (y i1,i 2,...,i m v 1,i1,f v 2,i2,f... v m,im,f f =1 } {{ } yi CD 1,i 2,...,im ) 2 i 2 V 2 T y i1,i 2,i 3 p 1 i = 1 Y V 1 p 2 v 1,i1 v 2,i2 v 2,i3 V T 3 i 3 Dot product = f =1 v 1,i1,f v 2,i2,f v 3,i3,f Gaussian maximum lielihood: argmax p 1 p 2... p m ( exp 1 ) 2 (y i 1,i 2,...,i m yi CD 1,i 2,...,i m ) 2 θ={v 1,V 2,...} i 1=1 i 2=1 i m=1 Christoph Freudenthaler 12 / 34 ISMLL, University of Hildesheim

GFM I: Predictive Model Interpretation Adapt Notation: Introduce for each tensor element l = 1,..., n vector-valued indicator vectors x l,j, j = 1,..., m of length p j Christoph Freudenthaler 13 / 34 ISMLL, University of Hildesheim

GFM I: Predictive Model Interpretation Adapt Notation: Introduce for each tensor element l = 1,..., n vector-valued indicator vectors x l,j, j = 1,..., m of length p j Form for each tensor element y i1,...,i m a predictor vector x l R p=p1+p2+...+pm by concatenating x j : x CD l = 0,..., 1,..., 0, 0,..., 1,..., 0,... }{{}}{{} x l,1 x l,2 Christoph Freudenthaler 13 / 34 ISMLL, University of Hildesheim

GFM I: Predictive Model Interpretation Adapt Notation: Introduce for each tensor element l = 1,..., n vector-valued indicator vectors x l,j, j = 1,..., m of length p j Form for each tensor element y i1,...,i m a predictor vector x l R p=p1+p2+...+pm by concatenating x j : x CD l = 0,..., 1,..., 0, 0,..., 1,..., 0,... }{{}}{{} x l,1 x l,2 Denote with V = {V 1,..., V m } the set of all predictive model parameters Christoph Freudenthaler 13 / 34 ISMLL, University of Hildesheim

GFM I: Predictive Model Interpretation Adapt Notation: Introduce for each tensor element l = 1,..., n vector-valued indicator vectors x l,j, j = 1,..., m of length p j Form for each tensor element y i1,...,i m a predictor vector x l R p=p1+p2+...+pm by concatenating x j : x CD l = 0,..., 1,..., 0, 0,..., 1,..., 0,... }{{}}{{} x l,1 x l,2 Denote with V = {V 1,..., V m } the set of all predictive model parameters Rewrite yi CD 1,...,i m as yi CD 1,...,i m = yl CD = f (xl CD V ) with general f (x l V ) = m f =1 j=1 xt l,j v j,f = f =1 v 1,i 1,f v 2,i2,f... v m,im,f Christoph Freudenthaler 13 / 34 ISMLL, University of Hildesheim

GFM I: Predictive Model Interpretation More intuitive interpretation: 2-level hierarchical representation y l N(µ l, α = 1) =1 y l T j, f x l =x l, j vj,f Level 1 V l=1,...,n Level 2 f (x l V ) = µ l = f =1 j=1 m x T l,j v j,f }{{} µ j,f (x l ) Christoph Freudenthaler 14 / 34 ISMLL, University of Hildesheim

GFM I: Predictive Model Interpretation More intuitive interpretation: 2-level hierarchical representation y l N(µ l, α = 1) =1 y l T j, f x l =x l, j vj,f Level 1 V l=1,...,n Level 2 f (x l V ) = µ l = f =1 j=1 m x T l,j v j,f }{{} µ j,f (x l ) each pair of mode j and latent dimension f has a different linear model µ j,f (x l ) = x T l,jv j,f with predictor vector x l,j describing each mode j and p j different model parameters v j,f per mode j and latent dimension f Christoph Freudenthaler 14 / 34 ISMLL, University of Hildesheim

GFM I: Predictive Model Interpretation More intuitive interpretation: 2-level hierarchical representation y l N(µ l, α = 1) =1 y l T j, f x l =x l, j vj,f Level 1 V l=1,...,n Level 2 f (x l V ) = µ l = f =1 j=1 m x T l,j v j,f }{{} µ j,f (x l ) each pair of mode j and latent dimension f has a different linear model µ j,f (x l ) = x T l,jv j,f with predictor vector x l,j describing each mode j and p j different model parameters v j,f per mode j and latent dimension f factorization models are 2-level hierarchical (multi-)linear models with µ j,f (x l ) per latent dimension f and mode j in the upper level Christoph Freudenthaler 14 / 34 ISMLL, University of Hildesheim

GFM I: Predictive Model Interpretation More intuitive interpretation: 2-level hierarchical representation y l N(µ l, α = 1) =1 y l T j, f x l =x l, j vj,f Level 1 V l=1,...,n Level 2 f (x l V ) = µ l = f =1 j=1 m x T l,j v j,f }{{} µ j,f (x l ) each pair of mode j and latent dimension f has a different linear model µ j,f (x l ) = x T l,jv j,f with predictor vector x l,j describing each mode j and p j different model parameters v j,f per mode j and latent dimension f factorization models are 2-level hierarchical (multi-)linear models with µ j,f (x l ) per latent dimension f and mode j in the upper level modeled as linear functions of mode-dependent predictors x l,j Christoph Freudenthaler 14 / 34 ISMLL, University of Hildesheim

GFM I: Predictive Model Interpretation More intuitive interpretation: 2-level hierarchical representation y l N(µ l, α = 1) =1 y l T j, f x l =x l, j vj,f Level 1 V l=1,...,n Level 2 f (x l V ) = µ l = f =1 j=1 m x T l,j v j,f }{{} µ j,f (x l ) each pair of mode j and latent dimension f has a different linear model µ j,f (x l ) = x T l,jv j,f with predictor vector x l,j describing each mode j and p j different model parameters v j,f per mode j and latent dimension f factorization models are 2-level hierarchical (multi-)linear models with µ j,f (x l ) per latent dimension f and mode j in the upper level modeled as linear functions of mode-dependent predictors x l,j m merged by the dot product f =1 j=1 µ j,f to get f (x l V ) Christoph Freudenthaler 14 / 34 ISMLL, University of Hildesheim

GFM I: Predictive Model Interpretation More intuitive interpretation: 2-level hierarchical representation y l N(µ l, α = 1) =1 y l T j, f x l =x l, j vj,f Level 1 V l=1,...,n Level 2 f (x l V ) = µ l = f =1 j=1 m x T l,j v j,f }{{} µ j,f (x l ) each pair of mode j and latent dimension f has a different linear model µ j,f (x l ) = x T l,jv j,f with predictor vector x l,j describing each mode j and p j different model parameters v j,f per mode j and latent dimension f factorization models are 2-level hierarchical (multi-)linear models with µ j,f (x l ) per latent dimension f and mode j in the upper level modeled as linear functions of mode-dependent predictors x l,j m merged by the dot product f =1 j=1 µ j,f to get f (x l V ) and y CD l = f (xl CD V ) using x CD l Christoph Freudenthaler 14 / 34 ISMLL, University of Hildesheim

Important example: Matrix Factorization Level 1: =1 V 1 V 2 T 1, f x l =x l,1 v 1,f Level 1 T 2, f x l = x l,2 v2,f µ 1,f (x l ) = x T l,1 v 1,f µ 2,f (x l ) = x T l,2 v 2,f ( row means) ( column means) Level 2 y l l=1,...,n Christoph Freudenthaler 15 / 34 ISMLL, University of Hildesheim

Important example: Matrix Factorization =1 Level 2 V 1 V 2 T 1, f x l =x l,1 v 1,f Level 1 T 2, f x l = x l,2 v2,f y l l=1,...,n Level 1: µ 1,f (x l ) = x T l,1 v 1,f µ 2,f (x l ) = x T l,2 v 2,f Level 2: ( row means) ( column means) f (x CD l V ) = f =1 µ 1,f (x l )µ 2,f (x l ) = v1,f T v 2,f Christoph Freudenthaler 15 / 34 ISMLL, University of Hildesheim

GFM II: Predictive Model Extension From x CD l to arbitrary x l : x l = (x l,1,..., x l,m ) still consists of m subvectors for each mode Christoph Freudenthaler 16 / 34 ISMLL, University of Hildesheim

GFM II: Predictive Model Extension From x CD l to arbitrary x l : x l = (x l,1,..., x l,m ) still consists of m subvectors for each mode While x l,j x CD l values, e.g., has exactly one active position, thus p j 1 zero x l,1 = ( }{{} 1, 0,..., 0) x l,2 = (0, }{{} 1, 0,..., 0) 1 st row 2 nd column Christoph Freudenthaler 16 / 34 ISMLL, University of Hildesheim

GFM II: Predictive Model Extension From x CD l to arbitrary x l : x l = (x l,1,..., x l,m ) still consists of m subvectors for each mode While x l,j x CD l values, e.g., has exactly one active position, thus p j 1 zero x l,1 = ( }{{} 1, 0,..., 0) x l,2 = (0, }{{} 1, 0,..., 0) 1 st row 2 nd column Mode-vectors x l,j may be defined arbitrarily, e.g., to tae into account more complex conditional dependencies Christoph Freudenthaler 16 / 34 ISMLL, University of Hildesheim

Understanding induced dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Simplest case SVD: Correlation only on identical mode-indices, e.g. row or column index rows/columns are independent from other rows/columns i j y i, j p i Y 1 = V 1 j V 2 T T. v 1,i v 2,j = v f =1 1,i,f v 2,j,f p 2 Christoph Freudenthaler 17 / 34 ISMLL, University of Hildesheim

Understanding induced dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Simplest case SVD: Correlation only on identical mode-indices, e.g. row or column index rows/columns are independent from other rows/columns i j y i, j p i Y 1 = V 1 j V 2 T T. v 1,i v 2,j = v f =1 1,i,f v 2,j,f p 2 e.g. p 1 = 3 rows, p 2 = 8 columns: x l,1 = (1, 0, 0), x l,2 = (0, 1, 0, 0, 0, 0, 0, 0) y l = µ 1,f (x l )µ 2,f (x l )+ɛ l = (x T l,1 v 1,f )(x T l,2 v 2,f )+ɛ l = v 1,1,f v 2,2,f +ɛ l f =1 f =1 f =1 Christoph Freudenthaler 17 / 34 ISMLL, University of Hildesheim

Creating new dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Dependencies between different rows/columns for SVD: More than one active indicator per mode, with p 1 = 3 rows, p 2 = 8 columns: i j y i, j x l,1 = (1, 0, 0), x l,2 = (0, 1, 1, 0, 1, 0, 0, 0) j j ' j ' ' p i Y 1 = V 1 V 2 T p 2 4 Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 18 / 34 ISMLL, University of Hildesheim

Creating new dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Dependencies between different rows/columns for SVD: More than one active indicator per mode, with p 1 = 3 rows, p 2 = 8 columns: i j y i, j x l,1 = (1, 0, 0), x l,2 = (0, 1, 1, 0, 1, 0, 0, 0) j j ' j ' ' p i Y 1 = V 1 V 2 T p 2 f (x l V ) = (x T l,1v 1,f )(x T l,2v 2,f ) = v 1,1,f (v 2,2,f + v 2,3,f + v 2,5,f ) f =1 f =1 4 Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 18 / 34 ISMLL, University of Hildesheim

Creating new dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Dependencies between different rows/columns for SVD: More than one active indicator per mode, with p 1 = 3 rows, p 2 = 8 columns: i j y i, j x l,1 = (1, 0, 0), x l,2 = (0, 1, 1, 0, 1, 0, 0, 0) j j ' j ' ' p i Y 1 = V 1 V 2 T p 2 f (x l V ) = (x T l,1v 1,f )(x T l,2v 2,f ) = v 1,1,f (v 2,2,f + v 2,3,f + v 2,5,f ) f =1 f =1 Example from recommender systems: SVD++ 4 4 Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 18 / 34 ISMLL, University of Hildesheim

Creating new dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Dependencies beyond rows and/or columns for SVD: p 1 = 3 rows and p 2 = 8 columns + additional rating information: x l,1 = (1, 0, 0), x l,2 = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 4, 0, 0, 0 ) }{{}}{{}}{{} users items ratings for co-rated items j j 3 4 i y i, j p i Y 1 = V 1 V 2 T f (x l V ) = p 2 (x T l,1v 1,f )(x T l,2v 2,f ) = f =1 f =1 v 1,1,f (3 v 2,3,f + 4 v 2,5,f ) 5 Rendle, S., Freudenthaler, C., Schmidt-Thieme, L.: Factorizing personalized Marov chains for next-baset Christoph recommendation. Freudenthaler WWW10. 19 / 34 ISMLL, University of Hildesheim

Creating new dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Dependencies beyond rows and/or columns for SVD: p 1 = 3 rows and p 2 = 8 columns + additional rating information: x l,1 = (1, 0, 0), x l,2 = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 4, 0, 0, 0 ) }{{}}{{}}{{} users items ratings for co-rated items j j 3 4 i y i, j p i Y 1 = V 1 V 2 T f (x l V ) = p 2 (x T l,1v 1,f )(x T l,2v 2,f ) = f =1 v 1,1,f (3 v 2,3,f + 4 v 2,5,f ) Example from recommender systems: Factorized Transition Tensors 5 5 Rendle, S., Freudenthaler, C., Schmidt-Thieme, L.: Factorizing personalized Marov chains for next-baset Christoph recommendation. Freudenthaler WWW10. 19 / 34 ISMLL, University of Hildesheim f =1

GFM II: Predictive Model Extension II f (x l V ) = µ l = m x T l,j v j,f = f =1 j=1 p m j x l,j,ij v j,ij,f f =1 j=1 i j =1 Only one predictive model per mode m Increase predictive accuracy: combine several reasonable predictive models per mode Christoph Freudenthaler 20 / 34 ISMLL, University of Hildesheim

GFM II: Predictive Model Extension II f (x l V ) = µ l = m x T l,j v j,f = f =1 j=1 p m j x l,j,ij 1 v j,ij,f f =1 j=1 i j =1 Only one predictive model per mode m Increase predictive accuracy: combine several reasonable predictive models per mode Introduce mode-defining selection vectors d j {0, 1} p on x l = (x 1,..., x m) R p, j = 1,..., m One set of selection vectors {d 1,..., d m} defines one model i V 1 m p f (x l V ) = µ l = x l,i d j,i v i,f f =1 j=1 i=1 j 3 4 V 2 T V 1 V 2 V Christoph Freudenthaler 20 / 34 ISMLL, University of Hildesheim

GFM II: Predictive Model Extension II f (x l V ) = µ l = m x T l,j v j,f = f =1 j=1 p m j x l,j,ij 1 v j,ij,f f =1 j=1 i j =1 Only one predictive model per mode m Increase predictive accuracy: combine several reasonable predictive models per mode Introduce mode-defining selection vectors d j {0, 1} p on x l = (x 1,..., x m) R p, j = 1,..., m One set of selection vectors {d 1,..., d m} defines one model i V 1 m p f (x l V ) = µ l = x l,i d j,i v i,f f =1 j=1 i=1 j 3 4 V 2 T V 1 V 2 V Collect several selection sets {d 1,..., d m} D: m p f GFM (x l V ) = µ l = x l,i d j,i v i,f f =1 {d 1,...,d m} D j=1 i=1 Christoph Freudenthaler 20 / 34 ISMLL, University of Hildesheim

GFM II: Predictive Model Learning? p p f GFM (x l V ) =... x l,i1... x l,im d 1,i1... d m,im v i1,f... v im,f i 1 =1 im=1 D f =1 }{{}}{{} Interaction Weight I Interaction Weight II Infer model selection D Bayesian model averaging? + Also negative interaction effects of even order - Redundant parameterization - Expensive model prediction O( D mp) Christoph Freudenthaler 21 / 34 ISMLL, University of Hildesheim

GFM II: Predictive Model Selection Specify D, i.e. select a specific model Christoph Freudenthaler 22 / 34 ISMLL, University of Hildesheim

GFM II: Predictive Model Selection Specify D, i.e. select a specific model Extended Matrix Factorization m = 2 different modes, p different mode-models D = mp = 2p { } {(d1,1,..., d D = 1,p ), : d (d 2,1,..., d 2,p )} 1,i1 = 1, d 2,i2 = 1 i 2 > i 1, 0 else, Christoph Freudenthaler 22 / 34 ISMLL, University of Hildesheim

GFM II: Predictive Model Selection Specify D, i.e. select a specific model Extended Matrix Factorization m = 2 different modes, p different mode-models D = mp = 2p { } {(d1,1,..., d D = 1,p ), : d (d 2,1,..., d 2,p )} 1,i1 = 1, d 2,i2 = 1 i 2 > i 1, 0 else, ex.1: d 1 = (1, 0, 0,..., 0), d 2 = (0, 1, 1, 1,..., 1) ex.2: d 1 = (0, 1, 0,..., 0), d 2 = (0, 0, 1, 1,..., 1) p p f (x l V ) = µ l = x l,i1 x l,i2 i 1=1 i 2>i 1 v i1,f v i2,f f =1 Christoph Freudenthaler 22 / 34 ISMLL, University of Hildesheim

Selecting a prior distribution: So far: y l x l N(µ l = f (x l V ), 1) More general x l unifies recent factorization model enhancements Christoph Freudenthaler 23 / 34 ISMLL, University of Hildesheim

Selecting a prior distribution: So far: y l x l N(µ l = f (x l V ), 1) More general x l unifies recent factorization model enhancements Next step: Selecting a prior distribution for model parameters V, using prior nowledge: If all tensor elements R linearly independent real valued latent dimensions Christoph Freudenthaler 23 / 34 ISMLL, University of Hildesheim

Selecting a prior distribution: So far: y l x l N(µ l = f (x l V ), 1) More general x l unifies recent factorization model enhancements Next step: Selecting a prior distribution for model parameters V, using prior nowledge: If all tensor elements R linearly independent real valued latent dimensions Each tensor element is a sum over linear functions of the same x l Different variances σ 2 f along latent dimensions Christoph Freudenthaler 23 / 34 ISMLL, University of Hildesheim

Selecting a prior distribution: So far: y l x l N(µ l = f (x l V ), 1) More general x l unifies recent factorization model enhancements Next step: Selecting a prior distribution for model parameters V, using prior nowledge: If all tensor elements R linearly independent real valued latent dimensions Each tensor element is a sum over linear functions of the same x l Different variances σ 2 f along latent dimensions Centered at some dimension dependent mean µf Christoph Freudenthaler 23 / 34 ISMLL, University of Hildesheim

Selecting a prior distribution: So far: y l x l N(µ l = f (x l V ), 1) More general x l unifies recent factorization model enhancements Next step: Selecting a prior distribution for model parameters V, using prior nowledge: If all tensor elements R linearly independent real valued latent dimensions Each tensor element is a sum over linear functions of the same x l Different variances σ 2 f along latent dimensions Centered at some dimension dependent mean µf Center µf and variance σ 2 f may differ for different modes Christoph Freudenthaler 23 / 34 ISMLL, University of Hildesheim

Selecting a prior distribution: Conjugate Gaussian prior distribution: Each latent -dim representation v i R, i = 1,..., p is an independent draw v i N((µ 1,..., µ ), diag(σ 2 1,..., σ 2 )) V i p V Christoph Freudenthaler 24 / 34 ISMLL, University of Hildesheim

Selecting a prior distribution: Conjugate Gaussian prior distribution: Each latent -dim representation v i R, i = 1,..., p is an independent draw v i N((µ 1,..., µ ), diag(σ 2 1,..., σ 2 )) V i p V Conjugate Gaussian hyperprior: Each prior mean µ f is considered an independent realization of, v f c, 0 =0 µ f N(µ 0, c λ f ) f=1,..., f Conjugate Gamma hyperprior: Each precision λ f = σ 2 f is considered an independent realization of v i i=1,...,p 1 f=1,..., λ f G(α 0, β 0 ) y l l=1,...,n Christoph Freudenthaler 24 / 34 ISMLL, University of Hildesheim

Learning Generalized Factorization Models Stochastic Gradient Descent: scalable, simple Christoph Freudenthaler 25 / 34 ISMLL, University of Hildesheim

Learning Generalized Factorization Models Stochastic Gradient Descent: scalable, simple Alternating Least Squares Christoph Freudenthaler 25 / 34 ISMLL, University of Hildesheim

Learning Generalized Factorization Models Stochastic Gradient Descent: scalable, simple Alternating Least Squares Bayesian Analysis, i.e. standard Gibbs sampling: Easy to derive for multi-linear models lie p p f (x l V ) = x l,i1 x l,i2 i 1=1 i 2>i 1 v i1,f v i2,f f =1 Christoph Freudenthaler 25 / 34 ISMLL, University of Hildesheim

Learning Generalized Factorization Models Stochastic Gradient Descent: scalable, simple Alternating Least Squares Bayesian Analysis, i.e. standard Gibbs sampling: Easy to derive for multi-linear models lie p p f (x l V ) = x l,i1 x l,i2 i 1=1 i 2>i 1 v i1,f v i2,f f =1 Learning on large datasets: bloc size = 1 O(Nz ) Christoph Freudenthaler 25 / 34 ISMLL, University of Hildesheim

Outline Generalized Factorization Model Relations to standard Models Empirical Evaluation Conclusion and Discussion Christoph Freudenthaler 26 / 34 ISMLL, University of Hildesheim

Polynomial Regression p p p p p f (x l V ) = x l,i1 β i1 + x l,i1 x l,i2 β i1,i 2 +... +... x l,i1... x l,i1 β i1,...,i o i 1 =1 i 1 =1 i 2 i 1 i 1 =1 i o i o 1 Order-o polynomial regression models Christoph Freudenthaler 26 / 34 ISMLL, University of Hildesheim

Polynomial Regression p p p p p f (x l V ) = x l,i1 β i1 + x l,i1 x l,i2 β i1,i 2 +... +... x l,i1... x l,i1 β i1,...,i o i 1 =1 i 1 =1 i 2 i 1 i 1 =1 i o i o 1 Order-o polynomial regression models f GFM (x l V ) = f =1 {d 1,...,d m} D m p x l,i d j,i v i,f j=1 i=1 Generalized Factorization Model of m-mode tensor Christoph Freudenthaler 26 / 34 ISMLL, University of Hildesheim

Polynomial Regression p p p p p f (x l V ) = x l,i1 β i1 + x l,i1 x l,i2 β i1,i 2 +... +... x l,i1... x l,i1 β i1,...,i o i 1 =1 i 1 =1 i 2 i 1 i 1 =1 i o i o 1 Order-o polynomial regression models f GFM (x l V ) = f =1 {d 1,...,d m} D m p x l,i d j,i v i,f j=1 i=1 Generalized Factorization Model of m-mode tensor Define one of the p predictor variables as constant, e.g. xl,1 = 1 l = 1,..., n Fix corresponding parameter vector v1 V = (1,..., 1) }{{} Christoph Freudenthaler 26 / 34 ISMLL, University of Hildesheim

Polynomial Regression p p p p p f (x l V ) = x l,i1 β i1 + x l,i1 x l,i2 β i1,i 2 +... +... x l,i1... x l,i1 β i1,...,i o i 1 =1 i 1 =1 i 2 i 1 i 1 =1 i o i o 1 Order-o polynomial regression models f GFM (x l V ) = f =1 {d 1,...,d m} D m p x l,i d j,i v i,f j=1 i=1 Generalized Factorization Model of m-mode tensor Define one of the p predictor variables as constant, e.g. xl,1 = 1 l = 1,..., n Fix corresponding parameter vector v1 V = (1,..., 1) }{{} Selecting D accordingly gives p p p p p f GFM (x l V ) = x l,i1 v i1,f + x l,i1 x l,i2 v i1,f v i2,f +...+... x l,i1... x l,im v i1,f... v im,f i 1 =1 f =1 i 1 =1 i 2 i 1 f =1 i 1 =1 im i }{{}}{{} m 1 f =1 }{{} β i1 β i1,i β 2 i1,...,im Christoph Freudenthaler 26 / 34 ISMLL, University of Hildesheim

Polynomial Regression vs. GFM Generalized Factorization Model include Polynomial Regression For factorized parameters, e.g. β i1,i 2 If number of modes m equals order o = f =1 v i 1,f v i2,f Christoph Freudenthaler 27 / 34 ISMLL, University of Hildesheim

Factorization Machines 6 vs. GFM Very similar to previous factorized polynomial regression model f GFM (x l V ) = p i 1=1 f FM (x l V ) = x l,i1 f =1 v i1,f }{{} β i1 p x l,i1 β i1 + i 1=1 + p p i 1=1 i 2 i 1 x l,i1 x l,i2 v i1,f v i2,f +... f =1 } {{ } β i1,i 2 p p i 1=1 i 2 i 1 x l,i1 x l,i2 v i1,f v i2,f f =1 } {{ } β i1,i 2 +... 6 Rendle, S.: Factorization Machines. ICDM10. Christoph Freudenthaler 28 / 34 ISMLL, University of Hildesheim

Factorization Machines 6 vs. GFM Very similar to previous factorized polynomial regression model f GFM (x l V ) = p i 1=1 f FM (x l V ) = x l,i1 f =1 v i1,f }{{} β i1 p x l,i1 β i1 + i 1=1 + p p i 1=1 i 2 i 1 x l,i1 x l,i2 v i1,f v i2,f +... f =1 } {{ } β i1,i 2 p p i 1=1 i 2 i 1 x l,i1 x l,i2 v i1,f v i2,f f =1 } {{ } β i1,i 2 +... Factorization Machines using simple Gibbs are successfully applied in several challenges 6 Rendle, S.: Factorization Machines. ICDM10. Christoph Freudenthaler 28 / 34 ISMLL, University of Hildesheim

Neural Networs vs. GFM Output y l Hidden f x l,v 1 w 1 w 2 f x l,v 2 Input v v v 1,1 1,2 4,2 v3,1 v 4,1 x l,1 x l,2 x l,3 x l,4 Christoph Freudenthaler 29 / 34 ISMLL, University of Hildesheim

Neural Networs vs. GFM Output y l Hidden f x l,v 1 w 1 w 2 f x l,v 2 Input p=4 v v v 1,1 1,2 4,2 v3,1 v 4,1 x l,1 x l,2 x l,3 x l,4 Input: predictor vector x l R p, l = 1,..., n Christoph Freudenthaler 29 / 34 ISMLL, University of Hildesheim

Neural Networs vs. GFM Output y l Hidden =2 f x l,v 1 w 1 w 2 f x l,v 2 Input p=4 v v v 1,1 1,2 4,2 v3,1 v 4,1 x l,1 x l,2 x l,3 x l,4 Input: predictor vector x l R p, l = 1,..., n Hidden: Hidden nodes correspond to latent dimensions nodes Christoph Freudenthaler 29 / 34 ISMLL, University of Hildesheim

Neural Networs vs. GFM Output y l Hidden =2 f x l,v 1 w 1 w 2 f x l,v 2 Input p=4 v v v 1,1 1,2 4,2 v3,1 v 4,1 x l,1 x l,2 x l,3 x l,4 Input: predictor vector x l R p, l = 1,..., n Hidden: Hidden nodes correspond to latent dimensions nodes Activation Function: f (x l, v f ) := {d 1,...,d m} D m j=1 p i=1 x l,id j,i v i,f Christoph Freudenthaler 29 / 34 ISMLL, University of Hildesheim

Neural Networs vs. GFM Output y l Hidden =2 f x l,v 1 w 1 w 2 f x l,v 2 Input p=4 v v v 1,1 1,2 4,2 v3,1 v 4,1 x l,1 x l,2 x l,3 x l,4 Input: predictor vector x l R p, l = 1,..., n Hidden: Hidden nodes correspond to latent dimensions nodes Activation Function: f (x l, v f ) := {d 1,...,d m} D Output: Weights w f = 1 m j=1 p i=1 x l,id j,i v i,f Christoph Freudenthaler 29 / 34 ISMLL, University of Hildesheim

Outline Generalized Factorization Model Relations to standard Models Empirical Evaluation Conclusion and Discussion Christoph Freudenthaler 30 / 34 ISMLL, University of Hildesheim

Empirical Evaluation on Recommender System Data Tas: movie rating prediction Data 7 : 100M ratings of 480 users on 18 items Netflix RMSE 0.88 0.89 0.90 0.91 0.92 SGD KNN256 SGD KNN256++ BFM KNN256 BFM KNN256++ SGD PMF BPMF BFM (u,i) BFM (u,i,t,f) BFM (u,i,ut,it,if) BFM Ensemble 0 50 100 150 (#latent Dimensions) Christoph Freudenthaler 30 / 34 ISMLL, University of Hildesheim

Bayesian Factorization Machines - Kaggle Tas: student performance prediction on GMAT (Graduate Management Admission Test), SAT and ACT (college admission test) 118 contestants Data 8 : 8 http://www.aggle.com/c/whatdoyouknow Christoph Freudenthaler 31 / 34 ISMLL, University of Hildesheim

Bayesian Factorization Machines - KDDCup 11 Tas: music rating prediction 1000 contestants Data 9 : 250M ratings of 1M users on 625 items (songs, tracs, album or artists) 9 http://ddcup.yahoo.com/ Christoph Freudenthaler 32 / 34 ISMLL, University of Hildesheim

Outline Generalized Factorization Model Relations to standard Models Empirical Evaluation Conclusion and Discussion Christoph Freudenthaler 33 / 34 ISMLL, University of Hildesheim

In a nutshell: GFM, thus matrix and tensor Factorization are types of regression models Christoph Freudenthaler 33 / 34 ISMLL, University of Hildesheim

In a nutshell: GFM, thus matrix and tensor Factorization are types of regression models Generalized Factorization Models relate to Polynomial regression for factorized parameters Christoph Freudenthaler 33 / 34 ISMLL, University of Hildesheim

In a nutshell: GFM, thus matrix and tensor Factorization are types of regression models Generalized Factorization Models relate to Polynomial regression for factorized parameters Feed-forward Neural Networs for given activation function Christoph Freudenthaler 33 / 34 ISMLL, University of Hildesheim

In a nutshell: GFM, thus matrix and tensor Factorization are types of regression models Generalized Factorization Models relate to Polynomial regression for factorized parameters Feed-forward Neural Networs for given activation function Factorization Machines with simple Gibbs show decent predictive performance Christoph Freudenthaler 33 / 34 ISMLL, University of Hildesheim

In a nutshell: GFM, thus matrix and tensor Factorization are types of regression models Generalized Factorization Models relate to Polynomial regression for factorized parameters Feed-forward Neural Networs for given activation function Factorization Machines with simple Gibbs show decent predictive performance Open questions: Bayesian learning for models with non-linear functions of V? Bayesian model averaging for GFM? Efficient Bayesian inference for non-gaussian lielihood? Christoph Freudenthaler 33 / 34 ISMLL, University of Hildesheim

Christoph Freudenthaler 34 / 34 ISMLL, University of Hildesheim