Deep Matrix Factorization for Recommendation

Size: px

Start display at page:

Download "Deep Matrix Factorization for Recommendation"

Paulina Webster
5 years ago
Views:

MSc Artificial Intelligence Track: Natural Language Processing and Learning Master s Thesis Deep Matrix Factorization for Recommendation by

1 MSc Artificial Intelligence Track: Natural Language Processing and Learning Master s Thesis Deep Matrix Factorization for Recommendation by Mart van Baalen September 30, 2016 Supervisors: Prof. Dr. M. Welling Thomas Kipf, MSc 42 EC Assessors: Dr. P. H. Rodenburg Dr. E. Kanoulas

3 Abstract Matrix factorization (MF) based methods have proven hugely successful in modern recommendation systems. MF methods learn a latent representation of users and items that, when combined in the dot-product, produce an approximation of the rating that a user would give to an item. Recent forays into deep learning based MF methods have shown interesting results. In this thesis we expand upon these methods. We explore four different but related avenues in autoencoding research: 1) we introduce the Matrix Factorizing Autoencoder, which encodes sparse user and item rating vectors into latent user and item factors, 2) the Matrix Factorizing Variational Autoencoder, which learns variational posterior distributions over latent user and item factors given their respective sparse rating vectors, 3) we introduce the Matrix Factorizing Graph Autoencoder, which encodes the graph structure formed by a rating dataset and encodes it into user and item factors, and 4) the Matrix Factorizing Variational Graph Autoencoder framework, which learns variational posterior distributions over latent user and item rating factors by encoding the graph structure of the rating data. We run a number of experiments on these models on three Movielens datasets, the Netflix dataset and two proprietary datasets that contain user clicks on a Dutch news website. We reach competitive results with some of these models. More importantly, we show that these Autoencoder frameworks, especially the Graph Autoencoder framework, are well suited for the problem of rating prediction and might reach state of the art results in future work.

5 List of Abbreviations CF MF AEVB GCN Collaborative Filtering Matrix Factorization Auto Encoding Variational Bayes Graph Convolutional (Neural) Network List of Symbols r ij Rating from user i for item j D A dataset of observed ratings containing triples {(i, j, r ij )} R Sparse matrix containing ratings, R ij = r ij if (i, j, r ij ) D r i r j ˆr ij u i v j U V N u N v N r g φ ( ) f θ ( ) Sparse vector of observed ratings from user i Sparse vector of observed ratings for item j Predicted rating from user i for item j Latent factor for user i Latent factor for item j Matrix of latent user factors Matrix of latent item factors Number of users Number of items Number of ratings Encoder function with parameters φ Decoder function with parameters θ We use lowercase bold face characters (e.g. v) to indicate vectors. We treat all vectors, including vectors that correspond to rows of matrices, as column vectors, unless explicitly stated otherwise. We use uppercase bold face characters (e.g. M) to indicate matrices. The only exception to this rule is that, in the discussion of variational Bayesian methods, we use Z to denote the set of latent variables and parameters, all of which are considered stochastic variables. In general, unless otherwise noted, we use i to indicate a user, and j to indicate an item. i

6 Acknowledgements I would like to thank my supervisors, Max Welling and Thomas Kipf for their knowledge, guidance and encouragement. Without their support this thesis would never have been possible. I would also like to thank Scyfer B.V., in particular its CEO and my internship supervisor Jörgen Sandig, for allowing me to use the BigScyfer server for the many experiments I ran for this thesis, and for letting me write this thesis as part of an internship. This was invaluable in completing my thesis. On a personal note I would like to thank my parents, Jieles and Paulien, for their support throughout my academic career, and in all other aspects of my life. I would also like to thank my girlfriend Kate for her patience during the long process of completing my thesis. ii

7 Notice This thesis was written as part of an internship at Scyfer B.V. iii

8 Contents 1 Introduction Contributions Thesis Contents Background Recommendation Systems Autoencoding Variational Bayes Autoencoders Variational Bayes Autoencoding Variational Bayes Graph-based Recommendation Systems and Graph Convolutional Networks Graph Convolutional Nets Autoencoding Variational Matrix Factorization Matrix Factorizing Autoencoder Supervised MFAE Unsupervised MFAE Matrix Factorizing Variational Autoencoder Matrix Factorizing Graph Autoencoder Relationship to MFAE Matrix Factorizing Variational Graph Autoencoder Experiments, Results and Analysis Experimental Setup iv

9 4.1.1 Datasets Evaluation Metrics Current State of the Art Preprocessing Connectivity Experiments Experiment 1: Model Performance Experiment 2: Hyperparameter Optimization Experiment 3: New Users and Items Experiment 4: Supervised MFAE Results Experiment 1: Model Performance Experiment 2: Hyperparameter Optimization Experiment 3: New Users and Items Experiment 4: Supervised MFAE Discussion Experiment 1: Model Performance Experiment 2: Hyperparameter Optimization Experiment 3: New Users and Items Experiment 4: Supervised MFAE Related Work Sparse Matrix Factorization Neural Network Collaborative Filtering Recent Synergy Conclusion Future work Concluding words Appendices 83 A Derivations 84 A.1 Derivation of the ELBO v

10 B Derivatives of the SGVBs 87 B.1 Derivative of KL-Divergence B.2 Derivative of the Log-Expectation C Hyperparameters 91 D Training Times 93 vi

12 Chapter 1 Introduction In navigating the vast ocean of information available on the modern internet, recommendation systems are indispensible in helping users find relevant items. Recommendation systems are systems that make personalized item recommendations for users based on which items the recommendation system thinks are relevant to the user [38]. Both the size of inventories of web services, i.e. the items that web services can consider recommending to users, and the amount of web traffic necessitate that the recommendations be automated. While in a brick-and-mortar store an employee might point a user to interesting items based on personal experience, it is intractable to let humans recommend items in a web-scale operation. For example, Amazon has 488 million items for sale in the United States, while Netflix has thousands of titles available for streaming and millions of monthly users [12]. Navigating inventories of this size without any sort of sense of direction is a daunting task, and companies have a large financial incentive to help users find the items they need. If a visitor to a webstore is recommended an item that they are interested in but did not specifically search for, they might purchase that item as well as the item(s) for which they initially visited the website. Likewise, if a member of a subscription video streaming service is regularly unsuccessful in finding a video to watch they might eventually end their subscription. In order to point users to useful items in their inventories online stores like Amazon or online video streaming services such as Net- 1

13 flix invest heavily in recommendation systems. In 2007 Netflix awarded a $1 million prize to a research team that increased the performance of their proprietary recommendation system by more than 10% [4], and claims that their current recommendation system is worth $1 billion per year [12]. Many successful recommendation systems treat the problem of recommending items to users as a matrix factorization (MF) problem ([1], [26], [28], [34], [35], [47]): known ratings given by users to specific items are elements in a sparse N u N v rating matrix R, where N u is the number of observed users and N v is the number of items in the system s inventory. Element r ij is the (observed) rating given to item j by user i. MF recommenders decompose R into low-rank user and item factor matrices U and V, of dimensionality N u D and N v D respectively, with D N u, N v. We use ˆr ij to indicate a predicted rating given by user i to item j. Predictions for ratings ˆr i j can be made by taking the dot product of the corresponding user and item factors for a previously unobserved user/item pair i /j in U and V respectively: ˆr i j = ut i v j. However, the sparsity of R makes it non-trivial to find a decomposition that both fits the training data and also generalizes well to new ratings ([35], [34]). Recent research has extended the MF framework into the realm of deep learning. For example, Strub et al. [41] use an autoencoder to learn a factorization of R. Dziugaite and Roy [19] extend the basic MF framework by using a deep neural network to predict a rating from latent user and item factors, instead of the standard dot product. prediction function ˆr ij = u T i v j That is, instead of using a the authors use a deep neural network to predict a rating from u i and v j 1 : ˆr ij = NN(u i, v j ) where NN(, ) indicates a neural network that takes as input a user factor and an item factor. 1 This is a simplified version of the system described in [19] that functions as an example. For a description of the actual system we refer the reader to the original article [19]. 2

14 1.1 Contributions In this thesis we extend the previously mentioned research by applying new advances in deep learning. Our extensions of current methods lead to the following three contributions: 1. We build an autoencoder-based matrix factorization model that includes encoders for the sparse rating vectors of both the users and the items. This is in contrast to Strub et al. [41], who only encode either the user or the item rating vectors. This is also in contrast to Dziugaite and Roy [19], who use gradient descent to find the latent user and item factors and thus have no explicit encoding function for the rating vectors. The benefit of using encoders for both user and item rating vectors is that a mapping from user and item ratings to latent user and item factors is learned. This can be used to add new users or items to the system as soon as a small number of ratings for these users/items is available. 2. We apply the Autoencoding Variational Bayes (AEVB) algorithm [21] to the problem of matrix factorization. This will bridge the gap between the autoencoding formulation of matrix factorization introduced in point 1 and the probabilistic formulation of matrix factorization introduced in e.g., [28], [1], [34]. 3. We also apply very recent research in Graph Convolutional Networks by Kipf and Welling [22]. This research employs a novel approximation to previously difficult-to-compute convolutions over graph-structured data. Kipf and Welling build an autoencoder that encodes both item features and the graph structure. We apply this method in encoding the graph structure of the rating data in an autoencoding setting, similar to the other two methods. We apply this method to both the regular autoencoder and with the AEVB framework. 3

15 1.2 Thesis Contents The remainder of this thesis is organized as follows. Chapter 2 introduces core concepts on which this research builds in more depth. Chapter 3 introduces the Matrix Factorizing Autoencoder, the Matrix Factorizing Variational Autoencoder, the Matrix Factorizing Graph Autoencoder and the Matrix Factorizing Variational Graph Autoencoder algorithms. Chapter 4 describes a number of experiments run on real-life rating and click data and presents and analyzes the results. Chapter 5 places this research in the broader context of Matrix Factorization for recommendation and describes similarities and differences between this and other recent work. Chapter 6 gives a summary of the work presented in this thesis, leads for future work, and concludes this thesis. 4

16 Chapter 2 Background This chapter introduces important concepts that the original research in this thesis builds upon. 2.1 Recommendation Systems Many recommendation systems utilize a user s history of explicit or implicit preference indications to select items that might be relevant to the user. Preference indications can be explicit, as is the case when a user rates items, or implicit, as is the case when a recommendation system has access to a user s click behavior and uses that to infer preference. An example of the former is Netflix, where users can give a rating of 1 to 5 stars to videos they have watched. An example of the latter is Amazon, which exploits observed user behavior, such as past search terms, item clicks and time spent on a clicked item, as well as purchase history, to infer which items were interesting to a user and which were not. Recommendation systems have traditionally been categorized into Contentbased Filtering and Collaborative Filtering systems. Content-based filtering systems combine the history of a user s preference indications with features of the items, such as item categories or text in the item description, to find new items that are similar to the items the user has shown a preference for in the past. Collaborative filtering (CF) systems use the preference indica- 5

17 tions given by all users to find items that are relevant to a target user. The term filtering stems from the idea that recommendation systems help filter relevant items in an inventory. The term collaborative indicates that all users collaborate in finding interesting items. Modern CF systems learn some internal representation of users and items and combine user and item representations to predict relevance. In recent years the distinction between CF and content-based filtering has become less strict as many CF systems incorporate side information about users or items (e.g. age of user, genre of a movie) to mitigate the cold-start problem. However, the term Collaborative Filtering is still often used to describe systems in which the preference indication history of all users is utilized to recommend items. A drawback of basing recommendations on previously observed preference indications is that no recommendations can be made for users who have not interacted with any items, e.g. new users. This problem is referred to as the cold-start problem. Note that this is a bigger problem in CF than in contentbased filtering: since content-based filtering depends on item features and user preference indications it is not necessary for an item to be rated before it can be recommended. CF, however, depends on preference indications of other users for items, which implies that an item will not be recommended if it has no observed preference indications. While there are many possible ways to use past interactions to predict relevant items, in this thesis we will focus on the problem of using CF for rating prediction, using past ratings to predict new ratings. By supposition, higher predicted ratings correspond to more relevant items. Learning-to-rank recommendation systems that predict a ranking of relevance to a user for all items do exist (e.g., [2], [46]), but are not discussed in this thesis. Collaborative Filtering and Matrix Factorization As stated in the introduction, many successful recommendation systems approach the problem of rating prediction as a problem of matrix factorization (for example: [42], [1], [23], [25]). A set of ratings r ij of a user i for an item j can be interpreted as elements in a partially observed N u N v matrix 6

18 r j 5 r i users items Rating matrix R Figure 2.1: Illustration of the sparse rating matrix R. To make the illustration less overwhelming only the ratings for user i and item j are shown. Note that user i has not rated item j. r i indicates the sparse vector of observed ratings for user i, i.e. row i in matrix R; r j indicates the sparse vector of observed ratings for item j, i.e. column j in matrix R. R. Element R ij of matrix R corresponds to rating r ij of user i for item j. Figure 2.1 gives a toy example of a sparse rating matrix. The recurring theme in the family of MF methods is the decomposition of the sparse rating matrix R into user and item matrix factors U and V of dimensionality N u D and N v D respectively, with D N u, N v, such that UV T most closely reconstructs the previously observed ratings, while still generalizing well to new ratings. A new rating ˆr ij for a user/item pair i, j that is previously unobserved is predicted as the dot product of the latent user factor u i and the latent item factor v j. ˆr ij = u T i v j (2.1) Figure 2.2 gives a schematic view of matrix factorization. If R were fully observed one could use Singular Value Decomposition (SVD) to find low-rank matrix factors U and V. However, R is only partially observed. A naive approach in which zeros are used as placeholders for the unobserved values will cause the SVD algorithm to fit the zeros and predict (a value close to) 0 for unobserved ratings. Modifying SVD to optimize the factors U and V w.r.t. the sum-squared distance between the predicted 7

19 r j 5 v j users r i u i V T ˆr ij = u i T v j 3 items Rating matrix R U Figure 2.2: Schematic overview of the factorization of matrix R into user and item matrix factors U and V T. User i is represented by the latent user factor u i, i.e. row i in U; item j is represented by the latent item factor v j, i.e. column j in V T. The unobserved rating ˆr ij, indicated by the purple square in R, is predicted from the dot product of u i and v j. Adapted from [45] and observed ratings for only the observed ratings leads to a non-convex optimization problem [39]. Probabilistic Matrix Factorization An early successful approach to matrix factorization is Probabilistic Matrix Factorization (PMF, [35]). We will describe this method in some detail to illustrate matrix factorization for recommendation. The PMF approach assigns a 0-mean Gaussian prior distribution to the rows of the matrix factors U and V with spherical variance σ f I, and assumes a rating r ij is a normally distributed variable with mean u T i v j and some fixed variance σ. The full data likelihood of this model is: log p(d, U, V) = log p(u)p(v)p(d U, V) = log p(u i ) + log p(v j )+ i j log p(r ij u i, v j ) i,j,r ij D = log N (u i 0, σ f I) + log N (v j 0, σ f I)+ i i,j,r ij D log N (r ij u T i v j, σ) (2.2) 8

20 where σf 2 is the variance of the latent factors and σ 2 is the variance of the ratings. If we assume σf 2 and σ 2 are fixed, then optimizing the negative loglikelihood is equivalent to optimizing the regularized squared error on the data: ( ) argmin log p(d, U, V) U,V = argmin U,V i,j,r ij D (r ij u T i v j ) 2 + λ ( U V 2 2 ) (2.3) where we have defined λ = σ2 and A 2 σf 2 2 = ij A 2 ij is the squared Frobenius norm [35]. The regularized squared error of eq. (2.3) can easily be optimized by gradient-based optimization methods [35]. Note that it is not necessarily straightforward to add new users or items to U or V. For example if ratings for a new user i are observed we could optimize their latent factor u i argmin u i j:r i j D by finding (r i j u T i v j) 2 + λ u i 2 2 (2.4) where u i 2 indicates the squared vector norm of u i. u i This is convex in and can be found using regularized least squares. However this solution would (presumably) have an effect on the error surface of eq. (2.3) w.r.t. the rows in V that correspond to items rated by user i. Optimizing V will then create a second-order effect on the optimal value of U. This means that gradient descent updates w.r.t. the full matrices U and V have to be done. In other words, one or more potentially expensive iterations of gradient descent have to be performed to update the factors. A variational Bayesian algorithm for Matrix Factorization was introduced by Lim and Teh [28]. This algorithm is described at a high level in Section Many other algorithms based on a probabilistic interpretation of matrix factorization that are not directly relevant to this thesis have been developed. A selection of these algorithms is discussed in Chapter 5. 9

21 New Users in the Matrix Factorization Paradigm We explicate some concepts that are relevant to the remainder of this thesis: 1. We use the term Known Users or Known Items to denote users or items that were seen during training. 2. We use the term New Users or New Items to denote users or items that were not seen during training. Note that this implies that, when new users and items appear after training, new users can rate new items. Figure 2.3 shows how known users and items and new users and items, as well as their ratings, relate to each other. Also note that, in this terminology, new users differ from cold-start users in that we assume they have already rated a number of items. Similarly new items differ from cold-start items in that we assume they have already been rated by a number of users. This distinction will be important later on in this thesis. 2.2 Autoencoding Variational Bayes This section describes the Autoencoding Variational Bayes (AEVB) algorithm introduced by Kingma and Welling [21]. This section is more extensive than the previous sections in this chapter, as the AEVB algorithm requires some background in both autoencoders and variational Bayesian (VB) methods. In this section we will first briefly describe autoencoders and basic variational Bayesian methods and then use these descriptions to introduce the Autoencoding Variational Bayes algorithm of Kingma and Welling [21] Autoencoders Autoencoders are unsupervised learning methods that are used to learn latent representations of input data [3]. Autoencoders consist of two parts: 1. An encoder function g φ : R N R M, parametrized by a set of parameters φ 10

22 Known Items New Users Known New R Ratings for known items Ratings from known users New users new items Figure 2.3: Illustration of the difference between ratings from new users for known items or from known users on new items on the one hand, and ratings from new users for new items on the other hand. In this figure, submatrix R contains the ratings used for training. The bottom left submatrix contains ratings from new users on known items. The top right submatrix contains ratings from known users on new items. The bottom right matrix contains ratings from new users on new items. 11

23 2. A decoder function f θ : R M R N, parametrized by a set of parameters θ The encoder encodes an input vector x R N (e.g. an image vector) into a latent representation z R M, while the decoder decodes a latent vector z R M and attempts to reconstruct the original input vector x R N. For some dataset D an autoencoder is trained with the following objective: argmin φ,θ Err [ x, f θ (g φ (x)) ] + Reg(φ, θ) (2.5) x D where Err [ x, f θ (g φ (x)) ] denotes the reconstruction error for an input x. The reconstruction error is a measure of how much the original input to an encoder differs from the output predicted by the autoencoder. Reg(φ, θ) is the regularization penalty for the parameters φ and θ. Autoencoders are usually used in Deep Learning. In this context the encoder f θ ( ) and the decoder g φ ( ) are both Neural Networks (e.g. simple Multilayer Perceptrons) with parameters θ and φ, respectively. The objective function of eq. (2.5) is then minimized with respect to φ and θ using gradientbased optimization methods, such as Stochastic Gradient Descent (SGD). Hybrid CF with Autoencoders Relevant to the work in this thesis is recent work by Strub et al. [41] 1. Strub et al. approach the MF problem as an autoencoder problem. Their Collaborative Filtering Network (CFN) model is a very basic autoencoder that encodes either the rows (sparse user rating vectors) or columns (sparse item rating vectors) of R into a latent space of dimensionality D. Their decoder then predicts a dense rating vector, i.e. including predictions for unobserved ratings. In the following discussion we will describe their U-CFN model, which encodes the sparse vector of user ratings r i, i.e. the rows of R, with the understanding that their V-CFN model, which encodes the sparse item rating vectors r j, i.e. the columns of R, mirrors the U-CFN model. 1 Research for this thesis had begun before [41] was released. The work in this thesis is however different from the approach taken by Strub et al. 12

24 Their architecture has the form: ˆr i = W (2) σ(w (1) r i + b (1) ) + b (2) (2.6) where ˆr i indicates the reconstructed (dense) user rating vector; W (1), b (1) indicate the D N v -dimensional weight matrix and the D-dimensional bias vector of the encoder function; W (2), b (2) indicate the N v D-dimensional weight matrix and the N v -dimensional bias vector of the decoder; and σ( ) indicates an arbitrary activation function. The encoder and decoder thus have the following form: encoder : g φ (r i ) = σ(w (1) r i + b (1) ) decoder : f θ (g φ (r i )) = W (2) g φ (r i ) + b (2) The encoder g φ ( ) maps the sparse rating vector to a latent representation. The decoder f θ ( ) reconstructs the dense rating vector from the latent representation. The authors note that there is a strong similarity between their autoencoding formulation of matrix factorization and classic matrix factorization methods. To make the link more apparent we shall refer to the output vector of the encoder function g φ with input r i as u i. Thus: u i = g φ (r i ) = σ(w (1) r i + b (2) ) (2.7) The similarities become clear when focusing on the single rating ˆr ij : ˆr ij = f θ (g φ (r i )) = W (2) T j ui + b (2) j (2.8) where we use W (2) j to indicate the jth row in W (2) and b (2) j to denote the jth element in b (2). If we use v j to denote W (2) j, and b j to denote b (2) j, the rating prediction becomes an item-bias corrected matrix factorization prediction: ˆr ij = u T i v j + b j (2.9) 13

25 users r j 5 3 r i items Rating matrix R r i R Nv g φ u i R D W (2) Nv D R v j f θ u i ˆr ij R Nv Figure 2.4: Schematic (and simplified) representation of the U-CFN architecture. The encoder, g φ ( ), encodes the sparse rating vector r i of user i into a (dense) latent factor u i. The decoder f θ ( ) decodes the encoded factor u i into a dense rating vector reconstruction ˆr ij by multiplying the latent factor u i with a learned matrix W (2). In this case the rows of W (2) encode the sparse item rating vectors, i.e. the columns of R. The predicted rating ˆr ij is indicated in the rightmost column vector by the purple square. Note that in this case, v j the sparse rating vector r i. is fixed after training, while u i is a function of To emphasize the similarity to standard MF approaches we shall refer to the weight matrix W (2) as V. Figure 2.4 gives a schematic representation of the U-CFN autoencoder architecture. The bias vector b (2) is left out of this figure for simplicity. The authors optimize their model by performing gradient descent on the regularized error on the observed ratings. The target is thus to find: argmin φ,θ i j:r ij D ) Err (r ij, f θ (g φ (r i )) j + Reg(φ, θ) (2.10) where Err(, ) is the reconstruction error, Reg( ) is a regularization penalty on the parameters, and f θ (g φ (r i )) j is used to denote the jth element in the predicted rating vector. An interesting aspect of this approach is that only the item factor matrix W (2) (i.e. V in this example) is directly optimized. The model learns a mapping g φ ( ) from user ratings to a latent user factor. Therefore, the latent user matrix U can be constructed by feeding R into the encoder function: U = g φ (R T ) (2.11) 14

26 But this also means that a new user i, provided they have only rated known items, can easily be added to the system by feeding their observed rating vector r i into the encoder g φ ( ). Note that Sedhain et al. [37] propose a model that is very similar to the CFN models proposed by Strub et al Variational Bayes This subsection is mostly a brief summary of the material in chapter 10 of Bishop [6], unless otherwise noted. Variational Bayesian methods are used for finding approximate posterior distributions in complex (generally Bayesian) probabilistic models where there is no analytical solution to the true posterior. We consider models in which there are observed variables denoted by X. We assume that the model has latent variables as well. In true Bayesian fashion we also treat the parameters of distributions as latent variables with appropriate priors. We use Z to denote the set of all latent variables and parameters, and z i to denote a single latent variable or parameter. Since, by supposition, there is no analytical solution to the true posterior distribution of the latent variables, variational Bayesian methods introduce a variational distribution Q(Z) over the latent variables and parameters. Variational methods are optimized by making the variational distribution Q(Z) approximate the true but intractable posterior distribution p(z X) as closely as possible. One can verify that the marginal log probability of the data, log p(x), can be written as: We define log p(x) = Q(Z) log p(x, Z) Q(Z) dz Q(Z) log p(z X) dz (2.12) Q(Z) 15

27 Thus L(Q) = ] D KL [Q(Z) p(z X) = Q(Z) log Q(Z) log p(x, Z) Q(Z) dz p(z X) dz Q(Z) ] log p(x) = L(Q) + D KL [Q(Z) p(z X) (2.13) where D KL is the Kullback-Leibler (KL)-divergence and the distribution Q(Z) is the variational] distribution over the latent variables. The KLdivergence D KL [Q P is a measure of how much a probability density function (pdf), (or probability mass function (pmf)) for discrete variables, Q, differs from a pdf (or pmf) P over the same variables. The KL-divergence is always 0, with equality iff Q = P. Therefore the KL-divergence can be used as a measure of how much the variational approximation Q(Z) differs from the true posterior p(z X). Since the KL-divergence is nonnegative and 0 iff Q(Z) = P (Z X), L(Q) is a lower bound on log p(x), and is often referred to as the Evidence Lower BOund, or ] ELBO. The difference between L(Q) and log p(x) is D KL [Q(Z) p(z X). This implies that maximizing L(Q) is equivalent to minimizing the KL-divergence between the approximate posterior and the true posterior. The ELBO can itself be further decomposed into: L(Q) = Q(Z) log p(x, Z) Q(Z) dz = E Q [log p(x, Z)] + H[Q] (2.14) where H[Q] indicates the entropy of Q. Assuming the approximate posterior Q(Z) is parametrized by a set of parameters Ψ, one can perform gradient ascent on L(Q) w.r.t. the parameters of Ψ [30]. Ψ L(Q) = Ψ E Q [log p(x, Z)] + Ψ H[Q] (2.15) 16

28 However, the expectation of the log-joint E Q [log p(x, Z)], which decomposes into a sum of terms, possibly contains expectations that are intractable to compute [30]. Let f(z i ) be a function that contains all terms in log p(x, Z) that depend on z i, with z i chosen such that E qψ [f(z i )] is an intractable expectation. Let q ψ (z i ) be the approximate posterior distribution over z i, parametrized by a subset of the variational parameters ψ Ψ. Paisley et al. [30] note that ψ E qψ [f(z i )] = E qψ [f(z i ) ψ log q ψ (z i )]. The latter expectation can then be approximated by using Monte Carlo integration [30], thus solving the issue of intractability: where z (s) i samples. E qψ [f(z i ) ψ log q ψ (z i )] 1 S S s=1 f(z (s) i ) ψ log q ψ (z (s) i ) (2.16) indicates the sth sample drawn from q ψ (z i ), out of a total of S Note that stochastic integration of E qpsi[f(z i )] is possible by using E qpsi[f(z i )] = q ψ (z i )f(z i )dz i 1 S 2 s=1 f(z (s) i ) (2.17) with z (s) i the sample z (s) i eq. (2.16). drawn from q ψ (z i ), but that the gradient of f(z (s) i ) w.r.t. ψ is 0, as is constant once drawn. The gradient w.r.t. ψ is nonzero in The authors perform gradient ascent using this stochastic gradient estimator. However, this estimator exhibits very high variance ([21],[30]), which makes it impractical in many settings. Variational Bayesian Matrix Factorization We illustrate the variational Bayesian approach by briefly describing the variational Bayesian Matrix Factorization algorithm of Lim and Teh [28]. Like in the PMF algorithm, the authors assume ratings r ij are normally distributed around the dot-product of a latent user and item factor: p(r ij u i, v j, σ) = N (r ij u T i v j, σ 2 ) (2.18) 17

29 where σ 2 is the variance of the ratings. The latent factors are given 0-mean Gaussian priors with diagonal (but not necessarily spherical) variance: p(u σ u ) = i p(v σ v ) = j N (u i 0, diag(σ 2 u)) N (v j 0, diag(σ 2 v)) (2.19) where σ 2 u and σ 2 v are vectors and diag(σ 2 u) and diag(σ 2 v) indicate square diagonal matrices with the values of σ 2 u and σ 2 v on the diagonal, respectively. The full data likelihood of this model is thus p(d, U, V σ u, σ v, σ) =p(u σ u )p(v σ v )p(d U, V, σ) = p(u i σ u ) p(v j σ v ) i j i,j,r ij D p(r ij u i, v j, σ) (2.20) Note that the variances σ 2, σ 2 u and σ 2 v are not assigned prior distributions. The authors are interested in finding the posterior distribution p(u, V D) However, this posterior distribution cannot be computed in closed form. The authors instead define an approximate posterior distribution Q(U, V) that they assume factorizes into distributions over U and V: Q(U, V) = Q(U)Q(V) (2.21) The ELBO is then: 18

30 L(Q) = Q(U, V) log p(d U, V, σ)p(u σ u)p(v σ v ) dudv Q(U, V) ] [ ] = D KL [Q(U) p(u σ u ) D KL Q(V) p(v σ v ) + [ ] E Q(U)Q(V) log p(d U, V, σ) (2.22) Lim and Teh optimize the ELBO of eq. (2.22) by iterating between optimizing w.r.t. U, optimizing w.r.t. V and optimizing w.r.t. the variances σ, σ u and σ v Autoencoding Variational Bayes Kingma and Welling [21] extend both variational Bayesian methods and previous work on autoencoders in the Autoencoding Variational Bayes (AEVB) algorithm. Their work considers models in which there is a latent variable z i for each data point x i. The datapoint x i is assumed generated by some conditional distribution p θ (x i z i ). The latent variables are endowed with prior distributions p(z i ). The parameters of this generative distribution are produced by a (deterministic) function f θ ( ) of the latent variable z i. f θ ( ) is parametrized by the set of parameters θ, with θ shared amongst all distributions. For example, if p θ (x i z i ) is Gaussian, then the mean µ i and diagonal variance σi 2 I are computed from z i using f θ : [µ i, log σ i ] = f θ (z i ) (2.23) Furthermore, the variational distribution over the latent variables is conditioned on the data, i.e. the variational distribution is expressed as Q φ (Z D) and factorizes over the independent datapoints, i.e. Q φ (Z D) = i q φ (z i x i ) 19

31 The parameters of each distribution q φ (z i x i ) are a (deterministic) function g φ ( ) of x i, where g φ ( ) is parametrized by φ. Similarly to the role f θ ( ) plays in the generative conditional distribution p θ, the function g φ ( ) produces the parameters of each of the variational distributions. The distributions q φ (z i x i ) are referred to as the recognition model. The ELBO for this model decomposes into a sum over ELBOs for individual datapoints. The ELBO for a single data point is [21]: ] [ ] L(θ, φ; x i ) = D KL [q φ (z i x i ) p(z i ) + E q log p θ (x i z i ) (2.24) [ ] In case the expectation E q log p(x i z i ) is intractible to compute, one can resort to stochastic integration to approximate the expectation [21]. By drawing S samples from q φ (z i x i ) the expectation can be approximated as [ ] E q log p θ (x i z i ) = 1 S q φ (z i x i ) log p θ (x i z i )dz i S s=1 log p θ (x i z (s) i ) (2.25) where z (s) i denotes sample s drawn from q φ (z i x i ). The term in the ELBO that depends on the single datapoint x i can then be approximated by [21]: L(θ, φ; x i ) L(φ, ] θ; x i ) def = D KL [q φ (z i x i ) p(z i ) + 1 S S s=1 log p θ (x i z (s) i ) (2.26) This yields the Stochastic Gradient Variational Bayes (SGVB) estimator for a single datapoint, denoted by L(θ, φ; x). The sum eq. (2.26) over all datapoints x i approximates the ELBO [21]. Note that the gradient of p θ (x i z i ) is 0 w.r.t. φ in eq. (2.26). The reason for this is that a sample z (s) i of q φ (z i x i ) is constant once it is drawn. This means that the SGVB estimator cannot be directly optimized w.r.t. the parameters φ of the recognition model. 20

32 To solve this issue, Kingma and Welling introduce the reparametrization trick. This entails the introduction of a deterministic function h(ɛ, g φ (x i )), that maps a random noise vector ɛ and the parameters of q φ (z i x i ), computed by g φ (x i ), to a sample from the posterior q φ (z i x i ) [21]. This yields: ] L(θ, φ; x) = D KL [q φ (z i x i ) p(z i ) + 1 S S s=1 ) log p θ (x i h(ɛ (s), g φ (x i )) (2.27) This separates the sample ɛ (s) from the parameters φ, which allows L to be differentiated w.r.t. both θ and φ, assuming that h is differentiable w.r.t. both θ and φ. In case the KL-divergence term can be computed in closed form (which is often the case, e.g. when both the prior and approximate posterior distribution are Gaussian), the SGVB estimator can be expressed in closed form, and can be optimized by performing gradient ascent using the gradient θ,φ L(θ, φ; x) [21]. Relationship to Autoencoders The authors apply the SGVB estimator in the Autoencoding Variational Bayes (AEVB) algorithm. The relationship between autoencoders as described in Section and the SGVB estimator becomes clear when one looks at the form of eq. (2.27). The distribution q φ (z i x i ) can be seen as an encoding distribution that gives a probability distribution on the encoded vector z i for input vector x i. The KL-divergence between q φ (z i x i ) and p(z i ) can be interpreted as a regularization term, and the expectation E qφ (z i x i )[log p θ (x i z i )] can be interpreted as the expected negative reconstruction error [21]. 21

33 2.3 Graph-based Recommendation Systems and Graph Convolutional Networks An alternative path in recommendation systems research is to view recommendation systems as graphs. In this view, users and items form nodes in a bipartite graph G = (U, V, E), where U is the set of nodes associated with users, V is the set of nodes associated with items, and E is the set of edges connecting users to items. Edges can denote binary preference indications, or ratings if the edges are weighted [7], [10]. Note that the bipartiteness of the graph follows from the fact that users can only rate items, not other users, and that items cannot rate other items. Therefore edges only exist between users and items. Traditionally the graph structure of the recommendation data is used to determine the similarity between nodes. This can then be used either by directly exploiting the similarity between a user node i and an item node j to predict the rating the user would give the item, or by exploiting the similarity of a target user i with all other users and their ratings on a target item j: ˆr ij = i D j sim(i, i )r i j i D j sim(i, i ) (2.28) where D j is the subset of D containing ratings for item j (and we assume that user i has not yet rated item j as it should be excluded from this set) [9]. Much of this work focuses on random walk properties of the recommendations graph to determine similarities between users or between users and items [7], [10]. Recent work extends this to more general graph kernels [9] Graph Convolutional Nets Kipf and Welling [22] introduce an efficient approximation to localized spectral filters on graphs in a Graph Convolutional Network (GCN). These filters can be interpreted as local graph feature extractors. These feature extractors 22

34 encode, for each node, both the local neighborhood structure of the node and transform features of the node and nodes in its neighborhood. This is analogous to Convolutional Neural Networks (CNNs) in which the consecutive application of learned, localized filters allows the networks to learn complex features of signals, e.g. images. The GCN framework can be seen as a generalization of CNNs, where GCNs are also sensitive to the local structure of a graph, something that does not hold for standard CNNs, as these only work on regular graphs (i.e. the lattice of pixels). However, like CNNs, consecutive applications of the graph convolution allow higher order neighborhoods of nodes to be encoded. Kipf and Welling [22] propose the following propagation rule: H (l) = σ ( ÂH (l 1) W (l)) (2.29) where H (l) is the matrix of activations in layer l; Â def = D 1 2 Ã D 1 2 ; Ã def = A+I, where A is the adjacency matrix; D is a diagonal matrix with D ii = j Ãij and σ( ) is an activation function. In the input layer l = 0, the matrix of activations H (0) features for node i [22]. is the feature matrix X, in which row i contains the The adjacency matrix A is a symmetric matrix where A ij = A ji = w ij is the weight of a edge between node i and j, and 0 if there is no edge between i and j. If the edges in the graph are unweighted, then A ij = A ji = 1 if there is an edge between node i and node j, and 0 otherwise. Kipf and Welling motivate the form of the propagation rule of eq. (2.29) from a first-order convolution over the full graph. A convolution over the full graph generally requires a full eigenvalue decomposition of the normalized graph Laplacian matrix I L = D 1 2 AD 1 2 compute. [22], which is expensive to 23

35 Chapter 3 Autoencoding Variational Matrix Factorization This section describes the Autoencoding Variational Matrix Factorization (AEVMF) algorithm. For a description of the Autoencoding Variational Bayes algorithm we refer the reader to Section 2.2 and the references cited therein. In this chapter we will first introduce the Matrix Factorizing Autoencoder that functions as a baseline model for the Variational Matrix Factorizing Autoencoder, which we introduce next. At the end of this chapter we introduce a related but slightly different model that is based on graph convolutions, the Graph Convolutional Matrix Factorizing Autoencoder, and its variational adaptation. 3.1 Matrix Factorizing Autoencoder Our simplest model is the Matrix Factorizing Autoencoder (MFAE). This model consists of the following three components 1. A user rating vector encoder g φu : R Nv R Du, parametrized by φ u 2. An item rating vector encoder g φv : R Nu R Dv, parametrized by φ v 3. A decoder f θ : (R Du, R Dv ) R, parametrized by θ 24

36 where D u and D v are the dimensionalities of the latent user and item factors respectively. The encoder g φu ( ) encodes a sparse user rating vector r i R Nv, i.e. a row in matrix R, into a latent user representation u i, while g φv ( ) encodes a sparse item rating vector r j R Nu, i.e. a column in R, into a latent item representation v j. The decoder f θ (, ) takes as input a latent user representation u i and a latent item representation v j and predicts a rating ˆr ij. Figure 3.1 gives a schematic overview of this model. The MFAE is trained to minimize the objective: argmin φ u,φ v,θ (i,j,r ij ) D ( ) Err r ij, f θ (g φu (r i ), g φv (r j )) + Reg(φ u, φ v, θ) (3.1) where Err( ) is an arbitrary error function, and Reg( ) is some regularization term on the learnable parameters. This might seem trivial, as for any triple (i, j, r ij ) D the rating r ij is present in both r i and r j, since r i and r j are both constructed from the ratings in D. However, the decoder only has access to the latent representations (u i and v j ) of the rating vectors (r i and r j ). Furthermore, even if the decoder did have access to the raw rating vectors, it has no way of learning which of the ratings in the rating vector is the target rating for a combination of two vectors r i and r j. This formulation provides a benefit with regards to the CFN model of Strub et al. [41]. Recall that their model only encodes rows or columns of R. The side that is not encoded is represented by a fixed matrix W (2). For example, in the U-CFN model, user rating vectors are encoded. In this model the item factors remain fixed. If a new item is added to the system in which the U-CFN model is employed, even if it is only rated by known users, the model cannot readily compute a latent factor for this new item. In the remainder of this section we present a number of increasingly sophisticated implementations of the MFAE, culminating in the Matrix Factorizing Variational Graph Autoencoder (MFVGAE). 25

37 r j 5 r i r j r i g φu g φv users 4 1 u i v j 3 items Rating matrix R f θ ˆr ij Figure 3.1: Schematic overview of the Matrix Factorizing Autoencoder. Elliptical nodes represent real or vector-valued variables, while rectangular boxes denote function application. The arrows indicate the flow of information. Red is used to indicate user ratings and the latent factor for user i, blue is used to indicate item ratings and the latent factor for item j. Purple is used to indicate a predicted (and previously unobserved) ˆr ij Supervised MFAE In this model we assume the matrix factorization into U and V is known and we train the encoder to learn a mapping from the sparse user and item rating vectors r i and r j to the user and item factors u i and v j for all i and j. The training objective is to find argmin φ u argmin φ v Err(g φu (r i ), u i ) + Reg(φ u ) i Err(g φv (r j ), v j ) + Reg(φ v ) j (3.2) In order to run supervised training, we use the latent user and item factors produced by a different matrix factorization algorithm as targets to train the encoder. 26

38 Encoders For all our encoders for the unsupervised MFAE we use a simple two-layer Neural Network. The networks have the following form: u i = g φu (r i ) = σ 2 (σ 1 (r i W (1) u + b (1) u )W (2) u + b (2) u ) v j = g φv (r j ) = σ 2 (σ 1 (r j W (1) v + b (1) v )W (2) v + b (2) v ) (3.3) where σ 1 ( ) and σ 2 ( ) are arbitrary activation functions, W (1) u, W (2) u, W (1) v and W (2) v are weight matrices for respectively the user and the item rating vector encoders with dimensionality N v D h, D h D, N u D h and D h D, respectively, and b (1) u, b (2) u, b (1) v and b (2) v are bias vectors for the user and item rating encoders, of dimensionality D h, D, D h and D, respectively. The number of free parameters N params for this model is N params = (N u + N v ) D h1 + 2 (D h1 + 1) D (3.4) where N u is the number of users, N v is the number of items, D h1 is the number of hidden units in the hidden layer (and is assumed equal for the user and the item rating encoder), and D is the number of latent factors. Note that the two encoders constitute a simple Multilayer Perceptron (MLP) with one hidden layer. h u = σ 1 (r i W (1) u + b (1) u ) h v = σ 1 (r j W (1) v + b (1) v ) (3.5) u i = σ 2 (h u W (2) u + b (2) u ) v j = σ 2 (h v W (2) v + b (2) v ) (3.6) where h u is the vector of activations of the hidden layer. While our formulation allows for differences in hidden layer dimensionality between the user rating and item rating encoder, we only include models in which the hidden layers of the two encoders have the same dimensionality. We define a simplified decoder in which weights in the output layer are shared. Thus we constrain W (2) u = W (2) v and b (2) u = b (2) v. We refer to the shared weight matrix and bias of the output layer simply as W (2) and b (2). 27

39 This simplified encoder has the following form: u i = g φu (r i ) = σ 2 (σ 1 (r i W (1) u + b (1) u )W (2) + b (2) ) v j = g φv (r j ) = σ 2 (σ 1 (r j W (1) v + b (1) v )W (2) + b (2) ) (3.7) The benefit of this encoder over the first encoder is that the second encoder has fewer parameters, and that the second encoder ensures that the hidden activations of the user and item encoder are projected onto the same space. The number of parameters N params for the simplified encoder is N params = (N u + N v + 2) D h1 + (D h1 + 1) D (3.8) Decoders While the formulation of the general MFAE allows an arbitrary function to be used as a decoder, the initial matrix factorization on which the supervised MFAE is trained was found such that it minimizes the error when the dotproduct function between a user and an item factor is used as the prediction function. It is therefore sensible to use the dot-product as a decoder in this setting: ˆr ij = f θ (u i, v j ) = u T i v j (3.9) This decoder has no learnable parameters. The supervised MFAE model is used as a baseline model to investigate whether it is possible to learn a mapping from a raw rating vector to a user or item factor Unsupervised MFAE In this setting we do not have previously learned targets. Instead the training objective is to find argmin φ u,φ v,θ i,j,r ij D ( ) Err r ij, f θ (g φu (r i ), g φv (r j )) + Reg(φ u, φ v, θ) (3.10) 28

40 The error of the Supervised MFAE algorithm is an upper bound on the error of the Matrix Factorizing Autoencoder. This is caused by the facts that 1. The Supervised MFAE algorithm will only perform as well as the algorithm that produced the target factors in case the MFAE algorithm is able to predict the target factors perfectly. 2. In case the Supervised MFAE algorithm is unable to predict the target factors perfectly, the prediction error on the target factors of the Supervised MFAE algorithm will likely be exacerbated when user and item factors are combined to predict a rating. By optimizing the reconstruction error between an original rating r ij and a predicted rating ˆr ij we circumvent this problem. Encoders We use the same encoders as those introduced in Section Decoders In this section we present several decoder functions f θ. Dot Product Decoder in Section 3.1.1, eq. (3.9). This is the dot-product based decoder introduced Matrix Dot Product Decoder In this decoder we introduce a learnable D u D v matrix M. This matrix projects u i onto the same space as v j. This means that it is no longer required that D u = D v. Ratings are predicted as: ˆr ij = f θ (u i, v j ) = u T i Mv j (3.11) Note that this decoder can be interpreted as projecting the latent item factor onto the space of the latent rating factor. The learnable parameters for this decoder are the elements of the matrix M. Multinomial Dot Product Decoder The underlying assumption of a dot-product based decoder is that r ij is a real-valued variable. It has been 29

41 noted in the literature that this is not necessarily a reasonable assumption ([26],[48],[44]). For this reason we introduce the family of Multinomial Dot Product Decoders. This family of decoders treats rating r ij as a categorical variable of D r categories, where D r is the number of possible rating values. For a movie rating dataset where users can rate movies on a 1- to 5-star scale D r would be 5. The initial output of the decoder is a D r -dimensional vector r ij. Element k : 1 k D r of this vector is computed as: r (k) ij = f (k) θ (u i, v j ) = u T i M (k) v j (3.12) Thus, for every rating category k we introduce a learnable matrix M (k). The predicted probability of the rating taking value k is computed using the softmax function: p(ˆr ij = k u i, v j ) = (k) er ij k er(k ) ij (3.13) The predicted rating ˆr ij can then either be computed as the most probable rating, or as the expected rating [48]: ˆr ij = argmax p(ˆr ij = k u i, v j ) k ˆr ij = E[k] = k kp(ˆr ij = k u i, v j ) (3.14) Deep decoder This decoder appends v j to u i to form one vector z ij. This vector is then fed into an MLP whose output is a single node corresponding to ˆr ij. Thus: ˆr ij = f θ (u i, v j ) = MLP([u i, v j ]) (3.15) where MLP denotes an arbitrary multilayer perceptron and [, ] is used to denote vector concatenation. This model can easily be extended to predict ratings as a categorical variable by changing the output dimensionality from 1 to D r. This model is a simplified version of the model presented by Dziugaite 30

42 and Roy [19] that only takes user factors as input (not user and item specific matrices). 3.2 Matrix Factorizing Variational Autoencoder The Matrix Factorizing Variational Autoencoder (MFVAE) algorithm bridges the gap between the autoencoder formulation introduced in Section 3.1 and the variational Bayesian Matrix Factorization algorithm introduced by Lim and Teh [28]. In this algorithm, we assume observed ratings are i.i.d. generated from a conditional probabilistic distribution: p θ (D U, V) = p θ (r ij u i, v j ) (3.16) i,j,r ij D The parameters of the conditional distribution are a deterministic function f θ of u i and v j. f θ is in turn parametrized by a set of parameters θ. The latent factors, collected in matrices U and V, are unobserved stochastic variables. We assign a prior distribution p(u, V) to the latent factors U and V that factorizes into prior distributions on the rows of U and V. p(u, V) = i p(u i ) j p(v j ) (3.17) The full data likelihood in this model is p(u, V, D θ) = i p(u i ) j p(v j ) p θ (r ij u i, v j ) (3.18) i,j,r ij ind Note that the matrix factorization likelihood of probabilistic matrix factorization [35] (Section 2.1) is recovered in case N (r ij u T i v j, σ) is the generative distribution (and thus f θ has no learnable parameters since in this case f θ (u i, v j ) = u T i v j, and we assume σ is fixed). As in any Bayesian model, the distribution of interest in this model is the 31

43 posterior distribution over the latent factors p(u, V D), which is intractable to compute for even mildly complicated functions f θ ( ). We therefore introduce a variational approximate distribution Q(U, V) to the posterior distribution on the latent variables U and V. We assume that the variational distribution factorizes over the rows of U and V. Note that [28] only assumes a factorization of Q(U, V) into Q(U)Q(V), but that the factorization of Q(U) and Q(V) into the rows of U and V follows from the form of the expectations in the ELBO (we refer the reader to [28] for details). Unlike [28] and following [21], we condition the posterior distribution on the data. Specifically, in the variational posterior distribution we condition each latent factor u i and v j on the observed ratings associated with that latent factor: Q φ (U, V D) = i q φu (u i r i ) j q φv (v j r j ) (3.19) The parameters of each distribution q φu (u i r i ) depend on the observed ratings r i through a deterministic function g φu ( ), which is in turn parametrized by the set of parameters φ u. Likewise, the parameters of q φv (v j r j ) depend on r j through the deterministic function g φv ( ) parametrized by φ v. The ELBO for the MFVAE model is presented in eq. (3.20). In the matrix factorization, setting the ELBO does not decompose into a convenient sum over ELBOs of individual datapoints, due to the diadic nature of the data. Instead the marginal likelihood decomposes into a sum over KL-divergence terms of each of the latent factors and a sum over the expected log conditional probability of each of the ratings. For a derivation of the full data marginal likelihood we refer the reader to Appendix A.1. 32

44 ] [ ] L(θ, φ u, φ v ; R) = D KL [Q(U, V D) p(u, V) + E Q(U,V D) log p θ (D U, V ) ] D KL [q φu (u i r i ) p(u i ) = i ] D KL [q φv (v j r j ) p(v j ) j + [ E qφu (u i r i )q φv (v j r j ) i,j D ] log p θ (r ij u i, v j ) (3.20) Note that this ELBO is highly similar to the ELBO for the Variational Bayesian Matrix Factorization model of Lim and Teh [28], which can be found in eq. (2.22). The major difference between this ELBO and the ELBO of the Variational Bayesian Matrix Factorization Algorithm is the dependence of the approximate posterior distribution Q(U, V D) on the data D. This is typical of Variational Autoencoder-type applications [21]. The form of the ELBO implies that the use of mini-batches of data is only possible if subsets of users and items can be selected such that there are no users and items outside of this subset that have rated/have been rated by an item/user outside of this group. Viewed from a graph perspective: minibatches only make sense for connected components of the rating graph. While connected components in the rating graph if there are more than one could be precomputed, we assume that the set of connected components in the rating graph contains one very large connected component containing most user and item nodes, and a few very small connected components containing unpopular movies and users with few ratings. Defining mini-batches of equal size would thus require some sort of subsampling procedure 1. Otherwise, the procedure for maximizing the variational lower bound of eq. (3.20) is identical to the procedure for maximizing the variational lower bound in the original AEVB algorithm. For convenience this is reproduced here. 1 The connectivity of each of the rating datasets and the splits we create (i.e. train/val/test splits) is investigated briefly in chapter 4. 33

45 The expectation E qφu (u i r i )q φv (v j r j )[ log pθ (r ij u i, v j ) ] in eq. (3.20) can, by supposition, not be computed in closed form. However, we can estimate the expectation by drawing samples u i q φu (u i r i ) and v j q φv (v j r j ), and computing an approximation to the expectation: E qφu (u i r i )q φv (v j r j ) [ log pθ (r ij u i, v j ) ] 1 S S s=1 log p θ (r ij u (s) i, v (s) j ) (3.21) indi- where u (s) i indicates the sth sample from q φv (v j r j ) and likewise v (s) j cates the sth sample from q φu (u i r i ). Recognition models and Prior Distributions In all our models we assume that the prior distribution for all latent factors u i and v j is a 0-mean Gaussian distribution with diagonal variance. In most models we assume that the priors have unit variance, but we allow some flexibility in setting the (spherical) variance in the prior. This has the effect of giving the KL-divergence in the ELBO a higher or lower weight, depending on whether the variance is decreased or increased, respectively, i.e. giving the approximate posterior more freedom to move away from the prior mean. Thus: p(u i ) = N (u i 0, σ 2 ui) p(v j ) = N (v j 0, σ 2 vi) (3.22) where σ u and σ v are positive scalars. σ 2 u i We also use Gaussian recognition models. We use the notation µ ui and to denote the mean and a vector representing the diagonal variance of the recognition model q φu (u i r i ), respectively, and we use µ vj and σ 2 v j to denote the mean and a vector representing the diagonal variance of the recognition model q φv (v j r j ), respectively. Thus: q φu (u i r i ) = N (u i µ ui, σ 2 u i I) (3.23) q φv (v j r j ) = N (v j µ vj, σ 2 v j I) (3.24) 34

46 Following the original AEVB algorithm we design the form of the functions g φu and g φv such that they predict both the mean and the log of the diagonal standard deviation for each of the recognition models: [µ ui, log σ ui ] = g φu (r i ) (3.25) [µ vj, log σ vj ] = g φv (r j ) (3.26) The definition of the functions g φu ( ) and g φv ( ) are as follows. Note that h u and h v are computed in the recognition model as they are in eq. (3.5). µ ui = σ(h u W (2) µ ui + b (2) µ ui ) log σ ui = h u W (2) σ ui + b (2) σ ui µ vj = σ(h v W (2) µ vj + b (2) µ vj ) log σ vj = h v W (2) σ vj + b (2) σ vj (3.27) Again, we allow for weight sharing in this model. When sharing weights between g φu ( ) and g φv ( ) we constrain W (2) µ ui = W (2) µ vj, b (2) µ ui = b (2) µ vj, W (2) σ ui = W (2) σ vj and b (2) σ ui = b (2) σ vj. We refer to the shared weight matrix and bias vectors as W (2) µ, W (2) σ, b (2) µ and b (2) σ respectively. The KL-divergences D KL (q i p) and D KL (q j p) can then be computed in closed form as functions of µ ui, σ ui, µ vj and σ vj : D KL (q i p) = 1 2 D KL (q j p) = 1 2 D d=1 D d=1 σ(d) 2 u i σu 2 σ(d) 2 v j σv 2 + µ(d) 2 u i + log σ 2 σu 2 u log σ u (d) 2 i 1 (3.28) + µ(d) 2 v j + log σ 2 σv 2 v log σ v (d) 2 j 1 (3.29) where D is the dimensionality of the latent fators, and we use the notation v (d) to indicate the dth element of some vector v, e.g. σ u (d) i indicates the dth element of the vector σ ui. The equations simplify for those models where σ u and σ v are 1. 35

47 Reparametrization Trick Kingma and Welling s Reparametrization trick [21] can now be applied to draw samples from q i (u i r i ) and q j (v j r j ). u (s) i = h(ɛ (s), µ ui, σ ui ) = µ ui + σ ui ɛ (s) (3.30) v (s) j = h(ɛ (s ), µ vj, σ vj ) = µ vj + σ vj ɛ (s ) (3.31) where u (s) i and v (s) j are sample s from q φu (u i r i ) and q φv (v j r j ) respectively, ɛ (s) en ɛ (s ) are two independent samples of N (0, I) of appropriate dimensionality, and denotes the Hadamard (elementwise) vector-vector product. The authors of [21] use the fact that a sample x from N (0, I) can be transformed into a sample y from N (µ, diag(σ 2 )) with diagonal covariance, by applying: y = µ + diag(σ 2 ) 1 2 x = µ + diag(σ)x = µ + σ x (3.32) Conditional Distributions We propose two conditional distributions p θ (r ij u i, v j ). Gaussian Model The first model is a Gaussian model where p θ (r ij u i, v j ) = N (r ij µ rij, σr 2 ij ) (3.33) µ rij and σ rij are computed by f θ (, ) from u i and v j. Note that to compute µ rij we can use any of the decoder models proposed in Section For the variance σr 2 ij we can either choose to use a global variance σ that is either fixed or learnable, or we can define a new a function of u i and v j that computes log σ rij. Multinomial Model The second model is a multinomial model where p θ (r ij u i, v j ) is computed as in eqs. (3.12) to (3.14). 36

48 SGVB Estimators The SGVB Estimator for this model is then: L(θ, φ u, φ v ; D) = i D KL (q i p(u i )) j D KL (q j p(v j ))+ are drawn using eqs. (3.30) and (3.31) respec- where samples u (s) i tively. i,j,r ij D 1 S and v (s) j s log p(r ij h(ɛ (s), µ ui, σ ui ), h(ɛ (s ), µ vj, σ vj )) s=1 (3.34) The derivatives of eq. (3.34) can be found in Appendix B for both conditional distributions introduced in this section. 3.3 Matrix Factorizing Graph Autoencoder We now introduce the Matrix Factorizing Graph Autoencoder (MFGAE). Its variational counterpart, the Matrix Factorizing Variational Graph Autoencoder (MFVGAE), is introduced in the section section. Like, e.g. [10] we treat the ratings data as a bipartite graph. The dataset D is interpreted as a set of weighted edges (i, j, r ij ) where i U is a user node, j V is an item node and r ij is an edge weight for the connection between i and j. We assign individual user nodes an integer 1... N u and individual item nodes an integer 1... N v. We always state clearly whether an integer refers to a user node or an item node. The variable i is always used to indicate user nodes, whereas j is always used to indicate item nodes. The adjacency matrix for D is the N N matrix A, where N = N u + N v, the sum of the number of users and number of items. The rows and columns of A are ordered such that rows and columns 1... N u refer to users 1... N u, and rows and columns N u N refer to items 1... N v. Furthermore we have that, if there is an observed rating r ij from user i for item j in D, there is a weighted edge connecting user i to item j in the rating data graph, and we 37

49 use r ij as the edge weight. In the adjacency matrix A for the rating graph we store the edge weights for each edge: A i,j+nu = A j+nu,i = r ij. If there is no edge connecting a user node i and item node j then A i,j+nu = A j+nu,i = 0. The matrix Â is then constructed as described in Section 2.3. We assume that no user or item features are available and therefore replace the feature matrix X in the input layer of the GAE model of Kipf and Welling [22] (see Section 2.3) by the N N identity matrix. Since the identity matrix has no influence on the resulting formulae we will leave it implicit from here on. We define two encoders. Encoder 1: H = σ 1 (ÂW (1) + b (1) ) Z = σ 2 (ÂHW (2) + b (2) ) (3.35) and Encoder 2: H = σ 1 (ÂW (1) + b (1) ) Z = σ 2 (HW (2) + b (2) ) (3.36) where W (1) is an N D h -dimensional weight matrix, W (2) is an D h D- dimensionsl weight matrix, b (1) is a D h -dimensional bias vector and b (2) is a D-dimensional bias vector, and D h and D are the dimensionalities of the hidden layer and latent factor respectively. σ 1 ( ) and σ 2 ( ) are again used to indicate arbitrary activation functions. H is used to indicate a matrix containing the convoluted features of one node on each row, and Z is used to denote the matrix of latent user and item representations. Note that the only difference between the two encoders is that Â is not applied in the second layer in Encoder 2. Our reasoning behind Encoder 2 is as follows. One graph convolution layer convolves the first order neighborhood of each node. In our model this means that for each node, only information from the edges to its direct neighbors is accumulated. Since the graph is bipartite, information is propagated from one type of node (e.g. users) to another type of node (e.g. items). In the decoder there is another step of information propagation when nodes are combined to predict a rating and information again propagates from one type of node to another type of 38

50 node. This means that with an odd number of convolution operations in the encoder, information is propagated from one type of node to the same type of node in the full MFGAE architecture. If an even number of convolution operations is applied in the encoder, information would be propagated from user nodes to item nodes. We note that the shape of Z is N D and that the order of the rows has not changed between steps. This means that we can interpret rows 1... N u as an encoded version of the corresponding user node, and rows N u N as an encoded version of the corresponding item node. We then define U = Z[1... N u ] (3.37) V = Z[N u N] (3.38) where we use the notation M = M[a... b] to indicate a matrix M consisting of rows a through b of matrix M. Thus row i of U corresponds to row i of Z, which is a representation of the node associated with user i. Row j of V corresponds to row j + N u of Z, which is a representation of the node associated with item j. This formulation allows us to use the same decoders as the MFAE model (Section 3.1.2) Relationship to MFAE There is a strong similarity between Encoder 2 described in eq. (3.35) containing a single convolutional layer and the simplified MFAE model described in eq. (3.7) of Section To show this we introduce the following definitions: 39

51 j : r i : j + N v Sparse vector of scaled ratings from user i in Â, 1 i N u r j : Sparse vector of scaled ratings for item j in Â, 1 j N v (or N u j N) r i : Vector containing elements N u N of row i of Â; r ij = Â ij r j : Elements 1... N u of row j of Â h (i) : The activated values of the hidden layer for an input r i for user i r i and r j are similar to but different from r i and r j. First, from the definition of Â we have that r ij = r j i = r ij Dii Dj j. Second, the normalized edge weights for user i are stored in elements N u... N. The first N u elements of r i are 0, except for element i. The value of this element is 1 D ii. Third, the normalized edge weights for item j are stored in elements 0... N u. The final N v elements of r j are 0, except for element j. The value of this element is 1. D j j Then, focusing on element n of the hidden layer for user i we see that ( N ) h (i) n = σ 1 r ikw (1) kn + b(1) n k=1 (3.39) Since the first N u elements of r i, except for element i, are 0, we can write this product as: n = σ 1 h (i) N r ikw (1) kn + W(1) in k=n u+1 + b D (1) n (3.40) ii We let W (1) v = W (1) [1... N u ]; W (1) u = W (1) [N u N]. We can then write ( h (i) = σ 1 r i T W (1) u ) + [W(1) v ] i + b D (1) ii (3.41) where [W (1) v ] i indicates row i in W (1) v. There is an analogous definition for the activated hidden layer values h (j) for item j. Recall from eq. (3.5) that the hidden layer activation for the user rating 40

52 encoding is ( ) σ 1 ri W (1) u + b (1) u (3.42) The difference between the activations in h (i) in the MFAE model and in the GAE model are thus caused by: r 1. A scaling of the sparse user rating vector, such that r ij = ij. Dii Dj j 2. The addition of a bias vector for each user, scaled by the number of ratings of that user. The bias vector added to user i is shared with W (1) v. Finally we note that all hidden layers, for users and for items, are multiplied by the same weight matrix W (2). This is analogous to the weight sharing of the simplified MFAE model. 3.4 Matrix Factorizing Variational Graph Autoencoder It is now straightforward to extend the Matrix Factorizing Graph Autoencoder to the Matrix Factorizing Variational Graph Autoencoder (MFVGAE). We use a Gaussian approximate posterior q φ (Z X), where X indicates the matrix of user and item features, i.e. the identity matrix in our case. Thus for a single row z in Z we have that: q(z x) = N (z µ, diag(σ 2 )) (3.43) The computation of µ and σ for Encoder 1 and Encoder 2 is described below. Recognition model 1 The mean and diagonal variance vector for a single row in Z for Encoder 1 is computed as follows: µ = σ 2 (ÂhW (2) µ + b (2) µ ) (3.44) log σ = σ 2 (ÂhW (2) σ + b (2) σ ) (3.45) 41

53 Recognition model 2 The mean and diagonal variance vector for a single row in Z is computed as follows: µ = σ 2 (hw (2) µ + b (2) µ ) (3.46) log σ = σ 2 (hw (2) σ + b (2) σ ) (3.47) Where we use h to denote a row in H, which is computed like in eqs. (3.35) and (3.36). For a sample Z (s) from the approximate posterior distribution, we construct U (s) and V (s) as in eqs. (3.37) and (3.38). Conditional models For this model we can use the same recognition models as in the MFVAE model introduced in Section 3.2. For convenience we reproduce these models here: Gaussian Model The first model is a Gaussian model where p θ (r ij u i, v j ) = N (r ij µ rij, σr 2 ij ) (3.48) µ rij and log σ rij are computed by f θ (, ) from u i and v j. Note that to compute µ rij we can use any of the decoder models proposed in Section For the variance σr 2 ij we can either choose to use a global variance σ that is either fixed or learnable, or we can define a new a function of u i and v j that computes log σ rij. Multinomial Model The second model is a multinomial model where p θ (r ij u i, v j ) is computed as in eqs. (3.12) to (3.14). 42

54 Chapter 4 Experiments, Results and Analysis 4.1 Experimental Setup This section describes the different datasets used in our experiments, the preprocessing applied to the datasets, benchmark results on the different datasets, and the experimental setup that we used to test the models presented in this thesis Datasets This section describes the datasets used in our experiments. The Movielens and Netflix datasets are benchmark datasets, and only basic statistics of these datasets are described. The NOS dataset presented at the end of this section is a proprietary dataset. For this reason the creation of this dataset is described in more detail. Table 4.1 shows basic statistics for the different datasets used in our experiments. We distinguish between rating-based and binary datasets. Ratingbased datasets contain explicit user-item ratings on a scale of 1 to 5. Binary datasets contain implicit user-item preference judgements based on website clicks. This section discusses the raw datasets. The subsections of this section 43

55 briefly describe the publicly available benchmark datasets and extensively describe the way proprietary datasets were created. Movielens The Movielens datasets [29] contain ratings for movies from users of the Movielens website 1. Movielens is a website that provides non-commercial, personalized movie recommendations. Users enter ratings for movies and the website recommends movies for these users to watch based on the ratings they entered. The ratings in the Movielens dataset are gathered by the Grouplens organization. The datasets only contain ratings for users who have rated at least 20 movies. Grouplens does not report a (lower or upper) limit on the number of ratings that movies must have received in order to appear in the dataset. The dataset consists of ratings of the form (user_id, movie_id, rating, timestamp) where user id and movie id are unique identifiers for users and movies respectively, rating is an integer rating on a 1-5 scale, and timestamp is a UNIX timestamp, i.e. an integer indicating the time, in seconds since , of the rating. 1 Table 4.1: Basic dataset statistics. Name Type N r N u N v Movielens 100K Rating 100, ,682 Movielens 1M Rating 1,000,209 6,040 3,706 Movielens 10M Rating 10,000,054 69,878 10,677 Netflix Rating 100,480, ,189 17,770 NOS 300K Binary 341,829 4,302 3,338 NOS 30M Binary 33,066, ,400 3,424 44

56 Netflix Prize Dataset The Netflix Prize Dataset was compiled in 2006 by Netflix for the Netflix Prize [4]. The dataset consists of a training set of approximately 100 million ratings from 480 thousand users for 17,770 movies and a qualifying set of 3 million ratings 2. There are no new users or items in the qualifying set, i.e. all users and items that appear in the qualifying set also appear in the training data. Ratings are on a scale of 1-5, and are only provided for the training set. In this thesis the qualifying set is ignored. The set of all ratings in the training and the qualifying sets is a subset of the roughly 1.2 billion ratings gathered by Netflix between October, 1998 and December, The dataset contains ratings from a random selection of users who had rated at least 20 movies. The ratings are randomly perturbed in a way that, according to the authors, does not change the overall statistics of the Prize dataset [4]. However Netflix does disclose that their perturbation process can change the number of ratings for a user. The perturbation is performed to protect information of users [4]. The process by which the perturbation is performed is not disclosed by Netflix. The qualifying set is created by selecting the 9 most recent ratings from a random selection of users. If a randomly selected user has less than 18 ratings (due to the perturbation), the most recent 50% of their ratings are used in the qualifying set. The combined effect of the perturbation and the random selection for the qualifying set is that it is not guaranteed that a user in the training set has 20 ratings or more. The Netflix prize was awarded to the team that was able to improve the Root Mean Squared Error (RMSE) on the test set by 10% relative to Netflix s own Cinematch system, which gets an RMSE score of The target test score on the Netflix dataset test-set is thus In fact, the training set subsumes a probe set that researchers can use to evaluate the performance of their algorithm. The qualifying set conists of a quiz and a test set. When researchers submitted ratings to the Netflix prize website they received the RMSE on the quiz set; the RMSE on the test set was only known to Netflix and was used to determine the final RMSE score. 45

57 NOS Click data To investigate whether our models work on data different from rating data we used a dataset of (anonymized) clicks and views on news articles. This data was gathered on the website of the Nationale Omroep Stichting (NOS) 3, a Dutch news organization that is part of the Dutch public broadcasting system (NPO, Nederlandse Publieke Omroep). This dataset was provided by Scyfer B.V. 4 with permission from the NOS. The data was used in a pilot implementation project of a recommendation system for the website of the NOS and covers the period between February 1st and February 29th, Since this is a proprietary dataset that has not been described in any publicly available document, its creation will be described in more detail in this section. There are several aspects relevant to the dataset and its creation to consider: 1. The desktop version of the NOS homepage shows two highlighted articles and twelve featured articles. The editorial staff of the NOS decides which articles are considered highlighted or features articles. For the highlighted articles only the title and a large accompanying image are shown. For the featured articles the title, a smaller photograph and a short summary are shown. Occasionally non-article items (e.g. video content or live blogs of current events) are featured or highlighted. Not all articles published on the NOS website are highlighted or featured. Articles are usually highlighted or featured soon after being created. Besides the highlighted and featured articles, the desktop version of the homepage contains a recent article feed on the right side of the screen. It is possible that an article appears in both the article feed and the highlighted or featured articles set at the same time. The mobile version of the NOS website does not contain a recent articles feed. 2. When a user clicks on the link to a news article, the user is taken to the news article s page. This page contains the text of the article and some

58 related content (e.g. images or videos). The article-specific content is followed by a list of six article links. These article links contain the two highlighted articles and the first four highlighted articles. The article links show the article title, a short summary, and an accompanying image. The unprocessed data in this dataset consists of events of the form (user_id, article_id, timestamp, click_or_view) user id is a meaningless (i.e. anonymized) unique identifier that is derived from a unique visitor id and cannot be traced back to any individual or IP-address. article id are unique (but anonymized) identifiers for users and articles in the NOS article catalog respectively. timestamp is a UNIX timestamp (i.e. seconds since ), and click or view is a variable that indicates whether the event is a click on an article or a view. To keep the dataset manageable we retained only the following events: 1. Clicks on articles (i.e. not clicks on e.g. live blogs) that were featured or highlighted at the time of the click. Since articles are usually highlighted or featured shortly after being created, we retained all clicks on articles that occurred from the time they were created until the time they were no longer featured. 2. Views on articles that were featured or highlighted. Note that these events only occur when an item is featured or highlighted. We created a binary dataset out of the retained events by counting clicks as positive feedback and views that were never followed by a click as negative feedback. This was done by the following procedure: 1. If a user with user id i has ever seen an article with article id j in the highlighted or featured block, then a triple with a negative response (i, j, 1) is added to the dataset. 2. If a user with user id i has clicked an article with article id j in the period between the article s creation and the removal from the featured 47

59 or highlighted set, a triple with a positive response (i, j, 1) is added to the dataset. If a triple (i, j, 1) was already present in the dataset then the negative triple was removed 5. This has the effect that no triple with the user/article combination occurs more than once in the data set. 3. In the final dataset we retained only those users who have more than 5 positive responses and more than 5 negative responses. The reason we added negative responses is that without them a network will learn that the correct prediction for every user/item combination will be positive. While this will increase recall of our system tremendously it will severely harm precision. This is a general issue with rating prediction for implicit feedback datasets [18] Evaluation Metrics Root Mean Squared Error For our experiments we measure the performance with Root Mean Squared Error (RMSE) between the target ratings r ij from dataset D, and the predicted ratings ˆr ij in set of predictions D: RMSE(D, D) = 1 D (r ij ˆr ij ) 2 (4.1) r ij D The RMSE is not a perfect measure for assessing the performance of a recommendation system. Since one uses recommendation systems to learn which items are relevant to a user, it would in theory be better to define some score on the predicted ranking of the items. However, the RMSE as a measure of a recommendation system s performance is very widespread in the recommendation systems literature as a result of the Netflix prize, which awarded the prize based on an RMSE target score. As a result, the RMSE score for a new recommendation system s performance is mentioned in nearly every cur- 5 This triple is not necessarily present in the data set, for example when a user i clicks on an article link for article j from a source outside of the NOS, e.g. Facebook. 48

60 rent publication on rating-prediction recommendation systems, in particular MF-based recommendation systems. It is important to note that the RMSE score depends on the scale of the ratings. For this thesis this implies that RMSE scores on the NOS datasets cannot be directly compared to the scores on the Movielens datasets, as the NOS dataset is binary whereas the Movielens dataset contains ratings on a 1 to 5 scale. ROC-AUC For the NOS data we also report the Area Under the Curve (AUC) for the Receiver Operating Characteristic (ROC) curve. The ROC is used in binary classification problems to measure how scores assigned by an algorithm to positive examples compare to scores given by the same algorithm to negative examples. This is done as follows. We assume the algorithm assigns scores between 0 and 1 to examples, but the general idea holds for other scores. We then define a threshold t. All scores > t assigned by the algorithm are predicted as positive examples, while all scores t are predicted as negative examples. A True Positive (TP) prediction is an example that is predicted positive and whose ground truth value is positive. A False Positive (FP) prediction is an example that is predicted as positive but whose ground truth value is negative. The TP-rate TPR t for a threshold t is then defined as TPR t = TP t where TP P t indicates the number of TP examples for a threshold t, and P indicates the number of all positive examples. Likewise, the FP-rate FPR t for a threshold t is computed as FPR t = FP t where FP N t indicates the number of FP examples for a threshold t and N indicates the number of all negative eamples. The ROC-Curve shows how the TP and FP rates change as the threshold t is lowered from 1 to 0. A mock example of an ROC curve for two fictional systems is given in figure fig The horizontal axis indicates the FPR, while the vertical axis indicates the TPR, both ranging from 0 to 1. If t = 1, then the TPR and FPR will both be 0. When t is decreased, either the TPR will increase, which happens when the lower threshold causes positive 49

61 Figure 4.1: Example of an ROC curve. The system that produced the blue line performed better than the system that produced the red line. examples to be predicted to have the positive label; or the FPR will increase, which happens when the lower threshold causes negative examples to be predicted to have the positive label; or both. In case the examples are perfectly ranked by the algorithm, the TPR will increase as the FPR stays 0 until the threshold has reached the score of the lowest ranked positive example, at which point the TPR is 1 and the FPR is 0. In this case the ROC-Curve passes through coordinate (0, 1). In case the examples are ordered randomly, the TPR and the FPR will increase at the same rate, giving a line through the diagonal of the TPR/FPR coordinate system. The Area Under the (ROC-)Curve then gives a global measure of how well the algorithm ranks examples. The AUC-ROC score indicates how closely the ROC-curve approximates the curve of a perfectly ranked dataset. An algorithm that perfectly ranks all examples will have an AUC of 1, while an algorithm that randomly ranks examples will have an AUC of 0.5. Note that an AUC of 0 means that the algorithm perfectly ranks the examples in reverse order. The worst score achievable for the AUC-ROC is thus 0.5. The AUC scores for our mock example in fig. 4.1 can be found in the legend of the figure. 50

62 4.1.3 Current State of the Art In this section we present State of the Art (SOTA) or near-sota results against which we will compare the results of our algorithms. The results are summarized in tables 4.2 and 4.3. We compare our results against the Variational Bayesian Matrix Factorization algorithm of Lim and Teh [28], see Section for details; the I-CF-NADE-S and U-CF-NADE-S algorithms presented by Zheng et al [48], see Chapter 5 for details; the Global and Local implementations of the LLORMA algorithm of Lee et al [27], see Chapter 5 for details; the Neural Network Matrix Factorization algorithm of Dziugaite and Roy [19], see Chapter 5 for details; and the User and Item based Hybrid CF Autoencoders of Strub et al. [41] described in detail in Section We ran benchmark experiments on the datasets that we ourselves preprocessed, to ensure a fair comparison of the results of our models to the results of the best scoring models. Due to resource limitations we were only able to run benchmark experiments described below. All other benchmark scores are taken from the paper in which the algorithm was described. CF-NADE To run the experiments we used the code made publicly available by the authors 6. We ran this code on the Movielens 100K and 1M datasets. Training this model on the other datasets required too much time. We modified the data preprocessing code to process datasets that were used in all of our experiments. For each dataset on which we ran experiments we tried to use the parameters cited by the authors. We used two hidden layers of 500 nodes, with tanh( ) activations. We trained the model for 1500 epochs. Otherwise, we used the default parameters the authors recommend in the paper or in their Github repository. Variational Bayesian Matrix Factorization We implemented the VBMF algorithm ourselves. We used the memory-efficient updates described in sections 4.1 and 4.2 of the Lim and Teh s KDDCup paper [28]. We trained the

63 Algorithm ml100k ml1m ml10m Netflix U-CF-NADE * * I-CF-NADE LLORMA-G * * * LLORMA-L * * * NNMF * * - - U-CFN * * - V-CFN * * - VBMF Table 4.2: State of the art or near state of the art RMSE scores for all rating-based datasets used in our experiments from a number of algorithms. Scores marked with an asterisk (*) are taken from the paper in which the algorithm is originally described. Scores not marked with an asterisk are reproduced by us. We refer the reader to the main text of this thesis for an explanation of each of the algorithms. The state of the art for each dataset is marked in boldface. Algorithm NOS300k NOS30M VBMF RMSE VBMF ROC-AUC Table 4.3: VBMF baseline RMSE and ROC-AUC scores for the NOS binary datasets used in our experiments. All scores in this table are produced by us. We refer the reader to the main text of this thesis for an explanation of each of the algorithms. model with 25-dimensional latent factors Preprocessing The current state of the art is held by the CF-NADE algorithm of Zheng et al. [48]. We therefore chose to preprocess our data the same way Zheng et al. did, to be able to compare our results more closely to theirs. To ensure that our datasets are the same as theirs we load the data in an equivalent way to Zheng et al, we use the same randomization method that they use (the random.shuffle function in Python 2.7), and we seed the random number generator with the same seeds 7. 7 We do reimplement the dataset creation procedure in a more efficient way than the authors do. However, the differences between the way we construct our dataset and the 52

64 Each dataset is split into train, validation, and test-sets, indicated henceforth by D train, D val, and D test, respectively. The full dataset is denoted by D total, to distinguish it from the general token D used in previous sections. The sets have the following proportions relative to D total : 1. D test : 10% of D total 2. D val : For the larger data sets, i.e. 1M ratings, 0.45% of D total is used. For the two smaller data sets (ML100K and NOS300K) 4.5% of D total is used. 3. D train : The above implies that for the larger data sets, i.e. 1M ratings, 89.55% of D total is used. For the two smaller data sets 85.5% of D total is used. The data sets are created such that each of the subsets is disjoint from the other two, and the three subsets combined contain all ratings: (D train D val ) (D train D test ) (D val D test ) = D train D val D test = D total (4.2) We perform no other preprocessing on the datasets Connectivity In this section we briefly discuss the connectivity of the graphs formed by the rating data. We used NetworkX [13] to determine the connectivity of each of the subsets (training set, validation set and test-set). The number of connected components can be found in table 4.4. A value of 1 in this table means that the graph is fully connected. 4.2 Experiments In this section we describe the different experiments, details of the experimental setup for each experiment, the dataset(s) on which the experiments were way Zheng et al. datasets. do, do not affect the ordering and splits of the ratings in the final 53

65 Dataset D train D val D test ml100k ml1m ml10m Netflix nos300k nos30m Table 4.4: Connectivity run, and the motivation for the experiments and the hypothesis (or hypotheses) we aimed to test. The code used for running our experiments is written in Python and freely available at recommendation. The code uses Theano [43] for automatic gradient computation Experiment 1: Model Performance In this experiment we train each model on each of the datasets with the hyperparameters that yielded the best validation set error on each combination of model and dataset. The procedure of the search for the best hyperparameter settings for each of the models and each of the datasets is presented in Experiment 2 below. After each model has converged, we use the data in the training set D train to infer the latent factor matrices U and V for the MFAE and MFGAE models and the parameters of the variational posterior distributions q φu (u i r i ) and q φv (v j r j ) for the MFVAE and MFVGAE models. From these factors we predict ratings for the user and item pairs in the test-set. From the predicted ratings we compute the RMSE scores as a measure of the performance of our models. As stated before, we also compute the ROC-AUC score for the binary datasets. In the variational models we use a fixed variance of 1 for the prior distributions over the latent factors. We also set the number of samples drawn for each user and item to 1. We use the following predictive distribution: 54

66 p(ˆr ij D) = p(ˆr ij u i, v j )q φu (u i r i )q φv (v j r j )du i v j S s=1 p θ (ˆr ij u (s) i, v (s) j ) where u (s) i and v (s) j are samples from q φu (u i r i ) and q φv (v j r j ), respectively, and we again set the number of samples S to 1. We use the multinomial probability for p θ (ˆr ij u i, v j ). We then use the expected rating E[ˆr ij ] = D r ˆr ij =1 ˆr ij p(ˆr ij u i, v j ) as our prediction. For the binary datasets, we change all values from -1 to 0 for negative examples and keep the value 1 for positive examples for the MFGAE and MFVGAE models. We need to do this in order to be able to compute the Â matrix. If we did not force our ratings to be nonnegative, the values of D might be negative, in which case the square root does not yield real results. For the MFAE and MFVAE models we do retain the value -1 for negative examples and 1 for positive examples. Note that this means that we throw away information in the MFGAE and MFVGAE models, as in the sparse matrix A we do not distinguish between observed but 0 and not observed. We therefore expect the MFAE and MFVAE models to outperform the MFGAE and MFVGAE models Experiment 2: Hyperparameter Optimization In this experiment we run a grid search on a wide range of various hyperparameters. Since the training time increases with dataset size, we are able to run training on a much larger number of hyperparameter settings for the smaller datasets than for the larger datasets. We use the outcome of earlier hyperparameter experiments to inform next steps in hyperparameter optimization. For example, we found that adding a bias vector to hidden layers does not improve results on the ml100k datasets (see Section 4.4) and therefore remove bias vectors from all models described in Chapter 3. Findings on hyperparameter optimization are described in more detail in Sections

67 and 4.4. For all hyperparameter settings we use the validation set to monitor training progress. We train the model until there is no more improvement on the validation set error. We restrict ourselves to symmetric models in which the user and item sparse rating vector encoders share hyperparameters. E.g. we do not use models with different hidden layer or latent factor dimensionalities, despite this being possible according to our model definitions Experiment 3: New Users and Items In this experiment we investigate whether a trained model can predict ratings for new users who have only rated items seen during training, and items that have only been rated by users seen during training. The reason we do not consider new users that have rated new items (or, conversely, new items that have only been rated by new users), is that in the weights for the first hidden layer, their are only weights for ratings from users and ratings for items that have been seen during training. There are no hidden layer weights for new users or new items. This means that if a new user rates a new item there is no weight in the model for the new item to the hidden layer, and thus a latent factor cannot be inferred. To answer this question we run two experiments: Experiment 3a In this experiment we select a random subset of 10% of all users or items. The ratings for these users or items form the test-set D test. From the remainder of the ratings, a random selection of 5% of the ratings is assigned to the validation set D val, and the rest of the ratings is assigned to the training set D train. We then train a model on the training data as in Experiment 1. In case we subsample users, we use the ratings from the users in D test to infer the latent test-set user factors at test time. We use the notation u i to indicate the inferred latent factor for a new user i that occurs only in the test-set. We use the ratings in the training set to infer the latent factors for all items that occur in the test-set. We use the notation ˆr i j to indicate a 56

68 predicted rating from a new user with inferred latent factor u i and an item seen during training with latent factor v j. We thus have that: ˆr i j = f θ (u i, v j ) (4.3) Similarly, in case we subsample items, we use the ratings from the items in D test to infer the latent test-set item factors at test time. We use the notation v j to indicate the inferred latent factor for a new item j that occurs only in the test-set. We use the ratings in the training set to infer the latent factors for all users that occur in the test-set. We use the notation ˆr ij to indicate a predicted rating from a user i who was seen during training time with latent factor u i, for a new item j with inferred latent factor v j. We thus have that: ˆr ij = f θ (u i, v j ) (4.4) We only run this experiment on the Movielens 100K and Movielens 1M datasets, as training on other datasets would be too costly. We refer the reader to fig. 2.3 for a visual depiction of what happens in this experiment. For example, in the experiment in which we subsample users from the dataset, the test-set is formed by the bottom left submatrix. Similarly in the experiment in which we subsample items from the dataset, the test-set is formed by the top right submatrix. Experiment 3b In this experiment we use the models trained in Experiment 1 to assess how well our model can infer latent factors for new users and items. Recall that these models are trained on D train, while training is monitored using D val. At test time we treat the ratings in the test-set as if they were ratings for new users and items. That is, we treat the test-set ratings from user i, all of which are ratings for items seen during training, as if they were ratings from a new, virtual, user i. Similarly we treat the test-set ratings for an item j as if they were for a new, virtual, item j that was only rated by users seen during training. In other words, we split the real user i into two users: the 57

69 user i seen during training, whose ratings consist of the subset of all ratings from user i in D train, and a virtual user i, whose ratings consist of the subset of ratings from user i in D test. The same holds for items. For the MFGAE and MFVGAE models we treat the test-set ratings as if they formed a new rating graph, from which we compute the Â matrix as explained in Section 3.3. Using the test-set Â matrix we infer the latent representations of users and items. We use the notation r (test) i to indicate the sparse rating vector consisting of the test-set ratings of user i. As stated before, we treat these ratings as if they were from a new user i, whose observed ratings consist of the ratings from user i in the test-set. Similarly, we use the notation r (test) j to indicate the sparse rating vector consisting of the test-set ratings of item j. Again, we treat these ratings as if they were from a new item j, whose observed ratings consist of the ratings for item j in the test-set. We then test the ability of a model to predict ratings for new users who have only rated known items, and items that have only been rated by known users as follows. We encode the sparse user and item rating vectors r (test) i and r (test) j into latent factors u i and v j using the encoders learned during Experiment 1. That is: u i = g φu (r (test) i ) v j = g φv (r (test) j ) (4.5) For a target test-set rating r ij we predict ˆr i j, ˆr ij and ˆr i j. That is, we predict ratings from the virtual user i for the known item j, from the known user i for the virtual item j, and from the virtual user i for the virtual item j. The ratings ˆr i j, ˆr ij and ˆr i j are predicted as follows: ˆr i j = f θ (u i, v j ) ˆr ij = f θ (u i, v j ) ˆr i j = f θ(u i, v j ) (4.6) where f θ is the decoder function learned during Experiment 1. Note that, for the target test-set rating (i.e. not the predicted rating), we 58

70 have that the true rating r ij = r i j = r ij = r i j, because user i and item j are virtual users and items that correspond to user i and item j respectively. Our motivation behind this somewhat unorthodox-seeming experiment is twofold: 1. We would like our models to behave such that, if we split the ratings of a real user r i into two virtual users r i and r i, the models infer similar latent factors u i, u i and u i. 2. This is a free experiment that requires no new training cycles, only a forward pass through the autoencoder architecture to infer the latent factors and predict the test-set ratings. We compare the performance of our models on this task to the performance of our models on the test-set results from Experiment 1, as other authors do not report scores on this task Experiment 4: Supervised MFAE In this experiment we perform simple tests on the Supervised MFAE model introduced in Section For the encoder we use the Simplified Encoder (see eq. (3.7)) and for the decoder we use the dot-product decoder (see eq. (3.12)). Note that the Supervised MFAE model is a toy model, and this is thus a toy experiment. The only purpose of this experiment is to show that our general idea, i.e. learning a mapping from sparse user and item rating vectors to latent factors, is feasible. In this experiment we assume that we have access to target factors for each user and each item. These factors are stored in target factor matrices U and V for user and item factors, respectively. We then learn a function g φu ( ) that maps a sparse user rating vector r i from a user i (i.e. a row of the sparse matrix R) to a predicted latent user factor û i, and a function g φv ( ) that maps a sparse item rating vector r j for an item j (i.e. a column of the sparse matrix R, or a row of R T ) to a predicted latent item factor ˆv j. The predicted factors are stored in predicted latent factor matrices Û and ˆV. We use the following model to predict the learned latent factors from the 59

71 input rating matrix: Û = g φu (R) = σ(rw (1) u )W (2) ˆV = g φv (R T ) = σ(r T W (1) v )W (2) (4.7) This means we need target factors on which we can train our model. We used the variational Bayesian matrix factorization (VBMF) [28] algorithm of Lim and Teh to find the target factors. We used the same implementation of VBMF for this experiment as the implementation described in Section that was used for the baseline experiments. We trained this model on the Movielens 100K and Movielens 1M datasets. We used the mean vectors of the variational posterior user and item factor distributions to build our target factor matrices U and V, where row i in U contains user factor u i and row j in V contains item factor v j. We then used the simplified encoder to learn a mapping from R to U and from R T to V. In our Supervised MFAE implementation as described in eq. (4.7) we use no activation function on the output layer. The reason we do not include an activation function on the output layer is that the factors learned by the VBMF algorithm did not fit in any specific range. We found that using an activation function in the output layer and rescaling the outputs to match the range of the values of the target latent factors yielded worse results. We also found that rescaling the latent factors by 1 m (where m = max(m u, m v ), and m u and m v are the maximum absolute values in U and V, respectively), followed by a rescaling of the ratings by m 2 (which yields the same results because ( 1 m u i) T ( 1 m v j) = 1 m 2 u T i v j ), produced slightly worse results in preliminary experiments. Also note that we share weights between the hidden layer and the output layer for the U and V encoding. We optimize this model w.r.t. W (1) u, W (1) v and W (2) by minimizing the mean squared error between U and Û, and V and ˆV. 60

72 4.3 Results Experiment 1: Model Performance In this section we present the best RMSE scores achieved by each of the algorithms on each of the datasets in tables 4.5 and 4.6. The AUC scores for the binary datasets can be found in table 4.7. We compare our results to the current state of the art for each of the datasets. The state of the art results can be found in tables 4.2 and 4.3, but are reproduced in tables 4.5 and 4.6. We then present the more detailed findings for each of the experiments described in section Section 4.2. We show the ROC-AUC scores for the two NOS datasets in table 4.7. We show the ROC curves on which the AUC scores were computed in figs. 4.2a and 4.3a. A detail of these curves for the lowest FPR and TPR values can be found in figs. 4.2b and 4.3b. The time it took to train each of the models can be found in Appendix D. Due to the long training times required for the larger datasets (Netflix and NOS30M), we were unable to run experiments on these datasets for the MFVAE and MFVGAE models. While we did run an experiment on the Movielens 10M dataset for the MFVGAE model, we were unable to find hyperparameter settings that yielded reasonable results. The reason we decided to allocate our resources towards running experiments with the MFAE and MFGAE models is that these models appeared to yield better results in the experiments on smaller datasets Experiment 2: Hyperparameter Optimization In this section we present the hyperparameter settings that yielded the best RMSE scores. Since the tables containing these values are rather unwieldy we have relegated them to Appendix C at the end of this thesis. The values for the hyperparameters that yielded the best validation set RMSE scores can be found in tables C.1 to C.4 in Appendix C. 61

73 ml100k ml1m ml10m Netflix MFAE MFGAE MFVAE MFVGAE U-CF-NADE * * I-CF-NADE LLORMA-G * * * LLORMA-L * * * NNMF * * - - U-CFN * * - V-CFN * * - VBMF Table 4.5: Test RMSE scores for each of our models with the lowest validation set RMSE on the rating datasets, compared to the SOTA in the literature. For the origin of the SOTA score we refer the reader to tables 4.2 and 4.3. The best score on each dataset is marked in boldface. The best score for our models on each of the datasets is marked in italics. Scores marked with a dagger ( ) are scores on experiments that had to be cut short due to excessive running time (in the case of Netflix: 10 days). Scores marked with an asterisk (*) are scores taken from the paper in which the baseline model was presented. nos300k nos30m MFAE MFGAE MFVAE MFVGAE VBMF Table 4.6: Best (lowest) RMSE scores for each of our models on the binary datasets, compared to a benchmark score from the VBMF algorithm. The best score on each dataset is marked in boldface. The best score for our models on each of the datasets is marked in italics. 62

74 nos300k nos30m MFAE MFGAE MFVAE MFVGAE VBMF Table 4.7: Best (highest) AUC scores for each of our models on the binary datasets, compared to a benchmark score from the VBMF algorithm. The best score on each dataset is marked in boldface. The best score for our models on each of the datasets is marked in italics. (a) ROC curves for the predicted ratings of our models on the NOS300K dataset. The VBMF curve functions as a benchmark. (b) Detail of the ROC curves for the lowest FPR and TPR rates. Figure 4.2: ROC curve for MFAE and MFGAE models on the NOS300k dataset. The detail view on the right shows the superior performance of all models on the highest scoring ratings compared to the VBMF basline. 63

75 (a) ROC curves for the predicted ratings of our models on the NOS30M dataset. The VBMF curve functions as a benchmark. (b) Detail of the ROC curves for the lowest FPR and TPR rates. Figure 4.3: ROC curve for MFAE and MFGAE models on the NOS30M dataset. The detail view on the right shows the superior performance of the MFAE model on the highest scoring ratings compared to the VBMF baseline Experiment 3: New Users and Items In this section we present the results of Experiment 3a and Experiment 3b on predicting ratings for new users and items. Experiment 3a We show the RMSE on the predicted ratings ˆr i j from new users for known items, and ˆr ij from known users for new items, in table 4.8. Experiment 3b We show the results on the predicted ratings ˆr i j, ˆr ij and ˆr i j. The RMSE computed on the predicted factors for each of the three types of predicted ratings can be found in table

76 ml100k-u ml1m-u ml100k-i ml1m-i MFAE val MFAE test MFVAE val MFVAE test Table 4.8: Validation and test-set RMSE results of randomly removing users and items from the dataset. The -U suffix indicates that this is an experiment in which users were removed, i.e. the results are the RMSE on ˆr i j; the -I suffix indicates that this is an experiment in which items were removed, i.e. the results are the RMSE on ˆr ij. ml100k ml1m ml10m Netflix nos300k nos30m MFAE MFGAE MFVAE MFVGAE MFAE MFGAE MFVAE MFVGAE MFAE MFGAE MFVAE MFVGAE Table 4.9: Table containing the ˆr i j, ˆr ij and ˆr i j scores for each of the models on each of the datasets. The top subtable shows the ˆr i j scores, the middle subtable shows ˆr ij and the bottom subtable shows ˆr i j. 65

77 4.3.4 Experiment 4: Supervised MFAE The results for the supervised MFAE experiment can be found in table In this table we present the RMSE of rating predictions made using the reconstructed factors inferred by the supervised MFAE algorithm on the training, validation and test-sets for the Movielens 100K and 1M datasets. We compare these scores to the rating predictions made using the factors learned by the VBMF algorithm. Our best model for this experiment was able to reconstruct the original factors with a mean squared error of on the Movielens 100K dataset and on the Movielens 1M dataset. train val test VBMF Target ml100k Supervised MFAE ml100k VBMF Target ml1m Supervised MFAE ml1m Table 4.10: Results of the supervised MFAE experiment on the Movielens 100K dataset. 4.4 Discussion In this section we delve deeper into our findings Experiment 1: Model Performance Rating Datasets The MFGAE model outperforms all our other models. Most notably we see the gap widen between the MFGAE and the MFAE model with increasing data set size. This might be caused by the fact that the raw rating matrix R is a crude approximation to the graph convolution of Kipf and Welling [22] (which in itself is a first-order approximation of graph convolutions). As stated in Section the difference between the MFAE and the MFGAE is the addition of self-connections for each user and item, and a scaling of each 66

78 ml100k ml1m MFVAE, σ = MFVAE, σ = MFVAE, σ = Table 4.11: Results for the MFVAE model on two different dataset, with different values for prior variance σ. 1 rating r ij in R by. We assume that particularly the scaling plays an D ii D jj important role in the performance difference between the MF(V)AE models and the MF(V)GAE models. We also note that the variational models underperform in comparison to the non-variational models on the rating datasets. One possible explanation for this observation is that the KL-divergence term in the ELBO of the MFV(G)AE models too strongly regularizes latent factors with posterior distributions that differ too strongly from N (0, I). This implies that increasing the variance of the prior would improve scores as the higher prior variance would allow the posterior to range away farther from the prior mean. Table 4.11 presents a small experiment in which we varied the prior variance for the latent factors. We can see that increasing the prior variance yields worse results. An interesting future experiment might measure the effect of decreasing the variance for the ratings datasets. Binary Dataset The MFVAE beats all our other models, and has performance very close to the VBMF baseline on the nos300k dataset. We can also see in the ROC plots of figs. 4.2a and 4.3a that the MFVAE model most closely approximates the ranking performance of the VBMF baseline. When zooming in on the ROC plots, as we have done in figs. 4.2b and 4.3b, one can see that our models outperform the VBMF baseline on the lower values of the TPR and FPR. In particular the MFVAE model on the nos300k dataset shows strong performance relative to the VBMF baseline. This means that our model makes fewer mistakes in the predicted rating ˆr ij for a user/item pair to which it assigns high scores than the VBMF model. 67

79 We note that the models based on the Graph Autoencoder approach perform worse for both the nos300k and the nos30m datasets than the MFAE and MFVAE models. This latter result is in line with our expectation that removing the negative values from the adjacency matrix A throws away crucial information. It is unfortunate that no results for the MFVAE model is available on the nos30m dataset. As stated before, not running an experiment on the nos30m dataset was a judgement call based on results on the larger rating datasets. It turns out that our models behave differently for binary datasets than for rating datasets and that we should have allocated our resources towards running an experiment with the MFVAE model on the nos30m dataset instead of the MFGAE model. General Observations In general the distance between the performance of our models and the state of the art increases with dataset size. We present two explanations for this, one practical and one structural: 1. As stated before, the larger datasets require much more training time. This means that it is expensive to perform hyperparameter optimization on a wide range of hyperparameters. It is therefore very possible that there exist different hyperparameters for which the models would perform better. 2. The reason training takes so long is that, for the larger datasets, our models contain a very large number of parameters. The neural network architecture of the encoders requires that the total number of parameters required by the first hidden layer is N D h1, where N = N u + N v and D h1 is the dimensionality of the first hidden layer. This is, however, on par with more traditional matrix factorization models. PMF [35], for example, requires N D parameters. The large number of parameters however makes our model impractical on very large datasets such as Netflix. 68

80 The total number of parameters is dominated by the weights for the first hidden layer. The large size of the weight matrix for the first hidden layer is caused by the fact that there is a weight from each user and item rating value to a hidden layer node. I.e. for a hidden layer node k the user encoding weight matrix W (1) u has a weight [W (1) u ] jk for each item j that a user can rate, and the item encoding weight matrix W (1) v has a weight [W (1) v ] ik for each user i who can rate the items. The large number of parameters for the larger datasets also indicates that our model might be overparametrized. We are ultimately only interested in the latent representations of the users and items. However, in learning a function that maps sparse rating vectors to latent factors, our current model in essence learns very high-dimensional representations of each user and item in the weight matrix. I.e. row j in W (1) u can itself be interpreted as a representation of item j. For the MFGAE and MFVGAE models this is caused by our initial choice of using the identity matrix I in place of the feature matrix X. This assigns a unique feature to each user and item. One could image a vastly simpler model in which the only two features exist: an indicater feature for users and an indicator feature for items. The graph convolution would still ultimately ensure that each user and item is assigned a unique latent factor, buth with a much smaller number of parameters. This approach would also vastly speed up training time, as the matrix product ÂX can be precomputed and has dimensionality N 2 instead of N N. A myriad of similar approaches (e.g. assigning all items a unique indicator feature but only using a global user indicator feature, or assigning random lower-dimensional unique feature vectors for each user and/or item) that reduce the computational complexity and reduce the number of parameters is possible and should be investigated in future work in this direction Experiment 2: Hyperparameter Optimization Phase 1: Preliminary Experiments In this section we present some of the findings of our preliminary experiments on the Movielens 100K datasets for the MFAE and the MFGAE models. 69

81 (a) L2 regularization vs best validation RMSE of an MFAE model on the Movielens 100K dataset, for two models with 1 hidden layer, one model including and one not including a bias vector in the hidden and factor layers. (b) L2 regularization vs best validation RMSE of an MFAE model on the Movielens 100K dataset, for two models with no hidden layer, one model including and one not including a bias vector in the hidden and factor layers. Figure 4.4: Effects of certain hyperparameters in preliminary experiments. These two plots are kept separate due to the differing ranges of values on the horizontal axes. Bias Vector In preliminary experiments we found that adding a bias vector in the hidden and the factor layers neither hurt nor improved performance. Figures 4.4a and 4.4b show graphs of the RMSE as a function of the L2 for an MFAE model with and without a bias vector. L2 Norm L2 regularization required careful tuning. Figures 4.4a and 4.4b, again, shows the effect of the L2 regularization parameter on the best validation RMSE score. Presence of hidden layer Figure 4.4a shows a graph for a model containing one hidden layer, while fig. 4.4b shows a graph for a model containing no hidden layer. The vertical axes of figs. 4.4a and 4.4b are on the same scale and show the same range (the horizontal axes show a different range, which is why the images cannot effectively be combined). In this way one can see that slightly better results are achieved in a model with two hidden layers. Hidden Layer Size We found that, while the hidden layer size does influence RMSE, there is a point of diminishing returns for the size of the 70

82 (a) RMSE as a function of hidden nodes and weight sharing. A model that does not share weights between the hidden layer and the factor layer performs less well than a model that does. (b) Number of hidden nodes vs best validation RMSE of an MFGAE model on the Movielens 100K dataset. There is a point of diminishing returns when the number of hidden nodes increases. Figure 4.5: Effects of hidden node size and weight sharing in preliminary experiments. hidden layer. See fig. 4.5b for an image depicting the development of the best RMSE achieved as a function of hidden nodes. In the experiments for this figure the dimensionality of the latent factors is kept fixed, while other hyperparameters are optimized to give the lowest validation RMSE. From 250 to 500 nodes there is a slight decrease in RMSE. Adding more than 500 nodes yields no significant improvement. Weight Sharing We found that weight sharing in the last layer always increases performance. Figure 4.5a illustrates this for an MFAE model trained on Movielens 100K, but this same pattern holds for other models and datasets. More Than One Graph Convolution We also found, for the Graph Autoencoder (GAE) models, i.e. MFGAE and MFVGAE, that adding a second graph-convolutional layer slightly decreased performance. This is in line with our reasoning behind Encoder 2 for the MFVGAE introduced in Section 3.3. Decoder Types Lastly, we found that the MLP decoder does not give good results, and that the Multinomial Matrix Dot Product (MMDP) decoder with a diagonal matrix M (k) gives better results than all non-multinomial 71

83 models, and results equal to an MMDP decoder with a symmetric or dense learnable matrix M (k), with fewer parameters. Phase 2: Optimal Hyperparameters We see that it is hard to predict which hyperpameters will perform well on each combination of dataset and model. We do note that some regularities pop up: 1. The MFVAE and MFVGAE perform best when the hidden and latent factor layers both have the tanh( ) activation function. 2. Other models perform best when they use the ReLU( ) activation function on the hidden layer and the sigmoid( ) function on the latent factor layers Experiment 3: New Users and Items Experiment 3a Table 4.8 presentes the results for Axperiment 3a. In this table, the rows marked val contain the scores for the best validation score achieved for that combination of model and dataset, while the rows marked test contain the RMSE scores on predicted ratings for new users or items. One can see in this table that the RMSE goes up when adding new users or items to the system, albeit not vastly so. However we did hope for better results in this experiment. It is interesting to see that, for both datasets on which we ran this test, predictions for new items perform better than the predictions for new users. It is possible that the resemblance of the distributions of the sparse rating vectors for new items r j and for known items r j is closer than the distributions for new users r i and for known users r i. This would allow the models to better predict latent factors for new items than for new users, resulting in better RMSE scores. However, this does not explain why the performance on new users and items is worse across the board. We hypothesize that the model overfits on 72

84 users and items it sees in the training set. While our careful tuning of the L2 regularization parameter ensured that our model did not overfit on the ratings, it has not prevented the model from overfitting on users and items. Future work might investigate whether normalization of the sparse rating factors, or randomly subsampling user and item ratings in the sparse input vectors r i and r j can function as a regularization on the users and items. Experiment 3b Rating datasets The results for new users presented in table 4.9 show significantly worse performance, with RMSE scores that are much higher than the RMSE scores computed for users and items seen during training. A first interesting observation from table 4.9 is that the predicted rating ˆr i j, i.e. for a new user and new item pair i, j, is always lower than when the user item/pair contains a known user or item and a new item or user. To analyze why this occurs we plot the distribution of the ground-truth test-set ratings r ij and compare this distribution with 1. The distribution of the predicted ratings ˆr ij for users and items seen during training, 2. the distribution of the predicted ratings ˆr i j for new users and new items. The encoder used for these plots is the MFGAE encoder from the best-scoring model on the Movielens 1M dataset. This plot can be found in fig The distribution of the ratings predicted for known users and items follows the distribution of the ground-truth test-set ratings much more closely than the ratings for new users and items. We hypothesize that this implies that the latent factors are distributed differently when the test-set ratings are used as input. To test this hypothesis we plot a 2D PCA projection of the latent user and item factors. To make the distributions somewhat clearer we also plot a 2 standard deviation ellipse around the mean of a 2D Gaussian fit to the PCA projection of the training set and test-set based factors. This plot can be found in fig In this plot 73

85 Figure 4.6: Ground-truth test-set rating distribution, test-set rating predictions for known users and items and test-set rating predictions for new virtual users and items for the Movielens 1M dataset. we see that the distributions of the (PCA projections) of the factors differ, but not very strongly. We then hypothesize that subsampling ratings for the test-set (as opposed to subsampling users or items) results in rating distributions within the sparse rating vectors r (test) i and r (test) j that differ from the sparse rating vectors r i and r j used during training. This is a likely explanation for the discrepancy between the results in table 4.8 and table 4.9. Finally we note a few interesting outliers in table 4.9. The MFVAE model performs decently in predicting the ratings for all combinations of new users or items for the Movielens 10M dataset. Furthermore, the MFAE model shows reasonable performance on the predictions of ˆr ij for both the Movielens 10M and the Netflix datasets Lastly the MFGAE model performs very well on the predictions of ˆr ij for the Movielens 10M datasets. We have no explanation for this behavior and assume that this is an artifact of the datasets and dataset splits that we used for our experiments. 74

Variational Autoencoders

Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly