Content-based Recommendation Suthee Chaidaroon June 13, 2016 Contents 1 Introduction 1 1.1 Matrix Factorization......................... 2 2 slda 2 2.1 Model................................. 3 3 flda 3 4 Collaborate Topic Regression (CTR) 4 4.1 Model................................. 6 5 CRPF 7 5.1 Model................................. 7 6 Revision 8 1 Introduction Many matrix factorization literature uses the dot product between user and item latent factors to predict rating: r ui = q T i p u. This setup is basic idea of Matrix Factorization. We could enhance user and item representation by incorporating additional user or item information. The better and natural integration between the new input sources and the prediction model will improve prediction and be able to solve cold start problem. This writeup summarizes the selected contentbased recommendation techniques using Bayesian method. We start with the classic Matrix factorization by Koren, then we turn our attention to topic model method to derive a better item representation and how each model integrate extra information to their models. The integration methods will be the theme of this writeup. 1
Figure 1: PGM for SLDA model 1.1 Matrix Factorization In a classic Matrix Factorization [1], it adds additional observed variables such as implicit feedback and user attributes to enhance user representation. The model is described as: r ui = µ + b i + b u + qi T [p u + N(u) 0.5 x ui + y ua ] i N(u) a A(u) (1) This model adds additional fixed value to a user-latent factor: a user implicit feedback x ui and user attribute y ua. The author needs to explicit tuning and normalizing user implicit feedback and attributes to achieve an improvement. For example, a user implicit feedback can be adjusted by: N(u) 0.5 Why x i has to be powered by 4.5? dataset. 2 slda i N(u) x 4.5 i This model ties to deeply with the Supervised LDA [2] is an extension of LDA. It ties an observed label, category, or real-value to an item-latent space. Its PGM is here: The connection between the topic assignment variable z d,n to a response variable y d influences the relateness between topic distribution and response variable. Without this connection, the topic distribution learned from the data might not be related to response variable. For an instance, the movie review with praise words should be grouped together given the movie rating. LDA will simply groups words based on its occurrence or theme of the movies which may not relate to the movie rating at all. Figure 2 demonstrates positive words are mapped to high rating where negative words are mapped to lower rating. 2
Figure 2: The word clusters after fitting movie reviews to slda model 2.1 Model The generative process of slda is described as follows: 1. Draw topic proportions θ α Dir(α) 2. For each word (a) Draw topic assignment z n θ Mult(θ) (b) Draw word w n z n, β 1:K Mult(β zn ) 3. Draw response variable y z 1:N, η, σ 2 N(η T z, σ 2 ) This model assumes that the response variable is drawn from Gaussian distribution with mean of η T z. The z component is an average of topic assignment for a particular item. This model implies that if when two items have similar average topic assignment, it should have very similar response values. For example movie reviews that contain a lot of negative words should have similar topic assignments and will end up getting the low rating. We can also think of this model as it attempts to find a weight vector η so that its prediction η T z to be as similar to actual response value as possible. This is basically a linear regression problem. 3 flda While slda model constructs a single regression model for all user, flda [3] generalizes this constraint by incorporating multiple regressions for each user. This model is built for recommendation problem specifically whereas slda is more general idea. If we look at its PGM from figure 3, there are two main components: user and item plates. From the standard probabilistic recommendation model, we will draw user and item latent factors and draw a rating from the dot product of these two latent factors. This model follows this same idea 3
Figure 3: PGM for flda model but adding user-bias, α i, item popularity, β j, user factor, s i, and average latent topic, z j. The rating is drawn from the normal distribution: y ij N (u ij, σ 2 ) where, u ij = x ij b+α i+β j +s i z j. The term z j is an average topic assignment vector such that z j = W j z jn n=1 W J. This idea is similar to slda. The author claims that z j is has more variability than topic distribution θ j which leads to faster convergence. [3]. This model is prone to an overfit because of a large number of latent variables. The regularization is crucial to guarantee for the model to perform well. 4 Collaborate Topic Regression (CTR) The previous work, flda, uses a linear regression to address cold start problem. A user with little rating information will use a linear regression learned from the user content to make a recommendation. The smooth transition between cold start and warm start situation made this model attractive. However, the problem of using latent topic directly lie within its inability to distinguish topics for explaining recommendations from topics important for explaining content [4]. CTR model proposed by Wang can handle this situation seamlessly by adding uncertainty to an item latent feature. 4
Figure 4: Generative process for flda model 5
Figure 5: The graphical model for the CTR model 4.1 Model CTR models item-latent vector as a topic distribution with additive Gaussian noise. Its generative process is summarized as follows: 1. For each user i, draw user latent vector u i N (0, λ 1 u I K ) 2. For each item j, (a) Draw topic proportions θ j Dirichlet(α). (b) Draw item latent offset ɛ j vector as v j = ɛ j + θ j. (c) For each word w jn, i. Draw topic assignment z jn Mult(θ). ii. Draw word w jn Mult(β zjn ). N (0, λ 1 v I K ) and set the item latent (d) For each user-item pair (i, j), draw the rating, r ij N (u T i v j, c 1 ij ). Its PGM in figure 6 shows that each item performs a linear regression to fit its latent-vector to the rating. This is a major difference from flda which roughly set item-latent vector to an inferred topic distribution per item. By allowing each item to diverge its latent vector from topic distribution, CTR is capable to construct different latent vectors for items that have similar topic distribution. For example, if two research paper mentioned one particular algorithms but the first paper is more popular among computer scientist researchers where the second paper is more suit to social behavior researchers, then both latent vectors will be different because of the offset term, ɛ j. The more users have rated the article (item), the higher precision of the offset term. 6
5 CRPF Figure 6: The graphical model for the CTR model Content-based recommendation with Poisson factorization [5] has proposed to exploit the property of Poisson factorization [6]. Two main important properties are: (1) Gamma distribution is more suitable for modelling sparse information; (2) Poisson factorization can be viewed as a resource allocation task where a user allocates his/her attention to rate particular movies. The first property works well on recommendation data because 99 % of its are zero entries. The second property treat any un-rated item as unobserved item rather than negative rated item by users. This is important property because Guassian-based MF assumes that a rating of zero implies negative response from users. Thus, modeling user attention using Gamma distribution is step-forward in recommendation problem. 5.1 Model Its generative process is described below: 1. Document Model: (a) Draw topics β vk Gamma(a, b) (b) Draw document topic intensities θ dk Gamma(c, d) (c) Draw word count w dv Poisson(θ T d β v) 2. Recommendation model: 7
(a) Draw user preferences η uk Gamma(e, f) (b) Draw document topic offsets ɛ dk Gamma(g, h) (c) Draw r ud Poisson(η T u (θ d + ɛ d )) Another interesting contribution is the connection between LDA and PMF. It turns out that PMF is a general case of LDA. In recommendation problem, we now treat each document as a user whose user latent vector is topic preference. Any word in the corpus is an item whose its latent vector is its attribute. Then, the word count observed in the document is a rating for a particular word that is rated by document. CRPF can model user s preference and it can identify group of interested users on the given document. Figure 7 shows how the article on EM algorithm are popular not only machine learning researchers but computer vision and statistical network analysis researchers as well. Figure 7: The topic distribution and user preferences estimated by CRPF 6 Revision Jun 14 - Draft - Summarized all models and their generative process. References [1] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, (8):30 37, 2009. [2] Jon D Mcauliffe and David M Blei. Supervised topic models. In Advances in neural information processing systems, pages 121 128, 2008. 8
[3] Deepak Agarwal and Bee-Chung Chen. flda: matrix factorization through latent dirichlet allocation. In Proceedings of the third ACM international conference on Web search and data mining, pages 91 100. ACM, 2010. [4] Chong Wang and David M Blei. Collaborative topic modeling for recommending scientific articles. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 448 456. ACM, 2011. [5] Prem K Gopalan, Laurent Charlin, and David Blei. Content-based recommendations with poisson factorization. In Advances in Neural Information Processing Systems, pages 3176 3184, 2014. [6] Prem Gopalan, Jake M Hofman, and David M Blei. Scalable recommendation with poisson factorization. arxiv preprint arxiv:1311.1704, 2013. 9