Offline Components: Collaborative Filtering in Cold-start Situations

Similar documents
Scaling Neighbourhood Methods

Restricted Boltzmann Machines for Collaborative Filtering

Generative Models for Discrete Data

Andriy Mnih and Ruslan Salakhutdinov

Regression-based Latent Factor Models

Content-based Recommendation

Large-scale Ordinal Collaborative Filtering

Mixed Membership Matrix Factorization

Collaborative Filtering. Radek Pelánek

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Mixed Membership Matrix Factorization

Data Mining Techniques

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Bayesian Methods for Machine Learning

Collaborative topic models: motivations cont

Generative Clustering, Topic Modeling, & Bayesian Inference

Lecture 13 : Variational Inference: Mean Field Approximation

Bayesian Matrix Factorization with Side Information and Dirichlet Process Mixtures

* Matrix Factorization and Recommendation Systems

Predictive Discrete Latent Factor Models for large incomplete dyadic data

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Scalable Bayesian Matrix Factorization

Algorithms for Collaborative Filtering

Matrix Factorization Techniques for Recommender Systems

Factor Modeling for Advertisement Targeting

Click Prediction and Preference Ranking of RSS Feeds

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Recent Advances in Bayesian Inference Techniques

STA 4273H: Statistical Machine Learning

Using SVD to Recommend Movies

Collaborative Filtering Matrix Completion Alternating Least Squares

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007

Matrix Factorization Techniques For Recommender Systems. Collaborative Filtering

Latent Dirichlet Allocation (LDA)

Graphical Models for Collaborative Filtering

Latent Dirichlet Allocation Introduction/Overview

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Collaborative Topic Modeling for Recommending Scientific Articles

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Density Estimation. Seungjin Choi

Matrix Factorization In Recommender Systems. Yong Zheng, PhDc Center for Web Intelligence, DePaul University, USA March 4, 2015

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

Generalized Linear Models in Collaborative Filtering

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning

13: Variational inference II

Chapter 20. Deep Generative Models

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Sparse Stochastic Inference for Latent Dirichlet Allocation

Gaussian Models

Classical Predictive Models

STA 4273H: Statistical Machine Learning

Machine Learning Techniques for Computer Vision

Probabilistic Time Series Classification

Collaborative Filtering

Decoupled Collaborative Ranking

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Distributed ML for DOSNs: giving power back to users

Study Notes on the Latent Dirichlet Allocation

Nonparametric Bayesian Methods (Gaussian Processes)

Collaborative Filtering Applied to Educational Data Mining

Side Information Aware Bayesian Affinity Estimation

Collaborative Filtering: A Machine Learning Perspective

Matrix Factorization Techniques for Recommender Systems

Introduction to Gaussian Process

A Simple Algorithm for Nuclear Norm Regularized Problems

Matrix Factorizations: A Tale of Two Norms

Modeling User Rating Profiles For Collaborative Filtering

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Matrix Factorization and Factorization Machines for Recommender Systems

Advanced Machine Learning

Topic Modelling and Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA)

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Pattern Recognition and Machine Learning

Hierarchical Bayesian Matrix Factorization with Side Information

Bayesian Nonparametrics for Speech and Signal Processing

Bagging During Markov Chain Monte Carlo for Smoother Predictions

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Matrix and Tensor Factorization from a Machine Learning Perspective

cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way.

CPSC 540: Machine Learning

Computational statistics

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Online Algorithms for Sum-Product

Logistic Regression. COMP 527 Danushka Bollegala

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Introduction to Logistic Regression

Scaling up Bayesian Inference

Chris Bishop s PRML Ch. 8: Graphical Models

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Integrated Non-Factorized Variational Inference

CS425: Algorithms for Web Scale Data

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Bayes methods for categorical data. April 25, 2017

Bayesian linear regression

6.034 Introduction to Artificial Intelligence

Downloaded 09/30/17 to Redistribution subject to SIAM license or copyright; see

Transcription:

Offline Components: Collaborative Filtering in Cold-start Situations

Problem Algorithm selects Item j with item features x j (keywords, content categories,...) User i with visits (i, j) : response y ij (explicit rating, implicit click/no-click) user features x i (demographics, browse history, search history, ) Predict the unobserved entries based on features and the observed entries 2

Model Choices Feature-based (or content-based) approach Use features to predict response (regression, Bayes Net, mixture models, ) Limitation: need predictive features Bias often high, does not capture signals at granular levels Collaborative filtering (CF aka Memory based) Make recommendation based on past user-item interaction User-user, item-item, matrix factorization, See [Adomavicius & Tuzhilin, TKDE, 2005], [Konstan, SIGMOD 08 Tutorial], etc. Better performance for old users and old items Does not naturally handle new users and new items (cold-start) 3

Collaborative Filtering (Memory based methods) User-User Similarity Item-Item similarities, incorporating both Estimating Similarities Pearson s correlation Optimization based (Koren et al) 4

How to Deal with the Cold-Start Problem Heuristic-based approaches Linear combination of regression and CF models Filterbot Add user features as psuedo users and do collaborative filtering - Hybrid approaches - Use content based to fill up entries, then use CF Matrix Factorization Good performance on Netflix (Koren, 2009) Model-based approaches Bilinear random-effects model (probabilistic matrix factorization) Good on Netflix data [Ruslan et al ICML, 2009] Add feature-based regression to matrix factorization (Agarwal and Chen, 2009) Add topic discovery (from textual items) to matrix factorization (Agarwal and Chen, 2009; Chun and Blei, 2011) 5

Per-item regression models When tracking users by cookies, distribution of visit patters could get extremely skewed Majority of cookies have 1-2 visits Per item models (regression) based on user covariates attractive in such cases 6

Several per-item regressions: Multi-task learning Agarwal,Chen and Elango, KDD, 2010 Low dimension (5-10), B estimated retrospective data Affinity to old items 7

Per-user, per-item models via bilinear random-effects model

Motivation Data measuring k-way interactions pervasive Consider k = 2 for all our discussions E.g. User-Movie, User-content, User-Publisher-Ads,. Power law on both user and item degrees Classical Techniques Approximate matrix through a singular value decomposition (SVD) After adjusting for marginal effects (user pop, movie pop,..) Does not work Matrix highly incomplete, severe over-fitting Key issue Regularization of eigenvectors (factors) to avoid overfitting 9

Early work on complete matrices Tukey s 1-df model (1956) Rank 1 approximation of small nearly complete matrix Criss-cross regression (Gabriel, 1978) Incomplete matrices: Psychometrics (1-factor model only; small data sets; 1960s) Modern day recommender problems Highly incomplete, large, noisy. 10

Latent Factor Models u s p o r t y v newsy Affinity = u v sporty s item z newsy Affinity = s z 11

Factorization Brief Overview Latent user factors: (α i, u i =(u i1,,u in )) Interaction Latent movie factors: (β j, v j =(v j1,.,v jn )) E ( y ) = µ + α + β + u Bv ij i j i j (Nn + Mm) parameters Key technical issue: will overfit for moderate values of n,m Regularization 12

Latent Factor Models: Different Aspects Matrix Factorization Factors in Euclidean space Factors on the simplex Incorporating features and ratings simultaneously Online updates 13

Maximum Margin Matrix Factorization (MMMF) Complete matrix by minimizing loss (hinge,squared-error) on observed entries subject to constraints on trace norm Srebro, Rennie, Jakkola (NIPS 2004) Convex, Semi-definite programming (expensive, not scalable) Fast MMMF (Rennie & Srebro, ICML, 2005) Constrain the Frobenious norm of left and right eigenvector matrices, not convex but becomes scalable. Other variation: Ensemble MMMF (DeCoste, ICML2005) Ensembles of partially trained MMMF (some improvements) 14

Matrix Factorization for Netflix prize data Minimize the objective function ij obs T i 2 ( r u v ) + λ( u + v ) ij j i i 2 j j 2 Simon Funk: Stochastic Gradient Descent Koren et al (KDD 2007): Alternate Least Squares They move to SGD later in the competition 15

Probabilistic Matrix Factorization (Ruslan & Minh, 2008, NIPS) a u u i a v v j Optimization is through Gradient Descent (Iterated conditional modes) Other variations like constraining the mean through sigmoid, using who-ratedwhom 2 σ r ij r Combining with Boltzmann Machines also improved performance ij u v i j ~ ~ ~ N( u T i v, σ MVN( 0, a u MVN( 0, a j 2 v ) I) I) 16

Bayesian Probabilistic Matrix Factorization (Ruslan and Minh, ICML 2008) Fully Bayesian treatment using an MCMC approach Significant improvement Interpretation as a fully Bayesian hierarchical model shows why that is the case Failing to incorporate uncertainty leads to bias in estimates Multi-modal posterior, MCMC helps in converging to a better one r MCEM also more resistant to over-fitting Var-comp: a u 17

Non-parametric Bayesian matrix completion (Zhou et al, SAM, 2010) Specify rank probabilistically (automatic rank selection) r y ~ N ( z u v, σ 2 ) ij k ik jk k = 1 z π k k ~ Ber( π k ) ~ Beta( a / r, b( r 1) / r) z k ~ Ber (1, a /( a + b( r 1))) E (# Factors) = ra /( a + b( r 1)) 18

How to incorporate features: Deal with both warm start and cold-start Models to predict ratings for new pairs Warm-start: (user, movie) present in the training data with large sample size Cold-start: At least one of (user, movie) new or has small sample size Rough definition, warm-start/cold-start is a continuum. Challenges Highly incomplete (user, movie) matrix Heavy tailed degree distributions for users/movies Large fraction of ratings from small fraction of users/movies Handling both warm-start and cold-start effectively in the presence of predictive features 19

Possible approaches Large scale regression based on covariates Does not provide good estimates for heavy users/movies Large number of predictors to estimate interactions Collaborative filtering Neighborhood based Factorization Good for warm-start; cold-start dealt with separately Single model that handles cold-start and warm-start Heavy users/movies User/movie specific model Light users/movies fallback on regression model Smooth fallback mechanism for good performance 20

Add Feature-based Regression into Matrix Factorization RLFM: Regression-based Latent Factor Model

Regression-based Factorization Model (RLFM) Main idea: Flexible prior, predict factors through regressions Seamlessly handles cold-start and warm-start Modified state equation to incorporate covariates 22

RLFM: Model 2 Rating: y ij ~ N(, σ ) user i gives item j µ ij ~ Bernoulli( ij ) ~ Poisson( ij Nij y ij µ y µ ij ) Gaussian Model Logistic Model (for binary rating) Poisson Model (for counts) ij t ij t (µ ) = x b + α + β + u α t α α 2 Bias of user i: = g x, ~ N(0, ) i j ε t i v j ε σ i 0 i + i i α t β β 2 Popularity of item j: β = d x ε, ε ~ N(0, σ ) j 0 j + j j β Factors of user i: u = Gx + ε, ε ~ N(0, σ ) i Factors of item j: v = Dx + ε, ε ~ N(0, σ ) i Could use other classes of regression models i j u i v i u i v i 2 u I 2 v I 23

Graphical representation of the model 24

Advantages of RLFM Better regularization of factors Covariates shrink towards a better centroid Cold-start: Fallback regression model (FeatureOnly) 25

RLFM: Illustration of Shrinkage Plot the first factor value for each user (fitted using Yahoo! FP data) 26

Induced correlations among observations Hierarchical random-effects model Marginal distribution obtained by integrating out random effects 27

Closer look at induced marginal correlations 28

Model fitting: EM for our class of models 29

The parameters for RLFM Latent parameters = ({ α }, { β }, { u }, { v i j i j }) Hyper-parameters Θ = ( b, G, D,A = a I,A = a I) u u v v 30

Computing the mode Minimized 31

The EM algorithm 32

Computing the E-step Often hard to compute in closed form Stochastic EM (Markov Chain EM; MCEM) Compute expectation by drawing samples from Effective for multi-modal posteriors but more expensive Iterated Conditional Modes algorithm (ICM) Faster but biased hyper-parameter estimates 33

Monte Carlo E-step Through a vanilla Gibbs sampler (conditionals closed form) Other conditionals also Gaussian and closed form Conditionals of users (movies) sampled simultaneously Small number of samples in early iterations, large numbers in later iterations 34

M-step (Why MCEM is better than ICM) Update G, optimize Update A u =a u I Ignored by ICM, underestimates factor variability Factors over-shrunk, posterior not explored well 35

Experiment 1: Better regularization MovieLens-100K, avg RMSE using pre-specified splits ZeroMean, RLFM and FeatureOnly (no cold-start issues) Covariates: Users : age, gender, zipcode (1 st digit only) Movies: genres 36

Experiment 2: Better handling of Cold-start MovieLens-1M; EachMovie Training-test split based on timestamp Same covariates as in Experiment 1. 37

Experiment 4: Predicting click-rate on articles Goal: Predict click-rate on articles for a user on F1 position Article lifetimes short, dynamic updates important User covariates: Age, Gender, Geo, Browse behavior Article covariates Content Category, keywords 2M ratings, 30K users, 4.5 K articles 38

Results on Y! FP data 39

Some other related approaches Stern, Herbrich and Graepel, WWW, 2009 Similar to RLFM, different parametrization and expectation propagation used to fit the models Porteus, Asuncion and Welling, AAAI, 2011 Non-parametric approach using a Dirichlet process Agarwal, Zhang and Mazumdar, Annals of Applied Statistics, 2011 Regression + random effects per user regularized through a Graphical Lasso 40

Add Topic Discovery into Matrix Factorization flda: Matrix Factorization through Latent Dirichlet Allocation

flda: Introduction Model the rating y ij that user i gives to item j as the user s affinity to the topics that the item has y =... ij + User i s affinity to topic k k s ik z jk Pr(item j has topic k) estimated by averaging the LDA topic of each word in item j Old items: z jk s are Item latent factors learnt from data with the LDA prior New items: z jk s are predicted based on the bag of words in the items Unlike regular unsupervised LDA topic modeling, here the LDA topics are learnt in a supervised manner based on past rating data flda can be thought of as a multi-task learning version of the supervised LDA model [Blei 07] for cold-start recommendation 42

LDA Topic Modeling (1) LDA is effective for unsupervised topic discovery [Blei 03] It models the generating process of a corpus of items (articles) For each topic k, draw a word distribution Φ k = [Φ k1,, Φ kw ] ~ Dir(η) For each item j, draw a topic distribution θ j = [θ j1,, θ jk ] ~ Dir(λ) For each word, say the nth word, in item j, Draw a topic z jn for that word from θ j = [θ j1,, θ jk ] Draw a word w jn from Φ k = [Φ k1,, Φ kw ] with topic k = z jn Item j Topic distribution: [θ j1,, θ jk ] Per-word topic: z j1,, z jn, Assume z jn = topic k Words: w j1,, w jn, Observed Φ 11,, Φ 1W Φ k1,, Φ kw Φ K1,, Φ KW Topic 1 Topic k Topic K 43

LDA Topic Modeling (2) Model training: Estimate the prior parameters and the posterior topic word distribution Φ based on a training corpus of items EM + Gibbs sampling is a popular method Inference for new items Compute the item topic distribution based on the prior parameters and Φ estimated in the training phase Supervised LDA [Blei 07] Predict a target value for each item based on supervised LDA topics Regression weight for topic k One regression per user y = j s k k z + jk vs. y ij =... s k ik z jk Same set of topics across different regressions Target value of item j Pr(item j has topic k) estimated by averaging the topic of each word in item j 44

flda: Model 2 Rating: y ij ~ N(, σ ) user i gives item j µ ij ~ Bernoulli( ij ) ~ Poisson( ij Nij y ij µ y µ ij t(µ ) ij = x b + α + β + t ij α i ) j Gaussian Model Logistic Model (for binary rating) Poisson Model (for counts) t α α 2 Bias of user i: = g x, ~ N(0, ) k s ε ik z jk ε σ i 0 i + i i α t β β 2 Popularity of item j: β = d x ε, ε ~ N(0, σ ) j 0 j + j j β s s Topic affinity of user i: s = Hx + ε, ε ~ N(0, σ 2 I) i Pr(item j has topic k): z = 1( z k) /(# words in item j) i jk n jn = Observed words: w LDA( λ, η, z ) jn i i The LDA topic of the nth word in item j ~ jn The nth word in item j s 45

Model Fitting Given: Features X = {x i, x j, x ij } Observed ratings y = {y ij } and words w = {w jn } Estimate: Parameters: Θ = [b, g 0, d 0, H, σ 2, a α, a β, A s, λ, η] Regression weights and prior parameters Latent factors: = {α i, β j, s i } and z = {z jn } User factors, item factors and per-word topic assignment Empirical Bayes approach: Maximum likelihood estimate of the parameters: Θ ˆ = argmax Pr[ y, w Θ] = arg max Θ The posterior distribution of the factors: Θ Pr[ y, w,, z Θ] d dz Pr[, z y, Θˆ ] 46

The EM Algorithm Iterate through the E and M steps until convergence Let ˆ ( n) Θ be the current estimate E-step: Compute The expectation is not in closed form We draw Gibbs samples and compute the Monte Carlo mean M-step: Find ˆ ( n+ 1) Θ f ( Θ) = = E arg max (, z y, w, Θˆ Θ [log Pr( y, w,, z Θ)] It consists of solving a number of regression and optimization problems n f ( Θ) ) 47

Supervised Topic Assignment The topic of the nth word in item j Pr( z jn Z Z = jn kl jn k k + η + Wη Rest) ( jn Z + λ) f ( y z = jk i rated Probability of observing y ij given the model j ij jn k) Same as unsupervised LDA Likelihood of observed ratings by users who rated item j when z jn is set to topic k 48

flda: Experimental Results (Movie) Task: Predict the rating that a user would give a movie Training/test split: Sort observations by time First 75% Training data Last 25% Test data Item warm-start scenario Only 2% new items in test data Model Test RMSE RLFM 0.9363 flda 0.9381 Factor-Only 0.9422 FilterBot 0.9517 unsup-lda 0.9520 MostPopular 0.9726 Feature-Only 1.0906 Constant 1.1190 flda is as strong as the best method It does not reduce the performance in warm-start scenarios 49

flda: Experimental Results (Yahoo! Buzz) Task: Predict whether a user would buzz-up an article Severe item cold-start All items are new in test data flda significantly outperforms other models Data Statistics 1.2M observations 4K users 10K articles 50

Experimental Results: Buzzing Topics 3/4 topics are interpretable; 1/2 are similar to unsupervised topics Top Terms (after stemming) bush, tortur, interrog, terror, administr, CIA, offici, suspect, releas, investig, georg, memo, al mexico, flu, pirat, swine, drug, ship, somali, border, mexican, hostag, offici, somalia, captain NFL, player, team, suleman, game, nadya, star, high, octuplet, nadya_suleman, michael, week court, gai, marriag, suprem, right, judg, rule, sex, pope, supreme_court, appeal, ban, legal, allow palin, republican, parti, obama, limbaugh, sarah, rush, gop, presid, sarah_palin, sai, gov, alaska idol, american, night, star, look, michel, win, dress, susan, danc, judg, boyl, michelle_obama economi, recess, job, percent, econom, bank, expect, rate, jobless, year, unemploy, month north, korea, china, north_korea, launch, nuclear, rocket, missil, south, said, russia Topic CIA interrogation Swine flu NFL games Gay marriage Sarah Palin American idol Recession North Korea issues 51

flda Summary flda is a useful model for cold-start item recommendation It also provides interpretable recommendations for users User s preference to interpretable LDA topics Future directions: Investigate Gibbs sampling chains and the convergence properties of the EM algorithm Apply flda to other multi-task prediction problems flda can be used as a tool to generate supervised features (topics) from text data 52

Summary Regularizing factors through covariates effective Regression based factor model that regularizes better and deals with both cold-start and warm-start in a single framework in a seamless way looks attractive Fitting method scalable; Gibbs sampling for users and movies can be done in parallel. Regressions in M-step can be done with any off-the-shelf scalable linear regression routine Distributed computing on Hadoop: Multiple models and average across partitions 53

Hierarchical smoothing Advertising application, Agarwal et al. KDD 10 Product of states for each node pair pubclass Advertiser Conv-id Spike and Slab prior Pub-id z (S z, E z, λ z ) campaign Ad-id Known to encourage parsimonious solutions Several cell states have no corrections Not used before for multi-hierarchy models, only in regression We choose P =.5 (and choose a by cross-validation) a psuedo number of successes 54