Causal Inference and Recommendation Systems. David M. Blei Departments of Computer Science and Statistics Data Science Institute Columbia University

Similar documents
Modeling User Exposure in Recommendation

arxiv: v2 [stat.ml] 4 Feb 2016

Probabilistic Matrix Factorization

Dynamic Poisson Factorization

arxiv: v2 [cs.ir] 14 May 2018

Content-based Recommendation

* Matrix Factorization and Recommendation Systems

Collaborative topic models: motivations cont

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Understanding Music Semantics and User Behavior with Probabilistic Latent Variable Models. Dawen Liang

SQL-Rank: A Listwise Approach to Collaborative Ranking

Power Laws & Rich Get Richer

SCMF: Sparse Covariance Matrix Factorization for Collaborative Filtering

Recommendations as Treatments: Debiasing Learning and Evaluation

Decoupled Collaborative Ranking

Lecture 16 Deep Neural Generative Models

Generative Clustering, Topic Modeling, & Bayesian Inference

An Overview of Edward: A Probabilistic Programming System. Dustin Tran Columbia University

CPSC 340: Machine Learning and Data Mining

A Modified PMF Model Incorporating Implicit Item Associations

Andriy Mnih and Ruslan Salakhutdinov

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Matrix Factorization Techniques For Recommender Systems. Collaborative Filtering

Recommender Systems EE448, Big Data Mining, Lecture 10. Weinan Zhang Shanghai Jiao Tong University

Naïve Bayes classification

Scalable Recommendation with Hierarchical Poisson Factorization

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Machine Learning for Large-Scale Data Analysis and Decision Making A. Week #1

Variable selection and machine learning methods in causal inference

Modeling User Rating Profiles For Collaborative Filtering

Minimum Message Length Inference and Mixture Modelling of Inverse Gaussian Distributions

PMR Learning as Inference

Restricted Boltzmann Machines for Collaborative Filtering

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Expectation Propagation Algorithm

Probability and Information Theory. Sargur N. Srihari

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Scaling Neighbourhood Methods

Large-scale Ordinal Collaborative Filtering

arxiv: v3 [cs.ir] 20 May 2014

Mixed Membership Matrix Factorization

Data Mining Techniques

Counterfactual Model for Learning Systems

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Mixed Membership Matrix Factorization

13: Variational inference II

Generative Models for Discrete Data

Probability Theory for Machine Learning. Chris Cremer September 2015

NetBox: A Probabilistic Method for Analyzing Market Basket Data

Lecture : Probabilistic Machine Learning

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Bayesian Approaches Data Mining Selected Technique

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Collaborative Filtering. Radek Pelánek

Relational Stacked Denoising Autoencoder for Tag Recommendation. Hao Wang

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Bayesian Methods: Naïve Bayes

STA 4273H: Statistical Machine Learning

Recommendation Systems

RaRE: Social Rank Regulated Large-scale Network Embedding

Bayesian Methods for Machine Learning

Primer on statistics:

Be able to define the following terms and answer basic questions about them:

A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation

Generalized Linear Models and Exponential Families

Statistical Distribution Assumptions of General Linear Models

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Machine Learning Linear Classification. Prof. Matteo Matteucci

Matrix and Tensor Factorization from a Machine Learning Perspective

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

MS&E 226: Small Data

Bounding the Probability of Causation in Mediation Analysis

Nonparametric Bayesian Methods (Gaussian Processes)

Machine Learning 4771

Hierarchical Models & Bayesian Model Selection

Bayesian Deep Learning

Click Prediction and Preference Ranking of RSS Feeds

CS249: ADVANCED DATA MINING

STA 4273H: Statistical Machine Learning

STA414/2104 Statistical Methods for Machine Learning II

Rating Prediction with Topic Gradient Descent Method for Matrix Factorization in Recommendation

The Bayesian Brain. Robert Jacobs Department of Brain & Cognitive Sciences University of Rochester. May 11, 2017

Statistical Models. David M. Blei Columbia University. October 14, 2014

Model Averaging (Bayesian Learning)

Gaussian Models

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Bayesian Paragraph Vectors

Probabilistic Matrix Factorization with Non-random Missing Data

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Context Selection for Embedding Models

Recommendation Systems

Some Probability and Statistics

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Computational Cognitive Science

Bayesian probability theory and generative models

Algorithms for Collaborative Filtering

A Count Data Frontier Model

Transcription:

Causal Inference and Recommendation Systems David M. Blei Departments of Computer Science and Statistics Data Science Institute Columbia University

EF In: Ratings or click data Out: A system that provides recommendations

Classical solution: Matrix factorization ITEM ATTRIBUTES USER PREFERENCES u f./ ˇi g. / ˇi y ui u y ui exp-fam > u ˇi Users have latent preferences u. Items have latent attributes ˇi. A users rating comes from an exponential family, e.g., Poisson, Bernoulli, Gaussian. Koren et al., Computer 2009

Classical solution: Matrix factorization All Things Airplane Flying Solo Crew-Only 787 Flight Is Approved By FAA All Aboard Rescued After Plane Skids Into Water at Bali Airport Investigators Begin to Test Other Parts On the 787 American and US Airways May Announce a Merger This Week Personal Finance In Hard Economy for All Ages Older Isn't Better It's Brutal Younger Generations Lag Parents in Wealth-Building Fast-Growing Brokerage Firm Often Tangles With Regulators The Five Stages of Retirement Planning Angst Signs That It's Time for a New Broker Example components from New York Times click data Condition on click data y to estimate the posterior p.; ˇ j y/. estimates of user preferences and item attributes Form predictions with the posterior predictive distribution. This is the backbone of many recommender methods. Salakhutdinov and Mnih, International Conference on Machine Learning 2008 Gopalan et al., Uncertainty in Artificial Intelligence 2015

Today Two related ideas that build on matrix factorization. Causal inference for recommendation (new; feedback wanted!) Modeling user exposure in recommendation Both ideas use the theory around causal inference. Key idea: The exposure model (aka the assignment mechanism) Liang et al., ACM Conference on Recommendation Systems, 2016

Causal Inference for Recommendation (with Dawen Liang and Laurent Charlin)

Causal inference and recommendation Causal inference Expose a unit to a treatment What would happen if a patient received a treatment? Biased data from observational studies Recommendation Expose a user to an item What would happen if a user was recommended an item? Biased data from logged user behavior

Causal recommendation EF DEF Treat the inference of user preferences as a causal inference We want preferences i to answer (from attributes ˇj ): What if user i saw item j? How would user i like it? This leads to a different approach from classical matrix factorization.

A potential outcomes perspective Consider a single user i and fix the attributes of all items ˇj. Assume a potential outcomes model a ij p ij y ij.0/ ı 0./ y ij.1/ exp-fam. > i ˇj / Usual inference about i is causal if we have ignorability, a ij?.y ij.0/; y ij.1// (This is typically the main issue behind observational data.) Rubin, 1974, 1975 Imbens and Rubin, 2015

A potential outcomes perspective Clearly, a ij? y ij.0/, because y ij.0/ is known. But is a ij? y ij.1/? Probably not. Users seek out items that they will like. This biases our estimates of the users preferences. Marlin and Zemel, 2009 Schnabel et al., 2016

Causal recommendation In ratings data we observe which items each user saw a ij (binary). how they rated those items, y ij.1/ when a ij D 1 and y.0/, trivially Fit an exposure model, how users find items. (This is also called the assignment mechanism.) Fit preferences by using the exposure model to correct for self-selection

Causal recommendation Fit the exposure model using Poisson factorization, a ij Poisson. > i j / Use inverse propensity weighting to estimate user preferences i, O D MAP.fa ij ; y ij.a ij /; 1=p.a ij D 1/g/ Intuition: if a user sees and likes a difficult-to-find movie then we upweight its influence on her preferences.

Why propensity weighting works Suppose our data come from this model: i a ij y ij.1/ y ij.0/ ˇj Then inference about preferences i are not causal inferences.

Why propensity weighting works We want data to come from this model: i a ij y ij.1/ y ij.0/ ˇj The exposure probabilities are known and independent of ˇi. Solution: Importance sampling, divide each sample by 1=p.a ij /.

Data sets Learning in Implicit Generative Models Shakir Mohamed and Balaji Lakshminarayanan DeepMind, London {shakir,balajiln}@google.com Abstract Generative adversarial networks (GANs) provide an algorithmic framework for constructing generative models with several appealing properties: they do not require a likelihood function to be specified, only a generating procedure; they provide samples that are sharp and compelling; and they allow ustoharnessour knowledge of building highly accurate neural network classifiers. Here, we develop our understanding of GANs with the aim of forming a rich view of this growing area of machine learning to build connections to the diversesetofstatistical thinking on this topic, of which much can be gained byamutualexchange of ideas. We frame GANs within the wider landscape of algorithms for learning in implicit generative models models that only specify a stochastic procedure with which to generate data and relate these ideas to modelling problems in related fields, such as econometrics and approximate Bayesian computation. We develop likelihood-free inference methods and highlight hypothesis testing as a principle for learning in implicit generative models, using which we are able to derive the objective function used by GANs, and many other related objectives. The testing viewpoint directs our focus to the general problem of density ratio estimation. There are four approaches for density ratio estimation, one of which is a solution using classifiers to distinguish real from generated data. Other approaches such as divergence minimisation and moment matching have also been explored in the GAN literature, and we synthesise these views to form an understanding in terms of the relationships between them and the wider literature, highlighting avenues for future exploration and cross-pollination. MovieLens 1M, 10M: Users rating movies Yahoo R3: Users listening to music Arxiv: Users downloading PDFs (exposure = seeing the abstract)

How good is the exposure model? Model ML-1M ML-10M Yahoo-R3 ArXiv Popularity -1.39-1.64-1.81-3.83 Poisson factorization -0.97-1.08-1.58-2.71 Tran et al., Arxiv, 2016 Golapan et al., Uncertainty in Artificial Intelligence, 2015

Self-selection on ML 1M 500 400 300 200 100 0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 Correlation of the log odds of assignment to the observed rating

Self-selection on ML 1M 600 500 400 300 200 100 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Correlation of the log odds of assignment to the predicted rating (Workshop confession: What does this mean?)

Different confounding on the Arxiv 1200 1000 800 600 400 200 0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 Correlation of the log odds of assignment to the predicted rating

Aside: Two test sets 70000 60000 SKEW REG 50000 40000 30000 20000 10000 0 0.0 0.2 0.4 0.6 0.8 1.0 Histogram of popularities of the standard and skewed test set A test set is biased by the same process that created the training set. We created skewed test sets, where any item has equal probability. Main idea: Test set distribution is different from the training distribution.

Causal inference improves predictions ML-1M ML-10M Yahoo-R3 ArXiv REG SKEW REG SKEW REG SKEW REG SKEW Pop OBS -1.50-2.07-1.62-2.59-1.58-1.75-1.61-1.65 CAU -1.61-1.95-1.67-1.89-1.51-1.56-1.74-1.76 PF OBS -1.50-2.07-1.62-2.59-1.58-1.75-1.61-1.65 CAU -1.48-1.84-1.51-1.96-1.49-1.55-1.60-1.62 Predictive log tail probability (bigger is better)

ML-1M (regular test set) 700 600 OBS CAU 500 400 300 200 100 0 6 5 4 3 2 1 0

ML-1M (skewed test set) 700 600 OBS CAU 500 400 300 200 100 0 6 5 4 3 2 1 0

Modeling User Exposure in Recommendation (with Dawen Liang, Laurent Charlin, and James McInerney)

Users Items Star Wars Empire Strikes Back Return of the Jedi When Harry Met Sally Barefoot in the Park Star Trek III: Wrath of Khan Implicit data is about users interacting with items clicks, likes, purchases,... Less information than explicit data (e.g. ratings), but more prevalent Challenge: We only observe positive signal.

ITEM ATTRIBUTES USER PREFERENCES u f./ ˇi g. / ˇi y ui u y ui exp-fam > u ˇi Classical solution: matrix factorization Users are associated with latent preferences u. Items are associated with latent attributes ˇi. Whether a user clicked on an item comes from an exponential family, e.g., Poisson, Bernoulli, Gaussian. Koren et al., Computer 2009

Users Items Star Wars Empire Strikes Back Return of the Jedi When Harry Met Sally Barefoot in the Park Star Trek III: Wrath of Khan Tacit assumption each user considered clicking on each item. I.e., every zero indicates that a user decided not to click on an item. This is false! Users are only exposed to a subset of items. Hu et al., International Conference on Data Mining 2008 Rendle et al., Uncertainty in Artificial Intelligence 2009

Users Items Star Wars Empire Strikes Back Return of the Jedi When Harry Met Sally Barefoot in the Park Star Trek III: Wrath of Khan This issue biases estimates of preferences and attributes. In practice, researchers correct for this bias by downweighting the 0 s. This gives state-of-the-art results, but it is ad-hoc. Hu et al., International Conference on Data Mining 2008 Rendle et al., Uncertainty in Artificial Intelligence 2009

Star Wars Empire Strikes Back Return of the Jedi When Harry Met Sally Barefoot in the Park Star Trek III: Wrath of Khan Idea: Model clicks from a two-stage process. First, a user is exposed to an item. Then, she decides whether to click on it. (Inspired by, but different from, potential outcomes.)

Star Wars Empire Strikes Back Return of the Jedi When Harry Met Sally Barefoot in the Park Star Trek III: Wrath of Khan Challenge: Exposure is (partly) latent. Clicks mean that the user was exposed. But non-clicks can arise in two ways: The user didn t see the item. The user chose not to click on it.

EXPOSURE INDICATOR a ui a ui Bern. i / y ui j a ui D 0 ı 0 i y ui j a ui D 1 exp-fam > u ˇi ˇi y ui u Exposure MF: an exposure model and a click model First generate whether the user was exposed to the item. Conditional on the exposure, generate whether the user clicked.

EXPOSURE INDICATOR a ui a ui Bern. i / y ui j a ui D 0 ı 0 i y ui j a ui D 1 exp-fam > u ˇi ˇi y ui u Popularity-based exposure: a ui Bernoulli. i / Location-based exposure: a ui Bernoulli..x > u `i// Others: social-network exposure, author-preference exposure,...

EXPOSURE INDICATOR a ui a ui Bern. i / y ui j a ui D 0 ı 0 i y ui j a ui D 1 exp-fam > u ˇi ˇi y ui u Given data y we find MAP estimates of the parameters M D f; ˇ; g Recall that the exposure indicator is sometimes latent. We use expectation maximization.

EXPOSURE INDICATOR a ui a ui Bern. i / y ui j a ui D 0 ı 0 i y ui j a ui D 1 exp-fam > u ˇi ˇi y ui u E-step: Estimate the posterior of each unclicked exposure, ui D p.a ui D 1 j y ui D 0; M/: M-step: Estimate M from the expected complete log likelihood of y.

Star Wars Empire Strikes Back Return of the Jedi When Harry Met Sally Barefoot in the Park Star Trek III: Wrath of Khan E-step: Exposure prior interacts with probability of not clicking, p.a ui D 1 j y ui D 0; M/ / p.a ui D 1 j M/p.y ui D 0 j a ui D 1; M/ In this example: Posterior exposure to Return of the Jedi is smaller than the prior. Posterior exposure to When Harry Met Sally is larger.

Star Wars Empire Strikes Back Return of the Jedi When Harry Met Sally Barefoot in the Park Star Trek III: Wrath of Khan M-step: Weighted factorization Zeros weighted by the posterior probability of exposure. Related to WMF, but emerges from a probabilistic perspective And, not all 0s are reweighted by the same amount E.g., the zero for Return of the Jedi does not pull the user s preferences. (Exposure M-step: MLE with expected sufficient statistics.)

log( [aui]) log(µui) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 8sHU A (5DGLRKHDG DnG InWHUSRO OLsWHnHU) 0.4 10 1 10 2 10 3 10 4 10 5 (PSLULFDO LWHP SRSuODULWy 5DGLRKHDG: ".LG A" 5DGLRKHDG: "JLJsDw )DOOLnJ InWR 3ODFH" 5DGLRKHDG: "6DLO TR TKH 0RRn" 5DGLRKHDG: "6LW 'Rwn. 6WDnG 8S" 5DGLRKHDG: "BORw 2uW" InWHUSRO: "C'PHUH" InWHUSRO: "1RW (vhn JDLO" InWHUSRO: "3uEOLF 3HUvHUW" InWHUSRO: "1HxW (vlo" InFuEus: "BDss 6ROR" GUHDW :KLWH: "HRusH 2I BURNHn LRvH" This user likes Interpol and Radiohead. Songs that she might like are down-weighted more than usual. This distinguishes the algorithm from WMF. But, it s a modeling choice. (The alternative could also be plausible.)

TPS Mendeley Gowalla ArXiv Type Music Papers Venues Clicks # of users 220K 45K 58K 38K # of items 23K 76K 47K 45K # interactions 14.0M 2.4M 2.3M 2.5M % interactions 0.29% 0.07% 0.09% 0.15% We studied several large data sets of implicit data.

TPS Mendeley ArXiv WMF ExpoMF WMF ExpoMF WMF ExpoMF Recall@20 0.195 0.201 0.128 0.139 0.143 0.147 NDCG@100 0.255 0.263 0.149 0.159 0.154 0.157 MAP@100 0.092 0.109 0.048 0.055 0.051 0.054 Weighted MF vs. popularity-based exposure

WMF Popularity Exposure Location Exposure Recall@20 0.122 0.118 0.129 NDCG@100 0.118 0.116 0.125 MAP@100 0.044 0.043 0.048 WMF vs. popularity-based vs. location-based exposure (Gowalla)

Related work: Causal inference and potential outcomes Weighted matrix factorization Missing data and recommender systems Spike and slab models Pearl, Causality Imbens and Rubin, Causal Inference in Statistics, Social and Biomedical Sciences: An Introduction Gelman et al., Bayesian Data Analysis Hu et al., International Conference on Data Mining 2008 Marlin et al., Uncertainty in Artificial Intelligence 2007

a ui i ˇi y ui u Summary: We developed a latent exposure process for modeling implicit data. Enables us to consider two models one of clicks and one of exposure. Outperforms weighted matrix factorization; opens the door to new methods See our paper at WWW 2016.

Causal Inference and Recommendation Systems Learning the assignment mechanism is useful for causal inference It helps us better learn user preferences It helps us better model implicit data Other work on recommender systems from our group Hierarchical Poisson factorization Social networks and recommender systems Dynamic recommender systems Combining content and clicks Embeddings for recommendation Gopalan et al., Uncertainty in Artificial Intelligence 2015 Chaney et al., ACM Conference on Recommendation Systems 2015 Charlin et al., ACM Conference on Recommendation Systems 2015 Gopalan et al., Neural Information Processing Systems 2014 Gopalan et al., Artificial Intelligence and Statistics 2014 Wang and Blei, Knowledge Discovery and Data Mining 2013 Liang et al., ACM Conference on Recommendation Systems 2016

Why it s not crazy to model integer data with a Normal distribution. OK, it is a little. Let D > u ˇi. Consider the Poisson with parameter expfg. p.y/ / expfy expfg log yšg Now consider a (unit-variance) Gaussian p.y/ / expfy 2 =2 y 2 =2g These are different distributions. But both contain y A penalty for large parameters A penalty for large variables