Counterfactual Evaluation and Learning from Logged User Feedback

Size: px

Start display at page:

Download "Counterfactual Evaluation and Learning from Logged User Feedback"

Marylou Robinson
6 years ago
Views:

Counterfactual Evaluation and Learning from Logged User Feedback Adith Swaminathan adith@cs.cornell.

1 Counterfactual Evaluation and Learning from Logged User Feedback Adith Swaminathan Department of Computer Science Cornell University Committee: Thorsten Joachims (chair), Johannes Gehrke, Éva Tardos, Robert Kleinberg

2 Interactive Systems Search E-commerce We collect user interactions for Personalization Evaluating performance Improving systems Entertainment SmartHome

3 Logs of Interactive Systems Alice: Strategy game Civilization 5 x: Context x 1 = (Alice, strat) y 1 = (Civ5) δ 1 = 7$ y: Action Alice: $$$ δ: Feedback 3

4 Example: Ad Placement Context x: User and page Action y: Ad Feedback δ: Click / no-click 4

5 Example: Chatbot Context x: Query Action y: GIF Feedback δ: Thumbs up / down 5

6 Goal Use to x 1 = (Alice, strat) y 1 = (Civ5) δ 1 = 7$ Evaluate new model Train models 6

7 Non-solutions Ignore Annotate x 1 = (Alice, strat) y 1 = (Civ5) δ 1 = 7$ x 1 = (Alice, strat) y 1 = (Civ5) δ 1 = 7$ y = (AoE2) Evaluate: Online A/B Train: Bandit algorithms Evaluate: Hold-out Train: Supervised learning 7

8 Thesis: Re-use logged data Logged interventions Observational logs??? x 1, y 1, δ 1 Evaluating slates [ICML 16 workshop] Batch Learning from Bandit Feedback (BLBF) [ICML 15] Better BLBF [NIPS 15] x 1, y 1, δ 1 Recommender systems [ICML 16] Learning to rank [WSDM 17] 8

9 Wald s insight: What s missing? Where to add armor? Cover bullet-holes? (Survivor bias!) Beware: Confounding due to missing info 9

10 Overview Introduction Interactive systems Reusing data How good is this new system? Project: MNAR (Schnabel et al, ICML 2016) Find the best new system Project: POEM 10

11 Romance Lovers Horror Lovers Movie Recommendation O Observed Y/N Horror Romance Drama Y True Rating Data is Missing Not At Random (MNAR) Example adapted from (Steck et al, 2010) 11

12 Selection Bias in Recommendations User-induced (e.g. browsing) System-induced (e.g. advertising) Question: What if we ignore these biases? 12

13 Romance Lovers Horror Lovers Evaluating recommendations under Selection Bias O Observed Y/N Y Recommend Y True Rating Horror Romance Drama Observed ratings are misleading

14 Romance Lovers Horror Lovers Evaluating rating predictions under Selection Bias Horror Romance Drama Horror Romance Drama Observed losses are misleading Y 1 Pred Ratings (worse) Y 2 Pred Ratings (better) 14

15 patients Users Recommendations as Treatments Fix selection bias potential outcomes framework Counterfactual Outcomes Y Items treatments Factual Outcomes Y Understand assignment mechanism (Imbens & Ruben, 2015) 15

16 Assignment Mechanism for Recommendation P u,i = P O u,i = 1 Propensities P Horror Romance Drama Inverse Propensity Scoring (IPS) is unbiased if P u,i > 0: p p/10 p/2 R IPS = 1 U I u,i 1{O ui =1} P u,i Y u,i Y u,i 2 p/10 p p/2 (Horvitz & Thompson, 1952; Rosenbaum & Rubin, 1983;...) 16

17 Interventional vs. Observational Interventional (Controlled Experiments) We control assignment mechanism (e.g. ad placement) Propensities P u,i = P O u,i = 1 known [ Just log propensities! ] Requirement: P u,i > 0 (prob. assignment) Observational Assignment mechanism not under our control (e.g. reviews/ratings) Use features Z; P u,i = P O u,i = 1 Z [ Estimate propensity ] Requirement: O u,i Y u,i Z (unconfounded) 17

18 Propensity Estimation Supervised Regression Problem P u,i = P O u,i = 1 Z Off-the-shelf ML, e.g., Logistic regression Naïve Bayes Bernoulli Matrix Factorization Observations O Horror Romance Drama IPS is robust to inaccurate propensities 18

19 Debiased Collaborative Filtering Y ERM = argmin V,W O u,i =1 1 P u,i Y u,i V u W i 2 + λ V F 2 + W F 2 Observations O Features Z Observed ratings Y Propensity estimation MF Complete Data Model Latent variables Missing Data Model discriminative generative (Marlin et al, 2007; Steck, 2011;...) 19

20 Collaborative Filtering Results Two real-world MNAR datasets YAHOO: Song ratings (15400 users; Marlin & Zemel, 2009) COAT: Shopping ratings (300 users; new Schnabel et al, 2016) Report performance on MAR datasets 20

21 Overview Introduction Interactive systems Reusing data How good is this new system? Project: MNAR Find the best new system Project: POEM (Swaminathan & Joachims, ICML 2015) 21

22 Evaluate with logged interventions New System π, π: x y How good is π? R π = E x E π δ Estimate R π using y i ~ Old System π 0 x 1, y 1, δ 1 = 1 22

23 System: Stochastic Policies Definition [Deterministic Policy]: Function y = π(x) that picks action y for context x Definition [Stochastic Policy]: Distribution π y x that samples action y given context x Y x Y x π 1 x π 1 (Y x) π 2 x π 2 (Y x) 23

24 Recap: IPS R ips π = 1 n i π(y i x i ) π 0 y i x i δ i π 0 π 24

25 Now: Learning Problem Find the best π Π, argmin π Π R π Empirical Risk Minimization: Work-horse of ML π = argmin π Π R(π) + λ Reg(π) Train loss Regularizer 25

26 ERM with IPS: Issue π 0 (Y x) π(y x) π 0 (Y x) π (Y x) argmin π Ƹ R π = 1 n i π(y i x i ) π 0 y i x i δ i Can we detect and avoid IPS failure when learning? 26

27 ERM: Generalization Error Bound Classic ERM: argmin π Π R π + λ Reg(π) Train loss Regularizer Classic Risk Bound: R π R π + O(Ϲ[Π]) Data used to estimate R(π) did not depend on π (Vapnik & Chervonenkis, 1979) 27

28 Now: π influences its data π 0 (Y x) π(y x) π 0 (Y x) π (Y x) argmin π Ƹ R π = 1 n i π(y i x i ) π 0 y i x i δ i 28

29 Counterfactual Learning Risk Bound: R π R π + O Var π n + O(Ϲ[Π]) Off-policy est. Emp. variance Regularizer Objective: argmin π Π R π + λ 1 Var π n + λ 2 Reg(π) Accounts for different Counterfactual Risk Minimization Τ π(y x) π 0 (y x) variability across Π (Maurer & Pontil, 2009; Bottou et al, 2013; Swaminathan & Joachims, 2015) 29

30 CRM for Structured Prediction Policy class, H: Stochastic linear rules π w y x exp w T ψ x, y Same form as Cond. Random Field or Structural SVM Learning: Use x i, y i, δ i, p i to find good w Policy Optimizer for Exponential Models (POEM) (Swaminathan & Joachims, 2015) 30

31 Case Study: Newsbox Placement Context x: Query, User, Ranked docs, Newsbox content features Action y: Position to place newsbox Loss δ: (1 MRR) of entire SERP Pos 1 Logger π 0 : Multinomial using production position scorer Pos 2 Pos 3 31

32 News Box Placement: Results Across 100 datasets, POEM consistently beats production ranker 32

33 Overview Introduction Interactive systems Reusing data How good is this new system? Find the best new system Discussion 33

34 Goal Re-use to x 1 = (Alice, strat) y 1 = (Civ5) δ 1 = 7$ Evaluate new model offline IPS (Off-policy estimation) Train models offline POEM 34

35 Thesis: Re-use logged data Logged interventions Observational logs??? x 1, y 1, δ 1 Evaluating slates [ICML 16 workshop] Batch Learning from Bandit Feedback (BLBF) [ICML 15] Better BLBF [NIPS 15] x 1, y 1, δ 1 Recommender systems [ICML 16] Learning to rank [WSDM 17] 35

36 Thesis: Other Projects Better BLBF Equivariant Optimization with Self-Normalized Estimators (Swaminathan & Joachims, NIPS 2015) Solve propensity overfitting when learning from bandit feedback Several variance reduction methods (clipping, control variates, explicit control, under-fit propensities). How do they compare and interact? 36

37 Thesis: Other Projects Evaluating slates Off-Policy Evaluation for Slate Recommendation (Swaminathan et al, ICML Workshop on Personalization 2016) Usable alternative to IPS for combinatorial contextual bandits Several ways to achieve a good bias-variance trade-off. How best to explore/exploit structured action spaces? 37

38 Thesis: Other Projects Learning to rank De-biasing Learning to Rank when using Implicit Feedback (Joachims et al, WSDM 2017) To get around getting randomized data, piggyback on randomness in user actions Can employ click models as propensity estimators. How to best adapt them for this purpose? 38

39 Counterfactual Techniques Model the world Model the bias e.g., predict user-item ratings Project: MNAR e.g., CTR of recommender Project: POEM Causal modelling Monte Carlo estimation 39

40 Looking Ahead Bridge Causal Inference and ML Off-policy Reinforcement Learning 40

41 Conclusion See also and Horror Romance Drama Thanks! π 0 (Y x) π (Y x) adith@cs.cornell.edu 41

42 Thorsten Prathmesh Hema Swetha Isaac Johannes Omkar Ashesh Laure Thanks! Bobby Sunny Abhishek Matthew Ruben Éva Hooda Anshumali Fabian Pannaga Stavros Karthik Raman Shuo Ethan Krishnaram Telang Siddarth Chandrasekaran Amit Soumen Anitha Nikos Karn Pushpak Saraswathi David Grangier Tobias Z Aditya GP Swaminathan Navdeep Ashudeep Auro Ramdas Arun Lillian Peter Frazier Basu Abi Sankaranarayanan Becky AkshayK/B Chenhao Indu Aravind Charlie Alekh Jon Prema Moontae K.P. Miro Yexiang Raja Vikram M. John Lu Hema Aparna Abbi Maarten Karthik Sridharan Ashwinkumar Ramesh Ashish Claire Damien Lefortier Sarah Wenlei Ria/Nina Harshal David Mimno Sumita Raghu Anne Schuth Shinde Ainur Tirth Artem Binit Bishan Maithra Daan Reetika Arzoo 42

Recommendations as Treatments: Debiasing Learning and Evaluation

Recommendations as Treatments: Debiasing Learning and Evaluation ICML 2016, NYC Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, Thorsten Joachims Cornell University, Google Funded in part through NSF Awards IIS-1247637, IIS-1217686, IIS-1513692. Romance