Counterfactual Model for Learning Systems

Similar documents
Counterfactual Evaluation and Learning

Outline. Offline Evaluation of Online Metrics Counterfactual Estimation Advanced Estimators. Case Studies & Demo Summary

Recommendations as Treatments: Debiasing Learning and Evaluation

Reducing contextual bandits to supervised learning

Information Retrieval

Counterfactual Evaluation and Learning from Logged User Feedback

Doubly Robust Policy Evaluation and Learning

COUNTERFACTUAL REASONING AND LEARNING SYSTEMS LÉON BOTTOU

Active Learning and Optimized Information Gathering

Weighting Methods. Harvard University STAT186/GOV2002 CAUSAL INFERENCE. Fall Kosuke Imai

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks

Learning with Exploration

Estimation Considerations in Contextual Bandits

arxiv: v2 [cs.lg] 26 Jun 2017

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Bayesian Contextual Multi-armed Bandits

ANALYTIC COMPARISON. Pearl and Rubin CAUSAL FRAMEWORKS

Learning from Rational * Behavior

Offline Evaluation of Ranking Policies with Click Models

Discrete Latent Variable Models

Off-policy Evaluation and Learning

ECE521 Lecture7. Logistic Regression

Covariate Balancing Propensity Score for General Treatment Regimes

Online Learning with Feedback Graphs

Propensity Score Weighting with Multilevel Data

Behavioral Data Mining. Lecture 19 Regression and Causal Effects

Propensity Score Analysis with Hierarchical Data

Causal Inference with a Continuous Treatment and Outcome: Alternative Estimators for Parametric Dose-Response Functions

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Stochastic Gradient Descent

Learning from Logged Implicit Exploration Data

Large-scale Information Processing, Summer Recommender Systems (part 2)

ECE521 week 3: 23/26 January 2017

Estimating the Marginal Odds Ratio in Observational Studies

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Matching. Quiz 2. Matching. Quiz 2. Exact Matching. Estimand 2/25/14

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

1 [15 points] Frequent Itemsets Generation With Map-Reduce

Recap from previous lecture

Collaborative Filtering. Radek Pelánek

Gov 2002: 4. Observational Studies and Confounding

Lecture 3: Causal inference

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783)

Diagnostics. Gad Kimmel

Balancing Covariates via Propensity Score Weighting

causal inference at hulu

Causal Inference Basics

Automated Tuning of Ad Auctions

Lecture Topic 4: Chapter 7 Sampling and Sampling Distributions

CSE446: non-parametric methods Spring 2017

Models, Data, Learning Problems

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Introduction to Restricted Boltzmann Machines

Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies

Linear classifiers: Overfitting and regularization

EMERGING MARKETS - Lecture 2: Methodology refresher

Doubly Robust Policy Evaluation and Optimization 1

Flexible Estimation of Treatment Effect Parameters

Be able to define the following terms and answer basic questions about them:

Quantitative Economics for the Evaluation of the European Policy

New Algorithms for Contextual Bandits

Distribution-Free Distribution Regression

Classification 1: Linear regression of indicators, linear discriminant analysis

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Linear Regression. Volker Tresp 2018

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Deep Counterfactual Networks with Propensity-Dropout

COMP90051 Statistical Machine Learning

STA 4273H: Statistical Machine Learning

Causal Inference with Big Data Sets

Combining multiple observational data sources to estimate causal eects

Weighting. Homework 2. Regression. Regression. Decisions Matching: Weighting (0) W i. (1) -å l i. )Y i. (1-W i 3/5/2014. (1) = Y i.

Slides at:

Statistical Models for Causal Analysis

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 3: Introduction to Complexity Regularization

Overfitting, Bias / Variance Analysis

An Introduction to Causal Mediation Analysis. Xu Qin University of Chicago Presented at the Central Iowa R User Group Meetup Aug 10, 2016

Regression I: Mean Squared Error and Measuring Quality of Fit

Estimating the Mean Response of Treatment Duration Regimes in an Observational Study. Anastasios A. Tsiatis.

Statistical learning. Chapter 20, Sections 1 3 1

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Deep-Treat: Learning Optimal Personalized Treatments from Observational Data using Neural Networks

Empirical approaches in public economics

Summary and discussion of The central role of the propensity score in observational studies for causal effects

Assessing Time-Varying Causal Effect. Moderation

Time-Varying Causal. Treatment Effects

multilevel modeling: concepts, applications and interpretations

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Logistic Regression. William Cohen

Lecture 7 Introduction to Statistical Decision Theory

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

Learning for Contextual Bandits

Online Learning and Sequential Decision Making

Counterfactual Learning-to-Rank for Additive Metrics and Deep Models

OPPA European Social Fund Prague & EU: We invest in your future.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Stochastic Gradient Descent. CS 584: Big Data Analytics

Transcription:

Counterfactual Model for Learning Systems CS 7792 - Fall 28 Thorsten Joachims Department of Computer Science & Department of Information Science Cornell University Imbens, Rubin, Causal Inference for Statistical Social Science, 25. Chapters,3,2.

Interactive System Schematic Utility: U π Action y for x System π

Context x: User Action y: Portfolio of newsarticles Feedback δ x, y : Reading time in minutes News Recommender

Ad Placement Context x: User and page Action y: Ad that is placed Feedback δ x, y : Click / no-click

Search Engine Context x: Query Action y: Ranking Feedback δ x, y : Click / no-click

Log Data from Interactive Systems Data context S = π action reward / loss x, y, δ,, x n, y n, δ n Partial Information (aka Contextual Bandit ) Feedback Properties Contexts x i drawn i.i.d. from unknown P(X) Actions y i selected by existing system π : X Y Feedback δ i from unknown function δ: X Y R [Zadrozny et al., 23] [Langford & Li], [Bottou, et al., 24]

Goal: Counterfactual Evaluation Use interaction log data S = x, y, δ,, x n, y n, δ n for evaluation of system π: Estimate online measures of some system π offline. System π can be different from π that generated log.

Evaluation: Outline Evaluating Online Metrics Offline A/B Testing (on-policy) Counterfactual estimation from logs (off-policy) Approach : Model the world Estimation via reward prediction Approach 2: Model the bias Counterfactual Model Inverse propensity scoring (IPS) estimator

Online Performance Metrics Example metrics CTR Revenue Time-to-success Interleaving Etc. Correct choice depends on application and is not the focus of this lecture. This lecture: Metric encoded as δ(x, y) [click/payoff/time for (x,y) pair]

System Definition [Deterministic Policy]: Function y = π(x) that picks action y for context x. Definition [Stochastic Policy]: Distribution π y x that samples action y given context x Y x Y x π x π (Y x) π 2 x π 2 (Y x)

Definition [Utility of Policy]: System Performance The expected reward / utility U(π) of policy π is U π = න න δ x, y π y x P x dx dy Y x i e.g. reading time of user x for portfolio y Y x j π(y x i ) π(y x j )

Online Evaluation: A/B Testing Given S = x, y, δ,, x n, y n, δ n collected under π, n U π = n i= δ i A/B Testing Deploy π : Draw x P X, predict y π Y x, get δ(x, y) Deploy π 2 : Draw x P X, predict y π 2 Y x, get δ(x, y) Deploy π H : Draw x P X, predict y π H Y x, get δ(x, y)

Pros and Cons of A/B Testing Pro User centric measure No need for manual ratings No user/expert mismatch Cons Requires interactive experimental control Risk of fielding a bad or buggy π i Number of A/B Tests limited Long turnaround time

Evaluating Online Metrics Offline Online: On-policy A/B Test Draw S from π U π Draw S 2 from π 2 U π 2 Draw S 3 from π 3 U π 3 Draw S 4 from π 4 U π 4 Draw S 5 from π 5 U π 5 Draw S 6 from π 6 U π 6 Draw S 7 from π 7 U π 7 Offline: Off-policy Counterfactual Estimates Draw S from π U π 6 U π 2 U π 8 U π 24 U π 3

Evaluation: Outline Evaluating Online Metrics Offline A/B Testing (on-policy) Counterfactual estimation from logs (off-policy) Approach : Model the world Estimation via reward prediction Approach 2: Model the bias Counterfactual Model Inverse propensity scoring (IPS) estimator

Approach : Reward Predictor Idea: Use S = x, y, δ,, x n, y n, δ n from π to estimate reward predictor መδ x, y Deterministic π: Simulated A/B Testing with predicted መδ x, y For actions y i = π(x i ) from new policy π, generate predicted log S = x, y, መδ x, y,, x n, y n, መδ x n, y n Y x δ x, y δ x, y 2 Estimate performace of π via U rp π = σ n n i= መδ x i, y i Stochastic π: U rp π = σ n i= n σ y መδ x i, y π(y x i ) መδ x, y δ x, y

Regression for Reward Prediction Learn መδ: x y R. Represent via features Ψ x, y 2. Learn regression based on Ψ x, y from S collected under π 3. Predict መδ x, y for y = π(x) of new policy π Ψ መδ(x, y) δ x, y Ψ 2

News Recommender: Exp Setup Context x: User profile Action y: Ranking Pick from 7 candidates to place into 3 slots Reward δ: Revenue Complicated hidden function Logging policy π : Non-uniform randomized logging system Placket-Luce explore around current production ranker

News Recommender: Results RP is inaccurate even with more training and logged data

Problems of Reward Predictor Modeling bias choice of features and model Selection bias π s actions are overrepresented Ψ መδ(x, y) δ x, π x U rp π = n i መδ x i, π x i Can be unreliable and biased Ψ 2

Evaluation: Outline Evaluating Online Metrics Offline A/B Testing (on-policy) Counterfactual estimation from logs (off-policy) Approach : Model the world Estimation via reward prediction Approach 2: Model the bias Counterfactual Model Inverse propensity score (IPS) weighting estimator

Idea: Approach Model the Bias Fix the mismatch between the distribution π Y x that generated the data and the distribution π Y x we aim to evaluate. π U π π y x = න න δ x, y π y x P x dx dy

Patients x i,..., n Counterfactual Model Example: Treating Heart Attacks Treatments: Y Bypass / Stent / Drugs Chosen treatment for patient x i : y i Outcomes: δ i 5-year survival: / Which treatment is best?

Patients i,..., n Placing Vertical Example: Treating Heart Attacks Treatments: Y Bypass / Stent / Drugs Chosen treatment for patient x i : y i Outcomes: δ i 5-year survival: / Which treatment is best? Counterfactual Model Pos / Pos 2/ Pos 3 Click / no Click on SERP Pos Pos 2 Pos 3

Patients x i, i,..., n Counterfactual Model Example: Treating Heart Attacks Treatments: Y Bypass / Stent / Drugs Chosen treatment for patient x i : y i Outcomes: δ i 5-year survival: / Which treatment is best? Everybody Drugs Everybody Stent Everybody Bypass Drugs 3/4, Stent 2/3, Bypass 2/4 really?

Patients Treatment Effects Average Treatment Effect of Treatment y U y = n σ i δ(x i, y) Example U bypass = 4 U stent = 6 U drugs = 3 Factual Outcome Counterfactual Outcomes

Patients Assignment Mechanism Probabilistic Treatment Assignment For patient i: π Y i = y x i Selection Bias Inverse Propensity Score Estimator U ips y = n i I{y i = y} p i δ(x i, y i ) Propensity: p i = π Y i = y i x i Unbiased: E U y =U y, if π Y i = y x i > for all i Example U drugs = + + +.8.7.8. =.36 <.75 π Y i = y x i.3.5..6.4....8.6.2.3.5..7.7...3.3.4.2..8.3.6.4..8..4..2

Experimental vs Observational Controlled Experiment Assignment Mechanism under our control Propensities p i = π Y i = y i x i are known by design Requirement: y: π Y i = y x i > (probabilistic) Observational Study Assignment Mechanism not under our control Propensities p i need to be estimated Estimate π Y i z i = π Y i x i ) based on features z i Requirement: π Y i z i ) = π Y i δ i, z i (unconfounded)

Patients Conditional Treatment Policies Policy (deterministic) Context x i describing patient Pick treatment y i based on x i : y i = π(x i ) Example policy: π A = drugs, π B = stent, π C = bypass Average Treatment Effect U π = n σ i δ(x i, π x i ) IPS Estimator U ips π = n i I{y i = π(x i )} p i δ(x i, y i ) B C A B A B A C A C B

Patients Stochastic Treatment Policies Policy (stochastic) Context x i describing patient Pick treatment y based on x i : π(y x i ) Note Assignment Mechanism is a stochastic policy as well! Average Treatment Effect U π = n σ i σ y δ(x i, y)π y x i IPS Estimator U π = n σ i π y i x i p i δ(x i, y i ) B C A B A B A C A C B

Recorded in Log Counterfactual Model = Logs Medical Search Engine Ad Placement Recommender Context x i Diagnostics Query User + Page User + Movie Treatment y i BP/Stent/Drugs Ranking Placed Ad Watched Movie Outcome δ i Survival Click metric Click / no Click Star rating Propensities p i controlled (*) controlled controlled observational New Policy π FDA Guidelines Ranker Ad Placer Recommender T-effect U(π) Average quality of new policy.

Evaluation: Outline Evaluating Online Metrics Offline A/B Testing (on-policy) Counterfactual estimation from logs (off-policy) Approach : Model the world Estimation via reward prediction Approach 2: Model the bias Counterfactual Model Inverse propensity scoring (IPS) estimator

System Evaluation via Inverse Propensity Scoring Definition [IPS Utility Estimator]: Given S = x, y, δ,, x n, y n, δ n collected under π, U ips π = n i= Unbiased estimate of utility for any π, if propensity nonzero whenever π y i x i >. Note: If π = π, then online A/B Test with U ips π = n δ i i Off-policy vs. On-policy estimation. n δ i π y i x i π y i x i Propensity p i [Horvitz & Thompson, 952] [Rubin, 983] [Zadrozny et al., 23] [Li et al., 2]

Illustration of IPS IPS Estimator: U IPS π = n i Unbiased: If then π y i x i π y i x i δ i x, y: π y x P(x) > π y x > E U IPS π = U(π) π Y x π(y x)

IPS Estimator is Unbiased E U IPS π independent = n x,y x n,y n i π y i x i π y i x i ) δ x i, y i π y x π y n x n P x P(x n ) = n π y i x i π y x P(x ) π y n x n P(x n ) π y i x i ) δ x i, y i x,y x n,y n i marginal full support = n i = n i = n i π y x P(x ) π y n x n P(x n ) x,y x n,y n π y i x i P(x i ) x i,y i π y i x i π y i x i ) δ x i, y i P(x i )π y i x i δ x i, y i = n x i,y i i identical x,y U(π) = U π π y i x i π y i x i ) δ x i, y i

News Recommender: Results IPS eventually beats RP; variance decays as O n

Counterfactual Policy Evaluation Controlled Experiment Setting: Log data: D = x, y, δ, p,, x n, y n, δ n, p n Observational Setting: Log data: D = x, y, δ, z,, x n, y n, δ n, z n Estimate propensities: p i = P y i x i, z i ) based on x i and other confounders z i Goal: Estimate average treatment effect of new policy π. IPS Estimator or many others. U π = n i δ i π y i x i p i

Evaluation: Summary Evaluating Online Metrics Offline A/B Testing (on-policy) Counterfactual estimation from logs (off-policy) Approach : Model the world Estimation via reward prediction Pro: low variance Con: model mismatch can lead to high bias Approach 2: Model the bias Counterfactual Model Inverse propensity scoring (IPS) estimator Pro: unbiased for known propensities Con: large variance

From Evaluation to Learning Naïve Model the World Learning: Learn: መδ: x y R Derive Policy: π y x = argmin መδ x, y y Naïve Model the Bias Learning: Find policy that optimizes IPS training error π y i x i ) π = argmin δ π π y i x i i i

Outline of Class Counterfactual and Causal Inference Evaluation Improved counterfactual estimators Applications in recommender systems, etc. Dealing with missing propensities, randomization, etc. Learning Batch Learning from Bandit Feedback Dealing with combinatorial and continuous action spaces Learning theory More general learning with partial information data (e.g. ranking, embedding)