Counterfactual Evaluation and Learning

Size: px

Start display at page:

Download "Counterfactual Evaluation and Learning"

Noreen Hopkins
6 years ago
Views:

SIGIR 26 Tutorial Counterfactual Evaluation and Learning Adith Swaminathan, Thorsten Joachims Department of Computer Science & Department of

1 SIGIR 26 Tutorial Counterfactual Evaluation and Learning Adith Swaminathan, Thorsten Joachims Department of Computer Science & Department of Information Science Cornell University Website: Funded in part through NSF Awards IIS , IIS-27686, IIS

2 Examples Search engines Entertainment media E-commerce Smart homes, robots, etc. Logs of User Behavior for Evaluating system performance Learning improved systems and gathering knowledge Personalization User Interactive Systems

3 Interactive System Schematic Utility: U π Action y for x System π

4 Ad Placement Context x: User and page Action y: Ad that is placed Feedback δ x, y : Click / no-click

5 Context x: User Action y: Portfolio of newsarticles Feedback δ x, y : Reading time in minutes News Recommender

6 Search Engine Context x: Query Action y: Ranking Feedback δ x, y : win/loss against baseline in interleaving

7 Log Data from Interactive Systems Data context S = π action reward / loss x, y, δ,, x n, y n, δ n Partial Information (aka Contextual Bandit ) Feedback Properties Contexts x i drawn i.i.d. from unknown P(X) Actions y i selected by existing system π : X Y Feedback δ i from unknown function δ: X Y R [Zadrozny et al., 23] [Langford & Li], [Bottou, et al., 24]

8 Goals for this Tutorial Use interaction log data S = x, y, δ,, x n, y n, δ n for Evaluation: Estimate online measures of some system π offline. System π is typically different from π that generated log. Learning: Find new system π that improves performance over π. Do not rely on interactive experiments like in online learning.

9 SIGIR 26 Tutorial Counterfactual Evaluation and Learning PART : EVALUATION

10 Evaluation: Outline Evaluating Online Metrics Offline A/B Testing (on-policy) Counterfactual estimation from logs (off-policy) Approach : Model the world Estimation via reward prediction Approach 2: Model the bias Counterfactual Model Inverse propensity scoring (IPS) estimator Advanced Estimators Self-normalized IPS estimator Doubly robust estimator Slates estimator Case Studies Summary & Demonstration with code samples

11 Online Performance Metrics Example metrics CTR Revenue Time-to-success Interleaving Etc. Correct choice depends on application and is not the focus of this tutorial. This tutorial: Metric encoded as δ(x, y) [click/payoff/time for (x,y) pair]

12 System Definition [Deterministic Policy]: Function y = π(x) that picks action y for context x. Definition [Stochastic Policy]: Distribution π y x that samples action y given context x Y x Y x π x π (Y x) π 2 x π 2 (Y x)

13 Definition [Utility of Policy]: System Performance The expected reward / utility U(π) of policy π is U π = න න δ x, y π y x P x dx dy Y x i e.g. reading time of user x for portfolio y Y x j π(y x i ) π(y x j )

14 Online Evaluation: A/B Testing Given S = x, y, δ,, x n, y n, δ n collected under π, n U π = n i= δ i A/B Testing Deploy π : Draw x P X, predict y π Y x, get δ(x, y) Deploy π 2 : Draw x P X, predict y π 2 Y x, get δ(x, y) Deploy π H : Draw x P X, predict y π H Y x, get δ(x, y)

15 Pros and Cons of A/B Testing Pro User centric measure No need for manual ratings No user/expert mismatch Cons Requires interactive experimental control Risk of fielding a bad or buggy π i Number of A/B Tests limited Long turnaround time

16 Evaluating Online Metrics Offline Online: On-policy A/B Test Draw S from π U π Draw S 2 from π 2 U π 2 Draw S 3 from π 3 U π 3 Draw S 4 from π 4 U π 4 Draw S 5 from π 5 U π 5 Draw S 6 from π 6 U π 6 Draw S 7 from π 7 U π 7 Offline: Off-policy Counterfactual Estimates Draw S from π U h U h U h U h U h U h U h U h U h U h U h U h U h U h U h U h U h U h U h U h U h U h U h U h U h U π 6 U π 2 U π 8 U π 24 U π 3

17 Evaluation: Outline Evaluating Online Metrics Offline A/B Testing (on-policy) Counterfactual estimation from logs (off-policy) Approach : Model the world Estimation via reward prediction Approach 2: Model the bias Counterfactual Model Inverse propensity scoring (IPS) estimator Advanced Estimators Self-normalized IPS estimator Doubly robust estimator Slates estimator Case Studies Summary & Demonstration with code samples

18 Approach : Reward Predictor Idea: Use S = x, y, δ,, x n, y n, δ n from π to estimate reward predictor መδ x, y Deterministic π: Simulated A/B Testing with predicted መδ x, y For actions y i = π(x i ) from new policy π, generate predicted log S = x, y, መδ x, y,, x n, y n, መδ x n, y n Y x δ x, y δ x, y 2 Estimate performace of π via U rp π = σ n n i= መδ x i, y i Stochastic π: U rp π = σ n i= n σ y መδ x i, y π(y x i ) መδ x, y δ x, y

19 Regression for Reward Prediction Learn መδ: x y R. Represent via features Ψ x, y 2. Learn regression based on Ψ x, y from S collected under π 3. Predict መδ x, y for y = π(x) of new policy π Ψ መδ(x, y) δ x, y Ψ 2

20 News Recommender: Exp Setup Context x: User profile Action y: Ranking Pick from 7 candidates to place into 3 slots Reward δ: Revenue Complicated hidden function Logging policy π : Non-uniform randomized logging system Placket-Luce explore around current production ranker (see case study)

21 News Recommender: Results RP is inaccurate even with more training and logged data

22 Problems of Reward Predictor Modeling bias choice of features and model Selection bias π s actions are overrepresented Ψ መδ(x, y) δ x, π x U rp π = n i መδ x i, π x i Can be unreliable and biased Ψ 2

23 Evaluation: Outline Evaluating Online Metrics Offline A/B Testing (on-policy) Counterfactual estimation from logs (off-policy) Approach : Model the world Estimation via reward prediction Approach 2: Model the bias Counterfactual Model Inverse propensity scoring (IPS) estimator Advanced Estimators Self-normalized IPS estimator Doubly robust estimator Slates estimator Case Studies Summary & Demonstration with code samples

24 Idea: Approach Model the Bias Fix the mismatch between the distribution π Y x that generated the data and the distribution π Y x we aim to evaluate. π U π π y x = න න δ x, y π y x P x dx dy

25 Patients x i,..., n Counterfactual Model Example: Treating Heart Attacks Treatments: Y Bypass / Stent / Drugs Chosen treatment for patient x i : y i Outcomes: δ i 5-year survival: / Which treatment is best?

26 Patients i,..., n Placing Vertical Example: Treating Heart Attacks Treatments: Y Bypass / Stent / Drugs Chosen treatment for patient x i : y i Outcomes: δ i 5-year survival: / Which treatment is best? Counterfactual Model Pos / Pos 2/ Pos 3 Click / no Click on SERP Pos Pos 2 Pos 3

27 Patients x i, i,..., n Counterfactual Model Example: Treating Heart Attacks Treatments: Y Bypass / Stent / Drugs Chosen treatment for patient x i : y i Outcomes: δ i 5-year survival: / Which treatment is best? Everybody Drugs Everybody Stent Everybody Bypass Drugs 3/4, Stent 2/3, Bypass 2/4 really?

28 Patients Treatment Effects Average Treatment Effect of Treatment y U y = n σ i δ(x i, y) Example U bypass = 5 U stent = 7 U drugs = 4 Factual Outcome Counterfactual Outcomes

29 Patients Assignment Mechanism Probabilistic Treatment Assignment For patient i: π Y i = y x i Selection Bias Inverse Propensity Score Estimator Propensity: p i = π Y i = y i x i Unbiased: E U y =U y, if π Y i = y x i > for all i Example U ips y = n i I{y i = y} p i δ(x i, y i ) U drugs = =.36 <.75 π Y i = y x i

30 Experimental vs Observational Controlled Experiment Assignment Mechanism under our control Propensities p i = π Y i = y i x i are known by design Requirement: y: π Y i = y x i > (probabilistic) Observational Study Assignment Mechanism not under our control Propensities p i need to be estimated Estimate π Y i z i = π Y i x i ) based on features z i Requirement: π Y i z i ) = π Y i δ i, z i (unconfounded)

31 Patients Conditional Treatment Policies Policy (deterministic) Context x i describing patient Pick treatment y i based on x i : y i = π(x i ) Example policy: π A = drugs, π B = stent, π C = bypass Average Treatment Effect U π = n σ i δ(x i, π x i ) IPS Estimator U ips π = n i I{y i = π(x i )} p i δ(x i, y i ) B C A B A B A C A C B

32 Patients Stochastic Treatment Policies Policy (stochastic) Context x i describing patient Pick treatment y based on x i : π(y x i ) Note Assignment Mechanism is a stochastic policy as well! Average Treatment Effect U π = n σ i σ y δ(x i, y)π y x i IPS Estimator U π = n σ i π y i x i p i δ(x i, y i ) B C A B A B A C A C B

Recorded in Log Counterfactual Model = Logs Medical Search Engine Ad Placement Recommender Context x i Diagnostics Query User + Page User + Movie Treatment y i BP/Stent/Drugs Ranking Placed Ad

33 Recorded in Log Counterfactual Model = Logs Medical Search Engine Ad Placement Recommender Context x i Diagnostics Query User + Page User + Movie Treatment y i BP/Stent/Drugs Ranking Placed Ad Watched Movie Outcome δ i Survival Click metric Click / no Click Star rating Propensities p i controlled (*) controlled controlled observational New Policy π FDA Guidelines Ranker Ad Placer Recommender T-effect U(π) Average quality of new policy.

34 Evaluation: Outline Evaluating Online Metrics Offline A/B Testing (on-policy) Counterfactual estimation from logs (off-policy) Approach : Model the world Estimation via reward prediction Approach 2: Model the bias Counterfactual Model Inverse propensity scoring (IPS) estimator Advanced Estimators Self-normalized IPS estimator Doubly robust estimator Slates estimator Case Studies Summary & Demonstration with code samples

35 System Evaluation via Inverse Propensity Scoring Definition [IPS Utility Estimator]: Given S = x, y, δ,, x n, y n, δ n collected under π, U ips π = n i= Unbiased estimate of utility for any π, if propensity nonzero whenever π y i x i >. Note: If π = π, then online A/B Test with U ips π = n δ i Off-policy vs. On-policy estimation. i n δ i π y i x i π y i x i Propensity p i [Horvitz & Thompson, 952] [Rubin, 983] [Zadrozny et al., 23] [Li et al., 2]

36 Illustration of IPS IPS Estimator: U IPS π = n i π y i x i π y i x i δ i π Y x π(y x)

37 IPS Estimator is Unbiased E U π = n x,y x n,y n i π y i x i π y i x i ) δ x i, y i π y x π y n x n P x P(x n ) = n π y i x i π y x P(x ) π y n x n P(x n ) π y i x i ) δ x i, y i x,y x n,y n i = n i π y x P(x ) π y n x n P(x n ) x,y x n,y n π y i x i π y i x i ) δ x i, y i Probabilistic Assignment = n i = n i π y i x i P(x i ) x i,y i π y i x i π y i x i ) δ x i, y i π y i x i P(x i )δ x i, y i = x i,y i n i U(π) = U π

38 News Recommender: Results IPS eventually beats RP; variance decays as O n

39 Adith takes over

Counterfactual Model for Learning Systems

Counterfactual Model for Learning Systems CS 7792 - Fall 28 Thorsten Joachims Department of Computer Science & Department of Information Science Cornell University Imbens, Rubin, Causal Inference for Statistical