Counterfactual Model for Learning Systems CS 7792 - Fall 28 Thorsten Joachims Department of Computer Science & Department of Information Science Cornell University Imbens, Rubin, Causal Inference for Statistical Social Science, 25. Chapters,3,2.
Interactive System Schematic Utility: U π Action y for x System π
Context x: User Action y: Portfolio of newsarticles Feedback δ x, y : Reading time in minutes News Recommender
Ad Placement Context x: User and page Action y: Ad that is placed Feedback δ x, y : Click / no-click
Search Engine Context x: Query Action y: Ranking Feedback δ x, y : Click / no-click
Log Data from Interactive Systems Data context S = π action reward / loss x, y, δ,, x n, y n, δ n Partial Information (aka Contextual Bandit ) Feedback Properties Contexts x i drawn i.i.d. from unknown P(X) Actions y i selected by existing system π : X Y Feedback δ i from unknown function δ: X Y R [Zadrozny et al., 23] [Langford & Li], [Bottou, et al., 24]
Goal: Counterfactual Evaluation Use interaction log data S = x, y, δ,, x n, y n, δ n for evaluation of system π: Estimate online measures of some system π offline. System π can be different from π that generated log.
Evaluation: Outline Evaluating Online Metrics Offline A/B Testing (on-policy) Counterfactual estimation from logs (off-policy) Approach : Model the world Estimation via reward prediction Approach 2: Model the bias Counterfactual Model Inverse propensity scoring (IPS) estimator
Online Performance Metrics Example metrics CTR Revenue Time-to-success Interleaving Etc. Correct choice depends on application and is not the focus of this lecture. This lecture: Metric encoded as δ(x, y) [click/payoff/time for (x,y) pair]
System Definition [Deterministic Policy]: Function y = π(x) that picks action y for context x. Definition [Stochastic Policy]: Distribution π y x that samples action y given context x Y x Y x π x π (Y x) π 2 x π 2 (Y x)
Definition [Utility of Policy]: System Performance The expected reward / utility U(π) of policy π is U π = න න δ x, y π y x P x dx dy Y x i e.g. reading time of user x for portfolio y Y x j π(y x i ) π(y x j )
Online Evaluation: A/B Testing Given S = x, y, δ,, x n, y n, δ n collected under π, n U π = n i= δ i A/B Testing Deploy π : Draw x P X, predict y π Y x, get δ(x, y) Deploy π 2 : Draw x P X, predict y π 2 Y x, get δ(x, y) Deploy π H : Draw x P X, predict y π H Y x, get δ(x, y)
Pros and Cons of A/B Testing Pro User centric measure No need for manual ratings No user/expert mismatch Cons Requires interactive experimental control Risk of fielding a bad or buggy π i Number of A/B Tests limited Long turnaround time
Evaluating Online Metrics Offline Online: On-policy A/B Test Draw S from π U π Draw S 2 from π 2 U π 2 Draw S 3 from π 3 U π 3 Draw S 4 from π 4 U π 4 Draw S 5 from π 5 U π 5 Draw S 6 from π 6 U π 6 Draw S 7 from π 7 U π 7 Offline: Off-policy Counterfactual Estimates Draw S from π U π 6 U π 2 U π 8 U π 24 U π 3
Evaluation: Outline Evaluating Online Metrics Offline A/B Testing (on-policy) Counterfactual estimation from logs (off-policy) Approach : Model the world Estimation via reward prediction Approach 2: Model the bias Counterfactual Model Inverse propensity scoring (IPS) estimator
Approach : Reward Predictor Idea: Use S = x, y, δ,, x n, y n, δ n from π to estimate reward predictor መδ x, y Deterministic π: Simulated A/B Testing with predicted መδ x, y For actions y i = π(x i ) from new policy π, generate predicted log S = x, y, መδ x, y,, x n, y n, መδ x n, y n Y x δ x, y δ x, y 2 Estimate performace of π via U rp π = σ n n i= መδ x i, y i Stochastic π: U rp π = σ n i= n σ y መδ x i, y π(y x i ) መδ x, y δ x, y
Regression for Reward Prediction Learn መδ: x y R. Represent via features Ψ x, y 2. Learn regression based on Ψ x, y from S collected under π 3. Predict መδ x, y for y = π(x) of new policy π Ψ መδ(x, y) δ x, y Ψ 2
News Recommender: Exp Setup Context x: User profile Action y: Ranking Pick from 7 candidates to place into 3 slots Reward δ: Revenue Complicated hidden function Logging policy π : Non-uniform randomized logging system Placket-Luce explore around current production ranker
News Recommender: Results RP is inaccurate even with more training and logged data
Problems of Reward Predictor Modeling bias choice of features and model Selection bias π s actions are overrepresented Ψ መδ(x, y) δ x, π x U rp π = n i መδ x i, π x i Can be unreliable and biased Ψ 2
Evaluation: Outline Evaluating Online Metrics Offline A/B Testing (on-policy) Counterfactual estimation from logs (off-policy) Approach : Model the world Estimation via reward prediction Approach 2: Model the bias Counterfactual Model Inverse propensity score (IPS) weighting estimator
Idea: Approach Model the Bias Fix the mismatch between the distribution π Y x that generated the data and the distribution π Y x we aim to evaluate. π U π π y x = න න δ x, y π y x P x dx dy
Patients x i,..., n Counterfactual Model Example: Treating Heart Attacks Treatments: Y Bypass / Stent / Drugs Chosen treatment for patient x i : y i Outcomes: δ i 5-year survival: / Which treatment is best?
Patients i,..., n Placing Vertical Example: Treating Heart Attacks Treatments: Y Bypass / Stent / Drugs Chosen treatment for patient x i : y i Outcomes: δ i 5-year survival: / Which treatment is best? Counterfactual Model Pos / Pos 2/ Pos 3 Click / no Click on SERP Pos Pos 2 Pos 3
Patients x i, i,..., n Counterfactual Model Example: Treating Heart Attacks Treatments: Y Bypass / Stent / Drugs Chosen treatment for patient x i : y i Outcomes: δ i 5-year survival: / Which treatment is best? Everybody Drugs Everybody Stent Everybody Bypass Drugs 3/4, Stent 2/3, Bypass 2/4 really?
Patients Treatment Effects Average Treatment Effect of Treatment y U y = n σ i δ(x i, y) Example U bypass = 4 U stent = 6 U drugs = 3 Factual Outcome Counterfactual Outcomes
Patients Assignment Mechanism Probabilistic Treatment Assignment For patient i: π Y i = y x i Selection Bias Inverse Propensity Score Estimator U ips y = n i I{y i = y} p i δ(x i, y i ) Propensity: p i = π Y i = y i x i Unbiased: E U y =U y, if π Y i = y x i > for all i Example U drugs = + + +.8.7.8. =.36 <.75 π Y i = y x i.3.5..6.4....8.6.2.3.5..7.7...3.3.4.2..8.3.6.4..8..4..2
Experimental vs Observational Controlled Experiment Assignment Mechanism under our control Propensities p i = π Y i = y i x i are known by design Requirement: y: π Y i = y x i > (probabilistic) Observational Study Assignment Mechanism not under our control Propensities p i need to be estimated Estimate π Y i z i = π Y i x i ) based on features z i Requirement: π Y i z i ) = π Y i δ i, z i (unconfounded)
Patients Conditional Treatment Policies Policy (deterministic) Context x i describing patient Pick treatment y i based on x i : y i = π(x i ) Example policy: π A = drugs, π B = stent, π C = bypass Average Treatment Effect U π = n σ i δ(x i, π x i ) IPS Estimator U ips π = n i I{y i = π(x i )} p i δ(x i, y i ) B C A B A B A C A C B
Patients Stochastic Treatment Policies Policy (stochastic) Context x i describing patient Pick treatment y based on x i : π(y x i ) Note Assignment Mechanism is a stochastic policy as well! Average Treatment Effect U π = n σ i σ y δ(x i, y)π y x i IPS Estimator U π = n σ i π y i x i p i δ(x i, y i ) B C A B A B A C A C B
Recorded in Log Counterfactual Model = Logs Medical Search Engine Ad Placement Recommender Context x i Diagnostics Query User + Page User + Movie Treatment y i BP/Stent/Drugs Ranking Placed Ad Watched Movie Outcome δ i Survival Click metric Click / no Click Star rating Propensities p i controlled (*) controlled controlled observational New Policy π FDA Guidelines Ranker Ad Placer Recommender T-effect U(π) Average quality of new policy.
Evaluation: Outline Evaluating Online Metrics Offline A/B Testing (on-policy) Counterfactual estimation from logs (off-policy) Approach : Model the world Estimation via reward prediction Approach 2: Model the bias Counterfactual Model Inverse propensity scoring (IPS) estimator
System Evaluation via Inverse Propensity Scoring Definition [IPS Utility Estimator]: Given S = x, y, δ,, x n, y n, δ n collected under π, U ips π = n i= Unbiased estimate of utility for any π, if propensity nonzero whenever π y i x i >. Note: If π = π, then online A/B Test with U ips π = n δ i i Off-policy vs. On-policy estimation. n δ i π y i x i π y i x i Propensity p i [Horvitz & Thompson, 952] [Rubin, 983] [Zadrozny et al., 23] [Li et al., 2]
Illustration of IPS IPS Estimator: U IPS π = n i Unbiased: If then π y i x i π y i x i δ i x, y: π y x P(x) > π y x > E U IPS π = U(π) π Y x π(y x)
IPS Estimator is Unbiased E U IPS π independent = n x,y x n,y n i π y i x i π y i x i ) δ x i, y i π y x π y n x n P x P(x n ) = n π y i x i π y x P(x ) π y n x n P(x n ) π y i x i ) δ x i, y i x,y x n,y n i marginal full support = n i = n i = n i π y x P(x ) π y n x n P(x n ) x,y x n,y n π y i x i P(x i ) x i,y i π y i x i π y i x i ) δ x i, y i P(x i )π y i x i δ x i, y i = n x i,y i i identical x,y U(π) = U π π y i x i π y i x i ) δ x i, y i
News Recommender: Results IPS eventually beats RP; variance decays as O n
Counterfactual Policy Evaluation Controlled Experiment Setting: Log data: D = x, y, δ, p,, x n, y n, δ n, p n Observational Setting: Log data: D = x, y, δ, z,, x n, y n, δ n, z n Estimate propensities: p i = P y i x i, z i ) based on x i and other confounders z i Goal: Estimate average treatment effect of new policy π. IPS Estimator or many others. U π = n i δ i π y i x i p i
Evaluation: Summary Evaluating Online Metrics Offline A/B Testing (on-policy) Counterfactual estimation from logs (off-policy) Approach : Model the world Estimation via reward prediction Pro: low variance Con: model mismatch can lead to high bias Approach 2: Model the bias Counterfactual Model Inverse propensity scoring (IPS) estimator Pro: unbiased for known propensities Con: large variance
From Evaluation to Learning Naïve Model the World Learning: Learn: መδ: x y R Derive Policy: π y x = argmin መδ x, y y Naïve Model the Bias Learning: Find policy that optimizes IPS training error π y i x i ) π = argmin δ π π y i x i i i
Outline of Class Counterfactual and Causal Inference Evaluation Improved counterfactual estimators Applications in recommender systems, etc. Dealing with missing propensities, randomization, etc. Learning Batch Learning from Bandit Feedback Dealing with combinatorial and continuous action spaces Learning theory More general learning with partial information data (e.g. ranking, embedding)