causal inference at hulu

Similar documents
Selection on Observables: Propensity Score Matching.

Potential Outcomes Model (POM)

ESTIMATION OF TREATMENT EFFECTS VIA MATCHING

Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies

Gov 2002: 4. Observational Studies and Confounding

Learning Representations for Counterfactual Inference. Fredrik Johansson 1, Uri Shalit 2, David Sontag 2

Lecture 3: Causal inference

Empirical approaches in public economics

Estimating Individual Treatment Effect from Educational Studies with Residual Counterfactual Networks

Big Data, Machine Learning, and Causal Inference

Notes on causal effects

Introduction to Econometrics

Quantitative Economics for the Evaluation of the European Policy

1 Impact Evaluation: Randomized Controlled Trial (RCT)

The cover page of the Encyclopedia of Health Economics (2014) Introduction to Econometric Application in Health Economics

Statistical Models for Causal Analysis

Causal Inference. Prediction and causation are very different. Typical questions are:

PSC 504: Instrumental Variables

The Simple Linear Regression Model

ECO220Y Simple Regression: Testing the Slope

Simultaneous Equation Models Learning Objectives Introduction Introduction (2) Introduction (3) Solving the Model structural equations

Treatment Effects. Christopher Taber. September 6, Department of Economics University of Wisconsin-Madison

CompSci Understanding Data: Theory and Applications

WISE International Masters

Linear classifiers: Logistic regression

What s New in Econometrics. Lecture 1

OUTLINE CAUSAL INFERENCE: LOGICAL FOUNDATION AND NEW RESULTS. Judea Pearl University of California Los Angeles (

Lecture 8. Roy Model, IV with essential heterogeneity, MTE

Causal inference in multilevel data structures:

Differences-in- Differences. November 10 Clair

EMERGING MARKETS - Lecture 2: Methodology refresher

CAUSAL INFERENCE IN THE EMPIRICAL SCIENCES. Judea Pearl University of California Los Angeles (

Econometrics of causal inference. Throughout, we consider the simplest case of a linear outcome equation, and homogeneous

Lecture 3: Statistical Decision Theory (Part II)

New Developments in Econometrics Lecture 11: Difference-in-Differences Estimation

The Supervised Learning Approach To Estimating Heterogeneous Causal Regime Effects

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Matching Techniques. Technical Session VI. Manila, December Jed Friedman. Spanish Impact Evaluation. Fund. Region

Counterfactuals via Deep IV. Matt Taddy (Chicago + MSR) Greg Lewis (MSR) Jason Hartford (UBC) Kevin Leyton-Brown (UBC)

Chapter 2: simple regression model

Machine Learning for Large-Scale Data Analysis and Decision Making A. Week #1

Technical Track Session I: Causal Inference

Introduction to Causal Calculus

arxiv: v3 [stat.ml] 30 Dec 2015

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Counterfactual Model for Learning Systems

Introduction to causal identification. Nidhiya Menon IGC Summer School, New Delhi, July 2015

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data?

Econometrics in a nutshell: Variation and Identification Linear Regression Model in STATA. Research Methods. Carlos Noton.

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?

Business Statistics. Lecture 10: Correlation and Linear Regression

Flexible Estimation of Treatment Effect Parameters

arxiv: v2 [cs.ai] 26 Sep 2018

Rubin s Potential Outcome Framework. Motivation. Example: Crossfit versus boot camp 2/20/2012. Paul L. Hebert, PhD

OUTLINE THE MATHEMATICS OF CAUSAL INFERENCE IN STATISTICS. Judea Pearl University of California Los Angeles (

Difference-in-Differences Estimation

Targeted Maximum Likelihood Estimation in Safety Analysis

ANALYTIC COMPARISON. Pearl and Rubin CAUSAL FRAMEWORKS

Random Variables Chris Piech CS109, Stanford University. Piech, CS106A, Stanford University

PHP2510: Principles of Biostatistics & Data Analysis. Lecture X: Hypothesis testing. PHP 2510 Lec 10: Hypothesis testing 1

Linear Regression and Discrimination

Statistical Learning. Philipp Koehn. 10 November 2015

Front-Door Adjustment

Econ 2148, fall 2017 Instrumental variables I, origins and binary treatment case

1. Regressions and Regression Models. 2. Model Example. EEP/IAS Introductory Applied Econometrics Fall Erin Kelley Section Handout 1

Fundamentals of Machine Learning

Gov 2002: 9. Differences in Differences

Dynamics in Social Networks and Causality

Section 3: Simple Linear Regression

Principles Underlying Evaluation Estimators

Combining multiple observational data sources to estimate causal eects

Variable selection and machine learning methods in causal inference

The Naïve Bayes Classifier. Machine Learning Fall 2017

Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity

Job Training Partnership Act (JTPA)

Simple Regression Model. January 24, 2011

Multiple Linear Regression CIVL 7012/8012

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

Econometrics with Observational Data. Introduction and Identification Todd Wagner February 1, 2017

Dimension Reduction Methods

ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

Technical Track Session I:

A Measure of Robustness to Misspecification

Prediction and causal inference, in a nutshell

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

An Introduction to Causal Analysis on Observational Data using Propensity Scores

6.867 Machine Learning

Bayesian Learning (II)

The Economics of European Regions: Theory, Empirics, and Policy

Experiments and Quasi-Experiments

6.867 Machine Learning

Multiple Linear Regression

Implementing Matching Estimators for. Average Treatment Effects in STATA. Guido W. Imbens - Harvard University Stata User Group Meeting, Boston

Machine Learning (CS 567) Lecture 2

Gov 2002: 5. Matching

Fundamentals of Machine Learning (Part I)

The Distribution of Treatment Effects in Experimental Settings: Applications of Nonlinear Difference-in-Difference Approaches. Susan Athey - Stanford

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

Assessing Studies Based on Multiple Regression

Transcription:

causal inference at hulu Allen Tran July 17, 2016 Hulu

Introduction Most interesting business questions at Hulu are causal Business: what would happen if we did x instead of y? dropped prices for risky subs halved our AdWords marketing spend bought some piece of Content 1

Causal inference is not supervised learning Supervised learning model (f : X Y) Optimized for: f(x) y test-train paradigm works regularization works Causal Inference model (f : X, T Y) Optimized to identify treatment effect: f(x, T = 1) f(x, T = 0) there is no test set, need to lean on statistical theory naive regularization is a horrible idea Sacrifice predictive performance to identify treatment effect 2

Roadmap 1. Potential outcomes framework/treatment effects 2. What we do at Hulu 3. Modern ML approaches to causality (we should do this) Goal: Convince us to think less supervised learning, more causal inference Note: focus will be on observation studies, not designing randomized trials 3

Causal inference = identifying Treatment effects Let x i describe some features (a user), and two potential outcomes, Y 1 (x i ) and Y 0 (x i ). Y k (x T = k) is observed, Y k (x T = 1 k) is the unobserved counterfactual. individual treatment effect ITE(x i ) = E Y1 p(y 1 x i ) (Y 1 (x i )) E Y0 p(y 0 x i ) (Y 0 (x i )) (1) heterogeneous treatment effect HTE(X) = E Y1 p(y 1 x X) (Y 1 (x)) E Y0 p(y 0 x X) (Y 0 (x)) (2) average treatment effect ATE = E (Y 1 (x) Y 0 (x)) = E x p(x) (ITE(x)) (3) 4

Assumptions Ignorability: assignment to treatment/control effectively random, conditioning on x (Y 0, Y 1 ) T x (4) or equivalently, p(y 0, Y 1, T x) = p(y 0, Y 1 x)p(t x) (5) Overlap/Common support: treatment/control across all units p(t x) > 0 t, x (6) 5

Estimating Average Treatment Effects First part of ATE E x,y1,t (Y 1 (x)) = E x ( EY1,T xy 1 (x, T) ) (7) Term inside parentheses is observed. = E x ( EY1 xy 1 (x, T = 1) ) (8) ATE = E x ( EY1 xy 1 (x, T = 1) E Y0 xy 0 (x, T = 0) ) (9) But x i units we observe are distributed p(x i T = 1) and p(x i T = 0), not p(x). Randomized trial, x T, difference in means is sufficient. 6

Estimating causal effects in practice In order of simplicity 1. Matching 2. Propensity score/reweighting 3. Covariate adjustment (current strategy at Hulu - incremental retention, portfolio, sub churn) 4. Modern ML methods designed for causal inference 7

Covariate adjustment Supervised learning with treatment variable as a feature ATE = 1 f(x i, T = 1) f(x i, T = 0) (10) N Problems i No goodness of fit metrics. Remember, we are fitting treatment effects, not Y Treatment effects wrong if model is misspecified (e.g unconfoundedness does not hold) unobserved demographics behind watch/did not watch (treatment) in incremental retention model specific types of shows (unobserved) get more marketing (treatment) in MMM model Regularization? May regularize away our treatment/potentially harmful 8

Casual Trees/Forests Athey & Imbens (2015) Directly estimate treatment effects via decision trees. Shown to be consistent estimator of treatment effect (i.e is a correct model). Regular decision tree 1. Estimator (of outcome): mean of outcome in leaf ŷ l = 1 Y i (11) N l 2. Splitting criterion: metrics (information gain/gini/variance reduction) for outcome + complexity penalty 3. Score: assess estimates of predicted outcome on test set, ŷ i against y i i l 9

Casual Trees/Forests: Transforming Outcomes Transform outcome Y i. Let p i = p(t = 1 x i ) Y i = T i p i 1 p i Y i (12) Expectation of transformed outcome is average treatment effect! E x,t (Y i ) = E x (E T (Y i )) (13) = E x (p i E (Y i T = 1, x) + (1 p i)e (Y i T = 0, x)) (14) = E x (p i Y 1 (x i ) (1 p i )Y 0 (x i )) (15) = ATE (16) 10

Casual Trees/Forests: Splitting Ideal but infeasible criterion, compare treatment effect to ground truth ( Q(ˆτ) = E (τ i ˆτ i ) 2) (17) But decompose this: Q(ˆτ) = E ( τ)i 2) + E ( ˆτ i 2 ) 2E (τ i ˆτ i ) (18) First term doesn t involve τ i so can ignore. Second term is sample mean of estimator. Third term can be calculated (after more algebra) E (τ i ˆτ i ) = E l (E (τ i ˆτ i x i l)) = E l (ˆτ l E (τ i x i l)) = E (ˆτ ) 2 i 11

Hence, criterion to continue splitting is ( ) 2 Q(ˆτ) = E ˆτ i (19) Rewards variance in estimates of treatment effects. (Compare to limiting case of no splits, every x i has the same treatment effect). 12

Causal Tree/Forests: Implementation 1. Estimator (of treatment effect): ˆτ i Mean of transformed outcome (with propensity score weighting as treatment/control distributions in leaf are not representative). ˆτ i = 1 N l i l Y i 1 p(t = 1 x i ) (20) 2. Splitting criterion: MSE on treatment effect + complexity of number of leaves 1 ˆτ i 2 + λ n leaves (21) N i 3. Score: out of sample MSE on treatment effect 1 (ˆτ i Y i )2 (22) N test i test 13

Deep Learning/Regularizing Representations Shalit, Johansson & Sontag (2016) Learn representations for x, Φ(x) Merge representations and treatment to output prediction Measure difference in distribution of representations for treatment and control x Φ Φ L (h (Φ, t), y) t d r t d o IPM $ (Φ t=0, Φ t=1 ) 14

Deep Learning/Regularizing Representations Loss function 1 ( min L (h(φ(x i ), T i ), Y i ) + α IPM F {Φ(xi )} i Ti =0, {Φ(x i )} i Ti =1) (23) Φ,h N i=1 IPM F is some distance metric over distributions Penalize differences between treatment and control group mitigates problem that treatment is not randomly assigned (incremental retention: different demographics watch different shows) regularization in the right place Theoretical result: if IPM F is Wasserstein or Max-Mean Discrepancy, loss function is bound on true counterfactual error 15

Dealing with unconfoundedness/missing variables What if we know that we have some unobserved variable that affects outcome? Strategies 1. Causal graphs (Judea Pearl et al.) - assess whether it is a problem and potential fixes 2. Instrumental variables - find instruments that induce random-like properties into our x variables 16

Causal graphs Bishop, Chapter 8: Pattern Recognition and Machine Learning Represent casual effects via directed graphs. Theoretical results ( d-separation ): infers conditional independence from directed graphs. We want Y Z X (24) where Y is our outcome, Z is unobserved and X is observed 17

Causal Graphs: Head-to-tail Observing X blocks the path from Z to Y, making Z independent of Y, conditional on X. 18

Causal Graphs: Head-To-Head Observing X blocks the path from Z to Y, making Z independent of Y, conditional on X. 19

Causal Graphs: Tail-to-Tail Observing X unblocks the path from Z to Y, making Z dependent of Y, conditional on X. We can not use X as a dependent variable 20

Causal Graphs: Fixing Tail-to-Tail We can fix the prior situation, by finding something else to condition on. Conditioning on X blocks the path from Z to Y, making Z dependent of Y, conditional on X. 21

Instrumental Variables Often (nearly all observational data) our X variable is correlated with unobserved variables making it impossible to identify the effect of X alone. Idea: extract the random component on X uncorrelated with those other variables. 22

Instrumental Variables: Implementation Suppose you have an instrument Z that is correlated with X but not Y. Two Stage Least Squares 1. OLS of X on Z gets you X IV 2. Second stage OLS of X IV on Y But linear regression is not very powerful, how to improve? What if I want to use a non-linear estimator (deep learning)? 23

Instrumental Variables: Examples Note: there is a whole branch of economics dedicated to finding super fun/crazy instruments Effect of education on wages: education correlated with unobserved ability. Use exogenous districting/local government policies that force more/less education on people Watching show A on retention: but watch behavior correlated with unobserved demographics/preferences. Induce (some) randomness in reco strategy (change probability of exposure) and use exposure probabilities as instruments. 24

Conclusion A lot of our questions are causal We try to answer these causal questions in a supervised learning heavy manner Supervised learning/covariate adjustment is not great at causal questions There are alternatives that make use of modern ML algorithms/big data and directly identify treatment effects 25