Causal Inference by Minimizing the Dual Norm of Bias. Nathan Kallus. Cornell University and Cornell Tech

Size: px

Start display at page:

Download "Causal Inference by Minimizing the Dual Norm of Bias. Nathan Kallus. Cornell University and Cornell Tech"

Russell Gibbs
5 years ago
Views:

1 Causal Inference by Minimizing the Dual Norm of Bias Nathan Kallus Cornell University and Cornell Tech

2 Matching Zoo It s a zoo of matching estimators for causal effects: PSM, NN, CM, CEM, GenMatch, Mean- Matching What are the inherent differences? What does it mean to match on covariates? This talk: New classification: worst- case bias minimizing (WCBM) Encompasses many existing methods Reveals motivation as optimality for particular structure Gives rise to new kernel matching estimators

3 Set- up tl;dr: measure SATT under unconfoundedness using a covariate- matched control sample Subjects i = 1,, n: Two treatments: treatment (T i =1) & control (T i =0) T 0 = {i : T i =0}, T 1 = {i : T i =1} Observe: covariates X i, treatment T i, and outcome Y i =Y i (T i ) Unseen counterfactual potential outcomes Y i (0), Y i (1) X=(X 1,, X n ), T=(T 1,, T n ) are the whole sample Assume unconfounded: E[Y i (t) T i,x i ]=E[Y i (t) X i ], P(T i = t X i ) > 0 Want to measure SATT: SATT = 1 n 1 Pi2T 1 (Y i (1) Y i (0))

4 Set- up Estimate by making a matched control sample ˆ W = 1 n 1 P i2t 1 Y i Pi2T 0 W i Y i W i 0 Honest weights W (T,X) No causal effect mining allowed! Weight types: P i2t 0 W i =1 that only depend on T, X General weights W general 0 = {W T0 2 R T 0 + : P i2t 0 Matched subset w/o rep: w/o rep. W0 = {W T0 2 {0, 1/n 0 0} T 0 : P i2t 0 Matched multi- subset w/ rep: w/ rep. W0 = {W T0 2 {0, 1/n 0 0,...} T 0 : P i2t 0

5 Decomposing Bias Define bias (misnomer more like error) bias = E [ˆ W SATT X, T] Let f 0 (x) =E[Y i (0) X i = x], i = Y i (0) f 0 (X i ) Theorem: ˆ W SATT = B(W ; f 0 )+E(W) B(W ; f) := 1 P n 1 Pi2T 1 f(x i ) i2t 0 W i f(x i ) E(W ):= 1 P n 1 i2t 1 i Pi2T 0 W i i And, under unconfoundedness, E[E(W ) X, T] =0 E[ˆ W SATT X, T] =B(W ; f 0 )

6 Worst- Case Bias Under unconfoundedness, our bias isb(w ; f 0 ) Involves an unknown function Consider guarding against any possible such function B(W ; f) scales linearly in f Consider worst- case relative to a magnitudekfk 2[0, 1] I.e., minimize f 0 max f B(W ;f) kfk

7 Worst- Case Bias For worst- case bias to be well- defined, assume: k k is a semi- norm on F = {f : kfk < 1} (implies F is a linear subspace) F/ {f : kfk =0} forms a Banach space (complete normed vector space) Contrasts f 7! B(W ; f) are continuous maps for any W (equivalently, 9M(W ):B(W; f) apple M(W ) kfk since continuous operators on B space = bounded operators) F The dual space = {continuous linear operators} Is a Banach space with norm kak =sup kfkapple1 A(f)

8 Dual Norm of Bias Dual norm of bias is the normalized worst- case bias: B(W ; F) = max f B(W ;f) kfk = max kfkapple1 B(W ; f) = kb(w ; )k Definition 1. A matching method W (T,X) is said to be worst-case bias minimizing (WCBM) if for some W and k k satisfying assumptions we have W (T,X) 2 arg min W 2W B(W ; F) 6= W.

9 Existing Methods as WCBM Surprising fact: most covariate- matching methods are WCBM! Reveals structural motivations of different matching methods Choose the method that matches your structural beliefs WCBM is the right framework

10 Nearest- Neighbor Matching NNM: Find a control match for each treated unit and minimize the sum of pairwise distances per (x, x 0 ) Can be with or without replacement Hansen, 2004 & 2006; Rubin, 1973; Cochran, 1953 Classically, not necessarily minimalsum of distances Usually, Mahalanobis (x, x 0 )= q (x x 0 )ˆ 1 (x x 0

11 Nearest- Neighbor Matching Theorem: Nearest neighbor matching wrt replacement is WCBM with kfk =sup x6=x 0 f(x) f(x 0 ) (x,x 0 ) (x, x 0 ) with and either or W 0 = {W T0 2 R T 0 + : P i2t 0 W 0 = {W T0 2 {0, 1/n 1,...} T 0 : P i2t 0

12 Nearest- Neighbor Matching Theorem: Nearest neighbor matching wrt without replacement is WCBM with (x, x 0 ) kfk =sup x6=x 0 f(x) f(x 0 ) (x,x 0 ) and either or W 0 = {W T0 2 [0, 1/n 1 ] T 0 : P i2t 0 W 0 = {W T0 2 {0, 1/n 1 } T 0 : P i2t 0

13 Caliper Matching CM: find smallest caliper size and pairs such that all pairwise distance can fit within the caliper Raynor, 1983; Cochran & Rubin, 1973 Classically, not necessarily optimal caliper When with replacement, (almost) same as NNM Theorem: Caliper matching wrt replacement is WCBM with ) = E µ µ h (f(x) f(x 0 )) (x,x 0 ) (x, x 0 ) without where kfk = ) i x 6= x 0 and ˆµ n is the EDF And either or W 0 = {W T0 2 [0, 1/n 1 ] T 0 : P i2t 0 W 0 = {W T0 2 {0, 1/n 1 } T 0 : P i2t 0

14 Coarsened Exact Matching CEM: match exactly within each stratum, as defined by a coarsening function E.g., if there are 5 treated subjects and 3 control subjects in a given stratum then each of the control subjects is given weight proportional to 5/3 (weights sum to one) Iacus et al., 2011 C : X! {1,...,M} Theorem: CEM with coarsening fn C is WCBM with kfk = supx2x f(x) f 1 (C 1 (j)) =18j, 1 otherwise, and W 0 = {W T0 2 R T 0 + : P i2t 0

15 Mean Matching MM: subsample the control population to have similar sample mean to treated population wrt Rubin, 2012; Rubin, 1973; Greenberg, 1953 Classically, not necessarily optimal Theorem: Mean matching is WCBM with and or M V (W )= V 1/2 kfk 2 = P 1 n 1 i2t 1 X i P i2t 0 W i X i T V f(x) = 0 + T x, 1 otherwise. W 0 = {W T0 2 {0, 1/n 1,...} T 0 : P i2t 0 W 0 = {W T0 2 {0, 1/n 1 } T 0 : P i2t 0 2 (w/ repl) (w/o repl)

16 Kernel Matching Most matching methods are WCBM Each corresponded to particular structure / functional space What about other spaces? In ML, reproducing kernel Hilbert spaces (RKHS) are very common for generalizing learned function E.g. kernelized SVM, kernel ridge regression, kernel PCA, Via WCBM, kernels can be used for matching too!

17 Reproducing Kernel Hilbert Space HS = inner product space that is a Banach space RKHS = HS with continuous evaluations By Riesz representation theorem, PSD kernel K(x, x 0 ) RKHS Polynomial kernel Spans polynomials deg s (finite- dim) Exponential kernel K(x, x 0 )=e xt x 0 Infinite dimensional C 0 - universal Gaussian kernel K s (x, x 0 )=e s2 x x 0 2 Infinite dimensional C 0 - universal K s (x, x 0 )=(1+x T x 0 /s) s

18 Kernel Matching Kernel Gram matrix K ij = K(X i,x j ) Theorem: B 2 (W ; F) = 1 n 2 1 e T n 1 K T1,T 1 e n1 + W T T 0 K T0 T 0 W T0 2 n 1 e T n 1 K T1,T 0 W T0 Minimize over different domains different matching methods

19 Kernel Matching General weight kernel matching W 0 = {W T0 2 R T 0 + : P i2t 0 Discrete kernel matching with replacement W 0 = {W T0 2 {0, 1/n 1,...} T 0 : P i2t 0 Discrete kernel matching without replacement W 0 = {W T0 2 {0, 1/n 1 } T 0 : P i2t 0

20 Numerics Hypothetical observational study distributed uniformly on [-1, 1] 2 Various forms for Measure RMSE X i 2 R 2 T i Bernoulli(0.8/(1 + p 2 kx i k 2 )) Y i (0) = f 0 (X i )+ i i N (0, 0.1) f 0

21 f 0 (x) = x 1 + x 2 Numerics: L1 norm RMSE n No matching CEM PSM Exp kernel weight Exp kernel match One-to-one Mahal. means Quad kernel weight Gauss kernel weight Gauss kernel match

Numerics: quadratic f 0 (x) =(x 1 + x 2 )+(x 1 + x 2 ) 2 RMSE 0.50 0.10 0.

22 Numerics: quadratic f 0 (x) =(x 1 + x 2 )+(x 1 + x 2 ) 2 RMSE n No matching CEM PSM Exp kernel weight Exp kernel match One-to-one Mahal. means Quad kernel weight Gauss kernel weight Gauss kernel match

Numerics: cubic f 0 (x) =(x 1 + x 2 ) 2 +(x 1 + x 2 ) 3 RMSE 1 0.50 0.10 0.

23 Numerics: cubic f 0 (x) =(x 1 + x 2 ) 2 +(x 1 + x 2 ) 3 RMSE n No matching CEM PSM Exp kernel weight Exp kernel match One-to-one Mahal. means Quad kernel weight Gauss kernel weight Gauss kernel match

24 Numerics: sinusoidal f 0 (x) =sin( (x 1 + x 2 )) + cos( (x 1 x 2 )) RMSE n No matching CEM PSM Exp kernel weight Exp kernel match One-to-one Mahal. means Quad kernel weight Gauss kernel weight Gauss kernel match

25 huh? WCBM offers a general framework for matching estimators Structure imbalance metric and matching methods that minimize imbalance This recovers existing matching methods and uncovers structural underpinnings New methods: kernel matching

Causal Inference by Minimizing the Dual Norm of Bias: Kernel Matching & Weighting Estimators for Causal Effects

Causal Inference by Minimizing the Dual Norm of Bias: Kernel Matching & Weighting Estimators for Causal Effects Nathan Kallus School of Operations Research and Information Engineering Cornell University