Robustness to Parametric Assumptions in Missing Data Models

Similar documents
Combining multiple observational data sources to estimate causal eects

Data Integration for Big Data Analysis for finite population inference

Flexible Estimation of Treatment Effect Parameters

Covariate Balancing Propensity Score for General Treatment Regimes

Double Robustness. Bang and Robins (2005) Kang and Schafer (2007)

Fractional Imputation in Survey Sampling: A Comparative Review

Weighting Methods. Harvard University STAT186/GOV2002 CAUSAL INFERENCE. Fall Kosuke Imai

Modification and Improvement of Empirical Likelihood for Missing Response Problem

Primal-dual Covariate Balance and Minimal Double Robustness via Entropy Balancing

Calibration Estimation for Semiparametric Copula Models under Missing Data

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Calibration Estimation of Semiparametric Copula Models with Data Missing at Random

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

Selection on Observables: Propensity Score Matching.

Calibration Estimation of Semiparametric Copula Models with Data Missing at Random

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

A note on multiple imputation for general purpose estimation

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

ESTIMATION OF TREATMENT EFFECTS VIA MATCHING

The consequences of misspecifying the random effects distribution when fitting generalized linear mixed models

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

What s New in Econometrics. Lecture 1

Bayesian causal forests: dealing with regularization induced confounding and shrinking towards homogeneous effects

Causal Inference with a Continuous Treatment and Outcome: Alternative Estimators for Parametric Dose-Response Functions

ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

A Measure of Robustness to Misspecification

Empirical Likelihood Methods for Two-sample Problems with Data Missing-by-Design

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

Clustering as a Design Problem

The Value of Knowing the Propensity Score for Estimating Average Treatment Effects

Accounting for Complex Sample Designs via Mixture Models

Lecture 3: Statistical Decision Theory (Part II)

Web-based Supplementary Materials for A Robust Method for Estimating. Optimal Treatment Regimes

Imbens/Wooldridge, Lecture Notes 1, Summer 07 1

Minimax Estimation of Kernel Mean Embeddings

Deductive Derivation and Computerization of Semiparametric Efficient Estimation

A Sampling of IMPACT Research:

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

On the Choice of Parametric Families of Copulas

The propensity score with continuous treatments

Bootstrap, Jackknife and other resampling methods

Causal Inference Basics

Statistical Properties of Numerical Derivatives

Lecture 8. Roy Model, IV with essential heterogeneity, MTE

Chapter 4: Imputation

Vector-Based Kernel Weighting: A Simple Estimator for Improving Precision and Bias of Average Treatment Effects in Multiple Treatment Settings

Statistical Analysis of Randomized Experiments with Nonignorable Missing Binary Outcomes

Propensity Score Weighting with Multilevel Data

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University

Statistical Methods for Handling Missing Data

High Dimensional Propensity Score Estimation via Covariate Balancing

41903: Introduction to Nonparametrics

Motivational Example

Lecture 11/12. Roy Model, MTE, Structural Estimation

Matching. Quiz 2. Matching. Quiz 2. Exact Matching. Estimand 2/25/14

Lecture 11 Roy model, MTE, PRTE

Comment: Understanding OR, PS and DR

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Jinyong Hahn. Department of Economics Tel: (310) Bunche Hall Fax: (310) Professional Positions

Econometric Analysis of Cross Section and Panel Data

Chapter 5: Models used in conjunction with sampling. J. Kim, W. Fuller (ISU) Chapter 5: Models used in conjunction with sampling 1 / 70

Finite Sample Properties of Semiparametric Estimators of Average Treatment Effects

Identification and Estimation of a Triangular model with Multiple

Semiparametric Generalized Linear Models

Short Questions (Do two out of three) 15 points each

Section 7: Local linear regression (loess) and regression discontinuity designs

Semiparametric Efficiency in Irregularly Identified Models

Propensity-Score Based Methods for Causal Inference in Observational Studies with Fixed Non-Binary Treatments

Lecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973]

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Nonresponse weighting adjustment using estimated response probability

Non-Parametric Bootstrap Mean. Squared Error Estimation For M- Quantile Estimators Of Small Area. Averages, Quantiles And Poverty

arxiv: v1 [stat.me] 15 May 2011

Using Estimating Equations for Spatially Correlated A

Lecture 7 Introduction to Statistical Decision Theory

The Economics of European Regions: Theory, Empirics, and Policy

Bayesian linear regression

MS&E 226: Small Data

Variable selection and machine learning methods in causal inference

Recent Advances in the analysis of missing data with non-ignorable missingness

Locally Robust Semiparametric Estimation

Small Domains Estimation and Poverty Indicators. Carleton University, Ottawa, Canada

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Bayesian Sparse Linear Regression with Unknown Symmetric Error

Estimation of the Conditional Variance in Paired Experiments

Estimation of Complex Small Area Parameters with Application to Poverty Indicators

Optimal bandwidth selection for the fuzzy regression discontinuity estimator

Bayesian Methods for Highly Correlated Data. Exposures: An Application to Disinfection By-products and Spontaneous Abortion

Lecture 9: Learning Optimal Dynamic Treatment Regimes. Donglin Zeng, Department of Biostatistics, University of North Carolina

Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score 1

Introduction to Estimation Methods for Time Series models Lecture 2

LIKELIHOOD RATIO INFERENCE FOR MISSING DATA MODELS

arxiv: v2 [math.st] 20 Jun 2014

Marginal and Nested Structural Models Using Instrumental Variables

Causal Inference in Observational Studies with Non-Binary Treatments. David A. van Dyk

7 Sensitivity Analysis

Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies

Topics and Papers for Spring 14 RIT

Gov 2002: 5. Matching

Transcription:

Robustness to Parametric Assumptions in Missing Data Models Bryan Graham NYU Keisuke Hirano University of Arizona April 2011

Motivation Motivation We consider the classic missing data problem. In practice covariate dimension is often very high, and conventional asymptotics may be misleading. We consider a finite-sample setting where correct parametric specifications make a big difference. Adopt the Angrist and Hahn (2004) setup, where cells can be small or even empty.

Outline Outline Missing Data Model Stratified Sampling Setup Parametric Imputation Empirical Bayes Double Robustness and EB Monte Carlo Some work in progress

Missing Data Model MAR Random sample from a population Observe a covariate X, and if D = 1, observe Y. Interested in population mean of Y: θ = E[Y]. Assume Y is missing at random (MAR): Y D X. We are especially interested in cases where cells are small.

Missing Data Model Let μ(x) = E[Y X = x] σ 2 (x) = V[Y X = x] e(x) = Pr(D = 1 X = x) bounded away from 0. Semiparametric efficiency bound for estimating θ (Hahn, 1998): VB = E σ (μ(x) θ) 2 2 (X) + E. e(x)

Missing Data Model Various efficient estimators proposed: Hahn (1998) Hirano, Imbens, Ridder (2003) Chen, Hong, Tarozzi (2008) Typically involve NP estimation of μ(x), e(x), or both. Typically identifical if X discrete.

Stratified sampling Discrete Covariate/Small Cell Model Following Angrist and Hahn (2004): Covariate takes values in {x 1,..., x K }. M k individuals in each cell fixed and known (stratified sampling). Let M k p k = K j=1 M j (distribution of X)

Stratified sampling Propensity score: in cell k, observe Y with probability e k > 0. n k observed outcomes per cell, with In cell k, let n k Binomial(e k, M k ). Y k1,..., Y knk iid (μk, σ 2 k ). Estimand: K θ = p k μ k. k=1

Poststratification Estimator Poststratification Estimator Let: Estimator: Y k = 1 n k ˆθ PS = n k i=1 Y ki. K p k Y k. k=1 May work poorly if n k small. For empty cells: drop cells (adjust p k ), or combine cells.

Parametric Imputation Parametric Imputation Suppose μ k = x k β, where β is a low-dimensional parameter. Then E[Y k n 1,..., n K ] = x k β, V[Y k n 1,..., n K ] = σ 2 k /n k.

Parametric Imputation Let ˆβ = WLS of Y k on x k (= OLS of underlying y s on x) Parametric Imputation Estimator: ˆθ PI = K k=1 p k x k ˆβ. Can handle empty cells, and will work well when model is correct. But may be sensitive to parametric specification.

Empirical Bayes Empirical Bayes Based on Morris (1983) and Chamberlain (2009) Suppose (temporarily) that μ k N(x k β, τ2 ). Note τ 2 = 0 corresponds to parametric imputation. Further suppose Y k μ k N(μ k, v k ), where v k = σ 2 k /n k.

Empirical Bayes Marginal for cell means: Y k N(x k β, v k + τ 2 ). Then posterior mean of μ k is μ k = (1 γ k)y k + γ k (x k β), where γ k = v k v k + τ 2.

Empirical Bayes Replace β, v k, τ 2 with estimates, form ˆγ k = ˆv k ˆv k + ˆτ 2 k, and K ˆθ EB1 = p k (1 ˆγk )Y k + ˆγ k (x ˆβ). k k=1 Small cells: if n k = 0 set ˆγ k = 1, modify variance estimate if n k = 1.

Empirical Bayes Qualitative features of EB If all ˆγ k near 1 (e.g. ˆτ 2 0), similar to parametric imputation If all ˆγ k near 0 (e.g. if all n k large), similar to poststratification More generally, weighting varies by cell based on v k and τ 2. Similar to adaptive bandwidth kernel.

Double Robustness and EB Double Robustness and EB Assumption DR1: μ k = x β for all k. k Assumption DR2: e k = G(x k ) for all k, where G is a known function. Double robustness: estimator is consistent if one or both DR1, DR2 hold. Protection against misspecification of μ k, provided G(x k ) is correct (and vice versa).

Double Robustness and EB Bang and Robins (2005): can augment regression with inverse of propensity score. Let α 1, α 2 solve min K α 1,α 2 k=1 p k e k E Yk x k α 1 G 1 2 (x k )α 2. (1) If DR1, DR2, or both hold, then θ = K k=1 p k x k α 1 + G 1 (x k )α 2.

Double Robustness and EB Proof Equivalent minimization problem: min α 1,α 2 k=1 K p k e k μk x k α 1 G 1 2 (x k )α 2. (2) If DR1 holds, (2) is solved by setting α 1 = β and α 2 = 0. Then K k=1 p k x k α 1 + G 1 (x k )α 2 = K k=1 p k x k β = K p k μ k = θ. k=1

Double Robustness and EB Proof cont d If DR2 holds, first order conditions for (2) implies K k=1 p k e k Then since e k = G(x k ), G(x k ) μk x k α 1 G 1 (x k )α 2 K p k μ k = k=1 K k=1 = 0. p k x k α 1 + G 1 (x k )α 2.

Double Robustness and EB Feasible version: ê k = n k M k. Then p k ê k n k, so we could solve min α 1,α 2 k=1 K n k Yk x k α 1 G 1 2 (x k )α 2. This is WLS of Y k on (x k, G 1 (x k )), with weights proportional to n k. ˆθ DR = K k=1 p k x k ˆα 1 + G 1 (x k ) ˆα 2.

Double Robustness and EB EB Extension of DR ˆθ DR is parametric imputation estimator with an additional regressor. So we can develop a corresponding empirical Bayes extension ˆθ EB2.

Double Robustness and EB EB Extension of DR ˆθ DR is parametric imputation estimator with an additional regressor. So we can develop a corresponding empirical Bayes extension ˆθ EB2. Triply robust?

Double Robustness and EB EB Extension of DR ˆθ DR is parametric imputation estimator with an additional regressor. So we can develop a corresponding empirical Bayes extension ˆθ EB2. Triply robust? Caution: DR2 does not imply τ 2 = 0. Could try to engineer alternative estimators of γ k. (But see Monte Carlo evidence below.)

Monte Carlo Monte Carlo study Support of X: { J,..., 0,..., J}. Equal size strata: M k = M for all k. μ k = x k β (implies θ = 0). Propensity score: step function, e k =.75 if x k < 0. Overall 1/2 probability of selection. Y ki iid N(μ k, σ 2 ).

Monte Carlo Consider various values of J, M with overall sample size 3000. Choose σ 2 so that (large sample) efficient estimator should have SE =.1. Specification for μ k : correct model and incorrect model (constant mean) Propensity score correctly specified

Monte Carlo Bias, Correct Parametric Mean K PS PI DR EB EBPS 5-0.0043-0.0047-0.0043-0.0047-0.0043 15 0.0037 0.0021 0.0036 0.0022 0.0037 25-0.0014-0.0025-0.0021-0.0025-0.0020 75 0.0016-0.0003 0.0012 0.0002 0.0015 125 0.0032 0.0037 0.0033 0.0038 0.0034 375-0.2471 0.0039 0.0032 0.0027 0.0024 K = # of cells For K=375, median of 18 empty cells

Monte Carlo RMSE, Correct Parametric Mean K PS PI DR EB EBPS 5 0.0997 0.0981 0.0996 0.0980 0.0996 15 0.1023 0.0994 0.1022 0.0995 0.1022 25 0.0993 0.0960 0.0988 0.0961 0.0989 75 0.1030 0.0952 0.0982 0.0956 0.0985 125 0.1018 0.0902 0.0931 0.0926 0.0957 375 0.2766 0.0950 0.0990 0.0968 0.1004

Monte Carlo Bias, Misspecified Mean K PS PI DR EB EBPS 5-0.0043-1.9401-0.0054-0.0096-0.0043 15 0.0037-2.2162 0.0059-0.0145 0.0038 25-0.0014-2.2840-0.0041-0.0325-0.0011 75 0.0016-2.3360 0.0017-0.0886 0.0017 125 0.0032-2.3528 0.0045-0.1394 0.0037 375-0.2471-2.3641 0.0018-0.6982 0.0007

Monte Carlo RMSE, Misspecified Parametric Mean K PS PI DR EB EBPS 5 0.0997 1.9449 0.1213 0.1001 0.0998 15 0.1023 2.2201 0.1189 0.1034 0.1023 25 0.0993 2.2879 0.1196 0.1049 0.0995 75 0.1030 2.3395 0.1196 0.1360 0.1028 125 0.1018 2.3563 0.1170 0.1856 0.1016 375 0.2766 2.3677 0.1186 0.7651 0.1101

Monte Carlo

Monte Carlo

Monte Carlo Remarks PS is poor when cells are small. EB is similar to PI when parametric mean correctly specified. With misspecified mean but correct prop score: PI poor, EB better DR good, but EBPS even better shrinking ˆτ all the way to zero would not have helped! In general, EB-type estimators should be (nearly) admissible, but make different risk tradeoffs over parameter space.

Extensions and Ongoing Work Extensions and Ongoing Work Continuous covariates: Gaussian process priors. Inference? Some existing literature on EB inference. Finite-sample MSE calculations and finite-sample decision theory. Asymptotics / limit experiments which preserve small/zero cells?

Extensions and Ongoing Work Limits of experiments (very very preliminary) Suppose (X i, D i, Y i ) are IID with X i P with density p(x) on X, Pr(D i = 1 X i = x) = e(x), Y i D i, X i = x F(y x, π). Parameters are θ = (e( ), π), and we think of e(x) close to zero.

Extensions and Ongoing Work Hellinger Transform Let E = {F θ ; θ Θ} be an experiment. Let α = {α θ ; θ Θ} satisfy α θ 0 θ α θ = 1 only a finite number of the α θ > 0. Define η(α) = (df θ ) α θ. θ

Extensions and Ongoing Work Useful properties of Hellinger transform of E: Pointwise convergence of η(α) is equivalent to weak convergence of experiments in the sense of Le Cam. In particular, it implies an asymptotic representation theorem. Hellinger transform of a product experiment is product of Hellinger transforms.

Extensions and Ongoing Work For simplicity, drop Y component and focus on (X, D). Local parametrization: Pr(D i = 1 X i = x) = e(x), i = 1,..., n. n Single-observation density: e(x) d f (x, d) = p(x) 1 e(x) n n 1 d wrt dominating measure λ(x, d) = λ d (d)λ x (x).

Extensions and Ongoing Work Let α have G non-zero components α 1,..., α G. Hellinger transform (single-obs): η(α) = G f (x, d) α g dλ(x, d) g=1 = e(x) αg + 1 e(x) αg dp(x) g n g n g = E e(x)α g + E 1 e(x) αg. n g n

Extensions and Ongoing Work For n observations: η n (α) = {η(α)} n g = E e(x)α g + E 1 e(x) αg n g n E g e(x)α g + 1 E g e(x)α n g n n exp E e(x) α g E e(x)α g. g g n

Extensions and Ongoing Work Now, consider a different experiment: Z is a Poisson point process on X, with intensity measure ν(x), indexed by e = e(x), where dν e (x) = e(x)p(x)dλ x (x).

Extensions and Ongoing Work By results in Le Cam and Yang (2000): Hellinger transform is G η(α) = exp αg G dνeg α g dν eg = exp g=1 E g g=1 e(x) α g E g α g e(x). Hence the Poisson process experiment characterizes what can be done asymptotically for the small probabilities model. Full model with Y: Poisson process on X Y?