Robust Outcome Analysis for Observational Studies Designed Using Propensity Score Matching

Similar documents
Robust Interval Estimation of a Treatment Effect in Observational Studies Using Propensity Score Matching

Computational rank-based statistics

Rank-Based Estimation and Associated Inferences. for Linear Models with Cluster Correlated Errors

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

robustness, efficiency, breakdown point, outliers, rank-based procedures, least absolute regression

Diagnostic Procedures

On Robustification of Some Procedures Used in Analysis of Covariance

Joseph W. McKean 1. INTRODUCTION

Regression Analysis for Data Containing Outliers and High Leverage Points

The Nonparametric Bootstrap

Chapter 15 Confidence Intervals for Mean Difference Between Two Delta-Distributions

Contents 1. Contents

Exploring data sets using partial residual plots based on robust fits

Estimation and Hypothesis Testing in LAV Regression with Autocorrelated Errors: Is Correction for Autocorrelation Helpful?

Multivariate Autoregressive Time Series Using Schweppe Weighted Wilcoxon Estimates

Impact of serial correlation structures on random effect misspecification with the linear mixed model.

Physics 509: Bootstrap and Robust Parameter Estimation

Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

One-Sample Numerical Data

6 Single Sample Methods for a Location Parameter

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

COMPARISON OF THE ESTIMATORS OF THE LOCATION AND SCALE PARAMETERS UNDER THE MIXTURE AND OUTLIER MODELS VIA SIMULATION

Answer Key: Problem Set 6

Homework 2: Simple Linear Regression

AN IMPROVEMENT TO THE ALIGNED RANK STATISTIC

Can you tell the relationship between students SAT scores and their college grades?

Permutation Tests. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

Application of Variance Homogeneity Tests Under Violation of Normality Assumption

Introduction to Econometrics. Review of Probability & Statistics

Supporting Information for Estimating restricted mean. treatment effects with stacked survival models

Lecture 14 October 13

9. Robust regression

Essential of Simple regression

Heteroskedasticity-Robust Inference in Finite Samples

University of California, Berkeley, Statistics 131A: Statistical Inference for the Social and Life Sciences. Michael Lugo, Spring 2012

A Simple, Graphical Procedure for Comparing Multiple Treatment Effects

Last two weeks: Sample, population and sampling distributions finished with estimation & confidence intervals

A nonparametric two-sample wald test of equality of variances

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Bootstrapping Heteroskedasticity Consistent Covariance Matrix Estimator

Lecture 11 Multiple Linear Regression

Inference For High Dimensional M-estimates: Fixed Design Results

INFLUENCE OF USING ALTERNATIVE MEANS ON TYPE-I ERROR RATE IN THE COMPARISON OF INDEPENDENT GROUPS ABSTRACT

Distribution-Free Tests for Two-Sample Location Problems Based on Subsamples

Review of Statistics 101

Nonparametric tests. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 704: Data Analysis I

Two-sample scale rank procedures optimal for the generalized secant hyperbolic distribution

Identify the scale of measurement most appropriate for each of the following variables. (Use A = nominal, B = ordinal, C = interval, D = ratio.

Multiple Linear Regression

Controlling for overlap in matching

MS&E 226: Small Data

In Class Review Exercises Vartanian: SW 540

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

f(x µ, σ) = b 2σ a = cos t, b = sin t/t, π < t 0, a = cosh t, b = sinh t/t, t > 0,

Simple Linear Regression: One Quantitative IV

Introduction to Estimation. Martina Litschmannová K210

1 Cricket chirps: an example

Half-Day 1: Introduction to Robust Estimation Techniques

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

Lecture 12 Inference in MLR

Applied Statistics and Econometrics

Hotelling s One- Sample T2

STAT440/840: Statistical Computing

Does k-th Moment Exist?

A Monte-Carlo study of asymptotically robust tests for correlation coefficients

Simple Linear Regression: One Qualitative IV

Alternative Biased Estimator Based on Least. Trimmed Squares for Handling Collinear. Leverage Data Points

Math 475. Jimin Ding. August 29, Department of Mathematics Washington University in St. Louis jmding/math475/index.

Increasing Power in Paired-Samples Designs. by Correcting the Student t Statistic for Correlation. Donald W. Zimmerman. Carleton University

POLSCI 702 Non-Normality and Heteroskedasticity

Density Curves and the Normal Distributions. Histogram: 10 groups

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

Applied Statistics and Econometrics

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

MATH 117 Statistical Methods for Management I Chapter Three

Advanced Statistics II: Non Parametric Tests

A Monte Carlo Simulation of the Robust Rank- Order Test Under Various Population Symmetry Conditions

On robust and efficient estimation of the center of. Symmetry.

Political Science 236 Hypothesis Testing: Review and Bootstrapping

An Alternative Algorithm for Classification Based on Robust Mahalanobis Distance

Introduction to Statistics

Improving linear quantile regression for


11. Bootstrap Methods

Robust Backtesting Tests for Value-at-Risk Models

INTRODUCTION TO ANALYSIS OF VARIANCE

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Chapter 1 - Lecture 3 Measures of Location

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Approximate Median Regression via the Box-Cox Transformation

Robustness and Distribution Assumptions

Eco 391, J. Sandford, spring 2013 April 5, Midterm 3 4/5/2013

On Modifications to Linking Variance Estimators in the Fay-Herriot Model that Induce Robustness

Using Estimating Equations for Spatially Correlated A

Resampling Methods. Lukas Meier

Section 4.6 Simple Linear Regression

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Transcription:

The work of Kosten and McKean was partially supported by NIAAA Grant 1R21AA017906-01A1 Robust Outcome Analysis for Observational Studies Designed Using Propensity Score Matching Bradley E. Huitema Western Michigan University Scott F. Kosten PharmaNet/i3 Joseph W. McKean Western Michigan University

In estimating the treatment effect in an observational study there are likely to be differences between treatment and control groups on many baseline covariates. If any of these covariates are correlated with the response variable the difference in sample outcome means is likely to be a biased estimate of the true treatment effect. Propensity score matching can be used to redesign the study in order to provide meaningful comparison groups. After these comparison groups are formed a choice must be made for the outcome analysis.

Outcome Analyses The Monte Carlo study by Hill and Reiter (2006) evaluated many outcome analyses, from simple matched paired tests to more complicated bootstrap and Hodges-Lehmann type methods. Results: Popular methods were inefficient with (relatively) wide confidence intervals for the treatment effect. Many of the methods were empirically invalid. Each method performed poorly under some conditions. There was no clear winner.

In this work, we introduce new outcome analyses called block-adjusted methods. One method is a block-adjusted method based on the least-squares (LS) fit. In our Monte Carlo it performed well under normal error distributions for the response but it performed poorly under distributions with heavier tails. Our second block-adjusted method is based on a robust-rank-based (Wilcoxon) fit. It performed nearly as well as the LS fit when the errors have a normal distribution and it was much more powerful than the LS procedure for the thicker tailed distributions. In our Monte Carlo study these block-adjusted methods outperformed the methods in the Hill and Reiter study. We also show the results of these methods on a real data set.

Notation and Models Each treated subject is matched to its closest control subject. The matching is done with with replacement. So acontrolsubjectmaybematchedwithmorethanone treated subject. n t =NumberofTreatedsubjects n uc = Number of unique Control subjects in the matching; so n uc =NumberofBlocks. s i =LengthofBlocki. N = n uc i=1 s i =TotalSampleSize. Y be n 1vectorofallresponses. X c be N p matrix of covariates. I i be N 1indicatorvectorforith block. T be N 1indicatorvectorfortreatment.

Let θ denote the treatment effect, (regr.coef.corresponding to the treatment indicator T). Hypotheses: H 0 : θ =0versusH A : θ 0. (1) Models There is controversy over whether or not the model should include the covariates. This is a point of investigation in our study. The models are: Design matrix with covariates in: X =[1 N TI 2 I nuc X C ]. the Model is Y = Xβ + e. (2) Design matrix without covariates X =[1 N TI 2 I nuc ]. The Model is Y = X β + e. (3)

LS Block Adjusted Methods l and L. LS Block Adjusted Methods l. Obtain the LS estimate of β using the model with covariates, β LS =Argmin N [Y j x β] 2. (4) j=1 The estimate of θ is θ LS = β LS,2.TheusualCIforθ is θ LS ± z α/2 σ (X X) 1 22, (5) where (X X) 1 22 is the second diagonal element of (X X) 1 and σ 2 is the usual MSE. The test of H 0 for the l method is: Reject H 0 if 0 is not the confidence interval (5). LS Block Adjusted Method L: Sameasl but use the design matrix X.

Wilcoxon Block Adjusted Methods w and W Instead of the Euclidean 2 norm, the Wilcoxon procedures use the norm based on the dispersion function D(β): D(β) = N [R(Y j x jβ) (N +1)/2)](Y j x jβ), (6) j=1 where R(Y j x j β)denotestherankofy j x j β.the R-estimate of β minimizes this dispersion function; see Kloke and McKean (2012) for R software. D(β) isinvarianttothe intercept parameter, so it is estimated separately usually by the median of the residuals. This estimate was proposed by Jaeckel (1972) and discussed in detail in Chapters 3-5 of Hettmansperger and McKean (2011). The estimate is highly efficient, attaining the efficiency of 0.955 relative to the LS for normal errors and is more efficient for error distributions with thicker tails than the normal. A simple weighting scheme (HBR estimator) can attain up to a 50%breakdown point.

Wilcoxon Block Adjusted Methods w The Wilcoxon estimate is β W =ArgminD(β). (7) The Wilcoxon estimate of θ is θ W = β W,2.TheusualCIforθ is θ W ± z α/2 τ (X X) 1 22, (8) where (X X) 1 22 is the second diagonal element of (X X) 1 and the scale estimator τ is estimated as discussed in Kosten et al. (2012). The test of H 0 for the w method is: Reject H 0 if 0 is not the confidence interval (8). Wilcoxon Block Adjusted Method W: Sameasw but use the design matrix X (w/o covariates).

Monte Carlo Investigation Our study is similar to that of Hill and Reiter (2006). Their study, however, only included normal errors, where as besides the normal we have added several heavier tailed error distributions to investigate the robustness of the methods. For all situations: A single treatment and a control plus two covariates were employed. Matching based on the propensity scores is done with replacement. For most situations about 150 treated subjects and 350 control subjects were generated. The basic response surface is Y = θc + x 1 +2x 2 + e, (9) where c is either 0 or 1 depending on whether Y is a control or a treated response, x 1 and x 2 are continuous covariates, and e is the error term. Hence, the parameter θ is the treatment effect. For the study, we set θ =4,thesamevalueusedby Hill and Reiter (2006).

Misspecified Response Surfaces Besides this response surface, Hill and Reiter (2006) considered two other response surfaces where the fitted model is misspecified. But our fully adjusted procedures l and w are based on full model fits., It is thus easy to use their associated residual analyses to diagnose misspecified models and, hence, to ultimately fit more appropriate models. We show this later in an example.

Two main factors of the study: Error Distributions. Normal distribution;a contaminated normal distribution with contamination at 20and ratio (contaminated to good) set at 4; and a Cauchy distribution. Degree of overlap between treated and control subjects a Strong Overlap (SO). All covariates(for both Treated and Control) drawn iid form N(1, 1). This allows for very close matches. b Moderate Overlap (MO). CovariatesforTreatedsubjects drawn as in SO, while only 150 covariates for Control are drawn this way. The remaining Control covariates are iid N(3, 1) (these are called distracters).

c Weak Overlap (WO). Similar to MO except now only 50 Control covariates are drawn from N(1, 1) while 300 are drawn form N(3, 1). d Uneven Overlap (UO). The probability of the covariates being assigned to Treatment or Control is dependent on the region the covariates fall into, so there is good matching in some regions but poor in the others; see Kosten et al. (2012).

Methods Investigated More detailed descriptions of these methods can be found in the manuscript by Kosten et al. (2012). a Two LS block adjusted methods, l and L. b Two Wilcoxon block adjusted methods, w and W. c Matched Pairs Method (M). Thisistheusualpairedt. Weighted Two Sample Method (T). Same estimator as M, i.e., paired mean difference but the variance is weighted to account for the number of treated subjects a control is matched to. d Weighted LS Method (s, S). AweightedLSfitwith weights similar to the last method. The s method uses the design matrix [1 N TX c ]. The method uses the design matrix [1 N T].

e Robust Sandwich Variance Methods (r, R). Thisestimate is described by Huber (1967). A diagonal matrix based on the residuals and weights is used as the sandwich. The method r uses the design matrix with the covariates in while R uses the one without covariates. f Bootstrap Methods (d, D, b, B). Foreachbootstrap,the Treated and Control were each resampled (with replacement) then the matches were obtained with replacement and the selected outcome method was used to estimate the effect. The number of bootstraps was set at B =1000. Thed-methodsusethevariancefromthe resampled bootstraps of the estimates. The method d is based on the weighted LS fit with the design including the covariates the method D is based on the weighted LS fit with the design excluding the covariates. Methods b abd B are similar, but the bootstrap percentile confidence interval is used.

Table: Methods in the Simulation. Method Matched Pairs Weighted Two-Sample WLS (Non-Covariate Adjusted) WLS (Covariate Adjusted) WLS Robust Sandwich Variance (Non-Covariate Adjusted) WLS Robust Sandwich Variance (Covariate Adjusted) Bootstrap - Variance (Non-Covariate Adjusted) Bootstrap - Variance (Covariate Adjusted) Bootstrap - Percentile (Non-Covariate Adjusted) Bootstrap - Percentile (Covariate Adjusted) Hodges-Lehmann Aligned Rank (Non-Covariate Adjusted) Hodges-Lehmann Aligned Rank (Covariate Adjusted) Block-Adjusted Least Squares (Non-Covariate Adjusted) Block-Adjusted Least Squares (Covariate Adjusted) Block-Adjusted Wilcoxon (Non-Covariate Adjusted) Block-Adjusted Wilcoxon (Covariate Adjusted) Label M T S s R r D d B b H h L l W w

Results of the Monte Carlo Study There are 12 situations: 3Distributions 4DegreesofOverlap. For each situation, 10,000 simulations were run. For each, method its outcome analysis is based on its confidence interval for the effect. Nominal confidence was set at 95%. Validity For each method, its Validity is based on the methods empirical coverage of its confidence interval. We deemed a method to be valid for a situation if its empirical confidence is between 93 and 97%. On this basis:

Results of the Validity Study None of the bootstrap procedures were valid. They were far too conservative. The Weighted Two-Sample (T) methodisconservative for 7 situations. The Sandwich Two-Sample (R) methodisconservative for 8 situations. The Weighted LS (S) methodisconservativefor4and liberal for 4 situations each. The Weighted LS (s) methodisliberalfor8situations. As attested by their empirical coverages in the following tables, the remaining procedures (r, H, h, l, L, W, w) are valid for almost all the situations. Efficiency The efficiency for two procedures is the ratio of the mean lengths of their confidence intervals. This measure was obtained for each situation and is tabled by degree of overlap.

For each procedure at each situation, the tabled ratio is the mean length of the l (LS-Block-Adjusted with covariates in the design matrix) method to the mean length of the procedure s confidence interval. Thus ratios greater than 1 mean that the procedure is more efficient than the l procedure. Table: Simulation results for the strong overlap (S) setting over all error distributions. Strong Overlap Normal CN Cauchy Meth Eff Conf Eff Conf Eff Conf r 0.97 0.944 0.98 0.951 1.09 0.978 H 0.54 0.982 1.88 0.961 42.27 0.963 h 0.97 0.948 2.65 0.952 15.94 0.961 L 0.56 0.982 0.98 0.953 1.01 0.975 l 1.00 0.946 1.00 0.949 1.00 0.974 W 0.53 0.982 2.09 0.962 45.66 0.958 w 0.96 0.947 3.18 0.958 62.16 0.942

Table: Simulation results for the moderately overlap (MO) setting over all error distributions. Moderate Overlap Normal CN Cauchy Meth Eff Conf Eff Conf Eff Conf r 0.95 0.942 0.96 0.946 1.13 0.978 H 0.77 0.972 2.24 0.954 31.99 0.954 h 0.97 0.949 2.44 0.949 13.04 0.962 L 0.80 0.971 1.00 0.948 1.00 0.970 l 1.00 0.948 1.00 0.948 1.00 0.968 W 0.76 0.969 2.90 0.953 38.22 0.944 w 0.96 0.949 3.41 0.946 43.02 0.933

Table: Simulation results for the weakly overlap (S) setting over all error distributions. Weak Overlap Normal CN Cauchy Meth Eff Conf Eff Conf Eff Conf r 0.91 0.933 0.94 0.946 1.18 0.976 H 0.78 0.971 2.06 0.950 23.29 0.953 h 0.97 0.952 2.19 0.946 11.04 0.961 L 0.80 0.970 1.00 0.946 1.00 0.954 l 1.00 0.951 1.00 0.946 1.00 0.954 W 0.75 0.974 3.10 0.941 30.36 0.940 w 0.94 0.954 3.66 0.936 34.40 0.927

Table: Simulation results for the uneven overlap (MO) setting over all error distributions. Uneven Overlap Normal CN Cauchy Meth Eff Conf Eff Conf Eff Conf r 0.89 0.932 0.93 0.948 1.13 0.976 H 0.75 0.969 2.14 0.953 23.30 0.958 h 0.97 0.948 2.33 0.948 9.72 0.963 L 0.81 0.963 1.00 0.950 0.99 0.960 l 1.00 0.944 1.00 0.948 1.00 0.960 W 0.75 0.968 2.92 0.950 28.51 0.944 w 0.94 0.950 3.43 0.943 32.14 0.935

Conclusions on Efficiency The r (robust sandwich) procedure is inefficient. Even the l procedure dominates it for the contaminated normal situations. Between the h and H, the H procedure is more efficient than the h procedure for the nonnormal situations. Evidently, in the presence of heavy tailed error distributions, the less LS estimation the better. The W procedure clearly dominates the H procedure. The w procedure, however, is definitely superior to all of the procedures over the nonnormal situations. It is clearly the most robust procedure in the study. Furthermore, for the normal situations, its efficiency is only slightly less than that of h and l.

Learning to Learn (LTL) Data Set Purpose of study was to investigate the effect of the LTL program on grade point average (GPA) and graduation rate. The study was conducted from the fall semester of 1987 to the winter semester of 1996. 310 students that participated in the LTL program. The study also collected data on 30025 control subjects. Response the student s last known college GPA. Covariates: Gender; Race (Caucasian or otherwise); Age; Alpha program participation (1=yes and 0=no); overall ACT score; English ACT score; Math ACT score; Reading ACT score; Science ACT score; high school GPA; and entry year into the study. One-to-one matching with replacement using propensity scores resulted in 294 matched controls. The next table shows how diverse treatment group is to all controls and how close the matched groups are:

Table: Baseline Covariates for the Treated (LTL), Full Control Groups, and Matched Controls Variable Treatment All Controls Matched Controls No. Obs. 310 30025 294 Gender 0.516 0.560 0.544 Race 0.635 0.899 0.619 Entry Age 18.474 18.470 18.306 Alpha 0.087 0.052 0.102 ACT 18.326 21.081 18.500 Eng. ACT 18.206 20.421 18.422 Math ACT 16.684 20.400 16.755 Read ACT 17.635 20.326 17.830 Sci. ACT 19.990 22.564 20.102 HS GPA 2.637 3.059 2.657 Entry 88.729 88.754 88.759

Results of the Valid Methods for the Treatment Effect Table: Point Estimates and Confidence Intervals for the Seven Valid Methods Method Estimate 95% CI r 0.0820 (-0.0164, 0.1805) H 0.0000 (-0.0319, 0.0627) h 0.0115 (-0.0514, 0.0229) L 0.0584 (0.0014, 0.1153) l 0.0570 (0.0034, 0.1106) W 0.0000 (-0.0055, 0.0055) w 0.0000 (-0.0040, 0.0040) All methods except the LS based L and l conclude that the treatment LTL was ineffective at changing GPA.

The discrepancy between the LS-based and robust Wilcoxon-based methods was investigated by a robust diagnostic analysis based on the w-fit; see McKean and Sheather (2009). Numerous outliers in the data led to the difference between the l and w procedures. Further, outliers were detected in factor (covariate) space. Hence, a robust high breakdown fit (HBR) was used. The high breakdown estimate of the effect is 0.0602, along with the confidence interval ( 0.0368, 0.1572), which confirms the w analysis. In summary, based on our diagnostic analysis, it appears that the outliers in both the response and factor spaces impaired the l analysis. Also, in light of the the high breakdown estimate and confidence interval for the effect, it appears that the analysis obtained from the w procedure is valid.

Conclusion Overall, method w has been shown to be the most versatile method for constructing a confidence interval for atreatmenteffectinanobservationalstudy. Method w was the most efficient estimator for heavy-tailed error distributions. It was very close to the most efficient when the errors were normal, while still providing near optimal coverage of the true treatment effect. Further, method w lends itself to a robust diagnostic residual analysis which checks quality of fit and identifies outliers in both response and covariate space. The nonrobust l method is a desirable approach if used in the context of well behaved distributions.