Robust Outcome Analysis for Observational Studies Designed Using Propensity Score Matching

Size: px

Start display at page:

Download "Robust Outcome Analysis for Observational Studies Designed Using Propensity Score Matching"

Helena Harrell
6 years ago
Views:

1 The work of Kosten and McKean was partially supported by NIAAA Grant 1R21AA A1 Robust Outcome Analysis for Observational Studies Designed Using Propensity Score Matching Bradley E. Huitema Western Michigan University Scott F. Kosten PharmaNet/i3 Joseph W. McKean Western Michigan University

2 In estimating the treatment effect in an observational study there are likely to be differences between treatment and control groups on many baseline covariates. If any of these covariates are correlated with the response variable the difference in sample outcome means is likely to be a biased estimate of the true treatment effect. Propensity score matching can be used to redesign the study in order to provide meaningful comparison groups. After these comparison groups are formed a choice must be made for the outcome analysis.

3 Outcome Analyses The Monte Carlo study by Hill and Reiter (2006) evaluated many outcome analyses, from simple matched paired tests to more complicated bootstrap and Hodges-Lehmann type methods. Results: Popular methods were inefficient with (relatively) wide confidence intervals for the treatment effect. Many of the methods were empirically invalid. Each method performed poorly under some conditions. There was no clear winner.

4 In this work, we introduce new outcome analyses called block-adjusted methods. One method is a block-adjusted method based on the least-squares (LS) fit. In our Monte Carlo it performed well under normal error distributions for the response but it performed poorly under distributions with heavier tails. Our second block-adjusted method is based on a robust-rank-based (Wilcoxon) fit. It performed nearly as well as the LS fit when the errors have a normal distribution and it was much more powerful than the LS procedure for the thicker tailed distributions. In our Monte Carlo study these block-adjusted methods outperformed the methods in the Hill and Reiter study. We also show the results of these methods on a real data set.

5 Notation and Models Each treated subject is matched to its closest control subject. The matching is done with with replacement. So acontrolsubjectmaybematchedwithmorethanone treated subject. n t =NumberofTreatedsubjects n uc = Number of unique Control subjects in the matching; so n uc =NumberofBlocks. s i =LengthofBlocki. N = n uc i=1 s i =TotalSampleSize. Y be n 1vectorofallresponses. X c be N p matrix of covariates. I i be N 1indicatorvectorforith block. T be N 1indicatorvectorfortreatment.

6 Let θ denote the treatment effect, (regr.coef.corresponding to the treatment indicator T). Hypotheses: H 0 : θ =0versusH A : θ 0. (1) Models There is controversy over whether or not the model should include the covariates. This is a point of investigation in our study. The models are: Design matrix with covariates in: X =[1 N TI 2 I nuc X C ]. the Model is Y = Xβ + e. (2) Design matrix without covariates X =[1 N TI 2 I nuc ]. The Model is Y = X β + e. (3)

7 LS Block Adjusted Methods l and L. LS Block Adjusted Methods l. Obtain the LS estimate of β using the model with covariates, β LS =Argmin N [Y j x β] 2. (4) j=1 The estimate of θ is θ LS = β LS,2.TheusualCIforθ is θ LS ± z α/2 σ (X X) 1 22, (5) where (X X) 1 22 is the second diagonal element of (X X) 1 and σ 2 is the usual MSE. The test of H 0 for the l method is: Reject H 0 if 0 is not the confidence interval (5). LS Block Adjusted Method L: Sameasl but use the design matrix X.

8 Wilcoxon Block Adjusted Methods w and W Instead of the Euclidean 2 norm, the Wilcoxon procedures use the norm based on the dispersion function D(β): D(β) = N [R(Y j x jβ) (N +1)/2)](Y j x jβ), (6) j=1 where R(Y j x j β)denotestherankofy j x j β.the R-estimate of β minimizes this dispersion function; see Kloke and McKean (2012) for R software. D(β) isinvarianttothe intercept parameter, so it is estimated separately usually by the median of the residuals. This estimate was proposed by Jaeckel (1972) and discussed in detail in Chapters 3-5 of Hettmansperger and McKean (2011). The estimate is highly efficient, attaining the efficiency of relative to the LS for normal errors and is more efficient for error distributions with thicker tails than the normal. A simple weighting scheme (HBR estimator) can attain up to a 50%breakdown point.

9 Wilcoxon Block Adjusted Methods w The Wilcoxon estimate is β W =ArgminD(β). (7) The Wilcoxon estimate of θ is θ W = β W,2.TheusualCIforθ is θ W ± z α/2 τ (X X) 1 22, (8) where (X X) 1 22 is the second diagonal element of (X X) 1 and the scale estimator τ is estimated as discussed in Kosten et al. (2012). The test of H 0 for the w method is: Reject H 0 if 0 is not the confidence interval (8). Wilcoxon Block Adjusted Method W: Sameasw but use the design matrix X (w/o covariates).

10 Monte Carlo Investigation Our study is similar to that of Hill and Reiter (2006). Their study, however, only included normal errors, where as besides the normal we have added several heavier tailed error distributions to investigate the robustness of the methods. For all situations: A single treatment and a control plus two covariates were employed. Matching based on the propensity scores is done with replacement. For most situations about 150 treated subjects and 350 control subjects were generated. The basic response surface is Y = θc + x 1 +2x 2 + e, (9) where c is either 0 or 1 depending on whether Y is a control or a treated response, x 1 and x 2 are continuous covariates, and e is the error term. Hence, the parameter θ is the treatment effect. For the study, we set θ =4,thesamevalueusedby Hill and Reiter (2006).

11 Misspecified Response Surfaces Besides this response surface, Hill and Reiter (2006) considered two other response surfaces where the fitted model is misspecified. But our fully adjusted procedures l and w are based on full model fits., It is thus easy to use their associated residual analyses to diagnose misspecified models and, hence, to ultimately fit more appropriate models. We show this later in an example.

12 Two main factors of the study: Error Distributions. Normal distribution;a contaminated normal distribution with contamination at 20and ratio (contaminated to good) set at 4; and a Cauchy distribution. Degree of overlap between treated and control subjects a Strong Overlap (SO). All covariates(for both Treated and Control) drawn iid form N(1, 1). This allows for very close matches. b Moderate Overlap (MO). CovariatesforTreatedsubjects drawn as in SO, while only 150 covariates for Control are drawn this way. The remaining Control covariates are iid N(3, 1) (these are called distracters).

13 c Weak Overlap (WO). Similar to MO except now only 50 Control covariates are drawn from N(1, 1) while 300 are drawn form N(3, 1). d Uneven Overlap (UO). The probability of the covariates being assigned to Treatment or Control is dependent on the region the covariates fall into, so there is good matching in some regions but poor in the others; see Kosten et al. (2012).

14 Methods Investigated More detailed descriptions of these methods can be found in the manuscript by Kosten et al. (2012). a Two LS block adjusted methods, l and L. b Two Wilcoxon block adjusted methods, w and W. c Matched Pairs Method (M). Thisistheusualpairedt. Weighted Two Sample Method (T). Same estimator as M, i.e., paired mean difference but the variance is weighted to account for the number of treated subjects a control is matched to. d Weighted LS Method (s, S). AweightedLSfitwith weights similar to the last method. The s method uses the design matrix [1 N TX c ]. The method uses the design matrix [1 N T].

15 e Robust Sandwich Variance Methods (r, R). Thisestimate is described by Huber (1967). A diagonal matrix based on the residuals and weights is used as the sandwich. The method r uses the design matrix with the covariates in while R uses the one without covariates. f Bootstrap Methods (d, D, b, B). Foreachbootstrap,the Treated and Control were each resampled (with replacement) then the matches were obtained with replacement and the selected outcome method was used to estimate the effect. The number of bootstraps was set at B =1000. Thed-methodsusethevariancefromthe resampled bootstraps of the estimates. The method d is based on the weighted LS fit with the design including the covariates the method D is based on the weighted LS fit with the design excluding the covariates. Methods b abd B are similar, but the bootstrap percentile confidence interval is used.

16 Table: Methods in the Simulation. Method Matched Pairs Weighted Two-Sample WLS (Non-Covariate Adjusted) WLS (Covariate Adjusted) WLS Robust Sandwich Variance (Non-Covariate Adjusted) WLS Robust Sandwich Variance (Covariate Adjusted) Bootstrap - Variance (Non-Covariate Adjusted) Bootstrap - Variance (Covariate Adjusted) Bootstrap - Percentile (Non-Covariate Adjusted) Bootstrap - Percentile (Covariate Adjusted) Hodges-Lehmann Aligned Rank (Non-Covariate Adjusted) Hodges-Lehmann Aligned Rank (Covariate Adjusted) Block-Adjusted Least Squares (Non-Covariate Adjusted) Block-Adjusted Least Squares (Covariate Adjusted) Block-Adjusted Wilcoxon (Non-Covariate Adjusted) Block-Adjusted Wilcoxon (Covariate Adjusted) Label M T S s R r D d B b H h L l W w

17 Results of the Monte Carlo Study There are 12 situations: 3Distributions 4DegreesofOverlap. For each situation, 10,000 simulations were run. For each, method its outcome analysis is based on its confidence interval for the effect. Nominal confidence was set at 95%. Validity For each method, its Validity is based on the methods empirical coverage of its confidence interval. We deemed a method to be valid for a situation if its empirical confidence is between 93 and 97%. On this basis:

18 Results of the Validity Study None of the bootstrap procedures were valid. They were far too conservative. The Weighted Two-Sample (T) methodisconservative for 7 situations. The Sandwich Two-Sample (R) methodisconservative for 8 situations. The Weighted LS (S) methodisconservativefor4and liberal for 4 situations each. The Weighted LS (s) methodisliberalfor8situations. As attested by their empirical coverages in the following tables, the remaining procedures (r, H, h, l, L, W, w) are valid for almost all the situations. Efficiency The efficiency for two procedures is the ratio of the mean lengths of their confidence intervals. This measure was obtained for each situation and is tabled by degree of overlap.

19 For each procedure at each situation, the tabled ratio is the mean length of the l (LS-Block-Adjusted with covariates in the design matrix) method to the mean length of the procedure s confidence interval. Thus ratios greater than 1 mean that the procedure is more efficient than the l procedure. Table: Simulation results for the strong overlap (S) setting over all error distributions. Strong Overlap Normal CN Cauchy Meth Eff Conf Eff Conf Eff Conf r H h L l W w

20 Table: Simulation results for the moderately overlap (MO) setting over all error distributions. Moderate Overlap Normal CN Cauchy Meth Eff Conf Eff Conf Eff Conf r H h L l W w

21 Table: Simulation results for the weakly overlap (S) setting over all error distributions. Weak Overlap Normal CN Cauchy Meth Eff Conf Eff Conf Eff Conf r H h L l W w

22 Table: Simulation results for the uneven overlap (MO) setting over all error distributions. Uneven Overlap Normal CN Cauchy Meth Eff Conf Eff Conf Eff Conf r H h L l W w

23 Conclusions on Efficiency The r (robust sandwich) procedure is inefficient. Even the l procedure dominates it for the contaminated normal situations. Between the h and H, the H procedure is more efficient than the h procedure for the nonnormal situations. Evidently, in the presence of heavy tailed error distributions, the less LS estimation the better. The W procedure clearly dominates the H procedure. The w procedure, however, is definitely superior to all of the procedures over the nonnormal situations. It is clearly the most robust procedure in the study. Furthermore, for the normal situations, its efficiency is only slightly less than that of h and l.

24 Learning to Learn (LTL) Data Set Purpose of study was to investigate the effect of the LTL program on grade point average (GPA) and graduation rate. The study was conducted from the fall semester of 1987 to the winter semester of students that participated in the LTL program. The study also collected data on control subjects. Response the student s last known college GPA. Covariates: Gender; Race (Caucasian or otherwise); Age; Alpha program participation (1=yes and 0=no); overall ACT score; English ACT score; Math ACT score; Reading ACT score; Science ACT score; high school GPA; and entry year into the study. One-to-one matching with replacement using propensity scores resulted in 294 matched controls. The next table shows how diverse treatment group is to all controls and how close the matched groups are:

25 Table: Baseline Covariates for the Treated (LTL), Full Control Groups, and Matched Controls Variable Treatment All Controls Matched Controls No. Obs Gender Race Entry Age Alpha ACT Eng. ACT Math ACT Read ACT Sci. ACT HS GPA Entry

26 Results of the Valid Methods for the Treatment Effect Table: Point Estimates and Confidence Intervals for the Seven Valid Methods Method Estimate 95% CI r ( , ) H ( , ) h ( , ) L (0.0014, ) l (0.0034, ) W ( , ) w ( , ) All methods except the LS based L and l conclude that the treatment LTL was ineffective at changing GPA.

27 The discrepancy between the LS-based and robust Wilcoxon-based methods was investigated by a robust diagnostic analysis based on the w-fit; see McKean and Sheather (2009). Numerous outliers in the data led to the difference between the l and w procedures. Further, outliers were detected in factor (covariate) space. Hence, a robust high breakdown fit (HBR) was used. The high breakdown estimate of the effect is , along with the confidence interval ( , ), which confirms the w analysis. In summary, based on our diagnostic analysis, it appears that the outliers in both the response and factor spaces impaired the l analysis. Also, in light of the the high breakdown estimate and confidence interval for the effect, it appears that the analysis obtained from the w procedure is valid.

28 Conclusion Overall, method w has been shown to be the most versatile method for constructing a confidence interval for atreatmenteffectinanobservationalstudy. Method w was the most efficient estimator for heavy-tailed error distributions. It was very close to the most efficient when the errors were normal, while still providing near optimal coverage of the true treatment effect. Further, method w lends itself to a robust diagnostic residual analysis which checks quality of fit and identifies outliers in both response and covariate space. The nonrobust l method is a desirable approach if used in the context of well behaved distributions.

Robust Interval Estimation of a Treatment Effect in Observational Studies Using Propensity Score Matching

Western Michigan University ScholarWorks at WMU Dissertations Graduate College 12-2010 Robust Interval Estimation of a Treatment Effect in Observational Studies Using Propensity Score Matching Scott F.