Web-based Supplementary Materials for Calibrating. Sensitivity Analyses to Observed Covariates in. Observational Studies by Hsu and Small

Size: px

Start display at page:

Download "Web-based Supplementary Materials for Calibrating. Sensitivity Analyses to Observed Covariates in. Observational Studies by Hsu and Small"

Brendan Atkinson
6 years ago
Views:

1 Web-based Supplementary Materials for Calibrating Sensitivity Analyses to Observed Covariates in Observational Studies by Hsu and Small Jesse Y. Hsu 1,2,3 Dylan S. Small 1 1 Department of Statistics, The Wharton School, University of Pennsylvania 400 Jon M. Huntsman Hall, 3730 Walnut Street Philadelphia, Pennsylvania, U.S.A. 2 Center for Outcomes Research, The Children s Hospital of Philadelphia Philadelphia, Pennsylvania, U.S.A. 3 hsu9@wharton.upenn.edu 1

2 A Details on Computation of p i (u) In this section, we provide details on computation of p i (u) and p + i discussed in Section 2.3; see Gastwirth et al. (1998) and Gastwirth et al. (2000) for more details. Under model (2), in the case of pair matching, the conditional distribution of the treatment assignment within pair i is π i = Pr(Z i1 = 1,Z i2 = 0 ỹ, Z,X) Pr(Z i1 = 1 ỹ, Z,X)Pr(Z i2 = 0 ỹ, Z,X) = Pr(Z i1 = 1 ỹ, Z,X)Pr(Z i2 = 0 ỹ, Z,X)+Pr(Z i1 = 0 ỹ, Z,X)Pr(Z i2 = 1 ỹ, Z,X) exp(γu i1 ) = exp(γu i1 )+exp(γu i2 ) 1 = 1+exp{γ(u i2 u i1 ). (A.1) Similarly, under model (3), the conditional distribution of response within pair i is λ i = Pr(y i1 = y i(2),y i2 = y i(1) ỹ, Z,X) Pr(y i1 = y i(2) ỹ, Z,X)Pr(y i2 = y i(1) ỹ, Z,X) = Pr(y i1 = y i(2) ỹ, Z,X)Pr(y i2 = y i(1) ỹ, Z,X)+Pr(y i1 = y i(1) ỹ, Z,X)Pr(y i2 = y i(2) ỹ, Z,X) exp{δ(y i(2) u i1 +y i(1) u i2 ) = exp{δ(y i(2) u i1 +y i(1) u i2 )+exp{δ(y i(1) u i1 +y i(2) u i2 ) 1 = 1+exp{δ(y i(2) y i(1) )(u i2 u i1 ). (A.2) For pair i, the chance that the treated subject has the higher response under the null hypothesis is p i (u) = π i λ i +(1 π i )(1 λ i ) = exp{γ(u i2 u i1 )exp{δ(y i(2) y i(1) )(u i2 u i1 )+1 [1+exp{γ(u i2 u i1 )][1+exp{δ(y i(2) y i(1) )(u i2 u i1 )]. (A.3) 2

3 The maximum value of p i (u), p + i, can be obtained by setting u i1 = 0 and u i2 = 1, which is p + i = p i (u i1 = 0,u i2 = 1) = exp(γ)exp{δ(y i(2) y i(1) )+1 {1+exp(γ)[1+exp{δ(y i(2) y i(1) )]. (A.4) B Empirical Results In this section, we provide additional empirical results for NHANES data, including a simultaneous sensitivity analysis, an approximation of Ω.05, calibration of a simultaneous sensitivity analysis to the observed covariates, and sensitivity of estimates to the choice of (Γ, ) Ω.05. B.1 Simultaneous Sensitivity Analysis to Hidden Bias One type of sensitivity analyses, simultaneous sensitivity analyses, use two sensitivity parameters, Γ and, to measure the degree of hidden bias due to the unobserved covariate in an observational study (Gastwirth et al., 1998). Suppose there is an unobserved covariate u that lies between 0 and 1. One sensitivity parameter, Γ, relates u to treatment; namely, the odds ratio of receiving treatment for two subjects with different values of u is at most Γ, and the other parameter,, relates u to response; namely, the odds ratio of having higher response for two subjects with different values of u is at most. The simultaneous sensitivity analysis finds the maximum p-value over all distributions of u for given values of Γ and. Web Table 1 gives the simultaneous sensitivity analysis for NHANES data. [Web Table 1 about here.] B.2 Approximation of Ω.05 Because the curves of Ω.05 does not have a closed form, we suggest an approach of grid search to find the approximation of Ω.05 ; that is Ω.05. First, we expand the values of (Γ, ) 3

4 from 1.01 to 16 with an increment of 0.01 and create a grid. Second, for each combination of (Γ, ) in the grid, we calculate the corresponding maximum p-values for McNemar s test statistic. Finally, for each Γ, we look for the such that the maximum p-value is close to the significance level, say In Web Table 2, we list 100 randomly selected values of (Γ, ) and their maximum p-values from the collection of (Γ, ) Ω.05, which is used to draw the curves in all figures. [Web Table 2 about here.] B.3 Calibrating the Sensitivity Analysis to Observed Covariates Using the optim function in R, we maximize two log-likelihood functions, logl{θ;z,x,γ = log(2.21) and logl{φ;y (0),X,δ = log(2.21), in (12) and (14) to obtain θ γ and φ δ, listed in the first two columns in Web Table 3. The last two columns of Web Table 3 show Θ γ = exp( θ γ ) and Φ δ = exp( φ δ ) which are the estimated effects of observed covariates on smoking and high blood lead. As an example of how to understand Web Table 3, if there are two subjects within a matched set and the subjects have the same values of all the observed covariates and the unobserved covariate except that one is male and the other is female, then the male subject has 2.36 times as high odds to smoke and has 5.32 times as high odds to have high blood lead. [Web Table 3 about here.] B.4 Sensitivity of Estimates to the Choice of (Γ, ) Ω.05 for the NHANES Data In this section, we examine how sensitive our proposed method is to different choices of (Γ, ) Ω.05 for the NHANES data. The general conjecture is discussed in Web Section C.2. Specifically, we empirically investigate the behavior of the estimates obtained from the 4

5 log-likelihood functions in (12) and (14) given different values of (Γ, ) Ω.05 (see Section B.2). In Web Table 2, we list 100 randomly selected values of (Γ, ) Ω.05. In this section, we pick4outofwebtable2alongwiththedefaultchoiceof(γ, ) = (2.21,2.21)andobtainthe corresponding estimates for the observed covariates given these 5 selected (Γ, ), which are (1.44,8.52), (1.61,4.12), (2.21,2.21), (6.95,1.47) and (11.28,1.41). Figure 1 shows calibration of the simultaneous sensitivity analysis to the observed covariates age, income-to-poverty level, gender, education and race from the 5 selected values of (Γ, ) Ω.05 in Web Table 2. The estimates are not sensitive to the choice of (Γ, ) Ω.05 in age, income-to-poverty level, gender and race. In education, estimates reveal some variation among estimates. Even though these estimates seem to vary in one dimension, mostly on the Γ-axis, at different values of (Γ, ) Ω.05, the conclusion remains consistent; i.e., estimates fall in the shaded area where the maximum p-value for McNemar s test statistic is less than Based on the empirical investigation, calibration of the simultaneous sensitivity analysis to observed covariates is not sensitive to the choice of (Γ, ) Ω.05. Therefore, we suggest a default choice of Γ = = 2.21 for the NHANES data. [Web Figure 1 about here.] C Simulation Studies We consider the following setup for simulation studies such that the simulated data are similar with but simpler than the data from our motivating example described in Section 1.2. Let X 1 be a continuous covariate (e.g., standardized age) with X 1ij N(0,0.25) and X 2 be a binary covariate (e.g., male gender) with X 2ij Bernoulli(0.5). Both X 1 and X 2 are observed. Let u denote a binary unobserved covariate such that u ij Bernoulli(0.5). All X 1, X 2 and u have the same variance of Consider the following data generating 5

6 processes for binary treatment Z ij and binary response under control y (0) ij : Z ij Bernoulli(p ij ), p ij = exp(θ 0 +θ 1 x 1ij +θ 2 x 2ij +γu ij ) 1+exp(θ 0 +θ 1 x 1ij +θ 2 x 2ij +γu ij ), and y (0) ij Bernoulli(π ij ), π ij = exp(φ 0 +φ 1 x 1ij +φ 2 x 2ij +δu ij ) 1+exp(φ 0 +φ 1 x 1ij +φ 2 x 2ij +δu ij ). (C.5) (C.6) Parameters in (C.5) and (C.6) are set as follows: (θ 0,θ 1,θ 2 ) = ( 2,0.5,0.5), (φ 0,φ 1,φ 2 ) = ( 3,0.2,0.2), and (γ,δ) = (0.5,0.2). The sample size for each replicate is 3,340. Suppose the response has a nonnegative effect, y (1) ij y (0) ij. If y(0) ij = 1, then y (1) ij = 1. If y (0) ij = 0, then y (1) ij = 0 or y (1) ij = 1. We assume the attributable effect the effect of the treatment on the ni j=1 Z ij(y (1) ij y (0) ij ). The parameter A is the number of treated subjects is A = I i=1 treated responses actually caused by exposure to the treatment. In Section C.1, we compare estimates obtained from two methods, when y (0) ij in the log-likelihood function in (13) are only observed partially. In Section C.2, we examine the sensitivity of estimates obtained from (14) to the choice of (Γ, ) Ω.05 that is discussed in Section 3.2. C.1 Sensitivity of Estimates to the Use of a Sub-Sample In this simulation, we compare estimates obtained from two methods, when y (0) ij in the loglikelihood function in (13) are only observed partially. Following the discussion in Section 3.1, method I replaces unobserved y (0) ij with y (1) ij for those who have Z ij = 1, and method II uses data only from subjects whose Z ij = 0. We consider the following four possible treatment effects: A = {0,50,75,100, where A = 0 represents the null treatment effect. Web Table 4 shows averaged estimates for parameters φ 1 and φ 2 in (C.6) and their standard errors from 1,000 replicates. When Fisher s sharp null hypothesis is true (i.e., A = 0), both methods provide good estimates while estimates from method II have lost some efficiency due to the use of a sub-sample. When the alternative hypothesis is true (i.e., A = 50,75, or 6

7 100), the estimates from method I starts deviating from true parameters as the treatment effect increases, but the estimates from method II still provides consistent estimates. Based on the results of this set of simulation studies, we suggest the use of method II in our paper. [Web Table 4 about here.] C.2 Sensitivity of Estimates to the Choice of (Γ, ) Ω.05 In this section, we examine how sensitive are estimates for θ and φ to the choice of (Γ, ) Ω.05 throughsimulation. Foreachreplicate, wethenchoose5different valuesof(γ, ) Ω.05 including the suggested default choice of Γ = and obtain 5 sets of estimates for θ 1 and θ 2 in (C.5) and φ 1 and φ 2 in (C.6) following the proposed method discussed in Section 3. We consider one treatment effect, A = 100, and generate 10 data sets following the setup described in the beginning of Section C. Web Figure 2 shows the calibration of the simultaneous sensitivity analysis to(a) observed covariate X 1 ; and (b) observed covariate X 2, at (Γ, ) Ω.05. Ten solid curves represent values of(γ, ) Ω.05 from10simulated data sets. Foreachsimulated dataset, five symbols (square, circle, diamond up-triangle, and down-triangle) represent 5 different arbitrary values of (Γ, ) Ω.05 (on the curve) and their corresponding estimates for exp(θ p ) and exp(φ p ) for p = 1,2 (off the curve); i.e., calibration of unobserved u to observed X p for p = 1,2. For both continuous (X 1 ) and binary (X 2 ) covariates, because estimates are located closely to each other among all 10 simulated data sets, the calibration of simultaneous sensitivity analysis to observed covariates is not sensitive to the choice of (Γ, ) Ω.05. Therefore, in our paper, we suggest choosing a default value of Γ =. [Web Figure 2 about here.] 7

8 D Software In this section, we provide the R codes for the calibration of a simultaneous sensitivity analysis to the observed covariates using the NHANES data in the paper. ############################################################################################################## # PART 1: Data (ref. #### d <- read.table("nhanes2008.lead.smoking.txt",header=t) attach(d) x <- cbind(age,male,edu.lt9,edu.9to11,edu.hischl,edu.somecol,edu.unknown,income,income.mis, black,mexicanam,otherhispan,otherrace) ############################################################################################################## # PART 2: Pair Matching #### library(optmatch) # Matching functions smahal <- function(z, X){ # Calculate rank-based Mahalanobis distance X <- as.matrix(x) n <- dim(x)[1] rownames(x) <- 1:n k <- dim(x)[2] m <- sum(z) for (j in 1:k) X[, j] <- rank(x[, j]) cv <- cov(x) vuntied <- var(1:n) rat <- sqrt(vuntied/diag(cv)) cv <- diag(rat) %*% cv %*% diag(rat) out <- matrix(na, m, n - m) Xc <- X[z == 0, ] Xt <- X[z == 1, ] rownames(out) <- rownames(x)[z == 1] colnames(out) <- rownames(x)[z == 0] library(mass) icov <- ginv(cv) for (i in 1:m) out[i, ] <- mahalanobis(xc, Xt[i, ], icov, inverted = T) out addcaliper <- function(dmat, z, p, calipersd = 0.2, penalty = 1000){ # add a caliper to distance matrix sdp = sd(p) adif = abs(outer(p[z == 1], p[z == 0], "-")) adif = (adif - (calipersd * sdp)) * (adif > (calipersd * sdp)) dmat = dmat + adif * penalty dmat pair.vector <- function(pairmatchvec, treatment){ # find out who is matched treated and who is matched control # NOTE: treated needs to be ordered at the beginning pairs.short <- substr(pairmatchvec, start = 3, stop = 10) pairsnumeric <- as.numeric(pairs.short) notreated <- sum(treatment) pairsvec <- rep(0, notreated) for (i in 1:notreated) { temp <- (pairsnumeric == i) * seq(1, length(pairsnumeric), 1) pairsvec[i] <- sum(temp, na.rm = T) - i pairsvec # propensity score model propscore.model <- glm(smoking~x,family=binomial,x=true) Xmat <- propscore.model$x[,-1] 8

9 colnames(xmat) <- colnames(x) propscore <- predict(propscore.model,type="response") # rank based Mahalanobis distance distmat <- smahal(smoking,xmat) # add caliper to distance matrix distmat2 <- addcaliper(distmat,smoking,propscore) ##### pair matching ##### pairmatchvec <- pairmatch(distmat2) # Create a vector saying which control unit each treated unit is matched to # NOTE: treated needs to be ordered at the beginning pairsvec <- pair.vector(pairmatchvec,smoking) # prepare to put matched data together matched.id <- seq(1,length(which(smoking==1))) Xmat <- as.data.frame(cbind(xmat,edu.college,white)) Xmat[which(Xmat$income.mis==1),"income"] <- NA Xnames <- names(xmat) # smokers id.s <- id[which(smoking==1)] Xmat.matched.smoker <- Xmat[which(smoking==1),] colnames(xmat.matched.smoker) <- paste(xnames,sep="",".s") lead.s <- lead[which(smoking==1)] # non-smokers id.u <- id[pairsvec] Xmat.matched.nonsmoker <- Xmat[pairsvec,] colnames(xmat.matched.nonsmoker) <- paste(xnames,sep="",".u") lead.u <- lead[pairsvec] # put matched data (smokers and non-smokers) together pair.match.data <- as.data.frame(cbind(matched.id,id.s,xmat.matched.smoker,lead.s, id.u,xmat.matched.nonsmoker,lead.u)) ############################################################################################################## # PART 3: Covariates Balance #### maketable.1 <- function(){ # Table 1 in the paper ##### Standardized differences ##### Xnames.stddif <- c("age","income","income.mis","male","edu.lt9","edu.9to11","edu.hischl","edu.somecol", "edu.college","edu.unknown","white","black","mexicanam","otherhispan","otherrace") sd.s <- apply(xmat[which(smoking==1),],2,sd,na.rm=t) sd.u <- apply(xmat[which(smoking==0),],2,sd,na.rm=t) ### Before matching ### STD.DIFF.before <- NULL for(i in 1:length(Xnames.stddif)){ name <- Xnames.stddif[i] x.s <- Xmat[which(smoking==1),name] x.u <- Xmat[which(smoking==0),name] if(name%in%c("age","income")){ mu.s <- mean(x.s,na.rm=t) mu.u <- mean(x.u,na.rm=t) std.diff <- (mu.s-mu.u)/sqrt((sd.s[name]^2 + sd.u[name]^2)/2) STD.DIFF.before <- rbind(std.diff.before,data.frame(cbind(mu.s,mu.u,std.diff))) else{ mu.s <- mean(x.s)*100 mu.u <- mean(x.u)*100 std.diff <- (mu.s/100 - mu.u/100)/sqrt((sd.s[name]^2 + sd.u[name]^2)/2) STD.DIFF.before <- rbind(std.diff.before,data.frame(cbind(mu.s,mu.u,std.diff))) lead.before <- cbind(round(mean(lead[smoking==1]>=5)*100,1),round(mean(lead[smoking==0]>=5)*100,1)) colnames(lead.before) <- c("mu.s","mu.u") rownames(lead.before) <- c("lead.before") ### After matching ### # pair matching STD.DIFF.pair <- NULL for(i in 1:length(Xnames.stddif)){ # i <- 7 name <- Xnames.stddif[i] x.s <- pair.match.data[,paste(name,".s",sep="")] 9

10 x.u <- pair.match.data[,paste(name,".u",sep="")] if(name%in%c("age","income")){ mu.s <- mean(x.s,na.rm=t) mu.u <- mean(x.u,na.rm=t) std.diff <- (mu.s-mu.u)/sqrt((sd.s[name]^2 + sd.u[name]^2)/2) STD.DIFF.pair <- rbind(std.diff.pair,data.frame(cbind(mu.s,mu.u,std.diff))) else{ mu.s <- mean(x.s)*100 mu.u <- mean(x.u)*100 std.diff <- (mu.s/100 - mu.u/100)/sqrt((sd.s[name]^2 + sd.u[name]^2)/2) STD.DIFF.pair <- rbind(std.diff.pair,data.frame(cbind(mu.s,mu.u,std.diff))) lead.after <- cbind(round(mean(pair.match.data$lead.s>=5)*100,1), round(mean(pair.match.data$lead.u>=5)*100,1)) colnames(lead.after) <- c("mu.s","mu.u") rownames(lead.after) <- c("lead.after") list(before=std.diff.before,after=std.diff.pair,outcome=rbind(lead.before,lead.after)) maketable.1() ############################################################################################################## # PART 4: Sensitivity Analysis #### # Functions for Sensitivity analysis for McNemar (binary) and Wilcoxon (continuous) statistics McNemar.sens <- function(i,t,gamma,delta){ # Simultaneous sensitivity analysis for a binary outcome and a binary treatment n.row <- length(gamma) n.col <- length(delta) p.value <- matrix(na,n.row,n.col) rownames(p.value) <- paste("gamma=",gamma,sep="") colnames(p.value) <- paste("delta=",delta,sep="") for(i in 1:n.row){ for(j in 1:n.col){ gamma <- log(gamma[i]) delta <- log(delta[j]) pi.bar <- exp(abs(gamma))/(1+exp(abs(gamma))) theta.bar <- exp(abs(delta))/(1+exp(abs(delta))) if(gamma*delta>=0) p <- pi.bar*theta.bar + (1-pi.bar)*(1-theta.bar) else p <- 1/2 p.value[i,j] <- 1-pbinom(T-1,I,p) p.value Wilcoxon.sens <- function(x,gamma=1,delta=1,gastwirth=true){ # Simultaneous sensitivity analysis for a continuous outcome and a binary treatment # Default with adjustment from Gastwirth et al. (1998) n.row <- length(gamma) n.col <- length(delta) if(sum(x==0)>0){ x <- x[x!=0] Std.Dev <- matrix(na,n.row,n.col) p.value <- matrix(na,n.row,n.col) rownames(std.dev) <- paste("gamma=",gamma,sep="") colnames(std.dev) <- paste("delta=",delta,sep="") rownames(p.value) <- paste("gamma=",gamma,sep="") colnames(p.value) <- paste("delta=",delta,sep="") rank <- rank(abs(x)) if(gastwirth==true){ rank.new <- 2*rank/length(rank) T <- sum(rank.new[x>0]) for(i in 1:n.row){ for(j in 1:n.col){ gamma <- log(gamma[i]) delta <- log(delta[j]) pi.bar <- 1/(1+exp(-abs(gamma))) theta.bar <- 1/(1+exp(-abs(delta)*rank.new)) if(gamma*delta>=0) p <- pi.bar*theta.bar + (1-pi.bar)*(1-theta.bar) else p <- 1/2 Std.Dev[i,j] <- (T - sum(p*rank.new))/sqrt(sum(p*(1-p)*rank.new*rank.new)) p.value[i,j] <- 1-pnorm(Std.Dev[i,j]) else{ 10

11 T <- sum(rank[x>0]) for(i in 1:n.row){ for(j in 1:n.col){ gamma <- log(gamma[i]) delta <- log(delta[j]) pi.bar <- 1/(1+exp(-abs(gamma))) theta.bar <- 1/(1+exp(-abs(delta)*rank)) if(gamma*delta>=0) p <- pi.bar*theta.bar + (1-pi.bar)*(1-theta.bar) else p <- 1/2 Std.Dev[i,j] <- (T - sum(p*rank))/sqrt(sum(p*(1-p)*rank*rank)) p.value[i,j] <- 1-pnorm(Std.Dev[i,j]) p.value ##### binary outcome ##### # binary outcome: 1 if lead>=5 (CDC cutoff) vs 0 if lead<5 binary.lead.s <- 1*(pair.match.data$lead.s>=5); binary.lead.u <- 1*(pair.match.data$lead.u>=5); # I=68 (number of all discordant pairs) # T=46 (number of discordant pairs that smokers had high blood lead but not nonsmokers) # Let qi=1 for i=1,...,68 I <- sum(binary.lead.s!=binary.lead.u) T <- sum(binary.lead.s==1 & binary.lead.u==0) # Test the null assuming no unobserved covariate (Gamma=1 & Delta=1) using McNemar s test McNemar.sens(I,T,1,1) omega <- function(i=null,t=null,r=null,length.out=16,alpha=0.05,type){ # This function create the line of omega at alpha # length.out: value of sensitivity parameters that that extends for plot # type: type of outcome "binary" or "continuous" if(type=="binary" & (!is.numeric(i)!is.numeric(t))){ stop("i or T cannot be NULL if binary") else if(type=="continuous" &!is.numeric(r)){ stop("r cannot be NULL if continuous") # Find values of Gamma and Delat such that Gamma=Delta and max. p-value is about 0.05 if(type=="binary"){ Gamma.eq.Delta <- floor(uniroot(function(g){mcnemar.sens(i,t,g,g)-alpha,interval=c(1,50))$root*1000)/1000 # Starting from the point where Gamma=Delta # Creating a list of Gamma and Delta that will lead to alpha # This is the collection of points on the curve of alpha # Rounding at one-hundredth, values tend to repeat at both ends Gamma.Delta <- rbind( cbind(round(unlist(lapply(rev(seq(ceiling(gamma.eq.delta*100)/100,length.out,.01)), function(d){uniroot(function(g){mcnemar.sens(i,t,g,d)-alpha,interval=c(1,50))$root)),2), rev(seq(ceiling(gamma.eq.delta*100)/100,length.out,.01))), cbind(seq(ceiling(gamma.eq.delta*100)/100,length.out,.01), round(unlist(lapply(seq(ceiling(gamma.eq.delta*100)/100,length.out,.01), function(g){uniroot(function(d){mcnemar.sens(i,t,g,d)-alpha,interval=c(1,50))$root)),2))) # max. p-values corresponding to values of Gamma and Delta from above Gamma.Delta p.alpha <- NULL for(i in 1:dim(Gamma.Delta)[1]){ p.alpha <- c(p.alpha, as.numeric(lapply(gamma.delta[i,1],function(g){ McNemar.sens(I,T,g,Gamma.Delta[i,2]) ))) ; rm(i) else if(type=="continuous"){ Gamma.eq.Delta <- floor(uniroot(function(g){wilcoxon.sens(r,g,g)-alpha,interval=c(1,50))$root*1000)/1000 # create a list of Gamma and Delta that will lead to alpha Gamma.Delta <- rbind( cbind(round(unlist(lapply(rev(seq(ceiling(gamma.eq.delta*100)/100,length.out,.01)), function(d){uniroot(function(g){wilcoxon.sens(r,g,d)-alpha,interval=c(1,11))$root)),2), rev(seq(ceiling(gamma.eq.delta*100)/100,length.out,.01))), cbind(seq(ceiling(gamma.eq.delta*100)/100,length.out,.01), round(unlist(lapply(seq(ceiling(gamma.eq.delta*100)/100,length.out,.01), function(g){uniroot(function(d){wilcoxon.sens(r,g,d)-alpha,interval=c(1,50))$root)),2))) p.alpha <- NULL for(i in 1:dim(Gamma.Delta)[1]){ p.alpha <- c(p.alpha, 11

12 unlist(lapply(gamma.delta[i,1],function(g){ Wilcoxon.sens(r,g,Gamma.Delta[i,2]) ))) # Redefine gamma.delta to make no repeat value a <- unique(gamma.delta[(gamma.delta[,1]<gamma.eq.delta),1]) b <- c(a,unique(gamma.delta[gamma.delta[,2]<gamma.eq.delta,2])) Omega.Gamma.Delta <- matrix(0,nrow=length(b),ncol=2) for(i in 1:length(a)){ Omega.Gamma.Delta[i,] <- matrix(gamma.delta[gamma.delta[,1]==a[i],])[abs(p.alpha[gamma.delta[,1]==a[i]]-alpha) ==min(abs(p.alpha[gamma.delta[,1]==a[i]]-alpha)),] for(i in (length(a)+1):length(b)){ Omega.Gamma.Delta[i,] <- matrix(gamma.delta[gamma.delta[,2]==b[i],])[abs(p.alpha[gamma.delta[,2]==b[i]]-alpha) ==min(abs(p.alpha[gamma.delta[,2]==b[i]]-alpha)),] colnames(omega.gamma.delta) <- c("gamma","delta") # max. p-values corresponding to values of Gamma and Delta from above Gamma.Delta Omega.p.alpha <- NULL for(i in 1:dim(Omega.Gamma.Delta)[1]){ Omega.p.alpha <- c(omega.p.alpha, p.alpha[which(gamma.delta[,1]==omega.gamma.delta[i,1] & Gamma.Delta[,2]==Omega.Gamma.Delta[i,2])]) # the function returns a list of two objects: # (1) values of Gamma and Delta at alpha and (2) their corresponding max. p-values list(gamma.delta=omega.gamma.delta,p.alpha=omega.p.alpha) # binary outcome requires: # I (number of all discordant pairs) and T (number of discordant pairs that treated had high outcome) binary.omega <- omega(i=i,t=t,type="binary") # Create a huge table (grid) of sensitivity analysis sens.grid <- function(i=null,t=null,r=null,type,length.out){ # length.out: value of sensitivity parameters that that extends for plot # type: type of outcome "binary" or "continuous" if(type=="binary" & (!is.numeric(i)!is.numeric(t))){ stop("i or T cannot be NULL if binary") else if(type=="continuous" &!is.numeric(r)){ stop("r cannot be NULL if continuous") Gamma <- seq(101,length.out*100,by=1)/100 Delta <- seq(101,length.out*100,by=1)/100 if(type=="binary"){ grid <- McNemar.sens(I,T,Gamma=Gamma,Delta=Delta) else if(type=="continuous"){ grid <- Wilcoxon.sens(r,Gamma=Gamma,Delta=Delta) grid binary.sens <- sens.grid(i=i,t=t,type="binary",length.out=12) maketable.2 <- function(){ # Web Table 1 in the supplementary materials sens <- round(mcnemar.sens(i,t,c(1,1.75,2,2.21,2.25,2.5, ),c(1,1.75,2,2.21,2.25,2.5, )),4) list(simultaneous.sens=sens) maketable.2() maketable.3 <- function(){ # Web Table 2 in the supplementary materials # Only the first row is shown Gamma <- c(1.39,1.88,2.29,2.95) out <- data.frame(binary.omega$gamma.delta[which(binary.omega$gamma.delta[,1]%in%gamma),]) out$p.value <- round(binary.omega$p.alpha[which(binary.omega$gamma.delta[,1]%in%gamma)],4) out maketable.3() ############################################################################################################## 12

13 # PART 5: Calibrating Sensitivity Analysis #### # Standardized Continuous Covariates to mean 0 and sd 1/2 (Gelman, 2008) stand.age <- (age-mean(age))/(2*sd(age)) stand.income <- (income-mean(income))/(2*sd(income)) binary.lead.outcome <- (lead>=5); # For all McNemar.gamma.delta, p-value is about.05 # Set k such that k=gamma=delta k <- as.numeric(apply(binary.omega$gamma.delta[ which(abs(binary.omega$gamma.delta[,1]-binary.omega$gamma.delta[,2]) ==min(abs(binary.omega$gamma.delta[,1]-binary.omega$gamma.delta[,2]))),],2,mean)[1]) # reset x with standardized continuous covariates x <- cbind(stand.age,male,edu.lt9,edu.9to11,edu.hischl,edu.somecol,edu.unknown,stand.income,income.mis, black,mexicanam,otherhispan,otherrace) calibration <- function(y,z,x,gamma,delta,type){ # y: outcome # z: binary treatment (0 represents control) # x: matrix of covariates # Gamma: sensitivity parameter # Delta: sensitivity parameter # type: type of outcome "binary" or "continuous" likefunc <- function(gamma,z,xmat,beta){ ### Compute the log-likelihood for P(Z_ij=1 X_ij) (or P(Y_ij=1 X_ij)), where ### beta is known and u_ij=1 (0) with probability 1/2 for each X_ij marginal.model <- glm(z~xmat,family=binomial,x=true); xmat.marginal.model <- marginal.model$x; expit <- function(x){ exp(x)/(1+exp(x)); marginal.over.u.prob <-.5*expit(xmat.marginal.model%*%matrix(gamma,ncol=1)) +.5*expit(xmat.marginal.model%*%matrix(gamma,ncol=1)+beta) loglikefunc <- sum(z*log(marginal.over.u.prob/(1-marginal.over.u.prob))+log(1-marginal.over.u.prob)) loglikefunc likefunc.cont <- function(parameter,y,xmat,beta){ ### variation of likefunc when outcome is continuous ### Compute the log-likelihood for P(Y_ij=y_ij X_ij), where ### beta is known and u_ij=1 (0) with probability 1/2 for each X_ij marginal.model <- glm(y~xmat,family=gaussian,x=true); xmat.marginal.model <- marginal.model$x; gamma <- parameter[1:(length(parameter)-1)] mu0 <- xmat.marginal.model%*%matrix(gamma,ncol=1) mu1 <- xmat.marginal.model%*%matrix(gamma,ncol=1) + beta sigma2 <- parameter[length(parameter)] marginal.over.u.prob <-.5*((1/sqrt(2*pi*sigma2))*exp(-(y-mu0)^2/(2*sigma2))) +.5*((1/sqrt(2*pi*sigma2))*exp(-(y-mu1)^2/(2*sigma2))) loglikefunc <- sum(log(marginal.over.u.prob)); loglikefunc # Find the treatment model which maximizes log likelihood, # beta is set to the value of log(k) where Gamma=Delta=k treatmentmodel.start <- glm(z~x,family=binomial,x=true) Xmat.treatment <- treatmentmodel.start$x[,-1] treatmentmodel.optim <- optim(coef(treatmentmodel.start),likefunc,control=list(maxit=20000,fnscale=-1), z=z,xmat=xmat.treatment,beta=log(gamma)) treatmentmodel.par <- treatmentmodel.optim$par[-1] select.s0 <- which(z==0) # only use controls if(type=="binary"){ # Find the outcome model which maximizes log likelihood, # beta is set to the value of log(k) where Gamma=Delta=k outcomemodel.start <- glm(y~x, family=binomial, subset=select.s0, x=true) Xmat.outcome <- outcomemodel.start$x[,-1] outcomemodel.optim <- optim(coef(outcomemodel.start),likefunc,control=list(maxit=20000,fnscale=-1), z=y[select.s0],xmat=xmat.outcome,beta=log(delta)) else if(type=="continuous"){ # Find the outcome model which maximizes log likelihood, # beta is set to the value of log(4.505) where Gamma=Delta=

14 outcomemodel.start <- glm(y~x, family=gaussian, subset=select.s0, x=true) sigma.sq.start <- sum(residuals(outcomemodel.start)^2)/length(residuals(outcomemodel.start)) Xmat.outcome <- outcomemodel.start$x[,-1] outcomemodel.optim <- optim(c(coef(outcomemodel.start),sigma.sq.start),likefunc.cont, control=list(maxit=20000,fnscale=-1),y=y[select.s0],xmat=xmat.outcome,beta=log(delta)) outcomemodel.par <- outcomemodel.optim$par[-1] calibrate.obs <- data.frame(cbind(treatmentmodel.par,outcomemodel.par, exp(treatmentmodel.par),exp(outcomemodel.par))) colnames(calibrate.obs) <- c("gamma","delta","gamma","delta") rownames(calibrate.obs) <- colnames(x) calibrate.obs McNemar.calibrate.obs <- calibration(binary.lead.outcome,smoking,x,gamma=k,delta=k,"binary") maketable.4 <- function(){ # Web Table 3 in the supplementary materials out <- round(mcnemar.calibrate.obs[c("stand.age","stand.income","male","edu.lt9","edu.9to11","edu.hischl","edu.somecol","black out maketable.4() References Gastwirth, J. L., Krieger, A. M., and Rosenbaum, P. R. (1998). Dual and simultaneous sensitivity analysis for matched pairs. Biometrika 85, Gastwirth, J. L., Krieger, A. M., and Rosenbaum, P. R. (2000). Asymptotic separability in sensitivity analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 62,

15 Mexican American vs. White Other Hispanic vs. White Male vs. Female Age of 2 SD difference Black vs. White Other races vs. White Income to poverty level of 2 SD difference Some College vs. College Less than 9th grade vs. College High school vs. College Ω 0.05 : max p value Ω 0.05 : max p value > 0.05 Ω 0.05 : max p value < th grade vs. College Γ=1.44 & =8.52 Γ=1.61 & =4.12 Γ=2.21 & =2.21 Γ=6.95 & =1.47 Γ=11.28 & = Web Figure 1: Calibration of the simultaneous sensitivity analysis to observed covariates (i.e., age, income-to-poverty level, gender and race) in NHANES data given (Γ, ) = {(1.44,8.52),(1.61,4.12),(2.21,2.21),(6.95,1.47),(11.28,1.41) Ω.05. The solid curves represent values of (Γ, ) Ω.05 where the maximum p-value is approximately equal to The shaded area represents values of (Γ, ) Ω.05 where the maximum p-value is less than Thewhitearearepresents valuesof(γ, ) Ω +.05 wherethemaximum p-valueisgreater than The areas (Γ, ) {[0,1] [0,1],[0,1] [1,12],[1,12] [0,1] are magnified to enable clear display of covariates in these areas. Γ 15

16 (a) Calibration to observed X max p value 0.05 Γ = 1 or = (b) Calibration to observed X 2 Γ max p value 0.05 Γ = 1 or = Web Figure 2: Sensitivity of estimates to the choice of (Γ, ) Ω.05 for (a) a continuous covariate X 1 ; and (b) a binary covariate X 2, among 10 simulated data sets. Solid curves are Ω.05 from 10 simulated data sets. Five symbols (square, circle, diamond up-triangle, and down-triangle) on Ω.05 are 5 arbitrary choices of (Γ, ) Ω.05 including the default choice of Γ =. Symbols off the curves are estimates for the observed covariate from 10 simulated data sets. Dashed lines are Γ = 1 or = Γ

17 Web Table 1: The simultaneous sensitivity analysis for NHANES data of smoking and high blood lead: maximum p-value for McNemar s test statistic for hidden bias of various magnitudes. = 1 = 1.75 = 2 = 2.21 = 2.25 = 2.5 Γ = Γ = Γ = Γ = Γ = Γ = Γ

18 Web Table 2: One hundred randomly selected values of (Γ, ) Ω.05 and their corresponding maximum p-values for McNemar s test statistic. Γ p-value Γ p-value Γ p-value Γ p-value

19 Web Table 3: Maximum likelihood estimates for (θ,φ) and (e θ,e φ ) given (γ,δ) = {log(2.21), log(2.21) from NHANES data of smoking and high blood lead. Observed Covariates θγ φδ e θ γ e φ δ Age (old vs. young) Income-to-poverty level (high vs. low) Male vs. Female Education Less than 9th grade vs. College th grade vs. College High school graduate vs. College Some college vs. College Race Black vs. White Mexican American vs. White Other Hispanic vs. White Other races vs. White binary comparisons for 2-standard-deviation difference 19

20 Web Table 4: Simulation studies for estimates and their standard errors of observed covariates obtained from both full and partial data under the null (H 0 : A = 0) and various alternative hypotheses (H 1 : A = 50,75, and 100) from 1,000 replicates, where A is the effect of the treatment on the treated subjects. Method I: use full data (y (0),X) HYP COV PAR MCEST MCSE H 0 : A = 0 X 1 φ 1 = X 2 φ 2 = H 1 : A = 50 X 1 φ 1 = X 2 φ 2 = H 1 : A = 75 X 1 φ 1 = X 2 φ 2 = H 1 : A = 100 X 1 φ 1 = X 2 φ 2 = Method II: use partial data from subjects whose Z ij = 0 (y (0),X) HYP COV PAR MCEST MCSE H 0 : A = 0 X 1 φ 1 = X 2 φ 2 = H 1 : A = 50 X 1 φ 1 = X 2 φ 2 = H 1 : A = 75 X 1 φ 1 = X 2 φ 2 = H 1 : A = 100 X 1 φ 1 = X 2 φ 2 = HYP:Hypothesis; COV:Covariate; PAR:Parameter; MCEST: Monte Carlo Estimate; MCSE: Monte Carlo Standard Error 20

Section 9: Matching without replacement, Genetic matching

Section 9: Matching without replacement, Genetic matching Fall 2014 1/59 Matching without replacement The standard approach is to minimize the Mahalanobis distance matrix (In GenMatch we use a weighted