Does low participation in cohort studies induce bias? Additional material

Similar documents
Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Introduction to Statistical Analysis

Lecture 12: Effect modification, and confounding in logistic regression

Online supplement. Absolute Value of Lung Function (FEV 1 or FVC) Explains the Sex Difference in. Breathlessness in the General Population

Harvard University. Harvard University Biostatistics Working Paper Series

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

Multiple imputation to account for measurement error in marginal structural models

Even Simpler Standard Errors for Two-Stage Optimization Estimators: Mata Implementation via the DERIV Command

Standard Error of Technical Cost Incorporating Parameter Uncertainty

Problem Set 6 Solution

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Suppose we obtain a MLR equation as follows:

Confidence Intervals for the Odds Ratio in Logistic Regression with One Binary X

Modelling Rates. Mark Lunt. Arthritis Research UK Epidemiology Unit University of Manchester

ANALYSIS OF CORRELATED DATA SAMPLING FROM CLUSTERS CLUSTER-RANDOMIZED TRIALS

A Bayesian multi-dimensional couple-based latent risk model for infertility

One-stage dose-response meta-analysis

STA6938-Logistic Regression Model

Generalized Additive Models

IP WEIGHTING AND MARGINAL STRUCTURAL MODELS (CHAPTER 12) BIOS IPW and MSM

Survival Analysis Math 434 Fall 2011

Correlation and regression

General Regression Model

Comparing two independent samples

Lecture 3: Measures of effect: Risk Difference Attributable Fraction Risk Ratio and Odds Ratio

Sociology 362 Data Exercise 6 Logistic Regression 2

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Sensitivity Analysis with Several Unmeasured Confounders

A Reliable Constrained Method for Identity Link Poisson Regression

Categorical data analysis Chapter 5

Penalized Logistic Regression in Case-Control Studies

STK4900/ Lecture 7. Program

Lattice Data. Tonglin Zhang. Spatial Statistics for Point and Lattice Data (Part III)

Incorporating published univariable associations in diagnostic and prognostic modeling

Long-Run Covariability

Logistic regression analysis. Birthe Lykke Thomsen H. Lundbeck A/S

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Time-varying proportional odds model for mega-analysis of clustered event times

Frailty Modeling for Spatially Correlated Survival Data, with Application to Infant Mortality in Minnesota By: Sudipto Banerjee, Mela. P.

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

A note on multiple imputation for general purpose estimation

Categorical Data Analysis Chapter 3

PubHlth Intermediate Biostatistics Spring 2015 Exam 2 (Units 3, 4 & 5) Study Guide

Morten Frydenberg Section for Biostatistics Version :Friday, 05 September 2014

Instructions: Closed book, notes, and no electronic devices. Points (out of 200) in parentheses

Selection on Observables: Propensity Score Matching.

BIOSTATS Intermediate Biostatistics Spring 2017 Exam 2 (Units 3, 4 & 5) Practice Problems SOLUTIONS

Regression models. Categorical covariate, Quantitative outcome. Examples of categorical covariates. Group characteristics. Faculty of Health Sciences

BIOS 6649: Handout Exercise Solution

Fitting stratified proportional odds models by amalgamating conditional likelihoods

Machine Learning Linear Classification. Prof. Matteo Matteucci

Confidence Intervals for the Odds Ratio in Logistic Regression with Two Binary X s

Jun Tu. Department of Geography and Anthropology Kennesaw State University

Lecture 01: Introduction

Causal Hazard Ratio Estimation By Instrumental Variables or Principal Stratification. Todd MacKenzie, PhD

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

A Simulation Study on Confidence Interval Procedures of Some Mean Cumulative Function Estimators

Categorical and Zero Inflated Growth Models

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Confidence Intervals for the Interaction Odds Ratio in Logistic Regression with Two Binary X s

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

Survey of Smoking Behavior. Survey of Smoking Behavior. Survey of Smoking Behavior

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources

multilevel modeling: concepts, applications and interpretations

Additive and multiplicative models for the joint effect of two risk factors

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Economics 583: Econometric Theory I A Primer on Asymptotics

Bayesian methods for missing data: part 1. Key Concepts. Nicky Best and Alexina Mason. Imperial College London

Time series models in the Frequency domain. The power spectrum, Spectral analysis

Some New Aspects of Dose-Response Models with Applications to Multistage Models Having Parameters on the Boundary

A Non-parametric bootstrap for multilevel models

Statistics in Environmental Research (BUC Workshop Series) II Problem sheet - WinBUGS - SOLUTIONS

Lab 8. Matched Case Control Studies

Interval Estimation III: Fisher's Information & Bootstrapping

BIOS 312: Precision of Statistical Inference

Help! Statistics! Mediation Analysis

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

Simple Sensitivity Analysis for Differential Measurement Error. By Tyler J. VanderWeele and Yige Li Harvard University, Cambridge, MA, U.S.A.

13.1 Causal effects with continuous mediator and. predictors in their equations. The definitions for the direct, total indirect,

Hypothesis Testing for Var-Cov Components

Data Uncertainty, MCML and Sampling Density

Homework Solutions Applied Logistic Regression

Meei Pyng Ng 1 and Ray Watson 1

REGRESSION WITH SPATIALLY MISALIGNED DATA. Lisa Madsen Oregon State University David Ruppert Cornell University

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Observational Studies 4 (2018) Submitted 12/17; Published 6/18

Counting Statistics and Error Propagation!

Sanjay Chaudhuri Department of Statistics and Applied Probability, National University of Singapore

Terminology Suppose we have N observations {x(n)} N 1. Estimators as Random Variables. {x(n)} N 1

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

Meta-analysis of epidemiological dose-response studies

Tests for the Odds Ratio in Logistic Regression with One Binary X (Wald Test)

Localization of Radioactive Sources Zhifei Zhang

Accurate Prediction of Rare Events with Firth s Penalized Likelihood Approach

ABHELSINKI UNIVERSITY OF TECHNOLOGY

Penalized likelihood logistic regression with rare events

Transcription:

Does low participation in cohort studies induce bias? Additional material Content: Page 1: A heuristic proof of the formula for the asymptotic standard error Page 2-3: A description of the simulation study Page 4: The interpretation of the estimates and the confidence intervals Page 5: Additional table: Participation rates and crude relative odds ratios for 3 exposure-outcome associations

Additional material: Does low participation in cohort studies induce bias? 1 A heuristic proof of the formula for the asymptotic standard error We will here present a heuristic proof of the formula for the asymptotic variance of the difference between the estimate based on a random subpopulation and that based on the total population, i.e. ( ˆ θ ˆ θ ) ( ˆ θ ) ( ˆ θ ) Var = Var Var (1) as Sub Tot as Sub as Tot First, we make a general observation. Let Z be given as the weighted average of two random variables X and Y using the inverse of their variances as weights: If X and Y are uncorrelated then This surprising result follows as 1 1 1 1 1 ( ) ( ) ( ) ( ) Z = Var X + Var Y Var X X + Var Y Y. ( X Z) = ( X) ( Z) Var Var Var ( Z) = ( Z X) = ( X) + ( Y) 1 1 Var Cov, Var Var ( X Z) = ( X) + ( Z) ( Z X) Var Var Var 2 Cov, Now consider two subpopulations 1 and 2 and let ˆSubi θ denote the maximum 1 and likelihood estimate based on the data in the ith subpopulation. If we let 1 ( ) ( ) ( ) ( ) 1 1 1 1 % θ = Var ˆ ˆ ˆ ˆ ˆ ˆ as θsub 1 + Varas θ Sub2 Varas θsub 1 θsub 1 + Varas θsub2 θ Sub2 then it follows that Var ( ˆ ) ( ˆ as θsub θ = Varas θsub ) Varas ( θ) (2) % %. Finally, we note that if the two subpopulations are a random partitioning of the total population, then θ % is asymptotically equal to the maximum likelihood based on the total population ˆTot θ. It is important to note that equation 1 does not hold in general. 1

Additional material: Does low participation in cohort studies induce bias? 2 Description of the simulation study The performance of the two methods for calculation of confidence intervals was assessed in a small simulation study covering scenarios identical or close to the actual data included in the adjusted analyses of IVF and preterm birth, and of smoking and SGA. The simulation set-up required specification of the distribution of the covariate pattern in the source population, the participation rate within each covariate pattern, and the dependence of the outcome on the covariates both among the participants and among the non-participants. For IVF and preterm birth the relevant model involved 96=2*2*3*2*2 different covariate patterns. In the simulations we used the observed distribution with the modification that patterns with less than 20 observations were assigned a small, but positive probability obtained by smoothing using a Poisson regression model. For each covariate pattern the participation rate was selected to be identical to the observed participation rate, except for rare patterns, where the participation rate was found by smoothing using a logistic regression model. For the associations between the covariates and the risk of preterm birth among the participants and among the non-participants we used logistic regression models of the same form as those used in the analysis of the actual data. In the simulation we considered three different scenarios reflecting different choices of the coefficients in the logistic model describing the relation between the covariates and preterm birth. In the first scenario the coefficients were identical to the estimates found in the actual data. In the two other scenarios we modified the adjusted odds ratios between IVF and preterm birth in order to consider other values of ROR. Since the source population is a mixture of participants and non-participants, the risk of preterm birth will not follow a logistic regression model. No closed form expressions for the coefficients in the 2

Additional material: Does low participation in cohort studies induce bias? 3 approximating logistic model exist, so these were found by calculating the expected probabilities and using these as weights in estimating the logistic regression model. For smoking and SGA the model involved two levels of exposure and 48=3*2*2*2 different covariate patterns. Here we also considered three scenarios established in a way similar to that used for IVF and preterm birth. For each scenario we considered source population sizes of 25,000 and 50,000. Each combination of scenario and sample size was simulated 5,000 times. The results of the simulations were summarized by the observed coverage probability of a nominal 95% interval of ROR, i.e. by computing how often the true ROR was contained in the interval estimated ROR exp( ± 1.96 se). All analyses and simulations were made using STATA version 8.2. 3

Additional material: Does low participation in cohort studies induce bias? 4 Interpretation of the estimates and the confidence intervals We will here make some comments on the interpretation of the estimates and their confidence intervals. As an example we will consider the proportion of primiparae. In the data used in the study we had 45.7 % primiparae in the source population and 50.4% among the participants in the DNBC. This resulted in an observed ratio of 1.103. That is, among the pregnancies in North Jutland County and in the Aarhus Municipality, we found a 10.3% overrepresentation of primiparae among the participants in the DNBC. The confidence interval (1.089-1.117) presented in the paper reflects the uncertainty of this estimate when considering similar cohort studies in a similar setting. This is also the usual interpretation of the confidence interval. Another relevant question is: How large is the over- or underrepresentation of primiparae in the entire DNBC? Since we believe that the pregnancies in the North Jutland County and in the Aarhus Municipality are a representative sample of the eligible population of the DNBC, the 1.10 is a valid estimate of this ratio. The eligible population of the DNBC is finite (approximately 310,000 pregnancies) and we therefore have a finite population problem. An approximate confidence interval in such a situation can be obtained by multiplying the standard error with the finite population adjustment factor 1 sample fraction = 1 0.16 = 0.92. This gives the confidence interval (1.115-1.090). The observed adjusted ROR for BMI and stillbirth was 0.97 with the infinite sample space confidence interval (0.48-1.96) (Table 2). The confidence interval for ROR in the entire DNBC would be slightly narrower and becomes (0.51-1.85). 4

Additional material: Does low participation in cohort studies induce bias? 5 Additional table. Participation rates and crude relative odds ratios for 3 exposure-outcome associations* Participation rates Crude relative odds ratio (95% CI) Outcome ROR Equation Bootstrap IVF and preterm birth Preterm Term Total No treatment 30% 32% 32% 1.00 IVF 34% 38% 38% 0.94 (0.68-1.28) (0.67-1.31) Total 30% 32% 32% Smoking and SGA SGA not SGA Total Non-smoker 29% 33% 33% 1.00 0-10 cig/day 23% 27% 27% 0.98 (0.79-1.21) (0.80-1.20) >10 cig/day 24% 22% 22% 1.24 (0.95-1.63) (0.96-1.61) Total 26% 32% 32% BMI and stillbirth Stillbirth Live birth Total <18.5-27% 27% 18.5-24.9 42% 33% 33% 1.00 25.0-29.9 39% 31% 31% 1.01 (0.58-1.73) (0.57-1.79) 30+ 33% 28% 28% 0.94 (0.47-1.85) (0.45-1.93) - Total 39% 32% 32% * Based on the populations displayed in Table 2. Associations are: In vitro fertilization (IVF) and preterm birth, smoking and birth of a small-for-gestational-age infant (SGA), and body mass index and antepartum stillbirth. ROR, relative odds ratio; CI, confidence interval; ref., reference. Computed using equation 1, see text for details. Based on a non-parametric bootstrap sample of size 200. Reference category. Participation was in general non-differential, i.e. the dependence on the outcome category was consistent across exposure categories, the only exception being heavy smokers where also the highest bias was observed. Here the reverse pattern relative to the reference category was seen with a higher participation rate in births with SGA than in births without SGA. 5