A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

Similar documents
A Flexible Bayesian Approach to Monotone Missing. Data in Longitudinal Studies with Nonignorable. Missingness with Application to an Acute

Discussion of Missing Data Methods in Longitudinal Studies: A Review by Ibrahim and Molenberghs

6 Pattern Mixture Models

Bayesian methods for missing data: part 1. Key Concepts. Nicky Best and Alexina Mason. Imperial College London

Whether to use MMRM as primary estimand.

2 Naïve Methods. 2.1 Complete or available case analysis

Bayesian Nonparametric Analysis of Longitudinal. Studies in the Presence of Informative Missingness

7 Sensitivity Analysis

A weighted simulation-based estimator for incomplete longitudinal data models

Fractional Imputation in Survey Sampling: A Comparative Review

Fully Bayesian inference under ignorable missingness in the presence of auxiliary covariates

Some methods for handling missing values in outcome variables. Roderick J. Little

Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features. Yangxin Huang

Methods for Handling Missing Data

Nonparametric Bayesian modeling for dynamic ordinal regression relationships

A Fully Nonparametric Modeling Approach to. BNP Binary Regression

Basics of Modern Missing Data Analysis

Bayesian linear regression

Bayesian Nonparametric Inference Methods for Mean Residual Life Functions

Bayesian non-parametric model to longitudinally predict churn

Default Priors and Effcient Posterior Computation in Bayesian

Semiparametric Generalized Linear Models

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

Mixed-Effects Pattern-Mixture Models for Incomplete Longitudinal Data. Don Hedeker University of Illinois at Chicago

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

Global Sensitivity Analysis for Repeated Measures Studies with Informative Drop-out: A Semi-Parametric Approach

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Global Sensitivity Analysis for Repeated Measures Studies with. Informative Drop-out: A Fully Parametric Approach

Gibbs Sampling in Endogenous Variables Models

5 Methods Based on Inverse Probability Weighting Under MAR

Case Study in the Use of Bayesian Hierarchical Modeling and Simulation for Design and Analysis of a Clinical Trial

A Nonparametric Bayesian Model for Multivariate Ordinal Data

Bootstrapping Sensitivity Analysis

Time-Invariant Predictors in Longitudinal Models

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Pattern Mixture Models for the Analysis of Repeated Attempt Designs

Flexible Regression Modeling using Bayesian Nonparametric Mixtures

Individualized Treatment Effects with Censored Data via Nonparametric Accelerated Failure Time Models

Gibbs Sampling in Latent Variable Models #1

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Robustness to Parametric Assumptions in Missing Data Models

Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor Data

Time-Invariant Predictors in Longitudinal Models

Introduction An approximated EM algorithm Simulation studies Discussion

Factor Analytic Models of Clustered Multivariate Data with Informative Censoring (refer to Dunson and Perreault, 2001, Biometrics 57, )

STAT 518 Intro Student Presentation

Plausible Values for Latent Variables Using Mplus

Calibration Estimation of Semiparametric Copula Models with Data Missing at Random

Reconstruction of individual patient data for meta analysis via Bayesian approach

Nonparametric Bayesian Methods (Gaussian Processes)

A Sampling of IMPACT Research:

STA 4273H: Statistical Machine Learning

ST 790, Homework 1 Spring 2017

Introduction to Gaussian Processes

Accounting for Complex Sample Designs via Mixture Models

Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing

Introduction to Statistical Analysis

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Longitudinal analysis of ordinal data

A Bayesian Nonparametric Approach to Causal Inference for Semi-competing risks

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University

Causal Inference Basics

Bayesian nonparametrics

De-mystifying random effects models

A note on multiple imputation for general purpose estimation

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Latent Variable Model for Weight Gain Prevention Data with Informative Intermittent Missingness

MISSING or INCOMPLETE DATA

Models for Multivariate Panel Count Data

Inferences on missing information under multiple imputation and two-stage multiple imputation

Calibration Estimation for Semiparametric Copula Models under Missing Data

Multilevel Statistical Models: 3 rd edition, 2003 Contents

GAUSSIAN PROCESS REGRESSION

Modelling Dropouts by Conditional Distribution, a Copula-Based Approach

Shared Random Parameter Models for Informative Missing Data

Recent Advances in the analysis of missing data with non-ignorable missingness

STAT Advanced Bayesian Inference

36 th Annual Meeting Preconference Workshop P3 Handout

Modification and Improvement of Empirical Likelihood for Missing Response Problem

Application of Gaussian Process Priors on Bayesian Regression

Short Questions (Do two out of three) 15 points each

Missing Data: Theory & Methods. Introduction

Computationally Efficient Estimation of Multilevel High-Dimensional Latent Variable Models

Analyzing Pilot Studies with Missing Observations

Extending causal inferences from a randomized trial to a target population

Bayesian Methods for Machine Learning

Bayesian Nonparametric Autoregressive Models via Latent Variable Representation

Contents. Part I: Fundamentals of Bayesian Inference 1

Bayes methods for categorical data. April 25, 2017

STA414/2104 Statistical Methods for Machine Learning II

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Missing Data Issues in the Studies of Neurodegenerative Disorders: the Methodology

Causal Hazard Ratio Estimation By Instrumental Variables or Principal Stratification. Todd MacKenzie, PhD

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Bayesian Statistics. Debdeep Pati Florida State University. April 3, 2017

STA 216, GLM, Lecture 16. October 29, 2007

Bayesian Linear Regression

Missing Data in Longitudinal Studies: Mixed-effects Pattern-Mixture and Selection Models

Bayesian Additive Regression Tree (BART) with application to controlled trail data analysis

Transcription:

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness A. Linero and M. Daniels UF, UT-Austin SRC 2014, Galveston, TX

1 Background 2 Working model 3 Extrapolation distribution 4 Analysis of Schizophrenia Trial 5 Conclusions

Notation y: full data response y mis : missing data y obs : observed data r: missingness indicators S: number of observed responses p s ( ) = p( S = s) Ȳ s = (Y 1,..., Y s ) J observation times

Quick Review of missing data I nonignorable missingness: need to model p(y, s) ignorable missingness is MAR distinct parameters for p(y θ) and p(s y, γ) a priori independence of interest p(y ω) = p(y s; ω)df (s; ω)

Quick Review of missing data II extrapolation factorization: p(y, s; ω) = p(y mis y obs, s; ω)p(y obs, s; ω) unidentified parameters/sensitivity analysis/informative priors (NAS 2010 report) not possible with parametric SM and SPM

General approach p(y, s; ω) = p(y mis y obs, s; ω)p(y obs, s; ω) specify a working model for the joint distribution of the responses and the dropout process (S) extract the observed data model from this working model identify the extrapolation distribution using priors grounded off of identifying restrictions

Why this approach? avoids parametric assumptions on the observed data distribution will scale up in a reasonable way allows sensitivity parameters allows for fair characterization of uncertainty since within the Bayesian paradigm (which offers advantages over semiparametric doubly robust approaches)

What is a Working model? I We begin by specifying a working model for the joint distribution of the response and dropout processes. Definition For a model p(y, s ω), a model p (y, s ω) is called a working model if for all s {1, 2,..., J}, p(y obs, s ω) = p (y obs, y mis, s ω) dy mis. (1) A given specification of p (y, s ω) identifies p(y, s ω) only up to p(y obs, s ω), leaving p(y mis y obs, s, ω) unidentified.

What is a Working model? II For the purposes of likelihood based inference, it suffices to fit p (y, s ω) to the data. Proposition A model p(y, s ω) and corresponding working model p (y, s ω) have the same observed data likelihood.

Implications of the working model 1 We can focus on specifying p to fit the data well without affecting the extrapolation distribution p(y mis y obs, s, ω). 2 often easier conceptually to design p to induce desired sharing of information across dropout times without needing to take precautions in leaving the extrapolation distribution unidentified rather than specifying p directly. 3 also convenient because it allows us to specify a single model of dimension J (as opposed to for (Ȳ s, S)) 4 For computational purposes, Y mis may be imputed via data augmentation using p rather than p, which is substantially simpler.

Form of the Working model I We take p to be a mixture of models as follows f (y θ 1 ) model for the outcome process g(s y, θ 2 ) model for the dropout processes conditional on the outcome process p (y, s ω) = f (y θ 1 )g(s y, θ 2 ) F (dθ). distribution of F is modeled as a Dirichlet process with (parameters) base distribution H and mass α > 0 specify g(s y, θ 2 ) as MAR so resulting working model is a mixture of MAR models

Form of the Working model II specification on previous slide is equivalent to the stick-breaking construction (Sethuraman, 1994), which shows the Dirichlet process mixture is a prior on latent class models, where β k = β k θ k iid H(dθ). p (y, s ω) = k=1 β k f (y θ (k) 1 )g(s y, θ(k) 2 ), (2) j<k (1 β j ), β j Beta(1, α), and

Choice of the models in the Working model f (y θ 1 ) a normal kernel (for continous data) g(s y, θ 2 ): sequence of regression on hazards or g(s y, θ 2 ) = g(s θ 2 ), a categorical or ordinal distribution on {1,..., J}. A convenient choice of g(s θ 2 ) in this case is the ordinal probit model

Nonparametric? While the proposed method is nonparametric in the sense of having large support in the space of random probability measures, draws from the Dirichlet process mixture resemble finite mixture models. but in light of the curse of dimensionality, it is not feasible to estimate the distribution of longitudinal data for even moderate J in a fully nonparametric manner the Dirichlet process mixture combats this by shrinking towards latent class models with a small number of dominant components.

Extrapolation distribution I We now need to specify the extrapolation distribution p(y mis y obs, s, ω) Identifying restrictions, which express the extrapolation distribution as a function of the observed data distribution, provide a natural starting point. available case missing value (ACMV) restriction sets p k (y j ȳ j 1, ω) = p j (y j ȳ j 1, ω), for all k < j and 2 j < J; equivalent to the MAR restriction under monotone missingness (Molenberghs et al., 1998), P(S = s Y, ω) = P(S = s Ȳ s, ω).

Extrapolation distribution II A subclass of identifying restrictions is generated by the non-future dependence assumption (NFD) (Kenward et al., 2003), the probability of dropout at time s depends only ȳ s+1, NFD holds if and only if P(S = s Y, ω) = P(S = s Ȳ s+1, ω). (3) p k (y j ȳ j 1, s, ω) = p j 1 (y j ȳ j 1, s, ω), (4) for k < j 1 and 2 < j J, but places no restrictions on p j 1 (y j ȳ j 1, s, ω). MAR when p j 1 (y j ȳ j 1, s, ω) = p j (y j ȳ j 1, s, ω).

Extrapolation distribution III j = 1 j = 2 j = 3 j = 4 S = 1 p 1 (y 1 )? p 2 (y 3 ȳ 2 ) p 3 (y 4 ȳ 3 ) S = 2 p 2 (y 1 ) p 2 (y 2 y 1 )? p 3 (y 4 ȳ 3 ) S = 3 p 3 (y 1 ) p 3 (y 2 y 1 ) p 3 (y 3 ȳ 2 )? S = 4 p 4 (y 1 ) p 4 (y 2 y 1 ) p 4 (y 3 ȳ 2 ) p 4 (y 4 ȳ 3 ) Table : Schematic representation of NFD when J = 4. Distributions above the dividing line are not identified by the observed data.

Identification within NFD I We consider two methods to identify the distribution p j 1 (y j ȳ j 1, s, ω) under NFD. 1 consider the existence of a transformation T j (y j ȳ j 1, ξ j ) such that [Y j Ȳ j 1, S = j 1, ω] d = [T j (Y j Ȳ j 1, ξ j ) Ȳ j 1, S j, ω], where d = denotes equality in distribution. If T j is chosen so that T j (Y j Ȳ j 1, 0) = Y j then deviations of ξ j from 0 represent deviations of the assumed model from MAR. Wang and Daniels (2011) implicitly take this approach,

Identification within NFD II 2 an exponential tilting assumption (Birmingham et al., 2003) p j 1 (y j ȳ j 1, s, ω) = p j (y j ȳ j 1, s, ω)e q j (ȳ j ) p j (y j ȳ j 1, s, ω)e q j (ȳ j ) dy j. (5) The function q j (ȳ j ) characterizes the influence of y j on the scale of log-odds ratios of the probability of dropout at time S = j 1 conditional on S j 1. When a normal kernel is used to model the outcome response, q j (ȳ j ) = γ j y j leads to tractable inference

Analysis of Schizophrenia trial I we analyze data from a clinical trial designed to determine the efficacy of a new drug for the treatment of acute schizophrenia. response was measured at baseline, Day 4, and Weeks 1, 2, 3, and 4 (so J = 6). three treatment (V ) arms corresponding to the test drug (81 subjects), active control (45 subjects), and placebo (78 subjects) primary endpoint: mean change from baseline of the Positive and Negative Syndrome Scale (PANSS) after Week 4, η v = E(Y 6 Y 1 V = v, ω)

Analysis of Schizophrenia trial II Moderate dropout: 33%, 19%, and 25% dropout for V = 1, 2, 3 Many dropout patterns were sparse for example, in the active control arm, only a single subject dropped out at each time j = 1, 2, 3.

Informative priors under MNAR I We compare inferences under MAR to inferences under NFD. To complete the NFD specification we first introduce sensitivity parameters ξ j common across treatments such that [Y j Ȳ j 1, S = j 1, V = v, ω] d = [Y j + ξ j Ȳ j 1, S j, V = v, ω]. MAR corresponds to ξ j 0 for all j. We specify an informative prior ξ j iid Uniform(0, 8). So, deviations from MAR are chosen to be at most roughly one residual standard deviation (8) are restricted to be positive to reflect the fact that those who dropped out are a priori believed to have higher PANSS scores on average.

Informative priors under MNAR II we also compare to different missingness mechanisms across treatments. fewer of the dropouts in the active arm dropped out due to lack of efficacy versus the placebo and test arms as a result, we consider MAR for the active arm and the MNAR priors for the other two arms.

Informative priors under MNAR III Model η 1 η 3 η 2 - η 3 DIC MAR Model DP 1.9( 8.9, 5.2) 6.5( 14.1, 1.2) 7762 DP Sensitivity Analysis MNAR 1.7( 9.3, 5.9) 7.4( 15.3, 0.6) 7762 MNAR-2 1.7( 9.3, 5.9) 8.3( 16.3, 0.3) 7762 Table : Inferences under different models for the Schizophrenia data. MNAR refers to the model which takes ξ j Uniform(0, 8) for all treatments; MNAR-2 assumes that ξ j 0 for the active control arm only. signficant CI under MNAR-2

Informative priors under MNAR IV Active MAR 0 0.15 0.10 5 0.05 10 0.00 Posterior Density 0.15 0.10 0.05 0.00 Placebo Test Missingness MAR MNAR Mean Change from Baseline 15 20 0 5 MNAR Treatment Active Placebo Test 0.15 10 0.10 15 0.05 0.00 30 20 10 0 10 Mean Change from Baseline at Four Weeks 20 Baseline Day 4 Week 1 Week 2 Week 3 Week 4 Time Figure : Density estimates for η v (left) and posterior estimate of mean change from baseline over time under MAR and MNAR.

Conclusions I We have introduced a general methodology for conducting nonparametric Bayesian inference under nonignorable missingness allows for a clear separation of the observed data distribution and the extrapolation distribution. allows both flexible modeling of the observed data and flexible specification of the extrapolation distribution. provides similar robustness to semiparametric AIPW (simulations not shown)

Conclusions II there is nothing particular about the Dirichlet process to our specification; in principle p(y obs, s ω) = p (y obs, y mis, s ω) dy mis. could be applied to any joint distribution p (y, s ω) provided that inference is tractable.

Conclusions III Model complexity controlled both by our prior on α, the mass of the Dirichlet distribution, or using a hierarchical specification of the prior on H forcing the latent classes to be similar. R package available to implement these models intermittent missingness implicitly handled under assumption of partial ignorability

Extensions extend work to applications with covariates. Often covariates are used to help with imputation of missing values, or to make the MAR assumption more plausible, but are not of primary interest (i.e., auxiliary covariates). incorporation of covariates in a semiparametric fashion extend these methods to non-monotone missingness. extend these methods to more complex data (e.g., multivariate longitudinal data) - challenge is specification of the extrapolation distribution

Inference I We work with an approximation of the Dirichlet process mixture based on truncating the stick-breaking construction at a fixed K by setting β K 1 (Ishwaran and James [IJ], 2001) We break inference into two steps. 1 Draw a sample of (θ (1), β 1,..., θ (K), β K ) from the posterior distribution given the observed data using the working model p (y, s ω). 2 Calculate the posterior distribution of desired functionals of the true distribution p(y ω). We use a data-augmentation scheme similar to IJ but which also includes augmenting the missing data for step 1.

Inference II Once we have a sample from the posterior distribution of (θ (k), β k ) in step 1, scientific interest often lies in functionals of the form E[t(Y ) ω] = t(y)p(y ω) dy. Define φ j P(S = j ω) = K k=1 β kg(j θ (k) 2 ). Then, E[t(Y ) ω] = J φ j E[t(Y ) S = j, ω]. j=1 φ j is typically available in closed form given ω, the expectation E[t(Y ) S = j, ω] has a complicated form and depends on the missing data assumption

Inference III under MAR, E[t(Y ) S = j, ω] = t(y) p j (ȳ j ω) p j+1 (y j+1 ȳ j, ω) p j+2 (y j+2 ȳ j+1, ω) p J (y J ȳ J 1, ω) dy. to calculate E[t(Y ) ω] we use Monte Carlo integration, sampling pseudo-data Y 1,..., Y N, and forming the average 1 N N i=1 t(y i ) (easy for NFD in general).

Simulations I assess the performance of our method as an estimator of the population mean at the end of a clinical trial with J = 3 time points and N = 100. we generate data under the following conditions: S1: Y N(µ, Σ). Missingness is MAR. S2: Y is distributed as a 50-50 mixture of normal distributions; chosen to make the distributions of (Y 1, Y 2 ) and (Y 1, Y 3 ) highly non-linear while (Y 2, Y 3 ) is roughly linear. Missingness is MAR.

Simulations II we compare out method to (a) modeling the data as normally distributed (b) augmented inverse-probability weighting (AIPW); solves the estimating equation n J 1 I (S i = J) P(S i=1 i = J Y i ) ϕ(y I (S i = j) λ j (Y i )I (S i j) i, θ) + E[ϕ(Y i, θ) Ȳ ij ] P(S j=1 i > j Y i ) = 0, where n i=1 ϕ(y i, θ) = 0 is a complete data least-squares estimating equation for the regression of Y 1 on Y 2 and (Y 1, Y 2 ) on Y 3. doubly robust in the sense that if either the dropout model or mean response model is correctly specified then the associated estimator is CAN

Simulations III Under (S1) the AIPW estimator used was constructed with the correct mean and dropout models Under (S2) the AIPW estimator was constructed assuming E[Y 2 Y 1] to be quadratic in Y 1 and E[Y 3 Y 1, Y 2] quadratic in Y 1 and linear in Y 2, with dropout modeled correctly (and so the estimator is consistent by double robustness). The expectation E[ϕ(Y, θ) Ȳ j ] is taken under the assumption that the mean response is modeled correctly.

Simulations IV Bias 95% CI Width 95% CI Coverage Probability Mean Squared Error Normal Model (S1) DP 0.001(0.004) 0.493(0.001) 0.963(0.006) 0.01443(0.0006) Normal 0.005(0.004) 0.494(0.002) 0.944(0.007) 0.01524(0.0007) AIPW 0.001(0.004) 0.470(0.002) 0.943(0.007) 0.01530(0.0007) Mixture of Normal Models (S2) DP 0.010(0.004) 0.542(0.001) 0.950(0.007) 0.0182(0.0008) Normal 0.039(0.005) 0.586(0.001) 0.949(0.007) 0.0220(0.0010) AIPW 0.001(0.004) 0.523(0.001) 0.944(0.007) 0.0185(0.0008) Mixture 0.006(0.004) 0.536(0.001) 0.952(0.007) 0.0182(0.0008) Table : Comparison of methods for estimating the population mean at time J = 3.