The mice package. Stef van Buuren 1,2. amst-r-dam, Oct. 29, TNO, Leiden. 2 Methodology and Statistics, FSBS, Utrecht University

Size: px

Start display at page:

Download "The mice package. Stef van Buuren 1,2. amst-r-dam, Oct. 29, TNO, Leiden. 2 Methodology and Statistics, FSBS, Utrecht University"

Bryce Lawson
6 years ago
Views:

1 The mice package 1,2 1 TNO, Leiden 2 Methodology and Statistics, FSBS, Utrecht University amst-r-dam, Oct. 29, 2012

2 > The problem of missing data Consequences of missing data Less information than planned Enough statistical power? Statistics are undefined (e.g. mean) Biases in the data analysis Systematic bias Representativity Appropriate confidence interval, P-values? In general, the presence of missing data can severely complicate interpretation and analysis of the data

3 > The problem of missing data How the calculate the mean Calculate mean in R as > y <- c(1,2,4) > mean(y) [1] where y is a vector containing three numbers, and where mean(y) is the R expression that returns their mean.

4 > The problem of missing data How to calculate the mean Now suppose that the last number is missing. R indicates this by the symbol NA, which stands for not available : > y <- c(1,2,na) > mean(y) [1] NA The mean is now undefined, and R informs us about this outcome by setting the mean to NA.

5 > The problem of missing data How to calculate the mean It is possible to add an extra argument na.rm = TRUE to the function call. This removes any missing data before calculating the mean: > mean(y, na.rm=true) [1] 1.5 This makes it possible to calculate a result, but of course the set of observations on which the calculations are based has changed. This may cause problems in statistical inference and interpretation.

6 > The problem of missing data Questions about complete-case analysis If numbers of cases change, can we compare the estimates from both models? Should we attribute differences in the estimates to changes in the model, or to changes in the subsample? Do the estimates generalize to the study population? Do we have enough cases to detect the effect of interest? Are we making the best use of the costly collected data?

7 > Single imputation methods > Mean imputation Mean imputation Frequency Ozone (ppb) Ozone (ppb) Solar Radiation (lang)

8 > Single imputation methods > Regression imputation Regression imputation Frequency Ozone (ppb) Ozone (ppb) Solar Radiation (lang)

9 > Single imputation methods > Stochastic regression imputation Stochastic regression imputation Frequency Ozone (ppb) Ozone (ppb) Solar Radiation (lang)

10 > Multiple imputation with mice Rising popularity of multiple imputation Number of publications (log) early publications 'multiple imputation' in abstract 'multiple imputation' in title Year

11 > Multiple imputation with mice Working flow in mice incomplete data imputed data analysis results pooled results mice() with() pool() data frame mids mira mipo

12 > Multiple imputation with mice How to do this in R?

13 > Multiple imputation with mice Philosophy of mice Imputation is a scientific activity, not a simple technical fix Imputed values should be plausible The method should work for any statistic calculated from the completed data

14 > How to create imputations > Criteria for good imputations Proper imputation in practice The imputation model should account for the process that created the missing data, preserve the relations in the data, and preserve the uncertainty about these relations.

15 > How to create imputations > Incorporating appropriate variation Relation between temperature and gas consumption Gas consumption (cubic feet) Temperature ( C)

16 > How to create imputations > Incorporating appropriate variation We delete gas consumption of observation 47 Gas consumption (cubic feet) a deleted observation Temperature ( C)

17 > How to create imputations > Incorporating appropriate variation Predict imputed value from regression line Gas consumption (cubic feet) b Temperature ( C)

18 > How to create imputations > Incorporating appropriate variation Predicted value + noise Gas consumption (cubic feet) c Temperature ( C)

19 > How to create imputations > Incorporating appropriate variation Predicted value + noise + parameter uncertainty Gas consumption (cubic feet) d Temperature ( C)

20 > How to create imputations > Incorporating appropriate variation Imputation based on two predictors Gas consumption (cubic feet) before insulation after insulation e Temperature ( C)

21 > How to create imputations > Incorporating appropriate variation Predictive mean matching: Y given X Gas consumption (cubic feet) before insulation after insulation Temperature ( C)

22 > How to create imputations > Incorporating appropriate variation Add two regression lines Gas consumption (cubic feet) before insulation after insulation Temperature ( C)

23 > How to create imputations > Incorporating appropriate variation Predicted given 5 C, after insulation Gas consumption (cubic feet) before insulation after insulation Temperature ( C)

24 > How to create imputations > Incorporating appropriate variation Define a matching range ŷ ± δ Gas consumption (cubic feet) before insulation after insulation Temperature ( C)

25 > How to create imputations > Incorporating appropriate variation Select potential donors Gas consumption (cubic feet) before insulation after insulation Temperature ( C)

26 > How to create imputations > Incorporating appropriate variation Bayesian PPM: Draw a line Gas consumption (cubic feet) before insulation after insulation Temperature ( C)

27 > How to create imputations > Incorporating appropriate variation Define a matching range ŷ ± δ Gas consumption (cubic feet) before insulation after insulation Temperature ( C)

28 > How to create imputations > Incorporating appropriate variation Select potential donors Gas consumption (cubic feet) before insulation after insulation Temperature ( C)

29 > How to create imputations > Imputation method in mice Method Description Scale type pmm Predictive mean matching numeric norm Bayesian linear regression numeric norm.nob Linear regression, non-bayesian numeric norm.boot Linear regression with bootstrap numeric mean Unconditional mean imputation numeric 2L.norm Two-level linear model numeric logreg Logistic regression factor, 2 levels logreg.boot Logistic regression with bootstrap factor, 2 levels polyreg Multinomial logit model factor, > 2 levels polr Ordered logit model ordered, > 2 levels lda Linear discriminant analysis factor sample Simple random sample any

30 > Multivariate missing data > MICE Multivariate Imputation by Chained Equations (MICE) MICE algorithm Specify imputation model for each incomplete column Fill in starting imputations And iterate Model: Fully Conditional Specification (FCS)

31 > Multivariate missing data > MICE Fully Conditional Specification: Con s Theoretical properties only known in special cases Cannot use computational shortcuts, like sweep-operator Care needed in building and checking the model

32 > Multivariate missing data > MICE Fully Conditional Specification : Pro s Extremely flexible Close to the data Subset selection of predictors Modular, can preserve valuable work Appears to work very well in practice

33 > Flexible Imputation of Missing Data Flexible Imputation of Missing Data (FIMD)

34 > Diagnostics > Graphs Standard diagnostic plots in mice Since mice 2.5, plots for imputed data: one-dimensional scatter: stripplot box-and-whisker plot: bwplot densities: densityplot scattergram: xyplot

35 > Diagnostics > Graphs Stripplot > library(mice) > imp <- mice(nhanes, seed = 29981) > stripplot(imp, pch = c(1, 19))

36 > Diagnostics > Graphs stripplot(imp, pch=c(1,19)) age bmi hyp chl Imputation number

37 > Diagnostics > Graphs A larger data set > imp <- mice(boys, seed = 24331, maxit = 1) > stripplot(imp) > bwplot(imp)

38 > Diagnostics > Graphs stripplot(imp) wgt hgt 200 age hc tv bmi Imputation number

39 > Diagnostics > Graphs bwplot(imp) age hgt wgt bmi hc tv Imputation number

40 > Diagnostics > Graphs densityplot(imp) hgt wgt bmi Density hc tv

41 > Diagnostics > Graphs Imputed by a normal model Genital stage Age

42 > Diagnostics > Graphs Imputed by a proportional odds model G5 G4 G3 G Genital stage G G5 G4 G3 G2 G Age

43 > Derived variables Derived variables ratio of two variables sum score index variable quadratic relations interaction term conditional imputation compositions

44 > Derived variables > Imputing a ratio How to impute a ratio? weight/height ratio: whr=wgt/hgt kg/m. Easy if only one of wgt or hgt or whr is missing Methods POST: Impute wgt and hgt, and calculate whr after imputation JAV: Impute whr as just another variable PASSIVE1: Impute wgt and hgt, and calculate whr during imputation PASSIVE2: As PASSIVE1 with adapted predictor matrix

45 > Derived variables > Imputing a ratio Method POST > imp1 <- mice(boys) > long <- complete(imp1, "long", inc = TRUE) > long$whr <- with(long, wgt/(hgt/100)) > imp2 <- long2mids(long)

46 > Derived variables > Imputing a ratio Method JAV: Just another variable > boys$whr <- boys$wgt/(boys$hgt/100) > imp.jav <- mice(boys, m = 1, seed = 32093, maxit = 10)

47 > Derived variables > Imputing a ratio Method JAV JAV passive passive 2 Weight/Height (kg/m) Height (cm)

48 > Derived variables > Imputing a ratio Method PASSIVE > meth["whr"] <- "~I(wgt/(hgt/100))"

49 > Derived variables > Imputing a ratio Method PASSIVE JAV passive passive 2 Weight/Height (kg/m) Height (cm)

50 > Derived variables > Imputing a ratio Method PASSIVE JAV passive passive 2 Weight/Height (kg/m) Height (cm)

51 > Derived variables > Summary Derived variables: summary Derived variables pose special challenges Plausible values respect data dependencies If you can, create derived variables after imputation If you cannot, use passive imputation Break up direct feedback loops using the predictor matrix

52 > Conclusion Why use MICE? 1 State-of-the-art methodology 2 Addresses all phases of multiple imputation 3 MICE algorithm is flexible and extendible 4 Extensive documentation, sample code and real datasets 5 Light and stable R package, 6 Easy to use, good defaults 7 Hundreds of applications 8 Free software, open source

53 > Conclusion Key documentation 1 Van Buuren, S. and Groothuis-Oudshoorn, C.G.M. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), Van Buuren, S. (2012). Flexible Imputation of Missing Data. Chapman & Hall/CRC, Boca Raton, FL.

Handling Missing Data in R with MICE

Handling Missing Data in R with MICE Handling Missing Data in R with MICE Stef van Buuren 1,2 1 Methodology and Statistics, FSBS, Utrecht University 2 Netherlands Organization for Applied Scientific Research