Toutenburg, Fieger: Using diagnostic measures to detect non-mcar processes in linear regression models with missing covariates

Similar documents
Toutenburg, Nittner: The Classical Linear Regression Model with one Incomplete Binary Variable

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

Projektpartner. Sonderforschungsbereich 386, Paper 163 (1999) Online unter:

Basics of Modern Missing Data Analysis

A Note on Bayesian Inference After Multiple Imputation

analysis of incomplete data in statistical surveys

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Inferences on missing information under multiple imputation and two-stage multiple imputation

Streamlining Missing Data Analysis by Aggregating Multiple Imputations at the Data Level

Statistical Practice

F-tests for Incomplete Data in Multiple Regression Setup

Multiple Linear Regression

Sociedad de Estadística e Investigación Operativa

Toutenburg, Srivastava: Estimation of Ratio of Population Means in Survey Sampling When Some Observations are Missing

Stein-Rule Estimation under an Extended Balanced Loss Function

Inferences on a Normal Covariance Matrix and Generalized Variance with Monotone Missing Data

Augustin: Some Basic Results on the Extension of Quasi-Likelihood Based Measurement Error Correction to Multivariate and Flexible Structural Models

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW

Küchenhoff: An exact algorithm for estimating breakpoints in segmented generalized linear models

A weighted simulation-based estimator for incomplete longitudinal data models

Fractional Imputation in Survey Sampling: A Comparative Review

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Analyzing Pilot Studies with Missing Observations

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University

Pooling multiple imputations when the sample happens to be the population.

Multiple Imputation for Missing Data in Repeated Measurements Using MCMC and Copulas

Holzmann, Min, Czado: Validating linear restrictions in linear regression models with general error structure

ASA Section on Survey Research Methods

PIRLS 2016 Achievement Scaling Methodology 1

Summary and discussion of The central role of the propensity score in observational studies for causal effects

Mean squared error matrix comparison of least aquares and Stein-rule estimators for regression coefficients under non-normal disturbances

Evaluating the Importance of Multiple Imputations of Missing Data on Stochastic Frontier Analysis Efficiency Measures

Outlier detection and variable selection via difference based regression model and penalized regression

1. The Multivariate Classical Linear Regression Model

Pseudo-minimax linear and mixed regression estimation of regression coecients when prior estimates are available

Alexina Mason. Department of Epidemiology and Biostatistics Imperial College, London. 16 February 2010

Regression Analysis for Data Containing Outliers and High Leverage Points

Longitudinal analysis of ordinal data

Some methods for handling missing values in outcome variables. Roderick J. Little

Regression Graphics. 1 Introduction. 2 The Central Subspace. R. D. Cook Department of Applied Statistics University of Minnesota St.

A Comparative Study of Imputation Methods for Estimation of Missing Values of Per Capita Expenditure in Central Java

Reconstruction of individual patient data for meta analysis via Bayesian approach

Two-phase sampling approach to fractional hot deck imputation

Ekkehart Schlicht: Trend Extraction From Time Series With Structural Breaks and Missing Observations

Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing

Regression diagnostics

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix

MISSING or INCOMPLETE DATA

Christopher Dougherty London School of Economics and Political Science

Advising on Research Methods: A consultant's companion. Herman J. Ader Gideon J. Mellenbergh with contributions by David J. Hand

Finite Sample Performance of A Minimum Distance Estimator Under Weak Instruments

Lecture Notes: Some Core Ideas of Imputation for Nonresponse in Surveys. Tom Rosenström University of Helsinki May 14, 2014

identify only a few factors (possibly none) with dispersion eects and a larger number of factors with location eects on the transformed response. Tagu

LECTURE 2 LINEAR REGRESSION MODEL AND OLS

A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL

TESTING FOR NORMALITY IN THE LINEAR REGRESSION MODEL: AN EMPIRICAL LIKELIHOOD RATIO TEST

INTERNALLY STUDENTIZED RESIDUALS TEST FOR MULTIPLE NON-NESTED REGRESSION MODELS

Implications of Missing Data Imputation for Agricultural Household Surveys: An Application to Technology Adoption

STA 4273H: Statistical Machine Learning

Linear Mixed Models for Longitudinal Data with Nonrandom Dropouts

Bootstrapping, Randomization, 2B-PLS

An Empirical Comparison of Multiple Imputation Approaches for Treating Missing Data in Observational Studies

Diagnostics for Linear Models With Functional Responses

Local Polynomial Wavelet Regression with Missing at Random

((n r) 1) (r 1) ε 1 ε 2. X Z β+

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

Department of Mathematical Sciences, Norwegian University of Science and Technology, Trondheim

Bayesian Estimation of a Possibly Mis-Specified Linear Regression Model

Discussing Effects of Different MAR-Settings

review session gov 2000 gov 2000 () review session 1 / 38

Bayesian Inference for the Multivariate Normal

Pruscha: Semiparametric Estimation in Regression Models for Point Processes based on One Realization

6 Pattern Mixture Models

Experimental Design and Data Analysis for Biologists

A note on multiple imputation for general purpose estimation

Bayesian methods for missing data: part 1. Key Concepts. Nicky Best and Alexina Mason. Imperial College London

Recent Advances in the analysis of missing data with non-ignorable missingness

Correction for Ascertainment

Younshik Chung and Hyungsoon Kim 968). Sharples(990) showed how variance ination can be incorporated easily into general hierarchical models, retainin

Introduction to Statistical Hypothesis Testing

Multiple QTL mapping

Estimation of Missing Data Using Convoluted Weighted Method in Nigeria Household Survey

Maximum Likelihood Estimation; Robust Maximum Likelihood; Missing Data with Maximum Likelihood

Whether to use MMRM as primary estimand.

A Robust Approach of Regression-Based Statistical Matching for Continuous Data

The Effects of Monetary Policy on Stock Market Bubbles: Some Evidence

Missing Data. On Missing Data and Interactions in SNP Association Studies. Missing Data - Approaches. Missing Data - Approaches. Multiple Imputation

Robust Wilks' Statistic based on RMCD for One-Way Multivariate Analysis of Variance (MANOVA)

MISSING or INCOMPLETE DATA

Managing Uncertainty

Miscellanea A note on multiple imputation under complex sampling

ADJUSTED POWER ESTIMATES IN. Ji Zhang. Biostatistics and Research Data Systems. Merck Research Laboratories. Rahway, NJ

Does k-th Moment Exist?

The Model Building Process Part I: Checking Model Assumptions Best Practice

Vector Autoregressive Model. Vector Autoregressions II. Estimation of Vector Autoregressions II. Estimation of Vector Autoregressions I.

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Introduction An approximated EM algorithm Simulation studies Discussion

11. Bootstrap Methods

Short course on Missing Data

Monitoring Random Start Forward Searches for Multivariate Data

Transcription:

Toutenburg, Fieger: Using diagnostic measures to detect non-mcar processes in linear regression models with missing covariates Sonderforschungsbereich 386, Paper 24 (2) Online unter: http://epub.ub.uni-muenchen.de/ Projektpartner

Using diagnostic measures to detect non-mcar processes in linear regression models with missing covariates H. Toutenburg A. Fieger y June 6, 2 Abstract This paper presents methods to analyze and detect non-mcar processes that lead to missing covariate values in linear regression models. First, the data situation and the problem is sketched. The next section provides an overview of the methods that deal with missing covariate values. The idea of using outlier methods to detect non-mcar processes is described in section 3. Section 4 uses these ideas to introduce a graphical method to visualize the problem. Possible extensions conclude the presentation. 1 Data and problem We consider the classical linear regression model y(n 1) = X(n p) (p 1) + (n 1) with missing data in the n p covariate matrix X. Reorganization of the n rows of the data matrix X, the corresponding elements of the response y and the error term leads to the following structure y mis = + c mis The index c indicates the completely observed submodel whereas the index mis corresponds to the submodel with missing values in the covariate matrix (note that y mis is completely observed). Using the missing data indicator matrix R introduced by (Rubin, 1976) R ij = 1 if Zij is observed if Z ij is missing with data Z ij =(y i X ij ), the missing mechanism can be characterized by the conditional distribution f(rjz )ofr given the data Z and an unknown parameter. The n(p+1) matrix Z consists of observed data Z obs and unobserved values Z mis. The data are said to be missing completely at random (MCAR) if the distribution of R given Z and only depends on the unknown parameter for any Z, i.e. f(rjz ) =f(rj) 8Z: If the conditional distribution of R depends on Z only via the observed values Z obs (for all Z mis, i.e. (1) f(rjz ) =f(rjz obs ) 8Z mis the data are called missing at random (MAR). Institute of Statistics, Ludwig-Maximilians University of Munich, Germany, email: toutenb@stat.unimuenchen.de y Institute of Statistics, Ludwig-Maximilians University of Munich, Germany, email: andreas@stat.unimuenchen.de 1

The optimal estimator is the Gauss-Markov estimator b applied to the data in (1): b =!;1 y mis = (X cx c + X mis ) ;1 (X cy c + X misy mis ) : (2) Due to the unknown values in, of course this estimator can not be used directly. There are a variety of methods that deal with this problem. 2 Dealing with missing values A simple and often used method is to discard all the information available in (y mis ) and to use the completely observed data in (y c X c ) only: b c =(X cx c ) ;1 X cy c : Using the available information in Z by estimating ^ = ^ ;1 xx ^ xy where the elements of ^ xx and ^ xy are formed by using all jointly observed pairs x ij x i j and x ij y i is called the available case method. Maximum likelihood procedures address the missing data problem by factoring the joint distribution f(z Rj ) =f(zj)f(rjz ) : Integration over the missing data Z mis yields f(z obs Rj ) = Z f(z Rj ) dz mis = Z f(rjz )f(zj) dz mis : If f(rjz) depends only on the observed data Z obs, i.e. the MAR assumption holds, we have f(z obs Rj ) =f(rjz obs ) Z f(zj) dz mis = f(rjz obs )f(z obs j) which iswhy the missing data mechanism is also called ignorable in this case. Imputation procedures form a dierent approach to the problem. Here the missing values in are replaced by new values. Having done this by some procedure, the estimator (2) with replaced by with as described below becomes operational. To replace the unknown values in a variety of imputation methods exist: mean imputation or zero order regression (ZOR) replaces an unknown value x ij by the mean x j, either formed of the complete cases in X c or the available cases in X c and. Conditional mean imputation or rst order regression (FOR) uses auxilliary regressions to nd replacements for the missing values. Regressing the covariate with missing values on the remaining covaraites (with parameters estimates based on the complete cases) yields predictions of the missing values that are used as substitutes. If the response y is also used in these regressions a stochastic element is introduced (see Buck (196) or Toutenburg and Shalabh (1998)). Multiple Imputation (Rubin, 1987 Schafer, 1997) repeats the imputation step and averages the results. While a single imputation is too smooth, the dierences between the individual imputation steps can be properly used to estimate the variance as the sum of the average variance within the imputed data sets and the between imputation variance. This strategy reects the uncertainty about the imputation process which is ignored in a single imputation strategy. By replacing a missing value by x R, the model (2) becomes the mixed model y R = + + c R 2

where addresses the dierence between the true but unknown values in and their replacements in. Using the mixed estimator (Theil and Goldberger, 1961) we have b =!;1 = (X c X c + X R ) ;1 (X c y c + X R y ) y The weighted-mixed-estimator introduced by Scharin and Toutenburg (199) uses a weight <1 for the values in (y mis ) b() = ( X c + XR ) ;1 ( y c + XR y R) : (3) This estimator may beinterpreted as the familiar mixed estimator in the model p y = p XR + c p : 3 MCAR diagnosis with outlier measures Popular diagnostics to detect non-mcar processes contain the comparison of correlation or covariance matrices, the comparison of means (y c vs. y mis ) or a more general test as described by Little (????). For the situation with only one column aected by missing values Simon and Simono (1986) present diagnostic plots where `envelopes' are compared. The idea rst presented by Simono (1988) combines the missing data problem with statistics that derive from the oulier detection eld. A comparison of the values of a statistic computed with and without imputation is the comparison of the sub-samples Z c and Z mis. If the imputation of values can be considered appropriate under MCAR and we really have MCAR (which isthenull hypothesis H ), the statistics should be `more or less' the same. If we have something other than MCAR, the statistics should reect this by having dierent values. Simono (1988) uses Cook's distance, which is based on the condence ellipsoid C = ( ^ ; ^ c )(X X )( ^ ; ^ c ) ps 2 the residual sum of squares DRSS (Andrews and Pregibon, 1978) DRSS = (RSS ; RSS c )=n mis RSS c =(n c ; n mis ; p +1) and the determinant ofthex X matrix DXX (Andrews and Pregibon, 1978) DXX = jx cx c j jx X j : For the construction of tests the distribution of the measures under H is needed. As this distribution also depends on the X values, Monte Carlo methods are used to determine it by rst computing the complete case statistics, and imputing missing values under MCAR-assumption. The generation of new response values y MC mis = ^Xmis ^c + MC with MC N( s 2 I) generates a new data set where `missing values' are drawn from using an MCAR mechanism. After applying the imputation procedure to these data the diagnostic measures are computed. Repeating the `data deletion' and imputation steps a null distribution of the diagnostic measure is generated. Finally the measure can be applied to the imputed original data and the resulting value can be compared to the null distribution. 3

4 Graphical diagnosis of the missing mechanism Animated residualplots are presented in Cook and Weisberg (1989). In a stepwise procedure weights between and 1 are used to include one case into the regression. The plots thus represent the inuence of that single case. Park, Kim and Toutenburg (1992) present a similar approach to visualize the inclusion of a further variable into a regression model. The adaption to the situation with missing data which shows a close relationship to the procedures of the preceeding section is described in?. Like in the above procedures, imputation is performed under an MCAR assumption. Having lled the gaps in, the weighted-mixed-estimator (3) is computed for certain values 2 [ 1]. Again the ideas is, that if we really have MCAR, there should not be any tendency in the residual plot, when stepwise including Z R in the model by increasing the weight from to 1. Figure 1 shows a small program that visualizes the following procedure: for ( = 1 +=step) f compute regression parameters compute estimated residuals display residual-plot g Figure 1: Java program for visualization on computer screen. Reads data for each frame of the animation and draws the single plots of the animation. See http://www.stat.uni-muenchen.de/~andreas/ The residual plot in gure 2 shows an example of an animated plot of ^y (an the X-axis) versus ^ (on the Y-axis) for a model y = + 1 x i + 2 x 2 + with missing data generated by a non MAR process where P (R i2 =)(avalue x i2 is missing) dependes on x 2. An increasing gives higher weight to the imputed data in the estimation of the regression parameters. The center of the residual plot in gure 2 shifts towards the origin. For = 1 the imputed data have the same weight as the complete data and biased estimated result. = (the complete case esitmator) on the other hand gives consistent estimates as the missing process is independent of the response y. Possible extensions The ideas of animated residual plots could be extended in various ways. Imagine a simultaneous plot of ^ vs. ^y vs. X i in dierent windows where the windows are linked. By brushing selected points of the plot could be highlighted in all windows and their location or movement can be studied while the value of changes. Univariate plots y vs. X j for all j together with the estimated regression line ^ + ^ i X i, where the points are static (as the imputation does not depend on the weight ) and the estimated regression line is dynamic. Again, these plots could be linked as described above. Creation of a null plot where missing values are created articially by aknown MCAR mechanism. This plot can be used as a means of comparison in order to have an idea of what the plot should look like under MCAR. 4

1 - -1-1 -1-1 1.3. 1 - -1-1 -1-1 1.6. 1 - -1-1 -1-1 1 1 -. 1 -.1. -1-1 -1-1 1.4. 1 - -1-1 -1-1 1.7. 1 - -1-1 -1-1 1.9. -1-1 -1-1 1 1-1 -.2. -1-1 -1-1 1.. 1 - -1-1 -1-1 1.8. 1 - -1-1 -1 -. 1 1-1 -1-1 - 1 1 1 Figure 2: Residualplot of ^y (X-axis) versus ^ (Y-axis) for a model y = + 1 x i + 2 x 2 + with a non MAR process where P (R i2 =)(avalue x i2 is missing) dependes on x 2. from (top left) to 1 (bottom right). References Andrews, D. F. and Pregibon, D. (1978). Finding outliers that matter, Journal of the Royal Statistical Society, Series B 4: 8{93. Buck, S. F. (196). A method of estimation of missing values in multivariate data suitable for use with an electronic computer, Journal of the Royal Statistical Society, Series B 22: 32{37. Cook, R. D. and Weisberg, S. (1989). Technometrics 31: 277{291. Regression diagnostics with dynamic graphics, Park, S. H., Kim, Y. H. and Toutenburg, H. (1992). Regression diagnostics for removing an observation with animating graphics, Statistical Papers 33: 227{24. Rubin, D. B. (1976). Inference and missing data, Biometrika 63: 81{92. Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Sample Surveys, Wiley, New York. Schafer, J. L. (1997). London. Analysis of Incomplete Multivariate Data, Chapman and Hall, Scharin, B. and Toutenburg, H. (199). Weighted mixed regression, Zeitschrift fur Angewandte Mathematik und Mechanik 7: 73{738. Simon, G. A. and Simono, J. S. (1986). Diagnostic plots for missing data in least squares regression, Journal of the American Statistical Association 81: 1{9.

Simono, J. S. (1988). Regression diagnostics to detect nonrandom missingness in linear regression, Technometrics 3: 2{214. Theil, H. and Goldberger, A. S. (1961). On pure and mixed estimation in econometrics, International Economic Review 2: 6{78. Toutenburg, H. and Shalabh (1998). Prediction of response values in linear regression models from replicated experiments, to be published. 6