Missing values imputation for mixed data based on principal component methods

Size: px

Start display at page:

Download "Missing values imputation for mixed data based on principal component methods"

Kerry Bennett
5 years ago
Views:

1 Missing values imputation for mixed data based on principal component methods Vincent Audigier, François Husson & Julie Josse Agrocampus Rennes Compstat' 2012, Limassol (Cyprus), / 21

2 A real dataset age weight size alcohol sex snore tobacco or 2 glasses/day M yes no or 2 glasses/day M no <= No W no <= or 2 glasses/day M no <= No W yes > >2 glasses/day M yes no or 2 glasses/day M yes <= No W no <= >2 glasses/day M no > or 2 glasses/day W no <= or 2 glasses/day M no no or 2 glasses/day W no <= or 2 glasses/day M yes <= >2 glasses/day M yes <= or 2 glasses/day M no <= No W no <= No M no <= or 2 glasses/day M yes <= or 2 glasses/day M yes no >2 glasses/day M no <= or 2 glasses/day W no <=1 2 / 21

3 A real dataset age weight size alcohol sex snore tobacco 51 NA 172 NA M yes no or 2 glasses/day M NA <=1 48 NA 164 No W no NA or 2 glasses/day M no <= No W yes > NA >2 glasses/day M yes no or 2 glasses/day M NA NA 49 NA 179 No W no <= >2 glasses/day M NA >1 NA or 2 glasses/day W no <= or 2 glasses/day M no no 34 NA 181 NA W no <= NA 1 or 2 glasses/day M yes <= NA >2 glasses/day M NA <= or 2 glasses/day M no NA NA No W NA <= No M no <= NA 1 or 2 glasses/day M NA NA 65 NA or 2 glasses/day M yes no >2 glasses/day M no <= NA W no <=1 2 / 21

4 A real dataset age weight size alcohol sex snore tobacco 51 NA 172 NA M yes no or 2 glasses/day M NA <=1 48 NA 164 No W no NA or 2 glasses/day M no <= No W yes > NA >2 glasses/day M yes no or 2 glasses/day M NA NA 49 NA 179 No W no <= >2 glasses/day M NA >1 NA or 2 glasses/day W no <= or 2 glasses/day M no no 34 NA 181 NA W no <= NA 1 or 2 glasses/day M yes <= NA >2 glasses/day M NA <= or 2 glasses/day M no NA NA No W NA <= No M no <= NA 1 or 2 glasses/day M NA NA 65 NA or 2 glasses/day M yes no >2 glasses/day M no <= NA W no <=1 Popular approach to deal with missing values: Single imputation 2 / 21

5 Single imputation methods Continuous variables: k-nearest neighbours, normal distribution (joint modelling), iterative regression (fully conditional specication), etc. Categorical variables: hot-deck imputation, multinomial model, latent class model (Vermunt et al., 2008), etc. Mixed data: transform the categorical variables into dummy variables and deal as continuous variables (package Amelia) MICE (multivariate imputation by chained equations, van Bureen 1999): a model must be specied for each variable random forest (Stekhoven & Bühlmann, 2011) New imputation method based on principal component methods 3 / 21

6 Imputation with PCA for continuous variables PCA minimizes: With missing values: C = X I J F I S U t S J 2 C = W (X FU t ) 2, with w ij = 0 if x ij is missing, w ij = 1 otherwise. Iterative PCA (Kiers, 1997) 4 / 21

7 Iterative PCA algorithm 1.5 NA x2 The data set x1 5 / 21

8 Iterative PCA algorithm 1.5 NA x2 Initialization step: mean imputation x1 5 / 21

9 1.5 NA Iterative PCA algorithm PCA performed on the completed data set; 1 dimension is kept x2 x1 5 / 21

10 Iterative PCA algorithm 1.5 NA x2 Calculation of the model prediction x1 5 / 21

11 Iterative PCA algorithm 1.5 NA x2 Imputation step: X l = W X + (1 W) ˆX l x1 5 / 21

12 Iterative PCA algorithm 1.5 NA x2 PCA is performed; 1 dimension is kept x1 5 / 21

13 Iterative PCA algorithm 1.5 NA x2 Imputation step: X l = W X + (1 W) ˆX l x1 5 / 21

14 Iterative PCA algorithm 1.5 NA x2 Iterate until convergence x1 5 / 21

15 x2 Iterative PCA - convergence Imputed values are obtained at convergence 1.5 NA x1 6 / 21

16 Iterative PCA 1 initialization l = 0: X 0 (mean imputation) 2 step l: (a) PCA on the completed matrix X l 1 ˆF l I S, Ûl K S S dimensions are kept; ˆX l = ˆF l Û l (estimation) (b) X l = W X + (1 W) ˆF l Û l (imputation) 3 Estimation and imputation are repeated until convergence The number of dimensions S has to be chosen a priori An imputation is performed during the algorithm PCA can be seen as an imputation method Overtting problems are managed with a regularized algorithm 7 / 21

17 Imputation with MCA for categorical variables MCA can be seen as the PCA on an indicator matrix X with specic weights depending on D Σ NA NA NA NA NA J I X= xik J D Σ = 0 I k NA NA I 1 I k I K J IJ I K 8 / 21

18 Iterative MCA Iterative MCA algorithm: 1 initialization: imputation of the indicator matrix (proportion) 2 iterate until convergence (a) MCA on the completed indicator matrix ˆF, Û (Estimation) (b) imputation of the missing values with the model matrix (c) column margins are updated V1 V2 V3 V14 V1_a V1_b V1_c V2_e V2_f V3_g V3_h ind 1 a NA g u ind ind 2 NA f g u ind ind 3 a e h v ind ind 4 a e h v ind ind 5 b f h u ind ind 6 c f h u ind ind 7 c f NA v ind ind 1232 c f h v ind imputed values can be seen as degree of membership 9 / 21

19 Principal component method for mixed data The core of principal component methods is PCA on particular matrices "Doing a data analysis, in good mathematics, is simply searching eigenvectors, all the science of it (the art) is just to nd the right matrix to diagonalize" (Benzécri) 10 / 21

20 Principal component method for mixed data (complete case) Factorial Analysis on Mixed Data (Escoer, 1979), PCAMIX (Kiers, 1991) Continuous variables Categorical variables Indicator matrix centring & scaling I 1 I 2 I k division by I/Ik and centring Matrix which balances the influence of each variable A PCA is performed on the weighted matrix 11 / 21

21 Properties of the method The distance between individuals is: K cont d 2 (i, l) = (x ik x lk ) 2 + k=1 K Q q q=1 k=1 The principal component F s maximises: K cont k=1 Q cat r 2 (F s, v k ) + η 2 (F s, v q ) q=1 1 I kq (x iq x lq ) 2 12 / 21

22 Imputation of mixed data The same kind of iterative algorithm as before age weight size alcohol sex snore tobacco NA NA M yes no gl/d M NA <=1 NA No W no NA gl/d M no <=1 NA NA NA NA NA NA NA NA NA NA imputeafdm age weight size alcohol sex snore tobacco gl/d M yes no gl/d M no <= No W no <= gl/d M no <= / 21

23 Properties on the imputations Imputation based on scores and loadings similarities between individuals and relationships between variables Relationships between continuous and categorical variables are taken into account Categorical variables evolve within many dimensions (k q 1 if they have k q categories) so they need many dimensions to be well predicted Compared to a PCA on the (unweighted) indicator matrix, small categories are better imputed 14 / 21

24 Simulation pattern Simulations 2 independent variables are drawn from a normal distribution 1 variable is replicated 4 times, the other 8 2 dimensions Random noise is added Half of the variables in each dimension are split in 3 clusters 10%, 20% or 30% of missing values are chosen at random Data are constructed (expected) to be in 4 dimensions Criterion for continuous data: N2RMSE = i missing mean ( ( X true i var (X true i ) ) ) 2 X imp i for categorical data: proportion of falsely classied entries 15 / 21

25 Simulations Imputation using continuous data only Imputation using both continuous and categorical data Error on continous data N2RMSE quanti 10% mixte 10% quanti 20% mixte 20% quanti 30% mixte 30% 16 / 21

26 Simulations Imputation using continuous data only Imputation using both continuous and categorical data Error on continous data N2RMSE quanti 10% mixte 10% quanti 20% mixte 20% quanti 30% mixte 30% Categorical data improved the imputation on continuous data / 21

27 Simulations Imputation using continuous data only Imputation using categorical data only Imputation using both continuous and categorical data Error on continous data Error on categorical data N2RMSE PFC quanti 10% mixte 10% quanti 20% mixte 20% quanti 30% mixte 30% quali 10% mixte 10% quali 20% mixte 20% quali 30% mixte 30% Categorical data improved the imputation on continuous data / 21

28 Simulations Imputation using continuous data only Imputation using categorical data only Imputation using both continuous and categorical data Error on continous data Error on categorical data N2RMSE PFC quanti 10% mixte 10% quanti 20% mixte 20% quanti 30% mixte 30% quali 10% mixte 10% quali 20% mixte 20% quali 30% mixte 30% Categorical data improved the imputation on continuous data and continuous data improved the imputation on categorical data 16 / 21

29 Simulations Error on continuous variables Error on the qualitative variables N2RMSE % 20% 30% PFC % 20% 30% Nb of dimensions Nb of dimensions The error on the estimation of the number of dimensions has not an important impact on the imputation error... if the estimation is not too bad 17 / 21

30 Comparison with random forest on real data sets Imputations obtained with random forest & iterative algorithm GBSG2 N2RMSE RF 10% AFDM 10% RF 20% AFDM 20% RF 30% AFDM 30% PFC RF 10% AFDM 10% RF 20% AFDM 20% RF 30% AFDM 30% 18 / 21

31 Comparison with random forest on real data sets Imputations obtained with random forest & iterative algorithm GBSG2 Ozone N2RMSE N2RMSE RF 10% AFDM 10% RF 20% AFDM 20% RF 30% AFDM 30% RF 10% AFDM 10% RF 20% AFDM 20% RF 30% AFDM 30% PFC PFC RF 10% AFDM 10% RF 20% AFDM 20% RF 30% AFDM 30% RF 10% AFDM 10% RF 20% AFDM 20% RF 30% AFDM 30% 18 / 21

32 Comparison with random forest Compared to random forest, imputations are quite similar Imputations are slightly better: for categorical variables especially for rare categories and imputations are worse: when there are non-linear relationships between continuous variables when there are interactions 19 / 21

33 Conclusion A new way to impute missing values is ecient when strong linear relationships between variables (you learn from the other variables) but needs tuning parameters, cv? approximation? is available in the missmda package. This package: handles missing values in principal component methods (PCA, MCA, MIXPCA, MFA) impute missing values for continuous, categorical and mixed variables performs multiple imputation for continuous variables 20 / 21

34 Perspective How to perform a statistical analysis from an incomplete dataset? we can modify the estimation process to apply it on an incomplete dataset (not always easy!) we can predict the missing entries with a single imputation method, but BE CAREFUL using the usual methods leads to underestimate the standard errors An alternative is to use multiple imputation... and single imputation is a rst step towards multiple imputation 21 / 21

missmda: a package to handle missing values in Multivariate exploratory Data Analysis methods

missmda: a package to handle missing values in Multivariate exploratory Data Analysis methods Julie Josse & Francois Husson Applied Mathematics Department Agrocampus Rennes - France user! 2011, Warwick,