Handling missing values in Multiple Factor Analysis

Size: px

Start display at page:

Download "Handling missing values in Multiple Factor Analysis"

Daisy Lynch
5 years ago
Views:

1 Handling missing values in Multiple Factor Analysis François Husson, Julie Josse Applied mathematics department, Agrocampus Ouest, Rennes, France Rabat, 26 March / 10

2 Multi-blocks data set Groups of variables (MFA) Groups of variables are quantitative and/ or qualitative Objectives: - study the link between the sets of variables - balance the influence of each group of variables - give the classical graphs but also specific graphs: groups of variables - partial representation Continuous and/or categorical (contingency) sets of variables products - sensorial, physico-chemical panels comparison (one group = one country) Examples: - Genomic: DNA, protein - Sensory analysis: sensorial, physico-chemical - Comparison of coding (quantitative / qualitative) products - judges (napping, ash prole, etc) collect methods comparison (sorting task/ QDA) Similarities - specicities of each group? Multiple Factor Analysis (Escoer & Pagès, 1998) 3 2 / 10

3 Missing values in multi-blocks data set Dierent patterns of missing values: scattered or structured Sensory analysis: each judge can't assess more than a certain number of products (saturation) experimental design (BIB) MFA with missing values 3 / 10

4 Missing values in MFA MFA balances the inuence of the groups X = X1 λ 1 1 X 2 ; λ 2 1 X K ;...; λ K 1 Complete case: SVD (PCA) on X X n p U n SΛ 1 2 S S V p S 2 With missing values: weighted least squares with w ij = 0 if x ij W n p (X n p U n SΛ 1 2 S S V p S ) 2 is missing, w ij = 1 otherwise Algorithms: weighted alternating least squares - iterative imputation 4 / 10

5 Iterative MFA 1 initialization l = 0: X 0 (mean imputation) 2 steps of estimation and imputation are repeated: (a) SVD on the completed data (U l, Λ l, V l ); S dim. kept (b) X l = W X + (1 W) (ˆX l = U l Λ 1/2l V l ) (c) means, standard deviations and eigenvalues are updated 5 / 10

6 Iterative MFA 1 initialization l = 0: X 0 (mean imputation) 2 steps of estimation and imputation are repeated: (a) SVD on the completed data (U l, Λ l, V l ); S dim. kept (b) X l = W X + (1 W) (ˆX l = U l Λ 1/2l V l ) (c) means, standard deviations and eigenvalues are updated Number of dimensions: CV (Bro, 2008; Josse & Husson, 2011) Same rationale in other multi-blocks methods GPA (Commandeur, 1991), PARAFAC (Tomasi & Bro, 2005), GCCA (Van de Velden & Bijmolt, 2006) 5 / 10

7 Iterative MFA 1 initialization l = 0: X 0 (mean imputation) 2 steps of estimation and imputation are repeated: (a) SVD on the completed data (U l, Λ l, V l ); S dim. kept (b) X l = W X + (1 W) (ˆX l = U l Λ 1/2l V l ) (c) means, standard deviations and eigenvalues are updated Number of dimensions: CV (Bro, 2008; Josse & Husson, 2011) Same rationale in other multi-blocks methods GPA (Commandeur, 1991), PARAFAC (Tomasi & Bro, 2005), GCCA (Van de Velden & Bijmolt, 2006) 5 / 10

8 Iterative MFA 1 initialization l = 0: X 0 (mean imputation) 2 steps of estimation and imputation are repeated: (a) SVD on the completed data (U l, Λ l, V l ); S dim. kept (b) X l = W X + (1 W) (ˆX l = U l Λ 1/2l V l ) (c) means, standard deviations and eigenvalues are updated Number of dimensions: CV (Bro, 2008; Josse & Husson, 2011) Same rationale in other multi-blocks methods GPA (Commandeur, 1991), PARAFAC (Tomasi & Bro, 2005), GCCA (Van de Velden & Bijmolt, 2006) Overtting problems many parameters / observed values (S large - many NA) data are very noisy 5 / 10

9 Regularized iterative MFA (Husson & Josse, 2013) Initialization - estimation step - imputation step The imputation step: ˆx MFA ij = S s=1 λs u is v js is replaced by a "shrunk" imputation step: ˆx rmfa ij = S s=1 ( λs ˆσ2 λs ) u is v js Compromise hard/ soft thresholding (Mazumder, Hastie & Tibshirani, 2010) σ 2 small regularized MFA MFA σ 2 large kill" the dimensions with small eigeivalues 6 / 10

10 Example on a wine data set 21 wines described by two groups of variables (olfaction/ gustation) 6 missing rows olfaction True configuration - partial points Dim 2 (13.22 %) Olfaction Tasting 1VAU 2ING T2 T1 3EL PER1 4EL 2BEA 1PO 1BOI Y 1DAM 1ING 2BOU 2DAM 1TUR 1FON 1ROC DOM1 1BEN 2EL 1CHA Dim 1 (65.39 %) 7 / 10

11 Example on a wine data set 21 wines described by two groups of variables (olfaction/ gustation) 6 missing rows olfaction True configuration - partial points M e a n im p u ta tio n - p a rtia l p o in ts Dim 2 (13.22 %) Olfaction Tasting 1VAU 2ING T2 T1 3EL PER1 4EL 2BEA 1PO 1BOI Y 1DAM 1ING 2BOU 2DAM 1TUR 1FON 1ROC DOM1 1BEN 2EL 1CHA Dim 2 (22.13 %) O lfa c tio n Ta st in g T 2 3 E L 1 P O Y 4EP L E R 1 1 T U R 2B E A 2 D 1F O N 2 B O U 1 B A O M 1 IN G I T1 1 C H A 1 D A M 1R O C D O M 1 1B E N 2 E L 1V A U 2 IN G Dim 1 (65.39 %) D im 1 ( % ) 7 / 10

12 Example on a wine data set 21 wines described by two groups of variables (olfaction/ gustation) 6 missing rows olfaction True configuration - partial points Iterative MFA - partial points Dim 2 (13.22 %) Olfaction Tasting 1VAU 2ING T2 T1 3EL PER1 4EL 2BEA 1PO 1BOI Y 1DAM 1ING 2BOU 2DAM 1TUR 1FON 1ROC DOM1 1BEN 2EL 1CHA Dim 2 (15.29 %) Olfaction Tasting 1VAU 2ING T1 T2 3EL 4EL 1POY PER1 1BOI 2BEA 1ING 2BOU 2DAM 1DAM 1TUR 1ROC DOM1 1FON 1BEN 1CHA 2EL Dim 1 (65.39 %) Dim 1 (71.74 %) 7 / 10

13 Simulations on napping data set 99 judges - 12 perfumes Subset of products 6-11: experimental design RV coecient between the "true" MFA products coordinates (12*2) and the regularized iterative MFA ones (12*2) V coefficient RV iterative MFA mean 12 consumers 25 consumers 50 consumers 75 consumers 99 consumers Numberof products seen by panellist 75 judges and 8 products 99 judges and 12 products 8 / 10

14 Simulations on napping data set Panelists Nb products RV Better to have 25 judges assessing all the products than 50 assessing half of the products Better to have more judges (50) assessing less products (9) than a small number of judges (25) assessing all the products (12) 9 / 10

15 Conclusion Multi-blocks method with missing values (regularized algorithm) Evaluation of n products - each judge assessing (n k) products R package missmda missing values in principal components methods (PCA, MCA, MIXPCA, MFA, multi-level methods) single imputation for continuous - categorical - mixed data single imputation is a rst step to multiple imputation Videos on Youtube! FactoMineR package 10 / 10

Missing values imputation for mixed data based on principal component methods

Missing values imputation for mixed data based on principal component methods Vincent Audigier, François Husson & Julie Josse Agrocampus Rennes Compstat' 2012, Limassol (Cyprus), 28-08-2012 1 / 21 A real