Distributed multilevel matrix completion for medical databases

Size: px

Start display at page:

Download "Distributed multilevel matrix completion for medical databases"

Gerard York
5 years ago
Views:

1 Distributed multilevel matrix completion for medical databases Julie Josse Ecole Polytechnique, INRIA Séminaire Parisien de Statistique, IHP January 15, / 38

2 Overview 1 Introduction 2 Single imputation with PCA 3 Single imputation for mixed multilevel data 4 Distribution 2 / 38

3 Outline 1 Introduction 2 Single imputation with PCA 3 Single imputation for mixed multilevel data 4 Distribution 3 / 38

4 Collaborators Imputation of mixed data with multilevel SVD Distributed computation Genevieve Robin, François Husson, Balasubramanian Narasimhan Polytechnique, Agrocampus, Stanford (stat/biomedical) 4 / 38

5 Public Assistance - Paris Hospitals Traumabase: patients/ 250 variables/ Center Accident Age Sex Weight Height BMI BP SBP 1 Beaujon Fall 54 m 85 NR NR Lille Other 33 m Pitie Salpetriere Gun 26 m NR NR NR Beaujon AVP moto 63 m Pitie Salpetriere AVP bicycle 33 m 75 NR NR Pitie Salpetriere AVP pedestrian 30 w NR NR NR HEGP White weapon 16 m Toulon White weapon 20 m NR NR NR Bicetre Fall 61 m SpO2 Temperature Lactates Hb Glasgow Transfusion <NA> yes no no yes NM no NM yes yes NM no no... Predict the Glasgow score, whether to start a blood transfusion, to administer fresh frozen plasma, etc... (Logistic) regressions with categorical/continuous values 5 / 38

6 Public Assistance - Paris Hospitals Traumabase: patients/ 250 variables/ 6 hospitals Center Accident Age Sex Weight Height BMI BP SBP 1 Beaujon Fall 54 m 85 NR NR Lille Other 33 m Pitie Salpetriere Gun 26 m NR NR NR Beaujon AVP moto 63 m Pitie Salpetriere AVP bicycle 33 m 75 NR NR Pitie Salpetriere AVP pedestrian 30 w NR NR NR HEGP White weapon 16 m Toulon White weapon 20 m NR NR NR Bicetre Fall 61 m SpO2 Temperature Lactates Hb Glasgow Transfusion <NA> yes no no yes NM no NM yes yes NM no no... Predict the Glasgow score, whether to start a blood transfusion, to administer fresh frozen plasma, etc... (Logistic) regressions with missing categorical/continuous values 5 / 38

7 Public Assistance - Paris Hospitals 6 hospitals: multilevel structure (patients within hospital) Hopital effect: lack of standardization Missing values Center Accident Age Sex Weight Height BMI BP SBP 1 Beaujon Fall 54 m 85 NR NR Lille Other 33 m Pitie Salpetriere Gun 26 m NR NR NR Beaujon AVP moto 63 m Pitie Salpetriere AVP bicycle 33 m 75 NR NR Pitie Salpetriere AVP pedestrian 30 w NR NR NR HEGP White weapon 16 m SpO2 Temperature Lactates Hb Glasgow Transfusion <NA> yes no no yes NM no NM yes yes 1) Handling missing values in multilevel data 6 / 38

8 Public Assistance - Paris Hospitals 6 hospitals: multilevel structure (patients within hospital) Hopital effect: lack of standardization 2) Data not aggregated but which stay on each site Missing values Center Accident Age Sex Weight Height BMI BP SBP 1 Beaujon Fall 54 m 85 NR NR Lille Other 33 m Pitie Salpetriere Gun 26 m NR NR NR Beaujon AVP moto 63 m Pitie Salpetriere AVP bicycle 33 m 75 NR NR Pitie Salpetriere AVP pedestrian 30 w NR NR NR HEGP White weapon 16 m SpO2 Temperature Lactates Hb Glasgow Transfusion <NA> yes no no yes NM no NM yes yes 1) Handling missing values in multilevel data 6 / 38

Still an issue with "big data" Data integration: data from different sources 1 J 1? 1? X1 k1? 1 i ki?? Xi?

9 Missing values are everywhere: unanswered questions in a survey, lost data, damaged plants, machines that fail... The best thing to do with missing values is not to have any Gertrude Mary Cox. Still an issue with "big data" Data integration: data from different sources 1 J 1? 1? X1 k1? 1 i ki?? Xi?? 1 I ki?? XI????? Multilevel structure: sporadically - systematic missing values (one variable missing in one hospital) 7 / 38

10 Solutions to handle missing values Litterature: Schaefer (2002); Little & Rubin (2002); Gelman & Meng (2004); Kim & Shao (2013); Carpenter & Kenward (2013); van Buuren (2015) Modify the estimation process to deal with missing values. Maximum likelihood: EM algorithm to obtain point estimates + Supplemented EM (Meng & Rubin, 1991) ; Louis for their variability One specific algorithm for each statistical method... Imputation (multiple) to get a completed data set on which you can perform any statistical method (Rubin, 1976) Famous imputation based on SVD (PCA) - quantitative 8 / 38

11 Solutions to handle missing values Litterature: Schaefer (2002); Little & Rubin (2002); Gelman & Meng (2004); Kim & Shao (2013); Carpenter & Kenward (2013); van Buuren (2015) Modify the estimation process to deal with missing values. Maximum likelihood: EM algorithm to obtain point estimates + Supplemented EM (Meng & Rubin, 1991) ; Louis for their variability One specific algorithm for each statistical method... Imputation (multiple) to get a completed data set on which you can perform any statistical method (Rubin, 1976) Famous imputation based on SVD (PCA) - quantitative Extension to imputation with multilevel SVD 8 / 38

12 Outline 1 Introduction 2 Single imputation with PCA 3 Single imputation for mixed multilevel data 4 Distribution 9 / 38

13 x2 PCA reconstruction X ^μ X F ^μ x V' Minimizes distance between observations and their projection Approx X n p with a low rank matrix k < p A 2 2 = tr(aa ): { } argmin µ X µ 2 2 : rank (µ) k SVD X: ˆµ PCA = U n k D k k V p k = F n k V p k F = UD PC - scores V principal axes - loadings 10 / 38

14 x2 PCA reconstruction X NA NA NA ^μ X F ^μ x V' Minimizes distance between observations and their projection Approx X n p with a low rank matrix k < p A 2 2 = tr(aa ): { } argmin µ X µ 2 2 : rank (µ) k SVD X: ˆµ PCA = U n k D k k V p k = F n k V p k F = UD PC - scores V principal axes - loadings 10 / 38

15 Missing values in PCA PCA: least squares argmin µ { X n p µ n p 2 2 : rank (µ) k } PCA with missing values: weighted least squares argmin µ { W n p (X µ) 2 2 : rank (µ) k } with W ij = 0 if X ij is missing, W ij = 1 otherwise; elementwise multiplication Many algorithms: Gabriel & Zamir, 1979: weighted alternating least squares (without explicit imputation) Kiers, 1997: iterative PCA (with imputation) 11 / 38

16 Iterative PCA x1 x NA x x1 12 / 38

17 x1 x NA x1 x x Iterative PCA x1 Initialization l = 0: X 0 (mean imputation) 12 / 38

18 x1 x NA x1 x x1 x x Iterative PCA x1 PCA on the completed data set (U l, Λ l, D l ); 12 / 38

19 x1 x NA x1 x x1 x x Iterative PCA x1 Missing values imputed with the fitted matrix ˆµ l = U l D l V l 12 / 38

20 x1 x NA x1 x x1 x x1 x x Iterative PCA The new imputed dataset is ˆX l = W X + (1 W ) ˆµ l x1 12 / 38

21 Iterative PCA x1 x NA x1 x x1 x x x1 12 / 38

22 Iterative PCA x1 x NA x1 x x1 x x1 x x x1 12 / 38

23 x1 x NA x1 x x1 x x1 x x Iterative PCA Steps are repeated until convergence x1 12 / 38

24 x x1 x NA Iterative PCA x1 x x1 PCA on the completed data set (U l, D l, V l ) Missing values imputed with the fitted matrix ˆµ l = U l D l V l 12 / 38

25 Iterative PCA 1 initialization l = 0: X 0 (mean imputation) 2 step l: (a) PCA on the completed data (U l, D l, V l ); k dim kept (b) (ˆµ l ) k = U l D l V l X l = W X + (1 W ) ˆµ l 3 steps of estimation and imputation are repeated 13 / 38

26 Iterative PCA 1 initialization l = 0: X 0 (mean imputation) 2 step l: (a) PCA on the completed data (U l, D l, V l ); k dim kept (b) (ˆµ l ) k = U l D l V l X l = W X + (1 W ) ˆµ l 3 steps of estimation and imputation are repeated ˆµ from incomplete data: EM algo: X = FV + ε X = µ + ε, ε ij iid N ( 0, σ 2 ) with µ of low rank Completed data: good imputation (matrix completion, Netflix) (Udell & Townsend Nice Latent Variable Models Have Log-Rank, 2017) Selecting k? Generalized cross-validation (Josse & Husson, 2012) 13 / 38

27 Iterative PCA 1 initialization l = 0: X 0 (mean imputation) 2 step l: (a) PCA on the completed data (U l, D l, V l ); k dim kept (b) (ˆµ l ) k = U l D l V l X l = W X + (1 W ) ˆµ l 3 steps of estimation and imputation are repeated ˆµ from incomplete data: EM algo: X = FV + ε X = µ + ε, ε ij iid N ( 0, σ 2 ) with µ of low rank Completed data: good imputation (matrix completion, Netflix) (Udell & Townsend Nice Latent Variable Models Have Log-Rank, 2017) Selecting k? Generalized cross-validation (Josse & Husson, 2012) 13 / 38

28 Regularized iterative SVD Overfitting issues of iterative PCA: many parameters (U n k, V k p )/observed values (k large - many NA); noisy data Regularized versions. Init - estimation - imputation steps: Imputation by ˆµ PCA = k q=1 d q u q v q is replaced by (ˆµ) λ = n q=1 (d q λ) + u q v q argmin µ { W (X µ) λ µ ˆµ = ( k q=1 d l ˆσ2 d q )u q v q Hastie et.al. (2015), Verbank, J. & Husson (2013); Gavish & Donoho (2014), J. & Wager (2015), J. & Sardy (2014) Iterative SVD algorithms good to impute data Model makes sense: data = structure of rank k + noise } 14 / 38

29 Outline 1 Introduction 2 Single imputation with PCA 3 Single imputation for mixed multilevel data 4 Distribution 15 / 38

30 Multilevel component analysis Ex: inhabitants nested within countries X R K J 1 1 k1 1 J X1 similarities between countries? level 1 similarities between inhabitants within each country? level 2 relationship between variables at each level 1 i ki 1 I ki Xi XI x ijki = x.j. + (x ij. x.j. ) + (x ijki x ij. ) Between + Within Analysis of variance: split the sum of squares for each variable j I k i I I I (x ijki ) 2 = k i (x.j. ) 2 + k i (x ij. x.j. ) 2 k i + (x ijki x ij. ) 2 i=1 k=1 i=1 i=1 i=1 k=1 16 / 38

31 Multilevel PCA MLPCA Model for the between and within part i = 1,..., I groups, J var X i(ki J) = 1 k i m + 1 ki F b i V b + Fi w V w + E i Fi b (Q b 1) between component scores of group i V b (J Q b ) between loading matrix Fi w (k i Q w ) within component scores of group i V w (J Q w ) within loading matrix. Constant across groups Fitted by solving the least squares (Timmerman, 2006) argmin F (m, F b i, V b, F w i, V w ) = I X i 1 ki m 1 ki Fi b i=1 Ii=1 k i Fi b = 0 Qb and 1 k i Fi w = 0 Qw, i for identifiability. V b F w i V w 2 17 / 38

32 MLPCA - quantitative data i = 1,..., I groups, J var, k i nb obs in group i Estimation: minimize the RSS argmin F () = I X i 1 ki m 1 ki Fi b V b Fi w V w 2, i=1 I i=1 k ifi b = 0 Qb and 1 k i Fi w = 0 Qw, i for identifiability. (ˆF b, ˆV b ): Weigthed PCA on the between part: SVD on WX m ; X m (I J) the means of the variables per group, W (I I) w ii = k i (ˆF w, ˆV w ) PCA on the within part: SVD on the centered data per group X w (K J), K = i ki With missing values: Weighted Least Squares Iterative imputation algorithm (imputation - estimation) 18 / 38

33 Iterative MLPCA 2. iteration l: estimation of the between structure SVD WXm l = PDQ ; Q b eigenvectors are kept: ˆF i b = [W 1 P Qb ] i, ˆF b concatenation by row of [1 ki ˆF i b ] ˆV b = Q Qb D Qb, (J Q b ) the between hat matrix is computed: ( ˆX b ) l = ˆF b b ˆV 3. iteration l: imputation of the missing values with the fitted values ˆX l = 1 K ˆm (l 1) + ( ˆX b ) l + ( ˆX w ) (l 1). The newly imputed dataset is X l = W X + (1 K 1 J W ) ˆX l ˆm l is computed on X l 4. iteration l: estimation of the within structure SVD (X w ) l = PDQ ; Q w eigenvectors are kept: F w = P Qw (K Q w ) V w = Q Qw D Qw (J Q w ) the within hat matrix is computed ( ˆX w ) l = ˆF w ˆV w 5. iteration l: imputation of the missing values with the fitted values ) X l+1 = M X + (1 K 1 J M) (1 K ˆm (l) + ( ˆX b ) l + ( ˆX w ) l ˆm l+1 is computed on X l+1 19 / 38

34 Regularized iterative MLSCA Estimation SVD - Imputation ˆX = 1 K ˆm + ˆF b ˆV b + ˆF w ˆV w ˆV b = Q Qb D Qb = Q Qb diag (d 1,..., d Qb ), as: Q Qb diag ( d 1 ˆσ2 b d 1,..., d Qb ˆσ2 b d Qb ), ˆσ 2 b = 1 J Q b J d q q=q b +1 ˆV w = Q Qw diag ( d 1 ˆσ2 w d 1,..., d Qw ˆσ2 w d Qw ) 20 / 38

35 Multiple Correspondence Analysis (MCA) X n m m categorical variables coded with indicator matrix A X = y... attack y... attack y... attack n... suicide A = D p = p p J n... accident n... suicide For a category c, the frequency of the category: p c = n c /n. A SVD on weighted matrix: Z = 1 mn (A 1p T )D 1/2 p = UDV The PC (F = UD) satisfies: arg max Fq R n 1 m m j=1 η2 (F q, X j ) η 2 (F, X j ) = Cj c=1 n c(f.c F.. ) 2 RSS between n Cj i=1 c=1 (F = ic) 2 RSS tot Benzecri, 1973 :"In data analysis the mathematical problems reduces to computing eigenvectors; all the science (the art) is in finding the right matrix to diagonalize" 21 / 38

36 Multilevel MCA Start with the matrix of dummy variables A and define a between and a within part Then, MCA is applied on each part Between: Apply MCA on the matrix with the mean of A per group i (proportion of obs taking each category in group i) (proportion of some disease in a particular hospital). Â b = F b V b D 1/2 p + 1 n p Within part Apply MCA on the data ( where the between part has been swept out (SVD is applied to 1 np A Â b) Dp 1/2 ) Â w = (np)f w V w Dp 1/2. Â = Â b + Â w 22 / 38

37 Regularized iterative Multilevel MCA 23 / 38

38 Regularized iterative Multilevel MCA Initialization: imputation of the indicator matrix (proportions) 23 / 38

39 Regularized iterative Multilevel MCA Initialization: imputation of the indicator matrix (proportions) Iterate until convergence 1 estimation: Multilevel MCA on the completed data ˆF b, ˆV b, ˆF w, ˆV w 23 / 38

40 Regularized iterative Multilevel MCA Initialization: imputation of the indicator matrix (proportions) Iterate until convergence 1 estimation: Multilevel MCA on the completed data ˆF b, ˆV b, ˆF w, ˆV w 2 imputation with the fitted matrix Â = Âb + Âw 23 / 38

41 Regularized iterative Multilevel MCA Initialization: imputation of the indicator matrix (proportions) Iterate until convergence 1 estimation: Multilevel MCA on the completed data ˆF b, ˆV b, ˆF w, ˆV w 2 imputation with the fitted matrix Â = Âb + Âw 3 column margins are updated 23 / 38

42 Regularized iterative Multilevel MCA Initialization: imputation of the indicator matrix (proportions) Iterate until convergence 1 estimation: Multilevel MCA on the completed data ˆF b, ˆV b, ˆF w, ˆV w 2 imputation with the fitted matrix Â = Âb + Âw 3 column margins are updated V1 V2 V3 V14 V1_a V1_b V1_c V2_e V2_f V3_g V3_h ind 1 a NA g u ind ind 2 NA f g u ind ind 3 a e h v ind ind 4 a e h v ind ind 5 b f h u ind ind 6 c f h u ind ind 7 c f NA v ind ind 1232 c f h v ind the imputed values can be seen as degree of membership 23 / 38

43 Regularized iterative Multilevel MCA Initialization: imputation of the indicator matrix (proportions) Iterate until convergence 1 estimation: Multilevel MCA on the completed data ˆF b, ˆV b, ˆF w, ˆV w 2 imputation with the fitted matrix Â = Â b + Â w 3 column margins are updated Two ways to impute categories: majority or draw 23 / 38

44 Regularized iterative Multilevel MCA Initialization: imputation of the indicator matrix (proportions) Iterate until convergence 1 estimation: Multilevel MCA on the completed data ˆF b, ˆV b, ˆF w, ˆV w 2 imputation with the fitted matrix Â = Â b + Â w 3 column margins are updated Two ways to impute categories: majority or draw 23 / 38

45 Public Assistance - Paris Hospitals Traumabase: patients/ 250 variables/ 6 hospitals Center Accident Age Sex Weight Height BMI BP SBP 1 Beaujon Fall 54 m 85 NR NR Lille Other 33 m Pitie Salpetriere Gun 26 m NR NR NR Beaujon AVP moto 63 m Pitie Salpetriere AVP bicycle 33 m 75 NR NR Pitie Salpetriere AVP pedestrian 30 w NR NR NR HEGP White weapon 16 m Toulon White weapon 20 m NR NR NR Bicetre Fall 61 m SpO2 Temperature Lactates Hb Glasgow Transfusion <NA> yes no no yes NM no NM yes yes NM no no / 38

46 Imputed Paris Hospitals data Traumabase: patients/ 250 variables/ 6 hospitals Center Accident Age Sex Weight Height BMI BP SBP 1 Beaujon Fall 54 m Lille Other 33 m Pitie Salpetriere Gun 26 m Beaujon AVP moto 63 m Pitie Salpetriere AVP bicycle 33 m Pitie Salpetriere AVP pedestri 30 m HEGP White weapon 16 m Toulon White weapon 20 m Bicetre Fall 61 m SpO2 Temperature Lactates Hb Glasgow / 38

47 The simulated data: Design X = 1 i(ki J) k i m + 1 ki Fi b V b + Fi w V w + E i, with E ijki N (0, σ) 5 groups, 10 variables, Q b = 2, Q w = 2 Many scenarios are considered: number of measurement occasions: 10-20, level of noise: low, strong percentage of missing values: 10%, 25%, 40% missing values mechanism: MCAR, 2 MAR situations 1 Prediction error: KJ (xijki x ijki ˆ ) 2 26 / 38

48 Results Competitors: Conditional model with random effect regression (mice) Random forests (bühlmann, 2012) (not designed) Global PCA - Separate PCA Global mixed PCA (with hospital) MSE (PCA separate multilevel PCA) per group FAMD global Jomo Mult PCA global PCA sep sigma=2, pna=0.2, J= group 27 / 38

49 Results J = 10 J = 30 5cat 5cat Global PCA mice Multilevel SVD Global mixed PCA Random forest Table: Time in seconds for a dataset with 20% NA, I = 5 k i = 200 PCA mixed as Random Forest mice (random effect model): difficulties with large dimensions Separate PCA: pb with many missing values Multilevel SVD = global SVD when no group effect Imputation properties depends on the method (linear) Other methods do not handle categorical variables 28 / 38

50 Outline 1 Introduction 2 Single imputation with PCA 3 Single imputation for mixed multilevel data 4 Distribution 29 / 38

51 6 hospital databases Combining data from different institutional databases promises many advantages in personalizing medical care (large n, more chance for finding patients like me) NIH requires sharing of data from funded projects 30 / 38

52 6 hospital databases Combining data from different institutional databases promises many advantages in personalizing medical care (large n, more chance for finding patients like me) NIH requires sharing of data from funded projects The problem: high barriers to aggregation of medical data lack of standardization of ontologies privacy concerns proprietary attitude towards data, reluctance to cede control complexity/size of aggregated data, updates problems 30 / 38

Solution: distributed computation Data aggregation is not always necessary NIH splits the storage of aggregated data across several centers

53 Solution: distributed computation Data aggregation is not always necessary NIH splits the storage of aggregated data across several centers Data can stay at site Computations can be distributed (share burden) Hospitals only share intermediate results instead of the raw data 31 / 38

54 Topology: master-workers (Wolfson, et. al (2010)) Ex: Each site share the sum of age X i and the number of patients n i. The master computes X = n i X i / n i 32 / 38

55 Solution: distributed computation Many models fitting can be implemented: Maximizing a likelihood. Intermediate computations break up into sums of quantities computed on local data at sites. Log-likelihood, score function and Fisher information can partition into sums. (OK for logistic regression) Singular Value Decomposition. Iterative algorithms available for SVD using quantities computed on local data at sites. And more. Implemented in the R package discomp (Narasimhan et. al., 2017) 33 / 38

56 Singular value decomposition SVD: X n p : U n k D k k V p k Power method to get the first direction: Other dims: "deflation", same procedure in the residuals (X udv ) Involves inner products and sums: distributed 34 / 38

57 Privacy preserving rank k SVD 35 / 38

58 Multilevel imputation 1 J 1 1? k 1 X 1?? 1 i k i?? X i?? I 1 k I? X? I????? Impute multilevel data with Multilevel SVD Distributed multilevel imputation Impute the data of one hospital using the data of the others Incentive to encourage the hospitals to participate in the project Apply other statistical methods on the imputed data (logistic regression) 36 / 38

59 Take home message - On going work Multilevel PCA powerful for single imputation of continuous & categorical multilevel data: reduce the dimensionality - capture the similarities between rows and relationship between variables at both levels. Method without missing values Computationaly fast - distributed - Implemented R package missmda Numbers of components Q b and Q w? Inference after imputation. Underestimation of the variance with single imputation 37 / 38

60 Take home message - On going work Multilevel PCA powerful for single imputation of continuous & categorical multilevel data: reduce the dimensionality - capture the similarities between rows and relationship between variables at both levels. Method without missing values Computationaly fast - distributed - Implemented R package missmda Numbers of components Q b and Q w? cross-validation? Inference after imputation. Underestimation of the variance with single imputation 37 / 38

61 Take home message - On going work Multilevel PCA powerful for single imputation of continuous & categorical multilevel data: reduce the dimensionality - capture the similarities between rows and relationship between variables at both levels. Method without missing values Computationaly fast - distributed - Implemented R package missmda Numbers of components Q b and Q w? cross-validation? Inference after imputation. Underestimation of the variance with single imputation Multiple imputation: bootstrap + drawn from the predictive distribution N (1 K ˆm + ˆF b ˆBb + ˆF ) w ˆBw, ˆσ 2 37 / 38

62 Research activities Dimensionality reduction methods to visualize complex data (PCA based): multi-sources, textual, arrays, questionnaire Missing values - matrix completion Low rank estimation, selection of regularization parameters Fields of application: bio-sciences (agronomy, sensory analysis), health data (hospital data) R community: book R for Statistics, R foundation, R taskforce, R packages and JSS papers: FactoMineR explore continuous, categorical, multiple contingency tables (correspondence analysis), combine clustering and PC,.. MissMDA for single and multiple imputation, PCA with missing denoiser to denoise data 38 / 38

Missing values imputation for mixed data based on principal component methods

Missing values imputation for mixed data based on principal component methods Vincent Audigier, François Husson & Julie Josse Agrocampus Rennes Compstat' 2012, Limassol (Cyprus), 28-08-2012 1 / 21 A real