Distributed multilevel matrix completion for medical databases

Size: px
Start display at page:

Download "Distributed multilevel matrix completion for medical databases"

Transcription

1 Distributed multilevel matrix completion for medical databases Julie Josse Ecole Polytechnique, INRIA Séminaire Parisien de Statistique, IHP January 15, / 38

2 Overview 1 Introduction 2 Single imputation with PCA 3 Single imputation for mixed multilevel data 4 Distribution 2 / 38

3 Outline 1 Introduction 2 Single imputation with PCA 3 Single imputation for mixed multilevel data 4 Distribution 3 / 38

4 Collaborators Imputation of mixed data with multilevel SVD Distributed computation Genevieve Robin, François Husson, Balasubramanian Narasimhan Polytechnique, Agrocampus, Stanford (stat/biomedical) 4 / 38

5 Public Assistance - Paris Hospitals Traumabase: patients/ 250 variables/ Center Accident Age Sex Weight Height BMI BP SBP 1 Beaujon Fall 54 m 85 NR NR Lille Other 33 m Pitie Salpetriere Gun 26 m NR NR NR Beaujon AVP moto 63 m Pitie Salpetriere AVP bicycle 33 m 75 NR NR Pitie Salpetriere AVP pedestrian 30 w NR NR NR HEGP White weapon 16 m Toulon White weapon 20 m NR NR NR Bicetre Fall 61 m SpO2 Temperature Lactates Hb Glasgow Transfusion <NA> yes no no yes NM no NM yes yes NM no no... Predict the Glasgow score, whether to start a blood transfusion, to administer fresh frozen plasma, etc... (Logistic) regressions with categorical/continuous values 5 / 38

6 Public Assistance - Paris Hospitals Traumabase: patients/ 250 variables/ 6 hospitals Center Accident Age Sex Weight Height BMI BP SBP 1 Beaujon Fall 54 m 85 NR NR Lille Other 33 m Pitie Salpetriere Gun 26 m NR NR NR Beaujon AVP moto 63 m Pitie Salpetriere AVP bicycle 33 m 75 NR NR Pitie Salpetriere AVP pedestrian 30 w NR NR NR HEGP White weapon 16 m Toulon White weapon 20 m NR NR NR Bicetre Fall 61 m SpO2 Temperature Lactates Hb Glasgow Transfusion <NA> yes no no yes NM no NM yes yes NM no no... Predict the Glasgow score, whether to start a blood transfusion, to administer fresh frozen plasma, etc... (Logistic) regressions with missing categorical/continuous values 5 / 38

7 Public Assistance - Paris Hospitals 6 hospitals: multilevel structure (patients within hospital) Hopital effect: lack of standardization Missing values Center Accident Age Sex Weight Height BMI BP SBP 1 Beaujon Fall 54 m 85 NR NR Lille Other 33 m Pitie Salpetriere Gun 26 m NR NR NR Beaujon AVP moto 63 m Pitie Salpetriere AVP bicycle 33 m 75 NR NR Pitie Salpetriere AVP pedestrian 30 w NR NR NR HEGP White weapon 16 m SpO2 Temperature Lactates Hb Glasgow Transfusion <NA> yes no no yes NM no NM yes yes 1) Handling missing values in multilevel data 6 / 38

8 Public Assistance - Paris Hospitals 6 hospitals: multilevel structure (patients within hospital) Hopital effect: lack of standardization 2) Data not aggregated but which stay on each site Missing values Center Accident Age Sex Weight Height BMI BP SBP 1 Beaujon Fall 54 m 85 NR NR Lille Other 33 m Pitie Salpetriere Gun 26 m NR NR NR Beaujon AVP moto 63 m Pitie Salpetriere AVP bicycle 33 m 75 NR NR Pitie Salpetriere AVP pedestrian 30 w NR NR NR HEGP White weapon 16 m SpO2 Temperature Lactates Hb Glasgow Transfusion <NA> yes no no yes NM no NM yes yes 1) Handling missing values in multilevel data 6 / 38

9 Missing values are everywhere: unanswered questions in a survey, lost data, damaged plants, machines that fail... The best thing to do with missing values is not to have any Gertrude Mary Cox. Still an issue with "big data" Data integration: data from different sources 1 J 1? 1? X1 k1? 1 i ki?? Xi?? 1 I ki?? XI????? Multilevel structure: sporadically - systematic missing values (one variable missing in one hospital) 7 / 38

10 Solutions to handle missing values Litterature: Schaefer (2002); Little & Rubin (2002); Gelman & Meng (2004); Kim & Shao (2013); Carpenter & Kenward (2013); van Buuren (2015) Modify the estimation process to deal with missing values. Maximum likelihood: EM algorithm to obtain point estimates + Supplemented EM (Meng & Rubin, 1991) ; Louis for their variability One specific algorithm for each statistical method... Imputation (multiple) to get a completed data set on which you can perform any statistical method (Rubin, 1976) Famous imputation based on SVD (PCA) - quantitative 8 / 38

11 Solutions to handle missing values Litterature: Schaefer (2002); Little & Rubin (2002); Gelman & Meng (2004); Kim & Shao (2013); Carpenter & Kenward (2013); van Buuren (2015) Modify the estimation process to deal with missing values. Maximum likelihood: EM algorithm to obtain point estimates + Supplemented EM (Meng & Rubin, 1991) ; Louis for their variability One specific algorithm for each statistical method... Imputation (multiple) to get a completed data set on which you can perform any statistical method (Rubin, 1976) Famous imputation based on SVD (PCA) - quantitative Extension to imputation with multilevel SVD 8 / 38

12 Outline 1 Introduction 2 Single imputation with PCA 3 Single imputation for mixed multilevel data 4 Distribution 9 / 38

13 x2 PCA reconstruction X ^μ X F ^μ x V' Minimizes distance between observations and their projection Approx X n p with a low rank matrix k < p A 2 2 = tr(aa ): { } argmin µ X µ 2 2 : rank (µ) k SVD X: ˆµ PCA = U n k D k k V p k = F n k V p k F = UD PC - scores V principal axes - loadings 10 / 38

14 x2 PCA reconstruction X NA NA NA ^μ X F ^μ x V' Minimizes distance between observations and their projection Approx X n p with a low rank matrix k < p A 2 2 = tr(aa ): { } argmin µ X µ 2 2 : rank (µ) k SVD X: ˆµ PCA = U n k D k k V p k = F n k V p k F = UD PC - scores V principal axes - loadings 10 / 38

15 Missing values in PCA PCA: least squares argmin µ { X n p µ n p 2 2 : rank (µ) k } PCA with missing values: weighted least squares argmin µ { W n p (X µ) 2 2 : rank (µ) k } with W ij = 0 if X ij is missing, W ij = 1 otherwise; elementwise multiplication Many algorithms: Gabriel & Zamir, 1979: weighted alternating least squares (without explicit imputation) Kiers, 1997: iterative PCA (with imputation) 11 / 38

16 Iterative PCA x1 x NA x x1 12 / 38

17 x1 x NA x1 x x Iterative PCA x1 Initialization l = 0: X 0 (mean imputation) 12 / 38

18 x1 x NA x1 x x1 x x Iterative PCA x1 PCA on the completed data set (U l, Λ l, D l ); 12 / 38

19 x1 x NA x1 x x1 x x Iterative PCA x1 Missing values imputed with the fitted matrix ˆµ l = U l D l V l 12 / 38

20 x1 x NA x1 x x1 x x1 x x Iterative PCA The new imputed dataset is ˆX l = W X + (1 W ) ˆµ l x1 12 / 38

21 Iterative PCA x1 x NA x1 x x1 x x x1 12 / 38

22 Iterative PCA x1 x NA x1 x x1 x x1 x x x1 12 / 38

23 x1 x NA x1 x x1 x x1 x x Iterative PCA Steps are repeated until convergence x1 12 / 38

24 x x1 x NA Iterative PCA x1 x x1 PCA on the completed data set (U l, D l, V l ) Missing values imputed with the fitted matrix ˆµ l = U l D l V l 12 / 38

25 Iterative PCA 1 initialization l = 0: X 0 (mean imputation) 2 step l: (a) PCA on the completed data (U l, D l, V l ); k dim kept (b) (ˆµ l ) k = U l D l V l X l = W X + (1 W ) ˆµ l 3 steps of estimation and imputation are repeated 13 / 38

26 Iterative PCA 1 initialization l = 0: X 0 (mean imputation) 2 step l: (a) PCA on the completed data (U l, D l, V l ); k dim kept (b) (ˆµ l ) k = U l D l V l X l = W X + (1 W ) ˆµ l 3 steps of estimation and imputation are repeated ˆµ from incomplete data: EM algo: X = FV + ε X = µ + ε, ε ij iid N ( 0, σ 2 ) with µ of low rank Completed data: good imputation (matrix completion, Netflix) (Udell & Townsend Nice Latent Variable Models Have Log-Rank, 2017) Selecting k? Generalized cross-validation (Josse & Husson, 2012) 13 / 38

27 Iterative PCA 1 initialization l = 0: X 0 (mean imputation) 2 step l: (a) PCA on the completed data (U l, D l, V l ); k dim kept (b) (ˆµ l ) k = U l D l V l X l = W X + (1 W ) ˆµ l 3 steps of estimation and imputation are repeated ˆµ from incomplete data: EM algo: X = FV + ε X = µ + ε, ε ij iid N ( 0, σ 2 ) with µ of low rank Completed data: good imputation (matrix completion, Netflix) (Udell & Townsend Nice Latent Variable Models Have Log-Rank, 2017) Selecting k? Generalized cross-validation (Josse & Husson, 2012) 13 / 38

28 Regularized iterative SVD Overfitting issues of iterative PCA: many parameters (U n k, V k p )/observed values (k large - many NA); noisy data Regularized versions. Init - estimation - imputation steps: Imputation by ˆµ PCA = k q=1 d q u q v q is replaced by (ˆµ) λ = n q=1 (d q λ) + u q v q argmin µ { W (X µ) λ µ ˆµ = ( k q=1 d l ˆσ2 d q )u q v q Hastie et.al. (2015), Verbank, J. & Husson (2013); Gavish & Donoho (2014), J. & Wager (2015), J. & Sardy (2014) Iterative SVD algorithms good to impute data Model makes sense: data = structure of rank k + noise } 14 / 38

29 Outline 1 Introduction 2 Single imputation with PCA 3 Single imputation for mixed multilevel data 4 Distribution 15 / 38

30 Multilevel component analysis Ex: inhabitants nested within countries X R K J 1 1 k1 1 J X1 similarities between countries? level 1 similarities between inhabitants within each country? level 2 relationship between variables at each level 1 i ki 1 I ki Xi XI x ijki = x.j. + (x ij. x.j. ) + (x ijki x ij. ) Between + Within Analysis of variance: split the sum of squares for each variable j I k i I I I (x ijki ) 2 = k i (x.j. ) 2 + k i (x ij. x.j. ) 2 k i + (x ijki x ij. ) 2 i=1 k=1 i=1 i=1 i=1 k=1 16 / 38

31 Multilevel PCA MLPCA Model for the between and within part i = 1,..., I groups, J var X i(ki J) = 1 k i m + 1 ki F b i V b + Fi w V w + E i Fi b (Q b 1) between component scores of group i V b (J Q b ) between loading matrix Fi w (k i Q w ) within component scores of group i V w (J Q w ) within loading matrix. Constant across groups Fitted by solving the least squares (Timmerman, 2006) argmin F (m, F b i, V b, F w i, V w ) = I X i 1 ki m 1 ki Fi b i=1 Ii=1 k i Fi b = 0 Qb and 1 k i Fi w = 0 Qw, i for identifiability. V b F w i V w 2 17 / 38

32 MLPCA - quantitative data i = 1,..., I groups, J var, k i nb obs in group i Estimation: minimize the RSS argmin F () = I X i 1 ki m 1 ki Fi b V b Fi w V w 2, i=1 I i=1 k ifi b = 0 Qb and 1 k i Fi w = 0 Qw, i for identifiability. (ˆF b, ˆV b ): Weigthed PCA on the between part: SVD on WX m ; X m (I J) the means of the variables per group, W (I I) w ii = k i (ˆF w, ˆV w ) PCA on the within part: SVD on the centered data per group X w (K J), K = i ki With missing values: Weighted Least Squares Iterative imputation algorithm (imputation - estimation) 18 / 38

33 Iterative MLPCA 2. iteration l: estimation of the between structure SVD WXm l = PDQ ; Q b eigenvectors are kept: ˆF i b = [W 1 P Qb ] i, ˆF b concatenation by row of [1 ki ˆF i b ] ˆV b = Q Qb D Qb, (J Q b ) the between hat matrix is computed: ( ˆX b ) l = ˆF b b ˆV 3. iteration l: imputation of the missing values with the fitted values ˆX l = 1 K ˆm (l 1) + ( ˆX b ) l + ( ˆX w ) (l 1). The newly imputed dataset is X l = W X + (1 K 1 J W ) ˆX l ˆm l is computed on X l 4. iteration l: estimation of the within structure SVD (X w ) l = PDQ ; Q w eigenvectors are kept: F w = P Qw (K Q w ) V w = Q Qw D Qw (J Q w ) the within hat matrix is computed ( ˆX w ) l = ˆF w ˆV w 5. iteration l: imputation of the missing values with the fitted values ) X l+1 = M X + (1 K 1 J M) (1 K ˆm (l) + ( ˆX b ) l + ( ˆX w ) l ˆm l+1 is computed on X l+1 19 / 38

34 Regularized iterative MLSCA Estimation SVD - Imputation ˆX = 1 K ˆm + ˆF b ˆV b + ˆF w ˆV w ˆV b = Q Qb D Qb = Q Qb diag (d 1,..., d Qb ), as: Q Qb diag ( d 1 ˆσ2 b d 1,..., d Qb ˆσ2 b d Qb ), ˆσ 2 b = 1 J Q b J d q q=q b +1 ˆV w = Q Qw diag ( d 1 ˆσ2 w d 1,..., d Qw ˆσ2 w d Qw ) 20 / 38

35 Multiple Correspondence Analysis (MCA) X n m m categorical variables coded with indicator matrix A X = y... attack y... attack y... attack n... suicide A = D p = p p J n... accident n... suicide For a category c, the frequency of the category: p c = n c /n. A SVD on weighted matrix: Z = 1 mn (A 1p T )D 1/2 p = UDV The PC (F = UD) satisfies: arg max Fq R n 1 m m j=1 η2 (F q, X j ) η 2 (F, X j ) = Cj c=1 n c(f.c F.. ) 2 RSS between n Cj i=1 c=1 (F = ic) 2 RSS tot Benzecri, 1973 :"In data analysis the mathematical problems reduces to computing eigenvectors; all the science (the art) is in finding the right matrix to diagonalize" 21 / 38

36 Multilevel MCA Start with the matrix of dummy variables A and define a between and a within part Then, MCA is applied on each part Between: Apply MCA on the matrix with the mean of A per group i (proportion of obs taking each category in group i) (proportion of some disease in a particular hospital). Â b = F b V b D 1/2 p + 1 n p Within part Apply MCA on the data ( where the between part has been swept out (SVD is applied to 1 np A Â b) Dp 1/2 ) Â w = (np)f w V w Dp 1/2. Â = Â b + Â w 22 / 38

37 Regularized iterative Multilevel MCA 23 / 38

38 Regularized iterative Multilevel MCA Initialization: imputation of the indicator matrix (proportions) 23 / 38

39 Regularized iterative Multilevel MCA Initialization: imputation of the indicator matrix (proportions) Iterate until convergence 1 estimation: Multilevel MCA on the completed data ˆF b, ˆV b, ˆF w, ˆV w 23 / 38

40 Regularized iterative Multilevel MCA Initialization: imputation of the indicator matrix (proportions) Iterate until convergence 1 estimation: Multilevel MCA on the completed data ˆF b, ˆV b, ˆF w, ˆV w 2 imputation with the fitted matrix  = Âb + Âw 23 / 38

41 Regularized iterative Multilevel MCA Initialization: imputation of the indicator matrix (proportions) Iterate until convergence 1 estimation: Multilevel MCA on the completed data ˆF b, ˆV b, ˆF w, ˆV w 2 imputation with the fitted matrix  = Âb + Âw 3 column margins are updated 23 / 38

42 Regularized iterative Multilevel MCA Initialization: imputation of the indicator matrix (proportions) Iterate until convergence 1 estimation: Multilevel MCA on the completed data ˆF b, ˆV b, ˆF w, ˆV w 2 imputation with the fitted matrix  = Âb + Âw 3 column margins are updated V1 V2 V3 V14 V1_a V1_b V1_c V2_e V2_f V3_g V3_h ind 1 a NA g u ind ind 2 NA f g u ind ind 3 a e h v ind ind 4 a e h v ind ind 5 b f h u ind ind 6 c f h u ind ind 7 c f NA v ind ind 1232 c f h v ind the imputed values can be seen as degree of membership 23 / 38

43 Regularized iterative Multilevel MCA Initialization: imputation of the indicator matrix (proportions) Iterate until convergence 1 estimation: Multilevel MCA on the completed data ˆF b, ˆV b, ˆF w, ˆV w 2 imputation with the fitted matrix  =  b +  w 3 column margins are updated Two ways to impute categories: majority or draw 23 / 38

44 Regularized iterative Multilevel MCA Initialization: imputation of the indicator matrix (proportions) Iterate until convergence 1 estimation: Multilevel MCA on the completed data ˆF b, ˆV b, ˆF w, ˆV w 2 imputation with the fitted matrix  =  b +  w 3 column margins are updated Two ways to impute categories: majority or draw 23 / 38

45 Public Assistance - Paris Hospitals Traumabase: patients/ 250 variables/ 6 hospitals Center Accident Age Sex Weight Height BMI BP SBP 1 Beaujon Fall 54 m 85 NR NR Lille Other 33 m Pitie Salpetriere Gun 26 m NR NR NR Beaujon AVP moto 63 m Pitie Salpetriere AVP bicycle 33 m 75 NR NR Pitie Salpetriere AVP pedestrian 30 w NR NR NR HEGP White weapon 16 m Toulon White weapon 20 m NR NR NR Bicetre Fall 61 m SpO2 Temperature Lactates Hb Glasgow Transfusion <NA> yes no no yes NM no NM yes yes NM no no / 38

46 Imputed Paris Hospitals data Traumabase: patients/ 250 variables/ 6 hospitals Center Accident Age Sex Weight Height BMI BP SBP 1 Beaujon Fall 54 m Lille Other 33 m Pitie Salpetriere Gun 26 m Beaujon AVP moto 63 m Pitie Salpetriere AVP bicycle 33 m Pitie Salpetriere AVP pedestri 30 m HEGP White weapon 16 m Toulon White weapon 20 m Bicetre Fall 61 m SpO2 Temperature Lactates Hb Glasgow / 38

47 The simulated data: Design X = 1 i(ki J) k i m + 1 ki Fi b V b + Fi w V w + E i, with E ijki N (0, σ) 5 groups, 10 variables, Q b = 2, Q w = 2 Many scenarios are considered: number of measurement occasions: 10-20, level of noise: low, strong percentage of missing values: 10%, 25%, 40% missing values mechanism: MCAR, 2 MAR situations 1 Prediction error: KJ (xijki x ijki ˆ ) 2 26 / 38

48 Results Competitors: Conditional model with random effect regression (mice) Random forests (bühlmann, 2012) (not designed) Global PCA - Separate PCA Global mixed PCA (with hospital) MSE (PCA separate multilevel PCA) per group FAMD global Jomo Mult PCA global PCA sep sigma=2, pna=0.2, J= group 27 / 38

49 Results J = 10 J = 30 5cat 5cat Global PCA mice Multilevel SVD Global mixed PCA Random forest Table: Time in seconds for a dataset with 20% NA, I = 5 k i = 200 PCA mixed as Random Forest mice (random effect model): difficulties with large dimensions Separate PCA: pb with many missing values Multilevel SVD = global SVD when no group effect Imputation properties depends on the method (linear) Other methods do not handle categorical variables 28 / 38

50 Outline 1 Introduction 2 Single imputation with PCA 3 Single imputation for mixed multilevel data 4 Distribution 29 / 38

51 6 hospital databases Combining data from different institutional databases promises many advantages in personalizing medical care (large n, more chance for finding patients like me) NIH requires sharing of data from funded projects 30 / 38

52 6 hospital databases Combining data from different institutional databases promises many advantages in personalizing medical care (large n, more chance for finding patients like me) NIH requires sharing of data from funded projects The problem: high barriers to aggregation of medical data lack of standardization of ontologies privacy concerns proprietary attitude towards data, reluctance to cede control complexity/size of aggregated data, updates problems 30 / 38

53 Solution: distributed computation Data aggregation is not always necessary NIH splits the storage of aggregated data across several centers Data can stay at site Computations can be distributed (share burden) Hospitals only share intermediate results instead of the raw data 31 / 38

54 Topology: master-workers (Wolfson, et. al (2010)) Ex: Each site share the sum of age X i and the number of patients n i. The master computes X = n i X i / n i 32 / 38

55 Solution: distributed computation Many models fitting can be implemented: Maximizing a likelihood. Intermediate computations break up into sums of quantities computed on local data at sites. Log-likelihood, score function and Fisher information can partition into sums. (OK for logistic regression) Singular Value Decomposition. Iterative algorithms available for SVD using quantities computed on local data at sites. And more. Implemented in the R package discomp (Narasimhan et. al., 2017) 33 / 38

56 Singular value decomposition SVD: X n p : U n k D k k V p k Power method to get the first direction: Other dims: "deflation", same procedure in the residuals (X udv ) Involves inner products and sums: distributed 34 / 38

57 Privacy preserving rank k SVD 35 / 38

58 Multilevel imputation 1 J 1 1? k 1 X 1?? 1 i k i?? X i?? I 1 k I? X? I????? Impute multilevel data with Multilevel SVD Distributed multilevel imputation Impute the data of one hospital using the data of the others Incentive to encourage the hospitals to participate in the project Apply other statistical methods on the imputed data (logistic regression) 36 / 38

59 Take home message - On going work Multilevel PCA powerful for single imputation of continuous & categorical multilevel data: reduce the dimensionality - capture the similarities between rows and relationship between variables at both levels. Method without missing values Computationaly fast - distributed - Implemented R package missmda Numbers of components Q b and Q w? Inference after imputation. Underestimation of the variance with single imputation 37 / 38

60 Take home message - On going work Multilevel PCA powerful for single imputation of continuous & categorical multilevel data: reduce the dimensionality - capture the similarities between rows and relationship between variables at both levels. Method without missing values Computationaly fast - distributed - Implemented R package missmda Numbers of components Q b and Q w? cross-validation? Inference after imputation. Underestimation of the variance with single imputation 37 / 38

61 Take home message - On going work Multilevel PCA powerful for single imputation of continuous & categorical multilevel data: reduce the dimensionality - capture the similarities between rows and relationship between variables at both levels. Method without missing values Computationaly fast - distributed - Implemented R package missmda Numbers of components Q b and Q w? cross-validation? Inference after imputation. Underestimation of the variance with single imputation Multiple imputation: bootstrap + drawn from the predictive distribution N (1 K ˆm + ˆF b ˆBb + ˆF ) w ˆBw, ˆσ 2 37 / 38

62 Research activities Dimensionality reduction methods to visualize complex data (PCA based): multi-sources, textual, arrays, questionnaire Missing values - matrix completion Low rank estimation, selection of regularization parameters Fields of application: bio-sciences (agronomy, sensory analysis), health data (hospital data) R community: book R for Statistics, R foundation, R taskforce, R packages and JSS papers: FactoMineR explore continuous, categorical, multiple contingency tables (correspondence analysis), combine clustering and PC,.. MissMDA for single and multiple imputation, PCA with missing denoiser to denoise data 38 / 38

Missing values imputation for mixed data based on principal component methods

Missing values imputation for mixed data based on principal component methods Missing values imputation for mixed data based on principal component methods Vincent Audigier, François Husson & Julie Josse Agrocampus Rennes Compstat' 2012, Limassol (Cyprus), 28-08-2012 1 / 21 A real

More information

On the consistency of supervised learning with missing values

On the consistency of supervised learning with missing values On the consistency of supervised learning with missing values Julie Josse, Nicolas Prost, Erwan Scornet, Gael Varoquaux CMAP, INRIA Parietal 4 April 2019 https://hal.archives-ouvertes.fr/hal-02024202 Google

More information

Regularized PCA to denoise and visualise data

Regularized PCA to denoise and visualise data Regularized PCA to denoise and visualise data Marie Verbanck Julie Josse François Husson Laboratoire de statistique, Agrocampus Ouest, Rennes, France CNAM, Paris, 16 janvier 2013 1 / 30 Outline 1 PCA 2

More information

Comparison of multiple imputation methods for systematically and sporadically missing multilevel data

Comparison of multiple imputation methods for systematically and sporadically missing multilevel data Comparison of multiple imputation methods for systematically and sporadically missing multilevel data V. Audigier, I. White, S. Jolani, T. Debray, M. Quartagno, J. Carpenter, S. van Buuren, M. Resche-Rigon

More information

missmda: a package to handle missing values in Multivariate exploratory Data Analysis methods

missmda: a package to handle missing values in Multivariate exploratory Data Analysis methods missmda: a package to handle missing values in Multivariate exploratory Data Analysis methods Julie Josse & Francois Husson Applied Mathematics Department Agrocampus Rennes - France user! 2011, Warwick,

More information

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

Statistical Methods. Missing Data  snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23 1 / 23 Statistical Methods Missing Data http://www.stats.ox.ac.uk/ snijders/sm.htm Tom A.B. Snijders University of Oxford November, 2011 2 / 23 Literature: Joseph L. Schafer and John W. Graham, Missing

More information

Handling missing values in Multiple Factor Analysis

Handling missing values in Multiple Factor Analysis Handling missing values in Multiple Factor Analysis François Husson, Julie Josse Applied mathematics department, Agrocampus Ouest, Rennes, France Rabat, 26 March 2014 1 / 10 Multi-blocks data set Groups

More information

Causal inference with missing values

Causal inference with missing values Causal inference with missing values Effect of tranexamic acid on mortality for head trauma patient Julie Josse (X - INRIA XPOP) - Imke Mayer 22 January, 2019 Statistic seminar Lyon 1 Research activities

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

Cross-Validation with Confidence

Cross-Validation with Confidence Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation

More information

Dimensionality Reduction

Dimensionality Reduction Lecture 5 1 Outline 1. Overview a) What is? b) Why? 2. Principal Component Analysis (PCA) a) Objectives b) Explaining variability c) SVD 3. Related approaches a) ICA b) Autoencoders 2 Example 1: Sportsball

More information

Cross-Validation with Confidence

Cross-Validation with Confidence Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University WHOA-PSI Workshop, St Louis, 2017 Quotes from Day 1 and Day 2 Good model or pure model? Occam s razor We really

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Multilevel Component Analysis applied to the measurement of a complex product experience

Multilevel Component Analysis applied to the measurement of a complex product experience Multilevel Component Analysis applied to the measurement of a complex product experience Boucon, C.A., Petit-Jublot, C.E.F., Groeneschild C., Dijksterhuis, G.B. Outline Background Introduction to Simultaneous

More information

Principal components analysis COMS 4771

Principal components analysis COMS 4771 Principal components analysis COMS 4771 1. Representation learning Useful representations of data Representation learning: Given: raw feature vectors x 1, x 2,..., x n R d. Goal: learn a useful feature

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage

More information

MULTILEVEL IMPUTATION 1

MULTILEVEL IMPUTATION 1 MULTILEVEL IMPUTATION 1 Supplement B: MCMC Sampling Steps and Distributions for Two-Level Imputation This document gives technical details of the full conditional distributions used to draw regression

More information

A note on multiple imputation for general purpose estimation

A note on multiple imputation for general purpose estimation A note on multiple imputation for general purpose estimation Shu Yang Jae Kwang Kim SSC meeting June 16, 2015 Shu Yang, Jae Kwang Kim Multiple Imputation June 16, 2015 1 / 32 Introduction Basic Setup Assume

More information

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University Lecture 4: Principal Component Analysis Aykut Erdem May 016 Hacettepe University This week Motivation PCA algorithms Applications PCA shortcomings Autoencoders Kernel PCA PCA Applications Data Visualization

More information

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms François Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini,

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Principal Component Analysis Barnabás Póczos Contents Motivation PCA algorithms Applications Some of these slides are taken from Karl Booksh Research

More information

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task Binary Principal Component Analysis in the Netflix Collaborative Filtering Task László Kozma, Alexander Ilin, Tapani Raiko first.last@tkk.fi Helsinki University of Technology Adaptive Informatics Research

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Fractional Imputation in Survey Sampling: A Comparative Review

Fractional Imputation in Survey Sampling: A Comparative Review Fractional Imputation in Survey Sampling: A Comparative Review Shu Yang Jae-Kwang Kim Iowa State University Joint Statistical Meetings, August 2015 Outline Introduction Fractional imputation Features Numerical

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

High Dimensional Discriminant Analysis

High Dimensional Discriminant Analysis High Dimensional Discriminant Analysis Charles Bouveyron 1,2, Stéphane Girard 1, and Cordelia Schmid 2 1 LMC IMAG, BP 53, Université Grenoble 1, 38041 Grenoble cedex 9 France (e-mail: charles.bouveyron@imag.fr,

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

Learning with Singular Vectors

Learning with Singular Vectors Learning with Singular Vectors CIS 520 Lecture 30 October 2015 Barry Slaff Based on: CIS 520 Wiki Materials Slides by Jia Li (PSU) Works cited throughout Overview Linear regression: Given X, Y find w:

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand

More information

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:

More information

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Principal Component Analysis. Applied Multivariate Statistics Spring 2012 Principal Component Analysis Applied Multivariate Statistics Spring 2012 Overview Intuition Four definitions Practical examples Mathematical example Case study 2 PCA: Goals Goal 1: Dimension reduction

More information

Data Preprocessing Tasks

Data Preprocessing Tasks Data Tasks 1 2 3 Data Reduction 4 We re here. 1 Dimensionality Reduction Dimensionality reduction is a commonly used approach for generating fewer features. Typically used because too many features can

More information

CS 340 Lec. 6: Linear Dimensionality Reduction

CS 340 Lec. 6: Linear Dimensionality Reduction CS 340 Lec. 6: Linear Dimensionality Reduction AD January 2011 AD () January 2011 1 / 46 Linear Dimensionality Reduction Introduction & Motivation Brief Review of Linear Algebra Principal Component Analysis

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

Noise & Data Reduction

Noise & Data Reduction Noise & Data Reduction Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum Dimension Reduction 1 Remember: Central Limit

More information

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) Chapter 5 The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) 5.1 Basics of SVD 5.1.1 Review of Key Concepts We review some key definitions and results about matrices that will

More information

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR Data Mining I Lecture

More information

A few applications of the SVD

A few applications of the SVD A few applications of the SVD Many methods require to approximate the original data (matrix) by a low rank matrix before attempting to solve the original problem Regularization methods require the solution

More information

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT & 14. PCA and MDS Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality

More information

Low Rank Matrix Completion Formulation and Algorithm

Low Rank Matrix Completion Formulation and Algorithm 1 2 Low Rank Matrix Completion and Algorithm Jian Zhang Department of Computer Science, ETH Zurich zhangjianthu@gmail.com March 25, 2014 Movie Rating 1 2 Critic A 5 5 Critic B 6 5 Jian 9 8 Kind Guy B 9

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Clusters. Unsupervised Learning. Luc Anselin.   Copyright 2017 by Luc Anselin, All Rights Reserved Clusters Unsupervised Learning Luc Anselin http://spatial.uchicago.edu 1 curse of dimensionality principal components multidimensional scaling classical clustering methods 2 Curse of Dimensionality 3 Curse

More information

Logistic Regression with Missing Covariates Parameter Estimation, Model Selection and Prediction

Logistic Regression with Missing Covariates Parameter Estimation, Model Selection and Prediction Logistic Regression with Missing Covariates Parameter Estimation, Model Selection and Prediction arxiv:1805.04602v3 [stat.me] 17 Dec 2018 Wei Jiang, Julie Josse, Marc Lavielle Inria XPOP and CMAP, École

More information

Time-Invariant Predictors in Longitudinal Models

Time-Invariant Predictors in Longitudinal Models Time-Invariant Predictors in Longitudinal Models Today s Class (or 3): Summary of steps in building unconditional models for time What happens to missing predictors Effects of time-invariant predictors

More information

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to:

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to: System 2 : Modelling & Recognising Modelling and Recognising Classes of Classes of Shapes Shape : PDM & PCA All the same shape? System 1 (last lecture) : limited to rigidly structured shapes System 2 :

More information

Vector Space Models. wine_spectral.r

Vector Space Models. wine_spectral.r Vector Space Models 137 wine_spectral.r Latent Semantic Analysis Problem with words Even a small vocabulary as in wine example is challenging LSA Reduce number of columns of DTM by principal components

More information

Chapter 17: Undirected Graphical Models

Chapter 17: Undirected Graphical Models Chapter 17: Undirected Graphical Models The Elements of Statistical Learning Biaobin Jiang Department of Biological Sciences Purdue University bjiang@purdue.edu October 30, 2014 Biaobin Jiang (Purdue)

More information

Statistical aspects of prediction models with high-dimensional data

Statistical aspects of prediction models with high-dimensional data Statistical aspects of prediction models with high-dimensional data Anne Laure Boulesteix Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie February 15th, 2017 Typeset by

More information

Dimensionality Reduction for Exponential Family Data

Dimensionality Reduction for Exponential Family Data Dimensionality Reduction for Exponential Family Data Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Andrew Landgraf July 2-6, 2018 Computational Strategies for Large-Scale

More information

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Adrien Todeschini Inria Bordeaux JdS 2014, Rennes Aug. 2014 Joint work with François Caron (Univ. Oxford), Marie

More information

Collaborative topic models: motivations cont

Collaborative topic models: motivations cont Collaborative topic models: motivations cont Two topics: machine learning social network analysis Two people: " boy Two articles: article A! girl article B Preferences: The boy likes A and B --- no problem.

More information

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i

More information

Statistical Methods for Integration of Multiple Omics Data

Statistical Methods for Integration of Multiple Omics Data Statistical Methods for Integration of Multiple Omics Data Hae-Won Uh BMTL, October 2014, Naples October 23, 2014 Hae-Won Uh, BMTL, October 2014, Naples Statistical Methods for Integration of Multiple

More information

14 Singular Value Decomposition

14 Singular Value Decomposition 14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

Data set Science&Environment (ISSP,1993) Multiple categorical variables. Burt matrix. Indicator matrix V4a(A) Multiple Correspondence Analysis

Data set Science&Environment (ISSP,1993) Multiple categorical variables. Burt matrix. Indicator matrix V4a(A) Multiple Correspondence Analysis Data set Science&Environment (ISSP,1993) Q. SCIENCE AND ENVIRONMENT Multiple categorical variables Multiple Correspondence Analysis How much do you agree or disagree with each of these statements? Q.a

More information

Introduction to Graphical Models

Introduction to Graphical Models Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

More information

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods. TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin

More information

Unsupervised learning: beyond simple clustering and PCA

Unsupervised learning: beyond simple clustering and PCA Unsupervised learning: beyond simple clustering and PCA Liza Rebrova Self organizing maps (SOM) Goal: approximate data points in R p by a low-dimensional manifold Unlike PCA, the manifold does not have

More information

Learning representations

Learning representations Learning representations Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 4/11/2016 General problem For a dataset of n signals X := [ x 1 x

More information

arxiv: v2 [stat.me] 27 Nov 2017

arxiv: v2 [stat.me] 27 Nov 2017 arxiv:1702.00971v2 [stat.me] 27 Nov 2017 Multiple imputation for multilevel data with continuous and binary variables November 28, 2017 Vincent Audigier 1,2,3 *, Ian R. White 4,5, Shahab Jolani 6, Thomas

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

REPLICATION VARIANCE ESTIMATION FOR THE NATIONAL RESOURCES INVENTORY

REPLICATION VARIANCE ESTIMATION FOR THE NATIONAL RESOURCES INVENTORY REPLICATION VARIANCE ESTIMATION FOR THE NATIONAL RESOURCES INVENTORY J.D. Opsomer, W.A. Fuller and X. Li Iowa State University, Ames, IA 50011, USA 1. Introduction Replication methods are often used in

More information

Principal Component Analysis

Principal Component Analysis B: Chapter 1 HTF: Chapter 1.5 Principal Component Analysis Barnabás Póczos University of Alberta Nov, 009 Contents Motivation PCA algorithms Applications Face recognition Facial expression recognition

More information

15 Singular Value Decomposition

15 Singular Value Decomposition 15 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

CS60021: Scalable Data Mining. Dimensionality Reduction

CS60021: Scalable Data Mining. Dimensionality Reduction J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Dimensionality Reduction Sourangshu Bhattacharya Assumption: Data lies on or near a

More information

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman Linear Regression Models Based on Chapter 3 of Hastie, ibshirani and Friedman Linear Regression Models Here the X s might be: p f ( X = " + " 0 j= 1 X j Raw predictor variables (continuous or coded-categorical

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1 Week 2 Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Part I Other datatypes, preprocessing 2 / 1 Other datatypes Document data You might start with a collection of

More information

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes Week 2 Based in part on slides from textbook, slides of Susan Holmes Part I Other datatypes, preprocessing October 3, 2012 1 / 1 2 / 1 Other datatypes Other datatypes Document data You might start with

More information

Machine Learning for Disease Progression

Machine Learning for Disease Progression Machine Learning for Disease Progression Yong Deng Department of Materials Science & Engineering yongdeng@stanford.edu Xuxin Huang Department of Applied Physics xxhuang@stanford.edu Guanyang Wang Department

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Feature Engineering, Model Evaluations

Feature Engineering, Model Evaluations Feature Engineering, Model Evaluations Giri Iyengar Cornell University gi43@cornell.edu Feb 5, 2018 Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 1 / 35 Overview 1 ETL 2 Feature Engineering

More information

IV. Matrix Approximation using Least-Squares

IV. Matrix Approximation using Least-Squares IV. Matrix Approximation using Least-Squares The SVD and Matrix Approximation We begin with the following fundamental question. Let A be an M N matrix with rank R. What is the closest matrix to A that

More information

The Missing-Index Problem in Process Control

The Missing-Index Problem in Process Control The Missing-Index Problem in Process Control Joseph G. Voelkel CQAS, KGCOE, RIT October 2009 Voelkel (RIT) Missing-Index Problem 10/09 1 / 35 Topics 1 Examples Introduction to the Problem 2 Models, Inference,

More information

Basic Medical Statistics Course

Basic Medical Statistics Course Basic Medical Statistics Course S7 Logistic Regression November 2015 Wilma Heemsbergen w.heemsbergen@nki.nl Logistic Regression The concept of a relationship between the distribution of a dependent variable

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Contingency Table Analysis via Matrix Factorization

Contingency Table Analysis via Matrix Factorization Contingency Table Analysis via Matrix Factorization Kumer Pial Das 1, Jay Powell 2, Myron Katzoff 3, S. Stanley Young 4 1 Department of Mathematics,Lamar University, TX 2 Better Schooling Systems, Pittsburgh,

More information

Inference Methods for the Conditional Logistic Regression Model with Longitudinal Data Arising from Animal Habitat Selection Studies

Inference Methods for the Conditional Logistic Regression Model with Longitudinal Data Arising from Animal Habitat Selection Studies Inference Methods for the Conditional Logistic Regression Model with Longitudinal Data Arising from Animal Habitat Selection Studies Thierry Duchesne 1 (Thierry.Duchesne@mat.ulaval.ca) with Radu Craiu,

More information

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle

More information

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear

More information

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling Jae-Kwang Kim 1 Iowa State University June 26, 2013 1 Joint work with Shu Yang Introduction 1 Introduction

More information

Graph Functional Methods for Climate Partitioning

Graph Functional Methods for Climate Partitioning Graph Functional Methods for Climate Partitioning Mathilde Mougeot - with D. Picard, V. Lefieux*, M. Marchand* Université Paris Diderot, France *Réseau Transport Electrique (RTE) Buenos Aires, 2015 Mathilde

More information

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016 Lecture 8 Principal Component Analysis Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 13, 2016 Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 1 / 31 Outline 1 Eigen

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

A multivariate multilevel model for the analysis of TIMMS & PIRLS data

A multivariate multilevel model for the analysis of TIMMS & PIRLS data A multivariate multilevel model for the analysis of TIMMS & PIRLS data European Congress of Methodology July 23-25, 2014 - Utrecht Leonardo Grilli 1, Fulvia Pennoni 2, Carla Rampichini 1, Isabella Romeo

More information

. a m1 a mn. a 1 a 2 a = a n

. a m1 a mn. a 1 a 2 a = a n Biostat 140655, 2008: Matrix Algebra Review 1 Definition: An m n matrix, A m n, is a rectangular array of real numbers with m rows and n columns Element in the i th row and the j th column is denoted by

More information

Classification 2: Linear discriminant analysis (continued); logistic regression

Classification 2: Linear discriminant analysis (continued); logistic regression Classification 2: Linear discriminant analysis (continued); logistic regression Ryan Tibshirani Data Mining: 36-462/36-662 April 4 2013 Optional reading: ISL 4.4, ESL 4.3; ISL 4.3, ESL 4.4 1 Reminder:

More information

Machine Learning. Chao Lan

Machine Learning. Chao Lan Machine Learning Chao Lan Clustering and Dimensionality Reduction Clustering Kmeans DBSCAN Gaussian Mixture Model Dimensionality Reduction principal component analysis manifold learning Other Feature Processing

More information

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Principal Component Analysis (PCA) for Sparse High-Dimensional Data AB Principal Component Analysis (PCA) for Sparse High-Dimensional Data Tapani Raiko, Alexander Ilin, and Juha Karhunen Helsinki University of Technology, Finland Adaptive Informatics Research Center Principal

More information

Likelihood-based inference with missing data under missing-at-random

Likelihood-based inference with missing data under missing-at-random Likelihood-based inference with missing data under missing-at-random Jae-kwang Kim Joint work with Shu Yang Department of Statistics, Iowa State University May 4, 014 Outline 1. Introduction. Parametric

More information

Singular Value Decomposition and Principal Component Analysis (PCA) I

Singular Value Decomposition and Principal Component Analysis (PCA) I Singular Value Decomposition and Principal Component Analysis (PCA) I Prof Ned Wingreen MOL 40/50 Microarray review Data per array: 0000 genes, I (green) i,i (red) i 000 000+ data points! The expression

More information

Classification 1: Linear regression of indicators, linear discriminant analysis

Classification 1: Linear regression of indicators, linear discriminant analysis Classification 1: Linear regression of indicators, linear discriminant analysis Ryan Tibshirani Data Mining: 36-462/36-662 April 2 2013 Optional reading: ISL 4.1, 4.2, 4.4, ESL 4.1 4.3 1 Classification

More information

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Multilevel Statistical Models: 3 rd edition, 2003 Contents Multilevel Statistical Models: 3 rd edition, 2003 Contents Preface Acknowledgements Notation Two and three level models. A general classification notation and diagram Glossary Chapter 1 An introduction

More information

Latent Growth Models 1

Latent Growth Models 1 1 We will use the dataset bp3, which has diastolic blood pressure measurements at four time points for 256 patients undergoing three types of blood pressure medication. These are our observed variables:

More information

Factor Analysis (10/2/13)

Factor Analysis (10/2/13) STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

More information

Reconstruction of individual patient data for meta analysis via Bayesian approach

Reconstruction of individual patient data for meta analysis via Bayesian approach Reconstruction of individual patient data for meta analysis via Bayesian approach Yusuke Yamaguchi, Wataru Sakamoto and Shingo Shirahata Graduate School of Engineering Science, Osaka University Masashi

More information