Multidimensional data analysis in biomedicine and epidemiology

in biomedicine and epidemiology Katja Ickstadt and Leo N. Geppert Faculty of Statistics, TU Dortmund, Germany Stakeholder Workshop 12 13 December 2017, PTB Berlin Supported by Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876, project C4

Challenges Overview of typical tasks in multidimensional data analysis in biomedicine and epidemiology, aims of employed methods, and related model complexity. Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 1 / 24

Challenges: Data Main Challenge: Molecular data Different data types (e.g., genomic, epigenomic, proteomic data) Measured on different platforms (arrays, chips) Different biological units (e.g., gene, methylation site, protein) Preprocessing involving advanced (statistical) modelling and algorithms Avoid Garbage in, garbage out Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 2 / 24

Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 3 / 24

Challenges: Data Published high-level functional information from databases, e.g., Kyoto Encyclopedia of Genes and Genomes pathways (http://www.genome.jp/kegg/) or gene ontology terms (http://www.geneontology.org) Clinical variables including, e.g., Magnetic Resonance Imaging (MRI) data Epidemiological, e.g., environmental variables Several studies (similar questions) Goal: Analyse all data available for a specific question jointly in an integrative analysis Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 4 / 24

Challenges: Analysis Large number of observations n and small to moderate number of variables p Small n, large p Large n, large p Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 5 / 24

Analysis for large n, small p Large n (100 000 or more observations; second by second data), small to moderate p (up to 100) Data arising in streams (e.g., online monitoring of patients) Image data with short acquisition time (e.g., MRI images) Huge data sets (e.g., Meta studies with RNA-seq data, MRI data) High model complexity (e.g., network structure, spatio-temporal dependencies) Goal: Solving the problem for reduced n by keeping the main information and introducing a controllable error Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 6 / 24

Setting Large number of observations n, small to moderate number of variables p (n p) Data set might be too large to load into memory (e.g., data stream) Aim: Conduct Bayesian (or frequentist) linear regression y = X β + u with u N ( 0, σu 2 ) In the Bayesian case, β is a random vector with prior distribution p(β) Computationally demanding Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 7 / 24

Johnson-Lindenstrauss theorem Idea of random projections is based on the Johnson-Lindenstrauss theorem (JLT) [Johnson & Lindenstrauss, 1984] JLT states that for every vector v R n there exists a random matrix Π R k n, such that: (1 ε) v 2 2 Πv 2 2 (1 + ε) v 2 2 The matrix Π is a random projection of ν and also called a ε-subspace embedding Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 8 / 24

Role of the ε-subspace embedding ε-subspace embeddings can be used to reduce the number of observations from n to k while retaining most of the algebraic structure Analysis is then carried out on the embedded data set Π [X, Y ] [ΠX, ΠY ] p post (β X, Y ) ε p post (β ΠX, ΠY ) Trade-off between goodness of approximation and data reduction can be adjusted using ε Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 9 / 24

How do we find an ε-subspace embedding Π? We compare three subspace embeddings: the Rademacher Matrix (RAD) [Sarlós (2006)], target dimension O ( d ε 2 ), running time O (ndk) the Subsampled Randomized Hadamard Transform (SRHT) [Ailon & Liberty (2009)] ( target dimension O d log(d) ε ), running time O (nd log k) 2 and the Clarkson Woodruff embedding (CW) [Clarkson & Woodruff ( (2013)] d target dimension O 2 ε ), running time O (nnz(x )) = O(nd) 2 Theoretical guarantees based on Wasserstein distance hold for all three embedding methods Choose approximation parameter ε or target dimension k Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 10 / 24

2000 running times in minutes 1500 1000 500 0 n sketch 5 10 4 1 10 5 5 10 4 1 10 5 5 10 4 1 10 5 5 10 4 1 10 5 original RAD SRHT CW Figure: Total running times in minutes for data sets with n {50 000, 100 000}, d = 50, σ = 5 and approximation parameter ε = 0.1. For the embedded data sets, the total running time consists of the time for reading, embedding and analysing the data set. For the original data set, the embedding time is 0 since this step is not applied Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 11 / 24

Summary & Outlook Projections can be used for Bayesian (and frequentist) linear regression Results of analyses are close to the original model, some additional variation is introduced Running time is reduced by a substantial amount, reduction grows with increasing n Generalisations to hierarchical models, likelihoods based on p-norms with p [1, 2] and some generalised linear models Implementation available in our R package RaProR [RaProR, Geppert et al. (2015)], which internally calls C++ code, http://sfb876.tu-dortmund.de, category Software Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 12 / 24

Analysis for small n, large p Small n, large p Clinical or epidemiologocal studies wirh small n but many variables (e.g., GWAS studies, RNA-seq data, more than one molecular data type) Interactions between variables within or across data sets, pathway or network structures between variables Goal: Combining variable selection with other tasks like regression or prediction (e.g., firstly using leverage scores or principal component analysis and, secondly, a regression method) Bayesian methods well suited for this situation (modelling all sources of uncertainty, incorporating information for variable selection and shrinkage), but computationally challenging Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 13 / 24

Leverage Scores Sampling approaches (instead of random projections) In a regression context, leverage scores are a measure of importance of observations, given as the diagonal elements of the hat matrix H = X (X X ) 1 X In our context, leverage scores as a measure of importance of variables H = [X, Y ] ([X, Y ][X, Y ] ) 1 [X, Y ] Cross-leverage scores in general: entries of H not on the main diagonal Here: influence of variable X i on Y, C i = h i,(p+1) Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 14 / 24

Example: Idea and Simulation of SNPs Single Nucleotide Polymorphisms (SNPs) represent mutations at a single locus They are coded by the three values {0, 1, 2}, S = 2 and S 0 stand for cases where more mutations stand for diseased and S = 0 and S 2 for non-diseased patients Simulate SNP data sets with increasing number of variables p Only first 12 SNPs influence Y (including higher-level interactions) Reduce number of possibly relevant variables before conducting analysis Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 15 / 24

Multidimensional data analysis (b) n = 40 60 80 S=0 S=1 S=2 S 0 S 1 S 2 CLS 0 0 20 Density Density 20 40 60 80 S=0 S=1 S=2 S 0 S 1 S 2 LS 100 (a) n = 40 40 100 Usefulness 0.00 0.05 0.10 0.15 0.20 0.06 0.04 N = 100 Bandwidth = 0.004286 0.02 0.00 0.02 0.04 0.06 N = 100 Bandwidth = 0.00234 Distribution of leverage scores is similar for influential and non-influential SNPs Distribution of cross-leverage scores is different for influential and non-influential SNPs, useful for sampling approach Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 16 / 24

Logic Regression Analysis using logistic regression possible, but more suitable tool for SNP data: logic regression Y = β i L i with logic independent variables L i being 0 or 1 E.g., L 2 = (S 4 0) (S 5 = 0) or L 2 = (X 7 X 9 X 10 ) using dummy variables Logic regression particularly suited for modelling and interpreting higher-order interactions Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 17 / 24

Results Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 18 / 24

Summary & Outlook Reducing all variables to most important main effects and interactions yields good results Cross-leverage scores are well-suited for reduction Logic regression is well-suited for (high-order) interactions Such a two-stage procedure recommended for small n, large p External knowledge can be incorporated in a Bayesian analysis through prior weights in the sampling procedure or informative prior distributions in the subsequent analysis Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 19 / 24

Analysis for large n, large p Large n, large p MRI image data with p and n are of the order of 10 5 and 10 6, respectively represents a large-scale nonlinear regression task, and the model can be written as Y N(m(θ), σ 2 I ) where Y is an n 1 vector, θ is p 1, I denotes the n n identity matrix, and m(θ) stands for the underlying nonlinear physical model Underlying situation in EMPIR application Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 20 / 24

Suggested Approach High-dimensional Bayesian regression with (spatial) Markov Random Field (MRF) priors Dimension reduction for high number of variables p using Bayesian Principal Component Regression or sampling methods For large number of observations n employ Random Projections or Merge & Reduce Approaches To model MRF prior, introduce sparcity (combination with dimension reduction methods possible) or utilise approximate posterior inference Develop appropriate software Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 21 / 24

Literature I LN Geppert, K. Ickstadt, A. Munteanu, J. Quedenfeld, C. Sohler Random Projections for Bayesian Regression Statistics and Computing (2017) H. Schwender and K. Ickstadt Identification of SNP interactions using logic regression Biostatistics (2008) K. Ickstadt, M. Schäfer, M. Zucknick Toward Integrative Bayesian Analysis in Molecular Biology Annual Review of Statistics and Its Application (2018) Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 22 / 24

Literature II LN Geppert, K. Ickstadt, A. Munteanu, J. Quedenfeld, C. Sohler RaProR: Calculate Sketches using Random Projections to Reduce Large Data Sets, Version 1.0 2015 WB Johnson, J. Lindenstrauss Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics (1984) T. Sarlós Improved approximation algorithms for large matrices via random projections In: Proceedings of FOCS (2006) Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 23 / 24

Literature III N. Ailon, E. Liberty Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes Discrete & Computational Geometry (2009) KL Clarkson, DP Woodruff Low rank approximation and regression in input sparsity time STOC 13. Proceedings of the forty-fifth annual ACM symposium on Theory of Computingr Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 24 / 24