Multidimensional data analysis in biomedicine and epidemiology

Size: px

Start display at page:

Download "Multidimensional data analysis in biomedicine and epidemiology"

Gloria Tyler
6 years ago
Views:

1 in biomedicine and epidemiology Katja Ickstadt and Leo N. Geppert Faculty of Statistics, TU Dortmund, Germany Stakeholder Workshop December 2017, PTB Berlin Supported by Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876, project C4

2 Challenges Overview of typical tasks in multidimensional data analysis in biomedicine and epidemiology, aims of employed methods, and related model complexity. Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

3 Challenges: Data Main Challenge: Molecular data Different data types (e.g., genomic, epigenomic, proteomic data) Measured on different platforms (arrays, chips) Different biological units (e.g., gene, methylation site, protein) Preprocessing involving advanced (statistical) modelling and algorithms Avoid Garbage in, garbage out Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

4 Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

5 Challenges: Data Published high-level functional information from databases, e.g., Kyoto Encyclopedia of Genes and Genomes pathways ( or gene ontology terms ( Clinical variables including, e.g., Magnetic Resonance Imaging (MRI) data Epidemiological, e.g., environmental variables Several studies (similar questions) Goal: Analyse all data available for a specific question jointly in an integrative analysis Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

6 Challenges: Analysis Large number of observations n and small to moderate number of variables p Small n, large p Large n, large p Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

7 Analysis for large n, small p Large n ( or more observations; second by second data), small to moderate p (up to 100) Data arising in streams (e.g., online monitoring of patients) Image data with short acquisition time (e.g., MRI images) Huge data sets (e.g., Meta studies with RNA-seq data, MRI data) High model complexity (e.g., network structure, spatio-temporal dependencies) Goal: Solving the problem for reduced n by keeping the main information and introducing a controllable error Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

8 Setting Large number of observations n, small to moderate number of variables p (n p) Data set might be too large to load into memory (e.g., data stream) Aim: Conduct Bayesian (or frequentist) linear regression y = X β + u with u N ( 0, σu 2 ) In the Bayesian case, β is a random vector with prior distribution p(β) Computationally demanding Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

9 Johnson-Lindenstrauss theorem Idea of random projections is based on the Johnson-Lindenstrauss theorem (JLT) [Johnson & Lindenstrauss, 1984] JLT states that for every vector v R n there exists a random matrix Π R k n, such that: (1 ε) v 2 2 Πv 2 2 (1 + ε) v 2 2 The matrix Π is a random projection of ν and also called a ε-subspace embedding Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

10 Role of the ε-subspace embedding ε-subspace embeddings can be used to reduce the number of observations from n to k while retaining most of the algebraic structure Analysis is then carried out on the embedded data set Π [X, Y ] [ΠX, ΠY ] p post (β X, Y ) ε p post (β ΠX, ΠY ) Trade-off between goodness of approximation and data reduction can be adjusted using ε Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

11 How do we find an ε-subspace embedding Π? We compare three subspace embeddings: the Rademacher Matrix (RAD) [Sarlós (2006)], target dimension O ( d ε 2 ), running time O (ndk) the Subsampled Randomized Hadamard Transform (SRHT) [Ailon & Liberty (2009)] ( target dimension O d log(d) ε ), running time O (nd log k) 2 and the Clarkson Woodruff embedding (CW) [Clarkson & Woodruff ( (2013)] d target dimension O 2 ε ), running time O (nnz(x )) = O(nd) 2 Theoretical guarantees based on Wasserstein distance hold for all three embedding methods Choose approximation parameter ε or target dimension k Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

12 2000 running times in minutes n sketch original RAD SRHT CW Figure: Total running times in minutes for data sets with n {50 000, }, d = 50, σ = 5 and approximation parameter ε = 0.1. For the embedded data sets, the total running time consists of the time for reading, embedding and analysing the data set. For the original data set, the embedding time is 0 since this step is not applied Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

13 Summary & Outlook Projections can be used for Bayesian (and frequentist) linear regression Results of analyses are close to the original model, some additional variation is introduced Running time is reduced by a substantial amount, reduction grows with increasing n Generalisations to hierarchical models, likelihoods based on p-norms with p [1, 2] and some generalised linear models Implementation available in our R package RaProR [RaProR, Geppert et al. (2015)], which internally calls C++ code, category Software Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

14 Analysis for small n, large p Small n, large p Clinical or epidemiologocal studies wirh small n but many variables (e.g., GWAS studies, RNA-seq data, more than one molecular data type) Interactions between variables within or across data sets, pathway or network structures between variables Goal: Combining variable selection with other tasks like regression or prediction (e.g., firstly using leverage scores or principal component analysis and, secondly, a regression method) Bayesian methods well suited for this situation (modelling all sources of uncertainty, incorporating information for variable selection and shrinkage), but computationally challenging Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

15 Leverage Scores Sampling approaches (instead of random projections) In a regression context, leverage scores are a measure of importance of observations, given as the diagonal elements of the hat matrix H = X (X X ) 1 X In our context, leverage scores as a measure of importance of variables H = [X, Y ] ([X, Y ][X, Y ] ) 1 [X, Y ] Cross-leverage scores in general: entries of H not on the main diagonal Here: influence of variable X i on Y, C i = h i,(p+1) Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

16 Example: Idea and Simulation of SNPs Single Nucleotide Polymorphisms (SNPs) represent mutations at a single locus They are coded by the three values {0, 1, 2}, S = 2 and S 0 stand for cases where more mutations stand for diseased and S = 0 and S 2 for non-diseased patients Simulate SNP data sets with increasing number of variables p Only first 12 SNPs influence Y (including higher-level interactions) Reduce number of possibly relevant variables before conducting analysis Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

17 Multidimensional data analysis (b) n = S=0 S=1 S=2 S 0 S 1 S 2 CLS Density Density S=0 S=1 S=2 S 0 S 1 S 2 LS 100 (a) n = Usefulness N = 100 Bandwidth = N = 100 Bandwidth = Distribution of leverage scores is similar for influential and non-influential SNPs Distribution of cross-leverage scores is different for influential and non-influential SNPs, useful for sampling approach Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

18 Logic Regression Analysis using logistic regression possible, but more suitable tool for SNP data: logic regression Y = β i L i with logic independent variables L i being 0 or 1 E.g., L 2 = (S 4 0) (S 5 = 0) or L 2 = (X 7 X 9 X 10 ) using dummy variables Logic regression particularly suited for modelling and interpreting higher-order interactions Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

19 Results Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

20 Summary & Outlook Reducing all variables to most important main effects and interactions yields good results Cross-leverage scores are well-suited for reduction Logic regression is well-suited for (high-order) interactions Such a two-stage procedure recommended for small n, large p External knowledge can be incorporated in a Bayesian analysis through prior weights in the sampling procedure or informative prior distributions in the subsequent analysis Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

21 Analysis for large n, large p Large n, large p MRI image data with p and n are of the order of 10 5 and 10 6, respectively represents a large-scale nonlinear regression task, and the model can be written as Y N(m(θ), σ 2 I ) where Y is an n 1 vector, θ is p 1, I denotes the n n identity matrix, and m(θ) stands for the underlying nonlinear physical model Underlying situation in EMPIR application Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

22 Suggested Approach High-dimensional Bayesian regression with (spatial) Markov Random Field (MRF) priors Dimension reduction for high number of variables p using Bayesian Principal Component Regression or sampling methods For large number of observations n employ Random Projections or Merge & Reduce Approaches To model MRF prior, introduce sparcity (combination with dimension reduction methods possible) or utilise approximate posterior inference Develop appropriate software Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

23 Literature I LN Geppert, K. Ickstadt, A. Munteanu, J. Quedenfeld, C. Sohler Random Projections for Bayesian Regression Statistics and Computing (2017) H. Schwender and K. Ickstadt Identification of SNP interactions using logic regression Biostatistics (2008) K. Ickstadt, M. Schäfer, M. Zucknick Toward Integrative Bayesian Analysis in Molecular Biology Annual Review of Statistics and Its Application (2018) Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

24 Literature II LN Geppert, K. Ickstadt, A. Munteanu, J. Quedenfeld, C. Sohler RaProR: Calculate Sketches using Random Projections to Reduce Large Data Sets, Version WB Johnson, J. Lindenstrauss Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics (1984) T. Sarlós Improved approximation algorithms for large matrices via random projections In: Proceedings of FOCS (2006) Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

25 Literature III N. Ailon, E. Liberty Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes Discrete & Computational Geometry (2009) KL Clarkson, DP Woodruff Low rank approximation and regression in input sparsity time STOC 13. Proceedings of the forty-fifth annual ACM symposium on Theory of Computingr Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis / 24

Sketching as a Tool for Numerical Linear Algebra

Sketching as a Tool for Numerical Linear Algebra David P. Woodruff presented by Sepehr Assadi o(n) Big Data Reading Group University of Pennsylvania February, 2015 Sepehr Assadi (Penn) Sketching for Numerical