Multidimensional data analysis in biomedicine and epidemiology

Similar documents
Sketching as a Tool for Numerical Linear Algebra

Technical Report. Random projections for Bayesian regression. Leo Geppert, Katja Ickstadt, Alexander Munteanu and Christian Sohler 04/2014

Sketching as a Tool for Numerical Linear Algebra

Randomized Algorithms

Minimization of Boolean Expressions Using Matrix Algebra

Coresets for Bayesian Logistic Regression

Fast Dimension Reduction

Sketched Ridge Regression:

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Faster Johnson-Lindenstrauss style reductions

A fast randomized algorithm for overdetermined linear least-squares regression

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling

Tighter Low-rank Approximation via Sampling the Leveraged Element

Fast Random Projections using Lean Walsh Transforms Yale University Technical report #1390

Proteomics and Variable Selection

MANY scientific computations, signal processing, data analysis and machine learning applications lead to large dimensional

Learning in Bayesian Networks

RandNLA: Randomized Numerical Linear Algebra

RandNLA: Randomization in Numerical Linear Algebra

Mixture models for analysing transcriptome and ChIP-chip data

Accelerated Dense Random Projections

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Randomized algorithms for the approximation of matrices

arxiv: v1 [cs.ds] 8 Aug 2014

to be more efficient on enormous scale, in a stream, or in distributed settings.

Partial factor modeling: predictor-dependent shrinkage for linear regression

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform

Predicting Protein Functions and Domain Interactions from Protein Interactions

Fast Random Projections

arxiv: v3 [cs.ds] 21 Mar 2013

A Fast Algorithm For Computing The A-optimal Sampling Distributions In A Big Data Linear Regression

Gradient-based Sampling: An Adaptive Importance Sampling for Least-squares

Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation Discussion of sampling approach in big data

The Fast Cauchy Transform and Faster Robust Linear Regression

Computational Biology Course Descriptions 12-14

CS 229r: Algorithms for Big Data Fall Lecture 17 10/28

Yale university technical report #1402.

Advances in Extreme Learning Machines

Dense Fast Random Projections and Lean Walsh Transforms

Random Projections for Support Vector Machines

Experimental Design and Data Analysis for Biologists

Mixtures and Hidden Markov Models for analyzing genomic data

Comparison of the Empirical Bayes and the Significance Analysis of Microarrays

Randomized Numerical Linear Algebra: Review and Progresses

Master of Science in Statistics A Proposal

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Sketching as a Tool for Numerical Linear Algebra

Outline Challenges of Massive Data Combining approaches Application: Event Detection for Astronomical Data Conclusion. Abstract

Recovering any low-rank matrix, provably

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Lecture 9: Low Rank Approximation

25.2 Last Time: Matrix Multiplication in Streaming Model

EECS 275 Matrix Computation

Fast Approximation of Matrix Coherence and Statistical Leverage

Dimensionality Reduction Notes 3

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Computationally Efficient Estimation of Multilevel High-Dimensional Latent Variable Models

Introduction to Bioinformatics

Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R

On Bayesian Computation

Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models

Dimensionality reduction: Johnson-Lindenstrauss lemma for structured random matrices

Randomized Algorithms in Linear Algebra and Applications in Data Analysis

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

Discovering molecular pathways from protein interaction and ge

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

ESL Chap3. Some extensions of lasso

Lecture 13 Fundamentals of Bayesian Inference

Zhiguang Huo 1, Chi Song 2, George Tseng 3. July 30, 2018

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior

Some Statistical Models and Algorithms for Change-Point Problems in Genomics

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Hypes and Other Important Developments in Statistics

Fast Dimension Reduction

2017 SISG Module 1: Bayesian Statistics for Genetics Lecture 7: Generalized Linear Modeling

Multi Omics Clustering. ABDBM Ron Shamir

Causal Discovery by Computer

Scaling up Bayesian Inference

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Methods for sparse analysis of high-dimensional data, II

COMPRESSED AND PENALIZED LINEAR

Inferring Transcriptional Regulatory Networks from High-throughput Data

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians

Lecture 18 Nov 3rd, 2015

QALGO workshop, Riga. 1 / 26. Quantum algorithms for linear algebra.

Identifying Bio-markers for EcoArray

Error Estimation for Randomized Least-Squares Algorithms via the Bootstrap

a Short Introduction

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Robust Preprocessing of Time Series with Trends

Disease mapping with Gaussian processes

A general mixed model approach for spatio-temporal regression data

A fast randomized algorithm for approximating an SVD of a matrix

17 Random Projections and Orthogonal Matching Pursuit

11 : Gaussian Graphic Models and Ising Models

Introduction into Bayesian statistics

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Transcription:

in biomedicine and epidemiology Katja Ickstadt and Leo N. Geppert Faculty of Statistics, TU Dortmund, Germany Stakeholder Workshop 12 13 December 2017, PTB Berlin Supported by Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876, project C4

Challenges Overview of typical tasks in multidimensional data analysis in biomedicine and epidemiology, aims of employed methods, and related model complexity. Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 1 / 24

Challenges: Data Main Challenge: Molecular data Different data types (e.g., genomic, epigenomic, proteomic data) Measured on different platforms (arrays, chips) Different biological units (e.g., gene, methylation site, protein) Preprocessing involving advanced (statistical) modelling and algorithms Avoid Garbage in, garbage out Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 2 / 24

Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 3 / 24

Challenges: Data Published high-level functional information from databases, e.g., Kyoto Encyclopedia of Genes and Genomes pathways (http://www.genome.jp/kegg/) or gene ontology terms (http://www.geneontology.org) Clinical variables including, e.g., Magnetic Resonance Imaging (MRI) data Epidemiological, e.g., environmental variables Several studies (similar questions) Goal: Analyse all data available for a specific question jointly in an integrative analysis Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 4 / 24

Challenges: Analysis Large number of observations n and small to moderate number of variables p Small n, large p Large n, large p Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 5 / 24

Analysis for large n, small p Large n (100 000 or more observations; second by second data), small to moderate p (up to 100) Data arising in streams (e.g., online monitoring of patients) Image data with short acquisition time (e.g., MRI images) Huge data sets (e.g., Meta studies with RNA-seq data, MRI data) High model complexity (e.g., network structure, spatio-temporal dependencies) Goal: Solving the problem for reduced n by keeping the main information and introducing a controllable error Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 6 / 24

Setting Large number of observations n, small to moderate number of variables p (n p) Data set might be too large to load into memory (e.g., data stream) Aim: Conduct Bayesian (or frequentist) linear regression y = X β + u with u N ( 0, σu 2 ) In the Bayesian case, β is a random vector with prior distribution p(β) Computationally demanding Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 7 / 24

Johnson-Lindenstrauss theorem Idea of random projections is based on the Johnson-Lindenstrauss theorem (JLT) [Johnson & Lindenstrauss, 1984] JLT states that for every vector v R n there exists a random matrix Π R k n, such that: (1 ε) v 2 2 Πv 2 2 (1 + ε) v 2 2 The matrix Π is a random projection of ν and also called a ε-subspace embedding Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 8 / 24

Role of the ε-subspace embedding ε-subspace embeddings can be used to reduce the number of observations from n to k while retaining most of the algebraic structure Analysis is then carried out on the embedded data set Π [X, Y ] [ΠX, ΠY ] p post (β X, Y ) ε p post (β ΠX, ΠY ) Trade-off between goodness of approximation and data reduction can be adjusted using ε Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 9 / 24

How do we find an ε-subspace embedding Π? We compare three subspace embeddings: the Rademacher Matrix (RAD) [Sarlós (2006)], target dimension O ( d ε 2 ), running time O (ndk) the Subsampled Randomized Hadamard Transform (SRHT) [Ailon & Liberty (2009)] ( target dimension O d log(d) ε ), running time O (nd log k) 2 and the Clarkson Woodruff embedding (CW) [Clarkson & Woodruff ( (2013)] d target dimension O 2 ε ), running time O (nnz(x )) = O(nd) 2 Theoretical guarantees based on Wasserstein distance hold for all three embedding methods Choose approximation parameter ε or target dimension k Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 10 / 24

2000 running times in minutes 1500 1000 500 0 n sketch 5 10 4 1 10 5 5 10 4 1 10 5 5 10 4 1 10 5 5 10 4 1 10 5 original RAD SRHT CW Figure: Total running times in minutes for data sets with n {50 000, 100 000}, d = 50, σ = 5 and approximation parameter ε = 0.1. For the embedded data sets, the total running time consists of the time for reading, embedding and analysing the data set. For the original data set, the embedding time is 0 since this step is not applied Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 11 / 24

Summary & Outlook Projections can be used for Bayesian (and frequentist) linear regression Results of analyses are close to the original model, some additional variation is introduced Running time is reduced by a substantial amount, reduction grows with increasing n Generalisations to hierarchical models, likelihoods based on p-norms with p [1, 2] and some generalised linear models Implementation available in our R package RaProR [RaProR, Geppert et al. (2015)], which internally calls C++ code, http://sfb876.tu-dortmund.de, category Software Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 12 / 24

Analysis for small n, large p Small n, large p Clinical or epidemiologocal studies wirh small n but many variables (e.g., GWAS studies, RNA-seq data, more than one molecular data type) Interactions between variables within or across data sets, pathway or network structures between variables Goal: Combining variable selection with other tasks like regression or prediction (e.g., firstly using leverage scores or principal component analysis and, secondly, a regression method) Bayesian methods well suited for this situation (modelling all sources of uncertainty, incorporating information for variable selection and shrinkage), but computationally challenging Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 13 / 24

Leverage Scores Sampling approaches (instead of random projections) In a regression context, leverage scores are a measure of importance of observations, given as the diagonal elements of the hat matrix H = X (X X ) 1 X In our context, leverage scores as a measure of importance of variables H = [X, Y ] ([X, Y ][X, Y ] ) 1 [X, Y ] Cross-leverage scores in general: entries of H not on the main diagonal Here: influence of variable X i on Y, C i = h i,(p+1) Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 14 / 24

Example: Idea and Simulation of SNPs Single Nucleotide Polymorphisms (SNPs) represent mutations at a single locus They are coded by the three values {0, 1, 2}, S = 2 and S 0 stand for cases where more mutations stand for diseased and S = 0 and S 2 for non-diseased patients Simulate SNP data sets with increasing number of variables p Only first 12 SNPs influence Y (including higher-level interactions) Reduce number of possibly relevant variables before conducting analysis Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 15 / 24

Multidimensional data analysis (b) n = 40 60 80 S=0 S=1 S=2 S 0 S 1 S 2 CLS 0 0 20 Density Density 20 40 60 80 S=0 S=1 S=2 S 0 S 1 S 2 LS 100 (a) n = 40 40 100 Usefulness 0.00 0.05 0.10 0.15 0.20 0.06 0.04 N = 100 Bandwidth = 0.004286 0.02 0.00 0.02 0.04 0.06 N = 100 Bandwidth = 0.00234 Distribution of leverage scores is similar for influential and non-influential SNPs Distribution of cross-leverage scores is different for influential and non-influential SNPs, useful for sampling approach Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 16 / 24

Logic Regression Analysis using logistic regression possible, but more suitable tool for SNP data: logic regression Y = β i L i with logic independent variables L i being 0 or 1 E.g., L 2 = (S 4 0) (S 5 = 0) or L 2 = (X 7 X 9 X 10 ) using dummy variables Logic regression particularly suited for modelling and interpreting higher-order interactions Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 17 / 24

Results Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 18 / 24

Summary & Outlook Reducing all variables to most important main effects and interactions yields good results Cross-leverage scores are well-suited for reduction Logic regression is well-suited for (high-order) interactions Such a two-stage procedure recommended for small n, large p External knowledge can be incorporated in a Bayesian analysis through prior weights in the sampling procedure or informative prior distributions in the subsequent analysis Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 19 / 24

Analysis for large n, large p Large n, large p MRI image data with p and n are of the order of 10 5 and 10 6, respectively represents a large-scale nonlinear regression task, and the model can be written as Y N(m(θ), σ 2 I ) where Y is an n 1 vector, θ is p 1, I denotes the n n identity matrix, and m(θ) stands for the underlying nonlinear physical model Underlying situation in EMPIR application Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 20 / 24

Suggested Approach High-dimensional Bayesian regression with (spatial) Markov Random Field (MRF) priors Dimension reduction for high number of variables p using Bayesian Principal Component Regression or sampling methods For large number of observations n employ Random Projections or Merge & Reduce Approaches To model MRF prior, introduce sparcity (combination with dimension reduction methods possible) or utilise approximate posterior inference Develop appropriate software Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 21 / 24

Literature I LN Geppert, K. Ickstadt, A. Munteanu, J. Quedenfeld, C. Sohler Random Projections for Bayesian Regression Statistics and Computing (2017) H. Schwender and K. Ickstadt Identification of SNP interactions using logic regression Biostatistics (2008) K. Ickstadt, M. Schäfer, M. Zucknick Toward Integrative Bayesian Analysis in Molecular Biology Annual Review of Statistics and Its Application (2018) Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 22 / 24

Literature II LN Geppert, K. Ickstadt, A. Munteanu, J. Quedenfeld, C. Sohler RaProR: Calculate Sketches using Random Projections to Reduce Large Data Sets, Version 1.0 2015 WB Johnson, J. Lindenstrauss Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics (1984) T. Sarlós Improved approximation algorithms for large matrices via random projections In: Proceedings of FOCS (2006) Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 23 / 24

Literature III N. Ailon, E. Liberty Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes Discrete & Computational Geometry (2009) KL Clarkson, DP Woodruff Low rank approximation and regression in input sparsity time STOC 13. Proceedings of the forty-fifth annual ACM symposium on Theory of Computingr Katja Ickstadt (Statistics, TU Dortmund) Multidimensional data analysis 12. 12. 2017 24 / 24