High-dimensional data: Exploratory data analysis

Similar documents
STA 4273H: Statistical Machine Learning

PCA, Kernel PCA, ICA

Expectation Maximization

Unsupervised machine learning

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

PCA and admixture models

STA414/2104 Statistical Methods for Machine Learning II

Lecture 9: PGM Learning

PMR Learning as Inference

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Introduction to Machine Learning

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Machine Learning 2017

Clustering using Mixture Models

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

Machine Learning (Spring 2012) Principal Component Analysis

COS513 LECTURE 8 STATISTICAL CONCEPTS

Unsupervised Learning: Dimensionality Reduction

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Generative Clustering, Topic Modeling, & Bayesian Inference

Machine Learning - MT & 14. PCA and MDS

Machine Learning Lecture 2

Naïve Bayes classification

Latent Variable Models and EM Algorithm

Bayesian Machine Learning

Principal Component Analysis

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Outline lecture 2 2(30)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Tales from fmri Learning from limited labeled data. Gae l Varoquaux

Dimension Reduc-on. Example: height of iden-cal twins. PCA, SVD, MDS, and clustering [ RI ] Twin 2 (inches away from avg)

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Neuroscience Introduction

Advanced Statistical Methods: Beyond Linear Regression

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

A Framework for Feature Selection in Clustering

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Statistical Machine Learning

Lecture 6: Methods for high-dimensional problems

Latent Variable Models and EM algorithm

CSE446: Clustering and EM Spring 2017

Data Exploration and Unsupervised Learning with Clustering

An Introduction to Spectral Learning

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Machine Learning Basics: Maximum Likelihood Estimation

Dimensionality Reduction

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Lecture : Probabilistic Machine Learning

Dimensionality reduction

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Linear Methods for Prediction

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

CS 340 Lec. 6: Linear Dimensionality Reduction

Unsupervised Learning

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Ch 4. Linear Models for Classification

CS281 Section 4: Factor Analysis and PCA

Machine Learning, Fall 2012 Homework 2

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

13: Variational inference II

Multiple testing: Intro & FWER 1

Density Estimation. Seungjin Choi

Clustering VS Classification

Estimation of Parameters

Mathematical Formulation of Our Example

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Least Squares Regression

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Non-Parametric Bayes

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Bayesian Decision Theory

Factor Analysis (10/2/13)

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Announcements (repeat) Principal Components Analysis

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Sparse statistical modelling

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Least Squares Regression

Overview of clustering analysis. Yuehua Cui

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Transcription:

High-dimensional data: Exploratory data analysis Mark van de Wiel mark.vdwiel@vumc.nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University Contributions by Wessel van Wieringen

Intro High-Dimensional Data

High-dimensional data: Definition High-dimensional data: Data for which the number of variables, p, exceeds the number of observations, n. Examples Genomics data. E.g. measurements on all human genes, p=25,000, for say n=00 individuals Imaging data (fmri). Thousands or millions of pixels (or voxels) for hundreds of individuals Astronomy. Terabytes of data for a limited number of galaxies / black holes / etc.

High-dimensional data: practical challenges Inference Which genes are differentially expressed between cancer and normally tissue? Visualization How to visualize high-dimensional observations (samples) and discover subgroups? Prediction Probability of tumor recurrence given the genomic profile of the primary tumor (baseline) Functional relationships Which brain regions interact functionally? Expression refers to a (relative) quantification of the gene in a cell/tissue/sample

High-dimensional data: statistical challenges Inference Statistical models; Multiple testing; shrinkage: borrowing information across features Visualization Clustering, principle component analysis Prediction Fitting ordinary regression models is not feasible: penalized regression; machine learning approaches Functional relationships Construction on networks which describe such relationships

Slides: Wessel van Wieringen Exploratory analysis I: Hierarchical clustering

Hierarchical clustering Objective of cluster analysis Cluster analysis seeks meaningful data-determined groupings of samples, s.t. samples are more similar within than across groups, this similarity in gene expression profiles is assumed to imply some form of phenotypic similarity of the samples. Cluster analysis is also known as: unsupervised learning, unsupervised classification, class discovery, and data segmentation

Hierarchical clustering Hierarchical clustering produces a nested sequence of clusters. It start with all objects apart, and at each step two clusters are merged until only one is left. The nested sequence can be represented by a dendogram. A dendogram is a twodimension diagram, a tree. Each fusion of clusters is plotted at a height equal to the dissimilarity of the two clusters which are joined.

Hierarchical clustering Building a dendogram (loosely): Find samples that have most similar gene expression profiles. expression gene 2 3 6 4 5 2 expression gene sample 2 sample 6 sample 5 sample 4 sample 3 sample

Hierarchical clustering Building a dendogram (loosely): Samples and 3 have most similar gene expression profiles. Let these samples form a cluster. Repeat this exercise. expression gene 2 3 6 4 5 2 expression gene sample 2 sample 6 sample 5 sample 4 sample 3 sample

Hierarchical clustering Building a dendogram (loosely): Look for samples or clusters that have most similar gene expression profiles. expression gene 2 3 6 4 5 2 expression gene sample 2 sample 6 sample 5 sample 4 sample 3 sample

Hierarchical clustering Building a dendogram (loosely): New clusters may form: samples 2 and 6. expression gene 2 3 6 4 5 2 expression gene sample 2 sample 6 sample 5 sample 4 sample 3 sample

Hierarchical clustering Building a dendogram (loosely). expression gene 2 3 6 4 5 2 expression gene sample 2 sample 6 sample 5 sample 4 sample 3 sample

Hierarchical clustering Building a dendogram (loosely): Finally, all samples/clusters are merged into one big cluster. expression gene 2 3 6 4 5 2 expression gene sample 2 sample 6 sample 5 sample 4 sample 3 sample

Hierarchical clustering Heatmap A dendrogram is often used in combination with heatmap. Heatmap of a expression matrix A heat map is a graphical representation of data where the values taken by a variable in a two-dimensional map are represented as colors. -5 0 5

Hierarchical clustering Expression matrix Heatmap S_ S_i S_50 g_ -.3 0. -.2 g_2-0. 2.4 0.3 g_j 0.4.5-0.2 g_846-0.9-0.8 0.4 expression of gene j in sample i -5 0 5

Hierarchical clustering samples Visualization of hierarchical clustering results: dendrogram and heatmap combined genes

Hierarchical clustering Hierarchical clustering of genes Genes that cluster together are believed to be functionally related (modules / pathway / GO node). This may help to characterize unknown genes. samples May also cluster samples and genes simultaneously. genes

Hierarchical clustering Distance Central to cluster analysis is the notion of distance (or dissimilarity) between objects being clustered. Distance measures take on values between 0 and : 0 reflects maximum similarity between two samples, means that two samples are not similar at all, and values inbetween indicate various degrees of resemblance.

Hierarchical clustering Some distance measures (for continuous data) Data: Y ij, column vectors (samples): Y.j Euclidean distance: Manhattan distance: d d p 2 E (Y.j,Y.k ) = (Y i ij - Yik ) = p M (Y.j,Y.k ) = Y i ij - Yik = (Pearson) correlation: d C (Y.j,Y.k ) = p i= (Y ij - Y.j )(Y ik p 2 p (Y i ij - Y.j ) = i= - (Y Y ik.k ) - Y.k ) 2

Hierarchical clustering Distance between clusters Distance measures are defined between two samples. In hierarchical clustering, also the distance between groups of samples (clusters) needs to be assessed. Linkage tells us how to do that. Cluster A Cluster B

Hierarchical clustering Cluster A Cluster B Single linkage Minimum distance Average linkage Average distance Complete linkage Maximum distance

Hierarchical clustering Effects of linkage Complete yields a more compact clustering. Complete Single Average

Exploratory analysis II: Principle Component Analysis (PCA)

Principle component analysis (PCA) Samples/Individuals, e.g n=8 Features/Genes, e.g. p =25 000 0.5.2-0.2 0.4 2.0 0.8-0.7 0.6 0.9 0.4 -..8-0.5.0 0.8 0. 0.3.2-0.6.6 0.8.7 0.2-0.5 0. 0.5 0.9.5 0.4 -.0 0.4 0.2 0.8 0.2-0.8.5.3 0.6-0.3 0.2 Group Group 2 Now suppose we pretend to not see the groups...

Principle component analysis (PCA) Samples/Individuals, e.g n=8 0.5.2-0.2 0.4 2.0 0.8-0.7 0.6 0.9 0.4 -..8-0.5.0 0.8 0. 0.3.2-0.6.6 0.8.7 0.2-0.5 0. 0.5 0.9.5 0.4 -.0 0.4 0.2 0.8 0.2-0.8.5.3 0.6-0.3 0.2 Features/Genes, e.g. p =25 000 Challenge: if the genomics data is relevant for the underlying grouping we should be able to observe this after visualization. Solution : Clustering (but depends a lot on ad hoc choices...) Solution 2: Principle Component Analysis (PCA)

Principle component analysis (PCA) Two-gene world Gene 2 Gene 2 Gene Gene But how to obtain a similar visualisation for gene dimension p=25,000?

Principle component analysis (PCA) Principle components (PC) Y ij : data k th PC: Z j k : linear combination p i= k w i Yij First PC: argmax s.t. w w 2 Var( = p p i= i= w Y (w i i ) 2 ij ) = k th PC: as above, but additional orthogonality constraint: argmax s.t. I) w II)w w k k k Var( 2.w h = p p i= i= k w Y (w i k i = = 0, h =,...,k- ) ij 2 )

Principle component analysis (PCA) Var( p i = w i Y ij ) = w T Σ pxp w max(w T Σ pxp w), s.t. w T w = Introduce Lagrange multiplier to deal with constraint: L(w) = dl(w) dw w T Σ pxp = 0 w-(λ w 2w T Σ T T pxp w-) - 2λw T = 0 Σ pxp w = λw Eigenvectors z= w are the solutions, z max corresponding to maximum eigenvalue λ max renders global maximum (simply substitute: Σ pxp = λ).

Principle component analysis (PCA) Efficient computation of PCs () Required: eigenvalues of Σ and orthonomal eigenvectors z. Solution: singular value decomposition (SVD). First, X = Y T, X = nxp X T X = YY T = (n -) pxp Then, SVD is a factorisation of X into U: orthonormal nxn matrix, D: rectangular nxp diagonal matrix 2, W: orthonormal pxp matrix X = UDW T X T X = WD 2 W T Assume wlog each column of Y is centered: mean(y) = 0 2 matrix consisting of (p-n) 0 n -columns and nxn diagonal matrix

Principle component analysis (PCA) Efficient computation of PCs (2) X = UDW T X T X = WD 2 W T The latter is the eigenvalue decomposition of the symmetric pxp matrix X T X. Problem: p is large. However: Y = X T = (UDW T ) T = WD T U T Y T Y = U(D T ) 2 U T Y T Y is of dimension nxn, n is small. So solution:. Eigenvalue decompostion of Y T Y renders D and U 2. YU = WD T or k th PC: column W.k = [YU].k / D kk, where k corresponds to the k th largest eigenvalue. using standard algorithms for finding eigenvalues and eigenvectors

Principle component analysis (PCA) Principle components (PC) Y ij : data k th PC: Z j k : linear combination p i= k w i Yij First PC: argmax s.t. w w 2 Var( = p p i= i= w Y (w i i ) 2 ij ) = k th PC: as above, but additional orthogonality constraint: argmax s.t. I) w II)w w k k k Var( 2.w h = p p i= i= k w Y (w i k i = = 0, h =,...,k- ) ij 2 )

Principle component analysis (PCA) Visualization k th principle component for j: PC k (j) = where Y ij : data for individual j p i= w k Y i ij Plot PC (j) vs PC 2 (j) for all individuals j=,..., n In words: For each individuals plot the coordinates of those two orthogonal summaries of the p-dimensional data that explain most of variation between individuals If a group label associates strongly with the p- dimensional data one may expect to observe that the groups are separated by (a combination of) the two PCs

Principle component analysis (PCA) Application: Colon Cancer. Black: Healthy colon tissue Green: Tumor colon tissue Measurements: ~ 2000 microrna expressions small pieces of mrna, which can degrade mrna genes

Efficient parameter estimation in p models: shrinkage Mark van de Wiel mark.vdwiel@vumc.nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University Many slides courtesy of Wessel van Wieringen

Data: Setting Samples/Individuals, e.g n=8 X T = 0.5.2-0.2 0.4 2.0 0.8-0.7 0.6 0.9 0.4 -..8-0.5.0 0.8 0. 0.3.2-0.6.6 0.8.7 0.2-0.5 0. 0.5 0.9.5 0.4 -.0 0.4 0.2 0.8 0.2-0.8.5.3 0.6-0.3 0.2 Features/Genes, e.g. p =25 000 Group Group 2 Model per gene : X ij = β 0j + β j Z i (i=sample, j=gene, Z i = 0 when sample i in group and, otherwise) How to efficiently estimate β j?

James-Stein estimator

James-Stein estimator JS example Let X i = (X i,, X ip ) T be a p-variate normally distributed random variable: The mean vector μ is estimated from a random sample of size n using the quadratic loss function: Then, the least squares (LS) estimate of μ:

James-Stein estimator JS example (continued) The (total) mean squared error (MSE) of this estimator: for independent X j s. This does not involve μ. Recall, in general, for any estimator of μ: Hence, the MSE is a measure of the quality of the estimator.

James-Stein estimator The James-Stein (JS) estimator is an estimator that outperforms the ML estimator, in the sense that it yields a smaller MSE. The JS estimator is of the form: where is the original estimator, is a target estimator, and is the shrinkage parameter, which determines how much the two estimators are pooled.

James-Stein estimator The JS estimator is of the form: e.g. :sample variance for a given gene and :a pooled variance estimate across all genes λ Estimate Pooled estimate Shrinkage Estimate 0 2 2 0.25 2.75 0.5 2.5 2

James-Stein estimator The MSE of the JS estimator can be expressed as MSE(θˆ(λ)) = MSE(θˆ MSE(θˆ 2 λ E[(θˆ j j (λ)) = E[ θˆ p j= target,j j (λ)), ( ) 2 (θˆ j θ j) λ(θˆ j θˆ target,j) ] = MSE(θˆ j) + ) ] 2λ{ Eθˆ Eθˆ θˆ θ E[θˆ θˆ ]} 2 2 j j target,j j j target,j This is a parabola in, whose parameters are determined by the first two moments of both estimators.

James-Stein estimator MSE = f(λ) interval leading to MSE decrease optimal shrinkage no shrinkage full shrinkage

James-Stein estimator Simulation n samples p genes X ij ~ N(μ j, ) μ j ~ N(0, τ 2 ) Investigate shrinkage effect under 3 different scenario s: I : vary τ p = 00, n = 40, τ = 0., 0.2, 0.4 II : vary n p = 0*n, n = 0, 00, 200, τ = 0. III : vary p/n p = 000, n = 20, 50, 300, τ = 0.

James-Stein estimator Simulation (continued) Estimators: Now study MSE of JS-estimator in relation to.

James-Stein estimator Simulation (continued): scenario I Shrinkage yields more if genes are more alike.

James-Stein estimator Simulation (continued): scenario II Shrinkage yields more if n is small.

James-Stein estimator Simulation (continued): scenario III Shrinkage yields more with larger p/n ratios

James-Stein estimator Crucial question: how to determine λ in Remember the simulation: Simulation n samples p genes X ij ~ N(μ j, ) μ j ~ N(0, τ 2 ) The latter can be regarded as a prior. We need to know this prior to estimate λ empirical Bayes

Empirical Bayes

Empirical Bayes The JS estimator can be motivated from an empirical Bayes perspective. Empirical Bayes methods are Bayesian methods with a twist. In an empirical Bayes setting, the parameters at the top level of a hierarchical model are set to their optimal values (as determined from the data), instead of being integrated out. Roughly: the priors are estimated rather than assumed.

Empirical Bayes JS example (continued) Recall: The are a sample from a prior distribution: The Bayes estimator of the given the data: is their posterior mean,

Empirical Bayes JS example (continued) The posterior mean is given by: * Of the same form as the JS estimator... * Standard Bayesian calculations

Empirical Bayes Rewrite θ j = θ tttttt and θ = X j * is of the James-Stein form θ j + [ (nτ 2 +) - ] (X j - θ j ) = X j (nτ 2 +) - (X j - θ j ) * with λ = (nτ 2 +) - If n or τ is large, there is little shrinkage towards the target

Empirical Bayes JS example (continued) Remember: Typically,use θ j = θ. The prior mean θ j plays role of target.? How to estimate θ and τ?

Empirical Bayes Marginal likelihood Marginal likelihood: likelihood integrated w.r.t. all prior(s). p( X;α) = p( X λ)p α (λ)dλ Parametric empirical Bayes: maximize p(x; α) w.r.t. parameters α. Example X ij ~ iid N(μ j, ), μ j ~ iid N(θ,τ 2 ), so α = {θ,τ}. p( X;θ,τ) = j i N(x ij ;μ j,)n(μ j ;θ,τ)dμ j

Empirical Bayes p( X;θ,τ) = p( X = j;θ,τ) j j i N(x ij ;μ j,)n(μ j ;θ,τ)dμ j The integral reduces to a product Gaussian form (conjugacy) Example X ij ~ iid N(μ j, ), μ j ~ iid N(θ,τ 2 ) What is the unconditional density of X j = (X j,, X nj ): p(x j ) = p(x j ; θ,τ)?

Bayesian inference, conjugate priors, example P X = P X μ P(μ)dμ * n = C exp ( X i μ 2 )/2 exp((μ θ) 2 )/2τ 2 dμ i= n = C exp ( X i A 2 )/B exp((μ D) 2 )/E dμ i= Where A, B do not depend on μ (but do depend on {θ,τ}). Gaussian form for μ: integral has to integrate to. First exponential also a Gaussian form: product of Gaussians * dropping index j

Empirical Bayes p( X;θ,τ) = p( X = j;θ,τ) j p(x j ; θ,τ) reduces to a product of Gaussians Outer product: Product of a product of Gaussians Hence, solving argmax θ,τ p(x; θ,τ) reduces to max lik estimation, which is equivalent to moment estimation in a Gaussian setting: j i N(x ij ;μ j,)n(μ j ;θ,τ)dμ j E[X ij V[X ] = E ij ] = μ V j μ {E[X j ij {E[X ij μ j μ ]} = E{μ j ]} + E μ j j } = θ = {V[X ij μ set j ]} X = τ 2 + = set pn - i,j (X ij -X) 2

Back to James-Stein estimator Moreover, we derived: λ = (nτ 2 +) - We obtain an estimate of λ by substituting the empirical Bayes estimate: 2 2 τˆ = (Xij -X) pn - i,j -

Beneficial effect of shrinkage 5 repeated studies. Estimates of parameter of interest +/- sd. Solid: no shrinkage; dashed: shrinkage. (a): n=5, (b): n=40.

Beneficial effects of shrinkage (more to come ) Better testing in a multiple testing setting Shrinkage causes bias but under selection pressure (e.g. pick the 5 genes with largest parameter): bias generally smaller than for unshrunken estimate In a regression setting shrinking a nuisance parameter can render higher power for the parameter of interest To be continued.