MULTI-VARIATE/MODALITY IMAGE ANALYSIS

Size: px

Start display at page:

Download "MULTI-VARIATE/MODALITY IMAGE ANALYSIS"

Timothy Clark
5 years ago
Views:

1 MULTI-VARIATE/MODALITY IMAGE ANALYSIS Duygu Tosun-Turgut, Ph.D. Center for Imaging of Neurodegenerative Diseases Department of Radiology and Biomedical Imaging

2 Curse of dimensionality A major problem in imaging statistics is the curse of dimensionality. If the data x lies in high dimensional space, then an enormous amount of data is required to learn distributions or decision rules.

3 Curse of dimensionality One way to deal with dimensionality is to assume that we know the form of the probability distribution. For example, a Gaussian model in N dimensions has N + N(N-1)/2 parameters to estimate. Requires O(N 2 ) data to learn reliably. This may not be practical.

4 Typical solution: divide into sub-problems Each sub-problem relates single voxel to clinical variable Known as voxel-wise, univariate, or pointwise regression Approach popularized by SPM (Friston, 1995) Dependencies between spatial variables neglected!!

5 Dimension reduction One way to avoid the curse of dimensionality is by projecting the data onto a lower-dimensional space. Techniques for dimension reduction: Principal Component Analysis (PCA) Fisher s Linear Discriminant Multi-dimensional Scaling. Independent Component Analysis (ICA)

6 A dual goal Find a good representation The features part Reduce redundancy in the data A side effect of proper features

7 A good feature Simplify the explanation of the input Represent repeating patterns When defined makes the input simpler How do we define these abstract qualities? Onto the math

8 Linear features Z = WX features samples " # Weight Matrix $ % features = dimensions " ' ' ' # $ ( ( ( % dimensions Feature Matrix samples " # Input Matrix $ %

9 A 2D example Matrix representation of data! Z=WX= # z 1 # " T z 2 T $! & & = # w 1 # % " T w 2 T $! &# &# %" x 1 T x 2 T $ & & % X 1 Scatter plot of same data X 2 X 2 X 1

10 Defining a goal Desirable feature features Give simple weights Avoid feature similarity How do we define these?

11 One way to proceed Simple weights Minimize relation of the two dimensions Decorrelate: Feature similarity Same thing! Decorrelate: z 1 T z 2 = 0 w 1 T w 2 = 0

12 Diagonalizing the covariance Covariance matrix! Cov(z 1, z 2 )=# z T 1 z 1 z T 1 z 2 # " z T 2 z 1 z T 2 z 2 $ & & / N % Diagonalizing the covariance suppresses crossdimensional co-activity if z 1 is high, z 2 won t be! Cov(z 1,z 2 )=# z T 1 z 1 z T 1 z 2 # " z T 2 z 1 z T 2 z 2 $ & & / N =! # % " $ & = Ι %

13 Problem definition For a given input X Find a feature matrix W So that the weights decorrelate ( WX) ( WX) T = NΙ ZZ T = NΙ WCov(X)W T = Ι Solving for diagonalization!

14 Solving for diagonalization Covariance matrices are positive definite Therefore symmetric have orthogonal eigenvectors and real eigenvalues and are factorizable by: U T AU=Λ where U has eigenvectors of A in its columns Λ=diag(λ i ), where λ i are the eigenvalues of A

15 Decorrelation An implicit Gaussian assumption N-D data has N directions of variance

16 Undoing the variance The decorrelating matrix W contains two vectors that normalize the input s variance

17 Resulting transform Input gets scaled to a well behaved Gaussian with unit variance in all dimensions Input data Transformed data (weights)

18 A more complex case Having correlation between two dimensions We still find the directions of maximal variance But we also rotate in addition to scaling Input data Transformed data (weights)

19 One more detail So far we considered zero-mean inputs The transforming operation was a rotation If the input mean is not zero bad things happen! Make sure that your data is zero-mean! Input data Transformed data (weights)

20 Principal component analysis This transform is known as PCA The features are the principal components They are orthogonal to each other And produce orthogonal (white) weights Major tool in statistics Removes dependencies from multivariate data Also known as the KLT Karhunen-Loeve transform

21 A better way to compute PCA The Singular Value Decomposition way [ U,S, V] = SVD(A) A=USV T Relationship to eigendecomposition In our case (covariance input A), U and S will hold the eigenvectors/values of A Why the SVD? More stable, more robust, fancy extensions

22 Dimensionality reduction PCA is great for high dimensional data Allows us to perform dimensionality reduction Helps us find relevant structure in data Helps us throw away things that won t matter

23 What is the number of dimensions? If the input was M dimensional, how many dimensions do we keep? No solid answer (estimators exist) Look at the singular-/eigen-values They will show the variance of each component, at some point it will be small

24 Example Eigenvalues of 1200 dimensional video data Little variance after component 30 We don t need to keep the rest of the data Dimension reduction occurs by ignoring the directions in which the covariance is small.

25 Recap PCA decorrelates multivariate data, finds useful components, reduces dimensionality. PCA is only powerful if the biological question is related to the highest variance in the dataset If not other techniques are more useful : Independent Component Analysis

26 Limitations of PCA PCA may not find the best directions for discriminating between two classes. 1 st eigenvector is best for representing the probabilities. 2 nd eigenvector is best for discrimination.

Fisher s linear discriminant analysis (LDA) Performs dimensionality reduction while preserving as much of the class discriminatory information as possible.

27 Fisher s linear discriminant analysis (LDA) Performs dimensionality reduction while preserving as much of the class discriminatory information as possible. Seeks to find directions along which the classes are best separated. Takes into consideration the scatter within-classes but also the scatter between-classes. z 1 z 2

28 LDA via PCA PCA is first applied to the data set to reduce its dimensionality.! # # # # "# x 1 x 2 x N $! & # & PCA # & ''''' # & # %& "# y 1 y 2 y K $ & & & & %& LDA is then applied to find the most discriminative directions.! # # # # "# y 1 y 2 y K $! & # & LDA # & ''''' # & # %& "# z 1 z 2 z C 1 $ & & & & %&

29 PCA fails Focus on uncorrelated and Gaussian components Second-order statistics Orthogonal transformation Focus on independent and non- Gaussian components Higher-order statistics Non-orthogonal transformation

30 Statistical independence If we know something about x, that should tell us nothing about y Dependent Independent

Concept of ICA A given signal (x) is generated by linear mixing (A) of independent components (s) ICA is a statistical analysis method to estimate those

31 Concept of ICA A given signal (x) is generated by linear mixing (A) of independent components (s) ICA is a statistical analysis method to estimate those independent components (z) and mixing rule (W) s 1 s 2 z = Wx = A ij x 1 x 2 WAs W ij z 1 z 2 Blind source separation s 3 x 3 z 3 s M x M s A x W z Observed Unknown z M

32 Example: PCA and ICA Model: x x 1 2 = a a a a s s 1 2

33 How it works?

34 Rational of ICA Find the components S i that are as independent as possible in the sens of maximizing some function F(s 1,s 2,,s k ) that measures indepedence All ICs (except 1) should be non-gaussian The variance of all ICs is 1 There is no hierarchy between ICs

35 How to find ICs? Many choices of objective function F Mutual information MI = f ( s 1, s 2,..., s k ) Log f 1 f ( s 1 ( s ) f 1 2, s 2 ( s,..., s 2 )... f k k ) ( s k ) Use the kurtosis of the variables to approximate the distribution function The number of ICs is chosen by the user

36 Difference with PCA It is not a dimensionality reduction technique. There is no single (exact) solution for components (in R: FastICA, PearsonICA, MLICA) ICs are of course uncorrelated but also as independent as possible Uninteresting for Gaussian distributed variables

37 Feature Extraction in ECG data (Raw Data)

38 Feature Extraction in ECG data (PCA)

39 Feature Extraction in ECG data (Extended ICA)

40 Feature Extraction in ECG data (flexible ICA)

41 Application domains of ICA Image denoising Medical signal processing fmri, ECG, EEG Feature extraction, face recognition Compression, redundancy reduction Watermarking Clustering Time series analysis (stock market, microarray data) Topic extraction Econometrics: Finding hidden factors in financial data

42 Gene clusters [Hsiao et al. 2002] Each gene is mapped to a point based on the value assigned to the gene in the 14 th (x-axis), 15 th (y-axis) and 55 th (z-axis) independent components: - Red - enriched with liverspecific genes - Orange enriched with muscle-specific genes - Green enriched with vulvaspecific genes - Yellow genes not annotated as liver-, muscle-, or vulvaspecific

43 Image, I Segmentation by ICA ICA Filter Bank With n filters I 1, I 2,, I n These are filter responses Segmented image, C Clustering Above is an unsupervised setting. Segmentation (i.e., classification in this context) can also be performed by a supervised method on the output feature images I 1, I 2,, I n. A texture image Segmentation Jenssen and Eltoft, ICA filter bank for segmentation of textured images, 4 th International symposium on ICA and BSS, Nara, Japan, 2003

44 Recap: PCA vs LDA vs ICA PCA : Proper for dimension reduction LDA : Proper for pattern classification if the number of training samples of each class are large ICA : Proper for blind source separation or classification using ICs when class id of training data is not available

45 Multimodal Data Collection Brain Function (spatiotemporal, task-related) Task 1-N Task 1-N EEG EEG EEG fmri fmri fmri Brain Structure (spatial) T1/T2 DTI Genetic Data (spatial/chromosomal) SNP Gene Express ion Covariates (Age, etc.) Other*

46 Joint ICA Variation on ICA to look for components that appear jointly across features or modalities = " #$ Χ Modality 1 Χ Modality 2 % &' = Α " S Modality 1 S #$ Modality 2 % &'

47 jica Results: Multitask fmri One component Significant at p<0.01 Calhoun VD, Adali T, Kiehl KA, Astur RS, Pekar JJ, Pearlson GD. (2006): A Method for Multitask fmri Data Fusion Applied to Schizophrenia. Hum.Brain Map. 27(7):

48 jica Results: Joint Histograms Calhoun VD, Adali T, Kiehl KA, Astur RS, Pekar JJ, Pearlson GD. (2006): A Method for Multitask fmri Data Fusion Applied to Schizophrenia. Hum.Brain Map. 27(7):

)=Correlation(A 1,A 2) = Var(a ) Var(a ) J. Liu, O. De

49 Parallel ICA: Two Goals Data1: (fmri) Data2: (SNP) X1 X2 Identify Hidden Features A1 W1 W2 A2 Identify Hidden Features S1 S2 Identify Linked Features MAX : {H(Y1) + H(Y2) }, <Infomax> 2 Cov(a 2 1i,a 2j) Subject to: arg max g{w1,w2 ŝ1,ŝ2 }, g( )=Correlation(A 1,A 2) = Var(a ) Var(a ) J. Liu, O. Demirci, and V. D. Calhoun, "A Parallel Independent Component Analysis Approach to Investigate Genomic Influence on Brain Function," IEEE Signal Proc. Letters, vol. 15, pp , i 2j

rs7961819 rs3757934 rs3772917 rs8027035 J. Liu, G. D.

50 Example 2: Default Mode Network & Multiple Genetic Factors fmri Component t-map Mixing Coefficients (ρ=0.31) fmri Component SNPs Component rs rs rs rs rs rs rs6136 rs rs rs rs rs rs J. Liu, G. D. Pearlson, A. Windemuth, G. Ruano, N. I. Perrone-Bizzozero, and V. D. Calhoun, "Combining fmri and SNP data to investigate connections between brain function and genetics using parallel ICA," Hum.Brain Map., vol. 30, pp ,

51 Partial least squares (PLS) Understanding relationships between process & response variables

52 PLS fundamentals X Space Y Space X 3 Projections Planes Y 3 X 2 Y 2 X 1 Y 1

53 PLS is the multivariate version of regression PLS uses two different PCA models, one for the X s and one for the Y s, and finds the links between the two. Mathematically, the difference is as follows: In PCA, we are maximizing the variance that is explained by the model. In PLS, we are maximizing the covariance.

54 Conceptually how PLS works? A step-wise process: PLS finds a set of orthogonal components that : maximize the level of explanation of both X and Y provide a predictive equation for Y in terms of the X s This is done by: fitting a set of components to X (as in PCA) similarly fitting a set of components to Y reconciling the two sets of components so as to maximize explanation of X and Y PLS finds components of X that are also relevant to Y Latent vectors are components that simultaneously decompose X and Y Latent vectors explain the covariance between X and Y

55 PLS the Inner Relation The way PLS works visually is by tweeking the two PCA models (X and Y) until their covariance is optimised. It is this tweeking that led to the name partial least-squares. All 3 are solved simultaneously via numerical methods

56 Canonical correlation analysis (CCA) CCA was developed first by H. Hotelling. H. Hotelling. Relations between two sets of variates. Biometrika, 28: , CCA measures the linear relationship between two multidimensional variables. CCA finds two bases, one for each variable, that are optimal with respect to correlations. Applications in economics, medical studies, bioinformatics and other areas.

57 Canonical correlation analysis (CCA) Two multidimensional variables Two different measurement on the same set of objects Web images and associated text Protein (or gene) sequences and related literature (text) Protein sequence and corresponding gene expression In classification: feature vector and class label Two measurements on the same object are likely to be correlated. May not be obvious on the original measurements. Find the maximum correlation on transformed space.

58 Canonical correlation analysis (CCA) X T measurement W X transformation Transformed data Correlation T Y W Y

59 Problem definition Find two sets of basis vectors, one for x and the other for y, such that the correlations between the projections of the variables onto these basis vectors are maximized. Given Compute two basis vectors w x and w y : y < wy, y >

60 60 Problem definition Compute the two basis vectors so that the correlations of the projections onto these vectors are maximized.

61 Least squares regression l Estimate the parameters β based on a set of training data: (x 1, y 1 ) (x N, y N ) l Minimize residual sum of squares N p RSS( β ) = y 0 i β x ij β j i= 1 j= 1 2 ( T ) T X X X y ˆ 1 β = poor numeric properties!

62 Subset selection We want to eliminate unnecessary features Use additional penalties to reduce coefficients: coefficient shrinkage Ridge Regression p 2 Minimize least squares s.t. β j s The Lasso Minimize least squares s.t. Principal Components Regression Regress on M < p principal components of X Partial Least Squares Regress on M < p directions of X weighted by y j= 1 p j= 1 β s j

63 Prostate Cancer Data Example

64 Error comparison

65 Shrinkage methods: Ridge regression

66 Shrinkage methods: Lasso regression

67 A unifying view We can view all the linear regression techniques under a common framework 2 N p p ˆ = + q β arg minβ y i β0 x ij β j λ β j i= 1 j= 1 j= 1 λ includes bias, q indicates a prior distribution on β λ=0: least squares λ>0, q=0: subset selection (counts number of nonzero parameters) λ>0, q=1: the lasso λ>0, q=2: ridge regression

68 Family of shrinkage regression

69 Atrophy associated with delayed memory Spatial correlation Ridge regression

Principal Component Analysis CS498

Principal Component Analysis CS498 Today s lecture Adaptive Feature Extraction Principal Component Analysis How, why, when, which A dual goal Find a good representation The features part Reduce redundancy