MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Size: px

Start display at page:

Download "MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A"

Richard Stewart
5 years ago
Views:

A. 2017-2018 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL

1 MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI Pietro Guccione Assistant Professor in Signal Processing (pietro.guccione@poliba.it, )

2 Lecture 8 - Summary Linear Dimensionality Reduction Dimensionality Reduction Principal Component Analysis, examples Canonical Correlation Analysis, examples Multiset Canonical Correlation Analysis, example Summary 2

Data Collection: high dimensionality More variables than observations (Hughes phenomenon): When the number of variables is too high compared to the number of the samples, it may happen that the

3 Data Collection: high dimensionality More variables than observations (Hughes phenomenon): When the number of variables is too high compared to the number of the samples, it may happen that the algorithm is unable to find a proper structure within data that can be generalized to other dataset of the same experiment. This is known as the curse of dimensionality or Hughes phenomenon. It may commonly occur in many fields. Example: Chemometrics: for determining concentrations in certain chemical compounds, calibration studies often need to analyze intensity measurements on a very large number (500 1,000 or more) of different spectral wavelengths using a small number of samples (few dozens) Overfitting in classification 3

4 Data Collection: high dimensionality The problem of high dimensionality involves also the estimation of parameters in hidden models (ex: number of coefficients in a regression problem) or of latent variables (number of mixtures in a density estimation problem) Overfitting in regression The problem of dimensionality depends on both the data and the algorithm. Possible solutions are: trying to change algorithm or trying to reduce the dimensionality of the problem 4

Dimensionality Reduction Two approaches are available to perform dimensionality reduction: Feature selection: choosing a subset of all the features (the ones more informative.

5 Dimensionality Reduction Two approaches are available to perform dimensionality reduction: Feature selection: choosing a subset of all the features (the ones more informative. Topic of a next lecture) Feature extraction: creating a subset of new features by combination of the existing ones In general, the optimal mapping y = f(x) will be a non-linear function. However, feature extraction is commonly limited to linear transformations: y = Wx 5

From Dimensionality Reduction to PCA Principal Component Analysis is a standard technique for visualizing high dimensional data and for data pre-processing.

6 From Dimensionality Reduction to PCA Principal Component Analysis is a standard technique for visualizing high dimensional data and for data pre-processing. PCA reduces the dimensionality (the number of variables) of a data set by maintaining as much variance as possible. PCA: finds the directions of maximum variation of the data decorrelates the original variables by using orthogonal transformation The set of uncorrelated variables are said principal components Retain all the dimensions Reduce the dimensions 6

7 PCA: mathematical details Principal Component Analysis is an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance lies on the first coordinate, the second greatest on the second coordinate, and so on. Organize data in a matrix, X [N x P], N samples (repetition of the experiment), P variates (the features of the experiment). The full principal components decomposition of X can be given as: T T XW X TW nxp nxp pxp T T T WΛW X X WW I The principal components T (called scores) are achieved as a linear combination of data and a set of weights (the loadings) The (column) weights W (that are the loadings) are the eigenvectors of the sample covariance matrix of data Λ is the diagonal matrix of the eigenvalues (sorted in decreasing order) N 1 p xnp N n1 N 1 q jk ( xnj j )( xnk k ) N 1 n1 Just to recall in mind: the sample mean and covariance 7

8 PCA: meaning In PCA data, are decomposed by projecting in a new space of the same dimension. Samples are described in a multi-dimensional space PCA 2 The loadings are the weight by which each standardized original variable should be multiplied to get the component score The scores are the transformed variable values corresponding to each sample T XW PCA Decomposition is done to maximize the variance (the energy) of the data in the first (few) dimensions PCA PCA

9 PCA: dimensionality reduction Not all the principal components are equally important. Their relative importance is given by the explained variance. A typical plot of the variance is given by: Cumulative variance explained [%] components We want 99% of variance explained n c =6 components are enough Cumulative variance explained [%] components Y X M UW 1 ( ) ' X UW ' M = n n X U :,1: W :,1: ' ΣM c c [Using matlab notation] 9

False color representation of the Pavia city, Italy achieved by simple superposition

10 PCA dimensional reduction: example Example: hyperspectral image of Earth achieved using a sensor with 103 bands in the interval of visible and near infrared. False color representation of the Pavia city, Italy achieved by simple superposition of all the bands

11 PCA dimensional reduction: example PCA on the dataset: cumulative variance 100 Exmplained variance [%] Components 11

PCA dimensional reduction: example Residual of

6 70.35 1.4 60.3 1.2 300 400 500 50.25 1 40.2 0.

12 PCA dimensional reduction: example Residual of original data and re-composition using: 1 st 1, st 2 and 1 st and component 2 nd 3 rd components

PCA: dimensionality reduction Another tool: the biplot. The biplot is a way to jointly represents information about the samples and variables, after dimensionality reduction.

13 PCA: dimensionality reduction Another tool: the biplot. The biplot is a way to jointly represents information about the samples and variables, after dimensionality reduction. The samples are represented as dots in a plane (supposing just 2 variables survived after dimensionality reduction); variables are displayed as vectors (by using the value of the loadings) Questions: Is there a particular combination of experimental condition able to better describe a group of samples? Which samples are described by which variates/combination? 20 PCA samples Explained variance [%] AA-1 AAm-1 HEMA-1 MAA-1 HEMA-2 Monomer 1/2 ratio Crosslinker Crosslinker concentration Experimental Conditions Support CaCl2 Concentration Additive Additive concentration # components 13

14 PCA: dimensionality reduction 0.8 Additive Second Component Additive concentration HEMA-1 Support AA-1 CaCl2 Concentration AAm-1 HEMA-2 Crosslinker MAA-1 concentration Monomer 1/2 ratio -0.4 Crosslinker First Component Similar The score higher vectors scores of the are each for vector, the samples loadings, the on higher mean the i.e. two similar the contribution components. variates, behavior projected of (in a this variate on case: the to similar principal that component, experimental components the conditions) higher the influence of that variate on that group of samples 14

15 Example: X-ray Powder Diffraction /1 A set of X-ray Powder Diffraction patterns are collected by changing an external stimulus, properly driven, along time. The spectra are function of the diffraction angle and of time. Due to the changing with the stimulus, the active and the silent parts of the structure behave differently. The corresponding XPD spectrum can be written as the superposition of three contributions: Coming from active atoms, Coming from active and silent atoms, Coming from silent atoms only. The three terms (each function of angle and time) have constraints among them, so a simple decomposition in principal components might fail A 2,t ~ 1200 X(n,p), n=1,,n; p=1, P time profile 15

16 Example: X-ray Powder Diffraction /2 Decomposition of by using PCA: 2, f1 t R 1 2 R2 2 R3 2 f2 t f t A t R f t R f t R f t A 2,t ~ 1200 X(n,p), n=1,,n; p=1, P time profile 16

17 Example: X-ray Powder Diffraction /3 Normalize data [?] Usually, the column-mean is removed; while removing of the standard deviation depends on the energy level of each variate p 1 N N n1 1 x np N p np p N 1 n1 x 2 P ω X(n,p), n=1,,n; p=1, P Z-scoring: remove the mean and divide for the standard deviation Apply the PCA Y X 1μ' 1 ( 1σ' ) Y UW' 17

18 Example: X-ray Powder Diffraction /4 In chemometrics: PCA relates the multivariate response (spectra = the loadings or components) to the concentration of the analyte of interest (the scores in PCA) A way to interpret the matrix multiplication score times loadings: ~ Signs are not accounted for in PCA decomposition! X 18

19 PCA Example: X-ray Powder Diffraction /5 Score #1 3.5 x 109 Loading #1 3 1 st component (or score) / corresponding loading (spectrum) Active and silent contribution Intensity PCA tests Score # tests 2 nd component (or score) / corresponding loading (spectrum) Active contribution only Intensity theta [deg] 4.5 x 109 Loading # theta [deg] 15 Score #3 18 x 1010 Loading # PCA rd component (or score) / corresponding loading (spectrum) Intensity Silent contribution only tests theta [deg] 19

20 PCA: how many components to retain? /1 This is the main problem in PCA when approximation of original dimensionality is necessary: how many principal components to retain. Components are sorted from highest variance to the lowest (PCA is a variance maximization technique), so the firsts PC, that cumulate high enough variance, should be sufficient. The question then is shifted to the magnitude of the eigenvalues of the covariance matrix of X (the sample covariance matrix), Σ X How small can an eigenvalue be, while the corresponding principal component is still considered significant? First method: Scree Plot: The sample eigenvalues from a PCA are ordered from largest to smallest. If the largest few sample eigenvalues dominate in magnitude, with the remaining sample eigenvalues very small, then the scree plot will exhibit an elbow in the plot corresponding to the division into large and small values of the sample eigenvalues. The order number at which the elbow occurs can be used to determine how many principal components to retain. 20

21 PCA: how many components to retain? /2 Scree plot for Gaussian Multivariate Data The scenarios may be very different, according to the relationship between data size and number of variates (or the rank of the covariance matrix) Example: Z [r x n] matrix with elements drawn from N(0,1) D [r x r] diagonal matrix, with D 2 =diag{20,17,12,8,3,2,2,,2} X = DZ is the multivariate dataset Σ X = n -1 XX T is the sample covariance matrix 1 st case: n = 300, r = 30 2 nd case: n = 30, r = 30 >> n=300; r=30; >> Z = randn(r,n); Written as transpose of the usual way >> diagvec = ones(r,1)*2; >> diagvec(1:5) = [ ]; >> D = sqrt(diag(diagvec)); >> X = D*Z; >> lambda = eig(x*x /n); >> figure,plot(lambda(end:-1:1)); grid on; 21

22 PCA: how many components to retain? /3 1 st case: n = 300, r = 30 2 nd case: n = 30, r = 30 >> n=30; r=30; >> Z = randn(r,n); >> diagvec = ones(r,1)*2; >> diagvec(1:5) = [ ]; >> D = sqrt(diag(diagvec)); >> X = D*Z; >> lambda = eig(x*x /n); >> figure,plot(lambda(end:-1:1)); grid on; elbow no elbow 22

23 PCA: how many components to retain? /4 Second method: The rank trace plot It consists of the plot of the residual of the eigenvalues versus a properly transformed sequential number t C 1 r r i t 1 r i1 i i C elbow no elbow C 23

24 The Canonical Correlation Analysis /1 CCA is a statistical multivariate technique used to analyze the relationship between two set of variables. CCA seeks two sets of transformed variates such that these assume the maximum correlation across the two datasets X [ X,..., X ] 1 1 Y [ Y,..., Y ] q p T T The two sets of variates CCA aims at finding the vector basis (or coefficients) {b x, b y } i, such that: The correlations between the projections of the variables onto these basis are mutually maximized; Each pair of basis is uncorrelated with the preceding ones. The obtained projections are the canonical variables, i.e. the linear combination of variables making up the j-th basis for X and Y: T T b X,b Y x y dxp pxn dxq qxn d=min{rank(x),rank(y)} 24

25 The Canonical Correlation Analysis /2 vectors of random variables Basis vectors (or scores) Canonical correlation analysis can be defined as the problem of finding two sets of basis vectors, one for x and the other for y, such that the correlations between the projections of the variables onto these basis vectors are mutually maximized. Let us consider the first step. Find first pair of canonical variables maximum canonical correlation coefficient Problem: 25

26 The Canonical Correlation Analysis /3 The subsequent canonical correlations must be uncorrelated for different solutions: The solution of the CCA problem can be obtained by solving the eigenvalue equations: where: are the within-sets covariance matrices are the between-sets covariance matrices 26

27 The Canonical Correlation Analysis /4 The CCA is optimal in solving, in the least square sense, the following problem: That is finding the G, H and ν, to minimize the matrix: Solution is given by: GX G,[ d p] HY HY H,[ d q] GX+ E HY GX HY GX T H YH It H G G H ( t) ( t) ( t) Y u T T ( t) 1/2 1 1/2 Y YX X X T T t tu t T 1 ( t) 1/2 Y T t X T d min( p, q) with the columns of G and H being the pair of canonical variate scores The correlation, ρ j, between ξ j and ω j is called the canonical correlation coefficient associated with the j th pair of canonical variates, j = 1, 2,..., t. 27

28 The Canonical Correlation Analysis /5 Example: we take a subset of a public catalogue of a large number of astronomical objects (from Izenmann Modern Multivariate Statistical Analysis, Springer, 2008): the COMBO-17 dataset ( The brightness of each object in 17 passbands, the magnitude, the redshift and other variables have been arranged in X (23 variables) and Y (6 variables), according to some criterion. 28

29 The Canonical Correlation Analysis /6 Astronomers want to know if groups of absolute magnitude are correlated with each other 29

30 The Canonical Correlation Analysis /7 Just one-two of the projected correlations are large, the other are very small 30

31 The Multiset Canonical Correlation Analysis /1 vectors of random variables M-CCA performs a sequence of deflationary linear transformations of the original sets that are solutions of a constrained optimization problem: s-th stage set of canonical variables The s-th stage set of canonical variables can be selected to maximize or minimize a particular function of its correlation matrix, subject to certain restrictions: 31

General principle of linear BSS: M-CCA: operative example /1 Blind source separation consists in estimating some underlying source signals from a mixture.

32 General principle of linear BSS: M-CCA: operative example /1 Blind source separation consists in estimating some underlying source signals from a mixture. s 1 s 2 A x 1 x 2 s N x N functional Magnetic Resonance Imaging (fmri) measures brain activity by detecting changes in blood flow and so in neuronal activity. unknown mixing process fmri data analysis goal: to detect correlations between brain activation areas and a task paradigm. 32

33 M-CCA: operative example /2 Data driven methods make no assumptions on the response to recover Subject i decomposition Subject j decomposition C 1 C 1 C 2 C 3 Source matching problem C 2 C 3 C n C n 33

34 M-CCA: operative example /3 Task: Visual N-Back condition 0-Back condition: identify the number currently seen 2-Back condition: recall the number seen two stimuli before Sequence of stimuli Correct Responses Time 0-Back Back Back

35 M-CCA: operative example /4 Stimulus paradigm: Task duration: 4 min Block duration: 30 sec Tasks: 0-Back / 2-Back 0B 2B 0B 2B 0B 2B 0B 2B Time The cyclic nature of the reference temporal task paradigm may be exploited to further populate the dataset. 0B 2B 0B 2B 0B 2B 0B 2B Time Time 35

M-CCA: operative example /5 Subj 1 [T, V] [K, V] Reshape Spatial PCA Subj M MCCA Spatial Sources extraction Source unmixing 1.2 1 1.2 1.2 1 0.8 1.2 1 0.8 0.6 1 0.8 0.6 0.4 0.8 0.6 0.4 0.2 0.6 0.4 0.2 0 0.

36 M-CCA: operative example /5 Subj 1 [T, V] [K, V] Reshape Spatial PCA Subj M MCCA Spatial Sources extraction Source unmixing K spatial maps, K temporal trends for each subject Time Courses extraction

37 M-CCA: operative example /6 Most significant estimated sources Time (s) Time (s) Working memory system: prefrontal cortex parietal cortex anterior cingulate basal ganglia Time (s) Time (s) Time (s) 37

38 M-CCA: operative example /7 Mean Controls Sources Mean Patients sources 38

39 M-CCA: operative example /8 Controls Patients Difference Maps 39

40 M-CCA: operative example /9 t-score: Selected features: Thresholds: η t,inf = -2.45, η t,sup = 2.47 Fisher-score: Selected features: Thresholds: η w =

41 Component Analysis: summary When the number of variates is too high it could be useful a reduction. The reduction based on linear decomposition grounds on the hypothesis that many of such variates are correlated among them and that it is possible to represent the variates in a new space, when a reduced number of such components are sufficient to represent data PCA is the simplest component decomposition technique. Choice of the number of reduced components to retain is a side problem Different PCA can be applied on groups of data: Canonical Correlation Analysis, on group of two dataset and Multiset Canonical Correlation Analysis, which is a generalization of CCA Responses are all related to the real possibility to identify linear decomposition of sources starting from variates (homogeneous variates is a strong hypothesis, i.e. variates that come from the same origin, same physical quantity) 41

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle elecomunicazioni A.A. 015-016 Pietro Guccione, PhD DEI - DIPARIMENO DI INGEGNERIA ELERICA E DELL INFORMAZIONE POLIECNICO DI BARI Pietro