Principal component analysis PCA. Kathleen Marchal Dept Plant Biotechnology and Bioinformatics Department information technology (INTEC)

Size: px

Start display at page:

Download "Principal component analysis PCA. Kathleen Marchal Dept Plant Biotechnology and Bioinformatics Department information technology (INTEC)"

Paulina Osborne
5 years ago
Views:

1 Principal component analysis PCA Kathleen Marchal Dept Plant Biotechnology and Bioinformatics Department information technology (INTEC)

2 Overzicht lessen 26/02 13h biostat S3 emile clapeyron 07/03 ma 13 h S5 Grace hopper 11/03 vrij 13 h S5 Grace hopper 7/04 do 9 h 14/04 do 9 h Emile Clapeyron S3 21/04 do 9 h Emile Clapeyron S3 21/04 do 13 h S5 Grace hopper 22/04 vrij 13 h S5 Grace hopper (28/04 do 9 h) 29/04 vrij 13 h S5 Grace hopper

3 Multivariate analysis methods Multiple variables Statistical text books Variables (n) Observations

4 PCA How does it work? Intuitive (case study, course notes) Geometric interpretation (course notes) Algebraic solution (tutorial)

5 Case study: systems biomedicine Cancer is a heterogeneous disease Subtypes exist within one cancer Subtypes have different molecular origin/prognosis Can molecular information help explaining the subtypes?

6 Case study: systems biomedicine Genes = variables G1 G2 G3 G4 Gn Patients =observations P1 P2 P3 P4 Pm Patient profiles Golub 1999: 72 patiënten met acute lymfatische (ALL, in deze tekst wordt gesteld dat deze patiënten tot klasse 1 behoren) of acute myeloïde (AML, in deze tekst behoren deze patiënten tot klasse 2) ; 7000 genen Patienten = observaties Genes = variabelen

7 Case study: systems biomedicine High dimensional dataset Patient profiles Variable 2 Gene 2 Variable 1, gene 1 Patient Variabelen: genen (7000) Observaties: patienten (37)

8 High Dimensionality of the datasets Goal Biomarkers Making of predictors

9 Subtyping/biomarker selection What do we expect? Patients with the same subtype (class) should have the same expression profiles Or the clinical subtype is reflected in the molecular phenotype This implies that the highest variable genes or gene combinations can be associated with the class distinction

10 Biomarkers

11 but there are confounding factors Expression signals contain related to age, drug usage, gender, there are redundant signals Feature selection: select those genes that are most distinctive for the phenotype of interest

12 Supervised analysis Class distinction is known Select features/genes that are most discriminative for the a priori know class distinction These genes are biomarkers used to screen novel patients Supervised dimensionality reduction

13 Feature extraction Choose class distinction vector (related to a known class distiction) c [ ] Calculate for every gene its metric p(g,c) i.e. its distance to the class distinction vector: Favors genes that have a pronounced between class variance but a low within class difference P( g, c) Pronounced between class variance High within class variance Pronounced between class variance Low within class variance Low between class variance low within class variance

14 Unsupervised analysis Previous methods only select single genes that do not necessarily contain independent information Sometimes linear combinations of genes can be more discriminative because the activity of a tumor is rarely determined by the activity of one gene = complex phenotype that requires interactions between genes What if the class distinction is not known a priori?

15 PCA => The dataset can be disentangled in different directions of variation (phenotype related and/or confounding factors) => We assume that the most pronounced variance in the dataset (changes in gene expression between patient groups) can be explained by the cancer phenotype

16 PCA Variables: genes (7000) Observations: patients (37) Patients are thus represented by 7000 dimensional vectors. They need to be plotted in a 7000 dimensional space. We will now reduce the dimensions of the dataset by making linear combinations of the variables (genes) that capture most of the variability in the dataset (1 st PC). The PC will be represented by the vector (a11, a12, ) where a11 and a12 correspond to the loadings of respectively gene 1 and gene 2 (or the contribution of gene 1 and gene 2 to the 1 st PC.

17 Variable 2 Gene 2 Gene1 high loading: feature important for class distinction PC1 (a11,a12) In case we have only two variables i.e. two dimensions Variable 1, gene 1 Patient profiles Variable 2 Gene 2 Gene2 high loading: feature important for class distinction PC1 (a11,a12) Variable 1, gene 1 Patient Express observations in new basis determined by PC

18 PCA Biologisch gezien is het ook logisch dat we de dimensionaliteit van het probleem reduceren: niet alle genen zijn onafhankelijke van elkaar. Sommige genen zijn bv coexpressed d.w.z. naar klasseonderscheid toe geven ze een redundant en dus niet onafhankelijk signaal. Met dimensionaliteitsreductie kunnen we deze genen groeperen.

19 Variable 2 Gene 2 Gene1 high loading: feature important for class distinction PC1 (a11,a12) P1_gene2 P1_gene1 Variable 1, gene 1 Variable 2 Gene 2 PC1 (a11,a12) Variable 1, gene 1 Dimensionality reduction: Project the observations on the first (or first two PCs) X coordinate of the first patient P1_(PC1) = a11 P1_gene1 + a12 P1_gene2

20 PCA (intuitive) new variables (PC) are linear combinations of the original variables. the principal components are selected such that they are uncorrelated with each other. the first principal component accounts for the maximum variance in the data, the second principal component accounts for the maximum of the variance not yet explained by the first component, and

21 predict(pcares)[, 1] predict(pcares)[, 2] scores PCA

22 PCA How does it work? Intuitive (case study, course notes) Geometric interpretation (course notes) Algebraic solution (tutorial)

23 Bioinformatics convention Patients =observations (n) Statistical text books Variables (n) Observations Genes = variables (4)

24 PCA (geometric) PCA is a basis transformation PX=Y in which P = transformation vector In PCA this transformation corresponds with a rotation of the original basis vectors over an angle a In the example below, the rows in the transformation vector are the PC

25 PCA (geometric) The data are mean centered Decide on whether the data need to be standardized or not. First component is selected in that direction where the observations establish most of the data variability Second component is selected in that direction that is orthogonal to the first component and that accounts for most of the remaining variance in the data. Procedure continues until the number of principal components equals the number of variables. The total number of new axes accounts for the same variation as the original axes.

PCA (geometric) x1 x2 1 16 8 2 12 10 3 13 6 4 11 2

26 PCA (geometric) x1 x

27 Mean centering PCA (geometric)

28 Variance rescaling PCA (geometric)

29 PCA (geometric) X2 PC1 (cosa, sina) =X1* x2 PC1 (cosa, sina) 100% 100% 0 a X1 0 a x % 0 100% Directionality if driven by the scale and not by the difference in contribution to the variance

30 PCA (geometric) x2 PC1 (cosa, sina) x2 PC1 (cosa, sina) a Variance explained x1 a Variance explained x1

32 PCA is a basis transformation PCA (geometric) PX=Y in which P = transformation vector In PCA this transformation corresponds with a rotation of the original basis vectors over an angle a In the example below, the rows in the transformation vector are the PC Loadings! cos ( ) sin ( ) P X = X* sin ( ) cos ( ) x1 x2 x1 x2 Vector representing the first new axis (PC1), the elements of the vector are the loadings and express the contribution of each original axis to the PC Performing the matrix multiplication corresponds to calculating the coordinates of the original datapoints according to the new axes To this end each datapoint is projected on each novel PC. The results of these projections are the scores of the data points

33 PCA (geometric)

34 PCA (geometric) How to determine the rotation (Ɵ) of the new axis? PC1 (cosa, sina) = X1* X2 Scores! p X1 The observations p are now projected with respect to the new axis X 1. The new coordinate x1* can be written as: x 1 = cos Ɵ * x1 + sin Ɵ * x2 x1, and x2 are the coordinates of that observation with respect to X1 and X2.

35 PCA (geometric)

36 PCA (geometric) Variance explained as function of the rotation

37 PCA (geometric) x2 PC1 (cosa, sina) x2 PC1 (cosa, sina) a Variance explained x1 a Variance explained x1

38 PCA (geometric) Second PC PC2 (-sina,cosa) The observations are now projected with respect to the new axis X 2. x2* = -sin Ɵ * x1 + cos Ɵ * x2 Data reduction possible

39 3D case PCA (geometric)

40 PCA (geometric) PCA as a dimensionality reduction technique

41 PCA (geometric) P1 is the rotation of the unitary vector over an angle a. The coordinates of p1 according to the original basis x1*, y1* So the directionality of the first PC is (cosa, sina)

42 PCA (geometric) P X= Y PCA is a base transformation, rotation that is obtained by multiplying the matrix P with X P: Transformation matrix contains in its rows the PC P: first coordinate of the original datapoint (x1,x2) according to the transformed base

43 PCA (geometric) Projection of p1 on the PC1 consists of two components px1 and px2 being respectively the projection of the first original coordinate on the PC1 and the projection of the second original coordinate on the first PC1

44 PCA How does it work? Intuitive (case study, course notes) Geometric interpretation (course notes) Algebraic solution (tutorial)

45 Basis transformation Matrix multiplication = lineaire afbeelding

46 Basis transformation PCA = defining a new basis (basis transformation) Assume 4 Dim case (4 genes are the basis vectors, a, b, c, d) X = matrix of n observations, coordinates of the n patients in the original basis (i.e. original expression measures) (4Xn) X= x 1a x na x 1d x nd P = lineaire afbeeldingsmatrix (loading matrix) (4X4) P= a 1a a 1d a 4a a 4d Y=PX (lineaire afbeelding op de nieuwe basis)

47 Basis transformation Y : the coordinates according to the new basis Y= a 1a x 1a + a 1b x 1b + a 1c x 1c + a 1d x 1d a 1a x na + a 1b x nb + a 1c x nc + a 1d x nd a 4a x 1a + a 4b x 1b + a 4c x 1c + a 4d x 1d a 4a x na + a 4b x nb + a 4c x nc + a 4d x nd Yi is the projection on the new basis P

48 PCA(algebraic solution) How to best rerepresent X How to choose the new basis P In PCA the P vector will consist of the loadings of PC and determine the PCs Data are noisy and redundant

49 PCA(algebraic solution) Signal to noise ratio (SNR) Noise is usually randomly distributed whereas the variance that is due to a signal is expected to be spread in a particular direction (certain genes tend to be consistently differentially expressed in the cancer patients of a specific subtype). The observed variance is not random. This is the signal we are interested in.

50 PCA(algebraic solution) In gene expression data: genes that belong to the same pathway and that are coexpressed confer redundant information => dimensionality reduction

51 PCA(algebraic solution) Gene 1 and gene 2 are correlated Redundancy in signal Gene 1 and gene 2 are not correlated No redundancy in signal

52 PCA(algebraic solution) x 1a x na x 1d x nd x 1a x 1d x na x nd Variancecovariance matrix 4 X n n X 4 Square m X m matrix Diagonal elements are the variances The off diagonal elements are the covariances Computing sx quantifies the correlations between all possible pair of measurements (between the genes profiles)

53 PCA(algebraic solution) Variancecovariance matrix

54 Variancecovariance matrix PCA(algebraic solution)

55 Variancecovariance matrix PCA(algebraic solution)

56 PCA(algebraic solution) Diagonalize the Covariance Matrix Our goals are to find the covariance matrix that: Minimizes redundancy, measured by covariance. (offdiagonal), i.e. we would like each variable to co-vary as little as possible with other variables. Maximizes the signal, measured by variance. (the diagonal) Since covariance is non-negative, the optimized covariance matrix will be a diagonal matrix.

57 PCA(algebraic solution) Choose P in the afbeelding Y=PX such that Sy is diagonalized and the values of the diagonals are ranked according to the variance in the data they explain PCA does this in the simplest way: The new basis is orthonormal The directions with the largest variances are the most important (solution is possible with linear algebra)

58 PCA(algebraic solution)

59 PCA(algebraic solution)

60 PCA(algebraic solution)

61 Case study AAAA 0.3 AAAT 7 S1 S2 Sn AAAG AAAC AATA Number of tetranucleotides: 4^4=265 (variables) Number of observations = scaffolds = restricted because of the frequency based binning (20?) Can scaffolds be separated based on their NT frequencies ->PCA

62 Case study AAAA 0.3 AAAT 7 S1 S2 Sn Variables: tetranucleotides: 4^4=265 (variables) Observations = scaffolds = restricted because of the frequency based binning (20?) AAAG AAAC AATA Can scaffolds be separated based on their NT frequencies ->PCA Make a new axis that is a linear combination of the tetranucleotides-> reduce the 265 dim space to a 2 dim space Scores of the original datapoints in the novel axes

Introduction to Machine Learning

10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what