Principal component analysis, PCA

Size: px

Start display at page:

Download "Principal component analysis, PCA"

Brice Walsh
5 years ago
Views:

1 CHEM-E3205 Bioprocess Optimization and Simulation Principal component analysis, PCA Tero Eerikäinen Room D416d

2 Data Process or system measurements New information from the gathered data Data type and variability important to know How can we extract information from data?

3 Purposes Monitoring state of the processes Understanding relationship between varibles Optimisation

Good to know In stable situation every process or system measurement vary around it s mean value Typically normal variance is inside the control limits

4 Good to know In stable situation every process or system measurement vary around it s mean value Typically normal variance is inside the control limits when 99,5% of measurements are within mean +/- 3*variance Dimensionality (over load if too many variables too often) Collinearity (not all independent)

5 Datatypes Univariate K=1 Bivariate K=2 Low number of variables K < 6 Multivariable data K 6 controlled and/or response variables Number of variables (K) vs. observations (N) K>N observations more interesting K<N variables more interesting

6 Problems types

8 Classical methods, 1930-> Few variables, many observations, independent X s: Multivariate regression Canonical analysis Linear discriminant analysis Analysis of variance Maximun likelyhood etc... Chemistry, biology, engineering, PAT 1990 Many variables, few observations: Chemometrics PCA PLS Other projection methods

10 Principal Component Analysis PCA basic principle

11 Basic principle From variables X 1,X 2,,X k will be created new varibles P 1, P 2, P k (max k) New variables (principal components) are linear equations from the original variables Principal components combine the variance of many original variables

12 K-dimensional variable space

13 Observations from data matrix

14 The center of gravity

15 Mean centering

16 Variance maximizing, residual variance minimizing

17 The first principal component

18 The second principal component

19 Two PCs --> plane

20 Scaling

21 Scaling

22 Scaling

23 PCA principles Pricipal components uncorrelated to each other (orthogonality) The distance between original observation points remains unchanged after tranformation The first principal component P 1 covers the most of the original variance, P 2 the most of the remaining variance, etc.

24 PCA properties The goal is to reduce or compress the original data to few explaining components To reduce the dimensionality of original data Works efficiently if there is strong correlation between some of the original X 1,X 2,,X k

25 Choosing the variables PCA variables are equal since they are not classified to dependent and independent variables Correlation matrix helps to inspect if principal components should be used Multinormal deviation of variables is desirable Linear relation between variables is needed (due to correlation or covariance matrix)

26 PCA Calculation Calculation is carried out using covariance matrix or correlation matrix

27 Analysis principle n*k-observation matrix n observations, k variables Coefficients a 11, a 12,,a 1k are chosen so, that the variance of new variable P 1 is as large as possible and that sum of squares is:

Analysis principle Similar approach to other components P x Same limit for the coefficients a xk2 =1 The eigenvalues (the number of = k) of correlation or covariance matrix are the variances of

28 Analysis principle Similar approach to other components P x Same limit for the coefficients a xk2 =1 The eigenvalues (the number of = k) of correlation or covariance matrix are the variances of the principal components The sum of principal component variances is equal to sum of original variable variances When using correlation matrix, the sum of eigenvalues is equal to number of variables

29 Analysis principle The variances D 2 (P i ) of pricipal components P i are the eigenvalues i of matrix C The number of chosen principal components is decided according to eigenvalues For example to cover % of total variance Graphical inspection

30 Simca PCA criteria Principal component is significant if at least one of the following is true : Rule 1: Q 2 > Limit The significance limit is displayed near the component. For a PCA model, the limit increases with subsequent components to account for the loss in degrees of freedom. Rule 2: At least K 0.5 variables (K = number of X-variables) Q 2 V > Limit

31 Determination<>prediction

32 Component weights=loadings The composition of various principal component the ratio explained of original variables by different principal components eigenvector a i forms the weights of component p i The intepretation of the principal components is the most subjective phase

33 Loadings

34 Score For each observation a score value can be calculated from original variable values, here e.g. the score value t 1n for the first component and observation n is: in which x ni is the value of X i from observation n From score plots one can see process trends, clustering of various observations etc.

35 Scores and distance to model

36 Exaple: Foods (Simca-program) 20 variables (foodstuff), 16 observations (contries)

37 Calculate enough principal components

38 Characteristic values for the model and original variables

39 Score values for the observations Middle European countries form one cluster Mediterranean countries on the left Nordic countries up right and middle

characterics to nordic countries Instant coffee

40 Loadings Garlic and olive oil form a Mediterranean group Crisp bread and frozen fish seem to be characterics to nordic countries Instant coffee and powder soup are used a lot in middle european countries

41 Components 1 and 3 The third component separates England and Ireland from the rest of the countries

42 Components 1 and 3 Loadings-plot shows that tee and jam are popular but grained coffee and fruits are less consumed on these islands.

43 Exaple: size classification Table shows the morphometric measurements (in mm) of 2-24 days old water fleas

44 Exaple: size classification...

45 Example: size classification... The first and the second principal components explain about 96 % of the original variation Each of the body size variables (all the other but X 2 ) explain each about 20 % of the variation that PC1 explains The second component (which could be said to describe the shape of a water flea) is actually only effected by X2 which describes the size of abdomen fling

Multivariate analysis of dynamic gene expression data from yeast Data was originally gathered from samples of very high gravity wort fermentations using Saccharomyces pastorianus (combining S.

TRAC can be used to create a dynamic expression picture along the physiological states of observed cultivations.

46 Multivariate analysis of dynamic gene expression data from yeast Data was originally gathered from samples of very high gravity wort fermentations using Saccharomyces pastorianus (combining S. cerevisiae and S. bayanus genes). Samples were analysed using the transcript analysis with aid of affinity capture (TRAC) method. TRAC can be used to create a dynamic expression picture along the physiological states of observed cultivations. The expression of selected genes relevant to wort fermentation were monitored at high frequency from several days fermentations. Changes in expression during the first hours of fermentations for several genes affecting maltose metabolism, glycolysis and ergosterol synthesis seemed to be remarkable. To find out more about gene interactions during different metabolic states, multivariate modelling was carried out using PCA and PLS methods. Score plots formed trajectories from the first hours through different metabolic states. Gene expression could be used to monitor fermentation phase changes and product quality. PLS modelling of fermentation sugars and apparent extract (carbohydrate conversion) are shown here. fermentation

49 PLS-models Partial Least Squares tai Projection to Latents Structures Find the correlation from multivariate data for output/explained variables (Y) in relation to input/process variables (X)

50 PLS 1

51 PLS 2

52 PLS 3

53 Y-score vs. X-score

54 PLS weights (w*c (1) vs. w*c (2) Weights w * reflect the correlation of X vs. u (Y) Weights c reflect the correlation of Y versus t(x) The weights of the 1st and 2nd dimension for both the X and Y space: w* c1 vs. w* c2 in the figure The farther away variable is from the center the greater its effect for the model is For example, the most positive effect to y6 is from x5in and the most negative x3in and x1in

57 Some more examples Batch statistical process control Bioinformatics

58 BSPC batch-wise manufacturing processes baker s yeast, beer brewing, polymerization processes, car painting, bioreactor cultivation etc. finite duration time dependent variable trajectories

59 Baker s yeast (BSPC) 7 batch trajectories of a single batch How does the trajectories correlate to product quality? Batch-to-batch variations due to deviations in: batch initiations, raw materials, impurities

60 Baker s yeast (BSPC) Evolution measurements yield a 3D-table with N batches, J time points and K variables Intial conditions and result characteristics yield additional data tables Batch maturity and various phases important

61 N=33 batches, 23 selected as reference Last step (14 hours) is concerned 1 sample/10 minutes -> J=83 samples/batch 7 measured variables Baker s yeast

62 Baker s yeast Develop model of good batches Use the model to monitor new batches Early fault detection possible Helps to understand how Y = f(z,x)

63 Baker s yeast, unfolding data Local batch time describing the "dummy" Y-variable, which Simca-P + automatically generates during data intake. In the example variable is described by the term $Time and the estimates from the PLS model describe well the maturity of the batch.

64 Baker s yeast, PLS model

65 Baker s yeast, control limits

66 Baker s yeast, control charts

67 Baker s yeast, control charts (monitoring)

68 QSAM Quatitative sequence-activity models Models to allow alter the biological activity of a DNA segment PLS model to establish numerical description of 68 bp fragments of 25 E.coli promoters and their in vivo strength QSAM model was used to predict more potent promoters

69 QSAM

70 PPs (scores) of 20 nucleosides 21 variables, which describe properties of 20 nucleosides 4 principal components the score values of which may be used further in PLS modeling (e.g. Table 6. A, C, G...)

71 Promoter hyperspace The 25 promoters were parametrized in each of 68 position by four descriptor variables This gives an 25x272 X- matrix This hyperspace is used to correlate with the promoter efficiency

Influence of each position Positions -35.

72 Influence of each position Positions , -11 and 1 are constants for all and likely to be important to everyone, even a small numerical value. Otherwise, the most significant points found on positions -12, , etc.

73 Literature vs. measured promoter Using PLS model two promoter, P LS1 and P LS2, were built from one- and twodimensional QSAM models, then predictions were calculated and in vivo experiments were made in which the activities to the existing reference promoters were compared strengths

74 Summary Principal component analysis summarize the variation of a data matrix X The data is modelled as a plane or hyperplane The axes of (hyper)plane are principal components Prior PCA data are pre-prosessed typically by mean centering and scaling to unit variance PLS models are used for output/explained variables (Y) in relation to input/process variables (X)

Principal component analysis

Principal component analysis Motivation i for PCA came from major-axis regression. Strong assumption: single homogeneous sample. Free of assumptions when used for exploration. Classical tests of significance