Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

Chemometrics Matti Hotokka Physical chemistry Åbo Akademi University

Linear regression Experiment Consider spectrophotometry as an example Beer-Lamberts law: A = cå Experiment Make three known references with concentrations c 1, c 2, c 3 and measure the absorbances A 1, A 2, and A 3 Place a straight line through the points: A = a + bc Measure the absorbance A of the unknown sample Read the concentration from the calibration curve

Linear regression Calibration curve A A calc (c) = a + bc We have good theoretical grounds for saying that the calibration model is linear. The intercept a is determined by additional disturbing components in the sample and can be ignodred. c

Linear regression How to solve One measured value, one regressor Equation: y calc (x) = b 0 + b 1 x To be solved: b 0, b 1 How to:

Multiregression Experiment Two components a and b Measure at two wavelengths, ë 1, and ë 2 Beer-Lamberts law but different for each component and wavelength;additive At ë 1 A 1 = å a1 c a + å b1 c b At ë 2 A 2 = å a2 c a + å b2 c b In matrix form

Multiregression Calibrate the model Obtain the unknown coefficients å In this case, two solutions containing only either a or b in known concentrations The molar absorption coefficients are calculated from A 1 ' = å a1 c a at ë 1 A 2 ' = å a2 c a at ë 2 A 1 " = å b1 c b at ë 1 A 2 " = å b2 c b at ë 2

Linear regression again Standard method One dimension y = b 0 + b 1 x = b 0 1 + b 1 x Generalization of the linear model

ANOVA In linear regression Definitions Sum of Squares Matrix operation Calculation D.f. SS T, Total SS M, Mean SS corr, Corrected for the mean SS fact, Factors SS R, Residuals SS lof, Lack of fit SS pe, Pure experimental error n = observations; p = coefficients b; f = replications

ANOVA Example No. x y 1 0 0.3 2 1 2.2 3 2 3 4 2 4

ANOVA Sums of squares

ANOVA Quality of the fit Correlation Mean sum of squares: MSS = SS divided by D.f. F-test for goodness of fit If this F-value exceeds the critical value in the table the fit is significant. F-test for lack-of-fit This F-value cannot exceed the critical value if the model is appropriate.

ANOVA The example Goodness of fit Exceeds the critical value at 5 % risk, 18.51. The fit is statistically significant Lack of fit This value is below the critical value at 5 % risk, 161. The model is appropriate because the lack of significant is not significant.

ANOVA Confidence intervals Variance-covariance matrix For an appropriate fit with a low value of SS lof, MSS R = s R 2 can be used instead of Ss pe. The diagonal elements of the variance-covariance matrix are the variances of the factors b. The confidence limits of a factor b (either b 0 or b 1 ) are The prediction at a given point x 0 = (1 x 0 ) is

ANOVA The example At 5 % risk level, F(0.05; 1, 2) = 18.51. Thus we obtain

Multiple linear regression Ordinary regression Two-dimensional case Measurement at three points (x 11,x 12 ), (x 21,x 22 ) and (x 31,x 32 ) are needed In matrix form this system of equations is written as The order has been changed to stress similarity to 1-dim regression

Multiple linear regression Ordinary regression If there are several dependent variables y, each with a different equation m p m n Y = n X B +Residuals p

Multiple linear regression Ordinary regression The equation is solved exactly as the 1-dim equation However, if there are linear dependencies between the x s the system becomes singular and cannot be solved. Prediction a y 0 vector (dimension 1xm) at a given point x 0 (dimension 1xp) is

PCR Principal component regression In full multicomponent regression it often happens that some of the x s are interdependent, i.e., not linearly independent. To avoid this only a few coordinate axes are used. They are chosen to be orthogonal. The selection of orthogonal coordinates resembles the PCA method.

PCR PCA revisited In PCA, the original data matrix X is written as a product of the scores and loadings, One method of solving the problem is to use the SVD (singular value decomposition) method. In that case X matrix is written as Here matrix U corresponds to T and V corresponds to L. They are joined by a diagonal matrix W. The diagonal elements are w ii = ë ii, where ë ii are the eigenvalues of the X matrix. The smallest eigenvalues can be forced to value zero. Then this matrix will remove the small eigenvalues indicating dependencies.

PCR The SVD method Solution of the full linear equation is Now the matrix X is written in SVD approximation as Then a pseudo-inverse matrix X + can be used The solution will then be given along a desired number of principal axes as

PLS Partial Least Squares method In PCA the matrix X is split into a product of the scores matrix and the loadings matrix, p d p p X = T P T + d + E n n n

PLS Partial Least Squares method In PLS, also the matrix Y is split into a product of the scores matrix and the loadings matrix, m d m m Y = U Q T + d + F n n n

PLS Partial Least Squares method The solution can then be written as Here W is is a dxp matrix of PLS weights. Only a few of the eigenvalues are kept, the rest are set to zero.

PLS Algorithm Initialize: Shift the columns of the matrices. Initialize: Use the first column of the Y matrix as the first Y score vector.

PLS Algorithm (1) Compute X-weights (2) Scale the weights (3) Estimate the scores of the X matrix

PLS Algorithm (4) Compute the Y loadings (5) Generate a new u vector Repeate from step (1) until u is stationary.

PLS Algorithm (6) Determine the scalar coefficient b for this variable. (7) Compute the loadings of the X matrix. (8) Compute the residuals.

PLS Algorithm Stopping criterion: Calculate the standard error of prediction due to crossvalidation. If SEP CV is greater than the actual number of factors then the optimum number of dimensions has been reached and the final B coefficients can be calculated. Otherwise use the residuals from step (8) as the new X and Y matrices and continue from the initialization step with an additional dimension.