Chemometrics The secrets behind multivariate methods in a nutshell!

Chemometrics The secrets behind multivariate methods in a nutshell! "Statistics means never having to say you're certain." We will use spectroscopy as the analytical method to explain the most commonly applied multivariate models and their + and -! Least Squares Regression (LSR) the starting point! Classical Least Squares Regression (CLS) Inverse Least Squares Regression (ILS) Principle Components Analysis/Regression (PCA/R) Partial Least Squares Regression (PLS) (Remember: we are still thinking linear!) Nonlinear? Artificial Neural Networks (ANN)!

What do we want? Take advantage of changes we do not see! absorbance (abs. units) Absorbance Units -0.01 0.00 0.01 0.02 0.03 2000 0.06 0.04 0.02 0.00-0.02 0.06 0.04 0.02 0.00-0.02 Salt ions in water KCl 2000 1500 1000 wavenumber (cm -1 ) 1800 NaCl 1 % w/v 3 % w/v 5 % w/v 1 % w/v 3 % w/v 5 % w/v 0.06 0.04 0.02 0.00-0.02 1600 NaBr KBr 1 % w/v 3 % w/v 5 % w/v 1 % w/v 3 % w/v 5 % w/v MgCl 2 1 % w/v 3 % w/v 5 % w/v 2000 15001400 1000 Wavenumber wavenumber cm-1 (cm -1 ) 0.06 0.04 0.02 0.00-0.02 0.06 0.04 0.02 0.00-0.02 CaCl 2 Na 2 SO 4 0.09 % w/v 0.25 % w/v 0.5 % w/v 1 % w/v 3 % w/v 5 % w/v 2000 1200 1500 1000 1000 wavenumber (cm -1 ) 0.06 0.04 0.02 0.00-0.02 0.06 0.04 0.02 0.00-0.02 absorbance (abs.u.)

What do we want? Another example (courtesy K. Booksh, ASU) 80 corn flour samples NIR reflectance measurements (differences?) Calibrate for moisture, oil, protein, and starch 0.8 0.6 %R 0.4 0.2 0 1000 1500 2000 2500 Wavelength (nm)

What we really want Calibration models! Quantitation of analytical results (in our case spectral analysis) requires prior training of the system with samples of known concentration! Simplest case: measurement of band height or band area of samples with known concentration and comparison of raw numeric values to unknown sample (measured at same conditions). (Note: one-point calibration) Requirement well resolved bands, but how about the real world???

A little more sophisticated Calibration equations instead of singular values! Create calibration equation (or series of equations) based on set of standard mixtures (= training set). Set reflects composition of unknown samples as close as possible. Set spans the expected range of concentrations and composition. Training set measured under same conditions (e.g. path length, sampling method, instrument, resolution, etc.) as unknown sample.

A little more sophisticated Calibration equations instead of singular values! c a = A a (area a ) + B c a = A a (height a ) 2 + (height a ) B + C, etc. A,B,C calibration coefficients. Coefficients usually not known. Samples with known concentrations (training set). Minimum number of calibration samples is the number of unknown coefficients! Usually more samples measured to improve accuracy of calib. coefficients. (Note: robust model; minimize sum of squared errors/residuals) Repeated measurements of same conc. for averaging (Note: noise!).

A little more sophisticated Calibration equations instead of singular values! Best way to find calibration coefficients: Least Squares Regression (LSR)! Calculate the coefficients of given equation such that differences between known responses (peak areas or heights) and predicted responses are minimized. Areas of spectral component band and concentrations used to compute the coefficients of the calibration equation by Least Squares Regression.

LSR The easy way! What do we need to pay attention to: If more than one component in the samples a separate band must be used for each. Hence, one equation necessary for each component. LSR assumes that absorbance measurement of peak height or peak area is result of only one component. Hence, results not accurate for mixtures with overlapping bands. Predictions will have large errors if interfering (spectrally overlapping) components are present. Solution: more sophisticated statistics!

More sophisticated getting closer to real world samples! We should calculate the absorptivity coefficients across a much larger portion of the spectrum. Beer s law: A λ = ε λ c b Assuming that for a given λε λ and b remain constant we define the constant K λ = ε λ b. Solving this equation: measure the absorbance of a single sample of known concentration and use these values to solve for given K λ. Prediction of unknown sample: c = A λ / K λ

Classical Least Squares Regression (CLS) also called K-Matrix! Problem? Basing an entire calibration on a single sample is generally not a good idea (noise, instrument error, sample handling error, etc.). Solution? Measure the absorbances of a series of different concentrations and calculate the best fit line through all the data points (see LSR).

CLS has more problems Sample with two constituents: Algebraic solution requires: # equations = # unknowns. Let s consider A s of component A and B are at different λ and absorptivity constants K λ are different. We can solve each equation independently provided that the spectrum of one constituent does not interfere with the spectrum of the other. A λ1 = c A K Aλ1 A λ2 = c B K Bλ2 Do you see a problem with that?

CLS has more problems Yes, there is a problem! Equations make assumption that absorbance at λ1 is entirely due to constituent A and λ2 entirely due to constituent B! Similar to LSR we need to find two λ in the spectra of the training set exclusively representing constituents A and B. Difficult with complex mixtures or simple mixtures of similar materials! Do you see a solution to that?

CLS Beer s law can help us: Absorbances of multiple constituents at the same λ are additive. A λ1 = c A K Aλ1 + c B K Bλ1 A λ2 = c A K Aλ2 + c B K Bλ2 Did we forget something? We assume there is no error in the measurement (i.e. the calculated least squares line(s) that best fits the calibration samples is perfect). Unfortunately, this never happens! Hence, we add a variable E describing the residual error between the least squares fit line and the actual absorbances.

CLS Now we write: A λ1 = c A K Aλ1 + c B K Bλ1 + + c n K nλ1 + E λ1 A λ2 = c A K Aλ2 + c B K Bλ2 + + c n K nλ1 + E λ2 A λn = As with most calibration models CLS requires many more training samples to build an accurate calibration. Hence, we need to solve many equations (for many constitutents and λs). Solution? Linear algebra formulating the equations into a matrix every PC is craving for!

CLS If we solve the equations for the K matrix we can use the resulting best fit least squares line(s) to predict concentrations of unknown K = A C -1 (Note: check back on matrix algebra in the beginning of this class!) Advantage compared to LSR: We can use parts or the entire spectrum for calibration. Averaging effect increases accuracy of prediction. If entire spectrum is used for calibration the rows of the K matrix are actually spectra of the absorptivities for each of the constituents, which look very similar to the pure constituent spectra.

CLS The Good, the Bad and the Ugly Advantages: Based on Beer s law. Calculations are relatively fast. Applicable to moderately complex mixtures. Calibrations do not require wavelength selection as long as the # wavelengths exceeds the # constituents. Disadvantages: Requires knowing the entire composition and concentration of every constituent of the training set. Limited applicability for mixtures with constituents that interact. Very susceptible to baseline effects (e.g. drifts) as equations assume the response at a wavelength is only due to the calibrated constituents.

Again more sophisticated getting even closer to real world samples! Beer s law: A λ = ε λ c b No interference in the spectrum between the individual sample constituents. Concentrations of all the constituents in the samples are known ahead of time. Very unlikely for real world samples! Solution: let s rearrange Beer s law again!

Inverse Least Squares Regression (ILS) also called P-Matrix or Multiple Linear Regression (MLR)! Beer s law rearranged: c = A λ / ε λ b Assuming that for a given λ ε λ and b remain constant we define the constant P = 1/ε λ b. Now we can write: c = P A λ + E In this expression Beer s Law says that the concentration is a function of the absorbances at a series of given wavelengths!??? Where is the difference/advantage to CLS???

ILS CLS: A λ1 = c A K Aλ1 + c B K Bλ1 + E λ1 A λ2 = c A K Aλ2 + c B K Bλ2 + E λ2 Absorbance at a single wavelength is calculated as an additive function of the constituent concentrations, i.e. concentrations of ALL components need to be known! ILS: c A = A λ1 P Aλ1 c B = A λ1 P Bλ1 + A λ2 P Aλ2 + E A + A λ2 P Bλ2 + E B NOTE: even if the concentrations of all the other constituents in the mixture are not known, the matrix of coefficients P can still be calculated correctly!!!

Consequently: ILS Only concentrations of the constituents of interest need to be known. No knowledge of the sample composition is needed. What do we need? Selected wavelengths must be in a region the constituent of interest contributes to the overall spectrum. Measurements of the absorbances at different wavelengths are needed for each constituent. Measurements of at least one different λ is needed for each additional independent variation (constituent) in the spectrum.

ILS Matrix algebra is helping again: P = C A -1 Now we can accurately build models for complex mixtures when only some of the constituent concentrations are known. We just need to select wavelengths corresponding to the absorbances of the desired constituents. So where is the UGLY?

ILS Dimensionality of matrix equations: (see also: very beginning of this class on matrix algebra!) Number of selected wavelengths cannot exceed the number of training samples. Collinearity: (see also: linear independency!) We could measure many more training samples to allow for additional wavelengths. BUT: absorbances in a spectrum tend to all increase and decrease together as the concentrations of the constituents in the mixture change.

Overfitting: (see also: chance correlations!) ILS In general, starting from very few λ, and adding more to the model (of course selected to reflect the constituents of interest) will improve the prediction accuracy. BUT: if the number of λ increases in the calibration equations, the likelihood that "unknown" samples will vary in exactly the same manner decreases and prediction accuracy goes down again. Noise: (see also: issues of noise!) If too much information (too many λ) is used to calibrate the model starts to include the spectral noise (which is unique to the training set only) as constituent signal and the prediction accuracy for unknown samples suffers.

Consequence: ILS Averaging effect gained by selecting many wavelengths as in CLS is effectively lost. Wavelength selection is critically important to building an accurate ILS model. Ideal situation: selecting sufficient wavelengths to compute accurate least squares line(s) and few enough that the calibration is not (overly) affected by the collinearity of the spectral data. Hence optimization of model required! Advantage: Main advantage of this multivariate method is the ability to calibrate for a constituent of interest without having to account for interferences in the spectra.

ILS The Good, the Bad and the Ugly Advantages: Based on Beer s law. Calculations are relatively fast. True multivariate model, which allows calibration of very complex mixtures since only knowledge on constituents of interest is required. (Note: multivariate because concentration (dependent variable) is solved by calculating a solution from responses at several selected wavelengths (multiple independent variables). Disadvantages: Wavelength selection can be difficult and time consuming. Collinearity of wavelengths must be avoided. # wavelengths used in the model limited by # calibration samples. Large number of samples are required for accurate calibration

Maybe we need to think more abstract to solve some of these problems Why? Spectrum of real world samples: many different variations contribute, incl. constituents in the mixture, interaction between constituents, instrument variations (e.g. detector noise), changing of ambient conditions affecting baseline and absorbance, sample handling, etc. What do we hope for? That the largest variations in the calibration set are the changes in the spectrum due to the different concentrations of the constituents!

Do we need the absolute absorbance Well values in the spectrum? even if many complex variations affect the spectrum there should be a finite number of independent common variations in the spectral data. Conclusion? If we can calculate a set of variation spectra representing the changes in the absorbances at all wavelengths in the spectra, this data could be used instead of the raw spectral data for building the calibration model!

Let s continue this idea Can we reconstruct a spectrum from variations? If we multiply a sufficient amount of variation spectra each with a different constant scaling factor and add the results together we should be able to reconstruct a real spectrum. Each spectrum in the calibration set would have a different set of scaling constants for each variation since the concentrations of the constituents are all different. Hence, the fraction of each variation spectrum that must be added to reconstruct the unknown data should be related to the concentration of the constituents.

What does this mean mathematically? Variations spectra are called Eigenvectors! (also: loading vectors or principal components) The scaling constants applied to reconstruct the spectra (which we multiply with the variations spectra = Eigenvectors) are called scores. What do we need to do? We break down a spectroscopic data set into its most basic variations

Variations vs. Absorbances Spectrum of 3 components

which is called: Principal Component Analysis (PCA) Why do we want to do that? Because there should be much fewer common variations in the calibration spectra than the number of calibration spectra. Hence, because we are lazy (or time is pressing) we expect to significantly reduce the number of calculations for the calibration equations!

Principal Component Analysis (PCA) Why does prediction work on the basis of Eigenvectors? The calculated eigenvectors derive from the original calibration data (spectra). Hence, inherently they must relate to the concentrations of the constituents making up the samples. Following, the same loading vectors (eigenvectors, principal components) can be used to predict unknown samples. Consequence: The only difference between the spectra of samples with different constituent concentrations is the fraction of each added loading vector (scores).

Principal Component Analysis (PCA) What do we need to do? The calculated scores are unique to each separate principal component and to each training spectrum. In fact, they can be used in lieu of absorbances in either of the classical model equations in CLS or ILS. Consequently, the representation of the mixture spectrum is reduced from many wavelengths to a few scores.

Principal Component Analysis (PCA) Now comes the real advantage! We can use the ILS expression of Beer s law (c = P A λ + E) to calculate concentrations as this allows us to calculate concentrations among interfering species. At the same time the calculations maintain the averaging effect of CLS by using a large number of wavelengths in the spectrum (up to the entire spectrum) for calculating the eigenvectors. Eigenvector models combine the best of both worlds!

Principal Component Analysis (PCA) What do we get? Spectral data condensed into the most prevalent spectral variations (principal components, eigenvectors, loadings) and the corresponding scaling coefficients (scores).

Principal Component Analysis (PCA) Difference to CLS and ILS: PCA models base the concentration predictions on changes in the data ( variation spectra ) and not absolute absorbance measurements. Conclusion: in order to establish a PCA model spectral data must change. Simplest way: vary the concentrations of the constituents of interest in the trainings set. Important: avoid collinearity. i.e. 2 or more components in the calibration samples should not be present in the same ratio (e.g. A and B are present in the stock solution in ratio 2:1 and training set is prepared by dilution of that solution)! PCA will detect only ONE variation! Calibration of Eigenvector models requires randomly distributed ratios of the constituents of interest.

Principal Component Analysis (PCA) Mean centering of data: Data is commonly mean centered prior to PCA. Mean centering: mean spectrum (average spectrum) is calculated from all calibration spectra and then subtracted from every calibration spectrum. Effect: enhancement of small differences between spectra as changes in the absorbance data important and not the absolute absorbance (i.e. data not falsified!). Following mean centering a set of Eigenvectors (principal components) is created that represents the changes in the absorbances common to all calibration spectra. After training data has been fully processed by the PCA algorithm, two main matrices remain: + The Eigenvectors (spectra) + The scores (the eigenvector weighting values for all the calibration spectra)

Principal Component Analysis (PCA) Matrix expression of PCA: A = S F + E A - E A error matrix describing the model s ability to predict the calibration absorbances; has same dimensionality as the A matrix. - E A called the matrix of residual spectra (Note: see residual analysis!) (Note: mean spectrum only added if data mean centered; spectral residual (E A ) is the difference between the reconstructed spectrum and the original keep in mind: no model is perfect!) Original calibration spectrum

Principal Component Analysis (PCA) Now the question arises how many PCs do we need to model our data? (Note: same question for # of factors in PLS) - Calculated Eigenvectors are ordered by their degree of importance to the model in case of PCA the decisive parameter is the variance. - If too many PCs are taken into account ( overfit ) the Eigenvectors will begin modeling the system noise as the smallest contributions of variance in the training data set. - This is great: if we select the correct number of PCs we effectively filter our noise! - However: if the number of PCs is too small ( underfit ) the concentration prediction for unknown samples will suffer. - So here is the task: define a model that contains enough orthogonal (linear independent) Eigenvectors to properly model the components of interest without adding too much contribution from noise! But how?

Principal Component Analysis (PCA) Calculate the PRESS value for every possible factor (PRESS=Prediction Residual Error Sum of Squares) - We build a calibration model with a number of factors. - Then we predict some samples of known concentration (usually from the training set) with the model. - The sum of the squared difference between the predicted and known concentrations gives the Prediction Residual Error Sum of Squares for that model. (n is the number of samples in the training set; m is the number of constituents; Cp is the matrix of predicted sample concentrations from the model; C is the matrix of known concentrations of the samples)

Principal Component Analysis (PCA) Self prediction: - Models built using all the spectra in the training set. - Then the same spectra are predicted back against these models. - Disadvantage: all vectors calculated exist in all training spectra. Hence, the PRESS plot will continue to fall as new factors are added to the model and will never rise. This gives the (false) impression that all vectors are constituent vectors and that there are no noise vectors to eliminate (which is never the case!). - However, there is one tempting advantage this method is very fast as the model is only built once! Better way: Cross validation

Principal Component Analysis (PCA) Cross validation (1): - Again, unknown samples emulated by training set. - However: sample to be predicted left out during calibration. - Procedure repeated until every calibration sample has been left out and predicted at least once. The calculated squared residual error is added to all the previous PRESS values. - Disadvantage: time consuming as re-calculation is required for every left-out sample. - However, as the predicted samples are not the same as the samples used to build the model, the calculated PRESS value is a very good indication of the error in the accuracy of the model when used to predict "unknown" samples in the future! Hence: The only recommended way!

Principal Component Analysis (PCA) Cross validation (2): - Initially the prediction error (PRESS value) decreases as new Eigenvectors (PCs) are added to the model. This indicates that the model is still underfit and there are not enough factors to completely account for the constituents of interest. - At some point the PRESS values reach a minimum and start to PCs that contain uncorrelated noise indicating that the model overfit.

Principal Component Regression (PCR) PCA combined with ILS: - Quantitative models for complex samples can be established. - Instead of directly regressing the constituent concentrations against the spectroscopic response via Beer s law we regress the concentrations against the PCA scores. - Eigenvectors of PCA decomposition represent the spectral variations common to all of the spectroscopic calibration data. - We can use that information to calculate a regression equation providing a robust model for predicting concentrations of the desired constituents in very complex samples (instead of directly utilizing absorbances).

Principal Component Regression (PCR) How does it work? - Let s compare against the techniques we know: CLS: K = A C -1, A λ1 = c A K Aλ1 + c B K Bλ1 + E λ1 ILS: P = C A -1, c A = A λ1 P Aλ1 + A λ2 P Aλ2 + E A PCA: A = S F + E A - F-Matrix in PCA (containing the PCs) has similar function as K- Matrix in CLS: stores the spectral (or spectral variance) data of the constituents. The F-Matrix needs the S-Matrix (scores) to be useful; likewise, the K-Matrix needs the C-Matrix. - The scores summarized in the S-Matrix are unique to each calibration spectrum. - An optical spectrum is represented by a collection of absorbances at a series of wavelengths. In analogy, the very same spectrum can be represented by a series of scores for a given set of factors. Hence: we can regress the concentrations (C-Matrix) against the scores (similar to the classical approach regressing the concentrations against the absorbances, i.e. A-Matrix).

Principal Component Regression (PCR) Using the ILS approach we can formulate: - C = B S + E C C represents the constituent concentration matrix, B the matrix of regression coefficients and S the scores matrix from the PCA. - Now we understand why this approach is called PCR: we combine PCA (first step) with ILS regression (second step) to solve the calibration equation for the model. In contrast (and as we shall see later), partial least squares (PLS) regression performs these operations in one step. - We can use A = S F rearranged to S = A F -1 (neglecting the error matrix for simplicity): C = B A F -1 + E C PCR model equation

Principal Component Regression (PCR) IMPORTANT (1): - PCR calibration model is a two-step process: (1) PCA Eigenvectors and scores are calculated; (2) scores are regressed against the constituent concentrations using a regression method similar to ILS. - NOTE: Remember that the ILS approach can build accurate calibrations, provided that the selected variables are physically related to the constituent concentrations. However, the PCA factors/scores are calculated independently of any knowledge of these concentrations represent only the largest common variations among all the spectra in the training set. - We assume that these variations will be mostly related to changes in the constituent concentrations, but there is no guarantee this will be true.

Principal Component Regression (PCR) IMPORTANT (2): - Practically, many PCR models include more factors than are actually necessary as some of the Eigenvectors are probably not related to any of the constituents of interest. - Ideally, a PCR model should be built by performing a selection on the scores (similar to wavelengths selection in ILS model) determining which factors should be used to build a model for each constituent. - As these selection rules are difficult to establish and to wrap into algorithms corresponding treatment is not included in most chemometrics packages! That s why we have yet another technique PLS ;-)

PCA/PCR The Good, the Bad and the Ugly Advantages: Does not require wavelength selection, usually whole spectrum or large regions used (though scores selection might be advantageous sometimes!). Larger number of wavelengths provides averaging effect (model less susceptible to spectral noise). PCA data compression (much less PCs than spectra) allows using inverse regression to calculate model coefficients calibrating only for constituents of interest. Can be used for very complex mixtures since only knowledge of constituents of interest is required. Can sometimes be used to predict samples with constituents (contaminants) not present in the original calibration mixtures.

PCA/PCR The Good, the Bad and the Ugly Disadvantages: Calculations slower than most classical methods (not a tremendous problem nowadays given the available computation power). Optimization requires some knowledge of PCA; models are more complex to understand and interpret coefficients calibrating only for constituents of interest. Large number of samples are required for accurate calibration. Hence preparation/collection of calibration samples can be difficult avoiding collinearity of constituent concentrations.

Partial Least Squares (PLS) yet another technique! More focused on concentrations - PLS is closely related to PCA. - Main difference: spectral decomposition uses concentration information provided in the training set. - PCA: first we decompose spectral matrix into set of Eigenvectors and scores; then we regress them against the concentrations in a separate step. - PLS: concentration information used already during the decomposition process; hence, spectra containing higher constituent concentrations weighted more heavily than spectra containing low concentrations. - Consequence: Eigenvectors and scores calculated using PLS are different from those in PCR. The main idea of PLS is to get as much concentration information as possible into the first few loading vectors

Partial Least Squares (PLS) Here s what we do - PCA decomposes spectra into the most common variations. - PLS takes advantage of the correlation relationship already existing between the spectral data and the constituent concentrations and decomposes the concentration data also into the most common variations. - Consequently: two sets of vectors (one set for spectral data; one set for constituent concentrations) and two sets of corresponding scores are generated for the calibration model.

Partial Least Squares (PLS) Let s contrast the results of PCA and PLS: - PCA decomposes first and then performs the regression. - PLS performs decomposition of spectral and concentration data simultaneously (=regression already included in one step). Principal components are called factors in PLS. PCA PLS

Partial Least Squares (PLS) How does it work? - We will not derive the algorithms. For those interested in applying the methodology guidelines for implementation will be posted on the web. - It is assumed that the two sets of scores (spectral and concentration scores) are related to each other through some type of regression (which appears natural as the spectral features are dominated by the constituent concentrations). Hence, a calibration model can be constructed. - As each new factor is calculated for the model, the scores are "swapped" before the contribution of the factor is removed from the raw data. The reduced data matrices are then used to calculate the next factor. This process is repeated until the desired number of factors is calculated.

Partial Least Squares (PLS) Main difference between PCA and PLS - In PLS the resulting spectral vectors are directly related to the constituents of interest. - In PCR the vectors only represent the most common spectral variations in the data completely ignoring their relation to the constituents of interest until the final regression step.

Partial Least Squares (PLS) to make it even more complicated there is a PLS-1 and a PLS-2 algorithm! - PLS-1 is the procedure we just discussed, which results in in a separate set of scores and loading vectors for each constituent of interest (see previous slide). Hence, the calculated vectors are optimized for each individual constituent. - PLS-2 basically adopts the strategy of PCA and calibrates for all constituents simultaneously. Hence, the calculated vectors are not optimized for each individual constituent. - Consequence: in principle, the predictions derived from PLS-1 should be more accurate then PLS-2 and PCA. But: speed of calculation! (Note: separate set of eigenvectors and scores must be calculated for every constituent of interest; training sets with a large number of samples and constituents will significantly increase the time of calculation.)

Partial Least Squares (PLS) Advantage of PLS: - For systems that have constituent concentrations that are widely varied. - Example: calibration spectra contain A in concentration range 40-60%, B in concentration range 5-8% and C in concentration range 0.1-0.5%. - Here, PLS-1 will very likely predict better than PLS-2 or PCA. - If the concentration ranges of A, B and C are approx. the same, PCA and PLS-2 will perform with similar predictive quality, however, PLS-1 will definitely take longer to calculate.

PLS The Good, the Bad and the Ugly Advantages: Combines the full spectral coverage of CLS with partial composition regression of ILS ( best of both worlds argument!). Single step decomposition and regression. Eigenvectors/Factors directly related to constituents of interest rather than largest common spectral variations. Calibrations are generally more robust if calibration set accurately reflects range of variability expected in unknown samples. Can sometimes be used to predict samples with constituents (contaminants) not present in the original calibration mixtures. In general, literature argues that PLS has superior predictive ability - HOWEVER: there are many published examples where certain calibrations simply have performed better using PCR or PLS-2 instead of PLS-1!!!

PLS The Good, the Bad and the Ugly Disadvantages: Extensive calculation times. Models are fairly abstract and difficult to understand and interpret. Large number of samples are required for accurate calibration. Hence preparation/collection of calibration samples can be difficult avoiding collinearity of constituent concentrations.

Decision maker on the correct method

What do we want? 80 corn flour samples 0.8 0.6 NIR reflectance measurements (differences?) Calibrate for moisture, oil, protein, and starch %R 0.4 0.2 80 Samples { 40 Calibration 20 Validation 20 Test 0 1000 1500 2000 2500 Wavelength (nm)

Corn flour samples PCR vs. PLS for Oil in Corn Flour Left: PCR; right: PLS Factors Percent Spectral Variance Cumulative % Spectral Variance 1 88.21 88.21 2 6.57 94.78 3 2.16 96.94 4 0.98 97.92 5 0.76 98.68 6 0.46 99.14 7 0.30 99.45 8 0.25 99.70 9 0.11 99.81 10 0.06 99.87 11 0.03 99.90 12 0.03 99.93 13 0.02 99.95 14 0.01 99.96 15 0.01 99.97 Factors Percent Spectral Variance Cumulative % Spectral Variance 1 82.54 82.54 2 12.19 94.73 3 0.84 95.58 4 0.57 96.15 5 1.78 97.94 6 0.78 98.72 7 0.61 99.33 8 0.11 99.44 9 0.22 99.66 10 0.12 99.78 11 0.07 99.85 12 0.05 99.90 13 0.02 99.92 14 0.01 99.93 15 0.02 99.96

Salinity PCR absorbance (abs. units) 0.06 0.04 0.02 0.00-0.02 0.06 0.04 0.02 NaCl KCl 1 % w/v 3 % w/v 5 % w/v 1 % w/v 3 % w/v 5 % w/v NaBr KBr 1 % w/v 3 % w/v 5 % w/v 1 % w/v 3 % w/v 5 % w/v 0.06 0.04 0.02 0.00-0.02 0.06 0.04 0.02 CaCl 2 1 % w/v 3 % w/v 5 % w/v 0.06 0.04 0.02 0.00-0.02 2000 1500 1000 wavenumber (cm -1 ) 0.06 0.04 0.02 0.00-0.02 MgCl 2 1 % w/v 3 % w/v 5 % w/v Na 2 SO 4 0.09 % w/v 0.25 % w/v 0.5 % w/v 0.00-0.02 0.06 0.04 0.02 absorbance (abs.u.) 0.00 0.00 Salt ions in water -0.02 2000 1500 1000 wavenumber (cm -1 ) 2000 1500 1000 wavenumber (cm -1 ) -0.02

Salinity PCR around 3350 cm -1 - water absorption is too strong around 2100 cm -1 - weak water absorption (included) around 1635 cm -1 around 900 cm -1 - appropriate water absorption (included) - appropriate water absorption (included) around 1100 cm -1 - absorption band of SO 4 2- ion wavenumber range used for chemometric data evaluation: 2300 cm -1 700 cm -1 Spectral evaluation

Salinity PCR 0.20 0.15 PC #1 PC #2 PC #3 PC #4 0.15 0.10 0.10 0.05 0.05 0.00-0.05 loadings (a.u.) 0.00-0.05 0.15 0.10 0.05 0.00-0.05 PC #5 PC #6 PC #7 PC #8-0.10-0.15 0.20 0.15 0.10 0.05 0.00-0.05-0.10-0.10-0.15 2000 1500 1000 2000 1500 1000-0.15 Salt ions in water wavenumber (cm -1 )

Salinity PCR 800 estimated detection limit: 1200 measured conc. (mm L -1 ) 400 0 800 600 400 200 0 800 600 400 200 0-200 Synthetic samples 100 mm L -1 Na + + K + 0 200 400 600 800 Br - 0 200 400 600 800 Mg 2+ 0 200 400 600 800 input conc. (mm L -1 ) Cl - 800 400 0 0 200 400 600 800 1000 1200 800 Ca 2+ 600 400 200 0 0 200 400 600 800 20 estimated detection limit: 0.3 mm L -1 SO 4 2-0 10 20 input conc. (mm L -1 ) 10 0

Salinity PCR 400 estimated detection limit: 400 measured conc (mm L -1 ) 100 mm L -1 200 100 0 0 0 100 200 300 400-100 0.0 0.3 0.6 100-1 0 100 Na + + K + Cl - 0 typ. 479 mm L typ. 559 mm L -1 200 300 400 500-200 Ca 2+ Br - typ. 11 mm L -1 typ. 1 mm L -1 50 0.3 mm L -1 20 200 100 50 0 0 5 10-50 30 estimated detection limit: Artificial seawater 0-50 Mg 2+ typ. 54 mm L -1 0 10 20 30 40 50 input conc (mm L -1 ) SO 4 2- typ. 29 mm L -1 0 0 10 20 30 input conc (mm L -1 ) 10

Salinity PCR a sensor is proposed for salinity analysis of aqueous samples investigated ions: Cl, Na +, Mg 2+, SO 2-4, Ca 2+, K +, Br measurement principle is based on changes of the water IR spectrum due to the ions (species and concentration dependent) multicomponent analysis of several salt ions was successful the influence of Na + and K + on the water spectrum is too similar to be discriminated estimated detection limits: 100 mm L -1 for all ion species except SO 2-4 0.3 mm L -1 for SO 2-4 Cl, Na + + K +, and SO 2-4 can be determined at the concentrations present in sea water Ca 2+, Br, and Mg 2+ are present in real world samples at too low concentrations Conclusions

Design of Training Data Set In general Quality of training data set is the most important aspect! Predictive ability of the equations are only as good as the data used to calculate them in the first place! Control the variables: Collecting representative samples. Accurate primary calibration method. Appropriate sample measurements (reproducibility of conditions, etc.).

Design of Training Data Set Training samples similar to unknown samples Training samples should be as similar as possible to unknown samples! Spectrum of pristine constituent looks. different from when it is part of a mixture! Exception: very simple mixtures; samples in gas phase. Factor based models (PCA, PLS) can compensate for interconstituent interactions. BUT only if the training set contains examples of these!!!

Design of Training Data Set Training samples similar to unknown samples If samples are simple mixtures: Few components, distinct absorption features. Use simple models (CLS, ILS)! If samples are complex mixtures: Many components, overlapping absorption features. Use factor based models (PCR, PLS) extracting the relevant information from the spectra and ignoring the rest! HOWEVER: give the model the best chance to learn! Train it using samples that emulate the unknowns as closely as possible.

Design of Training Data Set Training samples similar to unknown samples Strategy: Collect actual samples from the measurement environment (e.g. plant, the field, etc.). Analyze them in the lab using other primary calibration methods (e.g. chromatography, wet chemical test, etc.). This data along with the sample spectra formulates the training set to build a reliable calibration model.

Design of Training Data Set Bracket concentration range Strategy: Constituent values for the training samples should span the expected range of all future unknown samples. Extrapolation is generally not a good idea! External validation is the only way to determine how well a model will predict outside the original calibration range. Consequently: constituent values in the training samples should be larger and smaller than the expected values in unknown samples. Do not hesitate using a lot of calibration samples!

Design of Training Data Set Use enough samples Strategy: Training set must have at least as many samples as there are constituents of interest. Usually many more than that (Note: noise)! How many samples are required to build a good model? As many samples as it takes! The more data fed into the model, the higher your confidence in the prediction!

Design of Training Data Set Use enough samples Strategy: Use a sufficiently large number of samples for calibration to allow sufficient factors in the model. For complex matrices you need enough samples to account for all the variability in the real samples. Note: the maximum number of factors that can be calculated for a given training set is limited by the smallest dimension of the data matrix. Example: if a training set has 300 samples but the calibration regions have only 20 total spectral data points, then the maximum number of factors is limited to 20 as well.

Design of Training Data Set Use enough samples Keep in mind: Quantity AND quality of the data is important! The more samples, the better the discrimination between analytically relevant signatures and noise. BUT:Only accumulating a huge number of spectra as a training set will not guarantee a better model - carefully measuring and qualifying a much smaller number for calibration is the way to go!

Design of Training Data Set Constituent collinearity We already know: Collinearity is the effect observed when the relative amounts of two or more constituents are constant throughout all the training samples. Why is this a problem? Factor based models do not calibrate by creating a direct relationship between the constituent data and spectral response. They correlate the change in concentration to corresponding changes in the spectra. If constituents are collinear, multivariate models cannot not differentiate them, and calibrations for the constituents will be unstable.

Design of Training Data Set Constituent collinearity Note: For simple bivariate calibrations we make one stock solution with high concentrations of all constituents of interest. Then we make multiple dilutions of that one mixture to create the remaining samples. This procedure will completely fail for multivariate models! To an eigenvector-based model only one factor will arise containing nearly all the variance in the training data set. How can we determine that collinearity happens?

Design of Training Data Set Constituent collinearity Plot the sample concentrations of each constituent in the model against the others: If the points fall on a straight line, the concentrations are collinear. If the constituents were uncorrelated they would form a cluster of points!

Spectral region selection An optimization problem Should we always use the whole spectrum? It is very easy (and convenient) to simply select the entire range of the training spectra as the set of data to use for calibration. PLS and PCR models will certainly be able to figure out the regions in the spectra that are most important for calibration. Since there is no apparent penalty in using as many wavelengths as possible for calibrations, why not just use the entire spectrum?

Spectral region selection An optimization problem There are many reasons why not Regions of the spectrum where either the detector, the spectrometer source or the optics are not effective (e.g. noise at detector cut-off). Example:including data from wavelengths below the detector cut-off is adding randomly distributed and uncorrelated absorbances to the factors. In general, only selecting the highly correlated regions of the spectrum for calibration will improve the accuracy for predicting the constituents of interest (Note: we evaluate the changes in the spectra!).

Spectral region selection An optimization problem There are more reasons why not Use this information along with your chemical knowledge on the samples to pre-select spectral regions for inclusion in a calibration. PCA and PLS factor analysis can correct for some non-linearities (e.g. Beer s law). However, they cannot correct for regions of over-absorbance (total absorbance).

Spectral region selection An optimization problem What is the price to pay? Discovery of impurities in the samples or unknown absorbers may be impaired. If the spectral bands of the impurities do not appear in the selected calibration regions, then there is will be no indication that the predicted constituent values are potentially incorrect. Only a problem if samples are entirely unknown, which is a problem as such anyway (see also: how much should we know about the sample to establish reliable calibration models)!

Spectral region selection How do we determine the useful regions? Correlation analysis Calculate the correlation of the absorbance at every wavelength in the training spectra to the concentrations of constituents. Regions that show high correlation are regions that should be selected for calibration, regions that show low or no correlation should be ignored.

Spectral region selection How do we determine the useful regions? 20 FT-IR spectra of ethanol/water mixtures. Coefficient of determination (R 2 ). (Note: goodness of the fit for linear regression; 1 = perfect fit; 1 indicates regions of high correlation between the spectral absorbances and the constituent concentrations; regions that are near 0 are not correlated) Linear correlation (R). (Note: This type of correlation plot not only indicates regions of the spectrum that are correlated to the constituents but the type of correlation as well.) Negative correlation in (R): two constituent mixture; increase in ethanol concentration gives a corresponding decrease in water.

Spectral region selection How do we determine the useful regions? Reason for negative correlation Increasing the concentration of one constituent in a mixture "dilutes" the others. If dilution occurs as ratio function of the increase of the constituent added, a negative correlation will appear. In most cases, these regions are as useful for calibration as the positively correlated regions! Always true?

Spectral region selection How do we determine the useful regions? No! Collinearity! If the concentrations of the constituents of interest vary as a function of one another (or of other unknown constituents), the correlations will indicate regions that are not really useful. Example: creating a training set by simply making dilutions of a single mixture! Be careful in correlation analysis!

Artificial neural networks What is it? Definition? Sophisticated modeling techniques capable of modeling extremely complex functions. Capable of modeling non-linear relationships. Can handle large numbers of variables. They learn by example: NN user gathers representative data, and then invokes training algorithms to automatically learn the structure of the data. Hence easy to use!

Artificial neural networks What is it? When to apply? NN are applicable in virtually every situation in which a relationship between the predictor variables (independents, inputs) and predicted variables (dependents, outputs) exists. Even when the relationship is very complex and not easy to articulate in the usual terms of correlations or defined differences between groups.

Artificial neural networks What is it? Why neural? Grew out of research in Artificial Intelligence. Attempts to mimic the fault-tolerance and capacity to learn of biological neural systems by modeling the low-level structure of the brain. Idea: brain composed of a large number (approx. 10 10 ) of neurons; massively interconnected (average of several thousand interconnects per neuron, although this varies enormously ;-)

Artificial neural networks What is it? Why neural? Each neuron is a specialized cell which can propagate an electrochemical signal. Neuron has a branching input structure (dendrites), cell body, and a branching output structure (axon). Axons of one cell connect to the dendrites of another via a synapse. If neuron is activated, fires an electrochemical signal along the axon. Signal crosses the synapses to other neurons, which may in turn fire. Neuron fires only if the total signal received at the cell body from the dendrites exceeds a certain level (firing threshold).

Artificial neural networks What is it? Why neural? Conclusion: from a very large number of extremely simple processing units (each performing a weighted sum of its inputs, and then firing a binary signal if the total input exceeds a certain level) the brain manages to perform extremely complex tasks. Can we use that model?

Artificial neural networks What is it? The artificial neural network (ANN) Artificial neuron: + Receives a number of inputs (either from original data, or from the output of other neurons in the ANN). + Each input comes via a connection that has a strength ( weight ) + Weights correspond to synaptic efficacy in a biological neuron. + Each neuron also has a single threshold value. + The weighted sum of the inputs is formed, and the threshold subtracted, to compose the activation of the neuron (postsynaptic potential). + The activation signal is passed through an activation function ( transfer function ) to produce the output.

Artificial neural networks What is it? The artificial neural network (ANN) How does it work? If a step activation function is used (i.e. neuron output is 0 if the input is less than zero and 1 if the input is greater than or equal to 0), then neuron acts like the biological neuron. (Note: usually sigmoid functions applied) Network:inputs (which carry the values of variables of interest in the outside world) and outputs (which form predictions, or control signals) have to be connected. Inputs and outputs correspond to sensory and motor nerves such as those coming from the eyes and leading to the hands.

Artificial neural networks What is it? The artificial neural network (ANN) Usual structure Input layer, hidden layer and output layer connected together in feed-forward structure: signals flow from inputs, forward through any hidden units, finally reaching the output units. Distinct layered topology. Hidden and output layer neurons are each connected to all of the units in the preceding layer.

Artificial neural networks How does it work? Operation of an ANN Feed information The input variable values are placed in the input units. Processing + The hidden and output layer units are progressively executed. + Each of them calculates its activation value by taking the weighted sum of the outputs of the units in the preceding layer and subtracting the threshold. + Activation value is passed through the activation function to produce the output of the neuron. + When entire network has been executed: outputs of output layer act as the output of the entire network.