How to avoid over-fitting in multivariate calibration The conventional validation approach and an alternative

Size: px
Start display at page:

Download "How to avoid over-fitting in multivariate calibration The conventional validation approach and an alternative"

Transcription

1 Analytica Chimica Acta 595 (2007) How to avoid over-fitting in multivariate calibration The conventional validation approach and an alternative N.M. Faber a,, R. Rajkó b a Chemometry Consultancy, Rubensstraat 7, 6717 VD Ede, The Netherlands b Department of Unit Operations and Food Engineering, Szeged College of Food Engineering, University of Szeged, H-6701 Szeged, POB 433, Hungary Received 27 September 2006; received in revised form 17 May 2007; accepted 21 May 2007 Available online 25 May 2007 Abstract This paper critically reviews the problem of over-fitting in multivariate calibration and the conventional validation-based approach to avoid it. It proposes a randomization test that enables one to assess the statistical significance of each component that enters the model. This alternative is compared with cross-validation and independent test set validation for the calibration of a near-infrared spectral data set using partial least squares (PLS) regression. The results indicate that the alternative approach is more objective, since, unlike the validation-based approach, it does not require the use of soft decision rules. The alternative approach therefore appears to be a useful addition to the chemometrician s toolbox Elsevier B.V. All rights reserved. Keywords: Multivariate calibration; PLS; Component selection; Cross-validation; Test set validation; Randomization test; Near-infrared spectroscopy...i personally have not been aware of clear unambiguous automated warnings starting to appear when data was being over-fitted... A.N. Davies, Spectroscopy Europe (2004). 1. Introduction Multivariate calibration models play an important role in various technical fields. These models are not only applied in particular in the chemical, petrochemical, pharmaceutical, cosmetic, coloring, plastics, paper, rubber and foodstuffs industries, but also in forensic, environmental, medical, sensory and marketing research. As an illustration, consider near-infrared (NIR) spectroscopy, which is increasingly used for the characterization of solid, semi-solid, fluid and vapor samples [1]. Frequently, the objective with this characterization is to determine the value of one or several concentrations in future unknown samples. Multivariate calibration is then used to develop a quantitative relation, i.e., a model, between the digitized spectra, stored in a data matrix X, and the concentrations, stored in a data matrix Y, as reviewed by Martens and Næs [2]. NIR spectroscopy is also Corresponding author. Tel.: ; fax: address: nmf@chemometry.com (N.M. Faber). increasingly used to infer other properties than concentrations, e.g., the strength and viscosity of polymers, the thickness of a tablet coating, and the octane rating of gasoline. It is important to note that precise and accurate quantification on the basis of highly non-selective NIR spectra is one of the major success stories of chemometrics. Various methods have been developed for building a multivariate calibration model. The three most common ones are multiple linear regression (MLR), which is also known as ordinary least squares (OLS), principal component regression (PCR) and partial least squares (PLS) regression. While MLR requires more samples, denoted by N, than spectral channels, denoted by K, PCR and PLS can handle the opposite case as well, i.e., K > N. For that reason, they are often referred to as fullspectrum methods. PCR and PLS are able to cope with an arbitrarily large number of spectral channels by compressing the X-data into a relatively small number, denoted by A, of socalled t-scores usually less than ten. The score matrix T of size N A then replaces the original X-matrix of size N K in the subsequent regression step, i.e., Y is regressed onto T instead of X. The regression step amounts to solving a system of equations where each sample represents an equation and each t-score can be regarded as an unknown. Consequently, the strict mathematical requirement follows that the number of samples must exceed the number of t-scores, i.e. N > A. This require /$ see front matter 2007 Elsevier B.V. All rights reserved. doi: /j.aca

2 N.M. Faber, R. Rajkó / Analytica Chimica Acta 595 (2007) ment is easily fulfilled in practice. PCR constructs t-scores that successively describe the maximum amount of variation in X while being orthogonal to each other. PLS can be seen as a further development of PCR because the Y-data contribute to the construction of the t-scores [3]. The full-spectrum methods are generally preferred since the additional wavelength selection step required for application of MLR, to ensure that N > K, is problematic in itself. Moreover, the compression to a small number of t-scores acts as an effective noise filter. PLS is currently the de-facto standard in chemometrics because it has often been reported to exhibit a slight edge over PCR in applied work. For this reason, we will in the remainder restrict ourselves to PLS, although the proposed methodology should be equally suited for use with other score-based multivariate calibration methods. 2. Background 2.1. The problem of over-fitting Usually, the first step towards constructing a PLS model is to remove undesirable features from the X-data by pre-treatment techniques such as filtering [1] or differentiation [4]. When the data have been made appropriate for the actual modeling process, the next critical step serves to select the optimum model dimensionality (also known as model rank), which is the number of PLS components (also known as factors or latent variables) that constitute the multivariate model. This step is equivalent to determining the optimum degree of a polynomial for fitting univariate (x,y)-data pairs. However, it is a much harder problem to solve for multivariate calibration, owing to the larger amount of input data at hand, with possibly intricate signal and noise characteristics, and the consequently increased complexity of the calibration method deployed. The state of the art concerning commercially available software has been recently criticized by Davies [5]: Back in 1998 more advanced chemometric tools were being made available as standard in spectrometer control packages. This had, however, raised fears that the inherent dangers of over-fitting data were not being sufficiently addressed in order to help inexperienced spectroscopists handle the additional computing power that was becoming available. I must admit that the work of my co-column Editor in pushing for Good Chemometrics Practice has hopefully raised awareness in the community of the potential pitfalls in using these packages without due consideration, but I personally have not been aware of clear unambiguous automated warnings starting to appear when data was being over-fitted. (Our italics.) Over-fitting causes harm because one not only incorporates predictive features of the data in the model, but also noise. The implication is degraded model performance in the prediction stage. An example of over-fitting that is conveniently visualized occurs when a (two-dimensional) plane is fitted using two scores, while a (one-dimensional) line, using a single score, would be appropriate (Fig. 1). It is readily observed that prediction is still reliable in a restricted region, namely sufficiently close to the line. The concept of a correct predictive region while effectively over-fitting is further illustrated for a univariate polynomial fit in Fig. 2. For interpolating points one distinguishes a small but statistically significant increase of prediction uncertainty. By contrast, a large increase of prediction uncertainty is clearly observed for extrapolating points. This is all the more surprising because the fitted relationship is almost the same for this particular example. It is generally good advice to avoid extrapolation when deploying an empirical, entirely data-driven, soft calibration model, since in a strict sense the estimated relationship is only supported in a region close to the calibration points. However, extrapolation is often implied (to some degree) by the goal of the application. Apart from genuine prediction in time or forecasting as in Fig. 2, important examples of unavoidable extrapolation are: the detection of lower analyte concentrations in trace analysis; the determination of analyte content using the method of standard additions; the development of a product with higher consumer appreciation in sensory and marketing research; and Fig. 1. Two collinear X-variables onto which the Y-data ( ) are regressed. Note that an extremely high collinearity is the rule for adjacent channels in molecular spectroscopy, e.g., NIR. The X-variables allow for the construction of only one stable component using the first score, t 1. By contrast, the plane spanned by the first two scores, t 1 and t 2, is unstable. The spread of the fitted Y-data points ( ) around the first axis is caused by noise and should therefore be ignored. The model based on the first score is an effective noise filter, whereas the plane is over-fitting the data.

3 100 N.M. Faber, R. Rajkó / Analytica Chimica Acta 595 (2007) (independent) validation samples and averaging the squared prediction errors, i.e., the differences between model prediction and the associated known reference value. 1 The square root of this average squared error is known as the root mean squared error of prediction (RMSEP). In equation form, for increasing number of components (A), RMSEP(A) = 1 N val N val (Ŷ A,n Y ref,n ) 2 (1) n Fig. 2. The population of the USA in the period ( ). This data set is available through the built-in function census of Matlab (The Mathworks, Natick, MA, USA). The data are fitted using a polynomial of degree 2 ( ) and3( ), respectively. The associated uncertainty bands are given by the same line type. the search for a molecule with higher biological activity using a quantitative structure activity relationship (QSAR) model. One should further bear in mind that many calibration samples are required to cover a high-dimensional space. Assuming that, for example, five (non-replicated) calibration points are sufficient to fit a straight line; then surely many more points are required to obtain the same coverage for a plane, a three-dimensional space, etc. It is easily verified that a highdimensional space can be virtually empty (S. de Jong, personal communication). It follows that with increasing model dimensionality (A) it becomes increasingly difficult to achieve the same degree of interpolation for a multivariate model as for its familiar univariate counterparts. In summary, the consequences of overfitting are likely to be much more disturbing for multivariate calibration models and this is the very reason why component selection is to be regarded as a critical step in multivariate predictive modeling The conventional validation approach to avoid over-fitting Many methods have been developed to tackle the problem of over-fitting, of which model validation is the most frequently applied one in practice. In the context of multivariate calibration, validation amounts to assessing the ability of a model to predict the property of interest for future samples, from the same type. This assessment can be performed in two essentially different modes, namely externally and internally. The adjective external refers to the requirement that the validation samples (also known as test samples) be independent of the samples used for constructing the model, i.e., the calibration set; otherwise one does not properly assess the ability to predict for truly unknown future samples. For example, simple replicates are not allowed. The predictive ability is estimated by applying the model to these where N val is the number of validation samples and Ŷ A,n and Y ref,n denote the model prediction with A components and known reference value for sample n (n =1,..., N val ), respectively. Ideally, the results of this calculation lead to a clear (i.e., not too broad and shallow) minimum RMSEP for the optimum model dimensionality. This is achieved in case of a favorable bias-variance trade-off, see Fig. 3. Internal validation differs from external validation in the sense that the validation samples are taken from the calibration set itself, i.e., the validation samples are not truly independent. To execute an internal validation, one has the choice between (1) cross-validation, (2) bootstrapping, (3) leverage correction and (4) criteria originally intended in particular for variable selection in connection with MLR (e.g., Mallow s C p ). In cross-validation, one constructs models after judiciously leaving out segments of (calibration) samples. Then an estimate of RMSEP follows by averaging squared prediction errors for the left-out samples, as in external validation. To emphasize that this estimate is not based on truly independent validation samples, it will be denoted as root mean squared error of cross-validation (RMSECV) in the remainder of this paper. Cross-validation can be quite computerintensive, depending on the size of the calibration set and the number of segments. Bootstrapping performs similarly to crossvalidation [7,8]. Leverage correction is only a quick and dirty alternative when applied to PCR and PLS [9]. Finally, the criteria like Mallow s C p are seldom used. In the remainder we will therefore focus on internal validation using cross-validation Problems with the conventional validation approach Validation-based component selection is problematic for various general and specific reasons. Three general reasons are: Eq. (1) will only provide a correct estimate of RMSEP if the reference values are known with sufficient precision. This condition is, however, often not fulfilled in practice. DiFoggio [10] has coined the term apparent RMSEP to emphasize that the result of Eq. (1) is a pessimistic estimate (i.e., biased high) of the actual RMSEP because it contains a spurious error com- 1 To estimate the predictive ability of a multivariate calibration model one does not want to rely on theoretical formulas such as the ones underlying the uncertainty bands displayed in Fig. 2, although it is important to note that significant advances have been made in terms of characterizing the uncertainty in multivariate model results, see Part III of Guidelines for calibration in analytical chemistry of the International Union of Pure and Applied Chemistry (IUPAC) [6].

4 N.M. Faber, R. Rajkó / Analytica Chimica Acta 595 (2007) Fig. 3. Schematic presentations of bias and variance contribution to RMSEP as a function of model dimensionality, e.g., the number of PLS components: (left panel) standard presentation where variance ( ) increases rapidly and bias ( ) gives a substantial contribution to RMSEP ( ) for the optimum model, and (right panel) alternative presentation where variance increases slowly (when interpolating) and bias is relatively small for the optimum model. The latter asymmetric presentation is usually more realistic in practice and illustrates why under-fitting is seldom a concern. ponent, namely the reference error. It is clear that this spurious contribution to RMSEP does not depend on model dimensionality (A). Consequently, it will tend to obscure the sought-for global minimum by making it broader and shallower. Martens and Dardenne [11] have shown that many validation samples are required to obtain a sufficiently precise estimate of RMSEP. Simulations [12] inspired by this study have led to the suggestion that a rule of thumb holding for a plain standard deviation also works for estimates of RMSEP, i.e., σ(rmsep) RMSEP = 1 (2) 2Nval where σ( ) denotes the standard error of the associated quantity. This expression is an example of the law of diminishing returns. For example, to have a relative uncertainty of less than 20% requires about 13 validation samples (that spread out reasonably well in calibration space). To further reduce this uncertainty to less than 10% one has to quadruple the number of validation samples. Eq. (2) intends to enable the analyst to calculate the number of validation samples (s)he needs to report an RMSEP estimate in sufficient (significant) decimal digits usually two. Often, the RMSEP estimates do not exhibit a clear global minimum, as in Fig. 3. This is a direct consequence of the previous issues. As a result, one often has to resort to soft decision rules like the first local minimum or the start of a plateau, which is highly unsatisfactory both from a practical as well as a scientific point of view. It is important to note that the previous issues have led researchers to develop error indicator functions that do not require possibly noisy reference values [13,14]. Specific problems with the conventional approach are: External validation, i.e., test set validation, is best in the sense that a closer assessment of RMSEP is possible ( test is best ). However, it is wasteful because the validation samples are not available for the construction of the model. Cross-validation, on the other hand, ensures a more economic use of the available data, but it has two major drawbacks. First, it cannot be used if the data are designed. This can be understood as follows. Design points are special in the sense that they should have a large impact on the model. Consequently, the actual prediction uncertainty should be correspondingly small for these points. However, when leaving out these points, the model constructed for the remaining points may be very different from the full model hence it may generate an unduly large prediction residual for the left-out samples. Depending on the type of design, this drawback can be ignored if the calibration set is large enough to have some redundancy, but it certainly precludes the use of cross-validation in many sensory and QSAR applications, where the calibration set can be as small as ten samples (and, in principle, redundancy is avoided because of the high cost of sampling). Similar reasoning holds when the calibration model must be updated for new sources of variation, with few samples (X. Capron, personal communication). Obviously, in such cases one will not have recourse to a sufficiently large independent validation set either. Second, many variants of cross-validation have a tendency to select too many components, because they do not compensate for the fact that the same samples are used for both calibration and validation. In other words, with crossvalidation one is vulnerable to over-fitting the calibration data ( false positive components). So-called Monte Carlo crossvalidation has recently been introduced in chemometrics to reduce the risk of over-fitting [15]. For simulated data, the risk was reduced from 25% to about 14%. However, the latter risk is still fairly large, and what is even more disturbing: the procedure does not provide any hint about this risk. Finally, the simple fact that a different implementation can easily lead to a different advice constitutes an ambiguity that is confusing to the analyst The proposed alternative The proposed alternative assesses the statistical significance of each individual component that enters the model. Theoretical approaches to achieve this goal (using a t- or F-test) have been put forth but they are all based on unrealistic assumptions about the data, e.g., the absence of spectral noise, see [16] for examples. A pragmatic data-driven approach is therefore called for. A so-called randomization test is a data-driven approach and

5 102 N.M. Faber, R. Rajkó / Analytica Chimica Acta 595 (2007) Fig. 4. Generating the distribution under the null-hypothesis (H o ) by building a series of PLS models after pairing up the observations for predictor (X) and response (Y) variables at random. Any result obtained by PLS modeling after randomization must be due to chance. Consequently, the statistical significance of the value obtained for the original data follows from a comparison with the corresponding randomization results. therefore ideally suited for avoiding unrealistic assumptions. For an excellent description of this methodology, see van der Voet [17]. The rationale behind the randomization test in the context of regression modeling is illustrated in Fig. 4. Randomization amounts to permuting indices. For that reason, the randomization test is often referred to as a permutation test. In QSAR applications it is known as Y-scrambling. Clearly, scrambling the elements of Y, while keeping the corresponding numbers in X fixed, destroys any relationship that might exist between the X- and Y-variables. Randomization therefore yields PLS regression models that should reflect the absence of a real association between the X- and Y-variables in other words: purely random models. For each of these random models, a test statistic is calculated. We have opted for the covariance between the t-score and the Y-values because it is a natural measure of association, see [16] for more details. Geometrically, it is the inner product of the t-score vector and the Y-vector in Fig. 1. Clearly, the value for a test statistic obtained after randomization should be indistinguishable from a chance fluctuation. For this reason, it will be referred to as a noise value. Repeating this calculation a number of times generates a histogram for the null-distribution, i.e., the distribution that holds when the component under scrutiny is due to chance the null-hypothesis (H o ). Next, a critical value is derived from the null-distribution as the value exceeded by a certain percentage of noise values (say 5% or 10%). Finally, the statistic obtained for the original data the value under test is compared with the critical value. The (only) difference with a conventional statistical test is that the critical value follows as a percentage point of a data-driven histogram of noise values instead of a theoretical distribution that is tabulated, e.g., t or F. It is important to note that Y-scrambling has become a standard for assessing the significance of a (complete) QSAR model [18]. One may therefore feel tempted to apply this test to counter over-fitting but this will not work as intended because a significant model can either over- or under-fit. It follows that the resulting significance levels will be misleading at best. For example, the grand mean in analysis of variance invariably comes out as significant in a test, but it (usually) under-fits. It appears that the testing of complete models, instead of individual components, must lead to trouble. 3. Experimental 3.1. The example data set A NIR spectral data set will serve to illustrate the problems with the conventional validation approach to avoid over-fitting. This type of spectral data provides critical test cases for PLS component selection procedures because tiny substructures may have predictive value. The example data set (F. Wahl, Institut Français du Pétrole, personal communication) contains NIR spectra (X) for 239 gas oil samples measured between cm 1 (Fig. 5). The property of interest (Y) is the hydrogen content. The reference values were determined by nuclear magnetic resonance, which has an estimated measure- Fig. 5. NIR spectra of the example data set.

6 N.M. Faber, R. Rajkó / Analytica Chimica Acta 595 (2007) Fig. 6. Validation results for the example data set: (top panels) internal RMSECV ( ) for the 84 calibration samples and (bottom panels) external RMSEP ( ) for the 155 independent validation samples. To better exploit the vertical scale, the first point is omitted in panels (b) and (d). ment error standard deviation σ ref = g (100 g) 1. The 239 samples were split into a calibration and validation set by using the duplex algorithm. This method starts by selecting the two points furthest from each other and puts them both in a first set (calibration). Then the next two points furthest from each other are put in a second set (validation), and the procedure is continued by alternately placing pairs of points in the first or second set. As a result, 84 samples were used for calibration and 155 samples for validation. It is noted that the majority of the available samples was selected for (external) validation, which is unusual in practice. However, Fernández Pierna et al. had chosen this particular data split to test expressions for multivariate sample-specific prediction uncertainty [19]. In other words: focus was more on assessing the predictive ability of a model than on obtaining the best model. Also for the current study it should be useful to have a relatively large validation set because external validation is generally considered to be the golden standard Calculations The proposed randomization test has been implemented in Matlab 7.0 (The Mathworks, Natick, MA, USA) and the program is available from the first author. Histograms of noise values were generated using 1000 permutations. Although as few as 100 permutations can be used [17], this relatively large number ensures that the resulting histograms are fairly smooth. For the current example data set (84 samples 2128 wavelengths), the computations were completed within seven CPU seconds on a 3.4 GHz personal computer. To calculate the risk of overfitting when, in fact, none of the noise values exceeds the value under test, the so-called inverse Gaussian function is fit to the noise values. This function is often suited for modeling positive and/or positively skewed data [20]. 4. Results and discussion 4.1. The conventional validation approach Both internal and external validation the golden standard lead to a rather subjective decision process, see Fig. 6. The five-dimensional model achieves the first local minimum in RMSECV (see top panels). By contrast, the external RMSEP estimates continue to decrease until eight components have been fitted (see bottom panels). The analyst faces major difficulties to objectively decide whether the further decrease of RMSEP is worthwhile or merely results from statistical fluctuations. 2 We suspect that to obtain a clear minimum as in the schematic presentations of Fig. 3, many more samples are required since the law of diminishing returns is in force Eq. (2). However, the currently available total number of samples (239) is already quite favorable. It is noted that the unscrambler (CAMO, Trondheim, Norway) and SIMCA (Umetrics, Umeå, Sweden) packages have a decision rule implemented to assess whether a decrease of RMSEP or RMSECV resulting from the fit of an additional component is worthwhile or not. The underlying idea, which makes good sense, is that small improvements of RMSEP or RMSECV should be regarded with caution, because larger models are inherently less robust. Interestingly, both packages suggest on the basis of (internal) RMSECV that as few as two components are sufficient. This would, however, lead to a serious under-fitting of the data since the (external) RMSEP further decreases from 0.01 g (100 g) 1 to g (100 g) 1 the latter value being quite close to the refer- 2 The decision process is subjective in the sense that different analysts may easily arrive at different conclusions about the optimal number of components. For the current example data set, the conclusion may even depend on the use of scale for the ordinate, cf. Fig. 6c and d.

7 104 N.M. Faber, R. Rajkó / Analytica Chimica Acta 595 (2007) ence error (σ ref = g (100 g) 1 ). The fundamental problem with these decision rules is that the actual significance of a further improvement of RMSEP estimates depends on the size and quality of the data set at hand, as well as the validation procedure applied. Consequently, general purpose rules may easily fail in specific cases. It has been attempted to rationalize the validation-based selection of model dimensionality by comparing competing models in a pair-wise fashion [17,21]. However, the initial choice of competing model dimensionalities to be further scrutinized is left to the practitioner. For example, for the current example data set, likely initial choices are five, in combination with cross-validation, or eight, when test set validation is deployed. As a result, a major source of subjectivity is not eliminated. For example, the finally selected model could still be either under- or over-fitting the data. After all, each model entering this stage could either under- or over-fit. It stands to reason that the chain cannot be stronger than its weakest link. Finally, it is noted that the method developed by van der Voet [17] has been implemented in SAS (SAS Institute, Cary, NC, USA), as thoroughly reviewed by Reeves and Delwiche [22] The proposed alternative Histograms of noise values generated for components 1 8 are presented in Fig. 7. It is observed that the probability that the Fig. 7. Randomization results for the example data set: histogram of 1000 noise values, fit using the inverse Gaussian function ( ( ). The symbol α stands for the significance level. ) and value under test

8 N.M. Faber, R. Rajkó / Analytica Chimica Acta 595 (2007) Fig. 8. Comparison of current practice of multivariate predictive modeling (left) and the one enabled by the proposed alternative (right). value under test is due to chance (α) is extremely small for components 1 (0.0009%), 2 (0.02%), 4 (0.0006%) and 5 (0.002%). Interestingly, the significance of component 3 is only 3.3%. We speculate this to be due to component 3 taking care, with some difficulty, of subtle non-linearities in the spectra, after which the remaining linear contributions are conveniently handled by components 4 and 5. The high α-values for components 6 8 constitute a clear unambiguous warning that over-fitting starts after the fifth component. 5. Recommendations The proposed randomization test enables a different scheme for calibration modeling (Fig. 8). The essential difference is that the two critical steps preceding the actual modeling process are disentangled. The best data pre-treatment depends highly on the type and quality of the input data. Expert knowledge is valuable in this stage and allowing for subjectivity here is to be understood in a favorable sense: it may help reducing the inherent black-box character of soft modeling procedures. By contrast, the selection of optimum model dimensionality should be kept fully objective since a human expert cannot judge the observed modeling power of an additional component to be genuine, i.e., not due to chance. An adequate validation step of course one must validate! then constitutes the justification of the overall trial-and-error procedure. It is stressed that not completely relying on validation for component selection has an added bonus in the sense that the RMSEP estimate can be reported with more confidence, simply because it has not guided component selection. Until now the discussion focused completely on quantitative aspects of multivariate calibration modeling. However, in application areas such as sensory research, qualitative aspects such as the interpretation of the individual components can be even more important. Standard practice is to visualize the lowestnumbered components in score and loading plots (components 1 versus 2, components 1 versus 3, etc.) to discover patterns and trends. These observations then may lead to the development of a better product. A tacit assumption is that the components included in the model are ordered according to their importance for describing the Y-variable the property of interest. It has been observed, however, that non-significant components can be preceded and followed by (highly) significant ones [16,23]. This phenomenon has been termed sandwiching and can often be rationalized (see [24 27] for in-depth discussions of this aspect). An early component can, for example, take care of a background in the X-data and it consequently bears no relationship with the Y-variable. (Recall that PLS component 3 of the current example data set is close to being non-significant, while the preceding and following ones are highly significant.) It is clear that one should be cautious when attempting to interpret these sandwiched components. We therefore recommend displaying the statistical significance of a component in score and loading plots to avoid the interpretation of patterns and trends that have no significant relationship with the Y-variable the property of interest. 6. Concluding remarks The conventional validation approach to component selection is problematic in practice because often the RMSEP estimates do not yield a clear global minimum. In such a case, the analyst has to resort to visual inspection and its associated soft decision rules. This all leads to a rather subjective decision process, which makes the proposed statistical alternative rather attractive. The following concluding remarks seem to be in order:

9 106 N.M. Faber, R. Rajkó / Analytica Chimica Acta 595 (2007) The alternative enables one to scrutinize individual components without making strong assumptions about the data. It is user-friendly because it only requires (1) the number of permutations and (2) the critical significance level to be selected. The first requirement constitutes the only practical difference between a randomization test and a conventional statistical test. The result is often consistent with the one obtained using validation (e.g., unscrambler or SIMCA advice), but now it is fully objective visual inspection does not play a role. It can replace validation for component selection, but it can also supplement the common plot (RMSEP estimates vs. components) with an advice. The applicability of the randomization test is not restricted to PLS regression. Moreover, it is easily verified that it also applies to multiway calibration. One only needs to replace the (1-way) rows of the X-matrix in Fig. 4 by the appropriate data object (2-way matrices or in general N-way arrays). Compression may be necessary to handle large data sets. However, once the compression is done, other computer-intensive methods such as bootstrap, jack-knife and cross-validation can be entertained almost for free as well. The only requirement is that the compression should not introduce dependencies among the samples. The currently described randomization test operates on the calibration set. A purpose could be to add objectivity to crossvalidation, cf. Fig. 6a and b. There is no reason, however, why it cannot be adapted to add objectivity to test set validation, cf. Fig. 6c and d. In summary, the proposed randomization test appears to be a useful addition to the chemometrician s toolbox. Acknowledgements The thoughtful comments by Waltraud Kessler (Reutlingen University), Randy Pell (The Dow Chemical Company), Michael Sjöström (Umeå University) and Svante Wold (Umeå University) are appreciated by the authors. We further thank Chris Brown (InLight Solutions) for supplying the function for the inverse Gaussian fit and Alejandro Olivieri (Universidad Nacional de Rosario) for pointing out a numerical problem in the calculations. The relationship of the proposed alternative to the following patent is acknowledged: N.M. Faber, Method and system for selection of calibration model dimensionality, and use of such a calibration model, PCT/NL2005/ Part of this work was supported by the Hungarian Scientific Research Fund (OTKA T ) and was completed when one of the authors (RR) spent a sabbatical year from the University of Szeged. The critical comments by the reviewer are appreciated by the authors. References [1] S. Wold, H. Antti, F. Lindgren, J. Öhman, Chemom. Intell. Lab. Syst. 44 (1998) 175. [2] H. Martens, T. Næs, Multivariate Calibration, Wiley, New York, [3] S. Wold, A. Ruhe, H. Wold, W.J. Dunn III, SIAM J. Sci. Statist. Comput. 5 (1984) 735. [4] A. Savitzky, M.J.E. Golay, Anal. Chem. 36 (1964) [5] A.N. Davies, Spectrosc. Eur. 16 (3) (2004) 26. [6] A.C. Olivieri, N.M. Faber, J. Ferré, R. Boqué, J.H. Kalivas, H. Mark, Pure Appl. Chem. 78 (2006) 633. [7] M.C. Denham, J. Chemom. 14 (2000) 351. [8] R. Wehrens, H. Putter, L.M.C. Buydens, Chemom. Intell. Lab. Syst. 54 (2000) 35. [9] A. Lorber, B.R. Kowalski, Appl. Spectrosc. 44 (1990) [10] R. DiFoggio, Appl. Spectroscp. 49 (1995) 67. [11] H.A. Martens, P. Dardenne, Chemom. Intell. Lab. Syst. 44 (1998) 99. [12] N.M. Faber, Chemom. Intell. Lab. Syst. 49 (1999) 79. [13] L. Xu, I. Schechter, Anal. Chem. 68 (1996) [14] E.T.S. Skibsted, H.F.M. Boelens, J.A. Westerhuis, D.T. Witte, A.K. Smilde, Anal. Chem. 58 (2004) 264. [15] Q.-S. Xu, Y.-Z. Liang, Chemom. Intell. Lab. Syst. 56 (2001) 1. [16] S. Wiklund, D. Nilsson, L. Eriksson, M. Sjöström, S. Wold, K. Faber, J. Chemom., submitted. [17] H. van der Voet, Chemom. Intell. Lab. Syst. 25 (1994) 313. [18] S.-S. So, M. Karplus, J. Med. Chem. 40 (1997) [19] J.A. Fernández Pierna, L. Jin, F. Wahl, N.M. Faber, D.L. Massart, Chemom. Intell. Lab. Syst. 65 (2003) 281. [20] R.S. Chhikara, J.L. Folks, The Inverse Gaussian Distribution: Theory, Methodology, and Applications, Marcel Dekker, New York, [21] E.V. Thomas, J. Chemom. 17 (2003) 653. [22] J.B. Reeves III, S.R. Delwiche, J. Near Infrared Spectrosc. 11 (2003) 415. [23] M.P. Gómez-Carracedo, J.M. Andrade, D.N. Rutledge, N.M. Faber, Anal. Chim. Acta 585 (2007) 253. [24] H.R. Keller, D.L. Massart, Y.-Z. Liang, O.M. Kvalheim, Anal. Chim. Acta 263 (1992) 29. [25] Y.-L. Xie, J.H. Kalivas, Anal. Chim. Acta 348 (1997) 19. [26] Y.-L. Xie, J.H. Kalivas, Anal. Chim. Acta 348 (1997) 29. [27] U. Depczynski, V.J. Frost, K. Molt, Anal. Chim. Acta 420 (2000) 217.

Letter to the Editor. On the calculation of decision limits in doping control

Letter to the Editor. On the calculation of decision limits in doping control Letter to the Editor On the calculation of decision limits in doping control Nicolaas (Klaas). Faber 1, and Ricard Boqué 2 1 Chemometry Consultancy, Rubensstraat 7, 6717 VD Ede, The Netherlands 2 Department

More information

Chemometrics. 1. Find an important subset of the original variables.

Chemometrics. 1. Find an important subset of the original variables. Chemistry 311 2003-01-13 1 Chemometrics Chemometrics: Mathematical, statistical, graphical or symbolic methods to improve the understanding of chemical information. or The science of relating measurements

More information

EXTENDING PARTIAL LEAST SQUARES REGRESSION

EXTENDING PARTIAL LEAST SQUARES REGRESSION EXTENDING PARTIAL LEAST SQUARES REGRESSION ATHANASSIOS KONDYLIS UNIVERSITY OF NEUCHÂTEL 1 Outline Multivariate Calibration in Chemometrics PLS regression (PLSR) and the PLS1 algorithm PLS1 from a statistical

More information

Generalized Least Squares for Calibration Transfer. Barry M. Wise, Harald Martens and Martin Høy Eigenvector Research, Inc.

Generalized Least Squares for Calibration Transfer. Barry M. Wise, Harald Martens and Martin Høy Eigenvector Research, Inc. Generalized Least Squares for Calibration Transfer Barry M. Wise, Harald Martens and Martin Høy Eigenvector Research, Inc. Manson, WA 1 Outline The calibration transfer problem Instrument differences,

More information

New method for the determination of benzoic and. sorbic acids in commercial orange juices based on

New method for the determination of benzoic and. sorbic acids in commercial orange juices based on New method for the determination of benzoic and sorbic acids in commercial orange juices based on second-order spectrophotometric data generated by a ph gradient flow injection technique (Supporting Information)

More information

Shrinkage regression

Shrinkage regression Shrinkage regression Rolf Sundberg Volume 4, pp 1994 1998 in Encyclopedia of Environmetrics (ISBN 0471 899976) Edited by Abdel H. El-Shaarawi and Walter W. Piegorsch John Wiley & Sons, Ltd, Chichester,

More information

ReducedPCR/PLSRmodelsbysubspaceprojections

ReducedPCR/PLSRmodelsbysubspaceprojections ReducedPCR/PLSRmodelsbysubspaceprojections Rolf Ergon Telemark University College P.O.Box 2, N-9 Porsgrunn, Norway e-mail: rolf.ergon@hit.no Published in Chemometrics and Intelligent Laboratory Systems

More information

EFFECT OF THE UNCERTAINTY OF THE STABILITY DATA ON THE SHELF LIFE ESTIMATION OF PHARMACEUTICAL PRODUCTS

EFFECT OF THE UNCERTAINTY OF THE STABILITY DATA ON THE SHELF LIFE ESTIMATION OF PHARMACEUTICAL PRODUCTS PERIODICA POLYTECHNICA SER. CHEM. ENG. VOL. 48, NO. 1, PP. 41 52 (2004) EFFECT OF THE UNCERTAINTY OF THE STABILITY DATA ON THE SHELF LIFE ESTIMATION OF PHARMACEUTICAL PRODUCTS Kinga KOMKA and Sándor KEMÉNY

More information

Multivariate calibration What is in chemometrics for the analytical chemist?

Multivariate calibration What is in chemometrics for the analytical chemist? Analytica Chimica Acta 500 (2003) 185 194 Multivariate calibration What is in chemometrics for the analytical chemist? Rasmus Bro Department of Dairy and Food Science, The Royal Veterinary and Agricultural

More information

Explaining Correlations by Plotting Orthogonal Contrasts

Explaining Correlations by Plotting Orthogonal Contrasts Explaining Correlations by Plotting Orthogonal Contrasts Øyvind Langsrud MATFORSK, Norwegian Food Research Institute. www.matforsk.no/ola/ To appear in The American Statistician www.amstat.org/publications/tas/

More information

Supplementary material (Additional file 1)

Supplementary material (Additional file 1) Supplementary material (Additional file 1) Contents I. Bias-Variance Decomposition II. Supplementary Figures: Simulation model 2 and real data sets III. Supplementary Figures: Simulation model 1 IV. Molecule

More information

The Theory of HPLC. Quantitative and Qualitative HPLC

The Theory of HPLC. Quantitative and Qualitative HPLC The Theory of HPLC Quantitative and Qualitative HPLC i Wherever you see this symbol, it is important to access the on-line course as there is interactive material that cannot be fully shown in this reference

More information

Accounting for measurement uncertainties in industrial data analysis

Accounting for measurement uncertainties in industrial data analysis Accounting for measurement uncertainties in industrial data analysis Marco S. Reis * ; Pedro M. Saraiva GEPSI-PSE Group, Department of Chemical Engineering, University of Coimbra Pólo II Pinhal de Marrocos,

More information

INDEPENDENT COMPONENT ANALYSIS (ICA) IN THE DECONVOLUTION OF OVERLAPPING HPLC AROMATIC PEAKS OF OIL

INDEPENDENT COMPONENT ANALYSIS (ICA) IN THE DECONVOLUTION OF OVERLAPPING HPLC AROMATIC PEAKS OF OIL INDEPENDENT COMPONENT ANALYSIS (ICA) IN THE DECONVOLUTION OF OVERLAPPING HPLC AROMATIC PEAKS OF OIL N. Pasadakis, V. Gaganis, P. Smaragdis 2 Mineral Resources Engineering Department Technical University

More information

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size Berkman Sahiner, a) Heang-Ping Chan, Nicholas Petrick, Robert F. Wagner, b) and Lubomir Hadjiiski

More information

Statistical concepts in QSAR.

Statistical concepts in QSAR. Statistical concepts in QSAR. Computational chemistry represents molecular structures as a numerical models and simulates their behavior with the equations of quantum and classical physics. Available programs

More information

UPSET AND SENSOR FAILURE DETECTION IN MULTIVARIATE PROCESSES

UPSET AND SENSOR FAILURE DETECTION IN MULTIVARIATE PROCESSES UPSET AND SENSOR FAILURE DETECTION IN MULTIVARIATE PROCESSES Barry M. Wise, N. Lawrence Ricker and David J. Veltkamp Center for Process Analytical Chemistry and Department of Chemical Engineering University

More information

SC705: Advanced Statistics Instructor: Natasha Sarkisian Class notes: Introduction to Structural Equation Modeling (SEM)

SC705: Advanced Statistics Instructor: Natasha Sarkisian Class notes: Introduction to Structural Equation Modeling (SEM) SC705: Advanced Statistics Instructor: Natasha Sarkisian Class notes: Introduction to Structural Equation Modeling (SEM) SEM is a family of statistical techniques which builds upon multiple regression,

More information

Basics of Multivariate Modelling and Data Analysis

Basics of Multivariate Modelling and Data Analysis Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 6. Principal component analysis (PCA) 6.1 Overview 6.2 Essentials of PCA 6.3 Numerical calculation of PCs 6.4 Effects of data preprocessing

More information

2 D wavelet analysis , 487

2 D wavelet analysis , 487 Index 2 2 D wavelet analysis... 263, 487 A Absolute distance to the model... 452 Aligned Vectors... 446 All data are needed... 19, 32 Alternating conditional expectations (ACE)... 375 Alternative to block

More information

DEPARTMENT OF ENGINEERING MANAGEMENT. Two-level designs to estimate all main effects and two-factor interactions. Pieter T. Eendebak & Eric D.

DEPARTMENT OF ENGINEERING MANAGEMENT. Two-level designs to estimate all main effects and two-factor interactions. Pieter T. Eendebak & Eric D. DEPARTMENT OF ENGINEERING MANAGEMENT Two-level designs to estimate all main effects and two-factor interactions Pieter T. Eendebak & Eric D. Schoen UNIVERSITY OF ANTWERP Faculty of Applied Economics City

More information

Two-Level Designs to Estimate All Main Effects and Two-Factor Interactions

Two-Level Designs to Estimate All Main Effects and Two-Factor Interactions Technometrics ISSN: 0040-1706 (Print) 1537-2723 (Online) Journal homepage: http://www.tandfonline.com/loi/utch20 Two-Level Designs to Estimate All Main Effects and Two-Factor Interactions Pieter T. Eendebak

More information

Deconvolution of Overlapping HPLC Aromatic Hydrocarbons Peaks Using Independent Component Analysis ICA

Deconvolution of Overlapping HPLC Aromatic Hydrocarbons Peaks Using Independent Component Analysis ICA MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Deconvolution of Overlapping HPLC Aromatic Hydrocarbons Peaks Using Independent Component Analysis ICA V. Gaganis, N. Pasadakis, P. Smaragdis,

More information

MULTIVARIATE STATISTICAL ANALYSIS OF SPECTROSCOPIC DATA. Haisheng Lin, Ognjen Marjanovic, Barry Lennox

MULTIVARIATE STATISTICAL ANALYSIS OF SPECTROSCOPIC DATA. Haisheng Lin, Ognjen Marjanovic, Barry Lennox MULTIVARIATE STATISTICAL ANALYSIS OF SPECTROSCOPIC DATA Haisheng Lin, Ognjen Marjanovic, Barry Lennox Control Systems Centre, School of Electrical and Electronic Engineering, University of Manchester Abstract:

More information

Probability and Statistics

Probability and Statistics Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT

More information

A Unified Approach to Uncertainty for Quality Improvement

A Unified Approach to Uncertainty for Quality Improvement A Unified Approach to Uncertainty for Quality Improvement J E Muelaner 1, M Chappell 2, P S Keogh 1 1 Department of Mechanical Engineering, University of Bath, UK 2 MCS, Cam, Gloucester, UK Abstract To

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 1 Bootstrapped Bias and CIs Given a multiple regression model with mean and

More information

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS THE ROYAL STATISTICAL SOCIETY 008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS The Society provides these solutions to assist candidates preparing for the examinations

More information

Reliability of Acceptance Criteria in Nonlinear Response History Analysis of Tall Buildings

Reliability of Acceptance Criteria in Nonlinear Response History Analysis of Tall Buildings Reliability of Acceptance Criteria in Nonlinear Response History Analysis of Tall Buildings M.M. Talaat, PhD, PE Senior Staff - Simpson Gumpertz & Heger Inc Adjunct Assistant Professor - Cairo University

More information

IENG581 Design and Analysis of Experiments INTRODUCTION

IENG581 Design and Analysis of Experiments INTRODUCTION Experimental Design IENG581 Design and Analysis of Experiments INTRODUCTION Experiments are performed by investigators in virtually all fields of inquiry, usually to discover something about a particular

More information

New tricks by very old dogs: Predicting the catalytic hydrogenation of HMF derivatives using Slater-type orbitals

New tricks by very old dogs: Predicting the catalytic hydrogenation of HMF derivatives using Slater-type orbitals Supporting information S1 Ras et al. Supporting information for the article New tricks by very old dogs: Predicting the catalytic hydrogenation of HMF derivatives using Slater-type orbitals Erik-Jan Ras,*

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) Authored by: Sarah Burke, PhD Version 1: 31 July 2017 Version 1.1: 24 October 2017 The goal of the STAT T&E COE

More information

Independence and Dependence in Calibration: A Discussion FDA and EMA Guidelines

Independence and Dependence in Calibration: A Discussion FDA and EMA Guidelines ENGINEERING RESEARCH CENTER FOR STRUCTURED ORGANIC PARTICULATE SYSTEMS Independence and Dependence in Calibration: A Discussion FDA and EMA Guidelines Rodolfo J. Romañach, Ph.D. ERC-SOPS Puerto Rico Site

More information

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 Statistics Boot Camp Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 March 21, 2018 Outline of boot camp Summarizing and simplifying data Point and interval estimation Foundations of statistical

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice The Model Building Process Part I: Checking Model Assumptions Best Practice Authored by: Sarah Burke, PhD 31 July 2017 The goal of the STAT T&E COE is to assist in developing rigorous, defensible test

More information

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1 Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression

More information

A Better Way to Do R&R Studies

A Better Way to Do R&R Studies The Evaluating the Measurement Process Approach Last month s column looked at how to fix some of the Problems with Gauge R&R Studies. This month I will show you how to learn more from your gauge R&R data

More information

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel Lecture 2 Judging the Performance of Classifiers Nitin R. Patel 1 In this note we will examine the question of how to udge the usefulness of a classifier and how to compare different classifiers. Not only

More information

Spectroscopy in Transmission

Spectroscopy in Transmission Spectroscopy in Transmission + Reflectance UV/VIS - NIR Absorption spectra of solids and liquids can be measured with the desktop spectrograph Lambda 9. Extinctions up to in a wavelength range from UV

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Chapter 7: Simple linear regression

Chapter 7: Simple linear regression The absolute movement of the ground and buildings during an earthquake is small even in major earthquakes. The damage that a building suffers depends not upon its displacement, but upon the acceleration.

More information

Supplementary Note on Bayesian analysis

Supplementary Note on Bayesian analysis Supplementary Note on Bayesian analysis Structured variability of muscle activations supports the minimal intervention principle of motor control Francisco J. Valero-Cuevas 1,2,3, Madhusudhan Venkadesan

More information

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University Chemometrics Matti Hotokka Physical chemistry Åbo Akademi University Linear regression Experiment Consider spectrophotometry as an example Beer-Lamberts law: A = cå Experiment Make three known references

More information

Chapter Three. Hypothesis Testing

Chapter Three. Hypothesis Testing 3.1 Introduction The final phase of analyzing data is to make a decision concerning a set of choices or options. Should I invest in stocks or bonds? Should a new product be marketed? Are my products being

More information

THE APPLICATION OF SIMPLE STATISTICS IN GRAINS RESEARCH

THE APPLICATION OF SIMPLE STATISTICS IN GRAINS RESEARCH THE APPLICATION OF SIMPLE STATISTICS IN GRAINS RESEARCH Phil Williams PDK Projects, Inc., Nanaimo, Canada philwilliams@pdkgrain.com INTRODUCTION It is helpful to remember two descriptions. The first is

More information

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model Checking/Diagnostics Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics The session is a continuation of a version of Section 11.3 of MMD&S. It concerns

More information

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model Checking/Diagnostics Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics The session is a continuation of a version of Section 11.3 of MMD&S. It concerns

More information

Howard Mark and Jerome Workman Jr.

Howard Mark and Jerome Workman Jr. Linearity in Calibration: How to Test for Non-linearity Previous methods for linearity testing discussed in this series contain certain shortcomings. In this installment, the authors describe a method

More information

Drift Reduction For Metal-Oxide Sensor Arrays Using Canonical Correlation Regression And Partial Least Squares

Drift Reduction For Metal-Oxide Sensor Arrays Using Canonical Correlation Regression And Partial Least Squares Drift Reduction For Metal-Oxide Sensor Arrays Using Canonical Correlation Regression And Partial Least Squares R Gutierrez-Osuna Computer Science Department, Wright State University, Dayton, OH 45435,

More information

1 Overview. Coefficients of. Correlation, Alienation and Determination. Hervé Abdi Lynne J. Williams

1 Overview. Coefficients of. Correlation, Alienation and Determination. Hervé Abdi Lynne J. Williams In Neil Salkind (Ed.), Encyclopedia of Research Design. Thousand Oaks, CA: Sage. 2010 Coefficients of Correlation, Alienation and Determination Hervé Abdi Lynne J. Williams 1 Overview The coefficient of

More information

Introduction to Principal Component Analysis (PCA)

Introduction to Principal Component Analysis (PCA) Introduction to Principal Component Analysis (PCA) NESAC/BIO NESAC/BIO Daniel J. Graham PhD University of Washington NESAC/BIO MVSA Website 2010 Multivariate Analysis Multivariate analysis (MVA) methods

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Advancing from unsupervised, single variable-based to supervised, multivariate-based methods: A challenge for qualitative analysis

Advancing from unsupervised, single variable-based to supervised, multivariate-based methods: A challenge for qualitative analysis Advancing from unsupervised, single variable-based to supervised, multivariate-based methods: A challenge for qualitative analysis Bernhard Lendl, Bo Karlberg This article reviews and describes the open

More information

Method Transfer across Multiple MicroNIR Spectrometers for Raw Material Identification

Method Transfer across Multiple MicroNIR Spectrometers for Raw Material Identification Method Transfer across Multiple MicroNIR Spectrometers for Raw Material Identification Raw material identification or verification (of the packaging label) is a common quality-control practice. In the

More information

Rigorous Evaluation R.I.T. Analysis and Reporting. Structure is from A Practical Guide to Usability Testing by J. Dumas, J. Redish

Rigorous Evaluation R.I.T. Analysis and Reporting. Structure is from A Practical Guide to Usability Testing by J. Dumas, J. Redish Rigorous Evaluation Analysis and Reporting Structure is from A Practical Guide to Usability Testing by J. Dumas, J. Redish S. Ludi/R. Kuehl p. 1 Summarize and Analyze Test Data Qualitative data - comments,

More information

Supporting Australian Mathematics Project. A guide for teachers Years 11 and 12. Probability and statistics: Module 25. Inference for means

Supporting Australian Mathematics Project. A guide for teachers Years 11 and 12. Probability and statistics: Module 25. Inference for means 1 Supporting Australian Mathematics Project 2 3 4 6 7 8 9 1 11 12 A guide for teachers Years 11 and 12 Probability and statistics: Module 2 Inference for means Inference for means A guide for teachers

More information

CHAPTER 4 THE COMMON FACTOR MODEL IN THE SAMPLE. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum

CHAPTER 4 THE COMMON FACTOR MODEL IN THE SAMPLE. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum CHAPTER 4 THE COMMON FACTOR MODEL IN THE SAMPLE From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum 1997 65 CHAPTER 4 THE COMMON FACTOR MODEL IN THE SAMPLE 4.0. Introduction In Chapter

More information

Mixture Analysis Made Easier: Trace Impurity Identification in Photoresist Developer Solutions Using ATR-IR Spectroscopy and SIMPLISMA

Mixture Analysis Made Easier: Trace Impurity Identification in Photoresist Developer Solutions Using ATR-IR Spectroscopy and SIMPLISMA Mixture Analysis Made Easier: Trace Impurity Identification in Photoresist Developer Solutions Using ATR-IR Spectroscopy and SIMPLISMA Michel Hachey, Michael Boruta Advanced Chemistry Development, Inc.

More information

statistical methods for tailoring seasonal climate forecasts Andrew W. Robertson, IRI

statistical methods for tailoring seasonal climate forecasts Andrew W. Robertson, IRI statistical methods for tailoring seasonal climate forecasts Andrew W. Robertson, IRI tailored seasonal forecasts why do we make probabilistic forecasts? to reduce our uncertainty about the (unknown) future

More information

Physics 509: Error Propagation, and the Meaning of Error Bars. Scott Oser Lecture #10

Physics 509: Error Propagation, and the Meaning of Error Bars. Scott Oser Lecture #10 Physics 509: Error Propagation, and the Meaning of Error Bars Scott Oser Lecture #10 1 What is an error bar? Someone hands you a plot like this. What do the error bars indicate? Answer: you can never be

More information

Multivariate analysis (chemometrics) - quality of multivariate calibration

Multivariate analysis (chemometrics) - quality of multivariate calibration Multivariate analysis (chemometrics) - quality of multivariate calibration Wolfhard Wegscheider with contributions by Alessandra Rachetti 15 May 2018 Outline Historical reminescence: how it all started

More information

Inferential Analysis with NIR and Chemometrics

Inferential Analysis with NIR and Chemometrics Inferential Analysis with NIR and Chemometrics Santanu Talukdar Manager, Engineering Services Part 2 NIR Spectroscopic Data with Chemometrics A Tutorial Presentation Part 2 Page.2 References This tutorial

More information

Group comparison test for independent samples

Group comparison test for independent samples Group comparison test for independent samples The purpose of the Analysis of Variance (ANOVA) is to test for significant differences between means. Supposing that: samples come from normal populations

More information

A Calibration Model Maintenance Roadmap

A Calibration Model Maintenance Roadmap Preprints of the 9th International Symposium on Advanced Control of Chemical Processes The International Federation of Automatic Control June 7-,, Whistler, British Columbia, Canada MoA3. A Calibration

More information

CHEMISTRY. CHEM 0100 PREPARATION FOR GENERAL CHEMISTRY 3 cr. CHEM 0110 GENERAL CHEMISTRY 1 4 cr. CHEM 0120 GENERAL CHEMISTRY 2 4 cr.

CHEMISTRY. CHEM 0100 PREPARATION FOR GENERAL CHEMISTRY 3 cr. CHEM 0110 GENERAL CHEMISTRY 1 4 cr. CHEM 0120 GENERAL CHEMISTRY 2 4 cr. CHEMISTRY CHEM 0100 PREPARATION FOR GENERAL CHEMISTRY 3 cr. Designed for those students who intend to take chemistry 0110 and 0120, but whose science and mathematical backgrounds are judged by their advisors

More information

12.12 MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL)

12.12 MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL) 12.12 Model Building, and the Effects of Multicollinearity (Optional) 1 Although Excel and MegaStat are emphasized in Business Statistics in Practice, Second Canadian Edition, some examples in the additional

More information

Probability Distributions

Probability Distributions CONDENSED LESSON 13.1 Probability Distributions In this lesson, you Sketch the graph of the probability distribution for a continuous random variable Find probabilities by finding or approximating areas

More information

Signal, Noise, and Detection Limits in Mass Spectrometry

Signal, Noise, and Detection Limits in Mass Spectrometry Signal, Noise, and Detection Limits in Mass Spectrometry Technical Note Chemical Analysis Group Authors Greg Wells, Harry Prest, and Charles William Russ IV, Agilent Technologies, Inc. 2850 Centerville

More information

Dynamic-Inner Partial Least Squares for Dynamic Data Modeling

Dynamic-Inner Partial Least Squares for Dynamic Data Modeling Preprints of the 9th International Symposium on Advanced Control of Chemical Processes The International Federation of Automatic Control MoM.5 Dynamic-Inner Partial Least Squares for Dynamic Data Modeling

More information

Detection and quantification capabilities

Detection and quantification capabilities 18.4.3.7 Detection and quantification capabilities Among the most important Performance Characteristics of the Chemical Measurement Process (CMP) are those that can serve as measures of the underlying

More information

Model Updating for Spectral Calibration Maintenance and Transfer Using 1-Norm Variants of Tikhonov Regularization

Model Updating for Spectral Calibration Maintenance and Transfer Using 1-Norm Variants of Tikhonov Regularization Anal. Chem. 2010, 82, 3642 3649 Model Updating for Spectral Calibration Maintenance and Transfer Using 1-Norm Variants of Tikhonov Regularization M. Ross Kunz, John H. Kalivas,*, and Erik Andries Department

More information

A Test of Cointegration Rank Based Title Component Analysis.

A Test of Cointegration Rank Based Title Component Analysis. A Test of Cointegration Rank Based Title Component Analysis Author(s) Chigira, Hiroaki Citation Issue 2006-01 Date Type Technical Report Text Version publisher URL http://hdl.handle.net/10086/13683 Right

More information

Unconstrained Ordination

Unconstrained Ordination Unconstrained Ordination Sites Species A Species B Species C Species D Species E 1 0 (1) 5 (1) 1 (1) 10 (4) 10 (4) 2 2 (3) 8 (3) 4 (3) 12 (6) 20 (6) 3 8 (6) 20 (6) 10 (6) 1 (2) 3 (2) 4 4 (5) 11 (5) 8 (5)

More information

How Histogramming and Counting Statistics Affect Peak Position Precision. D. A. Gedcke

How Histogramming and Counting Statistics Affect Peak Position Precision. D. A. Gedcke ORTEC A58 Application ote How Histogramming and Counting Statistics Affect Peak Position Precision D. A. Gedcke Critical Applications In order to expedite comprehensive data processing with digital computers,

More information

Improved Holt Method for Irregular Time Series

Improved Holt Method for Irregular Time Series WDS'08 Proceedings of Contributed Papers, Part I, 62 67, 2008. ISBN 978-80-7378-065-4 MATFYZPRESS Improved Holt Method for Irregular Time Series T. Hanzák Charles University, Faculty of Mathematics and

More information

Prediction of Bike Rental using Model Reuse Strategy

Prediction of Bike Rental using Model Reuse Strategy Prediction of Bike Rental using Model Reuse Strategy Arun Bala Subramaniyan and Rong Pan School of Computing, Informatics, Decision Systems Engineering, Arizona State University, Tempe, USA. {bsarun, rong.pan}@asu.edu

More information

Logistic Regression: Regression with a Binary Dependent Variable

Logistic Regression: Regression with a Binary Dependent Variable Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression

More information

The paper is well written and prepared, apart from a few grammatical corrections that the editors can find.

The paper is well written and prepared, apart from a few grammatical corrections that the editors can find. Reviewers' comments: Reviewer #1 (Remarks to the Author): The manuscript by Desvaux and colleagues describes a novel application of spin-noise spectroscopy, a concept developed by Slean, Hahn and coworkers

More information

Statistics Toolbox 6. Apply statistical algorithms and probability models

Statistics Toolbox 6. Apply statistical algorithms and probability models Statistics Toolbox 6 Apply statistical algorithms and probability models Statistics Toolbox provides engineers, scientists, researchers, financial analysts, and statisticians with a comprehensive set of

More information

Quantitative Trendspotting. Rex Yuxing Du and Wagner A. Kamakura. Web Appendix A Inferring and Projecting the Latent Dynamic Factors

Quantitative Trendspotting. Rex Yuxing Du and Wagner A. Kamakura. Web Appendix A Inferring and Projecting the Latent Dynamic Factors 1 Quantitative Trendspotting Rex Yuxing Du and Wagner A. Kamakura Web Appendix A Inferring and Projecting the Latent Dynamic Factors The procedure for inferring the latent state variables (i.e., [ ] ),

More information

1 Least Squares Estimation - multiple regression.

1 Least Squares Estimation - multiple regression. Introduction to multiple regression. Fall 2010 1 Least Squares Estimation - multiple regression. Let y = {y 1,, y n } be a n 1 vector of dependent variable observations. Let β = {β 0, β 1 } be the 2 1

More information

Validating Slope Spectroscopy Methods: A Formula for Robust Measurements

Validating Slope Spectroscopy Methods: A Formula for Robust Measurements CWP0802261A White Paper Validating Slope Spectroscopy Methods: A Formula for Robust Measurements February 26, 2008 I-Tsung Shih, PhD Mark Salerno 1.0 Abstract Demands on measurement systems are ever increasing.

More information

Method Validation. Role of Validation. Two levels. Flow of method validation. Method selection

Method Validation. Role of Validation. Two levels. Flow of method validation. Method selection Role of Validation Method Validation An overview Confirms the fitness for purpose of a particular analytical method. ISO definition: Conformation by examination and provision of objective evidence that

More information

Writing Patent Specifications

Writing Patent Specifications Writing Patent Specifications Japan Patent Office Asia-Pacific Industrial Property Center, JIPII 2013 Collaborator: Shoji HADATE, Patent Attorney, Intellectual Property Office NEXPAT CONTENTS Page 1. Patent

More information

Response surface methodology: advantages and challenges

Response surface methodology: advantages and challenges 100 FORUM The forum series invites readers to discuss issues and suggest possible improvements that can help in developing new methods towards the advancing of current existing processes and applied methods

More information

Curve Fitting. 1 Interpolation. 2 Composite Fitting. 1.1 Fitting f(x) 1.2 Hermite interpolation. 2.1 Parabolic and Cubic Splines

Curve Fitting. 1 Interpolation. 2 Composite Fitting. 1.1 Fitting f(x) 1.2 Hermite interpolation. 2.1 Parabolic and Cubic Splines Curve Fitting Why do we want to curve fit? In general, we fit data points to produce a smooth representation of the system whose response generated the data points We do this for a variety of reasons 1

More information

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues Overfitting Categorical Variables Interaction Terms Non-linear Terms Linear Logarithmic y = a +

More information

PRINCIPAL COMPONENTS ANALYSIS

PRINCIPAL COMPONENTS ANALYSIS 121 CHAPTER 11 PRINCIPAL COMPONENTS ANALYSIS We now have the tools necessary to discuss one of the most important concepts in mathematical statistics: Principal Components Analysis (PCA). PCA involves

More information

Visible-near infrared spectroscopy to assess soil contaminated with cobalt

Visible-near infrared spectroscopy to assess soil contaminated with cobalt Available online at www.sciencedirect.com Procedia Engineering 35 (2012 ) 245 253 International Meeting of Electrical Engineering Research ENIINVIE 2012 Visible-near infrared spectroscopy to assess soil

More information

Quantitative Methods for Economics, Finance and Management (A86050 F86050)

Quantitative Methods for Economics, Finance and Management (A86050 F86050) Quantitative Methods for Economics, Finance and Management (A86050 F86050) Matteo Manera matteo.manera@unimib.it Marzio Galeotti marzio.galeotti@unimi.it 1 This material is taken and adapted from Guy Judge

More information

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Linear Models 1. Isfahan University of Technology Fall Semester, 2014 Linear Models 1 Isfahan University of Technology Fall Semester, 2014 References: [1] G. A. F., Seber and A. J. Lee (2003). Linear Regression Analysis (2nd ed.). Hoboken, NJ: Wiley. [2] A. C. Rencher and

More information

An Introduction to Path Analysis

An Introduction to Path Analysis An Introduction to Path Analysis PRE 905: Multivariate Analysis Lecture 10: April 15, 2014 PRE 905: Lecture 10 Path Analysis Today s Lecture Path analysis starting with multivariate regression then arriving

More information

Forecasting Levels of log Variables in Vector Autoregressions

Forecasting Levels of log Variables in Vector Autoregressions September 24, 200 Forecasting Levels of log Variables in Vector Autoregressions Gunnar Bårdsen Department of Economics, Dragvoll, NTNU, N-749 Trondheim, NORWAY email: gunnar.bardsen@svt.ntnu.no Helmut

More information

AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-12 NATIONAL SECURITY COMPLEX

AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-12 NATIONAL SECURITY COMPLEX AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-1 NATIONAL SECURITY COMPLEX J. A. Mullens, J. K. Mattingly, L. G. Chiang, R. B. Oberer, J. T. Mihalczo ABSTRACT This paper describes a template matching

More information

A model of capillary equilibrium for the centrifuge technique

A model of capillary equilibrium for the centrifuge technique A model of capillary equilibrium for the centrifuge technique M. Fleury, P. Egermann and E. Goglin Institut Français du Pétrole This paper addresses the problem of modeling transient production curves

More information

Article for the 29 th Sensing Forum

Article for the 29 th Sensing Forum Article for the 29 th Sensing Forum Characteristics of Tuning-fork Vibration Rheometer RHEO-VISCO RV-10000 Presented by: Naoto Izumo, Yuji Fukami, and Masahiro Kanno R&D Division 5, A&D Company, Limited

More information

Basics of Multivariate Modelling and Data Analysis

Basics of Multivariate Modelling and Data Analysis Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 2. Overview of multivariate techniques 2.1 Different approaches to multivariate data analysis 2.2 Classification of multivariate techniques

More information

is your scource from Sartorius at discount prices. Manual of Weighing Applications Part 2 Counting

is your scource from Sartorius at discount prices. Manual of Weighing Applications Part 2 Counting Manual of Weighing Applications Part Counting Preface In many everyday areas of operation, the scale or the weight is only a means to an end: the quantity that is actually of interest is first calculated

More information

CE 321 Sample Laboratory Report Packet

CE 321 Sample Laboratory Report Packet CE 321 Sample Laboratory Report Packet This packet contains the following materials to help you prepare your lab reports in CE 321: An advice table with Dr. Wallace s hints regarding common strengths and

More information