Accounting for measurement uncertainties in industrial data analysis Marco S. Reis * ; Pedro M. Saraiva GEPSI-PSE Group, Department of Chemical Engineering, University of Coimbra Pólo II Pinhal de Marrocos, 3030-90 COIMBRA, PORTUGAL Abstract This paper addresses the issue of integrating measurement uncertainty information in data analysis, namely in parametric and non-parametric regression. Several existent and new approaches will be presented here and critically accessed regarding their prediction and parameter estimation ability, under different scenarios. The results show that methods which explicitly incorporate measurement uncertainty information are quite sound and promising but do not always outperform other simpler approaches.. Introduction With the development of measurement instrumentation methods and metrology, the depth of knowledge regarding measurements quality, features and uncertainty has increased significantly (ISO, 993). Even though many efforts have been made regarding the specification of uncertainty in data generation and measurement, this is not very often the case when one moves to the corresponding task of data analysis, where we should also use techniques that take into account not only data, but also their associated uncertainty. The work presented in this paper addresses this issue. Several methodologies with the potential of integrating uncertainty information in the analysis of industrial data are shortly presented and compared with their current counterparts that do ignore measurement uncertainties. Some new methodologies are also proposed, in order to overcome some of the existing shortcomings. We provide examples from two modelling paradigm extremes: the non-parametric (nearest neighbour methodology) and parametric (linear regression methodologies).. The General Modelling Problem When we have at our disposal a reference data set with both inputs, X, and outputs, Y, from a given system, and the goal is to develop approaches that will allows us to make, in the future, predictive inferences about Y under given scenarios in the X domain, a wide spectrum of approaches can be used with two major poles: non-parametric approaches put very mild assumptions about the X to Y relationship and basically use only the reference data as it is ( data-driven techniques); parametric approaches assume a given well-defined underlying model of reality and adjust some of its parameters according to data. Here, we will pick up two particular cases that represent each of both categories: the nearest neighbour method and the class of methods relying on linear models. They will be used to illustrate what options are available for explicitly accounting for known measurement uncertainties in addition to the X and Y values. For * Author to whom correspondence should be addressed: marco@eq.uc.pt
each technique that makes use of the [X,Y] data, we present its counterpart, that explores the availability of both measurement values and the correspondent uncertainties, respectively [X,Y] and [unc(x),unc(y)].. Non-parametric approaches Nearest neighbour regression (NNR) consists of using only those k observations from the reference (or training) data set that are closest to the new X value, whose Y we want to estimate, with the inference for Y(x) being (Hastie et al., 00): Yˆ( x ) = y () i k x N ( x) i k where N k (x) is the set of k-nearest neighbours of x. However, when data measurement uncertainties are also available, the distance in the X space should reflect them as well: if x is at the same Euclidean distance of x i and x k, but unc(x i ) > unc(x k ), it is more likely for x i to be further way from x than x k. Therefore, we propose the following modification of the Euclidean distance for the counterpart, uncertainty based approach (unnr): N ( ) ( ) D ( x, x ) = x x unc( x ) + unc( x ) w i k i= i k i k () This should be complemented with a modified averaging methodology that takes also care of the information regarding uncertainties in Y, leading to: ˆ( ) i ( i ) ( i ) Y x y unc y unc y = x N i ( x) x N i ( x) W, k W, k (3). Parametric approaches We will restrict here ourselves to linear regression models with single/multiple inputs and a single output (i.e. SISO/MISO models). Furthermore, due to space limitations, some well-known linear regression approaches, such as weighted least squares (WLS) will not be part of our comparison study. Also, classical EIV approaches, that simultaneously estimate parameters and true data, are not considered, since our purpose is to compare (i) estimated parameter vectors and (ii) predictions over new data sets... Ordinary Least Squares (OLS) and Multivariate Least Squares (MLS) The well known OLS estimate considers only homoscedastic errors in Y and is given by: ˆ ( ) ˆ ˆ (4) T ( ) argmin n T B OLS = y y B( OLS) = ( X X ) X y i= i i B where X is the n (p+) matrix with n observations of the p inputs, plus one column for the intercept, y is the n column of outputs, and B the (p+) vector with the intercept and input coefficients. The full consideration of measurement uncertainties in both inputs and outputs, which can be heteroscedastic, is carried out with MLS (Martínez et al., 00), and consists of numerically solving the following optimization problem:
BMLS ˆ( ) = argmin y y s where s ei B n ( ˆ ) (5) i= i i e i is the variance of the regression residual at observation i when uncertainties in both inputs and outputs are accounted for and calculated using error propagation theory. Although the statistical properties of the OLS estimator are well established, it is pertinent to analyze the less well known properties of MLS... Stepwise regression (SR) and best subset (BS) regression with OLS and MLS A well known problem of the OLS estimator is the increasing variability of the estimated parameters when the inputs are correlated. To overcome this problem, often only a subset of variables is found and used by: stepwise regression or best subset regression (Draper and Smith, 998). These two procedures are based on OLS estimates applied to a subset of variables (we will address to them as SR-OLS and BS-OLS). Furthermore, SR-OLS uses the ANOVA decomposition along with a normality assumption to perform the necessary significance tests. In order to develop the counterpart methodologies that account for measurement uncertainties, we will replace the OLS steps with corresponding MLS steps. This does not raise any conceptual problem regarding an implementation of the so called BS- MLS, but, to the best of our knowledge, there is no exact ANOVA decomposition for the MLS regression, and thus one needs to come up with an appropriate procedure for SR-MLS, by using weighted sums of squares provided by the MLS algorithm (5) instead of the usual un-weighted ones of SR-OLS...3 Partial least squares (PLS) and principal components regression (PCR) Another class of methodologies for overcoming collinearity and achieving a sort of dimensionality reduction involves choosing several orthogonal directions in the X space and regressing Y into those directions, which are linear combinations of X-variables (called X-scores). In PCR, the linear combinations are those that most explain the X variability, and in PLS are those that, taking also into account the X-variability, most correlate with the output (or, more generally, the Y-scores). Both PCR and PLS present some robustness to noise, and, to some extent, have means to take it into account through adequate scaling. However, this is not enough for general errors structures, like heteroscedastic noise. Based on a previous work on maximum likelihood (ML) PCA, that incorporates X uncertainties in the estimation of the PCA model, Wentzell and Andrews (997) developed ML-PCR* using OLS in the regression step. Martínez et al. (00) replace OLS by MLS, incorporating in this way the Y-uncertainties. This method will be henceforth called ML-PCR. So far, to the best of our knowledge, no analogous methodology was presented for PLS. Here, we propose a methodology that preserves the original successful algorithmic nature of PLS, but modifying the optimal problems solved at each step in PLS, by their counterparts that incorporate uncertainty, leading to what we will call upls. Thus, OLS regression steps are replaced by MLS steps, and least squares optimization problems by general weighted least squares problems, where the weights are given by the inverse of the square of the data measurement uncertainties. Furthermore, score uncertainties are calculated using error propagation formulas.
3. A Case Study Comparative Analysis In this section we describe the main comparative results obtained by applying the different approaches mentioned before (with and without fully accounting for uncertainties) to a set of Monte Carlo simulated examples. To provide the basis of comparison, the following quantities were varied: number of variables or number of latent dimensions; correlation structure (COST) of the input variables (all variables with a fixed correlation among themselves) studied at both 0. and 0.9; heterogeneity level (HLEV) of the uncertainties for each variable studied also at two different levels (high/low; high level means a highly heteroscedastic behaviour of the measurement noise standard deviation, or uncertainty, from observation to observation). The uncertainty variations occur randomly (uniform distribution) within a range that is given by 0.0 (HLEV=low) or (HLEV=high) times the mean uncertainty for each variable. The mean uncertainty for each variable was kept constant at 0. times its standard deviation. For each scenario, reference data were generated (using a linear model with unit coefficients), and we will be using the mean relative error, MRE (or the mean absolute error, MAE) for parameter estimation performance assessment. Then, another data set is generated, and predictions are made using the estimated vector, after which the root mean square error of prediction is calculated (RMSEP). This process is repeated 00 times and the MRE, MAE and RMSEP mean values presented. For the comparison between NNR and unnr, our simulation procedure is simpler, since the goal is mainly to illustrate the advantage of incorporating uncertainty information. 3. NNR and unnr Our simulation study consists of considering a non-linear relationship between Y and X (a sine wave), where we: (i) generate 500 samples uniformly distributed in [0,π]; (ii) add heteroscedastic noise to X and Y (mean uncertainty of 0. for X and Y; HLEV=high); (iii) create 50 testing samples, with the corresponding weighted root mean square error calculated (RMSE W ). This process was repeated 50 times for each value of the parameter number of nearest neighbours, and their means are shown in Figure, where unnr outperforms NNR consistently. 3. Ordinary Least Squares (OLS) and Multivariate Least Squares (MLS) In this analysis the number of variables was varied in two levels: and 6. It can be seen in Table that the results for MRE differ widely, depending whether we are considering or not the intercept term in the calculations. For just one regressor variable, MLS does a better job in estimating correctly the parameters, but its performance deteriorates for 6 variables. 3.3 Stepwise regression (SR) and best subset (BS) regression with OLS and MLS We considered 4 and 0 variables, only half of which having non-zero coefficients. For the subset methods, we used the a priori known optimal number of variables. The detailed results can not be shown due to space restrictions, but in general SR-MLS does a better parameter estimation job than its counterpart, SR-OLS, in particular for COST=0.; the same happens for the BS methods, but the results are similar for COST=0.9. SR-MLS and BS-MLS results are quite similar.
4 NNR NNR unc 0 W M SE R 8 6 4 0 4 6 8 0 4 # nearest neighbours 6 8 0 Figure. Mean RMSEw for NNR (- -) and unnr ( ) with an increasing number of nearest neighbours considered. Table. Mean values of MRE for regression coefficient vector with OLS and MLS. Simulation conditions Methods # variables HLEV COST OLS MLS 0.9 0.798 / 4.644 * 39.8639 / 0.7864 * 0. 98.56 / 3.93 * 33.99 / 0.6565 * 0.9 90.655 / 3.7869 * 3.3586 / 0.646 * 0. 94.4889 / 3.850 * 37.67 / 0.74 * 6 0.9 8.9859 / 7.840 * 5.0894 / 9.9840 * 0. 46.6668 / 3.60 * 60.7667 /.0947 * 0.9 75.045 / 7.6608 * 57.884 / 0.78 * 0. 6.06 /.903 * 6.6 /.080 * (* values without considering the intercept term.) 3.4 PLS, PCR, upls and ML-PCR In the Monte Carlo study for this class of methods, the number of latent dimensions to be used was kept constant at different levels ( and latent dimensions), in order to allow for a fair comparison at similar levels of complexity. The results for the MRE means are shown in Table. With one latent variable, the methods tend to perform better at high correlation levels than at low correlations, which could be expected because at high input correlations the use of only one latent dimension does not limit so strongly the explanation of variability as in the case where the variables are almost uncorrelated. The fact that the PLS-based methods perform better for COST=0., may indicate a more effective use of the specified latent dimension. When latent dimensions are used, the pattern of results for PLS and PCR changes. One possible explanation would be that they are using the second dimension to better estimate the reminding variability in the uncorrelated case (COST=0.), but are mostly fitting noise in the high correlation case (COST=0.9). This explanation is consistent with the prediction results obtained (Table 3), where we can see a similar pattern. The proposed upls does not show this strong pattern and presents a consistently interesting estimation performance at COST=0.9. As for prediction, there is a certain dependency of upls with HLEV, with the best performance being achieved at the LOW level.
Table. Mean values of MRE for regression coefficient vector with PLS, upls and PCR (values without considering the intercept term). Simulation conditions Methods # Lat. Dim. HLEV COST PLS upls PCR ML-PCR 0.9 3.633 3.5605 3.685-0. 4.503 5.938 36.530-0.9 3.558 3.367 3.5648-0. 4.304.0608 37.794-0.9 6.605 9.35 0.059-0. 7.5586 9.9069 7.896-0.9 4.888.08 9.363-0. 6.956 0.704 8.634 - Table 3. RMSEP results for PLS, upls, PCR and ML-PCR. Simulation conditions Methods # Lat. Dim. HLEV COST PLS upls PCR ML-PCR 0.9 3.8070 5.088 3.8076 4.7035 0. 4.6393 5.0877 9.409.7403 0.9 3.740 3.735 3.747 3.747 0. 4.5894 4.077 9.5960 9.4759 0.9 4.3887 4.87 3.9454 4.5960 0. 3.953 5.8639 7.599 0.0 0.9 4.077 3.8365 3.7808 3.844 0. 3.6950 4.0647 7.488 7.994 4. Discussion and Conclusions In any simulation study of this kind, the results are strictly linked to the simulation settings used, but hopefully provide useful guidelines to adequately use the suggested methods as well as point out future research directions. In this line of thought, the results shown allow us to conclude that methods which explicitly incorporate measurement uncertainty information are quite sound and promising but do not always clearly outperform other approaches. For instance, MLS shows problems for the multivariate collinear case and ML-PCR seems to require substantial more dimensions to achieve the same predictive performance of other methods. In general, the predictive performance of the uncertainty based methods can also be a matter of concern. This underlines the importance of developing methodologies that consistently perform better in multivariate noise environments either in estimation or in prediction. In this regard, we proposed several methodologies (unnr, SR-MLS, BS-MLS, upls) under a common general framework for measurement uncertainty incorporation that seems to be particularly promising under certain operating scenarios. Acknowledgements The authors would like to acknowledge FCT for financial support through the project POCTI/3647/EQU/000. References Draper, N.R.; H. Smith, 998, Applied Regression Analysis, 3 rd. ed., Wiley, NY. ISO, 993; Guide to the Expression of Uncert. in Measurement, Geneva, Switzerland. Hastie, T.; R. Tibshirani; J. Friedman, 00, The Elements of Stat. Learning, Springer. Martínez, À.; J. Riu; F.X. Rius, 00. J. Chemometrics 6. Wentzell, P.D.; D.T. Andrews; B.R. Kowlaski, 997. Anal. Chem. 69.