Accounting for measurement uncertainties in industrial data analysis

Similar documents
Heteroscedastic latent variable modelling with applications to multivariate statistical process control

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

6.867 Machine Learning

Drift Reduction For Metal-Oxide Sensor Arrays Using Canonical Correlation Regression And Partial Least Squares

y response variable x 1, x 2,, x k -- a set of explanatory variables

International Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA

Chapter 4: Factor Analysis

SOME APPLICATIONS: NONLINEAR REGRESSIONS BASED ON KERNEL METHOD IN SOCIAL SCIENCES AND ENGINEERING

MS-C1620 Statistical inference

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Principal Components Regression

Inferences for Regression

Lecture : Probabilistic Machine Learning

Multimodel Ensemble forecasts

Linear Model Selection and Regularization

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

CSE446: non-parametric methods Spring 2017

EXTENDING PARTIAL LEAST SQUARES REGRESSION

Vector Space Models. wine_spectral.r

Chemometrics. 1. Find an important subset of the original variables.

Bayesian Regression Linear and Logistic Regression

ECE521 week 3: 23/26 January 2017

Introduction to Machine Learning

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Statistical Inference

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Using Estimating Equations for Spatially Correlated A

Basics of Multivariate Modelling and Data Analysis

Learning with Singular Vectors

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Managing Uncertainty

STRUCTURAL EQUATION MODELING. Khaled Bedair Statistics Department Virginia Tech LISA, Summer 2013

1 Motivation for Instrumental Variable (IV) Regression

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Regression: Ordinary Least Squares

Confidence Estimation Methods for Neural Networks: A Practical Comparison

Linear Regression Models

Statistical Learning

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Forecast comparison of principal component regression and principal covariate regression

High Dimensional Discriminant Analysis

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Linear Methods for Regression. Lijun Zhang

STA414/2104 Statistical Methods for Machine Learning II

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Howard Mark and Jerome Workman Jr.

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Emerging trends in Statistical Process Control of Industrial Processes. Marco S. Reis

The connection of dropout and Bayesian statistics

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Relevance Vector Machines for Earthquake Response Spectra

IDENTIFYING MULTIPLE OUTLIERS IN LINEAR REGRESSION : ROBUST FIT AND CLUSTERING APPROACH

Learning Gaussian Process Models from Uncertain Data

COMS 4771 Regression. Nakul Verma

Inferential Analysis with NIR and Chemometrics

Business Statistics. Lecture 9: Simple Regression

L2: Two-variable regression model

ISyE 691 Data mining and analytics

Advanced Engineering Statistics - Section 5 - Jay Liu Dept. Chemical Engineering PKNU

Combination of M-Estimators and Neural Network Model to Analyze Inside/Outside Bark Tree Diameters

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Do not copy, post, or distribute

Least Squares. Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter UCSD

Prediction & Feature Selection in GLM

Multiple Linear Regression CIVL 7012/8012

Correlation & Simple Regression

Exploiting Sparse Non-Linear Structure in Astronomical Data

Statistical Inference

Multivariate Regression

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

DIMENSION REDUCTION OF THE EXPLANATORY VARIABLES IN MULTIPLE LINEAR REGRESSION. P. Filzmoser and C. Croux

Linear Regression and Its Applications

Data Mining und Maschinelles Lernen

Bias-Variance in Machine Learning

Bootstrapping, Randomization, 2B-PLS

Principal component analysis

Introduction to Bayesian Learning. Machine Learning Fall 2018

Economic modelling and forecasting

Effect of trends on the estimation of extreme precipitation quantiles

Linear Regression with Multiple Regressors

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Linear Regression with Multiple Regressors

Discriminant analysis and supervised classification

Recap from previous lecture

MS&E 226: Small Data. Lecture 6: Bias and variance (v2) Ramesh Johari

2 Prediction and Analysis of Variance

Final Exam. Name: Solution:

Steps in Regression Analysis

Regression. Simple Linear Regression Multiple Linear Regression Polynomial Linear Regression Decision Tree Regression Random Forest Regression

General linear models. One and Two-way ANOVA in SPSS Repeated measures ANOVA Multiple linear regression

Sigmaplot di Systat Software

STK-IN4300 Statistical Learning Methods in Data Science

Multivariate statistical methods and data mining in particle physics

Transcription:

Accounting for measurement uncertainties in industrial data analysis Marco S. Reis * ; Pedro M. Saraiva GEPSI-PSE Group, Department of Chemical Engineering, University of Coimbra Pólo II Pinhal de Marrocos, 3030-90 COIMBRA, PORTUGAL Abstract This paper addresses the issue of integrating measurement uncertainty information in data analysis, namely in parametric and non-parametric regression. Several existent and new approaches will be presented here and critically accessed regarding their prediction and parameter estimation ability, under different scenarios. The results show that methods which explicitly incorporate measurement uncertainty information are quite sound and promising but do not always outperform other simpler approaches.. Introduction With the development of measurement instrumentation methods and metrology, the depth of knowledge regarding measurements quality, features and uncertainty has increased significantly (ISO, 993). Even though many efforts have been made regarding the specification of uncertainty in data generation and measurement, this is not very often the case when one moves to the corresponding task of data analysis, where we should also use techniques that take into account not only data, but also their associated uncertainty. The work presented in this paper addresses this issue. Several methodologies with the potential of integrating uncertainty information in the analysis of industrial data are shortly presented and compared with their current counterparts that do ignore measurement uncertainties. Some new methodologies are also proposed, in order to overcome some of the existing shortcomings. We provide examples from two modelling paradigm extremes: the non-parametric (nearest neighbour methodology) and parametric (linear regression methodologies).. The General Modelling Problem When we have at our disposal a reference data set with both inputs, X, and outputs, Y, from a given system, and the goal is to develop approaches that will allows us to make, in the future, predictive inferences about Y under given scenarios in the X domain, a wide spectrum of approaches can be used with two major poles: non-parametric approaches put very mild assumptions about the X to Y relationship and basically use only the reference data as it is ( data-driven techniques); parametric approaches assume a given well-defined underlying model of reality and adjust some of its parameters according to data. Here, we will pick up two particular cases that represent each of both categories: the nearest neighbour method and the class of methods relying on linear models. They will be used to illustrate what options are available for explicitly accounting for known measurement uncertainties in addition to the X and Y values. For * Author to whom correspondence should be addressed: marco@eq.uc.pt

each technique that makes use of the [X,Y] data, we present its counterpart, that explores the availability of both measurement values and the correspondent uncertainties, respectively [X,Y] and [unc(x),unc(y)].. Non-parametric approaches Nearest neighbour regression (NNR) consists of using only those k observations from the reference (or training) data set that are closest to the new X value, whose Y we want to estimate, with the inference for Y(x) being (Hastie et al., 00): Yˆ( x ) = y () i k x N ( x) i k where N k (x) is the set of k-nearest neighbours of x. However, when data measurement uncertainties are also available, the distance in the X space should reflect them as well: if x is at the same Euclidean distance of x i and x k, but unc(x i ) > unc(x k ), it is more likely for x i to be further way from x than x k. Therefore, we propose the following modification of the Euclidean distance for the counterpart, uncertainty based approach (unnr): N ( ) ( ) D ( x, x ) = x x unc( x ) + unc( x ) w i k i= i k i k () This should be complemented with a modified averaging methodology that takes also care of the information regarding uncertainties in Y, leading to: ˆ( ) i ( i ) ( i ) Y x y unc y unc y = x N i ( x) x N i ( x) W, k W, k (3). Parametric approaches We will restrict here ourselves to linear regression models with single/multiple inputs and a single output (i.e. SISO/MISO models). Furthermore, due to space limitations, some well-known linear regression approaches, such as weighted least squares (WLS) will not be part of our comparison study. Also, classical EIV approaches, that simultaneously estimate parameters and true data, are not considered, since our purpose is to compare (i) estimated parameter vectors and (ii) predictions over new data sets... Ordinary Least Squares (OLS) and Multivariate Least Squares (MLS) The well known OLS estimate considers only homoscedastic errors in Y and is given by: ˆ ( ) ˆ ˆ (4) T ( ) argmin n T B OLS = y y B( OLS) = ( X X ) X y i= i i B where X is the n (p+) matrix with n observations of the p inputs, plus one column for the intercept, y is the n column of outputs, and B the (p+) vector with the intercept and input coefficients. The full consideration of measurement uncertainties in both inputs and outputs, which can be heteroscedastic, is carried out with MLS (Martínez et al., 00), and consists of numerically solving the following optimization problem:

BMLS ˆ( ) = argmin y y s where s ei B n ( ˆ ) (5) i= i i e i is the variance of the regression residual at observation i when uncertainties in both inputs and outputs are accounted for and calculated using error propagation theory. Although the statistical properties of the OLS estimator are well established, it is pertinent to analyze the less well known properties of MLS... Stepwise regression (SR) and best subset (BS) regression with OLS and MLS A well known problem of the OLS estimator is the increasing variability of the estimated parameters when the inputs are correlated. To overcome this problem, often only a subset of variables is found and used by: stepwise regression or best subset regression (Draper and Smith, 998). These two procedures are based on OLS estimates applied to a subset of variables (we will address to them as SR-OLS and BS-OLS). Furthermore, SR-OLS uses the ANOVA decomposition along with a normality assumption to perform the necessary significance tests. In order to develop the counterpart methodologies that account for measurement uncertainties, we will replace the OLS steps with corresponding MLS steps. This does not raise any conceptual problem regarding an implementation of the so called BS- MLS, but, to the best of our knowledge, there is no exact ANOVA decomposition for the MLS regression, and thus one needs to come up with an appropriate procedure for SR-MLS, by using weighted sums of squares provided by the MLS algorithm (5) instead of the usual un-weighted ones of SR-OLS...3 Partial least squares (PLS) and principal components regression (PCR) Another class of methodologies for overcoming collinearity and achieving a sort of dimensionality reduction involves choosing several orthogonal directions in the X space and regressing Y into those directions, which are linear combinations of X-variables (called X-scores). In PCR, the linear combinations are those that most explain the X variability, and in PLS are those that, taking also into account the X-variability, most correlate with the output (or, more generally, the Y-scores). Both PCR and PLS present some robustness to noise, and, to some extent, have means to take it into account through adequate scaling. However, this is not enough for general errors structures, like heteroscedastic noise. Based on a previous work on maximum likelihood (ML) PCA, that incorporates X uncertainties in the estimation of the PCA model, Wentzell and Andrews (997) developed ML-PCR* using OLS in the regression step. Martínez et al. (00) replace OLS by MLS, incorporating in this way the Y-uncertainties. This method will be henceforth called ML-PCR. So far, to the best of our knowledge, no analogous methodology was presented for PLS. Here, we propose a methodology that preserves the original successful algorithmic nature of PLS, but modifying the optimal problems solved at each step in PLS, by their counterparts that incorporate uncertainty, leading to what we will call upls. Thus, OLS regression steps are replaced by MLS steps, and least squares optimization problems by general weighted least squares problems, where the weights are given by the inverse of the square of the data measurement uncertainties. Furthermore, score uncertainties are calculated using error propagation formulas.

3. A Case Study Comparative Analysis In this section we describe the main comparative results obtained by applying the different approaches mentioned before (with and without fully accounting for uncertainties) to a set of Monte Carlo simulated examples. To provide the basis of comparison, the following quantities were varied: number of variables or number of latent dimensions; correlation structure (COST) of the input variables (all variables with a fixed correlation among themselves) studied at both 0. and 0.9; heterogeneity level (HLEV) of the uncertainties for each variable studied also at two different levels (high/low; high level means a highly heteroscedastic behaviour of the measurement noise standard deviation, or uncertainty, from observation to observation). The uncertainty variations occur randomly (uniform distribution) within a range that is given by 0.0 (HLEV=low) or (HLEV=high) times the mean uncertainty for each variable. The mean uncertainty for each variable was kept constant at 0. times its standard deviation. For each scenario, reference data were generated (using a linear model with unit coefficients), and we will be using the mean relative error, MRE (or the mean absolute error, MAE) for parameter estimation performance assessment. Then, another data set is generated, and predictions are made using the estimated vector, after which the root mean square error of prediction is calculated (RMSEP). This process is repeated 00 times and the MRE, MAE and RMSEP mean values presented. For the comparison between NNR and unnr, our simulation procedure is simpler, since the goal is mainly to illustrate the advantage of incorporating uncertainty information. 3. NNR and unnr Our simulation study consists of considering a non-linear relationship between Y and X (a sine wave), where we: (i) generate 500 samples uniformly distributed in [0,π]; (ii) add heteroscedastic noise to X and Y (mean uncertainty of 0. for X and Y; HLEV=high); (iii) create 50 testing samples, with the corresponding weighted root mean square error calculated (RMSE W ). This process was repeated 50 times for each value of the parameter number of nearest neighbours, and their means are shown in Figure, where unnr outperforms NNR consistently. 3. Ordinary Least Squares (OLS) and Multivariate Least Squares (MLS) In this analysis the number of variables was varied in two levels: and 6. It can be seen in Table that the results for MRE differ widely, depending whether we are considering or not the intercept term in the calculations. For just one regressor variable, MLS does a better job in estimating correctly the parameters, but its performance deteriorates for 6 variables. 3.3 Stepwise regression (SR) and best subset (BS) regression with OLS and MLS We considered 4 and 0 variables, only half of which having non-zero coefficients. For the subset methods, we used the a priori known optimal number of variables. The detailed results can not be shown due to space restrictions, but in general SR-MLS does a better parameter estimation job than its counterpart, SR-OLS, in particular for COST=0.; the same happens for the BS methods, but the results are similar for COST=0.9. SR-MLS and BS-MLS results are quite similar.

4 NNR NNR unc 0 W M SE R 8 6 4 0 4 6 8 0 4 # nearest neighbours 6 8 0 Figure. Mean RMSEw for NNR (- -) and unnr ( ) with an increasing number of nearest neighbours considered. Table. Mean values of MRE for regression coefficient vector with OLS and MLS. Simulation conditions Methods # variables HLEV COST OLS MLS 0.9 0.798 / 4.644 * 39.8639 / 0.7864 * 0. 98.56 / 3.93 * 33.99 / 0.6565 * 0.9 90.655 / 3.7869 * 3.3586 / 0.646 * 0. 94.4889 / 3.850 * 37.67 / 0.74 * 6 0.9 8.9859 / 7.840 * 5.0894 / 9.9840 * 0. 46.6668 / 3.60 * 60.7667 /.0947 * 0.9 75.045 / 7.6608 * 57.884 / 0.78 * 0. 6.06 /.903 * 6.6 /.080 * (* values without considering the intercept term.) 3.4 PLS, PCR, upls and ML-PCR In the Monte Carlo study for this class of methods, the number of latent dimensions to be used was kept constant at different levels ( and latent dimensions), in order to allow for a fair comparison at similar levels of complexity. The results for the MRE means are shown in Table. With one latent variable, the methods tend to perform better at high correlation levels than at low correlations, which could be expected because at high input correlations the use of only one latent dimension does not limit so strongly the explanation of variability as in the case where the variables are almost uncorrelated. The fact that the PLS-based methods perform better for COST=0., may indicate a more effective use of the specified latent dimension. When latent dimensions are used, the pattern of results for PLS and PCR changes. One possible explanation would be that they are using the second dimension to better estimate the reminding variability in the uncorrelated case (COST=0.), but are mostly fitting noise in the high correlation case (COST=0.9). This explanation is consistent with the prediction results obtained (Table 3), where we can see a similar pattern. The proposed upls does not show this strong pattern and presents a consistently interesting estimation performance at COST=0.9. As for prediction, there is a certain dependency of upls with HLEV, with the best performance being achieved at the LOW level.

Table. Mean values of MRE for regression coefficient vector with PLS, upls and PCR (values without considering the intercept term). Simulation conditions Methods # Lat. Dim. HLEV COST PLS upls PCR ML-PCR 0.9 3.633 3.5605 3.685-0. 4.503 5.938 36.530-0.9 3.558 3.367 3.5648-0. 4.304.0608 37.794-0.9 6.605 9.35 0.059-0. 7.5586 9.9069 7.896-0.9 4.888.08 9.363-0. 6.956 0.704 8.634 - Table 3. RMSEP results for PLS, upls, PCR and ML-PCR. Simulation conditions Methods # Lat. Dim. HLEV COST PLS upls PCR ML-PCR 0.9 3.8070 5.088 3.8076 4.7035 0. 4.6393 5.0877 9.409.7403 0.9 3.740 3.735 3.747 3.747 0. 4.5894 4.077 9.5960 9.4759 0.9 4.3887 4.87 3.9454 4.5960 0. 3.953 5.8639 7.599 0.0 0.9 4.077 3.8365 3.7808 3.844 0. 3.6950 4.0647 7.488 7.994 4. Discussion and Conclusions In any simulation study of this kind, the results are strictly linked to the simulation settings used, but hopefully provide useful guidelines to adequately use the suggested methods as well as point out future research directions. In this line of thought, the results shown allow us to conclude that methods which explicitly incorporate measurement uncertainty information are quite sound and promising but do not always clearly outperform other approaches. For instance, MLS shows problems for the multivariate collinear case and ML-PCR seems to require substantial more dimensions to achieve the same predictive performance of other methods. In general, the predictive performance of the uncertainty based methods can also be a matter of concern. This underlines the importance of developing methodologies that consistently perform better in multivariate noise environments either in estimation or in prediction. In this regard, we proposed several methodologies (unnr, SR-MLS, BS-MLS, upls) under a common general framework for measurement uncertainty incorporation that seems to be particularly promising under certain operating scenarios. Acknowledgements The authors would like to acknowledge FCT for financial support through the project POCTI/3647/EQU/000. References Draper, N.R.; H. Smith, 998, Applied Regression Analysis, 3 rd. ed., Wiley, NY. ISO, 993; Guide to the Expression of Uncert. in Measurement, Geneva, Switzerland. Hastie, T.; R. Tibshirani; J. Friedman, 00, The Elements of Stat. Learning, Springer. Martínez, À.; J. Riu; F.X. Rius, 00. J. Chemometrics 6. Wentzell, P.D.; D.T. Andrews; B.R. Kowlaski, 997. Anal. Chem. 69.