Least squares with non-normal data: estimating experimental variance functions

Size: px

Start display at page:

Download "Least squares with non-normal data: estimating experimental variance functions"

Jacob McGee
5 years ago
Views:

1 i-section: PERSPECTIVE The Analyst Least squares with non-normal data: estimating experimental variance functions Joel Tellinghuisen* DOI: /b708709h Contrary to popular belief, the method of least squares (LS) does not require that the data have normally distributed (Gaussian) error for its validity. One practically important application of LS fitting that does not involve normal data is the estimation of data variance functions (VFE) from replicate statistics. If the raw data are normal, sampling estimates s 2 of the variance s 2 are x 2 distributed. For small degrees of freedom, the x 2 distribution is strongly asymmetrical exponential in the case of three replicates, for example. Monte Carlo computations for linear variance functions demonstrate that with proper weighting, the LS variancefunction parameters remain unbiased, minimum-variance estimates of the true quantities. However, the parameters are strongly non-normal almost exponential for some parameters estimated from s 2 values derived from three replicates, for example. Similar LS estimates of standard deviation functions from estimated s values have a predictable and correctable bias stemming from the bias inherent in s as an estimator of s. Because s 2 and s have uncertainties proportional to their magnitudes, the VFE and SDFE fits require weighting as s 24 and s 22, respectively. However, these weights must be evaluated on the calculated functions rather than directly from the sampling estimates. The computation is thus iterative but usually converges in a few cycles, with remaining weighting bias sufficiently small as to be of no practical consequence. Introduction The method of least squares (LS) is perhaps the most powerful and general purpose data analysis tool in physical science. Yet its validity does rest on a number of premises about the data, which are generally impossible to confirm in practical cases. Among these is the statement, often used disparagingly, Least squares requires that the data have random, normally distributed errors, and because of experimental anomalies, most real data cannot be normal. Actually, that statement is not quite correct; but more importantly, the question of what effect non-normal data have on the outcome of an LS analysis has rarely been investigated. This paper addresses that question in the context of data that can be very far from normal: sampling estimates of variances. The practical need for such fitting is to determine data variance functions, in order to provide correct statistical weights for analyzing the data themselves. For the purpose of Department of Chemistry, Vanderbilt University, Nashville, Tennessee 37235, USA the present study, several replicates are taken for each of a number of (x,y) data points. We will see that sampling estimates (s 2 ) of variances (s 2 ), though wildly non-normal in their distribution (exponential for three normally distributed replicates!), nonetheless yield unbiased, minimum-variance estimates of the LS parameters, at least in the idealized world of Monte Carlo (MC) computations, where we can be sure of the data structure and correct weights at the outset. On the other hand, if the same data are fitted as standard deviation (s) estimates of s, the results exhibit predictable bias, because now the data themselves are biased, contradicting one of the basic LS premises. The present study arose from a recent investigation of the effects of neglecting weights in calibration with heteroscedastic data. 1 For minimum-variance estimation of the parameters, the data must be weighted inversely as their variance, w i 3 s i Yet, in ref. 1 it was noted (and not for the first time 6 ) that weighting as the inverse sampling variance, w i 3 s i 22, could actually be worse than neglecting weights altogether. On the other hand, if the data variance depended in some simple way on the experimental parameters, and the sampling estimates were used collectively to estimate that dependence, then even approximate determination of the true variance function sufficed to yield near-minimumvariance estimates of the calibration parameters. These results left open the questions: (1) is it better to fit standard deviations or variances; (2) how should the s (s 2 ) data be weighted in the variance function estimation (VFE) step in the real world, where true weights are not known in advance; and (3) how many replicates suffice? The last question was already answered partially in ref. 1, by the observation that even a crude determination of the data variance function led to,10% loss in precision in the desired calibration function. The MC approach of this study is the same as that of ref. 1 and earlier works in a similar vein: 7,8 assume simple functional forms for the variance function and the data response function, generate N equivalent data sets by adding random This journal is ß The Royal Society of Chemistry 2008 Analyst, 2008, 133,

2 normal (Gaussian) error to the true response function, and then analyze each data set and accumulate statistics for all such sets. The choice of response function is unimportant here, except that it must permit variation in the variance function. I have used just the linear function, y 5 f(x) 5 a + bx. For the data error, I have also confined my attention to just two simple twoparameter forms: s 2 5 A 1 + B 1 y 2 and s 5 A 2 + B 2 y. All of these forms thus have the advantage of being linear in the adjustable parameters; and the error functions are reasonable for many experimental techniques, where constant error must dominate in the weak-signal limit, but where proportional error dominates for strong signal. Since the raw data for the estimation of the data error functions are the replicate-based estimates of s 2 or s, each of the N data sets consists of n R replicates at each of n x values of the independent variable (x). Thereby the reliability of the VFE fitting can be related to the manner in which the n R 6 n x values are distributed among x values and replicates. Before getting into the details of the computations, it is useful to review the tenets of linear LS. 2 5,8 The independent variable x is assumed to be error-free. If the data for the dependent variable y are unbiased, with errors that are random, independent, and of finite variance, the LLS estimators will be unbiased and of finite variance. If the data are weighted as s 22 i, the LLS estimators will be minimum variance (which essentially means the best possible). When the data error structure is known a priori (as, for example, in MC computations), the parameter variances are also known exactly a priori; they are the diagonal elements of the variance-covariance matrix V prior. If the data errors are normally distributed, then the LLS parameter estimates are normally distributed; and the sum of weighted, squared residuals S 5 S(d i /s i ) 2 is x 2 distributed for n degrees of freedom, where n is the number of fitted data points minus the number of adjustable parameters. Least squares and Monte Carlo computations The method of least squares is described in refs 1 8, most of which include the compact matrix notation that I will use here to cover the bare essentials. In LLS the data and adjustable parameters can be expressed as y 5 Xb + d, (1) where the column vectors y and d contain the n measured values of the dependent variable and the residuals, respectively, b contains the p adjustable parameters, and the design matrix X has n rows and p columns. The LS solution minimizes S 5 Sw i d i 2 5 d T Wd with respect to the p adjustable parameters, where the residuals d i 5 y i 2 f(x i ), and the weight matrix W is diagonal, with W ii 5 w i. Minimum-variance estimation of the parameters b requires that w i 3 s i 22. Unweighted LS (OLS) assumes that all s i are the same and uses w i 5 1, giving b u 5 (X T X) 21 X T y ; A u 21 X T y ; V u X T y. (2) The data variance is then estimated from the residuals, s 2 5 S/(n 2 p), and the estimated parameter variances are the diagonal elements of this a posteriori variance-covariance matrix, V post 5 s 2 V u. (3) V u is entirely determined by the x-structure of the data so is known from the outset. If the data variance is known to be s 2, then V prior 5 s 2 V u, (4) a result which is both known at the outset and exact. This is a special case of that for data of non-constant s, V prior 5 (X T WX) 21 ; A W 21, (5) where w i 5 s i 22. If the s i are known in only a relative sense, we must again resort to a posteriori estimates, V post ~ S n A {1 W : (6) Then the prefactor is known as the variance for data of unit weight. However, unless the s i are known to within a constant factor, eqn (6) is invalid and may be either optimistic or pessimistic. 1 If the data error is normal and the data are properly weighted, the sampling estimates s 2 of the variance are distributed as a scaled x 2 variate; thus they have the statistical properties of x 2, which has mean n and variance 2n. 9 In the reduced form x 2 /n, its probability distribution is P(z) 5 Cz (n 2 2)/2 exp(2nz/2), (7) where C is a normalization constant and z 5 x 2 /n. The reduced x 2 has mean 1 and variance (2/n); accordingly, sampling variance estimates (s 2,V post,ii ) have relative standard deviation (2/n) 1/2 ; and from simple error propagation, estimated standard deviations and parameter errors have relative standard error (RSE) a factor of 2 smaller, or (2n) 21/2. P(z) becomes Gaussian in the large-n limit, but is far from normal and highly asymmetrical for the small n of primary concern here (Fig. 1). The distribution of standard deviations is that for (x 2 /n) 1/2 and can be obtained from P(z) by 2 P y (y) dy 5 P z (z) dz, (8) p with y = ffiffi z. Using either Pz (z)orp y (y), we can verify that the average of y, SyT = p S ffiffi z T,is pffiffi C vz1 2 2 SyT~ v C v (9) 2 where C is the gamma function. 9 The value of SyT is,1 but approaches 1 for large n. This is the bias in sampling estimates of standard deviations. pffiffiffiffiffiffiffiffiffiffi (The difference between SyT and Sy 2 T is familiar to chemists in the context of gas kinetic theory, where the mean and rms molecular speeds differ.) Since both s and s 2 have standard deviations proportional to their magnitudes, LS fits of s and s 2 to data error functions should be weighted inversely as s 2 and s 4, respectively. In MC computations, we know the true s (s 2 ) by assignment. However, in dealing with actual data we have two choices, neither equivalent to this theoretical weighting: compute the weights from either the sampling estimates or from the fitted functional approximation of s (s 2 ). The latter approach requires iteration: the fitted function is not known at the outset, so estimates of the weights must be 162 Analyst, 2008, 133, This journal is ß The Royal Society of Chemistry 2008

3 Fig. 1 Probability density functions (unnormalized) for x 2 /n (var) and its square root (sd), for the indicated numbers of degrees of freedom n. revised following the fit. This procedure typically converges adequately within ca. 4 cycles. The Monte Carlo computations were done with FORTRAN programs and usually involved N data sets. Simple statistics and binning were done on the fly to avoid having to store and post-process large data sets. Thus, for example, from running sums of the parameters and their squares, one can compute the averages and variances at the end of the run, from, e.g., SaT 5 Sa i /N and s 2 a 5 Sa 2 T 2 SaT 2. For assessing the statistical significance of apparent bias, the metric for parameters p ffiffiffiffi is the standard error divided by N, while that for the RSEs is (2N) 21/2, or 0.224% for N Thus exact agreement with predictions in the MC sense means parameter disparities smaller than s and standard deviations within 0.22%, for 68% confidence. s 2 5 A 2 + B 2 y; A ; B (11b) for the data error. In a formal sense one can qffiffiffiffiffishow that fitting s 2 to s 2 2 and s to s 2 1 yield the respective parameters with identical standard errors. However, the latter forms are non-linear and that, combined with the relatively large uncertainties of the parameters (up to 100% for three replicates), gives many divergences in the MC runs. Thus I have used just the two, slightly inequivalent functions of eqns (11) for the present MC experiments (getting at most ca. 50 divergences, from the iterative weighting cycle, in 10 5 data sets). Table 1 summarizes results for a response function having six evenly spaced x values with three replicates at each. Of all the tabulated results, only those for fitting s 2 with theoretical weighting and calculated weighting (B 2 only) are statistically consistent with the true parameter values and predictions from V prior. Also noteworthy: (1) observed weighting gives catastrophic biases for the important proportionality constants B 1 and B 2 ; (2) not even theoretical weighting gives agreement in Table 1 Monte Carlo results (N ) for fitting data variance and standard deviation to the model of eqn (10) and eqn (11), with six x values and three replicates at each a Parameter Value b s b Exact values for fitting to s 2 5 A 1 + B 1 y 2 : A B Theoretical weighting c A B S/n Observed weighting c A B S/n Calculated weighting c A B S/n Results and discussion To demonstrate the statistical properties of fitting variances and standard deviations to functional forms, I have chosen models for the response function and its error close to those explored in ref. 1, specifically y 5 a + bx; a 5 0.1; b 5 10; 0 ( x ( 10 (10) for the response function, and and s A 1 + B 1 y 2 ; A ; B (11a) Exact values for fitting to s 5 A 2 + B 2 y: A B Theoretical weighting A d B d S/n Observed weighting A B S/n Calculated weighting A d B d S/n a Values in bold are the only ones in satisfactory agreement with theory and predictions. b MC results are for single runs and are given to slightly higher precision than warranted for N Exact values of standard errors are from V prior. c Weightings: theoretical employs true variances from model; observed uses sampling estimates directly; calculated uses fitted curve for each data set, p with ffiffiffi iteration to convergence. d Correcting the sampling s values for their known bias [here ( p )/2] brings the parameters values into statistical consistency but raises the parameter standard errors above their theoretical values. This journal is ß The Royal Society of Chemistry 2008 Analyst, 2008, 133,

4 the fitting of s to eqn (11a); and (3) even for theoretically weighted fitting of s 2, the standard deviation of S/n is wrong. The first of these amplifies previous concerns about basing weights on sampling estimates of variance. The second stems from the bias inherent in s, which if corrected in the data, yields correct parameter values (but too large parameter standard errors). The third failure is a reminder that the x 2 distribution for S and s 2 is contingent upon the data having normally distributed error. While the raw data are normal by design, the s 2 estimates which constitute the data for VFE are themselves x 2 distributed, which means exponentially distributed for n 5 2 [eqn (7)]. While exponentially distributed data are unusual in LS fitting, a truly surprising outcome is the nearly exponentially distributed results for the fitted values of two of the constants included in Fig. 2. At first reflection, this seems in conflict with the central limit theorem, according to which averages should become normal in the limit of large n. (As a simple example of this and an easy way to generate normal error for synthetic data the sum of 12 uniform deviates is very nearly normal, with mean 6 and standard deviation 1.) On reflection, it is clear that the result for A 1 stems from the very strong weighting for the lowest-y variance, meaning that A 1 is effectively determined by just that one point, which itself is x 2 distributed with n 5 2. The very strongly biased B 1 and B 2 from observed weighting are harder to explain, but presumably both the distribution and the bias stem from the dominance of anomalously small s 2 values (which thus are strongly overweighted). None of the S distributions in Fig. 2B approximates the x 2 distribution for n 5 4 expected for normal data. The distributions from the fitting of s, shown in Fig. 3, though more biased than those from fitting s 2, are narrower and more symmetrically distributed. This is mainly just a consequence of the transformation of eqn (8). None of the S distributions (not shown) resembles the x 2 distribution. It is interesting that some of the MC parameter errors are actually smaller than the corresponding exact values, seemingly in conflict with the LS Fig. 2 Histogrammed results of variance fitting summarized in Table 1, for A 1 and B 1 (A) and S/n (B), under different weightings: theoretical (TW), observed (OW) and calculated (CW). In A the argument X is the (observed 2 true) difference divided by the exact standard error (TW) The solid curve in A is a declining exponential fitted to all but the first displayed point for A 1 2 TW. The solid curve in B is the reduced x 2 distribution [eqn (7)] for n 5 4, fitted to the CW data. minimum-variance guarantee. However, in each such case the bias in the MC parameter value is appreciable, and when the bias error is included in the variance computation, the MC result no longer undershoots the exact value. The occurrence of bias in the estimated parameters for observed and calculated Fig. 3 Histogrammed results from standard deviation fitting summarized in Table 1. The argument is as defined in Fig. 2; the bin width is 0.25, but to avoid confusion, not all points are shown. Results for calculated weighting (not shown) are very similar to those for theoretical weighting. 164 Analyst, 2008, 133, This journal is ß The Royal Society of Chemistry 2008

5 weighting is interesting, because in LLS the use of incorrect weights leads to bias in the parameter error estimates but not in the parameters themselves. 8 The source of bias here is the variation in the weights from data set to data set and has nothing to do with the fitting of nonnormal data. This was confirmed by MC computations in which the s 2 values were fitted using weights that were incorrect but the same for all data sets. Some analysts choose to express their variance functions and weights in terms of the independent variable x rather than the dependent variable y [eqns (11)], so it is worth asking whether this choice has any bearing on the results for calculated weighting in Table 1. The answer is none at all for the 2% proportional error involved here. When this error is increased ten-fold, there is a slight statistical discrimination against the results for the y-based functions, from the set-to-set variability in the averaged y values used in eqns (11). However, this difference is of no practical significance. The choice of only three replicates was designed to emphasize the anomalies. To check the dependence on the structure of the data set, I conducted further MC calculations on variance fitting with weighting on the calculated curve, for varying numbers of calibration points and replicates. To make the constant contribution to the variance significant over a larger range of x, I increased the constant A 1 by a factor of 100 for these tests. Results in Fig. 4 show statistically significant bias in both constants for most probed data structures. The MC standard errors on A 1 and B 1 (not shown) are incremented over their theoretical values by amounts that roughly track the biases, amounting to at most 25% for A 1 and 8% for B 1. The theoretical standard errors for both parameters decrease in the order of the displayed points 5 8 in Fig. 4, by about 30% for A 1 and 20% for B 1. In part, this trend is due to the increasing total number of degrees of freedom with fewer x values; in part it reflects greater precision inherent in using more points near the extremes of the range of x. 10 For the intended purpose of VFE to obtain weights for the fit of the data to the response function neither the biases nor these precision dependences are significant. Conclusion The method of least squares does not require normal data for validity: as long as the data are of finite variance and unbiased, LLS parameter estimates will be of finite variance and unbiased; and they will also be minimum-variance estimates if the data are weighted inversely as their variances. However, biased data produce biased parameter estimates, and non-normal data give non-normal parameter distributions and non-x 2 - distributed sums of squared residuals. As a consequence it may be difficult to Fig. 4 Bias in variance function coefficients for different data structures. The first four pairs of entries are for six x values and 3, 4, 6, and 10 replicates, respectively. The last four are for ca. 24 total points each, with data values and replicates apportioned as (n x n R ): 8 3, 5 5, 4 6, 3 8 in order # 5 8. For these tests, A 1 was increased to The error bars indicate the MC precision. assign confidence limits and conduct goodness-of-fit tests. In the particular application under study here, variance function estimation (VFE), the data can be far from normal, and this can translate into strongly non-normal parameter distributions nearly exponential for some parameters estimated from three replicates at each point. However such problems are of little importance in this application, since the outcome of the analysis is a means to an end the proper weighting of the data themselves. The primary need for data variance functions in analytical chemistry is in the LS determination of calibration functions. Monte Carlo computational experiments on simple variance functions resembling those likely to be encountered in actual data analysis situations show that: N The weights should be assessed using the fitted function rather than the raw s 2 (s) values. This renders the fit iterative, but convergence is typically quick. N The variance function can be estimated with adequate precision and tolerable bias from as few as three replicates taken at each of six data values. N It makes no difference whether the variance function is expressed in terms of the dependent variable y or the independent variable x (though the former is theoretically more sensible). N It makes little difference whether s 2 or s is fitted; although fitting s yields biased parameters, the bias is correctable and anyway has negligible effect on the data fit for which the VFE or SDFE is needed. The foregoing is premised upon the existence of a simple relation between the data error and the experimental variables, like eqns (11). Such relations must surely exist in all experimental techniques, e.g. dependence upon wavelength and absorbance in spectrophotometry, 11 or signal level and titrant injection volume in isothermal titration calorimetry. 12 If this dependence cannot be perceived by the analyst, and heteroscedasticity still seems evident in the data, then considerably more than three replicates will be needed at each point to provide reliable data weights. Use of nine replicates gave ca. 10% loss in precision in the calibration tests in ref. 1. The need to weight the data in VFE/ SDFE stems from the inherent properties This journal is ß The Royal Society of Chemistry 2008 Analyst, 2008, 133,

6 of sampling estimates of variances, which have the statistical properties of x 2. All estimated variances and standard deviations have proportional error, or uncertainty that is proportional to their magnitude. The proportionality constant is readily calculated from the degrees of freedom: (2/n) 1/2 for estimated variances and (2n) 21/2 for estimated standard deviations. The fact that this uncertainty in the uncertainty is so easily calculated makes it all the more surprising that it is so widely neglected by data analysts in the physical sciences. References 1 J. Tellinghuisen, Analyst, 2007, 132, A. M. Mood and F. A. Graybill, Introduction to the Theory of Statistics, McGraw-Hill, New York, 2nd edn, W. C. Hamilton, Statistics in Physical Science: Estimation, Hypothesis Testing, and Least Squares, The Ronald Press Co., New York, W. H. Press, B. P. Flannery, S. A. Teukolsky and W. T. Vetterling, Numerical Recipes, Cambridge University Press, Cambridge, UK, R. N. Draper and H. Smith, Applied Regression Analysis, Wiley, New York, 3rd edn, R. J. Carroll and D. Ruppert, Transformation and Weighting in Regression, Chapman and Hall, New York, J. Tellinghuisen, J. Phys. Chem. A, 2000, 104, J. Tellinghuisen, J. Chem. Educ., 2005, 82, M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions, Dover, New York, N. Francois, B. Govaerts and B. Boulanger, Chemom. Intell. Lab. Syst., 2004, 74, J. Tellinghuisen, Appl. Spectrosc., 2000, 54, J. Tellinghuisen, Anal. Biochem., 2005, 343, Analyst, 2008, 133, This journal is ß The Royal Society of Chemistry 2008

Non-linear least-squares and chemical kinetics. An improved method to analyse monomer-excimer decay data

Journal of Mathematical Chemistry 21 (1997) 131 139 131 Non-linear least-squares and chemical kinetics. An improved method to analyse monomer-excimer decay data J.P.S. Farinha a, J.M.G. Martinho a and