Linearity in Calibration:

Size: px

Start display at page:

Download "Linearity in Calibration:"

Mae Parsons
5 years ago
Views:

1 Linearity in Calibration: The Durbin-Watson Statistic A discussion of how DW can be a useful tool when different statistical approaches show different sensitivities to particular departures from the ideal. Howard Mark and Jerome Workman, Jr. As we left off in our last column, we had proposed a definition of linearity. Now let's start by delving into the ins and outs of the Durbin-Watson statistic (1-6) and looking at how to use it to test for nonlinearity. In fact, we've talked about the Durbin-Watson statistic previously in our columns, although a long time ago and under a different name. Quite a while ago we published a column titled "Alternative Ways to Calculate Standard Deviation" (7). One of the alternative ways described was the calculation by Successive Differences. As we shall see, that calculation is very closely related to the Durbin-Watson statistic. More recently we described this statistic (more directly named) in a sidebar to an article in the American Pharmaceutical Review (8). To relate the Durbin-Watson statistic to our current concerns, we go back to the basics of statistical analysis and remind ourselves how statisticians think about statistics. Here we get into the deep thickets of statistical theory, and meaning and philosophy. We will try to keep it as simple as possible, though. How DB Works Let us start with two of the formulas for standard deviation presented in our earlier column (7). One of the formulas is the "ordinary" formula for standard deviation: errors have the characteristics that statisticians consider "good" statistical properties: random, independent (uncorrelated), constant variance, and in this case, a Normal distribution, and for errors, a mean (μ) of zero, as well. For a set of data that meets all these criteria, we can expect the two computations to produce the same answer (within the limits of what sometimes loosely is called "statistical variability"). So under conditions where we expect the same answer from both computations, we expect the ratio of the computations to equal 1 (unity). Basically, this is a general description of how statisticians think about problems: First, compare the results of two computations of what is nominally the same quantity when all conditions meet the specified assumptions. Then if the comparison fails, this constitutes evidence that something about the data is not conforming to the expected characteristic (that is, is not random, is correlated, is heteroscedastic, is not Normal, and so forth). The Durbin-Watson statistic is that type of computation, stripped to its barest essentials. Dividing equation 2 by equation 1 above, canceling similar terms, noting that the mean error is zero and ignoring the constant factor (2) we arrive at: The other formula is the formula for calculating standard deviation by Successive Differences: Now we ask ourselves the question: "If we calculate the standard deviation for a set of data (or errors) from these two formulas, will they give us the same answer?" And the answer to that question is that they will, if (that's a very big "if") the data and the Because of the way it is calculated, particularly the way the constant factor is ignored, the expected value of DW is 2, when the data do in fact meet all the specified criteria: random, independent errors, and so forth. Nonlinearity will cause the computed value of DW to be statistically significantly less than 2. (Homework assignment for the reader: what characteristic will make DW be statistically significantly greater than 2?) Figures 1 and 2 illustrate graphically what happens when you inspect the residuals from a calibration. When you plot linear data, the data are spread out 34 Spectroscopy 20(3) March

2 evenly around the calibration line as shown in Figure 1a. dependent almost entirely upon the systematic variation due to the curvature, and for nonlinear data this is much larger than the random noise contribution. Therefore the denominator variance of the residuals is much larger than the numerator variance when nonlinearity is present, and the Durbin-Watson statistic reflects this by assuming a value less than 2. When plotting the residuals, the line representing the calibration line is brought into coincidence with the x axis, so that the residuals are spread out evenly around the x axis, as shown in Figure 1b. For nonlinear data, shown in Figure 2a, a plot of the residuals shows that although the calibration line still coincides with the x axis, the data do not follow that line. Therefore, although the residuals still have equal positive and negative values, they are no longer spread out evenly around the zero line because the actual function is no longer a straight line. Instead, the residuals are spread out evenly around some hypothetical curved line (shown) representing the actual (nonlinear) function describing the data. In both the linear and nonlinear cases the total variation of the residuals is the sum of the random error, plus the departure from linearity. When the data is linear, the variance due to the departure from nonlinearity effectively is zero. For a nonlinear set of data, because the X-difference between adjacent data points is small, the nonlinearity of the function makes minimal contribution to the total difference between adjacent residuals; most of that difference contributing to the successive differences in the numerator of the DW calculation is due to the random noise of the data. The denominator term, on the other hand, is Good Statistics vs. Good Data The problem we all have is that we want answers to be in clear, unambiguous terms: yes/no, black/white, is/isn't linear, while statistics deals in probabilities. It is certainly true that there is no single statistic not SEE, not R 2, not DW, nor any other that is going to answer the question of whether a given set of data, or residuals, has a linear relation. If we wanted to be really ornery, we could even argue that "linearity" is, as with most mathematical concepts, an idealization of a property that never exists in real data. But that is not productive, and doesn't address the real-world issues that confront us. What are some of these real-world issues? Well, you might want to check out the paper "Graphs in Statistical Analysis" by F.J. Anscombe (9). We'll describe his results again, but it really is worth getting hold of and reading the original paper anyway; it's quite an eye-opener. What Anscombe presents are four sets of synthetic data, representing four simple (single X-variable) regression situations. One of the data sets represents a reasonably well-behaved set of 35 Spectroscopy 20(3) March

3 data: uniform distribution of data along the x axis; errors are random, independent and Normally distributed; and in all respects has all the properties that statisticians consider "good." The other three sets show very gross departures of varying kinds (including one that is severely nonlinear), from this well-behaved data set. So what's the big deal about that? By design, all four sets of data have identical values of all the common regression statistics: coefficients, SEE, R 2, and so forth. The intent is, of course, to show that no set of statistics can diagnose unambiguously all possible problems in all situations. It is immediately clear, when you look at the graphs of the four data sets on the other hand, which is the "good" one, which have the problems, and what the problems are. Any statistician worth his salt will tell you that if you are doing calibration work, you should examine the residual plots, and any others that might be informative. But the FDA/ICH guidelines do not promote that approach even though they are mentioned. To the contrary, they emphasize calculating and submitting the numerical results from the line-fitting process. Under ordinary circumstances, that really is not too bad, as long as you understand what you are doing, which usually means going back to basic statistical theory. This theory states that if data meet certain criteria, criteria that (always) include the fact that the errors that are random and independent, and (usually) Normally distributed, then certain calculations can be done and probabilistic statements made about the results of those calculations. If you make the calculation and the value turns out to be one of low probability, then that is taken as evidence that your data fail to meet one or more of the criteria that they are assumed to meet. Note that the calculation alone does not tell you which criterion is not met; the criterion that it does not meet might or might not be the one you are concerned with. The converse, however, is, strictly speaking, not true. If your calculated result turns out to be a high-probability value, it does not "prove" that the data meet the criteria. That is what Anscombe's paper is demonstrating, because there is a (natural) tendency to forget that point, and assume that a "good" statistic means "good" data. Applying DW So where does that leave us? Does it mean that statistics are useless, or that FDA is clueless? No, but it means that all these things have to be done with an eye to knowing what can go wrong. We strongly suspect that FDA has taken the position it has because it has found that, even though numerical statistics are not perfect, they provide an objective measure of calibration performance, and they have found through hard experience that the subjective interpretation of graphs is fraught even more with problems than is the use of admittedly imperfect statistics. For similar reasons, the statement, "If the Durbin- Watson test demonstrates a correlation, then the relationship between the two assays is not linear," is not exactly correct, either. Under some circumstances, a linear correlation also can give rise to a statistically significant value of DW. In fact, for any statistic, it always is possible to construct a data set that gives a high-probability value for the statistic, yet the data clearly and obviously fail to meet the pertinent criteria (again, Anscombe is a good example of this for a few common statistics). So what should we do? Well, different statistics show different sensitivities to particular departures from the ideal, and this is where DW comes in. The key to calculating the Durbin-Watson statistic is that prior to performing the calculation, the data must be put into a suitable order. The Durbin-Watson statistic then is sensitive to serial correlations of the ordered data. While the serial correlation often is thought of in connection with time series, that is only one of its applications. Draper and Smith (1) discuss the application of DW to the analysis of residuals from a calibration; their discussion is based upon the fundamental work of Durbin et al., in the references listed at the beginning of this column. While we cannot reproduce their entire discussion here, at the heart of it is the fact that there are many kinds of serial correlation, including linear, quadratic, and higher order. As Draper and Smith show, the linear correlation between the residuals from the calibration data and the predicted values from that calibration model is zero. Therefore, if the sample data are ordered according to the analyte values predicted from the calibration model, a statistically-significant value of the Durbin-Watson statistic for the residuals in indicative of high-order serial correlation, that is, nonlinearity. Draper and Smith point out that you need a minimum of 15 samples in order to get meaningful results from the calculation of the Durbin-Watson statistic (1). Because the Anscombe data set contains only 11 readings, statistically meaningful statements cannot be made. Nevertheless, it is interesting to see the results of the Durbin-Watson statistic applied to the nonlinear set of Anscombe data; the value of the statistic is For comparison, the Durbin-Watson statistic for the data set representing normal "good" data is Spectroscopy 20(3) March

4 Is DW perfect? Not at all. The way it is calculated, the highest-probability value (the "expected" value) for DW is, as we saw above, 2. Yet it is possible to construct a data set that has a DW value of 2, and clearly and obviously is not linear, as well as being nonrandom. That data set is as follows: 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,... But for ordinary data, we would not expect such a sequence to happen. This is the reason most statistics work as general indicators of data performance: the special cases that cause them to fail are themselves low-probability occurrences. In this case the problem is not whether the data are nonlinear the problem is that they are nonrandom. This is a perfect example of the data failing to meet a criterion other than the one you are concerned with. Therefore the Durbin-Watson test fails, as would any statistical test fail for such data; they are simply not amenable to meaningful statistical calculations. Nevertheless, a "blind" computation of the Durbin-Watson statistic would give an apparently satisfactory value. But this is a warning that other characteristics of the data can cause it to appear to meet the criteria. And you have to know what can occur. But the mechanics of calculating DW for testing linearity is relatively simple, once you've gone through all the above: sort the data set according to the values predicted from the calibration model, then perform the calculation specified in equation 3. Note that, while the sorting is done using the predicted values from the model, the DW calculations are done using the residuals. But anyone doing calibration work should read Draper and Smith anyway it's the "bible" of regression analysis (see reference 1). The discussions of DW appear on pages 69 and of Draper and Smith, third edition (the second edition contains a similar but somewhat less extensive discussion). They also include an algorithm and tables of critical values for deciding whether the correlation is statistically significant. You might also want to check out page 64 for the proof that the linear correlation between residuals and predicted values from the calibration is zero. DW vs. R 2 So DW and R 2 test different things. As a specific test for nonlinearity, what is the relative utility of DW versus R 2 for that purpose? Basically, the answer is that when done according to the way Draper and Smith (and we) described, DW then is sensitive specifically to nonlinearity in the predictions. So, for example, in the case of the Anscombe data, all the other statistics (including R 2 ) might be considered satisfactory, and because they are the same for all four sets of data then all four sets would be considered satisfactory. But if you do the DW test on the data showing nonlinearity, it will flag it as having a low value of the statistic. Anscombe did not provide enough samples' worth of synthetic data in his sets, however, for the calculated statistics to be statistically meaningful. We also note that as a practical matter, meaningful calculation of the Durbin-Watson Statistic requires many samples' worth of data. We noted above that for fewer than 15 samples critical values for this statistic are not listed in the tables. The reason for requiring so many samples is that essentially we are comparing two variances (or, at least, two measures of the same variance). Because variances are distributed as x 2, for small numbers of samples this statistic has a very wide range of values indeed, so that comparisons virtually become meaningless because almost anything will fall within the confidence interval, giving this test low statistical power. On the other hand, characterizing R 2 as a general measure of how good the fit is doesn't make us flinch, either; it is one of the standard statistics for doing that evaluation. Quite the contrary, when we saw R 2 being specified as the way to test linearity, we wondered why it was chosen by FDA and ICH, because it is so nonspecific. We still don't know why, except for the obvious guess that they weren't aware of DW. We are in favor of keeping the other statistics as measures of the general "goodness of fit" of the model to the data, but in the specific context of trying to assess linearity, still we have to promote DW over R 2 as being more suited for that special purpose. (We eventually will discuss in our next few columns an even better method for assessing linearity.) Sensitive, But Not Perfect As for testing other characteristics of a univariate calibration, there also are ways to test for statistical significance of the slope, to see whether unity slope adequately describes the relationship between test results and analyte concentration. These are described in the book Principles and Practice of Spectroscopic Calibration (10). The statistics described there are called the "Data Significance t" (DST) test and the "Slope Significance t" (SST) test. Unless the DST is significant statistically, though, the SST is meaningless. In principle, there also is a test for the intercept. But because the expected value for the intercept depends upon the slope, it gets a bit hairy. It also makes the confidence interval so large that the test is nigh-on useless few statisticians recommend it. 37 Spectroscopy 20(3) March

and other types of defects in the data can show up by giving a statistically significant value to DW. But all this is true for any and all statistics.

5 But let's add this coda to the discussion of DW: The fact that DW specifically is sensitive to nonlinearity does not mean that it is perfect. There might be cases of nonlinearity that will not be detected (especially if it's a marginal amount), linear data occasionally will be flagged as nonlinear (α percent of the time, in the long run), and other types of defects in the data can show up by giving a statistically significant value to DW. But all this is true for any and all statistics. The existence of at least one data set that is known to fool the calculation is a warning that the Durbin-Watson statistic, while a (large) step in the right direction, is not the ultimate answer. Some further comments here: there does seem to be some confusion between the usage of the statistics recommended by the guidelines, which are excellent for their intended purpose of testing the general "goodness of fit" of a model, and the specific testing of a particular model characteristic, such as linearity. A good deal of this confusion probably is due to the fact that the guidelines recommend those general statistics for the specific task of testing linearity. As Anscombe shows, however, and as we referred to previously, those generalized statistics are not up to the task. In our next column we will discuss other methods of testing for linearity that have appeared in the literature. We then will turn our attention to a new test that has been devised. In fact, it turns out that while DW has much in its favor, it is not the final or best answer. The new method is much more direct and specific even than DW. It is the correct way to test for linearity. We will discuss it all in due course, in a future installment of "." References 1. N. Draper and H. Smith, Applied Regression Analysis (third edition) (John Wiley & Sons, New York,1998). 2. J. Durbin and G.S. Watson, Biometrika 37, (1950). 3. J. Durbin and G.S. Watson, Biometrika 38, (1951). 4. J. Durbin, Biometrika 56, 1-15 (1969). 5. J. Durbin, Econometrica 38, (1970). 6. J. Durbin and G.S. Watson, Biometrika 58, 1-19 (1971). 7. H. Mark and J. Workman, Spectroscopy 2(11), (1987). 8. G. Ritchie and E. Ciurczak, Amer. Pharm. Rev.3(3), (2000). 9. F.J. Anscombe, Amer. Stat.27, (1973). 10. H. Mark, Principles and Practice of Spectroscopic Calibration (John Wiley & Sons, New York, 1991). Jerome Workman Jr. serves on the Editorial Advisory Board of Spectroscopy and is director of research, technology, and applications development for the Molecular Spectroscopy & Microanalysis division of Thermo Electron Corp. He can be reached by at: jerry.workman@thermo.com Jerome Workman Jr Howard Mark 38 Spectroscopy 20(3) March

Howard Mark and Jerome Workman Jr.

Howard Mark and Jerome Workman Jr. Linearity in Calibration: How to Test for Non-linearity Previous methods for linearity testing discussed in this series contain certain shortcomings. In this installment, the authors describe a method