Biometrika Trust Robust Regression via Discriminant Analysis Author(s): A. C. Atkinson and D. R. Cox Source: Biometrika, Vol. 64, No. 1 (Apr., 1977), pp. 15-19 Published by: Oxford University Press on behalf of Biometrika Trust Stable URL: http://www.jstor.org/stable/2335764 Accessed: 08-06-2017 14:26 UTC JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org. Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at http://about.jstor.org/terms Biometrika Trust, Oxford University Press are collaborating with JSTOR to digitize, preserve and extend access to Biometrika
Biometrika (1977), 64, 1, pp. 15-19 15 Printed in Great Britain Robust regression via discriminant analysis BY A. C. ATKINSON AND D. R. COX Department of Mathematics, Imperial College, London SUMMARY For linear regression with grouped data, and with an equal number of observations per grouping point, linear discriminant analysis is applied to the ordered response variables to find that combination which shows strongest linear regression. After a modification to remove weights with an anomalous sign, there results a robust estimate of slope. Generalizations are outlined. Some key words: Discriminant analysis; Grouped data; Order statistics; Outlier; Robust regression. 1. INTRODUCTION In the past, fitting a straight line by least squares has been preceded by plotting the data to check for outliers and other anomalies. There has been much recent interest (Huber, 1973) in robust methods which will automatically deflate the contribution of extreme observations. Such methods are likely to be particularly useful when the total amount of data is large, as, for example, when straight lines are to be fitted to a large number of sets of data in an initial stage of analysis, detailed inspection of individual plots being impracticable. In the present paper, we follow Huber in concentrating on the control of extreme errors in the response, or dependent, variable; the possibility of reducing the contributions of points extreme in the explanatory variable raises fresh issues. We take the straight line model in the usual form Yi = a +,flxi + ei, where the ej are random disturbances, centred in some sense about zero. A major difficulty in the robust estimation of location from single samples (Andrews et al., 1972) is the need to assume symmetrical distributions, or, failing that, to introduce some specific measure of the ' centre' of the distribution. This difficulty does not arise in the regression problem, provided that we are interested only in the slope /l and that the distribution of *j does not depend on x,. In the remainder of the paper, we concentrate on the point estimation of,. 2. CENTRAL IDEA Consider data in which there are m values xj,..., xm at each of which r independent v Y are observed. Denote the ordered values of Y at x = xi by Yiil) Y... i Ye,. Now a linear estimate of location based on these values is where al,..., a are suitable constants, the same for all i. r r y(a) =Eas Y a a=.1, (1) 8=1 8=1 The central idea of the paper is that the a8 should be chosen so that the regression relation is
16 A. C. ATKINSON AND D. R. (Jox recovered in as strong a form as possible; if one or more outliers in the data would seriously distort the relation, then we may hope to choose the a's to give these values little or no weight. One way to do this is to choose the a8 so that the proportion of the corrected sum of squares of the Y(a) accounted for by linear regression on xi is maximized. That is, the a8 are chosen by linear discriminant analysis. Because of the algebraic identity between linear discriminant analysis and multiple regression, we can therefore translate the problem into one of multiple regression with m individuals and r explanatory variables. The analysis is thus as follows. The formal regression analysis of x on the Y determines a linear function Zc'Y(i8) and the imposition of the constraint (1) leads to the location m ZC8Y(j), where c8 = c'/xc'. The regression coefficient of this quantity on xw is the required estimate of fi. More formally let Y be the m x r matrix of ranked observations and let x be the m x 1 vector of centred independent variables so that xl +... + xm = 0. Then, if X is the m x 2 design matrix with ith row (1 xi), the ratio of the regression sum of squares to the residual su squares is (at YTX)2/(XTX) at YT{I - X(XTX)-1 XT} Ya' which is maximized as a function of a by the values a satisfying the multiple regression equation YT{I-X(XTX)-lXT} Ya = ytx An alternative way of writing this equation is that the (s, t)th element of the matrix on the left-hand side is y(q)y(t) - YY(8) Y(z)fm - ZY(is)Xi ZY(it)xiPYxj, where all summations are over m. The required coefficients c are found by imposing the constraint (1) on the a. The same values of c result if the residual sum of squares in the discriminant analysis is replaced by the sum of squares of the Y's corrected for the mean. This leads to a second multiple regression equation identical to the first except for the omission of the third term in the expression above for the elements of the left-hand matrix. The resulting regression coefficients are a constant multiple of the a and so yield the same values of c after operation of the constraint. The resulting robust estimate of the regression coefficient is m m I m Al* = (CTYTX)/(XTX) = X C i=1 8-1 / i-1 Note that for m distinct values of x the degrees of freedom for deviations from linear regression are m -2. For the method to work at all we therefore must have m > 3 and appreciably larger m are, of course, desirable. In addition if r > m -2, it can be shown that exact linearity is achievable by suitable choice of the cs. Therefore in what follows we assum that m > 3, r < m-2. 3. Two DIFFICULTIES Unfortunately there are two difficulties with the relatively simple procedure outlined in? 2. First, in defining the location measure (1) it would in many respects have been natural to have imposed the further constraints that the as are nonnegative. Certainly if the c0 of? 2 arbitrary there is at least the possibility that the analysis will produce a linear combination that could not naturally be regarded as a measure of location and which, more seriously,would
Robust regression via discriminant analysis 17 lead to a poor estimate of slope. Numerical investigation shows that the estimates produced allowing negative c8 are indeed poor. Two modifications were therefore studied: (i) to apply the procedure of? 2, dropping 'variables' with negative weights, c8 = cj/zce. T whole calculation is repeated until only 'variables' with nonnegative weights remain; (ii) to fit all possible subset regressions and to take the best with all c, nonnegative. Numerical work shows that (ii), which is effectively the exact solution of the appropriate constrained optimization problem, gives negligible gain over (i). Subsequent work on the method therefore uses (i). The second point concerns the noninvariance of the estimated slope under simple transformation. If we add kxi to each Y(is,) the slope of the regression is changed from, to least squares estimate transforms similarly. But the estimate defined either by the simple procedure of? 2, or by the above modifications, does not transform simply. This can be verified algebraically. Qualitatively, the general method makes sense only when there is appreciable regression present. Detailed numerical work shows that: (a) for errors with a finite variance, provided that we operate in a region of strong regression defined empirically by lflj{rmlvar (/)} > 041, where var (,) is the variance of the least squares estimate, then the new estimates have the required invariance to within 1 %; (b) if the method is used in a region of nearly zero regression, estimates of notably inferior performance are obtained. Some results, given in Table 3, are discussed in? 4. A consequence of (b) is that for data showing slight regression it is best to add kxi to ea Y(is, for a suitably large k, to apply the procedure of (i) above and then to subtract k from resulting estimate of slope. The reason that these points are difficult to investigate algebraically is that the operation of the sign constraints changes in a complicated way with k. 4. SOME NUMERICAL RESULTS Tables 1, 2 and 3 summarize a small part of extensive simulation studies of the estimate of? 3. In all cases we took r = 5, m = 10 and, where appropriate, ot2 = 1. Results for r = 3 and 8 and for m = 5 and 15 were also obtained, but showed no unexpected features. For m = 10 the values of xi were taken equally spaced on (-9, 9). For these values of r and m the requirement of strong regression in? 3 reduces to requiring 1,81 > 5 Vvar (Ai). In the simulations yielding Tables 1 and 2 we took, = 1, which is 40-6 standard errors. In Table 1 we compare the means and variances from 1000 simulations of the two estimators, with the variances standardized against ordinary least squares; for these conditions var (,) = 0-606 x 10-3. The results justify the following conclusions. (i) When the e's are normally distributed, the estimated slope has no bias and a variance at most 25 % greater than that of the least squares estimate. (ii) The effect of outliers with normal errors was studied by alternately adding and subtracting 10 from one of the observations for which xi = 9. In this situation crude least squa i.e. least squares without preliminary inspection of the data, does badly, giving an estimate with over four times the variance of the robust estimate. (iii) For the Laplace distribution performance is better than that of least squares and satisfactory as judged against the best linear estimate, given in the Table as 'theoretical' (Govindarajulu, 1966).
18 A. C. ATKINSON AND D. R. Cox (iv) If the errors are Cauchy, the least squares estimate of the slope has infinite variance, 'approximated' by 14 530 in our results. The robust method gives a variance a little less than that of the estimator based on the median. In simple samples of size 5 from the Cauchy distribution, the efficiency of the median relative to the maximum likelihood estimate is 77-8 % (Barnett, 1966). Table 1. Means and standardized variances of robust and least squares estimates when, = 1 Standardized Model Estirmate Mean variance Normal Least squares 0.999 1P006 Weighted ranks 1P001 1P241 Normal plus Least squares 1 001 6.008 outlier Weighted ranks 1P003 1P468 Laplace Least squares 1P000 0.992 Weighted ranks 1P001 0 929 Theoretical 0-792 Cauchy Least squares 1*041 14530 Weighted ranks 1P003 5-482 Theoretical, median 6 106 The standardized variance is the ratio of empirical variance to theoretical variance for least squares with c2 = 1. Table 2. Average weights, from 1000 simulations, of the order statistics in the robust estimate Weights Model Estimate, _- A A Normal Weighted ranks 0X195 04199 04199 0X210 0X197 Theoretical, least squares 0X2 0X2 0X2 0X2 0X2 Normal plus Weighted ranks 0*015 0-264 0-246 0*250 0*225 negative Theoretical, least squares 0 0-25 0-25 0-25 0-25 outlier omitting outlier Laplace Weighted ranks 0-083 0-246 0 340 0*247 0*084 Theoretical (best linear) 0-017 0-221 0-524 0-221 0-017 Cauchy Weighted ranks 0*019 0X206 0*539 0*221 0*015 Theoretical, median 0 0 1 0 0 Table 3. Dependence of standardized variance of robust estimates on the value of / flh/var (A), ~~~~~A Model Estimate 0 1 5 Normal Weighted ranks 2-760 2-377 1*180 Trapezoidal weighting 24171 1-866 1P137 Cauchy Weighted ranks 34-80 41*42 14-95 Trapezoidal weighting 52-88 65-44 29-30 It was hoped initially that the estimated weights c8 wou tional shape involved. Unfortunately, even in the normal one realization to another. In Table 3 we show the weights averaged over 1000 realizations, which do, however, approximate the theoretically optimum weights, although with some diffusion. To exhibit more strongly the effect of outliers on the weights, contamination in this
Robust regression via discriminant analysis 19 case was introduced solely by subtracting 10 from one observation behaviour of the averaged values of the weights is useful only when large numbers of similar sets of data are to be analyzed. To quantify our earlier remarks about requiring strong regression we show in Table 3 the effect of the value of /8 on the variance of the robust estimator for both normal and Cauchy errors, again based on 1000 simulations. In the normal case a variance multiplication of nearly three when,8 is zero is reduced to one when,8 is five times its standard error. In the Cauchy case greater regression seems to be required. 5. SOME GENERALIZATIONS The work described above can be extended in various ways. The following outlines some of the main possibilities. (i) If the values of x are not initially grouped, a grouping can be imposed artificially. It is known that for least squares analysis (Haitovsky, 1973) imposition of even quite coarse grouping has little effect, unless there are a few very extreme points. The same applies to the method proposed here. (ii) The method can be applied to multiple regression, provided that reasonable grouping of the explanatory variables is feasible. The linear discriminant function is replaced by the first canonical variable in the canonical regression analysis of the ordered responses on the explanatory variables. (iii) Especially with larger values of r, it is possible to replace the arbitrary weights as by smoothly varying weights, e.g. by trapezoidal weights or even by symmetrical weights. For r = 5, trapezoidal weighting reduces to replacing the observations by three triangularly weighted combinations of the order statistics. As the results in Table 3 show these are, as is to be expected, better behaved than arbitrary weights if the errors are normal and less well behaved if they are Cauchy. These results were not sufficiently encouraging to demand further study. But a possible extension of the idea would allow the handling of data with varying numbers per group. (iv) If the variance of e changes strongly with x, the method of? 3 does badly, in fact worse than unweighted least squares. (v) In principle it should be possible to distinguish say between a case where the median of Y varies linearly with x as compared with one in which E( Y) varies linearly with x; these will differ when the distributional shape changes with x. Trials with this were not encouraging. (vi) Methods for the estimation of the precision of the robust estimate of slope would be of interest. REFERENCES ANDREWS, D. F., BICKEL, P. J., HAMPEL, F. R., HUBER, P. J., ROGERS, W. H. & TUxEY, J. W. (1972). Robust Estimates of Location. Princeton University Press. BARNETT, V. D. (1966). Order statistics estimators of the location of the Cauchy distribution. J. Am. Statist. Assoc. 61, 1205-18. GOVINDARAJULU, Z. (1966). Best linear estimates under symmetric censoring of the parameters of a double exponential population. J. Am. Statist. Assoc. 61, 248-58. HAITOVSKY, Y. (1973). Regression Estimation from Grouped Observations. London: Griffini. HUBER, P. J. (1973). Robust regression: asymptotics, conjectures and Monte Carlo. Ann. Statist. 1, 799-821. [Received July 1976. Revised September 1976]