Use and Abuse of Regression

Similar documents
Gilles Bourgeois a, Richard A. Cunjak a, Daniel Caissie a & Nassir El-Jabi b a Science Brunch, Department of Fisheries and Oceans, Box

Open problems. Christian Berg a a Department of Mathematical Sciences, University of. Copenhagen, Copenhagen, Denmark Published online: 07 Nov 2014.

The American Statistician Publication details, including instructions for authors and subscription information:

George L. Fischer a, Thomas R. Moore b c & Robert W. Boyd b a Department of Physics and The Institute of Optics,

Published online: 05 Oct 2006.

University, Tempe, Arizona, USA b Department of Mathematics and Statistics, University of New. Mexico, Albuquerque, New Mexico, USA

To cite this article: Edward E. Roskam & Jules Ellis (1992) Reaction to Other Commentaries, Multivariate Behavioral Research, 27:2,

University, Wuhan, China c College of Physical Science and Technology, Central China Normal. University, Wuhan, China Published online: 25 Apr 2014.

Ankara, Turkey Published online: 20 Sep 2013.

Nacional de La Pampa, Santa Rosa, La Pampa, Argentina b Instituto de Matemática Aplicada San Luis, Consejo Nacional de Investigaciones Científicas

Communications in Algebra Publication details, including instructions for authors and subscription information:

Dissipation Function in Hyperbolic Thermoelasticity

Guangzhou, P.R. China

CCSM: Cross correlogram spectral matching F. Van Der Meer & W. Bakker Published online: 25 Nov 2010.

The Fourier transform of the unit step function B. L. Burrows a ; D. J. Colwell a a

Geometric View of Measurement Errors

Park, Pennsylvania, USA. Full terms and conditions of use:

Full terms and conditions of use:

Version of record first published: 01 Sep 2006.

Online publication date: 01 March 2010 PLEASE SCROLL DOWN FOR ARTICLE

Diatom Research Publication details, including instructions for authors and subscription information:

Published online: 17 May 2012.

Precise Large Deviations for Sums of Negatively Dependent Random Variables with Common Long-Tailed Distributions

Characterizations of Student's t-distribution via regressions of order statistics George P. Yanev a ; M. Ahsanullah b a

Online publication date: 30 March 2011

Testing Goodness-of-Fit for Exponential Distribution Based on Cumulative Residual Entropy

OF SCIENCE AND TECHNOLOGY, TAEJON, KOREA

PLEASE SCROLL DOWN FOR ARTICLE

FB 4, University of Osnabrück, Osnabrück

Dresden, Dresden, Germany Published online: 09 Jan 2009.

Erciyes University, Kayseri, Turkey

The Homogeneous Markov System (HMS) as an Elastic Medium. The Three-Dimensional Case

G. S. Denisov a, G. V. Gusakova b & A. L. Smolyansky b a Institute of Physics, Leningrad State University, Leningrad, B-

Online publication date: 22 March 2010

France. Published online: 23 Apr 2007.

Tong University, Shanghai , China Published online: 27 May 2014.

Communications in Algebra Publication details, including instructions for authors and subscription information:

Acyclic, Cyclic and Polycyclic P n

Full terms and conditions of use:

Discussion on Change-Points: From Sequential Detection to Biology and Back by David Siegmund

MSE Performance and Minimax Regret Significance Points for a HPT Estimator when each Individual Regression Coefficient is Estimated

The use of the Normalized Difference Water Index (NDWI) in the delineation of open water features

Geometrical optics and blackbody radiation Pablo BenÍTez ab ; Roland Winston a ;Juan C. Miñano b a

Derivation of SPDEs for Correlated Random Walk Transport Models in One and Two Dimensions

University of Thessaloniki, Thessaloniki, Greece

Global Existence of Large BV Solutions in a Model of Granular Flow

Tore Henriksen a & Geir Ulfstein b a Faculty of Law, University of Tromsø, Tromsø, Norway. Available online: 18 Feb 2011

C.K. Li a a Department of Mathematics and

P. C. Wason a & P. N. Johnson-Laird a a Psycholinguistics Research Unit and Department of

Projective spaces in flag varieties

Published online: 27 Aug 2014.

Avenue G. Pompidou, BP 56, La Valette du Var cédex, 83162, France

PLEASE SCROLL DOWN FOR ARTICLE

T. Runka a, M. Kozielski a, M. Drozdowski a & L. Szczepańska b a Institute of Physics, Poznan University of

PLEASE SCROLL DOWN FOR ARTICLE

PLEASE SCROLL DOWN FOR ARTICLE

PLEASE SCROLL DOWN FOR ARTICLE. Full terms and conditions of use:

Published online: 04 Oct 2006.

Online publication date: 12 January 2010

A Simple Approximate Procedure for Constructing Binomial and Poisson Tolerance Intervals

35-959, Rzeszów, Poland b Institute of Computer Science, Jagiellonian University,

Sports Technology Publication details, including instructions for authors and subscription information:

Published online: 04 Jun 2015.

A note on adaptation in garch models Gloria González-Rivera a a

Published online: 10 Apr 2012.

Melbourne, Victoria, 3010, Australia. To link to this article:

PLEASE SCROLL DOWN FOR ARTICLE. Full terms and conditions of use:

Astrophysical Observatory, Smithsonian Institution, PLEASE SCROLL DOWN FOR ARTICLE

Xiaojun Yang a a Department of Geography, Florida State University, Tallahassee, FL32306, USA Available online: 22 Feb 2007

A Strongly Convergent Method for Nonsmooth Convex Minimization in Hilbert Spaces

PLEASE SCROLL DOWN FOR ARTICLE

Quan Wang a, Yuanxin Xi a, Zhiliang Wan a, Yanqing Lu a, Yongyuan Zhu a, Yanfeng Chen a & Naiben Ming a a National Laboratory of Solid State

Marko A. A. Boon a, John H. J. Einmahl b & Ian W. McKeague c a Department of Mathematics and Computer Science, Eindhoven

PLEASE SCROLL DOWN FOR ARTICLE

Fokker-Planck Solution for a Neuronal Spiking Model Derek J. Daniel a a

Fractal Dimension of Turbulent Premixed Flames Alan R. Kerstein a a

CENTER, HSINCHU, 30010, TAIWAN b DEPARTMENT OF ELECTRICAL AND CONTROL ENGINEERING, NATIONAL CHIAO-TUNG

Raman Research Institute, Bangalore, India

Silvio Franz a, Claudio Donati b c, Giorgio Parisi c & Sharon C. Glotzer b a The Abdus Salam International Centre for Theoretical

PLEASE SCROLL DOWN FOR ARTICLE

Yan-Qing Lu a, Quan Wang a, Yuan-Xin Xi a, Zhi- Liang Wan a, Xue-Jing Zhang a & Nai-Ben Ming a a National Laboratory of Solid State

Analysis of Leakage Current Mechanisms in BiFeO 3. Thin Films P. Pipinys a ; A. Rimeika a ; V. Lapeika a a

András István Fazekas a b & Éva V. Nagy c a Hungarian Power Companies Ltd., Budapest, Hungary. Available online: 29 Jun 2011

Horizonte, MG, Brazil b Departamento de F sica e Matem tica, Universidade Federal Rural de. Pernambuco, Recife, PE, Brazil

PLEASE SCROLL DOWN FOR ARTICLE

PLEASE SCROLL DOWN FOR ARTICLE

Accepted author version posted online: 28 Nov 2012.

To link to this article:

Anisotropic Hall effect in Al 13 TM 4

Móstoles, Madrid. Spain. Universidad de Málaga. Málaga. Spain

Online publication date: 29 April 2011

Morris J. Robins a, Sanchita Sarker a & Stanislaw F. Wnuk a a Department of Chemistry, Biochemistry, Brigham Young

A. H. Abuzaid a, A. G. Hussin b & I. B. Mohamed a a Institute of Mathematical Sciences, University of Malaya, 50603,

Osnabrück, Germany. Nanotechnology, Münster, Germany

Surface Tension of Liquid Crystals S. Chandrasekhar a a

Ampangan, Nibong Tebal, Penang, 14300, Malaysia c Director, REDAC, Universiti Sains Malaysia, Engineering Campus, Seri Ampangan, Nibong

William Henderson a & J. Scott McIndoe b a Department of Chemistry, The University of Waikato, Hamilton,

100084, People's Republic of China Published online: 07 May 2010.

PLEASE SCROLL DOWN FOR ARTICLE

Transcription:

This article was downloaded by: [130.132.123.28] On: 16 May 2015, At: 01:35 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Technometrics Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/utch20 Use and Abuse of Regression George E.P. Box a a University of Wisconsin Published online: 30 Apr 2012. To cite this article: George E.P. Box (1966) Use and Abuse of Regression, Technometrics, 8:4, 625-629 To link to this article: http://dx.doi.org/10.1080/00401706.1966.10490407 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the Content ) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http:// www.tandfonline.com/page/terms-and-conditions

TECHNOMETRICS VOL. 8, No. 4 NOVEMBER 1966 Use and Abuse of Regression GEORGE E. P. Box University of Wisconsin Let us first restate the usual assumptions and conclusions for linear least squares. Gauss showed that if we have n observations y1, yz,. * +, un and if an appropriate model for the uth observation is yu = 0, + A21, + 82% + * * * + P&-h + cl (1) where the,8 s are unknown parameters, the x s known constants, and the t s random variables uncorrelated and having the same variance and zero expectation, then estimates b,, b,,..., b, of the p s obtained by minimizing c (Y - 9) with jj = box0 + &xl + bzxz +... + bkxk are unbiassed and have smallest variance among all linear unbiassed estimates. The method of least squares is used in the analysis of data from planned experiments and also in the analysis of data from unplanned happenings. The word regression is most often used to describe analysis of unplanned data. It is the tacit assumption that the requirements for the validity of least squares analysis are satisfied for unplanned data that produces a great deal of trouble. Whether the data are planned or unplanned the quantity E, which is usually quickly dismissed as a random variable having the very specific properties mentioned above, really describes the effect of a large number of latent variables x~+~, x~+~,..., x, which we know nothing about. If we suppose that it is enough to consider the linear effects of these latent variables (which would often be realistic for small variations in x,++~,..., x,) we should have e = pk+lxk+l + flk+2xk+2 + *-* + /Lx, (2) Thus in matrix notation we can write for the column of n observations y J = x41 + xzea (3) where X1 has for elements the n values of the k regression variables and X, has for elements the n unknown values of the m - k latent* variables. The situation is illustrated in Figure 1 in which the variables xkcl. e e, x, are hidden behind the wall. In practice various kinds of linkages would occur between the variables indicated by lines. These linkages might indicate causative relations; for instance, an increase in temperature might necessarily produce an increase Received March 1966. t A talk prepared for the tenth conference on the Design of Experiments in Army Research, Development and Testing. Washington, D. C. November 1964. * More dramatically described as lurking variables. 625

626 GEORGE E. P. BOX X In Latent variables St2 /////////////////////////////,///////////// Regression variables FVXJRE 1 Latent variables and regression variables. in pressure; or merely relationships due to correlation. Thus, an operator in charge of a process might as a standard operating procedure always reduce the flow of one of the reactants if a certain temperature was observed to be high. We must now ask the question, What do we wish to do with the fitted regression equation? We might (i) desire to predict y in the future from passive observation of x1 -. - xl. We assume that the causal and correlative system which operated during the data taking has not been interferred with and also operates during the period when predictions are being made. (ii) to discover how deliberate changes in x, - -. xb will effect?/ with the intention of actually modifying the system to get a better value for y. The position is quite different depending upon whether prediction from passive observation or improvement from active interference is in mind. This is made clear by the following example. latent variable I/////// regression variable 7 x1 Pressure FIGURE 2 Relations between yield, impurity, and pressure.

USE AND ABUSE OF REGRESSION 627 Suppose that in a chemical process it has been found that undesirable frothing can be reduced by increasing pressure. The standard operating procedure is, therefore, to increase pressure whenever frothing appears. Suppose that the frothing in fact occurs because of an unsuspected impurity x2 (which is, of course, not measured because it is unknown). Suppose finally that a high value of impurity xz not only produces frothing but also lowers yield but that yield is unaffected directly by a change in pressure. If (with now x and y representing deviations from respective averages) a regression of yield on pressure $ = &x1 is fitted by the usual least squares procedure we may well find a highly significant coefficient 5,. This well known phenomenon of nonsense correlation exhibited in this example is worth studying further. Suppose there is a relationship y = fllxl + f12x2 connecting y exactly with the two variables x1 and x2 (with, in the present instance, p1 = 0). Now, of course, the actual levels of x2 are unknown but suppose that & = ax1, is the formal regression of x2 on x1 which would be obtained if values of xz were available. Then it is readily shown that bl = A + aa (4) In this expression 8, is zero and we appear to obtain a real effect only because of the influence of the bias term (a&). On the other hand using (4) we see that our fitted equation 9 = &x1 which ignored x2 can be written $ = fllxl + &ax1 or as 9 = PA + 82~1 (5) This equation which replaces xz by & is the best estimate of y we can expect to get from observing x1 only. Provided the system continues to be run in the same fashion as when the data were recorded we can use pressure to indicate the level of y. Of course, if we had measured x2 a more accurate (indeed in the present instance an exact) value of y would be deducible, but, lacking knowledge of the importance of x2, we might nevertheless appropriately use the simple regression equation 6 = &x1. On the other hand the value of b, will be utterly misleading if interpreted as the effect on the variable y of a unit change in x1. If we hope to increase yield by reducing pressure we will be disappointed. A similar argument applies for any number of variables. The true model is 9 = &LA + xzea (f-5) By including only the variables X, in the regression equation our prediction equation for y becomes i.e. 9 = X,b, = X,(X:X,)- X;y (7) = wgq- x:(xl@l + X&) (8) 9 = Xlljl + %e* (9) where ii, = X,A and A = (X:X,)- X:X, is the k + 1 X m - k matrix of regression coefficients of the latent variables on the regression variables.

628 GEORGE E. P. BOX Again we see that so far as the passive prediction of 9 is concerned our simple regression onto the known variables X, in effect replaces the unknown X, by X,. On the other hand the regression coefficients b, = 0, + A@, represent combinations of effects due to regression variables and latent variables and as before it is impossible to draw any valid conclusions as to how interference with the levels of the regression variables will affect the system. In a designed experiment, we are in quite a different case. It was, of course, to overcome such difficulties as those described above that Fisher introduced the idea of designed experiments and in particular of randomization. When the levels of the regression variables are chosen in some deliberately random manner it is impossible for the levels of a regression variable to be affected by the level of a latent variable. The only cause of the particular values which the regression variables have within the design framework is the throw of an unbiassed die or other random process. Fisher makes it possible to analyze the data as if Gaussian assumptions were true by making X, a random variable. The regression variables can, of course, still affect the latent variables and these may in turn effect y. Provided, however, we apply our results to the same system for which we obtained our data this will cause no problem. It will be genuinely true that apart from experimental error manipulation of regression variables will produce the predicted change in y even though it does it via some latent variable. The basic difficulty mentioned above is by no means the only one that faces us in the analysis of unplanned data. In the operation of an industrial process past experience often shows that certain variables are of major importance. In order to control fluctuations in the process, therefore, care is taken to hold precisely these variables very close to fixed values. As the statistical significance of any variable is greatly affected by the range it covers there is a strong probability, therefore, that the most important variables will be dubbed not significant by a standard regression analysis. A further difficulty is that with unplanned data regression variables will frequently be highly correlated only because of operating policy. The operator is told to reduce x2 whenever x1 becomes high. With such data even if difficulties from latent variables could be ignored it may be almost impossible to discover whether changes in y are associated with x1, with x2, or with both. In designed experiments, of course, one normally arranges that x1 and x2 are uncorrelated by using an orthogonal design. In summary the regression analysis of unplanned data is a technique which must be used with great care. However, (i) It may provide a useful prediction of y in a fixed system being passively observed even when latent variables of some importance exist. For this application computer programs which progressively add or drop variables make some sense. (ii) It is one of a number of tools sometimes useful in indicating variables which ought to be included in some latter planned experiment (in which randomization will, of course, be included as an integral part of the design). It ought never to be used to decide which variable should be

USE AND ABUSE OF REGRESSION 629 excluded from further investigation for reasons which are obvious from the above. To find out what happens to a system when you interfere with it you have to interfere with it (not just passively observe it). REFERENCE 1. FISHER, R. A., 1937. Design of Experiments, published by Oliver and Boyd.