The Effect of a Single Point on Correlation and Slope

Similar documents
Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr

Regression Diagnostics for Survey Data

Regression Analysis for Data Containing Outliers and High Leverage Points

Package svydiags. June 4, 2015

REGRESSION OUTLIERS AND INFLUENTIAL OBSERVATIONS USING FATHOM

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

Prediction Intervals in the Presence of Outliers

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

STAT 4385 Topic 06: Model Diagnostics

Detecting and Assessing Data Outliers and Leverage Points

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Lecture One: A Quick Review/Overview on Regular Linear Regression Models

Statistical Modelling in Stata 5: Linear Models

1.1 GRAPHS AND LINEAR FUNCTIONS

Beam Example: Identifying Influential Observations using the Hat Matrix

Chapter 10 Building the Regression Model II: Diagnostics

MATH Notebook 4 Spring 2018

On likelihood ratio tests

Efficient Choice of Biasing Constant. for Ridge Regression

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Comparison between Models With and Without Intercept

Alternative Biased Estimator Based on Least. Trimmed Squares for Handling Collinear. Leverage Data Points

A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL

A CONNECTION BETWEEN LOCAL AND DELETION INFLUENCE

((n r) 1) (r 1) ε 1 ε 2. X Z β+

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

CHAPTER 5. Outlier Detection in Multivariate Data

Regression Diagnostics Procedures

Outliers Detection in Multiple Circular Regression Model via DFBETAc Statistic

9 Correlation and Regression

The Masking and Swamping Effects Using the Planted Mean-Shift Outliers Models

How To: Deal with Heteroscedasticity Using STATGRAPHICS Centurion

Box-Cox Transformations

COLLABORATION OF STATISTICAL METHODS IN SELECTING THE CORRECT MULTIPLE LINEAR REGRESSIONS

Bivariate Data Summary

Quantitative Methods I: Regression diagnostics

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

Index I-1. in one variable, solution set of, 474 solving by factoring, 473 cubic function definition, 394 graphs of, 394 x-intercepts on, 474

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at

Chapter 1 Statistical Inference

Testing for a unit root in an ar(1) model using three and four moment approximations: symmetric distributions

MLR Model Checking. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Regression Diagnostics

STATISTICS 479 Exam II (100 points)

EDF 7405 Advanced Quantitative Methods in Educational Research MULTR.SAS

Detection of Influential Observation in Linear Regression. R. Dennis Cook. Technometrics, Vol. 19, No. 1. (Feb., 1977), pp

Regression Model Building

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

RECENT DEVELOPMENTS IN VARIANCE COMPONENT ESTIMATION

Regression Analysis By Example

Highly Robust Variogram Estimation 1. Marc G. Genton 2

Treatment Variables INTUB duration of endotracheal intubation (hrs) VENTL duration of assisted ventilation (hrs) LOWO2 hours of exposure to 22 49% lev

Robust Methods in Regression Analysis: Comparison and Improvement. Mohammad Abd- Almonem H. Al-Amleh. Supervisor. Professor Faris M.

Multiple Linear Regression

Review Notes for IB Standard Level Math

Robustness of location estimators under t- distributions: a literature review

Introduction to Linear regression analysis. Part 2. Model comparisons

Ratio of Polynomials Fit One Variable

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

FITTING LINES TO DATA SETS IN THE PLANE

Correlation and Regression

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Dr. Allen Back. Sep. 23, 2016

A Diagnostic for Assessing the Influence of Cases on the Prediction of Random Effects in a Mixed Model

High Breakdown Analogs of the Trimmed Mean

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to The American Statistician.

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES

Scatterplots and Correlation

Sensitivity and Asymptotic Error Theory

Check boxes of Edited Copy of Sp Topics (was 217-pilot)

Students should read Sections of Rogawski's Calculus [1] for a detailed discussion of the material presented in this section.

Unit 27 One-Way Analysis of Variance

Kutlwano K.K.M. Ramaboa. Thesis presented for the Degree of DOCTOR OF PHILOSOPHY. in the Department of Statistical Sciences Faculty of Science

AP Statistics Two-Variable Data Analysis

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

CS 195-5: Machine Learning Problem Set 1

STATISTICS 110/201 PRACTICE FINAL EXAM

FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE

Calculus Module C09. Delta Process. Copyright This publication The Northern Alberta Institute of Technology All Rights Reserved.

Math 1710 Class 20. V2u. Last Time. Graphs and Association. Correlation. Regression. Association, Correlation, Regression Dr. Back. Oct.

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

IDENTIFYING MULTIPLE OUTLIERS IN LINEAR REGRESSION : ROBUST FIT AND CLUSTERING APPROACH

Generalized left and right Weyl spectra of upper triangular operator matrices

Bootstrap Simulation Procedure Applied to the Selection of the Multiple Linear Regressions

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.

LINEAR COMBINATION OF ESTIMATORS IN PROBABILITY PROPORTIONAL TO SIZES SAMPLING TO ESTIMATE THE POPULATION MEAN AND ITS ROBUSTNESS TO OPTIMUM VALUE

Unit 10: Simple Linear Regression and Correlation

Available online at (Elixir International Journal) Statistics. Elixir Statistics 49 (2012)

The Helically Reduced Wave Equation as a Symmetric Positive System

MATHEMATICAL METHODS UNIT 1 CHAPTER 4 CUBIC POLYNOMIALS

Partial Fractions. Combining fractions over a common denominator is a familiar operation from algebra: 2 x 3 + 3

Inequalities Relating Addition and Replacement Type Finite Sample Breakdown Points

Wiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R.

Topic 18: Model Selection and Diagnostics

CHINO VALLEY UNIFIED SCHOOL DISTRICT INSTRUCTIONAL GUIDE ALGEBRA II

Review of Statistics 101

Chapter 11: Robust & Quantile regression

10 Model Checking and Regression Diagnostics

A Brief Overview of Robust Statistics

Transcription:

Rochester Institute of Technology RIT Scholar Works Articles 1990 The Effect of a Single Point on Correlation and Slope David L. Farnsworth Rochester Institute of Technology This work is licensed under a Creative Commons Attribution 3.0 License. Follow this and additional works at: http://scholarworks.rit.edu/article Recommended Citation David L. Farnsworth, The effect of a single point on correlation and slope, International Journal of Mathematics and Mathematical Sciences, vol. 13, no. 4, pp. 799-805, 1990. doi:10.1155/s0161171290001107 This Article is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Articles by an authorized administrator of RIT Scholar Works. For more information, please contact ritscholarworks@rit.edu.

Internat. J. Math. & Math. Scio VOL. 13 NO. 4 (1990) 799-806 799 THE EFFECT OF A SINGLE POINT ON CORRELATION AND SLOPE DAVID L. FARNSWORTH Department of Mathematics Rochester Institute of Technology Rochester, New York 14623 U.S.A. (Received December 8, 1989 and in revised form April 27, 1990) ABSTRACT. By augmenting a bivariate data set with one point, the correlation coefficient and/or the slope of the regression line can be changed to any prescribed values. For the target value of the correlation coefficient or the slope, the coordinates of the new point are found as a function of certain statistics of the original data. The location of this new point with respect to the original data is investigated. KEY WORDS AND PHRASES. Correlation coefficient, deletion technique, influence of data, least squares line, regression diagnostic, sample influence curve. 1980 AMS SUBJECT CLASSIFICATION CODES. 62J05, 62F35. 1. INTRODUCTION. The correlation coefficient and slope of a least squares regression line are known to be sensitive to a few outliers in the data. For this reason, these estimators are called nonrobust or nonresistant. It is not unexpected that changing one data point or introducing a new data point will greatly perturb the regression line and various statistics associated with it. Actually, we can judiciously introduce a new data point and create a line of our choosing. The effect of adding one new data point is the subject of this paper. In situations where we have a choice, this addition of a point is a way to lie with statistics. For example, a fit of payroll data, which we fear could reveal our bias, might be changed in this manner by appropriately hiring one more employee. Or, more benignly, in composing an example

800 DAVID L. FARNSWORTH for a classroom exercise or a made-up case study, we may want to produce certain outcomes. These might be obtained with the introduction of just one additional point. The idea of adding one special point in order to force a regression line through the origin was studied by Casella [1]. the line. The location of the point was shown to be related to various statistics for The robustness of an estimator is in part gauged by the influence function. The influence curve or function for a given estimator is defined pointwise as a limit as a vanishing weight is placed on the data point. These curves are derivatives and measure infinitesimal asymptotic (large sample) influence on the given estimator. The earliest work on this concept was by Hampel [2] and [3]. Also, see Belsley et al. [4], Cook and Weisberg [5], and Huber [6]. Finite sample influence curves are difference quotients, not limits. For sample size n, the numerator consists of the difference between a statistic of interest with a given point included in its formulation and the same statistic without that point. The denominator is 1/n. This sample influence curve is used to measure the influence of a single data point within a sample. The deletion of an influential data point would lead to large values of the influence curve. The diagnostic tests for influence presented in Section 5 are standardized versions of the sample influence curve. In Sections 2, 3 and 4 for any target value of the slope or the correlation coefficient, the coordinates of one additional point are found as a function of certain statistics of the original data. The locations of these points are presented both analytically and graphically. They are realizations of curves of constant influence where the original n-1 data are parameters. The trouble is that we may be caught tampering with the data in this way if conventional diagnostics flag the new point. In Section 5, four of the customary measures of leverage and influence are briefly discussed. In Section 6, an illustrative numerical example is given. 2. PRESCRIBING THE CORRELATION COEFFICIENT. We are given the original data {(x{,y)" 1,2,...,n- 1}. For convenience take the center 2 (5,) (0,0) and variances s x/(n- 1) 1 and s _y/(n- 1) 1, that is, x and 2 yi axe in standard units. The denominators of s and s 2 are the full sample size since that choice leads to considerable simplification of subsequent formulas. Thus, the regression line is y bx, and the correlation coefficient is r b. The additional point (x,, y,,) is expressed in standard deviation units of the original n-1 data as (u, v) (x,/s, y,,/s) (x,, y,). is The condition that the correlation coefficient of the augmented data is the prescribed value p =. /(- + =)(. +,) (2.) The dependence of p upon the original data is only through the values of n and r. Equation (2.1) is expressible as [(1 p)u nplv + [2nru]v + [n(r p2) npu] O. (2.2)

THE EFFECT OF A SINGLE POINT ON CORRELATION AND SLOPE 801 v For each p, symmetries of the solution curves are about the origin and the lines v u and -tt since equation (2.1) is invariant under replacement of (u, v) with (-u,-v) or (4-v, +u). Vertical asymptotes appear at (1 p2)tt2 np 0, that is, tt 4-V/-ffp/x/ l f 4-k. By symmetry, horizontal asymptotes are v 4-k. Note that k is a monotonically increasing function of rt and p2. For p 0, the asymptotes are the axes, and equation (2.1) becomes uv -rtr. The discriminant of equation (2.2)is 4rip In + u2][(1-p2)u + n(r-p2)] which is nonnegative for tt > n(p -r2)/(1- p2). Thus, for p2 < r there is no restriction upon it. For p > r the nonnegativity of the discriminant does restrict tt from a strip about the v-axis. If r 0, the p r solution curve becomes the two axes. If r 1, there is just the p < r type of solution curves. Representative solution curves are shown in Figure 1 for r b 0.5 and n 12. Because of the symmetries, only the right half-plane is displayed. The regression line v 0.5u and the asymptotes u 2 and v 2 for thep r 0.5 curve are shown. Thep -0.6 and the p 4-0.6 curves have asymptotes u 4-2.59 and v 4-2.59. Only the bulge about v u for the p 0.9 curve is visible. Its shape is similar to that of the curve for p 0.6. The p 0.9 curve doubles back, changing its concavity and approaching the vertical asymptote u 7.15 from the left and the horizontal asymptote v 7.15 from below. All of the p 0 curve is outside the square {-2 < u < 2,-2 < v < 2}, indicating how unusual the new point (u, v) would have to be to convert from r 0.5 to p 0. Of particular interest are the two branches of the p r 0.5 curve. The correlation coefficient is unchanged even by points which are very distant along this curve. I Ii Figure 1: Curves along which (u, v) may be placed to change the value of the correlation coefficient to selected values p for r b 0.5 and n 12.

802 DAVID L. FARNSWORTH 3. PRESCRIBING THE SLOPE. The condition that the slope of the least squares regression line based on the n points is the prescribed value d is or nb + uv =d (3.1) n+u.(d- b)/ + d. For each d the solution curve is symmetric with respect to the origin. Asymptotes are u 0 and v du for d b. Maxima and minima occur at (-I-V/n(d b)/d,4-2dv/n(d b)/d) for (d-b)/d > O. Intersections with the u-axis occur at (-l-v/n(b- d)/d, 0) for (d-b)/d < O. Placing d b in equation (3.1) implies u 0 and v bu. Figure 2 displays selected solution curves for b r 0.5 and n 12. Because of symmetry, just the right half-plane is shown. Of course, the d 0 curve in Figure 2 is the same as the p 0 curve in Figure 1. Io0 0.7 O.e II d:o Figure 2: Curves along which (u, v) may be placed to change the value of the slope to selected values d for r b 0.5 and n 12. 4. PRESCRIBING BOTH THE CORRILATION COEFFICIENT AND THE SLOPE. From Figures 1 and 2 it is apparent that the slope can be drastically chaged while the correlation coefficient is numerically the same. Setting p r b, equations (2.1) and (3.1) give d= W/n n+v b. 2 +u (4.1) If n is large, then u or v must be very large for d to be much different from b. Within the constraints p E (--,1,1) and sign p sign d, the two values p and d can be chosen arbitrarily by specifying

THE EFFECT OF A SINGLE POINT ON CORRELATION AND SLOPE 803 the two numbers u and v. However, the issue of remoteness of (u, v) from (, ) (0, 0) and from V bx is important and is addressed below. 5. DETECTING THE NEW POINT. Among the available diagnostic tests for gauging whether a data point (xi, W) is an interloper, four re the most common. Each measures a different feature of data. See Neter et al. [7], Atldnson [8], or Cook and Weisberg [5]. The notation and critical values are not uniformly agreed upon in the literature. Those suggested in Chapter 11 of Neter et M. [7] are used below. First, the point (u, v) could be an x-outlier, that is, it could be far in an horizontal direction from the mean of MI n data. Such points influence slope the most for a given chrome in the coordinate and are said to have high leverage. The usual measure of leverage for (xi,w) is the diagonal element hii of the hat matrix. See Hoaglin and Welsch [9]. Each hii depends upon all the x-coordinates but does not depend upon the y-coordinates of the data. If 4In < hii <_ 1, then (xi, W) will be said to have high leverage. If 1/n < h < 4/n, then (x,, y) has low leverage. The next three diagnostic measures arise from the deletion technique. The impact of a data point is often detected by using this technique: The data point is removed; the statistic of interest is computed; and a standardized difference between that statistic and the corresponding statistic utilizing all the data is analysed as the diagnostic measure. The unstandardized differences divided by l/n, are the sample influence curves discussed in the Introduction. Second, the point (u, v) could be outlying with respect to the!/-value. A diagnostic statistic for measuring whether (xi, yi) is a V-outlying point is the externally studentized residual dt, which is a standardized vertical distance at xi between the point (xi,yi) and the regression line created without (Xi,l/i). Eah d is t-distributed with df n-3. Using the prediction limits for a new point (xi, yi) based on the other n-1 data s regression line as the criteria for designating a point as outlying is equivalent to using d, as shown in Neter et al. [7], page 399. Third, (u, v) could inordinately change predicted values. The statistic (DFFITS)i is a standardized difference between the predicted values at xi one computed with, and one without, the data point (x, y). The point (x, yi) is said to be influential in this sense if I(DFFITS)I > for large n and if I(DFFITS)I > 1 for small to moderate n. This test is based on the usual confidence bands for regression lines. Fourth, (u, v) could unduly impact the slope. The statistic (DFBETA) is a standardized difference between the slope of the two regression lines --one with (x, y) and one without (z, yi). If I(DFBETA)il > 1 for small to moderate n and if I(DFBETA)il > 2/V/ for large n, then (xi, yi) is called influential in this sense. Generally, in data analysis each of these four measures is routinely computed for all n data sets obtained by deleting each point in turn. Then, the structure of each of the four batches of numbers is examined. But, since all critical values are absolute, we shall use an abridged procedure and find these four statistics for only the possible augmenting point (u, v) in the illustrative example in the

804 DAVID L. FARNSWORTH next section. 6. NUMERICAL EXAMPLE. Consider the following constructed data in standard units: (-1.54,-1.07), (-1.16,-1.26), (-0.77, 1.26), (-0.77,-0.63), (-0.39,-0.60), (-0.39, 0.60), (0.39,-0.63), (0.77, 0), (1.16, 1.89), (1.16,-0.63), and (1.54, 1.07). For these 11 data r b 0.50. Let us select d 0.7 as our desired slope and points A(0.39, 6.49), B(0.80, 3.55), C(1.85, 2.59), and D(3.00, 2.90) as possible auxiliary points on the d 0.7 curve. Point B is one where the correlation coetcient remains 0.5. See equation (4.1). Point C is the local minimum point of the d 0.7 curve. Each of the four diagnostic statistics discussed in Section 5 is presented in Table 1 for each of these four points representing (x,, y,). The symbol * next to a numerical value means that the point would be deemed unusual, that is, considered of high leverage or influence, by the criteria given in Section 5. The level of significance is 0.05 for the d, column. Point h,.,. a d, (DFFITS), (DFBETA), A 0.09 6.27* 2.03* 0.70 B 0.13 3.08* 1.19" 0.71 C 0.29 1.48 0.93 0.79 D 0.48* 1.06 1.01" 0.92 Table 1. Points on the d 0.7 curve with x, u _> 3.59 will be flagged with (DFBETA), > 1. Table 1 shows that introducing point C will give us the desired change of slope but will not be detected with these four diagnostic statistics. Other points on the d 0.7 curve may or may not be detected with various test statistics depending upon their locations. The original eleven data, the regression line based on the original data, and a branch of the d 0.7 curve with the four points labeled A, B, C, and D are shown in Figure 3. For purposes of scaling of d, the hyperbolic 95% prediction band is displayed as well.

THE EFFECT OF A SINGLE POINT ON CORRELATION AND SLOPE 805 Figure 3: Numerical example described in Section 6. REFERENCES 1. CASELLA, G. Leverage and Regression Through the Origin, Amer. Statist 37 (1983), 147-152. 2. HAMPEL, F.R. (ontributions to the Theory of Robust Estimation, Ph.D. Thesis, University of California, Berkeley, 1968. 3. HAMPEL, F.R. The Influence Curve and Its Role in Robust Estimation, ].. Amer. Statist. Assoc. 69 (1974), 383-393. 4. BELSLEY, D.A., KUH, E. and WELSCH, R.E. Recession Diagnostics, John Wiley and Sons, New York, 1980. 5. COOK, R.D. and WEISBERG, S. Residuals and Influence in Re_ression, Chapman and Hall, New York, 1982. 6. HUBER, P.J. Robust Statistics, John Wiley and Sons, New York, 1981. 7. NETER, 3., WASSERMAN, W. and KUTNER, M.H. Applied Linear Regression Models, Second Edition, Richard D. Irwin, Inc., Homewood, Illinois, 1989. 8. ATKINSON, A.C. Plots, Transformations and Recession: An Introduction to (raphical Methods of Dia_k,nostic Re_rression Analysis, Oxford University Press, Oxford, 1985. 9. HOAGLIN, D.C. and WELSCH, R.E. The Hat Matrix in Regression and ANOVA, Amer. Statist. 32 (1978), 17-22.