Complements on Simple Linear Regression

Similar documents
Approximations - the method of least squares (1)

22 Approximations - the method of least squares (1)

Mathematics Foundation for College. Lesson Number 8a. Lesson Number 8a Page 1

(x, y) = d(x, y) = x y.

CISC - Curriculum & Instruction Steering Committee. California County Superintendents Educational Services Association

Euclidean Spaces. Euclidean Spaces. Chapter 10 -S&B

Scatter plot of data from the study. Linear Regression

Mathematics 308 Geometry. Chapter 2. Elementary coordinate geometry

1. Introduction. 2. Outlines

A TRADITION of EXCELLENCE A VISION for TOMORROW

Intro Vectors 2D implicit curves 2D parametric curves. Graphics 2012/2013, 4th quarter. Lecture 2: vectors, curves, and surfaces

Intro Vectors 2D implicit curves 2D parametric curves. Graphics 2011/2012, 4th quarter. Lecture 2: vectors, curves, and surfaces

Scatter plot of data from the study. Linear Regression

4.3 TRIGONOMETRY EXTENDED: THE CIRCULAR FUNCTIONS

OHSx XM521 Multivariable Differential Calculus: Homework Solutions 13.1

(x 3)(x + 5) = (x 3)(x 1) = x + 5. sin 2 x e ax bx 1 = 1 2. lim

ALGEBRA 2 X. Final Exam. Review Packet

A. Incorrect! For a point to lie on the unit circle, the sum of the squares of its coordinates must be equal to 1.

PRECALCULUS BISHOP KELLY HIGH SCHOOL BOISE, IDAHO. Prepared by Kristina L. Gazdik. March 2005

INDIAN INSTITUTE OF TECHNOLOGY BOMBAY MA205 Complex Analysis Autumn 2012

Precalculus Summer Assignment 2015

Chapter 9. Correlation and Regression

Vectors and Plane Geometry

DAY 139 EQUATION OF A HYPERBOLA

7 Pendulum. Part II: More complicated situations

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

ISOMETRIES OF R n KEITH CONRAD

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Review Notes for IB Standard Level Math

Derivatives in 2D. Outline. James K. Peterson. November 9, Derivatives in 2D! Chain Rule

Regression #2. Econ 671. Purdue University. Justin L. Tobias (Purdue) Regression #2 1 / 24

Vector Geometry. Chapter 5

College Algebra: Midterm Review

(x 3)(x + 5) = (x 3)(x 1) = x + 5

4.1 Distance and Length

6. Vectors. Given two points, P 0 = (x 0, y 0 ) and P 1 = (x 1, y 1 ), a vector can be drawn with its foot at P 0 and

VECTORS AND THE GEOMETRY OF SPACE

6 Pattern Mixture Models

Graphical Analysis and Errors MBL

Regression. Estimation of the linear function (straight line) describing the linear component of the joint relationship between two variables X and Y.

Chapter 9 Overview: Parametric and Polar Coordinates

MATH 31BH Homework 1 Solutions

Spring 2017 CO 250 Course Notes TABLE OF CONTENTS. richardwu.ca. CO 250 Course Notes. Introduction to Optimization

1 Lagrange Multiplier Method

Parametric Equations and Polar Coordinates

Do not copy, post, or distribute

MA.8.1 Students will apply properties of the real number system to simplify algebraic expressions and solve linear equations.

Solving Systems of Equations Row Reduction

The Integers. Math 3040: Spring Contents 1. The Basic Construction 1 2. Adding integers 4 3. Ordering integers Multiplying integers 12

SOLUTIONS TO HOMEWORK ASSIGNMENT #2, Math 253

GRE Quantitative Reasoning Practice Questions

x y = 1, 2x y + z = 2, and 3w + x + y + 2z = 0

MockTime.com. (b) 9/2 (c) 18 (d) 27

Lagrange Multipliers

11.8 Vectors Applications of Trigonometry

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

MATH 8. Unit 1: Rational and Irrational Numbers (Term 1) Unit 2: Using Algebraic Properties to Simplify Expressions - Probability

Vectors Coordinate frames 2D implicit curves 2D parametric curves. Graphics 2008/2009, period 1. Lecture 2: vectors, curves, and surfaces

Rigid Geometric Transformations

STAT 111 Recitation 7

Pacing Guide Algebra 1

Park Forest Math Team. Meet #5. Self-study Packet

a factors The exponential 0 is a special case. If b is any nonzero real number, then

Algebra I. 60 Higher Mathematics Courses Algebra I

mathematical objects can be described via equations, functions, graphs, parameterization in R, R, and R.

Exercises. Template for Proofs by Mathematical Induction

Chapter 3 Representations of a Linear Relation

WORD: EXAMPLE(S): COUNTEREXAMPLE(S): EXAMPLE(S): COUNTEREXAMPLE(S): WORD: EXAMPLE(S): COUNTEREXAMPLE(S): EXAMPLE(S): COUNTEREXAMPLE(S): WORD:

MATH 103 Pre-Calculus Mathematics Dr. McCloskey Fall 2008 Final Exam Sample Solutions

The Geometry of Linear Regression

CALCULUS BASIC SUMMER REVIEW

Research Article The Weighted Fermat Triangle Problem

10550 PRACTICE FINAL EXAM SOLUTIONS. x 2 4. x 2 x 2 5x +6 = lim x +2. x 2 x 3 = 4 1 = 4.

Chapter 3 Representations of a Linear Relation

Algebra I Learning Targets Chapter 1: Equations and Inequalities (one variable) Section Section Title Learning Target(s)

10.1 Review of Parametric Equations

MAT1035 Analytic Geometry

Chapter 3. The Scalar Product. 3.1 The scalar product using coördinates

DISCUSSION CLASS OF DAX IS ON 22ND MARCH, TIME : 9-12 BRING ALL YOUR DOUBTS [STRAIGHT OBJECTIVE TYPE]

SOLUTIONS FOR PROBLEMS 1-30

MATH 7 HONORS. Unit 1: Rational and Irrational Numbers (Term 1) Unit 2: Using Algebraic Properties to Simplify Expressions - Probability

Notes 6 : First and second moment methods

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Midterm 1 Review. Distance = (x 1 x 0 ) 2 + (y 1 y 0 ) 2.

CHAPTER ONE FUNCTIONS AND GRAPHS. In everyday life, many quantities depend on one or more changing variables eg:

Elementary maths for GMT

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

2018 Canadian Senior Mathematics Contest

Algebra Topic Alignment

SECTION 1.8 : x = f LEARNING OBJECTIVES

MASSACHUSETTS MATHEMATICS LEAGUE CONTEST 5 FEBRUARY 2013 ROUND 1 ALGEBRA 2: ALGEBRAIC FUNCTIONS ANSWERS

Section 8.2 Vector Angles

Final Exam SOLUTIONS MAT 131 Fall 2011

Section 3.1. Best Affine Approximations. Difference Equations to Differential Equations

Extension of continuous functions in digital spaces with the Khalimsky topology

A VERTICAL LOOK AT KEY CONCEPTS AND PROCEDURES ALGEBRA I

COMS 4771 Regression. Nakul Verma

STEP Support Programme. Mechanics STEP Questions

Efficient packing of unit squares in a square

Lecture 5 Multivariate Linear Regression

Transcription:

Complements on Simple Linear Regression Terry R. McConnell Syracuse University March 16, 2015 Abstract We present a simple-minded approach to a variant of simple linear regression that seeks to minimize the sum of squared distances to a line rather than the sum of squared vertical distances. We also discuss variants where distances rather than squared distances are summed. 1 Introduction The Park Service is planning to build a number of shelters at selected scenic locations throughout a large tract of forest. Since much of the labor in this project involves carrying materials and equipment to each shelter site from the nearest road, the Service also plans to construct a straight section of supply road that will minimize the amount of labor involved. How should the road be located? If we assume labor is proportional to squared distance rather than distance, as is mathematically convenient, then this type of problem is a first cousin of the problem of simple linear regression from elementary statistics. In simple linear regression one is given a data set in the form of a scatter-plot, a plot depicting the locations of points in a 2 dimensional Cartesian coordinate system, and one seeks to find the line that best fits the set of points in the sense that it minimizes the sum of squared vertical distances from the points to the line. The type of problem where one seeks to minimize the squared distances, as opposed to squared vertical (or horizontal) distances, will be discussed in this paper. We shall call it orthogonal regression. Regression problems arise in statistics in two quite different contexts, which we review here briefly. (I) Measurement Error Models In problems involving measurement error models, one posits that a linear relationship y = mx + b holds between two physical quantities x and y, but the exact relationship is obscured by measurement error when actual pairs (x i, y i ) are observed. A common mathematical model is to suppose that y i = mx i + b + ɛ i, i = 1, 2,..., n, where the measurement errors ɛ i are assumed to be independent and normally distributed with mean zero, and either known or unknown variance σ 2. The problem is to infer the values of the model parameters m, b, and perhaps σ 2 from the observed data. The maximum likelihood estimators of m and b are obtained as the slope and y-intercept of the least squares (vertical) regression line described above. In actual collection of data, the x i are often selected by the experimenter and then corresponding y i are observed. In this context it is natural to think of x as the independent variable and y as the dependent variable, but the underlying mathematical model is symmetric in x and y. If we switch to y as the independent variable then the analogous statistical procedure could be termed horizontal regression: it would determine the line that minimizes the sum of squared horizontal distances to points in the scatter-plot. Horizontal and vertical regression generally give different solutions, which might be regarded as a jarring asymmetry in light of the symmetry of the underlying model. 1

(II) Bivariate Normal Parameter Estimation Recall that the bivariate normal distributions govern the joint distributions of (x, y) pairs in some models. They depend on 5 parameters: the mean and variance of the marginal distributions of x and y, and the correlation coefficient between x and y. Observed data can again be depicted with scatter-plots, but in this case both coordinates are random. However, if we condition on either the x values or the y values, then the resulting model is mathematically equivalent to a measurement error model, and the maximum likelihood estimators can be found using horizontal and vertical regression. In this case, the decision of which variable to regard as given is entirely arbitrary. In the context of both models discussed above, orthogonal regression can be viewed as an alternative method that is symmetric in its treatment of the variables and therefore more mathematically appealing. It is generally regarded as being significantly less tractable than horizontal or vertical regression, but, as we show in the next section, at least in the two-dimensional case it is possible to give a treatment that is no more involved than the usual derivations of least squares lines one sees in calculus and statistics courses. Orthogonal regression is also known as Deming regression, after H.W. Deming who wrote about the problem in the 1940s, but the roots of the problem and its solution date back to the late 19th century. See, for example, [3]. For an exposition similar in spirit to this one, but using somewhat different methods, see [2]. In section 2 we discuss orthogonal regression, and in section 3 we provide some examples to illustrate the results of section 2. In section 4 we discuss absolute value regression, in which distances replace squared distances. It is a pleasure to thank Vincent Fatica for some interesting conversations on the subject of this paper. In particular, Vince suggested the idea for the Park Service problem stated above, and provided the argument to handle the annoying special case of Theorem 4.1. I also wish to thank Thomas John for pointing out reference [3]. 2 Results A convenient parametrization of the set of affine linear functions on R 2 is given by L(x, y) = cos θx + sin θy + λ, θ [0, 2π), λ R. We obtain all lines in the plane as the zero sets of affine functions. In what follows we will abuse notation by identifying lines with the affine functions that vanish on them. The orthogonal distance d from a given point x 0 = (x 0, y 0 ) to L is d 2 = L(x 0, y 0 ) 2 = cos θx 0 + sin θy 0 + λ 2. To see this, note that the point x 1 = ( λ cos θ, λ sin θ) lies on L, and then compute the component of x 0 x 1 in the direction of the unit normal the vector with components (cos θ, sin θ). Now suppose we are given a scatter-plot of n points (x i, y i ), i = 1, 2,..., n, and we are to find the values of θ and λ that minimize the sum of squared orthogonal residuals given by Ro(L) 2 = (cos θx j + sin θy j + λ) 2. Setting R2 o λ = 0 we obtain that L( x, ȳ) = 0, i.e., the minimizing line passes through the centroid, as in horizontal and vertical regression. (Here x denotes the arithmetic mean of the x-coordinates and ȳ denotes the arithmetic mean of the y-coordinates.) It is convenient, then, to put the origin at the centroid and minimize Ro(L) 2 = (cos θx j + sin θy j ) 2, 2

over θ [0, 2π). A little manipulation shows that the normal equation R2 o θ = 0 can be rewritten as (2.1) 1 2 tan 2θ = SS(xy) SS(x) SS(y), at least when SS(x) SS(y). Here we use the standard notations SS(x) = (x j x) 2, SS(y) = (y j ȳ) 2, SS(xy) = (x j x)(y j ȳ) from Statistics. It is important to note here that Ro 2 has both a maximum and a minimum value if we restrict to lines that pass through the centroid. This is in contrast to the usual quantities Rv 2 (Rh 2 ) from vertical (horizontal) regression, where nearly vertical (horizontal) lines lead to unbounded values. In the case of orthogonal regression, it is clear that Ro 2 cannot exceed the square of the diameter of the scatter-plot. If S(xy) = 0 then we have Ro(L) 2 = cos 2 θss(x) + sin 2 θss(y). If SS(x) = SS(y) also, then R o is constant and there is no unique minimum. If SS(x) < SS(y) then the minimum occurs for θ = 0 and π, and the maximum occurs for θ = π 2 and 3π 2. If SS(x) > SS(y) then the maximum occurs for θ = 0 and π, and the minimum occurs for θ = π 2 and 3π 2. If SS(xy) 0 and SS(x) = SS(y), then we may interpret (2.1) as yielding solutions θ = π 4, 3π 4, 5π 4, and 7π 4. Thus, in all cases where SS(xy) 0, there are exactly 4 solutions in [0, 2π) that differ by multiples of π 2. Pairs that differ by π describe the same line. The minimum line is the one whose slope has the same sign as SS(xy). The maximum line is always perpendicular to the minimum line. (The latter fact is intuitive on geometric grounds, without being completely obvious.) Orthogonal regression outperforms both vertical and horizontal regression in the sense that if L o, L v, and L h denote the respective minimum lines, then (2.2) R 2 o(l o ) 1 2 R2 v(l v ) + 1 2 R2 h(l h ). To see this, note that neither the horizontal nor the vertical distances from a point to a line can be less than the orthogonal distance. Thus we have that Ro(L) 2 Rv(L) 2 and Ro(L) 2 Rh 2 (L) for any test line L. It follows that Ro(L 2 o ) Ro(L 2 v ) Ro(L 2 h ) 1 2 R2 v(l v ) + 1 2 R2 h(l h ). It is perhaps worth noting that a much stronger inequality holds for any given test line: (2.3) R 2 o(l) 1 4 R2 v(l) + 1 4 R2 h(l). This follows from some elementary geometry: the squared distance from vertex to hypotenuse in a right triangle is equal to half the harmonic mean of the squares on the legs. This, in turn, cannot exceed half the arithmetic mean of the squares on the legs, and the desired inequality follows by summing these inequalities over points in the scatter-plot. Plane-geometric results often assume their most elegant form when expressed in terms of complex numbers. Let w be a non-zero complex number. Then the set of z C such that R( wz) = 0 is a line through the origin. We shall call this the line determined by w. Theorem 2.1. Let (x j, y j ), j = 1, 2,..., n be points in the plane having the origin as centroid, and w be a non-zero complex number. Put z j = x j + iy j. Then the line determined by w is extremal for orthogonal regression if and only if n I( w 2 zj 2 ) = 0. 3

The proof is a straightforward verification using (2.1). (The centroid is assumed to lie at the origin for convenience only. In general one needs to subtract the arithmetic mean of the z j from both w and each of the z j.) 3 Examples In this section we provide 3 examples to illustrate the foregoing. Example 1 The collection of 3 points with coordinates (-2,-1), (1,-1), and (1,2) has centroid at the origin. For this arrangement, L o is the line through the origin of slope 1; and L v and L h are, respectively, the lines through the origin with slope 1/2 and with slope 2. We have R 2 o(l o ) = 3, while both R 2 v(l v ) and R 2 h (L h) are equal to 4.5. Also, R 2 v(l o ) and R 2 h (L o) both equal 6, showing that equality can hold in (2.3). Example 2 The 4 corners of the unit square also have centroid at the origin. In this case SS(xy) = 0 and SS(x) = SS(y) = 4. All lines L through the origin have R 2 o(l) = 4, so in this case there is no unique minimizing L o. Also, L v is horizontal and L h is vertical, with R 2 v(l v ) = R 2 h (L h) = 4, showing that equality can hold in (2.2). Example 3 ([1], Example 14.3.9) A survey of agricultural yield against amount of a pest allowed (the latter can be controlled by crop treatment) gave the data in the following table, where x i is the pest rate and y i is the crop yield: i 1 2 3 4 5 6 7 8 9 10 11 12 x i 8 6 11 22 14 17 18 24 19 23 26 40 y i 59 58 56 53 50 45 43 42 39 38 30 27 In this example, L v has slope-intercept equation y = 1.013x + 64.247, L h has equation y = 1.306x + 69.805, and L o has equation y = 1.172x + 67.264. 4 Absolute Value (AV) Regression In absolute value regression, also known as L 1 regression, residuals are obtained by adding distances (horizontal, vertical, or orthogonal) rather than squared distances. This type of regression is more natural for the Park Service problem than are the forms of regression discussed above. As before, we take as given a set of points (x i, y i ), i = 1, 2,... n, which we shall call the configuration. In vertical AV regression we seek to minimize the residual sum, given by R v = y j mx j b, over all possible lines L having slope-intercept equation y = mx + b. In the case of orthogonal AV regression, the summands are replaced by the (orthogonal) distances from the points to the test line. This quantity is denoted R o. Theorem 4.1. If n 2 then in all 3 cases of AV regression (vertical, horizontal, and orthogonal) there is an extremal line passing through two distinct points of the configuration. This result provides an effective procedure for finding a best-fit line: simply try all possible lines through pairs of points in the configuration and select the one that gives the smallest residual sum. Given the speed of today s computers, this procedure is even practical for problems of modest size. 4

Theorem 4.1 is a well-known folklore result, at least in the case of vertical regression, but some proof sketches we have seen in the literature miss (or at least fail to mention) one annoying detail. Accordingly, it seems desirable to give a proof here. Consider first the case of vertical AV regression. Let L be a line that avoids all points in the configuration. We shall argue that L either does not minimize R v, or can be replaced by another line with the same value of R v that does go through some point in the configuration. Since no point lies on L, all points can be classified as belonging to one of two groups: group (I) - points lying above L, i.e. (x j, y j ) for which y j > mx j + b; and group (II) - points lying below L. If more than half of the points belong to group I, then translating the line vertically, i.e., in the direction of the positive y-axis, decreases R v until a point of the configuration is encountered. Similarly, if more than half of the points belong to group II, then a better line can be obtained by translating L downward. Finally, if the number of points in the two groups is equal, L can be translated either up or down without changing the value of R v until some point of the configuration is encountered. In any case, L can be replaced by a line going through at least one point of the configuration and that has a value of R v that is no larger than the value L had itself. Accordingly, we may assume that L passes through some point of the configuration. Without loss of generality, that point is (x 1, y 1 ). Next, we repeat the argument, but with rotation about (x 1, y 1 ) in place of vertical translation. Assume that L does not pass through any other point of the configuration. Let θ be the angle L makes at (x 1, y 1 ) with the horizontal. A straightforward, but somewhat tedious, calculation shows that (4.1) R v = y j y 1 (S I S II ) tan θ, where S I is the sum of x j x 1 over points in group I, and S II is the sum over points in group II. A differential change dθ in the inclination of L that does not cause the line to encounter any other point of the configuration produces a differential change in R v of (S I S II ) sec 2 θ dθ. Thus, if S I S II, the sign of dθ can be chosen so as to lead to a smaller value of R v for the rotated line. On the other hand, if S I = S II, then the value of R v is independent of θ, and L can be rotated until some other point of the configuration is encountered without changing the value of R v. See (4.1). Thus, in any case, we can replace L by a line - produced by rotation around (x 1, y 1 ) - that goes through a second point of the configuration, and that has a value of R v that is no larger than the value L had itself. The case of horizontal AV regression is similar. Finally, consider a best-fit line L for orthogonal AV regression, and let R be the value of R o for this line. Rotate the coordinate system about the origin until the line L becomes horizontal. Since rotations preserve distance, the resulting horizontal line is necessarily a best-fit line for the rotated configuration under orthogonal AV regression. It also necessarily a best fit line for the rotated configuration under vertical AV regression, since the minimum value for R v cannot be less than R. It may be that this line does not go through two points of the configuration, but if so, it can be replaced by another having the same value of R v = R, and that does go through two points. But for this line we must have R o = R v = R, since R o R v for any line. (Indeed, the latter line must also be horizontal, in addition to passing through two points of the configuration.) Applying the inverse rotation transforms this line into one that has a value of R o that is equal to R, and that goes through two points of the original configuration. References [1] E. Dudewicz and S. Mishra, Modern Mathematical Statistics, Wiley, New York, 1988. [2] P. Glaister, Least Squares Revisited, The Mathematical Gazette 85(2001),104-107. [3] K. Pearson, On Lines and Planes of Closest Fit to Systems of Points in Space, Philosophical Magazine 2(1901), 559-572. 5