Linear Regression Demystified

Similar documents
ECON 3150/4150, Spring term Lecture 3

Simple Linear Regression

Machine Learning for Data Science (CS 4786)

Algebra of Least Squares

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Summary: CORRELATION & LINEAR REGRESSION. GC. Students are advised to refer to lecture notes for the GC operations to obtain scatter diagram.

U8L1: Sec Equations of Lines in R 2

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So,

SNAP Centre Workshop. Basic Algebraic Manipulation

Inverse Matrix. A meaning that matrix B is an inverse of matrix A.

The Binomial Theorem

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

1 Inferential Methods for Correlation and Regression Analysis

CHAPTER 5. Theory and Solution Using Matrix Techniques

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

Simple Linear Regression

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

The Method of Least Squares. To understand least squares fitting of data.

Kinetics of Complex Reactions

Math 155 (Lecture 3)

CHAPTER I: Vector Spaces

Correlation Regression

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = =

Zeros of Polynomials

6.3 Testing Series With Positive Terms

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

(3) If you replace row i of A by its sum with a multiple of another row, then the determinant is unchanged! Expand across the i th row:

An Introduction to Randomized Algorithms

Properties and Hypothesis Testing

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

(3) If you replace row i of A by its sum with a multiple of another row, then the determinant is unchanged! Expand across the i th row:

3.2 Properties of Division 3.3 Zeros of Polynomials 3.4 Complex and Rational Zeros of Polynomials

September 2012 C1 Note. C1 Notes (Edexcel) Copyright - For AS, A2 notes and IGCSE / GCSE worksheets 1

Theorem: Let A n n. In this case that A does reduce to I, we search for A 1 as the solution matrix X to the matrix equation A X = I i.e.

Chapter Vectors

Optimally Sparse SVMs

We are mainly going to be concerned with power series in x, such as. (x)} converges - that is, lims N n

Machine Learning for Data Science (CS 4786)

Some examples of vector spaces

Infinite Sequences and Series

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

THE KALMAN FILTER RAUL ROJAS

Lesson 10: Limits and Continuity

Assignment 5: Solutions

2 Geometric interpretation of complex numbers

P.3 Polynomials and Special products

NUMERICAL METHODS COURSEWORK INFORMAL NOTES ON NUMERICAL INTEGRATION COURSEWORK

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Confidence Intervals for the Population Proportion p

REVISION SHEET FP1 (MEI) ALGEBRA. Identities In mathematics, an identity is a statement which is true for all values of the variables it contains.

Principle Of Superposition

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

6.867 Machine learning, lecture 7 (Jaakkola) 1

Complex Numbers Solutions

Statistics 511 Additional Materials

Chapter 8: Estimating with Confidence

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

U8L1: Sec Equations of Lines in R 2

Topic 9: Sampling Distributions of Estimators

Ma 530 Introduction to Power Series

BIOSTATS 640 Intermediate Biostatistics Frequently Asked Questions Topic 1 FAQ 1 Review of BIOSTATS 540 Introductory Biostatistics

11 Correlation and Regression

(all terms are scalars).the minimization is clearer in sum notation:

Efficient GMM LECTURE 12 GMM II

We will conclude the chapter with the study a few methods and techniques which are useful

Statistical Properties of OLS estimators

Lecture 15: Learning Theory: Concentration Inequalities

Math 113 Exam 3 Practice

Math 61CM - Solutions to homework 3

10-701/ Machine Learning Mid-term Exam Solution

In algebra one spends much time finding common denominators and thus simplifying rational expressions. For example:

Beurling Integers: Part 2

CALCULATING FIBONACCI VECTORS

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

APPENDIX F Complex Numbers

REVISION SHEET FP1 (MEI) ALGEBRA. Identities In mathematics, an identity is a statement which is true for all values of the variables it contains.

Math 475, Problem Set #12: Answers

Paired Data and Linear Correlation

Empirical Process Theory and Oracle Inequalities

1 Generating functions for balls in boxes

Orthogonal transformations

b i u x i U a i j u x i u x j

TEACHER CERTIFICATION STUDY GUIDE

5.1 Review of Singular Value Decomposition (SVD)

1 Approximating Integrals using Taylor Polynomials

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

MATH 112: HOMEWORK 6 SOLUTIONS. Problem 1: Rudin, Chapter 3, Problem s k < s k < 2 + s k+1

R is a scalar defined as follows:

University of California, Los Angeles Department of Statistics. Practice problems - simple regression 2 - solutions

3. Z Transform. Recall that the Fourier transform (FT) of a DT signal xn [ ] is ( ) [ ] = In order for the FT to exist in the finite magnitude sense,

Let us consider the following problem to warm up towards a more general statement.

7.1 Convergence of sequences of random variables

[ 11 ] z of degree 2 as both degree 2 each. The degree of a polynomial in n variables is the maximum of the degrees of its terms.

Section 14. Simple linear regression.

MATH 304: MIDTERM EXAM SOLUTIONS

Rademacher Complexity

NUMERICAL METHODS FOR SOLVING EQUATIONS

multiplies all measures of center and the standard deviation and range by k, while the variance is multiplied by k 2.

Transcription:

Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to derive these formulae for studets with more advaced math backgroud. Simple Regressio: Oe Variable. Least Square Prescriptio Suppose we have a set of data poits (x i, y i ) (i =,,, ). The goal of liear regressio is to fid a straight lie that best fit the data. I other words, we wat to build a model ŷ i = β 0 + β x i () ad tue the parameters β 0 ad β so that ŷ i is as close to y i as possible for all i =,,,. Before we do the math, we eed to clarify the problem. How do we judge the closeess of ŷ i ad y i for all i? If the data poits (x i, y i ) do ot fall exactly o a straight lie, y i ad ŷ i is ot goig to be the same for all i. The deviatio of y i from ŷ i is called the residual ad is deoted by ɛ i here. I other words, y i = ŷ i + ɛ i = β 0 + β x i + ɛ i. () The least square prescriptio is to fid β 0 ad β that miimize the sum of the square of the residuals SSE: SSE = ɛ i = (y i ŷ i ) = (y i β 0 β x i ). (3) If you are familiar with calculus, you will kow that the miimizatio ca be doe by settig the derivatives of SSE with respect to β 0 ad β to 0, i.e. SSE/ β 0 = 0 ad SSE/ β = 0. The resultig equatios are (y i ŷ i ) = ɛ i = 0 ad x i (y i ŷ i ) = x i ɛ i = 0. (4) We ca combie the first ad secod equatio to derive a alterative equatio for the secod equatio: ɛ i = 0 xɛ i = 0 Subtractig this equatio from the secod equatio of (4) yields Equatio (4) ca be expressed as (y i ŷ i ) = ɛ i = 0 (x i x)ɛ i = 0. ad (x i x)(y i ŷ i ) = (x i x)ɛ i = 0. (5) If you are ot familiar with calculus, we ca show you a proof usig algebra. I actually prefer his proof sice it is more rigorous. Before presetig the proof, let s remid you some cocepts i statistics. Careful studets may realize that this oly gauratees that SSE is statioary but ot ecessarily miimum. There is also a cocer whether the resultig solutio is a global miimum or a local miimum. Detail aalysis (see Sectio.3) reveals that the solutio i our case is a global miimum.

. Mea, Stadard Deviatio ad Correlatio The mea ad stadard deviatio of a set of poits {u i } are defied as ū = u i, SD u = (u i ū). (6) The Z score of u i is defied as Z ui = u i ū SD u. (7) The correlatio of two set of poits (with equal umber of elemets) {x i } ad {y i } is defied as r xy = r yx = Z xi Z yi = The Cauchy-Schwarz iequality states that for two set of real umbers {u i } ad {v i }, ( ) ( ) ( ) u i v i u i (x i x)(y i ȳ) SD x (8) ad the equality holds if ad oly if v i = ku i for all i, where k is a costat. You ca fid the proof of the iequality i, e.g., Wikipedia. It follows from the Cauchy-Schwarz iequality that the correlatio r xy, with r xy = if ad oly if x i ad y i fall exactly o a straight lie, i.e. ɛ i = 0 for all i. So the correlatio r xy measures how tightly the poits (x i, y i ) are clustered aroud a lie..3 Regressio Equatio To derive the regressio equatio, we first rewrite SSE i equatio (3) i terms of Z-scores. It follows from the defiitio of Z-scores that x i = x + SD x Z xi, y i = ȳ + Z yi ad the SSE becomes SSE = For simplicity, we deote β 0 = ȳ β 0 β x. The we have SSE = = [ȳ + Z yi β 0 β ( x + SD x Z xi )] ( Z yi β SD x Z xi + ȳ β 0 β x) ( Z yi β SD x Z xi + β 0 ) = SD y Zyi + βsd x Zxi + + β 0 Z yi β 0 β SD x v i β 0 β SD x Z xi Z yi Z xi (9) Recall that the Z-scores are costructed to have zero mea ad uit stadard deviatio. It follows that Z xi = Z yi = 0, Zxi = Zyi = i

ad from the defiitio of correlatio we have Thus the SSE i equatio (9) ca be writte as Z xi Z yi = r xy, SSE = (SDy + βsd x + β 0 β r xy SD x ) = [ β 0 + (SD x β r xy ) + ( rxy)sd y] = [(ȳ β 0 β x) + (SD x β r xy ) ] + ( rxy)sd y (0) Note that we have re-expressed β 0 i terms of β 0 ad β i the last step. Remember your goal is to fid β 0 ad β to miimize SSE. The quatity iside the square bracket i equatio (0) is a sum of two squares ad cotai the variables β 0 ad β, whereas ( r xy)sd y is a fixed quatity. Therefore, we coclude that ad the equality holds if ad oly if SSE ( r xy)sd y ȳ β 0 β x = 0 ad SD x β r xy = 0. Therefore, the values of β 0 ad β that miimize SSE is β = r xy SD x, β 0 = ȳ β x () ad the miimum SSE is SSE = ( r xy)sd y () How does equatios () related to equatios (5) derived from calculus? We are goig to prove that they are the same thig. Usig equatios () we have ɛ i = = = = = 0 (y i ŷ i ) (y i β 0 β x i ) If the last step is ot obvious to you, here s a simple proof: (y i ȳ + β x β x i ) (y i ȳ) β (x i x) x = x i x i = x = x (x i x) = 0 ad similarly (y i ȳ) = 0. 3

Now let s look at the secod equatio of (5): (x i x)(y i ȳ) = = (x i x)(y i ȳ + β x β x i ) (x i x)(y i ȳ) β (x i x) From the defiitio of correlatio ad stadard deviatio, we have r xy = (x i x)(y i ȳ) SD x (x i x)(y i ȳ) = r xy SD x Thus, SD x = (x i x) (x i x) = SDx (x i x)(y i ȳ) = r xy SD x β SDx = SDx ( ) r xy β = 0. SD x Therefore, we have proved that equatios (5) follows from equatios (). O the other had, equatios (5) are two liear equatios for β 0 ad β. We ca solve for β 0 ad β from these equatios, which we will do so i Sectio.7. The result is that β 0 ad β are give by equatios () ad so the two sets of equatios are equivalet..4 Iterpretatio We will try to give you a ituitio of the meaig of equatios (5) below. We ca rewrite the first equatio of (5) as ɛ = 0. (3) That is to say that the mea of ɛ i vaishes. The secod equatio of (5) meas that (x i x)ɛ i = (x i x)(ɛ i ɛ) = 0 SD x SD ɛ r xɛ = 0 r xɛ = 0. That is to say that x i ad ɛ i are ucorrelated. Thus the least square prescriptio is to fid β 0 ad β to make the residuals havig zero mea ad beig ucorrelated with x i. Let s illustrate how this ca be doe graphically. Suppose we have the followig data set of x i ad y i : 4

0.0 0. 0.4 0.6 0.8.0 0 3 x y We wat to fit a straight lie y = β 0 + β x to the data poits. We kow the lie has to make the resuduals havig zero mea ad beig ucorrelated with x i. First, let s imagie choosig various values of β ad see what the plot y β x versus x look like. I geeral, we will have the plot similar to the oe above but with differet overall slope. If we pick the right value of β, we will see that the residual y β x vs x is flat: 0.0 0. 0.4 0.6 0.8.0 0 3 x y β x We kow have achieved the first goal: by choosig the right value of β, the residual y β x is ucorrelated with x. However, the residuals do ot have zero mea. So ext we wat to choose β 0 so that y β 0 β x has zero 5

mea. Whe the right value of β 0 is chose, the residuals become this: 0.0 0. 0.4 0.6 0.8.0 0 3 x ε = y β 0 β x We have accomplished our goal. The residuals ow have zero mea ad are ucorrelated with x i. The resultig regressio lie y = β 0 + β x is the straight lie that is best fit to the data poits. The graph below shows the regressio lie (blue), data poits (black) ad the residuals (red). 0.0 0. 0.4 0.6 0.8.0 0 3 x y We have outlied the idea of liear regressio. Next we itroduce the the cocept of a vector, which proves to be very useful for regressio with more tha oe variables. 6

.5 Vectors A -dimesioal vector V cotais real umbers v, v,, v, ofte writte i the form v v V =. v The umbers v, v,, v are called the compoets of V. Here we adopt the covetio to write vector variables i boldface to distiguish them from umbers. It is useful to defie a special vector X 0 whose compoets are all equal to. That is, X 0 = (4). The scalar product of two vectors U ad V is defied as U V = V U = u i v i, where v i ad u i are compoets of V ad U. Note that the result of the scalar product of two vectors is a umber, ot a vector. Two vectors U ad V are said to be orthogoal if ad oly if U V = 0. It follows from the defiitio that V V = vi 0, with V V = 0 if ad oly if all compoets of V are 0. It follows from the Cauchy-Schwarz iequality that (U V ) (U U)(V V ) with the equality holds if ad oly if U = kv for some costat k. Now that we have itroduced the basic cocept of a vector, we are ow ready to express the regressio equatios i vector form..6 Regressio Equatios i Vector Form Equatio () ca be writte as The sum of square of residuals SSE i (3) ca be writte as Y = Ŷ + ɛ = β 0 X 0 + β X + ɛ. (5) SSE = ɛ ɛ = (Y Ŷ ) (Y Ŷ ) = (Y β 0 X 0 β X) (Y β 0 X 0 β X) (6) To miimize SSE, we set the derivatives of SSE with respect to β 0 ad β to 0. The resultig equatios are or X 0 (Y β 0 X 0 β X) = 0, X (Y β 0 X 0 β X) = 0 X 0 ɛ = X ɛ = 0. (7) This is equatio (4) writte i vector form. We see that the least square prescriptio ca be iterpreted as fidig β 0 ad β to make the residual vector ɛ orthogoal to both X 0 ad X. 7

The mea ad stadard deviatio i (6) ca be expressed as ū = X 0 U, SD u = (U ūx 0) (U ūx 0 ) (8) ad the Z score i (7) becomes It follows that The correlatio r xy i (8) is proportioal to the scalar product of Z x ad Z y : The equatio r xɛ = 0 is equivalet to Z u = U ūx 0 SD u (9) Z u Z u = (0) r xy = Z x Z y () Z x ɛ = 0 () We see that the equatios i vector form are simpler ad more elegat. We are ow ready to solve the regressio equatios (5) ad (7) for β 0 ad β..7 Regressio Coefficiets I this sectio, we are goig to use the vector algebra to solve for β 0 ad β to re-derive equatios (). This is a useful exercise because it ca be easily geeralized to multiple regressio we cosider later. Start with equatio (5). Takig the scalar product of (5) with X 0, usig equatios (8), (7) ad X 0 X 0 =, we obtai ȳ = β 0 + β x ȳ = β 0 + β x, (3) Multiplyig both sides of the above equatio by X 0 gives Subtractig the above equatio from (5) gives ȳx 0 = β 0 X 0 + β xx 0 Y ȳx 0 = β (X xx 0 ) + ɛ It follows from equatio (9) that Y ȳx 0 = Z y ad X xx 0 = SD x Z x. Hece we have Z y = β SD x Z x + ɛ Takig the scalar product with Z x ad usig Z x ɛ = 0, we obtai Z x Z y = β SD x Z x Z x It follows from equatio () ad (0) that Z x Z y = r xy ad Z x Z x = ad so the above equatio reduces to r xy = β SD x β = r xy SD x. Combiig the above equatio with (3), we fially solve β 0 ad β : β = r xy SD x, β 0 = ȳ β x (4) The regressio lie is the straight lie with slope r xy /SD x passig through the poit ( x, ȳ). 8

.8 Root Mea Square Error Havig foud a straight lie that best fit the data poits, we ext wat to kow how good the fit is. Oe way to characterize the goodess of the fit is to calculate the mea square error RMSE, defied as SSE ɛ ɛ RMSE = = (5) It follows from equatio () that RMSE = rxy. Here we derive it agai usig the vector algebra sice it ca be easily geeralized to multiple regressio. We itroduce two quatities SST ad SSM. The total sum square SST is defied as SST = (y i ȳ) = (Y ȳx 0 ) (Y ȳx 0 ) (6) It follows from the defiitio of stadard deviatio [see equatio (6) or (8)] that SST = SD y (7) The sum square predicted by the liear model SSM is defied as SSM = (ŷ i ȳ) = (Ŷ ȳx 0 ) (Ŷ ȳx 0 ) (8) Usig Ŷ = β 0 X 0 + β X ad β 0 = ȳ β x, we have Ŷ ȳx 0 = (ȳ β x)x 0 + β X ȳx 0 = β (X xx 0 ) = β SD x Z x, (9) where we have used equatio (9) to obtai the last equality. Combiig (8), (9), (0) (4) ad (7), we obtai ( ) SSM = βsd x SD y = r xy SDx = r SD xysd y = rxysst (30) x Next we wat to prove a idetity that SST = SSM + SSE. To see that, we start with the idetity Y ȳx 0 = (Y Ŷ ) + (Ŷ ȳx 0 ) = ɛ + (Ŷ ȳx 0 ) We the square both sides by takig the scalar product with itself: (Y ȳx 0 ) (Y ȳx 0 ) = [ɛ + (Ŷ ȳx 0 )] [ɛ + (Ŷ ȳx 0 )] The left had side is SST, the right had side is the sum of SSE, SSM ad a cross term: SST = ɛ ɛ + (Ŷ ȳx 0 ) (Ŷ ȳx 0 ) + ɛ (Ŷ ȳx 0 ) = SSE + SSM + ɛ (Ŷ ȳx 0 ) It follows from (9) ad Z x ɛ = 0 that the cross term vaishes: ɛ (Ŷ ȳx 0 ) = β SD x Z x ɛ = 0 Aother way of seeig this is to ote that ɛ is orthogoal to both X 0 ad X, as required by the least square prescriptio (7). So ɛ is orthogoal to ay liear combiatio of X 0 ad X. Sice Ŷ ȳx 0 is a liear combiatio of X 0 ad X, it is orthogoal to ɛ. We have just proved that SST = SSM + SSE (3) We have calculated above that SSM = r xy ad SST = SD y. Usig the idetity SST = SSM + SSE, we obtai SSE = SST SSM = ( r xy)sst = ( r xy)sd y. Combiig this equatio with the defiitio of RMSE i (5), we fid RMSE = rxy (3) 9

Multiple Regressio Now we wat to geeralize the results of simple regressio to multiple regressio. We will first cosider the case with two variables i Sectio., because simple aalytic expressios exist i this case ad the derivatio is relatively straightforward. The we will tur to the more geeral case with more tha two variables i Sectio.. Fially, we will derive a geeral result for the RMSE calculatio i Sectio.4.. Two Variables Suppose we wat to fit the data poits {y i } with two variables {x i } ad {x i } by a liear model with y i = ŷ i + ɛ i, i =,,, ŷ i = β 0 + β x i + β x i. The least square prescriptio is agai to fid β 0, β ad β to miimize the sum square of the residuals: SSE = ɛ i = (y i ŷ i ) As i the case of simple regressio, it is more coveiet to rewrite the above equatios i vector form as follows: Y = Ŷ + ɛ = β 0 X 0 + β X + β X + ɛ (33) Ŷ = β 0 X 0 + β X + β X (34) SSE = ɛ ɛ = (Y β 0 X 0 β X β X ) (Y β 0 X 0 β X β X ) (35) To employ the least square prescriptio, we set the derivatives of SSE with respect to β 0, β ad β to 0. The resultig equatios are X 0 ɛ = X ɛ = X ɛ = 0. (36) That is to say that ɛ is orthogoal to X 0, X ad X. I other words, Ŷ is the vector Y projected oto the vector space spaed by X 0, X ad X. For coveiece, we deote Z ad Z as the Z-score vectors associated with X ad X, respectively. That is, Z = X x X 0, Z = X x X 0, (37) SD SD where SD ad SD are the stadard deviatio of {x i } ad {x i }, respectively. We see that Z is a liear combiatio of X 0 ad X ; Z is a liear combiatio of X 0 ad X. Sice ɛ to orthogoal to X 0, X ad X, it is orthogoal to Z ad Z as well: Z ɛ = Z ɛ = 0. (38) Recall that the mea of a vector U is ū = X 0 U/, ad the correlatio betwee U ad V is r uv = Z u Z v /. Thus the orthogoality coditios of ɛ mea that () ɛ has zero mea, ɛ = 0, ad () ɛ is ucorrelated with both X ad X : r ɛ = r ɛ = 0. To solve the regressio equatios (36), we take the scalar product of equatio (33) with X 0, resultig i the equatio ȳ = β 0 + β x + β x (39) This meas that the regressio lie passes through the average poit ( x, x, ȳ). Multiplyig equatio (39) by X 0 yields ȳx 0 = β 0 X 0 + β x X 0 + β x X 0 0

Subtractig the above equatio from (33) gives Y ȳx 0 = β (X x X 0 ) + β (X x X 0 ) + ɛ Usig the defiitio of the Z score vector, we ca write the above equatio as Takig the scalar product of the above equatio with Z gives Z y = β SD Z + β SD Z + ɛ (40) r y = β SD + β SD r, (4) where r y = Z y Z / is the correlatio betwee Y ad X, r = Z Z / is the correlatio betwee X ad X. Equatio (4) ca be writte as ( ) SD β = r y β r = β y β β, (4) SD SD where β y = r y /SD is the slope i the simple regressio for predictig Y from X, ad β = r SD /SD is the slope i the simple regressio for predictig X from X. Before derivig the fial expressio for β, let s poit out two thigs. First, if r = 0 (or equivaletly X X = 0) the β = β y. That is to say that if X ad X are ucorrelated (orthogoal), the slope β i the multiple regressio is exactly the same as the slope β y i the simple regressio for predictio Y from X. The additio of the X does ot chage the slope. Secod, if r > 0 ad β > 0, the β < β y. The slope β decreases if X ad X are positively correlated ad if the slope of X i the multiple regressio is positive. Let s go back to equatio (4). If r 0, the equatio of β ivolves β. So β ad β has to be solved together. Sice X ad X are symmetric, the equatio for β ca be obtaied by simply exchagig the idex betwee ad of the β equatio: β = β y β β, (43) where Substitutig β from (43) ito (4), we obtai β y = r y SD ad β = r SD SD. (44) β = β y (β y β β )β = β y β y β + β β β (45) Note that the product Thus equatio (45) becomes β β = ( ) ( ) SD SD r r = r SD SD. or Usig the expressios for β y, β y ad β, we fid β = β y β y β + r β ( r )β = β y β y β β = β y β y β r β = b SD, (46) where b = r y r y r r

may be iterpreted as the adjusted correlatio betwee Y ad X takig ito accout the presece of X. The expressio for β is obtaied by exchagig the idex betwee ad : β = b, b = r y r y r SD r. Gatherig all the results, we coclude that the regressio coefficiets are give by with β = b SD, β = b SD, β 0 = ȳ β x β x (47) b = r y r y r r, b = r y r y r r I Sectio.8, we see that the RMSE i simple regressio is related to by RMSE = multiple regressio, we will show i Sectio.4 that the formula is geeralized to where R is the correlatio betwee Y ad Ŷ : RMSE = R, R = Z y Zŷ. (48) r xy. I We will defer the calculatio of R to Sectio.3. I the case of multiple regressio with two variables cosidered here, R is give by ry R = + r y r yr y r r (49). More Tha Two Variables Suppose we ow wat to fit {y i } with p variables {x i, x i,, x pi }. The regressio equatio has p + parameters β 0, β, β,, β p. It ca be writte i the vector form as Y = Ŷ + ɛ = Ŷ = SSE = ɛ ɛ = Y β j X j + ɛ (50) j=0 β j X j (5) j=0 β j X j Y j=0 β j X j (5) To miimize SSE, we set the derivatives of SSE with respect to β j (j = 0,,, p) to 0. The resultig equatios ca be writte as X j ɛ = 0, j = 0,,, p. (53) This meas that ɛ is orthogoal all X 0, X,, X p, or equivaletly, the mea of ɛ is 0 ad ɛ is ucorrelated with ay of the variables we are tryig to fit. j=0

To fid the regressio coefficiets, we follow the same procedures as before. First, take the scalar product of equatio (50) with X 0. The result is ȳ = β 0 + β j x j (54) As before, this meas that the regressio lie passes through the poit of average. Next, we multiple the above equatio by X 0 : ȳx 0 = β 0 X 0 + β j x j X 0 ad the subtract it from equatio (50): Y ȳx 0 = β j (X j x j X 0 ) + ɛ. Usig the defiitio of the Z score vector (9), we ca write the above equatio as Dividig both sides by results i where Z y = Z y = β j SD j Z j + ɛ. Takig the scalar product of equatio (55) with Z i (i = 0,,,, p), we obtai b j Z j + ɛ, (55) b j = β j SD j. (56) r ij b j = r yi, i =,,, p. (57) This is a system of liear equatios for b, b,, b p. Writte them out, they look like b + r b + r 3 b 3 + + r p b p = r y b r + b + r 3 b 3 + + r p b p = r y b r 3 + b r 3 + b 3 + + r 3p b p = r y3 b r p + b r p + b 3 r p3 + + b p = r yp If all of the variables X j are ucorrelated (i.e. r ij = 0 if i j), the solutio is b j = r yj ad the slopes are β j = r yj /SD j. This is exactly the same as the slopes i simple regressio for predictig Y from X j. There are o simple aalytic expressios for b j i geeral, but there are several well-kow procedures to obtai the solutio by successive algebraic operatios, but we will ot discuss the methods here. Suppose all the b s have bee solved usig oe of those procedures, the slopes are give by equatio (56) as ad the itercept β 0 is give by equatio (54) as... β j = b j SD j, j =,,, p (58) β 0 = ȳ 3 β j x j. (59)

Fially, we should metio that the term liear i liear regessio refers to a model beig liear i the fittig parameters β 0, β,, β p. For example, we ca fit y i by the model y i = β 0 + β x i + β xi + β 3 x i + β 4 l x i + β 5 + x i usig the techique of multiple liear regressio sice the model is liear i β 0, β,, β 5. We simply label x 3 i (60) x i = x i, x i = x i, x 3i = x i, x 4i = l x i, x 5i = x3 i + x i ad equatio (60) ca be writte i vector form as Y = β 0 X 0 + β X + β X + β 3 X 3 + β 4 X 4 + β 5 X 5, which is equatio (50) with p = 5. The key is to ote that the least square prescriptio is to miimize SSE by varyig the paremeters β s, ot the idepedet variables x s..3 Correlatio Betwee Y ad Ŷ To geeralize the RM SE expressio i Sectio.8 for multiple regressio, we cosider the quatity R defied as the correlatio betwee Y ad Ŷ : R = Z y Zŷ. (6) We first calculate the average of Ŷ : ŷ = Ŷ X 0 = β 0 X 0 + β j X j X 0 = β 0 + β j x j = ȳ, where we have used equatio (5) for Ŷ ad (54) for ȳ. From the defiitio of the Z score vector (9) ad ŷ = ȳ, we ca write (6) as R = (Y ȳx 0) (Ŷ ȳx 0 ) SDŷ = (Ŷ + ɛ ȳx 0 ) (Ŷ ȳx 0 ) SDŷ = (Ŷ ȳx 0 ) (Ŷ ȳx 0 ) SDŷ = SD ŷ (6) where we have used ɛ X 0 = 0 ad ɛ Ŷ = 0 (sice Ŷ is a liear combiatio of X 0, X p ad all are orthogoal to ɛ). We have also used the defiitio of the stadard deviatio (8) to obtai the last lie. Hece we have R = SD ŷ SD y = (ŷ i ȳ), (63) (y i ȳ) which is iterpreted as the fractio of the variace of Y explaied by the liear model. 4

To compute R, we use equatio (5) for Ŷ ad (59) for β 0, ad write SDŷ = (Ŷ ȳx 0 ) (Ŷ ȳx 0 ) [ ] = (β 0 ȳ)x 0 + β i X i (β 0 ȳ)x 0 + β j X j [ ] = β i (X i x i X 0 ) β j (X j x j X 0 ) ( ) = SD i β i Z i SD j β j Z j = = SD i SD j β i β j Z i Z j SD i SD j β i β j r ij R = SD ŷ SD y = b i b j r ij, where we have used the defiitio of b j i equatio (56). The sum ca be simplified by otig that b j satisfy equatio (57), ad thus we obtai R = b i r yi (64) Whe the solutio of b i is obtaied, R ca be calculated usig the above equatio. I the case of multiple regressio with two variables (p = ), b ad b are give by equatio (48). Pluggig them ito equatio (64), we obtai equatio (49)..4 Root Mea Square Error As i Sectio.8, we defie SST ad SSM as SST = (y i ȳ) = (Y ȳx 0 ) (Y ȳx 0 ) SSM = (ŷ i ȳ) = (Ŷ ȳx 0 ) (Ŷ ȳx 0 ) It follows from the defiitio of stadard deviatio that ad it follows from (63) that SST = SD y (65) SSM = R SST. (66) The idetity SST = SSM + SSE still holds i muiltiple liear regressio. The proof is almost exactly the same as i Sectio.8. 5

We start with the idetity Take the scalar product with itself: Y ȳx 0 = (Y Ŷ ) + (Ŷ ȳx 0 ) = ɛ + (Ŷ ȳx 0 ). (Y ȳx 0 ) (Y ȳx 0 ) = [ɛ + (Ŷ ȳx 0 )] [ɛ + (Ŷ ȳx 0 )] SST = SSE + SSM + ɛ (Ŷ ȳx 0 ) Sice Ŷ ȳx 0 is a liear combiatio of X 0, X,, X p ad ɛ is orthogoal to all these vectors, ɛ (Ŷ ȳx 0 ) = 0 ad so we have SST = SSM + SSE (67) Combiig equatio (65), (66) ad (67), we have SSE = ( R )SD y. It follows from the defiitio of RMSE = SSE/ that which is the geeralizatio of equatio (3). RMSE = R (68) 6