Chapter 1. Linear Regression with One Predictor Variable 1.1 Statistical Relation Between Two Variables To motivate statistical relationships, let us consider a mathematical relation between two mathematical variables x and y. This may be represented by a functional relation; y = f(x), (1) which says that given a value of x, there is a unique value of y, which can be exactly determined. 1
For example, the relation between the number of hours(x) driven on a car and distance (y) travelled may be given by y = cx, where c is the constant speed. There are many examples in physical and other sciences of such relations, known as the deterministic or exact relationship. To define a statistical relationship, we replace the mathematical variables by random variables, X and Y and add a random component of error ɛ representing deviation from the true relation is given by y = f(x) + ɛ (2) 2
Here (x, y) represent a typical value of the bivariate random variable (X, Y ). Such a relation is also known as stochastic relation and model the random phenomenon where (i) there is tendency of Y values to vary around a smooth function and (ii) there is a random scatter of points around this systematic component. Figure 1.1 presents the plot of heights and weights of 23 students enrolled in my last year s of STAT360 (for the data given in Table 1.1). 3
This graph shows the tendency of the data to vary around a straight line. This tendency of the variation in weights as function of height is called linear trend. Since the points do not fall on a straight line, it may be suitable to use a statistical relationship, i.e. y = β 0 + β 1 x + ɛ where β 0 and β 1 are unknown constants, x represents height and y represents weight, and ɛ represents a random error. The subject matter of this course is the study of such relationships. 4
Figure 1.1 Scatter Plot of Height-Weight Data of STAT360 2001 Class 5
Table 1.1 Heights and Weights of 23 Students in STAT 360 Class of 2001 Student ID Height(Cms.) Weight(Kgs.) 4126548 183.00 77.09 4281675 177.80 90.70 4100212 172.72 81.63 4411919 167.64 49.88 5936748 162.56 45.35 5919460 162.56 54.42 5945267 172.72 72.56 4276051 177.80 74.83 4084489 172.72 54.42 4139615 185.42 92.97 5928281 180.34 81.63 5922763 172.72 80.72 3630137 180.34 70.29 4751612 158.00 55.00 4767098 163.00 50.00 4767209 158.00 42.00 4766733 182.00 72.00 4766164 166.00 60.00 4763661 168.00 62.00 4766970 163.00 55.00 4763734 170.00 65.00 3952312 172.72 95.23 5928389 162.56 72.56 6
1.2 Regression Models Terminology: Regression The conditional expectation given by m(x) = E(Y X = x) in a bivariate setting is called regression of Y on X. The term regression was used by Sir Francis Galton (1822-1911) in studying the height of the offsprings as a function of the heights of their parents in a paper entitled Regression towards mediocrity in hereditary stature (Nature, vol. 15, pp.507-510). 7
In this paper Galton reported on his discovery that the offsprings did not resemble their parents in size but tend to be always more mediocre [i.e. more average] than they - to be smaller than the parents if parents were large; to be larger than parents if they were very small... Thus the random variable Y may be assumed to vary around its mean m(x) as a function of X, and denoting the random deviation Y m(x) by ɛ, we can write Y = m(x) + ɛ (3) Note that the probability distribution of ɛ is the conditional probability distribution of Y m(x) X = x, which is essentially the same as the Eq. 2. Hence statistical relationships as such as these are known as regression models. 8
Dependent and Independent Variable The relation y = f(x) implicitly requires to study the changes in y as a function of x and some times is interpreted as a causal relation, (i.e. x causes y). This understanding has resulted in defining x as independent variable and y as dependent variable. Uses of Regression Relation The regression model is used for Description: Simply knowing the nature of the relationship such as described by Sir Galton. 9
Prediction: Prediction of Y values (which are random) as a function of some related variable. This is an educated guess. For example, increase in Sales (Y ) as a function of advertising expense (X) for a company will be an important quantity to predict. In this context, X is known to be the predictor variable and Y is known to be predicand variable or response variable. Control: Knowledge of regression relations is used in control of Y values. For example, in an industrial processes, temperature (X) may be used to control the density Y of the finished product. Hence, to produce material of a given average density, the regression relation may be used to determine the proper temperature level. 10
1.3 Simple Linear Regression Model Distribution of the Error Unspecified Let n observations obtained from the bivariate random variable (X, Y ) be denoted by (X i, Y i ), i = 1, 2,..., n. Then the Simple Linear Regression (SLR) Model can be stated as follows; Y i = β 0 + β 1 X i + ɛ i (4) Y i : value of the response (or dependent) variable in the ith trial β 0 and β 1 are parameters known as the regression parameters X i is a known constant, the value of the predictor variable in the ith trial 11
ɛ i : random error term for the ith trial, such that E(ɛ i ) = 0 and V ar(ɛ i ) = σ 2 {ɛ i } = σ 2 ɛ i and ɛ j for i j are uncorrelated so that their covariance is zero, ı.e. cov(ɛ i, ɛ j ) = σ{ɛ i, ɛ j } = 0. Normal Distribution of the Errors For theoretical purposes, it is important to assume that the errors are normally distributed, this we denote by ɛ i i.i.d N(0, σ 2 ). Note that i.i.d is short for independent and identically distributed, and zero covariance between two normal random variables implies independence. Model with this extra assumption is known as Normal Simple Linear Regression model. 12
Some Features of the SLR model In the expressions below, expectations are used as if the X values are fixed; hence in fact these are conditional expectations. This should not create a confusion, if we assume that the regression relation is to study the variations in Y for fixed values of X. (i) Y i is sum of a constant and a random variable; hence it is a random variable. (ii) E(Y i ) = β 0 + β 1 X i (iii) V ar(y i ) = σ 2 {Y i } = σ 2 {ɛ i } = σ 2 Hence this model assumes that the mean function is linear but variance function is constant in X. 13
(iv) For i j, the observations Y i and Y j are uncorrelated. The above observations follow by simple rules of expectation and variance. Meaning of Regression Parameters Since, E(Y ) = β 0 + β 1 X, it is clear that β 0 = E(Y X = 0) = intercept of the regression line = mean response when X = 0 and β 1 = slope of the regression line = Change in average response per unit change in X. 14
1.4 Estimation of Regression Function Method of Least Squares (LS) When the distribution of errors is not specified, we need to minimize the observed errors Y i β 0 β 1 X i. The least square principle provides the best fitting line to the data by minimizing Q(β 0, β 1 ) = n i=1 (Y i β 0 β 1 X i ) 2 (5) Note: Other criteria may also be proposed, such as considering the least absolute deviation (LAD), n i=1 (Y i β 0 β 1 X i ) but LS offers an enormous theoretical simplification and has some required good properties of resulting estimators. 15
Least Square Estimators The analytical solution for β 0 and β 1, denoted by b 0 and b 1, respectively, is obtained by solving the following simultaneous linear equations known as the normal equations): Yi = nb 0 + b 1 Xi (6) Xi Y i = b 0 Xi + b 1 X 2 i (7) These can be explicitly solved to give, b 1 = (Xi X)(Y i Ȳ ) (Xi X) 2 (8) b 0 = Ȳ b 1 X (9) 16
Proof The minimizing equations are It is easy to obtain Q β 0 = 0 (10) Q β 1 = 0 (11) Q β 0 = 2 (Y i β 0 β 1 X i ) (12) Q β 1 = 2 X i (Y i β 0 β 1 X i ) (13) Equating these to zero and substituting b 0 and b 1 respectively for β 0 and β 1, we get (Yi β 0 β 1 X i ) = 0 (14) Xi (Y i β 0 β 1 X i ) = 0 (15) Expanding the summation over individual terms we get Yi nβ 0 β 1 Xi = 0(16) Xi Y i β 0 Xi β 1 X 2 i = 0(17) 17
Rearranging the terms gives the normal equations. From the first normal equation, we get 1 1 Yi = b 0 + b 1 Xi (18) n n or Hence, Ȳ = b 0 + b 1 X (19) b 0 = Ȳ b 1 X Substituting this in the second normal equation, we get, Xi Y i = n X(Ȳ b 1 X) + b 1 X 2 i = n XȲ + b 1 ( X 2 i n X 2 ) 18
This gives b 1 = Xi Y i n XȲ X 2 i n X 2 Using the fact that Xi Y i n XȲ = (X i X)(Y i Ȳ ) the above expression becomes b 1 = (Xi X)(Y i Ȳ ) (Xi X) 2 19
Example For the data in Table 1.1, the following computations are obtained; n = 23; Xi = 3931.6, Yi = 1555.3 Xi Y i = 267951; X 2 i = 673552 Hence X = 3931.6/23 = 170.94, Ȳ = 1555.3/23 = 67.6231 For computing b 1, the numerator is computed as Xi Y i n XȲ = X i Y i ( X i )( Y i ) and denominator as Hence, X 2 i n X 2 = X 2 i ( X i ) 2 n n b 1 = 267951 3931.6 1555.3/23 673552 3931.6 2 /23 = 1.40694 b 0 = 67.6231 1.40694 170.94 = 172.8792 20
1.5 Point Estimation of Mean Response Let X h be a typical value of the independent variable at which the mean response E(Y ) has to be estimated. Note that this is equivalent to estimating the regression function for X = X h. E(Y ) = β 0 + β 1 X, (20) Note that individual value of Y is known as response and E(Y ) is known as the mean response. The regression function is linear in parameters β 0 and β 1, hence, its estimate is easily obtained as Ŷ = ˆβ 0 + ˆβ 1 X = b 0 + b 1 X (21) 21
For the cases in the study, we call Ŷ i : Ŷ i = b 0 + b 1 X i, i = 1, 2,..., n; (22) the fitted value for the ith case; viewed as the estimate if the mean response for X = X i. Example 1.2 For the data in Table 1.1, the estimators of b 0 and b 1 were obtained as b 0 = 172.88, b 1 = 1.41, Hence, the estimated regression function is given by Ŷ = 172.88 + 1.41X. This estimated regression function is plotted in Figure 1.2. The fitted values are reported in the following table. 22
Table 1.2: Fitted Values and Residuals for Height-Weight(2001) Data Student# Height(X) Weight(Y) Fits Residuals 1 183.00 77.0975 84.5908-7.4933 2 177.80 90.7029 77.2747 13.4282 3 172.72 81.6327 70.1274 11.5052 4 167.64 49.8866 62.9802-13.0936 5 162.56 45.3515 55.8329-10.4815 6 162.56 54.4218 55.8329-1.4112 7 172.72 72.5624 70.1274 2.4349 8 177.80 74.8299 77.2747-2.4448 9 172.72 54.4218 70.7214-15.7057 10 185.42 92.9705 87.9956 4.9749 11 180.34 81.6327 80.8483 0.7843 12 172.72 80.7256 70.1274 10.5982 13 180.34 70.2948 80.8483-10.5535 14 158.00 55.0000 49.4173 5.5827 15 163.00 50.0000 56.4520-6.4520 16 158.00 42.0000 49.4173-7.4173 17 182.00 72.0000 83.1838-11.1838 18 166.00 60.0000 60.6728-0.6728 19 168.00 62.0000 63.4867-1.4867 20 163.00 55.0000 56.4520-1.4520 21 170.00 65.0000 66.3006-1.3006 22 172.72 95.2381 70.1274 25.1107 23 162.56 72.5624 55.8329 16.7294 23
Regression Plot Y = -172.879 + 1.40694X R-Sq = 55.7 % 100 90 Weight(Y) 80 70 60 50 40 160 170 Height(X) 180 Figure 1.2 Scatter Plot and Fitted Line Plot of Height-Weight Data of STAT360 2001 Class 24
The graph shows a good scatter around the fitted line. Suppose that the mean weight of a person of typical height X = 171cms is desired; the corresponding point estimate is given by Ŷ = 172.88 + 1.41(171) = 68.23Kg. Table 1.2 also gives the fitted values for all the heights in the data; just by substituting X i for X in the equation of the fitted line. This table also gives the value of the residuals, which are the differences between the observed values and fitted values. In general, the ith residual is given by e i = Y i Ŷ i (23) 25
For the SLR model it can be written as e i = Y i b 0 b 1 X i = (Y i Ȳ ) b 1 (X i X) (24) The latter equation is useful for theoretical derivations in the course. The residuals are in some sense estimates of the errors ɛ i. They are used to justify the validity of the model as well as in finding departures from the model. 26
1.6 Properties of Fitted Regression Line (i) Sum of all the residuals equal zero, i.e n i=1 e i = 0 (25) Note that this implies that the sample mean of the residual values ē = 1 n ni=1 e i = 0.The sample mean being an estimator of the population mean, this is inline with the assumption that E(ɛ) = 0. To prove this use Eq (25) and the fact that ni=1 (X i X) = 0 = n i=1 (Y i Ȳ ) = 0. (ii) Sum of the squared residuals (Y i Ŷ i ) 2 is minimum, for the least squared residuals, e i = (Y i Ŷ i ). Note that this was the requirement in the least square estimation. 27
(iii) Sum of the observed values Y i equals the sum of fitted values Ŷ i : n i=1 Y i = n i=1 Ŷ i (26) This follows from the first property as ei = Y i Ŷ i = 0. This implies that the (sample) mean of the observed values Ȳ and the fitted values Ŷ are the same. (iv) Sum of the weighted residuals is zero, when the residuals are weighted by the corresponding level of the predictor variable: Xi e i = 0 (27) 28
To prove this we see that Xi e i = X i {(Y i Ȳ ) b 1 (X i X)} = X i (Y i Ȳ ) b 1 Xi (X i X) (28) Furthermore, S xy = (X i X)(Y i Ȳ ) = X i (Y i Ȳ ) X(Y i Ȳ ) = X i (Y i Ȳ ) X (Y i Ȳ ) = X i (Y i Ȳ ) as (Yi Ȳ ) = 0 Hence Eq (??) becomes Xi e i = S xy b 1 S xx 29
Using the formula for b 1 : b 1 = S xy S xx, the above equation becomes Xi e i = S xy S xy S xx S xx = S xy S xy = 0 (v) Sum of the weighted residuals is zero, when the residuals are weighted by the corresponding level of the fitted values: Ŷi e i = 0 (29) This easily follows as Ŷi e i = b 0 ei + b 1 Xi e i and the facts (proved earlier) that ei = 0 X i e i = 0 30
(vi) The (fitted) regression line always passes through the point ( X, Ȳ ). Substituting X = X, we find that Ŷ = Ȳ b 1 X + b 1 X = Ȳ which proves this property. Notes: (i) Property (i) follows from the first normal equation as ei = (Y i b 0 b 1 X i ) = Y i nb 0 b 1 Xi (ii) The property (v) follows from the 2nd normal equation as: Xi e i = X i Y i b 0 Xi b 1 X 2 i 31
(iii) If the data is transformed as Y y = Y Ȳ and X x = X X, the fitted equation becomes ŷ = b 1 x (30) where ŷ = Ŷ Ȳ. It is clear to see that this equation passes through the point (0,0); which is a consequence of shifting the origin to the point ( X, Ȳ ). 1.7 Estimation of Error Variance In general, variation can be estimated by squared deviations of observations from the mean or estimate of the mean. For example for the observations Y 1, Y 2,..., Y n from a normal population N(µ, σ 2 ), the unbiased estimator of σ 2 is given by 32
ˆσ 2 = 1 n s 2 = 1 n 1 n i=1 n i=1 (Y i µ) 2, if µ is known. (Y i Ȳ ) 2, if µ is unknown. In other words, we say that the estimate of σ 2 is sum of squared deviations divided by degrees of freedom; n if µ is known and n 1 if µ is estimated by Ȳ. In the case of regression model the approximation to deviations of observations Y i from its mean m(x i ) = β 0 +β 1 X i is given by Y i m(x i ) is given by e i = Y i Ŷ i = Y i b 0 b 1 X i and the corresponding sum of squares, denoted by SSE, for Sum of Squares due to Error. is given by SSE = n i=1 (Y i Ŷ i ) 2 = n i=1 e 2 i (31) 33
The corresponding degrees of freedom is n 2 (2 degrees of freedom are lost for estimating two parameters, β 0 and β 1.) This gives rise to the following estimate of σ 2, ni=1 e 2 i MSE = SSE n 2 = (32) n 2 where M SE stands for Mean Square due to Error. It will be proved later that MSE is unbiased for σ 2. Example The estimate of the error variance for the data of Table 1.1 is obtained as follows: Note that Sum of Squared Errors (Residuals) is given by SSE = 2329.5. based on n = 23 observations. MSE is given by Hence, MSE = 2329.5/21 = 110.93. 34
1.8 Normal Error Regression Models The information about the parameters is given by the distribution of errors Y 1,..., Y n. For the normal error regression model, ɛ 1,...ɛ n are independent and normally distributed with zero mean and variance= σ 2. This implies that Y 1,..., Y n are also normal and independent where Y i N(β 0 + β 1 X i, σ 2 ). The probability density function for Y i is given by f i (Y i ) = = 1 σ 2π exp{ 1 2σ 2(Y i m(x i )) 2 } 1 σ 2π exp{ 1 2σ 2(Y i β 0 β 1 X i ) 2 } 35
Since, Y 1,..., Y n are independent, the joint probability density function of Y 1,..., Y n is given by f(y 1,..., Y n ) = f 1 (Y 1 )f 2 (Y 2 )...f n (Y n ) and the likelihood function L(β 0, β 1, σ 2 ) is given by L(β 0, β 1, σ 2 ) = f(y 1,..., Y n ) { } n 1 n = σ exp{ 1 2π 2σ 2(Y i β 0 β 1 X i )) 2 } i=1 { } n 1 = σ exp{ 1 n (Y 2π 2σ 2 i β 0 β 1 X i ) 2 } (33) i=1 For finding the maximum likelihood estimators of β 0, β 1, σ 2, the likelihood function has to be maximized. Equivalently, we consider to maximize the log-likelihood function given by log e L = n 2 log e(2π) n 2 log e σ 2 1 2σ 2 n (Y i β 0 β 1 X i ) 2 i=1 (34) 36
Maximum Likelihood Estimators of Parameters The maximum likelihood estimators are obtained by solving the following three equations log e L β 0 = 0 log e L β 1 = 0 log e L σ 2 = 0 These partial derivatives are given by log e L β 0 = 1 σ 2 log e L β 1 = 1 σ 2 n i=1 n i=1 log e L σ 2 = n 2σ 2 + 1 2σ 4 (Y i β 0 β 1 X i ) X i (Y i β 0 β 1 X i ) n i=1 (Y i β 0 β 1 X i ) 2 37
Replacing β 0, β 1, σ 2 by ˆβ 0, ˆβ 1, ˆσ 2, after a little simplification, we obtain n i=1 n i=1 (Y i ˆβ 0 ˆβ 1 X i ) = 0, (35) X i (Y i ˆβ 0 ˆβ 1 X i ) = 0, (36) ni=1 (Y i ˆβ 0 ˆβ 1 X i ) 2 n = ˆσ 2. (37) Note that the equations (36) and (37) are the two normal equations obtained by the least square method. Hence the Maximum Likelihood estimators of β 0 and β 1 are the same as b 0 and b 1 respectively. Whereas the MLE for σ 2 is given by ˆσ 2 = = n i=1 (Y i b 0 b 1 X i ) 2 n i=1 e2 i n n (38) (39) Note that the MLE for σ 2 is biased, as E(ˆσ 2 ) = E( n 2 n MSE) = n 2 n σ2 (40) 38
The following output is obtained using MINITAB (available in Math and Stat department PC lab) using Height weight data for this class. (The data can be downloaded by following the links from http://alcor.concordia.ca/ chaubey either in excel or text format, which can be subsequently copied and pasted on MINITAB worksheet) Use Stat-Regression-Regression menu to obtain the following output. MINITAB ignores the missing data denoted by * ) Regression Analysis: Weight versus Height The regression equation is Weight = - 161 + 1.33 Height 52 cases used 6 cases contain missing values 39
Predictor Coef SE Coef T P Constant -160.74 21.59-7.45 0.000 Height 1.3259 0.1265 10.48 0.000 S = 7.325 R-Sq = 68.7% R-Sq(adj) = 68.1% Analysis of Variance Source DF SS MS F P Regression 1 5898.4 5898.4 109.92 0.000 Residual Error 50 2683.0 53.7 Total 51 8581.4 40
Notes: 1. If the missing weight is substituted by the fitted value and the regression is run again; the same results are obtained. To store the fitted values and residuals, use STORAGE option by clicking the proper spaces. 2. To obtain a fitted line plot use Stat-Regression- Fitted Line Plot menu in MINITAB. 3. Regression OUTPUT may also be obtained from EXCEL using Tools-Data Analysis - Regression. 41