Chapter 5 Linear least squares regression

Size: px

Start display at page:

Download "Chapter 5 Linear least squares regression"

Valerie Bridges
5 years ago
Views:

1 Chapter 5 Linear least squares regression Consider first simple linear regression. For now, the model of interest is only the model of the conditional expectations; the distribution of the residuals is ignored. The immediate objective is to determine estimators of the model coefficients, or parameters. As there are many different means of estimating the parameters, a decision must be made on which to use. Those that are used in conventional linear regression are optimal, or best, among a reasonably large class of estimators. The least squares estimators minimize the sum of the squared residuals. No other set of unbiased estimators produces a smaller sum of squared residuals. Since the sum of the squared residuals is used in computing the estimate of residual error, informally at least, the least squares estimators have the least error among all (unbiased) estimators. The derivation of the least squares estimators is useful to understand, but more useful is understanding the properties of the least squares estimators. We consider only the n observation pairs (y 1, x 1 )..., (y n, x n ). The model of the ith observation Y i is E(Y i x i ) β 0 + β 1 x i. The objective is to use the n observation pairs to estimate the parameters β 0 and β 1 in a manner that will result in the smallest possible error, where error is defined (in this context) to be the sum of the squared residuals. Let ŷ i β 0 + β 1 x i denote the ith fitted value (the estimators β 0 and β 1 have not derived yet) and ε i y i ŷ i y i β 0 β 1 x i denote the ith residual. The sum of the squared residuals depends on, and is a function of the estimators β 0 and β 1. It is S( β 0, β 1 ) (y i ŷ i ) 2 (y i β 0 β 1 x i ) 2. The calculus-based approach (the simplest) to deriving estimators of β 0 and β 1 minimizes S(β 0, β 1 ) by the choice of β 0 and β 1. The values of β 0 and β 1 that minimize S(β 0, β 1 ) are the least squares estimates. 17

2 S(β 0, β 1 ) is a function that is recognizable as a quadratic equation in two variables β 0 and β 1, and it can be visualized in three-space as a downward-pointing cone, though not with a pointed tip but a smooth tip. The function has its minimum value at the tip. The tip lies above the solution pair ( β 0, β 1 ); these are the estimates that minimize the sum of the squared residuals. To find the minimum of S(β 0, β 1 ) (y i β 0 β 1 x i ) 2, the partial derivatives S(β 0, β 1 )/ β 0 and S(β 0, β 1 )/ β 1 are computed, set to zero, and the resulting equations solved for β 0 and β 1. First, since n y i ny. Setting S(β 0, β 1 ) β 0 Second, Setting S(β 0, β 1 ) β 1 S(β 0, β 1 ) n (y i β 0 β 1 x i ) 2 β 0 β 0 (y i β 0 β 1 x i ) 2 β 0 2 (y i β 0 β 1 x i ) 2 (ny nβ 0 nβ 1 x) 0 and rearranging produces β 0 + xβ 1 y. S(β 0, β 1 ) n (y i β 0 β 1 x i ) 2 β 1 β 1 (y i β 0 β 1 x i ) 2 β 1 2 x i (y i β 0 β 1 x i ) ( 2 x i y i nβ 0 x β 1 0 and rearranging produces nxβ 0 + x 2 i β 1 x i y i. Collectively, system of equations (known as the normal equations) is β 0 + xβ 1 y nxβ 0 + x 2 i β 1 x i y i. 18 x 2 i ).

3 The solution to the system is β 0 y β 1 x (xi x)(y i y) β 1. (xi x) 2 The least squares estimate of E(Y i x i ) is ŷ i β 0 + β 1 x i y β 1 x + β 1 x i y + (x i x) β 1. (1) If you are able to use calculus, then you should verify the following. Without any explanatory variables, the linear model is E(Y ) β 0. The least squares estimator (derive this) is β 0 n 1 n i y i y. If there is an explanatory variable, then the least squares estimator of E(Y x i ) (formula 1) adjusts the estimator y according to how far x i is from x. The degree of adjustment depends on the estimate of β 1. Properties of the least squares estimators 1. The estimate of ŷ i given that x i x is y. Therefore, the least squares line passes through the pair (x, y). 2. The residuals are and the sum of the residuals is ε i y i ŷ i y i β 0 β 1 x i, ε i y i β 0 β 1 x i y i y β 1 x β 1 x i (y i y) β 1 (x x i ) (y i y) + β 1 (x i x). 19

4 Since y i ny y, n (y i y) 0 and the sum of residuals is n ε i Another property that will be used later is 0 ε i x i. The argument supporting this claim uses the second normal equation nxβ 0 + n x2 i β 1 n x iy i. Since the least squares estimates β 0 and β 1 satisfy the normal equations, it is true that nx β 0 + x 2 β i 1 x i y i. (2) Consider ε i x i (y i β 0 β 1 x i )x i x i y i β 0 x i β 1 x 2 i x i y i n β 0 x β 1 x 2 i. Equation (2) implies that the last line of the above set of equations is 0 and that n ε ix i The sum of the product between residuals and fitted values is 0 since ŷ i ε i ( β 0 + β 1 x i ) ε i β 0 ε i + β 1 ε i x i The properties above established that n ε i 0 and n ε ix i 0. Hence, n ŷi ε i 0. Assessing the adequacy of the regression model: The model must be considered a good (adequate) model before it is useful or interesting. A benchmark model against which the regression model ought to be substantially better is the null model E(Y ) β 0. A measure of adequacy is either the sum or mean of the squared residuals about the model. The residuals 20

5 associated with the null model are y i y, i 1,..., n since y is the least squares estimator of the parameter β 0 µ. The total sums of squares measures the fit of the null model: TSS (y i y) 2. TSS is the sum of squared residuals about the null model. The corresponding measure of fit for the linear regression model is the sum of squared residuals RSS (y i ŷ i ) 2 (y i β 0 β 1 x i ) 2. Another property of least squares regression is that RSS TSS because the null model is a constrained version of the two-parameter linear model E(Y x) β 0 + β 1 x. The constraint imposed on the parameters that leads to the null model is β 1 0. In the unconstrained case of the two-parameter model, the parameter estimates are those that minimize the residual sum of squares. In the constrained case, one parameter is not chosen to minimize the residual sum of squares; instead, it is fixed (β 1 0) and so it is not optimal unless the least squares estimate of β 1 were exactly 0. It is always the case that constraining a parameter will increase the residual sum of squares compared to the residual sum of squares when a parameter is not constrained but instead fit to the data using least squares regression. In other words, a constrained model will never fit better than an unconstrained model. TSS will be close to RSS when the constrained and unconstrained models are alike. The constrained and unconstrained models are alike if β 1 is very close to 0. Then, the constrained and unconstrained model residuals are much alike, that is, y i ŷ i y i β 0 y i y and TSS will be only slightly larger than RSS. Conversely, if TSS is only slightly larger than RSS, then y i ŷ i y i y for i 1,..., n which implies that y β 0 + β 1 x i, i 1,..., n; thus β 1 0 and the constrained and unconstrained models are very similar with respect to the residuals. 21

6 This logic will lead to a test of significance for a set of parameters. It will be concluded that parameters are zero if the residual sum of squares associated with a model with the set constrained to be zero is not much larger than the residual sum of squares associated with a model in which the parameters are unconstrained. Suppose that the residuals about the regression model were all zero so that the model perfectly fit the data. Then, RSS 0 and none of the variability in the response variable is unexplained by the regression model. More generally, RSS is said to be the residual variation not explained by the regression model, and the difference RegSS TSS RSS (3) is the portion of the total sums of squares that is explained by the regression model. The ratio of the regression and total sum of squares is the coefficient of determination r 2 RegSS (4) TSS TSS RSS. (5) TSS Since 0 TSS RSS TSS, 0 r 2 1. The coefficient of determination is sometimes said to the proportion of variation in the response variable that is explained by the regression model (or the explanatory variable). These ideas can be pursued further. Note that The total sum of squares can be expanded using formula (6) y i y (y i ŷ i ) + (ŷ i y) (6) TSS (y i y) 2 [(y i ŷ i ) + (ŷ i y)] 2. (y i ŷ i ) (y i ŷ i )(ŷ i y) + (ŷ i y) 2. 22

7 The cross-product 2 n (y i ŷ i )(ŷ i y) is zero since (y i ŷ i )(ŷ i y) ε i (ŷ i y) ε i ŷ i ε i ŷ i y ε i y ε i. The second term, y n ε i 0 since n ε i 0 (see property (2) of the least squares estimators above.), The first term n ε iŷ i 0 by property (4) above. Thus, the crossproduct term is 0, and the expansion of the total sum of squares is TSS (y i ŷ i ) 2 + RSS + (ŷ i y) 2 (ŷ i y) 2 TSS RSS (ŷ i y) 2. Recall the definition of the regression sum of squares (formula 3) which set TSS RSS RegSS. Comparing this definition to TSS RSS n (ŷ i y) 2 leads to another expression for the regression sum of squares: RegSS (ŷ i y) 2. This expression for regression sum of squares shows that the extent to which the unconstrained model predictions of the y i s (the ŷ i s) differ from the constrained model predictions of y i s (y), and determines how well the model fits. 23

8 Example The figure to the right shows world record times for the mile run, from 1861 to 2003 for men (red) and from 1967 for 1996 women (blue). 1 Separate regression lines are graphed for men and women as are horizontal lines identifying the mean world record times. 2 For the male observations, I have graphed vertical lines between the fitted values and the sample mean, thus each vertical line has length ŷ i y males. For the female observations, I have graphed vertical lines from the fitted values to the observed values, so each of these lines has length y i ŷ i. The lengths of the lines illustrate that TSS RegSS + RSS is dominated by RegSS for both genders. World record time (sec) Year Correlation: The correlation coefficient r is a measure of the strength of the linear association between two quantitative variables. The correlation coefficient is r 1 ( ) ( ) xi x yi y n 1 s x where s x and s y are the sample standard deviations of the response and explanatory variables, respectively. r is sometimes called Pearson s correlation to distinguish it from other measures of association; however, the term correlation coefficient in statistics refers specifically to r. Notice that the correlation coefficient is symmetric in the sense that the response and explanatory variables can be reversed or switched without altering r. The same statement does not hold for the parameter estimates β 1 and r are related: s y β 0 y β 1 x (xi x)(y i y) β 1. (xi x) 2 r s x s y β1 and β 1 r s y s x. If s x s y, then β 1 r; if the explanatory variable is twice as variable as the response (s x 2s y ), then the slope of the least squares line is reduced by a factor of two; i.e, β r. 1 Running a mile in four minutes (240 seconds) translates to a speed of 15 miles per hour. 2 The risks of extrapolation are apparent since as of February, 2009 the women s world record for the mile run is 4:12.56 set by Svetlana Masterkova of Russia on August 14, (7)

9 In the case of simple linear regression, the coefficient of determination is the square of the correlation coefficient, which explains why the symbol for the coefficient of determination is r 2. Properties of correlation r r r measures only the strength of linear association between two variables The value of r is always between 1 and 1. If r > 0, then the association is between variables is positive. If r < 0,then the association is between variables is negative. If r 0, then there is no association between variables r.48 r If r 1 or r 1, then there is perfect positive or negative linear association, respectively, between variables. 4. The value of r is unitless, and so the meaning of r as a measure of correlation (as discussed above) holds for any pair of quantitative variables. 5. There is no distinction between the x and y variable in terms of one variable explaining the other variable. More examples: Missoula Monthly Average Temperatures (degrees Fahrenheit) Distance from Sun vs. Planet Position (for all 9 planets) Average Temperature (deg F) Remarks Month Distance from Sun Planet Position 25

10 The plot on the left and above illustrates that the correlation coefficient is a measure of linear association. r.16 does not imply that a relationship does not exist between variables. The planet distance plot illustrates that a curved (but monotone) relationship may have a large correlation coefficient provided that the points lie close to a line. For these data, r.91. Effect of outliers on r: The correlation coefficient resistant to outliers is sensitive to outliers. Consider the two scatterplots below. y x y x Left panel: r.06 without the outlier and r.90 with the outlier. Right panel: r 1 without the outlier and r 0 with the outlier. There are several measures of association which are resistant to the effects of outliers. Two of the more common such measures are Kendall s τ and Spearman s ρ. Both of these measure the degree of monotonic association between two quantitative variables. Causation: Two variables have a cause-and-effect relationship if changes in one variable cause changes in the other variable. Examples: The a change in the concentration of a catalyst causes changes in a reaction rate. It is argued that smoking causes cardiovascular diseases and some cancers. Many studies have shown associations between smoking level and the incidence of lung cancer, but studies that imply causation in humans have not be carried out. Example: In 1982, SAT scores were first published on a state-by-state basis. Average scores ranged from 790 (out of 1600) (S. Carolina) to 1088 (Iowa). The data were used to support conflicting positions and claims regarding secondary education. In response, Powell and Steelman 3 attempted to use the data to understand why state-to-state variation was 3 Powell, B. and Steelman, L.C Variations in state SAT performance: Meaningful or misleading? Harvard Educational Review 54(4),

11 so great and, as they stated, assess the extent to which the compositional/demographic and school-structural characteristics are implicated in SAT differences. These data are sometimes called ecological data. Such data are aggregate statistics usually means. The data (means) differ from observations on individuals, and so inferences drawn about individuals from these data are potentially misleading or incorrect. In this example, data are measurements on states, not individual high school students, and the reader should not draw inferences about individuals from the data. Powell and Steelman obtained measurements on a number of variables, but in particular 1. The percentage of total eligible students that took the test 2. The average number of years that the takers took formal courses in social sciences, humanities, and natural sciences 3. The median percentile ranking of the test takers among the students in their classes The figure below and left is a dotplot showing mean SAT scores by state, and the figure below and right plots rank against percent takers. Wyoming Wisconsin WestVirginia Washington Virginia Vermont Utah Texas Tennessee SouthDakota SouthCarolina RhodeIsland Pennsylvania Oregon Oklahoma Ohio NorthDakota NorthCarolina NewYork NewMexico NewJersey NewHampshire Nevada Nebraska Montana Missouri Mississippi Minnesota Michigan Massachusetts Maryland Maine Louisiana Kentucky Kansas Iowa Indiana Illinois Idaho Hawaii Georgia Florida Delaware Connecticut Colorado California Arkansas Arizona Alaska Alabama Median class rank of takers S Dakota N Dakota Nebraska Montana Illinois Washington Nevada Alaska Texas Florida Oregon Hawaii Vermont New Hampshire New Jersey CT Mean SAT score (1982) Takers (percentage of class) Midwestern states tend to have low percentages of test takers, principally because most midwestern state universities required that applicants take the ACT test. Presumably, most of the students that took the SAT in these midwestern states were interested in attending college out-of-state. The percentage of test takers may have a substantial role in explaining variation in state SAT means but little if any insight into the school-structural characteristics responsible for in SAT differences. 27

12 It s expected that median class rank (of the takers) also explains some of the variation in mean state SAT scores. Differences between states with respect to these variables will confound a comparison of states aimed at understanding the success (or lack thereof) of states at educating students. The other potential explanatory variables (income, years, expenditure and public) are far more interesting with respect to the characteristics of states that succeed in educating students. The plot below and left shows the relationship between SAT means and (mean) years of formal study. For these data, r.330. If the confounding variables percent takers and median class rank are accounted, or adjusted for, then the apparent relationship is stronger. The figure below and right plots the adjusted mean SAT scores against adjusted years of formal study. For these data, r.561. State mean SAT State mean SAT (adjusted) Years of formal studies Years of formal studies (adjusted) Percent takers and median class rank are called confounding variables because the variables affect SAT means and also are associated with years of formal schooling. Differences among states with respect to percent takers and median class rank confounds, or obscures the relationship between mean SAT and mean years of formal schooling. Percent takers and median class rank are confounding variables, and are sometimes called lurking variables particularly if they are not measured, but understood to confound the relationship between two other measured variables. Residual error If the null model is assumed to be true, then the estimator of the residual variance ε is 1 (yi y) 2. n 1 28

13 If the linear model regression model is assumed to be true, then the estimator of σ 2 is the sample variance of the residuals s 2 1 ( εi ε) 2 n 2 1 ε 2 n 2 i 1 (yi ŷ i ) 2 n 2 1 The denominator must be used so that the estimator is unbiased, (the estimator is unbiased provided that its expected value is the same as the estimand: E(s 2 ) σ 2 ). The n 2 term n 2 is the degrees of freedom associated with (y i ŷ i ) 2 (assuming there is one explanatory variable). The degrees of freedom are in most cases, n (k + 1) where k + 1 is the number of parameters in the linear model (including β 0 ). The degrees of freedom associated with a particular sum of squares is the scaling factor that makes the scaled sum of squares an unbiased estimator of σ 2. Determining the specific degrees of freedom for a sum of squares is sometimes complex, though in most conventional problems it is, as stated above, n k 1. Multiple regression The linear regression model is extended from one explanatory variable to two, and then to k > 1. With two explanatory variables, the model is E(Y i x i,1, x i,2 ) β 0 + β 1 x i,1 + β 2 x i,2. Once again, the objective is to use the n observation triples (y, x i,1, x i,2 ), i 1,..., n to estimate the parameters β 0 β 1, and β 2 in a manner that will result in the smallest possible error, where error is defined (in this context) to be the sum of the squared residuals. Now the fitted values are ŷ i β 0 + β 1 x i,1 + β 2 x i,2 and the associated residual is ε i y i ŷ i y i β 0 β 1 x i,1 β 2 x i,2. The sum of the squared residuals is a function of the estimates β 0, β 1 and β 2. It is S( β 0, β 1, β 2 ) (y i ŷ i ) 2 (y i β 0 β 1 x i,1 β 2 x i,2 ) 2. The least squares parameter estimates β 0, β 1, β 2 are the values of β 0, β 1, β 2 that minimize S(β 0, β 1, β 2 ). The calculus-based approach computes the partial derivatives S(β 0, β 1, β 2 )/ β 0, and S(β 0, β 1, β 2 )/ β 1 and S(β 0, β 1, β 2 )/ β 2, sets each of the partial derivatives equal to 29

14 zero, and solves the resulting system of normal equations for β 0, β 1 and β 2. The partial derivatives are S(β 0, β 1, β 2 ) n (y i β 0 β 1 x i,1 β 2 x i,2 ) 2 β 0 β 0 (y i β 0 β 1 x i,1 β 2 x i,2 ) 2 β 0 2 (y i β 0 β 1 x i,1 β 2 x i,2 ) Next, ( 2 y i nβ 0 β 1 x i,1 β 2 ) x i,2 ). S(β 0, β 1, β 2 ) n (y i β 0 β 1 x i,1 β 2 x i,2 ) 2 β 1 β 1 (y i β 0 β 1 x i ) 2 β 1 2 x i,1 (y i β 0 β 1 x i,1 β 2 x i,2 ) ( 2 x i,1 y i β 0 x i,1 β 1 x 2 i,1 β 2 ) x i,1 x i,2. Similarly, ( S(β 0, β 1, β 2 ) 2 β 2 x i,2 y i β 0 x i,2 β 1 x i,1 x i,2 β 2 x 2 i,2 ). The normal equations are β 0 β 0 nβ 0 + β 1 x i,1 + β 1 x i,2 x i,1 + β 1 x i,2 + β 1 x 2 i,1 + β 2 x i,1 x i,2 x i,1 x i,2 + β 2 x 2 i,2 y i x i,1 y i x i,2 y i. The system can be written in matrix form as n x1 x2 x1 x 2 2 x1 x 2 x2 x1 x 2 x 2 2 β 0 β 1 β 2 y x1 y x2 y. 30

15 The solution is n x1 x2 x1 x 2 2 x1 x 2 x2 x1 x 2 x y x1 y x2 y β 0 β 1 β 2 provided that the inverse exists. If there are k 2 explanatory variables, the system is n x1 x1 x xk x1 x k xk x1 x k. x 2 k β 0 β 1. β k y x1 y. xk y. The system consists of three equations in k unknowns (β 0, β 1, β k ) and is usually solved by a computer implementation of Householder s or a similar algorithm. These algorithms are remarkably accurate and fast. Example The Ginzberg data set are observations made on patients hospitalized for depression. Three variables were measured on each patient: depression (Beck self-report depression scale), simplicity: a score of a subject s need to see the world in black and white, and fatalism score. 4 Both explanatory variables together explain R % of the variation in depression score. The figure to the right shows the fitted model as a surface. The surface of the plane consists of estimates of the expected depression score given the fatalism and simplicity score. Depression Simplicity score Fatalism score There are very few situations in which there is a correct regression model in the sense that there is some particular combination of explanatory variables that describes the mean 4 A fatalistic outlook is one in which the person feels they lack control of events to the extent that there is little point to attempting to improve their situation. 31

16 or expected value of y better than all other choices of explanatory variables. Every model is wrong, but some models are less wrong than others (which is to say, they more accurately describe the expected value of y). Some examples of multiple linear regression models are E(Y x 1, x 2 ) β 0 + β 1 x 1 + β 2 x 2 E(Y x 1, x 2 ) β 1 x 1 + β 2 x 2 E(Y x 1, x 2 ) β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2 E(Y x 1, x 2 ) β 0 + β 1 log(x 1 ) + β 2 log(x 2 ). An example of a nonlinear multiple regression model is E(Y x 1, x 2 ) exp(β 0 + β 1 x 1 + β 2 x 2 ). So far, the multiple linear regression model describes only the expected value of the response variable Y. For inferential purposes, the model needs to more completely describe the distribution of the response variable. The full model is very similar to the simple linear regression model. It is Y N(E(Y x 1,..., x k ), σ) where E(Y x 1,..., x k ) β 0 + β 1 x β k x k. The model states that the variance of Y is constant, provided that the expected value of mean is correctly specified. This condition is expressed by writing var(y x 1,..., x k ) σ 2. An equivalent specification of the model is Y β 0 + β 1 x β k x k + ε where ε N(0, σ). The residuals about the fitted regression model must be realizations of independent random variables ε 1, ε 2,... ε n to carry out accurate inference (hypothesis tests and confidence intervals). A compact model specification relevant to the data is Y i β 0 + β 1 x 1,i + + β k x k,i + ε i, where ε iid i N(0, σ), i 1, 2,..., n. The notation iid is shorthand for independent and identically distributed. An additional assumption has been stated: the observations (or the residuals) are independently distributed. Regression surfaces: Mathematically, a linear model describes a sub-space of R k. For example, a model with two explanatory variables x 1 and x 2 (k 2) defines a plane within 32

17 three-space, or R 3. There are three axes: y, x 1 and x 2. The regression plane intersects the y axis where x 1 x 2 0. The height of the plane at the intersection is β 0 + β β 2 0 β 0. The coefficient β 1 is the slope of the plane if x 2 is fixed at any particular value, and β 2 is the slope of the plane if x 1 is fixed at any particular value. If more than two explanatory variables are present in the model, then it is difficult to develop a geometrical interpretation. For example, a model with three variables describes a three-dimensional object embedded in 4-space. Interpretation of regression coefficients A coefficient, or parameter estimate for a multiple linear regression model is interpreted as the effect of the associated explanatory variable if all other variables are held fixed. The effect of a one unit increase of x 1 held fixed is on E(Y x 1,..., x k ), when all other variables are E(Y x 1 + 1, x 2..., x k ) E(Y x 1, x 2..., x k ) β 0 + β 1 (x 1 + 1) + β 2 x β k x k (β 0 + β 1 x 1 + β 2 x β k x k ) β 1. The effect of a 1 unit increase in x 1 does not depend on the value of either variable. The effect of a simultaneous increase of a one unit change of both x 1 and x 2 is sum of the coefficients β 1 and β 2 since E(Y x 1 + 1, x , x k ) E(Y x 1, x 2..., x k ) β 0 + β 1 (x 1 + 1) + β 2 (x 2 + 1)+ (β 0 + β 1 x 1 + β 2 x 2 ) β 1 + β 2 For this reason, the model E(Y x 1,..., x k ) β 0 +β 1 x 1 + +β k x k is called an additive model. The parameter estimates (coefficients) from the simple linear regression of depression score on fatalism are shown in Table 1; Table 2 shows the coefficient from the multiple linear regression of depression score on fatalism and simplicity. The coefficients associated with fatalism differ between models, in part because simplicity is correlated with fatalism (r.631). The interpretation of the parameter estimate associated with fatalism differs between models because the estimated change in mean depression score associated with a one unit change in fatalism is numerically different, and because the parameter estimate from the multiple regression model describes the estimated effect provided simplicity is held fixed. 33

18 Table 1: Coefficients and standard errors obtained from the linear regression of weekly numbers of depression score on fatalism. r Variable Coefficient Std. Error t-statistic P (T > t ) Intercept Fatalism <.0001 Table 2: Coefficients and standard errors obtained from the linear regression of weekly numbers of depression score on fatalism and simplicity. r Variable Coefficient Std. Error t-statistic P (T > t ) Intercept Fatalism <.0001 Simplicity The simple linear regression model estimates the effect of a one unit increase in fatalism to be an increase in depression score of.657. No control of simplicity is assumed. Because fatalism and simplicity are positively correlated, an increase in fatalism is associated with an increase in simplicity (for most or patients), so the effect of a one unit increase in fatalism without fixing simplicity has a greater effect than a one unit increase in fatalism when fixing simplicity. Analyzing the effect of fatalism with simplicity fixed is unrealistic, and in general, the interpretation of regression coefficients obtained from observational studies is not as clear as in experimental studies. In experimental studies, the values of the explanatory variables (properly called factors) are chosen by the researcher. With two factors with l 1 and l 2 levels respectively, there will usually be approximately the same number of observations at each of the l 1 l 2 treatment (combinations). Model assessment As with k 1 explanatory variables, model assessment is based on the decomposition of the total sums of squares, TSS. TSS measures the fit of the null model (which constrains all the linear model parameters to be 0 except for the intercept β 0 ). The estimator of β 0 is the sample mean and the total sum of squares is TSS (y i y) 2. As with k 1, the measure of fit for the linear regression model is the sum of squared residuals RSS (y i ŷ i ) 2 (y i β 0 β 1 x i,1 β 1 x i,k ) 2. 34

19 The sum of squared residuals provides an estimator of σ 2 var(y x 1,..., x k ). Fox uses s 2 E to denote the estimator: s 2 E RSS n k 1 ε 2 i (y i β 0 β 1 x i,1 β 1 x i,k ) 2 For the Ginzberg data, the estimated variance of depression scores was.250 (using the null model E(Y ) β 0 and s 2 E.123. The relative reduction is ( )/ which is not (as should be expected) much different from r The difference is attributable to the degrees of freedom associated with the two sum of squares (TSS for the null model and RSS for the linear regression model). The relative reduction in the variance estimates is sometimes called the adjusted r 2. Neither r 2 nor adjusted r 2 are unbiased, but the adjusted r 2 is less biased. The standard error of the regression is s E s 2 E and it estimates the average distance of an observation y from it s fitted value ŷ. The partitioning of TSS according to TSS RegSS + RSS holds as it did with a single variable, and the coefficient of determination r 2 RegSS TSS has the same interpretation as before: it s the proportion of the variation in the response variable explained by the regression. Standardized regression coefficients In some cases, its desirable to compare parameter estimates β 1,..., β k, or a subset of the estimates with the intent of assessing the relative importance of the associated explanatory variables x 1,..., x k toward explaining the expected value of the response variable. If the explanatory variables are measured in different units, then a direct comparison of, say β 1, the estimated change in the mean of Y given a one unit change in x 1 to β 2, the estimated change in the mean of Y given a one unit change in x 2 has little meaning because a one unit change in x 1 may span a large portion of the range of x 1 while a one unit change in x 2 may span a small portion of the range of x 2. 35

20 One solution to the problem is to standardize all variables 5 according to Z i,y y i y Z i,1 x i,1 x 1 s 1.. Z i,k x i,k x k s k. Afterwards, a one unit change in any explanatory variable x j away from x j has the same meaning: it spans about 68/2 34% of the observations closest to the center of the distribution of x j provided that the distribution of x j is roughly symmetric and without substantial outliers. Let β 1,..., β k denote the parameter estimates obtained from a model fit using the standardized variables. Afterward, a comparison of the magnitudes of β 1,..., β k reflects the relative importance of each variable towards explaining the conditional mean of Y. All variables in Ginzberg data set (from the R library alr3) have been scaled to have the same standard deviation, and so the relative importance of fatalism and simplicity score towards explaining expected depression score can be assessed by comparing the parameter estimates β Fatalism.418 and β Simplicity.380. The variables have approximately the same contribution towards explaining the expectation of Y. It s not necessary to scale or standardize the variables before the regression. The standardized regression coefficients (those coefficients obtained after standardizing all variables) can be computed using the formulas β 1 s 1 β1.. β k s k βk where β 1,..., β k are the estimates obtained without standardizing the variables. The explanation of why these formulas yield the standardized coefficients provides some insight into the properties of the coefficients. The first normal equation in the system for k explanatory variables is nβ 0 + β 1 x i,1 + β 1 x i,2 5 It s not necessary to standardize the response variable but it does lead to a simple shortcut. y i 36

21 and it implies that β 0 y β 1 x 1 β k x k. Substituting this expression for β 0 into the definition of a fitted value leads to ŷ i β 0 + β 1 x i,1 + + β k x i,k (y β 1 x 1 β k x k ) + β 1 x i,1 + + β k x i,k y + β 1 (x i,1 x 1 ) + + β k (x i,k x k ) (8) ŷ i y β 1 (x i,1 x 1 ) + + β k (x i,k x k ) From the third equation, it is apparent that the fitted model passes through the point (y, x 1,..., x k ). (This can be verified by evaluating the model by setting x 1 x 1,..., x k x k. The fitted value is ŷ y). A few more operations on the last equation will show how to compute the standardized coefficients. First, divide both sides by : ŷ i y β 1 (x i,1 x 1 ) + + β k (x i,k x k ) Multiply the jth coefficient on the right by 1 s j s j, j 1,..., k: ŷ i y s 1 β1 (x i,1 x 1 ) s s k βk (x i,k x k ) s k. Rearrange terms: ŷ i y ( ) ( ) s 1 xi,1 x 1 s k xi,k x k β β k. s 1 s k In this expression for the linear regression model, the response and explanatory variable observations have been standardized. The coefficients are associated with the standardized variables are Returning to the last line in derivation (8), β 1 β 1 s 1,, β k β k s k. ŷ i y β 1 (x i,1 x 1 ) + + β k (x i,k x k ), it s apparent that the regression sum of squares (which is the sum of squares of the lefthand terms, i 1,..., n) depends on the magnitude of the regression coefficients and the standard deviation of the responses. A regression coefficient that is large in magnitude and associated 37

22 with a variable with a large standard deviation will have the greatest impact upon the regression sum of squares. Example Returning to the data on 1984 record times of Scottish fell races, the relative importance of distance versus elevation gain can be assesses by regressing the logarithms of distance and elevation gain on record time, and then computing the standardized coefficients. Parameter estimates are shown in the table below. Table 3: Coefficients and standard errors obtained from the linear regression of log-record time on log-distance and log-elevation gain using the data set hills from the MASS library. Variable Parameter estimate Std. Error t-statistic P (T > t ) Constant <.0001 log(distance) <.0001 log(climb) The standard deviations of the variables (obtained from the function call sd(log(hills))) are.7048, s log(distance).5928 and s log(climb).8110, and the standardized coefficients are and β log(distance) β 1 s β log(distance) β 1 s Distance of the race is far more important towards determining the record times than the elevation gain. The question is muddied by log(distance) and log(climb) being moderately or strongly correlated (r.700). Another approach to assessing the influence of log(climb) is to conduct a regression of log(time) on log(distance) and examine the association between the residuals from the model and log(climb). The residuals, are in essence, the portion of log(time) that has not been accounted for by log(distance). If a significant fraction of the variation in the residuals is explained by log(climb), then it can be concluded that elevation gain is important. However, the elevation gain is associated with distance, and we ought to remove this association if possible from elevation gain. This is accomplished by regressing log(climb) on log(distance) and using the residuals from the regression in place of log(climb). 38

23 A partial residual plot plots the residuals from a regression of Y on x j against the residuals from a regression of x i on x j. The resulting plot reveals the extent to which additional information in x i is useful for explaining the response variable. The plot to the right is a partial residual plot showing that there is a single, highly influential race (Knock Hill the recorded pace was faster than a 4 minute mile) confounding the contributory effect of log(climb) on log(time), given log(distance). For these partial residuals, r.309; however, log(time) log(climb) with Knock Hill removed from the residual pairs, the correlation coefficient is.665. The partial residual plot also suggests that the log transformation of elevation gain is unhelpful because of the apparent curvilinear relationship between residuals. The conclusion is that elevation gain is important but that it is difficult to gauge its relative importance because it is moderately correlated with distance, and because of the impact of the dubious observation on the Knock Hill race. R code to used to construct the partial residual plots follows. y.r <- lm(loghills$logtime~loghills$logdistance)$resid x.r <- lm(loghills$logclimb~loghills$logdistance)$resid plot(x.r,y.r,pch16,xlab"log(climb)",ylab"log(time)") cor(x.r,y.r) cor(x.r[-18],y.r[-18]) plot(x.r[-18],y.r[-18],pch16,xlab"log(climb)",ylab"log(time)") 39

Jakarta International School 6 th Grade Formative Assessment Graphing and Statistics -Black

Jakarta International School 6 th Grade Formative Assessment Graphing and Statistics -Black Name: Date: Score : 42 Data collection, presentation and application Frequency tables. (Answer question 1 on