Chapter 2. Continued Proofs For ANOVA Proof of ANOVA Identity We are going to prove that Writing SST SSR + SSE. Y i Ȳ (Y i Ŷ i ) + (Ŷ i Ȳ ) Squaring both sides summing over all i 1,...n, we get (Y i Ȳ ) 2 (Y i Ŷ i ) 2 + (Ŷ i Ȳ )) 2 Noting that +2 (Y i Ŷ i )(Ŷ i Ȳ ) Y i Ŷ i Y i (b 0 + b 1 X i ) Y i (Ȳ b 1 X + b 1 X i ) (Y i Ȳ ) b 1 (X i X) Ŷ i Ȳ (b 0 + b 1 X i ) Ȳ (Ȳ b 1 X) + b 1 X i Ȳ b 1 (X i X) the product term in the above equation can be simplified as n (Y i Ŷ i )(Ŷ i Ȳ ) [(Y i Ȳ ) b 1 (X i X)] b 1 (X i X) b 1 S xy b 2 1 Sxx b 1 [S xy b 1 S xx] And since b 1 S xy/s xx, S xy b 1 S xx the rhs of the above equation is zero. Therefore, (Y i Ȳ ) 2 (Y i Ŷ i ) 2 + (Ŷ i Ȳ )) 2 or Expected Mean Squares We are going to prove that SST SSE + SSR E(MSE) σ 2 E(MSR) σ 2 + β 2 1S xx Note that these results allow us to compare MSE MSR in average sense when H 0 : β 1 0 H 1 : β 1 0. Since E(MSR) > E(MSE) when β 1 0, 1 2 we can create a decision rule; reject H 0 if MSR MSE This gives the F -test of ANOVA. is large To find the E{MSE}, we may use the result quoted earlier that SSE χ 2 (n 2) σ 2 This gives E{ SSE } (n 2) σ2 hence, E{MSE} E{SSE/(n 2)} (n 2)σ 2 /(n 2) σ 2 Alternatively, we will prove without normality assumption that E{SST O} (n 1)σ 2 + β 2 1 Sxx Using the expression for E{MSR}, this in turn provides, E{SSE} E{SST } E{SSR} (n 1)σ 2 + β1 2Sxx σ2 β1 2Sxx (n 2)σ 2 this implies that E{MSE} σ 2 Proof of E{SST } (n 1)σ 2 + β 2 1 Sxx Using the model Y i β 0 + β 1 X i + ɛ i, we can write [ Ȳ 1 β0 + β n n 1 X i + β 0 + β 1 X + ɛ Hence, Y i Ȳ β 1 (X i X) + (ɛ i ɛ) Squaring both sides we get SST β 2 1 Sxx + Sɛɛ + 2β 1S xɛ ɛ i Note that since S ɛɛ/(n 1) denotes the sample variance of ɛ 1,..., ɛ n which are i.i.d. with mean zero variance σ 2, we have E{S ɛɛ} (n 1)σ 2. For the expectetion of the product term we see that E{β n 1 (X i X)(ɛ i ɛ)} β n 1 (X i X)E{(ɛ i ɛ)} 0 since E{(ɛ i ɛ)} 0 by the assumption on the errors. This proves that E{SST } (n 1)σ 2 + β 2 1S xx ] 3 4
Proof of E{SSR} σ 2 + β 2 1 Sxx For this result we note that Ŷ i Ȳ b 1 (X i X i ), hence SSR (Ŷ i Ȳ ) 2 b 2 1 (X i X) 2 b 2 1S xx Therefore E{SSR} S xxe{b 2 1 } To evaluate E{b 2 1 }, use the formula which gives V ar(b 1 ) E{b 2 1 } (E{b 1}) 2 E{b 2 1 } V ar(b 1) + (E{b 1 }) 2 Using the sampling properies of b 1, we obtain from the above equation hence E{b 2 1} σ 2 /S xx + β 2 1, E{SSR} S xx[σ 2 /S xx + β1 2] σ 2 + β1 2Sxx Equivalence of t F for H 0 : β 1 0 vs. H 1 : β 1 0 The test statistic t is given by Using the formula we find that t b 1 /s{b 1 } s 2 {b 1 } MSE/S xx t 2 b 2 1 Sxx/MSE The numerator b 2 1Sxx may be recognized to be SSR MSR. Hence t 2 MSR/MSE which is the usual F, the ANOVA F -test statistic. Since t 2 (ν) follows F (1, ν) distribution, the C.R. is equivalent to t > t(1 α 2, n 2) F > t 2 (1 α, n 2) F (1 α; 1, n 2). 2 5 6 2.8 General Linear Test Approach This approach is based on the fact under restrictions on the model the sum of squared errors is generally larger as compared to that without any restriction. (Because the SSE without any restrictions is the absolute minimum). The difference in these summ of squared errors is used to propose a test statistic for the hypothesis imposing restrictions on the model. The model without any hypothesis is known as the full model the model under the hypothesis is called the reduced model. Let SSE(F ) SSE(R) denote the sum of squared errors under these models, the SSE(F ) (Y i Ŷ i ) 2 min (Y i Ŷ for any linear prediction Ŷ i, SSE(F ) SSE(R) Under departures from the H 0, the difference SSE(R) SSE(F ) is expected to be significantly large. create a T.S. as F [SSE(R) SSE(F )]/(df R df F ) SSE(F )/df F i )2 Hence, we may 7 This test statistic follows an F (df R df F, df F ) under the null hypothesis, when the errors are assumed normally distributed. Hence the decision rule to reject the null hypothesis is given by Testing H 0 : β 1 0 F > F (1 α; df R df F, df F ) In this case, SSE(F ) SSE. The reduced model becomes, Y i β 0 + ɛ i, in which case the L.S. estimator of β 0 becomes b 0 (R) Ȳ Ŷ ( R) Ȳ, hence SSE(R) the T.S. becomes F (Y i Ŷ i (R)) 2 SST (SST SSE)/[(n 1) (n 2)] SSE/(n 2) the usual F statistic. MSR/MSE 8
2.9 Descriptive Measures of Association The goodness of fit of the line can be measured by amount of the total variation attributed to regression. For example, if SSR SST then SSE 0 all the predicted values fall on the LS line. In this case we can say that the regression explains the variation in Y i s 100%. SSR is termed as the Explained Variation SSE as the unexplained variation. Coefficient of Determination It is defined as the ratio of Explained Variation to the Total Variation, r 2 SSR SST Since the denominator SST SSR + SSE SSR, all the sum of squares are non-negative, 0 r 2 1 1. When all the values fall on the regression line, SSE 0 r 2 1. Hence the predictor variable accounts for all the variability. 2. When b 1 0, the predictor variable drops out from the model we have SSE SST, i.e. SSR 0 r 2 0. And the variable X does not play any role in explaining the variation in Y s. 3. The above two cases are extreme cases. The value of r 2 closer to 1 is regarded as giveing a good fit. Usually it is measured in percentages is also called as Multiple Correlation Coefficient. The Correlation Coefficient It is defined by r ± r 2 positive sign corresponds to positive slope i.e b 1 > 0 the negative sign corresponds to negative slope b 1 < 0. It is clear that 1 r 1 A computational formula is given by r b 1 Sxx Syy The use of correlation coefficient is more in describing the joint association between X Y. And since r 2 < r, r may give an impression of a closer relationship than r 2 9 10 Example 2.9.1 The coefficient of determination for the height-weight data from the ANOVA table is given by since b 1 > 0. Adjusted R 2 r 2 2930.8.5570 55.7% 2329.5 r.7463 Since the SSR SSE carry different degrees of freedom, their ratio adjusted for degrees of freedom may be more appropriate as a measure of goodness of fit; it is given by AdjustedR 2 SSE/(n 2) 1 SST O/(n 1) 1 n 2 SSE n 1 SST O For the previous example Adj.R 2 1 22 2329.5 23 5260.3.5359 which tallies with the value in the ANOVA table. Chapter 3. Diagnostics Remedial Measures Diagnostic Tools Diagnostic tools are used to check any irregularities in the data. Graphical techniques are visual aids in locating patterns in the data identifying any extreme or unusual observations. 3.1. Diagnostics for Predictor Variable (X) Dot Plots These are basically frequency plots. Dots are placed above the variable line for the values of the variable. Dots are stacked over each other if a value is repeated in the data. These plots display the dispersion of the variable. It is desirable that the data is evenly dispersed. Figure 3.1 below represents the Dot Plot for the heights of the HtWt data. It shows that the data are evenly distributed there are no outlying observations. It can be obtained using the Graph/Dotplot menu from MINITAB. 11 12
Stem Leaf Plot Sequence Plots Sequence plot is the plot of the observation against its place in the data. Such plots are useful when the data is observed as a sequence of time. The points are connected to show the time sequence more effectively, can depict a time trend or some other pattern. Figure 3.2 gives a sequence plot for the height data. But in this case it does not have much meaning since the order of the data is arbitrary. Such a plot can be obtained using the Graph/Time Series Plot menu from MINITAB. This plot is an alternative way to display the data. The main column, called stem of the data generally displays, first n 1 disgits for n-digit numbers in the data. The data is displayed by listing the last digit of the observation (called leaf) beside the proper stem. To the left of the stem column may appear the frequency of the branch, meaning the number of the observations in the corresponding row. Also the frequency of the branch containing the median is written in parenthesis. It basically resembles the histogram displayed sideways may bring out the symmetry or asymmetry of the distribution. [Note that symmetry of the distribution is preferred.] Figure 3.3 displays the distribution of the heights. The distribution is concentrated more towards larger values. It can be obtained using Graph/Stem Leaf Plot menu from MINITAB. 13 14 3.2 Residual Plots Residuals as introduced earlier may be used for checking various model departures. Recall that the residuals defined by Box-Whisker Plot This plot gives a Box with the top boundary as the 3rd quartle the bottom boundary as the first quartile a line in the middle of the box signifying the median. Two lines protrude from the bottom top giving the minimum maximum. This is known as the five number summary of the data. The median being in the centre signifies symmetry of the data. Any long whisker signifies outlying observation. Figure 3.4 gives the Box-Whisker plot of the height data signifies that asymmetry as reported looking at the Stem--Leaf plot is not severe. It can be obtained using Graph/Box Plot menu from MINITAB. e i Y i Ŷ i may be regarded as the observed errors in contrast to the unknown true errors ɛ i Y i E{ɛ i } The properties used in diagnostic residual plots are 1. Mean The mean of n-residuals is ē e i 0. n Since ē is always zero, it does not provide any information on the assumption E{ɛ i } 0. 2. Variance The variance of the n residuals for the simple regression model is defined by (e i ē) 2 n 2 e2 i n 2 MSE If the model is appropriate it provides an unbiased estimator of the error variance σ 2. 15 16
3. Nonindependence The residuals in general are not independent as they are subject to the linear constraints (i) e i 0 (ii) X ie i 0. The dependency could, however, be ignored when n is large Semistudentized Residuals It may be helpful to stardize the residuals for residual analysis. The following form of stardization is useful: e i e i ē MSE These are known as Semistudentized Residuals because ei they are approximation to the stardized residual s.d.{ei}. Since the s.d.{e i } is complex varies for each X i, MSE is only an approximation to this stard deviation. Departures to be Studied from residuals 1. The regression function is not linear. 2. The variance of the error terms is not constant. 3. The error terms are not independent. 4. The model fits all but a few outlier observations. 5. The error terms are not normally distributed. 6. One or several important predictor variables are absent from the model. 17