Given a sample of n observations measured on k IVs and one DV, we obtain the equation

Size: px

Start display at page:

Download "Given a sample of n observations measured on k IVs and one DV, we obtain the equation"

Hubert Robinson
5 years ago
Views:

1 Psychology 8 Lecture #13 Outline Prediction and Cross-Validation One of the primary uses of MLR is for prediction of the value of a dependent variable for future observations, or observations that were not part of the original sample. In order for a MLR equation to have utility for prediction it must generalize beyond the sample that was used to derive it. A variety of methods are available for assessing such generalizability, and also for improving it. Here we consider how to obtain information about the performance of a MLR equation when applied to new observations. Derivation of the MLR equation Given a sample of n observations measured on k IVs and one DV, we obtain the equation Y ˆ = B + B X + B X + L B k X k such that Σe is minimized. We also obtain the SMC R.

2 It must always be kept in mind that this equation is obtained to fit the observed sample as well as possible. The resulting regression coefficients will be influenced by any idiosyncrasies of that sample. As a result, the equation will differ from what would be obtained in the population or in another sample. The equation will make the most accurate predictions of Y in the observed sample, but the same equation may not make the most optimal predictions for new individuals. Prediction of Y for new observations After the MLR equation is obtained from the observed sample, suppose we wish to use it to make a prediction of Y for a new observation for whom the scores on the IVs are available. Let those scores be designated X 1o, X o,..., X ko. Using the MLR equation we can obtain the predicted Y value for this individual: Y ˆ = B + B X + B X + L+ o 0 1 1o o B k X ko The question at this point is how precise this prediction is likely to be.

3 We can address this question by forming a prediction interval around the predicted score. This procedure provides us with an interval estimate of this individual s actual Y score. We first need the standard error of new Y scores around the predicted score: sd SE β z Y Yˆ jo ij io jo = + + ˆ n 1 Y o Yo n 1 R j 1 R j The first summation is over the IVs (subscripted j) and the second summation is over all pairs of IVs (subscripted i and j). There are two important features of this expression. First note that the z values represent distance from the mean on the IVs for the new observation. For an observation more distant from the means of the IVs, the standard error will increase, meaning that predictions will be less precise. Another important feature of this expression involves the term R j, which refers to the SMC obtained when predicting IV X j from all other IVs. Note that when these values are high, meaning multicollinearity is high, the magnitude of the terms in the two summations will increase, in turn increasing the size z z 3

4 of the entire standard error. Thus, when multicollinearity is high, predictions will become less precise. Using this standard error we can construct a prediction interval by the usual methods: Yˆ t sd Y Yˆ t sd o α / Y Yˆ o o + o o α / Y Yˆ This prediction interval will be wider when multicollinearity is high and for observations more deviant from the means of the IVs. That is, predictions for new observations will be less precise in those circumstances. Validation of a MLR equation on new data When an MLR equation is to be used for prediction purposes it is useful to obtain empirical evidence as to its generalizability, or its capacity to make accurate predictions for new samples of data. This process is sometimes referred to as validating the regression equation. One way to address this issue is to literally obtain a new sample of observations. That is, after the MLR equation is developed from the original sample, the investigator conducts a new study, replicating the original one as closely as possible, and uses the new o o 4

5 5 data to assess the predictive validity of the MLR equation. Logistically, this procedure can work as follows: The MLR equation is derived in the original sample. A new sample is obtained, providing measures on IVs and DV. The MLR equation is applied to the individuals in the new sample, producing predicted values of the DV. The correlation between the observed and predicted values of the DV in the new sample is obtained. That correlation reflects the performance of the MLR equation in the new sample. This procedure is usually viewed as impractical because of the requirement to conduct a new study to obtain validation data, as well as the difficulty in truly replicating the original study. An alternative, more practical procedure is crossvalidation.

6 6 Cross-Validation In cross-validation the original sample is split into two parts. One part is called the derivation sample, and the other part is called the validation sample. The splitting of the sample raises two questions. 1) What portion of the sample should be in each part? If sample size is very large, it is often best to split the sample in half. For smaller samples, it is more conventional to split the sample such that /3 of the observations are in the derivation sample and 1/3 are in the validation sample. ) How should the sample be split? The most common approach is to divide the sample randomly, thus theoretically eliminating any systematic differences. One alternative is to define matched pairs of subjects in the original sample and to assign one member of each pair to the derivation sample and the other to the validation sample. Once the sample is split, MLR is applied to the derivation sample, yielding a regression equation and a SMC, designated R D.

7 7 The MLR equation obtained from the derivation sample is then applied to the observations in the validation sample, yielding predicted Y values for those observations. We then obtain the squared correlation between observed and predicted Y values in the validation sample. That value is designated R V. It is often called a cross-validated squared multiple correlation. Of particular interest is the comparison of R D and R V. Note that R D has been maximized by MLR. That is, the regression equation is defined so that the predictions of Y are as precise as possible in the derivation sample. The equation is affected by idiosyncrasies of that sample. When that same equation is applied to a different sample (the validation sample), that sample probably exhibits different idiosyncrasies. As a result, the equation probably will not work as well, and the predictions are likely to be less accurate, meaning that we are likely to see R V < R D. This does not have to happen, but it nearly always does.

8 This reduction in the SMC under cross-validation provides an indication of the predictive validity of the MLR equation. A large reduction may indicate poor predictive validity, or poor cross-validation. A small reduction may indicate little loss of predictive precision when the equation is used outside of the original sample. There are several drawbacks to this cross-validation procedure: 1) Sampling variability: The outcome will vary depending on the splitting of the original sample. This variability may be quite large when n is not large. ) Increased standard errors of coefficients: When the sample is split, the MLR equation is estimated using only a subset of the original sample. This smaller derivation sample causes regression coefficients to have larger standard errors, meaning they are less stable, less precise. 3) In general, because of (1) and (), cross-validation methods are impractical when n is small. 8

9 9 Fortunately there is an alternative approach. It is possible to estimate the value of the cross-validated SMC without splitting the sample and carrying out the procedures just described. Given an original sample of size n, we can estimate the cross-validated SMC that would be obtained if the resulting MLR equation were applied to another sample of the same size: R n = 1 (1 R ) n ˆ + k k It is important to distinguish this adjusted R value from one we studied earlier. In our study of inferences in MLR we noted that the sample SMC is a biased estimate of the population SMC, and we made use of a correction for shrinkage that produced an unbiased estimate of the population SMC: ~ R = 1 (1 R n 1 ) n k 1 The two different corrections are often confused. They provide two very different pieces of information. The first one above estimates the SMC that would be obtained when the MLR equation obtained in one sample is applied to a new sample. The second provides an estimate of the population SMC.

10 We focus here on the estimate of the cross-validated SMC. Note that this value will be smaller than the R in the original sample: R ˆ < R This reduction is due to the fact that the MLR equation derived in the original sample will tend not to work as well in new samples. It is useful to consider what factors will affect the degree of this reduction in the SMC: The reduction will be smaller when: The original R is larger. Sample size n is larger. The number of IVs k is smaller. So regression equations cross-validate best when R is large, n is large, and k is small. When R is small, n is small, and k is large, it can be expected that a regression equation will cross-validate very poorly. The last point is especially relevant. Regression equations cross-validate better when the number of IVs is smaller, holding other factors constant. Thus, it is disadvantageous to include too many IVs in a regression model. Generalizability will be improved if we can exclude IVs that do not contribute to explaining the variance in Y. 10

11 11 Alternative weighting methods The reason for reduction in R when the MLR equation is used in a new sample is that the estimates of the regression coefficients can be very sensitive to idiosyncrasies of the original sample, especially when n is not large. This phenomenon is attributable to the use of the least-squares principle. The regression coefficients, often called least-squares weights, are estimated so as to minimize the sum of squared residuals in the original sample and thus are very sensitive to the nature of that sample. In turn, those same weights may result in a substantial loss in predictive accuracy when applied to a new sample. Given this fact, it may be useful to consider whether a different method for defining the coefficients to be used in the prediction equation might provide better performance under cross-validation. One approach that seems to have this characteristic is called unit-weighting. It works like this. Consider a conventional cross-validation design involving a split sample.

12 In the derivation sample, obtain the correlation of each IV with the DV; these values are designated r Yj. For each X j, simply determine whether the sign of r Yj is positive or negative. Define the weights for the IVs as follows: If r Yj > 0, then U j = 1. If r Yj < 0, then U j = -1. Then define a prediction equation using standardized variables as: 1 z ˆ = U z + U z + L+ U Y 1 1 k z k This is a greatly simplified prediction equation. The Us are not least-squares weights, but rather are unit weights with either a positive or negative sign. As such, they are far less influenced by the chance characteristics of the derivation sample. The predictive accuracy of these unit weights can be evaluated in the derivation sample by obtaining the squared correlation between observed and predicted scores on the DV; call that value R DU. Note that this value will be smaller than R D obtained using leastsquares weights. That is, R DU < R D.

13 The cross-validated predictive accuracy can be evaluated by applying the unit-weight equation in the validation sample and then obtaining the squared correlation between the resulting predicted and observed scores on the DV. Call this value R VU. Of particular interest here is a comparison of the cross-validated SMCs obtained using least-squares weights v. unit weights. Much research has shown a clear tendency for R VU > R V. That is, the unit weights tend to provide better predictive accuracy in the validation sample than do the least-squares weights. This is an important finding that is relevant to true prediction problems in practice. Unit weights may provide better predictive accuracy for new samples than do least-squares weights. This finding will tend to hold more consistently when n is not large and k is not small. For large n and small k, least-squares weights will still often provide better generalizability. In practice, if prediction is the primary objective, the investigator should try both methods. 13

Psychology 282 Lecture #3 Outline

Psychology 282 Lecture #3 Outline Psychology 8 Lecture #3 Outline Simple Linear Regression (SLR) Given variables,. Sample of n observations. In study and use of correlation coefficients, and are interchangeable. In regression analysis,