Decision 411: Class 8

Size: px

Start display at page:

Download "Decision 411: Class 8"

Solomon Byrd
5 years ago
Views:

1 Decision 411: Class 8 One more way to model seasonality Advanced regression (power tools): Stepwise and all possible regressions 1-way ANOVA Multifactor ANOVA General Linear Models (GLM) Out-of of-sample validation of regression models Logistic regression

2 One more way to model seasonality with regression Suppose a time series has an underlying stable trend and stable seasonal pattern (either additive or multiplicative), with effects of other independent variables added on,, so effects of the independent variables are not seasonal. Suppose that you also have a externally supplied seasonal index. Then it may be appropriate to use the seasonal index and/or the seasonal index multiplied by the time index as separate regressors to capture the seasonal part of the overall pattern.

3 Details Let SINDEX denote a seasonal index variable and let TIME denote a time index variable. Then by including TIME, SINDEX, and SINDEX*TIME as potential regressors,, you can model a range of patterns with stable trend and seasonality. Depending on the amount of trend and the degree to which seasonal swings get larger as the level of the series rises, perhaps not all of these terms would be significant.

4 Depending on the estimated coefficients, you could fit any of these patterns: Only SINDEX is significant Seasonal pattern with no trend (no real difference between additive and multiplicative) Only TIME and SINDEX are significant additive seasonal pattern with trend SINDEX*TIME is significant multiplicative seasonal pattern with trend

5 Stepwise regression Automatic stepwise variable selection is a standard feature in multiple regression (a right- mouse-button analysis option in Statgraphics)

6 Backward stepwise regression Automates the common process of sequentially removing the variable with the smallest t-stat, t if that t-stat t is less than a specified threshold. The F-to-remove parameter is the square of the minimum t-stat t needed to remain in the model, given the other variables still present. Can be used to fine-tune the selection of variables, but should not be used to go fishing for significant variables in a large pool.

7 Forward stepwise Automates the process of sequentially adding the variable that would have the highest t-stat t if it were the next variable entered. F-to-enter is the square of the minimum t-stat needed to enter, given the other variables already present. Can be used (with care!) to go fishing for significant variables in a large pool. It s s a potentially powerful data exploration tool, because it does something that would be hard to do by hand.

8 Example: enrollment revisited Dependent variable: ROLL Potential regressors (8): lag(roll,1), lag(roll,2) HSGRAD, lag(hsgrad,1) UNEMP, lag (UNEMP,1) INCOME, lag(income,1) Previously we had considered models whose equations involved up to 2 lags of ROLL and up to 1 lag of HSGRAD and UNEMP. INCOME is an additional predictive variable that was not considered before. Note that YOU are responsible for anticipating all transformations that may be useful.

9 Here is the all likely suspects model: many variables are not significant

10 With F-toF to-enter and F-toF to-remove set at 4.0, both forward and backward stepwise regression lead to a 2-variable 2 model.

11 Details of the steps in the backward stepwise regression: you can see here by how much the MSE changes as variables are removed. MSE actually improves (i.e., gets smaller) when some of the least significant variables are removed. In this case the smallest MSE was actually reached at step 2, although MSE s after steps are the same for all practical purposes. The INCOME variables are removed first, followed by the HSGRAD variables, leaving UNEMP as the only exogeous variable after step 4. MSE goes up somewhat in steps 5 & 6, although the variables removed at those points are technically not significant at the 0.05 level.

12 When F-toF to-enter and F-toF to-remove are lowered to 3.0, which corresponds to a permissible t-stat t as low as 1.73 in magnitude, forward stepwise still leads to the 2-variable model, while backward stepwise leads to this 4-variable 4 model in which one variable (lag(roll,2)) has a t-stat t of -1.86

13 Caveats Stepwise regression (or any other automatic model selection method) is not a substitute for logical thinking and graphical data exploration There is a danger of overfitting from fishing in too large a pool of potential variables and finding spurious regressors. Resist the urge to lower F-toF to-enter or F-toF to-remove below 3.0 to find more significant variables. Ideally you should hold out a significant sample of data while selecting variables, for later out-of of-sample validation of model. Validation is not honest if you peeked at the hold-out out data while trying to identify significant variables.

14 All possible regressions Automatic stepwise selection (forward or backward) is efficient, but not guaranteed to find the best model that can be constructed from a given set of potential regressors. It is computationally feasible to test all possible regressions that can be constructed with k out of m potential regressors. Beware: danger of getting obsessed with rankings & forgetting about logic & intuition!

15 All possible regressions, continued In Statgraphics,, all-possible possible-regressions is the Regression model selection procedure Analysis options allow you to set the maximum number of variables (default is 5). Outputs include rankings of models by adjusted R-R squared and Mallows C p stat. Pane options for these reports allow you to limit the number of best models shown for a given number of variables (default is 5).

16 What is the Mallows C p stat? MSE(subset of size p) Cp = p+ ( n p) 1 MSE(all variables) where p = # coeff s in subset model, including constant Ideally, C p should be small and p Note that C p = p for the all-variable model, so C p < p if subset model has lower MSE than all-variable model. Ideally you should approach or even beat the all- variable MSE with fewer variables. Ranking by C p penalizes more heavily for model complexity than ranking by adjusted R 2

17 Example: enrollment revisited Dependent variable: ROLL Potential regressors (8): lag(roll,1), lag(roll,2) HSGRAD, lag(hsgrad,1) UNEMP, lag (UNEMP,1) INCOME, lag(income,1) Number of possible models = 2 8 = 256

Lining up the suspects: Since there are only 8 potential regressors,, it is feasible to ask for reports on all possible models with up to 8 regressors (just for purposes of illustration!

18 Lining up the suspects: Since there are only 8 potential regressors,, it is feasible to ask for reports on all possible models with up to 8 regressors (just for purposes of illustration!!). 256 models are fitted to only 27 data points. Overkill? The default maximum number of variables is 5, which is usually plenty! In most applications, I would not recommend raising it.

19 How does it work so fast? It is actually unnecessary for the computer to run a complete set of calculations from scratch for each possible regression. Once the correlation matrix has been computed from the original variables, a simple sequence of calculations on the correlation matrix can determine the R-squaredR squared s and MSE s of all possible models. The big problem is the length of the reports!

20 Ranking by R-squaredR Note hair-splitting differences in adjusted R-R squared among models at the top of the rankings. Easy to get lost here! Since the dependent variable is non-stationary, all of the good models have adj. R-sq. R very close to 100%. Some 7-7 variable models, and even the 8-variable 8 model, show up near the top of the rankings (yike( yike!). As usual, it s s better to focus on MSE to decide whether additional variables are worth their weight in model complexity: MSE goes down as R-squared R goes up

21 Plot of R-squared R vs. # coefficients Plot of adjusted R-squared R vs. # coefficients (including constant) shows that most of the variance is explained by the first regressor added, which happens to be lag(roll,1). This is what should be expected with a nonstationary (strongly trended) dependent variable.

22 Ranking by C p Ranking by C p favors models with fewer coefficients and discriminates more finely among the models at the top. The best model includes lag(roll,1), lag(roll,2), unemp,, and lag(unemp,1)

23 Plot of C p vs. # coefficients C p plot shows that C p is minimized at 5 coefficients (i.e., 4 regressors + constant). The 4-coefficient 4 model also yields C p close to p. Note: the Y-axis Y scale had to be adjusted to show only small values of C p.

24 Details of best-c p model It s s necessary to run a manual regression to see the coefficients. Note that coefficients of unemp and lag(unemp,1) are roughly equal and opposite. Would a difference be just as good?

25 Restarting from a different set of transformed variables Let s s re-run run the all-possible possible- regressions with the lagged regressors replaced by differences.. This allows the selection algorithm to choose the difference alone or the difference together with the unlagged variable, which would be logically equivalent to including the lags separately.

26 New ranking by C p Here the two clearly-best models both include lag(roll,1), lag(roll,2), and diff(unemp), and the top model also includes diff(hsgrad). Thus, collapsing separate lags into a difference may allow a model with fewer coefficients to fit as well or allow a model with the same number of coefficients to fit t better.

27 New plot of C p vs. p The two highest-ranking models now have C p much less than p

28 Conclusions All-possible possible-regressions makes sure you don t overlook the model with lowest possible error stats for a given number of regressors... but staring at rankings can distract you from thinking about other issues, such as which model makes the most sense. It won t t find a good model by magic: you still have to choose the set of potential regressors and consider transformations of the variables. (Ditto for stepwise!) You are NOT REQUIRED to choose the model that is #1 in the rankings (on whatever measure)

29 Caution: do not overdifference In several of our examples of regressions of nonstationary time series, it has turned out that a differencing transformation was useful. However, beware of using differencing when it is not really needed! Differencing adds complexity to a model, and sometimes it may even create artificial correlation patterns and increase the variance to be explained. Differencing is most appropriate when the original variables either look like random walks (e.g., stock prices) or else are very smooth (e.g., ROLL), with variances dramatically reduced by differencing

30 Analysis of Variance (ANOVA) ANOVA is multiple regression with (only) categorical independent variables. In a one-way ANOVA, a dummy variable is created for all-but but-one level of the independent variable. The model then estimates the mean of the dependent variable for each level of the independent variable. A pooled estimate of the error standard deviation is used to compute standard errors of the means This is how one-way ANOVA differs from separate calculations of the means, in which standard errors are based on separate standard deviations.

31 ANOVA in practice Analysis of variance is typically used to analyze data from designed experiments in marketing research, pharmaceutical research, crop science, quality, etc. Interest often centers on nonlinear effects and/or interactions among effects of independent variables: Does relative effectiveness of different ad formats vary with market or demographics? Which combinations and dosages of drugs work best? Which combinations and quantities of crop treatments maximize yield and/or quality? ANOVA is also appropriate for natural experiments with categorical variables, if error variances can be assumed to be the same for all categories.

32 Example: cardata Let s s start by doing a one-way ANOVA of mpg vs. origin (continent). Origin codes 1, 2, 3 refer to America, Europe, and Japan, respectively. This model will test for differences in average (mean) mpg among the 3 origins Typical ANOVA output: ANOVA table Means table Box and whisker plot

33 Sum of squared deviations between group means (predictions) & grand mean Sum of squared errors (deviations from group means) Explained variance Unexplained variance Sum of squared deviations from the grand mean ANOVA table shows the decomposition of the sum of squared deviations ions from the grand mean and corresponding variances (but no R-squared!) R The F-ratio F (30.20) is the ratio of explained variance ( ) to the unexplained variance ( ). The variables in the model are jointly significant if this ratio is significantly greater than 1, which means they are doing more than what would happen if you just dummied out some data points.

34 Decomposition of the sum of squares: SS(total) ) = SS(between) ) + SS(within) B mean Grand mean A mean total variation between groups variation (prediction) within groups variation (error)

35 Means table The table of means shows the estimated mean of the dependent variable for each level of the independent variable-- --in this case, mean mpg s s for cars from each continent. These are just ordinary means. However, the standard errors of the means are based on a pooled estimate of the standard deviation of the errors.

36 The box and whisker plot provides a nice visual comparison of the means, inter-quartile ranges, and extreme values. 55 Box-and-Whisker Plot 45 mpg mean origin Interquartile range (25%-tile to 75%-tile) outside point (outside the box by >1.5x interquartile range) minimum & maximum (if not outside ) Here we see that the American cars have significantly lower mean mpg than European or Japanese cars, although the highest-mpg American car is in the upper quartile of the European and Japanese ranges. Also, although the e European and Japanese cars have similar mean mpg s, the Japanese cars have a tighter distribution of mpg s s except for two outliers, one high and one low.

37 For comparison, here s s the same model fitted by using multiple regression with dummy variables for the first two origin codes. Note that the ANOVA table shows the same F-ratio, F etc. The CONSTANT in this model is the mean for origin=3, and the coefficients of the dummy variables origin=1 o and origin=2 are the differences in means for the other two levels. Standard error of the regression ( ) is the square root of mean square for error ( ) and R-squared R is model SS ( ) divided by total SS ( ) Thus, ANOVA is nothing really new, new, it s s just a repackaging of regression output for the special case when the independent variables are dummies for levels of a categorical variable.

38 Multifactor ANOVA Multifactor ANOVA is regression with dummy variables for levels of two or more categorical independent variables. When there are two or more variables, you can estimate not only main effects,, but also interactions among levels of two different variables. One of the questions of interest is whether the interactions are significant.

39 Multi-factor ANOVA: possible patterns in data A main effect only B main effect only (These bar charts show hypothetical mean responses for 2 levels of factor A and 3 levels of factor B)

40 Interactions between factors?? Both A and B main effects, without interaction Both A and B main effects, with interaction

41 Two factors: mpg vs. origin & year Here, is 2-factor 2 ANOVA with no interactions: only main effects have been estimated. e The ANOVA table now shows separate F-ratios F for each of the two input variables, reflecting the joint significance of their respective dummy variables. ables. (Both are significant here, but origin is more significant.)

42 Main effects (mean mpg) for origin & year The means table now shows means of the dependent variable for each level of both variables: these are the main effects. (The means by origin are slightly different from those of the one-way ANOVA, since the coefficients of the origin dummies are now being estimated simultaneously with those for year.)

43 Means and 95.0 Percent LSD Intervals Means and 95.0 Percent LSD Intervals mpg origin mpg Here Europe and Japan have the same high average mpg year The LSD (least significant difference) intervals are constructed in such a way that if two means are the same, their intervals will overlap 95.0% of the time. Any pair of intervals that do not overlap vertically correspond to a pair of means which have a statistically significant difference.

44 Same model fitted by multiple regression: the differences among coefficients for different levels of the same variable are the same as the differences among means in the ANOVA output. These coefficients can be computed from the ones in the multifactor ANOVA output, but not vice versa: the multiple regression output does not show the grand mean.

45 Estimating interaction effects If the order of interactions is set to 2, additional dummy variables will be added for all possible combinations of a level of one variable and a level of the other variable.

46 Are interaction effects significant? The F-ratio F for the variance explained by the interaction terms is not significant. Hence there is no significant interaction between origin and year. This means that variations of average mpg across years are essentially the same for each origin, and correspondingly, variations of average mpg across origins are essentially the same for each year.

47 Here are the details of the estimated interactions, as well as the main effects.

48 Categorical + quantitative? Suppose that you want to include a quantitative independent variable along with dummies for categorical variables? Example: suppose you want to include weight as an additional regressor to control for differences in average weights in cars from different countries of origin. This brings us to...

49 General linear models (GLM) GLM is a combination of multifactor ANOVA and ordinary multiple regression. You can specify both categorical and quantitative independent variables. You can also estimate interactions and nested effects.

50 Input variables for GLM

51 Effects and interactions to be estimated After the variables have been specified on the data input panel, this panel is used to specify interactions and/or nesting of effects (if any). To begin with, we will just look for main effects

52 Here s s the ANOVA report. Note that weight has a very significant F-ratio. F (For a quantitative variable, the F-ratio F is simply the square of its t-stat. t Here the F-ratio F of 201 corresponds to a t-stat t of around 14.)

53 Regression with out-of of-sample validation! At the bottom of the Analysis Summary report are the usual regression statistics, including separate error stats for a validation period. If you use the Select box to hold out data in this (or any Advanced Regression) procedure, the de-selected points are used as the hold-out out sample. Hence if you use the GLM procedure to fit a multiple regression model, you can perform out-of of-sample validation!

54 Holding out a random sample An additional column is added to the data worksheet, with the name random120. random120. The Generate Data option is used with the expression RANDOM(120) to fill the column with 1 s 1 s in 120 random places, 0 s 0 elsewhere. When this variable is used as the Select criterion, only the randomly chosen rows with 1 s s will be fitted.

55 Refitting the GLM model with a random hold-out out sample Here the new random120 variable is used as the Select criterion. In this case it is appropriate for the hold-out out sample to be determined randomly because the variables are not time series and the rows are sorted according to the values of some of the independent variables. Hence holding out the last k values would not necessarily yield a representative sample.

56 Validation results Of the 120 rows that were randomly selected for fitting, only 119 had non-missing values for all independent variables. Note that MSE is actually smaller in the validation period (perhaps there was less variance in the hold-out out sample, while MAPE is slightly larger. So, the model appears to be valid, i.e., not overfitted.

57 Back to the original model with no hold-out: out: here s s the Model Coefficients report. It also includes Variance Inflation Factors to test for multicollinearity (VIF > 10 is bad ) The dummy variables actually have values of +1/-1/0 1/0 instead of +1/0, although taken together they are equivalent to the usual dummy variables.

58 What is multicollinearity? Multicollinearity refers to a situation in which the independent variables are strongly linearly related to each other. When multicollinearity exists, the estimated coefficients may not represent the true effects of the variables, and standard errors will be inflated (variables may all appear to be insignificant despite high R-squared). R In the most extreme case, where one independent variable is an exact linear function of the others, the regression will fail to produce any results at all (you will get an error condition).

59 What are Variance Inflation Factors? The VIF for the k th regressor is (VIF) k = 1/(1-R 2 k ) where R 2 k is the R-squared obtained by regressing the k th regressor on all the other regressors. Thus, (VIF( VIF) k = 1 when R 2 k = 0. Severe multicollinearity is indicated if (VIF) k > 10, which means R 2 k > 90%

60 GLM example, continued Means and 95.0 Percent LSD Intervals 33 Means and 95.0 Percent LSD Intervals 33 mpg mpg origin year When we control for weight, the pattern of main effects is different: European cars get higher mileage for a given weight.. Japanese cars evidently get high mileage by being lighter than American or European cars on average. Also, there has been an general upward trend in mpg except for drop in 1981.

61 Studentized residual Residual Plot predicted mpg Another nice feature of the GLM procedure: both autocorrelation and probability plots are pane options for the residual plot.

62 Probability plot in GLM Normal Probability Plot for mpg percentage Studentized residual Here s s the (vertical) probability plot. the slight S-shaped S pattern indicates that the tails of the residual distribution are a bit fatter than normal, but nothing much to worry about in this case. (There is no simple transformation of the data that will make the distribution tion look any better apparently a few cars are just exceptional.)

63 Here s s the same model fitted in the Multiple Regression procedure instead. Note that the regression stats and the coefficient of weight are identical. The differences in coefficients between levels of the same factor are e also identical.

GLM with interaction effects Hit the Cross button to insert the interaction operator (*) The GLM procedure can also be used to test for interaction ( cross( cross ) ) effects, exactly as in the ANOVA

64 GLM with interaction effects Hit the Cross button to insert the interaction operator (*) The GLM procedure can also be used to test for interaction ( cross( cross ) ) effects, exactly as in the ANOVA procedure, as well as to use nested experimental designs. Here this feature is used to look for interactions between the categorical factors while controlling for the effect of a quantitative tative factor (weight).

65 The F-ratio F for the variance explained by the interaction between origin and year is larger when controlling for weight, but still not technically significant (F=1.76, P=0.089).

66 Summary of GLM features GLM is an all-everything procedure for fitting models with categorical and/or quantitative factors, with or without interaction effects. It can perform out-of of-sample validation. Variance Inflation Factors (VIF( VIF s) ) are a test for multicollinearity (>10 is bad) It also includes a few more built-in in plots (residual autocorrelation & probability plot).

67 Logistic regression Logistic regression is regression with a binary (0-1) dependent variable,, e.g., an indicator variable for the occurrence or some event or condition. Applications: predicting probabilities of events or fractions of individuals who will respond to a given promotion or medical treatment, etc. In this case is the predicted probability that Y t = 1. Yˆt The probabilistic prediction equation has this form: Yˆ t exp( β + β X + β X +...) 1 exp(...) = 0 1 1t 2 2t + β + β 0 1X + β 1t 2X + 2t

68 Predictions expressed in terms of odds The predicted probability can equivalently be expressed in the form of odds in favor of Y t = 1 : 1 Yˆ /(1 Yˆ) = exp( β + β X + β X +...) t t 0 1 1t 2 2t Predicted odds in favor of of Y t =1 = X1t X2t exp( )exp( ) exp( )... Constant odds factor β β β Odds factor for X 1 Odds factor for X 2 The odds factor of the i th variable is raised to the power X it The predicted total odds is a product rather than a sum of contributions of the independent variables. If X it increases by one unit, the predicted odds in favor of Y t =1 increase by the factor exp(β i ),, other things being equal.

69 Predictions expressed in terms of log-odds The predicted probability can also be equivalently expressed in log odds form: log( Yˆ /(1 Yˆ)) = β + β X + β X +... t t 0 1 1t 2 2t Thus, logistic regression uses a linear regression equation to predict the log odds in favor of Y t = 1. However, you can t t estimate the model by regressing log(y t /(1 Y t )) on the X s. (Can t t take the log of zero!) In practice, the betas are estimated by a procedure that is similar to minimizing a weighted sum of the squared prediction errors: ˆ 2 ( Y Y) t t

(0-1) variable (shown here) or it can be a vector of

70 Logistic example: predicting magazine subscription responses by age and sex The dependent variable can either be a binary (0-1) variable (shown here) or it can be a vector of proportions or probabilities,, together with a vector of sample sizes.

71 Coefficients, standard errors, and R-squared R are interpreted in the same manner as in multiple regression. Odds ratio of an independent variable is just EXP(beta).

72 Differences in predicted subscription responses for male and female subjects

73 This plot shows a summary of the prediction capability of the fitted model. First, the model is used to predict the response using the information in each e row of the data file. If the predicted value is larger than the cutoff, the response is predicted to be TRUE. If the predicted value is less than or equal to the cutoff, the response is predicted to be FALSE. The table shows the percent of the observed data correctly predicted at various cutoff values. For example, using a cutoff equal to 0.56, 75.0% of all TRUE responses were correctly predicted, while 95.0% of all FALSE responses were e correctly predicted, for a total of 85.0%. Using the cutoff value which maximizes the e total percentage correct may provide a good value to use for predicting additional individuals. duals.

74 Other advanced regression procedures Comparison of regression lines Fit several simple regressions to the same X & Y variables, splitting the data on levels of another variable Nonlinear regression Estimate a model such as Y=1/(a+b*X^c X^c) Similar to Solver in Excel

75 Resources You can find out more about these and other procedures via the Statgraphics help system, StatAdvisor,, and user manuals (in pdf files in your Statgraphics directory) There are also many good on-line sources: Statsoft on-line textbook: David Garson s s on-line textbook: ( links are available on Decision 411 course home page)

Decision 411: Class 8

Decision 411: Class 8 One more way to model seasonality Advanced regression (power tools): Stepwise and all possible regressions 1-way ANOVA Multifactor ANOVA General Linear Models (GLM) Out-of of-sample