19.0 Practical Issues in Regression

Size: px

Start display at page:

Download "19.0 Practical Issues in Regression"

Bernice McKenzie
5 years ago
Views:

1 19.0 Practical Issues in Regression 1 Answer Questions Nonparametric Regression Residual Plots Extrapolation

2 19.1 Nonparametric Regression Recall that the multiple linear regression model is Y = β 0 + β 1 X β p X p + ǫ where IE[ǫ] = 0, Var [ǫ] = σ 2, and the ǫ are independent. 2 The model is useful because: it is interpretable the effect of each explanatory variable is captured by a single coefficient theory supports inference and prediction is easy simple interactions and transformations are easy (how?) dummy variables allow use of categorical information computation is fast.

3 We extended the multiple linear regression model to nonlinear regression, in which we fit a model of the form: g 0 (Y ) = β 0 + β 1 g 1 (X 1 ) β p g p (X p ) + ǫ where the g i are known transformations of the data, such as the log Y or 1/x 1, and, as before, IE[ǫ] = 0, Var [ǫ] = σ 2, and the ǫ are independent. 3 This model can be further extended to nonparametric regression, in which case one does not know the functions g 1,...,g p but instead must estimate these by smoothing the data. In applications, the linear regression model is usually only a locally correct approximation. And it is rare that one has a strong theoretical model that prescribes specific nonlinear transformations. Thus nonparametric regression is a practical tool in many cases.

4 4 As a running example for the next several pages, assume we have data generated from the following function by adding N(0..25) noise.

5 The x values were chosen to be spaced out at the left and right sides of the domain, and the raw data are shown below. 5

6 Bin Smoothing Here one partitions the x-axis into disjoint bins; e.g., take {[i, i + 1), i Z}. Within each bin average the Y values to obtain a smooth that is a step function. 6

7 Moving Averages Moving averages use variable bins containing a fixed number of observations, rather than fixed-width bins with a variable number of observations. They tend to wiggle near the center of the data, but flatten out near the boundary of the data. 7

8 Running Line This improves on the moving average by fitting a line rather than an average to the data within a variable-width bin. But it still tends to be rough. 8

9 Regression becomes much harder as the number of explanatory variables increases. This is called the Curse of Dimensionality (COD). The term was coined by Richard Bellman in the context of approximation theory. 9 The COD applies to all multivariate regressions that do not to impose strong modeling assumptions especially the nonparametric regressions, but also those in which one tests whether a specific variable or transformed variable should be included in the model. In terms of the sample size n and dimension p, the COD has three nearly equivalent descriptions: For fixed n, as p increases, the data become sparse. As p increases, the number of possible models explodes. For large p, most datasets are multicollinear.

10 For the sparsity description of the COD, let n points be uniformly distributed in the unit cube in IR p. What is the side-length l of a subcube that is expected to contain a fraction d of the data? Ans: l = p d 10 This means that for large p, the amount of local information that is available to fit bumps and wiggles in f is too small.

11 To explain the model explosion aspect, suppose we restrict attention to just linear models of degree 2 or fewer. For p = 1 these are: IE[Y ] = β 0 IE[Y ] = β 1 x 1 IE[Y ] = β 2 x 2 1 IE[Y ] = β 0 + β 1 x 1 IE[Y ] = β 0 + β 2 x 2 1 IE[Y ] = β 1 x 1 + β 2 x 2 1 IE[Y ] = β 0 + β 1 x 1 + β 2 x For p = 2 this set is extended to include expressions with the terms α 1 x 2, α 2 x 2 2, and γ 12 x 1 x 2. For general p, combinatorics shows that the number of possible models is p+ p 2 1 C A 1. This increases superexponentially in p, and there is not enough sample to enable the data to discriminate among these models.

12 For the multicollinearity issue, we note that multicollinearity occurs when two or more of the explanatory values are highly correlated. This implies that the predictive value of the fitted model breaks down quickly as one moves away from the subspace in which the data concentrate. Insert a physical demonstration. 12 In this class, we shall agree that multicollinearity occurs whenever the absolute value of the correlation between two of the explanatory variables exceeds.9. But this is a judgment call, and one can have multicollinearity that arises in more complex ways. For large p with finite n, it is almost certain that two explanatory variables will have high correlation, just by chance.

13 19.2 Variable Selection One wants to select a multiple regression model that only includes useful variables. Some methods are: 13 Forward Selection. One starts with no variables in the model, and sequentially adds the one that best explains the current residuals (or the raw data, at the initial step). One stops when none of the remaining variables provide significant explanation. Backwards Elimination. Start with all the variables in the model, and sequentially removes the variable that explains the least, until a t-test shows that no further variables should be removed. Stepwise Regression. Alternate use of forward selection and backwards elimination. None of these is bulletproof.

14 19.3 Cross-Validation 14 To assess model fit in complex, computer-intensive situations, the ideal strategy is to hold out a random portion of the data, fit a model to the rest, then use the fitted model to predict the response values from the values of the explanatory variables in the hold-out sample. This allows a straightforward estimate of the error in prediction using regression. But we usually need to compare fits among many models. If the same hold-out sample is re-used, then the comparisons are not independent and (worse) the model selection process will tend to choose a model the overfits the hold-out sample, causing spurious optimism.

15 Cross-validation is a procedure that balances the need to use data to select a model and the need to use data to assess prediction. Specifically, v-fold cross-validation is as follows: randomly divide the sample into v portions; 15 for i = 1,...,v, hold out portion i and fit the model from the rest of the data; for i = 1,...,v, use the fitted model to predict the hold-out sample; average the PMSE over the v different fits. One repeats these steps (including the random division of the sample!) each time a new model is assessed. The choice of v requires judgment. Often v = 10.

16 19.5 Case Study You should never believe your model. Personally, I m sometimes willing to believe the binomial model applies, but for nearly every other situation, the mechanisms that generate the data just do not quite match the simple assumptions that underlie the named probability distributions. 16 George Box said: All models are wrong, but some are useful. Economists look at a lot of data and often attempt to fit it by models. Be wary. Always plot your data. One can do goodness-of-fit tests to see whether the data conform with a particular model, but this has dangers too, especially with very large samples.

17 When deciding on which model to use to describe a data set, one should consider: Do you believe that a simple, single probability distribution generated the data? 17 Do the data have some natural support set? (The support set is the set on which the probability mass function or density function is positive.) Do you believe the data are roughly symmetrically distributed about the mean? Or is there skewness? Will the data have fat tails? (That is, are there likely to be some exceptionally large or small values, compared to what one would see in a sample from a normal distribution?) Do you understand the measurement process that acquired the data?

18 Beware of premature framing of a problem. In January 1985, a team of engineers at Morton Thiokal was tasked to study O-ring failures in Challenger launches. There were given information on all the launches in which O-ring failures occurred, and related data on temperature, manufacturing history, and so forth. 18

19 The engineers looked at all the variables. Temperature did not stand out. On January 28, 1986, when the executives at Morton Thiokol were asked by NASA whether they objected to greenlighting the launch given the unusual cold weather at Cape Kennedy, they contacted their engineers and asked their opinion. 19 The engineers, led by Roger Boisjoly, were nervous and tried to stop the flight. The Morton Thiokol management agreed that the issue was serious enough to recommend delaying the flight, and they arranged a telephone conference with NASA. However, during the call, the Morton Thiokol managers asked for a few minutes off the phone to discuss their final position again.

20 The Morton Thiokol managers decided to advise NASA that their data was inconclusive. NASA asked if there were objections. Hearing none, the decision to launch was made. The engineers should have looked at all the data, not just the data on failures. 20

21 Roger Boisjoly was one of the witnesses at the Rogers Commission. After the Committee gave its findings, Boisjoly found himself shunned by colleagues and managers and he resigned from Morton Thiokol. Subsequently, Roger Boisjoly wrote: [S]ome may argue that sufficient funds or schedule were not available and that may be so, but MTI contracted for that condition. The Shuttle program was declared operational by NASA after the fourth flight, but the technical problems in producing and maintaining the reusable boosters were escalating rapidly as the program matured, instead of decreasing as one would normally expect. Many opportunities were available to structure the work force for corrective action, but the MTI Management style would not let anything compete or interfere with the production and shipping of boosters.

1. Background and Overview

1. Background and Overview Data mining tries to find hidden structure in large, high-dimensional datasets. Interesting structure can arise in regression analysis, discriminant analysis, cluster analysis,