Introduction to Regression

Size: px

Start display at page:

Download "Introduction to Regression"

Marilynn Shields
5 years ago
Views:

1 PSU Summer School in Statistics Regression Slide 1 Introduction to Regression Chad M. Schafer Department of Statistics & Data Science, Carnegie Mellon University cschafer@cmu.edu June 2018

2 PSU Summer School in Statistics Regression Slide 2 Outline Motivation Linear Models Nonparametric Regression The Curse of Dimensionality Additive Models

3 PSU Summer School in Statistics Regression Slide 3 Motivation The objective is to construct a model that relates the response variable Y to the predictor variables X 1, X 2,..., X p. Example 1: Predicting redshift from photometry Y X 1, X 2, X 3, X 4, X 5 = redshift = magnitudes in u, g, r, i, z bands Example 2: Modelling relationship between distance modulus and redshift for Type Ia Supernovae Y X 1 = distance modulus = redshift

4 PSU Summer School in Statistics Regression Slide 4 Distance Modulus Redshift From Riess, et al. (2007), 182 Type Ia Supernovae

5 PSU Summer School in Statistics Regression Slide 5 The ΛCDM model proposes a simple model for the relationship between redshift (z) and distance modulus: Distance Modulus = 5 log 10 c(1 + z) H 0 z 0 ( Ωm (1 + u) 3 + (1 Ω m ) ) 1/ This depends on two cosmological parameters, Ω m and H 0. Hence learning this relationship can help constrain these parameters.

6 PSU Summer School in Statistics Regression Slide 6 Distance Modulus Redshift The case where H 0 = and Ω m = 0.34.

7 PSU Summer School in Statistics Regression Slide 7 The two above examples contrast two of the typical motivations for using regression: In Example 1 we want to make predictions for new/future response values (redshifts) given the ugriz magnitudes of an object. Very important problem for LSST and other photometric surveys. In Example 2 we want to learn something about the relationship between redshift and distance modulus for Type Ia supernovae, since different physical theories make different predictions for its shape. Which of the two scenarios we are in will dictate our level of concern for model fit and violations of other theoretical assumptions.

8 PSU Summer School in Statistics Regression Slide 8 Another Example: Modelling Emission Lines Here we seek to model the relationships between the log equivalent widths of different emission lines of galaxies. This was motivated by an effort to improve simulation models for galaxy spectra. The sample consists of 13,788 SDSS galaxies, and can be read in via the following R command: > eldata = read.table( + " + header=t) The use of header=t specifies that the first line of the file includes the variable names. Those can now be seen via > names(eldata) [1] "Halpha" "Hbeta" "SII" "Hgamma" "OII" "OIII" "NII"

9 PSU Summer School in Statistics Regression Slide 9 Create a scatter plot to consider one of the relationships between the lines: > plot(eldata$sii,eldata$oii, xlab="sii Log Equivalent Width", + ylab="oii Log Equivalent Width", pch=".", col="blue") OII Log Equivalent Width SII Log Equivalent Width

10 PSU Summer School in Statistics Regression Slide 10 The plot shows the least squares regression line. Our assessment of this line would depend, in part, on our motivation for fitting the regression model. If our objective was solely to make good predictions of the OII log equivalent width using the SII line, then it is not doing a bad job. On the other hand, we would be very reluctant to claim that the linear model is revealing some truth regarding the relationships between these emission lines. In what follows, we will consider a range of increasingly-complex models. The starting point will be linear regression.

11 PSU Summer School in Statistics Regression Slide 11 Linear Regression Linear regression models predict the response variable as a linear combination of the predictors, plus random scatter Y = β 0 + β 1 X 1 + β 2 X β p X p + ɛ If we want to stress that there is a sample of n observations being used, we can write the following: Y i = β 0 + β 1 X 1i + β 2 X 2i + + β p X pi + ɛ i, i = 1, 2,..., n. The available sample used for fitting the model is often referred to as the training sample.

12 PSU Summer School in Statistics Regression Slide 12 Despite their simplicity, linear models are utilized in a wide range of situations, and understanding them is crucial to exploring more advanced regression methods. Also, keep in mind that the predictors X i can be transformed versions of the avaiable variable(s). The response can also be transformed. These operations can create linear models out of nonlinear relationships. For example, this describes a cubic relationship between X and Y : Y = β 0 + β 1 X + β 2 X 2 + β 3 X 3 + ɛ, but it is a linear model since it is linear in the β parameters.

13 PSU Summer School in Statistics Regression Slide 13 The Irreducible Errors The ɛ i are sometimes called the irreducible errors, to suggest that they re ever-present, and not to be ignored: We do not seek to fit models that eliminate random scatter, as this would just lead to overfitting. It is almost always assumed that E(ɛ i ) = 0. If Var(ɛ i ) = σ 2 for all i, then the errors are said to be homoskedastic. Although homoskedasticity is a common assumption, it can be relaxed, and heteroskedastic errors can be handled. The irreducible errors for different observations are often assumed to be uncorrelated. Assuming that the ɛ i are iid Gaussian leads to some additional nice theoretical results, but it doesn t tend to be important in practical situations.

14 PSU Summer School in Statistics Regression Slide 14 Simple Linear Regression The simple linear regression model consists of a single predictor: Y i = β 0 + β 1 X i + ɛ i with the assumptions that the ɛ i are uncorrelated, mean zero random variables with variance σ 2. There are three parameters to be estimated: β 0, β 1, and σ 2. The standard (but not the only) approach to estimating β 0 and β 1 is to use least squares, i.e., to find the values of these parameters that minimizes n i=1 (Y i (β 0 + β 1 X i )) 2 These minimizers are usually denoted β 0 and β 1.

15 PSU Summer School in Statistics Regression Slide 15 Recall our example of predicting galaxy emission line widths. OII Log Equivalent Width SII Log Equivalent Width

16 PSU Summer School in Statistics Regression Slide 16 Consider one observation in the data set, Y i : OII Log Equivalent Width Y i SII Log Equivalent Width

17 PSU Summer School in Statistics Regression Slide 17 The prediction from the regression model is the fitted value Ŷi: OII Log Equivalent Width Y i Y i SII Log Equivalent Width

18 PSU Summer School in Statistics Regression Slide 18 The difference is the residual, ɛ i = Y i Ŷi: OII Log Equivalent Width Y i ε i Y i SII Log Equivalent Width The least squares regression line minimizes i ɛ 2 i.

19 PSU Summer School in Statistics Regression Slide 19 With a little calculus, it is not difficult to show that ( ni=1 Xi X ) ( Y i Y ) β 1 = ( ni=1 Xi X ) 2 and that β 0 = Y β 1 X.

20 PSU Summer School in Statistics Regression Slide 20 The Correlation Coefficient The (Pearson) correleation coefficient provides a measure of the strength of linear association between two variables. The (sample) covariance between two variables is Cov(X, Y ) = n ( Xi X ) ( Y i Y ) / (n 1) i=1 and the correlation coefficient is / r xy = Cov(X, Y ) s x s y where s denotes the sample standard deviation. Note that 1 r xy 1, and that r xy = ±1 indicates a perfect linear relationship.

21 PSU Summer School in Statistics Regression Slide 21 The figure below gives examples as to how the correlation coefficient expresses the strength of linear relationship in a scatter plot. Note that nonlinear relationships are not detected.

22 PSU Summer School in Statistics Regression Slide 22 It s not surprising that there is a close relationship between the correlation coefficient and simple linear regression. In fact, β 1 = Cov(X, Y ) s 2 x ( ) sy = r xy. s x It s worth noting that if X and Y have been rescaled to have mean of zero and standard deviation of one, then β 1 = r xy, and β 0 = 0.

23 PSU Summer School in Statistics Regression Slide 23 Linear Regression in R Linear models are fit easily in R using the function lm(): > ellm = lm(oii ~ SII, data=eldata) Here, OII is the response, and SII is the predictor. Additional predictors are easy to include: > ellm2 = lm(oii ~ Halpha + Hbeta + SII, data=eldata) The output provided by summary() shows a lot of the important information. See the following slide.

24 PSU Summer School in Statistics Regression Slide 24 > summary(ellm2) Call: lm(formula = OII ~ Halpha + Hbeta + SII, data = eldata) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** Halpha <2e-16 *** Hbeta <2e-16 *** SII <2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: 1.365e+04 on 3 and DF, p-value: < 2.2e-16

25 PSU Summer School in Statistics Regression Slide 25 Some comments on the output of the preceding slide: 1. The estimates of the β parameters are in the Estimate column of the table. Their standard errors are in the second column. 2. The t value column of the table is simply the ratio of the estimate to its standard error. 3. The fourth column reports the p-value for the test of H 0 : β i = 0 versus H 1 : β i 0. These tests are overused. 4. The estimate of σ is reported as the Residual standard error, i.e., σ =

26 PSU Summer School in Statistics Regression Slide 26 The Coefficient of Determination The R output on the previous slides also reports the value of Multiple R-Squared to be This is sometimes called the coefficient of determination, and is defined as R 2 = 1 ( ) 2 i Yi Ŷi ( i Yi Y ) 2 This (overused) metric can be interpreted as the proportion of the variance in the response explained by the model. Unfortunately, R 2 is often used as a proxy for the quality of a model fit, even though it is not suited to assess that.

27 PSU Summer School in Statistics Regression Slide 27 Consider the four scatter plots below, Anscombe s Quartet (1973). In all four cases, the values of β 0, β 1, r xy, R 2, and so forth, are equal.

28 PSU Summer School in Statistics Regression Slide 28 Residuals as Diagnostic Tools Anscombe s Quartet makes it clear that we cannot rely upon purely numerical assessments (including R 2 ) to judge the quality of a model fit. Inspection of the residuals is an important diagnostic tool. The i th residual is an approximation to the i th irreducible error: ɛ i = Y i Ŷi = Y i ( β 0 + β 1 X 1i + + β p X pi ) Y i (β 0 + β 1 X 1i + + β p X pi ) = ɛ i If the model is a good fit, then the residuals should be random scatter, and lacking in any relationships with predictors, etc.

29 PSU Summer School in Statistics Regression Slide 29 The plot of residuals versus fitted values is standard. The curved shape to the plot below shows a lack of fit in our linear regression example. > plot(ellm2$fitted.values, ellm2$residuals, pch=".", + xlab="fitted Values", ylab="residuals") Residuals Fitted Values

30 PSU Summer School in Statistics Regression Slide 30 It is also common to look at the normal probability plot of the residuals. While normality of the ɛ i is not an important assumption for most applications, this plot can also help to highlight the observations with the most extreme residuals. > qqnorm(ellm2$residuals, pch=16, main="") > qqline(ellm2$residuals) Sample Quantiles Theoretical Quantiles

31 PSU Summer School in Statistics Regression Slide 31 Model Selection The process of model selection involves deciding which of the available predictors should be utilized in the model. Including too many predictors leads to overfitting, a situation in which a model fits better to the training sample than it would to external observations not used in the fitting. This is a big problem. Important: Increasing the number of predictors will necessarily decrease the sum of the squared residuals, and necessarily increase the value of R 2. 1 These metrics are not useful for model selection. 1 Ignoring some technical cases where the new predictor is a linear combination of the predictors already included.

32 PSU Summer School in Statistics Regression Slide 32 Leave-one-out cross-validation is an important tool for model selection. For each observation i, imagine fitting a model which leaves out that row of the data set. The model is fit on this reduced training set. The fit model is then used to predict the response for observation i. Denote this Ŷ( i). The difference between Y i and Ŷ(i) is a useful view of the performance of the model at prediciting the response value for new cases. We can accumulate these errors into a score: LOOCV Score = 1 n n ( 2 Yi Ŷ( i)) i=1 This score can be compared over models with different sets of predictors. Of course, models with a small LOOCV score are preferred.

33 PSU Summer School in Statistics Regression Slide 33 Fortunately, there is an amazing shortcut to calculating the LOOCV score. Assume that the observed response values Y i are placed into a vector Y, and the fitted values are placed in vector Ŷ. If there is a matrix L such that then Ŷ = LY Y i Ŷ( i) = ɛ i /(1 L ii ) where L ii is the i th diagonal entry of L. These are called the leverages. The generalized cross validation (GCV) score is found by replacing L ii by the average diagonal entry of L: Y i Ŷ( i) = ɛ i / (1 L ii ) ɛ i /(1 tr(l)/n)

34 PSU Summer School in Statistics Regression Slide 34 The R function hatvalues() returns the leverages: > levs = hatvalues(ellm) > loocv = mean((ellm$residuals/(1-levs))^2) > gcv = mean((ellm$residuals/(1-mean(levs)))^2) > print(c(loocv, gcv)) [1] Compare this with the score of the model with three predictors: > levs2 = hatvalues(ellm2) > loocv2 = mean((ellm2$residuals/(1-levs2))^2) > gcv2 = mean((ellm2$residuals/(1-mean(levs2)))^2) > print(c(loocv2, gcv2)) [1] Function bestglm() in package bestglm will automate the search for smallest LOOCV score.

35 PSU Summer School in Statistics Regression Slide 35 Confidence and Prediction Intervals As described above, these models are ultimately used for inference, i.e., drawing conclusions and making predictions. Any inference should be accompanied by an uncertainty measure. We should contrast two inference objectives: 1. When making predictions of the response for new X values, then uncertainty is best described by a prediction interval. This interval gives a range of values that we can claim, with some confidence, the true value of Y falls into. 2. When characterizing the relationship between the response and the X values, a confidence interval on the regression function is often of interest.

36 PSU Summer School in Statistics Regression Slide 36 In R, use the function predict(). The argument interval controls the type of interval (prediction or confidence) that is created. > predict(ellm, data.frame(sii=2), interval="prediction") fit lwr upr > predict(ellm, data.frame(sii=2), interval="confidence") fit lwr upr Note that for the same value of X, the 95% prediction interval is much wider than the confidence interval, since the 95% prediction interval make take into account the irreducible error. Rule of Thumb: For large n, 95% prediction intervals have a half-width which is approximately 2 σ.

37 PSU Summer School in Statistics Regression Slide 37 Nonparametric Regression The linear models considered thus far are parametric models, meaning that there is a fixed form for the relationship between the response and predictor(s), the only unknowns are real-valued parameters. We saw at the start of the lecture that parametric models come in nonlinear forms as well, e.g., Distance Modulus = 5 log 10 c(1 + z) H 0 z 0 ( Ωm (1 + u) 3 + (1 Ω m ) ) 1/2 + 25

38 PSU Summer School in Statistics Regression Slide 38 Distance Modulus Redshift The case where H 0 = and Ω m = 0.34.

39 PSU Summer School in Statistics Regression Slide 39 Data Estimate Assumptions In the parametric case, the contribution to the estimate from the assumptions is fixed.

40 PSU Summer School in Statistics Regression Slide 40 Data Estimate Assumptions h In the nonparametric case, the influence of the assumptions is controlled by a smoothing parameter h. The value of h can be chosen smaller with larger sample sizes.

41 PSU Summer School in Statistics Regression Slide 41 Distance Modulus Redshift A nonparametric estimate of the relationship.

42 PSU Summer School in Statistics Regression Slide 42 Local Linear Regression The figure on the previous page shows a nonparametric regression or nonparametric smooth of the relationship. There are many variants of nonparametric regression, but here we focus on local linear regression, a version of local polynomial regression. This approach has proven to be very powerful, and possesses many strong theoretical properties to justify its use. Briefly stated, this procedure works by fitting a sequence of linear models: Each is fit not to the entire data set, but to only data within a neighborhood of a target point. The size of this neighborhood is the smoothing parameter: Large neighborhoods yield a large degree of smoothing, while small neighborhoods result in minimal smoothing.

43 PSU Summer School in Statistics Regression Slide 43 Our model here is that we observe (x i, Y i ) for i = 1, 2,..., n and that Y i = f(x i ) + ɛ i where the ɛ i are iid with mean zero and variance σ 2. Assuming that the ɛ i are normal will lead to some nice properties, but this development does not require that assumption. In order to construct the local linear regression estimate of f( ), it is best to consider a sequence of steps for each fixed x 0 at which f( ) will be estimated.

44 PSU Summer School in Statistics Regression Slide 44 Step One: Fix the target point x 0. Our objective is to estimate the regression function at x 0. Y X

45 PSU Summer School in Statistics Regression Slide 45 Step Two: Create the neighborhood around x 0. A common way to choose the neighborhood size is to choose is large enough to capture proportion α of the data. This parameter α is often called the span. A typical choice is α 0.5. Y X

46 PSU Summer School in Statistics Regression Slide 46 Step Three: Weight the data in the neighborhood. Values of x which are close x 0 will receive a larger weight than those far from x 0. Denote by w i the weight placed on observation i. The default choice is the tri-cube weight function: w i = ( 1 x i x 0 max dist 3) 3, if xi in the neighborhood of x 0 0, if x i is not in neighborhood of x 0

47 PSU Summer School in Statistics Regression Slide 47 Y X

48 PSU Summer School in Statistics Regression Slide 48 Step Four: Fit the local regression line. This is done by finding β 0 and β 1 to minimize the weighted sum of squares n i=1 w i (y i (β 0 + β 1 x i )) 2 Y X

49 PSU Summer School in Statistics Regression Slide 49 Step Five: Estimate f(x 0 ). This is done using the fitted regression line to estimate the regression function at x 0 : f(x 0 ) = β 0 + β 1 x 0 Y X

50 PSU Summer School in Statistics Regression Slide 50 This is implemented in R using loess(): > holdlo = loess(oii ~ SII, data=eldata, degree=1, span=0.27) Setting degree = 1 is necessary in order to get local linear models. You can use the default degree = 2 to get local quadratic fits, but this is usually unnecessary. > plot(eldata$sii,eldata$oii, xlab="sii Log Equivalent Width", + ylab="oii Log Equivalent Width", pch=".", col="blue") > > SIIseq = seq(min(eldata$sii), max(eldata$sii), 0.1) > points(siiseq, predict(holdlo,siiseq), type="l", + lwd=2, col="red")

51 PSU Summer School in Statistics Regression Slide 51 OII Log Equivalent Width SII Log Equivalent Width

52 PSU Summer School in Statistics Regression Slide 52 It still makes sense to look at the plot of residuals versus fitted values. The improvement in the fit is noticeable. > plot(holdlo$fitted,holdlo$residuals, xlab="fitted Values", + ylab="residuals", pch=".") Residuals Fitted Values

53 PSU Summer School in Statistics Regression Slide 53 Smoothing Parameter Selection The smoothing parameter in nonparametric regression controls how smooth the resulting estimate is: A smaller value leads to a rougher estimate, a larger value gives a smoother estimate. The argument span to loess() controls the span, as defined above. The default value is The smoothing parameter cannot be chosen by minimizing the sum of squared residuals ( least squares ). Doing so would lead to overfitting, since for small enough choice, the residuals could all be made close to zero.

54 PSU Summer School in Statistics Regression Slide 54 Y Using span = Clearly not enough smoothing. X

55 PSU Summer School in Statistics Regression Slide 55 Y Using span = 0.9. Too much smoothing, missing important features. X

56 PSU Summer School in Statistics Regression Slide 56 As before, a common strategy for setting smoothing parameters is to use cross-validation. The LOOCV score can be calculated for different values of the smoothing parameter: LOOCV(h) = 1 n ( 2 Yi n Ŷ( i)) i=1 and the choice of h which minimizes the score is optimal. GCV is also often used. Function loess.as(), part of package fancova, includes the ability to search for the optimal choice of the span, as determined by GCV. > library(fancova) > optlo = loess.as(eldata$sii,eldata$oii, criterion="gcv") > print(optlo$pars$span) [1]

57 PSU Summer School in Statistics Regression Slide 57 Y X When the sample size is small, cross validation can be unreliable. Reconsider our example from above, with span = 0.16 chosen using GCV.

58 PSU Summer School in Statistics Regression Slide 58 Smoothing Splines Another nonparametric approach to regression is the penalized spline or smoothing spline. This approach starts with a natural optimization problem: Find the twice differentiable function f( ) such that n i=1 (y i f(x i )) 2 b [ + λ f (2) (x) ] 2 dx a where f (2) (x) indicates the second derivative of f evaluated at x and λ > 0 is the smoothing parameter. Typically, a = min{x i } and b = max{x i }.

59 PSU Summer School in Statistics Regression Slide 59 Note that the penalty term b a [ f (2) (x) ] 2 will be large if the function is wiggly. Of course, if f(x) = a + bx, then this penalty equals zero. Large values of λ lead to smooth functions f(). As before, λ can be chosen via cross-validation.

60 PSU Summer School in Statistics Regression Slide 60 This may appear to set up a difficult optimization problem, but, in fact, the search for the optimal f( ) can be transformed into constructing f( ) as the linear combination of a specially formed basis of functions. Comment: The number of basis functions utilized is equal to the number of knots plus four. For computational reasons, the number of knots is usually chosen much smaller than the number of data points.

61 PSU Summer School in Statistics Regression Slide 61 The figure below shows a simple example for the case where there are five knots (shown as the vertical lines) x

62 PSU Summer School in Statistics Regression Slide 62 In R, penalized splines are implemented via smooth.spline(). The default is for smooth.spline() to choose λ using GCV, but if you set cv = TRUE, then true LOOCV is used. See below for the result of the fit to our SDSS example. > holdspline = smooth.spline(eldata$sii,eldata$oii) > print(holdspline) Call: smooth.spline(x = eldata$sii, y = eldata$oii) Smoothing Parameter spar= lambda= (12 iterations) Equivalent Degrees of Freedom (Df): Penalized Criterion (RSS): GCV:

63 PSU Summer School in Statistics Regression Slide 63 OII Log Equivalent Width SII Log Equivalent Width

64 PSU Summer School in Statistics Regression Slide 64 Comment: The Use of Nonparametrics in Astronomy A common objection raised to using nonparametric regression in astronomy is the failure to obtain a functional form for the relationship between the variables. Although there are situations where this is a drawback, there are many other situations in which a researcher wants a model for either (1) making predictions (e.g., photometric redshifts), or (2) summarizing data. For instance, a nonparametric estimate of a luminosity function can be a useful means of summarizing a massive data set. The use of a nonparametric approach would provide some reassurance that the summary is not affected by our model choices. Letting the data speak for themselves more important as samples grow.

65 PSU Summer School in Statistics Regression Slide 65 The Curse of Dimensionality Despite the promise of nonparametric approaches, they suffer from a serious drawback: The amount of data required to fit nonparametric regression models well increases exponentially with the dimension of the data. This is called the Curse of Dimensionality. What is the dimension of data? Standard data come to us in the form of numbers, but a single observation is often best thought of as a vector. Examples of high-dimensional data in astronomy include images, spectra, photometry, light curves, etc. measured from a single object.

66 PSU Summer School in Statistics Regression Slide 66 If there are p predictors being used in a model for a response variable, we would say that we are using a p-dimensional predictor vector. In some applications, p can be very large. Nonparametric regression is possible, in theory. For instance, if p = 2, a neighborhood can be thought of as a square centered on the target value. Local linear regression involves fitting a plane within this square. But, as the dimension increases, the available observations become more spread out. Neighborhoods must be made larger in order to compensate and achieve estimates with acceptable variance. Choosing a large neighborhood takes away a key feature of using nonparametric approaches: We want the estimate to be flexible, and not be based on too much smoothing.

67 PSU Summer School in Statistics Regression Slide 67 Consider the plot on the following slide. Here, the sample size is 100. Each observation is two-dimensional, i.e., imagine these are the two predictors. These data are simulated from uniform distributions. The marks shown in each margin depict the marginal distribution for each predictor. Note that in the margins, the data appear to be quite closely spaced. If we were to fit a nonparametric regression using just one predictor, neighborhoods could be chosen small. But, in two dimensions, large gaps appear in the distribution of the observations. Neighborhoods will have to be chosen larger in order to capture enough data.

68 PSU Summer School in Statistics Regression Slide Predictor Two Predictor One

69 PSU Summer School in Statistics Regression Slide 69 Another way of thinking about it: Under the uniform setup, with a p- dimensional predictor, in order for a neighborhood to include proportion α of the data, the length of the sides of the neighborhood must be α 1/p. This plot shows the case where α = 0.4. Length of Side p

70 PSU Summer School in Statistics Regression Slide 70 One approach to dealing with the curse of dimensionality is to utilize techniques of dimension reduction to map high-dimensional data into lowdimensional representations. Principal components analysis (PCA) is a classic method to dimension reduction, but more-sophisticated nonlinear approaches such as isomap, local linear embeddings, and diffusion map are available.

71 PSU Summer School in Statistics Regression Slide 71 Additive Models Another approach to dealing with the curse of dimensionality is to fit models which are nonparametric, but are additive in the predictors. Consider a case with p = 3 predictors. A fully nonparametric model would be of the form Y = f(x 1, X 2, X 3 ) + ɛ, i.e., the function f() could take any form on the three-dimensional predictor space. This model is very flexible, but difficult to fit well due to the curse of dimensionality. An additive nonparametric model would assume Y = β 0 + f 1 (X 1 ) + f 2 (X 2 ) + f 3 (X 3 ) + ɛ where each of the f i () are estimated nonparametrically.

72 PSU Summer School in Statistics Regression Slide 72 The general estimation strategy is called backfitting. In this process, each f k is estimated nonparametrically, in a rotation, and the process is repeated until there is convergence. When estimating f k ( ), the other f j ( ) are held fixed at their current best estimates, and we set up a one-dimensional nonparametric estimation problem, on which one could use either local linear regression, smoothing splines, or other approach. So, when estimating f k ( ), the response is taken to be Y i j k f j (x ij )

73 PSU Summer School in Statistics Regression Slide 73 Additive Nonparametric Regression in R There are two packages which include a function gam() used for fitting these models. In both cases, the syntax is much like when using lm(). The key difference is that one must specify which of the predictors are to be fit nonparametrically. This is done by wrapping the predictor with a special function. For the function gam() which is part of the package gam, there are two options: A function s() for fitting a smoothing spline, and a function lo() for fitting a local polynomial. For the function gam() which is part of the package mgcv, there is only the option s().

74 PSU Summer School in Statistics Regression Slide 74 In either case, the syntax would appear simply as > holdgam = gam(y ~ s(x1) + s(x2) + x3) In this example, the model being fit is Y = β 0 + f 1 (x 1 ) + f 2 (x 2 ) + β 3 x 3 + ɛ with f 1 and f 2 being estimated from the data using a smoothing spline. The advantage to using gam() from mgcv is that it incorporates automatic selection of the smoothing parameters for the individual smoothing splines.

75 PSU Summer School in Statistics Regression Slide 75 Back to the SDSS Data We can now fit an additive model relating the OII line strength to the other six lines, as follows: > library(mgcv) > holdgam = gam(oii ~ s(halpha) + s(hbeta) + s(sii) + s(hgamma) + + s(oiii) + s(nii), data = eldata) The estimates of the functions f i can be viewed: > plot(holdgam, pages=1, scale=0, scheme=1)

76 PSU Summer School in Statistics Regression Slide 76 s(halpha,6.02) s(hbeta,9) s(sii,8.13) Halpha Hbeta SII s(hgamma,7.17) s(oiii,8.86) s(nii,6.49) Hgamma OIII NII

77 PSU Summer School in Statistics Regression Slide 77 It again is important to look at the plot of residuals versus fitted values. There is no evidence of a lack of fit. > plot(holdgam$fitted,holdgam$residuals, xlab="fitted Values", + ylab="residuals", pch=".") Residuals Fitted Values

78 PSU Summer School in Statistics Regression Slide 78 Projection Pursuit Regression Recall our additive nonparametric regression model: Y = β 0 + p j=1 f j (x j ) + ɛ A generalization of this model is the projection pursuit regression model, which can be written as Y = β 0 + M ( β k f k α T k x ) + ɛ k=1 where each of the α k are a vector of length p. These α k are the projection direction vectors. The functions f k, called the ridge functions, are estimated nonparametrically. These functions are scaled to have mean zero and variance one when applied to the observed sample.

79 PSU Summer School in Statistics Regression Slide 79 The projection pursuit model takes linear combinations of the predictors, and then fits an additive model in these linear combinations. One can think of the α T k x as being new predictors which are being utilized in an additive model. Projection pursuit is implemented in R using the function ppr(). Projection pursuit is related to basic neural network models.

80 PSU Summer School in Statistics Regression Slide 80 Conclusion Motivation Linear Models Nonparametric Regression The Curse of Dimensionality Additive Models

Introduction to Regression

Introduction to Regression David E Jones (slides mostly by Chad M Schafer) June 1, 2016 1 / 102 Outline General Concepts of Regression, Bias-Variance Tradeoff Linear Regression Nonparametric Procedures