1 Chapter 2, Problem Set 1

Size: px

Start display at page:

Download "1 Chapter 2, Problem Set 1"

Coleen Adams
5 years ago
Views:

1 1 Chapter 2, Problem Set 1 1. The first model is the smoothest because it imposes a straight line. Only two degrees of freedom are lost. The second model exhibits the most jagged fit because each distinct value for temperature has its own conditional mean. Thirty-nine degrees of freedom are lost. There is no smoothing at all. The third model is a compromise between the first and second models with four degrees of freedom used up. The compromise is far closer to a straight line than to the unsmoothed fit ~ partial for ~ as.factor() as.factor() ~ s() s() The AIC statistics for the three models are , , and , respectively. By the AIC criterion, the third model has the best fit. The residual deviance statistics for the three models are , , and , respectively. The second model has the smallest residual deviation and by that criterion has the best fit. However, the residual 1

2 deviance does not take into account the flexibility of the model as represented by the AIC. Failing to take into account the flexibility built into a fitting function has cause misleading conclusions about fit quality. From the GAM plots, the third model seems to be most useful because it strikes the best balance between simplicity and quality of fit. The first model is the simplest but has the worst fit. The second model fits the data well (indeed, it is the best one can do) but would likely not perform well if one applied a test data set to the model and would be difficult to interpret in any case. It is too complex. If there were good substantive reasons to favor a linear fit, the first model might well be selected even though it did not perform as well as the third model by statistical criteria. 2. The span parameter seems to be the most influential on the fit of the smoother. A smaller span signifies more knots and a more jagged fit. As the span approaches 1, the smoother becomes more linear. The degree and family tuning parameters influence the path of the smoother, but not to the same extent as the span. Below are the different smoothers with a span of.25,.50, or.75, and a degree of 0, 1, or 2: Span=.25, Degree=0 Span=.25, Degree=1 Span=.25, Degree=2 2

3 Span=.50, Degree=0 Span=.50, Degree=1 Span=.50, Degree=2 3

4 Span=.75, Degree=0 Span=.75, Degree=1 Span=.75, Degree=2 3. There are numerous tuning parameter settings that suggest a positive, monotonic reionship and fit the data well. Any of these settings work well: span=0.50, degree=1, family=gaussian or symmetric span=0.75, degree=1, family=gaussian or symmetric span=0.75, degree=2, family=gaussian or symmetric 4. The slope is reively f (though still positive) for temperatures less than roughly 75 degrees. At roughly 75 degrees, the slope is considerably steeper and continues to increase monotonically. Below is a smoother using span=0.75 and degree=1. 4

5 Span=.75, Degree=1 5. The quantity and local variation of data points will dictate the width of the confidence intervals. With respect to temperature, instability is greatest at the tail ends of temperature values because there are reively fewer observations at the boundary of the temperature distribution. Conversely, stability is at its greatest for temperatures in the mid-60s, mid-70s, and mid-high 80s, at which there are a number of points and reively little variation in values of. 5

6 2 Chapter 2, Problem Set 2 1. Summary statistics for four different generalized additive models with loess polynomials are provided below: AIC Deviance lo(, degree=1)+lo(, degree=1) lo(, degree=1)+lo(, degree=2) lo(, degree=2)+lo(, degree=1) lo(, degree=2)+lo(, degree=2) The AIC and deviance statistics exhibit little variation. All AIC values are on the order of 161,700, and the deviance statistics range from 11,743,705 to 11,764,933. The partial plots show some differences in specific local regions of the predictor of interest. When a 1-D polynomial is applied to itude, the 1-D smoother of itude shows a negative and monotonic reionship between temperature. The negative slope is becomes steeper around itude value of 29. When a 2-D polynomial is applied to itude, the 1-D smoother shows two reive maximums at itude values of roughly 29 and Aside 6

7 from these small peaks, the reionship between temperature and itude is negative. When a 1-D polynomial is applied to itude, the suggested reionship between temperature and itude is jagged and exhibits several inflection points. For itude less than roughly -103, the reionship is negative. Between itudes -103 and -101, temperature increases. After a brief, slight decrease at -101, temperature increases until roughly -97. For itude values greater than -97, temperature decreases once again. However, the inflection points occur where the data become quite sparse; there is not much support. Unless there are very good subject-matter reasons for those inflections, they might usefully be ignored. When a 2-D polynomial is applied to itude, we again observe inflection points near -103 and -97. In addition, there are pronounced peaks at roughly -102 and But the same caveats apply. lo(, degree = 1) lo(, degree = 1) lo(, degree = 1) lo(, degree = 2)

8 lo(, degree = 2) lo(, degree = 1) lo(, degree = 2) lo(, degree = 2) When we construct a generalized additive model using a 2-D loess of itude and itude, we are able to observe one 3-D perspective plot (rather than two 2-D plots, as in #1). Using a span=0.75, we observe that temperature is highest for reively smaller values of itude and larger values of itude. erature decreases essentially monotonically with respect to itude. The surface is not substantially torqued so that earlier additive model may be adequate. There do not seem to be any interaction effects. There remain those curious inflection points for a the front two edges of the figure. 8

9 lo(,, span = 0.5) 3 Chapter 2, Problem Set 3 1. Using span=0.5, the partial plots corresponding to itude and itude bear similar resemblance to the partial plots constructed in problem set 2, question 1. The partial plot on year shows two pronounced peaks at the mid-1960s and the e 1980s. eratures decrease from the mid- 1960s to the mid-1970s, then increase from the mid-1970s until the early 1980s. After 1980, temperatures decline once again. Finally, the partial plot on month reveals nonlinear behavior, with a maximum during the summer (around July). eratures increase from January until the summer and then begin to decline once again. Note that if one increases the span parameter (to say, 0.75 rather than 0.5), the smoothers become more linear many of the local peaks disappear. 2. Using the default settings for degrees of freedom (df=4), penalized smoothing splines smoothers are less jagged than the loess smoothers (in which span=0.5). Also, the reive minima and maxima are less pronounced when employing penalized splines. The greater smoothness at a finegrained level may result from explicit imposition by the penalized approach of smoothness requirements at the knots. However, in comparison to the plots produced in #1, the overall conclusions regarding the reionships between temperature and the predictors are generally the same. 9

10 lo(, span = 0.5) lo(, span = 0.5) lo(year, span = 0.5) lo(month, span = 0.5) year month The reionship between temperature and itude is negative and monotonic, with a steeper decrease in temperature for itude values greater than 28. In the partial plot with itude, there are two sharp inflection points around -103 and at -96. With respect to year, the penalized smoothing splines suggest a reive minimum around 1975 and reive maxima around 1966 and 1981 (approximately). 10

11 s() s() s(year) s(month) year month 4 Chapter 2, Problem Set 4 1. In the generalized linear model, the AIC is 992 and the residual deviance is Age and family income are negatively reed to a wife s labor force participation. The log of the expected wage is positively reed to labor force participation. In the generalized additive model, the AIC is and the residual deviance is GAM fits better even after adjusting for the degrees of freedom used up. Additionally, the partial plots generated by plot.gam() suggest that the reionships between the response and each of the predictors are not linear. For example, labor force participation increases until around age 45 and then declines. This pattern is totally missed by generalized linear model. GAM is a better procedure in this case. 11

12 s(age) s(inc) age inc s(lwg) lwg 12

Interaction effects for continuous predictors in regression modeling

Interaction effects for continuous predictors in regression modeling Testing for interactions The linear regression model is undoubtedly the most commonly-used statistical model, and has the advantage