Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

Similar documents
Scatter plot of data from the study. Linear Regression

Scatter plot of data from the study. Linear Regression

Lecture 18: Simple Linear Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

Analysing data: regression and correlation S6 and S7

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

STAT/CS 94 Fall 2015 Adhikari HW12, Due: 12/02/15

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Lecture 14 Simple Linear Regression

Correlation and regression

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Density Temp vs Ratio. temp

Section 3: Simple Linear Regression

Simple Linear Regression

Inferences for Regression

Simple linear regression

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

1. Create a scatterplot of this data. 2. Find the correlation coefficient.

[y i α βx i ] 2 (2) Q = i=1

Statistical View of Least Squares

High-dimensional regression

Multiple Linear Regression

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

STAT 285: Fall Semester Final Examination Solutions

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Lecture 10: F -Tests, ANOVA and R 2

R 2 and F -Tests and ANOVA

AMS 7 Correlation and Regression Lecture 8

Warm-up Using the given data Create a scatterplot Find the regression line

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

appstats27.notebook April 06, 2017

appstats8.notebook October 11, 2016

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Chapter 27 Summary Inferences for Regression

Introduction and Single Predictor Regression. Correlation

STA 2201/442 Assignment 2

Assumptions, Diagnostics, and Inferences for the Simple Linear Regression Model with Normal Residuals

Exam Applied Statistical Regression. Good Luck!

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

Applied Regression Analysis. Section 2: Multiple Linear Regression

The Simple Linear Regression Model

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

Eigenvalues & Eigenvectors

Ordinary Least Squares Linear Regression

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Simple Linear Regression

1 Least Squares Estimation - multiple regression.

MFin Econometrics I Session 4: t-distribution, Simple Linear Regression, OLS assumptions and properties of OLS estimators

Lecture 6 Multiple Linear Regression, cont.

STA442/2101: Assignment 5

Lecture 5: Clustering, Linear Regression

Simple Linear Regression for the Climate Data

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

Statistical View of Least Squares

Business Statistics. Lecture 9: Simple Regression

Chapter 12: Linear regression II

Simple Linear Regression

Lecture 6: Linear Regression

Simple and Multiple Linear Regression

Ch 2: Simple Linear Regression

Lecture 5: Clustering, Linear Regression

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

Problem Set #6: OLS. Economics 835: Econometrics. Fall 2012

Section 4: Multiple Linear Regression

Correlation. Bivariate normal densities with ρ 0. Two-dimensional / bivariate normal density with correlation 0

Chapter 14. Linear least squares

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Intro to Linear Regression

Correlation and Linear Regression

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

5. Linear Regression

Stat 20 Midterm 1 Review

Intro to Linear Regression

Lecture 19 Multiple (Linear) Regression

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Unless provided with information to the contrary, assume for each question below that the Classical Linear Model assumptions hold.

1. (Problem 3.4 in OLRT)

5. Linear Regression

Chapter 10. Simple Linear Regression and Correlation

IV Estimation and its Limitations: Weak Instruments and Weakly Endogeneous Regressors

Simple linear regression

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Simple Linear Regression

Take-home Final. The questions you are expected to answer for this project are numbered and italicized. There is one bonus question. Good luck!

Linear regression. We have that the estimated mean in linear regression is. ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. The standard error of ˆµ Y X=x is.

Lecture 9 SLR in Matrix Form

Multiple Linear Regression for the Supervisor Data

STA 101 Final Review

Unit 6 - Introduction to linear regression

Lecture 16: Again on Regression

Linear Regression (9/11/13)

Correlation: basic properties.

Final Exam. Name: Solution:

ACOVA and Interactions

Simple Linear Regression

9 Correlation and Regression

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Ch 3: Multiple Linear Regression

Homoskedasticity. Var (u X) = σ 2. (23)

Transcription:

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS 1a) The model is cw i = β 0 + β 1 el i + ɛ i, where cw i is the weight of the ith chick, el i the length of the egg from which it hatched, and ɛ i the normal error. The index i ranges from 1 to n and the ɛ are i.i.d. normal with mean 0 and variance σ 2. The plot looks linear and roughly (though not perfectly) homoscedastic. I m willing to believe that the model is OK. The mean chick weight is 6.145455 grams with an SD of 0.4105892 grams. The mean egg length is 31.38955 mm with an SD of 1.100892 mm. The correlation between egg length and chick weight is 0.6761419. So by our old formulas the slope of the regression line is rs y /s x = 0.2521742 gm/mm, and the intercept is ȳ slope x = 1.770180 gm. The equation of the regression line is: estimated chick weight = 0.2521742gm/mm (egg length) - 1.770180 gm b) According to R the slope of the regression line is 0.2522 and the intercept is -1.7702, both of which agree with the values computed in a. Assuming that the linear model holds, the t-test for the intercept is testing whether or not the intercept of the true line is 0; the p-value is large (19%) which supports the null hypothesis that the intercept is 0. In this case (simple regression) the t-test for the slope and the F -test are both testing the same thing: whether or not the slope is 0. Both reject the hypothesis that the slope is 0. I hope you noticed that the F -statistic of 35.37 is the square of the t-statistics of 5.947. c) Not surprisingly, it s egg weight. The correlation between the weights of the eggs and the chicks is 0.847225. The plot is nice and linear but heteroscedasticity is an issue at the edges. The assumptions of the linear model are not terrible for the main bulk of the data. The residual plot shows the same thing. The R 2 is 0.72, not bad, and consistent with the correlation from the correlation matrix. d) Use the estimated slope and intercept from c: the estimate of the mean weight is 0.71852 8.5 0.05827 = 6.04915. By the formula in the class handout, the standard error is estimated as 1 (8.5 mean(ew))2 0.2207 + = 0.03455294 gm. 44 43 var(ew) Use the t 42 distribution to see that the confidence interval is 6.04915 ± 2.018082.03455294 which is about (5.98, 6.12)grams. e) The estimate is the same as in d and so is the value of t, but now the standard error is estimated as 1 (8.5 mean(ew))2 0.2207 + + 1 = 0.2233884 gm. 44 43 var(ew) f) The given egg weight is well outside the range of the data. I don t know whether the model holds at those weights or not. So, because we are dealing with such an outlier, I will declare this one to be not possible. In general, beware of extrapolation, that is, making estimates outside the range of your data. Unless you really have reason to believe that the model continues to hold even in where you have no observations, don t do it. 2a) The two regressions are very similar. Both show the same problems with the homoscedasticity assumptions. Both have an R 2 of about 0.71. The normal quantile plot is worse for the 1

multiple regression. But overall, not much to choose between them. b) The R 2 is an astonishing 0.95, and the rest of the regression diagnostics are not bad. So you can think of egg weight as essentially a linear function of egg length and egg breadth. That explains why the two regressions in a are so similar: linear functions of egg weight are essentially linear functions of egg length and egg breadth. c) Given the discussion in b), this is a really silly thing to do. And the silliness shows up in the apparently contradictory results of the F -test and the three t-tests for the slopes. Each t-test concludes that the corresponding slope is 0, but the F -test concludes that not all are 0. It s all correct! The predictor variables are so highly correlated with each other (see e.g. b) that the individual slopes have no meaning. It s true that the R 2 is just a shade higher than those of the earlier regressions, but the adjusted R 2 is a shade lower than that of the regression on egg weight alone. To summarize: if you use predictors that are highly correlated with each other, don t try to interpret results for individual slopes. Better still, don t use predictor variables that are highly correlated with each other! d) The two in a are the best. I don t find anything else that compares well. If you use a combination of egg weight and either of the other two variables, the only significant slope is that of egg weight and the R 2 are all around 0.7. So, for comprehensibility of the model and minimal correlation between predictors, I d go with the two in a. 3a) You can do the parametric test of the null hypothesis that in the population the mean baseline score is the same as the mean 15-month score (i.e. the treatment did nothing). The data are paired, so you run the t-test on the differences between the scores to see that the value of t is 6.15 (15months - base) with 21 degrees of freedom. The p-value is tiny so you conclude that the means are different. Notice the negative sign of t. This is coming from the fact that the 15-month scores are on average lower than the baseline scores. To run the nonparametric test of the null hypothesis that the underlying distributions at baseline and at 15 months are the same, do the Wilcoxon signed-rank test. The value of the statistic is 246 and the p-value is tiny, supporting the alternative that the two underlying distributions are different. This is consistent with the result of the parametric test. b) A glance at the correlation matrix shows that you ll want to include the baseline score (big surprise) and chemo (it s correlated with the response but almost uncorrelated with baseline). You may want to use height, but that s correlated with the baseline measurement so it may not be a good idea. And indeed, it turns out not to be a good idea when you run the regression. The best one uses baseline and chemo as the predictors; the adjusted R 2 is about 0.43, clearly greater than the values from the other regressions. Both slopes are significantly different from 0. According to the diagnostic plots the model looks OK, though there are clearly some deviations from homoscedasticity and normality. 4a) Looks about as normal as a real dataset gets. The histogram has a fairly symmetric bell shape and the normal q-q plot looks like a straight line. b) This distribution is skewed to the right. In the q-q plot this is represented by the bow shape of the line. If the skewness was in the other direction, the q-q plot would again be bow-shaped but this time it would be concave instead of convex. c) The mother s age doesn t seem to matter. The model which includes all the other predictors looks good; even better, you can drop the column of the mother s pregnancy weight without much 2

loss. In both cases the adjusted R 2 are around 0.25, clearly better than the others, and the diagnostic plots are pretty good. d) The coefficient is estimated as 8.35 or (or 8.5, depending on which model you used; it doesn t make much difference). It represents the average difference in birthweight between a baby born to a smoker and a baby born to a nonsmoker, provided the other variables included in the model are held constant. The conclusion is that a mother who smokes is expected to have a baby whose birthweight is 8.35 ounces less than a mother who doesn t smoke (ceterus paribus - all else assumed to be equal). 5. a) R = 0.9955. If you just plot the data, the fit appears to be very good. R 2 is very high and the slope is highly significant. However, the residual plot is u-shaped, showing a strong non-linear pattern. So, regardless of the high R 2, we should not have fit a straight line. b) Less. Each point in the dataset women represents many women all those of a given height. If you replace the point by the individual points for all the women at that height, the picture will become much more fuzzy. And the correlation will drop. c) The residual plot of the linear regression suggests a polynomial fit of degree 2. However, once that model is fitted, you can see a clear up-and-down pattern in the residuals. Try a final model, then, of degree 3. R 2 for this model equals 0.9998 which is ridiculously high. The residual plot is better than the previous two, though the normal q-q plot of the residuals is not as good as the one in the quadratic fit. I won t go to the 4th degree polynomial for several reasons, chief among them being that you don t gain much as far as the fit goes, and fourth powers of inches are beyond most people s comprehensions. Stick with the cubic. 6a) The plots are very similar and neither shows any clear relationship, linear or otherwise. Both appear to be formless blobs. b) Similar except that the men s heart rates are clearly less variable than the women s. c) The regression confirms what we saw in a there s nothing much going on in terms of a linear relationship. R 2 is only about 0.04. The residual plot is formless blob, which is good, but then so is the original plot. The estimated slope is 1.645 (not significantly different from 0, big surprise), and the estimated intercept is 87.967. d) The story for the women is pretty much the same as that for the men except that R 2 is higher (about 0.08) and the slope does come out to be significant. It s positive, so the estimated heart rate increases slightly with increasing temperature. But it s not a very convincing regression because its predictive power is pretty small due to the small R 2. e) The slope for the men was estimated as 1.645 with an SE of 1.039, and the slope for the women was estimated as 3.128 with an SE of 1.316. So the difference in slopes (women - men) is estimated to be 1.483 with an SE of 1.039 2 + 1.316 2 = 1.68, since the data for the men and women are independent. The sample sizes are large enough that I m just going to use the normal approximation and not worry about t distributions. An approximate 95%-confidence interval for the difference between the slopes is 1.483 ± 2 1.68 which clearly contains 0. So at the 5% level you can conclude that the slopes are equal. f) Play the same game as in e but with the intercepts. The estimated difference is 145.657 with an SE of 164.767, and the 95%-confidence interval for the difference clearly contains 0. So at the 3

5% level you can conclude that the intercepts are equal. 7a. There are n = 4 observations and 2 parameters w 1 and w 2. The model is Y = Xβ +ɛ where the observed value of Y is the transpose of (3, 3, 1, 7), β is the transpose of (w 1, w 2 ), the vector of errors ɛ consists of 4 i.i.d. normal (0, σ 2 ) variables, and the design matrix X is 1 0 0 1 1 1 1 1 b) The least squares estimates are (X T X) 1 X T Y. Now X T X is a particularly simple matrix: [ 3 0 ] 0 3 and so its inverse is equally simple: [ 1/3 0 0 1/3 And X T Y is the transpose of (11, 9) and so the estimates are ŵ 1 = 11/3 and ŵ 2 = 9/3. c) Ŷ is the transpose of (11/3, 9/3, 2/3, 20/3). So the residual vector is the transpose of ( 2/3, 0, 1/3, 1/3). So the estimate of σ 2 is (6/9)/(4 2) = 1/3. d) The covariance matrix of the estimates is σ 2 (X T X) 1 and so both the estimated variances are equal to 1/9, and so the estimated standard errors are both 1/3. e) The estimate is 11/3 9/3 = 2/3, and its standard error is 1/9 + 1/9 = 0.47 because the covariance is 0 (the off-diagonal elements of the covariance matrix are 0). f) The relative sizes of the estimated difference and its SE in part e show that the 95%-confidence interval for w 1 w 2 must contain 0, whether you use a t distribution or a normal (you should use a t with 2 d.f.). So you ll accept the null hypothesis. 8. Just like the previous one, except now X is an n 2 matrix whose first column consists of the values of x and the second column consists of the values of x 2. a) As in the previous problem, X T X is a 2 2 matrix whose diagonal entries are the sum of squares of x and the sum of fourth powers of x. The off-diagonal entries are the sum of cubes. After this, it s all as in the previous problem except that X T Y is the transpose of ( x i y i, x 2 i y i). I m not really interested in whether or not you wrote the algebraic formulas out correctly long-hand. b) The matrix is σ 2 (X T X) 1 where X T X was found in part a. That s all you have to do. 9. These results are the probability versions of results you derived in HW 9 for lists of real numbers. The derivations are very similar, using expectations instead of averages. I won t use the hint I ll just do the equivalent of what we did in class when we derived the formula for the slope and intercept of the regression line. a) E(Y (α + βx)) 2 = E(Y βx) 2 2αE(Y βx) + α 2. Fix β, treat this as a function of α, differentiate, and set equal to 0: ] 2E(Y βx) + 2ˆα = 0 and so for each fixed β, the value of ˆα is E(Y βx) = µ y βµ y, which is one of the results we are asked to prove. 4

For the other, plug ˆα into the expected square error: E[(Y µ y ) β(x µ X )] 2 = V ar(y ) 2βCov(X, Y ) + β 2 V ar(x) Minimize this with respect to β: and therefore ˆβ = σ xy / as was to be shown. b) 2Cov(X, Y ) + 2 ˆβV ar(x) = 0 V ar(y ) = V ar[(y Ŷ ) + Ŷ ] = V ar(y Ŷ ) + V ar(ŷ ) + 2Cov(Y Ŷ, Ŷ ) = V ar(y Ŷ ) + V ar(ŷ ) + 2Cov(Y, Ŷ ) 2V ar(ŷ ) = V ar(y Ŷ ) V ar(ŷ ) + 2Cov(Y, Ŷ ) = V ar(y Ŷ ) ˆβ 2 V ar(x) + 2 ˆβCov(Y, X) by plugging in Ŷ = ˆα + ˆβX. Now plug in the value of ˆβ to see that V ar(y ) V ar(y Ŷ ) = (σ xy) 2 + 2 (σ xy) 2 = (σ xy) 2 Divide both sides by V ar(y ) and you re done. The right hand side is the square of the correlation, by the definition of correlation as σ xy /σ x σ y. 10. Freebie. 5