Second Midterm Exam Name: Solutions March 19, 2014

Similar documents
Inference for Regression

Midterm 2 - Solutions

Inference for Regression Inference about the Regression Model and Using the Regression Line

Introduction and Single Predictor Regression. Correlation

Scatter plot of data from the study. Linear Regression

Lecture 11: Simple Linear Regression

Ch 2: Simple Linear Regression

Scatter plot of data from the study. Linear Regression

BIOSTATS 640 Spring 2018 Unit 2. Regression and Correlation (Part 1 of 2) R Users

Simple Linear Regression

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

Lecture 18: Simple Linear Regression

Amount of Weight Gained. Regained 5 lb or Less Regained More Than 5 lb Total. In Person Online Newsletter

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

LI EAR REGRESSIO A D CORRELATIO

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

ST430 Exam 2 Solutions

ST430 Exam 1 with Answers

Simple Linear Regression

Ch 3: Multiple Linear Regression

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

CAS MA575 Linear Models

Statistiek II. John Nerbonne. March 17, Dept of Information Science incl. important reworkings by Harmut Fitz

36-707: Regression Analysis Homework Solutions. Homework 3

Chapter 16. Simple Linear Regression and dcorrelation

MATH 644: Regression Analysis Methods

Section Least Squares Regression

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Exam Applied Statistical Regression. Good Luck!

Math 3330: Solution to midterm Exam

Chapter 4. Regression Models. Learning Objectives

Lectures on Simple Linear Regression Stat 431, Summer 2012

Math 2311 Written Homework 6 (Sections )

Density Temp vs Ratio. temp

Lecture 19: Inference for SLR & Transformations

Chapter 4: Regression Models

Simple and Multiple Linear Regression

STAT2912: Statistical Tests. Solution week 12

Stat 5102 Final Exam May 14, 2015

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

Correlation 1. December 4, HMS, 2017, v1.1

Regression. Marc H. Mehlman University of New Haven

Bivariate Relationships Between Variables

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

22S39: Class Notes / November 14, 2000 back to start 1

STAT2012 Statistical Tests 23 Regression analysis: method of least squares

Lecture 15. Hypothesis testing in the linear model

sociology sociology Scatterplots Quantitative Research Methods: Introduction to correlation and regression Age vs Income

11 Correlation and Regression

Simple Linear Regression

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Variance Decomposition and Goodness of Fit

9. Linear Regression and Correlation

Math 423/533: The Main Theoretical Topics

Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope

Unit 10: Simple Linear Regression and Correlation

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

Chapter 1. Linear Regression with One Predictor Variable

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Midterm 2 - Solutions

Chapter 16: Understanding Relationships Numerical Data

AMS 7 Correlation and Regression Lecture 8

Tests of Linear Restrictions

Inference for Regression Simple Linear Regression

Regression Analysis II

Final Exam - Solutions

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Simple Linear Regression

Psychology 282 Lecture #4 Outline Inferences in SLR

You may not use your books/notes on this exam. You may use calculator.

SMA 6304 / MIT / MIT Manufacturing Systems. Lecture 10: Data and Regression Analysis. Lecturer: Prof. Duane S. Boning

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

AP Statistics. Chapter 9 Re-Expressing data: Get it Straight

Dr. Allen Back. Sep. 23, 2016

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Regression Models. Chapter 4. Introduction. Introduction. Introduction

2011 Pearson Education, Inc

General Linear Statistical Models - Part III

STAT Chapter 11: Regression

> modlyq <- lm(ly poly(x,2,raw=true)) > summary(modlyq) Call: lm(formula = ly poly(x, 2, raw = TRUE))

Simple Linear Regression

Lecture 6: Linear Regression

Multiple Linear Regression

13 Simple Linear Regression

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences h, February 12, 2015

STAT 350 Final (new Material) Review Problems Key Spring 2016

Math Section MW 1-2:30pm SR 117. Bekki George 206 PGH

MS&E 226: Small Data

Lecture 6 Multiple Linear Regression, cont.

SCHOOL OF MATHEMATICS AND STATISTICS

22s:152 Applied Linear Regression. Chapter 5: Ordinary Least Squares Regression. Part 2: Multiple Linear Regression Introduction

Lecture 1: Linear Models and Applications

Section 4.6 Simple Linear Regression

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

Elementary Statistics Lecture 3 Association: Contingency, Correlation and Regression

2 Regression Analysis

Transcription:

Math 3080 1. Treibergs σιι Second Midterm Exam Name: Solutions March 19, 2014 (1. The article Withdrawl Strength of Threaded Nails, in Journal of Structural Engineering, 2001, describes an experiment to relate the diameter of a threaded nail (x in mm to its ultimate withdrawal strength from Douglas Fir lumber (y in N/mm. Obtain the equation of the leastsquares line. What proportion of observed variation in withdrawal strength can be attributed to the linear relationship between withdrawal strength and diameter? Values of summary quantities are > c( length(x, sum(x, sum(y, sum(x^2, sum(x*y, sum(y^2 [1] 37.3 659.9 145.6 2533.0 44906.0 We compute intermediate quantities x 1 xi 37.3 n 3.73, y 1 yi 659.9 65.99, n S xx x 2 i 1 ( 2 1 xi 145.6 (145.62 6.471; S xy x i y i 1 ( ( xi yi 2533 1 (37.3 (659.9 71.573; S yy x 2 i 1 ( 2 1 xi 44906 (659.92 1359.199 Thus the least-squares line is y β 0 + β 1 x where β 1 Sxy S xx 71.573 6.471 11.06058 β0 y β 1 x 65.99 (11.06058(3.73 24.73404 The proportion on of observed variation in withdrawal strength attributed to the linear relationship between withdrawal strength and diameter is R 2, which may be obtained by R 2 S2 xy (71.573 2 S xx S yy (6.471(1359.199 0.5824303 (2. Given n data points (x i, y i, suppose that we try to fit a linear equation y β 0 + β 1 x. For the simple regression model, what are the assumptions on the data {(x i, y i }? Find the variance V (Y i for the y-coordinate of the ith observation. Explain. What is the fitted value Ŷi? Show that it may be expressed as a linear combination c j Y j, where the c j s don t depend on the Y i s. j1 Starting from the regression model, derive the expected value E ( n For the simple regression model, we assume that x i s are fixed and that Y 2 i. y i β 0 + β 1 x i + ɛ i, where ɛ i N(0, σ 2 are I.I.D. Hence, the y-coordinate of random observation at x i has V (Y i V (β 0 + β 1 x i + ɛ i V (ɛ i σ 2 1

because Y i and ɛ i differ by a constant so have the same variance. The fitted value is the point predicted by the regression line, namely Ŷ i β 0 + β 1 x i Y β 1 x + β 1 x i Y + β 1 (x i x Y j n + (x i x S xy Y j S xx n + (x i x (x j x(y j Y j1 j1 j1 { 1 n + (x } i x(x j x Y j S xx because n j1 (x j xy 0. Now, use the fact that for any random variable, E(Z 2 V (Z + E(Z 2 so that E ( n Y 2 i nσ 2 + E ( Y 2 i n (β 0 + β 1 x i 2. S xx { V (Y i + E (Y i 2} j1 {σ 2 + (β 0 + β 1 x i 2} (3. A 1984 study in Journal of Retailing looked at variables that contribute to productivity in the retail grocery trade. A measure of productivity is value added per work hour, (in $ which is the surplus money generated by the business available to pay for labor, furniture and fixtures, and equipment. How does value added depend on store size (in 00 ft 2? State the model and the assumptions on the data. Is there strong evidence that β 2 <.006? State the null and alternative hypotheses. State the test statistic and rejection region. Carry out the test at significance level.05. Value Added vs. Store Size ValAdd 2.0 2.5 3.0 3.5 4.0 15 20 25 30 Size > ValAdd c( 4.08, 3.40, 3.51, 3.09, 2.92, 1.94, 4.11, 3.16, 3.75, 3.60 > Size c( 21.0, 12.0, 25.2,.4, 30.9, 6.8, 19.6, 14.5, 25.0, 19.1 > m1lm(valadd ~ Size + I(Size^2 > plot(valadd~size, pch19, main"value Added vs. Store Size" > lines(seq(0,32,.3,predict(m1,data.frame(sizeseq(0,32,.3 > summary(m1; anova(m1 lm(formula ValAdd ~ Size + I(Size^2 Coefficients: Estimate Std. Error t value Pr(> t 2

(Intercept -0.159356 0.500580-0.318 0.759512 Size 0.391931 0.058006 6.757 0.000263 I(Size^2-0.009495 0.001535-6.188 0.000451 Residual standard error: 0.2503 on 7 degrees of freedom Multiple R-squared: 0.8794,Adjusted R-squared: 0.845 F-statistic: 25.53 on 2 and 7 DF, p-value: 0.0006085 Analysis of Variance Table Response: ValAdd Df Sum Sq Mean Sq F value Pr(>F Size 1 0.80032 0.80032 12.774 0.0090466 I(Size^2 1 2.39858 2.39858 38.286 0.0004507 Residuals 7 0.43855 0.06265} The quadratic regression model assumes that sizes x i are fixed and the value added y i satisfies y i β 0 + β 1 x i + β 2 x 2 i + ɛ i, where ɛ i are I.I.D. N(0, σ 2 for some constants β i. This is a quadratic regression so the number of variables is k 2. We test the hypotheses H 0 : β 2 β 20.006 vs. H a : β 2 < β 20.006. The test statistic is t β 2 β 20 s β2 which is a random variable distributed according to the t-distribution with ν n k 1 degrees of freedom. The null hypothesis is rejected if t t α,n k 1. When α 0.05 and the number of observations is n, by reading the summary table, and looking at Table A5 from the text, t β 2 β 20 s β2.009495 (.006.001535 2.276873 < t α,n k 1 t.05,7 1.895. Thus we reject the null hypothesis: there is significant evidence that β 2 <.006. (4. To develop a model to predict miles per gallon based on highway speed, a test car was driven during two trial periods at speeds ranging from to 75 miles per hour. Consider two models to explain the data: first, a quadratic regression y i β 0 + β 1 x i + β 2 x 2 i (the solid lines, and second a transformed version, log(y i β 0 + β 1 x i + β 2 x 2 i (the dashed lines. R c was used to generate tables and plots. What is the equation of the dashed line (for Model 2 in the first panel scatterplot in (MP H, MP G coordinates? Discuss the two models with regard to quality of fit and whether the model assumptions are satisfied. Compare at least five features in the tables and plots. Which is the better model? MPH.0.0 15.0 15.0 20.0 20.0 25.0 25.0 30.0 30.0 35.0 35.0 40.0 40.0 MPG 4.8 5.7 8.6 7.3 9.8 11.2 13.7 12.4 18.2 16.8 19.9 19.0 22.4 23.5 MPH 45.0 45.0 50.0 50.0 55.0 55.0 60.0 60.0 65.0 65.0 70.0 70.0 75.0 75.0 MPG 21.3 22.0 20.5 19.7 18.6 19.3 14.4 13.7 12.1 13.0.1 9.4 8.4 7.6 lm(formula MPG ~ MPH + I(MPH^2 Coefficients: Estimate Std. Error t value Pr(> t (Intercept -7.5555495 1.42491-5.305 1.69e-05 MPH 1.2716937 0.0757321 16.792 3.99e-15 3

I(MPH^2-0.0145014 0.0008719-16.633 4.97e-15 Residual standard error: 1.663 on 25 degrees of freedom Multiple R-squared: 0.9188,Adjusted R-squared: 0.9123 F-statistic: 141.5 on 2 and 25 DF, p-value: 2.338e-14 Analysis of Variance Table Response: MPG Df Sum Sq Mean Sq F value Pr(>F MPH 1 17.37 17.37 6.2775 0.01911 I(MPH^2 1 765.46 765.46 276.6417 4.973e-15 Residuals 25 69.17 2.77 lm(formula log(mpg ~ MPH + I(MPH^2 Coefficients: Estimate Std. Error t value Pr(> t (Intercept 7.664e-01 6.768e-02 11.32 2.47e-11 MPH 1.030e-01 3.599e-03 28.62 < 2e-16 I(MPH^2-1.158e-03 4.144e-05-27.96 < 2e-16 Residual standard error: 0.07906 on 25 degrees of freedom Multiple R-squared: 0.9704,Adjusted R-squared: 0.968 F-statistic: 409.7 on 2 and 25 DF, p-value: < 2.2e-16 Analysis of Variance Table Response: log(mpg Df Sum Sq Mean Sq F value Pr(>F MPH 1 0.2362 0.2362 37.792 1.992e-06 I(MPH^2 1 4.8850 4.8850 781.587 < 2.2e-16 Residuals 25 0.1563 0.0063 Model 1 Model 1 MPG MPG 20 30 40 50 60 70 MPH QQ Plot of Model 1 Model 1 Theoretical Quantiles 4

Model 2 Model 2 logmpg 2.0 2.5 3.0 logmpg 2.0 2.5 3.0 20 30 40 50 60 70 MPH 1.8 2.0 2.2 2.4 2.6 2.8 3.0 QQ Plot of Model 2 Model 2 Theoretical Quantiles 1.8 2.0 2.2 2.4 2.6 2.8 3.0 The equation of the dashed line in the first figure is found from the second model. y exp ( β0 + β 0 x + β 0 x 2 + exp ( 0.774 + 0.30 x.001158 x 2 The second model seems to perform better. First of the two intrinsically linear versions, the coefficient of multiple determination in the first model R 2 0.9188 is smaller than R 2 0.9704 for the second model. Thus the second model accounts for a larger percentage of variation relative to the intrinsic model. Because both models have two independent variables, we do not need to compare adjusted R 2 which accounts for different numbers of variables. Adjusted R 2 also shows that the second model accounts for more variability. However, the second model has dimensions log y, and the first y, this comparison is limited. Second, looking at the scatterplots Panels 1 and 5, both show downward arc indicating that the quadratic term is useful. Indeed the quadratic coefficients are significantly nonzero in the tables. The dashed fitted line of the second model seems to capture the points slightly better than the solid line of the first model. Third, comparing the observed values vs. the fitted values, Panels 2 and 6, Model 1 seems to have a lot more upward bowing in the points than Model 2, indicating that the first model still needs some nonlinear adjustment, whereas the second has points lining up beautifully. Fourth, looking at the normal QQ-plots of the standardized residuals, Panels 3 and 7, both models line up very well with the 45 deg -line indicating that the residuals are equally plausibly normally distributed. Fifth, comparing the plots of standardized residuals vs. fitted values, Panels 4 and 8, Model 1 shows a marked U-shape showing dependence on fitted values and a missing nonlinear explanatory variables, whereas the plot for Model 2 is more uniformly scattered, looking like the stars in the night sky. 5

(5. In the 1984 article Diversity and Distribution of the Pedunculate Barnacles, in Crustaceana, researchers gave counts of the Octolasmis tridens and the O. lowei barnacles on each of lobsters. Does it appear that the barnacles compete for space on the surface of a lobster? If they do compete, do you expect that the number of O. tridens and the number of O. lowei to be positively or negatively correlated? Run a test to see if the barnacles compete for space. State the assumptions on the data. State the null and alternative hypotheses. State the test statistic and the rejection region. Compute the statistic and state your conclusion. > O.tridens c( 645, 320, 401, 364, 327, 73, 20, 221, 3, 5 > O.lowei c( 6, 23, 40, 9, 24, 5, 86, 0, 9, 350 > cor(o.tridens, O.lowei [1] -0.551912 > plot(o.tridens, O.lowei, pch 19, main "Barnacle Counts per Lobster" Barnacle Counts per Lobster O.lowei 0 50 0 150 200 250 300 350 0 0 200 300 400 500 600 O.tridens The scatterplot show that when there are few O. tridens then there are many O. lowei and vice versa. There seems to be a competition, which we expect will yield a negative correlation between the counts of the two barnacle species. To do a test of correlation, it is assumed that the data points (x i, y i some from a bivariate normal distribution. Since the scatterplot is L -shaped, this may not hold, although it is hard to tell from such a small sample. The null and alternative hypotheses are on the correlation coefficient. H 0 : ρ 0; H a : ρ < 0. To test against zero, we use the Pearson s correlation test. The test statistic is t r n 2 1 r 2 Under the null hypothesis, it is distributed according to the t-distribution with n 2 degrees of freedom. Thus the rejection region is t t α,n 2 for the one-tailed test. There are n data points. Computing, we find t r n 2 1 r 2 ( 0.551912 8 1 ( 0.551912 2 1.871973 At the significance level α.05 we find t α,n 2 t.05,8 1.860 from Table A5 in the text. Thus we reject the null hypothesis. There is significant evidence that the correlation is negative. 6