STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

Similar documents
STAT 350: Summer Semester Midterm 1: Solutions

Inferences for Regression

STAT 350 Final (new Material) Review Problems Key Spring 2016

Lecture 3: Inference in SLR

MATH 644: Regression Analysis Methods

Math 3330: Solution to midterm Exam

Inference for Regression

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

Exam Applied Statistical Regression. Good Luck!

STAT 501 EXAM I NAME Spring 1999

Inference for the Regression Coefficient

Topic 14: Inference in Multiple Regression

Inference for Regression Simple Linear Regression

Statistics 5100 Spring 2018 Exam 1

Lecture 11: Simple Linear Regression

The simple linear regression model discussed in Chapter 13 was written as

AMS 7 Correlation and Regression Lecture 8

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Question Possible Points Score Total 100

Important note: Transcripts are not substitutes for textbook assignments. 1

Unit 6 - Introduction to linear regression

Ch 2: Simple Linear Regression

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

ECO220Y Simple Regression: Testing the Slope

Unit 6 - Simple linear regression

Inference for Regression Inference about the Regression Model and Using the Regression Line

ST505/S697R: Fall Homework 2 Solution.

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Basic Business Statistics 6 th Edition

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

STA Module 11 Inferences for Two Population Means

STA Rev. F Learning Objectives. Two Population Means. Module 11 Inferences for Two Population Means

Stat 500 Midterm 2 8 November 2007 page 0 of 4

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Stats Review Chapter 14. Mary Stangler Center for Academic Success Revised 8/16

Chapter 24. Comparing Means. Copyright 2010 Pearson Education, Inc.

STAT 525 Fall Final exam. Tuesday December 14, 2010

Stat 500 Midterm 2 12 November 2009 page 0 of 11

Simple Linear Regression

Ch Inference for Linear Regression

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Correlation Analysis

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

Possibly useful formulas for this exam: b1 = Corr(X,Y) SDY / SDX. confidence interval: Estimate ± (Critical Value) (Standard Error of Estimate)

Ch. 1: Data and Distributions

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test, October 2013

Swarthmore Honors Exam 2012: Statistics

Chapter 14 Simple Linear Regression (A)

Simple Linear Regression: One Quantitative IV

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

Stat 5102 Final Exam May 14, 2015

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Correlation and Regression

Unless provided with information to the contrary, assume for each question below that the Classical Linear Model assumptions hold.

Correlation & Simple Regression

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Statistics and Quantitative Analysis U4320

Business Statistics. Lecture 10: Correlation and Linear Regression

Economics 345: Applied Econometrics Section A01 University of Victoria Midterm Examination #2 Version 2 Fall 2016 Instructor: Martin Farnham

Simple Linear Regression

ST430 Exam 1 with Answers

Statistics for Managers using Microsoft Excel 6 th Edition

Midterm 2 - Solutions

Least-Squares Regression. Unit 3 Exploring Data

Simple linear regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

STOR 455 STATISTICAL METHODS I

Basic Statistics. 1. Gross error analyst makes a gross mistake (misread balance or entered wrong value into calculation).

Part Possible Score Base 5 5 MC Total 50

36-463/663: Multilevel & Hierarchical Models

Applied Multivariate Statistical Modeling Prof. J. Maiti Department of Industrial Engineering and Management Indian Institute of Technology, Kharagpur

Scatterplots and Correlation

ST430 Exam 2 Solutions

Multiple Linear Regression

Final Exam STAT On a Pareto chart, the frequency should be represented on the A) X-axis B) regression C) Y-axis D) none of the above

Midterm 2 - Solutions

7.2 One-Sample Correlation ( = a) Introduction. Correlation analysis measures the strength and direction of association between

Lecture notes on Regression & SAS example demonstration

Formal Statement of Simple Linear Regression Model

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

WISE International Masters

Correlation and Regression

Chapter 4. Regression Models. Learning Objectives

Lecture 10 Multiple Linear Regression

A discussion on multiple regression models

Homework 2: Simple Linear Regression

Mathematical Notation Math Introduction to Applied Statistics

Stat 231 Exam 2 Fall 2013

Regression Models - Introduction

This document contains 3 sets of practice problems.

Test 3 Practice Test A. NOTE: Ignore Q10 (not covered)

Chapter 24. Comparing Means

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

Unit 9 Regression and Correlation Homework #14 (Unit 9 Regression and Correlation) SOLUTIONS. X = cigarette consumption (per capita in 1930)

Concordia University (5+5)Q 1.

Lecture 6 Multiple Linear Regression, cont.

Conditions for Regression Inference:

Variance Decomposition and Goodness of Fit

Transcription:

STAT 512 MidTerm I (2/21/2013) Spring 2013 Name: Key INSTRUCTIONS 1. This exam is open book/open notes. All papers (but no electronic devices except for calculators) are allowed. 2. There are 5 pages in addition to the cover sheet. If you need more room for a problem, use the back of the sheets; clearly indicate where the location of the answer is. 3. Only 3 decimal places are required for all answers except for some answers in question 1. In question 1, if the number used is less than 0.01, please include the whole number in the work. 4. Work is required to receive credit. Partial credit will be given for work that is partially correct. Points will be deducted for incorrect work even if the final answer is correct. 5. If I cannot read your answer, it will be marked wrong. 6. Good Luck! Question Possible Score 1 42 2 25 3 43 Total 110 1

(42 pts.) 1. How was Hubble s Constant calculated? In 1929, Hubble investigated the relationship between distance and recession velocity of extra-galactic nebulae. The following is the edited results from the 24 nebulae that Hubble used in his study. The distance is in Megaparsecs from Earth and the recession velocity (r_velocity) is in km/s. (http://lib.stat.cmu.edu/dasl/datafiles/hubble.html) Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 5.97547 5.97547 36.44 <.0001 Error 22 3.60782 0.16399 Corrected Total 23 9.58329 Variable Parameter Estimates DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 0.39910 0.11847??? 0.0028 r_velocity 1 0.00137 0.00022744??? <.0001 Obs r_velocity Dependent Variable Output Statistics Predicted Value Std Error Mean Predict 95% CL Mean Residual 25 200. 0.6737 0.0916??????. a) Write down the simple linear regression model and the assumed distribution of the errors. Y (distance) = 0 + 1 X(r_velocity) + ~ iid N(0, 2 ) b) Write down the estimated regression line using the data above. Y = 0.399 + 0.00137 X c) Explain the difference between parts a) and b). The model (part a) describes each individual point in the population. The estimated regression line (part b) describes the best fit line in the sample. d) What is the fitted value of Y (distance) for X (r_velocity) = 300? If the actual value of Y is 0.80, what is the residual? Y = 0.399 + (0.00137)(300) = 0.810 e = Y - Y = 0.80 0.81 = -0.01 2

e) Would it be reasonable to consider inference on the intercept 0 for this data? Explain why or why not. Without looking at the data, yes you could consider inference on 0 because it is possible that the recession velocity could be 0 (this means that the extra-galactic nebulae is moving at the same relative speed to the earth). In fact, the data included points that were negative and very close to 0. f) Calculate R 2 using the information provided. R 2 = SSM SST = 5.975 9.583 = 0.624 OR R 2 = 1 SSE SST = 1 3.608 9.583 = 0.624 g) From the information provided, can this model be used for prediction of new data points? Why or why not? Either answer was correct here depending on the justification. My answer is maybe. The P-value is good and with the units used, both SSM and SST are small, therefore if R 2 was high enough, it would be a good fit. However, since the R 2 is not very high (there is a fair amount of noise in the data), therefore I am not sure. Note: You had to mention the size of the values for SSM and SST in addition to R 2 and the p-value. We are assuming here that the data is a straight line and not curvilinear. h) Calculate and interpret the 95% confidence interval for the estimate of the mean at X (r_velocity)=200. t c = t 22 (1 - α 2 ) = t 22(0.975) = 2.074 Y 200 t c (0.975)s{Y 200 ) = 0.674 (2.074)(0.0916) = 0.674 0.190 ==> (0.484, 0.364) We are 95% confident that the mean distance in the population at a recession velocity of 200 km/s is between 0.484 Megaparsecs and 0.364 Megaparsecs. i) In this situation, when would the confidence band be more appropriate than what was calculated in part h). A confidence band would be more appropriate than a confidence interval if we were interested in looking at all of the possible recession velocities at once versus only one of them like in part h) j) If the optimal Box-Cox transformation suggests = 0. What is the optimal transformed response. That is, what function of Y should be used to perform the linear regression? Y = ln Y = log e Y I did give full credit for Y = log Y (this is assumed to be log 10 ) 3

(25 pts.) 2. The following modified data is based on the number of cigarettes smoked (households per capita) and the deaths per 100K from kidney cancer for 40 states in 1960. (http://lib.stat.cmu.edu/dasl/datafiles/cigcancerdat.html). a) State each of the assumptions that are required for linear regression model and state whether they are or are not met in this context. Be sure to mention all 3 of the plots in your discussion. linearity: not met because of graphs A and B. outliers: not met because of graph A. constant variance: not met because of graphs A and B (note, I did give full credit if you stated that they were met). normality: met because of QQplot (note: I did give full credit if you stated that it was not met). independence: can not tell from the graphs, needs to be determined from experimental conditions. 4

b) If any of the assumptions are not met, describe a possible method that could be used to remedy the situation. Please explain your choice being as explicit as possible. That is, if you are going to use a transformation, state which possible transformations would be used, if you are going to use a procedure in SAS, describe which procedure to use, etc. If all of the assumptions are met, state that fact and then choose one assumption and finish the question assuming that your chosen assumption is violated. The answer to this question depended on the answer to part a). If you stated that the only condition not met was linearity, then the correct answer here would be a X transformation. From the table in the book, the possible choices were X = log X or X = X. If you stated that constant variance and/or normality were not met, then the correct answer here would be a Y transformation. To determine which Y transformation would be appropriate, you would use the Box Cox procedure in SAS. After the transformation above is performed, then you need to rerun the diagnostics to see if all of the assumptions are now met. (43 pts.) 3. A General Psychology Instructor wanted to know if she could predict the score on the final from the scores on the first three exams. There were 25 students in this class. The following is the edited SAS output. (http://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/mlr/frames/frame.html - test scores for General Psychology) Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 3 13732 4577.333 672.196 <.0001 Error 21 143 6.8095 Corrected Total 24 13875 a) Fill in the missing values in the SAS output above. Please show work for all of the empty spaces below. p = 4, n = 24 df M = p 1 = 4 1 = 3 df E = n p = 24 4 = 21 df T = n 1 = 25 1 = 24 = df M + df E = 3 + 21 SSE = SST SSM = 13875 13732 = 143 MSM = SSM = 13732 = 4577.333 MSE = SSE = 143 df M 3 df E 21 = 6.8095 F = MSM MSE = 4577.333 6.8095 = 672.196 5

b) Using the results in part a), perform the appropriate significance test. Please include the null and alternative hypothesis, the test statistic with the degrees of freedom, the p-value, the decision and the conclusion in words (that is, the conclusion needs to be stated in the context of the problem). H 0 : 1 = 2 = 3 = 0 H a : at least one k 0 F = 672.196 with df(numerator) = 3, df(denominator) = 21 p < 0.0001 decision: reject H o the data strong supports the claim (P < 0.0001) that at least one of the three exam scores is associated with the score on the final. c) Write down the design matrix for this problem. You may use variables for the actual data points. Be sure to clearly indicate what the dimensions of the matrix are. Note: If you wrote down all of the matrices and did not indicate which one was the design matrix, you lost points X 25X4 = 1 X 1,1 X 1,2 X 1,3 1 X 25,1 X 25,2 X 25,3 d) Given the data below, do you expect a problem with correlation between the explanatory variables? Why or why not? Pearson Correlation Coefficients, N = 25 Final Exam1 Exam2 Exam3 Final 1.00000 0.94607 0.92947 0.97233 Exam1 0.94607 1.00000 0.90136 0.89274 Exam2 0.92947 0.90136 1.00000 0.84636 Exam3 0.97233 0.89274 0.84636 1.00000 Yes, I expect a problem with correlation between the explanatory variables because of the high correlation between them; they run from 0.84 to 0.90. There was no reason to include the p values here because of the high numbers. Remember, in the body fat example, we had problems when the correlation coefficients were around 0.5. 6