Stat 501, F. Chiaromonte. Lecture #8

Similar documents
Multiple Regression Examples

22S39: Class Notes / November 14, 2000 back to start 1

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Analysis of Bivariate Data

TMA4255 Applied Statistics V2016 (5)

Confidence Interval for the mean response

Correlation & Simple Regression

INFERENCE FOR REGRESSION

Data Set 8: Laysan Finch Beak Widths

Models with qualitative explanatory variables p216

Steps for Regression. Simple Linear Regression. Data. Example. Residuals vs. X. Scatterplot. Make a Scatter plot Does it make sense to plot a line?

Is economic freedom related to economic growth?

23. Inference for regression

27. SIMPLE LINEAR REGRESSION II

Histogram of Residuals. Residual Normal Probability Plot. Reg. Analysis Check Model Utility. (con t) Check Model Utility. Inference.

Simple Linear Regression. Steps for Regression. Example. Make a Scatter plot. Check Residual Plot (Residuals vs. X)

Orthogonal contrasts for a 2x2 factorial design Example p130

SMAM 314 Computer Assignment 5 due Nov 8,2012 Data Set 1. For each of the following data sets use Minitab to 1. Make a scatterplot.

MBA Statistics COURSE #4

Lecture 18: Simple Linear Regression

Chapter 14 Student Lecture Notes 14-1

Intro to Linear Regression

Final Exam Bus 320 Spring 2000 Russell

Intro to Linear Regression

Chapter 14. Multiple Regression Models. Multiple Regression Models. Multiple Regression Models

1 Multiple Regression

Model Building Chap 5 p251

Multiple Regression an Introduction. Stat 511 Chap 9

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

28. SIMPLE LINEAR REGRESSION III

Inference for Regression Inference about the Regression Model and Using the Regression Line

Six Sigma Black Belt Study Guides

SMAM 314 Practice Final Examination Winter 2003

1 Introduction to Minitab

Stat 529 (Winter 2011) A simple linear regression (SLR) case study. Mammals brain weights and body weights

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Basic Business Statistics, 10/e

Lecture 9: Linear Regression

Inferences for Regression

Introduction to Regression

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

1. Least squares with more than one predictor

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

FREC 608 Guided Exercise 9

School of Mathematical Sciences. Question 1. Best Subsets Regression

This document contains 3 sets of practice problems.

SMAM 314 Exam 42 Name

Soil Phosphorus Discussion

General Linear Model (Chapter 4)

Chapter 9 - Correlation and Regression

Ch 13 & 14 - Regression Analysis

Chapter 14 Multiple Regression Analysis

Basic Business Statistics 6 th Edition

(4) 1. Create dummy variables for Town. Name these dummy variables A and B. These 0,1 variables now indicate the location of the house.

Chapter 13. Multiple Regression and Model Building

Regression Models - Introduction

Apart from this page, you are not permitted to read the contents of this question paper until instructed to do so by an invigilator.

CHAPTER 5 FUNCTIONAL FORMS OF REGRESSION MODELS

Fish act Water temp

Predict y from (possibly) many predictors x. Model Criticism Study the importance of columns

Multiple Linear Regression

Correlation and Regression

MULTICOLLINEARITY AND VARIANCE INFLATION FACTORS. F. Chiaromonte 1

Data Set 1A: Algal Photosynthesis vs. Salinity and Temperature

PART I. (a) Describe all the assumptions for a normal error regression model with one predictor variable,

One-sided and two-sided t-test

Introduction to Regression

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Interpreting the coefficients

[4+3+3] Q 1. (a) Describe the normal regression model through origin. Show that the least square estimator of the regression parameter is given by

Regression. Marc H. Mehlman University of New Haven

Warm-up Using the given data Create a scatterplot Find the regression line

2.4.3 Estimatingσ Coefficient of Determination 2.4. ASSESSING THE MODEL 23

Statistics 5100 Spring 2018 Exam 1

Chapter 9. Correlation and Regression

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Data Analysis and Statistical Methods Statistics 651

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

MULTIPLE REGRESSION METHODS

Multiple Regression: Chapter 13. July 24, 2015

Biostatistics. Correlation and linear regression. Burkhardt Seifert & Alois Tschopp. Biostatistics Unit University of Zurich

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Lecture 11 Multiple Linear Regression

Unit 6 - Introduction to linear regression

9 Correlation and Regression

Institutionen för matematik och matematisk statistik Umeå universitet November 7, Inlämningsuppgift 3. Mariam Shirdel

The simple linear regression model discussed in Chapter 13 was written as

Multiple Regression Methods

Linear Regression Model. Badr Missaoui

ANOVA: Analysis of Variation

Unit 6 - Simple linear regression

IF YOU HAVE DATA VALUES:

13 Simple Linear Regression

Inferences for linear regression (sections 12.1, 12.2)

Inference with Simple Regression

Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables

ST430 Exam 1 with Answers

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

The Flight of the Space Shuttle Challenger

Transcription:

Stat 501, F. Chiaromonte Lecture #8 Data set: BEARS.MTW In the minitab example data sets (for description, get into the help option and search for "Data Set Description"). Wild bears were anesthetized, and their bodies were measured and weighed. One goal of the study was to make a table (or perhaps a set of tables) for hunters, so they could estimate the weight of a bear based on other measurements (this would be used because in the forest it is easier to measure the length of a bear, for example, than it is to weigh it). Variable Count (N) Description Age 143 Bear's age, in months (can be derived). Sex 143 1 = male 2 = female Head.L 143 Length of the head, in inches Head.W 143 Width of the head, in inches Neck.G 143 Girth (distance around) the neck, in inches Length 143 Body length, in inches Chest.G 143 Girth (distance around) the chest, in inches Weight 143 Weight of the bear, in pounds Unit=observation and measurement of a beard (but some animals were re-captured and re-measured several times ) (this is not all the data set) 1

Well suited example for a regression application: 1. Construct and estimate a regression model for Y = Weight on one or more of the measurements. 2. Use the model to estimate the mean-weight, or predict a new weight, given the measurement(s) --point estimate, confidence interval, prediction interval. Descriptive Statistics Variable N Mean Median TrMean StDev Age 143 43.43 32.00 40.17 34.02 Head.L 143 13.416 13.500 13.419 1.921 Head.W 143 6.331 6.000 6.293 1.314 Neck.G 143 21.327 20.000 21.254 5.069 Length 143 61.283 61.000 61.488 9.352 Chest.G 143 36.313 35.000 36.134 8.235 Weight 143 192.16 154.00 186.03 110.54 Variable SE Mean Minimum Maximum Q1 Q3 Age 3.73 8.00 177.00 19.0 58.0 Head.L 0.161 9.000 18.500 12.0 15.0 Head.W 0.110 4.000 10.000 5.5 7.0 Neck.G 0.424 10.000 33.000 18.0 25.5 Length 0.782 36.000 83.000 57.0 67.5 Chest.G 0.689 19.000 55.000 30.5 42.0 Weight 9.24 26.00 514.00 116.0 250.0 2

Correlations (Pearson) Ag Hd.L Hd.W Nk.G Lgt Ct.G We Ag Hd.L 0.687 Hd.W 0.669 0.744 Nk.G 0.734 0.862 0.805 Lgt 0.691 0.895 0.736 0.873 Ct.G 0.734 0.854 0.756 0.940 0.889 We 0.774 0.833 0.756 0.943 0.875 0.966 134.75 50.25 Age 16.125 11.375 Head.L 8.5 5.5 Head.W 27.25 15.75 Neck.G 71.25 47.75 Length 46 28 Chest.G 392 148 Weight 50.25 134.75 11.375 16.125 5.5 8.5 15.75 27.25 47.75 71.25 28 46 148 392 3

As to be expected, measurements on a body are very much related to one another, with strong linear components in the relationships (although, as the scatter plot matrix shows, the relationships are not necessarily purely or mainly linear ) Since we are still doing simple regression, we need to costruct a model that uses only one measurement. As a try, let us use the one that has the strongest linear relationship with Weight, i.e. Chest.G Regression Analysis The regression equation is Weight = - 279 + 13.0 Chest.G Predictor Coef StDev T P Constant -278.75 10.88-25.61 0.000 Chest.G 12.9680 0.2923 44.36 0.000 S = 28.68 R-Sq = 93.3% R-Sq(adj) = 93.3% Analysis of Variance Source DF SS MS F P Regression 1 1619241 1619241 1968.0 0.000 Residual Error 141 116012 823 Total 142 1735253 4

The regression line parameters estimates can be derived from the descriptive statistics in the previous outputs: b 1 = (s Y /s X ) r XY = b 0 = Av(Y) - b 1 Av(X) = Note that b 0 is negative, and anyway Chest.G=0 is not "reasonable". In order to give an interpretation of the "location" of our line (given the slope), we could interpret: A Weight value: The fitted value on min(chest.g)=19 A Chest.G value: The Chest.G value at which the estimated line crosses the abscisa axis 5

Regression Plot Y = -278.749 + 12.9680X R-Sq = 93.3 % 500 400 Weight 300 200 100 0 20 30 40 50 Chest.G 6

Take Chest.G = 40. From the regression output, the descriptive statistics, and tables for the t-distribution, we can calculate Confidence int. for the mean-weight (Chest.G=40) Prediction int. for a new weight (Chest.G=40) This is very close to the sample mean of Chest.G, and well within the sample range. Now do the same at Chest.G = 80 (far out) Confidence int. for the mean-weight (Chest.G=80) Prediction int. for a new weight (Chest.G=80) With Minitab we can get the bands about the estimated regression line (in "Fitted line plot", check out the "Options") 7

Regression Plot Y = -278.749 + 12.9680X R-Sq = 93.3 % 500 400 300 Weight 200 100 0-100 Regression 95% CI 95% PI 20 30 40 50 Chest.G 8

This could be a good summary to give to a hunter: Measure the Chest.G, and you have A point-estimate of the corresponding mean-weight An interval about it, which is guaranteed to contain the unknown meanweight with probability.95 Another interval about it (wider), which is guaranteed to contain a new weight observation with probability.95 But we are in for a surprise! Although the estimated regression line captured a lot of the Weight-variability, maybe a line in Chest.G is not the most appropriate regression model to use! 9

100 RESI1 0-100 20 30 40 50 Chest.G Normal Probability Plot for RESI1 ML Estimates Percent 99 95 90 80 70 60 50 40 30 20 10 5 1 Mean: StDev: 0.0000000 28.4829-50 0 50 Data 10

The residuals show quite a bit of curvature along the predictor. The coefficient of determination is quite high (the linear component of the relationship strong) But the simple regression is not performing well because it leaves residuals containing a clear patern along the predictor (a curve here) (If we take the residuals as the sample-equivalent of the errors, we have graphical evidence against the assumption: E(ε) = 0 at every level of X) We will get into diagnostics next time; for now, intuitively: This ough to suggest that the mean-relationship between Y and X is not linear Maybe we should change our model and consider a parabolic relationship (second order polynomial): Weight = β 0 + β 1 Chest.G + β 2 Chest.G 2 + ε with ε indep. X, E(ε) = 0, var(ε) = σ 2 This model, although it still refers to one variable, contains two terms (linear in the parameters) so it is technically a multiple regression. (create the variable (Chest.G)^2 in Minitab through "Calc", and then regress Weight on both Chest.G and (Chest.G)^2) 11

Regression Analysis The regression equation is Weight = 12.7-3.17 Chest.G + 0.212 (Chest.G)^2 Predictor Coef StDev T P Constant 12.65 35.19 0.36 0.720 Chest.G -3.166 1.901-1.67 0.098 (Chest.G)^2 0.21247 0.02484 8.56 0.000 S = 23.33 R-Sq = 95.6% R-Sq(adj) = 95.5% Analysis of Variance Source DF SS MS F P Regression 2 1659068 829534 1524.38 0.000 Residual Error 140 76185 544 Total 142 1735253 Regression Plot Y = 12.6535-3.16627X + 0.212473X**2 R-Sq = 95.6 % 500 400 Weight 300 200 100 0 Regression 95% CI 95% PI 20 30 40 50 Chest.G 12

100 50 RESI2 0-50 20 30 40 50 Chest.G Normal Probability Plot for RESI2 ML Estimates Percent 99 95 90 80 70 60 50 40 30 20 10 5 Mean: StDev: -0.0000000 23.0816 1-50 0 50 Data 13

Now we have eliminated the curved pattern in the residual plot quite efficiently But now we can see a clear non-constant variance: Residuals seem to be larger at larger predictor levels. (we have graphical evidence against the assumption: var(ε) = σ 2 at every level of X) One way to eliminate or weaken this kind of phenomenon is to transform the response via a variance stabilizing transformation. A typical one is the log(.). Maybe we should change our model again and consider: log(weight) = β 0 + β 1 Chest.G + β 2 Chest.G 2 + ε with ε indep. X, E(ε) = 0, var(ε) = σ 2 (again, create the variable log(weight) in Minitab through "Calc") Why should it "stabilize" the variability? It tends to "blow up" very small values, and "shrink" very large ones --in comparison. 14

Regression Analysis The regression equation is log(weight)= 0.309 + 0.0745 Chest.G - 0.000580 (Chest.G)^2 Predictor Coef StDev T P Constant 0.30873 0.09496 3.25 0.001 Chest.G 0.074508 0.005129 14.53 0.000 Chest.G^2-0.00057992 0.00006702-8.65 0.000 S = 0.06295 R-Sq = 94.3% R-Sq(adj) = 94.3% Analysis of Variance Source DF SS MS F P Regression 2 9.2369 4.6184 1165.61 0.000 Residual Error 140 0.5547 0.0040 Total 142 9.7916 Regression Plot Y = 0.308728 + 7.45E-02X - 5.80E-04X**2 R-Sq = 94.3 % 2.5 log(weight) 2.0 Regression 1.5 95% CI 95% PI 20 30 40 50 Chest.G 15

0.2 0.1 RESI3 0.0-0.1-0.2-0.3 20 30 40 50 Chest.G Normal Probability Plot for RESI3 ML Estimates Percent 99 95 90 80 70 60 50 40 30 20 10 5 Mean: StDev: 0.0000000 0.0622827 1-0.3-0.2-0.1 0.0 0.1 0.2 Data 16

Sign of the quadratic term in the parabola has changed passing to log of the Weight. Coeff. of determination has decreased slightly, but the non-constant variance in the residual plot has almost completely vanished. Notice there seems to be an outlier (a point far out from the bulk of the observations, that seems to "come from another population" and not to satisfy the relationship between response and predictor that is suggested by the other points). We could repeat the analysis eliminating the far out point. Also, we could separate females and males: We have not used the qualitative variable "sex" yet. Construct and estimate two separate regression models for males and females: If one obtains very similar models, based on the sample data the relationship between Weigt and Chest.G appears to be the same in the two sex-subpopulations. Might as well have one common model for both.. If one obtains very different models, based on the sample data the relationship between Weight ad Chest.G is sex-specific. Two separate models will fit the data much better than a common one. (in other words: is the gain in explanatory power and precision large enough to justify the fact that the hunter will carry along two tables/models, one for male and one for female bears?) Notice we are reasoning on how to introduce a qualitative variable in a (multiple) regression. 17