Lecture 19: Inference for SLR & Transformations

Similar documents
Announcements. Unit 7: Multiple linear regression Lecture 3: Confidence and prediction intervals + Transformations. Uncertainty of predictions

Modeling kid s test scores (revisited) Lecture 20 - Model Selection. Model output. Backward-elimination

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Announcements. Unit 6: Simple Linear Regression Lecture : Introduction to SLR. Poverty vs. HS graduate rate. Modeling numerical variables

Announcements. Lecture 18: Simple Linear Regression. Poverty vs. HS graduate rate

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

Lecture 3: Inference in SLR

Unit 6 - Simple linear regression

Announcements. Lecture 10: Relationship between Measurement Variables. Poverty vs. HS graduate rate. Response vs. explanatory

Lecture 20: Multiple linear regression

Lecture 18: Simple Linear Regression

STA 101 Final Review

Chi-square tests. Unit 6: Simple Linear Regression Lecture 1: Introduction to SLR. Statistics 101. Poverty vs. HS graduate rate

2. Outliers and inference for regression

Unit 6 - Introduction to linear regression

Inference for Regression Simple Linear Regression

Lecture 16 - Correlation and Regression

Inference with Simple Regression

Inference for Regression

Statistical View of Least Squares

Density Temp vs Ratio. temp

Homework 2: Simple Linear Regression

Business Statistics. Lecture 10: Course Review

Regression Models - Introduction

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

Inferences for Regression

Section 3: Simple Linear Regression

Unit 7: Multiple linear regression 1. Introduction to multiple linear regression

Regression and Models with Multiple Factors. Ch. 17, 18

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

STAT 3022 Spring 2007

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

ST430 Exam 1 with Answers

Correlation Analysis

Chapter 16. Simple Linear Regression and dcorrelation

Introduction and Single Predictor Regression. Correlation

Multiple linear regression

Chapter 27 Summary Inferences for Regression

Simple Linear Regression

Inference for Regression Inference about the Regression Model and Using the Regression Line

AMS 7 Correlation and Regression Lecture 8

Hypotheses. Poll. (a) H 0 : µ 6th = µ 13th H A : µ 6th µ 13th (b) H 0 : p 6th = p 13th H A : p 6th p 13th (c) H 0 : µ diff = 0

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Simple Linear Regression

Project proposal feedback. Unit 4: Inference for numerical variables Lecture 3: t-distribution. Statistics 104

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Lecture 18 Miscellaneous Topics in Multiple Regression

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Chapter 16. Simple Linear Regression and Correlation

STAT Chapter 11: Regression

Review for Final Exam Stat 205: Statistics for the Life Sciences

Model Specification and Data Problems. Part VIII

A discussion on multiple regression models

13 Simple Linear Regression

Ordinary Least Squares Regression Explained: Vartanian

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

appstats27.notebook April 06, 2017

Regression. Marc H. Mehlman University of New Haven

SCHOOL OF MATHEMATICS AND STATISTICS

Inference in Regression Analysis

Intro to Linear Regression

Announcements. Final Review: Units 1-7

28. SIMPLE LINEAR REGRESSION III

Stat 412/512 REVIEW OF SIMPLE LINEAR REGRESSION. Jan Charlotte Wickham. stat512.cwick.co.nz

Final Exam. Name: Solution:

INFERENCE FOR REGRESSION

Ordinary Least Squares Regression Explained: Vartanian

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

ST Correlation and Regression

Lecture 10 Multiple Linear Regression

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

11 Correlation and Regression

Second Midterm Exam Name: Solutions March 19, 2014

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran

Review of Statistics 101

Coefficient of Determination

Basic Business Statistics 6 th Edition

Applied Regression Modeling: A Business Approach Chapter 2: Simple Linear Regression Sections

Simple and Multiple Linear Regression

Ch. 1: Data and Distributions

Topic 10 - Linear Regression

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Inference for the Regression Coefficient

BIOSTATS 640 Spring 2018 Unit 2. Regression and Correlation (Part 1 of 2) R Users

Linear Regression is a very popular method in science and engineering. It lets you establish relationships between two or more numerical variables.

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Annoucements. MT2 - Review. one variable. two variables

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

9. Linear Regression and Correlation

Lectures on Simple Linear Regression Stat 431, Summer 2012

STAT 215 Confidence and Prediction Intervals in Regression

The Big Picture. Model Modifications. Example (cont.) Bacteria Count Example

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Ch 2: Simple Linear Regression

Math 2311 Written Homework 6 (Sections )

Simple Linear Regression for the MPG Data

Nonlinear Regression Functions

Transcription:

Lecture 19: Inference for SLR & Transformations Statistics 101 Mine Çetinkaya-Rundel April 3, 2012

Announcements Announcements HW 7 due Thursday. Correlation guessing game - ends on April 12 at noon. Winner will be announced in class. Prize: +1 (out of 100) point on the final. http:// istics.net/ stat/ correlations Group: sta101 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 1 / 28

Recap Online quiz 7 - commonly missed questions Question 1: Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 2 / 28

Recap Review question Which of the following is false? In SLR, (a) residuals should be nearly normally distributed with mean at 0 (b) residuals should have non-constant variance (c) residuals vs. x plot should show a random scatter around 0 (d) the relationship between x and y should be linear, and outliers should be handled with caution Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 3 / 28

1 Inference for linear regression Understanding regression output from software HT for the slope CI for the slope An alternative statistic 2 Transformations Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012

Major league baseball Yesterday in lab you worked with 2009 MLB data. What was the best predictor of runs? Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 4 / 28

Major league baseball Yesterday in lab you worked with 2009 MLB data. What was the best predictor of runs? Runs vs. On base plus slugging runs 650 775 900 0.70 0.74 0.78 0.82 ob_slg Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 4 / 28

Major league baseball R 2 for the regression line for predicting runs from on-base plus slugging is 91.31%. Which of the below is the correct interpretation of this value? 91.31% of (a) runs can be accurately predicted by on-base plus slugging. (b) variability in predictions of runs is explained by on-base plus slugging. (c) variability in predictions of on-base plus slugging is explained by runs. (d) variability in runs is explained by on-base plus slugging. (e) variability in on-base plus slugging is explained by runs. Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 5 / 28

Understanding regression output from software Major league baseball (regression output) m = lm(runs ob_slg, data = mlb) summary(m) Call: lm(formula = runs ob_slg, data = mlb) Residuals: Min 1Q Median 3Q Max -39.140-12.568-1.205 10.488 57.634 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -921.14 97.38-9.459 3.24e-10 *** ob_slg 2222.61 129.61 17.148 < 2e-16 *** --- Residual standard error: 22.37 on 28 degrees of freedom Multiple R-squared: 0.9131, Adjusted R-squared: 0.91 F-statistic: 294.1 on 1 and 28 DF, p-value: < 2.2e-16 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 6 / 28

HT for the slope Testing for the slope Clicker question Assuming that the 2009 season is representative of all MLB seasons, we would like to test if these data provide convincing evidence that the slope of the regression line for predicting runs from on-base plus slugging is different than 0. What are the appropriate hypotheses? (a) H 0 : b 0 = 0; H A : b 0 0 (b) H 0 : β 1 = 0; H A : β 1 0 (c) H 0 : b 1 = 0; H A : b 1 0 (d) H 0 : β 0 = 0; H A : β 0 0 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 7 / 28

HT for the slope Testing for the slope (cont.) Estimate Std. Error t value Pr(> t ) (Intercept) -921 97.38-9.46 0.0000 ob slg 2223 129.61 17.15 0.0000 We always use a t-test in inference for regression Remember: Test statistic, T = point estimate null value SE Point estimate = b 1 is the observed slope, and is given in the regression output SE b1 is the standard error associated with the slope, and can be calculated as (yi ŷ i ) SE b1 = 2 /(n 2) (xi x i ) 2 is also given in the regression output (and it s silly to try to calculate it by hand, just know that it s doable and why the formula works the way it does) Degrees of freedom associated with the slope is df = n 2, where n is the sample size Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 8 / 28

HT for the slope Testing for the slope (cont.) Estimate Std. Error t value Pr(> t ) (Intercept) -921 97.38-9.46 0.0000 ob slg 2223 129.61 17.15 0.0000 T = 2223 0 129.6116 = 17.15 df = 30 2 = 28 p value = P( T > 17.15) < 0.01 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 9 / 28

HT for the slope % College graduate vs. % Hispanic in LA What can you say about the relationship between of % college graduate and % Hispanic in a sample of 100 zip code areas in LA? Education: College graduate 1.0 Race/Ethnicity: Hispanic 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 Freeways No data 0.0 Freeways No data 0.0 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 10 / 28

HT for the slope % College educated vs. % Hispanic in LA - another look What can you say about the relationship between of % college graduate and % Hispanic in a sample of 100 zip code areas in LA? 100% % College graduate 75% 50% 25% 0% 0% 25% 50% 75% 100% % Hispanic Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 11 / 28

HT for the slope % College educated vs. % Hispanic in LA - linear model Clicker question Which of the below is the best interpretation of the slope? Estimate Std. Error t value Pr(> t ) (Intercept) 0.7290 0.0308 23.68 0.0000 %Hispanic -0.7527 0.0501-15.01 0.0000 (a) A 1% increase in Hispanic residents in a zip code area in LA is associated with a 75% decrease in % of college grads. (b) A 1% increase in Hispanic residents in a zip code area in LA is associated with a 0.75% decrease in % of college grads. (c) An additional 1% of Hispanic residents decreases the % of college graduates in a zip code area in LA by 0.75%. (d) In zip code areas with no Hispanic residents, % of college graduates is expected to be 75%. Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 12 / 28

HT for the slope % College educated vs. % Hispanic in LA - linear model Do these data provide convincing evidence that there is a statistically significant relationship between % Hispanic and % college graduates in zip code areas in LA? Estimate Std. Error t value Pr(> t ) (Intercept) 0.7290 0.0308 23.68 0.0000 hispanic -0.7527 0.0501-15.01 0.0000 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 13 / 28

HT for the slope Violent crime rate vs. unemployment Relationship between violent crime rate (annual number of violent crimes per 100,000 population) and unemployment rate (% of work eligible population not working) in 51 US States (including DC): violent_crime_rate 1400 1200 1000 800 600 400 200 DC 3 4 5 6 unemployed Note: The data are from the 2003 Statistical Abstract of the US. A 2012 version is available online, if looking for data on states for your project, it s a good resource. Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 14 / 28

HT for the slope Violent crime rate vs. unemployment Clicker question Which of the below is the correct set of hypotheses and the p-value for testing if the slope of the relationship between violent crime rate and unemployment is positive? Estimate Std. Error t value Pr(> t ) (Intercept) 27.68 130.00 0.21 0.8323 unemployed 105.03 32.04 3.28 0.0019 (a) H 0 :b 1 = 0 H A :b 1 0 p value = 0.0019 (b) H 0 :β 1 = 0 H A :β 1 > 0 p value = 0.0019/2 = 0.00095 (c) H 0 :β 1 = 0 H A :β 1 0 p value = 0.0019/2 = 0.00095 (d) H 0 :b 1 = 0 H A :b 1 > 0 p value = 0.0019/2 = 0.00095 (e) H 0 :β 1 = 0 H A :β 1 0 p value = 0.8323 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 15 / 28

CI for the slope Confidence interval for the slope Clicker question Remember that a confidence interval is calculated as point estimate±me and the degrees of freedom associated with the slope in a simple linear regression is n 2. Which of the below is the correct 95% confidence interval for the slope parameter? Note that the model is based on observations from 51 states. (a) 27.68 ± 1.65 32.04 Estimate Std. Error t value Pr(> t ) (Intercept) 27.68 130.00 0.21 0.8323 unemployed 105.03 32.04 3.28 0.0019 (b) 105.03 ± 2.01 32.04 (c) 105.03 ± 1.96 32.04 (d) 27.68 ± 1.96 32.04 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 16 / 28

CI for the slope Recap Inference for the slope for a SLR model (only one explanatory variable): Hypothesis test: Confidence interval: T = b 1 null value SE b1 df = n 2 b 1 ± t df =n 2 SE b 1 The null value is often 0 since we are usually checking for any relationship between the explanatory and the response variable The regression output gives b 1, SE b1, and two-tailed p-value for the t-test for the slope where the null value is 0 We rarely do inference on the intercept, so we ll be focusing on the estimates and inference for the slope Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 17 / 28

CI for the slope Caution Always be aware of the type of data you re working with: random sample, non-random sample, or population Statistical inference, and the resulting p-values, are meaningless when you already have population data If you have a sample that is non-random (biased), the results will be unreliable The ultimate goal is to have independent observations and you know how to check for those by now Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 18 / 28

An alternative statistic ANOVA We considered the t-test as a way to evaluate the strength of evidence for a hypothesis test evaluating the relationship between x and y However, we could focus on R 2 proportion of variability in the response variable (y) explained by the explanatory variable (x) A large R 2 suggests a linear relationship between x and y exists A small R 2 suggests the evidence provided by the data may not be convincing Considering the amount of explained variability is called analysis of variance (ANOVA) In SLR, where there is only one explanatory variable (and hence one slope parameter) t-test and the ANOVA yield the same result In multiple linear regression, they provide different pieces of information Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 19 / 28

Transformations 1 Inference for linear regression 2 Transformations Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012

Transformations Truck prices The scatterplot below shows the relationship between year and price of a random sample of 43 pickup trucks. Describe the relationship between these two variables. price 20000 15000 10000 5000 1980 1985 1990 1995 2000 2005 year From: http:// faculty.chicagobooth.edu/ robert.gramacy/ teaching.html Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 20 / 28

Transformations Remove unusual observations Let s remove trucks older than 20 years, and only focus on trucks made in 1992 or later. Now what can you say about the relationship? price 20000 15000 10000 5000 1995 2000 2005 year Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 21 / 28

Transformations Truck prices - linear model? price residuals 20000 15000 10000 5000 10000 5000 0 5000 1995 2000 2005 year Model: price = b 0 + b 1 year The linear model doesn t appear to be a good fit since the residuals have non-constant variance. 10000 1995 2000 2005 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 22 / 28

Transformations Truck prices - log transform of the response variable log(price) residuals 10.0 9.5 9.0 8.5 8.0 7.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1995 2000 2005 year 1995 2000 2005 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 23 / 28 Model: log(price) = b 0 + b 1 year We applied a log transformation to the response variable. The relationship now seems linear, and the residuals no longer have non-constant variance.

Transformations Interpreting models with log transformation Estimate Std. Error t value Pr(> t ) (Intercept) -265.07 25.04-10.59 0.00 pu$year 0.14 0.01 10.94 0.00 Model: log(price) = 265.07 + 0.14 year For each additional year the car is newer (for each year decrease in car s age) we would expect the log price of the car to increase on average by 0.14 log dollars. which is not very useful... Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 24 / 28

Transformations Working with logs Subtraction and logs: log(a) log(b) = log( a b ) Natural logarithm: e log(x) = x We can these identities to undo the log transformation Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 25 / 28

Transformations Interpreting models with log transformation (cont.) The slope coefficient for the log transformed model is 0.14, meaning the log price difference between cars that are one year apart is predicted to be 0.14 log dollars. log(price at year x + 1) log(price at year x) = 0.14 ( ) price at year x + 1 log = 0.14 price at year x e log( price at year x + 1 price at year x price at year x + 1 price at year x ) = e 0.14 = 1.15 For each additional year the car is newer (for each year decrease in car s age) we would expect the price of the car to increase on average by a factor of 1.15. Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 26 / 28

Transformations Recap: dealing with non-constant variance Non-constant variance is one of the most common model violations, however it is usually fixable by transforming the response (y) variable The most common variance stabilizing transform is the log transformation: log(y) When using a log transformation on the response variable the interpretation of the slope changes: For each unit increase in x, y is expected on average to decrease/increase by a factor of e b 1. Another useful transformation is the square root: y These transformations may also be useful when the relationship is non-linear, but in those cases a polynomial regression may also be needed (this is beyond the scope of this course, but you re welcomed to try it for your project, and I d be happy to provide further guidance) Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 27 / 28

Transformations R code # load data pu_allyrs = read.csv("http://stat.duke.edu/courses/spring12/sta101.1/lec/ pickups.csv") # drop trucks older than 20 yrs old pu = subset(pu_allyrs, pu_allyrs$year >= 1992) # linear model plot(pu$price pu$year) m1 = lm(pu$price pu$year) abline(m1) plot(m1$residuals pu$year) # model with log transformation plot(log(pu$price ) pu$year) m2 = lm(log(pu$price ) pu$year) abline(m2) plot(m2$residuals pu$year) # model summary and interpretation of the slope coefficient summary(m2) exp(0.14) Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & Transformations April 3, 2012 28 / 28