Introduction to Regression

Similar documents
Predict y from (possibly) many predictors x. Model Criticism Study the importance of columns

Multiple Regression Examples

Interpreting the coefficients

Q Lecture Introduction to Regression

INFERENCE FOR REGRESSION

Confidence Interval for the mean response

Introduction to Regression

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

Stat 529 (Winter 2011) A simple linear regression (SLR) case study. Mammals brain weights and body weights

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

STATISTICS 110/201 PRACTICE FINAL EXAM

General Linear Model (Chapter 4)

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

Statistical Modelling in Stata 5: Linear Models

Analysis of Bivariate Data

Correlation & Simple Regression

Stat 231 Final Exam. Consider first only the measurements made on housing number 1.

Introduction to Linear regression analysis. Part 2. Model comparisons

Lecture 18: Simple Linear Regression

Review of Statistics 101

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

School of Mathematical Sciences. Question 1

23. Inference for regression

10 Model Checking and Regression Diagnostics

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

SMAM 319 Exam 1 Name. 1.Pick the best choice for the multiple choice questions below (10 points 2 each)

SMAM 314 Computer Assignment 5 due Nov 8,2012 Data Set 1. For each of the following data sets use Minitab to 1. Make a scatterplot.

Unit 6 - Introduction to linear regression

Chapter 14 Multiple Regression Analysis

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

BE640 Intermediate Biostatistics 2. Regression and Correlation. Simple Linear Regression Software: SAS. Emergency Calls to the New York Auto Club

Business 320, Fall 1999, Final

Models with qualitative explanatory variables p216

Chapter 12: Multiple Regression

MULTIPLE LINEAR REGRESSION IN MINITAB

Correlation and Regression

Model Building Chap 5 p251

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

SMAM 319 Exam1 Name. a B.The equation of a line is 3x + y =6. The slope is a. -3 b.3 c.6 d.1/3 e.-1/3

School of Mathematical Sciences. Question 1. Best Subsets Regression

Homework 2: Simple Linear Regression

Histogram of Residuals. Residual Normal Probability Plot. Reg. Analysis Check Model Utility. (con t) Check Model Utility. Inference.

Simple Linear Regression. Steps for Regression. Example. Make a Scatter plot. Check Residual Plot (Residuals vs. X)

Lecture 11: Simple Linear Regression

LEARNING WITH MINITAB Chapter 12 SESSION FIVE: DESIGNING AN EXPERIMENT

Apart from this page, you are not permitted to read the contents of this question paper until instructed to do so by an invigilator.

y response variable x 1, x 2,, x k -- a set of explanatory variables

Unit 6 - Simple linear regression

SMAM 314 Exam 42 Name

13 Simple Linear Regression

1. Least squares with more than one predictor

22S39: Class Notes / November 14, 2000 back to start 1

Multiple Linear Regression

Table 1: Fish Biomass data set on 26 streams

Multiple Linear Regression

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Interaction effects for continuous predictors in regression modeling

MS&E 226: Small Data

Regression Diagnostics Procedures

Lecture 4: Multivariate Regression, Part 2

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

Institutionen för matematik och matematisk statistik Umeå universitet November 7, Inlämningsuppgift 3. Mariam Shirdel

10. Alternative case influence statistics

3 Variables: Cyberloafing Conscientiousness Age

Analysing data: regression and correlation S6 and S7

Statistics 5100 Spring 2018 Exam 1

Multiple Predictor Variables: ANOVA

Steps for Regression. Simple Linear Regression. Data. Example. Residuals vs. X. Scatterplot. Make a Scatter plot Does it make sense to plot a line?

STAT 350 Final (new Material) Review Problems Key Spring 2016

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

EDF 7405 Advanced Quantitative Methods in Educational Research MULTR.SAS

Lecture 4 Scatterplots, Association, and Correlation

Answer Keys to Homework#10

Stat 501, F. Chiaromonte. Lecture #8

Lecture 4 Scatterplots, Association, and Correlation

Assignment 9 Answer Keys

(4) 1. Create dummy variables for Town. Name these dummy variables A and B. These 0,1 variables now indicate the location of the house.

Data Set 1A: Algal Photosynthesis vs. Salinity and Temperature

STA220H1F Term Test Oct 26, Last Name: First Name: Student #: TA s Name: or Tutorial Room:

Chapter 14. Multiple Regression Models. Multiple Regression Models. Multiple Regression Models

Orthogonal contrasts for a 2x2 factorial design Example p130

Linear Regression Communication, skills, and understanding Calculator Use

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Inference for Regression Inference about the Regression Model and Using the Regression Line

28. SIMPLE LINEAR REGRESSION III

STATISTICS 479 Exam II (100 points)

STAT 212 Business Statistics II 1

Ph.D. Preliminary Examination Statistics June 2, 2014

Simple Linear Regression: A Model for the Mean. Chap 7

MATH ASSIGNMENT 2: SOLUTIONS

Examination paper for TMA4255 Applied statistics

Topic 14: Inference in Multiple Regression

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Introduction to Regression

Lecture 3: Inference in SLR

assumes a linear relationship between mean of Y and the X s with additive normal errors the errors are assumed to be a sample from N(0, σ 2 )

Transcription:

Introduction to Regression Using Mult Lin Regression Derived variables Many alternative models Which model to choose? Model Criticism Modelling Objective Model Details Data and Residuals Assumptions 1

Introduction to Regression: Criticism Objective? 1. Informal Analyse/explore the data Find patterns/exceptions 2. Use for prediction Precision 3. Use for testing/forming aspects of theory Understand how the variables are interrelated Understand coefficients What variables are important? 2

1 Informal Objectives Analyse/explore the data; Find patterns/exceptions Single response variable clear? No: Use scatterplot matrix; smoothers, correlation Generally: Use Multivariate analysis Yes: Also use regression; What predictors are most important? Interpret Coefficients Add new x vars; lurking vars? too many vars? Check residuals Exceptions? Patterns? 3

Stat Anal Alg Vect Mech 81 67 67 82 77 81 70 80 78 63 81 66 71 73 75 68 70 63 72 55 63 70 65 63 63 73 6 72 61 53 68 65 65 67 51 56 62 68 70 59 Matrix Plot of Stat, Anal, Alg, Vect, Mech Math Marks 88 Students Descriptive Statistics: Stat, Anal, Alg, Vect, Mech Variable Mean StDev Stat 2.31 17.26 Anal 6.68 1.85 Alg 50.60 10.62 Vect 50.59 13.15 Mech 38.95 17.9 Stat Correlation: Stat, Anal, Alg, Vect, Mech 80 0 0 70 5 20 80 Anal Alg Stat Anal Alg Vect Anal 0.607 Alg 0.665 0.711 Vect 0.36 0.85 0.610 Mech 0.389 0.09 0.57 0.553 0 Vect Cell Contents: Pearson correlation 800 0 Mech 0 0 0 80 0 25 50 20 5 70 0 0 80

2 Prediction Best? Seek large R 2 ; small S Simple model Small number of predictors; natural scale Interpretable coefficients No important exceptions No pattern in residuals later Prediction Intervals Great care with extrapolation 5

Standard Errors Data Like This Regression Analysis: LogVol versus LogHt, LogDiam The regression equation is LogVol = - 2.88 + 1.12 LogHt + 1.98 LogDiam Predictor Coef SE Coef T P Constant -2.8801 0.373-8.29 0.000 LogHt 1.1171 0.20 5.6 0.000 LogDiam 1.98265 0.07501 26.3 0.000 S = 0.0353 R-Sq = 97.8% 95% of Analyses of Data Like This LogHt coeff in (1.12 2(0.22)) ie 1 LogDiam coeff in (1.98 2(0.075)) ie 2 6

Simple is Good The regression equation is ie LogVol = - 2.699 + 1.005 Log (Ht Diam 2 ) 2 (0.035) = - 2.7 + Log (Ht Diam 2 ) 2 (0.035) Vol = 10-2.7 (Ht Diam 2 ) 10 2 (0.035) Meaning Vol in interval based on (Ht Diam 2 ) = Roughly 10 +2 (0.035) =1.17 +17% 10-2 (0.035) =0.85-15% 95% of Tree Vols = 0.0013 (Ht Diam 2 ) to within about 16% 7

Simple is good Predict Vol from Ht AND Diam S=3.88 R-sq = 9.80% logvol from loght, logdiam S = 0.0353 R-sq = 97.87% logvol from log (Ht Diam 2 ) S = 0.0355 R-Sq = 97.7% 8

Residuals - Trees Patterns? Exceptions? Vol vs Ht, Diam in linear scale Vol vs Ht Diam 2 in log scale Unusual Observations Obs Vol Fit Resid Std Resid 31 77.00 68.52 8.8 2.9 R R Large residual Unusual Observations Obs logvol Fit Resid Std Resid 15 1.2810 1.3539-0.0728-2.12 R R Large residual 9

3 Prediction for Theory Means of comparing/testing coeffs ( evolving theory) Compared to what? Controlling for external variation? Coefficients Interpretation One-at-a-time - SEs, t-ratios, Conf Intervals Many-at-once - Analysis of variance Entire model R 2 Confidence bands 10

Gas: Modelled with Interaction Regression Analysis: Gas versus Temperature, Insulated, Ins X Temp Model Summary S R-sq R-sq(adj) R-sq(pred) 0.32300 92.77% 92.35% 91.% Coefficients Term Coef SE Coef T-Value P-Value VIF Constant 6.85 0.136 50.1 0.000 Temperature -0.3932 0.0225-17.9 0.000 2.02 Insulated -2.130 0.180-11.83 0.000.33 Ins X Temp 0.1153 0.0321 3.59 0.001.70 S = 0.32300 R-Sq = 92.8% 11

Stat Significance Simplest for MTB Coeff value is 0: Var plays no role Coeff 0 Var plays some role Statistically Sig 95% interval does not includes 0 Data Like This Var seems to play some role in pred Formal Logic T ˆ Hyp 0.3932 0 17.9 SE ˆ 0.0225 T stat inconsistent with Explanation: β Hyp =0 and chance 12

Stat Significance Statistically Sig 95% interval does not includes 0 Alt Data Like This If in fact Null Hyp true then obs T is large in sense that Pr( T >17.03) when Null Hyp true Contrast Scientifically Sig Var plays important role in prediction in theory 13

What are Significant implications? Term Coef SE Coef T-Value P-Value VIF Constant 6.85 0.136 50.1 0.000 Temperature -0.3932 0.0225-17.9 0.000 2.02 Insulated -2.130 0.180-11.83 0.000.33 Ins X Temp 0.1153 0.0321 3.59 0.001.70 1

Residuals Fits and Diagnostics for Unusual Observations Obs Gas Fit Resid Std Resid 1 7.200 7.168 0.032 0.11 X 2 6.900 7.129-0.229-0.80 X 36 3.200 3.862-0.662-2.10 R 6.000 3.362 0.638 2.01 R 55 1.300 2.278-0.978-3.2 R R Large residual X Unusual X 15

Model Criticism Variables are columns; cases are rows Which variables are important? in what way? Which cases are important? in what way? Outliers, residuals, influence Errors Normal? Independent? 16

Frequency Residual Percent Residual The regression equation is Diagnostics Volume = - 1.7-0.09 Diameter + 0.030 Height + 0.39 Ht*Diam^2 Unusual Observations Obs Diameter Volume Fit SE Fit Residual St Resid 18 13.3 27.00 32.33 1.006 -.93-2.08R 31 20.6 77.000 78.288 2.030-1.288-0.81 X R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. Residual Plots for Vol 99 Normal Probability Plot 5.0 Versus Fits 90 50 2.5 0.0 10-2.5 1-5.0-2.5 0.0 Residual 2.5 5.0-5.0 0 20 0 Fitted Value 60 80 8 6 2 Histogram 5.0 2.5 0.0-2.5 Versus Order 0 - -2 0 Residual 2-5.0 2 6 8 10 12 1 16 18 20 22 Observation Order 2 26 28 30 17

SRES1 Vol Unusual Observations Obs Diameter Volume Outliers Fit SE Fit Residual St Resid 18 13.3 27.00 32.33 1.006 -.93-2.08R 31 20.6 77.000 78.288 2.030-1.288-0.81 X R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. 2 Scatterplot of SRES1 vs FITS1 26 28 17 23 90 Fitted Line Plot Vol = - 0.0000 + 1.000 FITS1 1 11 19 21 27 80 70 31 0-1 -2 0 2 3 1 10 13 1012 7 85 6 20 15 20 16 19 30 18 22 2 25 0 FITS1 50 29 30 60 70 31 80 60 50 0 30 20 10 0 0 321 10 7 23 17 21 22 18 20 19 113 12 11 16 10 9 85 6 15 20 30 25 2 0 FITS1 28 2627 30 29 50 60 70 80 18

Leverage x values far from the centre have high leverage. More difficult to see when many x variables 19

Outlying and Influential cases Outlying cases Unusual y values, for these x-values Influential cases Unusual combination of x-values Perhaps erroneous Omit and re-analyse? Certainly worth double-checking 20

Gas Consumption The regression equation is Gas = 6.85-0.393 Temperature - 2.13 Insulated + 0.115 Ins X Temp Unusual Observations Obs Temperature Gas Fit SE Fit Residual St Resid 1-0.8 7.2000 7.168 0.1521 0.0316 0.11 X 2-0.7 6.9000 7.1291 0.1501-0.2291-0.80 X 36 3.1 3.2000 3.8623 0.0667-0.6623-2.10R 6.9.0000 3.3620 0.0598 0.6380 2.01R 55 8.8 1.3000 2.2780 0.1156-0.9780-3.2R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. 21

Diagnostics for Gas Obs Temperature Gas Fit SE Fit Residual St Resid 1-0.8 7.2000 7.168 0.1521 0.0316 0.11 X 2-0.7 6.9000 7.1291 0.1501-0.2291-0.80 X 36 3.1 3.2000 3.8623 0.0667-0.6623-2.10R 6.9.0000 3.3620 0.0598 0.6380 2.01R 55 8.8 1.3000 2.2780 0.1156-0.9780-3.2R 22

Compare Models Gas Consumption Previous Analysis The regression equation is Gas = 6.85-0.393 Temperature - 2.13 Insulated + 0.115 Ins X Temp Predictor Coef SE Coef T P Constant 6.8538 0.1360 50.1 0.000 Temperature -0.3932 0.0229-17.9 0.000 Insulated -2.1300 0.1801-11.83 0.000 Ins X Temp 0.11530 0.03211 3.59 0.001 S = 0.32300 R-Sq = 92.8% R-Sq(adj) = 92.% New Analysis The regression equation is Gas = 6.85-0.393 Temperature - 2.20 Insulated + 0.10 Ins X Temp Predictor Coef SE Coef T P Constant 6.8538 0.1226 55.89 0.000 Temperature - 0.3932 0.02028-19.39 0.000 Insulated -2.2019 0.1637-13.5 0.000 Ins X Temp 0.13981 0.02975.70 0.000 Case 55 excluded S = 0.291320 R-Sq = 93.6% R-Sq(adj) = 93.2% 23

Compare diagnostics for Gas Previous Diagnostics Obs Temperature Gas Fit SE Fit Residual St Resid 1-0.8 7.2000 7.168 0.1521 0.0316 0.11 X 2-0.7 6.9000 7.1291 0.1501-0.2291-0.80 X 36 3.1 3.2000 3.8623 0.0667-0.6623-2.10R 6.9.0000 3.3620 0.0598 0.6380 2.01R 55 8.8 1.3000 2.2780 0.1156-0.9780-3.2R New Diagnostics Obs Temperature Gas Fit SE Fit Residual St Resid 1-0.8 7.2000 7.168 0.1372 0.0316 0.12 X 8 3.9.7000 5.3202 0.063-0.6202-2.18R 9.2 5.8000 5.2022 0.0617 0.5978 2.10R 36 3.1 3.2000 3.8662 0.0602-0.6662-2.3R 6.9.0000 3.101 0.0556 0.5899 2.06R 56 9.7 1.5000 2.1936 0.1291-0.6936-2.66R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. 2

Comparing diagnostics for Gas New Diagnostics Obs Temperature Gas Fit SE Fit Residual St Resid 1-0.8 7.2000 7.168 0.1372 0.0316 0.12 X 8 3.9.7000 5.3202 0.063-0.6202-2.18R 9.2 5.8000 5.2022 0.0617 0.5978 2.10R 36 3.1 3.2000 3.8662 0.0602-0.6662-2.3R 6.9.0000 3.101 0.0556 0.5899 2.06R 56 9.7 1.5000 2.1936 0.1291-0.6936-2.66R 25

Normal Distribution Diagnostics Prob Plot Normal Distribution Provides a scale for large outliers Relatively unimportant for T-ratios, p-values 26

Diagnostics and Analysis Residuals/leverage point to cases worth careful examination. Plot the data. Use labels to identify cases in different plots Ask questions Reduction in SS as more variables added can lead to VIF Difficulties with correlation in x-vars. T-ratios for coeffs provide a very unreliable guide. Proceed carefully. 27

Interpreting Coefficients What do coefficients mean? Recall Scale log Units - dimensions SEs 95% Conf Ints 28

Are coefficients important? Any? Some? Omitted variables? How judge? Science SE ANOVA Are (all) data to be relied on? Does part of the data dominate some conclusions 29