Introduction to Regression Using Mult Lin Regression Derived variables Many alternative models Which model to choose? Model Criticism Modelling Objective Model Details Data and Residuals Assumptions 1
Introduction to Regression: Criticism Objective? 1. Informal Analyse/explore the data Find patterns/exceptions 2. Use for prediction Precision 3. Use for testing/forming aspects of theory Understand how the variables are interrelated Understand coefficients What variables are important? 2
1 Informal Objectives Analyse/explore the data; Find patterns/exceptions Single response variable clear? No: Use scatterplot matrix; smoothers, correlation Generally: Use Multivariate analysis Yes: Also use regression; What predictors are most important? Interpret Coefficients Add new x vars; lurking vars? too many vars? Check residuals Exceptions? Patterns? 3
Stat Anal Alg Vect Mech 81 67 67 82 77 81 70 80 78 63 81 66 71 73 75 68 70 63 72 55 63 70 65 63 63 73 6 72 61 53 68 65 65 67 51 56 62 68 70 59 Matrix Plot of Stat, Anal, Alg, Vect, Mech Math Marks 88 Students Descriptive Statistics: Stat, Anal, Alg, Vect, Mech Variable Mean StDev Stat 2.31 17.26 Anal 6.68 1.85 Alg 50.60 10.62 Vect 50.59 13.15 Mech 38.95 17.9 Stat Correlation: Stat, Anal, Alg, Vect, Mech 80 0 0 70 5 20 80 Anal Alg Stat Anal Alg Vect Anal 0.607 Alg 0.665 0.711 Vect 0.36 0.85 0.610 Mech 0.389 0.09 0.57 0.553 0 Vect Cell Contents: Pearson correlation 800 0 Mech 0 0 0 80 0 25 50 20 5 70 0 0 80
2 Prediction Best? Seek large R 2 ; small S Simple model Small number of predictors; natural scale Interpretable coefficients No important exceptions No pattern in residuals later Prediction Intervals Great care with extrapolation 5
Standard Errors Data Like This Regression Analysis: LogVol versus LogHt, LogDiam The regression equation is LogVol = - 2.88 + 1.12 LogHt + 1.98 LogDiam Predictor Coef SE Coef T P Constant -2.8801 0.373-8.29 0.000 LogHt 1.1171 0.20 5.6 0.000 LogDiam 1.98265 0.07501 26.3 0.000 S = 0.0353 R-Sq = 97.8% 95% of Analyses of Data Like This LogHt coeff in (1.12 2(0.22)) ie 1 LogDiam coeff in (1.98 2(0.075)) ie 2 6
Simple is Good The regression equation is ie LogVol = - 2.699 + 1.005 Log (Ht Diam 2 ) 2 (0.035) = - 2.7 + Log (Ht Diam 2 ) 2 (0.035) Vol = 10-2.7 (Ht Diam 2 ) 10 2 (0.035) Meaning Vol in interval based on (Ht Diam 2 ) = Roughly 10 +2 (0.035) =1.17 +17% 10-2 (0.035) =0.85-15% 95% of Tree Vols = 0.0013 (Ht Diam 2 ) to within about 16% 7
Simple is good Predict Vol from Ht AND Diam S=3.88 R-sq = 9.80% logvol from loght, logdiam S = 0.0353 R-sq = 97.87% logvol from log (Ht Diam 2 ) S = 0.0355 R-Sq = 97.7% 8
Residuals - Trees Patterns? Exceptions? Vol vs Ht, Diam in linear scale Vol vs Ht Diam 2 in log scale Unusual Observations Obs Vol Fit Resid Std Resid 31 77.00 68.52 8.8 2.9 R R Large residual Unusual Observations Obs logvol Fit Resid Std Resid 15 1.2810 1.3539-0.0728-2.12 R R Large residual 9
3 Prediction for Theory Means of comparing/testing coeffs ( evolving theory) Compared to what? Controlling for external variation? Coefficients Interpretation One-at-a-time - SEs, t-ratios, Conf Intervals Many-at-once - Analysis of variance Entire model R 2 Confidence bands 10
Gas: Modelled with Interaction Regression Analysis: Gas versus Temperature, Insulated, Ins X Temp Model Summary S R-sq R-sq(adj) R-sq(pred) 0.32300 92.77% 92.35% 91.% Coefficients Term Coef SE Coef T-Value P-Value VIF Constant 6.85 0.136 50.1 0.000 Temperature -0.3932 0.0225-17.9 0.000 2.02 Insulated -2.130 0.180-11.83 0.000.33 Ins X Temp 0.1153 0.0321 3.59 0.001.70 S = 0.32300 R-Sq = 92.8% 11
Stat Significance Simplest for MTB Coeff value is 0: Var plays no role Coeff 0 Var plays some role Statistically Sig 95% interval does not includes 0 Data Like This Var seems to play some role in pred Formal Logic T ˆ Hyp 0.3932 0 17.9 SE ˆ 0.0225 T stat inconsistent with Explanation: β Hyp =0 and chance 12
Stat Significance Statistically Sig 95% interval does not includes 0 Alt Data Like This If in fact Null Hyp true then obs T is large in sense that Pr( T >17.03) when Null Hyp true Contrast Scientifically Sig Var plays important role in prediction in theory 13
What are Significant implications? Term Coef SE Coef T-Value P-Value VIF Constant 6.85 0.136 50.1 0.000 Temperature -0.3932 0.0225-17.9 0.000 2.02 Insulated -2.130 0.180-11.83 0.000.33 Ins X Temp 0.1153 0.0321 3.59 0.001.70 1
Residuals Fits and Diagnostics for Unusual Observations Obs Gas Fit Resid Std Resid 1 7.200 7.168 0.032 0.11 X 2 6.900 7.129-0.229-0.80 X 36 3.200 3.862-0.662-2.10 R 6.000 3.362 0.638 2.01 R 55 1.300 2.278-0.978-3.2 R R Large residual X Unusual X 15
Model Criticism Variables are columns; cases are rows Which variables are important? in what way? Which cases are important? in what way? Outliers, residuals, influence Errors Normal? Independent? 16
Frequency Residual Percent Residual The regression equation is Diagnostics Volume = - 1.7-0.09 Diameter + 0.030 Height + 0.39 Ht*Diam^2 Unusual Observations Obs Diameter Volume Fit SE Fit Residual St Resid 18 13.3 27.00 32.33 1.006 -.93-2.08R 31 20.6 77.000 78.288 2.030-1.288-0.81 X R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. Residual Plots for Vol 99 Normal Probability Plot 5.0 Versus Fits 90 50 2.5 0.0 10-2.5 1-5.0-2.5 0.0 Residual 2.5 5.0-5.0 0 20 0 Fitted Value 60 80 8 6 2 Histogram 5.0 2.5 0.0-2.5 Versus Order 0 - -2 0 Residual 2-5.0 2 6 8 10 12 1 16 18 20 22 Observation Order 2 26 28 30 17
SRES1 Vol Unusual Observations Obs Diameter Volume Outliers Fit SE Fit Residual St Resid 18 13.3 27.00 32.33 1.006 -.93-2.08R 31 20.6 77.000 78.288 2.030-1.288-0.81 X R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. 2 Scatterplot of SRES1 vs FITS1 26 28 17 23 90 Fitted Line Plot Vol = - 0.0000 + 1.000 FITS1 1 11 19 21 27 80 70 31 0-1 -2 0 2 3 1 10 13 1012 7 85 6 20 15 20 16 19 30 18 22 2 25 0 FITS1 50 29 30 60 70 31 80 60 50 0 30 20 10 0 0 321 10 7 23 17 21 22 18 20 19 113 12 11 16 10 9 85 6 15 20 30 25 2 0 FITS1 28 2627 30 29 50 60 70 80 18
Leverage x values far from the centre have high leverage. More difficult to see when many x variables 19
Outlying and Influential cases Outlying cases Unusual y values, for these x-values Influential cases Unusual combination of x-values Perhaps erroneous Omit and re-analyse? Certainly worth double-checking 20
Gas Consumption The regression equation is Gas = 6.85-0.393 Temperature - 2.13 Insulated + 0.115 Ins X Temp Unusual Observations Obs Temperature Gas Fit SE Fit Residual St Resid 1-0.8 7.2000 7.168 0.1521 0.0316 0.11 X 2-0.7 6.9000 7.1291 0.1501-0.2291-0.80 X 36 3.1 3.2000 3.8623 0.0667-0.6623-2.10R 6.9.0000 3.3620 0.0598 0.6380 2.01R 55 8.8 1.3000 2.2780 0.1156-0.9780-3.2R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. 21
Diagnostics for Gas Obs Temperature Gas Fit SE Fit Residual St Resid 1-0.8 7.2000 7.168 0.1521 0.0316 0.11 X 2-0.7 6.9000 7.1291 0.1501-0.2291-0.80 X 36 3.1 3.2000 3.8623 0.0667-0.6623-2.10R 6.9.0000 3.3620 0.0598 0.6380 2.01R 55 8.8 1.3000 2.2780 0.1156-0.9780-3.2R 22
Compare Models Gas Consumption Previous Analysis The regression equation is Gas = 6.85-0.393 Temperature - 2.13 Insulated + 0.115 Ins X Temp Predictor Coef SE Coef T P Constant 6.8538 0.1360 50.1 0.000 Temperature -0.3932 0.0229-17.9 0.000 Insulated -2.1300 0.1801-11.83 0.000 Ins X Temp 0.11530 0.03211 3.59 0.001 S = 0.32300 R-Sq = 92.8% R-Sq(adj) = 92.% New Analysis The regression equation is Gas = 6.85-0.393 Temperature - 2.20 Insulated + 0.10 Ins X Temp Predictor Coef SE Coef T P Constant 6.8538 0.1226 55.89 0.000 Temperature - 0.3932 0.02028-19.39 0.000 Insulated -2.2019 0.1637-13.5 0.000 Ins X Temp 0.13981 0.02975.70 0.000 Case 55 excluded S = 0.291320 R-Sq = 93.6% R-Sq(adj) = 93.2% 23
Compare diagnostics for Gas Previous Diagnostics Obs Temperature Gas Fit SE Fit Residual St Resid 1-0.8 7.2000 7.168 0.1521 0.0316 0.11 X 2-0.7 6.9000 7.1291 0.1501-0.2291-0.80 X 36 3.1 3.2000 3.8623 0.0667-0.6623-2.10R 6.9.0000 3.3620 0.0598 0.6380 2.01R 55 8.8 1.3000 2.2780 0.1156-0.9780-3.2R New Diagnostics Obs Temperature Gas Fit SE Fit Residual St Resid 1-0.8 7.2000 7.168 0.1372 0.0316 0.12 X 8 3.9.7000 5.3202 0.063-0.6202-2.18R 9.2 5.8000 5.2022 0.0617 0.5978 2.10R 36 3.1 3.2000 3.8662 0.0602-0.6662-2.3R 6.9.0000 3.101 0.0556 0.5899 2.06R 56 9.7 1.5000 2.1936 0.1291-0.6936-2.66R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. 2
Comparing diagnostics for Gas New Diagnostics Obs Temperature Gas Fit SE Fit Residual St Resid 1-0.8 7.2000 7.168 0.1372 0.0316 0.12 X 8 3.9.7000 5.3202 0.063-0.6202-2.18R 9.2 5.8000 5.2022 0.0617 0.5978 2.10R 36 3.1 3.2000 3.8662 0.0602-0.6662-2.3R 6.9.0000 3.101 0.0556 0.5899 2.06R 56 9.7 1.5000 2.1936 0.1291-0.6936-2.66R 25
Normal Distribution Diagnostics Prob Plot Normal Distribution Provides a scale for large outliers Relatively unimportant for T-ratios, p-values 26
Diagnostics and Analysis Residuals/leverage point to cases worth careful examination. Plot the data. Use labels to identify cases in different plots Ask questions Reduction in SS as more variables added can lead to VIF Difficulties with correlation in x-vars. T-ratios for coeffs provide a very unreliable guide. Proceed carefully. 27
Interpreting Coefficients What do coefficients mean? Recall Scale log Units - dimensions SEs 95% Conf Ints 28
Are coefficients important? Any? Some? Omitted variables? How judge? Science SE ANOVA Are (all) data to be relied on? Does part of the data dominate some conclusions 29