Q Lecture Introduction to Regression

Q3 2009 1

Before/After Transformation 2

Construction Role of T-ratios Formally, even under Null Hyp: H : 0, ˆ, being computed from k t k SE ˆ ˆ y values themselves containing random error, will sometimes be large ( ) for reasons of random chance only. Also t-ratios k k 0 k In (conceptual) replications, differing from current data by chance alone, the probability of obtaining, by chance, a t-ratio larger ( ) than quoted value is quoted P value 3

Role of T-ratios ˆ ˆ Informally, if t is not large (>2 in mag; p - 0.05) then coeff of x can be given a value of zero - equivalently x can be dropped from model - k k with little appreciable impact. t k k SE k k Cautio n: applies one variable at a time 4

Residuals Regression Analysis: Taste versus Lactic Acid, LAcetic, LH2S Taste = - 28.9 + 19.7 Lactic Acid + 0.33 LAcetic + 3.91 LH2S Predictor Coef SE Coef T P Constant -28.87 19.74-1.46 0.156 Lactic Acid 19.670 8.629 2.28 0.031 LAcetic 0.327 4.461 0.07 0.942 LH2S 3.912 1.249 3.13 0.004 S = 10.1307 R-Sq = 65.2% R-Sq(adj) = 61.2% Unusual Observations Lactic Obs Acid Taste Fit SE Fit Residual St Resid 15 1.52 54.90 29.45 3.04 25.45 2.63R Analysis of Variance Source DF SS MS F P Regression 3 4994.5 1664.8 16.22 0.000 Residual Error 26 2668.4 102.6 Total 29 7662.9 Source DF Seq SS Lactic Acid 1 3800.4 LAcetic 1 186.5 LH2S 1 1007.6 5

Residuals Unusual Observations Lactic Obs Acid Taste Fit SE Fit Residual St Resid 15 1.52 54.90 29.45 3.04 25.45 2.63R In fact barely outlying despite 2.63 Recall, one has to be largest!! 6

Options with Large Residuals Examine carefully: Why outlying? Anything special about this case/obs? Refit without does its removal change anything important? If delete, then formally Conclusions are based on something like this never happening in future Is this a meaningful statement? 7

Residuals, Standardized Residuals, Deleted T-residuals Normal Scores 8

Classic Linear Model Y x x... x Transformations i 1 1i 2 2i p pi i 2 where ~ N 0, Var or N 0, SD i Statistical Theory assumes Normally Dist residuals/'errors' makes NO assumptions re dist of Y, X,.. 1 (technical) makes assumption of additivity (crucial) But non-additivity - esp multiplicative models - and non normality often occur together 9

Normality of data or errors/resids? 10

Extreme example 11

Random Variation Additive? 6.00 Exp Decay. Random variation decreases with time 10.00 Exp Decay on log scale. Random variation constant in time 5.00 4.00 3.00 2.00 1.00 0.10 0 10 20 30 40 50 60 1.00 0.00 0 10 20 30 40 50 60 time t 0.01 0.00 time t 12

Artificially created data Rescaled Additive model to create data Resids by Exponentiation of line and data t line erors Y = line + e subtraction line Exp y 1 7.9-0.26 7.64-0.26 0.579 0.315 2 7.8-0.12 7.68-0.12 0.460 0.350 3 7.7 0.27 7.97 0.27 0.365 0.687 4 7.6-0.50 7.10-0.50 0.290 0.091 5 7.5-0.67 6.83-0.67 0.231 0.050 6 7.4-0.16 7.24-0.16 0.183 0.127 Data created multiplicatively exhibit neither linearity nor Normality in either data or residuals Log transform solves both issues 13

Artificially created data Rescaled Additive model to create data Resids by constant Exponentiation by construction of line and data t line erors Y = line + e subtraction 2.5 line Exp y 1 7.9-0.26 7.64-0.26 2 0.579 0.315 1.5 2 7.8-0.12 7.68-0.12 0.460 0.350 1 3 7.7 0.27 7.97 0.27 0.365 0.687 0.5 4 7.6-0.50 7.10-0.50 0.290 0.091 0 5 7.5-0.67 6.83-0.67-0.5 0 10 200.231 30 40 0.050 60 6 7.4-0.16 7.24-0.16 0.183 0.127-1 Linear Decay Rand variation 14

Artificially created data Rescaled Additive model to create data Resids by Exponentiation of line and data t line erors Y = line + e subtraction line Exp y 1 7.9-0.26 7.64-0.26 0.579 0.315 2 7.8-0.12 7.68-0.12 0.460 0.350 3 7.7 0.27 7.97 0.27 0.365 0.687 4 7.6-0.50 7.10-0.50 0.290 0.091 5 7.5-0.67 6.83-0.67 0.231 0.050 6 7.4-0.16 7.24-0.16 0.183 0.127 6.00 5.00 4.00 3.00 2.00 1.00 0.00 Exp Decay. Random variation decreases with time 0 10 20 30 40 50 60 time t 10.00 1.00 0.10 0.01 Exp Decay on log scale. Random variation now seems constant in time 0 10 20 30 40 50 60 time t 15

Artificially created data Distributions of obs Y under both models Additive model to create dat Exponentiation of line and data t line Y = line + e line Exp y 1 0.9 2.64 2.46 14.05 2 0.8 0.51 2.23 1.66 3 0.7 0.65 2.01 1.92 4 0.6 0.96 1.82 2.61 5 0.5 0.76 1.65 2.14 6 0.4 0.24 1.49 1.28 7 0.3-1.09 1.35 0.34 8 0.2-0.26 1.22 0.77 9 0.1 0.73 1.11 2.07 16

Resids for artificially created data Rescaled Residuals Exponentiation of line and data Resids by Resids by line Exp y subtraction division 0.281 0.412 0.13 1.47 0.223 0.125-0.10 0.56 0.177 0.146-0.03 0.82 0.141 0.121-0.02 0.86 0.112 0.091-0.02 0.81 0.089 0.052-0.04 0.58 0.071 0.073 0.00 1.04 Erroneous Not reflecting creation 17

Artificially created data Exponentiation of line and data Resids by Resids by line Exp y subtraction division 2.46 14.05 11.59 5.71 2.23 1.66-0.56 0.75 2.01 1.92-0.10 0.95 1.82 2.61 0.78 1.43 1.65 2.14 0.49 1.30 1.49 1.28-0.21 0.86 1.35 0.34-1.01 0.25 1.22 0.77-0.45 0.63 1.11 2.07 0.97 1.87 1.00 1.75 0.75 1.75 Now reflecting creation, but not Normal 18

Plotting Plotting Multiple Regression Fits gpm = - 0.402 + 1.47 wt + 0.0240 hp/wt a b1 b2 Given hp/wt -0.402 1.47 0.024 93.838 hp/wt wt line 41.985 2.620 5.702 38.261 2.875 6.076 40.086 2.320 5.261 34.215 3.215 6.576 50.872 3.440 6.907 30.347 3.460 6.936 68.627 3.570 7.098 19.436 3.190 6.539 30.159 3.150 6.481 35.756 3.440 6.907 35.756 3.440 6.907 44.226 4.070 7.833 48.257 3.730 7.333 47.619 3.780 7.407 39.048 5.250 9.568 39.639 5.424 9.823 43.031 5.345 9.707 30.000 2.200 5.084 32.198 1.615 4.224 35.422 1.835 4.548 39.351 2.465 5.474 42.614 3.520 7.025 43.668 3.435 6.900 63.802 3.840 7.495 45.514 3.845 7.502 13 12 11 10 9 8 7 6 5 4 3 Fitted line vs wt when hp/wt=41.98 0.0 1.0 2.0 3.0 4.0 5.0 6.0 wt Excel plotting 21

Very active area of advanced research Exam Q will not ask for undiscovered network Direct/Indirect Links Causal Networks Networks Network Models No primary response variable Now can consider Regress Directed Arrows Regression has no useful role Undirected Arrows Regression has some role? Y= wt on X s hp, hp/wt, gpm hp hp/ wt wt gpm 22

Undirected Network Models Indicative research methodology hp hp/ wt gpm Regression Analysis: wt versus hp, hp/wt, gpm wt = 2.34 + 0.0159 hp - 0.0530 hp/wt + 0.175 gpm T-ratios 7.55-8.88 2.99 Deleted stuff... wt Tentative network model Source DF Seq SS hp 1 12.8791 hp/wt 1 14.6780 gpm 1 0.5122 Note order gpm adds relatively little to pred wt when hp and hp/wt already in the model Relatively weak evidence in favour of Can drop link But propose new link hp NB changing the response variable hp/ wt wt gpm Proposed new network model 23

Undirected Network Models Direct/Indirect Links Key: No primary response variable Network Order hp hp/ wt wt gpm Recall regression fit completely insensitive to order Both following make the same predictions Regression not the natural tool but is relevant wt = 2.34 + 0.0159 hp - 0.0530 hp/wt + 0.175 gpm wt = 2.34-0.0530 hp/wt + 0.175 gpm + 0.0159 hp 24

Not on the exam in 2010 ANOVA and F -tables t-tables PIs and CIs 25