Introduction to Regression Using Mult Lin Regression Derived variables Many alternative models Which model to choose? Model Criticism Modelling Objective Model Details Data and Residuals Assumptions 1
Data Like This Values of coefficients Sampling Distributions Standard Errors 95% Confidence Intervals 95% Prediction Intervals ANOVA etc 2
Derived variables General Logs Proportions and Ratios Too many (derived) variables Redundancy Many versions of same model Indicator variables categorical data Time series applications Indicator variables eg seasonal effects Lagged variables Differences Logs and Rate of Return
Gas Gas Gas Consumption vs Temp 7 6 Period 1 Fitted Line Plot Gas = 6.854-0.92 Temperature S 0.2814 R-Sq 94.4% R-Sq(adj) 94.1% Weekly gas consumption (in 1000 cubic feet) and the average outside temperature (in degrees Celsius) at one house in south-east England for two "heating seasons", one of 26 weeks before, and one of 0 weeks after cavity-wall insulation was installed. The object of the exercise was to assess the effect of the insulation on gas consumption. The house thermostat was set at 20 C throughout. 5 4 Period 2 Fitted Line Plot Gas = 4.724-0.2779 Temperature 2 0 2 4 6 Temperature 8 10 5 S 0.54848 R-Sq 81.% R-Sq(adj) 80.6% 4 Comparative 2 1 0 2 4 6 Temperature 8 10 4
Objective Nominal focus on prediction Predict gas consumption in future for this house Knowing temp and whether or not insulated Actual interest Does insulation make a difference At all temps? How much? Slope? Intercept? SEs? Data Like This 5
Using an Indicator variable Insulated Week Temperature Gas Insulated Week Temperature Gas 0 1-0.8 7.2 1 27-0.7 4.8 0 2-0.7 6.9 1 28 0.8 4.6 0 0.4 6.4 1 29 1.0 4.7 0 4 2.5 6.0 1 0 1.4 4.0 etc etc One stacked data set Week Insulation Temperature Gas 22 0 7.6.5 2 0 8.0 4.0 24 0 8.5.6 25 0 9.1.1 26 0 10.2 2.6 27 1-0.7 4.8 28 1 0.8 4.6 29 1 1.0 4.7 etc Two parallel data sets 6
Temperature Gas Simple Regression & Indicator Variable 8 7 6 5 4 2 1 10 8 6 4 2 0 0.0 0.0 0.2 0.2 Fitted Line Plot Gas = 4.750-1.267 Insulated 0.4 Insulated 0.6 0.8 Fitted Line Plot Temperature = 5.50-0.8867 Insulated 0.4 Insulated 0.6 0.8 1.0 1.0 S 0.987577 R-Sq 29.8% R-Sq(adj) 28.5% S 2.7812 R-Sq 2.6% R-Sq(adj) 0.8% Gas vs Insulated Insulated = 0 Avg Gas = 4.750 Insulated = 1 Avg Gas =.48 Diff = -1.267 Temp vs Insulated Coeff Unit Increase Random Error Design Implications 7
Gas SLR with indicator var & T-test Fitted Line Plot Gas = 4.750-1.267 Insulated Two-sample T for Gas 8 7 6 5 4 2 1 0.0 0.2 0.4 Insulated 0.6 0.8 1.0 S 0.987577 R-Sq 29.8% R-Sq(adj) 28.5% Insulated N Mean StDev SE Mean 0 26 4.75 1.16 0.2 1 0.48 0.806 0.15 Difference = μ (0) - μ (1) T-Value = 4.79 P-Value = 0.000 DF = 54 Using Pooled StDev = 0.9876 Regression Analysis: Gas versus Insulated S R-sq R-sq(adj) R-sq(pred) 0.987577 29.79% 28.49% 24.5% Coefficients Term Coef SE Coef T-Value P-Value Constant 4.750 0.194 24.5 0.000 Insulated -1.267 0.265-4.79 0.000 8
Indicator Variables in Regression Response variable Predictors x Temp, x Insulated(0 /1) Statistical Model 1 1 2 2 1 2 Y x x ; ~ N 0, When x 0 Y x 2 1 1 0 1 1 When x 1 Y x Y Y Gas x 2 2 1 1 Y x 1 1 1 2 Common Slopes Diff bet Int'cpts No interaction Binary Indicator Variable 1 1 0 2 9
Multiple Regression Output Regression Analysis: Gas versus Temperature, Insulated The regression equation is Gas = 6.55-0.7 Temperature - 1.57 Insulated Predictor Coef SE Coef Constant 6.551 0.1181 Temperature -0.67 0.0177 Insulated -1.5652 0.0970 ˆ 1.565 SE ˆ 0.097 2 2 Rough 95%CI 1.57 2(0.097) Prev ( 1.76, 1.7) Mean Diff 1.27 2(0.274) Parallel lines 10
Implementation: Categorical Variable 11
Regression Output: Categorical Var Regression Analysis: Gas versus Temperature, Insulated Categorical predictor coding (1, 0) Model Summary S R-sq 0.57412 90.97% Coefficients Regression Equation Term Coef SE Coef T-Value P-Value Constant 6.551 0.118 55.48 0.000 Temperature -0.67 0.0178-18.95 0.000 Insulated 1-1.5652 0.0971-16.1 0.000 Insulated 0 Gas = 6.551-0.67 Temperature 1 Gas = 4.986-0.67 Temperature 12
Aside: Omitted predictors Hidden/Lurking variables Subset of data Used in exam Uninformed by insulation status Slope positive On avg, gas consumption increases with temp! Knowing insulation status Slopes negative On avg, gas consumption decreases with temp 1
Interaction? Refine the question Different slopes as well? 14
Indicator Variables in Regression Response variable Y Gas Predictors x Temp, x Insulated(0 /1), x Temp x Combined statistical model 1 2 2 Y x x x ; ~ N 0, 1 1 2 2 When x 0 Y x 2 1 1 0 1 1 When x 1 Y x 2 2 1 1 2 Y x diff in intercepts; diff in slopes 2 15
New Derived Variable 16
Modelling two regression lines Regression Analysis: Gas versus Temperature, Insulated, Ins X Temp Gas = 6.85-0.9 Temperature - 2.1 Insulated + 0.115 Ins X Temp Predictor Coef SE Coef Constant 6.858 0.160 Temperature -0.924 0.02249 Insulated -2.100 0.1801 Ins X Temp 0.1150 0.0211 S = 0.2004 R-Sq = 92.8% R-Sq(adj) = 92.4% Which coeff most fundamantal to theory of heat loss? 17
Alt Models of two regression lines Nearly equivalent Two sep lin regs Gas vs Temp Exercise Compare Coeff Ests 95% Ints Response variable 1 2 a) One model, w interaction b) Two sep models Predictors x Temp, x Insulated(0 / 1) Two Statistical Models Y Gas 2 2 0; NoIns NoIns 1 ; 0, NoIns x Y x N 2 2 1; Ins Ins 1 ; 0, Ins x Y x N 18
Multiple indicator variables Will also meet Redundancy Multiple formulations of same model 19
Housing Completions, quarterly, 1978 to 2000 Quarter 1978 1979 1980 1981 1982 198 1984 1985 Q1 5777 7276 58 6642 5981 4859 5129 4947 Q2 4772 4510 6001 4710 488 5862 4671 5188 Q 4579 4278 5879 5570 554 466 4947 90 Q4 424 4274 68 614 4894 4564 195 60 Quarter 1986 1987 1988 1989 1990 1991 1992 199 Q1 5186 4144 682 554 4296 4692 4155 684 Q2 719 6 298 985 4477 898 560 4487 Q 45 491 747 5277 5011 4600 5919 5121 Q4 726 478 477 4484 4752 5282 505 6009 Quarter 1994 1995 1996 1997 1998 1999 2000 Q1 4291 5770 6582 744 8010 990 1002 Q2 5266 6149 720 8799 9506 10227 11590 Q 6871 6806 764 9140 1010 10788 11892 Q4 7160 7879 871 10081 11474 12079 1287 20
Completions Figure 1.0 Housing Completions, quarterly, 1978 to 2000 14000 12000 Time Series Plot of Completions Take objective: forecast one quarter ahead Quarter Q1 Q2 Q Q4 10000 8000 6000 4000 2000 Quarter Q1 Year 1978 Q1 1981 Q1 1984 Q1 1987 Q1 1990 Q1 199 Q1 1996 Q1 1999 21
Comps Aside: Cubic/Quadratic Regression Fitted Line plot Options Log Quadratic Cubic 16000 14000 12000 Fitted Line Plot Comps = - 1.44E+10 + 217840 time - 10988 time**2 + 1.848 time** Regression 95% PI S 822.624 R-Sq 88.% R-Sq(adj) 87.9% 10000 8000 6000 4000 2000 1980 1985 1990 time 1995 2000 22
Modelling Options Focus on stable linear structure post 199 Assume this structure will continue Exploit structure extension of Indicator Vars Disadvantage: smaller data set One model for entire data set Note: structure has changed; might change again Exploit weaker structure Use Lagged variables Advantage: use all data. 2
Completions Comps, quarterly, 199 to 2000 Target is 2001 Q1 Use Q1 data only? OR Use all 199-2000 data? 4 parallel lines more efficient Why/What sense? Option 1 work since 199 Time Series Plot of Completions 1000 12000 11000 10000 9000 8000 7000 6000 5000 Quarter Q1 Q2 Q Q4 4000 Quarter Year Q1 199 Q1 1994 Q1 1995 Q1 1996 Q1 1997 Q1 1998 Q1 1999 24
Completions Completions Q1 only Fitted Line Plot Completions = - 1945191 + 977.8 year Other Qs; 4 sep lines 11000 10000 S 16.477 R-Sq 98.5% R-Sq(adj) 98.% 9000 8000 7000 6000 5000 4000 000 199 1994 1995 1996 1997 year 1998 1999 2000 Later, use Time since 1978 Changes intercept only Pred = -1945191 + 977.82001.00 ± 2(16.5) = (9795, 11061) 25
Linear in Time plus Quarterly Ind Vars Create set of binary variables Q1, Q2, Q, Q4 Comps = 1 Q 1 + 2 Q 2 + Q + 4 Q 4 + Time + Year. Quarter time Time since 1978 Comps Q1 Q2 Q Q4 199 Q1 199 15.00 684 1 0 0 0 199 Q2 199.25 15.25 4487 0 1 0 0 199 Q 199.5 15.50 5089 0 0 1 0 199 Q4 199.75 15.75 6041 0 0 0 1 1994 Q1 1994 16.00 4291 1 0 0 0 1994 Q2 1994.25 16.25 5266 0 1 0 0 1994 Q 1994.5 16.50 685 0 0 1 0 1994 Q4 1994.75 16.75 7196 0 0 0 1 26
Multiple Indicator Vars: Tech Issue Regression Analysis: Comps versus Time since 1978, Q1, Q2, Q, Q4 * Q4 is highly correlated with other X variables * Q4 has been removed from the equation. The regression equation is Comps = - 9452 + 986 Time since 1978-1792 Q1-119 Q2-758 Q Y Q Q Q Q t Interp of t 1 1 2 2 4 4 0 and all Q 0 i Redundancy Alternatives 0 No Constant Use indicator variables only equiv Enter " Quarter" as categorical variable 27
Multiple Indicator Vars: Tech Issue Regression Analysis: Comps versus Time since 1978, Q1, Q2, Q, Q4 * Q4 is highly correlated with other X variables * Q4 has been removed from the equation. Comps = - 9452 + 986 Time since 1978-1792 Q1-119 Q2-758 Q S = 297.82 OR Note -11244 = -9452-1792 -9452 = -9452 +0 etc Regression Analysis: Comps versus Time since 1978, Q1, Q2, Q, Q4 No constant option Comps = 986 Time since 1978-11244 Q1-10592 Q2-10210 Q - 9452 Q4 S = 297.82 28
Multiple Indicator Vars: Tech Issue Regression Analysis: Comps versus Time since 1978, Q1, Q2, Q, Q4 * Q4 is highly correlated with other X variables * Q4 has been removed from the equation. Comps = - 9452 + 986 Time since 1978-1792 Q1-119 Q2-758 Q S = 297.82 OR Note -11244 = -9452-1792 -9452 = -9452 +0 etc Regression Analysis: Comps versus Time since 1978, Q1, Q2, Q, Q4 No constant option Comps = 986 Time since 1978-11244 Q1-10592 Q2-10210 Q - 9452 Q4 S = 297.82 29
Categorical Variable approach Model Summary Regression Equations S R-sq 297.82 98.76% Quarter Q1 Comps = -11244 + 986.5 t Q2 Comps = -10592 + 986.5 t Coefficients Q Q4 Comps = -10210 + 986.5 t Comps = -9452 + 986.5 t Term Coef SE Coef Constant -11244 47 time since 1978 986.5 22.9 Quarter Consider Q2 Q1 at t = 0 Q2 65 149 Q 104 149 Q4 1792 150 0
Derived variables and Transforms in Time Series Lags Differences Rates of Return Log scale 1
Completions All Comps, quarterly, 1978 to 2000 Option 2 use all data, but diff model 14000 12000 Time Series Plot of Completions Quarter Q1 Q2 Q Q4 10000 8000 6000 4000 2000 Quarter Q1 Year 1978 Q1 1981 Q1 1984 Q1 1987 Q1 1990 Q1 199 Q1 1996 Q1 1999 2
Comps Auto-Regression for Time Series Basic idea next value like last value (Lag1) 14000 12000 Fitted Line Plot Comps = 564.6 + 0.9171 Lag1Comp S 1167.61 R-Sq 76.1% R-Sq(adj) 75.8% 10000 8000 6000 4000 2000 2000 4000 6000 8000 Lag1Comp 10000 12000
Auto-Regression for Time Series Basic idea next value like last value (Lag1) Auto Regression Y Y + * Y + t 0 lag1 t1 t + * Y t 0 lag1 t1 + * Y lag 4 t4 t Year. QuarterComps Lag1Comp Lag4Comp 1978 Q1 5777 1978 Q2 4772 5777 1978 Q 4588 4772 1978 Q4 424 4588 1979 Q1 7276 424 5777 1979 Q2 451 7276 4772 1979 Q 4284 451 4588 1979 Q4 4257 4284 424 1980 Q1 778 4257 7276 4
Using two lagged variables Regression Analysis: Comps versus Lag1Comp, Lag4Comp The regression equation is Comps = - 87 + 0.28 Lag1Comp + 0.782 Lag4Comp : S = 780.7 Comp Q4 2000 = 1287, Comp Q1 2000 = 1002 95% Pred Int Comp Q1 2001 = 11892 ± 2(780.7)= (100, 145) 5
Using Lagged Variables Basic Idea Current Quarter like prev quarter same Q last year Matrix Plot of Completions, Lag1Comp, Lag4Comp 4000 8000 12000 4000 8000 12000 12000 Completions 8000 4000 12000 Lag1Comp 8000 4000 Lag4Comp 6
1994 1994 1995 1996 1997 1997 1998 1999 2000 2000 2001 Comparison 16000 14000 12000 10000 8000 6000 4000 2000 0 Forecasting models Comps Linear in Time, quarter indicators Lag1 and Lag 4 Modelling Options 1 Parallel Linear Regressions Y Q Q Q Q t t 1 1 2 2 4 4 2 Seasonal AutoRegression Y Y Y t 1 t1 4 t4 t More efficient for prediction Fewer modelling assumptions Different modelling strategy t Lin in time + Q Lag 1 and lag 4 Comps Lag 1 Lag 4 Q1 Q2 Q inds 2000 22 Q1 1002 12079 990 1 0 0 10451 1140.17 2000 22.25 Q2 11590 1002 10227 0 1 0 1147.5 10989.57 2000 22.5 Q 11892 11590 10788 0 0 1 11945 11850.74 2000 22.75 Q4 1287 11892 12079 0 0 0 12979.5 12959.5 2001 2 Q1 1287 1002 1 0 0 1147 11891.51 2001 2.25 Q2? 11590 0 1 0 12.5? 2001 2.5 Q? 11892 0 0 1 1291 2001 2.75 Q4? 1287 0 0 0 1965.5 2002 24 Q1?? 1 0 0 1242 7
Model Criticism Criticism Does it make sense? Are there outliers? Choice amongst alternatives R 2 SE 8
Extra: Logs lags and differences Financial data IBM share price Natural language %age change MINITAB language logs 9
Financial Series- IBM Prices daily Simple Reg on Time 40
Logprice Logprice Log IBM Prices Log(Y t ) vs t Log(Y t ) vs log(y t-1 ) IBM Prices Logprice = 1.64 + 0.000561 t IBM Prices Logprice = 0.002264 + 0.9990 lag1logprice 2.0 1.9 1.8 Regression 95% PI S 0.094 R-Sq 94.4% R-Sq(adj) 94.4% 2.0 1.9 1.8 Regression 95% PI S 0.0080199 R-Sq 99.8% R-Sq(adj) 99.8% 1.7 1.7 1.6 1.6 1.5 1.4 1. 1.2 0 200 400 t 600 800 1000 1.5 1.4 1. 1. 1.4 1.5 1.6 1.7 lag1logprice 1.8 1.9 2.0 41
price price Modeled in Log Scale, presented in original units Log(Y Log(Y t )vs log(y t-1 ) t ) vs t IBM Prices log10(price) = 1.64 + 0.000561 t IBM Prices log10(price) = 0.002264 + 0.9990 log10(lag1price) 100 90 80 70 Regression 95% PI S 0.094 R-Sq 94.4% R-Sq(adj) 94.4% 100 90 80 70 Regression 95% PI S 0.0080199 R-Sq 99.8% R-Sq(adj) 99.8% 60 60 50 50 40 40 0 0 20 20 10 0 200 400 t 600 800 1000 10 20 0 40 50 60 lag1price 70 80 90 100 42
Differences/ Ratios First Differences Seasonal Diffs Today Yesterday This Q same Q last year Ratio Y(t) / Y(t-1) Rate of Return 100 x(y(t) Y(t-1))/ Y(t-1) 100 x (Ratio -1) Log(Ratio) Log( Y(t) ) Log ( Y(t-1) ) 4
La g1diff Financial Series- IBM Prices daily Simple Regression of Daily Diffs vs Time IBM Prices Lag1diff = 0.01424 + 0.000109 t 5.0 2.5 Regression 95% PI S 0.951260 R-Sq 0.1% R-Sq(adj) 0.0% 0.0-2.5-5.0 0 200 400 t 600 800 1000 44
Lag1difflog Financial Series- IBM Prices daily Simple Regression of First Diffs of LogPrice vs Time IBM Prices Lag1difflog = 0.000568 + 0.000000 t 0.04 0.0 0.02 0.01 Regression 95% PI S 0.0080216 R-Sq 0.0% R-Sq(adj) 0.0% 0.00-0.01-0.02-0.0-0.04-0.05 0 200 400 t 600 800 1000 45
Lag1difflog Financial Series- IBM Prices daily IBM Prices Lag1difflog = 0.000568 + 0.000000 t Interpretation 0.04 0.0 0.02 0.01 0.00 Regression 95% PI S 0.0080216 R-Sq 0.0% R-Sq(adj) 0.0% -0.01-0.02-0.0-0.04 log P log P 0 time t t1 t t1-0.05 0 200 400 600 800 1000 t log Pt P log t t 0.00057 or in (0.00057 0.016,0.00057 0.016) P t1 P t1 in (-0.0154, 0.0166) Pt 10 or in 10,10 1.001 or in 0.96,1.04 0.00057 0.016 0.016 In summary Rate of return 0.1% per day 4% P 46
Financial Series Day to day changes most naturally expressed as % change price tomorrow = price today small change Log(price t+1)= Log(price t) + Log(small change) Average drift per day (for logs) is 0.00057 ie about 0.1% growth pd = 61% pa 47
Financial Series Confidence in future prediction pt est hi lo 0.0006-0.015 0.0166 10^ Factor 1.001 0.9652 1.090 Eg initial capital 1000 Day 1 1001. 965 109 2 1002.6 92 1079 100.9 899 1122 4 1005. 868 1165 5 1006.6 88 1211 64 1612.4 0.0 infinity 65 1614.5 0.0 infinity 61% per annum?? 48
Derived Variables Why use derived variables? Adding extra variables gives more options Challenge Is there a cost? Which is best Scientific insight can powerful & simple analysis 49