Multiple linear regression Course MF 930: Introduction to statistics June 0 Tron Anders Moger Department of biostatistics, IMB University of Oslo Aims for this lecture: Continue where we left off. Repeat the most important things from last lecture. Learn tests for checking whether the slope of the regression line is different from zero 3. Look at what happens if more variables are included in the model Learn how to handle Binary independent variables Categorical independent variables
Example: 5000,00 4000,00 birthweight 3000,00 000,00 000,00 0,00 50,00 00,00 50,00 00,00 50,00 weight in pounds Repetition: Simple linear regression We define a model ε i Y Dependent variable = β + β x + ε i 0 i i Independent variable where are independent, normally distributed, with equal variance σ Wish to fit a line as close to the observed data (two normally distributed variables) as possible Example: Birth weight=β 0 +β *mother s weight Estimate for β 0 is called a, estimate for β is called b
Least squares regression 5000,00 4000,00 birthweight 3000,00 000,00 000,00 R Sq Linear = 0,035 0,00 50,00 00,00 50,00 00,00 50,00 weight in pounds Find the best fitting line by minimizing the squared distance from each data point to the line, summed over all data Let (x, y ), (x, y ),...,(x n, y n ) denote the points in the plane. Find a and b so that y=a+bx fit the points by minimizing Solution: n y) + ( a + bx y) + + ( a + bxn yn) = ( a + bxi yi ) i= SSE = ( a + bx L n b = xi yi ( xi )( yi ) xi yi = n( xi ) ( xi ) xi nxi yi b xi a = n = y bx nxy where xi y = x =, y n n i and all sums are done for i=,...,n.
How close are the data to the fitted line? R y SST = yi y x i,y i ε = SSE = y yˆ i i i SSR = yˆi y Predicted value=any point on the regression line ˆ i y = a + bx i R, the proportion of the total variance in the y i s in the data explained by the regression line, is given by SSR/SST x Also remember: Residuals (distance from data points to the regression line) have to be normally distributed!! Plots for checking this is easily obtained from SPSS Histograms Q-Q plots (Which SPSS calls P-P plots in regression)
Example: Regression of birth weight with mother s weight as independent variable Summary b SSE SST Adjusted Std. Error of R R Square R Square the Estimate,86 a,035,09 78,470 a. Predictors: (Constant), weight in pounds Pearson s r R b. Dependent Variable: birthweight Regression Residual Total Estimate for β 0 Estimate for β ANOVA b Sum of Squares df Mean Square F Sig. 344888 344888,30 6,686,00 a 964687 87 5587,574 9997053 88 a. Predictors: (Constant), weight in pounds b. Dependent Variable: birthweight (Constant) weight in pounds a. Dependent Variable: birthweight Unstandardized SSR a Standardized Estimate for σ P-value for test on whether there is a significant relationship between the variables in the model. Null hypothesis is no relationship P-values, confidence intervals etc. for the β s 95% Confidence Interval for B t Sig. Lower Bound Upper Bound B Std. Error Beta 369,67 8,43 0,374,000 99,040 80,304 4,49,73,86,586,00,050 7,809 But how to answer questions like: Given that a positive slope (b) has been estimated: Does it give a reproducible indication that there is a positive trend, or is it a result of random variation? What is a confidence interval for the estimated slope?
Confidence intervals for simple regression In a simple regression model, a estimates b estimates β ˆ σ = SSE /( n ) Also, where of b β 0 ( b β )/ S ~ t ˆ σ Sb = ( n ) s b n estimates So a confidence interval for by b± tn, α /Sb x σ estimates variance β is given Hypothesis testing for simple regression Choose hypotheses: H 0 : β = 0 H: β 0 Test statistic: b/ Sb ~ tn Reject H 0 if b/ Sb < tn, α / or b/ Sb > tn, α / For the example: Test H 0 : β mother s weight =0 on 5%-sig. level Get 4.49/.73=.586. Look up.5 and 97.5-percentiles in t-distribution with 87 degrees of freedom (use normal dist.) Find p-value<0.05, reject H 0
More than one independent variable: Multiple regression Assume we have data of the type (x, x, x 3, y ), (x, x, x 3, y ),... We want to explain y from the x-values by fitting the following model: y = a + bx + + cx dx3 Just like before, one can produce formulas for a,b,c,d minimizing the sum of the squares of the errors. Multiple regression model y β β x β x β x ε i = 0 + i + i +... + n ni + i ε i The errors are independent random (normal) variables with expected value zero and variance σ The explanatory variables x i, x i,, x ni cannot be linearily related, that is, measuring almost the same thing
Indicator variables Binary variables (yes/no, male/female, ) can be represented as /0, and used as independent variables. Also called dummy variables in the book. When used directly, they influence only the constant term of the regression It is also possible to use a binary variable so that it changes both constant term and slope of the regression line (interaction) Example: Regression of birth weight with mother s weight and smoking status as independent variables Summary b Adjusted Std. Error of R R Square R Square the Estimate,59 a,067,057 707,83567 a. Predictors: (Constant), smoking status, weight in pounds b. Dependent Variable: birthweight ANOVA b Regression Residual Total Sum of Squares df Mean Square F Sig. 6754 3366,65 6,7,00 a 93988 86 5003,335 9997053 88 a. Predictors: (Constant), smoking status, weight in pounds b. Dependent Variable: birthweight (Constant) weight in pounds smoking status a. Dependent Variable: birthweight Unstandardized a Standardized 95% Confidence Interval for B t Sig. Lower Bound Upper Bound B Std. Error Beta 500,74 30,833 0,83,000 044,787 955,56 4,38,690,78,508,03,905 7,57-70,03 05,590 -,8 -,557,0-478,3-6,705
Interpretation: Have fitted the model Birth weight=500.74+4.38*mother s weight-70.03*smoking status If the mother start to smoke (and her weight remain constant), what is the predicted influence on the infant s birth weight? -70.03*= -70 grams What is the predicted weight of the child of a 50 pound, smoking woman? 500.74+4.38*50-70.03*=866 grams Confounding See that the estimated effects of mothers weight has changed a little compared to the univariate analysis (where it was 4.49) Mother s weight is slightly confounded by smoking Mwt Smk Bwt Confounder: An independent variable that causes a great change (at least 0%) in the effect of other independent variables (the β), when it s included in the model
Confounding cont d. A confounder is differently distributed for different values of the variable it confounds E.g. if lean mothers smoked more than obese mothers, a univariate effect of mothers weight on birth weight would partly be due to smoking!! Including smoking in the model, removes this effect, you get a more correct estimate of mothers weight What if a categorical variable has more than two values? Example: Ethinicity; black, white, other For categorical variables with m possible values, use m- indicators Common to choose a large group as baseline, otherwise unstable estimation A model with two indicator variables will assume that the effect of one indicator adds to the effect of the other If this may be unsuitable, use an additional interaction variable (product of indicators)
birth weight as a function of ethnicity Have constructed variables black=0 or and other=0 or : Birth weight=a+b*black+c*others Get (Constant) black other Unstandardized a. Dependent Variable: birthweight a Standardized 95% Confidence Interval for B t Sig. Lower Bound Upper Bound B Std. Error Beta 303,740 7,88 4,586,000 959,959 347,5-384,047 57,874 -,8 -,433,06-695,50-7,593-99,75 3,678 -,97 -,637,009-53,988-75,46 Hence, predicted birth weight decrease by 384 grams for blacks and 99 grams for others Predicted birth weight for whites is 304 grams Multiple regression: Traffic deaths in 976 Want to find if there is any relationship between highway death rate (deaths per 000 per state) in the U.S. and the following variables: Average car age (in months) Average car weight (in 000 pounds) Percentage light trucks Percentage imported cars All data are per state
69,00 69,50 70,00 70,50 7,00 7,50 First: Scatter plots: 0,35 0,35 0,30 0,30 0,5 0,5 deaths 0,0 deaths 0,0 0,5 0,5 0,0 0,0 0,05 0,05 carage 3,00 3,0 3,40 3,60 3,80 vehwt 0,35 0,35 0,30 0,30 0,5 0,5 deaths 0,0 deaths 0,0 0,5 0,5 0,0 0,0 0,05 0,05 5,00 0,00 5,00 0,00 5,00 30,00 35,00 lghttrks 0,00 5,00 0,00 5,00 0,00 5,00 30,00 impcars Summary b Adjusted Std. Error of R R Square R Square the Estimate,49 a,4,6,0506 a. Predictors: (Constant), carage Univariate effects (including one independent variable at a time!): b. Dependent Variable: deaths a (Constant) carage a. Dependent Variable: deaths Deaths per 000=a+b*car age (in months) Unstandardized Standardized 95% Confidence Interval for B t Sig. Lower Bound Upper Bound B Std. Error Beta 4,56,34 3,98,000,33 6,800 -,06,06 -,49-3,834,000 -,094 -,09 Hence: If all else is equal, if average car age increases by one month, you get 0.06 fewer deaths per 000 inhabitants; increase age by months, you get *0.06=0.74 fewer deaths per 000 inhabitants Summary b Adjusted Std. Error of R R Square R Square the Estimate,8 a,079,059,05740 a. Predictors: (Constant), vehwt b. Dependent Variable: deaths a (Constant) vehwt Deaths per 000=a+b*car weight (in pounds) Unstandardized a. Dependent Variable: deaths Standardized 95% Confidence Interval for B t Sig. Lower Bound Upper Bound B Std. Error Beta -,7, -,7,6 -,76,74,4,06,8,983,053 -,00,49
Univariate effects cont d (one independent variable at a time!): Summary b Adjusted Std. Error of R R Square R Square the Estimate,76 a,5,50,0478 a. Predictors: (Constant), lghttrks b. Dependent Variable: deaths Hence: Increase prop. light trucks by 0 means 0*0.007=0.4 more deaths per 000 inhabitants (Constant) lghttrks a. Dependent Variable: deaths Unstandardized a Standardized 95% Confidence Interval for B t Sig. Lower Bound Upper Bound B Std. Error Beta,046,08,478,07,009,083,007,00,76 6,947,000,005,00 Summary b Adjusted Std. Error of R R Square R Square the Estimate,308 a,095,075,05690 a. Predictors: (Constant), impcars b. Dependent Variable: deaths Predicted number of deaths per 000 if prop. Imported cars is 0%: 0.06-0.004*0=0.7 a (Constant) impcars a. Dependent Variable: deaths Unstandardized Standardized 95% Confidence Interval for B t Sig. Lower Bound Upper Bound B Std. Error Beta,06,00 0,46,000,66,46 -,004,00 -,308 -,93,033 -,007,000 Building a multiple regression model, exploratory analysis: Forward regression: Try all independent variables, one at a time, keep the variable with the lowest p-value Repeat step, with the independent variable from the first round now included in the model Repeat until no more variables can be added to the model (no more significant variables) Backward regression: Include all independent variables in the model, remove the variable with the highest p- value Continue until only significant variables are left However: In health sciences you would often keep age, gender etc. in the model even though they are not significant
Two better methods of model building:. All independent variables chosen for the study have strong medical reasons for being interesting and you have a large enough study Then, all might be included in the final model regardless of significance. Middle road: use a cut-off saying that all variables with p-value<e.g. 0. in simple analyses can be included in final model For the traffic deaths, end up with: Deaths per 000=.7-0.037*car age +0.006*perc. light trucks Summary b Adjusted Std. Error of R R Square R Square the Estimate,768 a,590,57,0387 a. Predictors: (Constant), lghttrks, carage b. Dependent Variable: deaths (Constant) carage lghttrks a. Dependent Variable: deaths Unstandardized a Standardized 95% Confidence Interval for B t Sig. Lower Bound Upper Bound B Std. Error Beta,668,895,98,005,865 4,470 -,037,03 -,95 -,930,005 -,063 -,0,006,00,6 6,8,000,004,009 Conclusion: Did a multiple linear regression on traffic deaths, with car age, car weight, prop. light trucks and prop. imported cars as independent variables. Car age (in months, β=-0.037, 95% CI=(-0.063, -0.0)) and prop. light trucks (β=0.006, 95% CI=(0.004, 0.009)) were significant on 5%-level
Check of assumptions: Are residuals normally distributed? Histogram Normal P-P Plot of Regression Standardized Residual Dependent Variable: deaths Dependent Variable: deaths,0 4 0,8 Frequency 0 8 6 4 Expected Cum Prob 0,6 0,4 0, 0-3 - - 0 3 4 Regression Standardized Residual Mean =,3E-7 Std. Dev. = 0,978 N = 48 0,0 0,0 0, 0,4 0,6 0,8,0 Observed Cum Prob Least squares estimation in multiple regression yi = β0 + βx i + βx i +... + βkxki + εi The least squares estimates of β0, β,..., βk are the values b, b,, b K minimizing n i= (... ) SSE = b + b x + b x + + b x y 0 i i K Ki i They can be computed with similar but more complex formulas as with simple regression
R is defined just as before: Defining n We get as before We define yˆ = b + bx + b x +... + b x i 0 i i K Ki n n ( ) i SSE = ( y ˆ ) i yi SSR = ( yˆ ) i y SST = y y i= R i= SST = SSR + SSE SSR SSE = = SST SST i= Adjusted coefficient of determination Adding more independent variables will generally increase SSR and decrease SSE Thus the coefficient of determination will tend to indicate that models with many variables always fit better. To avoid this effect, the adjusted coefficient of determination may be used: SSE /( n K ) R = SST /( n )
Drawing inference about the model parameters in multiple regression Similar to simple regression, we get that the following statistic has a t distribution with n-k- degrees of freedom: bj β j tb = j sbj where b j is the least squares estimate for and s bj is its estimated standard deviation K is number of independent variables s bj is computed from SSE and the correlation between independent variables Confidence intervals and hypothesis tests A confidence interval for b ± t s j n K, α / bj β j becomes Testing the hypothesis H : 0 0 β j = vs H : 0 β j Reject if b s j bj < t n K, α / or b s j bj > t n K, α /
Testing sets of parameters We can also test the null hypothesis that a specific set of the betas are simultaneously zero. The alternative hypothesis is that at least one beta in the set is nonzero. But will not go into details here What if the relationship between x and y is non-linear? Most common thing to do is to categorize the independent variable E.g. categorize age into 0-0 yrs, -40 yrs, 4-60 yrs and so on Choose a baseline category, and estimate a slope b for each of the other categories Then, it does not matter what relationship you have between the outcome and the independent variable
Other options if the relationship is non-linear: Transformed variables The relationship between variables may not be linear Example: The natural model may be y = ae bx We want to find a and b bx so that the line y = ae approximates the points as well as possible 0.05 0.0 0.5 0.0 5 0 5 30 Example (cont.) bx When y = ae then log( y ) = log( a) + bx Use standard formulas on the pairs (x,log(y )), (x, log(y )),..., (x n, log(y n )) We get estimates for log(a) and b, and thus a and b 0.05 0.0 0.5 0.0 5 0 5 30
Doing a regression analysis Plot the data first, to investigate whether there is a natural relationship Linear or transformed model? Are there outliers which will unduly affect the result? Fit a model. Different models with same number of parameters may be compared with R Check the assumptions! Make tests / confidence intervals for parameters A lot of practice is needed!