Multiple Regression Relating a response (dependent, input) y to a set of explanatory (independent, output, predictor) variables x, x 2, x 3,, x q. A technique for modeling the relationship between variables. Deterministic component: Random component: y = µ + ε y x, x2, x3,..., x q µ = α + β x + β 2 x 2 + β 3 x 3 +... + β q x q ε y x, x2, x3,..., x q Multiple Linear Regression : y = α + β x + β 2 x 2 + β 3 x 3 +... + β q x q + ε α, β, β 2, β 3,..., β q in the model can all be estimated by least square estimators α, ˆ β, ˆ β, ˆ β,..., ˆ ˆ 2 3 β q The Least-Square Regression Equation: y ˆ = α ˆ + ˆ β x ˆ β x ˆ β x + 2 2 + 3 3 +... + ˆ β x q q Example: Study weight (y) using age (x ) and height (x 2 ). Data: (months), height (inches), and weight (pounds) were recorded for a group of school children. Weight 20 00 Weight 20 00 20 200 220 2 2 50 70 - Scatter plo show that both age and height are linearly related to weight - : y = α + β x + β 2 x 2 + ε where y: Weight, x :, x 2 :
Evaluation of the : (SPSS Output) Summary Adjusted Std. Error of R R Square R Square the Estimate.794 a.630.627.868 a. Predictors:,, t of determination: the percentage of variability in the response variable (Weight) that can be described by predictor variables (, ) through the model. Regression Residual Total a. Predictors:,, b. Dependent Variable: Weight ANOVA b Sum of Squares df Mean Square F Sig. 56233.254 2 286.627 99..000 a 329.76 234.858 8994.05 236 Test for significance of the model: p-value =.000 <.05 Ho: is insignificant (β i s are all zeros). Ha: is significant (Some β i s are all zeros). Estimation: (SPSS Output) -27.820 2.099-0.565.000.2.055.228 4.3.000.579.727 3.090.257.627 2.008.000.579.727 Inference for Regression : Ho: α = 0 v.s. Ha: α 0 p-value =.000 <.05 Ho: β = 0 v.s. Ha: β 0 p-value =.000 <.05 Ho: β 2 = 0 v.s. Ha: β 2 0 p-value =.000 <.05 Collinearity * statistics: If tolerance is less than 0. or VIF (Variance Inflation Factor) is greater than 0 implies serious collinearity. * Collinearity occurs when there are significant correlations between pairs of independent variables in the model. Collinearity: There is no significant collinearity in the model. Tes for regression coefficien: all parameters in the model, α, β, and β 2 are all statistically significant. Least square regression equation: yˆ = 27.82 +.24 x + 3. 09 x2 (For estimating expected response value) The average weight of children of 44 months old and whose height is 55 inches would be: 27.82 +.24x44 + 3.09x55 = 76.69 (lb) (estimated by the model) How to interpret α, β, and β 2? α is the constant of the y-intercept in the model. It is the average value of response when both predictor variables are 0. β is the rate of change of expected (average) weight per unit change of age adjusted for the height variable. β 2 is the rate of change of expected (average) weight per unit change of height adjusted for the age variable. 2
Other possible models: (y: Weight, x :, x 2 : ) y = α + β x + ε y = α + β 2 x 2 + ε With interaction term: y = α + β x + β 2 x 2 + β 3 x x 2 + ε y = α + β x + β 3 x x 2 + ε y = α + β 2 x 2 + β 3 x x 2 + ε Interaction term t Estimation with Interaction Between and for the : INTAG_HT y = α + β x + β 2 x 2 + β 3 x x 2 + ε 66.996 06.89.63.529 -.973.6 -.923 -.476.4.004 250.009-3.3E-02.70 -.006 -.08.985.03 77.06.936E-02.00.636.847.066.002 50.996 High VIF implies very serious collinearity Interaction should not be in to the model. : y = α + β x + β 2 x 2 + ε where y: Weight, x :, x 2 : Prediction Equation: yˆ = 27.82 +.24 x + 3. 09 x 2 Is the model above a good model for estimating a child s weight based on age and height for the population where the sample was taken from? 3
If only the male children or only the female children data are modeled, the SPSS output coefficien table for each model y = α + β x + β 2 x 2 + ε should be: For boys: -3.73 5.590-7.294.000.308.084.289 3.672.000.443 2.259 2.68.368.574 7.283.000.443 2.259 Is there a serious collinearity? Explain with the statistics in the table. Write the weight prediction equation using age and height as predictor variables. Find the average weight for boys that are 44 months old and 55 inches tall. For girls: -50.597 20.767-7.252.000.9.076.86 2.524.03.704.420 3.4.8.650 8.838.000.704.420 Is there a serious collinearity? Explain with the statistics in the table. Write the weight prediction equation using age and height as predictor variables. Find the average weight for boys that are 44 months old and 55 inches tall. 4
Indicator Variables Binary variables that take only two possible values, 0 and, and can be use for including categorical variables in the model. Male: Female: 0 Group Statistics Weight Male Female Std. Error N Mean Std. Deviation Mean 26 03.448 9.968.779 98.878 8.66.767 : (A model that models two independent samples situation with equal variances condition.) y = α + β x + ε When x = 0: y = α + ε When x 2 = : y = α + β + ε The difference of the averages of the two categories is β. where y: Weight, x : (x = 0 for female, x = for male) SPSS output for linear regression with gender as predictor variable 98.878.836 53.846.000 4.570 2.58.8.85.07.000.000 SPSS output for two independent samples t-test for comparing the mean weight between male and female. Independent Samples Test Weight Equal variances assumed Equal variances not assumed Levene's Test for Equality of Variances F Sig. t df Sig. (2-tailed) t-test for Equality of Means Mean Difference 95% Confidence Interval of the Std. Error Difference Difference Lower Upper.630.428.85 235.07 4.570 2.58 -.392 9.532.823 234.233.070 4.570 2.507 -.370 9.50 The relation between Weight and variables is insignificant or there is no significant difference between the average weigh of male and female children. 5
Use of Indicator Variables in the Regression, and variables as Predictor Variables : y = α + β x + β 2 x 2 + β 3 x 3 + ε where y: Weight, x :, x 2 :, x 3 : (x 3 = 0 for Female; x 3 = for Male) Summary Adjusted Std. Error of R R Square R Square the Estimate.794 a.63.626.893 a. Predictors:,,, W e i g h t 20 00 22 220 200 50 70 Male Female -28.209 2.264-0.454.000.238.056.226 4.250.000.562.7 3.05.267.630.62.000.539.854 -.338.4 -.009 -.20.834.932.073 With and variables in the model, variable becomes insignificant. When comparing the difference in average weigh between genders and adjusted for age and height variables, the difference is statistically insignificant.,,, and - Interaction variables as Predictor Variables INTAG_HT 8.307 08.647.748.455 -.076.6 -.020 -.584.5.004 264.838 -.234.74 -.048 -.35.893.03 79.658 -.047.636 -.027 -.6.523.885.30 2.09E-02.0.766.94.054.002 528.420 Adding interaction term to the model increase VIF to the model estimation. without interaction term would be better. 6
and as Predictor Variables : y = α + β x + β 2 x 2 + ε where y: Weight, x :, x 2 : (x 2 = 0 for Female; x 2 = for Male) Summary Adjusted Std. Error of R R Square R Square the Estimate.645 a.46.4 4.95 a. Predictors:,, Weight 20 00 Male 20 200 220 2 2 Female -.8 8.778 -.274.204.669.053.634 2.705.000.000.000 4.539.942.7 2.338.020.000.000 and are both significant variables if using them for predicting weight. There is significant difference in average weight between genders if adjusted for age variable. Exercise: What would be the average weight for 4 years old boys using the model above? : Prediction:,, and - Interaction variables as Predictor Variables INTGN_AG 7.83 2.892.7.544.554.078.525 7.05.000.45 2.27-30.04 7.7 -.774 -.729.085.02 8.48.2.05.903 2.002.046.02 82.6 Adding interaction term to the model increases VIF to the model estimation. without interaction term would be better. 7
Common mistake: Use the internally coded values of a categorical explanatory variable directly in linear regression modeling calculation. The proper way to include a categorical variable is to use indicator variables. For having a categorical variable with k categories, one should set up k indicator variables. Example: A survey question asked Race with 3 possible responses, White =, Black = 2, Hispanic = 3. One can set up an indicator variable x so x = represen White, otherwise x = 0, and another indicator x 2 such that x 2 = represen Black otherwise x 2 =0, and x = 0 and x 2 = 0 represen Hispanic. In this survey, it also asked Your Body Fat Percentage and Number of hours of exercise per week. Number of hours of exercise per week : y = α + β x + β 2 x 2 + β 3 x 3 + ε Body Fat Percentage Race Interpretation of the model: Race: White, x = and x 2 = 0, y = α + β + β 3 x 3 + ε Race: Black, x = 0 and x 2 =, y = α + β 2 + β 3 x 3 + ε Race: Hispanic, x = 0 and x 2 = 0, y = α + β 3 x 3 + ε Exercise: Suppose that the estimated parameter values for the model are the following: Write down the prediction equation: α = 20, ˆ β = 2., ˆ β =.3, ˆ β ˆ 2 3 =. Estimate the average body fat for a white person exercise 0 hours per week: Estimate the average body fat for a black person exercise 0 hours per week: Estimate the average body fat for a Hispanic person exercise 0 hours per week: 8
Example: Study female life expectancy using percentage of urbanization and birth rate. 90 90 Female life expectancy 992 70 50 Female life expectancy 992 70 50 0 20 00 20 0 0 20 30 50 Percent urban, 992 Births per 000 population, 992 : y = α + β Birth rate + β 2 Percent urbani + ε where x : Birth rate, x 2 : Percent urbani Evaluation of the model: (SPSS output) Summary Adjusted Std. Error of R R Square R Square the Estimate.904 a.87.83 4.89 a. Predictors:, Births per 000 population, 992, Percent urban, 992 t of determination: the percentage of variability in the response variable (female life expectancy) that can be described by predictor variables (birth rate, percentage of urbanization) through the model. Regression Residual Total ANOVA b Sum of Squares df Mean Square F Sig. 2577.056 2 6288.528 262.595.000 a 2825.820 8 23.948 52.876 20 a. Predictors:, Births per 000 population, 992, Percent urban, 992 b. Dependent Variable: Female life expectancy 992 Test for significance of the model: p-value =.000 <.05 Ho: is insignificant (β i s are all zeros). Ha: is significant (Some β i s are all zeros). 9
estimation: (SPSS output) Births per 000 population, 992 Percent urban, 992 a. Dependent Variable: Female life expectancy 992 76.26 2.43 3.350.000 -.555.045 -.648-2.96.000.55.84.54.025.33 6.238.000.55.84 Inference for Regression : Ho: α = 0 v.s. Ha: α 0 p-value =.000 <.05 Ho: β = 0 v.s. Ha: β 0 p-value =.000 <.05 Ho: β 2 = 0 v.s. Ha: β 2 0 p-value =.000 <.05 Collinearity * statistics: If tolerance is less than 0. or VIF (Variance Inflation Factor) is greater than 0 implies serious collinearity. * Collinearity: Significant correlations between pairs of independent variables in the model. Tes for regression coefficien: all parameters in the model, α, β, and β 2 are all statistically significant. Least square regression equation: yˆ = 76.26.555 x +. 54 x2 (For estimating expected response value) The average female life expectancy for the countries whose birth rate per 000 is 30 and whose percentage of urbanization is would be 76.26 0.555x30 + 0.54x = 65.726. (estimated by the model) How to interpret α, β, and β 2? α is the constant or the y-intercept of the model. It is the average value of response variable when both predictor variables are 0. β is the rate of change of expected (average) life expectancy per unit change of birth rate and adjusted for percentage of urbanization. β 2 is the rate of change of expected (average) life expectancy per unit change of percentage of urbanization and adjusted for the birth rate. Other possible models: (x : Birth rate, x 2 : Percent urbani ) y = α + β x + ε y = α + β 2 x 2 + ε With Interaction Effect: y = α + β x + β 2 x 2 + β 3 x x 2 + ε y = α + β x + β 3 x x 2 + ε y = α + β 2 x 2 + β 3 x x 2 + ε Interaction term 0
Understanding the female life expectancy and how it is related with explanatory variables: Birth Rate, Urbanization, Phones, Doctors, and GDP. After Log Transformation for Before Transformation Phones, Doctors, and GDP Female life expectan Female life expectan Births per 000 popu Births per 000 popu Percent urban, 992 Percent urban, 992 Phones per 00 peopl Natural log of phone Doctors per 0,000 p Natural log of docto GDP per capita Natural log of GDP Summary b R R Square Adjusted R Square.934 a.873.867 4.08 2.03 a. Predictors:, Natural log of GDP, Percent urban, 992, Births per 000 population, 992, Natural log of doctors per 0000, Natural log of phones per 00 people b. Dependent Variable: Female life expectancy 992 Std. Error of the Estimate Durbin-Waon Independence of error Regression Residual Total ANOVA b Sum of Squares df Mean Square F Sig. 223.330 5 2424.666 45.342.000 a 768.348 06 6.683 389.679 a. Predictors:, Natural log of GDP, Percent urban, 992, Births per 000 population, 992, Natural log of doctors per 0000, Natural log of phones per 00 people b. Dependent Variable: Female life expectancy 992 Births per 000 population, 992 Percent urban, 992 Natural log of phones per 00 people Natural log of doctors per 0000 Natural log of GDP a. Dependent Variable: Female life expectancy 992 z ed t s 77.448 5.829 3.287.000 -.272.058 -.39-4.659.000.256 3.903.937E-02.03.043.629.53.263 3.5 3.75.679.552 4.675.000.086.590.894.593.262 3.94.002.78 5.6 -.390.784 -.90 -.772.079.05 9.543 Multicollinearity Tolerance measures the strength of the linear relation between the independent variables. It is better to be higher than 0.. VIF is the reciprocal of Tolerance.
ANOVA d 2 3 Regression Residual Total Regression Residual Total Regression Residual Total Sum of Squares df Mean Square F Sig. 59.884 59.884 449.370.000 a 273.795 0 24.834 389.679 830.842 2 595.42 32.873.000 b 20.836 09 8.907 389.679 2069.502 3 23.67 238.452.000 c 822.77 08 6.872 389.679 a. Predictors:, Natural log of phones per 00 people b. Predictors:, Natural log of phones per 00 people, Births per 000 population, 992 c. Predictors:, Natural log of phones per 00 people, Births per 000 population, 992, Natural log of doctors per 0000 d. Dependent Variable: Female life expectancy 992 Step-wise selection 2 3 Natural log of phones per 00 people Natural log of phones per 00 people Births per 000 population, 992 Natural log of phones per 00 people Births per 000 population, 992 Natural log of doctors per 0000 B Std. Error a. Dependent Variable: Female life expectancy 992 z ed t s Beta.284.562 07.84.000 5.6.243.896 2.98.000.000.000 72.566 2.9 34.239.000 3.352.370.582 9.048.000.329 3.042 -.327.055 -.383-5.957.000.329 3.042 68.76 2.37 29.48.000 2.386.434.44 5.496.000.24 4.682 -.246.056 -.288-4.364.000.2 3.576 2.054.546.284 3.76.000.23 4.706 What are the significant factors that are related to the female s life expectancy? In stepwise regression, a large number of tes are performed and lead to higher probability of Type I or Type II error. It should be used when one wan to determine important independent variables from a large number of potentially useful variables in the modeling process. 2
Use of regression analysis. Description (model, system, relation) Relation between life expectancy, birth rate, GDP, Relation between salary, rank, years of service, 2. Control Died too young, underpaid, overpaid, 3. Prediction Life expectancy, salary for new comers, future salary, 4. Variable screening (important factors) What are the important factors that affecting salary or life expectancy? Construction of regression models:. Hypothesize the form of the model for µ y x, x2, x3,..., x q a) Selecting predictor variables. b) Deciding functional form of the regression equation. c) Defining scope of the model (design range). 2. Collect the sample data (observations, experimen). 3. Use sample to estimate unknown parameters in the model. 4. Specifying the probability distribution of the random error. 5. Statistically check the usefulness of the model. 6. Apply the model in decision making. 7. Review the model with new data. What is linear model? Example of a linear model: y = β 0 + β x + ε y = β 0 + β x + β 2 x 2 + ε y = β 0 + β x + β 2 x 2 + β 3 x x 2 + ε y = β 0 + β x + β 2 x 2 + β 3 x x 2 + β 4 x 2 + β 5 x 2 2 + ε y = β 0 + β ln(x) + ε y = β 0 + β e x + ε is linear in terms of i parameters. 3