x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

Similar documents
Chapter 9 - Correlation and Regression

Chapter 13. Multiple Regression and Model Building

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Multiple linear regression S6

Unit 6 - Introduction to linear regression

Chapter 14 Student Lecture Notes 14-1

Practical Biostatistics

Review of Multiple Regression

Analysing data: regression and correlation S6 and S7

A discussion on multiple regression models

Answer Key. 9.1 Scatter Plots and Linear Correlation. Chapter 9 Regression and Correlation. CK-12 Advanced Probability and Statistics Concepts 1

Multiple Regression and Model Building (cont d) + GIS Lecture 21 3 May 2006 R. Ryznar

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Chapter 3 Multiple Regression Complete Example

Single and multiple linear regression analysis

Multiple linear regression

Final Exam - Solutions

General linear models. One and Two-way ANOVA in SPSS Repeated measures ANOVA Multiple linear regression

Two-Way ANOVA. Chapter 15

Chapter 7 Student Lecture Notes 7-1

y response variable x 1, x 2,, x k -- a set of explanatory variables

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

Regression ( Kemampuan Individu, Lingkungan kerja dan Motivasi)

Chapter 4. Regression Models. Learning Objectives

Correlation and Regression Bangkok, 14-18, Sept. 2015

REVIEW 8/2/2017 陈芳华东师大英语系

Unit 11: Multiple Linear Regression

QUANTITATIVE STATISTICAL METHODS: REGRESSION AND FORECASTING JOHANNES LEDOLTER VIENNA UNIVERSITY OF ECONOMICS AND BUSINESS ADMINISTRATION SPRING 2013

Multiple Regression Analysis

Simple Linear Regression Using Ordinary Least Squares

Chapter 4: Regression Models

Inferences for Regression

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

Basic Statistics Exercises 66

Example: Forced Expiratory Volume (FEV) Program L13. Example: Forced Expiratory Volume (FEV) Example: Forced Expiratory Volume (FEV)

Parametric Test. Multiple Linear Regression Spatial Application I: State Homicide Rates Equations taken from Zar, 1984.

Unit 6 - Simple linear regression

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

Data Analysis 1 LINEAR REGRESSION. Chapter 03

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

9. Linear Regression and Correlation

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section:

Introduction to Regression

Inference for Regression Inference about the Regression Model and Using the Regression Line

( ), which of the coefficients would end

CHAPTER 10. Regression and Correlation

Inference for Regression Simple Linear Regression

Correlation & Simple Regression

2) For a normal distribution, the skewness and kurtosis measures are as follows: A) 1.96 and 4 B) 1 and 2 C) 0 and 3 D) 0 and 0

Multiple Regression and Model Building Lecture 20 1 May 2006 R. Ryznar

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

ECON 497 Midterm Spring

Mathematics for Economics MA course

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

Midterm 2 - Solutions

using the beginning of all regression models

Statistics 5100 Spring 2018 Exam 1

Daniel Boduszek University of Huddersfield

Intro to Linear Regression

In Class Review Exercises Vartanian: SW 540

Intro to Linear Regression

Multiple Regression: Chapter 13. July 24, 2015

Basic Business Statistics, 10/e

Chapter 7 9 Review. Select the letter that corresponds to the best answer.

Simple Linear Regression: One Qualitative IV

Correlation and simple linear regression S5

Self-Assessment Weeks 8: Multiple Regression with Qualitative Predictors; Multiple Comparisons

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

EDF 7405 Advanced Quantitative Methods in Educational Research MULTR.SAS

Chapter 4 Regression with Categorical Predictor Variables Page 1. Overview of regression with categorical predictors

Binary Logistic Regression

Simple Linear Regression

16.400/453J Human Factors Engineering. Design of Experiments II

Univariate analysis. Simple and Multiple Regression. Univariate analysis. Simple Regression How best to summarise the data?

Example. Multiple Regression. Review of ANOVA & Simple Regression /749 Experimental Design for Behavioral and Social Sciences

Chapter 10-Regression

Regression of Inflation on Percent M3 Change

Regression Analysis Primer DEO PowerPoint, Bureau of Labor Market Statistics

1. Use Scenario 3-1. In this study, the response variable is

An Analysis of College Algebra Exam Scores December 14, James D Jones Math Section 01

SPSS LAB FILE 1

DEVELOPMENT OF CRASH PREDICTION MODEL USING MULTIPLE REGRESSION ANALYSIS Harshit Gupta 1, Dr. Siddhartha Rokade 2 1

bivariate correlation bivariate regression multiple regression

SPSS Output. ANOVA a b Residual Coefficients a Standardized Coefficients

1 Multiple Regression

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Sociology 593 Exam 1 February 14, 1997

Multivariate Correlational Analysis: An Introduction

Categorical Predictor Variables

Lecture 4: Multivariate Regression, Part 2

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

Identify the scale of measurement most appropriate for each of the following variables. (Use A = nominal, B = ordinal, C = interval, D = ratio.

Review of the General Linear Model

(Where does Ch. 7 on comparing 2 means or 2 proportions fit into this?)

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

TOPIC 9 SIMPLE REGRESSION & CORRELATION

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

Transcription:

Multiple Regression Relating a response (dependent, input) y to a set of explanatory (independent, output, predictor) variables x, x 2, x 3,, x q. A technique for modeling the relationship between variables. Deterministic component: Random component: y = µ + ε y x, x2, x3,..., x q µ = α + β x + β 2 x 2 + β 3 x 3 +... + β q x q ε y x, x2, x3,..., x q Multiple Linear Regression : y = α + β x + β 2 x 2 + β 3 x 3 +... + β q x q + ε α, β, β 2, β 3,..., β q in the model can all be estimated by least square estimators α, ˆ β, ˆ β, ˆ β,..., ˆ ˆ 2 3 β q The Least-Square Regression Equation: y ˆ = α ˆ + ˆ β x ˆ β x ˆ β x + 2 2 + 3 3 +... + ˆ β x q q Example: Study weight (y) using age (x ) and height (x 2 ). Data: (months), height (inches), and weight (pounds) were recorded for a group of school children. Weight 20 00 Weight 20 00 20 200 220 2 2 50 70 - Scatter plo show that both age and height are linearly related to weight - : y = α + β x + β 2 x 2 + ε where y: Weight, x :, x 2 :

Evaluation of the : (SPSS Output) Summary Adjusted Std. Error of R R Square R Square the Estimate.794 a.630.627.868 a. Predictors:,, t of determination: the percentage of variability in the response variable (Weight) that can be described by predictor variables (, ) through the model. Regression Residual Total a. Predictors:,, b. Dependent Variable: Weight ANOVA b Sum of Squares df Mean Square F Sig. 56233.254 2 286.627 99..000 a 329.76 234.858 8994.05 236 Test for significance of the model: p-value =.000 <.05 Ho: is insignificant (β i s are all zeros). Ha: is significant (Some β i s are all zeros). Estimation: (SPSS Output) -27.820 2.099-0.565.000.2.055.228 4.3.000.579.727 3.090.257.627 2.008.000.579.727 Inference for Regression : Ho: α = 0 v.s. Ha: α 0 p-value =.000 <.05 Ho: β = 0 v.s. Ha: β 0 p-value =.000 <.05 Ho: β 2 = 0 v.s. Ha: β 2 0 p-value =.000 <.05 Collinearity * statistics: If tolerance is less than 0. or VIF (Variance Inflation Factor) is greater than 0 implies serious collinearity. * Collinearity occurs when there are significant correlations between pairs of independent variables in the model. Collinearity: There is no significant collinearity in the model. Tes for regression coefficien: all parameters in the model, α, β, and β 2 are all statistically significant. Least square regression equation: yˆ = 27.82 +.24 x + 3. 09 x2 (For estimating expected response value) The average weight of children of 44 months old and whose height is 55 inches would be: 27.82 +.24x44 + 3.09x55 = 76.69 (lb) (estimated by the model) How to interpret α, β, and β 2? α is the constant of the y-intercept in the model. It is the average value of response when both predictor variables are 0. β is the rate of change of expected (average) weight per unit change of age adjusted for the height variable. β 2 is the rate of change of expected (average) weight per unit change of height adjusted for the age variable. 2

Other possible models: (y: Weight, x :, x 2 : ) y = α + β x + ε y = α + β 2 x 2 + ε With interaction term: y = α + β x + β 2 x 2 + β 3 x x 2 + ε y = α + β x + β 3 x x 2 + ε y = α + β 2 x 2 + β 3 x x 2 + ε Interaction term t Estimation with Interaction Between and for the : INTAG_HT y = α + β x + β 2 x 2 + β 3 x x 2 + ε 66.996 06.89.63.529 -.973.6 -.923 -.476.4.004 250.009-3.3E-02.70 -.006 -.08.985.03 77.06.936E-02.00.636.847.066.002 50.996 High VIF implies very serious collinearity Interaction should not be in to the model. : y = α + β x + β 2 x 2 + ε where y: Weight, x :, x 2 : Prediction Equation: yˆ = 27.82 +.24 x + 3. 09 x 2 Is the model above a good model for estimating a child s weight based on age and height for the population where the sample was taken from? 3

If only the male children or only the female children data are modeled, the SPSS output coefficien table for each model y = α + β x + β 2 x 2 + ε should be: For boys: -3.73 5.590-7.294.000.308.084.289 3.672.000.443 2.259 2.68.368.574 7.283.000.443 2.259 Is there a serious collinearity? Explain with the statistics in the table. Write the weight prediction equation using age and height as predictor variables. Find the average weight for boys that are 44 months old and 55 inches tall. For girls: -50.597 20.767-7.252.000.9.076.86 2.524.03.704.420 3.4.8.650 8.838.000.704.420 Is there a serious collinearity? Explain with the statistics in the table. Write the weight prediction equation using age and height as predictor variables. Find the average weight for boys that are 44 months old and 55 inches tall. 4

Indicator Variables Binary variables that take only two possible values, 0 and, and can be use for including categorical variables in the model. Male: Female: 0 Group Statistics Weight Male Female Std. Error N Mean Std. Deviation Mean 26 03.448 9.968.779 98.878 8.66.767 : (A model that models two independent samples situation with equal variances condition.) y = α + β x + ε When x = 0: y = α + ε When x 2 = : y = α + β + ε The difference of the averages of the two categories is β. where y: Weight, x : (x = 0 for female, x = for male) SPSS output for linear regression with gender as predictor variable 98.878.836 53.846.000 4.570 2.58.8.85.07.000.000 SPSS output for two independent samples t-test for comparing the mean weight between male and female. Independent Samples Test Weight Equal variances assumed Equal variances not assumed Levene's Test for Equality of Variances F Sig. t df Sig. (2-tailed) t-test for Equality of Means Mean Difference 95% Confidence Interval of the Std. Error Difference Difference Lower Upper.630.428.85 235.07 4.570 2.58 -.392 9.532.823 234.233.070 4.570 2.507 -.370 9.50 The relation between Weight and variables is insignificant or there is no significant difference between the average weigh of male and female children. 5

Use of Indicator Variables in the Regression, and variables as Predictor Variables : y = α + β x + β 2 x 2 + β 3 x 3 + ε where y: Weight, x :, x 2 :, x 3 : (x 3 = 0 for Female; x 3 = for Male) Summary Adjusted Std. Error of R R Square R Square the Estimate.794 a.63.626.893 a. Predictors:,,, W e i g h t 20 00 22 220 200 50 70 Male Female -28.209 2.264-0.454.000.238.056.226 4.250.000.562.7 3.05.267.630.62.000.539.854 -.338.4 -.009 -.20.834.932.073 With and variables in the model, variable becomes insignificant. When comparing the difference in average weigh between genders and adjusted for age and height variables, the difference is statistically insignificant.,,, and - Interaction variables as Predictor Variables INTAG_HT 8.307 08.647.748.455 -.076.6 -.020 -.584.5.004 264.838 -.234.74 -.048 -.35.893.03 79.658 -.047.636 -.027 -.6.523.885.30 2.09E-02.0.766.94.054.002 528.420 Adding interaction term to the model increase VIF to the model estimation. without interaction term would be better. 6

and as Predictor Variables : y = α + β x + β 2 x 2 + ε where y: Weight, x :, x 2 : (x 2 = 0 for Female; x 2 = for Male) Summary Adjusted Std. Error of R R Square R Square the Estimate.645 a.46.4 4.95 a. Predictors:,, Weight 20 00 Male 20 200 220 2 2 Female -.8 8.778 -.274.204.669.053.634 2.705.000.000.000 4.539.942.7 2.338.020.000.000 and are both significant variables if using them for predicting weight. There is significant difference in average weight between genders if adjusted for age variable. Exercise: What would be the average weight for 4 years old boys using the model above? : Prediction:,, and - Interaction variables as Predictor Variables INTGN_AG 7.83 2.892.7.544.554.078.525 7.05.000.45 2.27-30.04 7.7 -.774 -.729.085.02 8.48.2.05.903 2.002.046.02 82.6 Adding interaction term to the model increases VIF to the model estimation. without interaction term would be better. 7

Common mistake: Use the internally coded values of a categorical explanatory variable directly in linear regression modeling calculation. The proper way to include a categorical variable is to use indicator variables. For having a categorical variable with k categories, one should set up k indicator variables. Example: A survey question asked Race with 3 possible responses, White =, Black = 2, Hispanic = 3. One can set up an indicator variable x so x = represen White, otherwise x = 0, and another indicator x 2 such that x 2 = represen Black otherwise x 2 =0, and x = 0 and x 2 = 0 represen Hispanic. In this survey, it also asked Your Body Fat Percentage and Number of hours of exercise per week. Number of hours of exercise per week : y = α + β x + β 2 x 2 + β 3 x 3 + ε Body Fat Percentage Race Interpretation of the model: Race: White, x = and x 2 = 0, y = α + β + β 3 x 3 + ε Race: Black, x = 0 and x 2 =, y = α + β 2 + β 3 x 3 + ε Race: Hispanic, x = 0 and x 2 = 0, y = α + β 3 x 3 + ε Exercise: Suppose that the estimated parameter values for the model are the following: Write down the prediction equation: α = 20, ˆ β = 2., ˆ β =.3, ˆ β ˆ 2 3 =. Estimate the average body fat for a white person exercise 0 hours per week: Estimate the average body fat for a black person exercise 0 hours per week: Estimate the average body fat for a Hispanic person exercise 0 hours per week: 8

Example: Study female life expectancy using percentage of urbanization and birth rate. 90 90 Female life expectancy 992 70 50 Female life expectancy 992 70 50 0 20 00 20 0 0 20 30 50 Percent urban, 992 Births per 000 population, 992 : y = α + β Birth rate + β 2 Percent urbani + ε where x : Birth rate, x 2 : Percent urbani Evaluation of the model: (SPSS output) Summary Adjusted Std. Error of R R Square R Square the Estimate.904 a.87.83 4.89 a. Predictors:, Births per 000 population, 992, Percent urban, 992 t of determination: the percentage of variability in the response variable (female life expectancy) that can be described by predictor variables (birth rate, percentage of urbanization) through the model. Regression Residual Total ANOVA b Sum of Squares df Mean Square F Sig. 2577.056 2 6288.528 262.595.000 a 2825.820 8 23.948 52.876 20 a. Predictors:, Births per 000 population, 992, Percent urban, 992 b. Dependent Variable: Female life expectancy 992 Test for significance of the model: p-value =.000 <.05 Ho: is insignificant (β i s are all zeros). Ha: is significant (Some β i s are all zeros). 9

estimation: (SPSS output) Births per 000 population, 992 Percent urban, 992 a. Dependent Variable: Female life expectancy 992 76.26 2.43 3.350.000 -.555.045 -.648-2.96.000.55.84.54.025.33 6.238.000.55.84 Inference for Regression : Ho: α = 0 v.s. Ha: α 0 p-value =.000 <.05 Ho: β = 0 v.s. Ha: β 0 p-value =.000 <.05 Ho: β 2 = 0 v.s. Ha: β 2 0 p-value =.000 <.05 Collinearity * statistics: If tolerance is less than 0. or VIF (Variance Inflation Factor) is greater than 0 implies serious collinearity. * Collinearity: Significant correlations between pairs of independent variables in the model. Tes for regression coefficien: all parameters in the model, α, β, and β 2 are all statistically significant. Least square regression equation: yˆ = 76.26.555 x +. 54 x2 (For estimating expected response value) The average female life expectancy for the countries whose birth rate per 000 is 30 and whose percentage of urbanization is would be 76.26 0.555x30 + 0.54x = 65.726. (estimated by the model) How to interpret α, β, and β 2? α is the constant or the y-intercept of the model. It is the average value of response variable when both predictor variables are 0. β is the rate of change of expected (average) life expectancy per unit change of birth rate and adjusted for percentage of urbanization. β 2 is the rate of change of expected (average) life expectancy per unit change of percentage of urbanization and adjusted for the birth rate. Other possible models: (x : Birth rate, x 2 : Percent urbani ) y = α + β x + ε y = α + β 2 x 2 + ε With Interaction Effect: y = α + β x + β 2 x 2 + β 3 x x 2 + ε y = α + β x + β 3 x x 2 + ε y = α + β 2 x 2 + β 3 x x 2 + ε Interaction term 0

Understanding the female life expectancy and how it is related with explanatory variables: Birth Rate, Urbanization, Phones, Doctors, and GDP. After Log Transformation for Before Transformation Phones, Doctors, and GDP Female life expectan Female life expectan Births per 000 popu Births per 000 popu Percent urban, 992 Percent urban, 992 Phones per 00 peopl Natural log of phone Doctors per 0,000 p Natural log of docto GDP per capita Natural log of GDP Summary b R R Square Adjusted R Square.934 a.873.867 4.08 2.03 a. Predictors:, Natural log of GDP, Percent urban, 992, Births per 000 population, 992, Natural log of doctors per 0000, Natural log of phones per 00 people b. Dependent Variable: Female life expectancy 992 Std. Error of the Estimate Durbin-Waon Independence of error Regression Residual Total ANOVA b Sum of Squares df Mean Square F Sig. 223.330 5 2424.666 45.342.000 a 768.348 06 6.683 389.679 a. Predictors:, Natural log of GDP, Percent urban, 992, Births per 000 population, 992, Natural log of doctors per 0000, Natural log of phones per 00 people b. Dependent Variable: Female life expectancy 992 Births per 000 population, 992 Percent urban, 992 Natural log of phones per 00 people Natural log of doctors per 0000 Natural log of GDP a. Dependent Variable: Female life expectancy 992 z ed t s 77.448 5.829 3.287.000 -.272.058 -.39-4.659.000.256 3.903.937E-02.03.043.629.53.263 3.5 3.75.679.552 4.675.000.086.590.894.593.262 3.94.002.78 5.6 -.390.784 -.90 -.772.079.05 9.543 Multicollinearity Tolerance measures the strength of the linear relation between the independent variables. It is better to be higher than 0.. VIF is the reciprocal of Tolerance.

ANOVA d 2 3 Regression Residual Total Regression Residual Total Regression Residual Total Sum of Squares df Mean Square F Sig. 59.884 59.884 449.370.000 a 273.795 0 24.834 389.679 830.842 2 595.42 32.873.000 b 20.836 09 8.907 389.679 2069.502 3 23.67 238.452.000 c 822.77 08 6.872 389.679 a. Predictors:, Natural log of phones per 00 people b. Predictors:, Natural log of phones per 00 people, Births per 000 population, 992 c. Predictors:, Natural log of phones per 00 people, Births per 000 population, 992, Natural log of doctors per 0000 d. Dependent Variable: Female life expectancy 992 Step-wise selection 2 3 Natural log of phones per 00 people Natural log of phones per 00 people Births per 000 population, 992 Natural log of phones per 00 people Births per 000 population, 992 Natural log of doctors per 0000 B Std. Error a. Dependent Variable: Female life expectancy 992 z ed t s Beta.284.562 07.84.000 5.6.243.896 2.98.000.000.000 72.566 2.9 34.239.000 3.352.370.582 9.048.000.329 3.042 -.327.055 -.383-5.957.000.329 3.042 68.76 2.37 29.48.000 2.386.434.44 5.496.000.24 4.682 -.246.056 -.288-4.364.000.2 3.576 2.054.546.284 3.76.000.23 4.706 What are the significant factors that are related to the female s life expectancy? In stepwise regression, a large number of tes are performed and lead to higher probability of Type I or Type II error. It should be used when one wan to determine important independent variables from a large number of potentially useful variables in the modeling process. 2

Use of regression analysis. Description (model, system, relation) Relation between life expectancy, birth rate, GDP, Relation between salary, rank, years of service, 2. Control Died too young, underpaid, overpaid, 3. Prediction Life expectancy, salary for new comers, future salary, 4. Variable screening (important factors) What are the important factors that affecting salary or life expectancy? Construction of regression models:. Hypothesize the form of the model for µ y x, x2, x3,..., x q a) Selecting predictor variables. b) Deciding functional form of the regression equation. c) Defining scope of the model (design range). 2. Collect the sample data (observations, experimen). 3. Use sample to estimate unknown parameters in the model. 4. Specifying the probability distribution of the random error. 5. Statistically check the usefulness of the model. 6. Apply the model in decision making. 7. Review the model with new data. What is linear model? Example of a linear model: y = β 0 + β x + ε y = β 0 + β x + β 2 x 2 + ε y = β 0 + β x + β 2 x 2 + β 3 x x 2 + ε y = β 0 + β x + β 2 x 2 + β 3 x x 2 + β 4 x 2 + β 5 x 2 2 + ε y = β 0 + β ln(x) + ε y = β 0 + β e x + ε is linear in terms of i parameters. 3