Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Similar documents
Multiple linear regression S6

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Chapter 3 Multiple Regression Complete Example

Single and multiple linear regression analysis

Chapter 14 Student Lecture Notes 14-1

Chapter 4. Regression Models. Learning Objectives

Chapter 7 Student Lecture Notes 7-1

Chapter 4: Regression Models

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Categorical Predictor Variables

Correlation Analysis

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

Multiple linear regression

LI EAR REGRESSIO A D CORRELATIO

y response variable x 1, x 2,, x k -- a set of explanatory variables

holding all other predictors constant

Basic Business Statistics, 10/e

Chapter 13. Multiple Regression and Model Building

Inferences for Regression

Multiple Regression Methods

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Data Analysis 1 LINEAR REGRESSION. Chapter 03

The Multiple Regression Model

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

Statistics for Managers using Microsoft Excel 6 th Edition

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Mathematics for Economics MA course

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Econ 3790: Statistics Business and Economics. Instructor: Yogesh Uppal

Chapter 14 Simple Linear Regression (A)

Analysing data: regression and correlation S6 and S7

Model Building Chap 5 p251

Multiple Regression and Model Building Lecture 20 1 May 2006 R. Ryznar

STA121: Applied Regression Analysis

Sociology Research Statistics I Final Exam Answer Key December 15, 1993

Bayesian Analysis LEARNING OBJECTIVES. Calculating Revised Probabilities. Calculating Revised Probabilities. Calculating Revised Probabilities

The simple linear regression model discussed in Chapter 13 was written as

Multiple regression: Model building. Topics. Correlation Matrix. CQMS 202 Business Statistics II Prepared by Moez Hababou

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

CS 5014: Research Methods in Computer Science

Unit 11: Multiple Linear Regression

Sociology 593 Exam 1 Answer Key February 17, 1995

Basic Business Statistics 6 th Edition

General linear models. One and Two-way ANOVA in SPSS Repeated measures ANOVA Multiple linear regression

MULTIPLE REGRESSION ANALYSIS AND OTHER ISSUES. Business Statistics

Correlation and regression

Multiple Linear Regression CIVL 7012/8012

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Correlation & Simple Regression

Chapter 14. Multiple Regression Models. Multiple Regression Models. Multiple Regression Models

Ch14. Multiple Regression Analysis

Econometrics -- Final Exam (Sample)

Linear regression. We have that the estimated mean in linear regression is. ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. The standard error of ˆµ Y X=x is.

Lecture 4: Multivariate Regression, Part 2

Regression Models REVISED TEACHING SUGGESTIONS ALTERNATIVE EXAMPLES

Simple Linear Regression

Lecture 4: Multivariate Regression, Part 2

Lecture 10 Multiple Linear Regression

Making sense of Econometrics: Basics

Lecture 9: Linear Regression

FinQuiz Notes

ANOVA CIVL 7012/8012

Inference for the Regression Coefficient

Making sense of Econometrics: Basics

Sociology 593 Exam 1 February 14, 1997

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

MATH ASSIGNMENT 2: SOLUTIONS

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

STAT Chapter 10: Analysis of Variance

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Lab 10 - Binary Variables

Statistics and Quantitative Analysis U4320

Least Squares Estimation-Finite-Sample Properties

B. Weaver (24-Mar-2005) Multiple Regression Chapter 5: Multiple Regression Y ) (5.1) Deviation score = (Y i

SIMPLE REGRESSION ANALYSIS. Business Statistics

CIVL 7012/8012. Simple Linear Regression. Lecture 3

Introduction to Regression

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

5.1 Model Specification and Data 5.2 Estimating the Parameters of the Multiple Regression Model 5.3 Sampling Properties of the Least Squares

ECON 482 / WH Hong Binary or Dummy Variables 1. Qualitative Information

ST430 Exam 2 Solutions

Applied Econometrics (QEM)

Daniel Boduszek University of Huddersfield

A discussion on multiple regression models

Inference for Regression Simple Linear Regression

PubH 7405: REGRESSION ANALYSIS. MLR: INFERENCES, Part I

Classification & Regression. Multicollinearity Intro to Nominal Data

Finding Relationships Among Variables

Correlation and Regression Bangkok, 14-18, Sept. 2015

Review of Multiple Regression

ECON 4230 Intermediate Econometric Theory Exam

Inference for Regression Inference about the Regression Model and Using the Regression Line

Multiple Regression Analysis

Remedial Measures, Brown-Forsythe test, F test

Lecture 6: Linear Regression (continued)

Chapter 14 Multiple Regression Analysis

Day 4: Shrinkage Estimators

Use of Dummy (Indicator) Variables in Applied Econometrics

Review of Statistics

Transcription:

What is Multiple Linear Regression Several independent variables may influence the change in response variable we are trying to study. When several independent variables are included in the equation, the regression is called multiple linear regression.

Multiple Regression Modeling Steps 1. Start with a hypothesis or belief. 2. Estimate unknown model parameters ( 0, 1, 2. ) 3. Specify probability distribution of random error term - Assumed to be a normal distribution 4. Check the assumptions of regression (normality, heteroscedasticity and multi-collinearity) 5. Evaluate Model 6. Use Model for Prediction & Estimation and other purposes.

Multiple Linear Regression Model Relationship between 1 dependent & 2 or more independent variables is a linear function. Population Y-intercept Population slopes Random error Y i 0 1 X 1i 2 X 2i k X ki i Dependent (response) variable Independent (explanatory) variables

Prediction Model The prediction equation obtained from sample data is: ki k i i X X X Y 2 2 1 1 0 ˆ

Multiple Regression Model Y Y i = 0 + 1 X 1i + 2 X 2i + i (Observed Y) Y i = Treatment Cost X 1i = Body Weight Response Plane 0 i X 2i = HR Pulse X 2 X 1 (X 1i,X 2i ) YX = 0 + 1 X 1i + 2 X 2i

Estimating and Interpreting the -Parameters Belief Y x... 0 1 1 k x k Fitted Model ˆ ˆ ˆ yˆ 0 1x1... kxk Using least squares method Minimize ˆ 2 SSE y y

Estimation of Parameters in Multiple Regression The least squares function is given by The least squares estimates must satisfy n i k j ij j i n i i x y SSE 1 2 1 0 1 2 j xij n i k j xij j yi j SSE n i k j xij j yi SSE 0 1 1 0 2 and 0 1 1 0 2 0 1 2 3

Estimation of Parameters in Multiple Regression The least squares equations are The solution to the above equations are the least squares estimators of the regression coefficients.

DAD Example Total Hospital Cost β β1 Body weight β2 Body height β3 HR Pulse β4 0 RR

MODEL Treatment Cost -118569.892 2512.260 Body weight 1537.105 HR Pulse 2560.631 RR 161.083 Body height

Interpretation of Coefficients ( 1 ) Partial regression coefficient Total cost of treatment increases by 2512.260 INR for every one kg increase in body weight while other parameters like body height, HR pulse and RR are controlled (kept constant). Treatment Cost -118569.892 2512.260 Body weight 161.083 Body height 1537.105 HR Pulse 2560.631 RR

Interpretation of Coefficients ( 3 ) Partial regression coefficient Total cost of treatment increases by 1537.105 INR for every one unit increase in HR pulse while other parameters like body weight, Body height and RR are controlled. Treatment Cost -118569.892 2512.260 Body weight 161.083 Body height 1537.105 HR Pulse 2560.631 RR

Standardized Beta (Beta Weights) The beta weights are the regression coefficients for standardized data. Standardized Beta is the average increase in the dependent variable when the independent variable is increased by one standard deviation and other independent variables are held constant.

Beta Weights S Standardized Beta i S S S x y Standard deviation of X Standard deviation of Y x y

Partial Correlation Partial correlation coefficient measures the relationship between two variables (say Y and X 1 ) when the influence of all other variables (say X 2, X 3,, X k ) connected with these two variables (Y and X 1 ) are removed. Partial correlation is the correlation between residualized response and residualized predictor.

Semi-Partial (Part) Correlation Semi-partial (or part correlation) coefficient measures the relationship between two variables (say Y and X 1 ) when the influence of all other variables (say X 2, X 3,, X k ) connected with these two variables (Y and X 1 ) are removed from one of the variables (X 1 ). Only the predictor is residualized. The denominator of the coefficient (the total variance of the response variable, Y) remains the same no matter which predictor is being examined.

Y D Variance in Y = A + B + C + D A C B Variance in Y explained by the model = A + B + C Variance in Y not explained by the model = E F E G X 1 X 2 Semi partial (part) correlation for X 1 = A /(A+B+C+D) Semi partial (part) correlation for X 2 = B /(A+B+C+D) Partial correlation for X 1 = A / (A+D) Partial correlation for X 2 = B / (B + D) IMPORTANT In part correlation we do not residualize the response variable

r r 12,3 13,2 r 23,1 r 12 (1 r r 13 (1 r r 23 2 13 2 12 (1 r 2 12 Partial Correlation r 13 r 23 )(1 r r 12 r 23 )(1 r r 12 r 13 )(1 r 2 23 2 23 2 13 ) ) ) Correlation between y1 and x2, when the influence of x3 is removed from both y1 and x2. Correlation between y1 and x3, when the influence of x2 is removed from both y1 and x3. Correlation between x2 and x3, when the influence of y1 is removed from both x2 and x3. For Proof: Theory of Econometrics by A Koutsoyiannis

Semi-partial correlation (Part Correlation) Semi-partial (or part) correlation sr 12,3 is the correlation between y 1 and x 2 when influence of x 3 is partialled out of x 2 (not on y 1 ). sr 12,3 r 12 r 13 1 r r 2 23 23 Part Correlation Partial Correlation (1 r 2 13 )

Semi-partial (part) correlation Square of Semi-partial (or part correlation) gives the increase in the R 2 value (coefficient of multiple determination) when an explanatory variable is added to the model.

Y = 0 + 1 x Body weight + 2 x HR Pulse 0.348 2 = 0.121 0.121 + (0.228) 2 = 0.173

Change in R 2 when a new variable is added Y 1 A C D B A X 1 X 2 Variance specific to X1 (part A) Variance common to X1 and X2 (part C) Variance specific to X2 (part B) Left over R 2 1 - R 2

Multiple Regression Model Diagnostics Test for overall model fitness (R-Square and Adjusted R-Square) Test for overall model statistical significance (F test test) Test for portions of the model (Partial F-test) Test for statistical significance of individual explanatory variables (t test) Test for Normality and Homoscedasticity of residuals Test for Multi-collinearity and Auto Correlation

Co-efficient of multiple determination R 2 SSR SST 1 SSE SST i i ( y ( y i i y) y) 2 2 2 R Multiple coefficient of multiple determination R x 1 2, is x the percentage of 2,..., x k the variation of y explained by

Y = 0 + 1 x Body Weight + 2 x HR Pulse R 2 is 0.173 17.3% of variation in Y is explained by Bodyweight and HR Pulse

Y = -55512.272 + 2674.585 x Body Weight + 1668.361 x HR Pulse

Co-efficient of determination in Multiple Regression Coefficient of determination increases as the number of explanatory variables increases. In SSR/SST, the numerator, SSR, increases as the number of explanatory variables increases, whereas the denominator, SST, remains constant. Increase in R 2 can deceptive, since more number of explanatory variables may over-fit the data.

Adjusted R 2 Inclusion of additional explanatory variable will increase the R 2 value. By introducing an additional explanatory variable, we increase the numerator of the expression for R 2 while the denominator remains the same. To correct this defect, we adjust the R 2 by taking into account the degrees of freedom.

Adjusted R 2 R R n k 2 A 2 A 1 SSE/(n SST/(n Adjusted R - Square number of number of k 1) 1) observations explanatory variables

Y = 0 + 1 x Body Weight + 2 x HR Pulse Adjusted R 2 is 0.167

Testing For Overall Significance of Model F Test H0 : β1 β2... βk 0 H :Not allβ values are zero A Test for overall significance of multiple regression model. Checks if there is a statistically significant relationship between Y and any of the explanatory variables (X 1, X 2,, X k ).

F Statistic F = MSR/MSE Relationship between F and R 2 F (1 R 2 R 2 / k ) /( n k 1)

DAD CASE - ANOVA Table F = 3.216 x 10 11 /12524679898

Test for Individual Variables

Testing for Significance of Individual Parameters H H 0 A : β :β i i 0 0 t S e i ( ) i T- test: By rejecting the null hypothesis, we can claim that there is a statistically significant relationship between the response variable Y and explanatory variable X i.

SPSS Output for Coefficients

Testing Model Portions Partial F Test Examines the contribution of a set of explanatory variables to the regression model. Used for selecting explanatory variables in regression model building strategies such as stepwise regression.

Testing Model Portions Partial F Test Full Model: Y X X... 0 1 1 2 2 k X k Reduced Model (r < k): Y X X... 0 1 1 2 2 r X r Test H 0 : r+1 =.. = k = 0 Partial F (SSE reduced SSE MSE full full )/(k r) k = Number of variables in the full model r = Number of variables in the reduced model

Partial F-test Example (DAD) Full Model: Treatment cost RR 4 0 Bodyweight 1 Bodyheight 2 HRpulse 3 Reduced Model: Treatment cost 0 1 Bodyweight 2 HRpulse F ( SSE SSE ) /(4 reduced MSE full full 2)

( SSE F reduced p value 0.4700 SSE MSE full full ) /(4 2) (3.06910 12 3.04610 1.25310 10 12 ) / 2 0.9172

Categorical Variables in Regression Many regression models are likely to have qualitative (categorical) variables as explanatory variable. Qualitative variables (categorical variables) in regression are replaced with dummy variables (or indicator variables) in regression model. A categorical variable with n levels are replaced with (n-1) dummy variables. The category for which no dummy variable assigned is known as Base Category.

Dummy variable When there are more than one qualitative variable, it is advisable to use (n-1) dummy variables for both qualitative variables along with the intercept. multi- Use of n dummy variables along with intercept will result in collinearity, known as dummy variable trap.

Dummy variables in Regression The intercept, 0, is the mean value of the base category. The coefficients attached to dummy variables are called differential intercept coefficients.

Qualitative Variables DAD Case 1. Gender (2 levels) 2. Marital Status (2 levels) 3. Key Complaints (6 levels)

Model with Dummy Variable Y = β 0 + β 1 x Female Y for Male = 211868.724 Y for Female = 211868.724 39756.800 = 172111.924

Y = β 0 + β 1 x Female + β 2 x Body Weight

Impact of Dummy Variable 70 60 50 Y Y 0 2 X 2 ( X 0 1) (M) 2 2 (F) 40 30 20 10 0 0 50 100 150 200 250

Qualitative Variable with Multiple Levels Key Complaints 1. ACHD 2. CAD 3.NONE 4. OS-ASD 5. OTHER 6.RHD

Regression Model Building Derived variables may help to explain the variation in the response variable. For example, consider a credit rating problem, which can be used for approving housing loan. A bank may use factors such as loan amount and value of the property etc. Customer 1 2 3 Loan Amount 250,000 350,000 750,000

Derived Variables Customer 1 2 3 Loan Amount 250,000 350,000 750,000 Value of Property 300,000 500,000 1,000,000 Loan / Value 0.83 0.70 0.75

Interaction Variables Consider a regression model of type: Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 1 X 2 Interaction variable The interaction variable is usually an interaction between a quantitative variable and qualitative variable

One of the popular applications of regression is to predict gender discrimination. Consider a regression model with salary as response variable Y: Y = β 0 + β 1 Gender + β 2 Work Experience + β 3 Gender x Work Experience

Interaction Variable Let Gender = 1 implies Female: Then Y for Female is: Y = β 0 + β 1 + (β 2 + β 3 ) Work Experience Y for Male is: Y = β 0 + β 2 x Work Experience

Conditional relationship in Interaction Variables

DAD Example Interaction Variable Y = β 0 + β 1 Marital Status + β 2 Body Weight + β 3 Marital status x Bodyweight

Multicollinearity High correlation between explanatory variables is called multi-collinearity. Multi-collinearity leads to unstable coefficients. Always exists; matter of degree.

Multi-collinearity Y = β 0 + β 1 Marital Status + β 2 Body Weight + β 3 Marital status x Bodyweight

Symptoms of Multi-collinearity High R 2 but few significant t ratios. F-test rejects the null hypothesis, but none of the individual t-tests are rejected. Correlations between pairs of X variables are more than with Y variables.

Effects of Multi-collinearity The variances of regression coefficient estimators are inflated. The magnitudes of regression coefficient estimates may be different. Adding and removing variables produce large changes in the coefficient estimates. Regression coefficient may have opposite sign.

Identifying Multi-collinearity Variance Inflation factor The variance inflation factor (VIF) is a relative measure of the increase in the variance in standard error of beta coefficient because of collinearity. A VIF greater than 10 indicates that collinearity is very high. A VIF value of more than 4 is not acceptable.

Variance of Beta Estimates ) (1 ) ( ) (1 ) ( 2 12 1 2 2 2 12 2 2 1 2 2 1 1 0 r SS S Var r SS S Var X X Y X e X e

Variance inflation factor Variance inflation factor associated with introducing a new variable X j is given by: VIF(X ) j 1 1 R 2 j 2 R j is the coefficient of determination for the regression of X j asdependent variable The standard error of the corresponding Beta is inflated by VIF

Impact of VIF on Se() R j Tolerance (1-R 2 ) VIF Impact on Se() VIF 0 1.0 1.0 1.0 0.4 0.84 1.19 1.09 0.6 0.64 1.56 1.25 0.8 0.36 2.78 1.67 0.87 0.25 4.0 2.0 0.9 0.19 5.26 2.29

DAD CASELET For Body Weight Actual t 1.272 x 4.681 2.752 Corresponding p - value 0.006

Remedies for Multi-collinearity Drop a variable (may result in specification bias). Principal component analysis (PCA) Partial Least Squares (PLS)

Forward Selection Method In forward selection procedure in which variables are sequentially entered into the model. The first variable considered for entry is the one with smallest p-value based on F-test.

Backward elimination method A variable selection procedure in which all variables are entered into the equation and then sequentially removed starting with the most nonsignificant variable. At each step, the largest probability of F is removed.

Stepwise Regression The first variable considered for entry into the equation is the one with the largest F value or smallest p-value based on F-test. At each step, the independent variable not in the equation that has the smallest probability of F is entered. Variables already in the regression equation are removed if their probability of F becomes sufficiently large.

Y = - 7164.969 + 122458.276 CAD + 103103.698 RHD +1342.539 HR PULSE + 71547.412 MARRIED + 33998.332 OTHERS