Multiple Regression Part I STAT315, 19-20/3/2014

Similar documents
Data Mining Techniques. Lecture 2: Regression

GRAD6/8104; INES 8090 Spatial Statistic Spring 2017

HW1 Roshena MacPherson Feb 1, 2017

<br /> D. Thiebaut <br />August """Example of DNNRegressor for Housing dataset.""" In [94]:

Introduction to PyTorch

Stat588 Homework 1 (Due in class on Oct 04) Fall 2011

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

STK 2100 Oblig 1. Zhou Siyu. February 15, 2017

ST430 Exam 2 Solutions

ON CONCURVITY IN NONLINEAR AND NONPARAMETRIC REGRESSION MODELS

1 Multiple Regression

Stat 401B Final Exam Fall 2016

Multiple Linear Regression. Chapter 12

SKLearn Tutorial: DNN on Boston Data

Sparse polynomial chaos expansions as a machine learning regression technique

Tests of Linear Restrictions

Review on Spatial Data

Stat 412/512 TWO WAY ANOVA. Charlotte Wickham. stat512.cwick.co.nz. Feb

Explanatory Variables Must be Linear Independent...

On the Harrison and Rubinfeld Data

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

General Linear Model (Chapter 4)

Recall that a measure of fit is the sum of squared residuals: where. The F-test statistic may be written as:

Stat 5102 Final Exam May 14, 2015

Regression and the 2-Sample t

MS&E 226: Small Data

22s:152 Applied Linear Regression

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Comparing Nested Models

Multiple Linear Regression

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Categorical Predictor Variables

MODELS WITHOUT AN INTERCEPT

Lecture 10 Multiple Linear Regression

Last updated: Oct 18, 2012 LINEAR REGRESSION PSYC 3031 INTERMEDIATE STATISTICS LABORATORY. J. Elder

Chapter 3 Multiple Regression Complete Example

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

DISCRIMINANT ANALYSIS: LDA AND QDA

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

y response variable x 1, x 2,, x k -- a set of explanatory variables

df=degrees of freedom = n - 1

Lecture 18: Simple Linear Regression

A discussion on multiple regression models

Simple, Marginal, and Interaction Effects in General Linear Models

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Business Statistics. Lecture 10: Course Review

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

MORE ON SIMPLE REGRESSION: OVERVIEW

13 Simple Linear Regression

Chapter 14 Student Lecture Notes 14-1

SCHOOL OF MATHEMATICS AND STATISTICS

Density Temp vs Ratio. temp

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Chapter 4 Dimension Reduction

Part II { Oneway Anova, Simple Linear Regression and ANCOVA with R

School of Mathematical Sciences. Question 1. Best Subsets Regression

Lecture 6: Linear Regression (continued)

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Linear Regression Measurement & Evaluation of HCC Systems

Booklet of Code and Output for STAC32 Final Exam

STAT 215 Confidence and Prediction Intervals in Regression

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Example: Poisondata. 22s:152 Applied Linear Regression. Chapter 8: ANOVA

Chapter 8 Conclusion

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

Lecture 6 Multiple Linear Regression, cont.

General Linear Statistical Models - Part III

Multiple linear regression S6

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

SCHOOL OF MATHEMATICS AND STATISTICS

NC Births, ANOVA & F-tests

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

CAS MA575 Linear Models

Regression and Models with Multiple Factors. Ch. 17, 18

Example. Multiple Regression. Review of ANOVA & Simple Regression /749 Experimental Design for Behavioral and Social Sciences

(4) 1. Create dummy variables for Town. Name these dummy variables A and B. These 0,1 variables now indicate the location of the house.

Data Analysis 1 LINEAR REGRESSION. Chapter 03

Variance Decomposition and Goodness of Fit

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

1 Introduction 1. 2 The Multiple Regression Model 1

In Class Review Exercises Vartanian: SW 540

Linear regression models in matrix form

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

22s:152 Applied Linear Regression. 1-way ANOVA visual:

a. The least squares estimators of intercept and slope are (from JMP output): b 0 = 6.25 b 1 =

Logistic Regression 21/05

STAT 350: Summer Semester Midterm 1: Solutions

A Robust Approach of Regression-Based Statistical Matching for Continuous Data

R Output for Linear Models using functions lm(), gls() & glm()

Inference for Regression

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Applied Statistics and Econometrics

A Re-Introduction to General Linear Models (GLM)

Analysis of multi-step algorithms for cognitive maps learning

Transcription:

Multiple Regression Part I STAT315, 19-20/3/2014

Regression problem Predictors/independent variables/features Or: Error which can never be eliminated. Our task is to estimate the regression function f. Regression function, describing how the expected value of y is related to the x s.

y -10 0 10 20 30 40 50-20 -10 0 10 20 x

y -10 0 10 20 30 40 50-20 -10 0 10 20 x

Since the regression function might be very complicated, we usually want to make some simplifying assumptions. The simplest kind of function which involves all the independent variables is a linear function.

Linear model

Fitted by minimising the sum of squared errors. This leads to: Estimate of beta Fitted values of Y Hat Matrix Interpretation: an increase of one unit in is associated with an increase of units in provided the other independent variables are held constant (sometimes called ceteris paribus condition) e.g. Y = salary of a cricketer X1 = length of career so far X2 = total number of runs scored in career

If, i.i.d errors then the method of least squares is also the method of maximum likelihood, and we are able to make all sorts of inferences, such as confidence intervals. These can be calculated from the output given by software.

Multivariate normal Estimate of the variance error terms of the Used to calculate confidence intervals for the regression coefficients Calculated by software t distribution 100(1 α)% cccccccccc iiiiiiii

> data(stackloss) > model <- lm(stack.loss ~., data=stackloss) > summary(model) Call: lm(formula = stack.loss ~., data = stackloss) Residuals: Min 1Q Median 3Q Max -7.2377-1.7117-0.4551 2.3614 5.6978 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -39.9197 11.8960-3.356 0.00375 ** Air.Flow 0.7156 0.1349 5.307 5.8e-05 *** Water.Temp 1.2953 0.3680 3.520 0.00263 ** Acid.Conc. -0.1521 0.1563-0.973 0.34405 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 3.243 on 17 degrees of freedom Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983 F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09 1.2953 + c(-1,1)*0.3680*qt(0.025, df=17) [1] 2.0717121 0.5188879 95% CI for β 1 is [0.52, 2.07] ( we are 95% confident that an increase of 1 unit in Water.Temp is associated with an increase of between 0.52 and 2.07 units in stack.loss )

One of the most useful things about linear regression is that it can handle categorical variables as well as numerical ones. For example: Salary = β 0 + β 1 (Years of education) + β 2 Gender where gender is coded as 0/1. Fitting the same slope for both genders, but different intercepts; parallel regression lines. Question: what term or terms do we need to add if we want separate slopes for the different genders as well? - Interpretation of β 2? - What if you have a categorical variable with several categories? e.g. Apple/Pear/Orange FruitWeight = β 0 + β 1 Apple + β 2 Orange One category has to be treated as the baseline, in this case Pear. Interpretation: weight of Pear is β 0, weight of Apple is β 0 + β 1, weight of orange is β 0 + β 2

SAS example: data fruit; input apple orange weight; cards; 0 0 29 0 0 26 0 0 25 1 0 22 1 0 23 1 0 26 0 1 50 0 1 34 0 1 30 ; proc reg data=fruit; model weight = apple orange; run; quit; Parameter Estimates Variable DF Parameter Estimate Note: this significant p-value is completely irrelevant. Why? What is it telling us? Standard Error t Value Pr > t Intercept 1 26.66667 3.66161 7.28 0.0003 apple 1-3.00000 5.17830-0.58 0.5834 orange 1 11.33333 5.17830 2.19 0.0712

R 2 (y i y ) 2 Variance of predicted y-values / (y i y ) 2 variance of actual y-values Not really the square of anything interesting, except in the one-variable case. Often quoted, but not necessarily a good way of measuring the quality of the fit. The more x-variables you add, the bigger R-squared will be. The more spread out the x-values, the bigger R-squared will be. Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -39.9197 11.8960-3.356 0.00375 ** Air.Flow 0.7156 0.1349 5.307 5.8e-05 *** Water.Temp 1.2953 0.3680 3.520 0.00263 ** Acid.Conc. -0.1521 0.1563-0.973 0.34405 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 3.243 on 17 degrees of freedom Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983 F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09

ANOVA Nested models: M1: Y = β 0 + β 1 X 1 + + β k X k M2: Y = β 0 + β 1 X 1 + + β k X k + β k+1 X k+1 + + β p X p H0 : β k+1 = = β p = 0 H1 : at least one of these β i is not zero. Under the null hypothesis, the quantity has an F(p-k, n-(p+1)) distribution. (RRR M 1 RRR M 2 )/(p k) RRR(M 2 )/(n p + 1 ) Special case: M1: Y = β 0 is the null model. The p-value for the F-test is usually reported in regression output, e. g. F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09

Often laid out as an ANOVA table p (n p 1) (y i y ) 2 (y i y i ) 2 (y i y ) 2 /p (y i y i ) 2 /(n p 1) F n 1 (y i y ) 2 SAS output example: Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 13 31638 2433.65468 108.08 <.0001 Error 492 11079 22.51785 Corrected Total 505 42716

Model comparison in R > lm(stack.loss ~ Air.Flow, data=stackloss) -> model1 > lm(stack.loss ~ Water.Temp + Air.Flow, data=stackloss) -> model2 anova(model1, model2) Analysis of Variance Table Model 1: stack.loss ~ Air.Flow Model 2: stack.loss ~ Water.Temp + Air.Flow Res.Df RSS Df Sum of Sq F Pr(>F) 1 19 319.12 2 18 188.80 1 130.32 12.425 0.002419 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Used in deciding whether to add or drop variables from the model. Leads to the idea of stepwise selection to choose the best variables to be in the regression model.

Multicollinearity A common problem, especially if too many variables are included in the regression. Often manifests itself in questions like: How come the F-test is significant, but none of the t-tests for the individual coefficients are significant? How come the standard error of one of my coefficients is 10000 times larger than the coefficient itself? Why did the regression fail with an error message about something being singular? Happens if some linear combination of the x-variables adds up a constant (or close to a constant) or alternatively, some subset of the centred x-variables is linearly dependent, which happens if and only if the variance-covariance matrix of the X s is singular (or nearly singular.) e.g. X 1 X 2 = 0 and so on. Y = β 0 + β 1 X 1 + β 2 X 2 = β 0 + (β 1 +1)X 1 + (β 2 1)X 2 = β 0 + (β 1 +2)X 1 + (β 2 2)X 2

One diagnostic for multicollinearity is the variance inflation factor or VIF. VVV X i = 1/(1 R 2 ) for a regression of X i on the other X j, j i If there is no linear combination which adds up to zero, then VIF will be approx. 1. proc reg data = boston; model MEDV = CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT / vif; run; 1. CRIM per capita crime rate by town 2. ZN proportion of residential land zoned for lots over 25,000 sq.ft. 3. INDUS proportion of non-retail business acres per town 4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 5. NOX nitric oxides concentration (parts per 10 million) 6. RM average number of rooms per dwelling 7. AGE proportion of owner-occupied units built prior to 1940 8. DIS weighted distances to five Boston employment centres 9. RAD index of accessibility to radial highways 10. TAX full-value property-tax rate per $10,000 11. PTRATIO pupil-teacher ratio by town 12. B 1000(Bk - 0.63)^2 where Bk is the proportion of African-Americans by town 13. LSTAT % lower status of the population 14. MEDV Median value of owner-occupied homes in $1000's

Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Variance Inflation Intercept 1 36.45949 5.10346 7.14 <.0001 0 CRIM 1-0.10801 0.03286-3.29 0.0011 1.79219 ZN 1 0.04642 0.01373 3.38 0.0008 2.29876 INDUS 1 0.02056 0.06150 0.33 0.7383 3.99160 CHAS 1 2.68673 0.86158 3.12 0.0019 1.07400 NOX 1-17.76661 3.81974-4.65 <.0001 4.39372 RM 1 3.80987 0.41793 9.12 <.0001 1.93374 AGE 1 0.00069222 0.01321 0.05 0.9582 3.10083 DIS 1-1.47557 0.19945-7.40 <.0001 3.95594 RAD 1 0.30605 0.06635 4.61 <.0001 7.48450 TAX 1-0.01233 0.00376-3.28 0.0011 9.00855 PTRATIO 1-0.95275 0.13083-7.28 <.0001 1.79908 B 1 0.00931 0.00269 3.47 0.0006 1.34852 LSTAT 1-0.52476 0.05072-10.35 <.0001 2.94149 In fact, corr(tax,rad) = 0.9