Statistiek II. John Nerbonne. March 17, Dept of Information Science incl. important reworkings by Harmut Fitz

Similar documents
Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Inference for Regression Simple Linear Regression

Inference for Regression Inference about the Regression Model and Using the Regression Line

Multiple Linear Regression

Confidence Interval for the mean response

Lecture 11 Multiple Linear Regression

Inference for Regression

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Lecture 13 Extra Sums of Squares

Inference for the Regression Coefficient

Lecture 18: Simple Linear Regression

Variance Decomposition and Goodness of Fit

Ch 2: Simple Linear Regression

STOR 455 STATISTICAL METHODS I

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Mathematics for Economics MA course

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Chapter 6 Multiple Regression

Topic 16: Multicollinearity and Polynomial Regression

Lecture 11: Simple Linear Regression

Statistiek II. John Nerbonne using reworkings by Hartmut Fitz and Wilbert Heeringa. February 13, Dept of Information Science

9. Linear Regression and Correlation

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

CS 5014: Research Methods in Computer Science

Regression Models - Introduction

AMS 7 Correlation and Regression Lecture 8

Lecture 18 MA Applied Statistics II D 2004

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Can you tell the relationship between students SAT scores and their college grades?

BNAD 276 Lecture 10 Simple Linear Regression Model

Biostatistics 380 Multiple Regression 1. Multiple Regression

Section Least Squares Regression

Simple and Multiple Linear Regression

Correlation and Linear Regression

11 Correlation and Regression

Introduction and Single Predictor Regression. Correlation

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

R 2 and F -Tests and ANOVA

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test, October 2013

Business Statistics. Lecture 10: Correlation and Linear Regression

Exam Applied Statistical Regression. Good Luck!

ST430 Exam 1 with Answers

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Comparing Nested Models

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Recall that a measure of fit is the sum of squared residuals: where. The F-test statistic may be written as:

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Chapter 4: Regression Models

Simple Linear Regression

Section 3: Simple Linear Regression

STAT 350 Final (new Material) Review Problems Key Spring 2016

Tests of Linear Restrictions

STAT 215 Confidence and Prediction Intervals in Regression

Simple Linear Regression

Regression. Marc H. Mehlman University of New Haven

MATH 644: Regression Analysis Methods

Linear models and their mathematical foundations: Simple linear regression

sociology sociology Scatterplots Quantitative Research Methods: Introduction to correlation and regression Age vs Income

Intro to Linear Regression

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Multiple Linear Regression. Chapter 12

Applied Regression Analysis

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

HOMEWORK (due Wed, Jan 23): Chapter 3: #42, 48, 74

Chapter 4. Regression Models. Learning Objectives

y response variable x 1, x 2,, x k -- a set of explanatory variables

36-707: Regression Analysis Homework Solutions. Homework 3

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

L21: Chapter 12: Linear regression

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Intro to Linear Regression

Lecture 10 Multiple Linear Regression

ANOVA: Analysis of Variation

Recall, Positive/Negative Association:

Simple Linear Regression: One Quantitative IV

1 Multiple Regression

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

Density Temp vs Ratio. temp

6. Multiple Linear Regression

Chapter 12: Linear regression II

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Unit 6 - Simple linear regression

1 Use of indicator random variables. (Chapter 8)

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

Applied Regression Modeling: A Business Approach Chapter 2: Simple Linear Regression Sections

Lectures on Simple Linear Regression Stat 431, Summer 2012

Regression Analysis II

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM

Simple Linear Regression

Statistiek II. John Nerbonne. February 26, Dept of Information Science based also on H.Fitz s reworking

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Inference in Regression Analysis

Second Midterm Exam Name: Solutions March 19, 2014

Lecture 3: Inference in SLR

ST Correlation and Regression

Transcription:

Dept of Information Science j.nerbonne@rug.nl incl. important reworkings by Harmut Fitz March 17, 2015

Review: regression compares result on two distinct tests, e.g., geographic and phonetic distance of dialects regression for numerical variables only fits a straight line on the data is there an explanatory relationship between these variables? answer: hypothesis tests for regression coefficients regression is asymmetric (explanatory direction) regression fallacy: seeing causation in regression regression towards the mean (inevitable)

Review: regression phonetic_distance 0.4 0.6 0.8 1.0 1.2 1.4 0 50 100 150 200 250 300 350 geographic_distance Regression line y = a + bx minimizes the sum of squared residuals

Review: correlation only for numeric variables x and y measures strength and direction of a linear relation between x and y r xy = 1 n 1 n z xi z yi i=1 correlation coefficient symmetric: r xy = r yx 1 r xy 1 pure number, no scale related to the slope of the regression line: y = a + bx has slope b = r σy σ x

Review: correlation and regression y y Unexplained Total variation variation Explained variation regression line x Coefficient of determination: r 2 Explained variation = Total variation

Review: prediction with regression phonetic distance 0.4 0.6 0.8 1.0 1.2 1.4 95% 95% CI mean phonetic distance 95% CI future observations 50 100 150 200 250 300 350 geographic distance

Today: multiple regression Idea: Predict numerical variable using several independent variables Examples: university performance dependent on general intelligence, high school grades, education of parents,... income dependent on years of schooling, school performance, general intelligence, income of parents,... level of language ability of immigrants depending on leisure contact with natives age at immigration employment-related contact with natives professional qualification duration of stay accommodation

Regression techniques attractive allows prediction of one variable value based on one or more others allows an estimation of the importance of various independent factors (cf. ANOVA) y = ɛ y = α + ɛ y = α + β 1 x 1 + ɛ y = α + β 2 x 2 + ɛ y = α + β 1 x 1 + β 2 x 2 + ɛ which independent factors, taken together or separately, explain the dependent variable the best?

Multiple regression data One dependent variable y, but several predictor variables x 1,..., x p N cases c i with i {1,..., N} Each case c i has the form c i = (x i1,..., x ip, y i ) Data: Case 1: c 1 = (x 11,..., x 1p, y 1 ) Case 2: c 2 = (x 21,..., x 2p, y 2 ).. Case N: c N = (x N1,..., x Np, y N ) Example: do geographic (x 1 ) and phonetic distance (x 2 ) predict people s intuitions about dialect distance (y)? (see Bezooijen and Heeringa, 2006)

Multiple regression model Statistical model of multiple linear regression: y 1 = α + β 1 x 11 + β 2 x 12 +... + β p x 1p + ɛ 1. y N = α + β 1 x N1 + β 2 x N2 +... + β p x Np + ɛ N Mean response µ y is linear combination of predictor variables: µ y = α + β 1 x 1 + β 2 x 2 +... + β p x p Deviations ɛ i are independent and normally distributed with mean 0 and standard deviation σ

Multiple regression model Need to estimate p + 1 model parameters a, b 1,..., b p : y = a + b 1 x 1 + b 2 x 2 + + b p x p

Multiple regression model Need to estimate p + 1 model parameters a, b 1,..., b p : y = a + b 1 x }{{} 1 + b 2 x 2 + + b p x p simple linear regression

Multiple regression model Need to estimate p + 1 model parameters a, b 1,..., b p : y = a + b 1 x 1 + b 2 x 2 + + b p x p Predicted response for case i: ŷ i = a + b 1 x i1 + b 2 x i2 + + b p x ip Residual of case i: e i = observed response predicted response = y i ŷ i = y i a b 1 x i1 b 2 x i2 b p x ip

Least squares regression Find parameters that minimize sum of squared residuals (SSE): N ei 2 = i=1 N (y i a b 1 x i1 b 2 x i2 b p x ip ) 2 i=1 But this time, let software do it for you... As usual, we partition the variance: SST = SSM + SSE N (y i y) 2 = N N (ŷ i y) 2 + (y i ŷ i ) 2 i=1 i=1 i=1 Total variance = Explained variance + Error variance

Degrees of freedom in multiple regression Multiple linear regression model has p + 1 parameters Hence, model degrees of freedom (DFM): (p + 1) 1 = p Total degrees of freedom (DFT): (number of cases) 1 = N 1 Error degrees of freedom (DFE): N p 1 As usual, DFT = DFM + DFE Mean square model: MSM = SSM/DFM Mean square error: MSE = SSE/DFE

Multiple regression: example Grade point average (GPA) of first-year computer science majors is measured (A = 4.0, B = 3.0,...) Questions: (a) do high school grades predict university grades? Mathematics English Science (b) do scholastic aptitude test (SAT) scores predict university grades? Mathematics Verbal (c) do both sets of scores predict GPA?

Multiple regression: example Obs HS-M HS-S HS-E SAT-M SAT-V GPA 1 10 10 10 670 600 3.32 2 6 8 5 700 640 2.26 3 8 6 8 640 530 2.35 4 9 10 7 670 600 2.08 5 8 9 8 540 580 3.38....... 224 9 8 9 559 488 2.28 HS-M/S/E: SAT-M/V: GPA: high school grades mathematics/science/english scholastic aptitude test scores mathematics/verbal grade point average

Distribution of scores gpa hsm hss Frequency 0 10 30 50 Frequency 0 20 40 60 Frequency 0 10 30 50 0 1 2 3 4 2 4 6 8 10 3 4 5 6 7 8 9 10 hse satm satv Frequency 0 10 20 30 40 50 Frequency 0 10 20 30 40 50 Frequency 0 20 40 60 3 4 5 6 7 8 9 10 300 400 500 600 700 800 300 400 500 600 700 800 Regression does not require that variables be normally distributed!

Multiple regression: predicted vs observed values 200 300 400 500 600 700 800 0 1 2 3 4 300 400 500 600 700 800 verbal SAT math SAT GPA Scatterplot of GPA against SAT scores with regression plane fitted

Visualizing residuals 1.5 2.0 2.5 3.0 2 1 0 1 Residual Fitted plot Fitted Residuals No indication of non-linear relationship between variables

Check normality of residuals 3 2 1 0 1 2 3 2 1 0 1 Normal Q Q Plot Theoretical Quantiles Sample Quantiles No indication that residuals are distributed non-normal

Regression on high school grades (a) do high school grades (HS-M, HS-S, HS-E) predict GPA? Call: lm(formula = gpa hse + hsm + hss, data = gpa data) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.58988 0.29424 2.005 0.0462 * hse 0.04510 0.03870 1.166 0.2451 hsm 0.16857 0.03549 4.749 3.68e-06 *** hss 0.03432 0.03756 0.914 0.3619 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. Residual standard error: 0.6998 on 220 degrees of freedom Multiple R-Squared: 0.2046, Adjusted R-squared: 0.1937 F-statistic: 18.86 on 3 and 220 DF, p-value: 6.359e-11

Regression on high school grades (a) do high school grades (HS-M, HS-S, HS-E) predict GPA? Call: lm(formula = gpa hse + hsm + hss, data = gpa data) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.58988 0.29424 2.005 0.0462 * hse 0.04510 0.03870 1.166 0.2451 hsm 0.16857 0.03549 4.749 3.68e-06 *** hss 0.03432 0.03756 0.914 0.3619 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. Residual standard error: 0.6998 on 220 degrees of freedom Multiple R-Squared: 0.2046, Adjusted R-squared: 0.1937 F-statistic: 18.86 on 3 and 220 DF, p-value: 6.359e-11 Regression equation: y = 0.59 + 0.04x 1 + 0.17x 2 + 0.03x 3

Regression on high school grades (a) do high school grades (HS-M, HS-S, HS-E) predict GPA? Call: lm(formula = gpa hse + hsm + hss, data = gpa data) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.58988 0.29424 2.005 0.0462 * hse 0.04510 0.03870 1.166 0.2451 hsm 0.16857 0.03549 4.749 3.68e-06 *** hss 0.03432 0.03756 0.914 0.3619 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. Residual standard error: 0.6998 on 220 degrees of freedom Multiple R-Squared: 0.2046, Adjusted R-squared: 0.1937 F-statistic: 18.86 on 3 and 220 DF, p-value: 6.359e-11 Regression equation: y = 0.59 + 0.04x 1 + 0.17x 2 + 0.03x 3

F-statistics for multiple regression F-statistics tests: H 0 : b 1 = b 2 =... = b p = 0 against H a : at least one of the b i 0 ANOVA table: Source Degrees of Sum of squares Mean square F freedom Model p (ŷi y) 2 SSM/DFM MSM/MSE Error N p 1 (yi ŷ i ) 2 SSE/DFE Total N 1 (yi y) 2 SST/DFT In the example: F (3, 220) = 18.86 and p < 0.001 Hence, we reject H 0, at least one regression coefficient b i 0 (but we don t know which one)

Regression on high school grades (a) do high school grades (HS-M, HS-S, HS-E) predict GPA? Call: lm(formula = gpa hse + hsm + hss, data = gpa data) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.58988 0.29424 2.005 0.0462 * hse 0.04510 0.03870 1.166 0.2451 hsm 0.16857 0.03549 4.749 3.68e-06 *** hss 0.03432 0.03756 0.914 0.3619 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. Residual standard error: 0.6998 on 220 degrees of freedom Multiple R-Squared: 0.2046, Adjusted R-squared: 0.1937 F-statistic: 18.86 on 3 and 220 DF, p-value: 6.359e-11 Regression equation: y = 0.59 + 0.04x 1 + 0.17x 2 + 0.03x 3

Hypothesis testing Which of the high school grades significantly contributes to predicting GPA? For each coefficient b 1, b 2, b 3 we test: H 0 : b i = 0 vs H a : b i 0 Under H 0 : t = b i SE i follows t-distribution with N p 1 degrees of freedom, where SE i = standard error of the estimated b i If t t(n p 1) at α = 0.05, reject H 0

Hypothesis testing Which of the high school grades significantly contributes to predicting GPA? Coefficients: Estimate Std. Error t value Pr(> t ) hse 0.04510 0.03870 1.166 0.2451 hsm 0.16857 0.03549 4.749 3.68e-06 *** hss 0.03432 0.03756 0.914 0.3619 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. In this regression model, only high school grades in Mathematics (HS-M) are significant BUT...

Hypothesis testing...if we regress Science grades (HS-S) only on GPA: Call: lm(formula = gpa hss, data = gpa data) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.41325 0.24017 5.884 1.46e-08 *** hss 0.15106 0.02906 5.198 4.55e-07 *** Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. Residual standard error: 0.7375 on 222 degrees of freedom Multiple R-Squared: 0.1085, Adjusted R-squared: 0.1045 F-statistic: 27.02 on 1 and 222 DF, p-value: 4.552e-07 We find that HS-S is a significant predictor of GPA!

Hypothesis testing Explanation: look at correlation between explanatory variables r HSM,HSE = 0.47 r HSM,HSS = 0.58 r HSE,HSS = 0.58 Hence, Maths and Science grades strongly correlated HSS does not add to explanatory power of HSM and HSE (in full model) HSS alone, though, predicts GPA (to some extent) be careful: always compare several multiple regression models and determine correlation before drawing conclusions

Visualizing multiple regression Y X 1 regress Y on X 1 (simple linear regression) shaded area r 2 (squared Pearson correlation coefficient) r 2 measures amount of variation in Y explained by X 1

Visualizing multiple regression Y X 1 X 2 regress Y on X 1 and X 2 (multiple linear regression) dark grey areas: uniquely explained variance ( squared semi-partial correlation ) light grey area: commonly explained variance (due to correlation of X 1 and X 2 )

Visualizing multiple regression Y X 1 X 3 X 2 regress Y on X 1 and X 2 and X 3 (multiple linear regression) dark grey areas: uniquely explained variance ( squared semi-partial correlation ) light grey area: commonly explained variance (due to correlation of X 1 and X 2 ) note: X 1 and X 3 uncorrelated

Visualizing multiple regression Y X 1 X 2 regress Y on X 1 and X 2 and X 3 (multiple linear regression) black area R 2 : squared multiple correlation coefficient R 2 measures total proportion of variance in Y accounted for by X 1, X 2 and X 3 X 3

Squared multiple correlation R 2 = SSM N SST = i=1 (ŷ i y) 2 N i=1 (y i y) 2 Regression of GPA on HS-S, HS-M and HS-E: Residual standard error: 0.6998 on 220 degrees of freedom Multiple R-Squared: 0.2046, Adjusted R-squared: 0.1937 F-statistic: 18.86 on 3 and 220 DF, p-value: 6.359e-11 High school grades explain 20.5% of variance in GPA Not a whole lot, despite highly significant p-value for HS-M coefficient Once again, small p-values do not entail a large effect!

Squared multiple correlation R 2 = SSM N SST = i=1 (ŷ i y) 2 N i=1 (y i y) 2 Regression of GPA on HS-S only: Residual standard error: 0.7375 on 222 degrees of freedom Multiple R-Squared: 0.1085, Adjusted R-squared: 0.1045 F-statistic: 27.02 on 1 and 222 DF, p-value: 4.552e-07 p-values in both models comparable, but High school grades in Science explain only 10.8% of variance in GPA Adding more variables (HS-M, HS-E) to model adds explanatory power

Refining the model In full model (HS-S/E/M), HS-S had largest p-value (0.3619); drop HS-S from model: Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.62423 0.29172 2.140 0.0335 * hse 0.06067 0.03473 1.747 0.0820. hsm 0.18265 0.03196 5.716 3.51e-08 *** Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. Residual standard error: 0.6996 on 221 degrees of freedom Multiple R-Squared: 0.2016, Adjusted R-squared: 0.1943 F-statistic: 27.89 on 2 and 221 DF, p-value: 1.577e-11 R 2 = 0.2016 versus R 2 = 0.2046 in the bigger model In this (precise) sense HS-S does not add to explanatory power

What about SAT scores? Question (b) do SAT scores predict GPA? Call: lm(formula = gpa satm + satv, data = gpa data) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.289e+00 3.760e-01 3.427 0.000728 *** satm 2.283e-03 6.629e-04 3.444 0.000687 *** satv -2.456e-05 6.185e-04-0.040 0.968357 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. Residual standard error: 0.7577 on 221 degrees of freedom Multiple R-Squared: 0.06337, Adjusted R-squared: 0.05498 F-statistic: 7.476 on 2 and 221 DF, p-value: 0.0007218 Regression on SAT scores also significant, but less explanatory power than high school grades

What about adding SAT scores? Question (c) do high school grades and SAT scores predict GPA? Call: lm(formula = gpa hse + hsm + hss + satm + satv, data = gpa data) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.3267187 0.3999964 0.817 0.414932 hse 0.0552926 0.0395687 1.397 0.163719 hsm 0.1459611 0.0392610 3.718 0.000256 *** hss 0.0359053 0.0377984 0.950 0.343207 satm 0.0009436 0.0006857 1.376 0.170176 satv -0.0004078 0.0005919-0.689 0.491518 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. Residual standard error: 0.7 on 218 degrees of freedom Multiple R-Squared: 0.2115, Adjusted R-squared: 0.1934 F-statistic: 11.69 on 5 and 218 DF, p-value: 5.058e-10

ANOVA for multiple regression How do we formally compare different regression models? For example, do SAT scores significantly add to explanatory power of high school grades? Compare lm(formula = gpa hse + hsm + hss, data = gpa data) with lm(formula = gpa hse + hsm + hss + satm + satv, data = gpa data) Use ANOVA to test: H 0 : b satm = b satv = 0 versus H a : at least one of these b s 0

ANOVA for multiple regression ANOVA F-score: F = [(SSE shorter SSE longer )/#new variables]/mse longer In the example: Analysis of Variance Table Model 1: gpa hse + hsm + hss Model 2: gpa hse + hsm + hss + satm + satv Res.Df SSE Df Sum of Sq F Pr(> F ) 1 220 107.750 2 218 106.819 2 0.931 0.9503 0.3882 Hence, SAT scores not significant predictors of GPA in regression model which already contains high school scores

Analyses summary What can we conclude from all these analyses? High school grades in Maths are a significant predictor of GPA High school grades in Science are a significant predictor of GPA High school grades in Science and English do not add to the explanatory power of Math grades SAT scores do not add explanatory power to the model either Can we ignore SAT scores and Science/English grades then? No, because we only looked at GPA of computer science majors at one university

Problems with multiple regression Overfitting: The more variables, the higher the amount of variance you can explain. Even if each variable doesn t explain much, adding large number of variables can result in high values of R 2 Interaction: Multiple regression is logically more complicated than simple regression applied several times for different variables Collinearity: Independent variables may correlate themselves, competing in their explanation Consider cleaning one indep. variable of another by using residuals of regression analysis. Suppression: An independent variable may appear not to be explanatory, but becomes significant in combined model

Summary multiple regression generalization of simple linear regression allows prediction of one variable value based on one or more others test hypotheses about the predictive power of variables (t-test for coefficients) measure the proportion of variance in dependent variable explained by predictors (R 2 ) allows an estimation of the importance of various independent factors (model comparison with ANOVA) which independent factors, taken together or separately, explain the dependent variable the best?

Next week Next week: logistic regression