Multiple linear regression

Similar documents
Correlation and simple linear regression S5

Review of Statistics 101

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Correlation Analysis

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Review of Multiple Regression

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Simple Linear Regression Using Ordinary Least Squares

Lecture 9: Linear Regression

Basic Business Statistics 6 th Edition

Chapter 3 Multiple Regression Complete Example

Correlation and regression

Lecture 18: Simple Linear Regression

Business Statistics. Lecture 10: Course Review

Chapter 9 - Correlation and Regression

Unit 10: Simple Linear Regression and Correlation

Inference for Regression Inference about the Regression Model and Using the Regression Line

REVIEW 8/2/2017 陈芳华东师大英语系

9. Linear Regression and Correlation

Statistics for Managers using Microsoft Excel 6 th Edition

Inferences for Regression

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

Inference for Regression Simple Linear Regression

Lecture 10 Multiple Linear Regression

LI EAR REGRESSIO A D CORRELATIO

Simple Linear Regression: One Quantitative IV

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

Mathematics for Economics MA course

Ordinary Least Squares Regression Explained: Vartanian

Inference for Regression

Chapter 4: Regression Models

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Chapter 4. Regression Models. Learning Objectives

Ch. 1: Data and Distributions

y response variable x 1, x 2,, x k -- a set of explanatory variables

Multiple Regression and Model Building Lecture 20 1 May 2006 R. Ryznar

Multiple linear regression S6

General Linear Model (Chapter 4)

Lecture 11: Simple Linear Regression

CS 5014: Research Methods in Computer Science

STAT 4385 Topic 03: Simple Linear Regression

Basic Business Statistics, 10/e

Lecture 3: Inference in SLR

Confidence Interval for the mean response

Example. Multiple Regression. Review of ANOVA & Simple Regression /749 Experimental Design for Behavioral and Social Sciences

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Practical Biostatistics

Inference for the Regression Coefficient

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

A discussion on multiple regression models

Chapter 16. Simple Linear Regression and Correlation

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Statistics and Quantitative Analysis U4320

Ch 2: Simple Linear Regression

WORKSHOP 3 Measuring Association

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Business Statistics. Lecture 10: Correlation and Linear Regression

Categorical Predictor Variables

Example: Forced Expiratory Volume (FEV) Program L13. Example: Forced Expiratory Volume (FEV) Example: Forced Expiratory Volume (FEV)

Chapter 14 Student Lecture Notes 14-1

Area1 Scaled Score (NAPLEX) .535 ** **.000 N. Sig. (2-tailed)

STAT 511. Lecture : Simple linear regression Devore: Section Prof. Michael Levine. December 3, Levine STAT 511

Correlation and Regression Bangkok, 14-18, Sept. 2015

STK4900/ Lecture 3. Program

16.400/453J Human Factors Engineering. Design of Experiments II

Unit 11: Multiple Linear Regression

Formal Statement of Simple Linear Regression Model

Analysis of variance

Economics 113. Simple Regression Assumptions. Simple Regression Derivation. Changing Units of Measurement. Nonlinear effects

Business Statistics. Lecture 9: Simple Regression

Finding Relationships Among Variables

CHAPTER EIGHT Linear Regression

1 Correlation and Inference from Regression

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Multiple Linear Regression

STA6938-Logistic Regression Model

Lecture 6 Multiple Linear Regression, cont.

The Multiple Regression Model

Chapter 16. Simple Linear Regression and dcorrelation

Inference in Regression Analysis

ESP 178 Applied Research Methods. 2/23: Quantitative Analysis

QUANTITATIVE STATISTICAL METHODS: REGRESSION AND FORECASTING JOHANNES LEDOLTER VIENNA UNIVERSITY OF ECONOMICS AND BUSINESS ADMINISTRATION SPRING 2013

Chapter 14. Linear least squares

Lecture 19: Inference for SLR & Transformations

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Chapter 14 Simple Linear Regression (A)

Disadvantages of using many pooled t procedures. The sampling distribution of the sample means. The variability between the sample means

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Simple Linear Regression: One Qualitative IV

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

( ), which of the coefficients would end

Example: Multiple linear regression. Least squares regression. Repetition: Simple linear regression. Tron Anders Moger

Correlation and Simple Linear Regression

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

Correlation and Regression

Statistics and Quantitative Analysis U4320. Segment 10 Prof. Sharyn O Halloran

Transcription:

Multiple linear regression Course MF 930: Introduction to statistics June 0 Tron Anders Moger Department of biostatistics, IMB University of Oslo Aims for this lecture: Continue where we left off. Repeat the most important things from last lecture. Learn tests for checking whether the slope of the regression line is different from zero 3. Look at what happens if more variables are included in the model Learn how to handle Binary independent variables Categorical independent variables

Example: 5000,00 4000,00 birthweight 3000,00 000,00 000,00 0,00 50,00 00,00 50,00 00,00 50,00 weight in pounds Repetition: Simple linear regression We define a model ε i Y Dependent variable = β + β x + ε i 0 i i Independent variable where are independent, normally distributed, with equal variance σ Wish to fit a line as close to the observed data (two normally distributed variables) as possible Example: Birth weight=β 0 +β *mother s weight Estimate for β 0 is called a, estimate for β is called b

Least squares regression 5000,00 4000,00 birthweight 3000,00 000,00 000,00 R Sq Linear = 0,035 0,00 50,00 00,00 50,00 00,00 50,00 weight in pounds Find the best fitting line by minimizing the squared distance from each data point to the line, summed over all data Let (x, y ), (x, y ),...,(x n, y n ) denote the points in the plane. Find a and b so that y=a+bx fit the points by minimizing Solution: n y) + ( a + bx y) + + ( a + bxn yn) = ( a + bxi yi ) i= SSE = ( a + bx L n b = xi yi ( xi )( yi ) xi yi = n( xi ) ( xi ) xi nxi yi b xi a = n = y bx nxy where xi y = x =, y n n i and all sums are done for i=,...,n.

How close are the data to the fitted line? R y SST = yi y x i,y i ε = SSE = y yˆ i i i SSR = yˆi y Predicted value=any point on the regression line ˆ i y = a + bx i R, the proportion of the total variance in the y i s in the data explained by the regression line, is given by SSR/SST x Also remember: Residuals (distance from data points to the regression line) have to be normally distributed!! Plots for checking this is easily obtained from SPSS Histograms Q-Q plots (Which SPSS calls P-P plots in regression)

Example: Regression of birth weight with mother s weight as independent variable Summary b SSE SST Adjusted Std. Error of R R Square R Square the Estimate,86 a,035,09 78,470 a. Predictors: (Constant), weight in pounds Pearson s r R b. Dependent Variable: birthweight Regression Residual Total Estimate for β 0 Estimate for β ANOVA b Sum of Squares df Mean Square F Sig. 344888 344888,30 6,686,00 a 964687 87 5587,574 9997053 88 a. Predictors: (Constant), weight in pounds b. Dependent Variable: birthweight (Constant) weight in pounds a. Dependent Variable: birthweight Unstandardized SSR a Standardized Estimate for σ P-value for test on whether there is a significant relationship between the variables in the model. Null hypothesis is no relationship P-values, confidence intervals etc. for the β s 95% Confidence Interval for B t Sig. Lower Bound Upper Bound B Std. Error Beta 369,67 8,43 0,374,000 99,040 80,304 4,49,73,86,586,00,050 7,809 But how to answer questions like: Given that a positive slope (b) has been estimated: Does it give a reproducible indication that there is a positive trend, or is it a result of random variation? What is a confidence interval for the estimated slope?

Confidence intervals for simple regression In a simple regression model, a estimates b estimates β ˆ σ = SSE /( n ) Also, where of b β 0 ( b β )/ S ~ t ˆ σ Sb = ( n ) s b n estimates So a confidence interval for by b± tn, α /Sb x σ estimates variance β is given Hypothesis testing for simple regression Choose hypotheses: H 0 : β = 0 H: β 0 Test statistic: b/ Sb ~ tn Reject H 0 if b/ Sb < tn, α / or b/ Sb > tn, α / For the example: Test H 0 : β mother s weight =0 on 5%-sig. level Get 4.49/.73=.586. Look up.5 and 97.5-percentiles in t-distribution with 87 degrees of freedom (use normal dist.) Find p-value<0.05, reject H 0

More than one independent variable: Multiple regression Assume we have data of the type (x, x, x 3, y ), (x, x, x 3, y ),... We want to explain y from the x-values by fitting the following model: y = a + bx + + cx dx3 Just like before, one can produce formulas for a,b,c,d minimizing the sum of the squares of the errors. Multiple regression model y β β x β x β x ε i = 0 + i + i +... + n ni + i ε i The errors are independent random (normal) variables with expected value zero and variance σ The explanatory variables x i, x i,, x ni cannot be linearily related, that is, measuring almost the same thing

Indicator variables Binary variables (yes/no, male/female, ) can be represented as /0, and used as independent variables. Also called dummy variables in the book. When used directly, they influence only the constant term of the regression It is also possible to use a binary variable so that it changes both constant term and slope of the regression line (interaction) Example: Regression of birth weight with mother s weight and smoking status as independent variables Summary b Adjusted Std. Error of R R Square R Square the Estimate,59 a,067,057 707,83567 a. Predictors: (Constant), smoking status, weight in pounds b. Dependent Variable: birthweight ANOVA b Regression Residual Total Sum of Squares df Mean Square F Sig. 6754 3366,65 6,7,00 a 93988 86 5003,335 9997053 88 a. Predictors: (Constant), smoking status, weight in pounds b. Dependent Variable: birthweight (Constant) weight in pounds smoking status a. Dependent Variable: birthweight Unstandardized a Standardized 95% Confidence Interval for B t Sig. Lower Bound Upper Bound B Std. Error Beta 500,74 30,833 0,83,000 044,787 955,56 4,38,690,78,508,03,905 7,57-70,03 05,590 -,8 -,557,0-478,3-6,705

Interpretation: Have fitted the model Birth weight=500.74+4.38*mother s weight-70.03*smoking status If the mother start to smoke (and her weight remain constant), what is the predicted influence on the infant s birth weight? -70.03*= -70 grams What is the predicted weight of the child of a 50 pound, smoking woman? 500.74+4.38*50-70.03*=866 grams Confounding See that the estimated effects of mothers weight has changed a little compared to the univariate analysis (where it was 4.49) Mother s weight is slightly confounded by smoking Mwt Smk Bwt Confounder: An independent variable that causes a great change (at least 0%) in the effect of other independent variables (the β), when it s included in the model

Confounding cont d. A confounder is differently distributed for different values of the variable it confounds E.g. if lean mothers smoked more than obese mothers, a univariate effect of mothers weight on birth weight would partly be due to smoking!! Including smoking in the model, removes this effect, you get a more correct estimate of mothers weight What if a categorical variable has more than two values? Example: Ethinicity; black, white, other For categorical variables with m possible values, use m- indicators Common to choose a large group as baseline, otherwise unstable estimation A model with two indicator variables will assume that the effect of one indicator adds to the effect of the other If this may be unsuitable, use an additional interaction variable (product of indicators)

birth weight as a function of ethnicity Have constructed variables black=0 or and other=0 or : Birth weight=a+b*black+c*others Get (Constant) black other Unstandardized a. Dependent Variable: birthweight a Standardized 95% Confidence Interval for B t Sig. Lower Bound Upper Bound B Std. Error Beta 303,740 7,88 4,586,000 959,959 347,5-384,047 57,874 -,8 -,433,06-695,50-7,593-99,75 3,678 -,97 -,637,009-53,988-75,46 Hence, predicted birth weight decrease by 384 grams for blacks and 99 grams for others Predicted birth weight for whites is 304 grams Multiple regression: Traffic deaths in 976 Want to find if there is any relationship between highway death rate (deaths per 000 per state) in the U.S. and the following variables: Average car age (in months) Average car weight (in 000 pounds) Percentage light trucks Percentage imported cars All data are per state

69,00 69,50 70,00 70,50 7,00 7,50 First: Scatter plots: 0,35 0,35 0,30 0,30 0,5 0,5 deaths 0,0 deaths 0,0 0,5 0,5 0,0 0,0 0,05 0,05 carage 3,00 3,0 3,40 3,60 3,80 vehwt 0,35 0,35 0,30 0,30 0,5 0,5 deaths 0,0 deaths 0,0 0,5 0,5 0,0 0,0 0,05 0,05 5,00 0,00 5,00 0,00 5,00 30,00 35,00 lghttrks 0,00 5,00 0,00 5,00 0,00 5,00 30,00 impcars Summary b Adjusted Std. Error of R R Square R Square the Estimate,49 a,4,6,0506 a. Predictors: (Constant), carage Univariate effects (including one independent variable at a time!): b. Dependent Variable: deaths a (Constant) carage a. Dependent Variable: deaths Deaths per 000=a+b*car age (in months) Unstandardized Standardized 95% Confidence Interval for B t Sig. Lower Bound Upper Bound B Std. Error Beta 4,56,34 3,98,000,33 6,800 -,06,06 -,49-3,834,000 -,094 -,09 Hence: If all else is equal, if average car age increases by one month, you get 0.06 fewer deaths per 000 inhabitants; increase age by months, you get *0.06=0.74 fewer deaths per 000 inhabitants Summary b Adjusted Std. Error of R R Square R Square the Estimate,8 a,079,059,05740 a. Predictors: (Constant), vehwt b. Dependent Variable: deaths a (Constant) vehwt Deaths per 000=a+b*car weight (in pounds) Unstandardized a. Dependent Variable: deaths Standardized 95% Confidence Interval for B t Sig. Lower Bound Upper Bound B Std. Error Beta -,7, -,7,6 -,76,74,4,06,8,983,053 -,00,49

Univariate effects cont d (one independent variable at a time!): Summary b Adjusted Std. Error of R R Square R Square the Estimate,76 a,5,50,0478 a. Predictors: (Constant), lghttrks b. Dependent Variable: deaths Hence: Increase prop. light trucks by 0 means 0*0.007=0.4 more deaths per 000 inhabitants (Constant) lghttrks a. Dependent Variable: deaths Unstandardized a Standardized 95% Confidence Interval for B t Sig. Lower Bound Upper Bound B Std. Error Beta,046,08,478,07,009,083,007,00,76 6,947,000,005,00 Summary b Adjusted Std. Error of R R Square R Square the Estimate,308 a,095,075,05690 a. Predictors: (Constant), impcars b. Dependent Variable: deaths Predicted number of deaths per 000 if prop. Imported cars is 0%: 0.06-0.004*0=0.7 a (Constant) impcars a. Dependent Variable: deaths Unstandardized Standardized 95% Confidence Interval for B t Sig. Lower Bound Upper Bound B Std. Error Beta,06,00 0,46,000,66,46 -,004,00 -,308 -,93,033 -,007,000 Building a multiple regression model, exploratory analysis: Forward regression: Try all independent variables, one at a time, keep the variable with the lowest p-value Repeat step, with the independent variable from the first round now included in the model Repeat until no more variables can be added to the model (no more significant variables) Backward regression: Include all independent variables in the model, remove the variable with the highest p- value Continue until only significant variables are left However: In health sciences you would often keep age, gender etc. in the model even though they are not significant

Two better methods of model building:. All independent variables chosen for the study have strong medical reasons for being interesting and you have a large enough study Then, all might be included in the final model regardless of significance. Middle road: use a cut-off saying that all variables with p-value<e.g. 0. in simple analyses can be included in final model For the traffic deaths, end up with: Deaths per 000=.7-0.037*car age +0.006*perc. light trucks Summary b Adjusted Std. Error of R R Square R Square the Estimate,768 a,590,57,0387 a. Predictors: (Constant), lghttrks, carage b. Dependent Variable: deaths (Constant) carage lghttrks a. Dependent Variable: deaths Unstandardized a Standardized 95% Confidence Interval for B t Sig. Lower Bound Upper Bound B Std. Error Beta,668,895,98,005,865 4,470 -,037,03 -,95 -,930,005 -,063 -,0,006,00,6 6,8,000,004,009 Conclusion: Did a multiple linear regression on traffic deaths, with car age, car weight, prop. light trucks and prop. imported cars as independent variables. Car age (in months, β=-0.037, 95% CI=(-0.063, -0.0)) and prop. light trucks (β=0.006, 95% CI=(0.004, 0.009)) were significant on 5%-level

Check of assumptions: Are residuals normally distributed? Histogram Normal P-P Plot of Regression Standardized Residual Dependent Variable: deaths Dependent Variable: deaths,0 4 0,8 Frequency 0 8 6 4 Expected Cum Prob 0,6 0,4 0, 0-3 - - 0 3 4 Regression Standardized Residual Mean =,3E-7 Std. Dev. = 0,978 N = 48 0,0 0,0 0, 0,4 0,6 0,8,0 Observed Cum Prob Least squares estimation in multiple regression yi = β0 + βx i + βx i +... + βkxki + εi The least squares estimates of β0, β,..., βk are the values b, b,, b K minimizing n i= (... ) SSE = b + b x + b x + + b x y 0 i i K Ki i They can be computed with similar but more complex formulas as with simple regression

R is defined just as before: Defining n We get as before We define yˆ = b + bx + b x +... + b x i 0 i i K Ki n n ( ) i SSE = ( y ˆ ) i yi SSR = ( yˆ ) i y SST = y y i= R i= SST = SSR + SSE SSR SSE = = SST SST i= Adjusted coefficient of determination Adding more independent variables will generally increase SSR and decrease SSE Thus the coefficient of determination will tend to indicate that models with many variables always fit better. To avoid this effect, the adjusted coefficient of determination may be used: SSE /( n K ) R = SST /( n )

Drawing inference about the model parameters in multiple regression Similar to simple regression, we get that the following statistic has a t distribution with n-k- degrees of freedom: bj β j tb = j sbj where b j is the least squares estimate for and s bj is its estimated standard deviation K is number of independent variables s bj is computed from SSE and the correlation between independent variables Confidence intervals and hypothesis tests A confidence interval for b ± t s j n K, α / bj β j becomes Testing the hypothesis H : 0 0 β j = vs H : 0 β j Reject if b s j bj < t n K, α / or b s j bj > t n K, α /

Testing sets of parameters We can also test the null hypothesis that a specific set of the betas are simultaneously zero. The alternative hypothesis is that at least one beta in the set is nonzero. But will not go into details here What if the relationship between x and y is non-linear? Most common thing to do is to categorize the independent variable E.g. categorize age into 0-0 yrs, -40 yrs, 4-60 yrs and so on Choose a baseline category, and estimate a slope b for each of the other categories Then, it does not matter what relationship you have between the outcome and the independent variable

Other options if the relationship is non-linear: Transformed variables The relationship between variables may not be linear Example: The natural model may be y = ae bx We want to find a and b bx so that the line y = ae approximates the points as well as possible 0.05 0.0 0.5 0.0 5 0 5 30 Example (cont.) bx When y = ae then log( y ) = log( a) + bx Use standard formulas on the pairs (x,log(y )), (x, log(y )),..., (x n, log(y n )) We get estimates for log(a) and b, and thus a and b 0.05 0.0 0.5 0.0 5 0 5 30

Doing a regression analysis Plot the data first, to investigate whether there is a natural relationship Linear or transformed model? Are there outliers which will unduly affect the result? Fit a model. Different models with same number of parameters may be compared with R Check the assumptions! Make tests / confidence intervals for parameters A lot of practice is needed!