Lecture 12: Interactions and Splines

Similar documents
Reminder: Nested models. Lecture 9: Interactions, Quadratic terms and Splines. Effect Modification. Model 1

Correlation and Simple Linear Regression

Problem Set #3-Key. wage Coef. Std. Err. t P> t [95% Conf. Interval]

Lecture 12: Effect modification, and confounding in logistic regression

Mathematics for Economics MA course

Lab 10 - Binary Variables

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Ordinary Least Squares Regression Explained: Vartanian

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is

ECON3150/4150 Spring 2015

Answer all questions from part I. Answer two question from part II.a, and one question from part II.b.

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

Problem Set #5-Key Sonoma State University Dr. Cuellar Economics 317- Introduction to Econometrics

ECON3150/4150 Spring 2016

Statistical Modelling in Stata 5: Linear Models

ECO220Y Simple Regression: Testing the Slope

Sociology Exam 2 Answer Key March 30, 2012

Interaction effects between continuous variables (Optional)

Problem Set 10: Panel Data

General Linear Model (Chapter 4)

Lecture 4 Scatterplots, Association, and Correlation

ECON Introductory Econometrics. Lecture 5: OLS with One Regressor: Hypothesis Tests

Lecture 3: Multiple Regression. Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II

sociology 362 regression

Lab 6 - Simple Regression

Lecture 4 Scatterplots, Association, and Correlation

S o c i o l o g y E x a m 2 A n s w e r K e y - D R A F T M a r c h 2 7,

Making sense of Econometrics: Basics

Section Least Squares Regression

ECON Interactions and Dummies

STATISTICS 110/201 PRACTICE FINAL EXAM

multilevel modeling: concepts, applications and interpretations

sociology 362 regression

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

Six Sigma Black Belt Study Guides

Inference for the Regression Coefficient

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Introduction to Econometrics. Review of Probability & Statistics

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Pre-Calculus Multiple Choice Questions - Chapter S8

Ordinary Least Squares Regression Explained: Vartanian

1. The shoe size of five randomly selected men in the class is 7, 7.5, 6, 6.5 the shoe size of 4 randomly selected women is 6, 5.

Thursday Morning. Growth Modelling in Mplus. Using a set of repeated continuous measures of bodyweight

ECON Introductory Econometrics. Lecture 7: OLS with Multiple Regressors Hypotheses tests

Correlation and regression. Correlation and regression analysis. Measures of association. Why bother? Positive linear relationship

Warwick Economics Summer School Topics in Microeconometrics Instrumental Variables Estimation

1 Warm-Up: 2 Adjusted R 2. Introductory Applied Econometrics EEP/IAS 118 Spring Sylvan Herskowitz Section #

University of California at Berkeley Fall Introductory Applied Econometrics Final examination. Scores add up to 125 points

Economics 326 Methods of Empirical Research in Economics. Lecture 14: Hypothesis testing in the multiple regression model, Part 2

Inferences for Regression

Confidence Interval for the mean response

ECON Introductory Econometrics. Lecture 17: Experiments

Group Comparisons: Differences in Composition Versus Differences in Models and Effects

Correlation & Simple Regression

Lecture 10: Introduction to Logistic Regression

Ch 13 & 14 - Regression Analysis

Graduate Econometrics Lecture 4: Heteroskedasticity

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Introduction to Regression

Variance Decomposition and Goodness of Fit

A discussion on multiple regression models

1 Independent Practice: Hypothesis tests for one parameter:

Sociology 63993, Exam 2 Answer Key [DRAFT] March 27, 2015 Richard Williams, University of Notre Dame,

INFERENCE FOR REGRESSION

Problem Set 1 ANSWERS

Regression #8: Loose Ends

1: a b c d e 2: a b c d e 3: a b c d e 4: a b c d e 5: a b c d e. 6: a b c d e 7: a b c d e 8: a b c d e 9: a b c d e 10: a b c d e

Data Analysis 1 LINEAR REGRESSION. Chapter 03

Applied Statistics and Econometrics

Business Statistics. Lecture 10: Course Review

Introductory Econometrics. Lecture 13: Hypothesis testing in the multiple regression model, Part 1

Lecture 24: Partial correlation, multiple regression, and correlation

Longitudinal Data Analysis of Health Outcomes

Applied Statistics and Econometrics

Lecture 7: OLS with qualitative information

Suggested Answers Problem set 4 ECON 60303

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

Measurement Error. Often a data set will contain imperfect measures of the data we would ideally like.

Lecture 5: ANOVA and Correlation

Practice exam questions

2.1. Consider the following production function, known in the literature as the transcendental production function (TPF).

Review of Multiple Regression

9. Linear Regression and Correlation

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p )

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

1 A Review of Correlation and Regression

Econometrics Homework 1

ECON2228 Notes 2. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 47

Question 1a 1b 1c 1d 1e 2a 2b 2c 2d 2e 2f 3a 3b 3c 3d 3e 3f M ult: choice Points

Classification & Regression. Multicollinearity Intro to Nominal Data

Introduction to Regression

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

The simple linear regression model discussed in Chapter 13 was written as

Basic econometrics. Tutorial 3. Dipl.Kfm. Johannes Metzler

sociology sociology Scatterplots Quantitative Research Methods: Introduction to correlation and regression Age vs Income

Review of Statistics 101

Simple Linear Regression Using Ordinary Least Squares

Transcription:

Lecture 12: Interactions and Splines Sandy Eckel seckel@jhsph.edu 12 May 2007 1

Definition Effect Modification The phenomenon in which the relationship between the primary predictor and outcome varies across levels of another predictor We say the other predictor modifies the effect between the primary predictor and outcome In linear regression, coded by inclusion of interaction term between primary predictor and another predictor Also called interaction 2

Reminder: Nested models Parent model contains one set of variables Extended model adds one or more new variables to the parent model one variable added: compare models with t test two or more variables added: compare models with F test Return to the example of wage versus experience 3

Model 1 Wage on Experience and Gender E Wage i ] = βˆ + βˆ (Experience ) + βˆ (Gender ) [ 0 1 i 2 i This model allows the average wage to differ for men and women, but the difference in average wage between men and women is always the same regardless of experience level. Hourly Wage 0 10 20 30 40 50 We have created two parallel regression lines by including the binary covariate gender 0 20 40 60 Years of Experience Men's hourly wage fit2_men Women's hourly wage fit2_women 4

Model 1 Wage on experience and gender E Wage i ] = βˆ + βˆ (Experience ) + βˆ (Gender [ 0 1 i 2 i Gender (0=man, 1=woman) Model results: Source SS df MS Number of obs = 534 -------------+------------------------------ F( 2, 531) = 61.62 Model 2651.49936 2 1325.74968 Prob > F = 0.0000 Residual 11425.1992 531 21.5163827 R-squared = 0.1884 -------------+------------------------------ Adj R-squared = 0.1853 Total 14076.6985 533 26.4103162 Root MSE = 4.6386 ------------------------------------------------------------------------------ wagehr Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- educyrs.7512834.0768225 9.78 0.000.6003701.9021966 gender -2.124057.4028322-5.27 0.000-2.915397-1.332716 (Intercept).2178312 1.036322 0.21 0.834-1.817962 2.253624 ------------------------------------------------------------------------------ ) 5

Model 2 Wage on experience and gender, include interaction Goal: Create a model that allows the average wage to differ for men and women the difference in average wage between men and women to change as experience level increases still will have two regression lines (one for each gender) no longer parallel E[Wagei ] = βˆ 0 + βˆ 1(Experiencei ) + βˆ 2(Genderi ) + βˆ 3(Genderi Experience i ) Interaction variable How do we create it? 6

Model 2: Creating the interaction variable Multiply the two X variables gender: 0 for men 1 for women gender*experience = 0*experience = 0 for men = 1*experience = experience for women 7

Model 2: R code and output Code in R to create and include interaction: > gendereduc <- gender*educ > lm( wagehr ~ educyrs + gender + gendereduc ) Output (similar to R s output) Source SS df MS Number of obs = 534 -------------+------------------------------ F( 3, 530) = 41.50 Model 2677.43224 3 892.477414 Prob > F = 0.0000 Residual 11399.2663 530 21.5080496 R-squared = 0.1902 -------------+------------------------------ Adj R-squared = 0.1856 Total 14076.6985 533 26.4103162 Root MSE = 4.6377 ------------------------------------------------------------------------------ wagehr Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- educyrs.6831451.0987423 6.92 0.000.4891708.8771194 gender -4.37045 2.085057-2.10 0.037-8.466441 -.2744591 gender_educ.1725303.1571232 1.10 0.273 -.1361305.481191 (Intercept) 1.104571 1.313655 0.84 0.401-1.476038 3.685181 ------------------------------------------------------------------------------ 8

Model 2: Interpretation in terms of gender value E[Wagei ] = βˆ 0 + βˆ 1(Experiencei ) + βˆ 2(Genderi ) + βˆ 3(Genderi Experience i ) Equation for men: E[ Wage E[ Wage i i ] = βˆ 0 + βˆ 1 (Experience ] = 1.10 + 0.68(Experience i ) i ) Gender: 0 for men 1 for women Equation for women: E[ Wage E[ Wage i i ] = ] = ( βˆ + βˆ ) + ( βˆ + βˆ ) 0 2 (Experience ( 1.10 4.37) + ( 0.68 + 0.17) (Experience β 2 : change in mean wage for women vs. men with no experience β 3 : change in slope (of experience) for women vs. men 1 3 i ) i ) 9

Model 2: Coefficient Interpretation E[Wagei ] = βˆ 0 + βˆ 1(Experiencei ) + βˆ 2(Genderi ) + βˆ 3(Genderi Experience β 0 : The average wage for men with no experience β 1 : The difference in average wage for a one year increase in experience among men β 2 : The difference in average wage between women and men with no experience i ) β 3 : The difference of the difference in average wage for a one year increase in experience between women and men the change in slope between women and men the slope for women is β 1 +β 3 the slope for men was β 1 10

Compare to model 1 In the parent model β 1 was slope for both men and women β 2 was difference between women & men at every experience level In the extended model (with interaction) β 1 is slope for men β 2 is difference between women & men for experience=0 β 3 is change in slope per year of experience between men & women 11

Is the change in slope statistically significant? Test model 1 vs. model 2 only 1 variable added use t test for that variable to compare models H 0 : β 3 =0 in the population From the t-statistic, p = 0.27 Fail to reject H 0 Conclude that model 1 is better 12

Model 3: Interaction of two binary predictors Model 2: continuous X, binary X, their interaction slope changes by group Model 3: binary X, binary X, their interaction difference in mean changes by group 13

Model 3: Wage on gender and married, include interaction E Wagei ] = βˆ 0 + βˆ 1(Genderi ) + βˆ 2( Marriedi ) + βˆ 3(Genderi Married gender: 0 for men 1 for women married: 0 if unmarried 1 if married [ i ) To create the interaction term: gender*married = 0*0 = 0 for unmarried men = 1*0 = 0 for unmarried women = 0*1 = 0 for married men = 1*1 = 1 for married women 14

Model 3: Results E[ Wagei ] = βˆ 0 + βˆ 1(Genderi ) + βˆ 2( Marriedi ) + βˆ 3(Genderi Marriedi) Source SS df MS Number of obs = 534 -------------+------------------------------ F( 3, 530) = 13.94 Model 1029.58518 3 343.195059 Prob > F = 0.0000 Residual 13047.1134 530 24.617195 R-squared = 0.0731 -------------+------------------------------ Adj R-squared = 0.0679 Total 14076.6985 533 26.4103162 Root MSE = 4.9616 ------------------------------------------------------------------------------ wagehr Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- gender -.0951139.7350696-0.13 0.897-1.539121 1.348894 married 2.521311.6121088 4.12 0.000 1.318854 3.723768 gender_mar~d -3.097184.907319-3.41 0.001-4.879567-1.314802 (Intercept) 8.354752.4936948 16.92 0.000 7.384914 9.324591 ------------------------------------------------------------------------------ Let s visualize what this implies! - basically, we ve created a model that allows for a different mean for each of the 4 possible types of people - the slopes (β 1 β 2 and β 3 ) represent differences in means between groups 15

Graph for Model 3 β 3 = Difference of differences 12 10 Difference = β 1 Difference = β 2 Difference = β 1 +β 3 Mean hourly wage 8 6 4 β 0 Difference = β 2 +β 3 2 0 unmarried men unmarried women married men married women E[ Wagei ] = βˆ 0 + βˆ 1(Genderi ) + βˆ 2( Marriedi ) + βˆ 3(Genderi Marriedi) 16

Model 3: Interpretation β 0 : The average wage for unmarried men β 1 : The difference in average wage between unmarried women and unmarried men β 1 + β 3 : The difference in average wage between married women and married men β 3 : The difference of the difference in average wage between married women and married men and between unmarried women and unmarried men 17

Model 3: Interpretation β 0 : The average wage for unmarried men β 2 : The difference in average wage between married men and unmarried men β 2 + β 3 : The difference in average wage between married women and unmarried women β 3 : The difference of the difference in average wage between married women and unmarried women and between married men and unmarried men 18

Model 3: conclusion The interaction variable is statistically significantly different from 0 (p=0.001, CI: -4.9 to -1.3 ) The difference in mean hourly wage between women and men is greater for married people than for unmarried people. -or- The difference in mean hourly wage between married people and unmarried people is greater for men than for women. 19

Summary Interaction interaction=var1*var2 interaction variable changes interpretation of entire model with interaction, the effect of one variable changes according to the level of the second variable Test for interaction by testing new variable use t-test if significant (p<α, 0 not in CI), keep if not significant, go back to parent model without interaction variable 20

Flexibility in linear models In linear regression, we assume the outcome, Y, has a linear relationship with the predictors, X However, we have flexibility in defining the predictors transform X, such as X 2 or X 3 use linear splines to fit broken arrow models 21

Linear Splines: when are they needed? Example 1: Outcome is `out of pocket medical expenditures A researcher tells you most Health Management Organizations (HMOs) will usually pay for the first week of a hospital stay only She expects `out of pocket expenditures to increase dramitically if length of stay (LOS) was longer than one week How should we set up the model? Example 2: Outcome is medical expenditures In the US, older adults are eligible for Medicare, a national health insurance program, once they reach the age of 65 We expect study subjects medical expenditure patterns before age 65 to be different from the expenditures after age 65 How should we set up the model? 22

Example 1: The researcher hypothesized this regression line for length of stay (LOS) Expenditures 3500 3000 2500 2000 Broken Arrow Model 3 5 7 9 length of stay (days) Terminology: Called Linear spline Broken arrow Hockey stick 23

Defining the spline variable We define a new variable that checks to see if the slope is indeed different if LOS is greater than 7 Idea, create a term: (LOS-7) + = (LOS 7) if LOS>7 = 0 if LOS<=7 When you run the regression include both LOS (LOS-7) + The spline allows you to change the magnitude of the slope! 24

When to use a spline? When a continuous predictor is used, a typical regression equation assumes there is a straight-line relationship between X and Y in the population. If the relationship between X and Y is a bent line a curve adding a spline may more accurately model the relationship between X and Y 25

Visualizing the linear spline model 3500 Broken Arrow Model Expenditures 3000 2500 Slope = β 1 Slope = β 1 + β 2 2000 3 5 7 9 length of stay (days) 26

The Model E(expenditures) = β 0 + β 1 LOS + β 2 (LOS-7) + Where: (LOS-7)+ = (LOS 7) if LOS>7 0 if LOS<=7 27

Then: E(expenditures LOS <= 7) = β 0 + β 1 LOS E(exp LOS > 7)= β 0 + β 1 LOS + β 2 (LOS - 7) = (β 0 - β 2 7)+ (β 1 + β 2 )LOS = β 0 * + β 1 *LOS 28

Back to modelling wages Standardized residuals -2 0 2 4 20 30 40 50 60 age Do we need a spline on age? 29

How should we add the spline? Goal: let the regression line bend Model: E(Wage i ) = β 0 +β 1 (age-35)+β 2 (age-35) + What is (age-35) +? 0 if age<35 (age-35) if age>=35 30

Fitted model with spline at 35 Model: E(Wage i ) = β 0 +β 1 (age-35)+β 2 (age-35) + Source SS df MS Number of obs = 533 -------------+------------------------------ F( 2, 530) = 28.18 Model 1231.65577 2 615.827885 Prob > F = 0.0000 Residual 11584.1395 530 21.8568669 R-squared = 0.0961 -------------+------------------------------ Adj R-squared = 0.0927 Total 12815.7952 532 24.0898407 Root MSE = 4.6751 ------------------------------------------------------------------------------ wagehr Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- age_cent.3328909.0470853 7.07 0.000.2403943.4253876 age_spline -.374546.0663082-5.65 0.000 -.504805 -.2442869 (Intercept) 10.45389.3577241 29.22 0.000 9.751156 11.15662 ------------------------------------------------------------------------------ 31

Fitted Graph (with spline) Solid line: fitted spline, Dotted line: visualize change in slope! Wage ($/hour) 0 10 20 30 40 50 20 30 40 50 60 age 32

Working out what the spline term means... E(Wage i ) = 10.45+0.33(age-35)-0.37(age-35) + For a person under 35: E(Wage i ) = 10.45+0.33(age-35)-0.37(age-35) + = 10.45+0.33(age-35) For a person 35 or older: E(Wage i ) = 10.45+0.33(age-35)-0.37(age-35) + = 10.45-0.04(age-35) 0 (age-35) β 1 +β 2 = new slope for those over 35 33

Interpretation β 0 is the average wage for people who are 35 years old β 1 is the change in average wage per additional year of age for those under 35 β 2 is the difference in the change in average wage per additional year of age for those over age 35 as compared to those under age 35 β 2 is the change in the slope for over 35 vs. under 35 34

Better Interpretation The average wage for people who are 35 years old is $10.45/hour (95% CI: $9.75, 11.16) For each additional year of age, those under age 35 earn an average of $0.33 more per hour (95% CI: $0.24, $0.43) For each additional year of age, those over age 35 earn an average of $0.04 less per hour (95% CI: -$0.10, $0.01) 35

Is the change in slope statistically significant? One variable was added to create the change in slope compare nested models with t-test ------------------------------------------------------------------------------ Coefficients Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- age_cent.3328909.0470853 7.07 0.000.2403943.4253876 age_spline -.374546.0663082-5.65 0.000 -.504805 -.2442869 (Intercept) 10.45389.3577241 29.22 0.000 9.751156 11.15662 ------------------------------------------------------------------------------ H 0 : spline is not needed (no change in slope in the population) p<0.001 or CI does not include 0: reject H 0 Conclude slope differs for those over vs. under 35 in population 36

Check the model assumptions again! Note: plots are omitted L Linear relationship With the spline, there is no longer any pattern in the residuals After removing the one outlier, no others appear to stand out I - Independence We cannot check this by looking at the data N Normality of the residuals The residuals are slightly skewed to positive values the estimated regression coefficients are still correct their confidence intervals may be misleading E Equal variance of the residuals across X The vertical spread of the residuals may be smaller for those under 25 years of age the estimated regression coefficients are still correct their confidence intervals may be misleading 37

Conclusions The increase in hourly wage with increasing age is statistically significant for those who recently entered the workforce (ages 18-35): for each additional year, these workers earn an average of 33 cents more per hour. However, this increase in wage with increasing age levels off for those over age 35, so that no appreciable increase in average wage is observed for those over age 35. One 21-year-old had much higher earnings ($44.50 per hour) than other young workers. This person s results were so unlike the rest of the sample that the observation was dropped from the analysis. It is possible that the data was incorrectly entered for this person, but we are unable to assess the data entry since the original completed surveys are unavailable. 38

A final word on Splines Splines are used to allow the regression line to bend the breakpoint is arbitrary and decided graphically the actual slope above and below the breakpoint is usually of more interest than the coefficient for the spline (i.e., the change in slope) 39

Lecture 12 Summary Effect modification Epidemiologic definition Include in model using interaction term 1 binary X, 1 continuous X 2 binary X Linear Splines The relationship between Y and a continuous X may be linear Allow for a change in slope at a certain point in the continuous X 40