Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Similar documents
Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Lecture 6: Introduction to Linear Regression

Statistics for Business and Economics

Chapter 15 - Multiple Regression

Reminder: Nested models. Lecture 9: Interactions, Quadratic terms and Splines. Effect Modification. Model 1

Y = β 0 + β 1 X 1 + β 2 X β k X k + ε

Statistics for Economics & Business

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

Statistics for Managers Using Microsoft Excel/SPSS Chapter 13 The Simple Linear Regression Model and Correlation

Learning Objectives for Chapter 11

Chapter 11: Simple Linear Regression and Correlation

Chapter 9: Statistical Inference and the Relationship between Two Variables

[The following data appear in Wooldridge Q2.3.] The table below contains the ACT score and college GPA for eight college students.

Diagnostics in Poisson Regression. Models - Residual Analysis

Linear regression. Regression Models. Chapter 11 Student Lecture Notes Regression Analysis is the

Statistics II Final Exam 26/6/18

Comparison of Regression Lines

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Basic Business Statistics, 10/e

Linear Regression Analysis: Terminology and Notation

STAT 3008 Applied Regression Analysis

Chapter 14 Simple Linear Regression

Lecture Notes for STATISTICAL METHODS FOR BUSINESS II BMGT 212. Chapters 14, 15 & 16. Professor Ahmadi, Ph.D. Department of Management

17 - LINEAR REGRESSION II

28. SIMPLE LINEAR REGRESSION III

x i1 =1 for all i (the constant ).

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

Introduction to Regression

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

Topic 7: Analysis of Variance

NANYANG TECHNOLOGICAL UNIVERSITY SEMESTER I EXAMINATION MTH352/MH3510 Regression Analysis

Statistics MINITAB - Lab 2

Correlation and Regression

Chapter 13: Multiple Regression

DO NOT OPEN THE QUESTION PAPER UNTIL INSTRUCTED TO DO SO BY THE CHIEF INVIGILATOR. Introductory Econometrics 1 hour 30 minutes

e i is a random error

Basically, if you have a dummy dependent variable you will be estimating a probability.

Chapter 14 Simple Linear Regression Page 1. Introduction to regression analysis 14-2

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

18. SIMPLE LINEAR REGRESSION III

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

Correlation and Regression. Correlation 9.1. Correlation. Chapter 9

Sociology 301. Bivariate Regression. Clarification. Regression. Liying Luo Last exam (Exam #4) is on May 17, in class.

Outline. Zero Conditional mean. I. Motivation. 3. Multiple Regression Analysis: Estimation. Read Wooldridge (2013), Chapter 3.

Lecture 3 Stat102, Spring 2007

STATISTICS QUESTIONS. Step by Step Solutions.

where I = (n x n) diagonal identity matrix with diagonal elements = 1 and off-diagonal elements = 0; and σ 2 e = variance of (Y X).

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Economics 130. Lecture 4 Simple Linear Regression Continued

Chapter 8 Multivariate Regression Analysis

Scatter Plot x

Chapter 5 Multilevel Models

Chapter 8 Indicator Variables

Statistics for Managers Using Microsoft Excel/SPSS Chapter 14 Multiple Regression Models

Midterm Examination. Regression and Forecasting Models

a. (All your answers should be in the letter!

BIO Lab 2: TWO-LEVEL NORMAL MODELS with school children popularity data

Negative Binomial Regression

The SAS program I used to obtain the analyses for my answers is given below.

January Examinations 2015

Cathy Walker March 5, 2010

β0 + β1xi and want to estimate the unknown

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

SIMPLE LINEAR REGRESSION

Sociology 301. Bivariate Regression II: Testing Slope and Coefficient of Determination. Bivariate Regression. Calculating Expected Values

Now we relax this assumption and allow that the error variance depends on the independent variables, i.e., heteroskedasticity

Interval Estimation in the Classical Normal Linear Regression Model. 1. Introduction

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis

PBAF 528 Week Theory Is the variable s place in the equation certain and theoretically sound? Most important! 2. T-test

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Interpreting Slope Coefficients in Multiple Linear Regression Models: An Example

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

Module Contact: Dr Susan Long, ECO Copyright of the University of East Anglia Version 1

University of California at Berkeley Fall Introductory Applied Econometrics Final examination

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

The Ordinary Least Squares (OLS) Estimator

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Regression Analysis. Regression Analysis

STAT 405 BIOSTATISTICS (Fall 2016) Handout 15 Introduction to Logistic Regression

( )( ) [ ] [ ] ( ) 1 = [ ] = ( ) 1. H = X X X X is called the hat matrix ( it puts the hats on the Y s) and is of order n n H = X X X X.

β0 + β1xi. You are interested in estimating the unknown parameters β

Statistical Evaluation of WATFLOOD

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Unit 10: Simple Linear Regression and Correlation

Durban Watson for Testing the Lack-of-Fit of Polynomial Regression Models without Replications

III. Econometric Methodology Regression Analysis

Properties of Least Squares

Professor Chris Murray. Midterm Exam

Statistics Chapter 4

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

STAT 3340 Assignment 1 solutions. 1. Find the equation of the line which passes through the points (1,1) and (4,5).

Chapter 15 Student Lecture Notes 15-1

/ n ) are compared. The logic is: if the two

Linear correlation and linear regression

Chapter 5: Hypothesis Tests, Confidence Intervals & Gauss-Markov Result

T E C O L O T E R E S E A R C H, I N C.

Econometrics: What's It All About, Alfie?

Chapter 3. Two-Variable Regression Model: The Problem of Estimation

UNIVERSITY OF TORONTO. Faculty of Arts and Science JUNE EXAMINATIONS STA 302 H1F / STA 1001 H1F Duration - 3 hours Aids Allowed: Calculator

PubH 7405: REGRESSION ANALYSIS. SLR: INFERENCES, Part II

Transcription:

Lecture 9: Lnear regresson: centerng, hypothess testng, multple covarates, and confoundng Sandy Eckel seckel@jhsph.edu 6 May 008

Recall: man dea of lnear regresson Lnear regresson can be used to study an outcome as a lnear functon of a predctor Example: 60 ctes n the US were evaluated for numerous characterstcs, ncludng: Outcome: the percentage of the populaton that had low ncome Predctor: medan educaton level

Recall: Where s our ntercept? We wll ntroduce the dea of centerng usng the cty % low ncome example % of populaton wth ncome < $3000 0 5 0 5 30 35 40 45 50 55 60 0 4 6 8 0 4 Medan educaton 3

Need for centerng β 0 makes no sense! We don t observe any ctes wth medan educaton 0 We can change X to fx ths problem by a process called centerng. Pck a value of X (c) wthn the range of the data. For each observaton, generate X centered X -c 3. Redo the regresson wth X centered 4

Centerng We ll use c, a hgh school degree 5 % of populaton wth ncome < $3000 0 5 0 5 30 9 0 3 Medan educaton

Centerng New equaton Yˆ Yˆ β β 0 0 + + β β ( X ) Centered ( X ) Yˆ ( )..0 X β has not changed β 0 now corresponds to average of y when X centered 0 or, equvalently, X (not X 0) Note: wth X 0, we have Ŷ..0. + 4 ( 0 ) 36. 6

Centerng Interpretaton β 0 s the mean outcome for the reference group, or the group for whch X centered X -0, or when X. Here, β 0 (.) s the average percent of the populaton that s dsadvantaged for ctes wth a medan educaton level of, the equvalent of a hgh school degree. The nterpretaton of β has not changed. 7

Hypothess testng and confdence ntervals of regresson coeffcents

Drawng conclusons about populaton assocaton usng our data (a sample) β 0 : changes dependng on centerng of X, whch doesn t affect assocaton of nterest Real concern: s X assocated wth Y? Assess by testngβ : Does β 0 n the populaton from whch ths sample was drawn? Hypothess testng Confdence nterval 9

Hypothess testng Null and alternate hypothess, test statstc H 0 : β 0 H 0 : β 0 Test statstc: t obs βˆ 0 SEβˆ ( ) df n-k- n number of observatons k number of predctors (X s) 0

Hypothess testng Educaton example H 0 : β 0 Test statstc: t obs -.0 0 0.59 3.36 df n-k- 60-- 58 n number of observatons 60 k number of predctors (X s) Calculate our p-value *pt(-3.36, df58) [] 0.0038308 p-value0.00

Hypothess testng Educaton example: nterpretaton and concluson If there were no assocaton between medan educaton and percentage of dsadvantaged ctzens n the populaton, there would be about a % chance of observng data as or more extreme than ours. The null probablty s very small, so: reject the null hypothess conclude that medan educaton level and percentage of dsadvantaged ctzens are assocated n the populaton

Confdence Interval for regresson coeffcents We calculate the CI usng the usual formula: df of t CR n-k- βˆ ± t CR SEβˆ ( ) For the educaton example, the 95% CI for β s:.0 ±.0 (- 3.,-0.8) 0.59 3

Confdence nterval Educaton example nterpretaton and concluson We are 95% confdent that the true populaton decrease n percentage of low ncome ctzens per addtonal year of medan educaton s between 3. and 0.8 Snce ths nterval does not contan 0, we beleve percentage of low ncome ctzens and medan educaton are assocated among ctes n the Unted States 4

Multple covarates and confoundng

Dataset Hourly wage nformaton from 998 workers, along wth nformaton regardng age, gender, years of experence, etc. We ll focus on predctng hourly wage wth avalable nformaton. Outcome: hourly wage Multple predctors: age, gender, years experence, etc 6

Regresson: Hourly wage vs. years of experence Smple lnear regresson snce only one covarate (years of experence) Hourly Wage 0 0 0 30 40 50 0 0 40 60 Years of Experence 7

How do we estmate the coeffcents? Use least squares For each person, ther actual hourly wage (Y ) and predcted hourly wage ( Ŷ ) are known εˆ Y Yˆ ( ) ( ( )) Y β0 + βx s the resdual or error The coeffcent estmates are found by mnmzng the sum of the squared error mn n ( Y ( + )) β0 βx The coeffcents are the least squares estmates 8

Notes on regresson analyss Ŷ β + 0 β X for any known pont on the lne Y β β X 0 + s always true Recall the regresson lne equaton Y β + β X + 0 ε 9

Model : years of experence Model : Predct ncome by years of experence Ŷ βˆ + βˆ X Ŷ 8.38 + βˆ 0 8.38 0 so the average hourly wage for someone wth no experence at all s about $8.40 0.04X βˆ 0.04 so for every addtonal year of experence, the predcted hourly wage ncreases about 4 cents For 0 years of addtonal experence, the predcted hourly wage ncreases about 40 cents 0

Should we center X? 0 years of experence s wthn the range of the data The average hourly wage correspondng to 0 years of experence makes sense No need to center X

Model : years of experence, gender (0man, woman) What happens f we also consder gender? Hourly Wage 0 0 0 30 40 50 0 0 40 60 Years of Experence Men's hourly wage ft_men Women's hourly wage ft_women

Model : for no experence years of experence, gender (0man, woman) Ŷ βˆ 0 + βˆ (Experence Ŷ ) + βˆ (Gender 9.7 + 0.04(Experence ) ) -.0(Gender ) For a man wth no experence: Ŷ 9.7 + 0.04(0)-.0(0) $9.7 βˆ For a woman wth no experence: 0 Ŷ 9.7 + 0.04(0) -.0() $7.07 βˆ 0 + βˆ 3

Model : for 0 years experence years of experence, gender (0man, woman) Yˆ βˆ 0 + βˆ ( Experence Yˆ 9.7 + ) + βˆ (Gender ) 0.04( Experence ) -.0(Gender ) For a man wth 0 years of experence: ˆ Y 97. + 004(0). -0. (0) $9.67 βˆ βˆ (0) 0 + For a woman wth 0 years of experence: Ŷ 9.7 + 0.04(0) -.0() $7.47 βˆ 0 + βˆ (0) + βˆ () 4

Model : for males years of experence, gender (0man, woman) Ŷ βˆ 0 + βˆ (Experence ) + βˆ (Gender ) Ŷ 9.7 + 0.04(Experence ) -.0(Gender ) For a man wth no experence: Ŷ 9.7 + 0.04(0) -.0(0) $9.7 For a man wth 0 years of experence: 0 βˆ Ŷ 9.7 + 0.04(0) -.0(0) $9.67 βˆ 0 + βˆ (0) 5

Model : for females years of experence, gender (0man, woman) Yˆ βˆ 0 + βˆ ( Experence Yˆ 9.7 + ) + βˆ (Gender ) 0.04( Experence ) -.0(Gender ) For a woman wth no experence: ˆ 9.7 + 0.04(0) -.0() Y For a woman wth 0 years of experence: ˆ Y $7.07 βˆ 0 + βˆ 9.7 + 0.04(0) -.0() $7.47 βˆ 0 + βˆ (0) + βˆ 6

Model Interpretaton Y ˆ 9.7 + 0.04( Experence ) -.0(Gender ) : the average hourly wage for a man βˆ0 9.7 wth no experence at all s about $9.30 βˆ 0.04 : for every addtonal year of experence, the predcted hourly wage ncreases about 4 cents for both men and women. βˆ.0 : the expected hourly wage s $.0 lower for women than t s for men at any experence level. 7

Confoundng by gender? Model vs. Model Model : Yˆ 8.38 + 0.04 ( Experence ) Model : Y ˆ 9.7 + 0.04(Experence ) -.0(Gender ) 95% CI for β n Model : (0.00, 0.07) and from Model s wthn ths CI βˆ Gender s not a confounder The assocaton between salary and experence does not change when we control for gender 8

Confoundng: the epdemologc defnton C s a confounder of the relaton between X and Y f: Outcome Y Confounder C Predctor X 9

Confoundng: example Smokng s a confounder of the relaton between coffee consumpton (X) and lung cancer (Y) snce: Lung Cancer Y Smokng C Coffee Consumpton X 30

Model 3: years of experence, age Let s try a model that ncludes age nstead of gender: Y ˆ βˆ + βˆ (Experence ) + ( Age 0-40) The relatonshp s harder to graph wth two contnuous predctors, snce now the regresson s n a 3-dmensonal space We wll learn AV plots n a few days Notce that age s centered at 40 years. Age ranged between 8 and 64 n ths dataset βˆ 3

Model 3: no experence years of experence, age Yˆ βˆ 0 + βˆ (Experence 6.5 ) + βˆ ( Age 0.8(Experence - 40) ) + 0.9( Age - 40) For a 40-year-old wth no experence: ˆ Y 6.5 0.8(0) + 0.9(40 40) βˆ 0 $6.50 For a 4-year-old wth no experence: ˆ Y 6.5 0.8(0) + 0.9(4 40) $7.4 βˆ 0 + βˆ 3

Model 3: 0 years experence years of experence, age Yˆ βˆ 0 + βˆ (Experence 6.5 ) + βˆ ( Age 0.8(Experence - 40) ) + 0.9( Age - 40) For a 40-year-old wth 0 years of experence: ˆ Y 6.5 0.8(0) + 0.9(40 40) $8.30 + βˆ For a 4-year-old wth 0 years of experence: ˆ Y βˆ 0 βˆ + βˆ 0 6.5 0.8(0) + 0.9(4 40) $9. 0 + βˆ 0 33

Model 3: 40 years of age years of experence, age Y ˆ βˆ 0 + βˆ (Experence ) + βˆ ( Age 6.5 0.8(Experence - 40) ) + 0.9( Age - 40) For a 40-year-old wth no experence: ˆ Y 6.5 0.8(0) + 0.9(40 40) $6.50 βˆ 0 For a 40-year-old wth 0 years of experence: ˆ Y 6.5 0.8(0) + 0.9(40 40) $8.30 βˆ + βˆ 0 0 34

Model 3: 4 years of age years of experence, age Y ˆ βˆ 0 + βˆ (Experence ) + βˆ ( Age 6.5 0.8(Experence - 40) ) + 0.9( Age - 40) For a 4-year-old wth no experence: Ŷ 6.5 0.8(0) + 0.9(4 40) $7.4 βˆ For a 4-year-old wth 0 years of experence: 0 + βˆ Ŷ 6.5 0.8(0) + 0.9(4 40) $9. βˆ 0 + βˆ 0 + βˆ 35

Model 3: Interpretaton βˆ : the average hourly wage for a 40-0 6.5 year-old wth no experence at all s about $6.50 : for every addtonal year of βˆ 0.8 experence, the predcted hourly wage decreases about 8 cents for two people of the same age (or adjustng for age ) βˆ 0.9 : for every addtonal year of age, the expected hourly wage ncreases about 9 cents for two people wth the same amount of experence (or adjustng for experence ) 36

Model vs. Model 3 What happens when we nclude age? Model : Yˆ 8.38 + 0.04 ( Experence ) Model 3: Y ˆ 6.5 0.8(Experence ) + 0.9( Age 95% CI for β n Model : (0.00, 0.07) and from Model 3 s outsde ths CI βˆ - 40) Age s a confounder! When we adjust for age, the apparent effect of experence on wage changes 37

The Coeffcent of Determnaton, R R measures the ablty to predct Y usng X Varablty explaned by X s SSM ( ˆ y) Total varablty s SST ( y y) y R s defned as R SSM SST ( yˆ ( y Measures the proporton of total varablty explaned by the model R s the square of r, Pearson s correlaton coeffcent Recall: r s a rough way of evaluatng the assocaton between two contnuous varables y) y) 38

Usng R as a model selecton crtera The coeffcent of determnaton, R evaluates the entre model R shows the proporton of the total varaton n Y that has been predcted by ths model Model : 0.0076; 0.8% of varaton explaned Model : 0.05; 5% of varaton explaned Model 3: 0.0; 0% of varaton explaned You want a model wth large R 39

Adjusted R another model selecton crtera In both models and 3, the new predctor added a great deal to the model R ncreased a lot More mportantly, both new predctors were statstcally sgnfcant R ncreases any tme you nclude a new varable! The adjusted R s adjusted for the number of X s n the model, so t only ncreases when helpful predctors are added Balance the need for Smple model Good predctve ablty 40

Summary of Lecture 9 Centerng Hypothess testng and confdence ntervals Regresson by least squares Interpretng regresson coeffcents Addng a nd predctor to a model Bnary X added: parallel lnes Contnuous X added: 3-dmensonal graph for both, new nterpretaton reflectng new model Is the new X a confounder? Compare β across models 4