Lecture 6: Introduction to Linear Regression

Similar documents
Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Statistics for Economics & Business

Reminder: Nested models. Lecture 9: Interactions, Quadratic terms and Splines. Effect Modification. Model 1

Statistics for Business and Economics

Statistics for Managers Using Microsoft Excel/SPSS Chapter 13 The Simple Linear Regression Model and Correlation

Linear regression. Regression Models. Chapter 11 Student Lecture Notes Regression Analysis is the

Chapter 15 - Multiple Regression

Y = β 0 + β 1 X 1 + β 2 X β k X k + ε

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

Basic Business Statistics, 10/e

Chapter 11: Simple Linear Regression and Correlation

Comparison of Regression Lines

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Learning Objectives for Chapter 11

Chapter 13: Multiple Regression

Diagnostics in Poisson Regression. Models - Residual Analysis

Lecture 3 Stat102, Spring 2007

Chapter 9: Statistical Inference and the Relationship between Two Variables

STAT 3008 Applied Regression Analysis

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

Introduction to Regression

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

Lecture Notes for STATISTICAL METHODS FOR BUSINESS II BMGT 212. Chapters 14, 15 & 16. Professor Ahmadi, Ph.D. Department of Management

[The following data appear in Wooldridge Q2.3.] The table below contains the ACT score and college GPA for eight college students.

Linear Regression Analysis: Terminology and Notation

/ n ) are compared. The logic is: if the two

Statistics II Final Exam 26/6/18

Chapter 14 Simple Linear Regression

NANYANG TECHNOLOGICAL UNIVERSITY SEMESTER I EXAMINATION MTH352/MH3510 Regression Analysis

Chapter 14 Simple Linear Regression Page 1. Introduction to regression analysis 14-2

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Correlation and Regression. Correlation 9.1. Correlation. Chapter 9

28. SIMPLE LINEAR REGRESSION III

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

Statistics for Managers Using Microsoft Excel/SPSS Chapter 14 Multiple Regression Models

STATISTICS QUESTIONS. Step by Step Solutions.

Statistics MINITAB - Lab 2

18. SIMPLE LINEAR REGRESSION III

Chapter 8 Indicator Variables

DO NOT OPEN THE QUESTION PAPER UNTIL INSTRUCTED TO DO SO BY THE CHIEF INVIGILATOR. Introductory Econometrics 1 hour 30 minutes

Topic 7: Analysis of Variance

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Biostatistics. Chapter 11 Simple Linear Correlation and Regression. Jing Li

x i1 =1 for all i (the constant ).

Correlation and Regression

Sociology 301. Bivariate Regression. Clarification. Regression. Liying Luo Last exam (Exam #4) is on May 17, in class.

Introduction to Dummy Variable Regressors. 1. An Example of Dummy Variable Regressors

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

Basically, if you have a dummy dependent variable you will be estimating a probability.

SIMPLE LINEAR REGRESSION

January Examinations 2015

e i is a random error

17 - LINEAR REGRESSION II

BIO Lab 2: TWO-LEVEL NORMAL MODELS with school children popularity data

III. Econometric Methodology Regression Analysis

Economics 130. Lecture 4 Simple Linear Regression Continued

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis

Chapter 12 Analysis of Covariance

Outline. Zero Conditional mean. I. Motivation. 3. Multiple Regression Analysis: Estimation. Read Wooldridge (2013), Chapter 3.

Midterm Examination. Regression and Forecasting Models

STAT 405 BIOSTATISTICS (Fall 2016) Handout 15 Introduction to Logistic Regression

Chapter 8 Multivariate Regression Analysis

Interpreting Slope Coefficients in Multiple Linear Regression Models: An Example

β0 + β1xi and want to estimate the unknown

β0 + β1xi. You are interested in estimating the unknown parameters β

Negative Binomial Regression

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

LINEAR REGRESSION ANALYSIS. MODULE VIII Lecture Indicator Variables

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

where I = (n x n) diagonal identity matrix with diagonal elements = 1 and off-diagonal elements = 0; and σ 2 e = variance of (Y X).

University of California at Berkeley Fall Introductory Applied Econometrics Final examination

a. (All your answers should be in the letter!

Sociology 301. Bivariate Regression II: Testing Slope and Coefficient of Determination. Bivariate Regression. Calculating Expected Values

Introduction to Analysis of Variance (ANOVA) Part 1

Statistics Chapter 4

Unit 10: Simple Linear Regression and Correlation

The Ordinary Least Squares (OLS) Estimator

Properties of Least Squares

PBAF 528 Week Theory Is the variable s place in the equation certain and theoretically sound? Most important! 2. T-test

Reduced slides. Introduction to Analysis of Variance (ANOVA) Part 1. Single factor

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Econometrics: What's It All About, Alfie?

Interval Estimation in the Classical Normal Linear Regression Model. 1. Introduction

Marginal Effects in Probit Models: Interpretation and Testing. 1. Interpreting Probit Coefficients

Regression. The Simple Linear Regression Model

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

Chapter 14: Logit and Probit Models for Categorical Response Variables

since [1-( 0+ 1x1i+ 2x2 i)] [ 0+ 1x1i+ assumed to be a reasonable approximation

Scatter Plot x

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Now we relax this assumption and allow that the error variance depends on the independent variables, i.e., heteroskedasticity

Dummy variables in multiple variable regression model

LOGIT ANALYSIS. A.K. VASISHT Indian Agricultural Statistics Research Institute, Library Avenue, New Delhi

Chapter 5 Multilevel Models

Chapter 10. What is Regression Analysis? Simple Linear Regression Analysis. Examples

The SAS program I used to obtain the analyses for my answers is given below.

Biostatistics 360 F&t Tests and Intervals in Regression 1

Chapter 5: Hypothesis Tests, Confidence Intervals & Gauss-Markov Result

Cathy Walker March 5, 2010

Transcription:

Lecture 6: Introducton to Lnear Regresson An Manchakul amancha@jhsph.edu 24 Aprl 27

Lnear regresson: man dea Lnear regresson can be used to study an outcome as a lnear functon of a predctor Example: 6 ctes n the US were evaluated for numerous characterstcs, ncludng: the percentage of the populaton that was dsadvantaged medan educaton level 2

Bnary educaton varable % of populaton wth ncome < $3 5 2 25 3 Low Educaton Hgh Educaton 3

Lnear regresson vs. ANOVA These means could be compared by a t-test or ANOVA Mean n low educaton group: 5.7% Mean n hgh educaton group: 3.2% Regresson provdes a unfed equaton: X 5.72.5 X where X = for hgh educaton for low educaton (X s a dummy varable or ndcator varable that desgnates group) 4

Interpretng the model s the predcted mean of the outcome for X, that observaton s value for X. X X = (Low educaton) 5.7 2.5 X 5.7 2.5 5.7 X = (Hgh educaton) 5.7 2.5 3.2 5

Interpretaton s the mean outcome for the reference group, or the group for whch X =. Here, s the average percent of the populaton that s dsadvantaged for ctes wth low educaton. 6

Interpretaton s the dfference n the mean outcome between the two groups (when X = vs. when X =) Here, s dfference n the average percent of the populaton that s dsadvantaged for ctes wth hgh educaton compared to ctes wth low educaton. 7

Why use lnear regresson? Lnear regresson s very powerful. It can be used for many thngs: Bnary X Contnuous X Categorcal X Adjustment for confoundng Interacton Curved relatonshps between X and Y 8

Regresson Analyss A regresson s a descrpton of a response measure, Y,the dependent varable, as a functon of an explanatory varable, X, the ndependent varable. Goal: predcton or estmaton of the value of one varable, Y, based on the value of the other varable, X. 9

Regresson Analyss A smple relatonshp between the two varables s a lnear relatonshp (straght lne relatonshp) Other names: lnear, smple lnear, least squares regresson

Galton s Example records of heghts of famly groups Really tall fathers tend on average to have tall sons but not qute as tall as the really tall fathers There s a regresson of a son s heght toward the average heght for sons

Galton s Example 74 Regresson of Son's Stature on Father's E(Y) = 33.73 +.56*X 72 Son's Heght 7 68 66 64 6 62 64 66 68 7 72 74 Father's Heght (nches) 2

Regresson Analyss: Populaton Model Probablty Model: ndependent responses y, y 2,,y n are sampled from Y ~ N(, 2 ) Systematc Model: µ = E(y x ) = + x where: = ntercept = slope 3

Another way to wrte the model Systematc: y = + x + Probablty: ~ N(, 2 ) The response, Y, s a lnear functon of X plus some random, normally dstrbuted error, I Data = Sgnal + nose 4

Geometrc Interpretaton 5

Model ) Y ~ N(, 2 ) 2) µ = E(y x ) = + x OR ) y = + x + 2) ~ N(, 2 ) where: = ntercept = slope The response, Y, s a lnear functon of X plus some random, normally dstrbuted error, 6

Interpretaton of Coeffcents Mean Model: µ = E(y x) = + x = expected response when X = Snce: E(y x=) = + () = = change n expected response per unt ncrease n X Snce: E(y x+) = + (x+) And: E(y x) = + x E(y) from x to x+ = 7

From Galton s Example E(Y x) = + x E(Y x) = 33.7 +.52x where: Y = son s heght (nches) x = father s heght (nches) Expected son s heght =33.7 nches when father s heght s nches Expected dfference n heghts for sons whose fathers heghts dffer by one nch =.52 nches 8

Cty/Educaton Example % of populaton wth ncome < $3 5 2 25 3 9 2 3 Medan educaton 9

Model X 36.2 2. X where X = the medan educaton level n cty 36.2 2. when X = 36.2 36.2 2. when X = 34.2 36.2 2. 2 when X =2 32.2 2 2

Interpretaton s the mean outcome for the reference group, or the group for whch X =. Here, s the average percent of the populaton that s dsadvantaged for ctes wth medan educaton level of. 2

Interpretaton s the dfference n the mean outcome for a one unt change n X. Here, s dfference n the average percent of the populaton that s dsadvantaged between two ctes, when the frst cty has % hgher medan educaton level than the second cty. 22

Fndng s from the graph s the Y-ntercept of the lne, or the average value of Y when X=. s the slope of the lne, or the average change n Y per unt change n X. y=mx+b b=, m= ˆ y x y x 2 2 Notaton: represents the true slope (n the populaton) b and ˆ are sample estmates of the slope 23

Where s our ntercept? % of populaton wth ncome < $3 5 2 25 3 35 4 45 5 55 6 2 4 6 8 2 4 Medan educaton 24

Centerng makes no sense! We can change X to fx ths problem by a process called centerng. Pck a value of X (c) wthn the range of the data 2. For each observaton, generate X_centered = X -c 3. Redo the regresson wth X_centered 25

We ll use c=2, a hgh school degree % of populaton wth ncome < $3 5 2 25 3 9 2 3 Medan educaton 26

New equaton 2.2 has not changed now corresponds to X=2, not X= X 2. 2 X 2 Note: wth X=, we have 2.2 2. 2.2 24 36.2 2 27

Interpretaton s the mean outcome for the reference group, or the group for whch X -2=, or when X =2. Here, (2.2%) s the average percent of the populaton that s dsadvantaged for ctes wth a medan educaton level of 2, the equvalent of a hgh school degree. The nterpretaton of has not changed. 28

Centerng n Galton Example Make 6 feet (72 nch) fathers the reference group Create a new X varable, X*, by subtractng 72 from our old X varable, X* = X 72 Then: E(Y x*) = + x* = + (x 72) So, = expected response when X = 72, snce E(Y x=72) = + (72 72) = Center X s whenever nterpretatons call for t! 29

Populaton Comparsons : changes dependng on centerng of X, whch doesn t affect assocaton of nterest Real concern: s X assocated wth Y? Assess by testng : Does = n the populaton from whch ths sample was drawn? Hypothess testng Confdence nterval 3

Hypothess testng H : = Test statstc: df = n-k- obs n = number of observatons k = number of predctors (X s) t ˆ SE ˆ 3

Hypothess testng for educaton example H : = Test statstc: t obs - 2..59 3.36 df = n-k- = 6-- = 58 n = number of observatons = 6 k = number of predctors (X s) = p<2*(-.995) p<. 32

Interpretaton and concluson If there were no assocaton between medan educaton and percentage of dsadvantaged ctzens n the populaton, there would be less than a % chance of observng data as or more extreme than ours. The null probablty s very small, so: reject the null hypothess conclude that medan educaton level and percentage of dsadvantaged ctzens are assocated n the populaton 33

Confdence Interval No need to specfy a hypothess: ˆ t cr SE ˆ 2. 2.2-3.2,-.8.59 34

Interpretaton and concluson We are 95% confdent that the true populaton decrease n percentage of dsadvantaged ctzens per addtonal year of medan educaton s between 3.2 and.8. Snce ths nterval does not contan, we beleve percentage of dsadvantaged ctzens and medan educaton are assocated among ctes n the Unted States. 35

So far Lnear regresson s used for contnuous outcome varables : mean outcome when X= Bnary X = dummy varable for group : mean dfference n outcome between groups Contnuous X : mean dfference n outcome correspondng to a -unt ncrease n X Center X to gve meanng to Test = n the populaton 36

Lnear Regresson: Multple covarates and confoundng

Dataset Hourly wage nformaton from 9,98 workers, along wth nformaton regardng age, gender, years of experence, etc. We ll focus on predctng hourly wage wth avalable nformaton. 38

Regresson: Hourly wage vs. Years of experence Hourly Wage 2 3 4 5 2 4 6 Years of Experence 39

What are the parameters? For each person, ther actual hourly wage (Y ) and predcted hourly wage are known. Y Y X s the resdual or error The parameters are found by mnmzng the n sum of the squared error Y X The parameters are the least squares estmates mn 2 4

Notes X for any known pont on the lne Y X s always true The regresson lne equaton Y X 4

Model Model : Predct ncome by years of experence ˆ ˆ X 8.38.4X ˆ 8.38 so the average hourly wage for someone wth no experence at all s about $8.4. ˆ.4 so for every addtonal year of experence, the predcted hourly wage ncreases about 4 cents. For years of addtonal experence, the predcted hourly wage ncreases about 4 cents. 42

Should we center X? years of experence s wthn the range of the data The average hourly wage correspondng to years of experence makes sense No need to center X 43

What happens f we also consder gender? (Model 2) Hourly Wage 2 3 4 5 2 4 6 Years of Experence Men's hourly wage ft2_men Women's hourly wage ft2_women 44

Model 2: Gender effect, no experence ˆ ˆ (Experence ) ˆ (Gender 9.27.4(Experence 2 ) ) - 2.2(Gender ) For a man wth no experence: 9.27.4() - 2.2() ˆ For a woman wth no experence: ˆ $9.27 9.27.4() - 2.2() $7.7 ˆ 2 45

Model 2: Gender effect, years experence ˆ ˆ (Experence ) ˆ 2 (Gender ) 9.27.4(Experence ) - 2.2(Gender ) For a man wth years of experence: 9.27.4() - 2.2() $9.67 ˆ ˆ () For a woman wth years of experence: 9.27.4() - 2.2() $7.47 ˆ ˆ () ˆ 2 () 46

Model 2: Experence effect, males ˆ ˆ (Experence ) ˆ 2 (Gender ) 9.27.4(Experence ) - 2.2(Gender ) For a man wth no experence: 9.27.4() - 2.2() ˆ For a man wth years of experence: ˆ $9.27 9.27.4() - 2.2() $9.67 ˆ () 47

Model 2: Experence effect, females ˆ ˆ (Experence ) ˆ 2 (Gender ) 9.27.4(Experence ) - 2.2(Gender ) For a woman wth no experence: 9.27.4() - 2.2() $7.7 ˆ ˆ 2 For a woman wth years of experence: 9.27.4() - 2.2() $7.47 ˆ ˆ () ˆ 2 48

Interpretaton: Model 2 ˆ 9.27 : the average hourly wage for a man wth no experence at all s about $9.3. ˆ.4 : for every addtonal year of experence, the predcted hourly wage ncreases about 4 cents for both men and women. ˆ 2 2.2 : the expected hourly wage s $2.2 lower for women than t s for men at any experence level. 49

Model vs. Model 2 Model : 8.38.4 Experence Model 2: 9.27.4(Experence ) - 2.2(Gender ) 95% CI for n Model : (.,.7) and from Model 2 s wthn ths CI ˆ Gender s not a confounder 5

What happens f we consder age, nstead? (Model 3) ˆ ˆ (Experence ) ˆ 2 (Age - 4) The relatonshp s harder to graph wth two contnuous predctors, snce now the regresson s n a 3-dmensonal space. Notce that age s centered at 4 years. Age ranged between 8 and 64 n ths dataset. 5

Model 3: Age effect, no experence ˆ ˆ (Experence ) ˆ 2 (Age - 4) 26.5.82(Experence ).92(Age - 4) For a 4-year-old wth no experence: 26.5.82().92(4 4) $26.5 ˆ For a 4-year-old wth no experence: 26.5.82().92(4 4) $27.42 ˆ ˆ 2 52

Model 3: Age effect, years experence ˆ ˆ (Experence ) ˆ 2 (Age 26.5.82(Experence - 4) ).92(Age - 4) For a 4-year-old wth years of experence: 26.5.82().92(4 4) $8.3 ˆ ˆ For a 4-year-old wth years of experence: 26.5.82().92(4 4) $9.22 ˆ ˆ ˆ 2 53

Model 3: Experence effect, 4 year old ˆ ˆ (Experence ) ˆ 2 (Age - 4) 26.5.82(Experence ).92(Age - 4) For a 4-year-old wth no experence: 26.5.82().92(4 4) $26.5 ˆ For a 4-year-old wth years of experence: 26.5.82().92(4 4) $8.3 ˆ ˆ 54

Model 3: Experence effect, 4 year old ˆ ˆ (Experence ) ˆ 2 (Age - 4) 26.5.82(Experence ).92(Age - 4) For a 4-year-old wth no experence: 26.5.82().92(4 4) $27.42 ˆ For a 4-year-old wth years of experence: ˆ ˆ ˆ 26.5.82().92(4 4) $9.22 2 ˆ 2 55

Interpretaton: Model 3 ˆ 26.5 : the average hourly wage for a 4- year-old wth no experence at all s about $26.5 : for every addtonal year of ˆ.82 experence, the predcted hourly wage decreases about 82 cents for two people of the same age (or adjustng for age ) ˆ 2.92 : for every addtonal year of age, the expected hourly wage ncreases about 92 cents for two people wth the same amount of experence (or adjustng for experence ) 56

Model vs. Model 3 Model : 8.38.4Experence Model 3: 26.5.82(Experence ).92(Age - 4) 95% CI for n Model : (.,.7) and from Model 3 s outsde ths CI ˆ Age s a confounder. When we adjust for age, the apparent effect of experence on wage changes. 57

The Coeffcent of Determnaton R 2 s the coeffcent of determnaton R 2 measures the ablty to predct Y usng X Varablty explaned by X s SSM = 2 y y) 2 ( y y) Total varablty s SST = ( ˆ 58

The Coeffcent of Determnaton R 2 s defned as R 2 SSM SST ( yˆ ( y y) y) Measures the proporton of total varablty explaned by the model 2 2 59

The Coeffcent of Determnaton R 2 s the square of r, Pearson s correlaton coeffcent r s a rough way of evaluatng the assocaton between two contnuous varables. 6

So, what s R 2? The coeffcent of determnaton, R 2 evaluates the entre model. R 2 shows the proporton of the total varaton n Y that has been predcted by ths model. Model :.76;.8% of varaton explaned Model 2:.5; 5% of varaton explaned Model 3:.2; 2% of varaton explaned 6

What s the adjusted R 2? In both models 2 and 3, the new predctor added a great deal to the model R 2 ncreased a lot More mportantly, both new predctors were statstcally sgnfcant R 2 always goes up! The adjusted R 2 s adjusted for the number of X s n the model, so t only goes up when helpful predctors are added. 62

Summary Regresson by least squares Interpretng regresson coeffcents Addng a 2 nd predctor to a model Bnary X added: 2 parallel lnes Contnuous X added: 3-dmensonal graph for both, new nterpretaton reflectng new model Is the new X a confounder? Compare across models 63