Statistical Modelling in Stata 5: Linear Models

Similar documents
Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

Correlation and Simple Linear Regression

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

STATISTICS 110/201 PRACTICE FINAL EXAM

Section Least Squares Regression

General Linear Model (Chapter 4)

Lab 11 - Heteroskedasticity

ECON3150/4150 Spring 2015

Inferences for Regression

A discussion on multiple regression models

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Empirical Asset Pricing

1 A Review of Correlation and Regression

sociology 362 regression

ECON Introductory Econometrics. Lecture 5: OLS with One Regressor: Hypothesis Tests

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Empirical Application of Simple Regression (Chapter 2)

sociology 362 regression

Ch 2: Simple Linear Regression

Simple Linear Regression

Regression Models - Introduction

Statistical Modelling with Stata: Binary Outcomes

Applied Statistics and Econometrics

INFERENCE FOR REGRESSION

Review of Statistics 101

Applied Statistics and Econometrics

Heteroskedasticity Example

Lab 07 Introduction to Econometrics

ECON2228 Notes 2. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 47

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Correlation and regression. Correlation and regression analysis. Measures of association. Why bother? Positive linear relationship

ECO220Y Simple Regression: Testing the Slope

13 Simple Linear Regression

sociology sociology Scatterplots Quantitative Research Methods: Introduction to correlation and regression Age vs Income

Labor Economics with STATA. Introduction to Regression Diagnostics

Statistical Inference with Regression Analysis

Lecture 11: Simple Linear Regression

ECON3150/4150 Spring 2016

The Classical Linear Regression Model

Introduction to Linear regression analysis. Part 2. Model comparisons

Measurement Error. Often a data set will contain imperfect measures of the data we would ideally like.

Economics 326 Methods of Empirical Research in Economics. Lecture 14: Hypothesis testing in the multiple regression model, Part 2

Graduate Econometrics Lecture 4: Heteroskedasticity

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Simple Linear Regression

ECON Introductory Econometrics. Lecture 7: OLS with Multiple Regressors Hypotheses tests

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is

Question 1a 1b 1c 1d 1e 2a 2b 2c 2d 2e 2f 3a 3b 3c 3d 3e 3f M ult: choice Points

1 Linear Regression Analysis The Mincer Wage Equation Data Econometric Model Estimation... 11

1 Independent Practice: Hypothesis tests for one parameter:

2. Outliers and inference for regression

Statistical Techniques II EXST7015 Simple Linear Regression

1: a b c d e 2: a b c d e 3: a b c d e 4: a b c d e 5: a b c d e. 6: a b c d e 7: a b c d e 8: a b c d e 9: a b c d e 10: a b c d e

Modelling Rates. Mark Lunt. Arthritis Research UK Epidemiology Unit University of Manchester

Problem Set #3-Key. wage Coef. Std. Err. t P> t [95% Conf. Interval]

Basic Business Statistics, 10/e

14 Multiple Linear Regression

MS&E 226: Small Data

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

Lecture 2. The Simple Linear Regression Model: Matrix Approach

(a) Briefly discuss the advantage of using panel data in this situation rather than pure crosssections

Categorical Predictor Variables

The Simple Linear Regression Model

8. Nonstandard standard error issues 8.1. The bias of robust standard errors

Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran

Sociology Exam 1 Answer Key Revised February 26, 2007

Nonlinear Regression Functions

ECON2228 Notes 7. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 41

EXST Regression Techniques Page 1. We can also test the hypothesis H :" œ 0 versus H :"

ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

Warwick Economics Summer School Topics in Microeconometrics Instrumental Variables Estimation

Lecture#12. Instrumental variables regression Causal parameters III

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.

Lecture 3: Inference in SLR

Lecture 4: Multivariate Regression, Part 2

Multiple Linear Regression

Lab 10 - Binary Variables

Inference for Regression Inference about the Regression Model and Using the Regression Line

Unit 6 - Simple linear regression

Introductory Econometrics. Lecture 13: Hypothesis testing in the multiple regression model, Part 1

Handout 11: Measurement Error

Binary Dependent Variables

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

SOCY5601 Handout 8, Fall DETECTING CURVILINEARITY (continued) CONDITIONAL EFFECTS PLOTS

Ch 3: Multiple Linear Regression

Confidence Interval for the mean response

Regression. Marc H. Mehlman University of New Haven

BIOSTATS 640 Spring 2018 Unit 2. Regression and Correlation (Part 1 of 2) STATA Users

Linear Regression with Multiple Regressors

Lecture 8: Heteroskedasticity. Causes Consequences Detection Fixes

Lecture 3: Multiple Regression. Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II

Answer all questions from part I. Answer two question from part II.a, and one question from part II.b.

Group Comparisons: Differences in Composition Versus Differences in Models and Effects

Correlation Analysis

Introduction to Regression

Week 3: Simple Linear Regression

Unit 6 - Introduction to linear regression

Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals

Lecture 4: Multivariate Regression, Part 2

Lecture 12: Interactions and Splines

Transcription:

Statistical Modelling in Stata 5: Linear Models Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 07/11/2017

Structure This Week What is a linear model? How good is my model? Does a linear model fit this data? Next Week Categorical Variables Interactions Confounding Other Considerations Variable Selection Polynomial Regression

Statistical Models All models are wrong, but some are useful. (G.E.P. Box) A model should be as simple as possible, but no simpler. (attr. Albert Einstein)

What is a Linear Model? Introduction Parameters Prediction ANOVA Stata commands for linear models Describes the relationship between variables Assumes that relationship can be described by straight lines Tells you the expected value of an outcome or y variable, given the values of one or more predictor or x variables

Variable Names The linear Model Introduction Parameters Prediction ANOVA Stata commands for linear models Outcome Dependent variable Y-variable Response variable Output variable Predictor Independent variables x-variables Regressors Input variables Explanatory variables Carriers Covariates

The Equation of a Linear Model Introduction Parameters Prediction ANOVA Stata commands for linear models The equation of a linear model, with outcome Y and predictors x 1,... x p Y = β 0 + β 1 x 1 + β 2 x 2 +... + β p x p + ε β 0 + β 1 x 1 + β 2 x 2 +... + β p x p is the Linear Predictor Ŷ = β 0 + β 1 x 1 + β 2 x 2 +... + β p x p is the predictable part of Y. ε is the error term, the unpredictable part of Y. We assume that ε is normally distributed with mean 0 and variance σ 2.

Linear Model Assumptions Introduction Parameters Prediction ANOVA Stata commands for linear models Mean of Y x is a linear function of x Variables Y 1, Y 2... Y n are independent. The variance of Y x is constant. Distribution of Y x is normal.

Parameter Interpretation Introduction Parameters Prediction ANOVA Stata commands for linear models Y Y = β0 + β1 x 1 β1 β0 x β 1 is the amount by which Y increases if x 1 increases by 1, and none of the other x variables change. β 0 is the value of Y when all of the x variables are equal to 0.

Estimating Parameters Introduction Parameters Prediction ANOVA Stata commands for linear models β j in the previous equation are referred to as parameters or coefficients Don t use the expression beta coefficients : it is ambiguous We need to obtain estimates of them from the data we have collected. Estimates normally given roman letters b 0, b 1,..., b n. Values given to b j are those which minimise (Y Ŷ )2 : hence Least squares estimates

Inference on Parameters Introduction Parameters Prediction ANOVA Stata commands for linear models If assumptions hold, sampling distribution of b j is normal with mean β j and variance σ 2 /ns 2 x (for sufficiently large n), where : σ 2 is the variance of the error terms ε, s 2 x is the variance of x j and n is the number of observations Can perform t-tests of hypotheses about β j (e.g. β j = 0). Can also produce a confidence interval for β j. Inference in β 0 (intercept) is usually not interesting.

Inference on the Predicted Value Introduction Parameters Prediction ANOVA Stata commands for linear models Y = β 0 + β 1 x 1 +... + β p x p + ε Predicted Value Ŷ = b 0 + b 1 x 1 +... + b p x p Observed values will differ from predicted values because of Random error (ε) Uncertainty about parameters β j. We can calculate a 95% prediction interval, within which we would expect 95% of observations to lie. Reference Range for Y

Prediction Interval The linear Model Introduction Parameters Prediction ANOVA Stata commands for linear models 15 10 Y1 5 0 0 5 10 15 20 x1

Inference on the Mean Introduction Parameters Prediction ANOVA Stata commands for linear models The mean value of Y at a given value of x does not depend on ε. The standard error of Ŷ is called the standard error of the prediction (by stata). We can calculate a 95% confidence interval for Ŷ. This can be thought of as a confidence region for the regression line.

Confidence Interval The linear Model Introduction Parameters Prediction ANOVA Stata commands for linear models 15 10 Y1 5 0 0 5 10 15 20 x1

Analysis of Variance (ANOVA) Introduction Parameters Prediction ANOVA Stata commands for linear models Variance of Y is (Y Ȳ) 2 (Y Ŷ) 2 + (Ŷ n 1 = Ȳ)2 n 1 SS reg = ( Ŷ Ȳ ) 2 (regression sum of squares) SS res = ( Y Ŷ ) 2 (residual sum of squares) Each part has associated degrees of freedom: p d.f for the regression, n p 1 for the residual. The mean square MS = SS/df. MS reg should be similar to MS res if no association between Y and x F = MSreg MS res gives a measure of the strength of the association between Y and x.

ANOVA Table The linear Model Introduction Parameters Prediction ANOVA Stata commands for linear models Source df Sum of Mean Square F Squares Regression p SS reg MS reg = SS reg p MS reg MS res Residual n-p-1 SS res MS res = SS res (n p 1) Total n-1 SS tot MS tot = SS tot (n 1)

Goodness of Fit The linear Model Introduction Parameters Prediction ANOVA Stata commands for linear models Predictive value of a model depends on how much of the variance can be explained. R 2 is the proportion of the variance explained by the model R 2 = SSreg SS tot R 2 always increases when a predictor variable is added Adjusted R 2 is better for comparing models.

Introduction Parameters Prediction ANOVA Stata commands for linear models Stata Commands for Linear Models The basic command for linear regression is regress y-var x-vars Can use by and if to select subgroups. The command predict can produce predicted values standard errors residuals etc.

Stata Output 1: ANOVA Table Introduction Parameters Prediction ANOVA Stata commands for linear models F() Prob > F R-squared Adj R-squared Root MSE F Statistic for the Hypothesis β j = 0 for all j p-value for above hypothesis test Proportion of variance explained by regression = SS Model SS Total (n 1)R 2 p n p 1 MSResidual = ˆσ

Stata Output 1: Example Introduction Parameters Prediction ANOVA Stata commands for linear models Source SS df MS Number of obs = 11 ---------+------------------------------ F( 1, 9) = 17.99 Model 27.5100011 1 27.5100011 Prob > F = 0.0022 Residual 13.7626904 9 1.52918783 R-squared = 0.6665 ---------+------------------------------ Adj R-squared = 0.6295 Total 41.2726916 10 4.12726916 Root MSE = 1.2366

Stata Output 2: Coefficients Introduction Parameters Prediction ANOVA Stata commands for linear models Coef. Estimate of parameter β for the variable in the left-hand column. (β 0 is labelled _cons for constant ) Std. Err. Standard error of b. b 0 t The value of s.e.(b), to test the hypothesis that β = 0. P > t P-value resulting from the above hypothesis test. 95% Conf. Interval A 95% confidence interval for β.

Stata Output 2: Example Introduction Parameters Prediction ANOVA Stata commands for linear models ------------------------------------------------------------------------------ Y Coef. Std. Err. t P> t [95% Conf. Interval] ---------+-------------------------------------------------------------------- x.5000909.1179055 4.241 0.002.2333701.7668117 _cons 3.000091 1.124747 2.667 0.026.4557369 5.544445 ------------------------------------------------------------------------------

Constant Variance Linearity Influential points Normality Is a linear model appropriate? Does it provide adequate predictions? Do my data satisfy the assumptions of the linear model? Are there any individual points having an inordinate influence on the model?

Constant Variance Linearity Influential points Normality Anscombe s Data 15 15 10 10 Y1 Y2 5 5 0 0 5 10 15 20 x1 0 0 5 10 15 20 x1 15 15 10 10 Y3 Y4 5 5 0 0 5 10 15 20 x1 0 0 5 10 15 20 x2

Constant Variance Linearity Influential points Normality Linear Model Assumptions Linear models are based on 4 assumptions Mean of Y i is a linear function of x i. Variables Y 1, Y 2... Y n are independent. The variance of Y i x is constant. Distribution of Y i x is normal. If any of these are incorrect, inference from regression model is unreliable We may know about assumptions from experimental design (e.g. repeated measures on an individual are unlikely to be independent). Should test all 4 assumptions.

Constant Variance Linearity Influential points Normality Distribution of Residuals Error term ε i = Y i β 0 + β 1 x 1i + β 2 x 2i +... + β p x pi Residual term e i = Y i b 0 + b 1 x 1i + b 2 x 2i +... + b p x pi = Y i Ŷi Nearly but not quite the same, since our estimates of β j are imperfect. Predicted values vary more at extremes of x-range (points have greater leverage Hence residuals vary less at extremes of the x-range If error terms have constant variance, residuals don t.

Constant Variance Linearity Influential points Normality Standardised Residuals Variation in variance of residuals as x changes is predictable. Can therefore correct for it. Standardised Residuals have mean 0 and standard deviation 1. Can use standardised residuals to test assumptions of linear model predict Yhat, xb will generate predicted values predict sres, rstand will generate standardised residuals scatter sres Yhat will produce a plot of the standardised residuals against the fitted values.

Constant Variance Linearity Influential points Normality Testing Constant Variance: Residuals should be independent of predicted values There should be no pattern in this plot Common patterns Spread of residuals increases with fitted values This is called heteroskedasticity May be removed by transforming Y Can be formally tested for with hettest There is curvature The association between x and Y variables is not linear May need to transform Y or x Alternatively, fit x 2, x 3 etc. terms Can be formally tested for with ovtest

Y 2.28352 1.81561.000087.99163 x Y 10.5454 1.35659.000087.99163 x The linear Model Constant Variance Linearity Influential points Normality Residual vs Fitted Value Plot Examples (a) Non-constant variance (b) Non-linear association

Constant Variance Linearity Influential points Normality Testing Linearity: Partial Residual Plots Partial residual p j = e + b j x j = Y β 0 l j b lx l Formed by subtracting that part of the predicted value that does not depend on x j from the observed value of Y. Plot of p j against x j shows the association between Y and x j after adjusting for the other predictors. Can be obtained from stata by typing cprplot xvar after performing a regression.

Constant Variance Linearity Influential points Normality Example Partial Residual Plot 7 Residuals Linear prediction e( Y2 X,x1 ) + b*x1.099091 4 14 x1

Constant Variance Linearity Influential points Normality Identifying Outliers Points which have a marked effect on the regression equation are called influential points. Points with unusual x-values are said to have high leverage. Points with high leverage may or may not be influential, depending on their Y values. Plot of studentised residual (residual from regression excluding that point) against leverage can show influential points.

Constant Variance Linearity Influential points Normality Statistics to Identify Influential Points DFBETA Measures influence of individual point on a single coefficient β j. DFFITS Measures influence of an individual point on its predicted value. Cook s Distance Measured the influence of an individual point on all predicted values. All can be produced by predict. There are suggested cut-offs to determine influential observations. May be better to simply look for outliers.

Constant Variance Linearity Influential points Normality Y -outliers A point with normal x-values and abnormal Y -value may be influential. Robust regression can be used in this case. Observations repeatedly reweighted, weight decreases as magnitude of residual increases Methods robust to x-outliers are very computationally intensive.

Constant Variance Linearity Influential points Normality Robust Regression 15 Y3 LS Regression Robust Regression Y3 10 5 5 10 15 x1

Constant Variance Linearity Influential points Normality Testing Normality Standardised residuals should follow a normal distribution. Can test formally with swilk varname. Can test graphically with qnorm varname.

Constant Variance Linearity Influential points Normality Normal Plot: Example 1.61402 Standardised Residuals Inverse Normal 2.99999 Standardised Residuals Inverse Normal Standardized residuals Standardized residuals 1.77998 1.4619 1.43573 Inverse Normal 1.48979 1.48979 1.51139 Inverse Normal

Constant Variance Linearity Influential points Normality Graphical Assessment & Formal Testing Can test assumptions both formally and informally Both approaches have advantages and disadvantages Tests are always significant in sufficiently large samples. Differences may be slight and unimportant. Differences may be marked but non-significant in small samples. Best to use both