General Linear Model (Chapter 4)

Similar documents
T-test: means of Spock's judge versus all other judges 1 12:10 Wednesday, January 5, judge1 N Mean Std Dev Std Err Minimum Maximum

Correlation and Simple Linear Regression

df=degrees of freedom = n - 1

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

ECON Introductory Econometrics. Lecture 5: OLS with One Regressor: Hypothesis Tests

Name: Biostatistics 1 st year Comprehensive Examination: Applied in-class exam. June 8 th, 2016: 9am to 1pm

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

Introduction to Crossover Trials

Lecture 3: Inference in SLR

sociology sociology Scatterplots Quantitative Research Methods: Introduction to correlation and regression Age vs Income

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Lecture 7: OLS with qualitative information

Lab 07 Introduction to Econometrics

Essential of Simple regression

Topic 28: Unequal Replication in Two-Way ANOVA

Section I. Define or explain the following terms (3 points each) 1. centered vs. uncentered 2 R - 2. Frisch theorem -

Lecture 1 Linear Regression with One Predictor Variable.p2

STK4900/ Lecture 3. Program

Topic 20: Single Factor Analysis of Variance

Statistical Modelling in Stata 5: Linear Models

STATISTICS 110/201 PRACTICE FINAL EXAM

ECON3150/4150 Spring 2016

Problem Set 1 ANSWERS

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Lecture 11: Simple Linear Regression

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

1 Independent Practice: Hypothesis tests for one parameter:

Lecture 4: Multivariate Regression, Part 2

STOR 455 STATISTICAL METHODS I

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

3 Variables: Cyberloafing Conscientiousness Age

y response variable x 1, x 2,, x k -- a set of explanatory variables

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Lecture 5: Hypothesis testing with the classical linear model

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

Applied Statistics and Econometrics

using the beginning of all regression models

Lab 10 - Binary Variables

171:162 Design and Analysis of Biomedical Studies, Summer 2011 Exam #3, July 16th

Topic 17 - Single Factor Analysis of Variance. Outline. One-way ANOVA. The Data / Notation. One way ANOVA Cell means model Factor effects model

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

Topic 14: Inference in Multiple Regression

2.1. Consider the following production function, known in the literature as the transcendental production function (TPF).

ST505/S697R: Fall Homework 2 Solution.

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Chapter 2 Inferences in Simple Linear Regression

Lecture 11 Multiple Linear Regression

Repeated Measures Part 2: Cartoon data

ECON3150/4150 Spring 2016

Please discuss each of the 3 problems on a separate sheet of paper, not just on a separate page!

Lecture 4: Multivariate Regression, Part 2

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Overview Scatter Plot Example

Statistics 5100 Spring 2018 Exam 1

6. Multiple regression - PROC GLM

Sociology Exam 2 Answer Key March 30, 2012

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

sociology 362 regression

Business Statistics. Lecture 10: Course Review

Inferences for Regression

Chapter 1 Linear Regression with One Predictor

ECON3150/4150 Spring 2015

Lecture 6 Multiple Linear Regression, cont.

Correlation & Simple Regression

SplineLinear.doc 1 # 9 Last save: Saturday, 9. December 2006

sociology 362 regression

STAT 3A03 Applied Regression With SAS Fall 2017

Measurement Error. Often a data set will contain imperfect measures of the data we would ideally like.

Lecture 5. In the last lecture, we covered. This lecture introduces you to

Binary Dependent Variables

ECO220Y Simple Regression: Testing the Slope

Linear models Analysis of Covariance

1 A Review of Correlation and Regression

In Class Review Exercises Vartanian: SW 540

A Re-Introduction to General Linear Models (GLM)

The Classical Linear Regression Model

Linear models Analysis of Covariance

Lecture 12 Inference in MLR

Correlation and simple linear regression S5

Confidence Interval for the mean response

Lecture notes on Regression & SAS example demonstration

Lab # 11: Correlation and Model Fitting

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

Question 1a 1b 1c 1d 1e 2a 2b 2c 2d 2e 2f 3a 3b 3c 3d 3e 3f M ult: choice Points

SOCY5601 Handout 8, Fall DETECTING CURVILINEARITY (continued) CONDITIONAL EFFECTS PLOTS

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

Lab 6 - Simple Regression

Data Analysis 1 LINEAR REGRESSION. Chapter 03

Lecture 3: Multiple Regression. Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II

Statistics for exp. medical researchers Regression and Correlation

At this point, if you ve done everything correctly, you should have data that looks something like:

Review of Statistics 101

Answer all questions from part I. Answer two question from part II.a, and one question from part II.b.

Lecture#12. Instrumental variables regression Causal parameters III

Chapter 8 Quantitative and Qualitative Predictors

Self-Assessment Weeks 8: Multiple Regression with Qualitative Predictors; Multiple Comparisons

Specification Error: Omitted and Extraneous Variables

Booklet of Code and Output for STAC32 Final Exam

Correlation and regression. Correlation and regression analysis. Measures of association. Why bother? Positive linear relationship

Transcription:

General Linear Model (Chapter 4) Outcome variable is considered continuous Simple linear regression Scatterplots OLS is BLUE under basic assumptions MSE estimates residual variance testing regression coefficients and confidence intervals Centering and standardizing variables Regression coefficients with continuous versus categorical predictors

Cholesterol predicting blood pressure A toy example: Suppose we randomly select ten patients from a clinic and measure their blood pressure and cholesterol levels at their visit. Investigators are interested in the relationship between total blood cholesterol and blood pressure (the ratio of systolic/diastolic). 10 BP ratios: (1.51, 1.63, 1.52, 1.43, 1.58, 1.5, 1.66, 1.55, 1.6, 1.49) 10 cholesterol levels: (190, 230, 175, 200, 245, 195, 300, 210, 235, 290) Scatterplot of BP ratios versus cholesterol levels. What can we see?

Simple Linear Regression Estimate the Expected value (mean value) of Blood Pressure ratio given a particular value of cholesterol. Assume linear model for the mean: Assume the error is additive with mean zero: So, we have in general matrix form:

Ordinary Least Squares (OLS): Estimation

Hypothesis test

Hypothesis test (cont.)

SAS output proc reg data=bp; model bp = chol; run; Explanation of the outputs: http://www.ats.ucla.edu/stat/sas/output/reg.htm Model: MODEL1 Dependent Variable: bp Number of Observations Read 10 Number of Observations Used 10 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 0.01121 0.01121 2.67 0.1411 Error 8 0.03360 0.00420 Corrected Total 9 0.04481 Root MSE 0.06481 R-Square 0.2501 Dependent Mean 1.54700 Adj R-Sq 0.1563 Coeff Var 4.18952 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 1.35590 0.11879 11.41 <.0001 chol 1 0.00084187 0.00051545 1.63 0.1411

Stata output. reg bp chol Source SS df MS Number of obs = 10 -------------+------------------------------ F( 1, 8) = 2.67 Model.011205324 1.011205324 Prob > F = 0.1411 Residual.033604686 8.004200586 R-squared = 0.2501 -------------+------------------------------ Adj R-squared = 0.1563 Total.04481001 9.00497889 Root MSE =.06481 ------------------------------------------------------------------------------ bp Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- chol.0008419.0005155 1.63 0.141 -.0003468.0020305 _cons 1.355895.1187893 11.41 0.000 1.081966 1.629823 ------------------------------------------------------------------------------ Explanation of the outputs: http://www.ats.ucla.edu/stat/stata/output/reg_output.htm

Interpretation of output Typically main target of interest is whether regression coefficients are significant. What is conclusion? Is testing the intercept in the current model interesting? How do you interpret estimate for chol. How to estimate the expected increase in BP for a 10 unit increase in cholesterol? How to create 95% confidence interval for parameter estimates? Know how to do it by hand. Question: which t-value to use: a. 1.633, b. qt(.975,8)=2.306, c. qt(.975,9)=2.262, d. qt(.975,1)=12.706, e. 1.96? In SAS add options to the model statement after /: model bp = chol / clb alpha =.05; What are the different parts of the ANOVA? F-test, MSE What is the standard deviation of BPratio? What would the Pearson correlation coefficient be between chol and BP? What would its p-value be?

Standard Deviation vs. Standard Error Suppose n samples with sample mean x Standard deviation: x 2 i x SD n 1 tells us the distribution of individual values around the mean. (If we draw another sample from the same population, it will likely have a value within x ± 3SD) Standard error of the mean: SD SE n tells us the distribution of the means, i.e., it is the standard deviation of sampling distribution of the means. (If we draw another set of samples from the same population, the mean of the new samples will likely be within x ± 3SE) Standard error of the regression: estimate of the standard deviation of the underlying errors. Recall the estimated standard error in OLS ˆ 2 MSE

Sum of Squares Sum of Squares: n Total Sum of Squares (TSS): TSS y 2 -- total variability of the i y outcome Model Sum of Squares (MSS): -- variability explained MSS by the model yˆ 2 i y Residual Sum of Squares (RSS): n -- variability not explained by the model RSS y yˆ TSS = MSS + RSS Estimate of variance of ε: RSS/(n-p) (Mean Square Error, MSE) Coefficient of determination, R 2 = MSS/TSS Interpretation: the proportion of the total variability of the outcome (TSS) that is accounted for by the model (MSS). statistically significant predictor does not necessarily suggest large R 2 Adjusted R 2, 1-(n-1)(1- R 2 )/(n-p), adjust for the number of predictors in a model i1 n i1 i1 2 i i

Fitted regression line 1.45 1.5 bp 1.55 1.6 1.65 150 200 250 300 chol

Fitted regression line with confidence interval We can also obtain a confidence interval for the fitted means. bp 1.4 1.5 1.6 1.7 150 200 250 300 chol Stata: twoway lfitci SAS: Proc sgscatter; plot / reg=(clm); run;

Fitted Mean Fitted mean: where For given covariates X=x 0 : 95% CI: 1 ˆ ˆ T T Y X X X X X Y T 1 X X X X T is called the hat matrix (projection matrix). Y~N Xβ, σ 2 H ˆ ˆ 2 T 1 Var Y x Var x x X X x 0 0 0 0 x ˆ t x X X x 2 0 ˆ /2, n p 0 0 T 1 Interpretation of CI: if we repeat the study for a large number of times using the same values of X, 95% of time the observed CIs would bracket the true mean response, E(Y x 0 ).

Predicted Mean For a future observation (not included in the model fitting): Y X * ˆ Y ~N Xβ, σ 2 I + H For given covariates X=x 0 : 95% CI: * ˆ 2 T 1 Var Y x Var x x X X x 0 0 0 0 x ˆ t ˆ 1 x X X x 2 T 1 0 /2, n p 0 0 The CI for predicted mean is wider than that for fitted mean. 1

Centering and standardizing variables What happens if center the X variable, that is, create X i = X i X and redo the OLS regression this time of Y on X i? How do the estimates and their standard errors change. How do the elements of the ANOVA change? What about the R 2? What is interpretation of confidence interval for β 0. What about if we standardize the X variable, i.e. X i = X i X /sd(x)? How to interpret? What about if we standardize both the X and Y variables? How to interpret?

Centering predictor 1. The predictor is centered: Root MSE 0.06481 R-Square 0.2501 Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 1.54700 0.02050 75.48 <.0001 chol_c 1 0.00084187 0.00051545 1.63 0.1411 1.45 1.45 1.5 1.5 bp 1.55 bp 1.55 1.6 1.6 1.65 1.65 150 200 250 300 chol -50 0 50 100 chol (centered)

Standardizing predictor 2. The predictor is standardized: Root MSE 0.06481 R-Square 0.2501 Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 1.54700 0.02050 75.48 <.0001 chol_std 1 0.03529 0.02160 1.63 0.1411 1.45 1.45 1.5 1.5 bp 1.55 bp 1.55 1.6 1.6 1.65 1.65 150 200 250 300 chol -1 0 1 2 chol (standardized)

Standardizing both outcome and predictor 3. Both the outcome and predictor are standardized:: Root MSE 0.91852 R-Square 0.2501 Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1-2.1982E-15 0.29046-0.00 1.0000 chol_std 1 0.50006 0.30617 1.63 0.1411 1.45 1.5-2 -1 bp 1.55 1.65 1.6 bp (standardized) 0 1 2 150 200 250 300 chol -1 0 1 2 chol (standardized)

Continuous vs. categorical predictor continuous predictor X: β 1 interpreted as slope of line β 0 is the intercept, which corresponds to the mean outcome when X = 0. categorical predictor X: create dummy(0/1) variables β 1 interpreted as mean difference in outcome comparing a specific group to the reference group β 0 is interpreted as mean of outcome in reference group

Categorical predictor For our BP ratio-chol example, suppose we also have Gender information. Create a 0-1 variable where 1 indicates Male and 0 indicates Female. Regress BP ratio on Gender:

SAS output: proc reg data=bp; model bp = gender; Run; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 0.00073500 0.00073500 0.13 0.7244 Error 8 0.04407 0.00551 Corrected Total 9 0.04481 Root MSE 0.07423 R-Square 0.0164 Dependent Mean 1.54700 Adj R-Sq -0.1065 Coeff Var 4.79801 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 1.54000 0.03030 50.82 <.0001 gender 1 0.01750 0.04791 0.37 0.7244 What is 1.54? What is 0.0175?

Stata output: Source SS df MS Number of obs = 10 -------------+------------------------------ F( 1, 8) = 0.13 Model.000734998 1.000734998 Prob > F = 0.7244 Residual.044075012 8.005509376 R-squared = 0.0164 -------------+------------------------------ Adj R-squared = -0.1065 Total.04481001 9.00497889 Root MSE =.07423 ------------------------------------------------------------------------------ bp Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- gender.0175.0479121 0.37 0.724 -.0929856.1279856 _cons 1.54.0303023 50.82 0.000 1.470123 1.609877 ------------------------------------------------------------------------------

Testing gender using a 2 sample t-test gender N Mean Std Dev Std Err Minimum Maximum 0 6 1.5400 0.0759 0.0310 1.4300 1.6300 1 4 1.5575 0.0714 0.0357 1.5000 1.6600 Diff (1-2) -0.0175 0.0742 0.0479 gender Method Mean 95% CL Mean Std Dev 95% CL Std Dev 0 1.5400 1.4604 1.6196 0.0759 0.0474 0.1861 1 1.5575 1.4440 1.6710 0.0714 0.0404 0.2661 Diff (1-2) Pooled -0.0175-0.1280 0.0930 0.0742 0.0501 0.1422 Diff (1-2) Satterthwaite -0.0175-0.1296 0.0946 Method Variances DF t Value Pr > t Pooled Equal 8-0.37 0.7244 Satterthwaite Unequal 6.8826-0.37 0.7223 Equality of Variances Method Num DF Den DF F Value Pr > F Folded F 5 3 1.13 0.9814 How do these results match up with those from the regression?

Technically we are still estimating a line What does the intercept represent? What does the slope represent?