General Linear Model (Chapter 4) Outcome variable is considered continuous Simple linear regression Scatterplots OLS is BLUE under basic assumptions MSE estimates residual variance testing regression coefficients and confidence intervals Centering and standardizing variables Regression coefficients with continuous versus categorical predictors
Cholesterol predicting blood pressure A toy example: Suppose we randomly select ten patients from a clinic and measure their blood pressure and cholesterol levels at their visit. Investigators are interested in the relationship between total blood cholesterol and blood pressure (the ratio of systolic/diastolic). 10 BP ratios: (1.51, 1.63, 1.52, 1.43, 1.58, 1.5, 1.66, 1.55, 1.6, 1.49) 10 cholesterol levels: (190, 230, 175, 200, 245, 195, 300, 210, 235, 290) Scatterplot of BP ratios versus cholesterol levels. What can we see?
Simple Linear Regression Estimate the Expected value (mean value) of Blood Pressure ratio given a particular value of cholesterol. Assume linear model for the mean: Assume the error is additive with mean zero: So, we have in general matrix form:
Ordinary Least Squares (OLS): Estimation
Hypothesis test
Hypothesis test (cont.)
SAS output proc reg data=bp; model bp = chol; run; Explanation of the outputs: http://www.ats.ucla.edu/stat/sas/output/reg.htm Model: MODEL1 Dependent Variable: bp Number of Observations Read 10 Number of Observations Used 10 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 0.01121 0.01121 2.67 0.1411 Error 8 0.03360 0.00420 Corrected Total 9 0.04481 Root MSE 0.06481 R-Square 0.2501 Dependent Mean 1.54700 Adj R-Sq 0.1563 Coeff Var 4.18952 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 1.35590 0.11879 11.41 <.0001 chol 1 0.00084187 0.00051545 1.63 0.1411
Stata output. reg bp chol Source SS df MS Number of obs = 10 -------------+------------------------------ F( 1, 8) = 2.67 Model.011205324 1.011205324 Prob > F = 0.1411 Residual.033604686 8.004200586 R-squared = 0.2501 -------------+------------------------------ Adj R-squared = 0.1563 Total.04481001 9.00497889 Root MSE =.06481 ------------------------------------------------------------------------------ bp Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- chol.0008419.0005155 1.63 0.141 -.0003468.0020305 _cons 1.355895.1187893 11.41 0.000 1.081966 1.629823 ------------------------------------------------------------------------------ Explanation of the outputs: http://www.ats.ucla.edu/stat/stata/output/reg_output.htm
Interpretation of output Typically main target of interest is whether regression coefficients are significant. What is conclusion? Is testing the intercept in the current model interesting? How do you interpret estimate for chol. How to estimate the expected increase in BP for a 10 unit increase in cholesterol? How to create 95% confidence interval for parameter estimates? Know how to do it by hand. Question: which t-value to use: a. 1.633, b. qt(.975,8)=2.306, c. qt(.975,9)=2.262, d. qt(.975,1)=12.706, e. 1.96? In SAS add options to the model statement after /: model bp = chol / clb alpha =.05; What are the different parts of the ANOVA? F-test, MSE What is the standard deviation of BPratio? What would the Pearson correlation coefficient be between chol and BP? What would its p-value be?
Standard Deviation vs. Standard Error Suppose n samples with sample mean x Standard deviation: x 2 i x SD n 1 tells us the distribution of individual values around the mean. (If we draw another sample from the same population, it will likely have a value within x ± 3SD) Standard error of the mean: SD SE n tells us the distribution of the means, i.e., it is the standard deviation of sampling distribution of the means. (If we draw another set of samples from the same population, the mean of the new samples will likely be within x ± 3SE) Standard error of the regression: estimate of the standard deviation of the underlying errors. Recall the estimated standard error in OLS ˆ 2 MSE
Sum of Squares Sum of Squares: n Total Sum of Squares (TSS): TSS y 2 -- total variability of the i y outcome Model Sum of Squares (MSS): -- variability explained MSS by the model yˆ 2 i y Residual Sum of Squares (RSS): n -- variability not explained by the model RSS y yˆ TSS = MSS + RSS Estimate of variance of ε: RSS/(n-p) (Mean Square Error, MSE) Coefficient of determination, R 2 = MSS/TSS Interpretation: the proportion of the total variability of the outcome (TSS) that is accounted for by the model (MSS). statistically significant predictor does not necessarily suggest large R 2 Adjusted R 2, 1-(n-1)(1- R 2 )/(n-p), adjust for the number of predictors in a model i1 n i1 i1 2 i i
Fitted regression line 1.45 1.5 bp 1.55 1.6 1.65 150 200 250 300 chol
Fitted regression line with confidence interval We can also obtain a confidence interval for the fitted means. bp 1.4 1.5 1.6 1.7 150 200 250 300 chol Stata: twoway lfitci SAS: Proc sgscatter; plot / reg=(clm); run;
Fitted Mean Fitted mean: where For given covariates X=x 0 : 95% CI: 1 ˆ ˆ T T Y X X X X X Y T 1 X X X X T is called the hat matrix (projection matrix). Y~N Xβ, σ 2 H ˆ ˆ 2 T 1 Var Y x Var x x X X x 0 0 0 0 x ˆ t x X X x 2 0 ˆ /2, n p 0 0 T 1 Interpretation of CI: if we repeat the study for a large number of times using the same values of X, 95% of time the observed CIs would bracket the true mean response, E(Y x 0 ).
Predicted Mean For a future observation (not included in the model fitting): Y X * ˆ Y ~N Xβ, σ 2 I + H For given covariates X=x 0 : 95% CI: * ˆ 2 T 1 Var Y x Var x x X X x 0 0 0 0 x ˆ t ˆ 1 x X X x 2 T 1 0 /2, n p 0 0 The CI for predicted mean is wider than that for fitted mean. 1
Centering and standardizing variables What happens if center the X variable, that is, create X i = X i X and redo the OLS regression this time of Y on X i? How do the estimates and their standard errors change. How do the elements of the ANOVA change? What about the R 2? What is interpretation of confidence interval for β 0. What about if we standardize the X variable, i.e. X i = X i X /sd(x)? How to interpret? What about if we standardize both the X and Y variables? How to interpret?
Centering predictor 1. The predictor is centered: Root MSE 0.06481 R-Square 0.2501 Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 1.54700 0.02050 75.48 <.0001 chol_c 1 0.00084187 0.00051545 1.63 0.1411 1.45 1.45 1.5 1.5 bp 1.55 bp 1.55 1.6 1.6 1.65 1.65 150 200 250 300 chol -50 0 50 100 chol (centered)
Standardizing predictor 2. The predictor is standardized: Root MSE 0.06481 R-Square 0.2501 Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 1.54700 0.02050 75.48 <.0001 chol_std 1 0.03529 0.02160 1.63 0.1411 1.45 1.45 1.5 1.5 bp 1.55 bp 1.55 1.6 1.6 1.65 1.65 150 200 250 300 chol -1 0 1 2 chol (standardized)
Standardizing both outcome and predictor 3. Both the outcome and predictor are standardized:: Root MSE 0.91852 R-Square 0.2501 Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1-2.1982E-15 0.29046-0.00 1.0000 chol_std 1 0.50006 0.30617 1.63 0.1411 1.45 1.5-2 -1 bp 1.55 1.65 1.6 bp (standardized) 0 1 2 150 200 250 300 chol -1 0 1 2 chol (standardized)
Continuous vs. categorical predictor continuous predictor X: β 1 interpreted as slope of line β 0 is the intercept, which corresponds to the mean outcome when X = 0. categorical predictor X: create dummy(0/1) variables β 1 interpreted as mean difference in outcome comparing a specific group to the reference group β 0 is interpreted as mean of outcome in reference group
Categorical predictor For our BP ratio-chol example, suppose we also have Gender information. Create a 0-1 variable where 1 indicates Male and 0 indicates Female. Regress BP ratio on Gender:
SAS output: proc reg data=bp; model bp = gender; Run; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 0.00073500 0.00073500 0.13 0.7244 Error 8 0.04407 0.00551 Corrected Total 9 0.04481 Root MSE 0.07423 R-Square 0.0164 Dependent Mean 1.54700 Adj R-Sq -0.1065 Coeff Var 4.79801 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 1.54000 0.03030 50.82 <.0001 gender 1 0.01750 0.04791 0.37 0.7244 What is 1.54? What is 0.0175?
Stata output: Source SS df MS Number of obs = 10 -------------+------------------------------ F( 1, 8) = 0.13 Model.000734998 1.000734998 Prob > F = 0.7244 Residual.044075012 8.005509376 R-squared = 0.0164 -------------+------------------------------ Adj R-squared = -0.1065 Total.04481001 9.00497889 Root MSE =.07423 ------------------------------------------------------------------------------ bp Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- gender.0175.0479121 0.37 0.724 -.0929856.1279856 _cons 1.54.0303023 50.82 0.000 1.470123 1.609877 ------------------------------------------------------------------------------
Testing gender using a 2 sample t-test gender N Mean Std Dev Std Err Minimum Maximum 0 6 1.5400 0.0759 0.0310 1.4300 1.6300 1 4 1.5575 0.0714 0.0357 1.5000 1.6600 Diff (1-2) -0.0175 0.0742 0.0479 gender Method Mean 95% CL Mean Std Dev 95% CL Std Dev 0 1.5400 1.4604 1.6196 0.0759 0.0474 0.1861 1 1.5575 1.4440 1.6710 0.0714 0.0404 0.2661 Diff (1-2) Pooled -0.0175-0.1280 0.0930 0.0742 0.0501 0.1422 Diff (1-2) Satterthwaite -0.0175-0.1296 0.0946 Method Variances DF t Value Pr > t Pooled Equal 8-0.37 0.7244 Satterthwaite Unequal 6.8826-0.37 0.7223 Equality of Variances Method Num DF Den DF F Value Pr > F Folded F 5 3 1.13 0.9814 How do these results match up with those from the regression?
Technically we are still estimating a line What does the intercept represent? What does the slope represent?