The General Linear Model. April 22, 2008

Similar documents
The General Linear Model. November 20, 2007

6. Multiple regression - PROC GLM

Structural Equation Models Short summary from day 2. Selected equations. In lava

Analysis of variance and regression. November 22, 2007

Linear models Analysis of Covariance

Linear models Analysis of Covariance

Statistical Modelling in Stata 5: Linear Models

Parametrisations, splines

Table 1: Fish Biomass data set on 26 streams

Topic 18: Model Selection and Diagnostics

Analysis of variance and regression. April 17, Contents Comparison of several groups One-way ANOVA. Two-way ANOVA Interaction Model checking

MATH 644: Regression Analysis Methods

using the beginning of all regression models

Lecture 10 Multiple Linear Regression

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Correlation and Simple Linear Regression

Answer to exercise 'height vs. age' (Juul)

Swabs, revisited. The families were subdivided into 3 groups according to the factor crowding, which describes the space available for the household.

Outline. Analysis of Variance. Acknowledgements. Comparison of 2 or more groups. Comparison of serveral groups

Lecture 1 Linear Regression with One Predictor Variable.p2

General Linear Model (Chapter 4)

EXST Regression Techniques Page 1. We can also test the hypothesis H :" œ 0 versus H :"

Analysis of variance. April 16, Contents Comparison of several groups

Analysis of variance. April 16, 2009

Statistics 5100 Spring 2018 Exam 1

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Lecture 18: Simple Linear Regression

Lecture 4 Multiple linear regression

Outline. Analysis of Variance. Comparison of 2 or more groups. Acknowledgements. Comparison of serveral groups

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Statistics for exp. medical researchers Regression and Correlation

9 Correlation and Regression

Regression Model Building

A discussion on multiple regression models

Chapter 1 Linear Regression with One Predictor

Exam Applied Statistical Regression. Good Luck!

Analysis of Variance

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

Simple Linear Regression

The Faroese Cohort 2. Structural Equation Models Latent growth models. Multiple indicator growth modeling. Response profiles. Number of children: 182

Multiple linear regression S6

Correlated data. Introduction. We expect students to... Aim of the course. Faculty of Health Sciences. NFA, May 19, 2014.

Lecture 11: Simple Linear Regression

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Biostatistics. Correlation and linear regression. Burkhardt Seifert & Alois Tschopp. Biostatistics Unit University of Zurich

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

STAT 3A03 Applied Regression Analysis With SAS Fall 2017

Unit 6 - Simple linear regression

Density Temp vs Ratio. temp

Chapter 14 Student Lecture Notes 14-1

STAT 350: Summer Semester Midterm 1: Solutions

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

Multiple linear regression

ECO220Y Simple Regression: Testing the Slope

Model Selection Procedures

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.

Lecture 2. The Simple Linear Regression Model: Matrix Approach

1. (Rao example 11.15) A study measures oxygen demand (y) (on a log scale) and five explanatory variables (see below). Data are available as

9. Linear Regression and Correlation

MS&E 226: Small Data

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

STATISTICS 110/201 PRACTICE FINAL EXAM

Checking model assumptions with regression diagnostics

STA441: Spring Multiple Regression. More than one explanatory variable at the same time

ST430 Exam 1 with Answers

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

In Class Review Exercises Vartanian: SW 540

Lecture 6: Linear Regression (continued)

Inference for Regression Simple Linear Regression

Unit 6 - Introduction to linear regression

Math 3330: Solution to midterm Exam

8 Analysis of Covariance

Chapter 8 Quantitative and Qualitative Predictors

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Ch 2: Simple Linear Regression

Environmental Health: A Global Access Science Source

Overview Scatter Plot Example

Simple Linear Regression

Chapter 3 Multiple Regression Complete Example

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

Introduction to Linear regression analysis. Part 2. Model comparisons

BE640 Intermediate Biostatistics 2. Regression and Correlation. Simple Linear Regression Software: SAS. Emergency Calls to the New York Auto Club

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Chapter 6 Multiple Regression

Multiple Linear Regression. Chapter 12

2. Outliers and inference for regression

Lecture 6: Linear Regression

Lecture 11 Multiple Linear Regression

Multiple Linear Regression

Correlation and simple linear regression S5

ST 512-Practice Exam I - Osborne Directions: Answer questions as directed. For true/false questions, circle either true or false.

Linear Regression Measurement & Evaluation of HCC Systems

Multiple Linear Regression

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Categorical Predictor Variables

Statistical Techniques II EXST7015 Simple Linear Regression

Transcription:

The General Linear Model. April 22, 2008 Multiple regression Data: The Faroese Mercury Study Simple linear regression Confounding The multiple linear regression model Interpretation of parameters Model control Collinearity Analysis of Covariance Confounding Interaction Esben Budtz-Jørgensen Department of Biostatistics, University of Copenhagen

Pilot whales

Design of the Faroese Mercury Study EXPOSURE: 1. Cord Blood Mercury 2. Maternal Hair Mercury 3. Maternal Seafood Intake RESPONSE: Neuropsychological Tests Age: Calendar: Children: Birth 1986-87 1022 7 Years 1993-94 917

Neuropsychological Testing

Boston Naming Test

Simple Linear Regression Response: test score, Covariate: log 10 (B-Hg) Model Y = α + β log 10 (B-Hg) + ǫ where ǫ N(0, σ 2 ). Two children: A and B. Exposure A: B-Hg 0 Exposure B: 10 B-Hg 0 Expected difference in test scores: α + β log 10 (10 B-Hg 0 ) [α + β log 10 (B-Hg 0 )] = β[log 10 (10) + log 10 (B-Hg 0 )] β log 10 (B-Hg 0 ) = β Selected results Response β s.e. p Boston Naming cued 2.55 0.51 <0.0001 Bender 1.17 0.48 0.015

Confounding Mercury Exposure Maternal Intelligence 1. Intelligent mothers get intelligent children Test Score 2. Children with intelligent mothers have lower prenatal mercury exposure In simple linear regression we ignore the confounder maternal intelligence and over-estimate the adverse mercury effect. Highly exposed children are doing poorly also because their mothers are less intelligent. Ideally we want to compare children with different degrees of exposure but with the same level of the confounder.

Multiple regression analysis DATA: n subjects, p explanatory variables + one response for each: subject x 1...x p y 1 x 11...x 1p y 1 2 x 21...x 2p y 2 3 x 31...x 3p y 3........ n x n1...x np y n The linear regression model with p explanatory variables: y i = β 0 + β 1 x i1 + + β p x ip + ε i response mean function biological variation Parameters β 0 β 1,, β p intercept regression coefficients

The multiple regression model y i = β 0 + β 1 x i1 + + β p x ip + ε i, i = 1,, n Usual assumption: ε i N(0, σ 2 ), independent Least squares estimation: S(β 0, β 1,, β p ) = (y i β 0 β 1 x i1 β p x ip ) 2

Multiple regression: Interpretation of regression coefficients Model Y i = β 0 + β 1 X i1 + β 2 X i2 +... + β p X ip + ǫ where ǫ N(0, σ 2 ) Consider two subjects: A has covariate values (X 1, X 2,..., X p ) B has covariate values (X 1 + 1, X 2,..., X p ) Expected difference in the response (B A): β 0 + β 1 (X 1 + 1) + β 2 X 2 +... + β p X p [β 0 + β 1 X 1 + β 2 X 2 +... + β p X p ] = β 1 β 1 : is the effect of a one unit increase in X 1 for fixed level of the other predictors

Interpretation of regression coefficients Simple regression: Y = α + β log 10 (B-Hg) + ǫ β: change in the child s score when log 10 (B-Hg) is increased by one unit, i.e. when the B-Hg is increased 10-fold. Multiple regression: Y = α + β log 10 (B-Hg) + β 1 X 1 +... + β p X p + ǫ β: change in score when log 10 (B-Hg) is increased by one unit, but all other covariates (child s sex, maternal intelligence,...) are fixed. We have adjusted for the effects of the other covariates. It is important to adjust for variables which are associated both with the exposure and the outcome.

In SAS proc reg data=temp; model bos_tot= logbhg kon age raven_sc risk childcar mattrain patempl pattrain town7; run; Analyst: Statistics Regression Linear

Model: MODEL1 Dependent Variable: BOS_TOT SAS output - Boston Naming Test Boston naming, total after cues Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 10 5088.07963 508.80796 21.197 0.0001 Error 780 18723.08598 24.00396 C Total 790 23811.16561 Root MSE 4.89938 R-square 0.2137 Dep Mean 27.38432 Adj R-sq 0.2036 C.V. 17.89120 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > T INTERCEP 1-4.795598 4.12759213-1.162 0.2457 LOGBHG 1-1.657193 0.49561145-3.344 0.0009 KON 1-0.707163 0.35022509-2.019 0.0438 AGE 1 4.058173 0.57334132 7.078 0.0001 RAVEN_SC 1 0.087834 0.02305023 3.811 0.0001 RISK 1-1.670470 0.49862857-3.350 0.0008 CHILDCAR 1 1.537897 0.38037560 4.043 0.0001 MATTRAIN 1 0.944715 0.38764292 2.437 0.0150 PATEMPL 1 0.917000 0.47760846 1.920 0.0552 PATTRAIN 1 0.978188 0.41376069 2.364 0.0183 TOWN7 1 0.977779 0.33099594 2.954 0.0032

SAS output - Bender Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > t Intercept Intercept 1 61.73849 4.07309 15.16 <.0001 logbhg 1 0.32415 0.48016 0.68 0.4998 sex 1-1.81533 0.34193-5.31 <.0001 AGE 1-3.88138 0.56583-6.86 <.0001 RAVEN_SC maternal Raven score 1-0.09479 0.02258-4.20 <.0001 TOWN7 Residence at age 7 1-1.19123 0.32200-3.70 0.0002 RISK lbw, sfd, premat, dysmat, 1 1.05263 0.49235 2.14 0.0328 trauma, mening CHILDCAR cared for outside home, F115 1-0.48337 0.37071-1.30 0.1926 MATTRAIN mother prof training 1-0.29383 0.37854-0.78 0.4378 PATEMPL father employed 1 0.00361 0.46966 0.01 0.9939 PATTRAIN father prof training 1-0.09930 0.40461-0.25 0.8062

Hypothesis tests Does mercury exposure have an effect, for fixed level of the other covariates? H 0 : β 1 = 0 assessed by t-test: BNT: β 1 = 1.66, s.e.( β 1 ) = 0.50, t = β 1 p=0.0009 s.e.( β1 ) = 3.34 t(780), Bender: β 1 = 0.32, s.e.( β 1 ) = 0.48, t = 0.68, p=0.50 95% confidence interval: β 1 ± t (97.5%,n p 1) s.e.( β 1 ) BNT: 1.66 ± 1.96 0.50 = ( 2.63; 0.67) Bender: 0.32 ± 1.96 0.48 = ( 0.62;1.26)

Prediction Fitted model: BNT i = 4.8 1.66 log 10 (B-Hg) i 0.70 SEX i +...+0.98 TOWN7 i +ǫ, ǫ N(0,4.9 2 ) Expected response of child first child in the data: BNT 1 = 4.8 1.66 log 10 (92.2) 0.70 0 +... + 0.98 0 = 27.8 Observed BNT 1 =21, Residual ǫ 1 =21 27.8 = 6.8 Prediction uncertainty: 95% prediction interval: expected value ±1.96 4.9 = (18.2;37.4) (here we have ignored estimation uncertainty in regression coefficients)

Tests of type I and III proc glm data=temp; model bender = logbhg kon age raven-sc town7 risk childcar mattrain patempl pattrain; run; Type I: Tests the effect of each covariate after adjustment for all covariates above it. Type III: Tests the effect of each covariate after adjustment for all other covariates

Dependent Variable: BENDER... Source DF Type I SS Mean Square F Value Pr > F LOGBHG 1 152.5635551 152.5635551 6.45 0.0113 SEX 1 713.5462412 713.5462412 30.15 0.0001 AGE 1 1592.2315510 1592.2315510 67.27 0.0001 RAVEN_SC 1 650.8761087 650.8761087 27.50 0.0001 TOWN7 1 524.2593362 524.2593362 22.15 0.0001 RISK 1 102.2066429 102.2066429 4.32 0.0380 CHILDCAR 1 39.0848938 39.0848938 1.65 0.1992 MATTRAIN 1 19.6362731 19.6362731 0.83 0.3627 PATEMPL 1 0.0085257 0.0085257 0.00 0.9849 PATTRAIN 1 1.4256305 1.4256305 0.06 0.8062 Source DF Type III SS Mean Square F Value Pr > F LOGBHG 1 10.7875779 10.7875779 0.46 0.4998 SEX 1 667.1577363 667.1577363 28.19 0.0001 AGE 1 1113.7874323 1113.7874323 47.05 0.0001 RAVEN_SC 1 417.1263013 417.1263013 17.62 0.0001 TOWN7 1 323.9547265 323.9547265 13.69 0.0002 RISK 1 108.1966303 108.1966303 4.57 0.0328 CHILDCAR 1 40.2427791 40.2427791 1.70 0.1926 MATTRAIN 1 14.2615095 14.2615095 0.60 0.4378 PATEMPL 1 0.0013970 0.0013970 0.00 0.9939 PATTRAIN 1 1.4256305 1.4256305 0.06 0.8062

Multiple regression analysis General form: Y = β 0 + β 1 x 1 + + β k x k + ǫ Idea: the x s can be (almost) everything! They do not have to be continuous variables. By suitable choice of artificial dummy variables the model become very flexible.

Group variables Group variables can be directly handled in PROC GLM by choosing the group variable as a CLASS variable. In PROC REG covariates must be numeric but group variables can be handled by generating dummy variables: For three groups 2 dummy variables are generated. Group x 1 x 2 E(y) A 0 0 β 0 B 1 0 β 0 + β 1 C 0 1 β 0 + β 2 β 0 : response level in group A β 1 : difference between A and B β 2 : difference between A and C data temp; set temp; if x= B then x1=1 else x1=0; if x= C then x2=1 else x2=0; run; y = β 0 + β 1 x 1 + β 2 x 2 + ǫ

Model control Model Y i = β 0 + β 1 X i1 + β 2 X i2 +... + β p X ip + ǫ where ǫ N(0, σ 2 ). What is there to check? Linearity Homogeneous variance in residuals Independent and normally distributed residuals Note that the model does not assume normality for X or Y

Residual plots Fitted values Ŷi = β 0 + β 1 X i1 + β 2 X i2 +... + β p X ip Residual ǫ i = Y i Ŷi Standardized Residuals: Residuals standardized so that their variance is one - easier to identify outliers (SAS provides additional types of residuals). Plot: Residuals vs covariates: mainly to test linearity Residuals vs fitted values: to test that the variance is homogeneous. A trumpetshape indicates a log-transformation [var{log(y )} var(y )/Y 2 ] Should not show any structure

Boston Naming Test: Standardized residual vs expected value

Boston Naming Test: Standardized residual vs Hg concentration

Residual plots in SAS proc glm data=temp; class ; model bos_tot= logbhg kon age raven_sc risk childcar mattrain patempl pattrain town7/solution; output out=esti p=expected r=res student=st h=h; run; symbol1 v=circle; axis1 label=(h=1.8 Cord Blood Mercury Concentration ( F=cgreek m F=complex g/l) ) logbase=2 value=(h=1.1); axis2 label=(a=90 r=0 h=1.8 Studentized Residual ) value=(h=1.4) minor=none; axis3 label=(h=1.8 Predicted Score ) order=19 to 37 by 2*/ value=(h=1.4) minor=none; proc gplot data=esti; plot st*bhg=1 /frame haxis=axis1 vaxis=axis2; run;

Test of linearity: Polynomial regression Y = β 0 + β 1 x + β 2 x 2 + β 3 x 3 + ǫ Note: Relationship between Y and x is not linear, but this model is a multiple regression model (Y linear in the β s). Thus, the model can be fitted using standard regression software. One just has to generate variables x 2, x 3 and include them as covariates. Test of linearity: H 0 : β 2 = β 3 = 0 The model is tested against a more general (flexible) model.

Test of linearity Association: prenatal Hg-exposure and blood pressure? Systolic blood pressure (mmhg), is regressed on the child weight (kg) and the prenatal mercury exposure T for H0: Pr > T Std Error of Parameter Estimate Parameter=0 Estimate INTERCEPT 86.91645496 44.84 0.0001 1.93827135 WEIGHT 0.53336582 7.61 0.0001 0.07011630 LOGBHG 0.01320824 0.02 0.9856 0.73105266 Hg-effect is clearly not significant. Mercury exposure does not seem to affect blood pressure. Make confidence interval!

STATISTICS ANOVA LINEAR MODELS... Inclusion of terms of higher degree Under MODEL choose the covariate (LOGBHG) then POLYNOMIAL and then specify the degree of the polynomium. I chose 3. proc glm data=temp; class ; model systolic=weight logbhg logbhg*logbhg logbhg*logbhg*logbhg; run; Standard Parameter Estimate Error t Value Pr > t Intercept 72.88105274 3.49693136 20.84 <.0001 WEIGHT 0.55404015 0.06966439 7.95 <.0001 logbhg 32.10913100 7.92439933 4.05 <.0001 logbhg*logbhg -22.13869674 6.68878362-3.31 0.0010 logbhg*logbhg*logbhg 4.54468520 1.78555302 2.55 0.0111 Make a drawing showing the estimated relationship! Calculate y = 32.1 logbhg 22.1 logbhg 2 +4.5 logbhg 3 for each person and plot y as a function of logbhg

Estimated dose-response function

Sums of squares (SS) DF SS model p Σ i (ŷ i ȳ) 2 error n p 1 Σ i (ŷ i y i ) 2 total n 1 Σ i (y i ȳ) 2 SS(model) is a measure of how much variation the model explains. The percentage of variation explained by the model R 2 = SS(model) SS(total) A low value indicates that the covariates are not important when predicting the response; it does NOT mean that the model assumptions are violated.

Can we reduce the number of covariates? The F-test for model reduction Idea in the F-test: yes, if the reduced model explains almost the same amount of variation. Mercury example: Can we remove logbhg,logbhg 2, logbhg 3? Generally: Can we reduce big Model 1 to smaller Model 2? Look at the difference in sum of squares SS = SS(model 1 ) SS(model 2 ) SS > 0, more covariates always explain more variation. How large must SS be before we will consider Model 1 to be significantly better? F = SS/[DF(model 1) DF(model 2 )] SS(error 1 )/DF(error 1 ) F follows F-distribution with DF(model 1 ) DF(model 2 ), DF(error 1 ) degrees of freedom

Reduced model has covariates: weight Mercury and blood pressure Dependent Variable: SYSTOLIC Sum of Mean Source DF Squares Square F Value Pr > F Model 1 3722.3404216 3722.3404216 58.28 0.0001 Error 867 55374.6031918 63.8692078 Corrected Total 868 59096.9436133 Big model has covariates: weight, logbhg, logbhg 2, logbhg 3 Dependent Variable: SYSTOLIC Sum of Mean Source DF Squares Square F Value Pr > F Model 4 5261.9796055 1315.4949014 21.11 0.0001 Error 864 53834.9640078 62.3089861 Corrected Total 868 59096.9436133 F(4 1,864) = (5262.0 3722.3)/(4 1) 53835/864 = 8.23, p < 0.001 - mercury effect is significant.

Influential observations Leverage i : (hat-matrix) measures how extreme the covariate values i th observation is. (One covariate: h ii = 1/n + (x i x) 2 /Σ j (x j x) 2 ) Cooks D i : measures how much all the regression coefficients change when the i th observation is excluded dfbeta i : measures how much a specific coefficient changes if the i th observation is excluded dfbeta i = [ β β (i) ]/s.e.( β) β (i) : coefficient without i th observation

When to transform covariates? When relation between x and y is not linear: transform x (or y) Why was B-Hg log-transformed when the linear model fits equally well?

Leverage

dfbeta

Automatic model selection Forward selection start with no covariates, try each covariate at a time, choose the most significant continue until non of the remanding covariates are significant Backward elimination start by including all covariates, remove covariate with highest p-value continue until all variables in the model are significant Backward elimination is generally recommended. In SAS: proc reg data=temp model bos-tot = logbhg age sex raven-sc risk childcar mattrain patempl pattrain olderbs matfaro gestage town71 matage smoke nursnew1 vgt ferry examtime livepar youngbs / selection = backward slstay=0.1 include=1; WARNING: The output from the selected model does not take the model uncertainty into account. The effect of selected covariates are over estimated. These methods should not be used for identification of the confounders. Can be used to determine simple prediction model.

Collinearity Two or more covariates are strongly associated Symptoms and consequences: Some regression coefficients have large standard errors R 2 is high but non of the covariates are significant The results are not as expected The results change a lot when one covariate are excluded. Regression coefficients are correlated Poor study design. Sometimes unavoidable.

PCB adjustment PCB measured in cord tissue but only in half of the children. (Median concentration 2 ng/g). Hg and PCB are associated: corr[log 10 (B-Hg), log 10 (PCB)] = 0.40, p < 0.0001 Response: BNT Cord Blood Hg PCB β s.e. p β s.e. p 1.93 0.74 0.009 - - - - - - 1.55 0.71 0.029 1.54 0.83 0.063 0.89 0.80 0.27 Based on the marginal analyses both variables have an effect. If both variables are included non of them have a significant effect. Conclusion: at least one of these variables have an effect but it is difficult to decide which of them it is. However, it seems to be mercury. In a standard backward elimination procedure PCB would be disregarded