REGRESSION AND ANALYSIS OF VARIANCE. Motivation. Module structure

Similar documents
1 Inferential Methods for Correlation and Regression Analysis

REGRESSION MODELS ANOVA

Stat 139 Homework 7 Solutions, Fall 2015

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Linear Regression Models

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

Properties and Hypothesis Testing

Correlation Regression

11 Correlation and Regression

Read through these prior to coming to the test and follow them when you take your test.

3/3/2014. CDS M Phil Econometrics. Types of Relationships. Types of Relationships. Types of Relationships. Vijayamohanan Pillai N.

Lecture 11 Simple Linear Regression

Regression, Inference, and Model Building

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Simple Linear Regression

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

(all terms are scalars).the minimization is clearer in sum notation:

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Final Examination Solutions 17/6/2010

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Simple Regression. Acknowledgement. These slides are based on presentations created and copyrighted by Prof. Daniel Menasce (GMU) CS 700

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

Recall the study where we estimated the difference between mean systolic blood pressure levels of users of oral contraceptives and non-users, x - y.

REGRESSION METHODS. Logistic regression

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL/MAY 2009 EXAMINATIONS ECO220Y1Y PART 1 OF 2 SOLUTIONS

University of California, Los Angeles Department of Statistics. Simple regression analysis

BIOS 4110: Introduction to Biostatistics. Breheny. Lab #9

Statistics 511 Additional Materials

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

Dr. Maddah ENMG 617 EM Statistics 11/26/12. Multiple Regression (2) (Chapter 15, Hines)

Parameter, Statistic and Random Samples

Comparing Two Populations. Topic 15 - Two Sample Inference I. Comparing Two Means. Comparing Two Pop Means. Background Reading

Chapters 5 and 13: REGRESSION AND CORRELATION. Univariate data: x, Bivariate data (x,y).

Correlation. Two variables: Which test? Relationship Between Two Numerical Variables. Two variables: Which test? Contingency table Grouped bar graph

Circle the single best answer for each multiple choice question. Your choice should be made clearly.

Statistics 203 Introduction to Regression and Analysis of Variance Assignment #1 Solutions January 20, 2005

Simple Linear Regression

Final Review. Fall 2013 Prof. Yao Xie, H. Milton Stewart School of Industrial Systems & Engineering Georgia Tech

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Sample Size Estimation in the Proportional Hazards Model for K-sample or Regression Settings Scott S. Emerson, M.D., Ph.D.

Regression. Correlation vs. regression. The parameters of linear regression. Regression assumes... Random sample. Y = α + β X.

STA6938-Logistic Regression Model

ST 305: Exam 3 ( ) = P(A)P(B A) ( ) = P(A) + P(B) ( ) = 1 P( A) ( ) = P(A) P(B) σ X 2 = σ a+bx. σ ˆp. σ X +Y. σ X Y. σ X. σ Y. σ n.

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So,

Lecture 2: Monte Carlo Simulation

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

MA238 Assignment 4 Solutions (part a)

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Algebra of Least Squares

ECON 3150/4150, Spring term Lecture 3

Interval Estimation (Confidence Interval = C.I.): An interval estimate of some population parameter is an interval of the form (, ),

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

S Y Y = ΣY 2 n. Using the above expressions, the correlation coefficient is. r = SXX S Y Y

6 Sample Size Calculations

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

STP 226 ELEMENTARY STATISTICS

Statistics 20: Final Exam Solutions Summer Session 2007

1 Models for Matched Pairs

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

INSTRUCTIONS (A) 1.22 (B) 0.74 (C) 4.93 (D) 1.18 (E) 2.43

Lesson 11: Simple Linear Regression

A quick activity - Central Limit Theorem and Proportions. Lecture 21: Testing Proportions. Results from the GSS. Statistics and the General Population

University of California, Los Angeles Department of Statistics. Practice problems - simple regression 2 - solutions

Chapter 12 Correlation

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Chapter 23: Inferences About Means

MCT242: Electronic Instrumentation Lecture 2: Instrumentation Definitions

Open book and notes. 120 minutes. Cover page and six pages of exam. No calculators.

MidtermII Review. Sta Fall Office Hours Wednesday 12:30-2:30pm Watch linear regression videos before lab on Thursday

A statistical method to determine sample size to estimate characteristic value of soil parameters

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

Lecture 1, Jan 19. i=1 p i = 1.

October 25, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 1

Comparing your lab results with the others by one-way ANOVA

Topic 10: Introduction to Estimation

SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Polynomial Functions and Their Graphs

Biostatistics for Med Students. Lecture 2

1036: Probability & Statistics

Chapter 2 Descriptive Statistics

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

Lecture 5: Parametric Hypothesis Testing: Comparing Means. GENOME 560, Spring 2016 Doug Fowler, GS

Assessment and Modeling of Forests. FR 4218 Spring Assignment 1 Solutions

Analysis of Experimental Data

Accuracy assessment methods and challenges

n but for a small sample of the population, the mean is defined as: n 2. For a lognormal distribution, the median equals the mean.

Chapter 11: Asking and Answering Questions About the Difference of Two Proportions

Statistics Lecture 27. Final review. Administrative Notes. Outline. Experiments. Sampling and Surveys. Administrative Notes

Chapter If n is odd, the median is the exact middle number If n is even, the median is the average of the two middle numbers

Stat 421-SP2012 Interval Estimation Section

Transcription:

REGRESSION AND ANALYSIS OF VARIANCE 1 Motivatio Objective: Ivestigate associatios betwee two or more variables What tools do you already have? t-test Compariso of meas i two populatios What will we cover i this module? Liear Regressio Associatio of a cotiuous outcome with oe or more predictors (categorical or cotiuous) Aalysis of Variace Compariso of a cotiuous outcome over a fixed umber of groups Logistic Regressio Associatio of a biary outcome with oe or more predictors (categorical or cotiuous) Module structure 1 sessios over.5 days Alteratig i-class ad lab practical sessios, each of approximately 1.5 hour duratio Day 1 Simple liear regressio Day Model checkig Multiple liear regressio ANOVA Day 3 ANCOVA Logistic regressio 3

REGRESSION MODELS SIMPLE LINEAR REGRESSION 4 Outlie: Simple Liear Regressio Motivatio The equatio of a straight lie Least Squares Estimatio Iferece About regressio coefficiets About predictios Model Checkig Residual aalysis Outliers & Ifluetial observatios 5 Motivatio: Cholesterol Example Data: Factors related to serum total cholesterol, 4 idividuals, 11 variables > head(cholesterol) ID sex age chol BMI TG APOE rs174548 rs477541 HTN chd 1 1 74 15 6. 367 4 1 1 1 1 51 4 4.7 15 4 1 1 1 3 64 5 4. 13 4 1 1 1 4 34 18 3.8 111 1 1 1 5 1 5 175 34.1 38 1 6 1 39 176.7 53 4 Our first goal: Ivestigate the relatioship betwee cholesterol (mg/dl) ad age i adults 6

Motivatio: Cholesterol Example 7 Motivatio: Cholesterol Example Is cholesterol associated with age? You could dichotomize age ad compare cholesterol betwee two age groups > group = 1*(age > 55) > group=factor(group,levels=c(,1), labels=c("3-55","56-8")) > table(group) group 3-55 56-8 1 199 > boxplot(chol~group,ylab= Total cholesterol(mg/dl) ) 8 mea i group 3-55 mea i group 56-8 Motivatio: Cholesterol Example Is cholesterol associated with age? You could compare mea cholesterol betwee two groups: t-test > t.test(chol ~ group) Welch Two Sample t-test data: chol by group t = -3.637, df = 393.477, p-value =.315 alterative hypothesis: true differece i meas is ot equal to 95 percet cofidece iterval: -1.9-3.638487 sample estimates: mea i group 3-55 mea i group 56-8 179.9751 187.8945 9

Motivatio: Cholesterol Example Questio: What do the boxplot ad the t-test tell us about the relatioship betwee age ad cholesterol? > t.test(chol ~ group) Welch Two Sample t-test data: chol by group t = -3.637, df = 393.477, p-value =.315 alterative hypothesis: true differece i meas is ot equal to 95 percet cofidece iterval: -1.9-3.638487 sample estimates: mea i group 3-55 mea i group 56-8 179.9751 187.8945 1 Motivatio: Cholesterol Example Usig the t-test: There is a statistical associatio betwee cholesterol ad age There appears to be a positive associatio betwee cholesterol ad age Is there ay way we could estimate the magitude of this associatio without breakig the cotiuous measure of age ito subgroups? With the t-test, we compared mea cholesterol i two age groups, could we compare mea cholesterol across cotiuous age? 11 Motivatio: Cholesterol Example We might assume that mea cholesterol chages liearly with age: Ca we fid the equatio for a straight lie that best fits these data? 1

Liear Regressio A statistical method for modelig the relatioship betwee a cotiuous variable [respose/outcome/depedet] ad other variables [predictors/exposure/idepedet] Most commoly used statistical model Flexible Well-developed ad uderstood properties Easy iterpretatio Buildig block for more geeral models Goals of aalysis: Estimate the associatio betwee respose ad predictors or, Predict respose values give the values of the predictors. We will start our discussio studyig the relatioship betwee a respose ad a sigle predictor Simple liear regressio model 13 The straight lie equatio Y A lie ca be described by two umbers y = b + b 1 x X 14 The straight lie equatio Y b o is the itercept: where the lie crosses the y-axis whe x= X 15

The straight lie equatio Y b 1 is the slope: the chage i y correspodig to a uit icrease i x x x+1 X 16 The straight lie equatio Y b 1 is the slope: the chage i y correspodig to a uit icrease i x b +b 1 (x+1) b +b 1 x Differece is b 1 x x+1 X 17 The straight lie equatio Y b 1 is the slope: the chage i y correspodig to a uit icrease i x The same across the etire lie X 18

The straight lie equatio Y Two values of x uits apart will have a differece i y values of *b 1 X 19 The straight lie equatio Slope b 1 is the chage i y correspodig to a uit icrease i x Slope gives iformatio about magitude ad directio of the associatio betwee x ad y The straight lie equatio y (b 1 =) No associatio betwee x ad y (values of y are the same regardless of x) y x (b 1 > ) Positive associatio betwee x ad y (values of y icrease as values of x icrease) y x (b 1 < ) Negative associatio betwee x ad y (values of y decrease as values of x icrease) x 1

Simple Liear Regressio We ca use liear regressio to model how the mea of a outcome Y chages with the level of a predictor, X The idividual Y observatios will be scattered about the mea We estimate a straight lie describig tred i the mea of a outcome Y as a fuctio of predictor X Simple Liear Regressio I regressio: X is used to predict or explai outcome Y. Respose or depedet variable (Y): variable we wat to predict or explai Explaatory or idepedet or predictor variable (X): attempts to explai the respose Simple Liear Regressio Model: y = b + b1 x + e, e ~ N(, s ) 3 Simple Liear Regressio y = b + b1 x + e, e ~ N(, s ) The model cosists of two compoets: Systematic compoet: E[Y X = x] = β + β 1 x b 1 : slope Mea populatio value of Y at X=x b :itercept Radom compoet: Var[Y X = x]=σ Variace does ot deped o x 4

Simple Liear Regressio: Assumptios MODEL: E 1 [ Y X = x] = β + β x Var[ Y X = x] = σ Compare with the boxplots o Slide 8 5 Simple Liear Regressio: Iterpretig model coefficiets Model: E[Y x] = b +b 1 x Var[Y x] = s Questio: How do you iterpret b? Aswer: b = E[Y x=], that is, the mea respose whe x= Your tur: iterpret b 1 6 Simple Liear Regressio: Iterpretig model coefficiets Model: E[Y x] = b +b 1 x Var[Y x] = s Questio: How do you iterpret b 1? Aswer: E[Y x] = β + β 1 x E[Y x+1] = β + β 1 (x+1) = β + β 1 x+ β 1 E[Y x+1] E[Y x] = β 1 idepedet of x (liearity) i.e. β 1 is the differece i the mea respose associated with a oe uit positive differece i x 7

Example: Cholesterol ad age Recall: Our motivatig example was to determie if there is a associatio betwee age (a cotiuous predictor) ad cholesterol (a cotiuous outcome) Suppose: We believe they are associated via the liear relatioship E[Y x] = b +b 1 x Questio: How would you iterpret b 1? Aswer: 8 Example: Cholesterol ad age Recall: Our motivatig example was to determie if there is a associatio betwee age (a cotiuous predictor) ad cholesterol (a cotiuous outcome) Suppose: We believe they are associated via the liear relatioship E[Y x] = b +b 1 x Questio: How do you iterpret b 1? Aswer: β 1 is the differece i mea cholesterol associated with a oe year icrease i age 9 Least Squares Estimatio Questio: How to fid a best-fittig lie? 3

8 6 y 4 Least Squares Estimatio Questio: How to fid a best-fittig lie? 4 6 8 x Method: Least Squares Estimatio Idea: chooses the lie that miimizes the sum of squares of the vertical distaces from the observed poits to the lie. 31 Least Squares Estimatio The least squares regressio lie is give by yˆ = β ˆ + βˆ 1x So the (squared) distace betwee the data (y) ad the least squares regressio lie is D = ( y i yˆ i ) i We estimate β ad β 1 by fidig the values that miimize D 3 Least Squares Estimatio These values are: βˆ = y βˆ x 1 βˆ ( xi x)( yi y) ( xi x) 1 = We estimate the variace as ( y ˆ i yi ) i= 1 i= 1 ˆ ri i= 1 σ = = = ( y βˆ βˆ x ) i 1 i 33

Estimated Stadard Errors Recall that whe estimatig parameters, there will be samplig variability i the estimates This is true for regressio parameter estimates Lookig at the formulas for ˆβ ad ˆβ 1, we ca see that these are just complicated meas I repeated samplig we would get differet estimates Kowledge of the samplig distributio of parameter estimates ca help us make iferece about the lie 34 http://lstat.kuleuve.be/java/versio./applet3.html true regressio lie Click here to simulate a data set 35 36

Samplig Distributio 37 Samplig Distributio 38 From statistical theory (i the figure, the theoretical distributio is show i pik) " # $ $ % & " # $ $ % & " # $ $ % & " # $ $ % & + 1 1 1) ( 1, ~ ˆ 1) ( 1, ~ ˆ x x s N s x N σ β β σ β β 39 39 Samplig variability

-5 5 desity desity -5 5 desity desity -5 5 desity desity desity Estimated Stadard Errors Estimate the variability of βˆ, βˆ across repeated samplig 1 SE( βˆ ) =σˆ 1 x + ( 1) s x SE( βˆ ) =σˆ 1 1 ( 1) s x 4 Iferece About regressio model parameters Hypothesis testig: H : b j = Test Statistic: Large Samples: βˆ j ( ull hyp) ~ N(,1) se( βˆ ) j Small Samples: ˆ b j - ( ull hyp) ~ t se( ˆ b ) j - Cofidece Itervals: β ˆ ± ( critical value) se( βˆ ) j j [Do t worry about these formulae: we will use R to fit the models] 41 Iferece: Hypothesis Testig Null Hypothesis: b j = T=test statistic Alterative P-Value b j > P(t - >T) desity.1..3.4 P-value b j < P(t - <T) desity.1..3.4 P-value b j ¹ P(t - > T ) desity.1..3.4 P-value 4

Iferece: Cofidece Itervals 1 (1-a)% Cofidece Iterval for b j (j=,1) βˆ ± t j, α SE( βˆ ) j Gives itervals that (1- α)1% of the time will cover the true parameter value ( β or β 1 ). We say we are (1- α)1% cofidet the iterval covers β j. 43 Example: Scietific Questio: Is cholesterol associated with age? > fit = lm(chol ~ age) > summary(fit) Call: lm(formula = chol ~ age) Residuals: Mi 1Q Media 3Q Max -6.4536-14.645 -.191 14.6595 58.9957 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 166.9168 4.6488 39.134 < e-16 *** age.3133.754 4.15 4.5e-5 *** --- Sigif. codes: ***.1 **.1 *.5..1 1 Residual stadard error: 1.69 o 398 degrees of freedom Multiple R-squared:.499, Adjusted R-squared:.3858 F-statistic: 17.1 o 1 ad 398 DF, p-value: 4.5e-5 > cofit(fit).5 % 97.5 % (Itercept) 158.5171656 175.861949 age.16411.458481 44 Example: Scietific Questio: Is cholesterol associated with age? > fit = lm(chol ~ age) > summary(fit) Call: lm(formula = chol ~ age) Residuals: Mi 1Q Media 3Q Max -6.4536-14.645 -.191 14.6595 58.9957 Coefficiets: βˆ 1 =.31 ; Estimate Std. Error t value Pr(> t ) (Itercept) 166.9168 4.6488 39.134 < e-16 *** age.3133.754 4.15 4.5e-5 *** --- Sigif. codes: ***.1 **.1 *.5..1 1 Residual stadard error: 1.69 o 398 degrees of freedom Multiple R-squared:.499, Adjusted R-squared:.3858 F-statistic: 17.1 o 1 ad 398 DF, p-value: 4.5e-5 Estimates of the model parameters ad stadard errors βˆ = 166.9 ; se( βˆ ) = 4.6 se( βˆ ) =.8 1 > cofit(fit).5 % 97.5 % (Itercept) 158.5171656 175.861949 age.16411.458481 45

Example: Scietific Questio: Is cholesterol associated with age? > fit = lm(chol ~ age) > summary(fit) Call: lm(formula = chol ~ age) Residuals: Mi 1Q Media 3Q Max -6.4536-14.645 -.191 14.6595 58.9957 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 166.9168 4.6488 39.134 < e-16 *** age.3133.754 4.15 4.5e-5 *** --- Sigif. codes: ***.1 **.1 *.5..1 1 Residual stadard error: 1.69 o 398 degrees of freedom Multiple R-squared:.499, Adjusted R-squared:.3858 F-statistic: 17.1 o 1 ad 398 DF, p-value: 4.5e-5 95% Cofidece itervals > cofit(fit).5 % 97.5 % (Itercept) 158.5171656 175.861949 age.16411.458481 46 Example: Scietific Questio: Is cholesterol associated with age? What do these models results mea i terms of our scietific questio? Parameter estimates ad cofidece itervals: ˆ b = 166.9 95% CI: (158.5, 175.3) ˆ b =.31 1 95% CI: (.16,.46) bˆ : The estimated average serum cholesterol for someoe of age = is 166.9? Your tur: What about 1 ˆb? 47 Example: Scietific Questio: Is cholesterol associated with age? What do these models results mea i terms of our scietific questio? Parameter estimates ad cofidece itervals: ˆ β =166.9 95% CI: (158.5, 175.3) ˆ β 1 =.31 95% CI: (.16,.46) Aswer: 1 ˆb : mea cholesterol is estimated to differ by.31 mg/dl for each oe year differece i age. Questio: What about the cofidece itervals? 48

Example: Scietific Questio: Is cholesterol associated with age? What do these models results mea i terms of our scietific questio? Parameter estimates ad cofidece itervals: ˆ β =166.9 95% CI: (158.5, 175.3) ˆ β 1 =.31 95% CI: (.16,.46) Aswer: 95% CIs give us a rage of values that will cover the true itercept ad slope 95% of the time For istace, we ca be 95% cofidet that the true differece i mea cholesterol associated with a oe year differece i age lies betwee.16 ad.46 mg/dl 49 Example: Scietific Questio: Is cholesterol associated with age? Presetatio of the results? The mea serum total cholesterol is sigificatly higher i older idividuals (p <.1). For each additioal year of age, we estimate that the mea total cholesterol differs by approximately.31 mg/dl (95% CI:.16,.46). Note: Emphasis o slope parameter (sig ad magitude) Cofidece iterval Uits for predictor ad respose. Scale matters 5 Iferece for predictios Give estimates βˆ, βˆ 1 we ca fid the predicted value, ŷ i for ay value of x i as ˆ = β ˆ + βˆ yi 1x i Iterpretatio of ŷ i : Estimated mea value of Y at X = x. i Be Cautious: This assumes the model is true. May be a reasoable assumptio withi the rage of your data. It may ot be true outside the rage of your data 51

y True model Observed values of x x No data i this rage Would you use the regressio lie to extrapolate?? 5 Be careful of extrapolatig Height (iches) 4 5 6 7 8 4 8 1 16 Age It would ot make sese to extrapolate height at age from a study of girls aged 4-9 years 53 Predictio Predictio of the mea E[Y X=x]: Poit Estimate: yˆ =β ˆ ˆ x + β1 Stadard Error: 1 ( x x) se( yˆ) = σˆ + ( x i x) i= 1 Note that as x gets further from x, variace icreases 1 (1-a)% cofidece iterval for E[Y X=x]: ± t se( ˆ) yˆ,1 α / y 54

Predictio Predictio of a ew future observatio, y*, at X=x: * Poit Estimate: yˆ =β ˆ ˆ x + β1 Stadard Error: * 1 ( x x) se( yˆ ) = σˆ 1+ + ( x i x) i= 1 1 (1-a)% predictio iterval for a ew future observatio: * * ˆ ± t se( yˆ ) y,1 α / Stadard error for the predictio of a future observatio is bigger: It depeds ot oly o the precisio of the estimated mea, but also o the amout of variability i Y aroud the lie. 55 Cholesterol Example: Predictio Predictio of the mea > predict.lm(fit, ewdata=data.frame(age=c(46,47,48)), iterval="cofidece") fit lwr upr 1 181.1771 178.6776 183.6765 181.4874 179.619 183.919 3 181.7977 179.439 184.1563 > predict.lm(fit, ewdata=data.frame(age=c(46,47,48)), iterval="predictio") fit lwr upr 1 181.1771 138.4687 3.8854 181.4874 138.7833 4.1915 3 181.7977 139.974 4.4981 Predictio of a ew observatio 56 Example: Scietific Questio: Is cholesterol associated with age? Let s iterpret these predictios For x = 46 yˆ = 181. 95% CI: (178.7, 183.7) yˆ* = 181. 95% CI: (138.5, 3.9) Questio: How do our iterpretatios for ŷ ad differ? ŷ* 57

Example: Scietific Questio: Is cholesterol associated with age? Let s iterpret these predictios For x = 46 Questio: How do our iterpretatios for ŷ ad ŷ* differ? yˆ = 181. 95% CI: (178.7, 183.7) yˆ* = 181. 95% CI: (138.5, 3.9) Aswer: The poit estimates represet our predictios for the mea serum cholesterol for idividuals age 46 ( ŷ ) ad for a sigle ew idividual of age 46 ( ŷ* ) 58 Example: Scietific Questio: Is cholesterol associated with age? Let s iterpret these predictios For x = 46 yˆ = 181. 95% CI: (178.7, 183.7) yˆ* = 181. 95% CI: (138.5, 3.9) Questio: Why are the cofidece itervals for ŷ ad ŷ* of differig widths? 59 Example: Scietific Questio: Is cholesterol associated with age? Let s iterpret these predictios For x = 46 Questio: Why are the cofidece itervals for ŷ ad ŷ* of differig widths? yˆ = 181. 95% CI: (178.7, 183.7) yˆ* = 181. 95% CI: (138.5, 3.9) Aswer: The iterval is broader whe we make a predictio for a cholesterol level for a sigle idividual because it must icorporate radom variability aroud the mea. 6

y Simple Liear Regressio: R Give o liear associatio: We could simply use the sample mea to predict E(Y). The variability usig this simple predictio is give by SST (to be defied shortly). Give a liear associatio: The use of X permits a potetially better predictio of Y by usig E(Y X). Questio: What did we gai by usig X? Let s examie this questio with the followig figure 61 Decompositio of sum of squares 4 6 8 y y yˆ y y yˆ y 4 6 8 x 6 Decompositio of sum of squares It is always true that: y y = ( y yˆ ) + ( yˆ y) i i i i It ca be show that: ( yi y) = ( yi yˆ i ) + i= 1 i= 1 i= 1 ( yˆ y) i SST = SSE + SSR SST: describes the total variatio of the Y i. SSE: describes the variatio of the Y i aroud the regressio lie. SSR: describes the structural variatio; how much of the variatio is due to the regressio relatioship. This decompositio allows a characterizatio of the usefuless of the covariate X i predictig the respose variable Y. 63

Simple Liear Regressio: R Give o liear associatio: We could simply use the sample mea to predict E(Y). The variability betwee the data ad this simple predictio is give as SST. Give a liear associatio: The use of X permits a potetially better predictio of Y by usig E(Y X). Questio: What did we gai by usig X? Aswer: We ca aswer this by computig the proportio of the total variatio that ca be explaied by the regressio o X R SSR SST SSE = = = 1 SST SST SSE SST This R is, i fact, the correlatio coefficiet squared. 64 Examples of R Low values of R idicate that the model is ot adequate. However, high values of R do ot mea that the model is adequate 65 Cholesterol Example: Scietific Questio: Ca we predict cholesterol based o age? > fit = lm(chol ~ age) > summary(fit) Call: lm(formula = chol ~ age) Residuals: Mi 1Q Media 3Q Max -6.4536-14.645 -.191 14.6595 58.9957 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 166.9168 4.6488 39.134 < e-16 *** age.3133.754 4.15 4.5e-5 *** --- Sigif. codes: ***.1 **.1 *.5..1 1 Residual stadard error: 1.69 o 398 degrees of freedom Multiple R-squared:.499, Adjusted R-squared:.3858 F-statistic: 17.1 o 1 ad 398 DF, p-value: 4.5e-5 > cofit(fit).5 % 97.5 % (Itercept) 158.5171656 175.861949 age.16411.458481 66

R =.4 Cholesterol Example: Scietific Questio: Ca we predict cholesterol based o age? What does R tell us about our model for cholesterol? 67 R =.4 Cholesterol Example: Scietific Questio: Ca we predict cholesterol based o age? What does R tell us about our model for cholesterol? Aswer: 4% of the variability i cholesterol is explaied by age. Although mea cholesterol icreases with age, there is much more variability i cholesterol tha age aloe ca explai 68 Cholesterol Example: Scietific Questio: Ca we predict cholesterol based o age? Decompositio of Sum of Squares ad the F-statistic > aova(fit) Aalysis of Variace Table Respose: chol Df Sum Sq Mea Sq F value Pr(>F) SSR= age 1 8 81.7 17.13 4.5e-5 *** SSE= Residuals 398 187187 47.3 --- Sigif. codes: ***.1 **.1 *.5..1 1 Degrees of freedom Decompositio of the Sum of Squares Mea Squares: SS/df F-statistic: MSR/MSE I simple liear regressio: F-statistic = (t-statistic for slope) Hypothesis beig tested: H : b 1 =, H 1 : b 1 ¹. 69

Simple Liear Regressio: Assumptios 1. E[Y x] is related liearly to x. Y s are idepedet of each other 3. Distributio of [Y x] is ormal 4. Var[Y x] does ot deped o x Liearity Idepedece Normality Equal variace Ca we assess if these assumptios are valid? 7 Model Checkig: Residuals (Raw or ustadardized) Residual: differece (r i ) betwee the observed respose ad the predicted respose, that is, r = y yˆ i i = y ( β ˆ + βˆ x ) i i 1 i The residual captures the compoet of the measuremet y i that caot be explaied by x i. 71 Model Checkig: Residuals Residuals ca be used to Idetify poorly fit data poits Idetify uequal variace (heteroscedasticity) Idetify oliear relatioships Idetify additioal variables Examie ormality assumptio 7

Model Checkig: Residuals Liearity Idepedece Normality Equal variace Plot residual vs X or vs Ŷ Q: Is there ay tred? Q: Ay scietific cocers? Residual histogram or qq-plot Q: Symmetric? Normal? Plot residual vs X Q: Is there ay patter? 73 Model Checkig: Residuals If the liear model is appropriate we should see a ustructured horizotal bad of poits cetered at zero as see i the figure below Residuals 1 1 Deviatio = residual 4 6 8 x 74 Model Checkig: Residuals The model does ot provide a good fit i these cases Residuals 1 1 4 6 8 1 Residuals 1 1 4 6 8 1 Violatios of the model assumptios? How? 75

Simple Liear Regressio: Residual Aalysis: Noliear Associatio True model: y = x^1.7 Plot of Fitted Model: ^ y=-1.41+.67x 1 5 Residuals Plot fitted (predictio) vs. residual: ^y=-1.41+.67x 1-1 -5 1 3 4 x - -5 5 1 Fitted values 76 Simple Liear Regressio: Residual Aalysis: No Costat Variace True model: y = x + errors icreasig with x 1 Plot of Fitted Model: ^y=.11+.91x Plot fitted (predictio) vs. residual: ^y=.11+.91x 5 5 Residuals 1 3 4 x -5 1 3 4 Fitted values 77 No-costat variace Sometimes variace of y is ot costat across the rage of x (heteroscedasticity) Little effect o poit estimates but variace estimates may be icorrect This may affect cofidece itervals ad p-values To accout for heteroscedasticity we ca Use robust stadard errors Trasform the data Fit a model that does ot assume costat variace (GLM) 78

Robust stadard errors Robust stadard errors correctly estimate variability of parameter estimates eve uder ocostat variace These stadard errors use empirical estimates of the variace i y at each x value rather tha assumig this variace is the same for all x values Regressio poit estimates will be uchaged Robust or empirical stadard errors will give correct cofidece itervals ad p-values 79 Simple Liear Regressio: Residual Aalysis: No-ormality of errors QQ-plot Graphical techique that allows us to assess whether or ot a data set follows a give distributio (such as the ormal distributio) The data are plotted agaist a give theoretical distributio Poits should approximately fall i a straight lie Departures from the straight lie idicate departures from the specified distributio. 8 Simple Liear Regressio: Residual Aalysis: No-ormality of errors Costructio of QQ-Plot: (a example) residuals.3 -.5 -.91.56 -.79-1.45 -.4 -.8 -.39-1.9.37 -.56 1.15-1.14.6.6.11.51 -. -.4 Sort sorted idex(i) -1.45 1-1.14-1.9 3 -.91 4 -.8 5 -.79 6 -.56 7 -.4 8 Empirical -.39 9 -.5 1 Probab. -.4 11 -. 1.6 13.11 14.3 15.37 16.51 17.56 18.6 19 1.15 i/ (i-.5)/ pr1 pr.5.5.1.75.15.15..175.5.5.3.75.35.35.4.375.45.45.5.475.55.55.6.575.65.65.7.675.75.75.8.775.85.85.9.875.95.95 1..975 Z-quatile -1.96-1.44-1.15 -.93 -.76 -.6 -.45 -.3 Get z-values -.19 (quatiles from -.6 Normal distr.).6.19.3.45.6.76.93 1.15 1.44 1.96 Plot of sorted residuals (sample quatiles) versus z-quatile (theoretical quatiles) = QQ-plot 81

Simple Liear Regressio: Residual Aalysis: No-ormality of errors Residuals Iverse Normal -.5.5 -.5.5 1 1.5 True model: y = x + chi-squared errors Plot of Fitted Model: Q-Q Plot Uder ormality, residuals should fall o the straight lie ^ 8 y=.14+.998x x 1 3 4 4 6 Plot of residuals versus fitted values Curvature? Heteroscedasticity? R COMMAND: plot(fit$fitted, fit$residuals) Plot of residuals versus quatiles of a ormal distributio(for > 3) Normality? R COMMAND: qqorm(fit$residuals) Cholesterol-Age example: Residuals 18 185 19 6 4 6 fit$fitted fit$residuals 3 1 1 3 6 4 6 Theoretical Quatiles Sample Quatiles 83 Aother example Liear regressio for associatio betwee age ad triglycerides 84 84 > fit.tg=lm(tg~age)

Robust stadard errors Residual aalysis suggests meavariace relatioship Use robust stadard errors to get correct variace estimates 1 15 5 1 1 3 fit.tg$fitted fit.tg$residuals 85 Cholesterol example: Robust stadard errors Liear regressio results: Results icorporatig robust SEs: > summary(fit.tg) Call: lm(formula = TG ~ age) Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) -53.359 11.1339-4.788.38e-6 *** age 4.9.1964 1.49 < e-16 *** > summary(fit.tg.ese) Call: gee(formula = TG ~ age, id = seq(1, legth(age))) Coefficiets: Estimate Naive S.E. Naive z Robust S.E. Robust z (Itercept) -53.3593 11.1339178-4.78776 8.7387366-6.99958 age 4.8964.1964165 1.48771.1813358 3.188 Poit estimates are uchaged 86 Cholesterol example: Robust stadard errors Liear regressio results: Results icorporatig robust SEs: > summary(fit.tg) Call: lm(formula = TG ~ age) Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) -53.359 11.1339-4.788.38e-6 *** age 4.9.1964 1.49 < e-16 *** > summary(fit.tg.ese) Call: gee(formula = TG ~ age, id = seq(1, legth(age))) Coefficiets: Estimate Naive S.E. Naive z Robust S.E. Robust z (Itercept) -53.3593 11.1339178-4.78776 8.7387366-6.99958 age 4.8964.1964165 1.48771.1813358 3.188 Stadard errors are corrected 87

Trasformatios Some reasos for usig data trasformatios Cotet area kowledge suggests oliearity Origial data suggest oliearity Equal variace assumptio violated Normality assumptio violated Trasformatios may be applied to the respose, predictor or both Be careful with the iterpretatio of the results 88 Cholesterol example: Trasformatios We have see that triglycerides are associated with age but display o-costat variace What about log trasformed triglycerides? 89 Cholesterol example: Trasformatios > summary(fit.tg.l) Call: lm(formula = log(tg) ~ age) Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 3.711583.55937 66.37 <e-16 *** age.48646.9866 5. <e-16 *** 4.6 4.8 5. 5. 5.4 5.6.5..5 1. fit.tg.l$fitted fit.tg.l$residuals Heteroscedasticity is corrected But iterpretatio of model is more complicated 9

Trasformatios Rarely do we kow which trasformatio of the predictor provides best liear fit As always, there is a dager i usig the data to estimate the best trasformatio to use If there is o associatio of ay kid betwee the respose ad the predictor, a liear fit (with a zero slope) is the correct oe Tryig to detect a trasformatio is thus a iformal test for a associatio Multiple testig procedures iflate the Type I error It is best to choose the trasformatio of the predictor o scietific grouds However, sometimes it does t matter it is ofte the case that may fuctios are well approximated by a straight lie over a small rage of the data Other approaches to o-liearity iclude splies ad fractioal polyomials 91 Model Checkig: Outliers vs Ifluetial observatios Outlier: a observatio with a residual that is uusually large (positive or egative) as compared to the other residuals. Ifluetial poit: a observatio that has a otable ifluece i determiig the regressio equatio. Removig such a poit would markedly chage the positio of the regressio lie. Observatios that are somewhat extreme for the value of x ca be ifluetial. 9 Outlier vs Ifluetial observatios y 8 6 4 Poit A Lie icludig Poit A ^Y=.958+.81x 5*X Lie with Poit A removed ^Y=.36+1.x *X 1 3 4 5 6 Poit A is a outlier, but is ot ifluetial. x 93

Outlier vs Ifluetial observatios 8 6 Lie icludig Poit B ^Y=.886+.58*X Poit B Y 4 Lie with Poit B removed ^Y=3.694-.594*X 4 6 8 X Poit B is ifluetial, but ot a outlier. 94 Cholesterol-Age Example: Residuals No extreme outliers 95 Model Checkig: Deletio diagostics Δβ ( i) Δβ ( i) se( βˆ) = βˆ βˆ ( i) : Delta-beta : Stadardized Delta-beta Delta-beta Stadardized delta-beta : tells how much the regressio coefficiet chaged by excludig the i th observatio : approximates how much the t-statistic for a coefficiet chaged by excludig the i th observatio 96

( Cholesterol-Age Example: Deletio diagostics > dfb = dfbeta(fit) > idex=order(abs(dfb[,]),decreasig=t) > cbid(dfb[idex[1:15],],age[idex[1;15]]) (Itercept) age 114 -.9893663.1568514 34 166 -.687966.14888475 78 55 -.619643.139713 75 186 -.8544144.1379531 33 113.537693 -.11943495 76 35 -.7517511.1138451 37 365.767658 -.119778 39 57 -.73743.119575 37 9 -.74787.1757541 35 144.7164 -.171881 37 197 -.678415.14697 34 96 -.6499386.111515 33 31 -.693174.97116 34 7.44397 -.95447 79 5 -.5981.941761 31 No evidece of ifluetial poits. The largest (i absolute value) delta beta is.15 compared to.31 for the regressio coefficiet. 97 Model Checkig What to do if you fid a outlier ad/or ifluetial observatio: Check it for accuracy Decide (based o scietific judgmet) whether it is best to keep it or omit it If you thik it is represetative, ad likely would have appeared i a larger sample, keep it If you thik it is very uusual ad ulikely to occur agai i a larger sample, omit it Report its existece [whether or ot it is omitted] 98 Simple Liear Regressio: Impact of Violatios of Model Assumptios No Liearity No Normality Estimates Problematic Miimal for most departures. Outliers ca be a problem. Tests/CIs Problematic Miimal for most departures. CIs for correlatio are sesitive. Correctio Trasform or Choose a oliear model. Delete outliers (if warrated) or Use robust regressio Uequal Variaces Miimal impact Variace estimates are wrog, but the effect is usually ot dramatic Trasform or Use robust stadard error Depedece Ofte the estimates are ubiased Variace estimates are wrog Regressio for depedet data 99

REGRESSION MODELS MULTIPLE LINEAR REGRESSION 1 Outlie: Multiple Liear Regressio Motivatio Model ad Iterpretatio Estimatio ad Iferece Iteractio 11 Motivatio The respose or depedet variable, Y, may deped o several predictors ot just oe Multiple regressio is a attempt to cosider the simultaeous ifluece of several variables o the respose This may be with the goal of a ubiased estimate of associatio or for better predictio 1

Motivatio Why ot fit multiple separate simple liear regressios? If the goal is to estimate the associatio betwee the respose ad a predictor of iterest, a cofouder ca make the observed associatio appear stroger tha the true associatio, weaker tha the true associatio, or eve the reverse of the true associatio How ca we address this: We ca adjust for the effects of the cofouder by addig a correspodig term to our liear regressio If the goal is predictio of the respose, we may be able to improve predictio by icludig additioal variables i the regressio model 13 Motivatio: Cholesterol Example Data > head(cholesterol) ID sex age chol BMI TG APOE rs174548 rs477541 HTN chd 1 1 74 15 6. 367 4 1 1 1 1 51 4 4.7 15 4 1 1 1 3 64 5 4. 13 4 1 1 1 4 34 18 3.8 111 1 1 1 5 1 5 175 34.1 38 1 6 1 39 176.7 53 4 Our goal: Ivestigate the relatioship betwee age (years), BMI (kg/m ) ad serum total cholesterol (mg/dl) 14 Motivatio I geeral, the multiple regressio equatio ca be writte as follows: E[Y x 1,x,...,x p ] = β + β 1 x 1 + β x +...+ β p x p We use multiple variables whe: The predictor variable is categorical with more tha two groups We eed polyomials, splies or other fuctios to model the shape of the relatioship(s) accurately Estimatig associatio: We wat to adjust for cofoudig by other variables We wat to allow the associatio to differ for differet values of other variables (iteractio) Predictio: we use multiple variables if we thik more tha oe variable will be useful i predictig future outcomes accurately 15

Model ad Iterpretatio Model: Y β + β x + β x +... + β + ε = 1 1 p x p where we assume iid ε ~ N (, σ ) Extesio of simple liear regressio Systematic compoet: E [ Y x1,..., x p + β1x1+ β x ]= β +... + β x p p Radom compoet: Var[ Y x1,..., x p ] = σ 16 Model ad Iterpretatio For example, let us assume that there are two predictors i the model ad so E[Y x 1, x ] = b + b 1 x 1 + b x Cosider two observatios with the same value for x, but oe observatio has x 1 oe uit higher, that is, Obs 1: E[Y x 1 =k+1, x =c] = b + b 1 (k+1) + b c Obs : E[Y x 1 =k, x =c] = b + b 1 (k) + b c Thus, E[Y x 1 =k+1, x =c] - E[Y x 1 =k, x =c] = b 1 That is, b 1 is the expected mea chage i y per uit chage i x 1 if x is held costat (adjusted/cotrollig for x ) Similar iterpretatio applies to b 17 Model ad Iterpretatio To facilitate our discussio let s assume we have two predictors with biary values Model: E[ Y x = β + β x + β x 1, x] 1 1 Mea of Y X = X =1 X 1 = b b +b X 1 =1 b +b 1 b +b 1 +b E[Y x 1 =1, x =] - E[Y x 1 =,x =] = b 1 E[Y x 1 =1, x =1] - E[Y x 1 =,x =1] = b 1 E[Y x 1 =, x =1] - E[Y x 1 =,x =] = b E[Y x 1 =1, x =1] - E[Y x 1 =1,x =] = b 18

Estimatio Least Squares Estimatio: Chooses the coefficiet estimates that miimize the residual sum of squares ( y y$ i i ) i Computatio more difficult, but statistical software (R) will do that for you 19 Estimatio ad Iferece Iferece About regressio model parameters Hypothesis Testig H : b j = Iterpretatio: Is there a statistically sigificat relatioship betwee the respose y ad x j after adjustig for all other factors (predictors) i the model? ˆ b j - ( ull hyp) Test Statistic: ~ t- p-1 se( ˆ b ) j Note: The square of the t-statistic gives the F-statistic ad the test is kow as the partial F-Test Cofidece Itervals β ˆ ( ) ( ˆ j ± critical value se β j ) 11 Estimatio ad Iferece About the full model Hypotheses H : β β =... = β vs. H 1 : At least oe b j is ot ull 1 = p = Aalysis of variace table Source df SS MS F Regressio p SSR= (y $ i - y) MSR= SSR/p MSR/MSE Residual -p-1 SSE= (y - y $ ) MSE= i i SSE/(-p-1) Total -1 SST= (y i - y) 111

Estimatio ad Iferece The F-value is tested agaist a F-distributio with p, -p-1 degrees of freedom If we reject the ull hypothesis, the the predictors do aid i predictig Y [i this aalysis we do ot kow which oes are importat] Failig to reject the ull hypothesis does ot mea that oe of the covariates are importat, sice the effect of oe or more covariates may be "masked" by others. The hard part is choosig which covariates to iclude or exclude. This is kow as the global (multiple) F-test 11 Scietific example: Modelig cholesterol usig age ad BMI We have see that there is a sigificat relatioship betwee age ad cholesterol Ca we better uderstad variability i cholesterol by icorporatig additioal covariates? 113 Scietific example: Modelig cholesterol usig age ad BMI 114

Scietific example: Modelig cholesterol usig age ad BMI It appears that BMI icreases with icreasig age Ad cholesterol icreases with icreasig BMI What if we wat to estimate the associatio betwee age ad cholesterol while holdig BMI costat? Multiple regressio 115 Scietific example: Modelig cholesterol usig age ad BMI > fit=lm(chol~age+bmi) > summary(fit) Call: lm(formula = chol ~ age + BMI) Residuals: Mi 1Q Media 3Q Max -58.994-15.793.571 14.159 6.99 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 137.161 9.61 15.3 < e-16 *** age.3.795.544.1137 * BMI 1.466.38 3.73.17 *** --- Sigif. codes: ***.1 **.1 *.5..1 1 Residual stadard error: 1.34 o 397 degrees of freedom Multiple R-squared:.7351, Adjusted R-squared:.6884 F-statistic: 15.75 o ad 397 DF, p-value:.6e-7 116 Scietific example: Modelig cholesterol usig age ad BMI Our estimated regressio equatio is y ˆ =137.16 +.Age +1.43BMI Questio: How do we iterpret the age coefficiet? 117

Scietific example: Modelig cholesterol usig age ad BMI Our estimated regressio equatio is y ˆ =137.16 +.Age +1.43BMI Questio: How do we iterpret the age coefficiet? Aswer: This is the estimated average differece i cholesterol associated with a oe year differece i age for two subjects with the same BMI. 118 Scietific example: Modelig cholesterol usig age ad BMI Our estimated regressio equatio is y ˆ =137.16 +.Age +1.43BMI The age coefficiet from our simple liear regressio model was.31. Questio: Why do the estimates from the two models differ? 119 Scietific example: Modelig cholesterol usig age ad BMI Our estimated regressio equatio is y ˆ =137.16 +.Age +1.43BMI The age coefficiet from our simple liear regressio model was.31. Questio: Why do the estimates from the two models differ? Aswer: We are ow coditioig o or cotrollig for BMI so our estimate of the age associatio is amog subjects with the same BMI. 1

Scietific example: Modelig cholesterol usig age ad BMI Call: lm(formula = chol ~ age + BMI) Residuals: Mi 1Q Media 3Q Max -58.994-15.793.571 14.159 6.99 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 137.161 9.61 15.3 < e-16 *** age.3.795.544.1137 * BMI 1.466.38 3.73.17 *** --- Sigif. codes: ***.1 **.1 *.5..1 1 Residual stadard error: 1.34 o 397 degrees of freedom Multiple R-squared:.7351, Adjusted R-squared:.6884 F-statistic: 15.75 o ad 397 DF, p-value:.6e-7 11 Cholesterol Example: Did addig BMI improve our model? > aova(fit,fit) Aalysis of Variace Table Model 1: chol ~ age Model : chol ~ age + BMI Res.Df RSS Df Sum of Sq F Pr(>F) 1 398 187187 397 1 884 1 6345.8 13.931.174 *** --- Sigif. codes: ***.1 **.1 *.5..1 1 How does the model with age ad BMI compare to a model that cotais oly the mea? > fit=lm(chol~1) > aova(fit,fit) Aalysis of Variace Table Model 1: chol ~ 1 Model : chol ~ age + BMI Res.Df RSS Df Sum of Sq F Pr(>F) 1 399 195189 397 1884 14347 15.748.6e-7 *** --- Sigif. codes: ***.1 **.1 *.5..1 1 1 Iteractio ad Liear Regressio Statistical iteractio (aka effect modificatio) occurs whe the relatioship betwee a outcome variable ad oe predictor is differet depedig o the levels of a secod predictor Iteractios are usually ivestigated because of a priori assumptios/hypotheses o the part of the researchers Liear regressio models allow for the iclusio of iteractios with cross-product terms 13

Cofoudig vs. Iteractio/Effect Modificatio Data ad scietific uderstadig help distiguish betwee cofoudig ad effect modifyig variables: Cofouder: Associated with predictor ad respose; Associatio betwee respose ad predictor costat across strata of the ew variable Effect modifier/iteractio: Associatio betwee respose ad the predictor varies across strata of the ew variable 14 Cofoudig vs. Iteractio/Effect Modificatio Cofoudig: Estimates of associatio from uadjusted aalysis are markedly differet from estimates of associatio from adjusted aalysis Associatio withi each stratum is similar, but differet from the crude associatio i the combied data (igorig the strata) I liear regressio, these symptoms are diagostic of cofoudig Effect modificatio would show differeces betwee adjusted aalysis ad uadjusted aalysis, but would also show differet associatios i the differet strata 15 Effect Modificatio /Iteractio Eve if preset, effect modificatio may ot always be of iterest i summarizig the effect of a predictor. For example, plecoaril, a ativiral drug, reduced the mea duratio of symptoms i subjects with a commo cold due to rhioviruses but had o effect i subjects whose cold was due to some other aget. I the case of the plecoaril, effect modificatio was importat i checkig that the drug did actually work by ihibitig rhiovirus. However, i cliical use of the drug, it would typically ot be possible to determie the ifectious aget (the tests are expesive ad take loger tha just recoverig from the cold), ad so the average effectiveess of the drug across all colds would be a more importat quatity. 16

Graphical Represetatio Y W=1 No parallel lies Iteractio W= X 17 Graphical Represetatio Y W=1 No parallel lies Iteractio W= X 18 Graphical Represetatio Y W=1 Parallel lies No Iteractio W= X 19

Graphical Represetatio Y W=1 [Y W=1] Parallel lies No Iteractio W= [Y W=] W is possibly a Cofouder X [X W=] [X W=1] 13 Graphical Represetatio Y W=1 Parallel lies No Iteractio W= X 131 Model ad Iterpretatio: iteractio Assume that there are two predictors i the model E[Y x 1, x ] = b + b 1 x 1 + b x + b 3 x 1 x Cosider two observatios with the same value, c, for x, but oe observatio has x 1 oe uit higher Obs 1: E[Y x 1 =k+1, x =c] = b + b 1 (k+1) + b c + b 3 (k+1)c Obs : E[Y x 1 =k, x =c] = b + b 1 (k) + b c + b 3 kc Thus, E[Y x 1 =k+1, x =c] - E[Y x 1 =k, x =c] = b 1 + b 3 c That is, the differece i meas depeds ow o the value of x 13

Model ad Iterpretatio: iteractio Model: E[Y x 1, x ] = b + b 1 x 1 + b x + b 3 x 1 x Differece i Meas: E[Y x 1 =k+1, x =c] - E[Y x 1 =k, x =c] = b 1 + b 3 c The differece i meas depeds o the value of x The differece i meas is b 1 if c=. The differece i meas is b 1 + b 3 if c=1 The differece i meas chages by b 3 for each uit differece i c (that is, i x ) [that is, b 3 is the differece of differeces] 133 133 H : β 3 = tests for iteractio Model ad Iterpretatio: iteractio Model: E[Y x 1, x ] = b + b 1 x 1 + b x + b 3 x 1 x Aother way to look at this Factor terms ivolvig x 1 : E[Y x 1, x ] = b + (b 1 + b 3 x )x 1 + b x Slope of x 1 chages with x = Differece i meas for each uit differece i x 1 chages with x (for each oe uit differece i x, the differece i meas chages by b 3 ) 134 134 Cholesterol Example: Does sex affect the age cholesterol relatioship? 3 4 5 6 7 8 1 14 16 18 age chol Male Female 135 135

Cholesterol Example: Does sex affect the age cholesterol relatioship? We first fit the model with age ad sex terms oly (Male: sex=, Female: sex=1) > fit3 = lm(chol ~ age+sex) > summary(fit3) Call: lm(formula = chol ~ age + sex) Residuals: Mi 1Q Media 3Q Max -55.66-14.48-1.411 14.68 57.876 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 16.35445 4.4184 38.75 < e-16 *** age.9697.7313 4.61 5.89e-5 *** sex 1.578.1794 4.985 9.9e-7 *** --- Sigif. codes: ***.1 **.1 *.5..1 1 Residual stadard error: 1.6 o 397 degrees of freedom Multiple R-squared:.9748, Adjusted R-squared:.993 F-statistic: 1.44 o ad 397 DF, p-value: 1.44e-9 136 Cholesterol Example: Does sex affect the age cholesterol relatioship? 1.5 137 Cholesterol Example: Does sex affect the age cholesterol relatioship? This model idicates that, after cotrollig for the effect of sex, the average cholesterol differs by.3 for each additioal year of age The age effect i this model is very similar to the effect from our simple liear regressio (.31) However, this does ot mea that the age/cholesterol relatioship is the same i males ad females To aswer this questio we must add the iteractio term 138

Cholesterol Example: Does sex affect the age cholesterol relatioship? Model with age ad sex mai effects, plus iteractio effect > fit4=lm(chol~age*sex) > summary(fit4) Call: lm(formula = chol ~ age * sex) Residuals: Mi 1Q Media 3Q Max -56.474-14.377-1.15 14.764 58.31 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 16.31151 5.8668 7.344 < e-16 *** age.3346.144 3.4.146 ** sex 14.5671 8.98 1.755.84. age:sex -.7399.1464 -.55.61361 --- Sigif. codes: ***.1 **.1 *.5..1 1 Residual stadard error: 1.8 o 396 degrees of freedom Multiple R-squared:.986, Adjusted R-squared:.913 F-statistic: 14.35 o 3 ad 396 DF, p-value: 6.795e-9 139 Cholesterol Example: Does sex affect the age cholesterol relatioship? Call: lm(formula = chol ~ age * sex) Residuals: Mi 1Q Media 3Q Max -56.474-14.377-1.15 14.764 58.31 Mea cholesterol for males at age Coefficiets: Estimate Std. Error t value Pr(> t ) 16.31151 5.8668 7.344 < e-16 (Itercept) *** age.3346.144 3.4.146 ** sex 14.5671 8.98 1.755.84. age:sex -.7399.1464 -.55.61361 --- Sigif. codes: ***.1 **.1 *.5..1 1 Residual stadard error: 1.8 o 396 degrees of freedom Multiple R-squared:.986, Adjusted R-squared:.913 F-statistic: 14.35 o 3 ad 396 DF, p-value: 6.795e-9 14 Cholesterol Example: Does sex affect the age cholesterol relatioship? Call: lm(formula = chol ~ age * sex) Residuals: Mi 1Q Media 3Q Max -56.474-14.377-1.15 14.764 58.31 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 16.31151 5.8668 7.344 < e-16 *** age.3346.144 3.4.146 ** sex 14.5671 8.98 1.755.84. age:sex -.7399.1464 -.55.61361 --- Sigif. codes: ***.1 **.1 *.5..1 1 Differece i mea cholesterol betwee males ad females at age Residual stadard error: 1.8 o 396 degrees of freedom Multiple R-squared:.986, Adjusted R-squared:.913 F-statistic: 14.35 o 3 ad 396 DF, p-value: 6.795e-9 141

Cholesterol Example: Does sex affect the age cholesterol relatioship? Call: lm(formula = chol ~ age * sex) Residuals: Mi 1Q Media 3Q Max -56.474-14.377-1.15 14.764 58.31 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 16.31151 5.8668 7.344 < e-16 *** age.3346.144 3.4.146 ** sex 14.5671 8.98 1.755.84. age:sex -.7399.1464 -.55.61361 --- Sigif. codes: ***.1 **.1 *.5..1 1 Differece i mea cholesterol associated with each oe year chage i age for males Residual stadard error: 1.8 o 396 degrees of freedom Multiple R-squared:.986, Adjusted R-squared:.913 F-statistic: 14.35 o 3 ad 396 DF, p-value: 6.795e-9 14 Cholesterol Example: Does sex affect the age cholesterol relatioship? Call: lm(formula = chol ~ age * sex) Residuals: Mi 1Q Media 3Q Max -56.474-14.377-1.15 14.764 58.31 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 16.31151 5.8668 7.344 < e-16 *** age.3346.144 3.4.146 ** sex 14.5671 8.98 1.755.84. age:sex -.7399.1464 -.55.61361 --- Sigif. codes: ***.1 **.1 *.5..1 1 Differece i chage i mea cholesterol associated with each oe year chage i age for females compared to males Residual stadard error: 1.8 o 396 degrees of freedom Multiple R-squared:.986, Adjusted R-squared:.913 F-statistic: 14.35 o 3 ad 396 DF, p-value: 6.795e-9 143 Iterpretatio? Estimated model: 16.3 +.33 Age + 14.56 Sex-.7 Age x Sex Cholesterol Example: Does sex affect the age cholesterol relatioship? Subject 1: Age = a+1, sex = b Subject : Age = a, sex = b Differece i the estimated cholesterol: [16.3 +.33(a+1) + 14.56(b).7 (a+1)(b)] [16.3 +.33(a) + 14.56 (b).7 (a)(b)] =.33-.7b Sex exerts a small (ot statistically sigificat) effect o the age/cholesterol relatioship I males: 16.3+.33 Age I females: 174.9+.6 Age 144

We ca also test the sigificace of iteractio terms usig a F-test Addig the iteractio term did ot sigificatly improve model fit > aova(fit3,fit4) Aalysis of Variace Table Model 1: chol ~ age + sex Model : chol ~ age * sex Res.Df RSS Df Sum of Sq F Pr(>F) 1 397 17616 396 17649 1 113.5.554.6136 Cholesterol Example: Does sex affect the age cholesterol relatioship? 145 Cholesterol Example: Does sex affect the age cholesterol relatioship? 3 4 5 6 7 8 1 14 16 18 4 Age (years) Total cholesterol (mg/dl) Male Female 146 146 147 Summary We have cosidered: Simple liear regressio Iterpretatio Estimatio Model checkig Multiple liear regressio Cofoudig Iterpretatio Estimatio Iteractio