Week 7 Multiple factors. Ch , Some miscellaneous parts

Similar documents
Regression and Models with Multiple Factors. Ch. 17, 18

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Generalised linear models. Response variable can take a number of different formats

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

Generalized linear models

R Output for Linear Models using functions lm(), gls() & glm()

Logistic Regressions. Stat 430

R Hints for Chapter 10

Non-Gaussian Response Variables

Example: Poisondata. 22s:152 Applied Linear Regression. Chapter 8: ANOVA

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Interactions in Logistic Regression

Generalized linear models

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Linear Regression Models P8111

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Logistic Regression - problem 6.14

UNIVERSITY OF TORONTO Faculty of Arts and Science

Logistic Regression 21/05

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

MODELS WITHOUT AN INTERCEPT

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

36-707: Regression Analysis Homework Solutions. Homework 3

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

STAT 3022 Spring 2007

Classification. Chapter Introduction. 6.2 The Bayes classifier

Tests of Linear Restrictions

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Multiple Predictor Variables: ANOVA

Exam Applied Statistical Regression. Good Luck!

Recall that a measure of fit is the sum of squared residuals: where. The F-test statistic may be written as:

STA102 Class Notes Chapter Logistic Regression

Stat 401B Final Exam Fall 2016

Duration of Unemployment - Analysis of Deviance Table for Nested Models

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

BIOSTATS 640 Spring 2018 Unit 2. Regression and Correlation (Part 1 of 2) R Users

Matched Pair Data. Stat 557 Heike Hofmann

9 Generalized Linear Models

Multiple Predictor Variables: ANOVA

Exercise 5.4 Solution

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

Logistic Regression. 0.1 Frogs Dataset

MODULE 6 LOGISTIC REGRESSION. Module Objectives:

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper

Logistic & Tobit Regression

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Stat 5102 Final Exam May 14, 2015

Regression Methods for Survey Data

Extensions of One-Way ANOVA.

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Multiple Regression: Example

Stat 412/512 TWO WAY ANOVA. Charlotte Wickham. stat512.cwick.co.nz. Feb

12 Modelling Binomial Response Data

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Nested 2-Way ANOVA as Linear Models - Unbalanced Example

Diagnostics and Transformations Part 2

Unbalanced Data in Factorials Types I, II, III SS Part 1

1 Use of indicator random variables. (Chapter 8)

Modeling Overdispersion

Log-linear Models for Contingency Tables

Reaction Days

Consider fitting a model using ordinary least squares (OLS) regression:

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

Analysing categorical data using logit models

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

Let s see if we can predict whether a student returns or does not return to St. Ambrose for their second year.

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Neural networks (not in book)

Explanatory variables are: weight, width of shell, color (medium light, medium, medium dark, dark), and condition of spine.

Extensions of One-Way ANOVA.

Stat 401B Exam 3 Fall 2016 (Corrected Version)

Introduction to the Analysis of Tabular Data

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

R 2 and F -Tests and ANOVA

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Inference for Regression

ssh tap sas913, sas

BIOL 933!! Lab 10!! Fall Topic 13: Covariance Analysis

Checking the Poisson assumption in the Poisson generalized linear model

Basic Methods of Data Analysis Part 3. Sepp Hochreiter. Institute of Bioinformatics Johannes Kepler University, Linz, Austria

Booklet of Code and Output for STAC32 Final Exam

Module 4: Regression Methods: Concepts and Applications

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

High-Throughput Sequencing Course

STA 450/4000 S: January

Chapter 8 Conclusion

1 Multiple Regression

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Chapter 1 Statistical Inference

NC Births, ANOVA & F-tests

Inference with Heteroskedasticity

STAT 571A Advanced Statistical Regression Analysis. Chapter 8 NOTES Quantitative and Qualitative Predictors for MLR

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

Sample solutions. Stat 8051 Homework 8

Transcription:

Week 7 Multiple factors Ch. 18-19, Some miscellaneous parts

Multiple Factors Most experiments will involve multiple factors, some of which will be nuisance variables Dealing with these factors requires more complex models

Two-factor ANOVA, ANCOVA A two-factor ANOVA typically includes both factors (each of which could have many levels) and the interaction term Y ~ A + B + A*B An analysis of covariance is the same model, except that the covariate is a continuous nuisance variable Y ~ A + B + A*B

Types of Sums of Squares For a perfectly balanced design, the distinction doesn t matter For unbalanced designs, different types of sums of squares can give different results In R, the default for lm is to calculate significance based on Type I sums of squares Most statistical packages use Type III sums of squares

Types of Sums of Squares Think about it this way: SS(AB A, B) is the effect of the interaction term after the main effects of A and B are fit SS(AB A, B) = SS(A, B, AB) SS (A, B) In other words SS(AB A, B) is the added explanatory power of the model gained by adding the interaction term (after the main effects of A and B have already been fit).

Type I Sum of Squares Sequential sum of squares Fits the factors in the order they are specified in the model So it fits: SS(A) SS(B A) SS(AB A, B) Different models will give different results, even when they contain the same terms: Y ~ A + B + A*B Y ~ B + A + A*B You will get different p-values from these two models if you use the Type I SS (which is the default for lm in R)

Type II Sum of Squares Main effects are tested assuming the interactions are non-significant It fits: SS(A B) SS(B A) Generally not used, because we want to take the interaction terms into account (unless there s an a priori reason not to care about them)

Type III Sum of Squares The effect of each factor is evaluated after all other factors are included in the model The result for each factor, then, is similar to the result that would be obtained if the factor had been the last one entered in a Type I SS model It fits: SS(A B, AB) SS(B A, AB) SS(AB A, B) Type III SS is insensitive to the order in which terms are entered (which is probably why it is commonly used)

Implementing a two-factor ANOVA with Type III Sum of Squares Use Anova with a capital A > library (car) > full_model <- lm(ln_fn ~ MH*CW, data=lldata) > Anova(full_model, type= III )

Output Anova Table (Type III tests) Response: ln_fn Sum Sq Df Fvalue Pr(>F) (Intercept) 77.942 1 228.41 < 2.2e-16 *** Microbe_History 4.801 1 14.067 0.0002123 *** Contemp_Water 3.192 1 9.3536 0.0024304 ** Microbe_History:Contemp_Water 3.775 1 11.062 0.0009928 *** Residuals 100.326 294 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Unless you have a reason to put the factors in your model in a particular order (covariates first, for instance), you should usually use the Type III Sums of Squares

Comparing Two Models One way to determine whether or not a term in a model is significant is to compare the models with and without the term This goal is accomplished using anova(), with a lower-case a. Usage: anova(reduced_model, full_model)

Example Comparing Two Models LLfullmodel <- lm(ln_fn ~ CW*MH, data=llsubset) LLreducedmodel <- lm(ln_fn ~ CW + MH, data=llsubset) anova(llreducedmodel,llfullmodel) Output: Analysis of Variance Table Model 1: ln_fn ~ Contemp_Water + Microbe_History Model 2: ln_fn ~ Contemp_Water * Microbe_History Res.Df RSS Df Sum of Sq F Pr(>F) 1 295 104.10 2 294 100.33 1 3.775 11.062 0.0009928 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1

Implementation of ANCOVA Data from Photinus ignitus, a firefly in which males provide a large nuptial gift to the female. Closed circles are females that mated with three males; open circles are females that mated with one male. Hypothesis: multiple mating allows a female to lay more eggs.

Plotting two sets of points on one graph in R Plot an ordinary scatterplot using plot(). Add the second set of points using points(). plot(y ~ x, data = one_mate, pch=19) points(y ~ x, data = three_mates, pch=18, col= red ) Add a line for each set of points: Reg1 <- lm(y ~ x, data = one_mate) abline(reg1, lty=2) Reg2 <- lm(y ~ x, data = three_mates) abline(reg2, col= red )

Running the ANCOVA > ffancova <- lm(eggs ~ weight*mates) [the covariate goes first] > anova(ffancova) Is the interaction term significant? If yes, you are done and the interpretation is that the relationship between eggs and weight is different between the two mating classes If no, rerun the analysis without the interaction term to test the difference in means, corrected for the covariate > ffancova <- lm(eggs ~ weight + mates) > anova(ffancova)

Obtain the Least Squares Means Least-squares means are the predicted means based on the model. They should be more informative than the raw means because they are corrected for the various variables in the model. > install.packages( lsmeans, dependencies=true) > library(lsmeans) > eggs_lsmeans <- lsmeans(ffancova, mates ) [gives the predicted mean of the response variable for each level of mates in the model] Recall that in the plant paper from earlier in the semester, the authors reported the least-squares means in their bar graphs instead of the raw means.

Logistic Regression Hermon Bumpus After a winter storm near Brown University in 1898, Dr. Bumpus obtained a sample of 136 house sparrows that were incapacitated by the storm. About half the sparrows survived. Bumpus measured a number of traits, and wondered if morphology predicted survival. This question is appropriately addressed with logistic regression.

Survival 0.0 0.2 0.4 0.6 0.8 1.0 Bumpus Data 150 155 160 165 170 Total Length

Logistic Regression Logistic regression fits a model for a binary response variable It assumes that the response variable has a binomial distribution at every x value, and uses the natural log of the odds of Y as the response variable: log-odds(y) = α + βx The odds equals the probability of success divided by the probability of failure Survival is an example of such a variable Implementation: > bumpus_glm <- glm(survival ~ total_length, data=bumpus, family=binomial(link=logit)) > summary(bumpus_glm) [The logit function gives the logarithm of the odds]

Logistic Regression Output Call: glm(formula = survival ~ total_length, family = binomial(link = logit), data = bumpus) Deviance Residuals: Min 1Q Median 3Q Max -1.7321-1.1396 0.7834 1.0907 1.5452 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 23.83490 8.35127 2.854 0.00432 ** total_length -0.14860 0.05229-2.842 0.00449 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Significant Effect of Total Length Null deviance: 188.07 on 135 degrees of freedom Residual deviance: 179.35 on 134 degrees of freedom AIC: 183.35 Smaller birds were more likely to survive Number of Fisher Scoring iterations: 4

Survival 0.0 0.2 0.4 0.6 0.8 1.0 Bumpus Data with Least Squares Line The line from a leastsquares regression has no real validity, but it can help to describe the relationship It looks like it doesn t do so well here, however. 150 155 160 165 170 Total Length abline(lm(survival ~ total_length, data=bumpus), lty=2, lwd=3, col="red")

Stabilizing Selection With only one factor, the logistic regression isn t really necessary, as you could do a t-test on the survivors versus the non-survivors. A logistic regression allows you to model multiple factors simultaneously, and this feature makes it very useful One example might be stabilizing selection. Do the intermediately sized birds survive better? It turns out that we can test this hypothesis by fitting a model with a quadratic term (known from selection theory).

Types of Selection

Logistic Regression with Quadratic Term A quadratic term is included by just taking the square of the linear term and adding it to the model: > bumpus_glm_quad <- glm(survival ~ total_length + I(total_length^2), data=bumpus, family=binomial(link=logit)) > summary(bumpus_glm_quad)

Output Call: glm(formula = survival ~ total_length + I(total_length^2), family = binomial(link = logit), data = bumpus) Deviance Residuals: Min 1Q Median 3Q Max -1.4963-1.1478 0.8891 1.0066 2.0475 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -804.25460 354.06593-2.271 0.0231 * total_length 10.25912 4.45606 2.302 0.0213 * I(total_length^2) -0.03269 0.01402-2.332 0.0197 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 188.07 on 135 degrees of freedom Residual deviance: 173.47 on 133 degrees of freedom AIC: 179.47 Number of Fisher Scoring iterations: 4

Survival 0.0 0.2 0.4 0.6 0.8 1.0 Fitting a Quadratic Curve 150 155 160 165 170 Total Length

Code for Quadratic Curve > bumpus$tot_ln2 <- bumpus$total_length*bumpus$total_length > bumpus_lm_quad <- lm(survival ~ total_length + tot_ln2, data=bumpus) > quad_x <- seq(150,170,1) > quad_y <- predict(bumpus_lm_quad,list(total_length=quad_x,tot_ln2=quad_x^2)) > lines(quad_x, quad_y, col="maroon", lwd=3)

Logistic Regression Summary If the response variable is binary, consider using a logistic regression Especially useful for models with more than one explanatory variable A significant result indicates that the probability of success is dependent upon the significant explanatory variable Be careful not to misapply it, because you can easily fall into a pseudoreplication trap: If the binary response variable is nested within test subjects, then you probably don t want to use logistic regression For example, if you are looking at offspring survivorship in females treated with a chemical versus untreated, the response variable might appear to be binary (offspring either survives or doesn t). However, the unit of replication is the female, not the offspring, so you would severely inflate your sample size by using a logistic regression on offspring. Instead, you need to calculate a proportion per female and analyze that (one value per independent female).