Wrap-up. The General Linear Model is a special case of the Generalized Linear Model. Consequently, we can carry out any GLM as a GzLM.

Similar documents
Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

SRBx14_9.xls Ch14.xls

Sleep data, two drugs Ch13.xls

Data files & analysis PrsnLee.out Ch9.xls

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction

Ch16.xls. Wrap-up. Many of the analyses undertaken in biology are concerned with counts.

Model Based Statistics in Biology. Part III. The General Linear Model. Chapter 8 Statistical Inference with the General Linear Model

GLMM workshop 7 July 2016 Instructors: David Schneider, with Louis Charron, Devin Flawd, Kyle Millar, Anne St. Pierre Provencher, Sam Trueman

Data files and analysis Lungfish.xls Ch9.xls

Where have all the puffins gone 1. An Analysis of Atlantic Puffin (Fratercula arctica) Banding Returns in Newfoundland

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Contrasting Marginal and Mixed Effects Models Recall: two approaches to handling dependence in Generalized Linear Models:

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

Cohen s s Kappa and Log-linear Models

Q30b Moyale Observed counts. The FREQ Procedure. Table 1 of type by response. Controlling for site=moyale. Improved (1+2) Same (3) Group only

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Inferences for Regression

Generalized linear models

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Generalized linear models

STA 101 Final Review

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Generalized Linear Models

Poisson Data. Handout #4

Count data page 1. Count data. 1. Estimating, testing proportions

Chapter 1. Modeling Basics

Six Sigma Black Belt Study Guides

Exercise 5.4 Solution

Ch 2: Simple Linear Regression

9 Generalized Linear Models

Basic Business Statistics 6 th Edition

Analysis of Count Data A Business Perspective. George J. Hurley Sr. Research Manager The Hershey Company Milwaukee June 2013

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Generalized Additive Models

SAS Software to Fit the Generalized Linear Model

Statistics for Managers using Microsoft Excel 6 th Edition

Today: Cumulative distributions to compute confidence limits on estimates

GROUPED DATA E.G. FOR SAMPLE OF RAW DATA (E.G. 4, 12, 7, 5, MEAN G x / n STANDARD DEVIATION MEDIAN AND QUARTILES STANDARD DEVIATION

Chapter 1 Statistical Inference

PLS205 Lab 2 January 15, Laboratory Topic 3

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Lectures 5 & 6: Hypothesis Testing

LOOKING FOR RELATIONSHIPS

A Re-Introduction to General Linear Models (GLM)

Linear Regression Models P8111

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Subject-specific observed profiles of log(fev1) vs age First 50 subjects in Six Cities Study

Generalised linear models. Response variable can take a number of different formats

Week 7 Multiple factors. Ch , Some miscellaneous parts

UNIVERSITY OF TORONTO Faculty of Arts and Science

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Applied Statistics Friday, January 15, 2016

Covariance Structure Approach to Within-Cases

A course in statistical modelling. session 09: Modelling count variables

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

Test 3 Practice Test A. NOTE: Ignore Q10 (not covered)

6.1 Frequency Distributions from Data. Discrete Distributions Example, Four Forms, Four Uses Continuous Distributions. Example, Four Forms, Four Uses

Correlation Analysis

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Exam details. Final Review Session. Things to Review

Econometrics. 4) Statistical inference

Statistical Distribution Assumptions of General Linear Models

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

MIXED MODELS FOR REPEATED (LONGITUDINAL) DATA PART 2 DAVID C. HOWELL 4/1/2010

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Exam Applied Statistical Regression. Good Luck!

1 Introduction to Minitab

Section Poisson Regression

Chapter 5: Logistic Regression-I

Multiple Logistic Regression for Dichotomous Response Variables

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

(c) Interpret the estimated effect of temperature on the odds of thermal distress.

Some comments on Partitioning

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models

1. (Problem 3.4 in OLRT)

Lecture 4. Checking Model Adequacy

SMA 6304 / MIT / MIT Manufacturing Systems. Lecture 10: Data and Regression Analysis. Lecturer: Prof. Duane S. Boning

Ron Heck, Fall Week 3: Notes Building a Two-Level Model

Lecture 11: Simple Linear Regression

Topic 20: Single Factor Analysis of Variance

Testing Independence

Review of One-way Tables and SAS

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Ch 3: Multiple Linear Regression

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

Introduction to Linear regression analysis. Part 2. Model comparisons

A discussion on multiple regression models

STAT 115:Experimental Designs

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Sections 4.1, 4.2, 4.3

Value Added Modeling

The GENMOD Procedure. Overview. Getting Started. Syntax. Details. Examples. References. SAS/STAT User's Guide. Book Contents Previous Next

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Transcription:

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Analysis of Continuous Data ReCap. Part I (Chapters 1,2,3,4), Part II (Ch 5, 6, 7) ReCap Part III (Ch 9, 10, 11), Part IV (Ch13, 14) 16 The Generalized Linear Model 16.1 Analysis of Count Data Binomial, Poisson, and Negative Binomial Counts Goodness of Fit - Chisquared Statistic 16.2 Analysis of Deviance Goodness of Fit - G Statistic Likelihood ratio tests Data Equations Improvement in fit G Analysis of Deviance Table Analysis of Deviance - Mutant frequency 16.3 Analysis of Continuous Data on chalk board Ch16.xls ReCap (Ch 16) We extend the model based approach we have learned to non-normal errors. This is called the generalized linear model. GLM (normal errors) is a special case of GzLM Today: Analysis of Continuous Data Wrap-up. The General Linear Model is a special case of the Generalized Linear Model. Consequently, we can carry out any GLM as a GzLM. The example today demonstrated the analysis of deviance for fly heterzygosity data already analyzed with ANOVA. The notation for the GzLM differs from the GLM. The computational routine for the GzLM differs from the GLM and so we obtain somewhat different estimates of parameters and of p-values for each term in the model. 1

The Generalized Linear Model (normal errors). We can carry out hypothesis testing using goodness of fit ( G) and analysis of deviance within the framework of the Generalized Linear Model. To do this we use the generic recipe for analysis of data with the General Linear Model, with only a few modifications (see Chapter 16 for modified recipe). Fly Heterozygosity. 1. Construct model Verbal model. Inversion heterozygosity changes with altitude, depending on species. 2

Graphical model. Figure 14.1a Figure 14.1b Formal model Response variable is inversion heterozygosity in two species of fruit fly, Drosophila persimilis and D. pseudoobscura H = % H = % per The ratio scale explanatory variable is altitude Alt = km The nominal scale explanatory variable is species Sp = D. persimilis or D. pseudoobscura pse 3

Write formal model The notation differs from that for the general linear model. However, if we substitute the second expression into the first, we obtain the same model as for the general linear model. This new notation will be needed when we move from normal errors to other error distributions. 4

2. Execute model. Place data in model format (column of data for H, Alt, and Sp). Options linesize=80; Title1 'Heterozygosity in relation to Elevation D. persimils, D. Pseudoobscura, Brussard 1984'; Data a; Input Elev H SP $; Cards; 850 0.59 Dper 3000 0.37 Dper 4600 0.41 Dper 6200 0.40 Dper 8000 0.31 Dper 8600 0.18 Dper 10000 0.20 Dper 850 0.70 Dpse 3000 0.69 Dpse 4600 0.71 Dpse 6200 0.70 Dpse 8000 0.70 Dpse 8600 0.62 Dpse 10000 0.68 Dpse ; 5

Code the model in statistical package according to the structural model H = o + Alt Alt + Sp SP + Alt Sp Alt SP + Proc Genmod; Class Sp; Model H = Alt Sp Alt*Sp/ link=identity dist=normal type1 type3; OUTPUT out=respred p=pred stdresdev=stdresdev; PROC PLOT data=respred; plot stdresdev*pred/vref=0; The structural model consists of all of the explanatory terms, including interaction terms. The structural model is specified in a model statement, which has the same format as the model statement for the general linear model. In addition to the structural model, we need to state the error structure and the function that links the response variable to the structural model. For the general linear model, the error structure is normal and the link is the identify link. 6

2. Execute model. Here is the general linear model for the fly heterozygosity data, written as a general linear model. H = + [this is the identity link] ~N(0, ) [the error is normal, with mean of zero and dispersion that will be estimated from the data] = + Alt + SP + Alt SP [this is the structural model] o Alt Sp Alt Sp Code the model in statistical package according to the structural model H = o + Alt Alt + Sp SP + Alt Sp Alt SP + Proc Genmod; Class Sp; Model H = Alt Sp Alt*Sp/ link=identity dist=normal type1 type3; OUTPUT out=respred p=pred stdresdev=stdresdev; PROC PLOT data=respred; plot stdresdev*pred/vref=0; We calculate the standardized (deviance) residuals because we expect these residuals to be homogeneous if our choice of error structure was correct. A deviance residual is the contribution of a particular observation to the overall deviance. The deviance residuals are computed by dividing the raw residuals by a factor that makes the variance constant, if the assumed error distribution is correct. In the case of normal errors, this factor is unity and hence the raw and deviance residuals are the same. 7

3. Evaluate model a. Straight line assumption. OK - no bowls or arches b. Homogeneity of variance. OK - no cones in deviance residuals Note that we evaluate the homogeneity of the residuals, but we no longer evaluate whether the residuals are normal. 4. State population and whether sample is representative. As before, we will assume that the data come from a statistical population--all measurements that could have been obtained, using the procedural statement. 5. Decide on mode of inference. Is hypothesis testing appropriate? Yes. We wish to know whether the apparent difference in gradient between the two species is more than chance. 6. State H A / H o pair, tolerance for Type I error Interaction term. Are the heterozygosity gradients the same? Deviance( ) > 0 Same as H : Deviance( ) = 0 Same as H : = Sp x Alt A per pse Sp x Alt o per pse Statistic - Non-Pearsonian chisquare (G-statistic) 8

7. Analysis of Deviance (instead of analysis of variance). Here is the AnoDev table for the fly heterozygosity example. H = o + Alt Alt + Sp SP + Alt Sp Alt SP + Source df G = 2*lnL G -----> Pr>ChiSq Intercept 1 6.5402 Alt 1 8.2761 1.74 0.1877 Sp 1 35.9782 27.70 <0.0001 Alt*Sp 1 48.9910 13.01 0.0003 For comparison, here is the ANOVA table. Source df Seq SS MS F ----> Pr>F Alt 1 0.05991 0.0599 24.19 0.0006 Sp 1 0.39111 0.3911 157.91 <0.0001 Alt*Sp 1 0.03798 0.03798 15.33 0.0029 Res 10 0.02477 0.00248 Total 13 0.51377 9

7. Analysis of Deviance Source. The intercept is the first parameter estimated in the model. In this case the intercept is the Y-intercept of the first species group, D. persimilis. H = 0.712 0.0145 Alt pers df. The degrees are freedom are calculated in the same way as with the ANOVA table, for the structural part of the model. df and df are not listed. total residual G replaces Seq SS. The first G value is the fit of the model to a single value, the intercept. In this example (normal error, identity link) the intercept is 0.7117. The deviance associated with intercept term is G = 6.54. G This is the change in fit associated with each term in the model. The fit, if we add Altitude, is G = 8.2761, the change in fit is G = 1.74 The fit, after we add the species term, is G = 35.98, the change in fit due to species is then G = 27.7 The fit, after we add the interaction term, is G = 48.99, the change in fit is G = 13.01 10

7. Analysis of Deviance p-value For each change in fit, we can compute a p-value from a Chisquare distribution. It is evident that the AnoDev and ANOVA table give different p-values for each term. This stems in part from differences in the procedures used by GLM routines (ordinary least squares) and those used by GzLM routines (iteratively reweighted least squares). The differences also stem from the basis of comparison. The ANOVA table uses ratios of variances while the AnoDev table uses differences in deviations relative to a simpler model immediately above it in the table. The differences stemming from sequential comparisons in the AnoDev table are removed by computing an adjusted SS and adjusted G. The same strategy (called Type III analysis) is used for both ANOVA and Analysis of Deviance: what is the SS or G value if the term is included last in the model. 11

7. Analysis of Deviance Here are the ANOVA and AnoDev tables for Type III computation (each term entered last in the model). H = o + Alt Alt + Sp SP + Alt Sp Alt SP + Source df G = 2*lnL G ----> Pr>ChiSq Alt 1 17.21 <0.0001 Sp 1 5.78 0.0162 Alt*Sp 1 13.01 0.0003 Source df Adj SS Adj MS F ----> Pr>F Alt 1 0.05991 0.0599 24.19 0.0006 Sp 1 0.39111 0.0127 5.11 0.0473 Alt*Sp 1 0.03798 0.03798 15.33 0.0029 Res 10 0.02477 0.00248 Total 13 0.51377 Now the results from the ANOVA and AnoDev table are similar. The two tables produce the same decisions concerning the statistical significance of each term. Te next step is to declare a decision. We no longer need to decide whether to use randomization to overcome problems with non-normal errors. 12

8. Assess table, based on evaluation of residuals. Assumptions met, continue to step 9. 9. Declare decision about terms in model. Reject H o that slopes are equal. Accept H A that slopes differ. 0.00029 = p < = 0.05. 10. Analysis of parameters of biological interest. Analysis Of Parameter Estimates Standard Wald 95% Chi- Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq Intercept 1 0.7117 0.0348 0.6435 0.7798 418.68 <.0001 SP Dper 1-0.1316 0.0492-0.2280-0.0352 7.16 0.0075 SP Dpse 0 0.0000 0.0000 0.0000 0.0000.. Elev 1-0.0000 0.0000-0.0000 0.0000 0.70 0.4016 Elev*SP Dper 1-0.0000 0.0000-0.0000-0.0000 21.46 <.0001 Elev*SP Dpse 0 0.0000 0.0000 0.0000 0.0000.. Scale 1 0.0421 0.0079 0.0290 0.0609 [slopes not reported in Elev*Sp term because too few decimal places reported] [redo with Elev = km in order to display parameter estimates] 13