Generalized Linear Models (GLZ)

Similar documents
Generalized Linear Models 1

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Generalized linear models

Introduction to General and Generalized Linear Models

A Practitioner s Guide to Generalized Linear Models

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Outline of GLMs. Definitions

LOGISTIC REGRESSION Joseph M. Hilbe

THE PRINCIPLES AND PRACTICE OF STATISTICS IN BIOLOGICAL RESEARCH. Robert R. SOKAL and F. James ROHLF. State University of New York at Stony Brook

Experimental Design and Data Analysis for Biologists

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Stat 5101 Lecture Notes

STAT5044: Regression and Anova

Model Selection for Semiparametric Bayesian Models with Application to Overdispersion

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

SAS Software to Fit the Generalized Linear Model

8 Nominal and Ordinal Logistic Regression

Textbook Examples of. SPSS Procedure

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

Introduction to Spatial Analysis. Spatial Analysis. Session organization. Learning objectives. Module organization. GIS and spatial analysis

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Generalized linear models

MODELING COUNT DATA Joseph M. Hilbe

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

INFORMATION THEORY AND STATISTICS

Lecture 1. Introduction Statistics Statistical Methods II. Presented January 8, 2018

Statistical Methods in HYDROLOGY CHARLES T. HAAN. The Iowa State University Press / Ames

Classification. Chapter Introduction. 6.2 The Bayes classifier

Generalized Linear Models

Generalized Linear Models Introduction

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

GLM I An Introduction to Generalized Linear Models

Generalized Linear Models

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Subject CS1 Actuarial Statistics 1 Core Principles

Investigating Models with Two or Three Categories

Introduction to Generalized Linear Models

Model Estimation Example

This manual is Copyright 1997 Gary W. Oehlert and Christopher Bingham, all rights reserved.

11. Generalized Linear Models: An Introduction

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Generalized logit models for nominal multinomial responses. Local odds ratios

The In-and-Out-of-Sample (IOS) Likelihood Ratio Test for Model Misspecification p.1/27

Review of the General Linear Model

Repeated ordinal measurements: a generalised estimating equation approach

More Accurately Analyze Complex Relationships

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Generalized Linear Models I

Principal component analysis

26:010:557 / 26:620:557 Social Science Research Methods

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Generalized, Linear, and Mixed Models

GLM models and OLS regression

Statistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010

Chapter 1. Modeling Basics

Generalized Linear Models: An Introduction

Introduction to General and Generalized Linear Models

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics

A COEFFICIENT OF DETERMINATION FOR LOGISTIC REGRESSION MODELS

Generalized Linear Models for Non-Normal Data

Semiparametric Generalized Linear Models

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Linear Regression Models P8111

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

Generalized Additive Models

Sample size determination for logistic regression: A simulation study

Generalized Linear Models. Kurt Hornik

Single-level Models for Binary Responses

Handbook of Regression Analysis

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Generalized linear models

Sample size calculations for logistic and Poisson regression models

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models

Gradient types. Gradient Analysis. Gradient Gradient. Community Community. Gradients and landscape. Species responses

Analysis of 2 n Factorial Experiments with Exponentially Distributed Response Variable

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

DISPLAYING THE POISSON REGRESSION ANALYSIS

LOGISTICS REGRESSION FOR SAMPLE SURVEYS

Outline. Mixed models in R using the lme4 package Part 5: Generalized linear mixed models. Parts of LMMs carried over to GLMMs

Regression Model Building

Bias-corrected AIC for selecting variables in Poisson regression models

Poisson regression: Further topics

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

High-Throughput Sequencing Course

The GENMOD Procedure. Overview. Getting Started. Syntax. Details. Examples. References. SAS/STAT User's Guide. Book Contents Previous Next

Neuroimage Processing

Applied Regression Modeling

CHOOSING AMONG GENERALIZED LINEAR MODELS APPLIED TO MEDICAL DATA

Generalized Linear Models

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

Lecture 8. Poisson models for counts

GENERALIZED LINEAR MODELS Joseph M. Hilbe

Transcription:

Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) are an extension of the linear modeling process that allows models to be fit to data that follow probability distributions other than the Normal distribution, such as the Poisson, Binomial, Multinomial, and etc. Generalized Linear Models also relax the requirement of equality or constancy of variances that is required for hypothesis tests in traditional linear models. The General Linear Univariate Model (GLUM) Most parametric statistical analyses can be viewed as a process of fitting a linear model to the observed data and testing hypotheses about the fitted model s parameters. Even the lowly t test is a form of the General Linear Univariate Model (GLUM). The Analysis of Variance (ANOVA), Regression, Multiple Regression, and the Analysis of Covariance (ANCOVA) are more complicated forms of the GLUM. The least squares criterion is used to obtain estimates of the parameters of these GLUM models. Additional assumptions must be met in order to test hypotheses about the model s parameters. Besides the assumption of independence of the observations, which is required for all statistical analyses, hypothesis tests derived from GLUM s require normality of the response variable and constancy or homogeneity of variances. The General Linear Multivariate Model (GLMM) When attempting to explain variation in more than one response variable simultaneously the modeling exercise is to fit the General Linear Multivariate Model (GLMM) to the data. Commonly used multivariate statistical procedures such as Multivariate Analysis of Variance (MANOVA), Multivariate Analysis of Covariance (MANCOVA), Discriminant Function Analysis (DFA), Canonical Correlation Analysis (CCA), and Principal Components Analysis (PCA) are all forms of the GLMM. To perform hypothesis tests in the context of the GLMM, one must assume that the response variables are multivariate normal and that the variance-covariance matrices are homogeneous. When the distribution of the response variable(s) is not normal or multivariate normal, or if the variances or the variance-covariance matrices are not homogeneous, then application of hypothesis tests to GLUM s or GLMM s can lead to Type I and Type II error rates that differ from the nominal rates. Traditionally, transformations of the scale of the response variables have been applied to insure that the assumptions required for hypotheses tests are met. For example, count data are often Poisson distributed and tend to be right skewed. Furthermore, the variance of a Poisson random variable is equal to the mean of the response. Hence, for count data a transformation must both normalize the

data and eliminate the inherent variance heterogeneity. Commonly, count data are transformed to a logarithmic scale or even a square-root scale, however such transformations are not always successful in achieving the desired end. In fact, there is no a priori reason to believe that a scale exists that will insure that data meet the normality and variance homogeneity assumptions. General - izing the Linear Model The Generalized Linear Model is an extension of the General Linear Model to include response variables that follow any probability distribution in the exponential family of distributions. The exponential family includes such useful distributions as the Normal, Binomial, Poisson, Multinomial, Gamma, Negative Binomial, and others. Hypothesis tests applied to the Generalized Linear Model do not require normality of the response variable, nor do they require homogeneity of variances. Hence, Generalized Linear Models can be used when response variables follow distributions other than the Normal distribution, and when variances are not constant. For example, count data would be appropriately analyzed as a Poisson random variable within the context of the Generalized Linear Model. Parameter estimates are obtained using the principle of maximum likelihood; therefore hypothesis tests are based on comparisons of likelihoods or the deviances of nested models. What puts the -ized in Generalized Linear Models The common linear regression model (a form of the general linear model) specifies that the mean response µ is identical to a linear function? of the predictor variables x j: E( Y ) = = η = β + β p µ (1) 0 j x j, j= 1 and uses least squares as the criterion by which to estimate the unknown parameters ß?= (ß 0,?ß 1,...,?ß p )'. When observations are independent and normally distributed with constant variance s 2, least squares estimation of ß?and s 2 is equivalent to maximum likelihood estimation. Generalized linear models encompass the general linear model and enlarge the class of linear least-squares models in two ways: the distribution of Y for fixed x is merely assumed to be from the exponential family of distributions, which includes important distributions such as the binomial, Poisson, exponential, and gamma distributions, in addition to the normal distribution. Also, the relationship between E(Y) = µ and? is specified by a non-linear link function? = g(µ), which is only required to be monotonic and differentiable.

The link function serves to link the random or stochastic component of the model, the probability distribution of the response variable, to the systematic component of the model (the linear predictor): E( Y ) = g( µ ) = β 0 + β 1x1 + L + β jx j, (2) Where g(µ) is a non-linear link function that links the random component, E(Y), to the systematic component β + β x + L + β j x ). For traditional linear models in ( 0 1 1 j which the random component consists of the assumption that the response variable follows the Normal distribution, the canonical link function is the identity link. The identity link specifies that the expected mean of the response variable is identical to the linear predictor, rather than to a non-linear function of the linear predictor. The canonical link functions for a variety of probability distribution are given below. Probability Distribution Normal Binomial Poisson Gamma Canonical Link Function Identity Logit Log Reciprocal Although other link functions are possible, the canonical links are most often used. Estimation and Testing The parameters in a generalized linear model can be estimated by the maximum likelihood method. For a given probability distribution specified by f(y i ; ß, F) and observations y = (y 1, y 2,..., y n )', the log-likelihood function for ß and F, expressed as a function of mean values µ = (µ 1,, µ n ) of the responses {Y 1, Y 2,..., Y n }, has the form n l( µ; y) = log f ( y i ; ß, φ). i= 1 The maximum likelihood estimates of the parameters ß can be obtained by iterative re-weighted least squares (IRLS). Detailed information about the

iterative algorithm and asymptotic properties of the parameter estimates can be found in McCullagh and Nelder (1989). Analogous to the residual sum of squares in linear regression, the goodness-of-fit of a generalized linear model can be measured by the scaled deviance D( y; µ ˆ) 2[ l( y; y) l( µ ˆ; y)] =, { 2 µ 1 where l( y; y) is the maximum likelihood achievable for an exact fit in which the fitted values are equal to the observed values, and l ( µ ˆ; y) is the log-likelihood function calculated at the estimated parameters ß. The deviance function is very useful for comparing two models when one model has parameters that are a subset of the second model. The deviance is additive for such nested models if maximum likelihood estimates are used (McCullagh and Nelder 1989). Consider two nested models with the second having some covariates omitted and denote the maximum likelihood estimates in the two models by $m 1 and?$m 2?, respectively. Then the deviance difference D( y; µ ˆ ) D( y; ˆ )} is identical to the likelihoodratio statistic and has an approximate χ 2 distribution with degrees of freedom equal to the difference between the numbers of parameters in the two models. For probability distributions in the exponential family the χ 2 approximation is usually quite accurate for differences of deviance even though it may be inaccurate for the deviances themselves (McCullagh and Nelder 1989). Over-dispersion If the sampling variance of a response variable Y i is significantly greater than that predicted by an expected probability distribution, Y i is said to be over-dispersed. The covariance matrix of ߈ is estimated by COV (ß ˆ) = F(X'WX)-1, where X is the covariate matrix and W is a weight matrix used in the iterative algorithm. If overdispersion occurs, ignoring it (i.e., setting F = 1) will result in underestimating the standard errors of the parameter estimates, which may lead to incorrect conclusions. McCullagh and Nelder (1989) suggest modeling mean and dispersion jointly as a way to take possible over-dispersion into account. The detailed fitting procedure can be found in McCullagh and Nelder (1989). Applications Several forms of the Generalized Linear Model are now commonly used and implemented in many statistical software packages. Logistic Regression, Multiway Frequency Analysis (Log-Linear Models), Logit Models, and Poisson

Regression are all forms of the Generalized Linear Model. In Logistic Regression, the binary response variable is modeled as a Binomial random variable with the logit link function. For Multiway Frequency Analysis (Log-Linear Models), the response variable is usually modeled as a Poisson random variable with the log link function. However, one could assume that the response variable is Binomial or Multinomial, but the results would not differ from those obtained assuming the response variable to be Poisson distributed (Agresti 1996). For logit models, binary response variables are modeled as Binomial random variables, while polychotomous response variables are modeled as Multinomial random variables, but in both instances the link function is the logit function. In Poisson regression, the response variable is modeled as a Poisson random variable with the log link function. Software GLZ s can be fit and evaluated using SPLUS, SAS, SPSS, and a number of other statistical packages. Of the major packages, SPLUS and SAS provide greater flexibility in fitting and evaluating GLZ s References Agresti, A. 1996. An Introduction to Categorical Data Analysis. John Wiley & Sons: New York. (A very readable introduction the many forms of the generalized linear model) McCullagh, P. and J.A. Nelder. 1989. Generalized Linear Models. Chapman and Hall: London. (mathematical statistics of generalized linear model) Ecological Applications of Generalized Linear Models Vincent, P.J. and J.M. Haworth. 1983. Poisson regression models of species abundance. Journal of Biogeography 10: 153-160. Connor, E.F., E. Hosfield, D. Meeter, and X. Nui. 1997. Tests for aggregation and size-based sample-unit selection when sample units vary in size. Ecology 78: 1238-1249. Links to Other Websites Site The Generalized Linear Models Page Description Introduction, bibliography, software, and other information on GLZ s

Statsoft online textbook GLMLAB Introduction to GLM Fairly comprehensive introduction to GLZ s Using Matlab to fit GLZ s Brief introduction to GLZ s