Generalized Linear Models (GLZ)

Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) are an extension of the linear modeling process that allows models to be fit to data that follow probability distributions other than the Normal distribution, such as the Poisson, Binomial, Multinomial, and etc. Generalized Linear Models also relax the requirement of equality or constancy of variances that is required for hypothesis tests in traditional linear models. The General Linear Univariate Model (GLUM) Most parametric statistical analyses can be viewed as a process of fitting a linear model to the observed data and testing hypotheses about the fitted model s parameters. Even the lowly t test is a form of the General Linear Univariate Model (GLUM). The Analysis of Variance (ANOVA), Regression, Multiple Regression, and the Analysis of Covariance (ANCOVA) are more complicated forms of the GLUM. The least squares criterion is used to obtain estimates of the parameters of these GLUM models. Additional assumptions must be met in order to test hypotheses about the model s parameters. Besides the assumption of independence of the observations, which is required for all statistical analyses, hypothesis tests derived from GLUM s require normality of the response variable and constancy or homogeneity of variances. The General Linear Multivariate Model (GLMM) When attempting to explain variation in more than one response variable simultaneously the modeling exercise is to fit the General Linear Multivariate Model (GLMM) to the data. Commonly used multivariate statistical procedures such as Multivariate Analysis of Variance (MANOVA), Multivariate Analysis of Covariance (MANCOVA), Discriminant Function Analysis (DFA), Canonical Correlation Analysis (CCA), and Principal Components Analysis (PCA) are all forms of the GLMM. To perform hypothesis tests in the context of the GLMM, one must assume that the response variables are multivariate normal and that the variance-covariance matrices are homogeneous. When the distribution of the response variable(s) is not normal or multivariate normal, or if the variances or the variance-covariance matrices are not homogeneous, then application of hypothesis tests to GLUM s or GLMM s can lead to Type I and Type II error rates that differ from the nominal rates. Traditionally, transformations of the scale of the response variables have been applied to insure that the assumptions required for hypotheses tests are met. For example, count data are often Poisson distributed and tend to be right skewed. Furthermore, the variance of a Poisson random variable is equal to the mean of the response. Hence, for count data a transformation must both normalize the

data and eliminate the inherent variance heterogeneity. Commonly, count data are transformed to a logarithmic scale or even a square-root scale, however such transformations are not always successful in achieving the desired end. In fact, there is no a priori reason to believe that a scale exists that will insure that data meet the normality and variance homogeneity assumptions. General - izing the Linear Model The Generalized Linear Model is an extension of the General Linear Model to include response variables that follow any probability distribution in the exponential family of distributions. The exponential family includes such useful distributions as the Normal, Binomial, Poisson, Multinomial, Gamma, Negative Binomial, and others. Hypothesis tests applied to the Generalized Linear Model do not require normality of the response variable, nor do they require homogeneity of variances. Hence, Generalized Linear Models can be used when response variables follow distributions other than the Normal distribution, and when variances are not constant. For example, count data would be appropriately analyzed as a Poisson random variable within the context of the Generalized Linear Model. Parameter estimates are obtained using the principle of maximum likelihood; therefore hypothesis tests are based on comparisons of likelihoods or the deviances of nested models. What puts the -ized in Generalized Linear Models The common linear regression model (a form of the general linear model) specifies that the mean response µ is identical to a linear function? of the predictor variables x j: E( Y ) = = η = β + β p µ (1) 0 j x j, j= 1 and uses least squares as the criterion by which to estimate the unknown parameters ß?= (ß 0,?ß 1,...,?ß p )'. When observations are independent and normally distributed with constant variance s 2, least squares estimation of ß?and s 2 is equivalent to maximum likelihood estimation. Generalized linear models encompass the general linear model and enlarge the class of linear least-squares models in two ways: the distribution of Y for fixed x is merely assumed to be from the exponential family of distributions, which includes important distributions such as the binomial, Poisson, exponential, and gamma distributions, in addition to the normal distribution. Also, the relationship between E(Y) = µ and? is specified by a non-linear link function? = g(µ), which is only required to be monotonic and differentiable.

The link function serves to link the random or stochastic component of the model, the probability distribution of the response variable, to the systematic component of the model (the linear predictor): E( Y ) = g( µ ) = β 0 + β 1x1 + L + β jx j, (2) Where g(µ) is a non-linear link function that links the random component, E(Y), to the systematic component β + β x + L + β j x ). For traditional linear models in ( 0 1 1 j which the random component consists of the assumption that the response variable follows the Normal distribution, the canonical link function is the identity link. The identity link specifies that the expected mean of the response variable is identical to the linear predictor, rather than to a non-linear function of the linear predictor. The canonical link functions for a variety of probability distribution are given below. Probability Distribution Normal Binomial Poisson Gamma Canonical Link Function Identity Logit Log Reciprocal Although other link functions are possible, the canonical links are most often used. Estimation and Testing The parameters in a generalized linear model can be estimated by the maximum likelihood method. For a given probability distribution specified by f(y i ; ß, F) and observations y = (y 1, y 2,..., y n )', the log-likelihood function for ß and F, expressed as a function of mean values µ = (µ 1,, µ n ) of the responses {Y 1, Y 2,..., Y n }, has the form n l( µ; y) = log f ( y i ; ß, φ). i= 1 The maximum likelihood estimates of the parameters ß can be obtained by iterative re-weighted least squares (IRLS). Detailed information about the

iterative algorithm and asymptotic properties of the parameter estimates can be found in McCullagh and Nelder (1989). Analogous to the residual sum of squares in linear regression, the goodness-of-fit of a generalized linear model can be measured by the scaled deviance D( y; µ ˆ) 2[ l( y; y) l( µ ˆ; y)] =, { 2 µ 1 where l( y; y) is the maximum likelihood achievable for an exact fit in which the fitted values are equal to the observed values, and l ( µ ˆ; y) is the log-likelihood function calculated at the estimated parameters ß. The deviance function is very useful for comparing two models when one model has parameters that are a subset of the second model. The deviance is additive for such nested models if maximum likelihood estimates are used (McCullagh and Nelder 1989). Consider two nested models with the second having some covariates omitted and denote the maximum likelihood estimates in the two models by $m 1 and?$m 2?, respectively. Then the deviance difference D( y; µ ˆ ) D( y; ˆ )} is identical to the likelihoodratio statistic and has an approximate χ 2 distribution with degrees of freedom equal to the difference between the numbers of parameters in the two models. For probability distributions in the exponential family the χ 2 approximation is usually quite accurate for differences of deviance even though it may be inaccurate for the deviances themselves (McCullagh and Nelder 1989). Over-dispersion If the sampling variance of a response variable Y i is significantly greater than that predicted by an expected probability distribution, Y i is said to be over-dispersed. The covariance matrix of ßˆ is estimated by COV (ß ˆ) = F(X'WX)-1, where X is the covariate matrix and W is a weight matrix used in the iterative algorithm. If overdispersion occurs, ignoring it (i.e., setting F = 1) will result in underestimating the standard errors of the parameter estimates, which may lead to incorrect conclusions. McCullagh and Nelder (1989) suggest modeling mean and dispersion jointly as a way to take possible over-dispersion into account. The detailed fitting procedure can be found in McCullagh and Nelder (1989). Applications Several forms of the Generalized Linear Model are now commonly used and implemented in many statistical software packages. Logistic Regression, Multiway Frequency Analysis (Log-Linear Models), Logit Models, and Poisson

Regression are all forms of the Generalized Linear Model. In Logistic Regression, the binary response variable is modeled as a Binomial random variable with the logit link function. For Multiway Frequency Analysis (Log-Linear Models), the response variable is usually modeled as a Poisson random variable with the log link function. However, one could assume that the response variable is Binomial or Multinomial, but the results would not differ from those obtained assuming the response variable to be Poisson distributed (Agresti 1996). For logit models, binary response variables are modeled as Binomial random variables, while polychotomous response variables are modeled as Multinomial random variables, but in both instances the link function is the logit function. In Poisson regression, the response variable is modeled as a Poisson random variable with the log link function. Software GLZ s can be fit and evaluated using SPLUS, SAS, SPSS, and a number of other statistical packages. Of the major packages, SPLUS and SAS provide greater flexibility in fitting and evaluating GLZ s References Agresti, A. 1996. An Introduction to Categorical Data Analysis. John Wiley & Sons: New York. (A very readable introduction the many forms of the generalized linear model) McCullagh, P. and J.A. Nelder. 1989. Generalized Linear Models. Chapman and Hall: London. (mathematical statistics of generalized linear model) Ecological Applications of Generalized Linear Models Vincent, P.J. and J.M. Haworth. 1983. Poisson regression models of species abundance. Journal of Biogeography 10: 153-160. Connor, E.F., E. Hosfield, D. Meeter, and X. Nui. 1997. Tests for aggregation and size-based sample-unit selection when sample units vary in size. Ecology 78: 1238-1249. Links to Other Websites Site The Generalized Linear Models Page Description Introduction, bibliography, software, and other information on GLZ s

Statsoft online textbook GLMLAB Introduction to GLM Fairly comprehensive introduction to GLZ s Using Matlab to fit GLZ s Brief introduction to GLZ s