Outline of GLMs. Definitions

Similar documents
Generalized Linear Models. Kurt Hornik

Generalized Linear Models

Generalized Linear Models I

Generalized linear models

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Linear Regression Models P8111

STAT5044: Regression and Anova

Chapter 4: Generalized Linear Models-II

Generalized Linear Models 1

Fall 2003: Maximum Likelihood II

Generalized Linear Models Introduction

Generalized Linear Models

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Generalized Estimating Equations

Poisson regression 1/15

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Model Selection for Semiparametric Bayesian Models with Application to Overdispersion

Lecture 8. Poisson models for counts

1/15. Over or under dispersion Problem

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

LOGISTIC REGRESSION Joseph M. Hilbe

Linear Methods for Prediction

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Generalized Linear Models (GLZ)

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Generalized linear models

Lecture 5: LDA and Logistic Regression

Introduction to General and Generalized Linear Models

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Generalized Linear Models

Chapter 1 Statistical Inference

where F ( ) is the gamma function, y > 0, µ > 0, σ 2 > 0. (a) show that Y has an exponential family distribution of the form

Some explanations about the IWLS algorithm to fit generalized linear models

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

if n is large, Z i are weakly dependent 0-1-variables, p i = P(Z i = 1) small, and Then n approx i=1 i=1 n i=1

Stat 579: Generalized Linear Models and Extensions

11. Generalized Linear Models: An Introduction

STA 450/4000 S: January

POLI 8501 Introduction to Maximum Likelihood Estimation

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Proportional hazards regression

Classification. Chapter Introduction. 6.2 The Bayes classifier

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Poisson regression: Further topics

Stat 710: Mathematical Statistics Lecture 12

2 Nonlinear least squares algorithms

Likelihoods for Generalized Linear Models

Introduction to Estimation Methods for Time Series models Lecture 2

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Linear Methods for Prediction

Sampling distribution of GLM regression coefficients

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

D-optimal Designs for Factorial Experiments under Generalized Linear Models

SB1a Applied Statistics Lectures 9-10

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Analysis of 2 n Factorial Experiments with Exponentially Distributed Response Variable

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Generalized Additive Models

Single-level Models for Binary Responses

Exam Applied Statistical Regression. Good Luck!

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Introduction to the Logistic Regression Model

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Generalized Linear Models: An Introduction

MATH Generalized Linear Models

Introduction to Generalized Linear Models

Linear Models in Machine Learning

Statistics 203: Introduction to Regression and Analysis of Variance Course review

12 Modelling Binomial Response Data

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Logistic Regression. Seungjin Choi

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models

(θ θ ), θ θ = 2 L(θ ) θ θ θ θ θ (θ )= H θθ (θ ) 1 d θ (θ )

REGRESSION WITH SPATIALLY MISALIGNED DATA. Lisa Madsen Oregon State University David Ruppert Cornell University

Maximum Likelihood (ML) Estimation

Likelihood-based inference with missing data under missing-at-random

Sections 4.1, 4.2, 4.3

Association studies and regression

Reparametrization of COM-Poisson Regression Models with Applications in the Analysis of Experimental Count Data

Stat 579: Generalized Linear Models and Extensions

Outline. Mixed models in R using the lme4 package Part 5: Generalized linear mixed models. Parts of LMMs carried over to GLMMs

Lecture 1. Introduction Statistics Statistical Methods II. Presented January 8, 2018

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Weighted Least Squares I

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Fractional Imputation in Survey Sampling: A Comparative Review

A Very Brief Summary of Statistical Inference, and Examples

Generalized Linear Models for Non-Normal Data

Model comparison and selection

Generalized Linear Mixed-Effects Models. Copyright c 2015 Dan Nettleton (Iowa State University) Statistics / 58

A Very Brief Summary of Statistical Inference, and Examples

Biostatistics Advanced Methods in Biostatistics IV

Spatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields

Transcription:

Outline of GLMs Definitions This is a short outline of GLM details, adapted from the book Nonparametric Regression and Generalized Linear Models, by Green and Silverman. The responses Y i have density (or mass function) ( ) yi θ i b(θ i ) p(y i ; θ i, φ) = exp + c(y i, φ) φ which becomes a one parameter exponential family in θ i if the scale parameter φ is known. This is simpler than McCullagh and Nelder s version: they use a(φ) or even a i (φ) in place of φ. The a i (φ) version is useful for data whose precision varies in a known way, as for example when Y i is an average of n i iid random variables. There are two simple examples to keep in mind. Linear regression has Y i N(θ i, σ 2 ) where θ i = x i β. Logistic regression has Y i Bin(m, θ i ) where θ i = (1 + exp( x i β)) 1. It is a good exercise to write these models out as in (1), and identify φ, b and c. Consider also what happens for the models Y i N(θ i, σ 2 i ) and Y i Bin(m i, θ i ). The mean and variance of Y i can be determined from its distribution given in equation (1). They must therefore be expressable in terms of φ, θ i, and the functions b( ) and c(, ). In fact, with b denoting a first derivative, and Equations (2) and (3) can be derived by solving and (1) E(Y i ; θ i, φ) = µ i = b (θ i ) (2) V (Y i ; θ i, φ) = b (θ i )φ. (3) E( log p(y i ; θ i, φ)/ θ i ) = 0 E( 2 log p(y i ; θ i, φ)/ θ 2 i ) = E(( log p(y i ; θ i, φ)/ θ i ) 2 ). To generalize linear models, one links the parameter θ i to (a column of) predictors x i by G(µ i ) = x iβ (4) 1

for a link function G. Thus θ i = (b ) 1 (G 1 (x iβ)). (5) The choice G( ) = b 1 ( ) is convenient because it makes θ i = x i β, with the result that X Y = i x i y i becomes a sufficient statistic for β (given φ). Of course real data are under no obligation to follow a distribution with this canonical link. But if they are close to this link then sufficient statistics allow one a great reduction data in set size. Estimation The log likelihood function is l(θ, φ) = ( yi θ i b(θ i ) φ ) + c(y i, φ) where θ subsumes all of the θ i. It could also be written as a function of β and φ because (given the x i ), β determines all the θ i. The main way of estimating β is by maximizing (6). The fact that G(µ i ) = x i β suggests a crude approximate estimate: regress G(y i ) on x i, perhaps modifying y i in order to avoid violating range restrictions (such as taking log(0)), and accounting for the differing variances of the observations. Fisher scoring is the widely use technique for maximizing the GLM likelihood over β. The basic step is ( ( β (k+1) = β (k) 2 )) 1 l l E β β (7) β where expectations are taken with β = β (k). This is like a Newton step, except that the Hessian of l is replaced by it s expectation. The expectation takes a simpler form. This is similar to (generalizes?) what happens in the Gauss-Newton algorithm, where individual observation Hessians multiplied by residuals are dropped. (It clearly would not work to replace l/β by its expectation!) Fisher scoring simplifies to where W is a diagonal matrix with (6) β (k+1) = (X W X) 1 X W z (8) W ii = (G (µ i ) 2 b (θ i )) 1. (9) 2

and z i = (Y i µ i )G (µ i ) + x i β. (10) Both equations (9) and (10) use β (k) and the derived values of θ (k) i and µ (k) i. The iteration (8) is known as iteratively reweighted least squares, or IRLS. The weights W ii have the usual interpretation as reciprocal variances: b (θ i ) is proportional to the variance of Y i and the G (µ i ) factor in z i is squared in W ii. The constant φ does not appear. More precisely: it appeared in (7) twice and cancelled out. Fisher scoring may also be written where z i = (Y i µ i )G (µ i ). Inference β (k+1) = β (k) + (X W X) 1 X W z (11) The saturated model S is one with n parameters θ i, one for each y i. The value of θ i in the saturated model is the maximizer of (1), commonly y i itself. In the N(θ i, σ 2 ) model this gives a zero sum of squares for the saturated model. [Exercise: what is the likelihood for a saturated normal model?] Note that if x i = x j, so that for any β x i β = x j β, the saturated model can still have θ i θ j. Let l max be the log likelihood of S. In the widely used models this is finite. (In some cases saturating a model can lead to an infinite likelihood. Consider the normal distribution where one picks a mean and a variance for each observation.) The scaled deviance is and the unscaled deviance is D = φd = 2 D = 2(l max l(θ( ˆβ))) (12) ( Y i ( θ i ˆθ ) i ) b( θ i ) + b(ˆθ i ) = d i. (13) The scaled deviance D (and the unscaled when φ = 1) is asymptotically χ 2 n p where p is the number of parameters in β, under some restrictive conditions. The usual likelihood ratio asymptotics do not apply because the degrees of freedom n p is growing as fast as the number of observations n. In a limit with each observation becoming more normal and the number of 3

observations remaining fixed a chisquare result can obtain. (Some examples: Bin(m i, p i ) observations with increasing m i and m i p i (1 p i ) bounded away from zero, and Poi(λ i ) observations with λ i increasing.) There are however cases where a chisquare limit holds with n increasing (see Rice s introductory statistics book). When the χ 2 n p approximation is accurate we can use it to test goodness of fit. A model with too large a deviance doesn t fit the data well. Differences between scaled deviances of nested models have a more nearly chisquare distribution than the deviances themselves. Because the degrees of freedom between two models is fixed these differences are also chisquared in the usual asymptotic setup, of increasing n. The deviance is additive so that for models M 1 M 2 M 3, the chisquare statistic for testing M 1 within M 3 is the sum of those for testing M 1 within M 2 and M 2 within M 3. The deviance generalizes the sum of squared errors (and D generalizes the sum of squares normalized by σ 2 ). Another generalization of sum of squared errors is Pearson s chisquare χ 2 (Y i E(Y i )) 2 = φ = V (Y i ) (Y i b (θ i )) 2 b, (14) (θ i ) often denoted by X 2 to indicate that it is a sample quantity. Pearson s chisquare can also be used to test nested models, but it does not have the additivity that the deviance does. It may also be asymptotically chisquared, but just as for the deviance, there are problems with large sparse models. (Lots of observations but small m i or λ i.) Pearson s chisquare is often used to estimate φ by ˆφ = χ2 n p. (15) Residuals can be defined by partitioning either the deviance or Pearson s χ 2. The deviance residuals are r (D) i = sign(y i µ i ) d i (16) where d i is from (13) and the Pearson residuals are r (P ) i = Y i µ i b (θ i ). (17) These can also be leverage adjusted to reflect the effect of x i on the variance of the different observations. See references in Green and Silverman. 4

These residuals can be used in plots to test for lack of fit of a model, or to identify which points might be outliers. They need not have a normal distribution, so qq plots of them can t necessarily diagnose a lack of fit. Approximate confidence intervals may be obtained by profiling the likelihood function and referring the difference in (scaled) deviance values to calibrate the coverage levels. Computationally simpler confidence statements can be derived from the asymptotic covariance matrix of ˆβ ( 2 l E β β ) 1 = φ(x W X) 1 (18) (I think the formula that Green and Silverman give (page 97) instead of (18) is wrong...they have partials with respect to a vector η with components η i = x i β.) These simple confidence statements are much like the linearization inferences in nonlinear least squares. They judge H 0 : β β 0 not by the likelihood at β 0 but by a prediction of what that likelihood would be based on an expansion around β ˆβ. Overdispersion Though our model may imply that V (Y i ) = φb (θ i ) in practice it can turn out that the model is too simple. For example a model in which Y i is a count may assume it to be a sum of independent identically distributed Bernoulli random variables. But non-constant success probability or dependence among the Bernoulli variables can cause V (Y i ) m i p i (1 p i ). We write this as V (Y i ) = b (θ i )φσ 2 i (19) for some overdispersion parameter σi 2 1. One can model the overdispersion in terms of x i and possibly other parameters as well. But commonly a very simple model with a constant value σi 2 = σ2 is used. Given that variances are usually harder to estimate than means, it is reasonable to use a smaller model for the overdispersion than for θ i. A constant value of σ 2 is equivalent to replacing φ by φσ 2, so an estimated quantity φˆσ 2 can be simplified to ˆφ. Given an estimate of the amount of overdispersion one can use it to adjust the estimated covariance matrix of ˆβ: multiply the naive estimate by ˆσ 2. Similarly one can adjust the chisquare statistics: divide the naive ones by ˆσ 2. 5

Quasi and empirical likelihoods The asymptotic good behavior of generalized linear modelling depends on getting the relationship between µ = E(Y ) = b (θ) and V (Y ) = φb (θ) right. The method of quasi likelihood involves simply specifying how V (Y ) depends on µ, without specifying a complete likelihood function. Asymptotically the inferences are justified for any likelihood having the postulated meanvariance relationship. McCullagh and Nelder (Chapter 9) discuss this. In quasi-likelihood, the derivative of the log likelihood is replaced by (Y µ)/σ 2 V (µ) for the postulated function V (µ). The method of empirical likelihood can also be used with generalized linear models. It can give asymptotically justified confidence statements and tests, and does not even require a correctly specified relationship between E(Y ) and V (Y ). This was studied by Eric Kolaczyk in Statistica Sinica in 1994. Examples of GLMs McCullagh and Nelder give a lot of examples of uses of GLMs. They consider models for cell counts in large tables, models for survival times and models for variances. 6