Generalized Linear Models. Kurt Hornik

Similar documents
Outline of GLMs. Definitions

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Generalized Linear Models

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Generalized linear models

Linear Regression Models P8111

Poisson regression 1/15

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

if n is large, Z i are weakly dependent 0-1-variables, p i = P(Z i = 1) small, and Then n approx i=1 i=1 n i=1

Linear Methods for Prediction

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

Generalized Linear Models I

Generalized Linear Models 1

Generalized Estimating Equations

SB1a Applied Statistics Lectures 9-10

STAT5044: Regression and Anova

Lecture 5: LDA and Logistic Regression

Fall 2003: Maximum Likelihood II

Chapter 4: Generalized Linear Models-II

Generalized Linear Models

Linear Methods for Prediction

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Generalized Linear Models

Generalized Estimating Equations (gee) for glm type data

Introduction to General and Generalized Linear Models

Likelihood-Based Methods

Some explanations about the IWLS algorithm to fit generalized linear models

LOGISTIC REGRESSION Joseph M. Hilbe

Generalized linear models

11. Generalized Linear Models: An Introduction

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

When is MLE appropriate

STA 450/4000 S: January

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

1/15. Over or under dispersion Problem

Logistic Regression. Seungjin Choi

POLI 8501 Introduction to Maximum Likelihood Estimation

Classification. Chapter Introduction. 6.2 The Bayes classifier

Lecture 16 Solving GLMs via IRWLS

The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Generalized Linear Models Introduction

Generalized Linear Models: An Introduction

Poisson regression: Further topics

Fractional Imputation in Survey Sampling: A Comparative Review

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Lecture 8. Poisson models for counts

For more information about how to cite these materials visit

LECTURE 2 LINEAR REGRESSION MODEL AND OLS

Model Selection for Semiparametric Bayesian Models with Application to Overdispersion

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Chapter 4: Constrained estimators and tests in the multiple linear regression model (Part III)

Semiparametric Generalized Linear Models

Computational methods for mixed models

Stat 579: Generalized Linear Models and Extensions

Estimating prediction error in mixed models

Introduction to Estimation Methods for Time Series models Lecture 2

Linear Models in Machine Learning

Chapter 3: Maximum Likelihood Theory

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

MIT Spring 2016

STAT 526 Advanced Statistical Methodology

Modeling Longitudinal Count Data with Excess Zeros and Time-Dependent Covariates: Application to Drug Use

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Stat 579: Generalized Linear Models and Extensions

General Regression Model

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

AMS-207: Bayesian Statistics

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Topic 12 Overview of Estimation

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Bayesian Linear Regression

Graphical Model Selection

Generalized Linear Models and Exponential Families

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Statistics & Data Sciences: First Year Prelim Exam May 2018

9 Generalized Linear Models

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

1 Mixed effect models and longitudinal data analysis

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Various types of likelihood

ECON 4160, Autumn term Lecture 1

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Rapid Introduction to Machine Learning/ Deep Learning

Review and continuation from last week Properties of MLEs

Bayesian Inference. Chapter 9. Linear models and regression

Weighted Least Squares I

UNIVERSITY OF TORONTO Faculty of Arts and Science

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Sampling distribution of GLM regression coefficients

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

Association studies and regression

CSC321 Lecture 18: Learning Probabilistic Models

Transcription:

Generalized Linear Models Kurt Hornik

Motivation Assuming normality, the linear model y = Xβ + e has y = β + ε, ε N(0, σ 2 ) such that y N(μ, σ 2 ), E(y ) = μ = β. Various generalizations, including general linear model Y = XB + E (with E normal with flexible error covariance structures) But what if normality is not appropriate (e.g., skewed, bounded, discrete)? Transformations or generalized linear models. Slide 2

Exponential Dispersion Models Densities with respect to reference measure m of the form yθ b(θ) ƒ (y θ, ϕ) = exp + c(y, ϕ) ϕ (alternatively, write (ϕ) instead of ϕ in the denominator). For fixed ϕ, this is an exponential family in θ. Slide 3

Exponential Dispersion Models Differentiate ƒ (y θ, ϕ) dm(y) = 1 with respect to θ (and assume interchanging integration and differentiation is justified): so that 0 = ƒ (y θ, ϕ) ƒ (y θ, ϕ) dm(y) = dm(y) θ θ = y b (θ) ƒ (y θ, ϕ) dm(y) = E θ,ϕ(y) b (θ) ϕ ϕ E θ,ϕ (y) = b (θ) (which does not depend on ϕ!). Slide 4

Exponential Dispersion Models Differentiate once more: 0 = so that b (θ) y b 2 (θ) + ƒ (y θ, ϕ) dm(y) ϕ ϕ = b (θ) ϕ V θ,ϕ (y) = ϕb (θ). + V θ,ϕ(y) ϕ 2 (which shows that ϕ is a dispersion parameter). Slide 5

Exponential Dispersion Models We can thus write E(y) = μ = b (θ). If μ = b (θ) defines a one-to-one relation between μ and θ (which it does: can be shown using convex analysis), we can write b (θ) = V(μ), formally V(μ) = b ((b ) 1 (μ)) where V is the variance function of the family. Thus: vr(y) = ϕb (θ) = ϕv(μ). Slide 6

Example: Bernoulli Family Take y binary with P(y = 1) = p. With m counting measure (on {0, 1}), ƒ (y) = p y (1 p) 1 y p y = (1 p) 1 p p = exp y log + log(1 p). 1 p I.e., exponential dispersion model with ϕ = 1 (hence in fact, exponential family) and p θ = log = logit(p) 1 p (quantile function of standard logistic distribution). Slide 7

Example: Bernoulli Family Inverting θ = logit(p) gives eθ p = 1 + e θ (probability function of standard logistic distribution) and hence 1 1 p = 1 + e θ so that b(θ) = log(1 p) = log(1 + e θ ). Altogether (note that there is a problem for p {0, 1}): ƒ (y θ) = exp(yθ log(1 + e θ )), θ = logit(p). Slide 8

Example: Bernoulli Family Differentiation gives: and b 1 (θ) = 1 + e θ eθ = p = μ b (θ) = eθ (1 + e θ ) e θ e θ (1 + e θ ) 2 = eθ 1 + e θ 1 = p(1 p) = μ(1 μ). 1 + eθ Necessary? We know that E(y) = p and vr(y) = p(1 p). Hence, eθ b (θ) = p = 1 + e θ = b(θ) = log(1 + eθ ). Slide 9

Generalized Linear Models For = 1,..., n have responses y from an exponential dispersion family with the same b and covariates such that for E(y ) = μ = b (θ ) we have g(μ ) = β = η, where g is the link function and η is the linear predictor. Alternatively, μ = h(β ) = h(η ), where h is the response function (and g and h are inverses of each other if invertible). Why useful? General conceptual framework for estimation and inference. Slide 10

Maximum Likelihood Estimation Log-likelihood is where l = l(β) = n =1 g(μ ) = g(b (θ )) = β. y θ b(θ ) + c(y, ϕ ), ϕ Differentiating the latter with respect to β j : j = β = g(b (θ )) = g (b (θ ))b (θ ) θ = g (μ )V(μ ) θ β j β j β j β j Slide 11

Maximum Likelihood Estimation Hence, and θ β j = l β j = j g (μ )V(μ ) n y b (θ ) j g (μ )V(μ ) = =1 ϕ n =1 y μ ϕ V(μ ) j g (μ ). MLE typically performed by solving score equations l/ β j = 0. For Newton-type algorithms, need the Hessian H(β) = [ 2 l/ β j β j ]. As μ = b (θ ), Slide 12 μ = b (θ ) θ j = V(μ ) β j β j g (μ )V(μ ) = j g (μ )

Maximum Likelihood Estimation Hence: 2 l β j β k = = = n β k ϕ V(μ ) g (μ ) =1 n j ϕ =1 n =1 n =1 y μ j μ β k 1 V(μ )g (μ ) j k ϕ V(μ )g (μ ) 2 y μ (V(μ )g (μ )) (V(μ )g (μ )) 2 β k (y μ ) j k ϕ V(μ ) 2 g (μ ) 3 (V (μ )g (μ ) + V(μ )g (μ )) Slide 13

Maximum Likelihood Estimation Second term looks complicated, but has expectation zero. Hence, drop and only use first term for Newton-type iteration: Fisher scoring algorithm. Equivalently, replace observed information matrix (negative Hessian of log-likelihood) by its expectation (Fisher information matrix). Next problem: what about ϕ? Assume that ϕ = ϕ/ with known case weights. Slide 14

Maximum Likelihood Estimation Then Fisher information matrix is 1 ϕ n V(μ =1 )g (μ ) 2 j k = X W(β)X ϕ where X is the usual regressor matrix (with as row ) and W(β) = dig V(μ )g (μ ) 2 Similarly, score function is 1 ϕ,, g(μ ) = β. n V(μ =1 )g (μ ) 2 g (μ )(y μ ) j = X W(β)r(β), ϕ where r(β) has elements g (μ )(y μ ): so-called working residuals. Slide 15

Maximum Likelihood Estimation Remember: Newton updates for minimizing l(β) are β new β (H(l)(β)) 1 l(β). Thus, Fisher scoring update (with approximation for H) uses β new β + (X W(β)X) 1 X W(β)r(β) = (X W(β)X) 1 X W(β)(Xβ + r(β)) = (X W(β)X) 1 X W(β)z(β) where working response z(β) has elements β + g (μ )(y μ ), g(μ ) = β. I.e., update computed by weighted least squares regression of z(β) on X (weights: square roots of W(β)): Fisher scoring algorithm for obtaining the MLEs is an iterative weighted least squares (IWLS) algorithm. Note: common dispersion parameter ϕ not used! Slide 16

Canonical Links The canonical link is given by g = (b ) 1 so that η = g(μ ) = g(b (θ )) = θ, g (μ) = d dμ (b ) 1 (μ) = so that g (μ)v(μ) 1, and hence 1 b ((b ) 1 (μ)) = 1 V(μ), l = 1 β j ϕ n (y μ ) j, =1 2 l = 1 β j β k ϕ n V(μ ) j k =1 Thus: observed and expected information coincide, IWLS Fisher scoring algorithm is the same as Newton s algorithm. Slide 17

Inference Under suitable conditions, MLE ˆβ asymptotically N(β, (β) 1 ) with expected Fisher information matrix (β) = 1 ϕ X W(β)X. Thus, standard errors can be computed as square roots of diagonal elements of cov( ˆβ) = ϕ(x W( ˆβ)X) 1 where X W( ˆβ)X is a by-product of the final IWLS iteration. Slide 18

Inference This needs an estimate of ϕ (unless known). Estimation by MLE is practically difficult: hence, usually estimated by method of moments. Remember vr(y ) = ϕ V(μ ) = ϕv(μ )/. Hence: if β was known, unbiased estimate of ϕ would be 1 n (y μ ) 2. n V(μ =1 ) Taking into account that β is estimated, estimate is ˆϕ = 1 n p n (y ˆμ ) 2 =1 V( ˆμ ) (where p is the number of β parameters). Slide 19

Deviance A quality-of-fit statistic for model fitting achieved by ML, generalizing the idea of using the sum of squares of residuals in ordinary least squares: D = 2ϕ(l st l mod ) (assuming a common ϕ, perhaps after taking out weights), where the saturated model uses separate parameters for each observation so that the data is fitted exactly. For GLMs: y = μ = b (θ ) achieves zero scores. Contribution of observation to l st l mod is y θ b (θ ) y ˆθ b ( ˆθ ) = y θ θ b(θ), ϕ ϕ ϕ ˆθ where ˆθ is obtained from the fitted model (i.e., g(b ( ˆθ )) = ˆβ ). Slide 20

Deviance We can write (y θ b(θ)) θ = ˆθ θ ˆθ d dθ (y θ b(θ)) dθ = θ ˆθ (y b (θ)) dθ. Substituting μ = b (θ): dμ = b (θ) dθ, i.e., dθ = V(μ) 1 dμ, so that θ ˆθ (y b (θ)) dθ = y ˆμ y μ V(μ) dμ and the deviance contribution of observation is 2ϕ y θ b(θ) ϕ θ = 2 ˆθ y ˆμ y μ V(μ) dμ. Can be taken to define deviance and introduce quasi-likelihood models. Slide 21

Residuals Several kinds of residuals can be defined for GLMs: response y ˆμ working from working response in IWLS, i.e., g ( ˆμ )(y ˆμ ) Pearson r P = y ˆμ V( ˆμ ) so that (rp )2 equals the generalized Pearson statistic. deviance so that (rd ) 2 equals the deviance (see above). (All definitions equivalent for the Gaussian family.) Slide 22

Generalized Linear Mixed Models Augment the linear predictor by (unknown) random effects b : η = β + z b where the b come from a suitable family of distributions and the z (as well as the, of course) are known covariates. Typically, b N(0, G(ϑ)). Conditionally on b, y is taken to follow an exponential dispersion model with g(e(y b )) = η = β + z b. Marginal likelihood function is observed y obtained by integrating out the joint likelihood of the y and b with respect to the marginal distribution of the b. If b are independent across observation units, L(β, ϕ, ϑ) = n ƒ (y β, ϕ, ϑ, b )ƒ (b ϑ) db =1 Slide 23