ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Size: px

Start display at page:

Download "ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples"

Marshall Ryan
6 years ago
Views:

1 ST3241 Categorical Data Analysis I Generalized Linear Models Introduction and Some Examples 1

2 Introduction We have discussed methods for analyzing associations in two-way and three-way tables. Now we will use models as the basis of such analysis. Models can handle more complicated situations than discussed so far. We can also estimate the parameters, which describe the effects in a more informative way. 2

3 Example: Challenger O-ring For the 23 space shuttle flights that occurred before the Challenger mission disaster in 1986, the following table shows the temperature at the time of flight and whether at least one primary O-ring suffered thermal distress. 3

4 The Data Ft Temp TD Ft Temp TD Ft Temp TD Is there any association between Temperature and thermal distress? 4

5 Fit From Linear Regression 5

6 Fit From Logistic Regression 6

7 Example: Horseshoe Crabs Each female horseshoe crab in the crab in the study had a male crab attached to her in her nest. The study investigated factors that affect whether the female crab had any other males, called satellites, residing nearby her. Explanatory variables included the female crab s color, spine condition, weight, and carapace width. The response outcome for each female crab is her number of satellites. 7

8 Example Continued We consider the width alone as a predictor. To obtain a clearer picture, we grouped the female crabs into a set of width categories 23.25, , , , , , , >29.25 Calculated sample mean number of satellites for female crabs in each category. 8

9 9

10 10

11 Random component Components of A GLM Identifies the response variable Y and assumes a probability distribution for it Systematic component Specifies the explanatory variables used as predictors in the model Link Describes the functional relation between the systematic component and expected value of the random component 11

12 Random Component Let Y 1,, Y N denote the N observations on the response variables Y. The random component specifies a probability distribution for Y 1,, Y N. If the potential outcome for each observation Y i are binary such as success or failure ; or, more generally, each Y i might be number of successes out of a certain fixed number of trials, we can assume a binomial distribution for the random component. If each response observation is a non-negative count, such as cell count in a contingency table, then we may assume a Poisson distribution for the random component. 12

13 Systematic Component The systematic component specifies the explanatory variables. It specifies the variables that play the roles of x j in the formula α + β 1 x β k x k. This linear combination of explanatory variables is called the linear predictors. Some x j may be based on others in the model; for instance, perhaps x 3 = x 1 x 2, to allow interaction between x 1 and x 2 in their effects on Y, or perhaps x 3 = x 2 1 to allow a curvilinear effect of x 1. 13

14 Link It specifies how µ = E(Y ) relates to explanatory variables in the linear predictor. The model formula states that g(µ) = α + β 1 x β k x k The function g(.) is called the link function. 14

15 Some Popular Link Functions Identity Link g(µ) = µ = α + β 1 x β k x k Log link g(µ) = log(µ) = α + β 1 x β k x k Logit µ g(µ) = log[ 1 µ ] = α + β 1x β k x k 15

16 More On Link Functions Each potential probability distribution has one special function of the mean that is called its natural parameter. For the normal distribution, it is mean itself. For the Poisson, the natural parameter is the log of the mean. For the Binomial, the natural parameter is the logit of the success probability. The link function that uses the natural parameter as g(µ) in the GLM is called the canonical link. Though other links are possible, in practice the canonical links are most common 16

17 GLM For Binary Data: Random Component The distribution of a binary response is specified by probabilities P (Y = 1) = π of success and P (Y = 0) = 1 π of failure. For n independent observations on a binary response with parameter π, the number of successes has the binomial distribution specified by parameters n and π. 17

18 Linear Probability Model To model the effect of X, use ordinary linear regression, by which the expected value of Y is a linear function of X. The model π(x) = α + βx is called a linear probability model. Probabilities fall between 0 and 1 but for large of small values of x, the model may predict π(x) < 0 or π(x) > 1. This model is valid only for a finite range of x values 18

19 Example: Snoring Heart Disease Snoring Yes No Proportion Yes Linear Fit Never Occasional Nearly Every Night Every Night

20 Example: We use (0,2,4,5) for their snoring categories, treating the last two levels closer. Linear Fit using maximizing likelihood: π(x) = x The least squares fit is slightly different. 20

21 Logistic Regression Model Relationship between π(x) and x are usually nonlinear rather than linear. The most important function describing this nonlinear relationship has the form That is, π(x) log( 1 π(x) ) = α + βx π(x) = F 0 (α + βx), where F 0 (x) = ex 1 + e x = e x, where F 0 (x) is the cdf of the logistic distribution. Its pdf is F 0 (x)(1 F 0 (x)). The associated GLM is called the logistic regression f unction. Logistic regression models are often referred as logit models as the link in this GLM is the logit link. 21

22 Parameters The parameter β determines the rate of increase or decrease of the curve. When β > 0, π(x) increases with x. When β < 0, π(x) decreases as x increases. The magnitude of β determines how fast the curve increases or decreases. As β increases, the curve has a steeper rate of change. 22

23 Example: Snoring Heart Disease Snoring Yes No Proportion Linear Logit Yes Fit Fit Never Occasional Nearly Every Night Every Night

24 Effect of Parameters 24

25 Alternative Binary Links For logistic regression curves, the probability of a success increases or decreases continuously as x increases. Let X denote a random variable, the cumulative distribution function (cdf) F (x) is defined as F (x) = P (X x), < x < Such a function, plotted as a function of x, has appearance like that of the logistic function in the previous figures. It suggests a class of models for binary responses of the form π(x) = F (α + βx) where F is a cdf for some distribution. 25

26 Alternative Binary Links The logistic regression curve has this form. When β > 0, π(x) = F (α + βx) has the shape of the cdf of the two-parameter logistic distribution. When β < 0, 1 π(x) = 1 F (α + βx) has the shape of the cdf of the two-parameter logistic distribution. Each choice of α and β > 0 corresponds to a different logistic distribution. The logistic cdf F 0 (x) corresponds to a probability distribution F 0 (x)(1 F 0 (x)) with a symmetric, bell shape and very similar looking to a normal distribution. 26

27 Probit Models The probability of success, π(x), has the form Φ(α + βx) where Φ is the cdf of a standard normal distribution N(0, 1). The link function is known as probit link: g(π) = Φ 1 (π). The probit transform maps π(x) so that the regression curve for π(x) (or 1 π(x), when β < 0) has the appearance of the normal cdf with mean µ = α/β and standard deviation σ = 1/ β. 27

28 Example: Snoring Heart Disease Snoring Yes No Proportion Linear Logit Probit Yes Fit Fit Fit Never Occasional Nearly Every Night Every Night

29 29

30 GLM for Count Data Many discrete response variables have counts as possible outcomes. For a sample of cities worldwide, each observation might be the number of automobile thefts in For a sample of silicon wafers used in computer chips, each observation might be the number of imperfections on a wafer. We have earlier seen the Poisson distribution as a sampling model for counts. 30

31 Poisson Regression Assume a Poisson distribution for the random component. One can model the Poisson mean using the identity link. But more common to model the log of the mean. A Poisson loglinear model is a GLM that assumes a Poisson distribution for Y and uses the log link. 31

32 Poisson Regression - Continued Let µ denote the expected value of Y and let X denote an explanatory variable. Then the Poisson log-linear model has the form log(µ) = α + βx For this model: µ = exp(α + βx) = e α (e β ) x 32

33 Example: Horseshoe Data 33

34 Poisson Regression For Rate Data It is often relevant to certain types of events which occur over time, space, or some other index of size, to model the rate at which events occur. In modeling numbers of auto thefts in 2003 for a sample of cities, we could form a rate for each city by dividing the number of thefts by the city s population size. The model describes how the rate depends on some other explanatory variables. 34

35 Poisson Regression For Rate Data When a response count Y i has index (such as population size) equal to t i, the sample rate of outcomes is Y i /t i. The expected value of the rate is µ i /t i. A log-linear model for the expected rate has form: log(µ i /t i ) = α + βx i This has an equivalent representation: log µ i log t i = α + βx i The adjustment term, log t i, to the above equation is called an offset. 35

36 Exponential Family The random variable Y has a distribution in the exponential family, if its p.d.f (or p.m.f.) can be written as f(y; θ, ϕ) = exp{(yθ b(θ))/a(ϕ) + c(y, ϕ)} for some specific function a(ϕ), b(θ) and c(y, ϕ). The parameter θ is called the natural parameter and ϕ is called the dispersion (or scale) parameter. 36

37 The p.d.f. of N(µ, σ 2 ) Examples: Normal Distribution f(y; θ, ϕ) = 1 2πσ 2 exp{ (y µ)2 /(2σ 2 )} = exp{(yµ µ 2 /2)/σ 2 (y 2 /σ 2 + log(2πσ 2 ))/2} Here, θ = µ, ϕ = σ 2, and a(ϕ) = ϕ, b(θ) = θ 2 /2 and c(y, ϕ) = {y 2 /σ 2 + log(2πσ 2 )}/2 The canonical link: g(µ) = µ. 37

38 Examples: Binomial Distribution The p.d.f. of Bernoulli(π): f(y; θ, ϕ) = π y (1 π) 1 y π = (1 π)( 1 π )y π = exp{y log( 1 π ) log( 1 1 π )} Here θ = log( π 1 π ),ϕ = 1,b(θ) = log(1 + eθ ) a(ϕ) = 1, c(y, ϕ) = 0 The canonical link: g(π) = log( π 1 π ). 38

39 The p.d.f. of Poisson(λ): Examples: Poisson Distribution λ λy f(y; θ, ϕ) = e y! = exp{(y log λ λ) log y!} Here θ = log λ,ϕ = 1, b(θ) = e θ, a(ϕ) = 1, c(y, ϕ) = log y! The canonical link: g(λ) = log(λ). 39

40 The log-likelihood function Log-Likelihood Functions l(θ, ϕ; y) = log f(y; θ, ϕ) = (yθ b(θ))/a(ϕ) + c(y, ϕ) We use general likelihood results, applicable to exponential families E( l l θ ) = 0 and E( θ )2 = E( 2 l θ ) 2 Here l θ = (y b (θ))/a(ϕ) and 2 l θ 2 = b (θ)/a(ϕ) 40

41 Mean and Variances Now, we have 0 = E( l θ ) = {E(y) b (θ)}/a(ϕ) So that, E(Y ) = b (θ). Similarly, var(y ) a 2 (ϕ) So that, var(y ) = b (θ)a(ϕ). = b (θ) a(ϕ) 41

42 Examples Normal Distribution: b(θ) = θ 2 /2 E(Y ) = b (θ) = θ = µ var(y ) = b (θ)a(ϕ) = ϕ = σ 2. Bernoulli Distribution: b(θ) = log(1 + e θ ) E(Y ) = b (θ) = e θ /(1 + e θ ) = π. var(y ) = b (θ) = e θ /(1 + e θ ) 2 = π(1 π) Poisson Distribution: b(θ) = exp(θ) E(Y ) = b (θ) = exp(θ) = λ. var(y ) = b (θ) = exp(θ) = λ. 42

43 Likelihood Equations in GLMs Let (y 1,, y N ) denote responses for N independent observations. Let (x i1,, x ip ) denote values of p explanatory variables for observation i. The systematic component for the i-th observation η i = p β j x ij j=1 If E(y i ) = µ i, then the link for the i-th observation is: η i = g(µ i ) 43

44 Likelihood Equations in GLMs For N independent observations, the log-likelihood function is: L(β)= N L i = N log f(y i ; θ, ϕ) = N + N c(y i, ϕ) i=1 i=1 i=1 y i θ i b(θ i ) a(ϕ) The notation L(β) reflects the dependence of θ on the model parameters β. i=1 44

45 Likelihood Equations in GLMs The likelihood equations are: for j = 1,, p. Simplifying, we have L(β) β j = N i=1 L i β j = 0, N i=1 (y i µ i )x ij var(y i ) µ i η i = 0 for j = 1,, p. Notice that µ i η i = 1/g (µ i ). 45

46 Logit Model: Examples N (y i π i )x ij = 0, where π i = i=1 i=1 k=1 exp( 1+exp( p j=1 p j=1 β j x ij ) β j x ij ) Probit Model: N (y i π i )x ij π i (1 π i ) ϕ( p β k x ik ) = 0, where π i = Φ( p Log-Linear Model: N (y i exp( p β k x ik ))x ij = 0 i=1 k=1 j=1 β j x ij ) 46

47 Maximum Likelihood Estimates ML estimates of β j s are obtained by solving the likelihood equations using numerical methods. The ML estimates ˆβ j s are approximately normally distributed. Thus, a confidence interval for a model parameter β j equals ˆβ j ± z α/2 ASE where ASE is the asymptotic standard error of ˆβ j. 47

48 To test: H 0 : β j = 0. Testing For Significance The test statistic, Z = ˆβ j /ASE has an approximate standard normal distribution, when H 0 is true. Equivalently, Z 2 has a chi-squared distribution with d.f. = 1, which can be used for two-sided alternatives This type of test is known as Wald s test. 48

49 Likelihood Ratio Test The likelihood-ratio test statistic equals 2 log(l 0 /L 1 ) = 2[log L 0 log L 1 ] = 2[l 0 l 1 ] where L 0 and L 1 are the maximized likelihood functions under the null hypothesis and under the full model, respectively. Under H 0, this test statistic also has a large sample chi-squared distribution with d.f. = 1. 49

50 Score Test The score statistic or efficient score statistic uses the size of the derivative of the log-likelihood function evaluated at β j = 0. The score statistic is the square of the ratio of this derivative to its ASE. It also has an approximate chi-squared distribution. 50

51 Example In simple logistic regression with one explanatory variable, the log-likelihood function is: l(α, β) = N {y i (α + βx i ) log(1 + exp(α + βx i ))} i=1 Therefore for H 0 : β = 0 and H 1 : β 0, we have l 0 = N {y i log ȳ 1 ȳ log 1 i=1 1 ȳ }, l 1 = N {y i (ˆα + ˆβx i ) log(1 + exp(ˆα + ˆβx i ))} i=1 51

52 Model Residuals For i-th observation, the raw residual is: r i = y i ˆµ i = Observed fitted, where y i is the observed response and ˆµ i is the fitted value from the model. The Pearson residual is defined as Pearson residual= Oberved-fitted = var(observed) ˆ For Poisson GLMs, it simplifies to e i = y i ˆµ i ˆµi y i ˆµ i. var(yi ˆ ) 52

53 Adjusted Residuals The Pearson residuals divided by its estimated standard error are called adjusted residuals. Adjusted residuals have an approximate standard normal distribution. For Poisson GLMs, the general form of the adjusted residual is: (y i ˆµ i ) ˆµi (1 h i ) = e i 1 hi where h i is called the leverage of observation i. 53

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

STA216: Generalized Linear Models Lecture 1. Review and Introduction Let y 1,..., y n denote n independent observations on a response Treat y i as a realization of a random variable Y i In the general