Binary Response: Logistic Regression. STAT 526 Professor Olga Vitek

Size: px

Start display at page:

Download "Binary Response: Logistic Regression. STAT 526 Professor Olga Vitek"

Paula Sutton
6 years ago
Views:

1 Binary Response: Logistic Regression STAT 526 Professor Olga Vitek March 29,

2 Model Specification and Interpretation 4-1

3 Probability Distribution of a Binary Outcome Y In many situations, the response variable has only two possible outcomes Disease (Y = 1) vs Not diseased (Y = 0) Employed (Y = 1) vs Unemployed (Y = 0) Response is binary or dichotomous Can model response using Bernoulli dist Y i Probability 1 Pr{Y 1 = 1} = π i 0 Pr{Y 1 = 0} = 1 π i E{Y i } = π i V ar{y i } = π i (1 π i ) 4-2

4 Goal: express E{Y } as function of a covariate X The simple regression is not appropriate E{Y i } = β 0 + β 1 X i It violates several assumptions: (1) Does not enforce the constraint 0 E{Y i } 1 is (2) Non-normal (binary) distribution of ε X: When Y i = 0 : ε i = 0 β 0 β 1 X i When Y i = 1 : ε i = 1 β 0 β 1 X i (3) Non-constant variance V ar{y i } = π i (1 π i ) = (β 0 + β 1 X i )(1 β 0 β 1 X i ) 4-3

5 Solution: a Generalized Linear Model A generalized linear model is E{Y i } = g(β 0 + β 1 X i ), or g 1 (E{Y i }) = β 0 + β 1 X i where g is a sigmoid function in (0,1). g is called the mean response function g 1 is called the link function A choice of g produces different models g(t) = Identity linear regression g(t) = Φ(t) = standard Normal CDF probit regresison g(t) = exp(t) = CDF of the logistic distrib. 1+exp(t) logistic regresison 4-4

6 Motivation for Probit Regression: Latent Variable Assume that the binary response is guided by a non-observed continuous variable Example: linear model for blood pressure: Only observe bp = β 0 + β 1 age + ε Y = { 1 (disease), if blood pressure > c 0 (healthy), if blood pressure c Pr{Y = 1} = Pr{bp > c} = Pr{β 0 + β 1 age + ε > c} = Pr{ε < β 0 + β 1 age c} = Pr{ ε σ < β 0 c + β 1 σ σ age} = Pr{z < β 0 + β 1 age} = Φ(β 0 + β 1 age) 4-5

7 Logistic Response Function and Logistic Regression A sigmoidal response function E{Y i } = exp(β 0 + β 1 X i ) 1 + exp(β 0 + β 1 X i ) = exp( β 0 β 1 X i )) A monotonic increasing/decreasing function Explicit functional form Restricts 0 E(Y i ) 1 Example of a nonlinear model Logit link function ( ) E{Yi } log = log 1 E{Y i } ( πi 1 π i ) = β 0 + β 1 X i 4-6

8 Probability Distribution of Y in Logistic Regression Y i are independent but not identically distributed Bernoulli random variables Y i ind π i = Bernoulli(π i ) where exp(β 0 + β 1 X i ) 1 + exp(β 0 + β 1 X i ) note no more error term! Probability density of Y i f(y i ) = π Y i i (1 π i) 1 Y i Least Squares Estimates are inappropriate use maximum likelihood for parameter estimation 4-7

9 Simple Logistic Regression: Interpretation of b 1 Fitted value for the individual i ˆπ i = eb 0 +b 1 X i 1+e b 0 +b 1 X i Fitted logistic response (i.e. fitted log odds) ( log e odds(x i ) ) ˆπ = log i (X i ) e = b 0 + b 1 X i 1 ˆπ i (X i ) b 1 is the slope of the fitted logistic response b 1 is interpreted as log(odds ratio) ( ) ˆπi (X i + 1) b 1 = log e 1 ˆπ i (X i + 1) ( ) odds(xi + 1) = log e odds(x i ) odds ratio(x i ) = exp(b 1 ) log e ( ˆπi (X i ) 1 ˆπ i (X i ) ) 4-8

10 Estimation by Maximum Likelihood Y i ind Bernoulli(π i ) where π i = exp(β 0 + β 1 X i ) 1 + exp(β 0 + β 1 X i ) Log likelihood: log e (L) = = log = = = { n n Y i log(π i ) + π Y i i (1 π i) 1 Y i } n (1 Y i ) log(1 π i ) n π i Y i log( ) + 1 π i n Y i (β 0 + β 1 X i ) n log(1 π i ) n log(1 + exp(β 0 + β 1 X i )) MLEs do not have closed forms 4-9

11 Equivalent specification: Binomial distribution Change in notation Data: (Y ij, n i, X i ), i = 1, 2,, c X i : predictor for observation i n i : # of Bernoulli trials in observation i Y i := n i Y ij j=1 Model: Y i ind π i = Binomial(n i, π i ), where exp(β 0 + β 1 X i ) 1 + exp(β 0 + β 1 X i ) Log-Likelihood: log e (L) = c {( ) } ni = log Y i π Y i (1 πi ) n i Y i c { = Y i log(π i) + (n i Y i ) log(1 π i) + log = c { Y i log π ( i ni + n i log(1 π i ) + log 1 π i Y i ( ni Y i )} 4-10 )}

12 Equivalence of Bernouilli and Binomial Models Binomial Log-Likelihood equals Bernouilli Log-Likelihood, up to a constant: log e (L) Binomial = = = = c { Y i log(π i) + (n i Y i ) log(1 π i) } + constant c n i Y ij log(π i ) + (n i c n i j=1 j=1 { π i n i j=1 Y ij log( ) + log(1 π i ) 1 π i Y ij ) log(1 π i ) + constant } + constant = log e (L) Bernouilli + constant Both models lead to same parameter estimates and inferences, but have different deviances. 4-11

13 Prospective and Retrospective Studies Prospective study: the outcome fix predictors, observe Recruit patients with 2 genotypes, compare occurrence of disease. Expensive; large variance for rare diseases Retrospective (=case-control) study: outcome, observe predictors fix Recruit patients with and without the disease, compare the genotypes. Cheaper Log odds ratio based on logistic regression is the same for both studies. Not true for other link functions 4-12

14 Prospective Model With Retrospective Data: Formulation Assume a prospective model: π(x i ) = P (Y i = 1 X i ) = eβ 0+β 1 X i 1 + e β 0+β 1 X i However, the subjects are selected retrospectively. The random variable is Z i = 1 {including subject i}, conditional on Y. Denote θ 0 = P (Z i = 1 Y i = 0) and θ 1 = P (Z 1 = 1 Y i = 1) Note that θ 0 and θ 1 are indenpent of X i, otherwise introduce selection bias. Of interest in retrospective study is P (Y i = 1 X i, Z i = 1) 4-13

15 Prospective Model With Retrospective Data: Estimation of log(or) With retrospective sampling, we model: P (Y i = 1 X i, Z i = 1) = P (Y i = 1, Z i = 1 X i ) P (Z i = 1 X i ) P (Y i = 1, Z i = 1 X i ) = P (Y i = 0, Z i = 1 X i ) + P (Y i = 1, Z i = 1 X i ) θ 1 π(x i ) = θ 0 {1 π(x i )} + θ 1 π(x i ) = θ 1 e β 0+β 1 X i θ 0 + θ 1 e β 0+β 1 X i = elog(θ1/θ0)+β0+β1xi 1 + e log(θ 1/θ 0 )+β 0 +β 1 X i Uncover the same β 1 as in the prospective study { } P (Yi = 1 X i, Z i ) = log = [log(θ 1 /θ 0 ) + β 0 ]+β 1 X i P (Y i = 0 X i, Z i ) 4-14

16 Multiple Logistic Regression Extension of the response function: exp(x i E{Y i } = β) 1 + exp(x i β) Extension of the logit link function: ( ) ( ) E{Yi } πi log e = log e 1 E{Y i } 1 π i = X i β The multivariate likelihood function: log e L(β) = n Y i (X i β) n log e [1 + exp(x i β)] Interpretation of b j : log e ( odds(xj + 1) odds(x j ) while other predictors are held fixed ) 4-15

17 Testing 4-16

18 Asymptotic Properties of β Asymptotic existence and uniqueness: P { β exists and is unique 1} as n Consistency: β β as n Asymptotic Normality: β Ass. N (β, I( β) 1 ) as n Asymptotic efficiency: The MLE has asymptotically smaller variance than many other estimators 4-17

19 Inference About Individual β j : Wald Test Test H 0 : β j = 0 versus H a : β j 0. Test statistic z = b j 0 s{b j } Approximate variance s 2 {b} s 2 {b} = 2 log e L(β) β j β j β=b 1 Approximate distribution of z z N (0, 1). Alternatively, (z ) 2 χ 2 1 reject H 0 if z > z 1 α/2 CI for β j : b j ± z 1 α/2 s{b j } 4-18

20 Simultaneous Inference About Several β j = 0: Likelihood Ratio Test Multivariate logistic regression ( ) E{Yi } log = β 0 + β 1 X i1 + + β p 1 X i,p 1 1 E{Y i } Test H 0 : β 1 = β 2 = = β q = 0 versus H a : not all β 1, β 2,, β q = 0 Test statistic [ ] L(reduced model) G 2 = 2 log e L(full model) = 2 [log e L(reduced model) log e L(full model)] Approximate distr of G 2 for large n Reject H 0 if G 2 > χ 2 (1 α, q) 4-19

21 Comments: Wald Test Versus Likelihood Ratio Test Both tests are approximate, for large n Likelihood Ratio test: β j = 0 simultaneously β j = 0, or several no other values of β j no one-sided tests, or linear com- Wald test: β j = β binations of β j of interest j When testing a single H 0 : β j = 0, the tests may lead to different conclusions due to the approximate nature of the tests unlike in linear regression 4-20

22 Quality of Fit: Replicated data 4-21

23 H 0 : H a : Lack of Fit for Replicated Data: Pearson χ 2 ( ) E{Yij } log = β 0 + β 1 X i1 + + β p 1 X i,p 1 vs 1 E{Y ij } ( ) E{Yij } log β 0 + β 1 X i1 + + β p 1 X i,p 1 1 E{Y ij } c covariate configurations, each with n j cases Observed counts O i0 : observed # of 0 s in configuration i O i1 : observed # of 1 s in configuration i Expected counts E i0 = n i (1 ˆπ i ): expected # of 0 s in conf. i E i1 = n i (ˆπ i ): expected # of 1 s in conf. i Test statistic: reject H 0 if c 1 X 2 (O jk E jk ) 2 = > χ 2 (1 α, c p) E jk k=0 4-22

24 Lack of Fit for Replicated Data: Deviance ( ) E{Yij } H 0 : log = β 0 + β 1 X i1 + + β p 1 X i,p 1 vs 1 E{Y ij } ( ) E{Yij } H a : log β 0 + β 1 X i1 + + β p 1 X i,p 1 1 E{Y ij } c covariate configurations, each with n i cases H 0 : model of interest; H a : saturated model Model of interest E{Y ij } = π i ; E{Yij } = ˆπ i Saturated model E{Y ij } = p i ; E{Yij } = ˆp i = Y ij j n i Test statistic: LR, also called deviance Deviance of a saturated model always =

25 Lack of Fit for Replicated Data: Deviance G 2 = DEV (X 0, X 1,..., X p 1 ) = 2 [ log e L(current model) log e L(saturated model) ] = 2 +2 = 2 c c c n i j=1 n i j=1 n i j=1 Y ij log e ˆπ i + (n i Y ij log e ˆp i + (n i n i j=1 n i j=1 ) (ˆπi Y ij log e + (n i ˆp i Y ij ) log e (1 ˆπ i ) Y ij ) log e (1 ˆp i ) n i j=1 Reject H 0 if G 2 > χ 2 (1 α, c p) ( ) 1 ˆπi Y ij ) log e 1 ˆp i Approximation χ 2 (1 α, c p) can be poor The closer the distribution to Gaussian, and the closer the link to identity, the better the approximation Unlike with the LR test, the quality of approximation does not improve with the sample size 4-24

26 Note: Can Use Deviance for LR Test of Nested Models log ( E{Yij } 1 E{Y ij } ) = β 0 + β 1 X i1 + + β p 1 X i,p 1 Test H 0 : β 1 = β 2 = = β q = 0 versus H a : not all β 1, β 2,, β q = 0 Likelihood Ratio test statistic G 2 = [ ] L(reduced model) = 2 log e L(full model) = 2 [log e L(reduced model) log e L(full model)] = 2 [log e L(reduced model) log e L(saturated model)] +2 [log e L(full model) log e L(saturated model)] = Deviance(reduced model) Deviance(full model) Approximate distr of G 2 for large n Reject H 0 if G 2 > χ 2 (1 α, q) 4-25

27 Quality of Fit: Individual Observations 4-26

28 H 0 : H a : Non-Replicated Data: Hosmer-Lemeshow Goodness of Fit ( ) E{Yi } log = β 0 + β 1 X i1 + + β p 1 X i,p 1 vs 1 E{Y i } ( ) E{Yi } log β 0 + β 1 X i1 + + β p 1 X i,p 1 1 E{Y i } Group cases based on values of estimated probabilities ˆπ i into c groups E.g., find c = 9 groups based on percentiles Apply Pearson χ 2 test to the groups Reject H 0 if X 2 > χ 2 (1 α, c 2) showed by simulation that this distribution is appropriate 4-27

29 Can Write Pearson χ 2 (But Not Use for Tests) Test statistic: X 2 = = c 1 k=0 c (O ik E ik ) 2 E ik (O i0 E i0 ) 2 E i0 + c (O i1 E i1 ) 2 E i1 The corresponding quantities (KNNL p. 591): c = n, n i = 1 O i0 = 1 Y i, O i1 = Y i E i0 = 1 ˆπ i, E i1 = ˆπ i Test statistic: X 2 = = n n [(1 Y i ) (1 ˆπ i )] 2 1 ˆπ i + (Y i ˆπ i ) 2 1 ˆπ i + n n (Y i ˆπ i ) 2 (Y i ˆπ i ) 2 ˆπ i = ˆπ i n (Y i ˆπ i ) 2 ˆπ i (1 ˆπ i ) 4-28

30 Can Write Deviance (But Not Use for Tests) Test statistic G 2 = DEV (X 0, X 1,..., X p 1 ): c = 2 n i j=1 ) (ˆπi Y ij log + (n i ˆp i n i j=1 Y ij ) log ( ) 1 ˆπi 1 ˆp i The corresponding quantities (KNNL p. 592): c = n, n i = 1, n i j=1 Y ij = Y i, ˆp i = n i j=1 Y ij /n i = Y i Test statistic: n [ ) ( ) ] (ˆπi 1 G 2 ˆπi = 2 Y i log + (1 Y i ) log Y i 1 Y i n = 2 [ Y i log ˆπ i + (1 Y i ) log(1 ˆπ i ) Y i log Y i (1 Y i ) log(1 Y i ) ] n = 2 [ Y i log ˆπ i + (1 Y i ) log(1 ˆπ i ) ] 4-29

31 Diagnostics: Residuals Logistic Regression Residuals e i = Pearson residual { 1 ˆπi, if Y i = 1 ˆπ i, if Y i = 0 r Pi = Y i ˆπ i ˆπi (1 ˆπ i ) e i divided by the standard error of Y i n r 2 P i equals non-replicated Pearson X 2 Studentized Pearson Residual r Pi = Y i ˆπ i ˆπi (1 ˆπ i ) (1 h ii ) e i divided by the SE of e i unit variance h ii is the diagonal element of the hat matrix H = Ŵ 1 2X(X ŴX) 1 X Ŵ 1 2, where Ŵ = diag ( ˆπ i (1 ˆπ i )) 4-30

32 Diagnostics: Residuals Deviance Residuals d i = sign(y i ˆπ i ) 2 [ Y i log ˆπ i + (1 Y i ) log 1 ˆπ ] i Y i 1 Y i = sign(y i ˆπ i ) 2 [Y i log ˆπ i + (1 Y i ) log(1 ˆπ i )] the signed square root of the contribution of Y i to the model deviance Analysis of residuals unknown distribution of residuals under true model plot residual by predicted value. A flat lowess smooth to this plot suggests good model Other summaries as in linear regression DFFITS, DFBETAS χ 2, dev, Cook s distance (see KNNL p. 598) 4-31

33 Graphical Check of the Fit Partition the observations into groups of covariate patterns x i. Haldane (1956) recommended to plot y ˆη i = log i n i y i against covariate patterns x i. The plot should be roughly linear if the model is appropriate for the data When all n i = 1 or all n i are small, one can group the data with nearby x values to make the plot 4-32

34 Overdispersion 4-33

35 Overdispersion Y ind Binomial(n, π), π = exp(x β) 1 + exp(x β) Implies E{Y } = π, V ar{y } = nπ(1 π) Overdispersion: V ar{y } > nπ(1 π) Underdispersion: V ar{y } < nπ(1 π) Mechanisms of overdispersion: Suppose Y 1, Y 2,..., Y n are Bernoulli r.v., E{Y i } = π. define Y = n Y i Can think of at least two situations when Y does not have a Binomial distribution (and therefore a different variance) 4-34

36 Overdispersion from correlation Suppose Y 1, Y 2,..., Y n are not independent Suppose all pairs (Y i, Y j ) have a same correlation ρ ( n V ar(y ) = Cov Y i, = n ) n Y i V ar(y i ) + i j = nπ(1 π) + n(n 1)ρπ(1 π) > nπ(1 π) Corr(Y i, Y j ) V ar(y i ) V ar(y j ) The variance exceeds the variance of the Binomial distribution 4-35

37 Overdispersion from clustered data Suppose Y = n Y i π Binomial(π) Suppose π is a random variable, E{π} = p, V ar{π} = p(1 p) Special case: π Beta(α, β) E{Y } = E{E{Y π}} = p V ar{y } = V ar{e{y π}} + E{V ar{y π}} = V ar{nπ} + E{nπ(1 π)} = n 2 V ar{π} + np n[v ar{π} + p 2 ] = np(1 p) + n(n 1)V ar{π} > np(1 p) The variance exceeds the variance of the Binomial distribution 4-36

38 Modeling Overdispersion Introduce additional parameter E{Y i } = n i π i, V ar{y i } = φn i π i {1 π i } When φ 1, Y i follows a quasi-binomial distribution. The distribution is characterized by its expectation and variance. The probability distribution function is unspecified. ˆφ = χ 2 /(n p) χ 2 is the Pearson lack of fit statistic (= sum of squared Pearson residuals with nonreplicated data) p is the number of parameters in the model ˆφ >> 1 indicates evidence of overdispersion. Since φ does not affect E{Y i }, modeling overdispersion does not change ˆβ. SE{ˆβ} is multiplied by ˆφ. 4-37

39 Comparing Nested Models in Presence of Overdispersion Regular likelihood-based approaches (e.g. LRT, AIC) are not applicable. F test approximates deviance-based LR test F = D reduced D full /ˆφ ass., H 0 df reduced df full F dfreduced df full,df full Assumes roughly equal covariate classes Modeling strategy: fit the full model (with all predictors) estimate ˆφ compare nested models with F test to reduce the number of predictors 4-38

40 Prediction and Classification 4-39

41 Prediction of the Mean Point estimate for the link ˆπ h : ˆπ h = X hb Point estimate for the response ˆπ h : ˆπ h = exp( ˆπ h ) = exp( X h b) Interval estimate for the link ˆπ h : s 2 {ˆπ h } = X h s2 {b}x h (1 α)% CI for ˆπ h : ˆπ h ± z1 α/2 s{ˆπ } = (L, U) Approximate interval estimate for the response ˆπ h : (1 α)% CI for ˆπ h : ( exp( L), exp( U) ) Use Bonferroni if multiple X are of interest 4-40

42 Measures of Agreement Have N observations Consider all pairs of distinct responses In this example t = = 224 Compare predicted probabilities Concordant if ˆπ Y =1 > ˆπ Y =0 Discordant if ˆπ Y =1 < ˆπ Y =0 Tie if ˆπ Y =1 = ˆπ Y =0 Measures of agreement Somers D : (#C - #D)/t Goodman-Kruskal Gamma : (#C - #D)/(#C + #D) Kendall s Tau-a : (#C - #D)/(.5N(N-1)) c : (#C +.5(t-#C-#D))/t 4-41

43 Prediction of a New Observation (i.e. Classification) Choose a cutoff c (0, 1) Ŷ h = { 1, if ˆπh > c 0, if ˆπ h c Sensitivity: # predicted 1 & true 1 # true 1 = n n Ŷ i Y i Y i Specificity: # predicted 0 & true 0 # true 0 = n (1 Ŷ i ) (1 Y i ) n 1 Y i Vary the cut-off c (0, 1), and choose c to optimize sensitivity and specificity 4-42

44 ROC curve for classification Vary c, and plot sensitivity vs 1-specificity. Evaluate models by area under the curve. Sensitivity: # predicted '1' / # true '1' Specificity: # predicted '1' / # true '0' 4-43

45 Evaluation of the Predictive Ability of the Model Area under ROC can be used to compare models Area = 1 perfect classification Area =.5 random classification Classification on the training set is overly optimistic Use cross-validation to construct a more accurate ROC curve 4-44

46 Variable Selection 4-45

47 Automatic Variable Selection Exhaustive search. Minimize: 2 log e L(b) AIC p = 2 log e L(b) + 2p BIC p = 2 log e L(b) + p log e (n) Heuristic search forward selection; backward elimination; stepwise selection based on Wald statistic and Normal distribution 4-46

48 Variable Selection Should be Done as Part of Cross-Validation Example from Simon et al., JNCI, Simulated data with no structure 20 observations with random labels 6,000 possible but unrelated predictors Repeated 200 times Estimated predictive accuracy using no cross-validation selecting features on full dataset, then using cross-validation selecting features at each step of cross-validation 4-47

49 Variable Selection Should be Done as Part of Cross-Validation ature selection in logistic regressi Example from SimonSimon, et al., J. of JNCI, the National Cancer Institute 1.00 Proportion of Simulated Data Sets Cross validation: none (resubstitution method) Cross validation: after gene selection Cross validation: before gene selection No. of Misclassifications imulated 20 samples with random labels + 6,000 gene Conclusion repeated 200 times Incorporating selection of predictors within the cross-validation procedure is key stimated predictive accuracy using no cross-validation; set aside validation set after gene select 4-48 onclusion: cross-validation at all steps of feature selectio

Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models

Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 許湘伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 29 14.1 Regression Models