Biostat Methods STAT 5500/6500 Handout #12: Methods and Issues in (Binary Response) Logistic Regression

Size: px

Start display at page:

Download "Biostat Methods STAT 5500/6500 Handout #12: Methods and Issues in (Binary Response) Logistic Regression"

Caroline Randall
5 years ago
Views:

1 Biostat Methods STAT 5500/6500 Handout #12: Methods and Issues in (Binary Resonse) Logistic Regression Recall general χ 2 test setu: Y 0 1 Trt 0 a b Trt 1 c d I. Basic logistic regression Previously (Handout 7): χ 2 test of indeendence; for Handout 11 Examle 1: i = P (Y = 1 T rt = i) H 0 : 0 = 1 -value here: 0.11 conclusion here: no evidence of a relase rate between ACT & PBO grous Now: Think of Trt grou as art of an individual s covariate rofile covariate rofile i: x i (secific values in each redictor variable) R i = #Y=1 at covariate rofile i N i = # of obs. at covariate rofile i i = P (Y = 1 covariate rofile i) i = 1,..., k R i Binomial(N i, i ) likelihood function is k f(r ) = f(r i i ) = i=1 k N i! i=1 (N i R i )!R i! R i i (1 i ) N i R i Basic question: Is i the same for all covariate rofiles? Need to relate i to covariates: h( i ) = β 0 + β 1 X i other covariates = x T i β (h = link function; β = covariate effects) logit link (common link function): h() = log 1

2 Logistic regression model how a two-level resonse Y {0, 1} deends on covariates: log = P (Y = 1) 1 = β 0 + β 1 X = x T i β = ex[ x T i β] Substitute this into log f(r ), and call it l(β) Want to find β to maximize l(β), but no closed-form solution; need iterative rocedures (Fisher Scoring, Newton-Rahson); SAS does this and reorts: arameter estimates covariance matrix Cov( ) test of H 0 : β j = 0 odds ratio (most common interretation of arameter estimate) odds = /(1 ) = P (Y = 1) / P (Y = 0) odds ratio (OR) for redictor X j : Comare ÔR to 1: e j = odds of Y = 1 when X j + 1 odds of Y = 1 when X j Examle: ÔR for trt in model with trt and x is So odds of relase (Y=1) in ACT grou (trt=1) is about of that in the PBO grou (trt=0), after controlling for # months in remission. The odds of relase (Y=1) in ACT grou (trt=1) is about than the odds of relase in the PBO grou (trt=0), after controlling for # months in remission.

3 How good is a logistic model fit? Likelihood ratio χ 2 tests H 0 : β 1 =... = β = 0 concordance: how well redicted robabilities agree with observed outcomes Notation: look at all airs of obs. with different Y, = P (Y = 1) n c : number of concordant airs (Y=1 obs. has larger ˆ) n d : number of discordant airs (Y=1 obs. has a smaller ˆ) n t : number of tied airs (Y=1 obs. has same ˆ) Measures: Somers D = n c n d n c + n d + n t Gamma = n c n d n c + n d Tau-a = c = n c n d.5(n 1)n n c +.5n t n c + n d + n t These are rank correlation indices; in general, a model with values closer to 1 has better redictive ability. easiest to comare concordance measures for different models on same data visualizing concordance: Get ˆ for each obs; get Ŷ based on some threshold If Ŷ =1 when Y=1, call this a If Ŷ =1 when Y=0, call this a Sort all obs. from smallest to biggest ˆ. At each osition in list, use that ˆ i as threshold for redicting Y=1 = = #{Ŷ = 1 Y = 1} #{Y = 1} #{Ŷ = 1 Y = 0} #{Y = 0} Sensitivity = P( detect Y=1 Y=1 ) = Secificity = P( detect Y=0 Y=0 ) = 1 Secificity = P( detect Y=1 Y=0 ) = Plot these: Receiver Oerating Characteristic (ROC) curve higher curve: maximum Area Under Curve (AUC): when model has no redictive ability: AUC reorted in concordance table:

4 II. Overdisersion Recall binomial model in logistic regression: R i = #{Y=1 at covariate rofile i } = # successes at rofile i N i = #{ obs. at covariate rofile i } = # indeendent trials at rofile i i = P (Y = 1 covariate rofile i) R i Binomial(N i, i ), so E[R i ] = N i i, and V ar[r i ] = N i i (1 i ) What to do when trials are not indeendent? (Here, each subject is one stratum.) With deendent trials, V ar[r i ] > N i i (1 i ) this larger variance is called can also result from omitted redictors or incorrect link function (mis-secified model) Solution here: add to model a scale arameter σ, with V ar[r i ] = N i i (1 i )σ One way to estimate σ is to aggregate data into suboulations (maybe covariate rofiles, or atients here), and assess goodness of fit within suboulations Let m be # of suboulations, r + 1 be the # of resonse levels, w ik be the total weight (# obs.) for resonse level k in rofile i, w i = r+1 k=1 w ij be the total weight (# obs. across all resonse levels) at rofile i, and ˆ ik = P (Y = k rofile i ). Pearson χ 2 P = Deviance χ 2 D = 2 A common estimate: ˆσ 2 = χ 2 P /df m r+1 i=1 k=1 m r+1 i=1 k=1 (w ij w iˆ ik ) 2 w iˆ ik w ij log ( wij w iˆ ik SAS alternative: scale = williams (see SAS documentation; based on Pearson χ 2 P ) Test for overdisersion H 0 : σ = 1 if rejected, then: ) Alternative aroach (coming in Handout #14), and yielding essentially equivalent results:

5 III. Searation of Points What if we have searation of oints in logistic regression? Menoause examle: what if menoause = 1 iff age > 56 Dose effect becomes less significant: larger variance of estimates non-convergence of maximum likelihood iterations Solution: (Heinze and Schemer, 2002 / 2006) Rather than maximizing l(β) [recall. 2 above], maximize (via iterative rocedure) l(β) log V ar(β) 1 Searation may not be based on just one redictor (but one often is the main culrit) If searation with one continuous redictor, could make dummy var. and do χ 2 test IV. Inverse Interval Estimation Logistic regression is sometimes called dose-resonse modeling What dose is required to achieve ˆ =? Often = 0.5 LD50: ED50: If only wanted a oint estimate: What about interval estimation? AML Examle: In ACT grou, how many months in remission necessary to achieve a 40% relase rate? (Want a 95% confidence interval for ED40 when trt=1.)

6 Inverse Interval Estimation Methods log 1 = β d dose + x T β dose = log 1 xt β β d a = log b = d 1 xt Let r = µ a /µ b ; want a (1 α)100% CI for r, for a given Fieller s Theorem: When ( a b ) N (( µa µ b ) ( Vaa V, ab V ab V bb )), then ( (a rb) 2 ) P > z 2 V aa 2rV ab + r 2 V bb = α, where r = µ a /µ b, and z = z (1 α/2) is the (1 α/2)100 th ercentile of N(0, 1). Solve inequality for r to get (1 α)100% CI : eqn := (a-r*b)^2 / (V_aa-2*r*V_ab+r^2*V_bb) - z^2 ; sol := solve(eqn,r) ; ab z 2 V ab ± z b 2 V aa + a 2 V bb z 2 V aa V bb + z 2 Vab 2 2abV ab b 2 z 2 V bb Resulting interval called Fieller interval or sometimes fiducial limits AML Examle: log 1 = β 0 + β 1 I trt=1 + β 2 x = Cov( ) =

7 Rules for Variance & Covariance: Let X, Y, and Z be random variables, and c be some constant. Then the following are necessary rules: E(cX) = ce(x) V ar(cx) = c 2 V ar(x) E(X + Y ) = E(X) + E(Y ) V ar(x + Y ) = V ar(x) + V ar(y ) + 2Cov(X, Y ) Cov(X, X) = V ar(x) Cov(cX, Y ) = ccov(x, Y ) Cov(X + Y, Z) = Cov(X, Z) + Cov(Y, Z) Cov(X, Y ) = Cov(Y, X) AML Examle Want a 95% confidence interval for ED40 when trt=1 z = 1.96 a = log 1 XT = log.4.6 ( ) =.4055 ( ) = z b 2 V aa + a 2 V bb z 2 V aa V bb + z 2 Vab 2 2abV ab =.1038 ab z 2 V ab =.2786 b 2 z 2 V bb =.0276 b = d = 2 =.1998 V aa = V ar(a) = V ar( ( )) = V ar( 0 ) + V ar( 1 ) + 2Cov( 0, 1 ) = (.2095) =.3100 V bb = V ar(b) = V ar( 2 ) =.0032 V ab = Cov(a, b) = Cov( 0 1, 2 ) = Cov( 0, 2 ) Cov( 1, 2 ) = = ± PROC PROBIT gives: = (6.33, ) Probit Analysis on x Probability x 95% Fiducial Limits (Difference due only to rounding)

8 V. Two Misc. Asides 1. How is logistic regression related to chi-square/binomial test? Consider simle 2-by-2 table, and only single covariate Trt with coefficient β 1 ; then at iteration k of logistic model-fitting: Y 0 1 Trt 0 a b Trt 1 c d (k+1) 1 = When β 1 0, start with (1) 1 = (k) 1 + ( b E[b β 1 = (k) 1 ] ) /V ar[b β 1 = (0) 1 = 0; then one-ste estimate is (k) 1 ] (0) 1 + (b E[b β 1 = 0]) /V ar[b β 1 = 0] = (O E)/V where O = observed = b, and E and V are exected value and variance from hyergeometric distribution: Think of n = a + b + c + d balls in an urn, with a + c black (Y = 0) and b + d red (Y = 1); a + b balls are drawn at random without relacement (T rt = 0). Count # of red (Y = 1) balls drawn (T rt = 0) O = b E = (a + b)(b + d) n V = (a + b)(b + d)(a + c)(c + d) n 2 (n 1) Then V ar[ 1 ] = 1/V, and test H 0 : β 1 = 0 using 1 / V ar[ 1 ] N(0, 1) This one-ste method is called Peto s method, and is equivalent to two-samle binomial & χ 2 tests (as n gets large). 2. Why logistic regression? If T logistic(µ, λ), cdf is F T (t) = [1 + ex( (t µ)/λ)] 1 F T (t) is the solution to a articular differential equation simlest form: dp dt = P (1 P ), P (0) = 0.5 related to oulation growth and carrying caacity F T (t) is called the logistic equation from French equation logistique (Verhulst) esecially roviding necessary suort for ongoing (growing) military oerations describes self-limiting oulation growth

Biostat Methods STAT 5820/6910 Handout #5a: Misc. Issues in Logistic Regression

Biostat Methods STAT 5820/6910 Handout #5a: Misc. Issues in Logistic Regression Recall general χ 2 test setu: Y 0 1 Trt 0 a b Trt 1 c d I. Basic logistic regression Previously (Handout 4a): χ 2 test of