Variable selection: Suppose for the i-th observational unit (case) you record ( failure Y i = 1 success and explanatory variabales Z 1i Z 2i Z ri Variable (or model) selection: subject matter theory and expert opinion t models to test hypotheses of interest stepwise methods Consider a istic regression model ß i = Pr(Y i = 1 Z 1i Z 2i Z ri ) backward elimination forward selection and ψ ßi 1 ß i! = + 1 Z 1i + 2 Z 2i + + r Z ri Which explanatory variables should be included in the model? 111 stepwise selection the user supplies a list of potential variables consider diagnostics" maximize a penalized" likelihood 112 Retrospective study of bronchopulmonary dysplasia (BPD) in new born infants. Observational Units: 248 infants treated for respiratory distress syndrome (RDS) at Stanford Medical Center between 1962 and 1973, and who received ventilatory assistance by intubation for more than 24 hours, and who survived for 3 or more days. Binary response: BPD = Suspected causes: ( 1 stage III or IV 2 state I or II Duration and level of exposure to oxygen therapy and intubation. Background variables: Sex YOB = female 1 = male year of birth APGAR one minute APGAR score ( 1) GEST gestational age (weeks 1) BWT birth weight (grams) AGSYM age at onset of respiratory symptoms (hrs 1) RDS severity of initial X-ray for RDS on a 6 point scale from = no RDS seen to 5 = very severe case of RDS 113 114
Treatment Variables INTUB duration of endotracheal intubation (hrs) VENTL duration of assisted ventilation (hrs) LOWO2 hours of exposure to 22 49% level of elevated oxygen MED2 HI2 hours of exposure to 4 79% level of elevated oxygen hours of exposure to 8 1% level of elevated oxygen AGVEN age at onset of ventilatory assistance (hours) Methods for assessing the adequacy of models: 1. Goodness-of-t tests 2. Procedues aimed at specic alternatives such as tting extended or modied models. 3. Residuals and other diagnostic statistics designed to detect anomalous or influencial cases or unexpected patterns. Inspection of graphs Case deletion diagnostics Measures of t 115 116 Overall assesment of t: An overall measures similar to R 2 Logistic regression model for the BDP data: tted model -likelihood! L L M L S L " - -likelihood for saturated model = G2 G2 M G 2 -likelihood for model containing only an intercept This becomes L L M L (since L S = ) when there are n observational units that provide n independent Bernoulli observations. For the BPD data and the model selected with backward elimination L L M = 2L ( 2L M) 36:52 97:77 = = :681 L 2L 36:52 ψ ^ßi 1 ^ß i! 2 6 4 = 12:729 + :7686 LNL 1:787 LNM + :3543 (LNM) 2 2:438 LNH + :5954 (LNH) 2 + 1:648 LNV concordant = 97:% discordant = 2:9% GAMMA = :941 3 7 5 117 118
Goodness of t tests: H : proposed model is correct H A : any other model Pearson chi-squared test: 2X X 2 = n X (Y ij n i^ß ij ) 2 X = n j=1 n i^ß ij Deviance: G 2 = 2 n X j=1 2X Y ij Y ij =n i^ß ij (Y ij n i^ß ij ) 2 n i^ß ij (1 ^ß ij ) When each response is an independent Bernoulli observation 8 < Y i = 1 success : failure and ( ß i = Pr Y i = 1 i = 1; 2; : : : ; n ) Z 1i; ; Z ri and there is only one response for each pattern of covariates (Z1i; : : : ; Z ri ) Then, for testing H : @ ß 1 i A = +1Z1i+ + 1 k Z ki ß i versus H A : < ß i < 1 (general alternative) 119 111 neither nor G 2 = 2 n X Y i X +2 n @ Y 1 ia ^ß i (1 Y i ) @ 1 Y 1 ia 1 ^ß i X 2 X = n (Y i ^ß i ) 2 ^ß i (1 ^ß i ) In this situation, G 2 and X 2 tests for comparing two (nested) models may still be well approximated by chisquared distributions when k+1 = = r =. H : ß i 1 ß i = +1Z1i+ + k Z ki H A : ß i 1 ß i = +1Z1i+ + k Z ki + k+1 Z k+1;i + + r Z r;i is well approximated by a chi-square distribution when H is true, even for large n. 1111 1112
Deviance: G 2 = 2 n X Pearson statistic X 2 = n X Y i @ ^m 1 i;aa ^m i; (^m i;a ^m i; ) 2 ^m i; have approximate chi-squared distributions with r k degrees of freedom when (r k)=n is small k+1 = = r = (S. Haberman, 1977 Annals of Statistics.). Insignicant values of G 2 and X 2 1. Only indicate that the alternative model offers little or no improvement over the null model 2. Do not imply that either model is adequate. These are the types of comparisons you make with stepwise variable selection procedures. 1113 1114 Hosmer-Lemeshow test: Collect the n cases into g groups. Make a 2 g contingency table The "expected counts" are Groups 1 2 g () Y = 1 11 12 1g (i=2) Y = 2 21 22 2g n 1 n 2 n g Compute a Pearson statistic X C = 2 gx ( ik E ik ) 2 k=1 E ik where E2k = n k(1 μß k ) E1k = n kμß k μß k = 1 n k n k X ^ß j j=1 1115 1116
Hosmer and Lemeshow recommend formed as g = 1 groups For the BPD example: ^ß i values -.1.1-.2.2-.3.3-.4.4-.5.5-.6.6-.7.7-.8.8-.9.9-1. 3 3 5 3 6 2 6 2 1 46 128 18 6 8 1 1 5 1 1 1 1 131 group 1 all observational units with < ^ß j» :1 group 2 all observational units with.1 < ^ß j» :2.. group 1 all observational units with :9 < ^ß j < 1 Reject the proposed model if C > X 2 g 2;ff 1117 Expected" Counts 4.95 3.14 2.67 3.56 3.11 1.67 7.18 2.38 1.72 46.59 126.5 17.86 8.33 7.44 3.89 1.33 3.82.62.28.41 C = 12.46 on 8 d.f. (p-value =.132) * This test often has little power * Even when this test indicates that the model does not t well, it says little about how to improve the model. 1118 The lackt" option to the model statement in PROC LOGISTIC make 1 groups of nearly equal size BPD=1 (yes) 1 4 9 16 25 22 BPD=2 (No) 25 25 25 25 24 21 16 9 25 25 25 25 25 25 25 25 25 22 "Expected" counts @ X i ^ß i 1 A.3.9.25.53 1.25 2.67 7.64 18.1 24.5 22. 25. 24.9 24.7 24.5 23.7 22.3 17.4 6.9.5.1 25 25 25 25 25 25 25 25 25 22 C = 3.41 on 8 d.f. (p-value =.91) Diagnostics: Cook, and Weisberg (1982) Residuals and Influence in Regression, Chapman Hall. Belsley, Kuh, and Welsch (198) Regression Diagnostics: Identifying Influential Data and Sources of Collinearity, Wiley. Pregibon (1981) Annals of Statistics 9, 75 724. Hosmer and Lemeshow (2) Applied Logistic Regression. Wiley, 2nd edition. Collett, D. (1991) Modelling Binary Data, Chapman and Hall, London Lloyd, C. (1999) Statistical Analysis of Categorical Data, Wiley, Section 4.2. 1119 112
Residuals and other diagnostics: Pearson residuals: @ X 2 = n X r 2 i 1 A Kay, R. and Little, S. (1986) Applied Statistics, 35, 16 3 (case study). Fowlkes, E. B. (1987) Biometrika 74, 53 515 Cook, R. D. (1986) JRSS-B, 48, 133 155. Miller, M. E., Hui, S. L., Tierney, W. M. (1991) Statistics in Medicine 1 1213 1226. r i = Deviance residuals: Y i n i^ß i qni^ß i (1 ^ß i ) @ G 2 = n X d 2 i d i = sign(y i n i^ß i ) q jg i j 1 A where " g i = 2 Y i ψ Yi n i^ß i!!# ni Y + (n i Y i )ψ i n i (1 ^ß 1 ) 1121 1122 adjusted residuals:» OBS - PRED S.E. RESID Adjusted Pearson residual: r Λ i = r i p 1 hi r Λ i = Y i n i^ß i q^v (Yi n i^ß i ) Adjusted deviance residual: d Λ i = d i p 1 hi = Y i n i^ß i qni^ß i (1 ^ß i )(1 h i ) where h i is the leverage" of the i-th observation 1123 1124
Compare residuals to percentiles of the standard normal distribution Residual plots Cases with residuals larger than 3 or smaller than -3 are suspicious. None of these residuals" may be well approximated with a standard normal distribution. versus each explanatory variable versus order (look for outliers or patterns across time) versus expected counts: n i^ß i Smoothed residual plots They are too discrete". 1125 1126 What is leverage? In linear regression the hat matrix" is @ ß i 1 ß i 1 A = + 1X1i + + k X ki i = 1; : : : ; n H = X(X X) 1 X which is a projection operator onto the column space of X, and 2 6 4. ß1 1 ß1 ß2 1 ß2 ßn 1 ßn 3 7 5 = 2 6 4 = X 1 X 11 X k1 1 X 12 X k2... 1 X 1n X kn " model matrix 3 2 7 6 5 4 1. k 3 7 5 ^Y = HY residuals = (I H)Y V (residuals) = (I H)ff 2 1127 1128
Pregibon (1981) uses a generalized least squares approach to istic regression which yields a hat matrix The i-th diagonal element of H called a leverage value Call this element h i. is where V H = V 1=2 X(X V X) 1 X V 1=2 is an n n diagonal matrix with i-th diagonal element n i^ß i (1 ^ß i ) = V ii and n i is the number of cases with the i-th covariate pattern. Note that nx h i = k + 1 % number of coefcients When there is one individual for each covariance pattern, the upper bound on h i is 1. 1129 113 Cases with large values of h i may be cases with vectors of covariates that are far away from the mean of the covariates. However, such cases can have small h i values if ^ß i << :1 or ^ß i >> :9 An alternative quantity that gets larger as the vector of covariates gets farther from the mean of the covariates is b i = h i n i^ß i (1 ^ß i ) see Hosmer + Lemshow pages 153 155 Look for cases with large leverage values and see what happens to estimated coefcients when the case is deleted. 1131 INFLUENCE: Anaous to Cook's D for linear regression Dene: b b (i) the m.l.e. for using all n observations the m.l.e. for when the i-th case is deleted A standardized" distance between b and b (i) is approximately Influence (i) = (b b (i) ) (X V X) 1 (b b (i) ) % called C i in PROC LOGISTIC : = r 2 i h i (1 h i ) 2 = (ri Λ ) 2 h i 1 h i % % squared adjusted monotone function residual of leverage 1132
V ar(y i ^ß i ) _=n i ß i (1 ß i )(1 h i ) and an adjusted Pearson residual is PROC LOGISTIC approximates the m.l.e. for with the i-th case deleted as b Λ (i) = b bλ (i) where b Λ (i) = (X V X) 1 x i @ Y 1 i n i^ß i A 1 h i r Λ i = Y i ^ß i qni^ß i (1 ^ß i )(1 h i ) An approximate measure of influence is = r i 1 h i C i = r 2 i h i (1 h i ) 2 % square of the Pearson residual 1133 1134