Logistic Regression. Building, Interpreting and Assessing the Goodness-of-fit for a logistic regression model

Size: px

Start display at page:

Download "Logistic Regression. Building, Interpreting and Assessing the Goodness-of-fit for a logistic regression model"

Clinton Wilcox
5 years ago
Views:

1 Logistic Regression In previous lectures, we have seen how to use linear regression analysis when the outcome/response/dependent variable is measured on a continuous scale. In this lecture, we will assume that the outcome variable (call it Y for general purposes, for example Y indicates whether a disease or any other characteristic is present or absent) is binary, where Y = 1 indicates success and Y = 0 indicates failure, see Appendix A for discussion about the Binomial distribution. The explanatory independent variables (usually denoted by X, for example, age, height, marital status) can be either categorical or continuous. If we were to use for predicting Y the same approach as that for continuous data we will most definitely encounter situations where the predicted variable will not be a 0 or 1 but a value on a straight line stretching from - to +. In order to overcome this difficulty rather than modelling the binary outcome we model the probability of success of the outcome, usually denoted by π (Probability of disease, of recovery, etc...). However, this probability follows an s-shaped curve rather than a line; with high/low values of the outcome associated with the presence/absence of the disease; and it is constrained to lie between 0 and 1. Nevertheless, we can apply a transformation to π that will allow us to study the variation of the transformed π using a linear combination of the independent variables. This function is given by p( Y = 1) log = α + βx p Y. 1 ( = 1) This transformation is called the logit link function. Recall that p( Y = 1) 1 p( Y = 1) is the odds of having the disease where p(y =1) is the probability of the disease (presence of a certain characteristic) and 1- p(y =1) is the probability of no disease (absence of a certain characteristic). With some algebraic manipulation, we can show that 1 p( y) = 1+ exp. [ ( α + βx) ] See Appendix B for some graphical examples of the above model. Building, Interpreting and Assessing the Goodness-of-fit for a logistic regression model In what follows, we will explore how to build, interpret and assess the goodness-of-fit a logistic regression model for a binary response through an example. Hosmer and Lemeshow (2000) reported the following data set 1 : LOW BIRTH WEIGHT DATA that was collected at Baystate Medical Center, Springfield, Massachusetts during The aim of the study was to determine risk factors associated with giving birth to a low birth weight baby. A low birth weight baby is defined as weighing less than 2500 grams at birth. 1 John Wiley & Sons Inc. 1

2 this table are: Table 1 is an extract from this study. The variables presented in 1. ID: Identification Code 2. LOW: Low Birth Weight (0 = Birth Weight >= 2500g, 3. AGE: Age of the mother in years 1 = Birth Weight < 2500g) 4. LWT: Weight in pounds at the last menstrual period 5. RACE: Race (1 = White, 2 = Black, 3 = Other) 6. SMOKE: Smoking status during Pregnancy (1 = Yes, 0 = No) 7. PTL: History of Premature Labor (0 = None 1 = One, etc.) 8. HT: History of Hypertension (1 = Yes, 0 = No) 9. UI: Presence of Uterine Irritability (1 = Yes, 0 = No) 10. FTV: Number of Physician Visits During the First Trimester 11. (0 = None, 1 = One, 2 = Two, etc.) 12. BWT: Birth Weight in Grams Table 1: Low birth weight extract from data reported by Hosmer and Lemeshow (2000). ID LOW AGE LWT RACE SMOKE PTL HT UI FTV BWT Explanatory categorical variables (Qualitative) describe Contains data from \Lecture_4mo\lowbwt.dta obs: 189 vars: 11 tab low low Freq. Percent Cum BWT >= BWT < Total tab low smoke, chi lr smoke low No Yes Total BWT >= BWT < Total Pearson chi2(1) = Pr = likelihood-ratio chi2(1) = Pr = display (30/44)/(29/86) The odds of low birth weight for smokers versus non-smokers is

3 . cs low smoke, or smoke Exposed Unexposed Total Cases Noncases Total Risk Point estimate [95% Conf. Int] Risk difference Risk ratio Attr. frac. ex Attr. frac. pop.19 Odds ratio chi2(1) = 4.92 Pr>chi2 = logit low smoke Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Logistic regression Number of obs = 189 LR chi2(1) = 4.87 Prob > chi2 = Pseudo R2 = Log likelihood = low Coef. S.E. z P> z [95% CI] smoke _cons log(odds of low bwt) = smoke + error term log(odds of low bwt for smokers) = log(odds of low bwt for non-smokers) = log(or smokers versus non-smokers) = log(odds of low bwt for smokers/ odds of low bwt for nonsmokers)= log(odds of low bwt for smokers) log(odds of low bwt for non-smokers)= 0.70 OR smokers versus non-smokers = exp(0.70). display exp( ) display exp( ) display exp( ) logit low smoke, or Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Logistic regression Number of obs = 189 LR chi2(1) = 4.87 Prob > chi2 = Pseudo R2 = Log likelihood = low Odds Ratio S.E. z P> z [95% CI] smoke est store A. logit low Iteration 0: log likelihood =

4 Logistic regression Number of obs = 189 LR chi2(0) = 0.00 Prob > chi2 =. Log likelihood = Pseudo R2 = low Coef. S.E. z P> z [95% C. I] _cons est store B lrtest A B Likelihood-ratio test LR chi2(1) = 4.87 (Assumption: B nested in A) Prob > chi2 = The likelihood-ratio test compares the model with the variables to that with constant term only. Since we have 1 variable only, this results in one df. Based on a P-value of we have good evidence that the term smoke is needed in our model.. tab low race, chi race low White Black Other Total BWT >= BWT < Total Pearson chi2(2) = Pr = logit low race Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Logistic regression Number of obs = 189 LR chi2(1) = 3.57 Prob > chi2 = Log likelihood = Pseudo R2 = low Coef. S. E. z P> z [95% CI] race _cons xi: logit low i.race i.race _Irace_1-3 (naturally coded; _Irace_1 omitted) Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Iteration 3: log likelihood = Logistic regression Number of obs = 189 LR chi2(2) = 5.01 Prob > chi2 = Log likelihood = Pseudo R2 =

5 low Coef. S. E. z P> z [95% CI] _Irace_ _Irace_ _cons log(odds of low bwt) = black other+ error term. xi: logit low i.race, or nolog i.race _Irace_1-3 (naturally coded; _Irace_1 omitted) Logistic regression Number of obs = 189 LR chi2(2) = 5.01 Prob > chi2 = Log likelihood = Pseudo R2 = low Odds Ratio S. E. z P> z [95% CI] _Irace_ _Irace_ Adjusted Odds ratios. xi: logit low i.smoke i.race, or nolog i.smoke _Ismoke_0-1 (naturally coded; _Ismoke_0 omitted) i.race _Irace_1-3 (naturally coded; _Irace_1 omitted) Logistic regression Number of obs = 189 LR chi2(3) = Prob > chi2 = Log likelihood = Pseudo R2 = low OR S. E. z P> z [95% CI] _Ismoke_ _Irace_ _Irace_ log(odds of low bwt) = smoke black other+ error term 5

6 Explanatory Continuous Data (Quantitative). logit low age Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Iteration 3: log likelihood = Logistic regression Number of obs = 189 LR chi2(1) = 2.76 Prob > chi2 = Log likelihood = Pseudo R2 = low Coef. S. E. z P> z [95% CI] age _cons logit low age, or nolog low OR S. E. z P> z [95% CI] age OR here represents the increase in the odds for an increase in one unit (one year) of age.. quietly xi: logit low i.smoke age, nolog. est store A. est table A, b(%9.2f) Variable A _Ismoke_ age _cons log(odds of low bwt) = smoke age + error term predict e6, xb predict see6, stdp gen ule6 = e6+1.96*see6 gen lle6 = e6-1.96*see6 scatter e6 age if smoke ==0, msymbol(o) line ule6 lle6 age if smoke ==0, xlabel(10(5)45) title("non-smokers") sort Non-smokers age Linear prediction: smoke age lower bound: age smoke upper bound: age smoke 6

7 scatter e6 age if smoke ==1, msymbol(o) line ule6 lle6 age if smoke ==1, xlabel(10(5)40) title("smokers") sort Including interaction terms gen lwd=(lwt<110) tab low lwd, col Smokers age Linear prediction: smoke age lower bound: age smoke upper bound: age smoke Key frequency column percentage lwd low 0 1 Total BWT >= BWT < Total xi: logit low i.lwd*age, nolog i.lwd _Ilwd_0-1 (naturally coded; _Ilwd_0 omitted) i.lwd*age _IlwdXage_# (coded as above) Logistic regression Number of obs = 189 LR chi2(3) = Prob > chi2 = Log likelihood = Pseudo R2 = low Coef. S. E. z P> z [95% CI] _Ilwd_ age _IlwdXage_ _cons

8 log(odds of low bwt) = lwd age lwd x age + error term predict el, xb quietly xi: logit low i.lwd age predict e2, xb scatter el age, msymbol(o) scatter e2 age, msymbol( x), xlabel(10(5)45) Calculating OR in presence of interaction: (i) one continuous and one categorical variables: log(odds of low bwt) = lwd age lwd x age + error term Given age 20: log(odds of low bwt among lwd >= 110 ) = x x x 0 x age Linear Prediction: lwd * age Linear prediction: lwd age log(odds of low bwt among lwd < 110 ) = x x x 1 x 20 log(or)=log((odds of low bwt among lwd < 110)/ (odds of low bwt among lwd >= 110 )) = log(odds of low bwt among lwd >=110 ) log(odds of low bwt among lwd < 110 ) = 1.94 x x 1 x 20 = 0.66 OR = exp(0.66) = lincom _Ilwd_1 + 20* _IlwdXage_1, or ( 1) _Ilwd_ _IlwdXage_1 = 0 low OR S. E. z P> z [95% CI] (1)

9 . lincom _Ilwd_1 + 25* _IlwdXage_1, or ( 1) _Ilwd_ _IlwdXage_1 = 0 low OR S. E. z P> z [95% CI] (1) (ii) two categorical variables:. xi: logit low i.smoke*i.race, nolog LR chi2(5) = Prob > chi2 = low Coef. S. E. P> z _Ismoke_ _Irace_ _Irace_ _IsmoXrac_~ _IsmoXrac_~ _cons log(odds of low bwt) = smoke black other smoke x black smoke x other + error term Given a non-smoker: log(odds of low bwt among white) = 0.77 log(odds of low bwt among black) = log(or black vs white) = 1.51 Given a smoker: log(odds of low bwt among white) = log(odds of low bwt among black) = log(or black vs white) = lincom _Irace_2 + _IsmoXrac_1_2 ( 1) _Irace_2 + _IsmoXrac_1_2 = low Coef. Std. Err. P> z [95% CI] (1) lincom _Irace_2 + _IsmoXrac_1_2, or low Odds Ratio Std. Err. [95% CI] (1) Testing coefficients equal to zero: using Wald Statistics. test _IsmoXrac_1_2 _IsmoXrac_1_3 ( 1) _IsmoXrac_1_2 = 0 ( 2) _IsmoXrac_1_3 = 0 chi2( 2) = 3.02 Prob > chi2 =

10 Diagnostics Goodness-of-fit To assess a lack of fit of a model when continuous explanatory variables are present, one needs to group observations as discussed below. This grouping will result in chi-squared statistics that have a better validity. However, if there are several continuous variables in your model then simultaneously grouping these covariates will lead to a large contingency table with small counts. One way to avoid such a situation is by grouping observations according to predicted values. Hosmer and Lemeshow (1989) have devised a Pearson like statistic based on such a partitioning. Their statistic can be referred to a chi-squared statistic with df = number of groups 2. Using lfit or estat gof after you fit your model and specifying the group option you can obtain the Hosmer and Lemeshow type statistics.. xi: logit low i.smoke i.race lwt low Coef. Std. Err. P> z _Ismoke_ _Irace_ _Irace_ lwt _cons estat gof Logistic model for low, goodness-of-fit test number of observations = 189 number of covariate patterns = 132 Pearson chi2(127) = Prob > chi2 = estat gof, table group(10) Logistic model for low, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) Group Prob Obs_1 Exp_1 Obs_0 Exp_0 Total number of observations = 189 number of groups = 10 Hosmer-Lemeshow chi2(8) = 7.35 Prob > chi2 =

11 ROC (Receiver Operating Characteristic) & Discrimination Once a logit-model is fitted to a data set, then one can compute the probability of a success at different levels of the predictors. Remember that we can show that p( Y 1 = 1) =, if we assume that 1+ exp [ ( α + βx) ] p( Y = 1) log = α + βx p Y. However, the outcome variable is binary. 1 ( = 1) Therefore, when we calculate the sensitivity and specificity of the model 2, we use the following prediction rule: those with p ( Y = 1) 0. 5 will have the disease and those with probability less than 0.5 will not have the disease. The choice of the cut-off point 0.5 is based on statistical considerations. However, we can try other cut-off points and calculate, in each case, the corresponding specificity and sensitivity. Then the optimal cut-off point would be the point that maximizes both specificity and sensitivity. One way to decide on this is to plot the cutoff points versus sensitivity and specificity and see where these two curves intersect. The point of intersection of these two curves represents the optimal cut-off point. In order to determine how good the model at discriminating between the two categories is, that is if the association between the predictor and the 2 Sensitivity= proportion of times the model predicts a positive when it is actually a positive. Specificity = proportion of times the model predicts a negative when it is actually a negative. outcome is positive then when we observe a higher value of the predictor we expect the outcome, we use the ROC curve. This curve is obtained by plotting sensitivity versus 1- specificity for all possible values of the cutoff points. Then we determine the area under this curve, if the area is 0.5 then this indicates that the model does not discriminate well, that is it is not better than basing your decision on a flip of the coin. If the area ranges from 0.7 to 0.8 this is considered as moderate discrimination. If the area ranges from 0.8 to 0.9 this is considered as very good/excellent discrimination. If the area is greater or equal to 0.9 this is considered as outstanding discrimination (See Hosmer and Lemeshow, 2000). First, let us look at how good is our model in discriminating between the two categories. To achieve this we use estat classification (STATA 9) or lstat (STATA 8).. estat classification Logistic model for low True Classified D ~D Total Total

12 Classified + if predicted Pr(D) >=.5 True D defined as low!= Sensitivity Pr( + D) 15.25% Specificity Pr( - ~D) 93.85% Positive predictive value Pr( D +) 52.94% Negative predictive value Pr(~D -) 70.93% False + rate for true ~D Pr( + ~D) 6.15% False - rate for true D Pr( - D) 84.75% False + rate for classified + Pr(~D +) 47.06% False - rate for classified - Pr( D -) 29.07% Correctly classified 69.31% lsens, nograph genprob(p1) gensens(se1) genspec(sp1). estat classification, cutoff( ) Logistic model for low True Classified D ~D Total Total lsens Sensitivity/Specificity Probability cutoff Classified + if predicted Pr(D) >= True D defined as low!= Sensitivity Pr( + D) 64.41% Specificity Pr( - ~D) 62.31% Positive predictive value Pr( D +) 43.68% Negative predictive value Pr(~D -) 79.41% False + rate for true ~D Pr( + ~D) 37.69% False - rate for true D Pr( - D) 35.59% False + rate for classified + Pr(~D +) 56.32% False - rate for classified - Pr( D -) 20.59% Correctly classified 62.96% Sensitivity Specificity 12

13 lroc Sensitivity Specificity Area under ROC curve =

14 Outliers/Influential Observations Outliers are observations that have large residuals, that is, observations that fit poorly. Note that outliers need not be necessarily influential. Pregibon dbeta: (Cook s distance) Influential observations are observations that have a large effect on the estimated parameters. The effect of such observations is determined by examining the change in the estimated regression coefficients that occurs when such observations are deleted. This effect is measured using what is known as Cook s distance. For logistic regression Stata reports an approximation of this measure which was proposed by Pregibon (1981) using the predict command after fitting the regression model. You can plot these measures versus the predicted probability in order to determine which observations are influential. The Stata commands that you need in order to achieve this are as follows: predict dbnewvar, dbeta predict pnewvar, p predict cnnewvar, number scatter dbnewvar pnewvar, mlabel(cnnewvar) mlabposition(0) msymbol(i) predict plowsrlwt, p predict dblowsrlwt, db predict cnlowsrlwt, number scatter dblowsrlwt plowsrlwt,mlabel(cnlowsrlwt) mlabposition(0) msymbol(i) Pregibon's dbeta Pr(low) For our example, we used the following commands 14

15 One can also look at two other measures: one is the change in Pearson chi-square statistic and the other is the change in the deviance (difference of log-likelihood) statistic when an observation is deleted. The larger the change the greater is the influence of the observation. The Stata commands that are needed to generate these measures are Change in the deviance (difference of log-likelihood) statistic when an observation is deleted. predict ddlowsrlwt, dd scatter ddlowsrlwt plowsrlwt, mlabel(cnlowsrlwt) mlabposition(0) msymbol(i) Change in Pearson chi-square statistic when an observation is deleted. predict dx2lowsrlwt, dx2 scatter dx2lowsrlwt plowsrlwt, mlabel(cnlowsrlwt) mlabposition(0) msymbol(i) H-L dx^ Pr(low) H-L dd Pr(low) 15

16 Pearson s Residuals are given by P = i o n π n π i i i ( 1 π ) i i i The Pearson statistic is the sum of all such components. If n i is large then P i has an approximate normal distribution. When the model holds, its expected value is zero, however, its variance is smaller than that of a standard normal variable. If the number of parameters in the model is small compared to the sample size then Pearson residuals are treated like standard normal deviates. The standardised Pearson residual is given by St P i. Pi = where ii 1 hii h is the leverage (influence) of observation i on the estimates. It is worth noting that the standardised Pearson residual is slightly larger in absolute value and has an approximate standard normal distribution. To obtain Pearson residuals and standardized Pearson residuals from Stata use the following command after a logit command, respectively. predict rnewvar, r predict rstnewvar, rst The following are the graphs of the Pearson residuals and standardized Pearson residuals versus predicted probability. predict rlowsrlwt, r scatter rlowsrlwt plowsrlwt, mlabel(cnlowsrlwt) mlabposition(0) msymbol(i) Pearson residual Pr(low)

17 predict rstlowsrlwt, rst scatter rstlowsrlwt plowsrlwt, mlabel(cnlowsrlwt) mlabposition(0) msymbol(i) deviance residuals from Stata use the following command after a logit command predict devnewvar, dev standardized Pearson residual Pr(low) Deviance Residuals Deviance is a measure of discrepancy between observed data and the expected value under the proposed model. It is the sum of the residuals, d i, obtained as d = 2o ln ( o / e ) + ( n o )ln (( n o ) ( n e )) 115 i i i i i i i i i i where o i is the observed number of positive responses and e i is the expected number under the model. As we add more terms to our regression, we continue to reduce the value of the deviance. To obtain The following is the graph of the deviance residuals versus predicted probability. predict devlowsrlwt, de scatter devtlowsrlwt plowsrl, mlabel(cnlowsrlwt) mlabposition(0) msymbol(i) deviance residual Pr(low)

18 Pseudo-R2 R measures correlation between predicted and observed values. R 2 measures the variation explained by the fitted model. Pseudo-R2 is the McFadden s R2 also known as the likelihood-ratio index. It is 1 loglikelihood of the full model /log-likelihood with the intercept. It increases as the number of parameters increases. There is an adjusted version that accounts for the number of parameters in the model. Selection of models This is done using one of the following approaches Backward Forward Combination In Stata this can be done using the command sw and its options. Namely, pr(#) backward selection pe(#) forward selection pr(#) pe(#) stepwise For example, xi: sw logit low i.smoke (i.race) lwt, pr(0.02) Information Criterion The likelihood ratio test is used when we compare nested models. However, when comparing models that cannot be nested within each other we need different criteria. We will list two that are widely used the Akaike s information criterion (AIC) and Bayesian information criterion (BIC). Both are based on the likelihood function and the number of parameters involved. 18 AIC = - 2 * log-likelihood +2* number of parameters. BIC= - 2 * log-likelihood + log(n) * number of parameters. Here n is the sample size. The model with the smaller AIC/ BIC is considered the better fitting model. Raftery (1996) suggested the following guidelines for BIC: If the absolute difference between the two models ranges from 0-2, the evidence that one model is better than the other is weak. If the difference is from 2 to 6, the evidence is positive. If the difference is from 6 to 10, the evidence is strong. If the difference is greater than 10, the evidence is very strong.. xi: logit low i.smoke (i.race) lwt. est store A. xi: logit low i.smoke (i.race) age. est store B. xi: logit low age ui ftv smoke. est store C. est stat Model Obs ll(null) ll(model) df AIC BIC A B C

19 Looking Further (For your information only) Logistic Regression for Polytomous Data A polytomous variable is a variable with more than two categories. If the response variable is polytomous, one can generalize the ideas of the binary logistic regression to cater for polytomous response variables. Suppose that the response variable has K categories then the probability of belonging to one and only one category is Pr( Y = i explanatory variables x) = π i ( x), where i=1,..,k. In addition, π x) + π ( x) + + π ( x) 1. To proceed in the analysis, 1 ( 2 K = we need to choose a reference category, suppose we choose the first category. Then based on this category one can apply the logistic regression approach by fitting π ( x) = + x + x + + i log β i0 β i1 1 βi2 2 βim m for i=2,., K. π 1 ( x) Therefore, we can fit K-1 equations for the association between the 1 st category and the other K-1 categories. Hence, we need to estimate K-1 sets of the m regression parameters. The Stata command that is used to analyze such data is the mlogit command x Logistic Regression for Ordinal Data For ordinal categorical outcomes, there are several models that one can look at that exploits the fact that the outcome is ordinal. The following is a list of the most commonly used ones: I. Adjacent category. II. Continuation ratio. III. Proportional odds models. However, there are others that are not commonly used and here is a list of some of them: Unconstrained partial-proportional odds. Constrained partial-proportional odds. Stereotype. In what follows, I provide a brief description of each of the three commonly used models. Assume here that the outcome has (K + 1) categories and is indexed by k = 0,, K I- Adjacent category model compares each outcome to the next larger (smaller) outcome. P( Y = k x) π k ( x) ln = = a P Y k x ln π k x ( = 1 ) 1( ) k ( x) = α + x Note that this is a constrained version of the multinomial logit k T β (sometimes known as the baseline logit), since we force the corresponding betas across the different models to be equal. 19

20 II- Continuation ratio model compares each outcome to all lower (higher) outcomes Y = k versus Y< k. P( Y bk ( x) = ln P( Y = < k x) T = ϑ k + x β k k x) Here it is as if we are fitting K ordinary binary logistic regression models, note that this model can be constrained Remark: There are instances where you might want the higher group as your reference group and others where you might the lower group as your reference group. Three such examples are: Depression scale Low. Apgar scale High. Low birth weight categorized higher weight category as such that β k = β ). reference category. The following gives the commands that you need to use in STATA to fit III- Proportional odds model compares the outcome belonging to category k or below to the outcome belonging to a category greater than k or vice versa. P( Y ck ( x) = ln P( Y > k x) = ς k k x) x Negative sign is used in order to be consistent with STATA, see Hosmer and Lemeshow (2000). Note that Pr( Y k) = π 1( x) + + π k 1 ( x) + π ( x) and Pr( Y k) = 1 Pr( Y k) = π 1( x) + π ( x) > k+ Note that when the betas are not constrained in the above model then we refer to the model as the cumulative logit. T β K k the above models. I: Fitting the adjacent category model 20 mlogit outcome predictor1 predictor2 constraint define 1 [2]outcome = 2*[1]outcome constraint define 2 [3]outcome = 3*[1]outcome.. mlogit outcome predictor1 predictor2, constraint(1 2 ) II: Fitting the unconstrained continuation model: a series of logit models. III: Proportional Odds model: ologit outcome predictor1 predictor2

21 Appendix A: Binomial Distribution: Example: A study reported that the probability of conceiving after undergoing an IVF treatment is 80 out of 100. Given a sample of size 10, the probability of observing 7 women conceiving is calculated as follows: 1. Probability of a woman conceiving is 0.8 this implies that the probability of 7 women conceiving is 0.8 x 0.8 x 0.8 x 0.8 x 0.8 x 0.8 x 0.8 = = If 7 women conceived this means that 3 women did not conceive. The probability of a woman not conceiving is = 0.2. Therefore, the probability of 3 women not conceiving is 0.2 x 0.2 x 0.2 = = If we have 10 women then there are 120 ways in which the 7 that conceived and the three that did not conceive could have been arranged. To get the 120 we usually use the following n! formula, which gives you the number of ways you y!( n y)! could have observed y out of n observations. Note that n! = 1 x 2 x 3 x..x n, for example 3! = 1 x 2 x 3 = Therefore, the probability of observing 7 women conceiving is 120 x x = 0.2 conceiving, 3 women conceiving,, or 10 women conceiving. For each of these cases we can calculate the probability of observing such an outcome, Table 1 lists all these probabilities and Figure 1 gives the graphical representation. The probability distribution listed in Table 1 and presented in Figure 1 is referred to as the Binomial distribution with 10 (usually denoted by n) trials and probability of success 0.8 (usually denoted by p when we are talking about a sample where p stands for proportion and π (pronounced pi, a Greek letter) when we are talking about a population). Table 1: Binomial distribution with n = 10 and π = 0.8 y Probability y y Probability y If we have 10 women then we could observe one of the following outcomes no women conceiving (which is highly unlikely, given that the rate of success is 0.8, but possible), 1 woman conceiving, 2 women 21

22 Pr(Y=y) y Figure 1: Binomial distribution with n = 10 and pi = 0.8 Based on Table 1 and Figure 1 we note that the most likely events are those that are around 8, which is the expected value if the probability of success is 0.8 and we have 10 women (10 x 0.8 = 8). In general, for a binomial distribution the mean/average number of successes (or expected value) is obtained by n x p. The variance is obtained by n x p x (1-p). STATA: Figure 1 and contents of Table 1 were obtained using the following STATA commands. The starred commands are comments explaining what the previous command is doing. Therefore, if you want to run the commands no need to include the starred commands. * allocates 11 spaces in STATA s memory. Can you guess why *we need 11 not 10? egen y = seq(), from(0) to(10) * generates a sequence from 0 to 10 and puts it in * the variable y generate CBinY = Binomial(10,y,80/100) * STATA does not have a function that generates *directly the Binomial probabilities that we need but it *has one that generates the upper cumulative *probabilities. For example, the upper cumulative *probability at 6 is the probability that we observe 6 or *more than 6 women conceiving. This is usually computed by *adding the probability of observing 6 women to that of *observing 7, 8, 9, and 10 women conceiving. This is what *the function Binomial(10,y,80/100) is doing. The *information that you should provide for this function is *as follows n = 10 in this case then y the sequence of all *possible outcomes (or the ones you are interested in) and *then p = 80/100. STATA then calculates for each value of *y the upper cumulative probability. label var CBinY "Pr(Y >= y)" * label for the newly generated variable to remind us what * it is. generate CBinYp1 = Binomial(10,y+1,80/100) clear * clears any previous data from STATA s memory that you *might have been using set obs 11 * Next, we generate a variable that holds similar *information to that of CBinY but instead of y we look at *y+1, for example, instead of 6 successes we look at 7 *successes. The reason for this is as follows since *Binomial creates upper cumulative probabilities so for 22

23 *y = 6 it adds the probabilities 6, 7, 8, 9, and 10. For *y = 7 it adds the probabilities 7, 8, 9, and 10. So if we *subtract these two values then we get the probability of *observing 6 only. This is what the above command and the *next generate command achieve. label var CBinYp1 "Pr(Y>=y +1)" generate BinY = CBinY - CBinYp1 label var BinY "Pr(Y=y)" l y BinY * lists y and its probability scatter BinY y * scatter plot for y and its probability Exercise (optional): Given a sample of size 50, what will be the probability that (i) Forty-four women will conceive? (ii) Six women will not conceive? (iii) A maximum of ten women will not conceive? (iv) Between four to six women will not conceive? Figure 2 represents a number of Binomial distributions with varying number of trials across columns (The value of the number of trials can be read from the x-axis of a particular graph) and varying probabilities of successes across rows. The graph was created using R. 23

24 Figure 2: A series of Binomial distributions. 24

25 Hypothesis testing for a proportion Example 1: If in a sample of 7000 adults 2000 were smokers, test the hypothesis that H : π 0. 3 against the alternative H : π = Using STATA s command bitesti we get the following Output 1 N Observed k Expected k Assumed p Observed p Pr(k >= 2000) = (one-sided test) Pr(k <= 2000) = (one-sided test) Pr(k <= 2000 or k >= 2201) = (two-sided test) In the above, bitesti instructs STATA that you want a binomial test as we are dealing with proportions, the i at the end of bitesti is to tell STATA that you have immediate input and that STATA should not expect a variable but the number of trials n = 7000, the number of successes 2000 and the assumed (hypothetical value) proportion 0.3 in this case. Note that this is one of the advantages that STATA has on SPSS where for some commands you need not have a dataset available and you can use immediate forms. In case, you had the original dataset available to you and the variable smoke was coded 0 A for non smokers and 1 for smokers, then you would have 2000 ones and 5000 zeros. Now to test whether the proportion of smokers is 0.3 you would use the command bitest smoke = 0.3 The output for this command would be exactly like the previous one possibly added to it the name of the variable somewhere. In Output 1, N, represents the number of trials (observations) Observed k represents the number of successes in this case smokers which is Assumed p is the hypothetical value 0.3. Expected k is the number of smokers that you would expect if the true proportion is 0.3 so 7000 * 0.3 = Observed p is the proportion that you observe based on your sample 2000/7000 = The lines starting with Pr are P-values associated with the three possible sets of hypotheses that you might want to look at. In our case we are interested whether it is different that 0.3 that is a two-sided test. Therefore, the P-value for our test is The numbers reported in Pr(k <= 2000 or k >= 2201) are related to the calculation of the P- value where we look at the probability of obtaining a statistic that is as or more extreme than what we observed (2000 successes or a proportion of 0.286) if the null hypothesis is true. So if we plot the Binomial 25

26 distribution with n = 7000 and p = 0.3 and calculate the probability of observing 2000 smokers it turns out to be (I calculated this probability by using the STATA command display Binomial( 7000, 2000, 0.3) - Binomial( 7000, 2001, 0.3) Note that the command display makes STATA behaves like a calculator). Now if we look at the graph of the Binomial distribution with n = 7000 and p = 0.3 (see Figure 2 for similar examples) we will find out that any number less than or equal to 2000 and any number that is greater than or equal to 2201 has a probability less than or equal to Note that the probability for 2201 is which is less than and was computed by display Binomial( 7000, 2201, 0.3) - Binomial( 7000, 2202, 0.3) whereas the probability for 2200 is which is greater than and was computed by display Binomial( 7000, 2200, 0.3) - Binomial( 7000, 2201, 0.3) This is the reason you see Pr(k <= 2000 or k >= 2201). Based on the P-value of we have strong evidence to reject the null hypothesis that the population proportion which this sample represent is equal to 0.3. Hence, we conclude that the population proportion is significantly different than 0.3 and is actually higher. You could have used an option of bitest in order to see these details. bitesti , detail N Observed k Expected k Assumed p Observed p Pr(k >= 2000) = (one-sided test) Pr(k <= 2000) = (one-sided test) Pr(k <= 2000 or k >= 2201) = (two-sided test) Pr(k == 2000) = (observed) Pr(k == 2200) = Pr(k == 2201) = (opposite extreme) If you were interested in a one tailed hypothesis then one could have explored on of the following A. H : π 0. 3 against the alternative H : π. > 0. 3 The P-value 0 associated with this set of hypotheses would be Pr(k <= 2000) = (one-sided test). I will leave it to you to interpret this P-value. B. H : π 0. 3 against the alternative H : π. < 0. 3 The P-value 0 associated with this set of hypotheses would be Pr(k >= 2000) = (one-sided test). I will leave it to you to interpret this P-value. The above is the exact test as we are using the Binomial distribution. However, before the advent of super fast computers a calculation that A A 26

27 you have performed by pressing a button would have taken a substantial amount of time. Therefore, other alternatives were needed. In what follows I list some. Three test statistics Score statistic which uses a standard error (S.E) based on the hypothetical value and is computed as follows z = p π 0 π 0 ( 1 π 0 ) n As n increases the distribution of z tends to the standard normal. This is more powerful if the null hypothesis is true. Wald statistic which uses a S.E based on the data and is computed as follows. z = p π 0 p(1 p) n As n increases the distribution of z tends to the standard normal. Note that, the score statistics sampling distribution is closer to the standard normal than that of the Wald statistic. Example 2: If in a sample of 7000 adults 700 were smokers, test the hypothesis that H : π 0. 3 against the alternative H : π = A Now, p = = 0.1 Using the score statistic, we have * = = Using the Wald statistic, we have = = * Referring both results to the Standard Normal leads to rejection of the null hypothesis. The Likelihood ratio test, see section on maximum likelihood estimation. Confidence interval for a proportion You can also construct confidence intervals either using the Exact Binomial distribution or the Wald approach for a Binomial distribution. You can employ STATA to do that for you.. cii Binomial Exact Variable Obs Mean Std. Err. [95% Conf.I] cii , wald 27

28 -- Binomial Wald Variable Obs Mean Std. Err. [95% Conf.I] In the above two commands we have employed the immediate form of ci, so if the variable was available we would have used ci smoke, binomial The two outputs are similar except for the CI which vary slightly as one is based on the exact distribution whereas the other uses the normal distribution approximation. The Obs column gives how many observations were seen 7000; the mean column gives the proportion observed 2000/7000 = 0.286; the Std. Err gives the associated standard p ( 1 p) error and is computed as = ; and then a CI computed n depending either on the Binomial or the Normal distributions. The confidence interval can also aid us in making a decision about whether the population proportion is different than 0.3 (or any other value that we propose). If 0.3 is in the interval then we fail to reject that the population proportion is equal to 0.3. However, if 0.3 is not in the interval, which is the case here, we reject the null hypothesis of no difference from 0.3 and conclude that the population proportion is significantly different than 0.3. The Wald 100(1-α)% confidence interval for π can be constructed by p ± z S. E.( ) if n is large. For the above example, a 95% Wald CI is α / 2 p ± 1.96 x = (0.2751,0.296). Note, if π < 0. 2 or π > 0. 8 then constructing a Wald C.I. doesn t work well. (You can ignore this. For your information only) Formal Definition of Binomial distribution Given n independent and identically distributed (IID) trials with two possible outcomes, a success and a failure (Bernoulli trials), then the number of successes, Y, in the n trials follows a binomial distribution. If π is the probability of success then Pr ( Y = y) = y! n! ( n y) Note, mean ( Y ) = nπ and ( Y ) = nπ ( 1 π ) The quantity n! y!( n y)!! n y ( 1 π ) for y = 0,, n y π. (1) Var and n! = 1 x 2 x 3 x..x n gives you the number of ways you could have observed y successes out of n observations. Given the probability of y success is π then if we observe y successes we do that with π. If we have n trials and y of them are successes then n-y are failures and the probability of failure is 1 - π therefore the probability of n-y failures is n y ( π ) 1. Hence in a sample of n, we observe y successes (or n-y failures) with the probability given in (1). 28

29 Appendix B: In this section, we will investigate the effect of both α and β on the general shape of the above function using graphical displays. 1 Figure 1 represents 1+ exp ( α + βx) and α = 0, note the s-shape of the graph. [ ], where x= count, β = exp( x) Figure 3: Logit model with α =0 and β =1 Figure 4: The graphs represent the logit function with varying α, see text for details. Figure 2 graphs represent 1+ exp 1 [ ( α + βx) ], where x= count, β = 1 and α varying from α = -5, -3, -2, -1, 0, 1, 2, 3, 5, with the figure at the far right corresponding to α = -5 (- - ) and the figure at the far the corresponding to α = 5 (- - ). Therefore, as we can see from Figure 2 varying α will result in a shift of the curve but the shape remains the same. 29

30 probability of having the disease, and lower values of the count are associated with a lower probability of having the disease. Whereas for a negative β lower values are associated with higher probability of having the disease, and higher values of the count are associated with a lower probability of having the disease. Figure 5: The graphs represent the logit function with varying β. See text for details 1 Figure 3 graphs represent 1+ exp ( α + βx) [ ], where α = 0 and β = -3, - 0.5, 0.5, 1, 3. The bold dashed and dotted graphs correspond to β = -3 and β =3 ( _._. ), where the s-shaped graph corresponds to β =3 and the inverted s-shaped graph corresponds to β = -3. The dotted graphs correspond to β = -0.5 (inverted s-shaped) and β =0.5 (s-shaped). The solid graph corresponds to β = 1 (s-shaped). The higher the absolute value of β the steeper is the slope in the graph. Positive β s result in s- shaped graphs and negative β s result in inverted s-shaped graphs. For a positive β, higher values of the count are associated with higher 30

Statistical Modelling with Stata: Binary Outcomes

Statistical Modelling with Stata: Binary Outcomes Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 21/11/2017 Cross-tabulation Exposed Unexposed Total Cases a b a + b Controls