Logistic Regression. Building, Interpreting and Assessing the Goodness-of-fit for a logistic regression model

Size: px
Start display at page:

Download "Logistic Regression. Building, Interpreting and Assessing the Goodness-of-fit for a logistic regression model"

Transcription

1 Logistic Regression In previous lectures, we have seen how to use linear regression analysis when the outcome/response/dependent variable is measured on a continuous scale. In this lecture, we will assume that the outcome variable (call it Y for general purposes, for example Y indicates whether a disease or any other characteristic is present or absent) is binary, where Y = 1 indicates success and Y = 0 indicates failure, see Appendix A for discussion about the Binomial distribution. The explanatory independent variables (usually denoted by X, for example, age, height, marital status) can be either categorical or continuous. If we were to use for predicting Y the same approach as that for continuous data we will most definitely encounter situations where the predicted variable will not be a 0 or 1 but a value on a straight line stretching from - to +. In order to overcome this difficulty rather than modelling the binary outcome we model the probability of success of the outcome, usually denoted by π (Probability of disease, of recovery, etc...). However, this probability follows an s-shaped curve rather than a line; with high/low values of the outcome associated with the presence/absence of the disease; and it is constrained to lie between 0 and 1. Nevertheless, we can apply a transformation to π that will allow us to study the variation of the transformed π using a linear combination of the independent variables. This function is given by p( Y = 1) log = α + βx p Y. 1 ( = 1) This transformation is called the logit link function. Recall that p( Y = 1) 1 p( Y = 1) is the odds of having the disease where p(y =1) is the probability of the disease (presence of a certain characteristic) and 1- p(y =1) is the probability of no disease (absence of a certain characteristic). With some algebraic manipulation, we can show that 1 p( y) = 1+ exp. [ ( α + βx) ] See Appendix B for some graphical examples of the above model. Building, Interpreting and Assessing the Goodness-of-fit for a logistic regression model In what follows, we will explore how to build, interpret and assess the goodness-of-fit a logistic regression model for a binary response through an example. Hosmer and Lemeshow (2000) reported the following data set 1 : LOW BIRTH WEIGHT DATA that was collected at Baystate Medical Center, Springfield, Massachusetts during The aim of the study was to determine risk factors associated with giving birth to a low birth weight baby. A low birth weight baby is defined as weighing less than 2500 grams at birth. 1 John Wiley & Sons Inc. 1

2 this table are: Table 1 is an extract from this study. The variables presented in 1. ID: Identification Code 2. LOW: Low Birth Weight (0 = Birth Weight >= 2500g, 3. AGE: Age of the mother in years 1 = Birth Weight < 2500g) 4. LWT: Weight in pounds at the last menstrual period 5. RACE: Race (1 = White, 2 = Black, 3 = Other) 6. SMOKE: Smoking status during Pregnancy (1 = Yes, 0 = No) 7. PTL: History of Premature Labor (0 = None 1 = One, etc.) 8. HT: History of Hypertension (1 = Yes, 0 = No) 9. UI: Presence of Uterine Irritability (1 = Yes, 0 = No) 10. FTV: Number of Physician Visits During the First Trimester 11. (0 = None, 1 = One, 2 = Two, etc.) 12. BWT: Birth Weight in Grams Table 1: Low birth weight extract from data reported by Hosmer and Lemeshow (2000). ID LOW AGE LWT RACE SMOKE PTL HT UI FTV BWT Explanatory categorical variables (Qualitative) describe Contains data from \Lecture_4mo\lowbwt.dta obs: 189 vars: 11 tab low low Freq. Percent Cum BWT >= BWT < Total tab low smoke, chi lr smoke low No Yes Total BWT >= BWT < Total Pearson chi2(1) = Pr = likelihood-ratio chi2(1) = Pr = display (30/44)/(29/86) The odds of low birth weight for smokers versus non-smokers is

3 . cs low smoke, or smoke Exposed Unexposed Total Cases Noncases Total Risk Point estimate [95% Conf. Int] Risk difference Risk ratio Attr. frac. ex Attr. frac. pop.19 Odds ratio chi2(1) = 4.92 Pr>chi2 = logit low smoke Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Logistic regression Number of obs = 189 LR chi2(1) = 4.87 Prob > chi2 = Pseudo R2 = Log likelihood = low Coef. S.E. z P> z [95% CI] smoke _cons log(odds of low bwt) = smoke + error term log(odds of low bwt for smokers) = log(odds of low bwt for non-smokers) = log(or smokers versus non-smokers) = log(odds of low bwt for smokers/ odds of low bwt for nonsmokers)= log(odds of low bwt for smokers) log(odds of low bwt for non-smokers)= 0.70 OR smokers versus non-smokers = exp(0.70). display exp( ) display exp( ) display exp( ) logit low smoke, or Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Logistic regression Number of obs = 189 LR chi2(1) = 4.87 Prob > chi2 = Pseudo R2 = Log likelihood = low Odds Ratio S.E. z P> z [95% CI] smoke est store A. logit low Iteration 0: log likelihood =

4 Logistic regression Number of obs = 189 LR chi2(0) = 0.00 Prob > chi2 =. Log likelihood = Pseudo R2 = low Coef. S.E. z P> z [95% C. I] _cons est store B lrtest A B Likelihood-ratio test LR chi2(1) = 4.87 (Assumption: B nested in A) Prob > chi2 = The likelihood-ratio test compares the model with the variables to that with constant term only. Since we have 1 variable only, this results in one df. Based on a P-value of we have good evidence that the term smoke is needed in our model.. tab low race, chi race low White Black Other Total BWT >= BWT < Total Pearson chi2(2) = Pr = logit low race Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Logistic regression Number of obs = 189 LR chi2(1) = 3.57 Prob > chi2 = Log likelihood = Pseudo R2 = low Coef. S. E. z P> z [95% CI] race _cons xi: logit low i.race i.race _Irace_1-3 (naturally coded; _Irace_1 omitted) Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Iteration 3: log likelihood = Logistic regression Number of obs = 189 LR chi2(2) = 5.01 Prob > chi2 = Log likelihood = Pseudo R2 =

5 low Coef. S. E. z P> z [95% CI] _Irace_ _Irace_ _cons log(odds of low bwt) = black other+ error term. xi: logit low i.race, or nolog i.race _Irace_1-3 (naturally coded; _Irace_1 omitted) Logistic regression Number of obs = 189 LR chi2(2) = 5.01 Prob > chi2 = Log likelihood = Pseudo R2 = low Odds Ratio S. E. z P> z [95% CI] _Irace_ _Irace_ Adjusted Odds ratios. xi: logit low i.smoke i.race, or nolog i.smoke _Ismoke_0-1 (naturally coded; _Ismoke_0 omitted) i.race _Irace_1-3 (naturally coded; _Irace_1 omitted) Logistic regression Number of obs = 189 LR chi2(3) = Prob > chi2 = Log likelihood = Pseudo R2 = low OR S. E. z P> z [95% CI] _Ismoke_ _Irace_ _Irace_ log(odds of low bwt) = smoke black other+ error term 5

6 Explanatory Continuous Data (Quantitative). logit low age Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Iteration 3: log likelihood = Logistic regression Number of obs = 189 LR chi2(1) = 2.76 Prob > chi2 = Log likelihood = Pseudo R2 = low Coef. S. E. z P> z [95% CI] age _cons logit low age, or nolog low OR S. E. z P> z [95% CI] age OR here represents the increase in the odds for an increase in one unit (one year) of age.. quietly xi: logit low i.smoke age, nolog. est store A. est table A, b(%9.2f) Variable A _Ismoke_ age _cons log(odds of low bwt) = smoke age + error term predict e6, xb predict see6, stdp gen ule6 = e6+1.96*see6 gen lle6 = e6-1.96*see6 scatter e6 age if smoke ==0, msymbol(o) line ule6 lle6 age if smoke ==0, xlabel(10(5)45) title("non-smokers") sort Non-smokers age Linear prediction: smoke age lower bound: age smoke upper bound: age smoke 6

7 scatter e6 age if smoke ==1, msymbol(o) line ule6 lle6 age if smoke ==1, xlabel(10(5)40) title("smokers") sort Including interaction terms gen lwd=(lwt<110) tab low lwd, col Smokers age Linear prediction: smoke age lower bound: age smoke upper bound: age smoke Key frequency column percentage lwd low 0 1 Total BWT >= BWT < Total xi: logit low i.lwd*age, nolog i.lwd _Ilwd_0-1 (naturally coded; _Ilwd_0 omitted) i.lwd*age _IlwdXage_# (coded as above) Logistic regression Number of obs = 189 LR chi2(3) = Prob > chi2 = Log likelihood = Pseudo R2 = low Coef. S. E. z P> z [95% CI] _Ilwd_ age _IlwdXage_ _cons

8 log(odds of low bwt) = lwd age lwd x age + error term predict el, xb quietly xi: logit low i.lwd age predict e2, xb scatter el age, msymbol(o) scatter e2 age, msymbol( x), xlabel(10(5)45) Calculating OR in presence of interaction: (i) one continuous and one categorical variables: log(odds of low bwt) = lwd age lwd x age + error term Given age 20: log(odds of low bwt among lwd >= 110 ) = x x x 0 x age Linear Prediction: lwd * age Linear prediction: lwd age log(odds of low bwt among lwd < 110 ) = x x x 1 x 20 log(or)=log((odds of low bwt among lwd < 110)/ (odds of low bwt among lwd >= 110 )) = log(odds of low bwt among lwd >=110 ) log(odds of low bwt among lwd < 110 ) = 1.94 x x 1 x 20 = 0.66 OR = exp(0.66) = lincom _Ilwd_1 + 20* _IlwdXage_1, or ( 1) _Ilwd_ _IlwdXage_1 = 0 low OR S. E. z P> z [95% CI] (1)

9 . lincom _Ilwd_1 + 25* _IlwdXage_1, or ( 1) _Ilwd_ _IlwdXage_1 = 0 low OR S. E. z P> z [95% CI] (1) (ii) two categorical variables:. xi: logit low i.smoke*i.race, nolog LR chi2(5) = Prob > chi2 = low Coef. S. E. P> z _Ismoke_ _Irace_ _Irace_ _IsmoXrac_~ _IsmoXrac_~ _cons log(odds of low bwt) = smoke black other smoke x black smoke x other + error term Given a non-smoker: log(odds of low bwt among white) = 0.77 log(odds of low bwt among black) = log(or black vs white) = 1.51 Given a smoker: log(odds of low bwt among white) = log(odds of low bwt among black) = log(or black vs white) = lincom _Irace_2 + _IsmoXrac_1_2 ( 1) _Irace_2 + _IsmoXrac_1_2 = low Coef. Std. Err. P> z [95% CI] (1) lincom _Irace_2 + _IsmoXrac_1_2, or low Odds Ratio Std. Err. [95% CI] (1) Testing coefficients equal to zero: using Wald Statistics. test _IsmoXrac_1_2 _IsmoXrac_1_3 ( 1) _IsmoXrac_1_2 = 0 ( 2) _IsmoXrac_1_3 = 0 chi2( 2) = 3.02 Prob > chi2 =

10 Diagnostics Goodness-of-fit To assess a lack of fit of a model when continuous explanatory variables are present, one needs to group observations as discussed below. This grouping will result in chi-squared statistics that have a better validity. However, if there are several continuous variables in your model then simultaneously grouping these covariates will lead to a large contingency table with small counts. One way to avoid such a situation is by grouping observations according to predicted values. Hosmer and Lemeshow (1989) have devised a Pearson like statistic based on such a partitioning. Their statistic can be referred to a chi-squared statistic with df = number of groups 2. Using lfit or estat gof after you fit your model and specifying the group option you can obtain the Hosmer and Lemeshow type statistics.. xi: logit low i.smoke i.race lwt low Coef. Std. Err. P> z _Ismoke_ _Irace_ _Irace_ lwt _cons estat gof Logistic model for low, goodness-of-fit test number of observations = 189 number of covariate patterns = 132 Pearson chi2(127) = Prob > chi2 = estat gof, table group(10) Logistic model for low, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) Group Prob Obs_1 Exp_1 Obs_0 Exp_0 Total number of observations = 189 number of groups = 10 Hosmer-Lemeshow chi2(8) = 7.35 Prob > chi2 =

11 ROC (Receiver Operating Characteristic) & Discrimination Once a logit-model is fitted to a data set, then one can compute the probability of a success at different levels of the predictors. Remember that we can show that p( Y 1 = 1) =, if we assume that 1+ exp [ ( α + βx) ] p( Y = 1) log = α + βx p Y. However, the outcome variable is binary. 1 ( = 1) Therefore, when we calculate the sensitivity and specificity of the model 2, we use the following prediction rule: those with p ( Y = 1) 0. 5 will have the disease and those with probability less than 0.5 will not have the disease. The choice of the cut-off point 0.5 is based on statistical considerations. However, we can try other cut-off points and calculate, in each case, the corresponding specificity and sensitivity. Then the optimal cut-off point would be the point that maximizes both specificity and sensitivity. One way to decide on this is to plot the cutoff points versus sensitivity and specificity and see where these two curves intersect. The point of intersection of these two curves represents the optimal cut-off point. In order to determine how good the model at discriminating between the two categories is, that is if the association between the predictor and the 2 Sensitivity= proportion of times the model predicts a positive when it is actually a positive. Specificity = proportion of times the model predicts a negative when it is actually a negative. outcome is positive then when we observe a higher value of the predictor we expect the outcome, we use the ROC curve. This curve is obtained by plotting sensitivity versus 1- specificity for all possible values of the cutoff points. Then we determine the area under this curve, if the area is 0.5 then this indicates that the model does not discriminate well, that is it is not better than basing your decision on a flip of the coin. If the area ranges from 0.7 to 0.8 this is considered as moderate discrimination. If the area ranges from 0.8 to 0.9 this is considered as very good/excellent discrimination. If the area is greater or equal to 0.9 this is considered as outstanding discrimination (See Hosmer and Lemeshow, 2000). First, let us look at how good is our model in discriminating between the two categories. To achieve this we use estat classification (STATA 9) or lstat (STATA 8).. estat classification Logistic model for low True Classified D ~D Total Total

12 Classified + if predicted Pr(D) >=.5 True D defined as low!= Sensitivity Pr( + D) 15.25% Specificity Pr( - ~D) 93.85% Positive predictive value Pr( D +) 52.94% Negative predictive value Pr(~D -) 70.93% False + rate for true ~D Pr( + ~D) 6.15% False - rate for true D Pr( - D) 84.75% False + rate for classified + Pr(~D +) 47.06% False - rate for classified - Pr( D -) 29.07% Correctly classified 69.31% lsens, nograph genprob(p1) gensens(se1) genspec(sp1). estat classification, cutoff( ) Logistic model for low True Classified D ~D Total Total lsens Sensitivity/Specificity Probability cutoff Classified + if predicted Pr(D) >= True D defined as low!= Sensitivity Pr( + D) 64.41% Specificity Pr( - ~D) 62.31% Positive predictive value Pr( D +) 43.68% Negative predictive value Pr(~D -) 79.41% False + rate for true ~D Pr( + ~D) 37.69% False - rate for true D Pr( - D) 35.59% False + rate for classified + Pr(~D +) 56.32% False - rate for classified - Pr( D -) 20.59% Correctly classified 62.96% Sensitivity Specificity 12

13 lroc Sensitivity Specificity Area under ROC curve =

14 Outliers/Influential Observations Outliers are observations that have large residuals, that is, observations that fit poorly. Note that outliers need not be necessarily influential. Pregibon dbeta: (Cook s distance) Influential observations are observations that have a large effect on the estimated parameters. The effect of such observations is determined by examining the change in the estimated regression coefficients that occurs when such observations are deleted. This effect is measured using what is known as Cook s distance. For logistic regression Stata reports an approximation of this measure which was proposed by Pregibon (1981) using the predict command after fitting the regression model. You can plot these measures versus the predicted probability in order to determine which observations are influential. The Stata commands that you need in order to achieve this are as follows: predict dbnewvar, dbeta predict pnewvar, p predict cnnewvar, number scatter dbnewvar pnewvar, mlabel(cnnewvar) mlabposition(0) msymbol(i) predict plowsrlwt, p predict dblowsrlwt, db predict cnlowsrlwt, number scatter dblowsrlwt plowsrlwt,mlabel(cnlowsrlwt) mlabposition(0) msymbol(i) Pregibon's dbeta Pr(low) For our example, we used the following commands 14

15 One can also look at two other measures: one is the change in Pearson chi-square statistic and the other is the change in the deviance (difference of log-likelihood) statistic when an observation is deleted. The larger the change the greater is the influence of the observation. The Stata commands that are needed to generate these measures are Change in the deviance (difference of log-likelihood) statistic when an observation is deleted. predict ddlowsrlwt, dd scatter ddlowsrlwt plowsrlwt, mlabel(cnlowsrlwt) mlabposition(0) msymbol(i) Change in Pearson chi-square statistic when an observation is deleted. predict dx2lowsrlwt, dx2 scatter dx2lowsrlwt plowsrlwt, mlabel(cnlowsrlwt) mlabposition(0) msymbol(i) H-L dx^ Pr(low) H-L dd Pr(low) 15

16 Pearson s Residuals are given by P = i o n π n π i i i ( 1 π ) i i i The Pearson statistic is the sum of all such components. If n i is large then P i has an approximate normal distribution. When the model holds, its expected value is zero, however, its variance is smaller than that of a standard normal variable. If the number of parameters in the model is small compared to the sample size then Pearson residuals are treated like standard normal deviates. The standardised Pearson residual is given by St P i. Pi = where ii 1 hii h is the leverage (influence) of observation i on the estimates. It is worth noting that the standardised Pearson residual is slightly larger in absolute value and has an approximate standard normal distribution. To obtain Pearson residuals and standardized Pearson residuals from Stata use the following command after a logit command, respectively. predict rnewvar, r predict rstnewvar, rst The following are the graphs of the Pearson residuals and standardized Pearson residuals versus predicted probability. predict rlowsrlwt, r scatter rlowsrlwt plowsrlwt, mlabel(cnlowsrlwt) mlabposition(0) msymbol(i) Pearson residual Pr(low)

17 predict rstlowsrlwt, rst scatter rstlowsrlwt plowsrlwt, mlabel(cnlowsrlwt) mlabposition(0) msymbol(i) deviance residuals from Stata use the following command after a logit command predict devnewvar, dev standardized Pearson residual Pr(low) Deviance Residuals Deviance is a measure of discrepancy between observed data and the expected value under the proposed model. It is the sum of the residuals, d i, obtained as d = 2o ln ( o / e ) + ( n o )ln (( n o ) ( n e )) 115 i i i i i i i i i i where o i is the observed number of positive responses and e i is the expected number under the model. As we add more terms to our regression, we continue to reduce the value of the deviance. To obtain The following is the graph of the deviance residuals versus predicted probability. predict devlowsrlwt, de scatter devtlowsrlwt plowsrl, mlabel(cnlowsrlwt) mlabposition(0) msymbol(i) deviance residual Pr(low)

18 Pseudo-R2 R measures correlation between predicted and observed values. R 2 measures the variation explained by the fitted model. Pseudo-R2 is the McFadden s R2 also known as the likelihood-ratio index. It is 1 loglikelihood of the full model /log-likelihood with the intercept. It increases as the number of parameters increases. There is an adjusted version that accounts for the number of parameters in the model. Selection of models This is done using one of the following approaches Backward Forward Combination In Stata this can be done using the command sw and its options. Namely, pr(#) backward selection pe(#) forward selection pr(#) pe(#) stepwise For example, xi: sw logit low i.smoke (i.race) lwt, pr(0.02) Information Criterion The likelihood ratio test is used when we compare nested models. However, when comparing models that cannot be nested within each other we need different criteria. We will list two that are widely used the Akaike s information criterion (AIC) and Bayesian information criterion (BIC). Both are based on the likelihood function and the number of parameters involved. 18 AIC = - 2 * log-likelihood +2* number of parameters. BIC= - 2 * log-likelihood + log(n) * number of parameters. Here n is the sample size. The model with the smaller AIC/ BIC is considered the better fitting model. Raftery (1996) suggested the following guidelines for BIC: If the absolute difference between the two models ranges from 0-2, the evidence that one model is better than the other is weak. If the difference is from 2 to 6, the evidence is positive. If the difference is from 6 to 10, the evidence is strong. If the difference is greater than 10, the evidence is very strong.. xi: logit low i.smoke (i.race) lwt. est store A. xi: logit low i.smoke (i.race) age. est store B. xi: logit low age ui ftv smoke. est store C. est stat Model Obs ll(null) ll(model) df AIC BIC A B C

19 Looking Further (For your information only) Logistic Regression for Polytomous Data A polytomous variable is a variable with more than two categories. If the response variable is polytomous, one can generalize the ideas of the binary logistic regression to cater for polytomous response variables. Suppose that the response variable has K categories then the probability of belonging to one and only one category is Pr( Y = i explanatory variables x) = π i ( x), where i=1,..,k. In addition, π x) + π ( x) + + π ( x) 1. To proceed in the analysis, 1 ( 2 K = we need to choose a reference category, suppose we choose the first category. Then based on this category one can apply the logistic regression approach by fitting π ( x) = + x + x + + i log β i0 β i1 1 βi2 2 βim m for i=2,., K. π 1 ( x) Therefore, we can fit K-1 equations for the association between the 1 st category and the other K-1 categories. Hence, we need to estimate K-1 sets of the m regression parameters. The Stata command that is used to analyze such data is the mlogit command x Logistic Regression for Ordinal Data For ordinal categorical outcomes, there are several models that one can look at that exploits the fact that the outcome is ordinal. The following is a list of the most commonly used ones: I. Adjacent category. II. Continuation ratio. III. Proportional odds models. However, there are others that are not commonly used and here is a list of some of them: Unconstrained partial-proportional odds. Constrained partial-proportional odds. Stereotype. In what follows, I provide a brief description of each of the three commonly used models. Assume here that the outcome has (K + 1) categories and is indexed by k = 0,, K I- Adjacent category model compares each outcome to the next larger (smaller) outcome. P( Y = k x) π k ( x) ln = = a P Y k x ln π k x ( = 1 ) 1( ) k ( x) = α + x Note that this is a constrained version of the multinomial logit k T β (sometimes known as the baseline logit), since we force the corresponding betas across the different models to be equal. 19

20 II- Continuation ratio model compares each outcome to all lower (higher) outcomes Y = k versus Y< k. P( Y bk ( x) = ln P( Y = < k x) T = ϑ k + x β k k x) Here it is as if we are fitting K ordinary binary logistic regression models, note that this model can be constrained Remark: There are instances where you might want the higher group as your reference group and others where you might the lower group as your reference group. Three such examples are: Depression scale Low. Apgar scale High. Low birth weight categorized higher weight category as such that β k = β ). reference category. The following gives the commands that you need to use in STATA to fit III- Proportional odds model compares the outcome belonging to category k or below to the outcome belonging to a category greater than k or vice versa. P( Y ck ( x) = ln P( Y > k x) = ς k k x) x Negative sign is used in order to be consistent with STATA, see Hosmer and Lemeshow (2000). Note that Pr( Y k) = π 1( x) + + π k 1 ( x) + π ( x) and Pr( Y k) = 1 Pr( Y k) = π 1( x) + π ( x) > k+ Note that when the betas are not constrained in the above model then we refer to the model as the cumulative logit. T β K k the above models. I: Fitting the adjacent category model 20 mlogit outcome predictor1 predictor2 constraint define 1 [2]outcome = 2*[1]outcome constraint define 2 [3]outcome = 3*[1]outcome.. mlogit outcome predictor1 predictor2, constraint(1 2 ) II: Fitting the unconstrained continuation model: a series of logit models. III: Proportional Odds model: ologit outcome predictor1 predictor2

21 Appendix A: Binomial Distribution: Example: A study reported that the probability of conceiving after undergoing an IVF treatment is 80 out of 100. Given a sample of size 10, the probability of observing 7 women conceiving is calculated as follows: 1. Probability of a woman conceiving is 0.8 this implies that the probability of 7 women conceiving is 0.8 x 0.8 x 0.8 x 0.8 x 0.8 x 0.8 x 0.8 = = If 7 women conceived this means that 3 women did not conceive. The probability of a woman not conceiving is = 0.2. Therefore, the probability of 3 women not conceiving is 0.2 x 0.2 x 0.2 = = If we have 10 women then there are 120 ways in which the 7 that conceived and the three that did not conceive could have been arranged. To get the 120 we usually use the following n! formula, which gives you the number of ways you y!( n y)! could have observed y out of n observations. Note that n! = 1 x 2 x 3 x..x n, for example 3! = 1 x 2 x 3 = Therefore, the probability of observing 7 women conceiving is 120 x x = 0.2 conceiving, 3 women conceiving,, or 10 women conceiving. For each of these cases we can calculate the probability of observing such an outcome, Table 1 lists all these probabilities and Figure 1 gives the graphical representation. The probability distribution listed in Table 1 and presented in Figure 1 is referred to as the Binomial distribution with 10 (usually denoted by n) trials and probability of success 0.8 (usually denoted by p when we are talking about a sample where p stands for proportion and π (pronounced pi, a Greek letter) when we are talking about a population). Table 1: Binomial distribution with n = 10 and π = 0.8 y Probability y y Probability y If we have 10 women then we could observe one of the following outcomes no women conceiving (which is highly unlikely, given that the rate of success is 0.8, but possible), 1 woman conceiving, 2 women 21

22 Pr(Y=y) y Figure 1: Binomial distribution with n = 10 and pi = 0.8 Based on Table 1 and Figure 1 we note that the most likely events are those that are around 8, which is the expected value if the probability of success is 0.8 and we have 10 women (10 x 0.8 = 8). In general, for a binomial distribution the mean/average number of successes (or expected value) is obtained by n x p. The variance is obtained by n x p x (1-p). STATA: Figure 1 and contents of Table 1 were obtained using the following STATA commands. The starred commands are comments explaining what the previous command is doing. Therefore, if you want to run the commands no need to include the starred commands. * allocates 11 spaces in STATA s memory. Can you guess why *we need 11 not 10? egen y = seq(), from(0) to(10) * generates a sequence from 0 to 10 and puts it in * the variable y generate CBinY = Binomial(10,y,80/100) * STATA does not have a function that generates *directly the Binomial probabilities that we need but it *has one that generates the upper cumulative *probabilities. For example, the upper cumulative *probability at 6 is the probability that we observe 6 or *more than 6 women conceiving. This is usually computed by *adding the probability of observing 6 women to that of *observing 7, 8, 9, and 10 women conceiving. This is what *the function Binomial(10,y,80/100) is doing. The *information that you should provide for this function is *as follows n = 10 in this case then y the sequence of all *possible outcomes (or the ones you are interested in) and *then p = 80/100. STATA then calculates for each value of *y the upper cumulative probability. label var CBinY "Pr(Y >= y)" * label for the newly generated variable to remind us what * it is. generate CBinYp1 = Binomial(10,y+1,80/100) clear * clears any previous data from STATA s memory that you *might have been using set obs 11 * Next, we generate a variable that holds similar *information to that of CBinY but instead of y we look at *y+1, for example, instead of 6 successes we look at 7 *successes. The reason for this is as follows since *Binomial creates upper cumulative probabilities so for 22

23 *y = 6 it adds the probabilities 6, 7, 8, 9, and 10. For *y = 7 it adds the probabilities 7, 8, 9, and 10. So if we *subtract these two values then we get the probability of *observing 6 only. This is what the above command and the *next generate command achieve. label var CBinYp1 "Pr(Y>=y +1)" generate BinY = CBinY - CBinYp1 label var BinY "Pr(Y=y)" l y BinY * lists y and its probability scatter BinY y * scatter plot for y and its probability Exercise (optional): Given a sample of size 50, what will be the probability that (i) Forty-four women will conceive? (ii) Six women will not conceive? (iii) A maximum of ten women will not conceive? (iv) Between four to six women will not conceive? Figure 2 represents a number of Binomial distributions with varying number of trials across columns (The value of the number of trials can be read from the x-axis of a particular graph) and varying probabilities of successes across rows. The graph was created using R. 23

24 Figure 2: A series of Binomial distributions. 24

25 Hypothesis testing for a proportion Example 1: If in a sample of 7000 adults 2000 were smokers, test the hypothesis that H : π 0. 3 against the alternative H : π = Using STATA s command bitesti we get the following Output 1 N Observed k Expected k Assumed p Observed p Pr(k >= 2000) = (one-sided test) Pr(k <= 2000) = (one-sided test) Pr(k <= 2000 or k >= 2201) = (two-sided test) In the above, bitesti instructs STATA that you want a binomial test as we are dealing with proportions, the i at the end of bitesti is to tell STATA that you have immediate input and that STATA should not expect a variable but the number of trials n = 7000, the number of successes 2000 and the assumed (hypothetical value) proportion 0.3 in this case. Note that this is one of the advantages that STATA has on SPSS where for some commands you need not have a dataset available and you can use immediate forms. In case, you had the original dataset available to you and the variable smoke was coded 0 A for non smokers and 1 for smokers, then you would have 2000 ones and 5000 zeros. Now to test whether the proportion of smokers is 0.3 you would use the command bitest smoke = 0.3 The output for this command would be exactly like the previous one possibly added to it the name of the variable somewhere. In Output 1, N, represents the number of trials (observations) Observed k represents the number of successes in this case smokers which is Assumed p is the hypothetical value 0.3. Expected k is the number of smokers that you would expect if the true proportion is 0.3 so 7000 * 0.3 = Observed p is the proportion that you observe based on your sample 2000/7000 = The lines starting with Pr are P-values associated with the three possible sets of hypotheses that you might want to look at. In our case we are interested whether it is different that 0.3 that is a two-sided test. Therefore, the P-value for our test is The numbers reported in Pr(k <= 2000 or k >= 2201) are related to the calculation of the P- value where we look at the probability of obtaining a statistic that is as or more extreme than what we observed (2000 successes or a proportion of 0.286) if the null hypothesis is true. So if we plot the Binomial 25

26 distribution with n = 7000 and p = 0.3 and calculate the probability of observing 2000 smokers it turns out to be (I calculated this probability by using the STATA command display Binomial( 7000, 2000, 0.3) - Binomial( 7000, 2001, 0.3) Note that the command display makes STATA behaves like a calculator). Now if we look at the graph of the Binomial distribution with n = 7000 and p = 0.3 (see Figure 2 for similar examples) we will find out that any number less than or equal to 2000 and any number that is greater than or equal to 2201 has a probability less than or equal to Note that the probability for 2201 is which is less than and was computed by display Binomial( 7000, 2201, 0.3) - Binomial( 7000, 2202, 0.3) whereas the probability for 2200 is which is greater than and was computed by display Binomial( 7000, 2200, 0.3) - Binomial( 7000, 2201, 0.3) This is the reason you see Pr(k <= 2000 or k >= 2201). Based on the P-value of we have strong evidence to reject the null hypothesis that the population proportion which this sample represent is equal to 0.3. Hence, we conclude that the population proportion is significantly different than 0.3 and is actually higher. You could have used an option of bitest in order to see these details. bitesti , detail N Observed k Expected k Assumed p Observed p Pr(k >= 2000) = (one-sided test) Pr(k <= 2000) = (one-sided test) Pr(k <= 2000 or k >= 2201) = (two-sided test) Pr(k == 2000) = (observed) Pr(k == 2200) = Pr(k == 2201) = (opposite extreme) If you were interested in a one tailed hypothesis then one could have explored on of the following A. H : π 0. 3 against the alternative H : π. > 0. 3 The P-value 0 associated with this set of hypotheses would be Pr(k <= 2000) = (one-sided test). I will leave it to you to interpret this P-value. B. H : π 0. 3 against the alternative H : π. < 0. 3 The P-value 0 associated with this set of hypotheses would be Pr(k >= 2000) = (one-sided test). I will leave it to you to interpret this P-value. The above is the exact test as we are using the Binomial distribution. However, before the advent of super fast computers a calculation that A A 26

27 you have performed by pressing a button would have taken a substantial amount of time. Therefore, other alternatives were needed. In what follows I list some. Three test statistics Score statistic which uses a standard error (S.E) based on the hypothetical value and is computed as follows z = p π 0 π 0 ( 1 π 0 ) n As n increases the distribution of z tends to the standard normal. This is more powerful if the null hypothesis is true. Wald statistic which uses a S.E based on the data and is computed as follows. z = p π 0 p(1 p) n As n increases the distribution of z tends to the standard normal. Note that, the score statistics sampling distribution is closer to the standard normal than that of the Wald statistic. Example 2: If in a sample of 7000 adults 700 were smokers, test the hypothesis that H : π 0. 3 against the alternative H : π = A Now, p = = 0.1 Using the score statistic, we have * = = Using the Wald statistic, we have = = * Referring both results to the Standard Normal leads to rejection of the null hypothesis. The Likelihood ratio test, see section on maximum likelihood estimation. Confidence interval for a proportion You can also construct confidence intervals either using the Exact Binomial distribution or the Wald approach for a Binomial distribution. You can employ STATA to do that for you.. cii Binomial Exact Variable Obs Mean Std. Err. [95% Conf.I] cii , wald 27

28 -- Binomial Wald Variable Obs Mean Std. Err. [95% Conf.I] In the above two commands we have employed the immediate form of ci, so if the variable was available we would have used ci smoke, binomial The two outputs are similar except for the CI which vary slightly as one is based on the exact distribution whereas the other uses the normal distribution approximation. The Obs column gives how many observations were seen 7000; the mean column gives the proportion observed 2000/7000 = 0.286; the Std. Err gives the associated standard p ( 1 p) error and is computed as = ; and then a CI computed n depending either on the Binomial or the Normal distributions. The confidence interval can also aid us in making a decision about whether the population proportion is different than 0.3 (or any other value that we propose). If 0.3 is in the interval then we fail to reject that the population proportion is equal to 0.3. However, if 0.3 is not in the interval, which is the case here, we reject the null hypothesis of no difference from 0.3 and conclude that the population proportion is significantly different than 0.3. The Wald 100(1-α)% confidence interval for π can be constructed by p ± z S. E.( ) if n is large. For the above example, a 95% Wald CI is α / 2 p ± 1.96 x = (0.2751,0.296). Note, if π < 0. 2 or π > 0. 8 then constructing a Wald C.I. doesn t work well. (You can ignore this. For your information only) Formal Definition of Binomial distribution Given n independent and identically distributed (IID) trials with two possible outcomes, a success and a failure (Bernoulli trials), then the number of successes, Y, in the n trials follows a binomial distribution. If π is the probability of success then Pr ( Y = y) = y! n! ( n y) Note, mean ( Y ) = nπ and ( Y ) = nπ ( 1 π ) The quantity n! y!( n y)!! n y ( 1 π ) for y = 0,, n y π. (1) Var and n! = 1 x 2 x 3 x..x n gives you the number of ways you could have observed y successes out of n observations. Given the probability of y success is π then if we observe y successes we do that with π. If we have n trials and y of them are successes then n-y are failures and the probability of failure is 1 - π therefore the probability of n-y failures is n y ( π ) 1. Hence in a sample of n, we observe y successes (or n-y failures) with the probability given in (1). 28

29 Appendix B: In this section, we will investigate the effect of both α and β on the general shape of the above function using graphical displays. 1 Figure 1 represents 1+ exp ( α + βx) and α = 0, note the s-shape of the graph. [ ], where x= count, β = exp( x) Figure 3: Logit model with α =0 and β =1 Figure 4: The graphs represent the logit function with varying α, see text for details. Figure 2 graphs represent 1+ exp 1 [ ( α + βx) ], where x= count, β = 1 and α varying from α = -5, -3, -2, -1, 0, 1, 2, 3, 5, with the figure at the far right corresponding to α = -5 (- - ) and the figure at the far the corresponding to α = 5 (- - ). Therefore, as we can see from Figure 2 varying α will result in a shift of the curve but the shape remains the same. 29

30 probability of having the disease, and lower values of the count are associated with a lower probability of having the disease. Whereas for a negative β lower values are associated with higher probability of having the disease, and higher values of the count are associated with a lower probability of having the disease. Figure 5: The graphs represent the logit function with varying β. See text for details 1 Figure 3 graphs represent 1+ exp ( α + βx) [ ], where α = 0 and β = -3, - 0.5, 0.5, 1, 3. The bold dashed and dotted graphs correspond to β = -3 and β =3 ( _._. ), where the s-shaped graph corresponds to β =3 and the inverted s-shaped graph corresponds to β = -3. The dotted graphs correspond to β = -0.5 (inverted s-shaped) and β =0.5 (s-shaped). The solid graph corresponds to β = 1 (s-shaped). The higher the absolute value of β the steeper is the slope in the graph. Positive β s result in s- shaped graphs and negative β s result in inverted s-shaped graphs. For a positive β, higher values of the count are associated with higher 30

Statistical Modelling with Stata: Binary Outcomes

Statistical Modelling with Stata: Binary Outcomes Statistical Modelling with Stata: Binary Outcomes Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 21/11/2017 Cross-tabulation Exposed Unexposed Total Cases a b a + b Controls

More information

Description Syntax for predict Menu for predict Options for predict Remarks and examples Methods and formulas References Also see

Description Syntax for predict Menu for predict Options for predict Remarks and examples Methods and formulas References Also see Title stata.com logistic postestimation Postestimation tools for logistic Description Syntax for predict Menu for predict Options for predict Remarks and examples Methods and formulas References Also see

More information

STA6938-Logistic Regression Model

STA6938-Logistic Regression Model Dr. Ying Zhang STA6938-Logistic Regression Model Topic 2-Multiple Logistic Regression Model Outlines:. Model Fitting 2. Statistical Inference for Multiple Logistic Regression Model 3. Interpretation of

More information

Correlation and regression

Correlation and regression 1 Correlation and regression Yongjua Laosiritaworn Introductory on Field Epidemiology 6 July 2015, Thailand Data 2 Illustrative data (Doll, 1955) 3 Scatter plot 4 Doll, 1955 5 6 Correlation coefficient,

More information

Binary Dependent Variables

Binary Dependent Variables Binary Dependent Variables In some cases the outcome of interest rather than one of the right hand side variables - is discrete rather than continuous Binary Dependent Variables In some cases the outcome

More information

Homework Solutions Applied Logistic Regression

Homework Solutions Applied Logistic Regression Homework Solutions Applied Logistic Regression WEEK 6 Exercise 1 From the ICU data, use as the outcome variable vital status (STA) and CPR prior to ICU admission (CPR) as a covariate. (a) Demonstrate that

More information

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ Logistic Regression The goal of a logistic regression analysis is to find the best fitting and most parsimonious, yet biologically reasonable, model to describe the relationship between an outcome (dependent

More information

Using the same data as before, here is part of the output we get in Stata when we do a logistic regression of Grade on Gpa, Tuce and Psi.

Using the same data as before, here is part of the output we get in Stata when we do a logistic regression of Grade on Gpa, Tuce and Psi. Logistic Regression, Part III: Hypothesis Testing, Comparisons to OLS Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 14, 2018 This handout steals heavily

More information

Lecture 12: Effect modification, and confounding in logistic regression

Lecture 12: Effect modification, and confounding in logistic regression Lecture 12: Effect modification, and confounding in logistic regression Ani Manichaikul amanicha@jhsph.edu 4 May 2007 Today Categorical predictor create dummy variables just like for linear regression

More information

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis Lecture 6: Logistic Regression Analysis Christopher S. Hollenbeak, PhD Jane R. Schubart, PhD The Outcomes Research Toolbox Review Homework 2 Overview Logistic regression model conceptually Logistic regression

More information

2. We care about proportion for categorical variable, but average for numerical one.

2. We care about proportion for categorical variable, but average for numerical one. Probit Model 1. We apply Probit model to Bank data. The dependent variable is deny, a dummy variable equaling one if a mortgage application is denied, and equaling zero if accepted. The key regressor is

More information

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013 Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013 1 Overview Data Types Contingency Tables Logit Models Binomial Ordinal Nominal 2 Things not

More information

Modelling Binary Outcomes 21/11/2017

Modelling Binary Outcomes 21/11/2017 Modelling Binary Outcomes 21/11/2017 Contents 1 Modelling Binary Outcomes 5 1.1 Cross-tabulation.................................... 5 1.1.1 Measures of Effect............................... 6 1.1.2 Limitations

More information

Basic Medical Statistics Course

Basic Medical Statistics Course Basic Medical Statistics Course S7 Logistic Regression November 2015 Wilma Heemsbergen w.heemsbergen@nki.nl Logistic Regression The concept of a relationship between the distribution of a dependent variable

More information

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla. Experimental Design and Statistical Methods Workshop LOGISTIC REGRESSION Jesús Piedrafita Arilla jesus.piedrafita@uab.cat Departament de Ciència Animal i dels Aliments Items Logistic regression model Logit

More information

Logistic Regression Models for Multinomial and Ordinal Outcomes

Logistic Regression Models for Multinomial and Ordinal Outcomes CHAPTER 8 Logistic Regression Models for Multinomial and Ordinal Outcomes 8.1 THE MULTINOMIAL LOGISTIC REGRESSION MODEL 8.1.1 Introduction to the Model and Estimation of Model Parameters In the previous

More information

LOGISTIC REGRESSION Joseph M. Hilbe

LOGISTIC REGRESSION Joseph M. Hilbe LOGISTIC REGRESSION Joseph M. Hilbe Arizona State University Logistic regression is the most common method used to model binary response data. When the response is binary, it typically takes the form of

More information

Assessing the Calibration of Dichotomous Outcome Models with the Calibration Belt

Assessing the Calibration of Dichotomous Outcome Models with the Calibration Belt Assessing the Calibration of Dichotomous Outcome Models with the Calibration Belt Giovanni Nattino The Ohio Colleges of Medicine Government Resource Center The Ohio State University Stata Conference -

More information

Lecture 3.1 Basic Logistic LDA

Lecture 3.1 Basic Logistic LDA y Lecture.1 Basic Logistic LDA 0.2.4.6.8 1 Outline Quick Refresher on Ordinary Logistic Regression and Stata Women s employment example Cross-Over Trial LDA Example -100-50 0 50 100 -- Longitudinal Data

More information

Group Comparisons: Differences in Composition Versus Differences in Models and Effects

Group Comparisons: Differences in Composition Versus Differences in Models and Effects Group Comparisons: Differences in Composition Versus Differences in Models and Effects Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised February 15, 2015 Overview.

More information

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS Duration - 3 hours Aids Allowed: Calculator LAST NAME: FIRST NAME: STUDENT NUMBER: There are 27 pages

More information

Model Estimation Example

Model Estimation Example Ronald H. Heck 1 EDEP 606: Multivariate Methods (S2013) April 7, 2013 Model Estimation Example As we have moved through the course this semester, we have encountered the concept of model estimation. Discussions

More information

Lecture 2: Poisson and logistic regression

Lecture 2: Poisson and logistic regression Dankmar Böhning Southampton Statistical Sciences Research Institute University of Southampton, UK S 3 RI, 11-12 December 2014 introduction to Poisson regression application to the BELCAP study introduction

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links Communications of the Korean Statistical Society 2009, Vol 16, No 4, 697 705 Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links Kwang Mo Jeong a, Hyun Yung Lee 1, a a Department

More information

Understanding the multinomial-poisson transformation

Understanding the multinomial-poisson transformation The Stata Journal (2004) 4, Number 3, pp. 265 273 Understanding the multinomial-poisson transformation Paulo Guimarães Medical University of South Carolina Abstract. There is a known connection between

More information

Lecture 5: Poisson and logistic regression

Lecture 5: Poisson and logistic regression Dankmar Böhning Southampton Statistical Sciences Research Institute University of Southampton, UK S 3 RI, 3-5 March 2014 introduction to Poisson regression application to the BELCAP study introduction

More information

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression Logistic Regression Usual linear regression (repetition) y i = b 0 + b 1 x 1i + b 2 x 2i + e i, e i N(0,σ 2 ) or: y i N(b 0 + b 1 x 1i + b 2 x 2i,σ 2 ) Example (DGA, p. 336): E(PEmax) = 47.355 + 1.024

More information

Multinomial Logistic Regression Models

Multinomial Logistic Regression Models Stat 544, Lecture 19 1 Multinomial Logistic Regression Models Polytomous responses. Logistic regression can be extended to handle responses that are polytomous, i.e. taking r>2 categories. (Note: The word

More information

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3 STA 303 H1S / 1002 HS Winter 2011 Test March 7, 2011 LAST NAME: FIRST NAME: STUDENT NUMBER: ENROLLED IN: (circle one) STA 303 STA 1002 INSTRUCTIONS: Time: 90 minutes Aids allowed: calculator. Some formulae

More information

Binary Logistic Regression

Binary Logistic Regression The coefficients of the multiple regression model are estimated using sample data with k independent variables Estimated (or predicted) value of Y Estimated intercept Estimated slope coefficients Ŷ = b

More information

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression Section IX Introduction to Logistic Regression for binary outcomes Poisson regression 0 Sec 9 - Logistic regression In linear regression, we studied models where Y is a continuous variable. What about

More information

Statistical Modelling in Stata 5: Linear Models

Statistical Modelling in Stata 5: Linear Models Statistical Modelling in Stata 5: Linear Models Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 07/11/2017 Structure This Week What is a linear model? How good is my model? Does

More information

8 Nominal and Ordinal Logistic Regression

8 Nominal and Ordinal Logistic Regression 8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on

More information

STATISTICS 110/201 PRACTICE FINAL EXAM

STATISTICS 110/201 PRACTICE FINAL EXAM STATISTICS 110/201 PRACTICE FINAL EXAM Questions 1 to 5: There is a downloadable Stata package that produces sequential sums of squares for regression. In other words, the SS is built up as each variable

More information

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses ST3241 Categorical Data Analysis I Multicategory Logit Models Logit Models For Nominal Responses 1 Models For Nominal Responses Y is nominal with J categories. Let {π 1,, π J } denote the response probabilities

More information

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis Review Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 22 Chapter 1: background Nominal, ordinal, interval data. Distributions: Poisson, binomial,

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Modelling Rates. Mark Lunt. Arthritis Research UK Epidemiology Unit University of Manchester

Modelling Rates. Mark Lunt. Arthritis Research UK Epidemiology Unit University of Manchester Modelling Rates Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 05/12/2017 Modelling Rates Can model prevalence (proportion) with logistic regression Cannot model incidence in

More information

Unit 5 Logistic Regression

Unit 5 Logistic Regression BIOSTATS 640 - Spring 2017 5. Logistic Regression Page 1 of 65 Unit 5 Logistic Regression To all the ladies present and some of those absent - Jerzy Neyman What behaviors influence the chances of developing

More information

Ordinal Independent Variables Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised April 9, 2017

Ordinal Independent Variables Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised April 9, 2017 Ordinal Independent Variables Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised April 9, 2017 References: Paper 248 2009, Learning When to Be Discrete: Continuous

More information

Unit 5 Logistic Regression

Unit 5 Logistic Regression BIOSTATS 640 - Spring 2018 5. Logistic Regression Page 1 of 66 Unit 5 Logistic Regression To all the ladies present and some of those absent - Jerzy Neyman What behaviors influence the chances of developing

More information

Inference for Binomial Parameters

Inference for Binomial Parameters Inference for Binomial Parameters Dipankar Bandyopadhyay, Ph.D. Department of Biostatistics, Virginia Commonwealth University D. Bandyopadhyay (VCU) BIOS 625: Categorical Data & GLM 1 / 58 Inference for

More information

Chapter 5: Logistic Regression-I

Chapter 5: Logistic Regression-I : Logistic Regression-I Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM [Acknowledgements to Tim Hanson and Haitao Chu] D. Bandyopadhyay

More information

Investigating Models with Two or Three Categories

Investigating Models with Two or Three Categories Ronald H. Heck and Lynn N. Tabata 1 Investigating Models with Two or Three Categories For the past few weeks we have been working with discriminant analysis. Let s now see what the same sort of model might

More information

Logistic Regression. Continued Psy 524 Ainsworth

Logistic Regression. Continued Psy 524 Ainsworth Logistic Regression Continued Psy 524 Ainsworth Equations Regression Equation Y e = 1 + A+ B X + B X + B X 1 1 2 2 3 3 i A+ B X + B X + B X e 1 1 2 2 3 3 Equations The linear part of the logistic regression

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population

More information

Logistic Regression: Regression with a Binary Dependent Variable

Logistic Regression: Regression with a Binary Dependent Variable Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression

More information

Unit 5 Logistic Regression

Unit 5 Logistic Regression PubHlth 640 - Spring 2014 5. Logistic Regression Page 1 of 63 Unit 5 Logistic Regression To all the ladies present and some of those absent - Jerzy Neyman What behaviors influence the chances of developing

More information

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches Sta 216, Lecture 4 Last Time: Logistic regression example, existence/uniqueness of MLEs Today s Class: 1. Hypothesis testing through analysis of deviance 2. Standard errors & confidence intervals 3. Model

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) (b) (c) (d) (e) In 2 2 tables, statistical independence is equivalent

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n

More information

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU. SOS3003 Applied data analysis for social science Lecture note 08-00 Erling Berge Department of sociology and political science NTNU Erling Berge 00 Literature Logistic regression II Hamilton Ch 7 p7-4

More information

Treatment Variables INTUB duration of endotracheal intubation (hrs) VENTL duration of assisted ventilation (hrs) LOWO2 hours of exposure to 22 49% lev

Treatment Variables INTUB duration of endotracheal intubation (hrs) VENTL duration of assisted ventilation (hrs) LOWO2 hours of exposure to 22 49% lev Variable selection: Suppose for the i-th observational unit (case) you record ( failure Y i = 1 success and explanatory variabales Z 1i Z 2i Z ri Variable (or model) selection: subject matter theory and

More information

BMI 541/699 Lecture 22

BMI 541/699 Lecture 22 BMI 541/699 Lecture 22 Where we are: 1. Introduction and Experimental Design 2. Exploratory Data Analysis 3. Probability 4. T-based methods for continous variables 5. Power and sample size for t-based

More information

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 Introduction to Generalized Univariate Models: Models for Binary Outcomes EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 EPSY 905: Intro to Generalized In This Lecture A short review

More information

Stat 642, Lecture notes for 04/12/05 96

Stat 642, Lecture notes for 04/12/05 96 Stat 642, Lecture notes for 04/12/05 96 Hosmer-Lemeshow Statistic The Hosmer-Lemeshow Statistic is another measure of lack of fit. Hosmer and Lemeshow recommend partitioning the observations into 10 equal

More information

Sociology 362 Data Exercise 6 Logistic Regression 2

Sociology 362 Data Exercise 6 Logistic Regression 2 Sociology 362 Data Exercise 6 Logistic Regression 2 The questions below refer to the data and output beginning on the next page. Although the raw data are given there, you do not have to do any Stata runs

More information

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response) Model Based Statistics in Biology. Part V. The Generalized Linear Model. Logistic Regression ( - Response) ReCap. Part I (Chapters 1,2,3,4), Part II (Ch 5, 6, 7) ReCap Part III (Ch 9, 10, 11), Part IV

More information

Chapter 19: Logistic regression

Chapter 19: Logistic regression Chapter 19: Logistic regression Self-test answers SELF-TEST Rerun this analysis using a stepwise method (Forward: LR) entry method of analysis. The main analysis To open the main Logistic Regression dialog

More information

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests Chapter 59 Two Correlated Proportions on- Inferiority, Superiority, and Equivalence Tests Introduction This chapter documents three closely related procedures: non-inferiority tests, superiority (by a

More information

The Flight of the Space Shuttle Challenger

The Flight of the Space Shuttle Challenger The Flight of the Space Shuttle Challenger On January 28, 1986, the space shuttle Challenger took off on the 25 th flight in NASA s space shuttle program. Less than 2 minutes into the flight, the spacecraft

More information

multilevel modeling: concepts, applications and interpretations

multilevel modeling: concepts, applications and interpretations multilevel modeling: concepts, applications and interpretations lynne c. messer 27 october 2010 warning social and reproductive / perinatal epidemiologist concepts why context matters multilevel models

More information

7. Assumes that there is little or no multicollinearity (however, SPSS will not assess this in the [binary] Logistic Regression procedure).

7. Assumes that there is little or no multicollinearity (however, SPSS will not assess this in the [binary] Logistic Regression procedure). 1 Neuendorf Logistic Regression The Model: Y Assumptions: 1. Metric (interval/ratio) data for 2+ IVs, and dichotomous (binomial; 2-value), categorical/nominal data for a single DV... bear in mind that

More information

Lecture 7: OLS with qualitative information

Lecture 7: OLS with qualitative information Lecture 7: OLS with qualitative information Dummy variables Dummy variable: an indicator that says whether a particular observation is in a category or not Like a light switch: on or off Most useful values:

More information

STAT 7030: Categorical Data Analysis

STAT 7030: Categorical Data Analysis STAT 7030: Categorical Data Analysis 5. Logistic Regression Peng Zeng Department of Mathematics and Statistics Auburn University Fall 2012 Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall 2012

More information

Introducing Generalized Linear Models: Logistic Regression

Introducing Generalized Linear Models: Logistic Regression Ron Heck, Summer 2012 Seminars 1 Multilevel Regression Models and Their Applications Seminar Introducing Generalized Linear Models: Logistic Regression The generalized linear model (GLM) represents and

More information

Models for Binary Outcomes

Models for Binary Outcomes Models for Binary Outcomes Introduction The simple or binary response (for example, success or failure) analysis models the relationship between a binary response variable and one or more explanatory variables.

More information

Case of single exogenous (iv) variable (with single or multiple mediators) iv à med à dv. = β 0. iv i. med i + α 1

Case of single exogenous (iv) variable (with single or multiple mediators) iv à med à dv. = β 0. iv i. med i + α 1 Mediation Analysis: OLS vs. SUR vs. ISUR vs. 3SLS vs. SEM Note by Hubert Gatignon July 7, 2013, updated November 15, 2013, April 11, 2014, May 21, 2016 and August 10, 2016 In Chap. 11 of Statistical Analysis

More information

ECON Introductory Econometrics. Lecture 5: OLS with One Regressor: Hypothesis Tests

ECON Introductory Econometrics. Lecture 5: OLS with One Regressor: Hypothesis Tests ECON4150 - Introductory Econometrics Lecture 5: OLS with One Regressor: Hypothesis Tests Monique de Haan (moniqued@econ.uio.no) Stock and Watson Chapter 5 Lecture outline 2 Testing Hypotheses about one

More information

THE LINEAR DISCRIMINATION PROBLEM

THE LINEAR DISCRIMINATION PROBLEM What exactly is the linear discrimination story? In the logistic regression problem we have 0/ dependent variable, and we set up a model that predict this from independent variables. Specifically we use

More information

Count data page 1. Count data. 1. Estimating, testing proportions

Count data page 1. Count data. 1. Estimating, testing proportions Count data page 1 Count data 1. Estimating, testing proportions 100 seeds, 45 germinate. We estimate probability p that a plant will germinate to be 0.45 for this population. Is a 50% germination rate

More information

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p )

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p ) Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p. 376-390) BIO656 2009 Goal: To see if a major health-care reform which took place in 1997 in Germany was

More information

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011) Ron Heck, Fall 2011 1 EDEP 768E: Seminar in Multilevel Modeling rev. January 3, 2012 (see footnote) Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October

More information

How To Do Piecewise Exponential Survival Analysis in Stata 7 (Allison 1995:Output 4.20) revised

How To Do Piecewise Exponential Survival Analysis in Stata 7 (Allison 1995:Output 4.20) revised WM Mason, Soc 213B, S 02, UCLA Page 1 of 15 How To Do Piecewise Exponential Survival Analysis in Stata 7 (Allison 1995:Output 420) revised 4-25-02 This document can function as a "how to" for setting up

More information

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015 Introduction to Linear Regression Rebecca C. Steorts September 15, 2015 Today (Re-)Introduction to linear models and the model space What is linear regression Basic properties of linear regression Using

More information

Categorical data analysis Chapter 5

Categorical data analysis Chapter 5 Categorical data analysis Chapter 5 Interpreting parameters in logistic regression The sign of β determines whether π(x) is increasing or decreasing as x increases. The rate of climb or descent increases

More information

ssh tap sas913, sas https://www.statlab.umd.edu/sasdoc/sashtml/onldoc.htm

ssh tap sas913, sas https://www.statlab.umd.edu/sasdoc/sashtml/onldoc.htm Kedem, STAT 430 SAS Examples: Logistic Regression ==================================== ssh abc@glue.umd.edu, tap sas913, sas https://www.statlab.umd.edu/sasdoc/sashtml/onldoc.htm a. Logistic regression.

More information

Chapter 11. Regression with a Binary Dependent Variable

Chapter 11. Regression with a Binary Dependent Variable Chapter 11 Regression with a Binary Dependent Variable 2 Regression with a Binary Dependent Variable (SW Chapter 11) So far the dependent variable (Y) has been continuous: district-wide average test score

More information

Assessing Calibration of Logistic Regression Models: Beyond the Hosmer-Lemeshow Goodness-of-Fit Test

Assessing Calibration of Logistic Regression Models: Beyond the Hosmer-Lemeshow Goodness-of-Fit Test Global significance. Local impact. Assessing Calibration of Logistic Regression Models: Beyond the Hosmer-Lemeshow Goodness-of-Fit Test Conservatoire National des Arts et Métiers February 16, 2018 Stan

More information

Latent class analysis and finite mixture models with Stata

Latent class analysis and finite mixture models with Stata Latent class analysis and finite mixture models with Stata Isabel Canette Principal Mathematician and Statistician StataCorp LLC 2017 Stata Users Group Meeting Madrid, October 19th, 2017 Introduction Latent

More information

Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials

Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials Lecture : Introduction to Logistic Regression Ani Manichaikul amanicha@jhsph.edu 2 May 27 Binomial Model n independent trials (e.g., coin tosses) p = probability of success on each trial (e.g., p =! =

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur Lecture No. # 36 Sampling Distribution and Parameter Estimation

More information

Multiple linear regression

Multiple linear regression Multiple linear regression Course MF 930: Introduction to statistics June 0 Tron Anders Moger Department of biostatistics, IMB University of Oslo Aims for this lecture: Continue where we left off. Repeat

More information

8 Analysis of Covariance

8 Analysis of Covariance 8 Analysis of Covariance Let us recall our previous one-way ANOVA problem, where we compared the mean birth weight (weight) for children in three groups defined by the mother s smoking habits. The three

More information

Lecture 10: Introduction to Logistic Regression

Lecture 10: Introduction to Logistic Regression Lecture 10: Introduction to Logistic Regression Ani Manichaikul amanicha@jhsph.edu 2 May 2007 Logistic Regression Regression for a response variable that follows a binomial distribution Recall the binomial

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Nonlinear Regression Functions

Nonlinear Regression Functions Nonlinear Regression Functions (SW Chapter 8) Outline 1. Nonlinear regression functions general comments 2. Nonlinear functions of one variable 3. Nonlinear functions of two variables: interactions 4.

More information

Lecture 01: Introduction

Lecture 01: Introduction Lecture 01: Introduction Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina Lecture 01: Introduction

More information

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form: Outline for today What is a generalized linear model Linear predictors and link functions Example: fit a constant (the proportion) Analysis of deviance table Example: fit dose-response data using logistic

More information

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification, Likelihood Let P (D H) be the probability an experiment produces data D, given hypothesis H. Usually H is regarded as fixed and D variable. Before the experiment, the data D are unknown, and the probability

More information

11. Generalized Linear Models: An Introduction

11. Generalized Linear Models: An Introduction Sociology 740 John Fox Lecture Notes 11. Generalized Linear Models: An Introduction Copyright 2014 by John Fox Generalized Linear Models: An Introduction 1 1. Introduction I A synthesis due to Nelder and

More information

Simple logistic regression

Simple logistic regression Simple logistic regression Biometry 755 Spring 2009 Simple logistic regression p. 1/47 Model assumptions 1. The observed data are independent realizations of a binary response variable Y that follows a

More information

Statistics in medicine

Statistics in medicine Statistics in medicine Lecture 4: and multivariable regression Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu

More information

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont. TCELL 9/4/205 36-309/749 Experimental Design for Behavioral and Social Sciences Simple Regression Example Male black wheatear birds carry stones to the nest as a form of sexual display. Soler et al. wanted

More information

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling Linear Modelling in Stata Session 6: Further Topics in Linear Modelling Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 14/11/2017 This Week Categorical Variables Categorical

More information