STAT 7030: Categorical Data Analysis

Size: px
Start display at page:

Download "STAT 7030: Categorical Data Analysis"

Transcription

1 STAT 7030: Categorical Data Analysis 5. Logistic Regression Peng Zeng Department of Mathematics and Statistics Auburn University Fall 2012 Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

2 Logistic Regression for Binary Response Data Logistic regression models binary response variables, for which the response outcome for each subject is a success or failure. Banks predict the probability that a person pays a bill on time using predictors such as size of the bill, annual income, occupation, mortgage and debt obligations, percentage of bills paid on time in the past, and other aspects of an applicant s credit history. A company that relies on catalog sales may determine whether to send a catalog to a potential customer by modeling the probability of a sale as a function of indices of past buying behavior. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

3 Outline 1 Logistic regression Horseshoe crabs (one continuous variable) Maternal alcohol consumption (one categorical predictor) Checking model adequacy Horseshoe crab, revisited Neuralgia Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

4 Horseshoe crabs (one continuous variable) Example: Horseshoe Crabs The data comes from a study of nesting horseshoe crabs. Each female horseshoe crab had a male crab resident in her nest. Satellites mean other male crabs residing nearby. Define a binary response: { 1, if a female crab has at least one satellite Y = 0, if a female crab has no satellite We want to model this binary response on one continuous predictor (width of the female crab), π(x) = P(Y = 1 X = x) = 1 P(Y = 0 X = x). Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

5 Horseshoe crabs (one continuous variable) Nonlinear Relationship Between π(x) and x Usually, binary data result from a nonlinear relationship between π(x) and x. A fixed change in x often has less impact when π(x) is near 0/1 than when π(x) is near 0.5. pi(x) beta > 0 pi(x) beta < 0 For example, in the purchase of an automobile, consider the choice between buying new or used. Let π(x) denote the probability of selecting new when annual family income = x. An increase of $50, 000 in annual income would have less effect when x = $1, 000, 000 (for which π(x) is near 1) than when x = $50, 000. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

6 Horseshoe crabs (one continuous variable) Logistic Regression Logistic regression assumes that there exists a nonlinear relationship between π(x) and x, π(x) = exp(α + βx) 1 + exp(α + βx) It implies π(x) increases (decreases) as an S-shaped function of x. This nonlinear regression model can be written in terms of a linear model for transformed response, called logit model. Note that the response is log odds. logit[π(x)] = log π(x) 1 π(x) = η = α + βx. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

7 When β 0, the curve flattens to a horizontal straight line. When β = 0, Y is independent of X. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66 Logistic regression Horseshoe crabs (one continuous variable) Influence of β The sign of β determines whether π(x) is increasing (β > 0) or decreasing (β < 0) as x increases. The rate of change increases as β increases. pi(x) alpha = 1.0 beta = 0.1 pi(x) alpha = 1.0 beta = 1.0 pi(x) alpha = 1.0 beta = pi(x) alpha = 1.0 beta = 0.1 pi(x) alpha = 1.0 beta = 1.0 pi(x) alpha = 1.0 beta =

8 Horseshoe crabs (one continuous variable) Interpret β Using Odds Ratio The logit increases by β for every one unit increase in x. β = logit[π(x + 1)] logit[π(x)] = log odds at x + 1 odds at x Or equivalently, the odds increases multiplicatively by e β for every one unit increase in x. e β = odds at x + 1 odds at x = π(x + 1)/(1 π(x + 1)) π(x)/(1 π(x)) Therefore, e β is in fact an odds ratio. Most of us dot not think naturally on a logit or odds scale, so we need to consider alternative interpretations. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

9 Horseshoe crabs (one continuous variable) Interpret β Using Probability The function π(x) is a curve and the rate of change (first derivative) π (x) = βπ(x)[1 π(x)] depends on the value of x. When π(x) = 0.5, the rate is 0.25β. When π(x) = 0.1, the rate is 0.09β. When π(x) approaches 1.0 or 0, the rate also approaches 0. The steepest slope (rate of change) occurs at x for which π(x) = 0.5, which corresponds to x = α/β. This x value is called the median effective level. It represents the level at which each outcome has a 50% chance. Near x where π(x) = 0.5, a change in x of 1/β corresponds to a change in π(x) roughly (1/β)(β/4) = Therefore, 1/β approximates the distance between x values where π(x) = 0.25 or 0.75 and where π(x) = Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

10 Horseshoe crabs (one continuous variable) pi(x) pi(x) = exp(2x)/(1 + exp(2x)) Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

11 Horseshoe crabs (one continuous variable) Scatter Plot for Horseshoe Crabs Plotting Y against X is not informative because Y is binary. presence of satellites width Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

12 Horseshoe crabs (one continuous variable) Visualizing Data with Binary Responses When there are n i observations at setting i of X plot log p i 1 p i = log y i n i y i against x Or if y i = 0 or n i plot log y i n i y i against x If the plot shows a linear pattern, logistic regression is appropriate. When x is continuous and n i = 1 (or very small), group the data with nearby x values into categories. If we use other link function g( ), we can plot g(π(x)) against x to check whether the points lie around a straight line. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

13 Horseshoe crabs (one continuous variable) Plot for Horseshoe Crabs Data A logistic regression is approximate for the horseshoe crab data. pi(x) logit(pi) width width Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

14 Horseshoe crabs (one continuous variable) SAS Code and Output proc logistic data = SAS-Dataset; model yvar (event = level ) = list-of-variables; run; The option event = tells SAS that we want to model the probability of that particular level. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

15 SAS Output Horseshoe crabs (one continuous variable) The SAS output shows ˆα = , SE(ˆα) = ˆβ = , SE( ˆβ) = Therefore, the fitted model is logit[ˆπ(x)] = x Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

16 Horseshoe crabs (one continuous variable) Fitted Regression Function Therefore, the estimated probability at x is ˆπ(x) = exp( x) 1 + exp( x) The estimated odds of having a satellite increases by 64.4% for each 1-cm increase in width. (exp( ˆβ) = 1.644). At the minimum width x = 21.0, the estimated probability is ˆπ(21.0) = exp( ) 1 + exp( ) = At the maximum width x = 33.5, the estimated probability is Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

17 Horseshoe crabs (one continuous variable) Interpretation Using Probability At the sample mean width of 26.3cm, ˆπ(x) = The estimated incremental rate of change in the fitted probability at this point is ˆβ ˆπ(x)[1 ˆπ(x)] = (0.497)(0.674)(0.326) = 0.11 For female crabs near the mean width, the estimated probability of a satellite increases at the rate of 0.11 per 1cm increase in width. The estimated rate of change is greatest at the median effective level x = 24.8 at which ˆπ(x) = log = 0 = ˆα + ˆβx x = ˆαˆβ = 24.8 There, the estimated probability increases at the rate of (0.497)(0.50)(0.50) = 0.12 per 1 cm increase in width. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

18 Horseshoe crabs (one continuous variable) Comparing Different Variables Suppose we want to compare the effect of width and weight. The fitted function for width is logit[ˆπ(x)] = x The fitted function with weight as the predictor is logit[ˆπ(x)] = x The estimated odds of having a satellite increases by 514.1% for each 1-kg increase in weight. (exp(1.815) = 6.141). Notice that the relationship between π(x) and x is nonlinear. It may be misleading if we only look at the estimates of β. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

19 Horseshoe crabs (one continuous variable) Comparing Estimated Probabilities The probabilities π(x) at quartiles are useful for comparing the effects of predictors having different units. The lower quartile, median, and upper quartile for width are 24.9, 26.1, and 27.7, and ˆπ(x) at those values equals 0.51, 0.65, and ˆπ(26.1) = exp( ) 1 + exp( ) = The quartiles for weight are 2.00, 2.35, and 2.85, and ˆπ(x) at those values are 0.48, 0.64, and Therefore, the effect of weight is similar to that of width. ˆπ(2.35) = exp( ) 1 + exp( ) = Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

20 Horseshoe crabs (one continuous variable) Inference for Parameters The SAS output gives the estimate of β and the corresponding standard error. A Wald (1 α)100% confidence interval for β is ˆβ ± z α/2 SE( ˆβ) To test hypothesis H 0 : β = β 0, the Wald test statistic is z = ( ˆβ β 0 )/SE( ˆβ) Under H 0, z approximately follows N(0, 1). The p-value and critical value are calculated according to H a. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

21 Horseshoe crabs (one continuous variable) Example of Wald Test For the logistic regression with a single predictor, logit[π(x)] = α + βx, Want to know if the probability π(x) depends on β. The test statistic of Wald test is H 0 : β = 0, H a : β 0. z = ˆβ/SE( ˆβ) Under H 0, z approximately follows N(0, 1). The SAS output shows Wald chi-squared statistic z 2 and the p-value is P(χ 2 1 z 2 ). In the example, the statistic is z 2 = (0.4972/0.1017) 2 = The p-value < , and thus β is significantly different from 0. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

22 Horseshoe crabs (one continuous variable) Confidence Interval for Parameters An approximate (1 α)100% confidence interval for β is ˆβ ± z α/2 SE( ˆβ) The Wald 95% confidence interval for β is ± = (0.2978, ) The confidence interval for the effect on the odds per 1-cm increase in width equal (e , e ) = (1.36, 2.03). We infer that a 1-cm increase in width has at least a 36% increase and at most a doubling in the odds of a satellite. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

23 Horseshoe crabs (one continuous variable) Full Model and Reduced Model Likelihood ratio test can be used to compare a reduced model with a full model. Full model: larger model, more parameters, considered to be appropriate for the data. Reduced model: smaller model, less parameters, simplified under H 0. For example, in a logistic regression logit(π) = α + βx if we want to test H 0 : β = 0, then full and reduced model are full model : logistic regression with logit(π) = α + βx reduced model : logistic regression with logit(π) = α Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

24 Horseshoe crabs (one continuous variable) Likelihood Ratio Test Denote It is clear that l 0 l 1. l 0 = maximized log likelihood under H 0 l 1 = maximized log likelihood under H a The likelihood ratio test statistic is G 2 = 2(l 0 l 1 ) Under H 0, G 2 approximately follows χ 2 d, where d is the difference between the number of parameters under H 0 and H a. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

25 Horseshoe crabs (one continuous variable) Some Comments The relationship between Wald test and likelihood ratio test is similar to that between t-test and F -test in linear regression. When testing a single parameter, Wald test and likelihood ratio test are approximately equivalent in the sense that they have the same p-values. However H a of Wald test can be either one-sided or two-sided, while H a of likelihood ratio test is usually two-sided. The likelihood ratio test can be used for more complicated testing problems. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

26 Horseshoe crabs (one continuous variable) Example of Likelihood Ratio Test The likelihood ratio test of H 0 : β = 0 essentially compares the following two models: H 0 : logit[π(x)] = α, H a : logit[π(x)] = α + βx The likelihood ratio test statistic is Under H 0, G 2 follows χ 2 1. G 2 = 2(l 0 l 1 ) The maximized log likelihood is for logit[π(x)] = α + βx and for logit[π(x)] = α. Then G 2 = 2( ) = 31.3 We should reject H 0, which means the probability that the a female crab has at least a satellite depends on its width. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

27 Horseshoe crabs (one continuous variable) Confidence Interval for Estimated Probabilities The confidence interval for π(x) is obtained by transforming the confidence interval for logit[π(x)]. An approximate (1 α)100% confidence interval for logit[π(x)] is (ˆα + ˆβx) ± z α/2 SE(ˆα + ˆβx) where SE(ˆα + ˆβx) = var(ˆα) + x 2 var( ˆβ) + 2x cov(ˆα, ˆβ). Denote the confidence interval for logit[π(x)] is (a, b). Then an approximate (1 α)100% confidence interval for π(x) is ( exp(a) 1 + exp(a), exp(b) ) 1 + exp(b) Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

28 Horseshoe crabs (one continuous variable) Example Consider a crab with width x = 26.5, the estimated logit is (0.4972)(26.5) = From the SAS output, var(ˆα) = 6.910, var( ˆβ) = , and cov(ˆα, ˆβ) = , then var(ˆα + ˆβx) = (26.5) 2 ( ) + (2)(26.5)( ) = A 95% confidence interval for logit[π(26.5)] is The estimated probability is ± = (0.457, 1.193) ˆπ(26.5) = exp(0.825)/(1 + exp(0.825)) = and a 95% confidence interval is (0.612, 0.768). Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

29 Horseshoe crabs (one continuous variable) SAS Code for Estimated Probability proc logistic data = SAS-Dataset; model response (event = level ) = list-of-variables; output out = mydata prob = prob lower = lower upper = upper; run; The above SAS code creates a new dataset, called mydata, with variables prob, lower, upper, which give the estimated probabilities and its 95% confidence intervals. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

30 Horseshoe crabs (one continuous variable) How about Ignoring the Model? Six female crabs in the sample had x = 26.5, and four of them had satellites. If we ignore the model, a 95% confidence interval is (4/6) ± 1.96 (4/6)(1 4/6)/6 = (0.29, 1.05) which is much wider than the one obtained from model. When the logistic model truly holds, the model-based estimator of a probability is considerably better than the sample proportion. The model has only two parameters to estimate, whereas the saturated model has a separate parameter for every distinct value of x. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

31 Maternal alcohol consumption (one categorical predictor) Example: Maternal Alcohol Consumption The following table summarizes the results of a study of maternal alcohol consumption (average number of drinks per day) and child s congenital malformations. congenital malformation alcohol consumption present absent < The response is binary and the predictor is a categorical variable. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

32 Categorical Predictors Maternal alcohol consumption (one categorical predictor) Categorical explanatory variables are also called factors. First consider a single factor X with I categories (levels). alcohol consumption success failure total level 1 y 1 n 1 y 1 n 1 level 2 y 2 n 2 y 2 n 2 level I y I n I y I n I In row i of the I 2 table, y i is the number of outcomes in the first column (successes) out of n i trails. It is analog to one-way ANOVA model. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

33 Saturated Model Maternal alcohol consumption (one categorical predictor) A saturated model has a separate parameter for each observation (distinct value of x), and it provides a perfect fit to the data. Heuristically, a saturated model assumes there is no relationship between the means at different values of x. We treat y i as binomial with parameter π i. logit(π i ) = log π i 1 π i = α + β i. There are in total I + 1 unknown parameters {α, β 1,..., β I }. It is a saturated model because π i is different for different level of x. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

34 Comparing Probabilities Maternal alcohol consumption (one categorical predictor) The higher β i is, the higher the value of π i (probability of success). The interested question is if there is any factor affect. or equivalently, H 0 : β 1 = β 2 = = β I H 0 : π 1 = π 2 = = π I Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

35 Maternal alcohol consumption (one categorical predictor) Constraint on Parameters There is one redundant parameter in {α, β i }. We can solve this problem by adding one constraint. One popular constraint is to choose the one level of x as the reference level, and set β I = 0. level 1 logit(π 1 ) = α + β 1 level I 1 logit(π I 1 ) = α + β I 1 level I logit(π I ) = α Therefore, α is the logit in row I, and β i is the log odds ratio for row i and row I. α = log π I 1 π I, β i = logit(π i ) logit(π I ) = log π i/(1 π i ) π I /(1 π I ) Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

36 Maternal alcohol consumption (one categorical predictor) Other Choice of Constraint The constraint of parameters and coding scheme are not unique. However, for different constraint and coding scheme, {ˆα + ˆβ i } or {ˆπ i } are the same. The differences ˆβ i ˆβ j for any two levels of X are identical and represent estimated log odds ratios. When a factor has two levels, a common alternative constraint is βi = β 1 + β 2 = 0. In this case, α is the average logit, and β i is the difference between the logit in row i and the average logit. The corresponding regression function is logit(π i ) = α + βx. where let x = 1 for one level and x = 1 for the other. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

37 Estimate of Parameters Maternal alcohol consumption (one categorical predictor) Let us focus on the constraint of β I = 0. The meaning of the parameters shed light on the estimation of parameters. Notice that population proportions are estimated by sample proportions. ˆπ i = p i = y i /n i and thus the estimates of α and β i are ˆα = log p I 1 p I, ˆβi = log p i/(1 p i ) p I /(1 p I ) It can be verified that these estimates are MLEs of α and β i. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

38 Maternal alcohol consumption (one categorical predictor) SAS Code and Results proc logistic data = SAS-Dataset; class xvar (ref = level ) / param = reference; model yvar / total = list-of-variables; run; The SAS output shows For example, estimate std err 95% CI α β β β β β ˆα = log 1 37 = , ˆβ1 = log = Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

39 Maternal alcohol consumption (one categorical predictor) Equivalent Regression Model With dummy variables, the model can be represented in terms of a regression model. A factor with I levels needs I 1 dummy variables. Define x 1,..., x I 1 by x i = 1 for row i and x i = 0 otherwise (i = 1,..., I 1). Therefore, if x 1 = = x I 1 = 0, the observation is in row I. The regression model is logit(π i ) = α + β 1 x β I 1 x I 1. α and β i have the same interpretations. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

40 Consider the hypothesis, Logistic regression Maternal alcohol consumption (one categorical predictor) Results for Example H 0 : β 1 = = β 5 Notice that H 0 means π 1 = = π 5 or the response is independent of the predictor. Under H 0, the estimated probability ˆπ 0 is the same for all different levels of x. ˆπ 0 = total sample size = Therefore, Pearson chi-squared statistic is X 2 = 12.1 (p-value is 0.02) and likelihood ratio statistic G 2 = 6.2 (p-value is 0.19). Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

41 Test of Independence The observed and expected frequencies are For example, Maternal alcohol consumption (one categorical predictor) congenital malformation expected alcohol consumption present absent present absent < = ( )( ), = ( )( ) Under H 0, both X 2 and G 2 follow χ 2 4. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

42 Ordinal Maternal alcohol consumption (one categorical predictor) Notice that the variable alcohol consumption is ordinal. Assume that scores {0, 0.5, 1.5, 4.0, 7.0} properly describe distances between levels of X. Consider the following model The SAS output shows logit(π i ) = α + βx i ˆα = , SE(ˆα) = ˆβ = , SE( ˆβ) = The estimated multiplicative effect of a unit increase in daily alcohol consumption on the odds of malformation is exp(0.317) = Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

43 Inference for β Maternal alcohol consumption (one categorical predictor) The test of independence corresponds to H 0 : β = 0. with p-value = z = / = The model seems to fit well from the table. alcohol proportion malformed consumption present absent observed fitted < Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

44 Checking model adequacy Goodness-of-fit The goodness-of-fit test is used to check if a model adequately describe the variability in data. If there are more than one subjects for each setting of x, compare the model with the saturated model. for example alcohol fitted consumption present absent prob present absent < = ( )(0.0026), = ( )( ) Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

45 Checking model adequacy Two Chi-squared Statistics First calculate fitted frequencies for each categories, and then calculate Pearson chi-squared statistic or likelihood ratio statistic using the following formulas. X 2 = i G 2 = 2 i (observed fitted) 2 fitted observed log observed fitted The Pearson chi-squared statistic for goodness-of-fit is X 2 = 2.05, and the likelihood ratio statistic is G 2 = Under H 0, both X 2 and G 2 follow χ 2 3. χ 2 3,0.05 = Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

46 Checking model adequacy Goodness-of-fit, More If there is only one subject for each setting of x, it may be misleading to compare the model with the saturated model because the chi-squared distribution does not hold. (think of linear regression model) We can group the data and check the goodness-of-fit. If the test is accepted, we can feel more comfortable about using the model for the original ungrouped data. Compare the model with a more complicated one (for example: logit[π(x)] = α + β 1 x + β 2 x 2 ). If more complex models do not fit better (test is accepted), this provides some assurance that the model chosen is reasonable. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

47 Checking model adequacy Goodness-of-fit For Horseshoe Crabs width # obs # Yes # No Fitted Yes Fitted No < > Fitted yes of kth category = Fitted no of kth category = x i in kth category x i in kth category ˆπ(x i ). {1 ˆπ(x i )}. X 2 = 5.3 and G 2 = 6.2. Both follows χ 2 6, and P-value is about 0.4. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

48 Checking model adequacy Compare to a More Complicated Model Fit a quadratic logistic regression for width, and the fitted model is logit[ˆπ(x)] = x x 2 The Wald test is not significant (P-value is ). And the likelihood-ratio statistic is 2 ( ) = There is no evidence to support adding a quadratic term. Therefore, a linear logistic model is adequate for width. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

49 Horseshoe crab, revisited Multiple Logistic Regression When there are p predictors x 1,..., x p, the logistic regression models π(x) = P(Y = 1) by or equivalently, π(x) = exp(β 0 + β 1 x β p x p ) 1 + exp(β 0 + β 1 x β p x p ) logit[π(x)] = β 0 + β 1 x β p x p The parameter β i refers to the effect of x i on the log odds that Y = 1, controlling the other x j. And e β i is the multiplicative effect on the odds of a one-unit increase in x i, at fixed levels of other x s. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

50 Horseshoe crab, revisited Horseshoe Crab Data Logistic regression can have a mixture of quantitative and qualitative predictors. In this example, we analyze the horseshoe crab data by using both the female crab s shell width and color as predictors. logit(π) = β 0 + β 1 c 1 + β 2 c 2 + β 3 c 3 + β 4 x where π = P(Y = 1) is the probability that a female crab has satellite, x = width in centimeters, and c 1 = 1 for medium-light color, and 0 otherwise c 2 = 1 for medium color, and 0 otherwise c 3 = 1 for medium-dark color, and 0 otherwise. The crab color is dark when c 1 = c 2 = c 3 = 0. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

51 Horseshoe crab, revisited The fitted model is Fitted Regression Model logit(ˆπ) = c c c x It is informative to write the fitted model for different colors, medium-light : logit(ˆπ) = x medium : logit(ˆπ) = x medium-dark : logit(ˆπ) = x dark : logit(ˆπ) = x The exponentiated difference between two color parameter estimates is an odds ratio comparing those two colors. At any given width, the estimated odds that a medium-light crab has a satellite are e = 3.8 times the estimated odds for a dark crab. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

52 Horseshoe crab, revisited Plot of Fitted Regression Curves Any one curve equals any other curve shifted to the right or left. The parallelism of curves in the horizontal dimension implies that any two curves never cross. (No interaction between color and width.) predicted probability four curves, from left to right medium (c2) medium light (c1) medium dark (c3) dark Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

53 Horseshoe crab, revisited Check Model Adequacy A more complicated model allowing color width interaction has three additional terms, the cross-products of width with the color dummy variables. logit(π) = β 0 + β 1 c 1 + β 2 c 2 + β 3 c 3 + β 4 x + β 5 c 1 x + β 6 c 2 x + β 7 c 3 x It is equivalent to fitting logistic regression with width predictor separately for crabs of each color. medium-light : logit(π) = (β 0 + β 1 ) + (β 4 + β 5 )x medium : logit(π) = (β 0 + β 2 ) + (β 4 + β 6 )x medium-dark : logit(π) = (β 0 + β 3 ) + (β 4 + β 6 )x dark : logit(π) = β 0 + β 4 x Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

54 Horseshoe crab, revisited Result of Testing Consider the hypothesis, H 0 : β 5 = β 6 = β 7 = 0, H a : not all β 5, β 6, β 7 are zero Notice that the full model is the model with interaction terms, while the reduced model is the model without interaction terms. The likelihood ratio test statistic is G 2 = = The degrees of freedom are df = 3 with p-value Therefore, the model without interaction terms is adequate to model the data. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

55 Horseshoe crab, revisited Test Effect of Color In the model without interaction terms, logit(π) = β 0 + β 1 c 1 + β 2 c 2 + β 3 c 3 + β 4 x To test whether color contributes significantly to model, H 0 : β 1 = β 2 = β 3 = 0 H a : not all β 1, β 2, β 3 are zero The reduced model is logit(π) = β 0 + β 4 x. The likelihood ratio test statistic is G 2 = = The degrees of freedom is df = 3 with p-value Therefore, we should accept H 0, which means controlling for width, the probability of a satellite is independent of color. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

56 Horseshoe crab, revisited More Consideration for Color The color has ordered categories, from lightest to darkest. logit(π) = β 0 + β 1 c + β 2 x We can assign scores to different color categories. Use scores c = {1, 2, 3, 4} for the color categories. ˆβ 1 = (SE = 0.224) and ˆβ 2 = (SE = 0.104). Compare this model to the one treating color as nominal, G 2 = 1.7, df = 2, and p-value is Use scores c = {1, 1, 1, 0} for the color categories. ˆβ1 = (SE = 0.526) and ˆβ 2 = (SE = 0.104). G 2 = 0.5, df = 2 and p-value is A much larger sample is needed to determine which color scoring is more appropriate. It is advantageous to treat ordinal predictors in a quantitative manner when such models fit well. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

57 Neuralgia Example: Neuralgia Consider a study of the analgesic effects of treatments on elderly patients with neuralgia. Two test treatments (A and B) and a placebo (P) are compared. The response variable is whether the patient reported pain or not. Researchers recorded age and gender of the patients and the duration of complaint before the treatment began. The data consist of 60 patients. The four predictor variables are treatment sex age duration A, B, P F, M continuous continuous Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

58 Neuralgia Logistic Model Consider the following logistic model logit(π) = β 0 + β 1 x i1 + β 2 x i2 + β 3 x i3 + β 4 x i4 + β 5 x i5 where π is the probability of reporting pain and the predictor variables are x 1 x 2 x 3 x 4 x 5 = 1 for treatment A and = 0 otherwise = 1 for treatment B and = 0 otherwise = 1 for female and = 0 for male age duration Notice that x 1 and x 2 are two dummy variables constructed for treatment. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

59 Is the model significant? Logistic regression Neuralgia Model Fitting H 0 : β 1 = = β 5 = 0, H a : not all β i are zero The likelihood ratio statistic is G 2 = = with 5 degrees of freedom. The p-value is <.0001 and the critical value is χ 2 5,0.05 = We should reject H 0. Standard Wald Parameter Estimate Error Chi-Square Pr > ChiSq Intercept Treatment A Treatment B Sex F Age Duration Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

60 Neuralgia Refined Model The variable duration is not significant. After removing this variable, the logistic model becomes logit(π) = β 0 + β 1 x i1 + β 2 x i2 + β 3 x i3 + β 4 x i4 and the fitted parameters are Standard Wald Parameter Estimate Error Chi-Square Pr > ChiSq Intercept Treatment A Treatment B Sex F Age Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

61 Neuralgia Odds Ratio in SAS output The SAS output of proc logistic displays the odds ratio estimates and their confidence intervals for those variables that are not involved in any interaction terms. For a categorical variable (appeared in a class statement), the odds ratio comparing each level with the last level is computed regardless of the coding scheme. one level last level success failure x + 1 x success failure For a continuous explanatory variable, the odds ratio corresponds to one unit increase of this variable. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

62 Neuralgia Interpretation of the Parameters The odds of female patients reporting pain is 16.1% of the odds for male patients. The 95% confidence interval is (0.034, 0.762). The odds of reporting pain for patients treated by A is 4.2% of the odds for patients treated by placebo. The 95% confidence interval is (0.006, 0.303). The odds of reporting pain for patients treated by B is 2.4% of the odds for patients treated by placebo. The 95% confidence interval is (0.003, 0.222). The odds of reporting pain increase by 30.0% if the patient is one-year older. The 95% confidence interval is (1.080, 1.773). Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

63 Neuralgia Contrast Statement in SAS The contrast statement enable us to conduct flexible test involving categorical variables. Suppose we want to compare the two treatments A and B. β A = β B β A β B = 0 The corresponding SAS code is contrast A vs B Treatment 1-1;. Suppose we want to compare the treatments with the placebo. (β A + β B )/2 = β P 0.5β A + 0.5β B β P = 0 Because β P = 0 for reference coding, the corresponding SAS code is contrast AB vs P Treatment ;. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

64 Neuralgia Results There is no much difference between treatment A and B. The test is not significant (p-value = ) and the 95% confidence interval for the odds ratios (0.2786, ), which contains 1. The two treatments are quite different from the placebo. The test is significant (p-value = ) and the 95% confidence interval for the odds ratio is (0.0047, ). Notice that The syntax of contrast statement in proc logistic is different from that in proc glm. If a different coding scheme is used (e.g. effect coding), the SAS code should be modified accordingly, because the constraint of effect coding is β P = β A β B. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

65 Neuralgia Model Failure Instead of Success The logistic regression is logit(π) = β 0 + β 1 x i1 + β 2 x i2 + β 3 x i3 + β 4 x i4 where π is the probability of reporting pain. Let π be the probability of reporting no pain. Then π = 1 π and logit( π) = log π 1 π = log 1 π π Therefore, the logistic regression for π is = log π 1 π = logit(π) logit( π) = logit(π) = β 0 β 1 x i1 β 2 x i2 β 3 x i3 β 4 x i4 Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

66 Neuralgia Comparing with a Larger Model In order to assess the model adequacy, we compare this model with a larger model that including all the pairwise interactions. There are in total 14 predictors in the full model. The hypothesis are x 1, x 2, x 3, x 4, x 5 x 1 x 3, x 1 x 4, x 1 x 5, x 2 x 3, x 2 x 4, x 2 x 5, x 3 x 4, x 3 x 5, x 4 x 5 H 0 : model with x 1,..., x 5 ; H a : model with all 14 predictors. The likelihood ratio test statistic is G 2 = = with 9 degrees of freedom. The p-value is P(χ ) = We should accept H 0, which means the model with x 1,..., x 5 is adequate. Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall / 66

Categorical data analysis Chapter 5

Categorical data analysis Chapter 5 Categorical data analysis Chapter 5 Interpreting parameters in logistic regression The sign of β determines whether π(x) is increasing or decreasing as x increases. The rate of climb or descent increases

More information

Chapter 5: Logistic Regression-I

Chapter 5: Logistic Regression-I : Logistic Regression-I Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM [Acknowledgements to Tim Hanson and Haitao Chu] D. Bandyopadhyay

More information

ST3241 Categorical Data Analysis I Logistic Regression. An Introduction and Some Examples

ST3241 Categorical Data Analysis I Logistic Regression. An Introduction and Some Examples ST3241 Categorical Data Analysis I Logistic Regression An Introduction and Some Examples 1 Business Applications Example Applications The probability that a subject pays a bill on time may use predictors

More information

Sections 4.1, 4.2, 4.3

Sections 4.1, 4.2, 4.3 Sections 4.1, 4.2, 4.3 Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1/ 32 Chapter 4: Introduction to Generalized Linear Models Generalized linear

More information

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples ST3241 Categorical Data Analysis I Generalized Linear Models Introduction and Some Examples 1 Introduction We have discussed methods for analyzing associations in two-way and three-way tables. Now we will

More information

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: ) NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) Categorical Data Analysis (Semester II: 2010 2011) April/May, 2011 Time Allowed : 2 Hours Matriculation No: Seat No: Grade Table Question 1 2 3

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population

More information

The material for categorical data follows Agresti closely.

The material for categorical data follows Agresti closely. Exam 2 is Wednesday March 8 4 sheets of notes The material for categorical data follows Agresti closely A categorical variable is one for which the measurement scale consists of a set of categories Categorical

More information

Multinomial Logistic Regression Models

Multinomial Logistic Regression Models Stat 544, Lecture 19 1 Multinomial Logistic Regression Models Polytomous responses. Logistic regression can be extended to handle responses that are polytomous, i.e. taking r>2 categories. (Note: The word

More information

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses ST3241 Categorical Data Analysis I Multicategory Logit Models Logit Models For Nominal Responses 1 Models For Nominal Responses Y is nominal with J categories. Let {π 1,, π J } denote the response probabilities

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) (b) (c) (d) (e) In 2 2 tables, statistical independence is equivalent

More information

Chapter 4: Generalized Linear Models-I

Chapter 4: Generalized Linear Models-I : Generalized Linear Models-I Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM [Acknowledgements to Tim Hanson and Haitao Chu] D. Bandyopadhyay

More information

8 Nominal and Ordinal Logistic Regression

8 Nominal and Ordinal Logistic Regression 8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on

More information

(c) Interpret the estimated effect of temperature on the odds of thermal distress.

(c) Interpret the estimated effect of temperature on the odds of thermal distress. STA 4504/5503 Sample questions for exam 2 1. For the 23 space shuttle flights that occurred before the Challenger mission in 1986, Table 1 shows the temperature ( F) at the time of the flight and whether

More information

9 Generalized Linear Models

9 Generalized Linear Models 9 Generalized Linear Models The Generalized Linear Model (GLM) is a model which has been built to include a wide range of different models you already know, e.g. ANOVA and multiple linear regression models

More information

Logistic Regressions. Stat 430

Logistic Regressions. Stat 430 Logistic Regressions Stat 430 Final Project Final Project is, again, team based You will decide on a project - only constraint is: you are supposed to use techniques for a solution that are related to

More information

Simple logistic regression

Simple logistic regression Simple logistic regression Biometry 755 Spring 2009 Simple logistic regression p. 1/47 Model assumptions 1. The observed data are independent realizations of a binary response variable Y that follows a

More information

Linear Regression With Special Variables

Linear Regression With Special Variables Linear Regression With Special Variables Junhui Qian December 21, 2014 Outline Standardized Scores Quadratic Terms Interaction Terms Binary Explanatory Variables Binary Choice Models Standardized Scores:

More information

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION Categorical Data Analysis (Semester II: 2010 2011) April/May, 2011 Time Allowed : 2 Hours Matriculation No: Seat No: Grade Table Question 1 2 3 4 5 6 Full marks

More information

ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T.

ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T. Exam 3 Review Suppose that X i = x =(x 1,, x k ) T is observed and that Y i X i = x i independent Binomial(n i,π(x i )) for i =1,, N where ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T x) This is called the

More information

STA6938-Logistic Regression Model

STA6938-Logistic Regression Model Dr. Ying Zhang STA6938-Logistic Regression Model Topic 2-Multiple Logistic Regression Model Outlines:. Model Fitting 2. Statistical Inference for Multiple Logistic Regression Model 3. Interpretation of

More information

Single-level Models for Binary Responses

Single-level Models for Binary Responses Single-level Models for Binary Responses Distribution of Binary Data y i response for individual i (i = 1,..., n), coded 0 or 1 Denote by r the number in the sample with y = 1 Mean and variance E(y) =

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

Lecture 12: Effect modification, and confounding in logistic regression

Lecture 12: Effect modification, and confounding in logistic regression Lecture 12: Effect modification, and confounding in logistic regression Ani Manichaikul amanicha@jhsph.edu 4 May 2007 Today Categorical predictor create dummy variables just like for linear regression

More information

Explanatory variables are: weight, width of shell, color (medium light, medium, medium dark, dark), and condition of spine.

Explanatory variables are: weight, width of shell, color (medium light, medium, medium dark, dark), and condition of spine. Horseshoe crab example: There are 173 female crabs for which we wish to model the presence or absence of male satellites dependant upon characteristics of the female horseshoe crabs. 1 satellite present

More information

Testing Independence

Testing Independence Testing Independence Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM 1/50 Testing Independence Previously, we looked at RR = OR = 1

More information

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy Logistic Regression Some slides from Craig Burkett STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy Titanic Survival Case Study The RMS Titanic A British passenger liner Collided

More information

Binary Logistic Regression

Binary Logistic Regression The coefficients of the multiple regression model are estimated using sample data with k independent variables Estimated (or predicted) value of Y Estimated intercept Estimated slope coefficients Ŷ = b

More information

Cohen s s Kappa and Log-linear Models

Cohen s s Kappa and Log-linear Models Cohen s s Kappa and Log-linear Models HRP 261 03/03/03 10-11 11 am 1. Cohen s Kappa Actual agreement = sum of the proportions found on the diagonals. π ii Cohen: Compare the actual agreement with the chance

More information

Multiple Logistic Regression for Dichotomous Response Variables

Multiple Logistic Regression for Dichotomous Response Variables Multiple Logistic Regression for Dichotomous Response Variables Edps/Psych/Soc 589 Carolyn J. Anderson Department of Educational Psychology c Board of Trustees, University of Illinois Fall 2018 Outline

More information

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture! Hierarchical Generalized Linear Models ERSH 8990 REMS Seminar on HLM Last Lecture! Hierarchical Generalized Linear Models Introduction to generalized models Models for binary outcomes Interpreting parameter

More information

Analysing data: regression and correlation S6 and S7

Analysing data: regression and correlation S6 and S7 Basic medical statistics for clinical and experimental research Analysing data: regression and correlation S6 and S7 K. Jozwiak k.jozwiak@nki.nl 2 / 49 Correlation So far we have looked at the association

More information

Sections 3.4, 3.5. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Sections 3.4, 3.5. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis Sections 3.4, 3.5 Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 22 3.4 I J tables with ordinal outcomes Tests that take advantage of ordinal

More information

STAT5044: Regression and Anova

STAT5044: Regression and Anova STAT5044: Regression and Anova Inyoung Kim 1 / 18 Outline 1 Logistic regression for Binary data 2 Poisson regression for Count data 2 / 18 GLM Let Y denote a binary response variable. Each observation

More information

Lecture 10: Introduction to Logistic Regression

Lecture 10: Introduction to Logistic Regression Lecture 10: Introduction to Logistic Regression Ani Manichaikul amanicha@jhsph.edu 2 May 2007 Logistic Regression Regression for a response variable that follows a binomial distribution Recall the binomial

More information

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression Logistic Regression Usual linear regression (repetition) y i = b 0 + b 1 x 1i + b 2 x 2i + e i, e i N(0,σ 2 ) or: y i N(b 0 + b 1 x 1i + b 2 x 2i,σ 2 ) Example (DGA, p. 336): E(PEmax) = 47.355 + 1.024

More information

Basic Medical Statistics Course

Basic Medical Statistics Course Basic Medical Statistics Course S7 Logistic Regression November 2015 Wilma Heemsbergen w.heemsbergen@nki.nl Logistic Regression The concept of a relationship between the distribution of a dependent variable

More information

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response) Model Based Statistics in Biology. Part V. The Generalized Linear Model. Logistic Regression ( - Response) ReCap. Part I (Chapters 1,2,3,4), Part II (Ch 5, 6, 7) ReCap Part III (Ch 9, 10, 11), Part IV

More information

Binary Dependent Variables

Binary Dependent Variables Binary Dependent Variables In some cases the outcome of interest rather than one of the right hand side variables - is discrete rather than continuous Binary Dependent Variables In some cases the outcome

More information

Multiple Regression: Chapter 13. July 24, 2015

Multiple Regression: Chapter 13. July 24, 2015 Multiple Regression: Chapter 13 July 24, 2015 Multiple Regression (MR) Response Variable: Y - only one response variable (quantitative) Several Predictor Variables: X 1, X 2, X 3,..., X p (p = # predictors)

More information

BIOS 625 Fall 2015 Homework Set 3 Solutions

BIOS 625 Fall 2015 Homework Set 3 Solutions BIOS 65 Fall 015 Homework Set 3 Solutions 1. Agresti.0 Table.1 is from an early study on the death penalty in Florida. Analyze these data and show that Simpson s Paradox occurs. Death Penalty Victim's

More information

BMI 541/699 Lecture 22

BMI 541/699 Lecture 22 BMI 541/699 Lecture 22 Where we are: 1. Introduction and Experimental Design 2. Exploratory Data Analysis 3. Probability 4. T-based methods for continous variables 5. Power and sample size for t-based

More information

Section Poisson Regression

Section Poisson Regression Section 14.13 Poisson Regression Timothy Hanson Department of Statistics, University of South Carolina Stat 705: Data Analysis II 1 / 26 Poisson regression Regular regression data {(x i, Y i )} n i=1,

More information

Short Course Introduction to Categorical Data Analysis

Short Course Introduction to Categorical Data Analysis Short Course Introduction to Categorical Data Analysis Alan Agresti Distinguished Professor Emeritus University of Florida, USA Presented for ESALQ/USP, Piracicaba Brazil March 8-10, 2016 c Alan Agresti,

More information

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models SCHOOL OF MATHEMATICS AND STATISTICS Linear and Generalised Linear Models Autumn Semester 2017 18 2 hours Attempt all the questions. The allocation of marks is shown in brackets. RESTRICTED OPEN BOOK EXAMINATION

More information

LOGISTIC REGRESSION. Lalmohan Bhar Indian Agricultural Statistics Research Institute, New Delhi

LOGISTIC REGRESSION. Lalmohan Bhar Indian Agricultural Statistics Research Institute, New Delhi LOGISTIC REGRESSION Lalmohan Bhar Indian Agricultural Statistics Research Institute, New Delhi- lmbhar@gmail.com. Introduction Regression analysis is a method for investigating functional relationships

More information

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3 STA 303 H1S / 1002 HS Winter 2011 Test March 7, 2011 LAST NAME: FIRST NAME: STUDENT NUMBER: ENROLLED IN: (circle one) STA 303 STA 1002 INSTRUCTIONS: Time: 90 minutes Aids allowed: calculator. Some formulae

More information

Review of Multinomial Distribution If n trials are performed: in each trial there are J > 2 possible outcomes (categories) Multicategory Logit Models

Review of Multinomial Distribution If n trials are performed: in each trial there are J > 2 possible outcomes (categories) Multicategory Logit Models Chapter 6 Multicategory Logit Models Response Y has J > 2 categories. Extensions of logistic regression for nominal and ordinal Y assume a multinomial distribution for Y. 6.1 Logit Models for Nominal Responses

More information

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Logistic Regression 1 / 38 Logistic Regression 1 Introduction

More information

Solutions for Examination Categorical Data Analysis, March 21, 2013

Solutions for Examination Categorical Data Analysis, March 21, 2013 STOCKHOLMS UNIVERSITET MATEMATISKA INSTITUTIONEN Avd. Matematisk statistik, Frank Miller MT 5006 LÖSNINGAR 21 mars 2013 Solutions for Examination Categorical Data Analysis, March 21, 2013 Problem 1 a.

More information

MATH c UNIVERSITY OF LEEDS Examination for the Module MATH1725 (May-June 2009) INTRODUCTION TO STATISTICS. Time allowed: 2 hours

MATH c UNIVERSITY OF LEEDS Examination for the Module MATH1725 (May-June 2009) INTRODUCTION TO STATISTICS. Time allowed: 2 hours 01 This question paper consists of 11 printed pages, each of which is identified by the reference. Only approved basic scientific calculators may be used. Statistical tables are provided at the end of

More information

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

 M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2 Notation and Equations for Final Exam Symbol Definition X The variable we measure in a scientific study n The size of the sample N The size of the population M The mean of the sample µ The mean of the

More information

Beyond GLM and likelihood

Beyond GLM and likelihood Stat 6620: Applied Linear Models Department of Statistics Western Michigan University Statistics curriculum Core knowledge (modeling and estimation) Math stat 1 (probability, distributions, convergence

More information

R Hints for Chapter 10

R Hints for Chapter 10 R Hints for Chapter 10 The multiple logistic regression model assumes that the success probability p for a binomial random variable depends on independent variables or design variables x 1, x 2,, x k.

More information

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While

More information

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression INTRODUCTION TO CLINICAL RESEARCH Introduction to Linear Regression Karen Bandeen-Roche, Ph.D. July 17, 2012 Acknowledgements Marie Diener-West Rick Thompson ICTR Leadership / Team JHU Intro to Clinical

More information

Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials

Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials Lecture : Introduction to Logistic Regression Ani Manichaikul amanicha@jhsph.edu 2 May 27 Binomial Model n independent trials (e.g., coin tosses) p = probability of success on each trial (e.g., p =! =

More information

Ch 6: Multicategory Logit Models

Ch 6: Multicategory Logit Models 293 Ch 6: Multicategory Logit Models Y has J categories, J>2. Extensions of logistic regression for nominal and ordinal Y assume a multinomial distribution for Y. In R, we will fit these models using the

More information

Homework 1 Solutions

Homework 1 Solutions 36-720 Homework 1 Solutions Problem 3.4 (a) X 2 79.43 and G 2 90.33. We should compare each to a χ 2 distribution with (2 1)(3 1) 2 degrees of freedom. For each, the p-value is so small that S-plus reports

More information

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key Statistical Methods III Statistics 212 Problem Set 2 - Answer Key 1. (Analysis to be turned in and discussed on Tuesday, April 24th) The data for this problem are taken from long-term followup of 1423

More information

Regression Methods for Survey Data

Regression Methods for Survey Data Regression Methods for Survey Data Professor Ron Fricker! Naval Postgraduate School! Monterey, California! 3/26/13 Reading:! Lohr chapter 11! 1 Goals for this Lecture! Linear regression! Review of linear

More information

Logistic regression: Miscellaneous topics

Logistic regression: Miscellaneous topics Logistic regression: Miscellaneous topics April 11 Introduction We have covered two approaches to inference for GLMs: the Wald approach and the likelihood ratio approach I claimed that the likelihood ratio

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Stat 704: Data Analysis I, Fall 2010

Stat 704: Data Analysis I, Fall 2010 Stat 704: Data Analysis I, Fall 2010 Generalized linear models Generalize regular regression to non-normal data {(Y i,x i )} N i=1, most often Bernoulli or Poisson Y i. The general theory of GLMs has been

More information

2. We care about proportion for categorical variable, but average for numerical one.

2. We care about proportion for categorical variable, but average for numerical one. Probit Model 1. We apply Probit model to Bank data. The dependent variable is deny, a dummy variable equaling one if a mortgage application is denied, and equaling zero if accepted. The key regressor is

More information

Correlation and regression

Correlation and regression 1 Correlation and regression Yongjua Laosiritaworn Introductory on Field Epidemiology 6 July 2015, Thailand Data 2 Illustrative data (Doll, 1955) 3 Scatter plot 4 Doll, 1955 5 6 Correlation coefficient,

More information

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form: Outline for today What is a generalized linear model Linear predictors and link functions Example: fit a constant (the proportion) Analysis of deviance table Example: fit dose-response data using logistic

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58 Inference ME104: Linear Regression Analysis Kenneth Benoit August 15, 2012 August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58 Stata output resvisited. reg votes1st spend_total incumb minister

More information

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game. EdPsych/Psych/Soc 589 C.J. Anderson Homework 5: Answer Key 1. Probelm 3.18 (page 96 of Agresti). (a) Y assume Poisson random variable. Plausible Model: E(y) = µt. The expected number of arrests arrests

More information

A discussion on multiple regression models

A discussion on multiple regression models A discussion on multiple regression models In our previous discussion of simple linear regression, we focused on a model in which one independent or explanatory variable X was used to predict the value

More information

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20 Logistic regression 11 Nov 2010 Logistic regression (EPFL) Applied Statistics 11 Nov 2010 1 / 20 Modeling overview Want to capture important features of the relationship between a (set of) variable(s)

More information

BIOSTATS Intermediate Biostatistics Spring 2017 Exam 2 (Units 3, 4 & 5) Practice Problems SOLUTIONS

BIOSTATS Intermediate Biostatistics Spring 2017 Exam 2 (Units 3, 4 & 5) Practice Problems SOLUTIONS BIOSTATS 640 - Intermediate Biostatistics Spring 2017 Exam 2 (Units 3, 4 & 5) Practice Problems SOLUTIONS Practice Question 1 Both the Binomial and Poisson distributions have been used to model the quantal

More information

General Linear Model (Chapter 4)

General Linear Model (Chapter 4) General Linear Model (Chapter 4) Outcome variable is considered continuous Simple linear regression Scatterplots OLS is BLUE under basic assumptions MSE estimates residual variance testing regression coefficients

More information

12 Modelling Binomial Response Data

12 Modelling Binomial Response Data c 2005, Anthony C. Brooms Statistical Modelling and Data Analysis 12 Modelling Binomial Response Data 12.1 Examples of Binary Response Data Binary response data arise when an observation on an individual

More information

(Where does Ch. 7 on comparing 2 means or 2 proportions fit into this?)

(Where does Ch. 7 on comparing 2 means or 2 proportions fit into this?) 12. Comparing Groups: Analysis of Variance (ANOVA) Methods Response y Explanatory x var s Method Categorical Categorical Contingency tables (Ch. 8) (chi-squared, etc.) Quantitative Quantitative Regression

More information

Introducing Generalized Linear Models: Logistic Regression

Introducing Generalized Linear Models: Logistic Regression Ron Heck, Summer 2012 Seminars 1 Multilevel Regression Models and Their Applications Seminar Introducing Generalized Linear Models: Logistic Regression The generalized linear model (GLM) represents and

More information

STAC51: Categorical data Analysis

STAC51: Categorical data Analysis STAC51: Categorical data Analysis Mahinda Samarakoon April 6, 2016 Mahinda Samarakoon STAC51: Categorical data Analysis 1 / 25 Table of contents 1 Building and applying logistic regression models (Chap

More information

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression Section IX Introduction to Logistic Regression for binary outcomes Poisson regression 0 Sec 9 - Logistic regression In linear regression, we studied models where Y is a continuous variable. What about

More information

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014 LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Liang (Sally) Shan Nov. 4, 2014 L Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers

More information

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS Duration - 3 hours Aids Allowed: Calculator LAST NAME: FIRST NAME: STUDENT NUMBER: There are 27 pages

More information

Generalised linear models. Response variable can take a number of different formats

Generalised linear models. Response variable can take a number of different formats Generalised linear models Response variable can take a number of different formats Structure Limitations of linear models and GLM theory GLM for count data GLM for presence \ absence data GLM for proportion

More information

Unit 11: Multiple Linear Regression

Unit 11: Multiple Linear Regression Unit 11: Multiple Linear Regression Statistics 571: Statistical Methods Ramón V. León 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 1 Main Application of Multiple Regression Isolating the effect of a variable

More information

Lecture (chapter 13): Association between variables measured at the interval-ratio level

Lecture (chapter 13): Association between variables measured at the interval-ratio level Lecture (chapter 13): Association between variables measured at the interval-ratio level Ernesto F. L. Amaral April 9 11, 2018 Advanced Methods of Social Research (SOCI 420) Source: Healey, Joseph F. 2015.

More information

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011) Ron Heck, Fall 2011 1 EDEP 768E: Seminar in Multilevel Modeling rev. January 3, 2012 (see footnote) Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October

More information

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing Chapter Fifteen Frequency Distribution, Cross-Tabulation, and Hypothesis Testing Copyright 2010 Pearson Education, Inc. publishing as Prentice Hall 15-1 Internet Usage Data Table 15.1 Respondent Sex Familiarity

More information

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017 Binary Regression GH Chapter 5, ISL Chapter 4 January 31, 2017 Seedling Survival Tropical rain forests have up to 300 species of trees per hectare, which leads to difficulties when studying processes which

More information

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model 1 Linear Regression 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor

More information

Generalized logit models for nominal multinomial responses. Local odds ratios

Generalized logit models for nominal multinomial responses. Local odds ratios Generalized logit models for nominal multinomial responses Categorical Data Analysis, Summer 2015 1/17 Local odds ratios Y 1 2 3 4 1 π 11 π 12 π 13 π 14 π 1+ X 2 π 21 π 22 π 23 π 24 π 2+ 3 π 31 π 32 π

More information

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013 Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013 1 Overview Data Types Contingency Tables Logit Models Binomial Ordinal Nominal 2 Things not

More information

Assessing the Calibration of Dichotomous Outcome Models with the Calibration Belt

Assessing the Calibration of Dichotomous Outcome Models with the Calibration Belt Assessing the Calibration of Dichotomous Outcome Models with the Calibration Belt Giovanni Nattino The Ohio Colleges of Medicine Government Resource Center The Ohio State University Stata Conference -

More information

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti Good Confidence Intervals for Categorical Data Analyses Alan Agresti Department of Statistics, University of Florida visiting Statistics Department, Harvard University LSHTM, July 22, 2011 p. 1/36 Outline

More information

Generalized linear models

Generalized linear models Generalized linear models Douglas Bates November 01, 2010 Contents 1 Definition 1 2 Links 2 3 Estimating parameters 5 4 Example 6 5 Model building 8 6 Conclusions 8 7 Summary 9 1 Generalized Linear Models

More information

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102 Background Regression so far... Lecture 21 - Sta102 / BME102 Colin Rundel November 18, 2014 At this point we have covered: Simple linear regression Relationship between numerical response and a numerical

More information

STAT 705 Chapter 16: One-way ANOVA

STAT 705 Chapter 16: One-way ANOVA STAT 705 Chapter 16: One-way ANOVA Timothy Hanson Department of Statistics, University of South Carolina Stat 705: Data Analysis II 1 / 21 What is ANOVA? Analysis of variance (ANOVA) models are regression

More information

More Statistics tutorial at Logistic Regression and the new:

More Statistics tutorial at  Logistic Regression and the new: Logistic Regression and the new: Residual Logistic Regression 1 Outline 1. Logistic Regression 2. Confounding Variables 3. Controlling for Confounding Variables 4. Residual Linear Regression 5. Residual

More information

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper Student Name: ID: McGill University Faculty of Science Department of Mathematics and Statistics Statistics Part A Comprehensive Exam Methodology Paper Date: Friday, May 13, 2016 Time: 13:00 17:00 Instructions

More information

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links Communications of the Korean Statistical Society 2009, Vol 16, No 4, 697 705 Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links Kwang Mo Jeong a, Hyun Yung Lee 1, a a Department

More information

9. Linear Regression and Correlation

9. Linear Regression and Correlation 9. Linear Regression and Correlation Data: y a quantitative response variable x a quantitative explanatory variable (Chap. 8: Recall that both variables were categorical) For example, y = annual income,

More information