Chapter 12.8 Logistic Regression

Size: px
Start display at page:

Download "Chapter 12.8 Logistic Regression"

Transcription

1 Chapter 12.8 Logistic Regression Logistic regression is an example of a large class of regression models called generalized linear models (GLM) An Observational Case Study: The Donner Party (Gayson, D.K., 1990, Donner Party deaths: A demographic assessment, Journal of Anthropological Research, 46, , and Ramsey, F.L. and Schafer, D.W., 2002, The Statistical Sleuth, 2nd Ed, Duxbury Press, p In 1846, the Donner and Reed families left Illinois for California by covered wagon (87 people, 20 wagons). They attempted a new and untried crossing of the region between Ft. Bridger, Wyoming and the Sacramento Valley. After numerous problems and delays in Utah, they reached the Eastern Serria Nevada in late October. They were stranded near Lake Tahoe by a series of snowstorms that left as much as 8 feet of snow by some accounts. By the time they were rescued in April of the following year, 40 members had died. Some (or perhaps all) of those that survived did so by resorting to cannibalism The researchers attempted to address questions such as whether females are better able to withstand harsh conditions than men, and whether the odds of survival varied with age. Grayson was able to reconstruct records on survival, age and gender for 45 individuals. Summary of Findings The odds of survival of females were estimated to be 4.9 times the odds of survival for men of the same age. An approximate 95% confidence interval for this odds ratio is 1.1 to

2 We call π/(1 π) the odds of the event of interest. Here, π is the probability of the event of interest (e.g., survival), so π = P (S). The complement of S is denoted by S. If π = 0.5, the odds of S (relative to S) is 1 If π = 0.75, the odds of S (relative to S) is 0.75/0.25 = 3, or 3 to 1 If π = 0.25, the odds of S (relative to S) is 0.25/0.75 = 0.33, or 1 in 3 The logistic regression model is of log(odds of S), though it is easy to recover the estimated odds of S, and the probability of S, from the fitted model Donner Party revisited Consider a plot of survivorship (= 1 implies survivorship and = 0 implies death) for each of participants, against age, by gender (Figure 1) Figure 1 illustrates two common failures when using ordinary least squares regression with binary response variables 1. First, the predicted values are completely inconsistent with the possible expected values for the response variable. Specifically, E(Y i ) = π i is the expected value of the ith (binary) response. Here π i is also the probability of surviving, so 0 π i 1, yet the lines are both below and above 0 and 1 even within the observed range of age 2. Secondly, last semester it was observed that the survivorship rate for males was consistent lower than that for females, but the regression lines indicate that older males have a greater survivorship rate than older females. The model is not consistent with a straightforward contingency table analysis. The two problems noted above 2

3 are absent from Figure 2 showing estimated probabilities from a logistic regression model. Survivorship Males Females Age Figure 1: Survivorship for the Donner Party participants. Fitted least squares lines (for each gender) are shown. n = 45 Figure 2 shows the data with the fitted logistic regression model super-imposed. The lines show the estimated probability of survival as a function of age and gender. Scope of Inference: Because the data are observational, the result cannot be used to infer that women are more apt to survive than men. There may be confounding variables that account for the apparent difference (e..g., behavior). Moreover, these 45 individuals are not a randomly sampled from any identifiable population to which inference may be made 3

4 Survivorship Males Females Age Figure 2: Survivorship for the Donner Party participants. Fitted logistic regression lines (for each gender) are shown. n = 45 A Retrospective Case Study (see Ramsey and Schafer, The Statistical Sleuth, 2nd ed.) Holst, P.A., Kromhout, D. and Brand, R For debate: pet birds as an independent risk factor for lung cancer, British Medical Journal, 297, A health survey in The Hague, Netherlands presented evidence of an association between keeping pet birds and increased risk of lung cancer. To investigate this link further, researchers conducted a case-control study of patients at four hospitals in The Hague. They identified 49 cases of lung cancer among patients that were younger than 65 and long-term residents of the city. The also selected 98 controls from the population of city residents having the same general age structure as 4

5 the cancer cases. Data were gathered on the following variables: 1. Sex (1 =F, 0 =M) 2. Age in years 3. Socioeconomic status (1 =High, 0 =Low), determined by occupation of the household s principal wage earner 4. Years of smoking prior to diagnosis or examination 5. Average rate of smoking (cigarettes/day) 6. Indicator of birdkeeping. Birdkeeping was defined as keeping caged birds in the home for more than 6 consecutive months from 5 to 14 years before diagnosis (cases) or examination (controls) Scope of Inference: Inference extends to the population of lung cancer patients and unaffected individuals in The Hague in 1985 (the study year). Statistical analysis of these observational data cannot be used as evidence that birdkeeping causes lung cancer. However, there is medical rationale supporting this conclusion: people who keep birds inhale and expectorate excess allergens and dust particles which increases the likelihood of dysfunction of lung macrophages which in turn may lead to diminished immune system response. This is a retrospective study. A retrospective study is one in which two (sub)populations (cases and controls) are sampled at different rates. A comparison of the number of individuals in each sample provides no information regarding the probability that a randomly 5

6 sampled birdkeeper has lung cancer. However, a comparison of the cases and controls provides a relative measure of increased risk (the actual level of risk is not addressed) In essence, if the proportion of lung cancer cases among the cases (birdkeepers) is twice that among the controls (not birdkeepers), then the data indicate that the odds of lung cancer is twice as great for birdkeepers compared to nonbirdkeepers Generalized Linear Models (GLMs) A GLM is a probability model in which the mean, or expected value, of a response variable is related to a set of explanatory variables through a regression equation The usual multiple regression model is an example: µ = E(Y ) = β 0 + β 1 x β p 1 x p 1 To handle distributions besides the normal distribution, we must model a nonlinear function of µ. For example, if the response variable is Poisson in distribution, then the GLM is log µ = β 0 + β 1 x β p 1 x p 1 If the response variable is Binomial in distribution, then the GLM is where n is the number of trials ( ) µ log = β 0 + β 1 x β p 1 x p 1 n µ If the response variable is Binary, (or Bernoulli) in distribution, then the GLM is ( ) π log = β 0 + β 1 x β p 1 x p 1 1 π because µ = nπ = π when the distribution is Bernoulli 6

7 In general, there is some link function, say g(µ) that links the linear model to µ Provided that the link function is appropriate, nearly all of the ordinary multiple regression methods (residual analysis, hypothesis testing, confidence intervals) carry over to the GLM with only minor modifications Logistic regression is used for modeling the mean of a binary, or Bernoulli random variable. A Bernoulli random variable is a Binomial random variable for which the number of trials is n = 1. Only one of two outcomes (S or S) is possible, and the probability of S is denoted by P (S) = π. Consequently, P (S) = 1 P (S) = 1 π. The mean of a Binomial random variable is µ = nπ, and the variance is σ 2 = nπ(1 π) Thus, for a binary random variable Y defined as 1, if the outcome is S Y = 0, if the outcome is S, E(Y ) = µ = π and Var(Y ) = π(1 π) The logit function is the link between µ and the linear model. Let η denote the linear portion of the model, i.e., η = β 0 + β 1 x β p 1 x p 1 7

8 estimates and specific values of x 1,..., x p 1 8 The logit function of π is ( ) π logit(π) = log 1 π Thus, for logistic regression logit(π) = η Recall that π/(1 π) is the the odds of S. If π = 0.5, the odds of S (relative to S) is 1 The logit function has the effect of stretching the possible values of π from 0 to 1, to to. This is very helpful with respect to the computational aspects of model fitting The inverse of the logit is important. We calculate it as follows ( ) π η =log 1 π e η = π 1 π 1 e =1 π = 1 η π π e η = 1 π 1 + eη e η = 1 π eη 1 + e η =π Figure 3 show the logit and its inverse. When logistic regression is used to model the probability of an outcome, model fitting produces estimates of the parameters β 0,..., β p1 It is simple, then to obtain an estimate of the linear predictor η given the parameter

9 logit(p) p p Linear predictor Figure 3: The logit function (left panel) and its inverse (right panel). It is necessary to be able to determine the estimated probability of the outcome of interest given specific values x 1,..., x p 1 This can be done in two steps: 1. Compute η = β 0 + β 1 x β p 1 x p 1 2. Compute π = exp( η) 1 + exp( η) (1) or more directly by computing π = exp( β 0 + β 1 x β p 1 x p 1 ) 1 + exp( β 0 + β 1 x β p 1 x p 1 ) 9 (2)

10 A useful interpretation of β 1 is obtained by considering the odds ratio. The odds ratio expresses how much more (or less) likely event A is to occur than event B. Suppose that A and B are two outcomes. To be concrete, suppose that A is the outcome that a heavy smoker has lung cancer, and B is the outcome that a nonsmoker has lung cancer. The increase in risk of lung cancer can be quantified by comparing the odds of having lung cancer for the smoker to the odds of having lung cancer to the nonsmoker. Suppose that the probability of A is π A = P (A). Specifically, given an individual is a heavy smoker, then the probability of having lung cancer is π A = P (A). Similarly, given that an individual is a nonsmoker, then the probability of having lung cancer is the probability of the outcome B, which is denoted by π B = P (B) The odds that A will occur is π A /(1 π A ) and the odds of B is π B /(1 π B ). The odds ratio is π A /(1 π A ) π B /(1 π B ) = π A(1 π B ) π B (1 π A ) For example, if P (A) = π A = 0.75, and P (B) = π B = 0.25, then the odds of A relative to B is π A (1 π B ) π B (1 π A ) = 0.75(1.25) 0.25(1.75) = 3 3 = 9 We say that the odds of the outcome A is 9 times the odds of the outcome B For example, if P (A) = π A = 0.75, and P (B) = π B = 0.6, then the odds of A relative to B is π A (1 π B ) π B (1 π A ) = 0.75(1.60) 0.60(1.75) = = 2 Said another way, the odds of A is 3 : 1, and the odds of B is 0.6/0.4 = 1.5, or 3 : 2, hence, the odds of A is twice that of B 10

11 Interpretation of logistic regression coefficients Consider the logit model of π as a function of a single explanatory variable x ( ) π log = η = β 0 + β 1 x 1 π Let π A denote the probability when x = x A : π A 1 π A = e β 0+β 1 x A Let π B denote the probability when x = x B : π B 1 π B = e β 0+β 1 x B The ratio of the odds of success (i.e., π) when x = A relative to x = B is π A /(1 π A ) =eβ0+β1xa π B /(1 π B ) e β 0+β 1 x B =e β 0+β 1 x A (β 0 +β 1 x B ) =e β 1(x A x B ) This means that if x A differs from x B by one unit, i.e., x A x B = 1, then e β 1(x A x B ) = e β 1, and the odds of success will change by a multiplicative factor of e β 1 when x = x A compared to x = x B For example, suppose that β 1 = 0.5; then e β 1 = e 0.5 = So, a one unit change in x will increase the odds of success by 1.65 times For example, suppose that β 1 = 0.5; then e β 1 = e 0.5 = 1/1.649 =

12 Thus, a one unit change in x will decrease the odds of success by a (multiplicative) factor of The odds of success are = 39.4% less as a result of a 1 unit change in x The figure below plots β on the x-axis and e β on the y-axis, and shows how a one-unit change in x will change the odds of success exp(beta) beta Figure 4: The function e β plotted against β. Suppose that there are more than a single explanatory variable that is used to model the logit of π. A particular estimated coefficient β k is interpreted in much the same manner as in linear regression with multiple explanatory variables Suppose that all variables x 2,..., x p 1 are held constant, and we compare the effect of 12

13 a 1-unit change in x 1 on the odds of S. The change is best measured with respect to the odds ratio. Let π 1 denote the probability of S for some value, say x 1, and π 2 denote the probability of S at x The odds ratio is ( ) π2 / ( π1 1 π 2 1 π 1 ) = eβ 0+β 1 (x 1 +1)+ +β p 1 x p 1 e β 0+β 1 x 1 + +β p 1 x p 1 = eβ 1(x 1 +1) e β 1x 1 = e β 1 We say that if x 1 changes by 1-unit, then the odds of S will change by a multiplicative factor of e β 1 provided that all other variables are held constant More generally, if x 1 changes from A to B, then the odds of S will change by a multiplicative factor of e β1(a B) provided that all other variables are held constant because ( ) πa / ( ) πb = eβ 0+β 1 A+ +β p 1 x p 1 1 π A 1 π B e β 0+β 1 B+ +β p 1 x p 1 = eβ 1A e β 1B = eβ 1(A B) Example: logistic regression produces a fitted model for the Donner party. The fitted model is of log-odds of π = Pr(Death), and it is logit( π) = x AGE 1.597x GENDER, where 1, if Female x GENDER = 0, if Male The odds of death increase by a multiplicative factor of e = for every year of life if gender is held constant (for both males and females). In other words, the risk increases by 8% for every year of life. An approximate 95% CI for this factor is to

14 The odds of death of females are estimated to be e = times the odds of death for males of the same age. An approximate 95% confidence interval for this odds ratio is to 0.89 We also can say that the odds of death of males are estimated to be 1 e = e1.597 = 4.9 times the odds of death for females of the same age. An approximate 95% confidence interval for this odds ratio is 1/0.89 = to 1/0.046 = To get a direct comparison of probabilities, we need to specify either the probability of death for females (at a specific age) to get the probability of death for males at the same age, or specify the probability of death for males to get the corresponding probability for females. For example, suppose that the probability of death for females is π F = 0.5. Then 4.9 π F 1 π F = π M 1 π M 4.9 = π M 1 π M Solving the righthand equation for π M yields π M = More generally, π M = ( 1 + e β 1 π ) 1 F π F This formula yields π M = when π F = 1/3 and π M = when π F = 2/3 Estimation of Logistic Regression Coefficients In contrast to ordinary multiple regression, the regression parameters are estimated by maximizing the likelihood of the data (instead of minimizing the sum of the residuals) Logistic regression parameter estimates are maximum likelihood estimates 14

15 The likelihood of the data, in the case of discrete data 1, is the probability of obtaining the sample as a function of the parameters By maximizing the likelihood through the choice of the parameter estimates, we make the probability of obtaining the sample as large as possible. Any choice of parameter estimates results in a lesser probability An example involving 1 parameter is a random sample of n = 20 Bernoulli observations. Suppose the sample is y = (0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1), and that each observation is generated from the Bin(1, π) distribution. Probabilities from the Bin(1, π) distribution can be computed according to the formula P (Y = y 1 ) = π y 1 (1 π) 1 y 1, for y 1 = 0 or 1 Because the sample is random, the observations are independent, and the likelihood L(π; y) of observing this sample of 20 observations is the product of the individual probabilities, that is L(π; y) = P (Y 1 = 0, Y 2 = 0,..., Y 20 = 1) = P (Y 1 = 0) P (Y 2 = 0) P (Y 20 = 1) = π 0 (1 π) 1 π 0 (1 π) 1 π 1 (1 π) 0 = π 4 (1 π) 16 because there are four 1 s and sixteen 0 s in the sample y Figure 5 (left panel) is a graph of L(p; y) versus p shows how the likelihood of the sample varies with 0 < p < 1 1 In the case of continuous data, the definition is somewhat more complicated 15

16 Note that the value of p that maximizes the likelihood is 0.2. Consequently, the maximum likelihood estimate of p is 0.2. likelihood log-likelihood p p Figure 5: Left panel shows the likelihood L(p; y) = p 4 (1 p) 16 as a function of 0 < p < 1. The left panel shows log[l(p; y)] as a function of 0 < p < 1. In this example, the MLE is the same as our previous estimator of π, namely, the sample proportion p = # successes # trials = 4 20 It is easier to work with the log-likelihood, i.e., logl(π; y), instead of the likelihood because the logarithm converts products to sums, and the likelihood function is a product of individual terms arising from each observation Note that the value of p that maximizes logl(p; y) also maximizes L(p; y) (right panel 16

17 of Figure 5). Because log(x) is a strictly increasing function of x, the maximizing value of logl(p; y) is the same as the maximizing value of L(p; y) For the logistic regression model, the probability of survival for the ith individual is a function of the parameters β 0, β 1 and β 2. Two preliminary calculations help simplify the log-likelihood. They are 1. First π i = eβ 0+β 1 x AGE i +β 2 x GENDER i 1 + e β 0+β 1 x AGE i +β 1 x GENDER i = eη i 1 + e η i which leads to 1 π i = e β 0+β 1 x AGE i +β 2 x GENDER i = e η i 2. Tle log-probability for the ith observation can then be expressed as Then, log[p (Y i = y i )] = log[π y i i (1 π i) 1 y i ] = log π y i i + log(1 π i ) (1 y i) = y i log π i + (1 y i ) log(1 π i ) ( ) πi = y i log + log(1 π i ) 1 π i ( ) 1 = y i η i + log 1 + e η i = y i (β 0 + β 1 x AGE i 45 log[l(β 0, β 1, β 2 ; y)] = log[p (Y i = y i )] i=1 45 = i=1 y i (β 0 + β 1 x AGE i + β 2 x GENDER i + β 2 x GENDER i 17 ) log[1 + e β 0+β 1 x AGE i +β 2 x GENDER i ] ) log[1 + e β 0+β 1 x AGE i +β 2 x GENDER i ]

18 Numerical optimization methods are used to determine the values of β 0, β 1 and β 2 that maximize log[l(β 0, β 1, β 2 ; y)] For the Donner Party data, Table 1 2 shows log[l(β 0, β 1, β 2 ; y)] for a few of the possible values of β 0, β 1,and β 2 : Table 1: log[l(β 0, β 1, β 2 ; y)] as a function of β 0, β 1, β 2. β 0 β 1 β 2 log[l(β 0, β 1, β 2 ; y)] The last line in the table shows the value of log[l(β 0, β 1, β 2 ; y)] at the MLE s. These MLEs are different than those above because Ramsey and Schafer set π = P (Survival) whereas I set π = P (Death) Properties of MLEs If the model is correct and the sample size is large, then 1. MLE s are nearly unbiased 2. The standard errors of MLEs can be estimated, and the estimates are nearly unbiased 3. MLE s are more precise than nearly all other estimators 4. The distribution of an MLE is approximately normal (a consequence of the Central 2 From Ramsey, F.L. and Schafer, D.W The Statistical Sleuth, 2nd Ed. p

19 Limit Theorem) Tests and Confidence Intervals Involving A Single Parameter The last property above implies that if β i is the MLE of β i, then β i N(βi, σ βi ) where σ βi is the standard error of the β i This gives the Wald test of H 0 : β i = 0 versus H 1 : β i 0 The test statistic is If H 0 is true, then Z N(0, 1). Z = β i σ βi. Suppose that the observed value of the test statistic is z; then, the p-value is 2P (Z z ). The normal approximation may be poor in some instances, and so the test must be treated with some caution. In essence, the p-value is may not be very accurate For example, in the Donner Party data analysis, the linear portion of a logistic regression model of P (Death) is η i = β 0 + β 1 x AGE i + β 2 x GENDER i + β 3 x AGE GENDER i, SPSS reports β 3 = 0.162, σ β3 = 0.092, and Sig. =

20 The p-value is approximately because (ignoring rounding error) Z = β i σ βi = = 1.73 and 2P (Z 1.73) = Also, SPSS reports that e β 3 = e = This means that the difference in odds of death resulting from a 1 year increase in age is greater for females (GENDER = 1) than males by a factor of given that age is held constant. A approximate 95% CI for this factor is to In my opinion, interaction between gender and age is not supported by the data, and I adopt the no-interaction model. Earlier, it was reported that the odds of death increase by a multiplicative factor of e = for every year of life if gender is held constant. So, if I compare a female at age= 20 versus a female at age= 30, the odds of death are = 2.18 times greater for the older female. An approximate 95% CI for this increase obtained by computing the anit-logs of the upper and lower bound estimates on the parameter, specifically, = 1.05 to = 4.56 The Drop-in-Deviance Test More generally, testing the significance of a covariate or a factor in logistic regression is conducted using the same principles as in linear regression. The following test is preferred to the Wald statistic as the p-values associated with the test statistic are more accurate. Further, more than one variable can be tested for significance, and so this test 20

21 must be used if a factor with more than 2 levels is to be tested for significance Specifically, we compare the fit of two models: the full model which contains the covariate or the factor of interest, and a reduced model that is the same as the full except that the covariate or factor is not in the model If we are assessing the importance of a factor with k levels, so k 1 indicator variables are used to account for it, then all k 1 indicators are removed from the full model to get the reduced model Suppose that g variables are used to account for the term (if the term is a covariate g = 1), and the parameters that multiply the variables are β 1, β 2,..., β g. Then the null hypothesis of interest is H 0 : β 1 = β 2 = = β g = 0 and the research hypothesis is H 1 : at least one of β 1, β 2,..., β g is not 0. The drop-in-deviance test uses the the likelihood ratio statistic. This statistic is 2 times logarithm of the ratio of the two likelihoods, specifically, { L(full model; y) } LRT = 2log L(reduced model; y) = 2log[L(full model; y)] 2log[L(reduced model; y)]. If the H 0 is true, then the LRT is approximately chi-square in distribution and the degrees of freedom are g, i.e., LRT χ 2 g 21

22 The entire procedure is often called the drop-in-deviance test (analogous to the extrasums-of-squares test which uses an F -statistic). The deviance of a model, say model M, is 2 times the log-likelihood D(M) = 2log [L(M; y)] Recall that the likelihood was the probability of obtaining the sample as a function of a set of parameters. A large likelihood corresponds to a small deviance because of the sign change induced by the term 2 Hence, the larger the deviance, the worse the fit of the model. Similar to the error sums of squares in ordinary regression, we cannot interpret the deviance in isolation (without comparing it to some other model) The likelihood ratio statistic comparing two models can be expressed in terms of the deviance: LRT = D(reduced) D(full) Example: Donner Party data. Table 2 shows an analysis of deviance table Table 2: Likelihood Ratio Tests. -2 Log Likehood of Effect Chi-Square df Sig of Reduced Model Intercept AGE GENDER The chi-square statistic is the difference in -2 log-likelihoods between the final model and a reduced model. The reduced model is formed by omitting an effect from the final model. The null hypothesis is that all parameters of that effect are equal to 0. 22

23 The test of significance for AGE compares the deviance of the full model (D = ) to the deviance to the model without AGE but including GENDER (D = ). The test statistic is LRT = = 6.030, and P (χ 2 1 > 6.030) = Hence, there is strong evidence (p-value 0.014) that there differences in the probability of survival at different ages A Case-Control Study - Birdkeeping and Lung Cancer The objective was to determine if there is an increased probability of lung cancer associated with birdkeeping, even after accounting for other factors (e.g., smoking) Factors (and covariates) are 1. Sex (1 =F, 0 =M) 2. Age in years 3. Socioeconomic status (1 =High, 0 =Low), determined by occupation of the household s principal wage earner 4. Years of smoking prior to diagnosis or examination 5. Average rate of smoking (cigarettes/day) 6. Indicator of birdkeeping. Birdkeeping was defined as keeping caged birds in the home for more than 6 consecutive months from 5 to 14 years before diagnosis (cases) or examination (controls) The first step is to examine the effect of smoking on lung cancer. 23 Figure 6 compares

24 the relationship between age and smoking, and whether the individuals kept birds, for the cases and the controls Birdkeeping No Birdkeeping 50 LUNGCANCER NOCANCER 40 Years of smoking Age Figure 6: Age versus years smoking, by case (lung cancer) and control (no lung cancer). n = 147 Figure 6 shows that smoking is strongly associated with lung cancer because of the cases, all but one has been smoking for more than 10 years, and all but 4 have been smoking for more than 20 years Comparing cases (LUNGCANCER) to controls (NOCANCER) reveals that relatively more individuals were birdkeepers among the cases than the controls Birdkeeping does not appear to be associated with years of smoking or age. The researchers (main) objective was to determine whether there is an association 24

25 between birdkeeping and incidence of lung cancer. This statement implies the model fitting strategy should be to fit a rich model with all available explanatory variables besides birdkeeping, and determine whether the indicator of birdkeeping significantly improves the fit of the rich model. If birdkeeping is significant, then the differences in the odds of contracting lung cancer effect between birdkeepers and others should be estimated The drop-in-deviance statistic obtained from adding birdkeeping to the main-effectsonly model is LRT = 11.7 and the associated p-value is The odds associated of contracting cancer is estimated to be 3.77 times greater when comparing two individuals that differ according to whether they keep birds or not, given that all other variables are held constant. An approximate 95% confidence interval for this increase is 1.7 to 8.5 We may be interested in the significance of the other variables, though it must be admitted that there may be other variables that are associated with the incidence of lung cancer that were not measured. I would not try to describe, in general, what variables are associated with lung cancer because of these unobserved, or latent variables However, looking among those variables that are available, it is of interest which show and association with lung cancer. A good-fitting model (found by fitting all main effects and sequentially eliminating non-significant variables) is shown in Table 2 Table 3: Parameter estimates from the final model (Birdkeeping data). Variable Parameter estimate Estimated standard error Wald statistic Approx. p-value Intercept < Years smoking < Bird keeping <

26 The odds associated of contracting cancer is estimated to be e = 1.7 times greater when comparing two individuals that differ according to 10-years in the length of time that they have smoked, given that all other factors are the same The odds associated of contracting cancer is estimated to be e = 4.38 times greater when comparing two individuals that differ according to whether they keep birds or not, given that all other factors are the same AIC and BIC Two measures of model fit (and variants on each) are sometimes used in model fitting when the researcher has a large collection of explanatory variables and little scientific understanding of the problem to guide model fitting. These are Akaike s information criterion (AIC) and Bayes information criterion (BIC) These measures use a penalty term that depends on the number of variables in the model. The difference between the two is the size of the penalty. The BIC penalty is greater than that using AIC when the sample size is large The AIC penalty is 2p, where p is the number of model parameters. The AIC measure of the fit of model M 1 is AIC(M 1 ) = D(M 1 )+2p, where D(M 1 ) is the deviance associated with model M 1 The BIC penalty is p ln(n), where n is the number of observations. The BIC measure of the fit of model M 1 is BIC(M 1 ) = D(M 1 ) + p ln(n) Figure 7 shows the penalty functions as a they vary with n and p = 10. AIC does not depend on n, and so the penalty function is constant. A larger penalty is imposed by BIC to compensate for an increase in test power with 26

27 Penalty Sample size Figure 7: Penalty functions for AIC (horizontal line) and BIC (curve) when p = 5. larger sample size. If the researcher believes that some, but not all explanatory variables exert an influence on response variable, they may choose the model with the smallest BIC (versus AIC). On the other hand, if the researcher is inclined to believe that all explanatory variables exert some influence on the response and wants to find a parsimonious model, then the researcher will choose the model with the smallest AIC. Cautionary Remarks These criteria (AIC and BIC) are sometimes useful, and sometimes not. It is incorrect to assume that they relieve the scientist from the obligation of careful and considered model fitting. Many statisticians use these methods sparingly, if at all, in part because there has been abundant criticism of AIC for being too liberal (that is, it tends to include variables that are not associated with the response in certain situations). 27

28 In general, it is best to keep in mind that finding a best model is an unrealistic goal (given the many variables that have may some influence on the response variable, but have not been measured). A realistic objective of model fitting is to learn which variables, among those that have been measured, are associated with the response variable and to obtain a description of how the explanatory variables relate to the response. For example, in the birdkeeping analysis, from among the available variables, it was found that only years of smoking and birdkeeping had an significant effect on the odds of contracting lung cancer, and that gender, age, socioeconomic status, and smoking rate were not significant. The odds of contracting lung cancer where much greater (about 4.4 times greater) for birdkeepers versus those that did not keep birds. This is association is larger than the change in the odds associated with 10 years of smoking, as the odds of contracting lung cancer are estimated to be 1.7 times greater when comparing individuals that differ by 10 years in the length of time that they smoked A goodness of fit test when the number of trials is greater than one It is sometimes possible to test whether the assumed model fits the data Generally, these are situations in which a separate variance parameter is not used in the model. In addition, there must be multiple observations observed at each combination of factor levels and covariate values These conditions are satisfied if the assumed model is binomial, and the number of trials is greater than one For example, we can carry out a goodness of fit test for the Challenger data because there were 6 trials associated with each launch and the number of failed o-rings (out of 28

29 the 6) is the observed response variable Let m denote the number of trials (m = 6) for the Challenger data. The idea is that a model-free, good estimate of the probability of failure associated with each launch (and hence, observation on the number of failures) is the sample proportion of the m that failed. For example, if y = 2 failed when temperature = 53, then the estimated probability of failure is 2/6 when temperature is 53 There are two ways to get a set of estimates that correspond to this best model 1. Compute each the sample proportions by hand 2. Fit a saturated model that has the same number of parameters as observations (say, with an intercept and n 1 indicator variables that identify each observation. Then compute the fitted probabilities according to the saturated model The term saturated is used because model cannot contain any other parameters. For this model p = n Then, compare the fit of the saturated model to the simpler model. If the fits are nearly the same, then we conclude that model fit is good Formally, we set H 0 : the model fits the data versus H a : the model does not fit the data The test statistic is the likelihood ratio statistic 29

30 { L(saturated model; y) } LRT = 2log L(simple model; y) = 2log[L(saturated model; y)] 2log[L(simple model; y)]. If H 0 is true, then the LRT is approximately chi-square in distribution and the degrees of freedom are n p, i.e., LRT χ 2 n p We reject H 0 if Pr(χ 2 n p LRT ) is smaller than some α (e.g., 0.05) Example: The Challenger data using the logit model and temperature. SPSS reports the LRT goodness of fit statistic as Deviance= with 21 degrees of freedom. The associated p-value (not reported) is Specifically, SPSS definition and calculation of the deviance different than I described above. The SPSS calculation is { L(saturated model; y) } Deviance = 2log L(simple model; y) = 2log[L(saturated model; y)] 2log[L(simple model; y)]. While the SPSS definition is somewhat unconventional, it does lend itself to the goodness of fit test. Moreover, their definition can be used for significance testing of variables without any additional work When testing the significance of a variable, we use the likelihood ratio statistic, and I advocate using the difference in deviances between the models with and without the variable. Using the SPSS form, we obtain 30

31 LRT = D(M 1 ) D(M 2 ) =2log[L(saturated model; y)] 2log[L(M 1 ; y)] {2log[L(saturated model; y)] 2log[L(M 2 ; y)]} =2log[L(M 2 ; y)]] 2log[L(M 1 ; y)] This is the correct test statistic A cautionary remark: most of the time deviance statistic that appears in the SPSS output cannot be used in a goodness of fit test. The conditions stated above must hold. Most notably, the number of trials must be greater than one for the majority of the observations 31

Chapter 20: Logistic regression for binary response variables

Chapter 20: Logistic regression for binary response variables Chapter 20: Logistic regression for binary response variables In 1846, the Donner and Reed families left Illinois for California by covered wagon (87 people, 20 wagons). They attempted a new and untried

More information

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102 Background Regression so far... Lecture 21 - Sta102 / BME102 Colin Rundel November 18, 2014 At this point we have covered: Simple linear regression Relationship between numerical response and a numerical

More information

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation Background Regression so far... Lecture 23 - Sta 111 Colin Rundel June 17, 2014 At this point we have covered: Simple linear regression Relationship between numerical response and a numerical or categorical

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

9 Generalized Linear Models

9 Generalized Linear Models 9 Generalized Linear Models The Generalized Linear Model (GLM) is a model which has been built to include a wide range of different models you already know, e.g. ANOVA and multiple linear regression models

More information

Single-level Models for Binary Responses

Single-level Models for Binary Responses Single-level Models for Binary Responses Distribution of Binary Data y i response for individual i (i = 1,..., n), coded 0 or 1 Denote by r the number in the sample with y = 1 Mean and variance E(y) =

More information

Logistic regression: Miscellaneous topics

Logistic regression: Miscellaneous topics Logistic regression: Miscellaneous topics April 11 Introduction We have covered two approaches to inference for GLMs: the Wald approach and the likelihood ratio approach I claimed that the likelihood ratio

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population

More information

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) (b) (c) (d) (e) In 2 2 tables, statistical independence is equivalent

More information

8 Nominal and Ordinal Logistic Regression

8 Nominal and Ordinal Logistic Regression 8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on

More information

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models SCHOOL OF MATHEMATICS AND STATISTICS Linear and Generalised Linear Models Autumn Semester 2017 18 2 hours Attempt all the questions. The allocation of marks is shown in brackets. RESTRICTED OPEN BOOK EXAMINATION

More information

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3 STA 303 H1S / 1002 HS Winter 2011 Test March 7, 2011 LAST NAME: FIRST NAME: STUDENT NUMBER: ENROLLED IN: (circle one) STA 303 STA 1002 INSTRUCTIONS: Time: 90 minutes Aids allowed: calculator. Some formulae

More information

Multinomial Logistic Regression Models

Multinomial Logistic Regression Models Stat 544, Lecture 19 1 Multinomial Logistic Regression Models Polytomous responses. Logistic regression can be extended to handle responses that are polytomous, i.e. taking r>2 categories. (Note: The word

More information

Lecture 12: Effect modification, and confounding in logistic regression

Lecture 12: Effect modification, and confounding in logistic regression Lecture 12: Effect modification, and confounding in logistic regression Ani Manichaikul amanicha@jhsph.edu 4 May 2007 Today Categorical predictor create dummy variables just like for linear regression

More information

Chapter 22: Log-linear regression for Poisson counts

Chapter 22: Log-linear regression for Poisson counts Chapter 22: Log-linear regression for Poisson counts Exposure to ionizing radiation is recognized as a cancer risk. In the United States, EPA sets guidelines specifying upper limits on the amount of exposure

More information

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy Logistic Regression Some slides from Craig Burkett STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy Titanic Survival Case Study The RMS Titanic A British passenger liner Collided

More information

Log-linear Models for Contingency Tables

Log-linear Models for Contingency Tables Log-linear Models for Contingency Tables Statistics 149 Spring 2006 Copyright 2006 by Mark E. Irwin Log-linear Models for Two-way Contingency Tables Example: Business Administration Majors and Gender A

More information

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011) Ron Heck, Fall 2011 1 EDEP 768E: Seminar in Multilevel Modeling rev. January 3, 2012 (see footnote) Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October

More information

Introducing Generalized Linear Models: Logistic Regression

Introducing Generalized Linear Models: Logistic Regression Ron Heck, Summer 2012 Seminars 1 Multilevel Regression Models and Their Applications Seminar Introducing Generalized Linear Models: Logistic Regression The generalized linear model (GLM) represents and

More information

Generalized Linear Models 1

Generalized Linear Models 1 Generalized Linear Models 1 STA 2101/442: Fall 2012 1 See last slide for copyright information. 1 / 24 Suggested Reading: Davison s Statistical models Exponential families of distributions Sec. 5.2 Chapter

More information

STAT 7030: Categorical Data Analysis

STAT 7030: Categorical Data Analysis STAT 7030: Categorical Data Analysis 5. Logistic Regression Peng Zeng Department of Mathematics and Statistics Auburn University Fall 2012 Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall 2012

More information

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection Model Selection in GLMs Last class: estimability/identifiability, analysis of deviance, standard errors & confidence intervals (should be able to implement frequentist GLM analyses!) Today: standard frequentist

More information

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20 Logistic regression 11 Nov 2010 Logistic regression (EPFL) Applied Statistics 11 Nov 2010 1 / 20 Modeling overview Want to capture important features of the relationship between a (set of) variable(s)

More information

Model Estimation Example

Model Estimation Example Ronald H. Heck 1 EDEP 606: Multivariate Methods (S2013) April 7, 2013 Model Estimation Example As we have moved through the course this semester, we have encountered the concept of model estimation. Discussions

More information

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Logistic Regression 1 / 38 Logistic Regression 1 Introduction

More information

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses ST3241 Categorical Data Analysis I Multicategory Logit Models Logit Models For Nominal Responses 1 Models For Nominal Responses Y is nominal with J categories. Let {π 1,, π J } denote the response probabilities

More information

Introduction to logistic regression

Introduction to logistic regression Introduction to logistic regression Tuan V. Nguyen Professor and NHMRC Senior Research Fellow Garvan Institute of Medical Research University of New South Wales Sydney, Australia What we are going to learn

More information

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics Faculty of Health Sciences Regression models Counts, Poisson regression, 27-5-2013 Lene Theil Skovgaard Dept. of Biostatistics 1 / 36 Count outcome PKA & LTS, Sect. 7.2 Poisson regression The Binomial

More information

Introduction Fitting logistic regression models Results. Logistic Regression. Patrick Breheny. March 29

Introduction Fitting logistic regression models Results. Logistic Regression. Patrick Breheny. March 29 Logistic Regression March 29 Introduction Binary outcomes are quite common in medicine and public health: alive/dead, diseased/healthy, infected/not infected, case/control Assuming that these outcomes

More information

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds Chapter 6 Logistic Regression In logistic regression, there is a categorical response variables, often coded 1=Yes and 0=No. Many important phenomena fit this framework. The patient survives the operation,

More information

Review of Multiple Regression

Review of Multiple Regression Ronald H. Heck 1 Let s begin with a little review of multiple regression this week. Linear models [e.g., correlation, t-tests, analysis of variance (ANOVA), multiple regression, path analysis, multivariate

More information

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46 A Generalized Linear Model for Binomial Response Data Copyright c 2017 Dan Nettleton (Iowa State University) Statistics 510 1 / 46 Now suppose that instead of a Bernoulli response, we have a binomial response

More information

Chapter 8: Multiple and logistic regression

Chapter 8: Multiple and logistic regression Chapter 8: Multiple and logistic regression OpenIntro Statistics, 3rd Edition Slides developed by Mine C etinkaya-rundel of OpenIntro. The slides may be copied, edited, and/or shared via the CC BY-SA license.

More information

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News: Today HW 1: due February 4, 11.59 pm. Aspects of Design CD Chapter 2 Continue with Chapter 2 of ELM In the News: STA 2201: Applied Statistics II January 14, 2015 1/35 Recap: data on proportions data: y

More information

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression Section IX Introduction to Logistic Regression for binary outcomes Poisson regression 0 Sec 9 - Logistic regression In linear regression, we studied models where Y is a continuous variable. What about

More information

Cohen s s Kappa and Log-linear Models

Cohen s s Kappa and Log-linear Models Cohen s s Kappa and Log-linear Models HRP 261 03/03/03 10-11 11 am 1. Cohen s Kappa Actual agreement = sum of the proportions found on the diagonals. π ii Cohen: Compare the actual agreement with the chance

More information

BMI 541/699 Lecture 22

BMI 541/699 Lecture 22 BMI 541/699 Lecture 22 Where we are: 1. Introduction and Experimental Design 2. Exploratory Data Analysis 3. Probability 4. T-based methods for continous variables 5. Power and sample size for t-based

More information

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches Sta 216, Lecture 4 Last Time: Logistic regression example, existence/uniqueness of MLEs Today s Class: 1. Hypothesis testing through analysis of deviance 2. Standard errors & confidence intervals 3. Model

More information

Lecture 01: Introduction

Lecture 01: Introduction Lecture 01: Introduction Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina Lecture 01: Introduction

More information

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples ST3241 Categorical Data Analysis I Generalized Linear Models Introduction and Some Examples 1 Introduction We have discussed methods for analyzing associations in two-way and three-way tables. Now we will

More information

Logistic Regression - problem 6.14

Logistic Regression - problem 6.14 Logistic Regression - problem 6.14 Let x 1, x 2,, x m be given values of an input variable x and let Y 1,, Y m be independent binomial random variables whose distributions depend on the corresponding values

More information

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014 LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Liang (Sally) Shan Nov. 4, 2014 L Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers

More information

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form: Outline for today What is a generalized linear model Linear predictors and link functions Example: fit a constant (the proportion) Analysis of deviance table Example: fit dose-response data using logistic

More information

Lecture 5: Poisson and logistic regression

Lecture 5: Poisson and logistic regression Dankmar Böhning Southampton Statistical Sciences Research Institute University of Southampton, UK S 3 RI, 3-5 March 2014 introduction to Poisson regression application to the BELCAP study introduction

More information

Lecture 2: Poisson and logistic regression

Lecture 2: Poisson and logistic regression Dankmar Böhning Southampton Statistical Sciences Research Institute University of Southampton, UK S 3 RI, 11-12 December 2014 introduction to Poisson regression application to the BELCAP study introduction

More information

12 Modelling Binomial Response Data

12 Modelling Binomial Response Data c 2005, Anthony C. Brooms Statistical Modelling and Data Analysis 12 Modelling Binomial Response Data 12.1 Examples of Binary Response Data Binary response data arise when an observation on an individual

More information

Multiple linear regression S6

Multiple linear regression S6 Basic medical statistics for clinical and experimental research Multiple linear regression S6 Katarzyna Jóźwiak k.jozwiak@nki.nl November 15, 2017 1/42 Introduction Two main motivations for doing multiple

More information

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010 1 Linear models Y = Xβ + ɛ with ɛ N (0, σ 2 e) or Y N (Xβ, σ 2 e) where the model matrix X contains the information on predictors and β includes all coefficients (intercept, slope(s) etc.). 1. Number of

More information

Binary Logistic Regression

Binary Logistic Regression The coefficients of the multiple regression model are estimated using sample data with k independent variables Estimated (or predicted) value of Y Estimated intercept Estimated slope coefficients Ŷ = b

More information

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto. Introduction to Dalla Lana School of Public Health University of Toronto olli.saarela@utoronto.ca September 18, 2014 38-1 : a review 38-2 Evidence Ideal: to advance the knowledge-base of clinical medicine,

More information

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1 Parametric Modelling of Over-dispersed Count Data Part III / MMath (Applied Statistics) 1 Introduction Poisson regression is the de facto approach for handling count data What happens then when Poisson

More information

Correlation and regression

Correlation and regression 1 Correlation and regression Yongjua Laosiritaworn Introductory on Field Epidemiology 6 July 2015, Thailand Data 2 Illustrative data (Doll, 1955) 3 Scatter plot 4 Doll, 1955 5 6 Correlation coefficient,

More information

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS Duration - 3 hours Aids Allowed: Calculator LAST NAME: FIRST NAME: STUDENT NUMBER: There are 27 pages

More information

Chapter 7: Simple linear regression

Chapter 7: Simple linear regression The absolute movement of the ground and buildings during an earthquake is small even in major earthquakes. The damage that a building suffers depends not upon its displacement, but upon the acceleration.

More information

36-463/663: Multilevel & Hierarchical Models

36-463/663: Multilevel & Hierarchical Models 36-463/663: Multilevel & Hierarchical Models (P)review: in-class midterm Brian Junker 132E Baker Hall brian@stat.cmu.edu 1 In-class midterm Closed book, closed notes, closed electronics (otherwise I have

More information

Linear Regression Models P8111

Linear Regression Models P8111 Linear Regression Models P8111 Lecture 25 Jeff Goldsmith April 26, 2016 1 of 37 Today s Lecture Logistic regression / GLMs Model framework Interpretation Estimation 2 of 37 Linear regression Course started

More information

ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T.

ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T. Exam 3 Review Suppose that X i = x =(x 1,, x k ) T is observed and that Y i X i = x i independent Binomial(n i,π(x i )) for i =1,, N where ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T x) This is called the

More information

Generalized linear models

Generalized linear models Generalized linear models Douglas Bates November 01, 2010 Contents 1 Definition 1 2 Links 2 3 Estimating parameters 5 4 Example 6 5 Model building 8 6 Conclusions 8 7 Summary 9 1 Generalized Linear Models

More information

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j Standard Errors & Confidence Intervals β β asy N(0, I( β) 1 ), where I( β) = [ 2 l(β, φ; y) ] β i β β= β j We can obtain asymptotic 100(1 α)% confidence intervals for β j using: β j ± Z 1 α/2 se( β j )

More information

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017 Binary Regression GH Chapter 5, ISL Chapter 4 January 31, 2017 Seedling Survival Tropical rain forests have up to 300 species of trees per hectare, which leads to difficulties when studying processes which

More information

CHAPTER 1: BINARY LOGIT MODEL

CHAPTER 1: BINARY LOGIT MODEL CHAPTER 1: BINARY LOGIT MODEL Prof. Alan Wan 1 / 44 Table of contents 1. Introduction 1.1 Dichotomous dependent variables 1.2 Problems with OLS 3.3.1 SAS codes and basic outputs 3.3.2 Wald test for individual

More information

An Introduction to Path Analysis

An Introduction to Path Analysis An Introduction to Path Analysis PRE 905: Multivariate Analysis Lecture 10: April 15, 2014 PRE 905: Lecture 10 Path Analysis Today s Lecture Path analysis starting with multivariate regression then arriving

More information

Investigating Models with Two or Three Categories

Investigating Models with Two or Three Categories Ronald H. Heck and Lynn N. Tabata 1 Investigating Models with Two or Three Categories For the past few weeks we have been working with discriminant analysis. Let s now see what the same sort of model might

More information

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture! Hierarchical Generalized Linear Models ERSH 8990 REMS Seminar on HLM Last Lecture! Hierarchical Generalized Linear Models Introduction to generalized models Models for binary outcomes Interpreting parameter

More information

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression Logistic Regression Usual linear regression (repetition) y i = b 0 + b 1 x 1i + b 2 x 2i + e i, e i N(0,σ 2 ) or: y i N(b 0 + b 1 x 1i + b 2 x 2i,σ 2 ) Example (DGA, p. 336): E(PEmax) = 47.355 + 1.024

More information

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ Logistic Regression The goal of a logistic regression analysis is to find the best fitting and most parsimonious, yet biologically reasonable, model to describe the relationship between an outcome (dependent

More information

Simple logistic regression

Simple logistic regression Simple logistic regression Biometry 755 Spring 2009 Simple logistic regression p. 1/47 Model assumptions 1. The observed data are independent realizations of a binary response variable Y that follows a

More information

Review of Multinomial Distribution If n trials are performed: in each trial there are J > 2 possible outcomes (categories) Multicategory Logit Models

Review of Multinomial Distribution If n trials are performed: in each trial there are J > 2 possible outcomes (categories) Multicategory Logit Models Chapter 6 Multicategory Logit Models Response Y has J > 2 categories. Extensions of logistic regression for nominal and ordinal Y assume a multinomial distribution for Y. 6.1 Logit Models for Nominal Responses

More information

Generalized linear models

Generalized linear models Generalized linear models Outline for today What is a generalized linear model Linear predictors and link functions Example: estimate a proportion Analysis of deviance Example: fit dose- response data

More information

A discussion on multiple regression models

A discussion on multiple regression models A discussion on multiple regression models In our previous discussion of simple linear regression, we focused on a model in which one independent or explanatory variable X was used to predict the value

More information

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 Introduction to Generalized Univariate Models: Models for Binary Outcomes EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 EPSY 905: Intro to Generalized In This Lecture A short review

More information

STA102 Class Notes Chapter Logistic Regression

STA102 Class Notes Chapter Logistic Regression STA0 Class Notes Chapter 0 0. Logistic Regression We continue to study the relationship between a response variable and one or more eplanatory variables. For SLR and MLR (Chapters 8 and 9), our response

More information

STA6938-Logistic Regression Model

STA6938-Logistic Regression Model Dr. Ying Zhang STA6938-Logistic Regression Model Topic 2-Multiple Logistic Regression Model Outlines:. Model Fitting 2. Statistical Inference for Multiple Logistic Regression Model 3. Interpretation of

More information

Statistics in medicine

Statistics in medicine Statistics in medicine Lecture 4: and multivariable regression Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu

More information

Chapter 5: Logistic Regression-I

Chapter 5: Logistic Regression-I : Logistic Regression-I Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM [Acknowledgements to Tim Hanson and Haitao Chu] D. Bandyopadhyay

More information

LOGISTIC REGRESSION Joseph M. Hilbe

LOGISTIC REGRESSION Joseph M. Hilbe LOGISTIC REGRESSION Joseph M. Hilbe Arizona State University Logistic regression is the most common method used to model binary response data. When the response is binary, it typically takes the form of

More information

Sections 4.1, 4.2, 4.3

Sections 4.1, 4.2, 4.3 Sections 4.1, 4.2, 4.3 Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1/ 32 Chapter 4: Introduction to Generalized Linear Models Generalized linear

More information

Solution to Tutorial 7

Solution to Tutorial 7 1. (a) We first fit the independence model ST3241 Categorical Data Analysis I Semester II, 2012-2013 Solution to Tutorial 7 log µ ij = λ + λ X i + λ Y j, i = 1, 2, j = 1, 2. The parameter estimates are

More information

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification, Likelihood Let P (D H) be the probability an experiment produces data D, given hypothesis H. Usually H is regarded as fixed and D variable. Before the experiment, the data D are unknown, and the probability

More information

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont. TCELL 9/4/205 36-309/749 Experimental Design for Behavioral and Social Sciences Simple Regression Example Male black wheatear birds carry stones to the nest as a form of sexual display. Soler et al. wanted

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

An Introduction to Mplus and Path Analysis

An Introduction to Mplus and Path Analysis An Introduction to Mplus and Path Analysis PSYC 943: Fundamentals of Multivariate Modeling Lecture 10: October 30, 2013 PSYC 943: Lecture 10 Today s Lecture Path analysis starting with multivariate regression

More information

MSH3 Generalized linear model

MSH3 Generalized linear model Contents MSH3 Generalized linear model 5 Logit Models for Binary Data 173 5.1 The Bernoulli and binomial distributions......... 173 5.1.1 Mean, variance and higher order moments.... 173 5.1.2 Normal limit....................

More information

PubHlth Intermediate Biostatistics Spring 2015 Exam 2 (Units 3, 4 & 5) Study Guide

PubHlth Intermediate Biostatistics Spring 2015 Exam 2 (Units 3, 4 & 5) Study Guide PubHlth 640 - Intermediate Biostatistics Spring 2015 Exam 2 (Units 3, 4 & 5) Study Guide Unit 3 (Discrete Distributions) Take care to know how to do the following! Learning Objective See: 1. Write down

More information

Logistic regression analysis. Birthe Lykke Thomsen H. Lundbeck A/S

Logistic regression analysis. Birthe Lykke Thomsen H. Lundbeck A/S Logistic regression analysis Birthe Lykke Thomsen H. Lundbeck A/S 1 Response with only two categories Example Odds ratio and risk ratio Quantitative explanatory variable More than one variable Logistic

More information

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression 36-309/749 Experimental Design for Behavioral and Social Sciences Sep. 22, 2015 Lecture 4: Linear Regression TCELL Simple Regression Example Male black wheatear birds carry stones to the nest as a form

More information

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression 22s:52 Applied Linear Regression Ch. 4 (sec. and Ch. 5 (sec. & 4: Logistic Regression Logistic Regression When the response variable is a binary variable, such as 0 or live or die fail or succeed then

More information

Sociology 362 Data Exercise 6 Logistic Regression 2

Sociology 362 Data Exercise 6 Logistic Regression 2 Sociology 362 Data Exercise 6 Logistic Regression 2 The questions below refer to the data and output beginning on the next page. Although the raw data are given there, you do not have to do any Stata runs

More information

Modern Methods of Statistical Learning sf2935 Lecture 5: Logistic Regression T.K

Modern Methods of Statistical Learning sf2935 Lecture 5: Logistic Regression T.K Lecture 5: Logistic Regression T.K. 10.11.2016 Overview of the Lecture Your Learning Outcomes Discriminative v.s. Generative Odds, Odds Ratio, Logit function, Logistic function Logistic regression definition

More information

High-Throughput Sequencing Course

High-Throughput Sequencing Course High-Throughput Sequencing Course DESeq Model for RNA-Seq Biostatistics and Bioinformatics Summer 2017 Outline Review: Standard linear regression model (e.g., to model gene expression as function of an

More information

Varieties of Count Data

Varieties of Count Data CHAPTER 1 Varieties of Count Data SOME POINTS OF DISCUSSION What are counts? What are count data? What is a linear statistical model? What is the relationship between a probability distribution function

More information

UNIVERSITY OF TORONTO Faculty of Arts and Science

UNIVERSITY OF TORONTO Faculty of Arts and Science UNIVERSITY OF TORONTO Faculty of Arts and Science December 2013 Final Examination STA442H1F/2101HF Methods of Applied Statistics Jerry Brunner Duration - 3 hours Aids: Calculator Model(s): Any calculator

More information

Binary Response: Logistic Regression. STAT 526 Professor Olga Vitek

Binary Response: Logistic Regression. STAT 526 Professor Olga Vitek Binary Response: Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4 Model Specification and Interpretation 4-1 Probability Distribution of a Binary Outcome Y In many situations, the response

More information

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis Review Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 22 Chapter 1: background Nominal, ordinal, interval data. Distributions: Poisson, binomial,

More information

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 Statistics Boot Camp Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 March 21, 2018 Outline of boot camp Summarizing and simplifying data Point and interval estimation Foundations of statistical

More information

Testing and Model Selection

Testing and Model Selection Testing and Model Selection This is another digression on general statistics: see PE App C.8.4. The EViews output for least squares, probit and logit includes some statistics relevant to testing hypotheses

More information

STAT5044: Regression and Anova

STAT5044: Regression and Anova STAT5044: Regression and Anova Inyoung Kim 1 / 18 Outline 1 Logistic regression for Binary data 2 Poisson regression for Count data 2 / 18 GLM Let Y denote a binary response variable. Each observation

More information