Logistic Regression - problem 6.14

Similar documents
R Hints for Chapter 10

STA102 Class Notes Chapter Logistic Regression

Exam Applied Statistical Regression. Good Luck!

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Linear Regression Models P8111

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Logistic Regressions. Stat 430

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Exercise 5.4 Solution

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

UNIVERSITY OF TORONTO Faculty of Arts and Science

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Age 55 (x = 1) Age < 55 (x = 0)

Modeling Overdispersion

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Faculty of Science FINAL EXAMINATION Mathematics MATH 523 Generalized Linear Models

Interactions in Logistic Regression

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

9 Generalized Linear Models

Let s see if we can predict whether a student returns or does not return to St. Ambrose for their second year.

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Sociology 362 Data Exercise 6 Logistic Regression 2

Generalized linear models

Logistic Regression 21/05

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

Regression Methods for Survey Data

Suppose that we are concerned about the effects of smoking. How could we deal with this?

Lecture 12: Effect modification, and confounding in logistic regression

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

Introduction to General and Generalized Linear Models

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

On the Inference of the Logistic Regression Model

Generalised linear models. Response variable can take a number of different formats

Generalized Linear Models. stat 557 Heike Hofmann

Classification. Chapter Introduction. 6.2 The Bayes classifier

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

STA 450/4000 S: January

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

Week 7 Multiple factors. Ch , Some miscellaneous parts

Matched Pair Data. Stat 557 Heike Hofmann

Checking the Poisson assumption in the Poisson generalized linear model

Introduction to the Analysis of Tabular Data

BMI 541/699 Lecture 22

Lecture 4 Multiple linear regression

Sample solutions. Stat 8051 Homework 8

Duration of Unemployment - Analysis of Deviance Table for Nested Models

Introduction to logistic regression

R Output for Linear Models using functions lm(), gls() & glm()

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

12 Modelling Binomial Response Data

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Various Issues in Fitting Contingency Tables

Log-linear Models for Contingency Tables

Chapter 20: Logistic regression for binary response variables

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Generalized Additive Models

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Stat 401B Final Exam Fall 2016

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Poisson Regression. The Training Data

Data-analysis and Retrieval Ordinal Classification

Using R in 200D Luke Sonnet

Statistics Handbook. All statistical tables were computed by the author.

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

MODULE 6 LOGISTIC REGRESSION. Module Objectives:

Unit 5 Logistic Regression Practice Problems

PAPER 30 APPLIED STATISTICS

Non-Gaussian Response Variables

Binomial and Poisson Probability Distributions

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Simple logistic regression

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Consider fitting a model using ordinary least squares (OLS) regression:

STAT 510 Final Exam Spring 2015

8 Nominal and Ordinal Logistic Regression

Lecture 6 Multiple Linear Regression, cont.

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

Stat 5102 Final Exam May 14, 2015

Stat 401B Exam 3 Fall 2016 (Corrected Version)

Business Statistics. Lecture 10: Course Review

Survival Analysis I (CHL5209H)

Explanatory variables are: weight, width of shell, color (medium light, medium, medium dark, dark), and condition of spine.

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

Classification: Logistic Regression and Naive Bayes Book Chapter 4. Carlos M. Carvalho The University of Texas McCombs School of Business

Lecture 5: LDA and Logistic Regression

Transcription:

Logistic Regression - problem 6.14 Let x 1, x 2,, x m be given values of an input variable x and let Y 1,, Y m be independent binomial random variables whose distributions depend on the corresponding values of x. They depend on x through a function p(x) that gives the success probability in the binomial experiment for each given value of x. We assume that the numbers of trials are predetermined, so that each Y Binom(n i, p(x i )). In istic regression, we assume that the -odds on success is a linear function of the variable x: p(x) 1 p(x) = β 0 + β 1 x, where β 0 and β 1 are unknown constants. x is a numeric variable, although if it is a two-level factor variable it will be numerically coded as 0 for the first level and 1 for the second level. For example, if it is male or female, it will be coded as 0 for female and 1 for male simply because f comes before m in the alphabet. Thus, p(0) 1 p(0) = β 0, and p(1) 1 p(1) = β 0 + β 1. Therefore, β 1 is the odds ratio on success for the two populations males and females. Estimates of the parameters β 0 and β 1 are found by the method of maximum likelihood. The estimates β 0 and β 1 maximize the likelihood function l( β 0, β 1 ) = which simplifies as m Y j (p(x j )) + (n j Y j )(1 p(x j )), l( β 0, β m 1 ) = Y j ( β0 + β ) m ) 1 x j n j (1 + e β 0+ β 1x j. The maximum likelihood estimates cannot be found by elementary calculus and must be numerically approximated. Theorems in advanced mathematical 1

statistics ensure that for large sample sizes m the estimators are approximately normally distributed with means equal to the true parameters. Furthermore, their variances approach zero as m. R has a function glm, which stands for generalized linear model, that calculates the maximum likelihood estimates, their standard errors, and many other things. We will illustrate it with the data set stenosis which is in the Van Belle data folder. Part of the data set is given below. smoke disease sex 1 yes 1 m 2 yes 1 m 3 yes 1 m 4 yes 1 m 5 yes 1 m 6 yes 1 m 7 yes 1 m 8 yes 1 m 9 yes 1 m 10 yes 1 m 62 no 1 m 63 yes 0 m 64 yes 0 m 65 yes 0 m 66 yes 0 m 67 yes 0 m Disease is the response variable and has values either 0 for no disease or 1 for the presence of disease. Each n i here is equal to 1 and the response Y is Bernoulli. For the moment, we ignore the variable sex and treat the disease probability as a function of smoking status only. For Bernoullit responses, glm is simple to use. > stenosis.glm=glm(disease~smoke,data=stenosis,family=binomial) > summary(stenosis.glm) Call: glm(formula = disease ~ smoke, family = binomial, data = stenosis) Deviance Residuals: Min 1Q Median 3Q Max -1.251-1.087-1.087 1.188 1.270 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -0.2157 0.1829-1.180 0.238 smokeyes 0.3863 0.2762 1.399 0.162 2

(Dispersion parameter for binomial family taken to be 1) Null deviance: 297.94 on 214 degrees of freedom Residual deviance: 295.97 on 213 degrees of freedom AIC: 299.97 Number of Fisher Scoring iterations: 3 The variable smoke has two levels, no and yes. The intercept parameter with an estimated value of -0.2157 is the odds on disease for nonsmokers. The parameter estimate labelled smokeyes is the difference in odds for smokers and nonsmokers. In other words, it is the odds ratio on disease for smokers and nonsmokers. You can get the odds and odds ratio by exponentiating these estimates: > exp(coef(stenosis.glm)) (Intercept) smokeyes 0.8059701 1.4715762 The odds on disease for nonsmokers is estimated to be 0.80597 and the odds on disease for smokers is 1.472 times as great. However, the p-value of 0.162 is not significant. We cannot conclude that the true odds ratio is different from 1. Confidence intervals for the -odds and odds ratio are obtained as follows. > confint(stenosis.glm) Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) -0.5774201 0.1413800 smokeyes -0.1536262 0.9309511 > exp(.last.value) 2.5 % 97.5 % (Intercept) 0.5613447 1.151862 smokeyes 0.8575925 2.536921 If both smoking status and sex are included in the model, > stenosis.glm2=glm(disease~smoke+sex,data=stenosis,family=binomial) > summary(stenosis.glm2) Call: glm(formula = disease ~ smoke + sex, family = binomial, data = stenosis) Deviance Residuals: Min 1Q Median 3Q Max -1.3630-1.0555-0.9783 1.0807 1.3905 3

Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -0.4882 0.2159-2.261 0.0238 * smokeyes 0.1946 0.2903 0.670 0.5026 sexm 0.7199 0.2881 2.499 0.0125 * --- NA (Dispersion parameter for binomial family taken to be 1) Null deviance: 297.94 on 214 degrees of freedom Residual deviance: 289.64 on 212 degrees of freedom AIC: 295.64 Number of Fisher Scoring iterations: 4 Now the intercept parameter is the -odds on disease for nonsmoker females. To get the -odds on disease for smokers, add 0.1946. To get the -odds on disease for males, add 0.7199. The estimated odds and odds ratios are: > exp(coef(stenosis.glm2)) (Intercept) smokeyes sexm 0.6137494 1.2148232 2.0542127 The odds ratio for smokers vs. nonsmokers is 1.215. The odds ratio for males vs. females is 2.054. It appears that being a male is riskier than being a smoker. The odds ratio for male smokers vs. female nonsmokers is 1.215 2.054 = 2.496. In problem 6.14 the data in the book is given in a format suitable for the Mantel- Haenzel test. I have arranged it in tabular form for importation as a data frame. The data file is in the Van Belle data folder. It looks like this once you have imported it. MI Control Coffee Cigs 1 7 31 >=5 never 2 55 269 <5 never 3 7 18 >=5 former 4 20 112 <5 former 5 7 24 >=5 1to14 6 33 114 <5 1to14 7 40 45 >=5 15to24 8 88 172 <5 15to24 9 34 24 >=5 25to34 10 50 55 <5 25to34 11 27 24 >=5 35to44 12 55 58 <5 35to44 13 30 17 >=5 45+ 14 34 17 <5 45+ 4

Here the response is not given as individual Bernoulli responses as in the stenosis example. Rather, the first two columns are the numbers of successes (MI) and failures (Control) corresponding to each combination of factor levels. For data in this form, the formula in glm must be the two-column matrix of successes and failures. Invoke it like this: > prob6.14.glm=glm(cbind(mi,control).,data=prob6.14,family=binomial) The cbind function puts vectors together as columns of a matrix. The dot. on the right of the model formula after is a shortcut meaning to include all the variables in the data frame that are not named on the left. You can name the object anything you like. If you don t want to do the Mantel-Haenzel calculations by hand you can put the data into the format given in the book and use the mantelhaen.test function in R. It takes some work. Here are the steps. > attach(prob6.14) > prob6.14b=array(rbind(mi,control),dim=c(2,2,7)) > for(k in 1:7) prob6.14b[,,k]=t(prob6.14b[,,k]) > dimnames(prob6.14b)=list(c("mi","control"),c(">=5","<5"),c("never", + "former","1to14","15to24","25to34","35to44","45+")) > prob6.14b The result should look like the data in the book. After this, all you have to do is call the mantelhaen.test function: > mantelhaen.test(prob6.14b) Compare the results. They should agree. 5