Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Similar documents
You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials.

Simple logistic regression

Logistic regression analysis. Birthe Lykke Thomsen H. Lundbeck A/S

STA6938-Logistic Regression Model

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Q30b Moyale Observed counts. The FREQ Procedure. Table 1 of type by response. Controlling for site=moyale. Improved (1+2) Same (3) Group only

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Multinomial Logistic Regression Models

Contrasting Marginal and Mixed Effects Models Recall: two approaches to handling dependence in Generalized Linear Models:

11 November 2011 Department of Biostatistics, University of Copengen. 9:15 10:00 Recap of case-control studies. Frequency-matched studies.

Case-control studies

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Chapter 5: Logistic Regression-I

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

SAS Analysis Examples Replication C8. * SAS Analysis Examples Replication for ASDA 2nd Edition * Berglund April 2017 * Chapter 8 ;

Lecture 12: Effect modification, and confounding in logistic regression

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

COMPLEMENTARY LOG-LOG MODEL

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Cohen s s Kappa and Log-linear Models

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

Stat 642, Lecture notes for 04/12/05 96

Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials

Models for Binary Outcomes

Lecture 10: Introduction to Logistic Regression

ssh tap sas913, sas

BMI 541/699 Lecture 22

Count data page 1. Count data. 1. Estimating, testing proportions

Analysis of Count Data A Business Perspective. George J. Hurley Sr. Research Manager The Hershey Company Milwaukee June 2013

BIOSTATS Intermediate Biostatistics Spring 2017 Exam 2 (Units 3, 4 & 5) Practice Problems SOLUTIONS

Sections 4.1, 4.2, 4.3

Linear Regression Models P8111

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

STAT 7030: Categorical Data Analysis

8 Nominal and Ordinal Logistic Regression

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Introduction to General and Generalized Linear Models

ST3241 Categorical Data Analysis I Logistic Regression. An Introduction and Some Examples

Appendix: Computer Programs for Logistic Regression

Lecture 24. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

2/26/2017. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

CHAPTER 1: BINARY LOGIT MODEL

PubHlth Intermediate Biostatistics Spring 2015 Exam 2 (Units 3, 4 & 5) Study Guide

Analysing categorical data using logit models

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Section Poisson Regression

Generalized linear models

Poisson Data. Handout #4

Generalized linear models

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

An ordinal number is used to represent a magnitude, such that we can compare ordinal numbers and order them by the quantity they represent.

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

Basic Medical Statistics Course

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Binary Logistic Regression

Chapter 14 Logistic and Poisson Regressions

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Model Estimation Example

Using PROC GENMOD to Analyse Ratio to Placebo in Change of Dactylitis. Irmgard Hollweck / Meike Best 13.OCT.2013

Generalized Additive Models

Classification. Chapter Introduction. 6.2 The Bayes classifier

Age 55 (x = 1) Age < 55 (x = 0)

Logistic Regression. Continued Psy 524 Ainsworth

Case-control studies C&H 16

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lab 8. Matched Case Control Studies

Logistic Regression Analyses in the Water Level Study

Investigating Models with Two or Three Categories

Exam Applied Statistical Regression. Good Luck!

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

Correlation and regression

Single-level Models for Binary Responses

Logistisk regression T.K.

9 Generalized Linear Models

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Unit 5 Logistic Regression Practice Problems

STA102 Class Notes Chapter Logistic Regression

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Longitudinal Modeling with Logistic Regression

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Introduction to the Analysis of Tabular Data

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Applied Statistics Friday, January 15, 2016

Chapter 4: Generalized Linear Models-II

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Introduction to logistic regression

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

Homework Solutions Applied Logistic Regression

Transcription:

Logistic Regression Usual linear regression (repetition) y i = b 0 + b 1 x 1i + b 2 x 2i + e i, e i N(0,σ 2 ) or: y i N(b 0 + b 1 x 1i + b 2 x 2i,σ 2 ) Example (DGA, p. 336): E(PEmax) = 47.355 + 1.024 weight + 0.147 height Interpretation of linear regression For a given height, PEmax grows by 1.024cmH 2 O per kg body weight. For a given weight, PEmax grows by 0.147cmH 2 O per cm body height. The effect of a single explaining variable is conditional on the other variables present in the model. The effect of each explaining variable is linear. 1 2 0-1 variable Number / Frequency Other types of outcomes these are integers; the error term could not be normal... instead look at the mean value: Still we have the problem: E(y) = b 0 + b 1 x 1i + b 2 x 2i mean of a 0-1 variable X: E(X) = P(X = 1) = p p [0,1] a count has a its mean value [0,+ ] 0-1 response variable: Wound infection (dependence on age and on operation time?) p inf optime age p inf optime age p inf optime age p inf optime age 1 1 140 76 51 0 25 30 101 0 120 63 151 0 120 66 2 0 190 71 52 1 240 73 102 0 90 63 152 0 120 64 3 0 150 80 53 0 180 79 103 0 40 25 153 0 90 77 4 0 65 48 54 0 35 17 104 0 45 23 154 0 105 68 5 0 390 34 55 0 180 75 105 0 90 89 155 0 90 83 6 0 210 73 56 0 60 30 106 0 65 79 156 0 75 60 7 1 140 73 57 0 195 64 107 0 50 32 157 0 150 74 8 0 210 78 58 1 120 81 108 0 205 78 158 0 75 25 9 0 135 78 59 1 240 66 109 0 105 89 159 0 120 78 10 0 10 10 60 0 100 52 110 0 390 77 160 0 75 78 11 0 5 64 61 0 80 14 111 0 90 87 161 0 75 80 12 0 40 13 62 0 90 31 112 0 130 63 162 0 70 40 13 0 55 13 63 0 215 63 113 0 75 77 163 0 210 78 14 0 55 70 64 0 65 27 114 1 60 72 164 0 30 76 15 0 40 78 65 0 15 78 115 0 35 72 165 1 130 73 16 0 30 37 66 0 40 23 116 0 45 22 166 1 180 61 17 0 60 23 67 0 175 42 117 1 120 79 167 0 240 53 18 0 120 77 68 0 45 25 118 0 165 62 168 0 210 53 19 0 70 26 69 0 150 63 119 0 80 75 169 0 200 65 20 0 60 24 70 0 50 81 120 0 105 75 170 0 105 31 21 0 150 81 71 0 150 83 121 0 105 82 171 0 45 82 22 0 45 13 72 0 50 22 122 0 85 82 172 0 100 78 3 4

p inf optime age p inf optime age p inf optime age p inf optime age 23 0 50 20 73 0 150 61 123 0 80 79 173 1 205 61 24 0 110 83 74 0 120 69 124 0 240 77 174 0 45 29 25 0 120 82 75 0 60 14 125 0 50 50 175 0 120 24 26 0 175 31 76 0 85 13 126 0 10 24 176 0 120 72 27 0 60 31 77 0 120 60 127 1 105 76 177 0 105 75 28 0 55 67 78 1 70 71 128 0 30 76 178 0 40 74 29 0 35 69 79 0 70 81 129 0 70 90 179 0 60 21 30 0 160 52 80 0 80 22 130 0 40 16 180 1 60 84 31 0 60 22 81 0 150 90 131 0 120 80 181 0 60 30 32 1 60 80 82 0 65 54 132 0 35 51 182 0 60 59 33 0 60 52 83 0 45 44 133 0 90 88 183 0 40 53 34 0 95 84 84 0 25 7 134 1 130 75 184 0 60 69 35 0 90 60 85 1 140 81 135 0 40 20 185 0 60 24 36 0 30 33 86 0 20 81 136 0 75 67 186 0 45 64 37 0 70 23 87 1 180 57 137 0 95 15 187 1 220 83 38 0 60 75 88 0 55 23 138 0 70 80 188 0 50 16 39 0 180 70 89 0 75 17 139 1 60 78 189 0 60 78 40 0 120 78 90 1 115 75 140 0 45 77 190 0 90 78 41 0 120 62 91 0 90 86 141 0 60 73 191 0 120 81 42 0 60 20 92 0 45 65 142 0 30 31 192 0 40 25 43 1 300 82 93 1 55 12 143 0 75 24 193 0 50 13 44 0 45 66 94 0 65 74 144 0 45 7 194 0 45 86 45 0 45 75 95 0 35 17 145 0 40 76 46 0 45 70 96 0 35 13 146 0 50 24 47 0 120 55 97 0 20 28 147 0 75 72 48 0 45 27 98 1 60 71 148 0 25 50 49 0 165 84 99 0 120 64 149 0 55 10 50 0 60 25 100 0 45 38 150 0 60 39 Analysis of a 0-1 response variable Response variable binary ( 0 / 1 ) how is the dependence on operation time (optime) and age (age) described? Model for Not good to use p = P {Wound infection} ( [0, 1])? p = a + b 1 x 1 + b 2 x 2! since this would usually not stick to [0,1]... 5 6 Logistic regression Binary outcome (e.g. 1 for success ): Y {0, 1} Probability for success : p = P {Y = 1} [0,1] Odds for success : ω = p 1 p [0,+ ] p = ω 1 + ω Odds-ratio (2 groups): OR = p 1 1 p 1 / p2 1 p 2 [0,+ ] Log-odds: logit is the link function. Linear predictor: Predicted odds: Logistic regression (ctd.) logit(p) = ln ( p 1 p ) [,+ ] logit(p) = b o + b 1 x 1 + b 2 x 2 = η ω = exp(η) Predicted probability: p = ω 1 + ω = exp(η) 1 + exp(η) 7 8

Logistic regression interpretation Two groups, with probabilities p 1 and p 2 : ( ) p1 logit(p 1 ) logit(p 2 ) = ln 1 p 1 ( p2 ln ( / p1 p2 = ln 1 p 1 1 p 2 = ln(or) 1 p 2 ) A linear model for logit(p) yields comparisons via odds-ratios. ) 9 Logistic regression in wound infection data Y = { 1 post-operative wound infection 0 no post-operative wound infection p = P {postoperative wound infection} x 1 = operation time in minutes x 2 = age in years Estimated model: logit(p) = 5.1144 + 0.00753 x 1 + 0.0353 x 2 exp( 5.1144 + 0.00753 x 1 + 0.0353 x 2 ) p = 1 + exp( 5.1144 + 0.00753 x 1 + 0.0353 x 2 ) 10 Interpretation of logistic regression: Same operation time (T) Age difference of 10 years (A + 10 vs. A) logit(p 1 ) = 5.1144 + 0.00753 T + 0.0353 (A + 10) logit(p 2 ) = 5.1144 + 0.00753 T + 0.0353 A ln(or A+10,A ) = 0.0353 10 OR A+10,A = exp(0.353) = 1.423 What does that mean? OR A+10,A = exp(0.353) = 1.423 If age increases by 10 years, the odds to get a wound infection increases by a factor 1.423, i.e. by 42.3% Odds-ratio refers to the difference in odds for disease between two levels of an explaining variable. 11 12

Calculation of probabilities: ( ) p logit(p) = ln = b 0 + b 1 x 1i + b 2 x 2i 1 p exp(b 0 + b 1 x 1i + b 2 x 2i ) p = 1 + exp(b 0 + b 1 x 1i + b 2 x 2i ) 1 p = 1 1 + exp(b 0 + b 1 x 1i + b 2 x 2i ) The Example yields: logit(p {optime=200 min, age=60 years}) = 5.1144 + 0.00753 200 + 0.0353 60 = 5.1144 + 1.560 + 2.118 = 1.490 p = e 1.490 0.2254 1.490 = 1 + e 1.2254 = 0.1839 13 14 Dependence of p on age for different operation times Dependence of p on operation time for different ages Predicted probability: 30, 120 and 240 min. 0.0 0.2 0.4 0.6 0.8 1.0 Predicted probability, 50, 60, 75 years 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100 Age 0 100 200 300 400 500 600 Operation time 15 16

What does the intercept mean here? ( ) p logit(p) = ln = b 0 + b 1 x 1i + b 2 x 2i 1 p ( ) p x 1i = x 2i = 0 ln = b 0 1 p The intercept is the log-odds for disease in a person with 0 in all covariates. In the wound infection case this would be a person of 0 years which is operated 0 minutes not very meaningful! Wound infection data analyzed in SAS Direct input of data: data brem ; input inf optime age ; cards ; 1 140 76 0 190 71 0 150 80 : : 0 50 13 0 45 86 ; run ; Analyst: Open... 17 18 Direct programming proc logistic data = brem descend; model inf = optime age; run; (or: proc genmod data = brem descending ; model inf = optime age / dist = binomial link = logit ; estimate "Operation" optime 1 / exp ; estimate "age" age 1 / exp ; run ;) Analyst Statistics/Regression/Logistic click at Single trial in Dependent type choose inf as Dependent; specify Model Pr{..} as 1 choose optime and age as Quantitative The LOGISTIC Procedure Model Information Data Set WORK.BREM Response Variable inf Number of Response Levels 2 Number of Observations 194 Model binary logit Optimization Technique Fisher s scoring Response Profile Ordered Total Value inf Frequency 1 1 23 2 0 171 Probability modeled is inf= 1. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 143.247 130.007 SC 146.515 139.811-2 Log L 141.247 124.007 SAS Output 19 20

Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1-5.1142 1.1041 21.4568 <.0001 optime 1 0.00753 0.00316 5.6815 0.0171 age 1 0.0353 0.0145 5.9023 0.0151 (or: The LOGISTIC Procedure Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits optime 1.008 1.001 1.014 age 1.036 1.007 1.066 The GENMOD Procedure Model Information Data Set WORK.BREM Distribution Binomial Link Function Logit Dependent Variable inf Observations Used 194 PROC GENMOD is modeling the probability that inf= 1. Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 191 124.0070 0.6493 Scaled Deviance 191 124.0070 0.6493 Intercept -5.1144 1.1041-7.2785-2.9504 21.46 <.0001 optime 0.0075 0.0032 0.0013 0.0137 5.68 0.0171 age 0.0353 0.0145 0.0068 0.0638 5.90 0.0151 Scale 1.0000 0.0000 1.0000 1.0000 Analysis Of Parameter Estimates Standard Wald 95% Chi- Parameter Estimate Error Confidence Limits Square Pr > ChiSq Contrast Estimate Results Standard Chi- Label Estimate Error Confidence Limits Square Pr>ChiSq Operation 0.0075 0.0032 0.0013 0.0137 5.68 0.0171 Exp(Operation) 1.0076 0.0032 1.0013 1.0138 age 0.0353 0.0145 0.0068 0.0638 5.90 0.0151 Exp(age) 1.0360 0.0151 1.0069 1.0659 21 22 Confidence intervals (1 α) c.i. = estimate ± z 1 α/2 std.error 95% confidence interval for OR associated with a difference of 1 year in the age at operation: For ln(or): 0.035325 ± 1.96 0.014516 = (0.006874; 0.063776) For OR: exp[(0.0066874; 0.063776)] = (1.006897; 1.065854) or: e 0.035325 e 1.96 0.014516 = (1.006897;1.065854) 95% confidence interval for OR associated with a difference of 10 years in age at operation: For ln(or): 10 0.035325 ± 1.96 10 0.014516 = (0.068736, 0.637764) For OR: exp[(0.068736; 0.637764)] = (1.071154; 1.892244) or: e 10 0.035325 e 1.96 10 0.014516 = (1.071154;1.892244) = (1.006897 10 ;1.065854 10 ) 23 24

Program: Confidence intervals in SAS using Proc Genmod proc genmod data = brem descending ; model inf = optime age / dist = binomial ; estimate "Op60" optime 60 / exp ; estimate "A10" age 10 / exp ; run ; Output: Effect of scaling and centering of covariates Program: data brem ; set brem ; a50 = ( age - 50 ) / 10 ; op1 = ( optime - 60 ) / 60 ; run ; proc logistic data = brem descend; model inf = op1 a50; run; Standard Wald 95% Chi- Parameter Estimate Error Conf. Limits Square Pr>ChiSq Op60 0.4518 0.1896 0.0803 0.8234 5.68 0.0171 Exp(Op60) 1.5712 0.2978 1.0836 2.2781 A10 0.3533 0.1454 0.0683 0.6382 5.90 0.0151 Exp(A10) 1.4237 0.2070 1.0707 1.8931 25 26 Scaling and centering Output: Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1-2.8962 0.4228 46.9216 <.0001 op1 1 0.4518 0.1896 5.6815 0.0171 a50 1 0.3532 0.1454 5.9023 0.0151 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits op1 1.571 1.084 2.278 a50 1.424 1.071 1.893 The Intercept refers to the log(odds) for a person with value 0 in all covariates, but this is now a person of 50 years, operated for 1 hour. If the covariates are divided by a factor: the estimates are multiplied with this factor. the standard deviations are multiplied by this factor. Wald s test and the p values remain the same. If the covariates are centered around a value: the estimates are not changed. the standard deviations are not changed. Wald s test and the p values remain the same. 27 28

The intercept refers to the log odds for a person with covariates values equal to those which are used for centering. ˆ odds 50,60 = exp( 2.8962) = 0.05523 ˆp 50,60 = 1/(1 + 0.05523) = 0.05234, c.i.(odds 50,60 ) = exp( 2.8962 ± 1.96 0.4228) = (0.02412,0.12650) c.i.(p 50,60 ) = (1/(1 + 0.02412),1/(1 + 0.1265)) = (0.02355, 0.11229) The infection probability for a 0-person (50 years old, operated for 1 hour) is 0.052, with a 95% c.i. of (0.024, 0.112). Wald s test: Model reduction For testing the importance of a single covariate, e.g. H 0 : β k = 0. Under H 0, we have approximately: or: estimate std.err. N(0, 1) ( ) 2 estimate χ 2 1 std.err. This is calculated in SAS per default, for each parameter separately. 29 30 Likelihood-ratio-test: Model reduction 2ln(likelihood-ratio) χ 2 df The likelihood-ratio is the ratio between the maximized likelihood functions under two different models, for which the smaller one lacks df (one or more) parameters. The deviance is the likelihood-ratio test statistic for comparing the current model vs. a model with one parameter per observation. Thus, the corresponding df is the number of observations minus the number of parameters in the current model. E.g., in our example we have 194 observations and 3 parameters in our model (intercept, optime, age), so the deviance has df = 191. The deviance on its own is not meaningful! However, the difference in deviances between two (nested) models corresponds to the likelihood-ratio test between the two models. It is assessed with help of the χ 2 distribution with df equal to the difference in the numbers of parameters in the two models. E.g., test of model with both optime and age vs. model with only optime: 124.007(191) vs. 131.876(192): χ 2 = 131.876 124.007 = 7.869, df = 1, p = 0.005 (a bit different from the Wald test...) 31 32

Data from DGA: 2 k table with ordered categories Shoe size CS < 4 4 4.5 5 5.5 6 Total Yes 5 7 6 7 8 10 43 No 17 28 36 41 46 140 308 Total 22 35 42 48 54 150 351 Recall (lecture on categorical data): χ 2 test for independence: 9.29, with 5 df; p = 0.098. Partition of χ 2 test in tests for linearity and for trend: χ 2 total (5) = χ2 lin (4) + χ2 trend (1) 9.29 = 1.27 + 8.02 Logistic regression: Model deviance df p logit(p i ) = β i 0.00 0 Test for linearity 1.78 4 0.775 logit(p i ) = α + β s i 1.78 4 0.775 Test for trend 7.56 1 0.005 logit(p i ) = µ 9.34 5 0.096 33 34 Analysis of shoe size data: data shoe ; input cs $ shoeno number ; cards ; Y 3.5 5 Y 4.0 7 Y 4.5 6 Y 5.0 7 Y 5.5 8 Y 6.0 10 N 3.5 17 N 4.0 28 N 4.5 36 N 5.0 41 N 5.5 46 N 6.0 140 ; run; Direct programming: Shoeno as class variable: proc logistic data=shoe descend; weight number; class shoeno / param=ref ref=last; model cs = shoeno; run; Shoeno as quantitative variable: proc logistic data=shoe descend; weight number; model cs = shoeno; run; 35 36

Analyst: Shoeno as class variable: Statistics/Regression/Logistic choose cs as Dependent, and shoeno as Class under Variables, choose number as Weight. double-click at the Code node, copy the program to the editor add two options to the class statement: class shoeno / param=ref ref=last For a direct interpretation of the parameter estimates, the last 2 steps are essential! (if the 1st level should be the reference, use ref=first instead...) Shoeno as quantitative variable: Select shoeno as Quantitative instead of Class. Full model (shoe size as a class variable) Response Profile Ordered Total Total Value cs Frequency Weight 1 Y 6 43.00000 2 N 6 308.00000 Probability modeled is cs= Y. Class Level Information Design Variables Class Value 1 2 3 4 5 shoeno 4 1 0 0 0 0 5 0 1 0 0 0 6 0 0 1 0 0 3.5 0 0 0 1 0 4.5 0 0 0 0 1 5.5 0 0 0 0 0 Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 263.067 263.723 SC 263.552 266.632-2 Log L 261.067 251.723 37 38 Type III Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq shoeno 5 8.6369 0.1245 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1-1.7492 0.3831 20.8513 <.0001 shoeno 4 1 0.3629 0.5704 0.4049 0.5246 shoeno 5 1-0.0185 0.5603 0.0011 0.9737 shoeno 6 1-0.8898 0.5039 3.1187 0.0774 shoeno 3.5 1 0.5255 0.6368 0.6808 0.4093 shoeno 4.5 1-0.0426 0.5841 0.0053 0.9419 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits shoeno 4 vs 5.5 1.438 0.470 4.396 shoeno 5 vs 5.5 0.982 0.327 2.944 shoeno 6 vs 5.5 0.411 0.153 1.103 shoeno 3.5 vs 5.5 1.691 0.485 5.892 shoeno 4.5 vs 5.5 0.958 0.305 3.011 p := P(cs = y shoeno = 3.5) =?: OR ˆ = 1.7492 + 0.5255 = 1.2237 ˆp = exp( 1.2237)/(1 + exp( 1.2237)) = 0.2273 Model with linear effect of shoe size Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 263.067 257.508 SC 263.552 258.477-2 Log L 261.067 253.508 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 7.5597 1 0.0060 Score 8.0237 1 0.0046 Wald 7.6971 1 0.0055 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 0.6877 0.9462 0.5283 0.4673 shoeno 1-0.5194 0.1872 7.6971 0.0055 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits shoeno 0.595 0.412 0.859 39 40

Model comparisons Difference model deviance df deviance df p full 251.7230 6 1.7845 4 0.7753 linear 253.5075 10 7.5598 1 0.0060 intercept only 261.0673 11 Exercise: Use the output to calculate the predicted values for the probability of a caesarean section for women with shoe sizes 4, 5 and 6, respectively, from the model with a linear effect of shoe size. The test in the second last line is the trend test. 41 42 Case-control studies In a case-control-studie, one chooses: cases (diseased) as verified from a register or so controls, which are persons representing the population to which the cases belong. Thus, persons in case-control-studies are chosen according to the outcome. Typically, the proportion of cases and controls will be specified beforehand. If a variable is important for the development of the disease: Different distributions of the variable between cases and controls. The probability to be a case (in the population), P{disease}, can not be estimated from a case-control study. But, the effects of covariates on the disease probability can be estimated! 43 44

Case-control studies Prevalence in the population: p = P {case} p 1 p = odds(case) Selection fractions, i.e. inclusion probabilities π 0 and π 1 : P {inclusion in study case} = π 1 P {inclusion in study control} = π 0 In a case-control study one observes the number of cases and the number of controls, conditional on that they are actually in the study. These depend on diverse covariates (which one is interested in) and on the inclusion probabilities (which one is not interested in). 45 46 p 1 p case control π 1 1 π 1 π 0 1 π 0 P {case & included} = p π 1 included not included included not included P {control & included} = (1 p) π 0 p π 1 odds(case included) = = p (1 p) π 0 1 p π 1 π 0 Logistic regression Model for the population: [ ] p ln = b 0 + b 1 x 1 + b 2 x 2 1 p Model for the observed: [ ] [ ] p π1 ln[odds(case incl.)] = ln + ln 1 p π 0 ( [ ] ) π1 = ln + b 0 + b 1 x 1 + b 2 x 2 π 0 47 48

Analysis of P(case inclusion) i.e. binary observations: Y = { 1 case 0 control Effects of covariates are estimated correctly! Intercept has no meaning depends on π 0 and π 1, which are usually unknown. 49