Treatment Variables INTUB duration of endotracheal intubation (hrs) VENTL duration of assisted ventilation (hrs) LOWO2 hours of exposure to 22 49% lev

Similar documents
Stat 642, Lecture notes for 04/12/05 96

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Chapter 1 Statistical Inference

STAC51: Categorical data Analysis

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

REVISED PAGE PROOFS. Logistic Regression. Basic Ideas. Fundamental Data Analysis. bsa350

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Description Syntax for predict Menu for predict Options for predict Remarks and examples Methods and formulas References Also see

A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Correlation and regression

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

Logistic Regression. Building, Interpreting and Assessing the Goodness-of-fit for a logistic regression model

Generalized Linear Models

Notes for week 4 (part 2)

Diagnostics for matched case control studies : SAS macro for Proc Logistic

Regression Model Building

9 Generalized Linear Models

LOGISTIC REGRESSION Joseph M. Hilbe

Exercise 7.4 [16 points]

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

ANALYSING BINARY DATA IN A REPEATED MEASUREMENTS SETTING USING SAS

Multinomial Logistic Regression Models

LOGISTICS REGRESSION FOR SAMPLE SURVEYS

8 Nominal and Ordinal Logistic Regression

Residuals and regression diagnostics: focusing on logistic regression

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

0.3. Proportion failing Temperature

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Categorical data analysis Chapter 5

Stat 579: Generalized Linear Models and Extensions

The Flight of the Space Shuttle Challenger

Unit 11: Multiple Linear Regression

Testing Independence

Detecting and Assessing Data Outliers and Leverage Points

Math 423/533: The Main Theoretical Topics

Chapter 14 Logistic and Poisson Regressions

holding all other predictors constant

Generalized Linear Models: An Introduction

Log-linear Models for Contingency Tables

Comparison of Estimators in GLM with Binary Data

CHAPTER 5. Outlier Detection in Multivariate Data

Basic Medical Statistics Course

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

STA6938-Logistic Regression Model

ssh tap sas913, sas

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper

Three-Way Tables (continued):

Logistic Regression. Continued Psy 524 Ainsworth

ONE MORE TIME ABOUT R 2 MEASURES OF FIT IN LOGISTIC REGRESSION

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

Diagnostics for Linear Models With Functional Responses

11. Generalized Linear Models: An Introduction

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

Statistics in medicine

TUTORIAL 8 SOLUTIONS #

9 Correlation and Regression

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Exam Applied Statistical Regression. Good Luck!

Sections 4.1, 4.2, 4.3

CHAPTER 1: BINARY LOGIT MODEL

Binary Response: Logistic Regression. STAT 526 Professor Olga Vitek

Statistics 203: Introduction to Regression and Analysis of Variance Course review

STK4900/ Lecture 7. Program

Logistic Regression Models for Multinomial and Ordinal Outcomes

Modelling Survival Data using Generalized Additive Models with Flexible Link

Regression Diagnostics for Survey Data

Chapter 5: Logistic Regression-I

Generalized Linear Models

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Lecture 1: Linear Models and Applications

SCHOOL OF MATHEMATICS AND STATISTICS Autumn Semester

Multiple Linear Regression

Lecture 12: Effect modification, and confounding in logistic regression


STAT 4385 Topic 06: Model Diagnostics

STAT 525 Fall Final exam. Tuesday December 14, 2010

Statistical Modelling with Stata: Binary Outcomes

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr

Outline of GLMs. Definitions

Checking model assumptions with regression diagnostics

STAT 7030: Categorical Data Analysis

Unit 5 Logistic Regression Practice Problems

TAMS38 Experimental Design and Biostatistics, 4 p / 6 hp Examination on 19 April 2017, 8 12

Regression and Generalized Linear Models. Dr. Wolfgang Rolke Expo in Statistics C3TEC, Caguas, October 9, 2015

Non-Gaussian Response Variables

Generalized Linear Models 1

ST3241 Categorical Data Analysis I Logistic Regression. An Introduction and Some Examples

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

Analysing data: regression and correlation S6 and S7

Beyond GLM and likelihood

Topic 18: Model Selection and Diagnostics

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

((n r) 1) (r 1) ε 1 ε 2. X Z β+

Solution to Tutorial 7

Prediction of Bike Rental using Model Reuse Strategy

STATISTICS 479 Exam II (100 points)

Transcription:

Variable selection: Suppose for the i-th observational unit (case) you record ( failure Y i = 1 success and explanatory variabales Z 1i Z 2i Z ri Variable (or model) selection: subject matter theory and expert opinion t models to test hypotheses of interest stepwise methods Consider a istic regression model ß i = Pr(Y i = 1 Z 1i Z 2i Z ri ) backward elimination forward selection and ψ ßi 1 ß i! = + 1 Z 1i + 2 Z 2i + + r Z ri Which explanatory variables should be included in the model? 111 stepwise selection the user supplies a list of potential variables consider diagnostics" maximize a penalized" likelihood 112 Retrospective study of bronchopulmonary dysplasia (BPD) in new born infants. Observational Units: 248 infants treated for respiratory distress syndrome (RDS) at Stanford Medical Center between 1962 and 1973, and who received ventilatory assistance by intubation for more than 24 hours, and who survived for 3 or more days. Binary response: BPD = Suspected causes: ( 1 stage III or IV 2 state I or II Duration and level of exposure to oxygen therapy and intubation. Background variables: Sex YOB = female 1 = male year of birth APGAR one minute APGAR score ( 1) GEST gestational age (weeks 1) BWT birth weight (grams) AGSYM age at onset of respiratory symptoms (hrs 1) RDS severity of initial X-ray for RDS on a 6 point scale from = no RDS seen to 5 = very severe case of RDS 113 114

Treatment Variables INTUB duration of endotracheal intubation (hrs) VENTL duration of assisted ventilation (hrs) LOWO2 hours of exposure to 22 49% level of elevated oxygen MED2 HI2 hours of exposure to 4 79% level of elevated oxygen hours of exposure to 8 1% level of elevated oxygen AGVEN age at onset of ventilatory assistance (hours) Methods for assessing the adequacy of models: 1. Goodness-of-t tests 2. Procedues aimed at specic alternatives such as tting extended or modied models. 3. Residuals and other diagnostic statistics designed to detect anomalous or influencial cases or unexpected patterns. Inspection of graphs Case deletion diagnostics Measures of t 115 116 Overall assesment of t: An overall measures similar to R 2 Logistic regression model for the BDP data: tted model -likelihood! L L M L S L " - -likelihood for saturated model = G2 G2 M G 2 -likelihood for model containing only an intercept This becomes L L M L (since L S = ) when there are n observational units that provide n independent Bernoulli observations. For the BPD data and the model selected with backward elimination L L M = 2L ( 2L M) 36:52 97:77 = = :681 L 2L 36:52 ψ ^ßi 1 ^ß i! 2 6 4 = 12:729 + :7686 LNL 1:787 LNM + :3543 (LNM) 2 2:438 LNH + :5954 (LNH) 2 + 1:648 LNV concordant = 97:% discordant = 2:9% GAMMA = :941 3 7 5 117 118

Goodness of t tests: H : proposed model is correct H A : any other model Pearson chi-squared test: 2X X 2 = n X (Y ij n i^ß ij ) 2 X = n j=1 n i^ß ij Deviance: G 2 = 2 n X j=1 2X Y ij Y ij =n i^ß ij (Y ij n i^ß ij ) 2 n i^ß ij (1 ^ß ij ) When each response is an independent Bernoulli observation 8 < Y i = 1 success : failure and ( ß i = Pr Y i = 1 i = 1; 2; : : : ; n ) Z 1i; ; Z ri and there is only one response for each pattern of covariates (Z1i; : : : ; Z ri ) Then, for testing H : @ ß 1 i A = +1Z1i+ + 1 k Z ki ß i versus H A : < ß i < 1 (general alternative) 119 111 neither nor G 2 = 2 n X Y i X +2 n @ Y 1 ia ^ß i (1 Y i ) @ 1 Y 1 ia 1 ^ß i X 2 X = n (Y i ^ß i ) 2 ^ß i (1 ^ß i ) In this situation, G 2 and X 2 tests for comparing two (nested) models may still be well approximated by chisquared distributions when k+1 = = r =. H : ß i 1 ß i = +1Z1i+ + k Z ki H A : ß i 1 ß i = +1Z1i+ + k Z ki + k+1 Z k+1;i + + r Z r;i is well approximated by a chi-square distribution when H is true, even for large n. 1111 1112

Deviance: G 2 = 2 n X Pearson statistic X 2 = n X Y i @ ^m 1 i;aa ^m i; (^m i;a ^m i; ) 2 ^m i; have approximate chi-squared distributions with r k degrees of freedom when (r k)=n is small k+1 = = r = (S. Haberman, 1977 Annals of Statistics.). Insignicant values of G 2 and X 2 1. Only indicate that the alternative model offers little or no improvement over the null model 2. Do not imply that either model is adequate. These are the types of comparisons you make with stepwise variable selection procedures. 1113 1114 Hosmer-Lemeshow test: Collect the n cases into g groups. Make a 2 g contingency table The "expected counts" are Groups 1 2 g () Y = 1 11 12 1g (i=2) Y = 2 21 22 2g n 1 n 2 n g Compute a Pearson statistic X C = 2 gx ( ik E ik ) 2 k=1 E ik where E2k = n k(1 μß k ) E1k = n kμß k μß k = 1 n k n k X ^ß j j=1 1115 1116

Hosmer and Lemeshow recommend formed as g = 1 groups For the BPD example: ^ß i values -.1.1-.2.2-.3.3-.4.4-.5.5-.6.6-.7.7-.8.8-.9.9-1. 3 3 5 3 6 2 6 2 1 46 128 18 6 8 1 1 5 1 1 1 1 131 group 1 all observational units with < ^ß j» :1 group 2 all observational units with.1 < ^ß j» :2.. group 1 all observational units with :9 < ^ß j < 1 Reject the proposed model if C > X 2 g 2;ff 1117 Expected" Counts 4.95 3.14 2.67 3.56 3.11 1.67 7.18 2.38 1.72 46.59 126.5 17.86 8.33 7.44 3.89 1.33 3.82.62.28.41 C = 12.46 on 8 d.f. (p-value =.132) * This test often has little power * Even when this test indicates that the model does not t well, it says little about how to improve the model. 1118 The lackt" option to the model statement in PROC LOGISTIC make 1 groups of nearly equal size BPD=1 (yes) 1 4 9 16 25 22 BPD=2 (No) 25 25 25 25 24 21 16 9 25 25 25 25 25 25 25 25 25 22 "Expected" counts @ X i ^ß i 1 A.3.9.25.53 1.25 2.67 7.64 18.1 24.5 22. 25. 24.9 24.7 24.5 23.7 22.3 17.4 6.9.5.1 25 25 25 25 25 25 25 25 25 22 C = 3.41 on 8 d.f. (p-value =.91) Diagnostics: Cook, and Weisberg (1982) Residuals and Influence in Regression, Chapman Hall. Belsley, Kuh, and Welsch (198) Regression Diagnostics: Identifying Influential Data and Sources of Collinearity, Wiley. Pregibon (1981) Annals of Statistics 9, 75 724. Hosmer and Lemeshow (2) Applied Logistic Regression. Wiley, 2nd edition. Collett, D. (1991) Modelling Binary Data, Chapman and Hall, London Lloyd, C. (1999) Statistical Analysis of Categorical Data, Wiley, Section 4.2. 1119 112

Residuals and other diagnostics: Pearson residuals: @ X 2 = n X r 2 i 1 A Kay, R. and Little, S. (1986) Applied Statistics, 35, 16 3 (case study). Fowlkes, E. B. (1987) Biometrika 74, 53 515 Cook, R. D. (1986) JRSS-B, 48, 133 155. Miller, M. E., Hui, S. L., Tierney, W. M. (1991) Statistics in Medicine 1 1213 1226. r i = Deviance residuals: Y i n i^ß i qni^ß i (1 ^ß i ) @ G 2 = n X d 2 i d i = sign(y i n i^ß i ) q jg i j 1 A where " g i = 2 Y i ψ Yi n i^ß i!!# ni Y + (n i Y i )ψ i n i (1 ^ß 1 ) 1121 1122 adjusted residuals:» OBS - PRED S.E. RESID Adjusted Pearson residual: r Λ i = r i p 1 hi r Λ i = Y i n i^ß i q^v (Yi n i^ß i ) Adjusted deviance residual: d Λ i = d i p 1 hi = Y i n i^ß i qni^ß i (1 ^ß i )(1 h i ) where h i is the leverage" of the i-th observation 1123 1124

Compare residuals to percentiles of the standard normal distribution Residual plots Cases with residuals larger than 3 or smaller than -3 are suspicious. None of these residuals" may be well approximated with a standard normal distribution. versus each explanatory variable versus order (look for outliers or patterns across time) versus expected counts: n i^ß i Smoothed residual plots They are too discrete". 1125 1126 What is leverage? In linear regression the hat matrix" is @ ß i 1 ß i 1 A = + 1X1i + + k X ki i = 1; : : : ; n H = X(X X) 1 X which is a projection operator onto the column space of X, and 2 6 4. ß1 1 ß1 ß2 1 ß2 ßn 1 ßn 3 7 5 = 2 6 4 = X 1 X 11 X k1 1 X 12 X k2... 1 X 1n X kn " model matrix 3 2 7 6 5 4 1. k 3 7 5 ^Y = HY residuals = (I H)Y V (residuals) = (I H)ff 2 1127 1128

Pregibon (1981) uses a generalized least squares approach to istic regression which yields a hat matrix The i-th diagonal element of H called a leverage value Call this element h i. is where V H = V 1=2 X(X V X) 1 X V 1=2 is an n n diagonal matrix with i-th diagonal element n i^ß i (1 ^ß i ) = V ii and n i is the number of cases with the i-th covariate pattern. Note that nx h i = k + 1 % number of coefcients When there is one individual for each covariance pattern, the upper bound on h i is 1. 1129 113 Cases with large values of h i may be cases with vectors of covariates that are far away from the mean of the covariates. However, such cases can have small h i values if ^ß i << :1 or ^ß i >> :9 An alternative quantity that gets larger as the vector of covariates gets farther from the mean of the covariates is b i = h i n i^ß i (1 ^ß i ) see Hosmer + Lemshow pages 153 155 Look for cases with large leverage values and see what happens to estimated coefcients when the case is deleted. 1131 INFLUENCE: Anaous to Cook's D for linear regression Dene: b b (i) the m.l.e. for using all n observations the m.l.e. for when the i-th case is deleted A standardized" distance between b and b (i) is approximately Influence (i) = (b b (i) ) (X V X) 1 (b b (i) ) % called C i in PROC LOGISTIC : = r 2 i h i (1 h i ) 2 = (ri Λ ) 2 h i 1 h i % % squared adjusted monotone function residual of leverage 1132

V ar(y i ^ß i ) _=n i ß i (1 ß i )(1 h i ) and an adjusted Pearson residual is PROC LOGISTIC approximates the m.l.e. for with the i-th case deleted as b Λ (i) = b bλ (i) where b Λ (i) = (X V X) 1 x i @ Y 1 i n i^ß i A 1 h i r Λ i = Y i ^ß i qni^ß i (1 ^ß i )(1 h i ) An approximate measure of influence is = r i 1 h i C i = r 2 i h i (1 h i ) 2 % square of the Pearson residual 1133 1134