STAT 526 Spring Final Exam. Thursday May 5, 2011

Similar documents
STAT 526 Spring Midterm 1. Wednesday February 2, 2011

STAT 525 Fall Final exam. Tuesday December 14, 2010

STAC51: Categorical data Analysis

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Generalized logit models for nominal multinomial responses. Local odds ratios

Lecture 01: Introduction

MAS3301 / MAS8311 Biostatistics Part II: Survival

Linear Regression Models P8111

Lecture 10: Introduction to Logistic Regression

8 Nominal and Ordinal Logistic Regression

Statistics & Data Sciences: First Year Prelim Exam May 2018

Generalized Linear Models

Chapter 2: Describing Contingency Tables - I

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Generalized Linear Modeling - Logistic Regression

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

STAT 705: Analysis of Contingency Tables

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Survival Analysis I (CHL5209H)

2 Describing Contingency Tables

Statistics 3858 : Contingency Tables

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Single-level Models for Binary Responses

You may use a calculator. Translation: Show all of your work; use a calculator only to do final calculations and/or to check your work.

You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?

Typical Survival Data Arising From a Clinical Trial. Censoring. The Survivor Function. Mathematical Definitions Introduction

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Department of Statistical Science FIRST YEAR EXAM - SPRING 2017

Log-linear Models for Contingency Tables

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Statistics in medicine

STAT 7030: Categorical Data Analysis

Ph.D. course: Regression models. Introduction. 19 April 2012

Survival Analysis. STAT 526 Professor Olga Vitek

ORF 245 Fundamentals of Engineering Statistics. Final Exam

Loglinear models. STAT 526 Professor Olga Vitek

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Ph.D. course: Regression models. Regression models. Explanatory variables. Example 1.1: Body mass index and vitamin D status

Beyond GLM and likelihood

Generalized linear models III Log-linear and related models

Likelihoods for Generalized Linear Models

Stat 5102 Final Exam May 14, 2015

Statistics 135 Fall 2008 Final Exam

Lecture 8: Summary Measures

Multinomial Regression Models

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Introduction. Dipankar Bandyopadhyay, Ph.D. Department of Biostatistics, Virginia Commonwealth University

You may use your calculator and a single page of notes.

,..., θ(2),..., θ(n)

Section 4.6 Simple Linear Regression

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Thursday, August 30, 2018

Review of Multinomial Distribution If n trials are performed: in each trial there are J > 2 possible outcomes (categories) Multicategory Logit Models

Linear Regression With Special Variables

Generalized Linear Models 1

9 Generalized Linear Models

UNIVERSITY OF TORONTO Faculty of Arts and Science

Qualifying Exam in Probability and Statistics.

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

TMA 4275 Lifetime Analysis June 2004 Solution

Lecture 7 Time-dependent Covariates in Cox Regression

REGRESSION ANALYSIS FOR TIME-TO-EVENT DATA THE PROPORTIONAL HAZARDS (COX) MODEL ST520

Stat 642, Lecture notes for 04/12/05 96

Survival Analysis. Stat 526. April 13, 2018

Logistic Regression - problem 6.14

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

General Regression Model

Qualifying Exam in Probability and Statistics.

Lecture 8. Poisson models for counts

LOGISTIC REGRESSION Joseph M. Hilbe

Qualifying Exam in Probability and Statistics.

Figure 36: Respiratory infection versus time for the first 49 children.

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model

Proportional hazards regression

Anders Skrondal. Norwegian Institute of Public Health London School of Hygiene and Tropical Medicine. Based on joint work with Sophia Rabe-Hesketh

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Chapter 20: Logistic regression for binary response variables

Binary Response: Logistic Regression. STAT 526 Professor Olga Vitek

Generalized Linear Models (1/29/13)

Lecture 1. Introduction Statistics Statistical Methods II. Presented January 8, 2018

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Sections 4.1, 4.2, 4.3

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

Describing Contingency tables

Homework 10 - Solution

Exam Applied Statistical Regression. Good Luck!

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Solution to Tutorial 7

Lecture 3. Truncation, length-bias and prevalence sampling

Discrete Multivariate Statistics

Logistic Regression. Advanced Methods for Data Analysis (36-402/36-608) Spring 2014

Chapter 4 Regression Models

Transcription:

STAT 526 Spring 2011 Final Exam Thursday May 5, 2011 Time: 2 hours Name (please print): Show all your work and calculations. Partial credit will be given for work that is partially correct. Points will be deducted for false statements, even if the final answer is correct. Please circle your final answer where appropriate. This exam is closed-book. You may consult two pages with your hand-written notes. Calculators are permitted. Honor code: I promise not to cheat on this exam. I will neither give nor receive any unauthorized assistance. I will not to share information about the exam with anyone who may be taking it at a different time. I have not been told anything about the exam by someone who has taken it earlier. Signature: Date: 1

Question Possible Points Actual Points 1 24 2 24 3 24 4 36 5 12 2

1. The following data are counts of road accidents in Maine, classified according to location and gender of the patients, and whether the person was wearing a seat belt. The response categories are (1) not injured, (2) injured but not transported by emergency medical services, (3) injured and transported by emergency medical services, (4) injured, hospitalized and survived, (5) injured and not survived. Response Gender Location Seat Belt 1 2 3 4 5 Female Urban No 7287 175 720 91 10 Yes 11587 126 577 48 8 Rural No 3246 73 710 159 31 Yes 6134 94 564 82 17 Male Urban No 10381 136 566 96 14 Yes 10969 83 259 37 1 Rural No 6123 141 710 188 45 Yes 6693 74 353 74 12 A proportional odds model was fit to the response, using the remaining categories as predictors. Parameter estimates and standard errors of the fit are given below. All parameters have the correct signs (i.e. you do not need to take the negative value of parameter estimates, as is usually done with the polr output). Coefficients: Value Std. Error t value (Intercept):1 3.30742 0.035102 94.2233 (Intercept):2 3.48185 0.035562 97.9103 (Intercept):3 5.34938 0.046950 113.9385 (Intercept):4 7.25633 0.091428 79.3664 genderfemale -0.54625 0.027211-20.0748 seatbeltno -0.76016 0.039382-19.3021 locationrural -0.69885 0.042388-16.4867 seatbeltno:locationrural -0.12442 0.054765-2.2718 (a) (6 pts) State the model and the distributional assumptions. Y X Multinomial(π 1, π 2, π 3, π 4, π 5 ). If we define P j = Pr(Y j X), then the model is ( ) Pj log = α j + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 23 X 2 X 3, j = 1,..., 4 1 P j where X 1 = I {gender=female}, X 2 = I {seatbelt=no} and X 3 = I {location=rural}. 3

(b) (6 pts) For males in urban areas wearing seat belts, calculate the estimated probabilities of the first two response categories. The cumulative probabilities are P j = exp(x β) ( exp(3.30742) 1+exp(3.30742), 1+exp(x β) The response probabilities are, and the values for all the categories are ) exp(5.34938) 1+exp(5.34938), exp(7.25633) 1+exp(7.25633), 1 exp(3.48185) 1+exp(3.48185), π 1 = P 1, and π j = P j P j 1 for j = 2,..., 5 therefore the values for all the categories are (0.9647, 0.0055, 0.0251, 0.0040, 0.0007) (c) (6 pts) Give a point estimate of the cumulative odds ratio of the gender, given seat belt use and location. Interpret this estimate. The odds ratio is defined as OR = odds P {Y j female} odds P {Y j male} The point estimate is ÔR = exp(β 1) = exp( 0.54625) = 0.579. Given any status of seat belt use and location, the estimated odds of injury below any fixed level for a female are 0.579 times the estimated odds for a male. 4

(d) (6 pts) Find and interpret the estimated cumulative odds ratio between response and seat belt use, given that the accidents occurred in a rural location. Why is this estimate different from the estimate for the accidents in urban locations? The odds ratio is defined as OR = odds P {Y j seatbelt = no, rural} odds P {Y j seatbelt = yes, rural} The point estimate is ÔR = exp(β 2 + β 23 ) = exp( 0.76016 0.12442) = 0.41 in rural locations and exp( 0.76016) = 0.47 in urban locations. The interaction effect 0.1244 represents the difference between the log odds ratios in rural and urban locations. 5

2. Consider an I J 2 table with variables X, W and Y respectively. Assume that the counts in the table are consistent with the Poisson sampling. (a) (6 pts) Specify a log-linear model that assumes an association between each pair of the variables for each level of the third variable, i.e. (XY, W Y, XW ). State the model, the assumptions, the constraints and the model degrees of freedom. We assume that the count in each cell ind Y ijk P oisson(λ ijk ), where i = 1,..., I, j = 1,..., J and k = 1, 2 and log λ ijk = µ + α i + β j + γ k + (αβ) ij + (αγ) ik + (βγ) jk µ = log E{Y 111 }, α 1 = β 1 = γ k = αγ 1k = αγ i1 = βγ 1k = βγ j1 = αβ 1j + αβ i1 = 0 The model df = 1 + (I-1) + (J-1) + (2-1) + (I-1)(J-1) + (I-1)(2-1) + (J-1)(2-1) = IJ + I + J -1 Equivalently, the model df can be obtained as IJ2 (I 1)(J 1)(2 1) (b) (6 pts) Specify the equivalent logistic regression model with Y as the response. State the model, the assumptions, the constraints and the model degrees of freedom. The equivalent logistic regression model is Y 2 ij ind Binomial(n ij, π 2 ij ), where ( ) ( ) π2 ij π2 11 log 1 π = µ + α 2 ij i + β j ; µ = log 1 π and α 2 11 1 = β 1 = 0 The model df = 1 + (I-1) + (J-1) = I + J -1. Since the model conditions on n ij, this accounts for the additional IJ degrees of freedom. Therefore the two models are equivalent. 6

(c) (6 pts) Denote the fitted cell counts according to the log-linear model ˆλ ijk and the fitted probabilities according to the logistic regression ˆπ 1 ij and ˆπ 2 ij. Write ˆπ 2 ij in terms of ˆλ ijk. ˆπ 2 ij = ˆλ ij2 ˆλ ij1 + ˆλ ij2 (d) (6 pts) Show that according to this model (in either formulation (a) or (b), since these are equivalent) the odds ratios of levels of Y given different levels of X are independent of W. E.g., using the formulation in (b): log OR = log odds(y = 2 X = i, W = j) odds(y = 2 X = i, W = j) = log ( ) π2 i j/(1 π 2 i j) = α i α i π 2 ij /(1 π 2 ij ) 7

3. The geometric probability distribution can be interpreted as the distribution of the number of iid Bernoulli trials observed until a first success. Its probability mass function is f(y π) = π y (1 π), y = 0, 1, 2,... (a) (6 pts) Show that this is a member of the exponential family of distributions. According to the notation used in class, state φ, a(φ), θ, b(θ) and c(y, φ). Therefore f(y π) = exp (y log π + log(1 π)) θ = log(π) π = exp(θ) φ = 1 a(φ) = 1 c(y, φ) = 0 b(θ) = log(1 π) = log (1 exp θ) (b) (6 pts) StateE{Y }, V ar{y } and the canonical link. Since E{Y } = b(θ) θ = eθ 1 e θ = π 1 π =notation = µ V ar{y } = 2 b(θ) θ 2 a(φ) = e θ (1 e θ ) 2 = π (1 π) 2 then the canonical link is µ = e θ 1 e θ, ( ) µ θ = log 1 + µ 8

(c) Consider a two-arm randomized clinical trial where we study remission of a disease. The patients are randomly assigned to a treatment (arm 1) or to a control (arm 2). The protocol of the trial says that each arm stops whenever a remission is observed. i. (6 pts) Specify a probability model that would be appropriate for this study. Use the canonical link function. The number of patients seen until the first remission follows a geometric distribution, with probability of an event different between the arms. Since the canonical link is log ( µ 1 + µ ) use notation for µ = log ( π / (1 π) 1 / (1 π) where trt takes value of 1 for treatment, and 0 otherwize. ) = log π model = β 0 + β 1 trt ii. (6 pts) Suppose that the control arm stops at the fifth patient, and the treatment arm stops at the third patient. Write the log-likelihood function that will be maximized to obtain parameter estimates. The observed sequences of events are R, R, R, R, R for the control, and R, R, R for the treatment. Using the exponential family representation in 3(a), the log-likelihood is [ ] l(β 0, β 1 ) = 4 β 0 + log(1 e β 0 ) + 2 (β 0 + β 1 ) + log(1 e β 0+β 1 ) 9

4. Researchers conducted a study to determine whether female breast cancer patients can be more accurately classified into subtypes using immunohistochemical (IH) examination. The study reported survival times (in months) of patients with negative and with positive outcomes of the exam. The data are shown below, where + indicates a censored observation. Survival times t i n n i=1 t i IH - 19 25 30 30+ 30+ 38 47 51 56 57 78 86 122+ 123+ 130+ 130+ 133+ 134+ 18 1319 IH+ 22 22+ 38 42 73 77 89 115 144+ 9 622 (a) The researchers would like to estimate the probability of survival for each group, without making any parametric distributional assumptions. i. (6 pts) State the assumptions of the Kaplan-Meier model for survival, and report Ŝ KM (38). The survival function is S(t) = c i, t i t < t i+1 for each observed event time t i, where 1 c i c i+1 0 Its Kaplan-Meier estimator is Ŝ KM (38) = i:t i t n i d i n i = (1 1/18) (1 1/17) (1 1/16) (1 1/13) = 0.7692 ii. (6 pts) Report the 95% confidence interval of ŜKM (38). V ar{log ŜKM (38)} = i:t i t Then the 95% confidence interval for log ŜKM (38) is d i n i (n i d i ) = 1 18 17 + 1 17 16 + 1 16 15 + 1 13 12 = 0.0175 log 0.7692 ± 1.96 0.0175 [L, R] = [ 0.5218, 0.002] and the approximate 95% confidence interval for ŜKM (38) is [e L, e R ] = [0.5934, 0.9970] 10

(b) The researchers now assume that the survival function for each group follows an exponential distribution. i. (6 pts) State the assumptions of the model, and report Ŝexponential (38). The survival function is S placebo (t) = e λ t, S trt (t) = e λ IH+ t, where λ is the reciprocal of the expected survival time. The MLE of λ is ˆλ = 18 i=1 δ i 18 i=1 t i = 10 = 0.0075, and 1319 Ŝ exponential (38) = e 0.0075 38 = 0.7520, comparable to the K-M estimate. ii. (6 pts) Report the 95% confidence interval of Ŝ exponential (38). How does this confidence interval compare to the confidence interval in (a, ii)? Explain. (Hint: use the observed information to estimate V ar{ŝexponential (38)}) SE{ˆλ } = 18 i=1 δ i 10 ( 18 i=1 t i) = 2 1319 2 = 0.0023 SE[log S(t)] = (V ar{log S(t)}) = (V ar{ λt}) = (t 2 V ar{λ}) = tse(λ) = t 0.0023 Then the 95% confidence interval for log Ŝexponential (38) is log 0.7520 ± 1.96 38 0.0023 [L, R] = [ 0.4563, 0.1137] and the approximate 95% confidence interval for Ŝexponential (38) is [e L, e R ] = [0.6336, 0.8925] The confidence interval is narrower due to the parametric nature of the estimation. 11

(c) To determine the effectiveness of the immunohistochemical examination, the researchers fit the model given in the partial output below. In the output, the variable ih=0 for the IH negative and ih=1 for the IH positive group. Call: survreg(formula = Surv(time, delta) ~ ih, data = X, dist = "exponential") Value Std. Error z p (Intercept) 4.882 0.316 15.438 9.03e-54 ih -0.395 0.493-0.802 4.23e-01 Scale fixed at 1 Exponential distribution Loglik(model)= -97.2 Loglik(intercept only)= -97.5 Chisq= 0.62 on 1 degrees of freedom, p= 0.43 i. (6 pts) State the assumptions of the model, and interpret the parameters. This is an accelerated failure time model, which assumes that the covariates have the effect of adjusting time to event S(t) = S 0 (t e β 0+β 1 ih ), where S 0 (t) is the survival function of the standard exponential random variable. ii. (6 pts) Report the estimate of Ŝ(38) according to this model. Does the estimate agree with the result in (b)? Explain. Ŝ AF (38) = exp( 38 e 4.882 ) = 0.7496 Up to numeric roundings, Ŝ AF (38) = Ŝexponential (38). The two models specify the same likelihood functions for each of the groups, and therefore yield the same parameter estimates. 12

5. Consider a proportional odds model for an ordered multinomial response Y, as function of the predictor variable X. For each question below, circle TRUE or FALSE, and provide the rationale. (a) (6 pts) Suppose that the cumulative odds ratio of a particular category of Y for two levels of X is 3. If we invert the order of the response categories, then the odds ratio will equal to 1/3. TRUE FALSE True. The model is defined as ( ) P {Y j} log 1 P {Y j} = α j + βx The model in the new order of the response categories can be expressed in terms of the old order of the response categories as ( ) P {Y j} log = α j + βx, i.e. 1 P {Y j} ( ) P {Y j} log = α j + βx 1 P {Y j} and the odds ratio OR = odds P {Y j X + 1} odds P {Y j X} = e β = 1 β (b) (6 pts) A positive value of the parameter β associated with the predictor X can be interpreted as the positive association between the predictor and the latent continuous variable underlying the response. TRUE FALSE False. A positive parameter indicates a negative association. The model expresses the fact that an increase in X is associated with a higher odds of lower categories of Y. 13