Solutions for Examination Categorical Data Analysis, March 21, 2013

Similar documents
STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Categorical Data Analysis Chapter 3

Cohen s s Kappa and Log-linear Models

STAT 7030: Categorical Data Analysis

Simple logistic regression

The material for categorical data follows Agresti closely.

Review of One-way Tables and SAS

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Multinomial Logistic Regression Models

Categorical data analysis Chapter 5

Answer Key for STAT 200B HW No. 7

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti

Answer Key for STAT 200B HW No. 8

Solution to Tutorial 7

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

Generalized Linear Modeling - Logistic Regression

Classification. Chapter Introduction. 6.2 The Bayes classifier

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

STAT 525 Fall Final exam. Tuesday December 14, 2010

Explanatory variables are: weight, width of shell, color (medium light, medium, medium dark, dark), and condition of spine.

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Chapter 4: Constrained estimators and tests in the multiple linear regression model (Part III)

9 Generalized Linear Models

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Correlation and regression

Log-linear Models for Contingency Tables

UNIVERSITY OF TORONTO Faculty of Arts and Science

Hypothesis Testing One Sample Tests

Sections 4.1, 4.2, 4.3

The Logit Model: Estimation, Testing and Interpretation

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Categorical Variables and Contingency Tables: Description and Inference

ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T.

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

STA6938-Logistic Regression Model

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Model comparison and selection

Short Course Introduction to Categorical Data Analysis

Problem Selected Scores

[y i α βx i ] 2 (2) Q = i=1

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.

Binary Dependent Variables

Scatter plot of data from the study. Linear Regression

CHAPTER 1: BINARY LOGIT MODEL

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Statistics Ph.D. Qualifying Exam: Part II November 3, 2001

Single-level Models for Binary Responses

Homework 1 Solutions

Stat 579: Generalized Linear Models and Extensions

STAT 705: Analysis of Contingency Tables

Scatter plot of data from the study. Linear Regression

Lecture 2: Categorical Variable. A nice book about categorical variable is An Introduction to Categorical Data Analysis authored by Alan Agresti

Master s Written Examination

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Loglikelihood and Confidence Intervals

STAC51: Categorical data Analysis

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Linear regression. We have that the estimated mean in linear regression is. ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. The standard error of ˆµ Y X=x is.

Poisson regression: Further topics

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Lecture 23. November 15, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

STA 450/4000 S: January

STAT763: Applied Regression Analysis. Multiple linear regression. 4.4 Hypothesis testing

ST3241 Categorical Data Analysis I Two-way Contingency Tables. Odds Ratio and Tests of Independence

where x and ȳ are the sample means of x 1,, x n

Question. Hypothesis testing. Example. Answer: hypothesis. Test: true or not? Question. Average is not the mean! μ average. Random deviation or not?

Model Estimation Example

Logistic regression: Miscellaneous topics

Generalized Linear Models

Tests for the Odds Ratio in Logistic Regression with One Binary X (Wald Test)

ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios

Advanced Quantitative Methods: maximum likelihood

Direction: This test is worth 250 points and each problem worth points. DO ANY SIX

Topic 21 Goodness of Fit

n y π y (1 π) n y +ylogπ +(n y)log(1 π).

ECON 4160, Autumn term Lecture 1

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as

Is the cholesterol concentration in blood related to the body mass index (bmi)?

Chapter 1 Statistical Inference

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Econometrics of Panel Data

Logistic Regression Analysis

Introduction to the Logistic Regression Model

1 Mixed effect models and longitudinal data analysis

ST3241 Categorical Data Analysis I Logistic Regression. An Introduction and Some Examples

Linear Regression Models P8111

Chapter 12 - Lecture 2 Inferences about regression coefficient

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

Inference for Regression

Simple and Multiple Linear Regression

PubHlth Intermediate Biostatistics Spring 2015 Exam 2 (Units 3, 4 & 5) Study Guide

Transcription:

STOCKHOLMS UNIVERSITET MATEMATISKA INSTITUTIONEN Avd. Matematisk statistik, Frank Miller MT 5006 LÖSNINGAR 21 mars 2013 Solutions for Examination Categorical Data Analysis, March 21, 2013 Problem 1 a. Fisher s exact test evaluates H 0 distribution under the condition of the observed marginal frequencies. P H0 (n 11 t n 1+, n 2+, n +1, n +2 ( ( n1+ n n1+ t n +1 t ( n n +1 b. One-sided p-value for Fisher s exact test P (n 11 t o n i+, n +j 4 t0 P (n 11 t n i+, n +j 0 + 0 + 0.001 + 0.004 + 0.019 0.024 where t o 4 is the observed value of n 11. The one-sided p-value is below 5% (even below 2.5% and therefore, the study shows a hypertension preventing effect of aspirin compared to placebo. c. Two-sided p-value can be defined as P (n 11 t o n i+, n +j + P (n 11 t n i+, n +j with t min{t P (n 11 s n i+, n +j P (n 11 t o n i+, n +j for all s t}. As the hypergeometric distribution is unimodal, this definition is equal to P (n 11 t n i+, n +j t T with T {s P (n 11 s n i+, n +j P (n 11 t o n i+, n +j }. For the dataset, the two-sided p-value is P (n 11 4 n i+, n +j + P (n 11 12 n i+, n +j 0.024 + 0.012 + 0.002 0.038. The null hypothesis is rejected with the two-sided test at the 5% level showing that there is any effect of aspirin compared to placebo. d. Mid p-value P (n 11 < t o n i+, n +j + 1 2 P (n 11 t o n i+, n +j 0.005 + 0.019/2 0.0145. Advantage: Type I error of a test based on the mid p-value is closer to the desired level α. Disadvantage: Type I error might in some cases be slightly above α ( test is anti-conservative. 1

Problem 2 a. Estimate for odds ratio: ˆθ n 11n 22 4 20 n 12 n 21 11 30 8 33 0.242. The odds of getting hypertension after treatment with aspirin were 0.242 of the odds of getting hypertension after placebo treatment. b. SE V (log(ˆθ 1/n 11 + 1/n 12 + 1/n 21 + 1/n 22 0.651. log ˆθ log 0.242 1.417. 95% Wald CI for log(ˆθ : 1.417 ± 1.96 0.651; CI [ 2.693, 0.141]. 95% Wald CI for ˆθ : [e 2.693, e 0.141 ] [0.068, 0.868]. c. For example, likelihood ratio test: G 2 2 2 2 j1 n ij log n ij ˆµ ij 1/4 + 1/11 + 1/30 + 1/20 where ˆµ ij n i+ n +j /n is the ML-estimation in the independence model. G 2 2 (4 log(4/7.85 + 11 log(11/7.15 + 30 log(30/26.15 + 20 log(20/23.85 5.28. The full model has 4 parameters, the independence model 3 parameters. Therefore the asymptotic distribution of G 2 is the χ 2 -distribution with 1( 4 3 degrees of freedom. The null hypothesis of independence can be rejected at the 5% level by the two-sided test, since 5.28 > 3.84. The problem should state one-sided instead of two-sided. For the one-sided test at the 5% level, one has to compare G 2 5.28 with χ 2 1;0.90 2.71 and check the direction of the effect. Since ˆθ 0.242 < 1 and G 2 > χ 2 1;0.90, one can conclude that the one-sided test of testing independence versus the alternative of a hypertension preventing effect of aspirin rejects H 0. log ˆθ SE Shorter alternative instead of the LR-test: Wald test (here: one-sided: 1.42 2.18 < 1.64. Therefore the Wald test at 5% level rejects the nullhypothesis. 0.651 Problem 3 a. P (Y i y i p y i i (1 p i 1 y i, y i {0, 1} and p i α+βx i, i 1,..., n. The likelihood function is n p y i i (1 p i 1 y i. The loglikelihood is L(α, β y i log(p i + (1 y i log(1 p i y i log(α + βx i + (1 y i log(1 α βx i. 2

b. Likelihood equations are the derivatives of the loglikelihood with respect to α and β set to 0. Derivatives: L(α, β α L(α, β β y i α + βx i 1 y i 1 α βx i y i αy i βx i y i α βx i + y i α + y i βx i (α + βx i (1 α βx i y i α βx i (α + βx i (1 α βx i y i x i α + βx i (1 y ix i 1 α βx i x i (y i α βx i (α + βx i (1 α βx i The likelihood equations are then L(α, β/ α 0 and L(α, β/ β 0. Their solution (e.g. obtained numerically with an algorithm is then the ML-estimate for α and β. c. Asymptotic variance matrix for the parameter estimates: The vector of ML-estimates (ˆα, ˆβ is asymptotically normally distributed with mean (α, β and covariance matrix I(α, β 1 where I(α, β is the information matrix I(α, β E 2 L(α,β 2 α E 2 L(α,β α β E 2 L(α,β α β E 2 L(α,β 2 β The information matrix and therefore the asymptotic covariance matrix contain the unknown parameters (α, β. Replacing them by their ML-estimates (ˆα, ˆβ leads to an estimated covariance matrix which can be used for constructing CIs. d. Advantage: Results from analysis are on probability scale and hence directly interpretable in terms of probabilities or differences of probabilities without further transformation. Disadvantage: Estimated/predicted probabilities are not necessarily between 0 and 1. Problem 4 a. Maximum likelihood estimates ˆα 4.0343 and ˆβ W 0.0042. Further, e ˆβ W 0.996. With each gram increased weight, the odds for developing BPD is reduced by 0.4% (1-0.996. (A better interpretation: with each 100 gram increased weight, the odds for BPD are decreased with 34% since e 100 ˆβ W 0.66. b. Let π 0 be the proportion of interest. logit(ˆπ 0 ˆα + ˆβ W 1000 4.0343 4.2 0.2, ˆπ 0 e 0.2 /(1 + e 0.2 0.45 The predicted probability that a newborn baby with weight 1000 gram develops BPD until day 28 of life is 45%. 3

c. V (logit(ˆπ 0 V (ˆα + ˆβ W 1000 V (ˆα + 2Cov(ˆα, ˆβ W 1000 + V ( ˆβ W 10 6 0.484 0.866 + 0.411 0.029. 95% Wald CI for logit( ˆπ 0 0.2 ± 1.96 0.029; CI[-0.53,0.13]. 95% Wald CI for ˆπ 0 : [e 0.53 /(1 + e 0.53, e 0.13 /(1 + e 0.13 ] [0.37, 0.53]. d. Estimates ˆβ W 0.0024, ˆβ A 0.398 and e ˆβ W 0.9976, e ˆβ A 0.672. With each gram increased weight, the odds for developing BPD is reduced by 0.24 % (1 e ˆβ W 0.0024 if age is kept constant. With each week increased age, the odds for developing BPD is reduced by 33% if weight is kept constant. The estimate for β W is around half in model (W+A compared with model (A. Since weight and age are correlated, the addition of age as explanatory variable reduces the estimate for β W. However, weight is still important in this model (significant variable in model (W+A. e. Definition of AIC: AIC -2(maximized loglikelihood - number of parameters in the model. Based on AIC, (W+A is preferable over (W, since AIC215.2 for (W+A which is smaller than AIC227.7 for (W. LR-test-statistic 2(L 0 L 1 where L i are the maximized loglikelihood for the two models. 2(L 0 L 1 2( 111.86 + 104.61 14.46. The test statistics is approximately χ 2 -distributed with 1 degree of freedom (1 difference of number of parameters between the two models. Since 14.46 > χ 2 1;0.95 3.84, we can reject H 0 that (W holds. Therefore it is justified to use (W+A. Problem 5 Denote school with S, gender with G and age group with A and the frequencies in the table by n a, s 1, 2, g 1, 2, a 1, 2, 3. a. Saturated loglinear model for the expected number of students: log µ a λ + λ S s + λ G g + λ A a + λ SG sa ga a. The parameters with at least one of s 2, g 2 or a 3 are set to 0 to avoid over-parametrisation. The remaining parameters not set to 0 are: λ, λ S 1, λ G 1, λ A 1, λ A 2, λ SG 11, λ SA 11, λ SA 12, λ GA 11, λ GA 12, λ SGA 111, λ SGA 112. These are 12 parameters ( number of cells in the 2x2x3 contingency table. b. Age jointly independent of both gender and school. Model (SG, A: log µ a λ + λ S s + λ G g + λ A a + λ SG. 4

c. In the saturated model, the ML-estimate is the observation, i.e. ˆµ 111 n 111 42. In model in (SG, A, ˆµ 111 n 11+n ++1 n 259 184 1373 34.7 (see also Table 5 in Appendix B. d. Likelihood ratio test: with ˆµ a n +n ++a n 2 2 3 G 2 2 n a log(n a /ˆµ a, s1 g1 a1 which is available in Table 5. Consequently, ( G 2 2 42 log 42 34.7 +... 2 9.34 18.68. Asymptotically, G 2 is χ 2 -distributed with 6 12 6 degrees of freedom (saturated model has 12 parameters; (SG, A model has 6 parameters. According to the table with the χ 2 -distribution, G 2 > 12.59 χ 2 6;0.95 and therefore the hypothesis that the (SG, A model holds can be rejected. Age group can therefore not being considered as jointly independent of school and gender. e. Consider now a 1, 2 only. Equivalent logistic regression models which have age group as depending variable and school and gender as explanatory variables. For loglinear model in a. (saturated model: log P (A 1 P (A 2 log µ 1 µ 2 ( λ + λ S s + λ G g + λ A 1 + λ SG ( λ + λ S s + λ G g + λ A 2 + λ SG s1 s2 g1 g2 1 2 (λ A 1 λ A 2 + (λ SA s1 λ SA s2 + (λ GA g1 λ GA g2 + (λ SGA 1 }{{}}{{}}{{} β βs S βg G β + β S s + β G g + β SG, λ SGA 2 } {{ } β SG and for the model in b. (gender and school jointly independent of age: log P (A 1 P (A 2 log µ 1 µ 2 ( λ + λ S s + λ G g + λ A 1 + λ SG (λ A 1 λ A 2 β. }{{} β ( λ + λ S s + λ G g + λ A 2 + λ SG 5