NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Similar documents
NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Logistic Regressions. Stat 430

9 Generalized Linear Models

BMI 541/699 Lecture 22

Multinomial Logistic Regression Models

Log-linear Models for Contingency Tables

Cohen s s Kappa and Log-linear Models

BIOS 625 Fall 2015 Homework Set 3 Solutions

STAT 7030: Categorical Data Analysis

Categorical data analysis Chapter 5

Explanatory variables are: weight, width of shell, color (medium light, medium, medium dark, dark), and condition of spine.

Exam Applied Statistical Regression. Good Luck!

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

(c) Interpret the estimated effect of temperature on the odds of thermal distress.

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Matched Pair Data. Stat 557 Heike Hofmann

Review of Multinomial Distribution If n trials are performed: in each trial there are J > 2 possible outcomes (categories) Multicategory Logit Models

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Generalized linear models

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

8 Nominal and Ordinal Logistic Regression

ST3241 Categorical Data Analysis I Logistic Regression. An Introduction and Some Examples

Linear Regression Models P8111

STAC51: Categorical data Analysis

The material for categorical data follows Agresti closely.

Sections 4.1, 4.2, 4.3

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T.

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper

Chapter 5: Logistic Regression-I

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Generalized linear models

Homework 1 Solutions

MATH 644: Regression Analysis Methods

Simple logistic regression

Ch 6: Multicategory Logit Models

UNIVERSITY OF TORONTO Faculty of Arts and Science

Introduction to General and Generalized Linear Models

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Lecture 12: Effect modification, and confounding in logistic regression

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Model Estimation Example

Analysing categorical data using logit models

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Stat 5102 Final Exam May 14, 2015

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

WORKSHOP 3 Measuring Association

STA102 Class Notes Chapter Logistic Regression

Regression Methods for Survey Data

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

An ordinal number is used to represent a magnitude, such that we can compare ordinal numbers and order them by the quantity they represent.

COMPLEMENTARY LOG-LOG MODEL

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Single-level Models for Binary Responses

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Applied Statistics Friday, January 15, 2016

Exercise 5.4 Solution

ssh tap sas913, sas

Generalized Linear Models

Logistic Regression. Continued Psy 524 Ainsworth

Generalized logit models for nominal multinomial responses. Local odds ratios

SAS Analysis Examples Replication C8. * SAS Analysis Examples Replication for ASDA 2nd Edition * Berglund April 2017 * Chapter 8 ;

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Solution to Tutorial 7

Introducing Generalized Linear Models: Logistic Regression

Correlation and regression

Homework 10 - Solution

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Logistic Regression - problem 6.14

Various Issues in Fitting Contingency Tables

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

BIOSTATS Intermediate Biostatistics Spring 2017 Exam 2 (Units 3, 4 & 5) Practice Problems SOLUTIONS

Generalised linear models. Response variable can take a number of different formats

Investigating Models with Two or Three Categories

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

1 Interaction models: Assignment 3

MULTINOMIAL LOGISTIC REGRESSION

12 Modelling Binomial Response Data

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

McGill University. Faculty of Science MATH 204 PRINCIPLES OF STATISTICS II. Final Examination

Chi-Square. Heibatollah Baghi, and Mastee Badii

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials.

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Classification. Chapter Introduction. 6.2 The Bayes classifier

Short Course Introduction to Categorical Data Analysis

Duration of Unemployment - Analysis of Deviance Table for Nested Models

Regression with Qualitative Information. Part VI. Regression with Qualitative Information

Generalized Linear Models for Non-Normal Data

Transcription:

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION Categorical Data Analysis (Semester II: 2010 2011) April/May, 2011 Time Allowed : 2 Hours Matriculation No: Seat No: Grade Table Question 1 2 3 4 5 6 Full marks 10 28 15 20 15 12 Earned marks Total Full marks 100 Earned marks INSTRUCTIONS TO CANDIDATES 1. This examination paper contains SIX (6) questions and comprises TWELVE (12) printed pages. 2. Answer ALL the questions for TOTAL 100 marks. 3. Read the questions CAREFULLY. 4. All NOTATIONS used here are the same as those used in the lecture notes. 5. Write your answers NEATLY following the associated questions. 6. This is a Closed textbook, Closed notes examination but calculators are allowed. 7. Candidates may bring in TWO A4 size (210 297 mm) help sheets. Page 2

Page 2 1. [10 pts, each 1 pt] Circle T or F for each of the statements. (1) [T, F] To test for independence in two-way contingency tables, likelihood ratio tests and Pearson s χ 2 tests are equivalent for small sample sizes. (2) [T, F] Fisher s exact test uses negative binomial distribution to compute p-values. (3) [T, F] Diagnosis of type of mental illness (schizophrenia, neurosis, depression) is an ordinal variable. (4) [T, F] If odds of success in a binary response is 0.5, the probability of success is 0.25. (5) [T, F] Suppose that P (Y i = 1) = 1 P (Y i = 0) = 0.2, i = 1,, n, where Y i s are independent. Let Y = 50 Y i. Then the distribution of Y is Binomial with mean 10. i=1 (6) [T, F] Test of independence for a linear trend alternative cannot be used for nominal categorical data. (7) [T, F] In a logistic regression model, logit[π(x)] = α+βx, e α equals the odds of success when x = 1. (8) [T, F] In a logit model logit[π(x)] = α + βx, the probability increases at the rate of 0.16β when π(x) = 0.4. (9) [T, F] A classical linear regression model with errors having normal distribution is a special case of generalized linear model with probit link. (10) [T, F] Fitting a saturated model often results in nonzero residual deviance. Page 3

Page 3 2. [28 pts] For a study using logistic regression to examine the data on rheumatoid arthritis, we consider age of the patient as the predictor variable. The response Y measured whether the patient showed any improvement at all (1=yes). The following computer output reports for a logistic regression model using age to predict the probability of improvement. Model Fit Statistics Intercept Intercept and Criterion Only Covariates -2 Log L 116.449 109.164 Standard Wald Parameter DF Estimate Error Chi-Square Pr>ChiSq Intercept 1-2.6421 1.0732 6.0611 0.0138 age 1 0.0492 0.0194 6.4733 0.0110 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits age 1.050 1.011 1.091 Estimated Covariance Matrix parameter Intercept age Intercept 1.15169-0.02030 age -0.02030 0.00038 The 90, 95, 97.5 and 99.5-th percentiles of the standard normal distribution are 1.28, 1.645, 1.96, and 2.576 respectively. (a) Find out the rates of change in predicted probabilities of improvement when age = 25 and when the estimated probability of improvement is 0.3, respectively. [8 Pts] Page 4

Page 4 (b) Find out the age at which the estimated probability of improvement is 0.3. [4 Pts] (c) Obtain a 95% confidence interval for the true odds ratio of improvement for a half year increase in age. [6 pts] (d) Obtain a 95% confidence interval for the probability of improvement at age = 25. [10 pts] Page 5

Page 5 3. [15 pts] The following table was taken from the 1991 General Social Survey. Party Identification Race Democrat Independent Republican Total White 341 105 405 851 Black 103 15 11 129 Total 444 120 416 980 Final Examination Q3 R code and Output Racew<-c(1,1,1,0,0,0)# White=1; black=0; PartyD<-c(1,0,0,1,0,0) #Democrat= 1; others 0 PartyI<-c(0,1,0,0,1,0)# Independent=1; others 0 Count<-c(341,105,405,103,15,11) RacewPartyD<-Racew*PartyD; RacewPartyI<-Racew*PartyI; fit<-glm(count~racew+partyd+racewpartyd+racewpartyi,family=poisson(link="log")) summary(fit) ####R outputs Call: glm(formula = count ~ Racew + PartyD + RacewPartyD +RacewPartyI, family = poisson(link = "log")) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 2.5649 0.1961 13.079<2e-16 *** Racew 3.4389 0.2023 16.998<2e-16 *** PartyD 2.0698 0.2195 9.430 <2e-16 *** RacewPartyD -2.2418 0.2315-9.686 <2e-16 *** RacewPartyI -1.3499 0.1095-12.327<2e-16 *** Null deviance: 918.8 Residual deviance: 0.61784 Page 6

Page 6 Let X and Y denote the race and party respectively. The 95-th percentiles of χ 2 -distribution with 1, 2, 3, 4, 5, 6 degrees of freedom are 3.841, 5.99, 7.81, 9.49, 11.07, and 12.59 respectively. Based on the R code and R output, (a) Write down the loglinear regression model and identify the associated estimates. [3 Pts] (b) Compute all the estimated cell counts. [6 pts] (c) Comment if the intercept model fit the data well. [3 pts] (d) Comment if the loglinear model fit the data well. [3 pts] Page 7

Page 7 4. [20 pts] The following table is taken from Lecture 8. Alcohol, Cigarette and Marijuana Use For High School Seniors Marijuana Use Alcohol Cigarette Use Use Yes No Yes Yes 911 538 No 44 456 No Yes 3 43 No 2 279 ## Final Examination Q4 R code and output A<-c(1,1,1,1,0,0,0,0); ## 1--Alcohol use 0--otherwise C<-c(1,1,0,0,1,1,0,0); ## 1---Cigarette use 0---otherwise M<-c(1,0,1,0,1,0,1,0); ## 1-Marijuana use 0-otherwise count<-c(911,538,44,456,3,43,2,279); AC<-A*C; AM<-A*M; CM<-C*M; ACM<-A*C*M; ##Model (AM,CM,AC) fit drug.log<-glm(count~a+c+m+am+cm+ac,family=poisson(link="log")) summary(drug.log) ## output Call: glm(formula = count ~ A + C + M + AM + CM + AC, family = poisson(link = "log")) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 5.63342 0.05970 94.361 < 2e-16*** A 0.48772 0.07577 6.437 1.22e-10 *** C -1.88667 0.16270-11.596 < 2e-16 *** M -5.30904 0.47520-11.172 < 2e-16*** AM 2.98601 0.46468 6.426 <1.31e-10*** CM 2.84789 0.16384 17.382 < 2e-16 *** AC 2.05453 0.17406 11.803 < 2e-16 *** Null deviance: 2851.46098, Residual deviance: 0.37399 ##Estimated covariance matrix between AM and CM AM CM AM 0.215925578-0.004968391 CM -0.004968391 0.026843349 Page 8

Page 8 Let X, Y and Z denote the variables Alcohol, Cigarette and Marijuana use respectively. The 90, 95, 97.5 and 99.5-th percentiles of the standard normal distribution are 1.28, 1.645, 1.96, and 2.576 respectively. Based on the R code and R output, (a) Write down the loglinear regression model and identify the associated estimates. [5 pts] (b) Compute the estimated odds ratio between any two variables of Alcohol, Cigarette, and Marijuana use controlling for the third variable. [3 pts] Page 9

Page 9 (c) Construct the 95% confidence interval for the true odds ratio between Alcohol and Cigarette use controlling for Marijuana use. [4 pts] (d) Test if the true odds ratio between Alcohol and Marijuana use controlling for Cigarette use equals the true odds ratio between Cigarette and Marijuana use controlling for Alcohol use at α = 5%. [ 8 pts] Page 10

Page 10 5. [15 pts] Consider a three-way contingency table with categorical variables X having 2 categories, Y having 2 categories and Z having K 2 categories. (Hint: To show A if and only if B, you need show both A implies B and B implies A ) (a) Show that the loglinear model (XY, XZ, Y Z) holds if and only if X and Y have homogeneous association controlling for Z. [9 pts] Page 11

Page 11 (b) Show that the loglinear model (XZ, Y Z) holds if and only if X and Y are conditionally independent controlling for Z. [6 pts] Page 12

Page 12 6. [12 pts] (a) Let P (Y = 1) = 1 P (Y = 0) = p. For the population of subjects having Y = j, X has a probability density function f j (x) = λ j exp( λ j x), x 0, j = 0, 1. Show that π(x) = P (Y = 1 x) satisfies the logistic regression model with some α and β. [7 pts] (b) For known n 2, show that the negative binomial distribution with probability mass function, f(y n, µ) = ( ) ( ) n ( y y+n 1 n n 1 µ+n 1 µ+n) n, y = 0, 1, 2,. belongs to the exponential family of distributions. Find out the natural parameter for this distribution. [5 pts] -End of the Paper