BIOS 625 Fall 2015 Homework Set 3 Solutions

Similar documents
Lecture 8: Summary Measures

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

ST3241 Categorical Data Analysis I Two-way Contingency Tables. Odds Ratio and Tests of Independence

STAT 705: Analysis of Contingency Tables

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Sections 2.3, 2.4. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis 1 / 21

Chapter 2: Describing Contingency Tables - II

Ch 6: Multicategory Logit Models

Homework 1 Solutions

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

2 Describing Contingency Tables

Solution to Tutorial 7

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Cohen s s Kappa and Log-linear Models

Optimal exact tests for complex alternative hypotheses on cross tabulated data

ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios

Testing Independence

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Ordinal Variables in 2 way Tables

Inference for Binomial Parameters

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

3 Way Tables Edpsy/Psych/Soc 589

Department of Mathematics The University of Toledo. Master of Science Degree Comprehensive Examination Applied Statistics.

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Lecture 23. November 15, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

13.1 Categorical Data and the Multinomial Experiment

n y π y (1 π) n y +ylogπ +(n y)log(1 π).

STAT 7030: Categorical Data Analysis

Means or "expected" counts: j = 1 j = 2 i = 1 m11 m12 i = 2 m21 m22 True proportions: The odds that a sampled unit is in category 1 for variable 1 giv

Unit 9: Inferences for Proportions and Count Data

Multinomial Logistic Regression Models

Review of Multinomial Distribution If n trials are performed: in each trial there are J > 2 possible outcomes (categories) Multicategory Logit Models

Unit 9: Inferences for Proportions and Count Data

GROUPED DATA E.G. FOR SAMPLE OF RAW DATA (E.G. 4, 12, 7, 5, MEAN G x / n STANDARD DEVIATION MEDIAN AND QUARTILES STANDARD DEVIATION

Epidemiology Wonders of Biostatistics Chapter 11 (continued) - probability in a single population. John Koval

(Where does Ch. 7 on comparing 2 means or 2 proportions fit into this?)

Epidemiology Wonders of Biostatistics Chapter 13 - Effect Measures. John Koval

Exercise 7.4 [16 points]

Data are sometimes not compatible with the assumptions of parametric statistical tests (i.e. t-test, regression, ANOVA)

Sections 3.4, 3.5. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution

ST3241 Categorical Data Analysis I Logistic Regression. An Introduction and Some Examples

Categorical Data Analysis Chapter 3

CDA Chapter 3 part II

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as

Simple logistic regression

The Multinomial Model

16.400/453J Human Factors Engineering. Design of Experiments II

BMI 541/699 Lecture 22

Section 4.6 Simple Linear Regression

Investigating Models with Two or Three Categories

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Homework 2: Simple Linear Regression

8 Nominal and Ordinal Logistic Regression

Log-linear Models for Contingency Tables

Binary Dependent Variables

Categorical data analysis Chapter 5

ESP 178 Applied Research Methods. 2/23: Quantitative Analysis

Chapter 5: Logistic Regression-I

1 Comparing two binomials

(c) Interpret the estimated effect of temperature on the odds of thermal distress.

Q30b Moyale Observed counts. The FREQ Procedure. Table 1 of type by response. Controlling for site=moyale. Improved (1+2) Same (3) Group only

Binary Logistic Regression

Three-Way Contingency Tables

CHAPTER 1: BINARY LOGIT MODEL

Discrete Multivariate Statistics

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

MA : Introductory Probability

Session 3 The proportional odds model and the Mann-Whitney test

15: CHI SQUARED TESTS

STAC51: Categorical data Analysis

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Statistics 3858 : Contingency Tables

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

STA6938-Logistic Regression Model

Review of Statistics 101

Lecture 41 Sections Mon, Apr 7, 2008

STP 226 EXAMPLE EXAM #3 INSTRUCTOR:

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti

Inferences for Proportions and Count Data

In Class Review Exercises Vartanian: SW 540

Lecture 12: Effect modification, and confounding in logistic regression

STA Outlines of Solutions to Selected Homework Problems

Chapter 11: Models for Matched Pairs

1) Answer the following questions with one or two short sentences.

Statistics Handbook. All statistical tables were computed by the author.

Quantitative Analysis and Empirical Methods

Statistics of Contingency Tables - Extension to I x J. stat 557 Heike Hofmann

Module 4 Practice problem and Homework answers

Statistics for Managers Using Microsoft Excel

6 Single Sample Methods for a Location Parameter

Lecture 3.1 Basic Logistic LDA

Logistic regression: Miscellaneous topics

Statistics 135 Fall 2008 Final Exam

Epidemiology Principle of Biostatistics Chapter 14 - Dependent Samples and effect measures. John Koval

Analysis of Categorical Data Three-Way Contingency Table

IP WEIGHTING AND MARGINAL STRUCTURAL MODELS (CHAPTER 12) BIOS IPW and MSM

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials.

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Transcription:

BIOS 65 Fall 015 Homework Set 3 Solutions 1. Agresti.0 Table.1 is from an early study on the death penalty in Florida. Analyze these data and show that Simpson s Paradox occurs. Death Penalty Victim's Race Defendant's Race Yes o White White 19 13 Black 11 5 Black White 0 9 Black 6 97 Let X, Y, Z denote defendant s race, death penalty, and victim s race, respectively. For white victims, the odds ratio of death penalty for white defendant to black is: XY Z W 19 5 0.68 1 1113 For black victims, the corresponding odds ratio is 0 1. XY Z B However, the marginal odds ratio is XY 19 149 1.18 1. 17 141 The result shows that Simpson s paradox occurs since the two conditional odds ratios are both less than 1 while the marginal odds ratio is greater than 1. In other words, we have the reversal in the association after controlling for the victim s race. OTE: Can also show this by calculating probabilities and then stratified probabilities.. Agresti.4 Table.14 cross classifies ob satisfaction by race. Determine whether the groups are stochastically ordered, and estimate the difference between the probability that ob satisfaction is higher for blacks than whites and the probability that ob satisfactions is higher for whites than blacks. Job Satisfaction Race Dissatisfied eutral Fairly Satisfied Very or Completely Satisfied Black 19 13 4 59 White 47 40 15 430 Sample conditional distributions are: 1/7

Black: (19/133, 13/133, 4/133, 59/133) White: (47/73, 40/73, 15/73, 40/73) i.e., Black: (0.14, 0.098, 0.316, 0.443) White: (0.064, 0.055, 0.94, 0.587) The groups are stochastically ordered, with the black group tending to have more dissatisfaction with their work. 59 47 40 15 4 47 40 13 47 19 40 15 430 13 15 430 4 430 133 73 0.1784 estimates the difference between the probability that the satisfaction level is higher for black than white groups and the probability that the satisfaction level is higher for white than black groups. 3. Agresti 3.11 Refer to Table 3.11, GSS data on party ID and race. Party Identification Race Democrat Independent Republican Black 19 75 8 White 459 586 471 a. Using X and G, test the hypothesis of independence between party identification and race. Report the p-values and interpret. X 177.3, df, p0.001 G 197.4, df, p0.001 Strong evidence of association. b. Use standardized residuals to describe the evidence of association. Table of race by party race party Frequency Std Residual Black Democrat 19 1.541 Independent 75-3.5986 Republican 8-9.7063 Total 75 White Democrat 459-1.541 Independent 586 3.5986 Republican 471 9.7063 /7

Table of race by party race party Frequency Std Residual Total 1516 Party identification leans towards democrat for blacks and republican for whites. c. Partition chi-squared into components regarding the choice between Democrat and Independent and between these two combined and Republican. Between Democrat and Independent: X df p 66.63, 1, 0.001 Between Democrat/Independent and Republican: X df p 94.1,, 0.001 4. Agresti 3.1 Using the 008 GSS, we cross-classified party ID with gender. Table 3.1 shows some results. Explain how to interpret all the results on this printout. The values X 8.9df, p 0.0158 and G 8.31df, p 0.0157 show considerable evidence against the hypothesis of independence. The standardized Pearson residuals show Female Democrats and Male Independents are much greater than expected under independence, and the number of Female Independents and Male Democrats is significantly less than expected under independence. 5. Agresti 3.16 A study on educational aspirations of high school students measured aspirations with the scale (some high school, high school graduate, some college, college graduate). The student counts in these categories were (9, 44, 13, 10) when family income was low, (11, 5, 3, ) when family income was middle, and (9, 41, 1, 7) when family income was high. a. Test the independence of educational aspirations and family income using Explain the deficiency of this test for these data. X 8.8709, df 6, p 0.181 G 8.9165, df 6, p 0.1783 X or Both tests indicate not reecting the null hypothesis. There is no significant association between family income level and student educational aspirations. Deficiency of the tests for these data: When Pearson or Likelihood ratio chi-square tests are used to test independence between X and Y, the methods themselves ignore ordinal information, so it is possible that the tests gives us a large p-value even when there is truly a trend between X and Y. b. Find the standardized residuals. Do they suggest any association pattern? G. 3/7

Table of income by education income education Frequency Std Residual low <HS 9 0.4061 HS 44 1.588 <College 13-0.186 College 10 -.1078 Total 76 middle <HS 11-0.1898 HS 5-0.5441 <College 3 1.304 College -0.403 Total 108 high <HS 9-0.1903 HS 41-0.9459 <College 1-1.374 College 7.4360 Total 89 The standardized residuals show higher income could be associated with higher educational aspirations. c. Conduct an alternative test that may be more powerful. Interpret. Considering the ordinal feature for these data, M would be appropriate to test for independence. Thus, we have M n r 1 73 1 0.131 4.75 with df=1. This shows strong evidence of association (p=0.09). Also you can report the Mantel-Haenszel Chi-square test from SAS. The test statistic is 4.7489 with df=1 and p=0.093, leading to a very similar result. 6. Agresti 3.19 A study in the Department of Wildlife Ecology at the University of Florida sampled wild common carp fish from a wetland in central Chile. One analysis investigated whether the fish muscle had lead pollutant and whether there was evident malformation in the fish. Of 5 fish without lead, 7 had malformation. Of 14 with lead, 7 had malformation. Report and interpret the p-value for Fisher s exact test for a one-sided alternative of a greater chance of malformation when there is lead pollutant. Sample SAS code looks like: data table; input lead$ malformation$ count @@; datalines; no yes 7 no no 18 4/7

yes yes 7 yes no 7 ; proc freq order=data; weight count; tables lead*malformation; exact fisher; run; Try to figure out the correct p-value. Fisher's Exact Test Cell (1,1) Frequency (F) 7 Left-sided Pr <= F 0.156 Right-sided Pr >= F 0.9568 Table Probability (P) 0.1094 Two-sided Pr <= P 0.966 7. Agresti 3.6 Using the delta method as in Section 3.1.6, show that the Wald confidence interval for the logit of a binomial parameter π is z n log 1 1. Explain how to use this interval to obtain one for π itself. For the binomial parameter π, we have its MLE ow based on the delta method, the estimated variance for g var g g var log var 1 1 1 1 1 1 n 1 n 1 and estimated variance log 1 is var 1 n Thus the 95% CI for the logit of a binomial parameter is: z n log 1 1. Based on the 95% CI for the logit of π, we could derive the 95% CI for ust π as follows: 5/7

1 exp y logit log logit y 1 1exp ote: the inverse-logit function is also called the expit function. Thus the 95% CI for π is: 1 logit log z n 1 y y exp log 1 z n 1 1 1explog z n 1 1 After some algebra, we could see that the Wald (1-α) CI for π is: 8. Agresti 3.34 For a x table, show that: z exp n 1, z z 1 exp 1 exp n 1 n 1 a. The four Pearson residuals may take different values. Suppose we have a x table, where the column total is labeled as n 1, ni i 1,. The overall total is n. and row total is The Pearson residuals for any of these 4 cells would be calculated as follows, indicating that it s likely that the Pearson residuals may take different values: e nin n n n n n n n i i i n n b. All four standardized residuals have the same absolute value. The absolute value of the standardized residual would be calculated like this: 6/7

r nin n 1 n i nn n n nin 1 1 1 1 1 pi p n i n n n i 1 1 nin ni' n ' nin ni' n ', the denominator includes all four marginal totals (two row totals and two column totals) and the overall totals for this x table. Therefore, the denominator would be the same for four r i 1, ; 1,, no matter which cell is chosen. Regarding the numerator, we can have the following: From the formula above, for every four r i 1, ; 1, ' ' nn n n n n n n n i i i i i nn nn n n n n i i ' i i i ' nn n n i ' i i ' n n n n n n i' i' ' ' i' nn n n i' ' ' i' The numerator is ust the absolute value of the cross product for this x table no matter which cell you choose. In other words, the numerator is the same when you calculate any of four absolute standardized residuals. Therefore, all four standardized residuals would have the same absolute values. c. The square of each standardized residual equals For the x table we have the formula: X X. n n n n n n n n n Also the square of each standardized residual is: r 11 1 1 1 1 nn i ' ' n ' ni ' n n n n 11 1 1 1 n 1 1 i n n n n n n i' n ' Therefore, we can see that the square of each standardized residual equals X. 7/7