Homework 1 Solutions

Size: px

Start display at page:

Download "Homework 1 Solutions"

Ambrose Higgins
5 years ago
Views:

1 Homework 1 Solutions Problem 3.4 (a) X and G We should compare each to a χ 2 distribution with (2 1)(3 1) 2 degrees of freedom. For each, the p-value is so small that S-plus reports it to be 0. We therefore reject the null hypothesis of independence between party identification and race. (b) Under the null hypothesis, the standardized Pearson residuals will have approximately a standard normal distribution. The magnitude of the standardized residuals below is therefore evidence against the null hypothesis of independence. Democrat Independent Republican Black White (c) Using the subtable excluding Republicans, the test statistics for choice between Democrat and Independent are X and G Using the subtable for comparing Democrats and Independents as a group to Republicans, the the test statistics are X and G Note that G2 1 + G2 2 G2, but X1 2 + X2 2 X2, as we can partition G 2 exactly but not X 2. Although both sets of tests lead us to reject the corresponding null hypotheses, we can see that the relationship between race and party is greater when we compare Republicans to Democrats and Independents as a group than when we compare Democrats and Independents to each other. (d) The MLE for the log odds ratio is log( /(341 11)) By the delta method, its asymptotic standard error is (1/ / / /11) 1/ A 95% confidence interval for the log odds ratio is ± [1.7706,3.0470], and so a 95% confidence interval for the odds ratio is [5.87,21.05]. Because this interval does not contain 1, we reject the null hypothesis of independence between race and party affiliation. Problem 3.5 The table shows actual and expected counts in each cell. The statistics reported are X and G , each of which is compared to a χ 2 distribution with (2 1)(3 1) 2 degrees of 1

2 freedom, giving p-values in each case of about 0.03, which means that we reject, at the 0.05 level, the hypothesis of independence between gender and party identification. The residuals for each cell are given at the bottom of the output. The first column (Resraw) are the raw residuals: the observed minus the expected counts. The second two columns are the Pearson and standardized Pearson residuals. The magnitude of the standardized residuals is further evidence against the null hypothesis of independence, when we compare them to the null distribution (standard normal). Problem 3.6 The table of counts used in the analysis was the following. Advanced Local Alone Spouse Others This gives test statistics X and G Comparing each to a χ 2 distribution with (2 1)(2 1) 1 degree of freedom, we get p-values of and , respectively. Either could be rounded up to get the value of 0.02 reported in the article. Problem 3.11 (a) The statistics for testing the null hypothesis of independence between family income and educational aspirations are X and G Comparing each to a χ 2 distribution with (3 1)(4 1) 6 degrees of freedom, we get p-values of and 0.178, neither of which would lead us to reject the null hypothesis. However, both of the variables in the table are ordinal, and neither of these tests use this information. (b) The standardized Pearson residuals are below. Roughly speaking, as income increases, so do aspirations. Some high school High school grad Some college College grad Low Middle High (c) To use the information about the ordering, I assigned values of 1 through 4 to each of the aspirations and 1 through 3 to each of the incomes. Then the correlation between income and aspirations is r 0.132, so the M 2 statistic is (n 1)r Comparing this to a χ 2 distribution with 1 degree of freedom, we get a p-value of 0.029, which leads us to reject the null hypothesis of independence. 2

3 Problem 3.12 The point estimate for gamma is Using the delta method (Problem 3.27), the standard error is , giving a confidence interval of ± [0.315, 0.459]. Please see the Appendix for calculations in S-plus. Because this interval does not contain zero, we reject the null hypothesis of independence between schooling and attitude toward abortion. Problem 3.22 Suppose Y Bin(n,π). The MLE for π, ˆπ y/n AN(π,π(1 π)/n). We apply the delta method to g(x) log(x/(1 x)). The first derivative of g evaluated at π is g (π) 1 π(1 π), so using g(ˆπ) AN(g(π),g (π) 2 π(1 π) n ), we have log( ˆπ 1 ˆπ ) AN(log( π 1 π ), 1 nπ(1 π) ). This gives a Wald confidence interval of log( ˆπ 1 ˆπ ) ± z α/2/ nπ(1 π). Because g is monotone, applying g 1 (y) ey 1+e to each endpoint of this interval will give a confidence interval for π itself. y Problem 3.32 (a) There are many counterexamples here. For instance, for the 2 2 table j1 j2 i1 1 2 i2 3 4 the Pearson residuals are j1 j2 i i (b) Using the usual notation for 2 2 tables, each standardized residual is e ij n ij ˆµ ij. ˆµ ij (1 p i+ )(1 p +j ) Write n i C + for the sum of the row not equal to i and likewise n +j C for the sum of the column not equal to j. Then the denominator ˆµ ij (1 p i+ )(1 p +j ) n ( i+n +j 1 n ) ( i+ 1 n ) +j n n n ni+ n +j n i C + n +j C n n n n1+ n 2+ n +1 n +2 n 3, 3

4 which doesn t depend on i or j. So we need only show that the raw residuals n ij ˆµ ij have the same absolute value. Calculating them separately for each i and j, we have the following raw residuals, which all have the same absolute value. i j n ij ˆµ ij (nn ij n i+ n +j )/n 1 1 (n 11 n 22 n 12 n 21 )/n 1 2 (n 12 n 21 n 11 n 22 )/n 2 1 (n 12 n 21 n 11 n 22 )/n 2 2 (n 11 n 22 n 12 n 21 )/n (c) Using the fact given in the problem, that X 2 n(n 11 n 22 n 12 n 21 ) 2 /(n 1+ n 2+ n +1 n +2 ), it s clear that (e ij) 2 (n 11n 22 n 12 n 21 ) 2 /n 2 (n 1+ n 2+ n +1 n +2 )/n 3 X 2 Problem 4.9 (a) When creating my dummy variables, I chose to make light medium the default color category. The model is then Y i Poisson(λ i ), where log(λ i ) β 0 +β 1 weight+β 2 I {colormedium} +β 3 I {colordark medium} +β 4 I {colordark}. The parameter estimates are ˆβ , ˆβ , ˆβ , ˆβ , and ˆβ (If your dummy variables correspond to the first 3 color categories, your parameter estimates should be , , , , and ) The intercept doesn t have a meaningful interpretation; it corresponds to a light medium colored crab weighing zero kilograms. The coefficient for weight is interpreted as follows: for every increase of one kilogram, log(λ) increases by , which means that the mean number of satellites is multiplied by exp(0.5462) The coefficients for color should all be interpreted with reference to the baseline case. Medium colored crabs have on average exp( ) times as many satellites as medium light crabs, dark medium crabs have exp( ) times as many, and dark crabs have exp( ) times as many. (b) For the medium light crab, the estimate of E(Y ) is For the dark crab, it is (c) The difference in deviances for the full model and the model without color is Comparing this to a χ 2 distribution with 3 degrees of freedom (the difference in the number of parameters for each model) gives a p-value of This is evidence against the null hypothesis that a model without color is sufficient. 4

5 (d) Treating color as quantitative, I assign values 1 through 4 to each of the color categories. The systematic component of the model is now log(λ i ) β 0 + β 1 weight + β 2 color. The parameter estimates are ˆβ , ˆβ , and ˆβ Now the interpretation of the color effect is that for each increase of one color category away from light medium, the number of satellites is multiplied by exp( ) Under this model, the estimate of E(Y ) for the medium light crab is For the dark crab, it is Comparing this model to the one without a color effect, the difference in deviances is 8.073, which has a p-value of when compared to a χ 2 distribution with 1 degree of freedom, which is again evidence against the null hypothesis of no color effect. Finally, comparing the fit of this model to the model using dummy variables for color (a more flexible model), the difference in deviances is 0.989, which has a p-value of 0.61 when compared to a χ 2 distribution with 2 degrees of freedom. So we fail to reject the null hypothesis that the simpler, linear model provides an adequate fit. (Note that this model is a special case of the previous model where the differences between adjacent categories are constrained to be equal. Therefore the null χ 2 distribution of the test statistic is valid.) (e) The output for the coefficients in this model is given below. The effect of width does not appear to be significant when weight is already in the model (small t value). Value Std. Error t value (Intercept) weight color width The likelihood ratio test confirms this: the difference in deviances for this model and the model without width is 0.442, which has a p-value of about 0.5 when compared to a χ 2 distribution with 1 degree of freedom. Therefore we fail to reject the null hypothesis that a model with weight and color only is adequate. Problem 4.19 Suppose Y i Bernoulli(π(x i )) for x i {0,1}, with log π(x i ) α + βx i. Then log π(0) α and log π(1) α + β, so solving for β we have β log π(1) log π(0) log ( ) π(1), π(0) and π(1)/π(0) is the relative risk. This link function is not often used because if α + β > 0,π(1) exp(α + β) > 1 and therefore is not a valid probability. Problem

6 Suppose n i Y i Binomial(n i,π i ), and π i Φ( j β jx ij ). Define η i j β jx ij and µ i E[n i Y i ] π i Φ(η i ). Then µ i η i φ(η i ), where φ(u) Φ(u) u. The variance of Y i is the variance of n i Y i divided by n 2 i, which gives w i n i φ 2 (η i ) Φ(η i )[1 Φ(η i )]. So Cov(ˆβ) ˆ (X ŴX) 1, where Ŵ is a diagonal matrix whose diagonal elements are equal to the w i evaluated at ˆβ. In the case of logistic regression, Φ(u) eu 1+e, so φ(u) u eu (1+e u ) and 2 φ(η i ) eη i (1+e η i) eη i 2 1+e 1 η i 1+e π(1 π). Finally, η i Problem 4.27 w i n i[π i (1 π i )] 2 Φ(η i )[1 Φ(η i )] n i[π i (1 π i )] 2 π i (1 π i ) n i π i (1 π i ) (a) Suppose Y ij are independent Poisson with E(Y ij ) µ i for i 1,...,I and j 1,...,n i. Then the log-likelihood is L(µ 1,µ 2,...,µ I ) [ µ i + y ij log µ i log(y ij!)] i1 j1 I n i n i µ i + n i log µ i (y ij!), i1 j1 so L µ i ni/µ i j y ij, and setting this equal to zero we get ˆµ i ȳ i j y ij/n i as the MLE. (Note the second derivative evaluated at ˆµ i is negative, so this is indeed a maximum.) (b) Note that so L(ˆµ;y) i1 j1 j y ij /n i + y ij log( j y ij /n i ) log(y ij!) L(y;y) [ y ij + y ij log(y ij ) log(y ij!)] i1 j1 i1 j1 ni j1 y ij n i y ij, i1 j1 ( ) ȳi 2(L(ˆµ; y) L(y; y)) 2 y ij log y i1 j1 ij ( ) yij 2 y ij log ȳ i1 j1 i 6

7 Problem 4.29 Suppose π(x) Φ(α + βx), where the standard cdf Φ corresponds to a pdf φ that is symmetric around 0. (a) π(x) Φ(α + βx) Φ 1 (0.5) α + βx 0 α + βx x α/β (b) π (x) π(x)0.5 π (x) x α/β Φ(α + βx) x x α/β βφ(α + βx) x α/β βφ(0) For the logit link, Φ(x) ex 1+e x and φ(x) ex (1+e x ) 2, so βφ(0) β/4. For the probit link, φ(x) exp( x 2 /2)/ 2π, so βφ(0) β/ 2π. (c) The regression curve is π(x), a function of x given by π(x) Φ(α + βx) P(Z α + βx),where Z N(0,1) ( ) Z α P x β Since Z N(0,1), Z α β this distribution. N( α/β,1/β 2 ), and π(x) is therefore the cdf corresponding to Appendix: Splus code for problem 3.12 f.gamma <-function(x){ # x is a matrix of counts n <- nrow(x); m <- ncol(x) pihat.mat <- x/sum(x) pihat.mat.c <- pihat.mat.d <- matrix(na, nrown, ncolm) for(i in 1:n){ for(j in 1:m){ pihat.mat.c[i,j] <- sum(pihat.mat[row(pihat.mat)<i & col(pihat.mat)<j]) + sum(pihat.mat[row(pihat.mat)>i & col(pihat.mat)>j]) pihat.mat.d[i,j] <- sum(pihat.mat[row(pihat.mat)<i & col(pihat.mat)>j]) + 7

8 sum(pihat.mat[row(pihat.mat)>i & col(pihat.mat)<j]) } } pihat.c <- sum(pihat.mat * pihat.mat.c) C <- pihat.c*sum(x)^2/2 pihat.d <- sum(pihat.mat * pihat.mat.d) D <- pihat.d*sum(x)^2/2 gamma <- (C-D)/(C+D) psi.mat <- 4*(pihat.d*pihat.mat.c - pihat.c*pihat.mat.d)/(pihat.c+pihat.d)^2 sigma2 <- 16 * sum(pihat.mat*(pihat.d*pihat.mat.c - pihat.c*pihat.mat.d)^2)/ (pihat.c+pihat.d)^4 / sum(x) list(gammagamma, sigma2sigma2, CIgamma+c(-1,1)*1.96*sqrt(sigma2)) } counts <- c(209,101,237,151,126,426,16,21,138) table.3.12 <- cbind(expand.grid(list(attitudec("gd", "MP", "GA"), Schoolc("<HS", "HS", ">HS"))), countcounts) table.3.12.array <- t(design.table(table.3.12)) f.gamma(table.3.12.array) #$gamma: #[1] #$sigma2: #[1] #$CI: #[1]

BIOS 625 Fall 2015 Homework Set 3 Solutions

BIOS 625 Fall 2015 Homework Set 3 Solutions BIOS 65 Fall 015 Homework Set 3 Solutions 1. Agresti.0 Table.1 is from an early study on the death penalty in Florida. Analyze these data and show that Simpson s Paradox occurs. Death Penalty Victim's