STAC51H3:Categorical Data Analysis Assign 2 Due: Thu Feb 9, 2016 in class All relevant work must be shown for credit.

Size: px

Start display at page:

Download "STAC51H3:Categorical Data Analysis Assign 2 Due: Thu Feb 9, 2016 in class All relevant work must be shown for credit."

Oswin Morgan
5 years ago
Views:

1 STAC51H3:Categorical Data Analysis Assign 2 Due: Thu Feb 9, 2016 in class All relevant work must be shown for credit. Note: In any question, if you are using R, all R codes and R outputs must be included in your answers. You should assume that the reader is not familiar with R outputs and so explain all your findings, quoting necessary values form your outputs. Please note that academic integrity is fundamental to learning and scholarship. You may discuss questions with other students. However, the work you submit should be your own. If I feel suspicious of any assignment (e.g. if your work doesn t appear to be consistent with what we have discussed in class), I will not mark the assignment. Instead, I will ask you to present your work in my office and your grade will be assigned based on your presentation. 1. (Agresti) In the United States, the estimated annual probability that a woman over the age of 35 dies of lung cancer equals for current smokers and for nonsmokers [M. Pagano and K. Gauvreau, Principles of Biostatistics, Belmont, CA: Duxbury Press (1993), p. 134]. (a) (3 points) Calculate and interpret the difference of proportions and the relative risk. Denoting smokers by 1 and non-smokers by 2, the difference of proportion ˆπ 1 ˆπ For women over 35 years of age, the probability of dying of lung cancer is grater (by ) for smokers compared to non-smokers. Relative risk is / For women over 35 years of age, the chance of dying of lung cancer for smokers is times higher than that of the non-smokers. (b) (3 points) Calculate and interpret the odds ratio. Explain why the relative risk and odds ratio take similar values. Is this always the case or only in some cases? Explain. Odds ratio ˆθ /( ) For women over 35 years of /( ) age, the odds of dying from lung cancer is 10.8 times higher for smoker compared to non-smokers. The relative risk and the odds ratios are very different in example. This is usually the case for rare diseases, i.e. when the probability is small. > # part a > p1 < > p2 < > diff <- p1-p2 # Risk difference Question 1 continues on the next page...

2 > diff [1] > rr <- p1/p2 # Relative risk > rr [1] > # part b > odds1 <- p1/(1-p1) > odds2 <- p2/(1-p2) > thetahat <- odds1/odds2 > odds1 [1] > odds2 [1] > thetahat [1] Page 2 of 11

3 Page 3 of Drivers at an intersection are classified by gender (Female or Male) and seat-belt usage (Yes or No). After one hours observation, the following table was collected: Seat-belt use Gender Yes No F M (a) (4 points) Compute and interpret the odds (odd of not wearing seat-belt) ratio for this example. ˆθ 60/ The odds of not wearing seat belts 45/65 among female drivers is about 0.79 times that among male drivers. Or the odds of not wearing seat belts among male drivers is about 1/ times that among female drivers. (b) (3 points) Which sampling model (Poisson, Binomial, Multinomial, Product Multinomial) seems most appropriate here? Give reasons for your answer. The design didn t have a fixed sample size. The number of drivers passing the crossing in a one-hour period is a random variable, typically modeled by Poisson distribution. In contingency tables, cell counts have independent Poisson distributions. (c) (2 points) Is one of the variables a response variable? Which one? Explain. Whether or not wearing seat belts is what can depend on gender and whether or not wearing seat belts is the response variable and so gender is the explanatory variable. 3. A survey estimated that 20% of all Americans aged 16 to 20 drove under the influence of drugs or alcohol. A similar survey is planned for Canada. They want a 95% confidence interval to have a margin of error of 0.04 (for Wald confidence interval). (a) (4 points) Find the necessary sample size if they expect to find results similar to those in the United States. 0.2 (1 0.2) We wanr and so n 385 n (b) (2 points) Suppose instead they used the conservative formula based on ˆp 0.5. What is now the required sample size?

4 Page 4 of (1 0.5) We wanr and so n 600 n 4. In this question we will do a simulation study of the confidence intervals for odds ratios for contingency tables based on multinomial sampling. (a) (8 points) Use R to generate n contingency tables with total count (i.e. grad total), N 100 with and known cell probabilities (π 11, π 12, π 21, π 22 ) (0.2, 0.3, 0.3, 0.2) from a multinomial distribution. i.e. from multinomial (N, π 11, π 12, π 21, π 22 ). For each of these generated tables, calculate the odds ratio and a 95 percent large sample confidence interval. What is the true odds ratio θ(i.e. population odds ratio) for these tables? How many of the 10 intervals you calculated contain θ? Note in this part please print all your table cell counts (i.e. for the 10 tables), estimated odds ratios (i.e. ˆθ) and the confidence intervals. > #R code Q4 Assign 2 > N <- 100 # the grad total for each table > n <- 10 # number of tables > pi11 <- 0.2 > pi12 <- 0.3 > pi21 <- 0.3 > pi22 <- 0.2 > alpha < > # > table <- rmultinom(n, size N, prob c(pi11, pi12, pi21, pi22)) > theta <- (pi11*pi22)/(pi12*pi21) > table <- t(table) > a <- table[,1] > b <- table[,2] > c <- table[,3] > d <- table[,4] > # add 0.5 if any cell count is 0 to avoid division by zero > a <- (a0)*(a+0.5)+(a > 0)*a > b <- (b0)*(b+0.5)+(b > 0)*b > c <- (c0)*(c+0.5)+(c > 0)*c > d <- (d0)*(d+0.5)+(d > 0)*d > thetahat <- (a*d)/(b*c) > logthetahat <- log(thetahat) > SE <- sqrt(1/a+1/b+1/c+1/d) > LLlog <- logthetahat - qnorm(1-alpha/2)*se > ULlog <- logthetahat + qnorm(1-alpha/2)*se Question 4 continues on the next page...

5 Page 5 of 11 > LL <- exp(lllog) > UL <- exp(ullog) > results <- cbind(a, b, c, d, thetahat, LL, UL, theta) > results a b c d thetahat LL UL theta [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] > (b) (7 points) Repeat part (a) but this time with n Do not print the tables etc this time, but instead calculate the proportion of the intervals (i.e. the million intervals) containing θ. Comment on your value. Note: Any table with a zero cell count has odds ratio equal to 0 or. Replace any zero cell counts by 0.5. (this is often done when dealing with zero cell counts) > #Now reapeat the above simulation for a larger number of tables > # and calculate the coverage probability > N <- 100 # the grad total for each table > n < # number of tables > pi11 <- 0.2 > pi12 <- 0.3 > pi21 <- 0.3 > pi22 <- 0.2 > alpha < > # > table <- rmultinom(n, size N, prob c(pi11, pi12, pi21, pi22)) > theta <- (pi11*pi22)/(pi12*pi21) > table <- t(table) > a <- table[,1] > b <- table[,2] > c <- table[,3] > d <- table[,4] > # add 0.5 if any cell count is 0 to avoid division by zero Question 4 continues on the next page...

6 Page 6 of 11 > a <- (a0)*(a+0.5)+(a > 0)*a > b <- (b0)*(b+0.5)+(b > 0)*b > c <- (c0)*(c+0.5)+(c > 0)*c > d <- (d0)*(d+0.5)+(d > 0)*d > thetahat <- (a*d)/(b*c) > logthetahat <- log(thetahat) > SE <- sqrt(1/a+1/b+1/c+1/d) > LLlog <- logthetahat - qnorm(1-alpha/2)*se > ULlog <- logthetahat + qnorm(1-alpha/2)*se > LL <- exp(lllog) > UL <- exp(ullog) > thetainci (LL < theta)*(ul > theta) > observed_cof_level <- mean(thetainci) > observed_cof_level [1] > (c) (3 points) Repeat part (b), but this time with N 20. (i.e. still a million tables but each table with grad total 20) > N <- 20 # the grad total for each table > n < # number of tables > pi11 <- 0.2 > pi12 <- 0.3 > pi21 <- 0.3 > pi22 <- 0.2 > alpha < > # > table <- rmultinom(n, size N, prob c(pi11, pi12, pi21, pi22)) > theta <- (pi11*pi22)/(pi12*pi21) > table <- t(table) > a <- table[,1] > b <- table[,2] > c <- table[,3] > d <- table[,4] > # add 0.5 if any cell count is 0 to avoid division by zero > a <- (a0)*(a+0.5)+(a > 0)*a > b <- (b0)*(b+0.5)+(b > 0)*b > c <- (c0)*(c+0.5)+(c > 0)*c > d <- (d0)*(d+0.5)+(d > 0)*d Question 4 continues on the next page...

7 > thetahat <- (a*d)/(b*c) > logthetahat <- log(thetahat) > SE <- sqrt(1/a+1/b+1/c+1/d) > LLlog <- logthetahat - qnorm(1-alpha/2)*se > ULlog <- logthetahat + qnorm(1-alpha/2)*se > LL <- exp(lllog) > UL <- exp(ullog) > thetainci (LL < theta)*(ul > theta) > observed_cof_level <- mean(thetainci) > observed_cof_level [1] Page 7 of 11

8 Page 8 of Suppose that we would like to know whether there is an association between voter gender and candidate choice in an election, say candidate A and candidate B. An investigator has decided to conduct an exit poll with 50 voters. He classified the results by gender and the candidate they voted for. The counts are given in the table below: Candidate A Candidate B Female Male 5 20 (a) (2 points) What is the appropriate sampling method appropriate for counts in this this table and give reasons for your answer? In this design, the grand total is fixed and so multinomial sampling sampling is the appropriate sampling method. (b) (4 points) What are the estimated cell provabilities under the assumption of independence of gender and the preference for the candidate? > a <- 10 > b <- 15 > c <- 5 > d <- 20 > n <- a+b+c+d > n [1] 50 > pi11 <- ((a+b)/n)*((a+c)/n) > pi12 <- ((a+b)/n)*((b+d)/n) > pi21 <- ((c+d)/n)*((a+c)/n) > pi22 <- ((c+d)/n)*((b+d)/n) > prob <- c(pi11, pi12, pi21, pi22) > prob [1] (c) (4 points) Using the estimated cell probabilities as the actual values of the probabilities (i.e. π ij ), calculate the probability of observing the counts in the table above. > pobstable <- dmultinom(c(a, b, c, d), size n, prob prob) > pobstable [1] Question 5 continues on the next page...

9 Page 9 of 11 (d) (5 points) Calculate the probability of observing table counts as surprising as or more surprising than counts on the table above. > a <- 10 > b <- 15 > c <- 5 > d <- 20 > n <- a+b+c+d > n [1] 50 > X <- t(as.matrix(expand.grid(0:n, 0:n, 0:n))) > X <- X[, colsums(x) < n] > X <- rbind(x, n - colsums(x)) > # Let s use estimated proprotions under independence as the probabilities and caluculate > # the probability of observing a table as surprising as or more surprising than the > # table observed > pi11 <- ((a+b)/n)*((a+c)/n) > pi12 <- ((a+b)/n)*((b+d)/n) > pi21 <- ((c+d)/n)*((a+c)/n) > pi22 <- ((c+d)/n)*((b+d)/n) > prob <- c(pi11, pi12, pi21, pi22) > prob [1] > sum(prob) [1] 1 > #p <- round(apply(x, 2, function(x) dmultinom(x, size n, prob prob)), 3) > p <- apply(x, 2, function(x) dmultinom(x, size n, prob prob)) > pobstable <- dmultinom(c(a, b, c, d), size n, prob prob) > pextreme <- subset(p, p < pobstable) > pvalue sum(pextreme) > pobstable [1] > pvalue [1]

10 6. Consider the 2 2 contingency table with the cell probabilities as shown below: Let θ P (Y 1 X1)/P (Y 2 X1) P (Y 1 X2)/P (Y 2 X2). Y 1 Y 2 Total X 1 x a x a X 2 b x 1 a b + x 1-a Total b 1 b 1 Page 10 of 11 Note: Do not use any statistical ideas in parts (a) and (b) of this question. Just treat a, b and x as real numbers between 0 and 1 and use only simple algebra (nothing more than high school algebra). (a) (5 points) Show that if θ 1, then x a b θ P (Y 1 X 1)/P (Y 2 X 1) P (Y 1 X 2)/P (Y 2 X 2) x / a x a a b x / 1 a b+x 1 a 1 a x(1 a b + x) (a x)(b x) x ax bx + x2 ab ax bx + x 2 θ 1 x ax bx + x2 ab ax bx + x 1 2 x ax bx + x 2 ab ax bx + x 2 x ab (b) (5 points) Show that if x a b, then θ 1 Question 6 continues on the next page...

11 Page 11 of 11 If x ab, then θ P (Y 1 X 1)/P (Y 2 X 1) P (Y 1 X 2)/P (Y 2 X 2) x / a x a a b x / 1 a b+x 1 a 1 a ab / a ab a a b ab / 1 a b+ab 1 a 1 a b/(1 b) b 1 a/ (1 a)(1 b) 1 a 1 a b/(1 b) b/(1 b) 1 (c) (2 points) What is the meaning of the above result (from statistical point)? Note that x π 11, a π 1+, b π +1 and so x ab means π 11 π 1+ π +1. Also when x ab, π 12 a x a ab a(1 b) π 1+ π +2, π 21 b x b ab (1 a)b π 2+ π +2 and π 22 1 a b+x a1 a b+ab (1 a)(1 b) π 2+ π +2. In other words π ij π i+ π +j, for all (i, j). In other words x ab is the same as to say P (X i, Y j) P (X i)p (Y j) for all (i, j), or X and Y are independent. In parts (a) and (b) above we proved that θ 1 iff x ab. That means what shown here is θ 1 iff X and Y are independent

STAC51: Categorical data Analysis

STAC51: Categorical data Analysis Mahinda Samarakoon January 26, 2016 Mahinda Samarakoon STAC51: Categorical data Analysis 1 / 32 Table of contents Contingency Tables 1 Contingency Tables Mahinda Samarakoon