University of North Carolina Chael Hill Soci252-002 Data Analysis in Sociological Research Sring 2013 Professor François Nielsen Homework 4 Comuter Handout Readings This handout covers comuter issues related to Chaters 18, 19, 20, 21 and 22 in De Veaux et al. 2012. Stats: Data and Models. 3e. Addison-Wesley. (STATSDM3) Chater 18 Samling Distribution Models See Comuter Handout for Homework 3 and Activity 12 and 13 for discussion on how to simulate samling distributions using R. Chater 19 Confidence Intervals for Proortions Calculating a CI for a Proortion by hand I illustrate calculating a confidence interval for a roortion with the examle of 510 randomly samled adults in October 2008 resonding to the question Generally seaking, do you believe the death enalty is alied fairly or unfairly in this country today?, in which 275 (54%) answered Fairly (STATSDM3.464 466). Using R as a calculator, one would roceed as follows. > n <- 510 > hat <- 275/510 > SE <- sqrt(hat*(1 - hat)/n) > SE [1] 0.02207217 > alha <-.05 > zstar <- qnorm(1 - alha/2) # z for =.975 > zstar [1] 1.959964 > ME <- zstar*se > c(hat - ME, hat + ME) [1] 0.4959550 0.5824763 Thus we are 95% confident that between 49.6% and 58.2% of adults think that the death enalty is alied fairly. Calculating a CI for a Proortion with ro.test The R function ro.test calculates the CI for a roortion. It is used as ro.test(x, n, conf.level = 0.95, correct = TRUE) where x is the number of successes, n is the samle size, conf.level is the desired confidence level (95% by default) and correct indicates whether a continuity correction is used. For the death enalty examle, ro.test is used as follows, secifying correct = FALSE. 1
S O C I 2 5 2-0 0 2 D A T A A N A L Y S I S I N S O C I O L O G I C A L R E S E A R C H 2 > ro.test(275, 510, correct=false) 1-samle roortions test without continuity correction data: 275 out of 510, null robability 0.5 X-squared = 3.1373, df = 1, -value = 0.07652 alternative hyothesis: true is not equal to 0.5 0.4958229 0.5820222 0.5392157 We see that this confidence interval is very close to the one calculated by hand. Note that I am using the otion correct = FALSE here only to make the results most comarable to those in the text. In general, however, the continuity correction does no harm and we would leave the default otion correct = TRUE as is. CI for a Proortion for a Factor in a Dataframe In ractice we often want to calculate a CI for a roortion from the original, ungroued data stored as a factor in a data frame. To illustrate I calculate a CI for the roortion of deressed (as oosed to normal) resondents in Afifi and Clark s deress data set. The (confusingly named) variable cases is a factor taking the value deressed if the resondent has cesd >= 16 and normal otherwise. Before doing anything I need to change the order of the levels of factor cases so that deressed comes first. 1 I then use ro.test after first tabulating the values of cases with the table function. > # reading the Afifi and Clark data > library(foreign) > deress <- read.dta("deress.dta") # read Stata data set > deress$cases <- factor(deress$cases, levels = c("deressed", "normal")) > attach(deress) # to make variable names accessible > head(cases, 10) # look at first 10 observations [1] normal normal normal normal normal normal normal [8] normal deressed normal Levels: deressed normal > tab <- table(cases) > tab cases deressed normal 50 244 > ro.test(tab, correct=false) 1-samle roortions test without continuity correction data: tab, null robability 0.5 X-squared = 128.0136, df = 1, -value < 2.2e-16 alternative hyothesis: true is not equal to 0.5 0.1314451 0.2172017 1 This is because the two-dimensional table inut into ro.test() must contain the numbers of successes and failures, in that order. In this context success means being deressed. If I didn t change the order of the levels ro.test() would give me a CI for the roortion of normal, rather than deressed, resondents.
S O C I 2 5 2-0 0 2 D A T A A N A L Y S I S I N S O C I O L O G I C A L R E S E A R C H 3 0.170068 We can be 95% confident that the roortion of deressed resondents in the oulation samled is between 13.1% and 21.7%. Note that I have selled out the stes in detail. Once we understand what is going on we can just enter ro.test(table(cases)) to obtain our CI in one fell swoo (just making sure that the level corresonding to success is listed first). Note also that the hyothesis-testing art of the outut can be ignored here, as it tests the default hyothesis =.5, which is not meaningful in this context. Chater 20 Testing Hyotheses About Proortions Hyothesis Test for One Proortion by hand To illustrate a hyothesis test for one roortion I use the examle of the home field advantage hyothesis in the 2009 Major League Baseball season (STATSDM3,.485 486), in which the home team won 1333 (54.8%) of the 2430 games. Could this deviation from 50% be due to chance or is there really a home field advantage in rofessional baseball? We set u the hyothesis to be tested as H 0 : =.50; H A : >.50 and roceed as follows. > 0 <-.5 > n <- 2430 > hat <- 1333/2430 > hat [1] 0.5485597 > SD <- sqrt(0*(1-0)/n) # note this is SD, not SE; why? > z <- (hat - 0)/SD # test statistic > z [1] 4.787501 > 1 - norm(z) [1] 8.44355e-07 The very small -value indicates that the 54.86% roortion of wins by the home team is unlikely to obtain by chance if the robability of winning is.5. Thus we reject the hyothesis that the home team has no advantage. Hyothesis Test for One Proortion with ro.test R function ro.test can test hyotheses as well as calculate confidence intervals. To test the hyothesis that the roortion of wins by the home team is greater than.5 we roceed as follows. > ro.test(x=1333, n=2430, =.5, alternative="greater", correct=false) 1-samle roortions test without continuity correction data: 1333 out of 2430, null robability 0.5 X-squared = 22.9202, df = 1, -value = 8.444e-07 alternative hyothesis: true is greater than 0.5 0.5319099 1.0000000 0.5485597 I secified alternative= greater because the alternative hyothesis H A : > 0 =.5 is one-sided in the ositive direction. The two-sided hyothesis or the one-sided hyothesis
S O C I 2 5 2-0 0 2 D A T A A N A L Y S I S I N S O C I O L O G I C A L R E S E A R C H 4 in the negative direction would be indicated by alternative= two.sided (default) and alternative= less, resectively. Because the alternative hyothesis is one-sided ( greater ), the confidence interval (.532, 1.000) rovided by ro.test is also one-sided. However, we will not consider one-sided confidence intervals further in this course. Chater 21 More About Tests and Intervals One-sided and Two-sided -Values The ro.test function in R rovides the correct -value according to whether the test is one-sided (alternative = greater or alternative = less ), or two-sided (alternative = two.sided ). For a given 0 the -value of the twosided test is twice that of the corresonding one-sided test. For examle, for the home team advantage examle, the two.sided test that the robability of home team win is actually.5 is as follows. > ro.test(x=1333, n=2430, =.5, alternative="two.sided", correct=false) 1-samle roortions test without continuity correction data: 1333 out of 2430, null robability 0.5 X-squared = 22.9202, df = 1, -value = 1.689e-06 alternative hyothesis: true is not equal to 0.5 0.5287125 0.5682535 0.5485597 We see that the -value 1.689e-06 of the two-sided test is twice the -value 8.444e-07 found above for the one-sided test. The Agresti-Coull Plus Four Interval When the samle has fewer than 10 successes or failures, the Agresti-Coull Plus Four interval can be calculated by adding 2 successes and 2 failures (thus 4 cases to the total count) and calculating the confidence interval with ro.test. The examle of the 45 surgical oerations with 3 failures (STATSDM3,.511) does not satisfy the Success/Failure Condition. Thus we calculate the Agresti-Coull interval as follows. > ro.test(x = 3+2, n = 45+4, correct=false) 1-samle roortions test without continuity correction data: 3 + 2 out of 45 + 4, null robability 0.5 X-squared = 31.0408, df = 1, -value = 2.527e-08 alternative hyothesis: true is not equal to 0.5 0.04437955 0.21756362 0.1020408 The confidence interval (0.04437955, 0.21756362) reorted by ro.test differs from the interval (.017,.187) reorted in the text (.511). I do not know why at the moment.
S O C I 2 5 2-0 0 2 D A T A A N A L Y S I S I N S O C I O L O G I C A L R E S E A R C H 5 Chater 22 Comaring Two Proortions Comaring Two Proortions by hand To illustrate comarison of two roortions I use the examle of seat-belt use by male drivers deending on whether a woman is sitting next to them. In these data, of 4208 male drivers with female assengers 2777 (66.0%) used their seat-belt. Among 2763 male drivers with male assengers only, 1383 (49.3%) wore seat belts (STATSDM3,.525). Using R as a calculator, one could roceed as follows (STATSDM3,.529 531). > nf <- 4208 > nm <- 2763 > hatf <- 2777/4208 > hatm <- 1363/2763 > SE <- sqrt(hatf*(1-hatf)/nf + hatm*(1-hatm)/nm) > SE [1] 0.01199155 > zstar <- qnorm(.975) > zstar [1] 1.959964 > ME <- zstar*se > dif <- hatf - hatm > dif [1] 0.1666291 > c(dif - ME, dif + ME) # CI for difference [1] 0.1431261 0.1901321 This corresonds closely to the result in the text. Comaring Two Proortions with ro.test To comare the two roortions with ro.test we need to create vectors with the numbers of successes and samle sizes, resectively. These vectors then serve as inut to ro.test. > y <- c(2777, 1363) # the 2 numbers of successes > n <- c(4208, 2763) # the 2 samle sizes > ro.test(y, n, correct=false) 2-samle test for equality of roortions without continuity correction data: y out of n X-squared = 192.0052, df = 1, -value < 2.2e-16 alternative hyothesis: two.sided 0.1431261 0.1901321 ro 1 ro 2 0.6599335 0.4933044 The result is identical to that roduced by the by hand method. Comaring Two Proortions in a Dataframe Are women more likely to be deressed than men? This conjecture can be investigated by comaring the roortions of men and women who are diagnosed as deressed (as oosed to normal) on the basis of their cesd score in the Afifi and Clark deress data. We do this by constructing a table of factor cases (with categories
S O C I 2 5 2-0 0 2 D A T A A N A L Y S I S I N S O C I O L O G I C A L R E S E A R C H 6 deressed and normal) with factor sex (with categories male and female), and inutting the table into ro.test, as follows. > table(sex, cases) cases sex deressed normal male 10 101 female 40 143 > ro.test(table(sex, cases), correct=false) 2-samle test for equality of roortions without continuity correction data: table(sex, cases) X-squared = 8.0815, df = 1, -value = 0.004472 alternative hyothesis: two.sided -0.20862865-0.04834964 ro 1 ro 2 0.09009009 0.21857923 > detach(deress) # cleanu We see that the roortion deressed differs significantly between men and women (-value = 0.004472). The estimated difference in roortions is 12.8%. We can be 95% confident that the difference between the sexes is between 4.1% and 20.9%. Note that it is imortant for interretation to enter the exlanatory variable first in the table function (i.e., table(sex, cases) rather than table(cases, sex)), so ro.test returns the conditional roortions of cases given sex, rather than the other way around. The -value of the test, however, would be the same if we had entered cases first. Note that the CI for the difference in roortion has negative bounds; this is because levels for sex are in the order male, female, so ro 1 is assigned to male and ro 2 to female. We could change this by reordering the levels of sex as female, male, as follows. 2 > sex <- factor(sex, levels = c("female", "male")) > ro.test(table(sex, cases), correct=false) 2-samle test for equality of roortions without continuity correction data: table(sex, cases) X-squared = 8.0815, df = 1, -value = 0.004472 alternative hyothesis: two.sided 0.04834964 0.20862865 ro 1 ro 2 0.21857923 0.09009009 The CI now has ositive bounds. You can check for yourself that the first row of table(sex, cases) now corresonds to female and the second row to male. 2 Note too that we had earlier changed the order of factor cases to deressed, normal. This change is still in effect.