Soci Data Analysis in Sociological Research. Homework 4 Computer Handout. Chapter 19 Confidence Intervals for Proportions

Similar documents
Tests for Two Proportions in a Stratified Design (Cochran/Mantel-Haenszel Test)

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Hotelling s Two- Sample T 2

Slides Prepared by JOHN S. LOUCKS St. Edward s s University Thomson/South-Western. Slide

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Introduction to Probability and Statistics

CHAPTER 5 STATISTICAL INFERENCE. 1.0 Hypothesis Testing. 2.0 Decision Errors. 3.0 How a Hypothesis is Tested. 4.0 Test for Goodness of Fit

Measuring center and spread for density curves. Calculating probabilities using the standard Normal Table (CIS Chapter 8, p 105 mainly p114)

Hypothesis Test-Confidence Interval connection

Two sample Hypothesis tests in R.

One-way ANOVA Inference for one-way ANOVA

7.2 Inference for comparing means of two populations where the samples are independent

Monte Carlo Studies. Monte Carlo Studies. Sampling Distribution

Measuring center and spread for density curves. Calculating probabilities using the standard Normal Table (CIS Chapter 8, p 105 mainly p114)

The one-sample t test for a population mean

¼ ¼ 6:0. sum of all sample means in ð8þ 25

A proportion is the fraction of individuals having a particular attribute. Can range from 0 to 1!

Objectives. Estimating with confidence Confidence intervals.

STA 250: Statistics. Notes 7. Bayesian Approach to Statistics. Book chapters: 7.2

Chapter 22. Comparing Two Proportions 1 /29

Objectives. 6.1, 7.1 Estimating with confidence (CIS: Chapter 10) CI)

MATH 2710: NOTES FOR ANALYSIS

Chapter 22. Comparing Two Proportions 1 /30

Supplementary Materials for Robust Estimation of the False Discovery Rate

The Poisson Regression Model

Sampling. Inferential statistics draws probabilistic conclusions about populations on the basis of sample statistics

Downloaded from jhs.mazums.ac.ir at 9: on Monday September 17th 2018 [ DOI: /acadpub.jhs ]

Chapter 7 Sampling and Sampling Distributions. Introduction. Selecting a Sample. Introduction. Sampling from a Finite Population

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Morten Frydenberg Section for Biostatistics Version :Friday, 05 September 2014

8 STOCHASTIC PROCESSES

Biostat Methods STAT 5500/6500 Handout #12: Methods and Issues in (Binary Response) Logistic Regression

Chapter 22. Comparing Two Proportions. Bin Zou STAT 141 University of Alberta Winter / 15

Lecture 1.2 Units, Dimensions, Estimations 1. Units To measure a quantity in physics means to compare it with a standard. Since there are many

Package sempower. March 27, 2018

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

Real Analysis 1 Fall Homework 3. a n.

CHAPTER-II Control Charts for Fraction Nonconforming using m-of-m Runs Rules

BERNOULLI TRIALS and RELATED PROBABILITY DISTRIBUTIONS

State Estimation with ARMarkov Models

On split sample and randomized confidence intervals for binomial proportions

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

Feedback-error control

Characterizing the Behavior of a Probabilistic CMOS Switch Through Analytical Models and Its Verification Through Simulations

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III

Discrete Probability Distributions

Class 24. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Yixi Shi. Jose Blanchet. IEOR Department Columbia University New York, NY 10027, USA. IEOR Department Columbia University New York, NY 10027, USA

Topic: Lower Bounds on Randomized Algorithms Date: September 22, 2004 Scribe: Srinath Sridhar

University of North Carolina-Charlotte Department of Electrical and Computer Engineering ECGR 4143/5195 Electrical Machinery Fall 2009

Models of Regression type: Logistic Regression Model for Binary Response Variable

A Game Theoretic Investigation of Selection Methods in Two Population Coevolution

CMSC 425: Lecture 4 Geometry and Geometric Programming

q3_3 MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Chapter 6. Phillip Hall - Room 537, Huxley

Notes on Instrumental Variables Methods

Towards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK

Chapter 26: Comparing Counts (Chi Square)

Monopolist s mark-up and the elasticity of substitution

Participation Factors. However, it does not give the influence of each state on the mode.

Asymptotic Properties of the Markov Chain Model method of finding Markov chains Generators of..

ASYMPTOTIC RESULTS OF A HIGH DIMENSIONAL MANOVA TEST AND POWER COMPARISON WHEN THE DIMENSION IS LARGE COMPARED TO THE SAMPLE SIZE

CSE 599d - Quantum Computing When Quantum Computers Fall Apart

SAS for Bayesian Mediation Analysis

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)

Sociology Research Statistics I Final Exam Answer Key December 15, 1993

1 Extremum Estimators

Brownian Motion and Random Prime Factorization

Chapter 6. Estimates and Sample Sizes

STATPRO Exercises with Solutions. Problem Set A: Basic Probability

John Weatherwax. Analysis of Parallel Depth First Search Algorithms

Inferential statistics

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Introduction to Optimization (Spring 2004) Midterm Solutions

Section 2.5 Linear Inequalities

Estimation of the large covariance matrix with two-step monotone missing data

A Problem Involving Games. Paccioli s Solution. Problems for Paccioli: Small Samples. n / (n + m) m / (n + m)

STAT 201 Chapter 5. Probability

Chapter 9, Part B Hypothesis Tests

Biostat Methods STAT 5820/6910 Handout #5a: Misc. Issues in Logistic Regression

Discrete distribution. Fitting probability models to frequency data. Hypotheses for! 2 test. ! 2 Goodness-of-fit test

STAT-UB.0103 NOTES for Wednesday 2012.APR.25. Here s a rehash on the p-value notion:

2x2x2 Heckscher-Ohlin-Samuelson (H-O-S) model with factor substitution

Chapter 13 Variable Selection and Model Building

Last week: Sample, population and sampling distributions finished with estimation & confidence intervals

FE FORMULATIONS FOR PLASTICITY

Quantitative Analysis and Empirical Methods

A comparison of two barometers: Nicholas Fortin versus Robert Bosch

Aggregate Prediction With. the Aggregation Bias

STAT 201 Assignment 6

Evaluating Process Capability Indices for some Quality Characteristics of a Manufacturing Process

Two sided, two sample t-tests. a) IQ = 100 b) Average height for men = c) Average number of white blood cells per cubic millimeter is 7,000.

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek

ECE 534 Information Theory - Midterm 2

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

CHAPTER 7. Hypothesis Testing

STK4900/ Lecture 7. Program

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

Distributed Rule-Based Inference in the Presence of Redundant Information

The Noise Power Ratio - Theory and ADC Testing

Outline. EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Simple Error Detection Coding

Transcription:

University of North Carolina Chael Hill Soci252-002 Data Analysis in Sociological Research Sring 2013 Professor François Nielsen Homework 4 Comuter Handout Readings This handout covers comuter issues related to Chaters 18, 19, 20, 21 and 22 in De Veaux et al. 2012. Stats: Data and Models. 3e. Addison-Wesley. (STATSDM3) Chater 18 Samling Distribution Models See Comuter Handout for Homework 3 and Activity 12 and 13 for discussion on how to simulate samling distributions using R. Chater 19 Confidence Intervals for Proortions Calculating a CI for a Proortion by hand I illustrate calculating a confidence interval for a roortion with the examle of 510 randomly samled adults in October 2008 resonding to the question Generally seaking, do you believe the death enalty is alied fairly or unfairly in this country today?, in which 275 (54%) answered Fairly (STATSDM3.464 466). Using R as a calculator, one would roceed as follows. > n <- 510 > hat <- 275/510 > SE <- sqrt(hat*(1 - hat)/n) > SE [1] 0.02207217 > alha <-.05 > zstar <- qnorm(1 - alha/2) # z for =.975 > zstar [1] 1.959964 > ME <- zstar*se > c(hat - ME, hat + ME) [1] 0.4959550 0.5824763 Thus we are 95% confident that between 49.6% and 58.2% of adults think that the death enalty is alied fairly. Calculating a CI for a Proortion with ro.test The R function ro.test calculates the CI for a roortion. It is used as ro.test(x, n, conf.level = 0.95, correct = TRUE) where x is the number of successes, n is the samle size, conf.level is the desired confidence level (95% by default) and correct indicates whether a continuity correction is used. For the death enalty examle, ro.test is used as follows, secifying correct = FALSE. 1

S O C I 2 5 2-0 0 2 D A T A A N A L Y S I S I N S O C I O L O G I C A L R E S E A R C H 2 > ro.test(275, 510, correct=false) 1-samle roortions test without continuity correction data: 275 out of 510, null robability 0.5 X-squared = 3.1373, df = 1, -value = 0.07652 alternative hyothesis: true is not equal to 0.5 0.4958229 0.5820222 0.5392157 We see that this confidence interval is very close to the one calculated by hand. Note that I am using the otion correct = FALSE here only to make the results most comarable to those in the text. In general, however, the continuity correction does no harm and we would leave the default otion correct = TRUE as is. CI for a Proortion for a Factor in a Dataframe In ractice we often want to calculate a CI for a roortion from the original, ungroued data stored as a factor in a data frame. To illustrate I calculate a CI for the roortion of deressed (as oosed to normal) resondents in Afifi and Clark s deress data set. The (confusingly named) variable cases is a factor taking the value deressed if the resondent has cesd >= 16 and normal otherwise. Before doing anything I need to change the order of the levels of factor cases so that deressed comes first. 1 I then use ro.test after first tabulating the values of cases with the table function. > # reading the Afifi and Clark data > library(foreign) > deress <- read.dta("deress.dta") # read Stata data set > deress$cases <- factor(deress$cases, levels = c("deressed", "normal")) > attach(deress) # to make variable names accessible > head(cases, 10) # look at first 10 observations [1] normal normal normal normal normal normal normal [8] normal deressed normal Levels: deressed normal > tab <- table(cases) > tab cases deressed normal 50 244 > ro.test(tab, correct=false) 1-samle roortions test without continuity correction data: tab, null robability 0.5 X-squared = 128.0136, df = 1, -value < 2.2e-16 alternative hyothesis: true is not equal to 0.5 0.1314451 0.2172017 1 This is because the two-dimensional table inut into ro.test() must contain the numbers of successes and failures, in that order. In this context success means being deressed. If I didn t change the order of the levels ro.test() would give me a CI for the roortion of normal, rather than deressed, resondents.

S O C I 2 5 2-0 0 2 D A T A A N A L Y S I S I N S O C I O L O G I C A L R E S E A R C H 3 0.170068 We can be 95% confident that the roortion of deressed resondents in the oulation samled is between 13.1% and 21.7%. Note that I have selled out the stes in detail. Once we understand what is going on we can just enter ro.test(table(cases)) to obtain our CI in one fell swoo (just making sure that the level corresonding to success is listed first). Note also that the hyothesis-testing art of the outut can be ignored here, as it tests the default hyothesis =.5, which is not meaningful in this context. Chater 20 Testing Hyotheses About Proortions Hyothesis Test for One Proortion by hand To illustrate a hyothesis test for one roortion I use the examle of the home field advantage hyothesis in the 2009 Major League Baseball season (STATSDM3,.485 486), in which the home team won 1333 (54.8%) of the 2430 games. Could this deviation from 50% be due to chance or is there really a home field advantage in rofessional baseball? We set u the hyothesis to be tested as H 0 : =.50; H A : >.50 and roceed as follows. > 0 <-.5 > n <- 2430 > hat <- 1333/2430 > hat [1] 0.5485597 > SD <- sqrt(0*(1-0)/n) # note this is SD, not SE; why? > z <- (hat - 0)/SD # test statistic > z [1] 4.787501 > 1 - norm(z) [1] 8.44355e-07 The very small -value indicates that the 54.86% roortion of wins by the home team is unlikely to obtain by chance if the robability of winning is.5. Thus we reject the hyothesis that the home team has no advantage. Hyothesis Test for One Proortion with ro.test R function ro.test can test hyotheses as well as calculate confidence intervals. To test the hyothesis that the roortion of wins by the home team is greater than.5 we roceed as follows. > ro.test(x=1333, n=2430, =.5, alternative="greater", correct=false) 1-samle roortions test without continuity correction data: 1333 out of 2430, null robability 0.5 X-squared = 22.9202, df = 1, -value = 8.444e-07 alternative hyothesis: true is greater than 0.5 0.5319099 1.0000000 0.5485597 I secified alternative= greater because the alternative hyothesis H A : > 0 =.5 is one-sided in the ositive direction. The two-sided hyothesis or the one-sided hyothesis

S O C I 2 5 2-0 0 2 D A T A A N A L Y S I S I N S O C I O L O G I C A L R E S E A R C H 4 in the negative direction would be indicated by alternative= two.sided (default) and alternative= less, resectively. Because the alternative hyothesis is one-sided ( greater ), the confidence interval (.532, 1.000) rovided by ro.test is also one-sided. However, we will not consider one-sided confidence intervals further in this course. Chater 21 More About Tests and Intervals One-sided and Two-sided -Values The ro.test function in R rovides the correct -value according to whether the test is one-sided (alternative = greater or alternative = less ), or two-sided (alternative = two.sided ). For a given 0 the -value of the twosided test is twice that of the corresonding one-sided test. For examle, for the home team advantage examle, the two.sided test that the robability of home team win is actually.5 is as follows. > ro.test(x=1333, n=2430, =.5, alternative="two.sided", correct=false) 1-samle roortions test without continuity correction data: 1333 out of 2430, null robability 0.5 X-squared = 22.9202, df = 1, -value = 1.689e-06 alternative hyothesis: true is not equal to 0.5 0.5287125 0.5682535 0.5485597 We see that the -value 1.689e-06 of the two-sided test is twice the -value 8.444e-07 found above for the one-sided test. The Agresti-Coull Plus Four Interval When the samle has fewer than 10 successes or failures, the Agresti-Coull Plus Four interval can be calculated by adding 2 successes and 2 failures (thus 4 cases to the total count) and calculating the confidence interval with ro.test. The examle of the 45 surgical oerations with 3 failures (STATSDM3,.511) does not satisfy the Success/Failure Condition. Thus we calculate the Agresti-Coull interval as follows. > ro.test(x = 3+2, n = 45+4, correct=false) 1-samle roortions test without continuity correction data: 3 + 2 out of 45 + 4, null robability 0.5 X-squared = 31.0408, df = 1, -value = 2.527e-08 alternative hyothesis: true is not equal to 0.5 0.04437955 0.21756362 0.1020408 The confidence interval (0.04437955, 0.21756362) reorted by ro.test differs from the interval (.017,.187) reorted in the text (.511). I do not know why at the moment.

S O C I 2 5 2-0 0 2 D A T A A N A L Y S I S I N S O C I O L O G I C A L R E S E A R C H 5 Chater 22 Comaring Two Proortions Comaring Two Proortions by hand To illustrate comarison of two roortions I use the examle of seat-belt use by male drivers deending on whether a woman is sitting next to them. In these data, of 4208 male drivers with female assengers 2777 (66.0%) used their seat-belt. Among 2763 male drivers with male assengers only, 1383 (49.3%) wore seat belts (STATSDM3,.525). Using R as a calculator, one could roceed as follows (STATSDM3,.529 531). > nf <- 4208 > nm <- 2763 > hatf <- 2777/4208 > hatm <- 1363/2763 > SE <- sqrt(hatf*(1-hatf)/nf + hatm*(1-hatm)/nm) > SE [1] 0.01199155 > zstar <- qnorm(.975) > zstar [1] 1.959964 > ME <- zstar*se > dif <- hatf - hatm > dif [1] 0.1666291 > c(dif - ME, dif + ME) # CI for difference [1] 0.1431261 0.1901321 This corresonds closely to the result in the text. Comaring Two Proortions with ro.test To comare the two roortions with ro.test we need to create vectors with the numbers of successes and samle sizes, resectively. These vectors then serve as inut to ro.test. > y <- c(2777, 1363) # the 2 numbers of successes > n <- c(4208, 2763) # the 2 samle sizes > ro.test(y, n, correct=false) 2-samle test for equality of roortions without continuity correction data: y out of n X-squared = 192.0052, df = 1, -value < 2.2e-16 alternative hyothesis: two.sided 0.1431261 0.1901321 ro 1 ro 2 0.6599335 0.4933044 The result is identical to that roduced by the by hand method. Comaring Two Proortions in a Dataframe Are women more likely to be deressed than men? This conjecture can be investigated by comaring the roortions of men and women who are diagnosed as deressed (as oosed to normal) on the basis of their cesd score in the Afifi and Clark deress data. We do this by constructing a table of factor cases (with categories

S O C I 2 5 2-0 0 2 D A T A A N A L Y S I S I N S O C I O L O G I C A L R E S E A R C H 6 deressed and normal) with factor sex (with categories male and female), and inutting the table into ro.test, as follows. > table(sex, cases) cases sex deressed normal male 10 101 female 40 143 > ro.test(table(sex, cases), correct=false) 2-samle test for equality of roortions without continuity correction data: table(sex, cases) X-squared = 8.0815, df = 1, -value = 0.004472 alternative hyothesis: two.sided -0.20862865-0.04834964 ro 1 ro 2 0.09009009 0.21857923 > detach(deress) # cleanu We see that the roortion deressed differs significantly between men and women (-value = 0.004472). The estimated difference in roortions is 12.8%. We can be 95% confident that the difference between the sexes is between 4.1% and 20.9%. Note that it is imortant for interretation to enter the exlanatory variable first in the table function (i.e., table(sex, cases) rather than table(cases, sex)), so ro.test returns the conditional roortions of cases given sex, rather than the other way around. The -value of the test, however, would be the same if we had entered cases first. Note that the CI for the difference in roortion has negative bounds; this is because levels for sex are in the order male, female, so ro 1 is assigned to male and ro 2 to female. We could change this by reordering the levels of sex as female, male, as follows. 2 > sex <- factor(sex, levels = c("female", "male")) > ro.test(table(sex, cases), correct=false) 2-samle test for equality of roortions without continuity correction data: table(sex, cases) X-squared = 8.0815, df = 1, -value = 0.004472 alternative hyothesis: two.sided 0.04834964 0.20862865 ro 1 ro 2 0.21857923 0.09009009 The CI now has ositive bounds. You can check for yourself that the first row of table(sex, cases) now corresonds to female and the second row to male. 2 Note too that we had earlier changed the order of factor cases to deressed, normal. This change is still in effect.