Review of One-way Tables and SAS

Stat 504, Lecture 7 1 Review of One-way Tables and SAS In-class exercises: Ex1, Ex2, and Ex3 from http://v8doc.sas.com/sashtml/proc/z0146708.htm To calculate p-value for a X 2 or G 2 in SAS: http://v8doc.sas.com/sashtml/lgref/z0245929.htmz0845409 use PROBCHI function. For example, if X 2 = 0.47 with df = 2, then the p-value=1-probchi(0.47,2)

Stat 504, Lecture 7 2 Introduction to Two-Way Tables Example 1: 2 2 Table of counts and/or proportions Table 1: Incidence of Common Colds involving French Skiers (Pauling(1971) as reported in Fienberg(1980) Cold No Cold Totals Placebo 31 109 140 Absorbic Acid 17 122 139 Totals 48 231 279

Stat 504, Lecture 7 3 Table 2: Incidence of Common Colds involving French Skiers (Pauling(1971) as reported in Fienberg(1980) Cold No Cold Totals Placebo 0.111 0.391 0.502 Absorbic Acid 0.601 0.437 0.498 Totals 0.172 0.828 1 Q1: Compare relative frequency of occurrence of some characteristics of two groups, e.g. is a probability of a member of the placebo group contracting a cold same as a probability of a member for the ascorbic group contracting a cold? Q2: Are two characteristics independent, e.g. are a type of treatment and contracting cold associated? Q3: Is one characteristic a cause for another, e.g. does having a therapeutic value of ascorbic acid (vitamin C) prevent contracting a cold?

Stat 504, Lecture 7 4 Suppose that we collect data on two binary variables, Y and Z. Binary means that these variables take two possible values, say 1 (e.g. cold ) and 2 (e.g. no cold ). Suppose we collect values of Y (e.g. treatment) and Z (e.g. contracting cold) for n sample units. The data then consist of n pairs, (y 1, z 1 ), (y 2, z 2 ),..., (y n, z n ). We can summarize the data in a frequency table. Let x ij be the number of sample units having Y = i and Z = j. Then x = (x 11, x 12, x 21, x 22 ) is a summary of all n responses, e.g x 11 = 31. We could display x as a one-way table with four cells, but it is customary to display x as a square table with two rows and two columns: Z = 1 Z = 2 Y = 1 x 11 x 12 Y = 2 x 21 x 22

Stat 504, Lecture 7 5 Marginal totals. When a subscript in a cell count x ij is replaced by a plus sign (+), it will mean that we have taken the sum of the cell counts over that subscript. The row totals are the column totals are and the grand total is x 1+ = x 11 + x 12, x 2+ = x 21 + x 22, x +1 = x 11 + x 21, x +2 = x 12 + x 22, x ++ = x 11 + x 12 + x 21 + x 22 = n. These quantities are often called marginal totals, because they are conveniently placed in the margins of the table, like this. Z = 1 Z = 2 total Y = 1 x 11 x 12 x 1+ Y = 2 x 21 x 22 x 2+ total x +1 x +2 x ++

Stat 504, Lecture 7 6 If the sample units are randomly sampled from a large population, then x = (x 11, x 12, x 21, x 22 ) will have a multinomial distribution with index n = x ++ and parameter vector π = (π 11, π 12, π 21, π 22 ), where π ij = P (Y = i, Z = j). Z = 1 Z = 2 total Y = 1 π 11 π 12 π 1+ Y = 2 π 21 π 22 π 2+ total π +1 π +2 π ++ = 1 The probability distribution {π ij } is the joint distribution of Y and Z. When you sum the joint probabilities, you get a marginal distribution, e..g the probability distribution {π i+ } is the marginal distribution for Y where P (Y = 1) = π 1+ and P (Y = 2) = π 2+. How does the distribution of Z change as the category of Y changes? The conditional distribution of Z given Y, for example, is {π j i } = π ij π i+, such that P j π j i = 1.

Stat 504, Lecture 7 7 In class exercise: What is the observed conditional probability distribution P ( cold treatment )?

Stat 504, Lecture 7 8 Under a general multinomial model, the π vector contains three unknown parameters. The general multinomial model is often called the saturated model, because it contains the maximum number of unknown parameters. Explore geometry of 2 2 tables: http://www 2.cs.cmu.edu/ eairoldi/tetrahedron3d/

Stat 504, Lecture 7 9 The independence model Given a 2 2 table, it is natural to ask how Y and Z are related. Suppose for the moment that there is no relationship between Y and Z, i.e. that they are independent. Independence means that π ij = P (Y = i, Z = j) = P (Y = i) P (Z = j) for i, j = 1, 2. Let P (Y = 1) = α and P (Z = 1) = β, so that P (Y = 2) = 1 α and P (Z = 2) = 1 β. Under independence, we have π 11 = P (Y = 1) P (Z = 1) = αβ, (1) π 12 = P (Y = 1) P (Z = 2) = α(1 β), (2) π 21 = P (Y = 2) P (Z = 1) = (1 α)β, (3) π 22 = P (Y = 2) P (Z = 2) = (1 α)(1 β).(4)

Stat 504, Lecture 7 10 Note that α = π 1+ = π 11 + π 12, 1 α = π 2+ = π 21 + π 22, β = π +1 = π 11 + π 21, 1 β = π +2 = π 12 + π 22, so the condition of independence can be conveniently written as π ij = π i+ π +j, i, j = 1, 2. (5) The primary reason that we introduced the symbols α and β for π 1+ and π +1 is to emphasize that under the independence model, there are only two unknown parameters. Once α and β are known, the vector π can be found using (1) (4). The independence model is a submodel of (i.e. a special case of) the saturated model that satisfies the constraints (5).

Stat 504, Lecture 7 11 Test of independence The hypothesis of independence can be tested using the general method described in Lecture 4. To test H 0 : the independence model is true versus H 1 : the saturated model is true, do the following. First, estimate α and β, the unknown parameters of the independence model. Second, calculate estimated cell probabilities and expected frequencies from the estimated α and β. Third, calculate X 2 and/or G 2 and compare them to the appropriate chisquare distribution.

Stat 504, Lecture 7 12 How can we estimate α and β? Under H 0, Y (e.g. treatment ) and Z (e.g. cold ) provide no information about one another, so we can estimate the parameters of their distributions separately. Note that x 1+ Bin(n, α) (6) and x +1 Bin(n, β), (7) and under H 0 (6) and (7) are independent.

Stat 504, Lecture 7 13 Therefore, the ML estimates of α and β are ˆα = x 1+ n and ˆβ = x +1 n. Plugging these estimates into (1) (4) gives estimated probabilities ˆπ 11 = x 1+ n ˆπ 21 = x 2+ n x +1 n, ˆπ 12 = x 1+ n x +1 n, ˆπ 22 = x 2+ n x +2 n, x +2 n, and estimated expected cell counts E 11 = nˆπ 11 = x 1+x +1 n E 21 = nˆπ 21 = x 2+x +1 n, E 12 = nˆπ 12 = x 1+x +2 n, E 22 = nˆπ 22 = x 2+x +2 n These four formulas are conveniently summarized as,. E ij = x i+x +j n, i, j = 1, 2, which can be easily remembered as expected frequency = row total column total. grand total

Stat 504, Lecture 7 14 Under H 0, both X 2 and G 2 are approximately χ 2 provided that the expected counts E ij are sufficiently large. Under H 0 the model has 2 unknown parameters, whereas under H 1 there are 3 unknowns. The degrees of freedom are therefore ν = 3 2 = 1. A large value of X 2 or G 2 indicates that the independence model is not plausible, and thus that Y and Z are related. The 95th percentile of χ 2 1 is 3.96, so an observed value of X 2 or G 2 greater than 4 means that we can reject the null hypothesis of independence at the.05 level.

Stat 504, Lecture 7 15 The test for independence in a 2 2 table is a special case of the general goodness-of-fit test discussed in Lecture 5 and 6. Therefore, all of the caveats regarding goodness-of-fit tests discussed there apply to this test also. For the chisquare approximation to work well, the E ij s need to be sufficiently large. The iid assumption for the n sample units must be satisfied; there should be no clustering in the data.

Stat 504, Lecture 7 16 Example. Suppose that in a sample of n = 300 hospital patients, 90 are overweight, 90 are hypertensive, and 30 are both overweight and hypertensive. Is there evidence of a relationship between these two conditions? The observed data are shown below. not hypertensive hypertensive total overweight 30 60 90 not overweight 60 150 210 total 90 210 300 The expected cell counts for the four cells are E 11 = E 21 = 90 90 300 210 90 300 = 27, E 12 = = 63, E 22 = The goodness-of-fit statistics are 90 210 300 210 210 300 = 63, = 147. X 2 = (30 27)2 27 + + (150 147)2 147 (60 63)2 63 = 0.68, + (60 63)2 63

Stat 504, Lecture 7 17 G 2 = 2 30 log 30 27 + 150 log 150 147 60 + 60 log 63 «= 0.67. + 60 log 60 63 These do not exceed 4, so we cannot reject the independence model at the.05 level. An approximate p-value is P (χ 2 1.68) =.40. On the basis of these data, there is little evidence of a relationship between the two conditions.

Stat 504, Lecture 7 18 The test for independence in a 2 2 table can be done in Minitab using the chisq command: MTB > read c1-c2 DATA> 30 60 DATA> 60 150 DATA> end 2 rows read. MTB > chisq c1-c2 Expected counts are printed below observed counts C1 C2 Total 1 30 60 90 27.00 63.00 2 60 150 210 63.00 147.00 Total 90 210 300 ChiSq = 0.333 + 0.143 + 0.143 + 0.061 = 0.680 df = 1 Note that Minitab gives only Pearson s X 2. Calculating the deviance G 2 in Minitab is a little more tedious. One way to do it is to enter the cell counts in a single column, say, C1. Then enter the row sums and column sums in C2 and C3, respectively. Then calculate the expected cell counts and put them

Stat 504, Lecture 7 19 into C4. MTB > set c1 # enter observed counts DATA> 30 60 60 150 DATA> end MTB > set c2 # enter row sums DATA> 90 90 210 210 DATA> end MTB > set c3 # enter column sums DATA> 90 210 90 210 DATA> end MTB > let c4 = c2*c3/300 # calculate expected counts MTB > let k1 = 2*sum(c1*log(c1/c4)) # calculate G^2 MTB > print k1 K1 0.672805

Stat 504, Lecture 7 20 In R or S-PLUS the Pearson X 2 -test is easily carried out using the chisq() function. By default, this function employs the continuity correction proposed by Yates (1934) for a 2 2 table. This correction is not universally regarded as appropriate, however, so we will not use it. To turn off the Yates correction, include correct=f as an argument to the chisq() function. > x_c(30,60,60,150) # enter data > x_matrix(x,2,2) # convert to a matrix > chisq.test(x,correct=f) Pearson s chi-square test without Yates continuity correction data: x X-squared = 0.6803, df = 1, p-value = 0.4095 To calculate G 2 in R or S-PLUS, you need to go through essentially the same steps as in Minitab. > ob_c(30,60,60,150) > rsum_c(90,90,210,210) > csum_c(90,210,90,210) > ex_rsum*csum/300 > G2_2*sum(ob*log(ob/ex)) > G2 [1] 0.6728037

Stat 504, Lecture 7 21 In SAS the function under PROC FREQ is chisq and for two-way tables and above will give you both the Pearson X 2 statistic and the deviance, G 2. See: http://v8doc.sas.com/sashtml/proc/zreq-ex3.htm

Stat 504, Lecture 7 22 Multinomial sampling: In one type of experiment, we draw a sample of n = x ++ subjects from a population and record (Y, Z) for each subject. Then the joint distribution of {x ij } is multinomial with index n and parameter π = {π ij }, π ij = P (Y = i, Z = j). Where the grand total n is fixed and known. Sometimes we express the parameter as the cell means m ij = E(x ij ) = nπ ij.

Stat 504, Lecture 7 23 Poisson sampling: x ij Poisson(m ij ) independently for i = 1,..., I and j = 1,..., J. In this scheme, the overall n is not fixed. Example: You sit by the roadside for one hour with a radar gun, checking the speed of each car as it passes by. You record Y = color of the car (1=black, 2=white, 3=red, 4=other) and Z = whether the car s speed exceeds the legal limit (1=yes, 2=no).

Stat 504, Lecture 7 24 In Lecture 4, we argued that the likelihood function may be factored into the product of a Poisson likelihood for n, n Poisson(m ++ ) and a multinomial likelihood for {x ij } given n, with parameters π ij = m ij m ++. The total n provides no information about π = {π ij }. From a likelihood standpoint, we get the same inferences about π whether n is regarded as fixed or random. Therefore, if m ++ is not of interest, Poisson data may be analyzed as if it were multinomial. Conversely, if data are multinomial, we may analyze them as if they were Poisson. The inferences for π are valid, and the inferences for m ++ should be ignored.

Stat 504, Lecture 7 25 Product-multinomial sampling: Decide beforehand that we will draw x i+ subjects with characteristic Y = i (i = 1,..., I) and record the Z-value for each one. In this scenario, each row of the table (x i1, x i2,..., x ij ) T is multinomial with probabilities π j i = π ij /π i+ and the rows are independent. Viewing the data as product-multinomial is appropriate when the row totals truly are fixed by design, as in stratified random sampling (strata defined by Y ) an experiment where Y =treatment group It s also appropriate when the row totals are not fixed, but we are interested in P (Z Y ) and not P (Y ). That is, when Z is the outcome of interest, and Y is an explanatory variable that we do not wish to model.

Stat 504, Lecture 7 26 Suppose the data are multinomial. Then by results from Lecture 4, we may factor the likelihood into two parts: a multinomial likelihood for the row totals (x 1+, x 2+,..., x I+ ) T with index n and parameter {π i+ } I independent multinomial likelihoods for the rows, (x i1, x i2,..., x ij ) T, with parameters {π j i = π ij /π i+ }. Therefore, if the parameters of interest to us can be expressed as functions only of the π j i s and not the π i+ s, then correct likelihood-based inferences may be obtained by treating the data as if they were product-multinomial. Conversely, if the data are product-multinomial, then correct likelihood-based inferences about functions of the π j i s will be obtained if we analyze the data as if they were multinomial. We may also treat them as Poisson, ignoring any inferences about m ++ or m i+.

Stat 504, Lecture 7 27 Hypergeometric sampling: In a few rare examples, we may encounter data where both the row totals (x 1+,..., x I+ ) T and the column totals (x +1,..., x +J ) T are fixed by design. The best-known example of this is Fisher s hypothetical example of the lady tasting tea, which we will discuss soon. In a 2 2 table, the resulting sampling distribution is hypergeometric. Even when both sets of marginal totals are not fixed by design, some statisticians like to condition on them and perform exact inference when the sample size is small and asymptotic approximations are unlikely to work well. Methods for exact inference will be discussed later.

Stat 504, Lecture 7 28 Next lecture: Suggested reading: Ch.2 and Ch. 3 of Agresti Next week we ll cover the test of independence, measures of association and exact tests for 2 2 and I J tables There is no regular homework assignment due next week. However, there is an EXTRA credit assignment due on Tuesday, Feb. 8, 2005. 1. For the French skier example, are two variables independent; i.e. are the treatment and response independent? 2. What seems to be the most reasonable sampling scheme for this problem?; e.g. if you are to design the study which sampling model discussed today would you apply and why? 3. Read the on-line information (example) on analysis of 2 2 tables in SAS (see slide 21). Run the analysis of the overweight example in SAS. Submit your code and compare your results to what we got in class today. What s the most appropriate sampling model for this example and why?