STAC51: Categorical data Analysis

Size: px

Start display at page:

Download "STAC51: Categorical data Analysis"

Laureen Matthews
5 years ago
Views:

1 STAC51: Categorical data Analysis Mahinda Samarakoon January 26, 2016 Mahinda Samarakoon STAC51: Categorical data Analysis 1 / 32

2 Table of contents Contingency Tables 1 Contingency Tables Mahinda Samarakoon STAC51: Categorical data Analysis 2 / 32

3 Contingency Tables The two-way table below shows the distribution of the number of members in a fitness club classified by two variables. Women Men Vegetarian 9 3 Non-vegetarian 8 10 Does an association exist between gender and food habits (being a vegetarian or a non-vegetarian)? Are women in this club more likely to be vegetarians than men? Mahinda Samarakoon STAC51: Categorical data Analysis 3 / 32

4 Joint distribution Contingency Tables The three probability distributions below are often used when analyzing data from a contingency table: Joint distribution: The joint distribution of X and Y is defined by the collection of probabilities π ij = P(X = i, Y = j) for i = 1,..., I and j = 1,..., J. If n ij denote the observed number of evens in the i th row and j th column, π ij is estimated by p ij = n ij /n where n = I i=1 J j=1 n ij. Mahinda Samarakoon STAC51: Categorical data Analysis 4 / 32

5 Marginal distribution Contingency Tables Marginal distribution: The probability distribution of X is called the marginal distribution of X. The probability mass function of X, i.e. f X (i) = P(X = i) for i = 1,..., I identifies the marginal distribution of X. This is estimated by ˆf X (i) = n i+ /n for i = 1,..., I where n i+ = J j=1 n ij. Similarly The probability mass function of Y, i.e. p Y (j) = P(Y = j) for j = 1,..., J identifies the marginal distribution of Y. This is estimated by ˆf Y (j) = n +j /n for j = 1,..., J where n +j = I i=1 n ij. Mahinda Samarakoon STAC51: Categorical data Analysis 5 / 32

6 Conditional distribution Conditional distribution: A conditional distribution refers to the probability distribution of Y at a fixed level of X. The conditional distribution of Y given X = i can be estimated by n ij /n i+, j = 1,..., J. We choose Y to be the dependent variable and X to be the independent variable. Mahinda Samarakoon STAC51: Categorical data Analysis 6 / 32

7 Conditional distribution In the fitness club data above, Y is food habits (being a vegetarian or a non-vegetarian) and X is gender. The estimated conditional distribution (i.e. conditional proportions) for women (say X = 1) is given below: Y Probability Vegetarian 9/(9 + 8) = 0.53 Non-vegetarian 8/(9 + 8) = 0.47 The estimated conditional distribution (i.e. conditional proportions) for men (say X = 2) is given below: Y Probability Vegetarian 3/(3 + 10) = 0.23 Non-vegetarian 10/(3 + 10) = 0.77 The proportion of vegetarians among women is more than twice that among men. Mahinda Samarakoon STAC51: Categorical data Analysis 7 / 32

8 Sensitivity and Specificity in Diagnostic Tests Diagnostic testing is used to detect many medical conditions. For example, a test can detect cancer in a population. The result of a diagnostic test is said to be positive if it states that the disease is present and negative if it states that the disease is absent. The accuracy of diagnostic tests is often assessed with two conditional probabilities: Given that a subject has the disease, the probability the diagnostic test is positive is called the sensitivity. Given that the subject does not have the disease, the probability the test is negative is called the specificity. Mahinda Samarakoon STAC51: Categorical data Analysis 8 / 32

9 Sensitivity and Specificity in Diagnostic Tests If X denote the true state of a person, with categories 1 = the person has the disease, 0 = the person does not have the disease, and if Y = outcome of diagnostic test, with categories 1 = positive, 0 = negative, then, sensitivity = P(Y = 1 X = 1), specificity = P(Y = 0 X = 0). The higher the sensitivity and specificity, the better the diagnostic test. If you get positive result on a diagnostic test, then you might be interested in knowing the probability that you really have the disease, i.e. P(X = 1 Y = 1). This may be low even if sensitivity and specificity are both high. Mahinda Samarakoon STAC51: Categorical data Analysis 9 / 32

10 Sensitivity and Specificity in Diagnostic Tests: Example (Agresti) The data are from a screening test for HIV that was performed on a group of 100,000 people. Note that the prevalence rate of HIV in this group was very low. HIV status Test Result Positive Negative Total Positive Negative Total In this study the estimated sensitivity = 475/500 = 0.95 and estimated specificity = 94525/99500 = Mahinda Samarakoon STAC51: Categorical data Analysis 10 / 32

11 Sensitivity and Specificity in Diagnostic Tests: Example (Agresti) Breast cancer is the most common form of cancer in women. Of women who get mammograms at any given time, it has been estimated that 1% truly have breast cancer. Typical values reported for mammograms are sensitivity = 0.86 and specificity = If these values are correct, then given that a mammogram has a positive result, what is the probability this person truly has breast cancer? Solution: We have P(C) = 0.01, P(+ C) = 0.86 and P( C c ) = P(C+) = P(+ C)P(C) = = P(C c ) = P( C c )P(C c ) = 0.88 (1 0.01) = P(C c +) + P(C c ) = P(C c ) = = 0.99 = P(C c +) = = and P(+) = P(C+) + P(C c +) = = P(C +) = P(C+) P(+) = = Mahinda Samarakoon STAC51: Categorical data Analysis 11 / 32

12 Independence of Two Categorical Variables Definition The random variables X and Y are said to be related if the conditional distribution of Y given that X = x changes as x changes. Definition The random variables X and Y are said to be statistically independent if the conditional distribution of Y given that X = x is identical at each level of x. Note that X and Y are statistically independent if and only if π ij = π i+ π +j Mahinda Samarakoon STAC51: Categorical data Analysis 12 / 32

13 Independence of Two Categorical Variables: Example The joint distribution of the two random variables X and Y is given below: y x Are X and Y independent? Why? Mahinda Samarakoon STAC51: Categorical data Analysis 13 / 32

14 Poisson, binomial, and multinomial sampling What is the distribution of counts in a contingency table? There are four possible cases. Poisson Sampling In some cases we can treat each cell of an I J contingency table as independent Poisson random variables; i.e., the number of observation s in each cell, N ij independent Poisson(µ ij ). Thus, P(N ij = n ij ) = e µ ij µ n ij il n ij!, n ij = 0, 1, 2,... Independent Poisson sampling is appropriate when the total sample size n is not fixed. Mahinda Samarakoon STAC51: Categorical data Analysis 14 / 32

15 Poisson sampling Example These data are from records of accidents in 1988 compiled by the Department of Highway Safety and Motor Vehicles in Florida (Agresti, 1990, 1996). Seat belt use Injury Fatal Nonfatal No , 527 Yes , 368 Mahinda Samarakoon STAC51: Categorical data Analysis 15 / 32

16 Multinomial Sampling sampling When n is fixed (or conditional on sample size), multinomial sampling occurs over all of the cells of the contingency table; i.e., (N 11, N 12,..., N IJ ) Multinomial(n, π 11, π 12,..., π IJ ). (Agresti 3rd ed, p41) Suppose the researchers randomly sample 200 police records of accidents and classify each according to seat-best use and outcome of the accident (fatal or non-fatal). For this study the total sample size n is fixed (n = 200). They might treat the numbers of observations at the four combinations of seat-belt use and outcome of accident as a multinomial random variables with unknown joint probabilities (π 11, π 12, π 21, π 22 ) Mahinda Samarakoon STAC51: Categorical data Analysis 16 / 32

17 Independent Multinomial Sampling sampling Sometimes the row totals, n 1+, n 2+,..., n I + are fixed by the sampling design. For example in a clinical trial, there may be only 10 people available for the placebo group and 12 people available for the drug group. This type of sampling is appropriate for case-control studies and cohort studies. If there are only two possible outcomes for the trial cured and not-cured. In this case, we have binomial sampling within each row of the contingency table. This is often called independent binomial sampling since random variables are independent across the rows. Mahinda Samarakoon STAC51: Categorical data Analysis 17 / 32

18 Independent Multinomial Sampling sampling When more than two outcomes are possible, say cured, partially cured, and not cured, then independent multinomial sampling occurs within each row of the contingency table. (N 11, N 12,..., N 1J ) Multinomial(n 1+, π 1 1, π 2 1,..., π I 1 ), (N 21, N 22,..., N 2J ) Multinomial(n 2+, π 1 2, π 2 2,..., π I 2 ), and so on. - Sometimes both row and column totals are fixed. In that case hypergeometric sampling is appropriate. Mahinda Samarakoon STAC51: Categorical data Analysis 18 / 32

19 Conditional Association in Stratified 2 2 tables More than two categorical variables may be of interest. Let X, Y and Z be three categorical variables. Let s assume the X and Y each has two levels and Z has k levels. The table showing the counts in each cell is now a three dimensional table, but we can also display they as k two dimensional tables. Mahinda Samarakoon STAC51: Categorical data Analysis 19 / 32

20 Conditional Association in Stratified 2 2 tables Y Z = Total 1 n X 111 n 121 n n 211 n 221 n 2+1 Total n +11 n +21 n ++1 and for Z = k, Y Z = Total 1 n X 112 n 122 n n 212 n 222 n 2+2 Total n +12 n +22 n ++2 Y Z = k 1 2 Total 1 n X 11k n 12k n 1+k 2 n 21k n 22k n 2+k Total n +1k n +2k n ++k Mahinda Samarakoon STAC51: Categorical data Analysis 20 / 32

21 Conditional Association in Stratified 2 2 tables This idea can be extended to I J K tables Notation P(X = i, Y = j, Z = k) = π ijk and µ ijk = E(n ijk ) I J K Note 1: π ijk = 1 i=1 j=1 k=1 Note 2: An unbiased estimator of π ijk is ˆπ ijk = n ijk /n. i.e. E(ˆπ ijk ) = π ijk Note 3: A table showing the counts based on X and Y for a particular level of the third variable Z is called a partial table. These tables can be used to study the relationship between X and Y for a particular level of the variable Z. The associations in partial tables are called conditional associations, because they refer to the effect of X on Y conditional on fixing Z at some level. Mahinda Samarakoon STAC51: Categorical data Analysis 21 / 32

22 Conditional Association in Stratified 2 2 tables Note 4: The two-way XY contingency table obtained by combining the partial tables for all levels of the variable Z is called the XY marginal table. The marginal table contains no information about Z. It is simply a two-way table relating X and Y but may reflect the effects of Z on X and Y. Mahinda Samarakoon STAC51: Categorical data Analysis 22 / 32

23 Z as a control variable Contingency Tables Sometimes Z plays the role of a control variable In this case, the purpose is to understand the relationship between X and Y while controlling for Z In this case Z is sometimes called a layer variable Sometimes Z is also called a stratification variable Mahinda Samarakoon STAC51: Categorical data Analysis 23 / 32

24 Example Contingency Tables Let s assume that a university consists only two professional schools: Law school and business school and we are interested in studying the association between the two variables X = gender and Y = admission decision. Z = school is another variable that might influence the admission decision. X has two levels: 1 = Male, 2 = Female. Y has two levels: 1 = accepted, 2 = rejected. Z has two levels: 1 = law school, 2 = business school Mahinda Samarakoon STAC51: Categorical data Analysis 24 / 32

25 Example Partial tables Contingency Tables Y = Decision Z = 1 = Law school Accepted Rejected Total Male n X = gender 111 = 10 n 121 = 90 n 1+1 = 100 Female n 211 = 100 n 221 = 200 n 2+1 = 300 Total n +11 = 110 n +21 = 290 n ++1 = 400 Y = Decision Z = 2 = Business school Accepted Rejected Total Male n X = gender 112 = 480 n 122 = 120 n 1+2 = 600 Female n 212 = 180 n 222 = 20 n 2+2 = 200 Total n +12 = 660 n +22 = 140 n ++2 = 800 Marginal XY table Y = Decision Both schools Accepted Rejected Total Male n 11+ = 490 n 12+ = 210 n 1++ = 700 X = gender Female n 21+ = 280 n 22+ = 220 n 2++ = 500 Total n +1+ = 770 n +2+ = 430 n +++ = 1200 Mahinda Samarakoon STAC51: Categorical data Analysis 25 / 32

26 Conditional and Marginal Odds Ratios Odds ratios can describe marginal and conditional associations. Let {µ ijk } denote cell expected frequencies for some sampling model, such as binomial, multinomial, or Poisson sampling. Then the conditional odds ratio for category k of Z is given by θ XY (k) = µ 11kµ 22k µ 21k µ 12k The marginal odds ratio is given by θ XY = µ 11+µ 22+ µ 21+ µ 12+ Mahinda Samarakoon STAC51: Categorical data Analysis 26 / 32

27 Conditional and Marginal Odds Ratios Their estimates are given by ˆθ XY (k) = n 11kn 22k n 21k n 12k The estimate of marginal odds ratio is given by ˆθ XY = n 11+n 22+ n 21+ n 12+ Mahinda Samarakoon STAC51: Categorical data Analysis 27 / 32

28 Conditional and Marginal Odds Ratios: Example In example 25 above, ˆθ XY (1) = = 2 9 This means, for females, the odds of being selected to the law school is 4.5 times higher than that for males ˆθ XY (2) = = 1 12 This means, for females, the odds of being selected to the business school is 12 times higher than that for males. The marginal odds ratio is given by ˆθ XY = n 11+n = n 21+ n = 2.96 This means, for males, the odds of being selected to this school is 2.96 times higher than that for females. Mahinda Samarakoon STAC51: Categorical data Analysis 28 / 32

29 Conditional and Marginal Odds Ratios: Example The result that a marginal association can have a different direction from each conditional association is called Simpson s paradox. Mahinda Samarakoon STAC51: Categorical data Analysis 29 / 32

30 Marginal Independence versus Conditional Independence Definition: If X and Y are independent in partial table k, then X and Y are called conditionally independent at level k of Z. X and Y are said to be conditionally independent given Z when they are conditionally independent at every level of Z. Mahinda Samarakoon STAC51: Categorical data Analysis 30 / 32

31 Marginal Independence versus Conditional Independence : Example The data shown below is from Agresti Table 2.7, p52. Response Clinic Treatment Success Failure 1 A B A 2 8 B 8 32 Total A B Mahinda Samarakoon STAC51: Categorical data Analysis 31 / 32

32 Marginal Independence versus Conditional Independence : Example ˆθ XY (1) = n 111n 221 n 211 n 121 = = 1 and so X and Y are conditionally independent at Z = 1. ˆθ XY (2) = n 112n 222 n 212 n 122 = = 1 and so X and Y are conditionally independent at Z = 2. The marginal odds ratio is ˆθ XY = n 11+n = n 21+ n = 2 and so X and Y are not marginally independent. Mahinda Samarakoon STAC51: Categorical data Analysis 32 / 32

Chapter 2: Describing Contingency Tables - I

Chapter 2: Describing Contingency Tables - I : Describing Contingency Tables - I Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM [Acknowledgements to Tim Hanson and Haitao Chu]